KHyperLogLog Functions¶
Presto implements the KHyperLogLog
algorithm and data structure. KHyperLogLog
data structure can be created
through khyperloglog_agg()
.
Data Structures¶
KHyperLogLog is a data sketch that compactly represents the association of two
columns. It is implemented in Presto as a two-level data structure composed of
a MinHash structure whose entries map to HyperLogLog
.
Serialization¶
KHyperLogLog sketches can be cast to and from varbinary
. This allows them to
be stored for later use.
Functions¶
- khyperloglog_agg(x, y) -> KHyperLogLog()¶
Returns the
KHyperLogLog
sketch that represents the relationship between columnsx
andy
. The MinHash structure summarizesx
and the HyperLogLog sketches representy
values linked tox
values.
- cardinality(khll) -> bigint()
This calculates the cardinality of the MinHash sketch, i.e.
x
’s cardinality.
- intersection_cardinality(khll1, khll2) -> bigint()¶
Returns the set intersection cardinality of the data represented by the MinHash structures of
khll1
andkhll2
.
- jaccard_index(khll1, khll2) -> double()¶
Returns the Jaccard index of the data represented by the MinHash structures of
khll1
andkhll2
.
- uniqueness_distribution(khll) -> map<bigint,double>()¶
For a certain value
x'
, uniqueness is understood as how manyy'
values are associated with it in the source dataset. This is obtained with the cardinality of the HyperLogLog that is mapped from the MinHash bucket that corresponds tox'
. This function returns a histogram that represents the uniqueness distribution, the X-axis being theuniqueness
and the Y-axis being the relative frequency ofx
values.
- uniqueness_distribution(khll, histogramSize) -> map<bigint,double>()¶
Returns the uniqueness histogram with the given amount of buckets. If omitted, the value defaults to 256. All
uniqueness
values greater thanhistogramSize
are accumulated in the last bucket.
- reidentification_potential(khll, threshold) -> double()¶
The reidentification potential is the ratio of
x
values that have auniqueness
under the giventhreshold
.
- merge(khll) -> KHyperLogLog()
Returns the
KHyperLogLog
of the aggregate union of the individualKHyperLogLog
structures.
- merge_khll(array(khll)) -> KHyperLogLog()¶
Returns the
KHyperLogLog
of the union of an array of KHyperLogLog structures.