KHyperLogLog Functions¶
Presto implements the KHyperLogLog
algorithm and data structure. KHyperLogLog data structure can be created
through khyperloglog_agg().
Data Structures¶
KHyperLogLog is a data sketch that compactly represents the association of two
columns. It is implemented in Presto as a two-level data structure composed of
a MinHash structure whose entries map to HyperLogLog.
Serialization¶
KHyperLogLog sketches can be cast to and from varbinary. This allows them to
be stored for later use.
Functions¶
- khyperloglog_agg(x, y) -> KHyperLogLog()¶
Returns the
KHyperLogLogsketch that represents the relationship between columnsxandy. The MinHash structure summarizesxand the HyperLogLog sketches representyvalues linked toxvalues.
- cardinality(khll) -> bigint()
This calculates the cardinality of the MinHash sketch, i.e.
x’s cardinality.
- intersection_cardinality(khll1, khll2) -> bigint()¶
Returns the set intersection cardinality of the data represented by the MinHash structures of
khll1andkhll2.
- jaccard_index(khll1, khll2) -> double()¶
Returns the Jaccard index of the data represented by the MinHash structures of
khll1andkhll2.
- uniqueness_distribution(khll) -> map<bigint,double>()¶
For a certain value
x', uniqueness is understood as how manyy'values are associated with it in the source dataset. This is obtained with the cardinality of the HyperLogLog that is mapped from the MinHash bucket that corresponds tox'. This function returns a histogram that represents the uniqueness distribution, the X-axis being theuniquenessand the Y-axis being the relative frequency ofxvalues.
- uniqueness_distribution(khll, histogramSize) -> map<bigint,double>()¶
Returns the uniqueness histogram with the given amount of buckets. If omitted, the value defaults to 256. All
uniquenessvalues greater thanhistogramSizeare accumulated in the last bucket.
- reidentification_potential(khll, threshold) -> double()¶
The reidentification potential is the ratio of
xvalues that have auniquenessunder the giventhreshold.
- merge(khll) -> KHyperLogLog()
Returns the
KHyperLogLogof the aggregate union of the individualKHyperLogLogstructures.
- merge_khll(array(khll)) -> KHyperLogLog()¶
Returns the
KHyperLogLogof the union of an array of KHyperLogLog structures.