Sketch Functions

Sketches are data structures that can approximately answer particular questions about a dataset when full accuracy is not required. Approximate answers are often faster and more efficient to compute than functions which result in full accuracy.

Presto C++ provides support for computing some sketches available in the Apache DataSketches library.

Theta Sketches

Theta sketches enable distinct value counting on datasets and also provide the ability to perform set operations. For more information on Theta sketches, please see the Apache DataSketches Theta sketch documentation.

sketch_theta(x) -> varbinary()

Computes a theta sketch from an input dataset. The output from this function can be used as an input to any of the other sketch_theta_* family of functions.

sketch_theta_estimate(sketch) -> double()

Returns the estimate of distinct values from the input sketch.

sketch_theta_summary(sketch) -> row(estimate double, theta double, upper_bound_std double, lower_bound_std double, retained_entries int)

Returns a summary of the input sketch which includes the distinct values estimate alongside other useful information such as the sketch theta parameter, current error bounds corresponding to 1 standard deviation, and the number of retained entries in the sketch.