Noisy Aggregate Functions¶
Overview¶
Noisy aggregate functions are functions that provide random, noisy
approximations of common aggregations like sum()
, count()
, and
approx_distinct()
as well as sketches like approx_set()
. By
injecting random noise into results, noisy aggregation functions make it more
difficult to determine or confirm the exact data that was aggregated.
While many of these functions resemble differential privacy mechanisms, neither the values returned by these functions nor the query results that incorporate these functions are differentially private in general. See Limitations below for more details. Users who wish to support a strong privacy guarantee should discuss with a suitable technical expert first.
Counts, Sums, and Averages¶
- noisy_count_gaussian(col, noise_scale[, random_seed]) -> bigint()¶
Counts the non-
NULL
values incol
and then adds a normally distributed random double value with 0 mean and standard deviation ofnoise_scale
to the true count. The noisy count is post-processed to be non-negative and rounded to bigint.If provided,
random_seed
is used to seed the random number generator. Otherwise, noise is drawn from a secure random.SELECT noisy_count_gaussian(orderkey, 20.0) FROM tpch.tiny.lineitem; -- 60179 (1 row) SELECT noisy_count_gaussian(orderkey, 20.0) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row)
Note
Unlike
count()
, this function returnsNULL
when the (true) count ofcol
is 0.Distinct counting can be performed using
noisy_count_gaussian(DISTINCT col, ...)
, or withnoisy_approx_distinct_sfm()
. Generally speaking,noisy_count_gaussian()
returns more accurate results but at a larger computational cost.
- noisy_count_if_gaussian(col, noise_scale[, random_seed]) -> bigint()¶
Counts the
TRUE
values incol
and then adds a normally distributed random double value with 0 mean and standard deviation ofnoise_scale
to the true count. The noisy count is post-processed to be non-negative and rounded to bigint.If provided,
random_seed
is used to seed the random number generator. Otherwise, noise is drawn from a secure random.SELECT noisy_count_if_gaussian(orderkey > 10000, 20.0) FROM tpch.tiny.lineitem; -- 50180 (1 row) SELECT noisy_count_if_gaussian(orderkey > 10000, 20.0) FROM tpch.tiny.lineitem WHERE false; -- NULL (1 row)
Note
Unlike
count_if()
, this function returnsNULL
when the (true) count is 0.
- noisy_sum_gaussian(col, noise_scale, lower, upper[, random_seed]) -> double()¶
Calculates the sum over the input values in
col
and then adds a normally distributed random double value with 0 mean and standard deviation of noise_scale. Each value is clipped to the range of [lower
,upper
] before adding to the sum.If provided,
random_seed
is used to seed the random number generator. Otherwise, noise is drawn from a secure random.
- noisy_sum_gaussian(col, noise_scale[, random_seed]) -> double()
Calculates the sum over the input values in
col
and then adds a normally distributed random double value with 0 mean and standard deviation ofnoise_scale
.If provided,
random_seed
is used to seed the random number generator. Otherwise, noise is drawn from a secure random.
- noisy_avg_gaussian(col, noise_scale, lower, upper[, random_seed]) -> double()¶
Calculates the average (arithmetic mean) of all the input values in
col
and then adds a normally distributed random double value with 0 mean and standard deviation ofnoise_scale
. Each value is clipped to the range of [lower
,upper
] before averaging.If provided,
random_seed
is used to seed the random number generator. Otherwise, noise is drawn from a secure random.
- noisy_avg_gaussian(col, noise_scale[, random_seed]) -> double()
Calculates the average (arithmetic mean) of all the input values in
col
and then adds a normally distributed random double value with 0 mean and standard deviation ofnoise_scale
.If provided,
random_seed
is used to seed the random number generator. Otherwise, noise is drawn from a secure random.
Approximate Distinct Counting/Sketching¶
Noisy approximate distinct counting and sketching (analogous to the deterministic HyperLogLog Functions) is supported via the Sketch-Flip-Merge (SFM) data sketch [Hehir2023].
- noisy_approx_set_sfm(col, epsilon[, buckets[, precision]]) -> SfmSketch()¶
Returns an SFM sketch of the input values in
col
. This is analogous to theapprox_set()
function, which returns a (deterministic) HyperLogLog sketch.col
supports many types, similar toHyperLogLog
.epsilon
(double) is a positive number that controls the level of noise in the sketch, as described in [Hehir2023]. Smaller values of epsilon correspond to noisier sketches.buckets
(int) defaults to 4096.precision
(int) defaults to 24.
Note
Unlike
approx_set()
, this function returnsNULL
whencol
is empty. If this behavior is undesirable, usecoalesce()
withnoisy_empty_approx_set_sfm()
.
- noisy_approx_set_sfm_from_index_and_zeros(col_index, col_zeros, epsilon, buckets[, precision]) -> SfmSketch()¶
Returns an SFM sketch of the input values in
col_index
andcol_zeros
.This is similar to
noisy_approx_set_sfm()
except that function calculates aMurmur3Hash128.hash64()
ofcol
, and calculates the SFM PCSA bucket index and number of trailing zeros as described in [FlajoletMartin1985]. In this function, the caller must explicitly calculate the hash bucket index and zeros themselves and pass them as argumentscol_index
andcol_zeros
.col_index
(bigint) must be in the range0..buckets-1
.col_zeros
(bigint) must be in the range0..64
. If it exceedsprecision
, it is cropped toprecision-1
.epsilon
(double) is a positive number that controls the level of noise in the sketch, as described in [Hehir2023]. Smaller values of epsilon correspond to noisier sketches.buckets
(int) is the number of buckets in the SFM PCSA sketch as described in [Hehir2023].precision
(int) defaults to 24.
Note
Like
noisy_approx_set_sfm()
, this function returnsNULL
whencol_index
orcol_zeros
isNULL
. If this behavior is undesirable, usecoalesce()
withnoisy_empty_approx_set_sfm()
.
- noisy_approx_distinct_sfm(col, epsilon[, buckets[, precision]]) -> bigint()¶
Equivalent to
cardinality(noisy_approx_set_sfm(col, epsilon, buckets, precision))
, this returns the approximate cardinality (distinct count) of the columncol
. This is analogous to the (deterministic)approx_distinct()
function.Note
Unlike
approx_distinct()
, this function returnsNULL
whencol
is empty.
- noisy_empty_approx_set_sfm(epsilon[, buckets[, precision]]) -> SfmSketch()¶
Returns an SFM sketch with no items in it. This is analogous to the
empty_approx_set()
function, which returns an empty (deterministic) HyperLogLog sketch.epsilon
(double) is a positive number that controls the level of noise in the sketch, as described in [Hehir2023]. Smaller values of epsilon correspond to noisier sketches.buckets
(int) defaults to 4096.precision
(int) defaults to 24.
- cardinality(SfmSketch) -> bigint()¶
Returns the estimated cardinality (distinct count) of an
SfmSketch
object.
- merge(SfmSketch) -> SfmSketch()¶
An aggregator function that returns a merged
SfmSketch
of the set union of individualSfmSketch
objects, similar tomerge(HyperLogLog)
.SELECT year, cardinality(merge(sketch)) AS annual_distinct_count FROM monthly_sketches GROUP BY 1
- merge_sfm(ARRAY[SfmSketch, ...]) -> SfmSketch()¶
A scalar function that returns a merged
SfmSketch
of the set union of an array ofSfmSketch
objects, similar tomerge_hll()
.SELECT cardinality(merge_sfm(ARRAY[ noisy_approx_set_sfm(col_1, 5.0), noisy_approx_set_sfm(col_2, 5.0), noisy_approx_set_sfm(col_3, 5.0) ])) AS distinct_count_over_3_cols FROM my_table
Limitations¶
While these functions resemble differential privacy mechanisms, the values returned by these functions are not differentially private in general. There are several important limitations to keep in mind if using these functions for privacy-preserving purposes, including:
All noisy aggregate functions return
NULL
when aggregating empty sets. This means aNULL
return value noiselessly indicates the absence of data.GROUP BY
clauses used in combination with noisy aggregation functions reveal non-noisy information: the presence or absence of a group noiselessly indicates the presence or absence of data. See, e.g., [Wilkins2024].Functions relying on floating-point noise may be susceptible to inference attacks such as those identified in [Mironov2012] and [Casacuberta2022].
Casacuberta, S., Shoemate, M., Vadhan, S., & Wagaman, C. (2022). Widespread Underestimation of Sensitivity in Differentially Private Libraries and How to Fix It. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security (pp. 471-484).
Hehir, J., Ting, D., & Cormode, G. (2023). Sketch-Flip-Merge: Mergeable Sketches for Private Distinct Counting. In Proceedings of the 40th International Conference on Machine Learning (Vol. 202).
Mironov, I. (2012). On significance of the least significant bits for differential privacy. In Proceedings of the 2012 ACM Conference on Computer and Communications Security (pp. 650-661).
Wilkins, A., Kifer, D., Zhang, D., & Karrer, B. (2024). Exact Privacy Analysis of the Gaussian Sparse Histogram Mechanism. Journal of Privacy and Confidentiality, 14 (1).
Flajolet, P, Martin, G. N. (1985). Probabilistic Counting Algorithms for Data Base Applications. In Journal of Computer and System Sciences, 31:182–209, 1985