Vector researcher Gautam Kamath breaks down the latest developments in robustness and privacy
May 28, 2024
May 28, 2024
By Gautam Kamath
As statistics and machine learning are applied in increasingly broad settings, we need to prepare our methods to face wide-ranging challenges. Some — like gathering data from untrusted sources, or drawing inferences from sensitive, personal information — weren’t dreamed of when our statistical toolkit was originally developed.
Specifically, we need statistical methods that satisfy the following criteria:
Without special care, most of our conventional techniques fail at both these goals. This holds true even for mean estimation, the most basic of statistical tasks. Focusing on this problem, I’ll discuss some recent developments in robustness and privacy, highlighting the surprising conceptual and technical connections between the two.
The core problem is deceptively simple: given a d-dimensional dataset drawn from some probability distribution, output an estimate of the mean. As a minimal technical assumption, we need the variance of the distribution to be bounded in every direction.
This problem is fairly easy to solve: just take the empirical mean of the dataset. With just a linear number of samples (a dataset size that is proportional to d), we get an accurate estimate of the mean.
Problems arise when the estimator must be robust. In this setting, we’ll assume that a small amount (say, 1% of the dataset size) of adversarially chosen data may be added to the dataset (sometimes known as Huber’s contamination model in Statistics). Nevertheless, the goal is to get a good estimate of the mean, where the accuracy degrades only proportional to how much data is contaminated.
It’s not hard to see that this robust setting can cause problems for naive estimators, even with only one corrupted data point in one dimension (hint: what happens to the empirical mean when we add one point very far out?). Fortunately, for the one-dimensional setting, we have alternate statistics which are robust, such as the median.
Unfortunately, the picture is murkier in higher dimensional settings. The natural approach is to try to generalize the median for these settings. One generalization is Tukey’s median, which attempts to find the “deepest point in space” with respect to the dataset. It turns out that it achieves the best accuracy guarantee one can hope for — with the caveat that it’s NP-hard to compute. In other words, it would take prohibitively long to compute in even a 10-dimensional setting. Another popular generalization is the geometric median, which is the point in space that minimizes the sum of the L2 distances to points in the dataset. While the geometric median can be computed efficiently, it runs into a different problem: as the dimensionality of the problem increases, the accuracy quickly degrades. This dichotomy applies to all previous methods for robust mean estimation in high dimensions: either the estimator is not computationally efficient, or the error guarantee degrades as the dimension increases.
We resolved this tension in a joint work with Ilias Diakonikolas, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart published at FOCS 2016. We gave the first robust estimator for the mean which simultaneously a) is efficient to compute and b) has a dimension-independent error guarantee, settling a decades-old open problem in the field. The algorithm computes a “spectral center” of the dataset: using spectral information about the dataset, it finds directions with variance higher than expected, and after projecting onto these directions it discards points that stand out as outliers.
This work sparked a flurry of interest in computationally-efficient, robust, high-dimensional statistics. As one notable example, in a 2020 Annals of Statistics paper, Samuel B. Hopkins introduces the “combinatorial center.” His work takes a similar lens to the problem of efficient mean estimation with sub-Gaussian rates, showing the broad applicability of these algorithmic ideas. But the most interesting connection is yet to come.
It has been repeatedly demonstrated that, without special care, statistics are prone to leaking details about individual data points. This can be problematic if the data corresponds to sensitive information about people.
To alleviate these issues, Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith introduced the definition of differential privacy (DP) in 2006. Roughly speaking, an algorithm is DP if its distribution over outputs is insensitive to adding or removing any data point from its input. Operationally, this implies that one can’t infer much about individual input data points by looking at the output of a DP algorithm.
Intuitively, this sounds like a type of robustness. After all, for both robustness and privacy, an estimator should not be too sensitive to a small number of points in the dataset. However, fully satisfying formalizations of this connection have been elusive.
How do you privately estimate the mean of a multivariate distribution? In a COLT 2020 paper with Vikrant Singhal and Jonathan Ullman, we give an estimator based on adding Gaussian noise to the empirical mean. Since this simple algorithm is just noise addition, it’s clearly computationally efficient. And it achieves the optimal error one can hope for. The caveat: it only achieves a relaxed flavour of DP, known as (ε,δ)-differential privacy, or approximate DP. A variant of this algorithm instead adds Laplace noise: this estimator is again computationally efficient, and achieves the strongest notion of DP (pure, or (ε,0)-differential privacy). However, the error guarantee is weakened by a dimension-dependent factor. Yet another algorithm in our paper is based on the exponential mechanism: rather than adding noise, this algorithm introduces privacy by randomly sampling a solution from a judiciously chosen distribution. This offers strong DP and optimal error guarantees — but the sampling step makes it inefficient to compute.
To summarize, we have algorithms that enjoy any two of the following:
Is there an estimator which achieves all three at the same time?
In a joint work with Hopkins and University of Waterloo MMath ’23 Mahbod Majid at STOC 2022, we answered this question in the affirmative, by designing such an algorithm. We gave a variant of the previously mentioned estimator based on the exponential mechanism. To overcome previous computational barriers associated with the sampling procedure, we use algorithms for efficient log-concave sampling. But critically, at the heart of our algorithm, we employ Hopkins’ combinatorial center algorithm: the same one used for obtaining sub-Gaussian rates, which is also effective in robust settings. This shows an unexpected technical algorithmic connection between robustness, sub-Gaussian rates, and privacy.
Can we go further? Are there more general connections between robustness and privacy? In a STOC 2023 paper with Hopkins, Majid, and Shyam Narayanan (and, simultaneously, by Hilal Asi, Ullman, and Lydia Zakynthinou in an independent work at ICML 2023), we show that this is the case: robustness implies privacy. In particular, we give a black-box transformation that uses a robust estimator to construct a private estimator. In many cases, the optimality of the robust estimator implies the optimality of the private estimator. This connection shows strong conceptual connections between robustness and privacy. Some partial converses are known (in a classic STOC 2009 work by Dwork and Jing Lei, as well as a more recent NeurIPS 2022 paper by Kristian Georgiev and Hopkins), but none seem like the final word on this matter.
Overall, these results give insight into how to design statistical procedures under important considerations like robustness and privacy. Perhaps surprisingly, the technical ideas needed in both cases are quite similar. This hints towards deeper connections between the two areas, which may serve us well as we work towards algorithms that are better suited for the many challenges of the real world.