How Vector Researcher Xi He uses differential privacy to help keep data private

February 2, 2024

Insights Trustworthy AI

By Michael Barclay

“I’m always fascinated by how technologies can improve our lives,” says Vector Faculty Member Xi He. It’s been the focus of her studies since she was a grad student at Duke University in North Carolina. “AI is revolutionizing our lives in many aspects. But privacy is one very important aspect that we need to address: whether my personal data has been used in this process properly. It’s a very important problem to me.”

It’s easy to be cynical about digital privacy: every time we blindly sign some terms-and-agreements page, who knows where our personal data will end up? Whether it’s social media or a consumer rewards program, there are myriad reasons to be suspicious in the age of “surveillance capitalism,” to borrow a term from bestselling writer Shoshana Zuboff. “If your personal information is in the wrong hands,” says Xi, “you may get rejected from certain opportunities, like mortgage applications, insurance and so on. We really want the right data to be in the right hands.” 

“A lot of times, I do sign ‘yes,’” Xi continues, referring to those privacy agreements that few people take the time to read, “because I need to make use of their service. It seems there’s no choice, because it’s either in or out, right? But we can do better so that even though my data is in, they are going to use it properly – so that my sensitive information won’t be leaked to other parties who are not supposed to view my data.” 

One way to help limit the chances of this happening is a privacy-enhancing technology called differential privacy.

Privacy vs. Utility

In analyzing data sets in an AI system, there is always a trade-off between privacy (your info) and utility (what the analyst wants to find out). Previous models for data analysts worked on maximum utility, subject to privacy requirements. But differential privacy, a privacy-preserving framework, supports a clear optimization problem between privacy and utility.  A firewall is put up between the data analyst and the private data, which introduces some “noise” to the answers.  

Take a grocer’s rewards card as an example. “If I want to learn the average amount customers spend in this grocery, I must ensure that whoever sees the output would not be able to tell whether your data is being used in this computation or not,” Xi explains. “Let’s take, for example, the most expensive product they sell in the grocery store. My question is: how many people buy that thing? With differential privacy, we’re applying some randomized algorithm to process the answer. 

“If you are the only person who buys it, the answer is ‘one’ – but [a differential privacy algorithm] adds some noise to the answer. When people look at this noisy answer, they won’t know whether that figure of ‘one’ is really coming from you, or coming from the noise. It gives some randomness. But regardless, whether you’re really in the database or not – or if your data has been used in the computation process – the final number is pretty small. It doesn’t have to be exactly ‘one,’ but they know that it’s a small number: not too many people are buying this product.”

More privacy—but less accuracy. Xi is okay with that. In the algorithms she’s working on, “I want the answer to have an error of plus or minus 10. Within this range is a certain confidence level. Our system will try to find the best differential-privacy algorithm, with minimum privacy cost, that can achieve that accuracy goal. This is different from the standard design of a differentially private algorithm, which worked on maximum utility, subject to privacy requirements.”

Generalizing differential privacy systems

Xi, a Canada CIFAR AI Chair and professor at the David R. Cheriton School of Computer Science at the University of Waterloo, co-wrote her first paper on differential privacy in 2019. By that point, the bigger tech companies were already applying the concept in certain apps, “but it never scaled up to more than that,” she says. “It basically was not scalable at that point in time. We really want to design more general systems or frameworks for people to use: database systems you can just plug into your system and run anything on your sensitive data. You want the system to provide a trustworthy answer to the user.” To this end, she and her team have published a series of papers showcasing systems that that can provide high-level language support to the specification of privacy requirements for the data owner ( PrivateSQL, DProvDB) and accuracy requirements for the data analysts ( APEx, CacheDP) and optimize the privacy-utility trade-offs on behalf the system users.

Just because a company has your private data doesn’t necessarily mean they’re sharing it with third parties. “Your data can sit in there without any privacy leakage, that’s totally fine,” says Xi. “But we make use of it for more interesting, data-driven applications, to ensure there’s no malicious behaviour, and to give people a better idea of how marketing can be improved – without knowing the exact behaviour of a particular person.”

Differential privacy is not yet a requirement of legal regulation. It’s slowly being recognized by government agencies, including the U.S. Census, who used it when publishing their 2020 data. But it’s far from a mainstream term, and open to confusion. “People do not understand properly what differential privacy really guarantees,” says Xi. For instance, it does not give a precise answer to an analysis due to its randomness for protecting privacy, but it is possible to quantify the scale of the error. “It’s very important to explain the trustworthiness of these algorithms.”

Privacy in healthcare

To state the obvious, AI is moving fast and there is a clock on building PET safeguards like differential privacy into it. “I don’t know how much time I have,” says Xi. “I just know that I try my best.” Her work at Vector connects her to members of Vector’s FL4health and AI Engineering teams, with whom she is developing privacy-preserving federated learning framework for healthcare data. 

“Professor He has been an exceptional leader and collaborator in the development of efficient federated learning algorithms with strong privacy guarantees, which are of the utmost importance when working with real-world clinical data,” says David Emerson, an applied machine learning scientist at Vector. “We look forward to continuing to collaborate with her team and pushing the boundaries of ML methods in healthcare.”

Xi was initially drawn to the subject not because of urgency, but for relatively mundane, intellectual reasons: it’s not like her credit card was hacked or someone broke into her phone. “I just think it’s a very important problem, and I can make use of some of my knowledge and skills to improve on this aspect of it. I’m not alone. There are great people working on this problem together.” 

Click here to learn more about Vector’s work in data privacy.

Related:

Solar panels sit on green grass. There are trees and skyscrapers in the background
Insights

Harnessing AI For Sustainability

Three people stare at a laptop with a Vector logo on it
Insights
Research

Vector Research Blog: Structured Neural Networks for Density Estimation and Causal Inference

Insights
Research

Vector Research Blog: Causal Effect Estimation Using Machine Learning