By Ian Gormely
“The internet never forgets,” was once thought to be a foundational truism of life online. But a spate of legal rulings have challenged that notion, and the ability to erase a person’s online footprint is emerging as a bedrock principle of digital privacy rights.
But while legislation like the European Union’s “Right to be Forgotten” regulation lays out important digital privacy principles, it fails to offer technical solutions for how this might be achieved in a hyper-connected online world where a single post can be aggregated across multiple channels.
Likewise, deleting someone’s data from an AI algorithm is a time-consuming process. It can cost companies valuable resources, and delay action on someone’s reasonable request to have their information scrubbed clean.
To address this problem, Vector Faculty Member, Canada CIFAR AI Chair and Assistant Professor, Department of Electrical & Computer Engineering UoT Nicolas Papernot and his team looked at how models could be trained differently to make processing these requests easier while updating a model without fundamentally altering it. “That’s the guarantee that we want to provide users.”
AI models and algorithms are created by using millions of data points from thousands of people. “You have to assume that these models are direct by-products of the data,” says Papernot. During the training process, where algorithms learn by combing through examples or data points, every data point is used to update all of the model parameters. Every future update will depend on that specific data point. “So if you delete that data point, you should also delete the models.”
Of course, scrapping a model wholesale is generally not an option for researchers or business. So Papernot and his co-authors looked at different ways data could be presented to a model so that small tweaks might be made.
Their paper, “Machine Unlearning,” which was recently accepted to the IEEE Symposium on Security and Privacy, the leading conference for computer security and electronic privacy, offers a two pronged approach. First, they “shard” the data, creating many smaller models as opposed to one big one, thereby restricting the influence of any one data point. “We then ask the different models to vote on the label they predict,” says Papernot. “We count how many votes each class received and output the class that received the most number of votes.”
Then they “slice” the shards and present the data to the model in small increments, increasing the amount of data each time while creating checkpoints along the way. “So when someone asks us to unlearn their data, we can revert to the checkpoint that was saved before we started analyzing their data,” saving time and resources.
Papernot, whose research is focused on areas of privacy and security in machine learning, is not alone in looking at ways to tackle this problem. As AI is integrated into so many facets of society, unlearning data becomes a growing issue for companies.
But he and his team were among the first to look at it, and their approach is more wholesome than many of their peers. “We wanted to be completely agnostic as to the kind of algorithm people are using so that you can just throw it in with whatever pipeline you have.”
The goal, he explains, is to make it practical for organizations to receive and process these requests quickly. “If the model takes a week to re-train, that slows down the speed at which an organization will handle these requests. We’re saying that you can do that more regularly and the smaller cost.”