Vector Institute researchers reconvene for the second edition of the Machine Learning Privacy and Security Workshop
September 16, 2024
September 16, 2024
The second edition of Vector’s Machine Learning Security and Privacy workshop brought together a number of Vector Faculty Members, Faculty Affiliates, Postdoctoral Fellows, and researchers from the wider Vector research community. At the July event, participants discussed ML security and privacy innovations, emerging trends, practical tools and techniques, and research findings.
Advances in technology can present tremendous opportunities along with potentially equally significant risks. The workshop’s primary goal was to build a community of researchers at the intersection of ML systems and dependable and secure computing. But rapid advances in the field not only led to calls for the creation of leading-edge tools but also to ensure that the continued advancement of technology is privacy-preserving, safe and responsible.
“Problems with robustness, privacy, uncertainty quantification, and more are critical issues that preclude high-stakes real world deployment,” says Vector Faculty Member Gautam Kamath, workshop organizer. “Sometimes specific threats are marginalized and disregarded. But the underlying technical vulnerabilities recur time and time again, and with more serious consequences for failure each time. It is important that the Vector Institute community, in tandem with the broader machine learning community, works together to solve these problems.”
Vector Faculty Affiliate Hassan Ashtiani presenting at the workshop.
A crucial aspect of machine learning deployment is its potential vulnerability to adversarial attacks. In the last 10 years, a rich learning theory literature has emerged to study the mathematical foundations of robust learning (learning in the presence of an adversary at deployment). However, many theoretical works (in both robust and non-robust learning) give guarantees that do not hold when adding the requirement that the learner must be a computable algorithm; there is a procedure that always halts on all potential samples. Vector Postdoctoral Fellow Pascale Gourdeau gave a talk on the computability of robust learning, based on joint work with incoming Vector Postdoctoral Fellow Tosca Lechner and Vector Faculty Affiliate Ruth Urner. They showed that when adding the simple requirement of computability, the landscape of robust learning changes considerably. They also introduced a complexity measure that lower bounds, but not upper bounds, the number of samples needed to learn in this framework.
Incoming Vector Postdoctoral Fellow Tosca Lechner, who is also a PhD candidate at the University of Waterloo, discussed her work on robust learning with uncertain manipulation capabilities. One challenge facing learned classifiers after deployment is that incoming instances might adapt their feature presentation or even intentionally mislead the classifier by changing the representation in a way that is imperceptible to humans. The fields of strategic classification and adversarially robust learning deal with these settings. The manipulation capabilities or regions of imperceptible change for both are often assumed to be known by the learner. In reality, the prior knowledge about these capabilities is more limited.
Based on joint work with Vector Faculty Member and Canada CIFAR AI Chair Shai Ben-David, Vector Faculty Affiliate Ruth Urner and Vinayak Pathak, the talk introduced the notion of adversarially robust and strategically robust PAC learning for a class of plausible candidate manipulation structures and investigated what prior knowledge of the candidates allowed for learning guarantees. For the adversarial setting, abstention as well as additional oracle-access can yield learnability guarantees. Learning the manipulation capabilities from distribution shifts can make strategically robust learning feasible.
Uncertainty quantification (UQ) in deep neural networks (DNNs) plays a crucial role in safety-critical applications, such as medical diagnosis and robotics. A simple point prediction of a DNN, without reporting the model’s confidence in its prediction, can be misleading. For example, a DNN classifier of a lung CT image may predict a patient is healthy among six or more possible outcomes, including healthy, pulmonary fibrosis, lung cancer, pneumonia, COPD, and asthma. In such a case, a physician considering AI-assisted diagnostic imaging may request additional diagnostic testing if the model can state that its prediction of the patient being healthy comes with high uncertainty (only 55% confidence) versus a case where the model is 80% confident. In his talk, Vector Faculty Affiliate Reza Samavi, who is also an Associate Professor at the Toronto Metropolitan University, presented a conformal prediction method for DNN models where a point prediction is replaced with a set of predictions where the model is highly confident (say, 90%) that the true outcome is in the set.
Instead of the model simply predicting that the patient is healthy with low confidence, it can predict with high confidence that the outcome is one of the following: healthy, pneumonia, or pulmonary fibrosis. This suggests to the physician that further inquiry may be warranted. In this way, the set translates the heuristic notion of uncertainty to a rigorous one with several advantages compared to distribution-dependent methods of uncertainty quantification, such as MC-Dropout. This approach is distribution-free, can work on any black box model, and makes almost no assumptions. In particular, Samavi’s research showed that with a negligible overhead, the optimal size of the prediction set is achievable. Using evidence already available in the logit layer of a DNN, the classifier can be calibrated post-deployment, particularly when the model is deployed to an out-of-distribution environment, such as when the diagnostic imaging model is trained on a North American population, but is deployed elsewhere. Uncertainty quantification leads to building models with less bias and more fairness.
Vector Faculty Affiliate Hassan Ashtiani, who is also an Associate Professor at McMaster University, talked about solving the classic problem of hypothesis selection under the constraint of local differential privacy (LDP). LDP has been the preferred model of privacy in several sensitive applications used by companies like Apple, Google, and Microsoft. Unlike the central model of privacy, in the LDP model individuals do not need to trust a central entity to collect and handle private data. Instead, privacy is enforced locally, such as, on individuals’ personal devices. LDP is also suitable for settings like federated learning where learning is done in a distributed way. Ashtiani also talked about their recent work with fellow Vector Faculty Affiliate Shahab Asoodeh, who is also an assistant professor at McMaster University, and Alireza Pour on the basic problem of hypothesis selection in the LDP model. Of note was the discovery that any sample-optimal algorithm for this problem would require multiple rounds of interactions.
Vector Faculty Affiliate Sasho Nikolov presenting at the workshop.
Private statistical estimation aims to compute accurate estimates about a population without revealing private information about any individual. A fundamental problem in this area is mean estimation, in which each individual’s data is encoded as a high dimensional vector (list of numbers) and the goal is to estimate the average of the vectors. A basic method for private mean estimation is to compute the mean and then ensure privacy by adding carefully correlated noise drawn from the normal distribution to each coordinate. This method has the advantage of being unbiased — the noise is equally likely to increase or decrease the true mean in any direction. In his talk, Vector Faculty Affiliate Aleksandar Nikolov, who is also an Associate Professor at the University of Toronto, showed how to optimize the correlations between noises to minimize the error and prove that, with these optimal correlations, adding normally distributed noise is the most accurate unbiased private mean estimation method.
Concerns about the privacy leakage of end-user data have hampered the use of sophisticated models deployed on the cloud like machine learning as a service. One method to mitigate this leakage is to add local differentially private noise to sensitive queries before sending them to the cloud. But this degrades the cloud model’s utility as a side effect since it generates potentially erroneous outputs on the noisy queries. Vector Faculty Affiliate David Lie, who is also a Professor at the University of Toronto, demonstrated that instead of accepting the utility loss, one can deploy a trustworthy model on a user-owned device and train it with aggregated knowledge available in the noisy and potentially incorrect labels returned from making noisy queries to the cloud model to recover the correct labels.
Graduate students present their research during a poster session.
The standard setting in ML considers a centralized dataset processed in a tightly integrated system. However, in the real world, data is often distributed across many parties. Directly sharing data may be prohibited due to privacy concerns. This is where federated learning (FL) comes into play. FL enables global models to be collaboratively trained while keeping data in local sites. Only the local model information is shared.
In typical FL, a server coordinates model training among local sites — called clients — and requires them to share model parameters/weights with the server for model aggregation. With the FedAvg algorithm, this aggregation is implemented by averaging the model weights. However, for traditional FL, relying on a centralized server may increase vulnerability and trust. Additionally, local model architectures may vary due to different local computational resources, and data from different clients may be shifted. Considering these practical settings, Vector Faculty Member and Canada CIFAR AI Chair Xiaoxiao Li, who is also an Assistant Professor at the University of British Columba, proposed a novel decentralized FL technique by introducing synthetic anchors. Called DeSA, the technique relaxes restrictions and assumptions to enhance knowledge transfer in FL. Specifically, Li proposed sharing local sites’ synthetic data with differential privacy protections. Called local anchors, they capture local data distributions and will be shared among clients before FL training. After that, clients will aggregate the shared local anchors into global anchors. During FL, clients use both local data and the global anchor to update local models based on the training objective. Regarding the information to share, clients simply need to exchange the model output logits with each other, and information exchange can be achieved by knowledge distillation. Furthermore, the global anchors perform as a regularizer to harmonize the features learned from different clients. The overall pipeline, shown in Figure 1, enables different clients with heterogeneous data and models to collaborate effectively under privacy considerations.
Figure 1: Shared global anchor facilitates secure information exchange among collaborator (Source: DeSA ICML presentation slides, Huang, C. Y., Srinivas, K., Zhang, X., & Li, X.*, Overcoming Data and Model Heterogeneities in Decentralized Federated Learning via Synthetic Anchors. In 2024 International Conference on Machine Learning)
Masoumeh Shafieinejad, an Applied Machine Learning Scientist on Vector’s AI Engineering team, spoke about a portfolio of projects focusing on privacy enhancing technologies (PETs), work done in collaboration with Vector stakeholders including Vector Faculty Member and Canada CIFAR AI Chair Xi He, Amii, and RBC.
Focused on promoting privacy by design in the finance and health sectors, Shafieinejad’s work on synthetic data generation for multi-table tabular data was positively received for adoption by industry. She also advances secure and privacy-preserving collaboration among various organizations, in the context of multi-party synthetic tabular data generation and federated learning for health data.
As an applied researcher, it is equally important to Shafieinejad to investigate the factors that facilitate or obstruct the industry’s adoption of privacy technologies. This investigation is of particular significance given the Canadian government’s upcoming bill (C27) to enhance privacy protections and address AI risks. Shafieinejad pointed out the need to translate technical privacy evaluation outcomes into meaningful risk assessments for industry and government. She ended her talk by calling for further discussion, and collaboration on the topic.
Vector Applied Machine Learning Scientist – Privacy Enhancing Technologies, Masoumeh Shafieinejad presents at the workshop
Multidisciplinary studies and multi-stakeholder discussions are essential to address the complex challenges generated by AI’s advancement. Their goal: the development of practical and reliable solutions that ensure safety while fostering an agile environment and enabling the secure deployment of innovative AI applications while preserving data privacy.
Despite the progress made with the research advancement mentioned above, significant questions remain within the machine learning security and privacy domain. For instance, will future advancements allow us to train machine learning models using synthetic data? Additionally, how can we more effectively integrate privacy and confidentiality guarantees into the design of the next generation of deep learning models?
Watch talks from the first edition of the Vector Machine Learning Security and Privacy Workshop.
Vector Faculty Member and Canada CIFAR AI Chair Gautam Kamath break down the latest developments in robustness and privacy.