Vector community explores data privacy research at Machine Learning Privacy and Security Workshop

By Natasha Ali

The recent Machine Learning Security and Privacy workshop brought together a number of distinguished Vector Institute Faculty Members, Faculty Affiliates and Postdoctoral Fellows. Held at the Vector Institute office in July, the event saw researchers discussing emerging trends and research findings in the field of machine learning security and privacy, as rapid developments lead to calls for immediate measures for regulating the process of machine training and preserving private information and user data.

Reducing machine learning models’ susceptibility to attacks

Vector Faculty Member Nicholas Papernot presented a unique take on machine learning model theft and appropriate defense mechanisms. As an assistant professor in the Department of Electrical and Computer Engineering at the University of Toronto and a Canada CIFAR AI Chair, Papernot has made significant contributions to the field of machine learning security, pioneering the development of algorithms that facilitated machine learning privacy research.

He focused on prediction-based machine learning models, which are particularly vulnerable to adversarial attacks – attacks that aim to misuse machine learning models in a way that results in consequences unintended by the model owner.

Model extraction, a common type of adversarial attack, enables attackers to mimic and thus “steal” victim machine learning models, bypassing the costly process of dataset curation. By gaining access to model outputs and their prediction processes, an attacker can reproduce models.

Beyond model extraction, machine learning comes with a broad variety of adversarial risks. Attackers can also modify a machine learning model’s training processes such that it falsely interprets malicious information as benign. In forcing models to learn harmful data, attackers can disrupt the chain of command in deep neural networks and transfer falsified training data from one model to another, creating a domino effect of malicious activity.

Papernot pointed out that common defenses against model extraction attacks involve blocking the attacker’s access to the original outputs to disrupt the training of the attacker model. For example, the owner of a victim model can identify whether its model was stolen by embedding “watermarks” into the training data. Unfortunately, this inevitably comes at the cost of model utility. Alternatively, one can use membership inference techniques to compare the suspected machine learning model against the original model to determine if training data and policies were stolen.

However, these approaches are all reactive: they check whether a model was stolen after the fact. Instead, Papernot suggested that one can take preemptive measures to negatively manipulate the cost-benefit trade-off of stealing a machine learning model, thus disincentivizing model theft.

To fend off model extraction attacks before they infiltrate the models, Papernot proposed a pro-active mechanism that increases the computational costs of model theft without compromising the original models’ functionality or output.

Using an elaborate detection scheme, the victim model can generate puzzles of various difficulty levels that the attacker has to solve to gain access to the model output. As the first defense system to preserve the original model’s accuracy, this method enforces high computational costs for model extraction, thereby deterring attacks.

“If you calibrate the puzzle, then the attacker has to spend a lot more compute power and has to wait for longer in order to steal your machine learning model.”

Nicholas Papernot

Vector Faculty Member, Canada CIFAR AI Chair

Papernot believes that this groundbreaking pro-active defense mechanism can reduce data theft preemptively by restricting user access to machine learning algorithms prior to being infiltrated by malicious attackers.

Developing data poisoning algorithms

Expanding on data manipulation attacks, Vector Faculty Member and Canada CIFAR AI Chair Yaoliang Yu led an informative discussion on data poisoning in neural network models.

In these attacks, a malicious party can inject “poisonous” data into the training datasets of machine learning models. Since modern machine learning techniques rely on large amounts of training data, data poisoning attacks pose a serious threat that compromises the validity of model outputs and leaves them vulnerable to subsequent attacks.

Yu’s research focuses on designing new data poisoning algorithms in order to assess their immediate impact on model accuracy and classify them based on their attack strategies. His two recent papers highlighted indiscriminate attacks for image classification, where attackers can obtain training data and corrupt the process of image labeling, yielding inaccurate results in image generation tasks.

In collaboration with fellow researchers at the University of Waterloo, Yu developed data poisoning attacks that generate data points more efficiently and effectively than before. Unlike previous data poisoning methods, the proposed Total Gradient Descent Ascent (TGDA) model produces thousands of poisoned datasets all at once, boosting the speed and performance of the adversary attack.

These elaborate methods enable researchers to examine the extent of interference in poisoning attacks, making it faster and easier to qualify different types of malicious activities and study their impact on machine training and model output.

Combining public and private data training to enhance privacy

Shifting gears to differential privacy, Vector Faculty Member and Canada AI CIFAR Chair Gautam Kamath discussed the pitfalls associated with using sensitive and personal information to train machine learning models.

“The Vector Institute has so much expertise in ML security and privacy! It’s fantastic to have the opportunity to bring everyone together and see everyone’s perspectives on how we tackle the biggest challenges in the field.”

Gautam Kamath

Vector Faculty Member and Canada AI CIFAR Chair

Given the continuous development of large language models that are trained on publicly available data, he expressed concern that “machine learning models are memorizing things in their training datasets which may be things we don’t want to reveal.” Large language models are generally pre-trained on public datasets which are readily available online, such as downloadable texts, Wikipedia articles, and blog posts. However, a conflict of interest arises when large language models are coerced into generating copyrighted material and sensitive data without proper regulation or consent.

This is where differential privacy comes in. Regularly used as an approach to protect user datasets, it enables public sharing of group data and information patterns, while maintaining the privacy of individual user data points and identifiers. As a valuable data analysis mechanism, differential privacy allows researchers and businesses to collect group data and dissect a machine learning model’s utility without compromising sensitive information about the users.

Kamath’s proposed approach involves two main steps: pretraining using public datasets, followed by fine-tuning to narrow the process down to task-specific smaller datasets.

“I’m distinguishing between two types of data,” he said. “One is public data which is large and diverse, which we’re going to use for pretraining, and the other one is smaller, private, and more focused on the downstream tasks.”

In an ideal situation, public and private data go hand-in-hand such that public data is used to improve private machine learning accuracy. The caveat, he noted, is that not all publicly available data is appropriate; some datasets may have been plagiarized from external sources while others may be unsuitable for the specific purposes of machine training.

Kamath concluded that differential privacy guarantees machine learning models won’t depend on sensitive data from a single individual in their training, but rather on aggregate patterns contained in large-scale datasets from a collection of users. This approach polishes the output data as a whole, keeping machine learning models from memorizing and sharing private individual data.

Additional highlights from the event included Vector Faculty Member Shai Ben-David’s discussion on the intersection of public and private data training in estimation tasks, Vector Affiliate Reza Samavi’s talk on the robustness of deep neural networks to adversarially modified data, and Vector Postdoctoral Fellow Masoumeh Shafieinejad’s unique take on the applications of differential privacy and data regulation in industry, academia, and healthcare settings.

As machine learning models continue to advance, more data analysis and computational processes are required to successfully train them. With the frequent deployment of training datasets in machine learning algorithms and the rise in large language model applications, the event speakers emphasized an urgent need for intellectual property protection and privacy regulation, as valuable information becomes more susceptible to thefts and malware attacks.

Want to learn more about the Vector Institute’s current research initiatives in machine learning privacy? Click here for the full playlist of talks.

machine learning privacy Research security

Reducing machine learning models’ susceptibility to attacks

Developing data poisoning algorithms

Combining public and private data training to enhance privacy

Related:

Vector Institute and South Korea’s National AI Research Lab partner to accelerate frontier AI research

Vector Institute and European Space Agency partner to advance AI for Earth observation

The AI Scientist: Towards full automation of the research life cycle

Vector researchers advance generative AI, responsible AI, and scientific discovery at ICML 2026

Anne Martel: Using AI to personalize cancer treatment

Mohamad Moosavi: Accelerating the search for climate solutions with AI

Hassan Ashtiani: Building trustworthy AI through mathematical foundations

Vector researchers advance representation learning and deep learning research at ICLR 2026

Remarkable 2026 Poster Session: 60 research projects shaping AI’s future

CRISPNAM-FG: An interpretable Fine-Gray deep survival model for competing risks in health care

The New Cartography of the Invisible

Vector researchers advance AI frontiers with 80 papers at NeurIPS 2025

When smart AI gets too smart: Key insights from Vector’s 2025 ML Security & Privacy Workshop

Vector Institute names 13 new Faculty Members, expanding core research leadership across Ontario

Vector researchers dive into deep learning at ICLR 2025

Vector researchers tackle real-world AI challenges at ICML 2025

Transforming Youth Mental Health Support: FAIIR’s AI-Powered Crisis Response Model

AI Weather Forecasting Breakthrough: How Canadian Innovation is Transforming Climate Prediction | Aardvark Weather

Exploring Intelligence: Vector Faculty Member Kelsey Allen’s Path from Particle Physics to Cognitive Machine Learning

Real World Multi-Agent Reinforcement Learning – Latest Developments and Applications

Leveraging Large Language Models for More Efficient Systematic Reviews in Medicine and Beyond

Thought Cloning: Teaching AI to Think Like Humans for Better Decision-Making

Recommender Systems: Where Academia Meets Industry

My Visiting Researcher Term at Vector Institute

Vector researchers presenting more than 98 papers at NeurIPS 2024

Unlocking the Potential of Prompt-Tuning in Federated Learning

New multimodal dataset will help in the development of ethical AI systems

Unveiling Alzheimer’s: How Speech and AI Can Help Detect Disease

Vector co-founder Geoffrey Hinton wins the Nobel Prize in Physics 2024

Empowering Air Travelers: A Chatbot for Canadian Air Passenger Rights

Vector Institute researchers reconvene for the second edition of the Machine Learning Privacy and Security Workshop

Vector researcher Wenhu Chen on improving and benchmarking foundation models

Vector Researchers present papers at ACL 2024

AtomGen: Streamlining Atomistic Modeling through Dataset and Benchmark Integration

Vector researchers presented more than 50 papers at ICML 2024

Vector researchers are presenting over a dozen papers at CVPR 2024

Vector Institute Computer Vision Workshop showcases the field’s current capabilities and future potential

Vector researcher Gautam Kamath breaks down the latest developments in robustness and privacy

World-leading AI Trust and Safety Experts Publish Major Paper on Managing AI Risks in the journal Science

Standardized protocols are key to the responsible deployment of language models

The known unknowns: Vector researcher Geoff Pleiss digs deep into uncertainty to make ML models more accurate

Breaking Ground: Natural language processing headlines Vector Institute’s latest workshop gathering

Vector Research Blog: Is Your Neural Network at Risk? The Pitfall of Adaptive Gradient Optimizers

How Vector Researcher Xi He uses differential privacy to help keep data private

Vector Research Blog: Structured Neural Networks for Density Estimation and Causal Inference

Vector Research Blog: Causal Effect Estimation Using Machine Learning

Machine learning theory takes centre stage at Vector Institute workshop

Introducing FlexModel: Breakthrough Framework for Unveiling the Secrets of Large Generative AI Models

Neutralizing Bias in AI: Vector Institute’s UnBIAS Framework Revolutionizes Ethical Text Analysis

Vector researchers presenting more than 65 papers at NeurIPS 2023

AI for Chemistry and Materials: blending old and new ways of thinking

AI & public health: using natural language processing for clinical database management

ICML 2023: Developing an adaptive computation model for multidimensional generative tasks

Vector Research Blog: Large Language Models, Prompting and PEFT

Dan Roy named Vector Research Co-Director

Unlocking AI-powered approaches to cancer treatment and detection

Machine Learning Meets Quantum Mechanics: Vector Workshop Showcases Groundbreaking Developments in Quantum Computing

Over 20 Vector research papers accepted at CVPR 2023

Vector research featured at ICLR 2023

AI Research Symposium highlights new Vector research

Vector researchers win top honours at NeurIPS 2022

Canada can lead in AI for Science

Vector researcher Alán Aspuru-Guzik delivers CIFAR Massey Talk

Deep Learning for Building Footprint Extraction in Aerial Imagery

Graham Taylor named Vector Research Director

Acceleration Consortium, Matter Lab, and Vector Institute collaborate on software to power self-driving labs

New Vector Faculty Member Jeff Clune’s quest to create open-ended AI systems

Vector research blog: Value Gradient weighted Model-Based Reinforcement Learning

New AI framework helps map and manage invasive mussel species in Canada’s lakes

Computer Vision Technical Report details insights from industry-academic collaborative project

Vector researchers recognized with awards at the 2022 International Conference on Learning Representations (ICLR)

Research Symposium brings together Vector community to celebrate student and postdoc work

Amateur hockey given professional viewing experience courtesy of machine vision startup co-founded by Vector researcher

AI-enabled tool that identifies COVID-19 variants co-developed by Vector researcher Bo Wang

Technology, including AI, increasingly plays a key role in our food chain