Vector Research Blog: Is Your Neural Network at Risk? The Pitfall of Adaptive Gradient Optimizers

By Avery Ma, Yangchen Pan, and Amir-massoud Farahmand

tl;dr: Our empirical and theoretical analyses reveal that models trained using stochastic gradient descent exhibit significantly higher robustness to input perturbations than those trained via adaptive gradient methods. This means that certain training techniques make machine learning systems more reliable and less likely to be thrown off by unexpected changes in the input data.

Have you ever wondered about the differences between models trained with various optimizers? Ongoing research focuses on how these optimizers impact a model’s standard generalization performance: their accuracy on the original test set. In this post, we explore how they can make or break the robustness of the models against input perturbations, whether you are team stochastic gradient descent (SGD) or team adaptive gradient.

Figure 1. Three scatter plots comparing standard test accuracy (y-axis, approximately 80–100 per cent) against robustness metrics (x-axis) for models trained using three optimizers: SGD (blue stars), RMSProp (green diamonds) and Adam (red circles). Seven datasets are represented across all three plots: MNIST, SVHN, FashionMNIST, Imagenette, CIFAR10, Caltech101 and CIFAR100, appearing in descending order of test accuracy. The left plot measures accuracy under Gaussian perturbations (x-axis, 40–90 per cent). The middle plot measures accuracy under ℓ₂-bounded attacks (x-axis, 50–85 per cent). The right plot measures accuracy under ℓ∞-bounded attacks (x-axis, 30–70 per cent). Across all three plots and all datasets, the y-axis values cluster tightly regardless of optimizer, indicating comparable standard test accuracy. However, the x-axis positions of the three optimizers diverge noticeably for each dataset, indicating meaningful differences in robustness. The caption notes that while test accuracy is similar across training algorithms, there is a distinct robustness difference. — **Figure 1:** **Comparison between models trained using SGD, Adam, and RMSProp.** Models trained by different algorithms have similar test accuracy, but there is a distinct robustness difference.

We start by putting models trained with SGD, Adam, and RMSProp side by side. The result is summarized in Figure 1. We focus on two criteria in this figure. First, all three plots align on the same Y-axis, which indicates the standard test accuracy. The three X-axes show the accuracy of the model under various input perturbations. Models trained by SGD, Adam, and RMSProp are marked using a star, circle, and diamond, respectively. Each colored triplet denotes models on the same dataset.

There is a small vertical gap among each triplet, showing that the models have similar standard generalization performance despite being trained by different algorithms.

On the other hand, under all three types of perturbations, there is a large horizontal span with the star always positioned on the far right side among the three. This indicates that models trained by SGD are the clear winners in terms of robustness against perturbations. Similar results can be observed with vision transformers or other data modalities.

Why do models behave differently under perturbations?

To understand this phenomenon, we investigate it through the lens of a frequency-domain analysis. First, we notice that natural datasets contain some frequencies that do not significantly impact the standard generalization performance of models. But here is the twist: under certain optimizers, this type of irrelevant information can actually make the model more vulnerable. Specifically, our main claim is that:

To optimize the standard training objective, models only need to learn how to correctly use relevant information in the data. However, their use of irrelevant information in the data is under-constrained and can lead to solutions sensitive to perturbations.

Because of this, by injecting perturbations into parts of the signal that contain irrelevant information, we observe that models trained by different algorithms exhibit very different performance changes.

Observation I: Irrelevant Frequencies in Natural Signals

To demonstrate that irrelevant frequencies exist when training a neural network classifier, we consider a supervised learning task, removing the irrelevant information from the training input, and then assessing the model’s performance using the original test data.

Figure 2. Two line graphs showing accuracy on the original test set (y-axis, approximately 80–100 per cent) as a function of p, the percentage of DCT bases removed (x-axis, 0–90 per cent), for seven datasets: MNIST, FashionMNIST, CIFAR10, CIFAR100, SVHN, Caltech101 and Imagenette. The left graph removes DCT bases based on magnitude (low spectrum energy) and the right graph removes DCT bases based on frequency (high frequencies). In the left graph, accuracy across all datasets remains largely stable as p increases from 0 to 90, with only modest decline even at high removal rates, indicating that low-energy frequency components carry little useful signal. In the right graph, accuracy also remains relatively high at low-to-mid values of p but declines more noticeably at higher values, particularly for datasets like Imagenette and Caltech101. MNIST maintains near-perfect accuracy in both plots across all values of p. The caption notes that these results demonstrate that irrelevant frequencies exist in natural data, as test accuracy remains high even when significant portions of the frequency signal are removed." — Figure 2: **Irrelevant frequencies exist in the natural data.** Accuracy on the original test set remains high when the training inputs are modified by removing parts of the signal with low spectrum energy (left) and high frequencies (right).

When we modify the training data by removing parts of the signal that either have low energy (Figure 2, left) or are of high frequency (Figure 2, right), we find that it does not really affect how accurate the models are on the original test set. This suggests that there is a considerable amount of irrelevant information from the perspective of a neural network.

This observation leads to the first part of our claim, that models only need to learn how to correctly use the crucial class-defining information from the training data to optimize the training objective. On the other hand, the extent to which they utilize irrelevant information in the data is not well-regulated. This can be problematic and lead to solutions sensitive to perturbations.

Observation II: Model Robustness along Irrelevant Frequencies

Let us now focus on the second part of the claim. If models’ responses to perturbations along the irrelevant frequencies explain their robustness difference, then we should expect a similar accuracy drop between models when perturbations are along relevant frequencies, but a much larger accuracy drop on less robust models when test inputs are perturbed along irrelevant frequencies.

Figure 3. Two line graphs showing the accuracy change under band-limited Gaussian perturbations (y-axis, expressed as a percentage, ranging from approximately 0 to -12 per cent) across perturbed frequency bands r (x-axis, 0 to 8) for three optimizers: SGD (blue), Adam (orange) and RMSProp (green). The left graph shows results for CIFAR100 and the right graph for Imagenette. In both plots, all three optimizers produce similar small negative accuracy changes at the lowest frequency band (r=0), indicating comparable sensitivity to low-frequency perturbations. As r increases toward higher frequency bands, the three lines diverge noticeably. In the CIFAR100 plot, SGD shows the steepest decline in accuracy at higher frequency bands, reaching approximately -12 per cent, while Adam and RMSProp exhibit less sensitivity and follow shallower trajectories. The Imagenette plot shows a similar divergence pattern across the three optimizers at higher frequency bands, though with differing magnitudes. The caption notes that while low-frequency perturbations affect all models similarly, models' responses vary significantly when perturbations focus on higher frequency bands. — Figure 3: **The effect of band-limited Gaussian perturbations on models trained using SGD, Adam, and RMSProp.** Perturbations from the lowest band have a similar effect on all the models, while models’ responses vary significantly when the perturbation focuses on higher frequency bands.

This leads to our next experiment. Figure 3 demonstrates how the classification accuracy degrades under different band-limited Gaussian noises on CIFAR100 and Imagenette. Notice that the perturbation from the lowest band has a similar impact on all the models regardless of the algorithm they are trained by. There is however a noticeable difference in how models trained by SGD and adaptive gradient methods respond to perturbations from higher frequency bands.

This observation shows that when models, during their training phase, do not have mechanisms in place to limit their use of irrelevant frequencies, their performance can be compromised if data along irrelevant frequencies become corrupted at test time.

Linear Regression Analysis with an Over-parameterized Model

In addition to the empirical studies, we theoretically analyze the learning dynamics of gradient descent (GD) and sign gradient descent (signGD), a memory-free version of Adam and RMSProp, with linear models. We briefly introduce the problem setup and summarize key results. For more details, we direct the reader to our paper.

We focus on least square regression and compare the standard and adversarial risk of the asymptotic solutions obtained by GD and signGD. Motivated by our previous observations, we design a synthetic dataset that mimics the properties of a natural dataset by specifying frequencies that are irrelevant in generating the true target. We are particularly interested in the standard risk:

An equation defining the standard risk. The expression reads: ℛ-subscript-s of w, defined as equal to the expectation of the absolute value of w-transpose X minus Y, squared. In this expression, ℛ-subscript-s denotes the standard risk, w is the weight vector, X is the input and Y is the label.

and the adversarial risk under l2-norm bounded perturbations:

An equation defining the adversarial risk. The expression reads: ℛ-subscript-a of w, defined as equal to the expectation of the maximum over all Δx where the ℓ₂ norm of Δx is less than or equal to ε, of the absolute value of w-transpose times the quantity X plus Δx, minus Y, squared. In this expression, ℛ-subscript-a denotes the adversarial risk, w is the weight vector, Δx is an adversarial perturbation bounded by ε in ℓ₂ norm, X is the input and Y is the label

Our main results are threefold.

1. Irrelevant information leads to multiple standard risk minimizers. For an arbitrary minimizer, we can obtain its adversarial risk as:

An equation expressing the closed-form adversarial risk at the optimal weights. The expression reads: ℛ-subscript-a of w-superscript-asterisk equals epsilon squared over 2, multiplied by the ℓ₂ norm of w-superscript-asterisk squared. In this expression, w-superscript-asterisk denotes the optimal weight vector, ε is the adversarial perturbation bound and the ℓ₂ norm squared of w-superscript-asterisk captures the sensitivity of the model to adversarial perturbations.

This means that for models’ robustness to l2-norm bounded changes are inversely proportional to the model parameters’ weight norm: a smaller weight norm implies better robustness.

2. With a sufficiently small learning rate, the standard risk of solutions obtained by GD and signGD can be both close to 0.

3. Consider a three dimensional input space. The ratio between the adversarial risk of GD and signGD solution is always greater than 1:

An inequality comparing the adversarial risk of a model trained with sign gradient descent versus standard gradient descent. The expression reads: the ratio of ℛ-subscript-a of w-superscript-signGD to ℛ-subscript-a of w-superscript-GD is strictly greater than 1 plus C. In this expression, w-superscript-signGD denotes weights trained using sign gradient descent, w-superscript-GD denotes weights trained using standard gradient descent, ℛ-subscript-a denotes adversarial risk and C is a positive constant. The inequality states that sign gradient descent produces a model with strictly higher adversarial risk than standard gradient descent by a margin of at least C.

where C>0 and its value depends on weight initialization and the data covariance.

The latter two findings are particularly important. They provide insights that help explain the phenomena observed in Figure 1, specifically the similar levels of standard generalization across models and the variations in their robustness. The last results highlight that the three-dimensional linear model obtained through GD consistently exhibits greater robustness against $ℓ₂$ -norm bounded perturbations compared to the model obtained from signGD.

Connecting the Norm of Linear Models to the Lipschitzness of Neural Networks

The first results from the linear analysis shows that for the standard risk minimizers, its robustness against $ℓ₂$ perturbation is proportional to its weight. To generalize this result in the deep learning setting, we make a connection between weight norm and the Lipschitzness of neural networks.

Consider the feed-forward neural network as a series of function compositions:

where each $φ$ is a linear operation, an activation function or pooling operations. Denoting the Lipschitz constant of function $f$ as $L(f)$ we can establish an upper bound on the Lipschitz constant for the entire feed-forward neural network using.

An inequality establishing an upper bound on the Lipschitz constant of a feed-forward neural network. The expression reads: L of f is less than or equal to the product from i equals 1 to l of L of φ-subscript-i. In this expression, L of f is the Lipschitz constant of the full network, l is the total number of layers and L of φ-subscript-i is the Lipschitz constant of the i-th layer component. The inequality states that the network's overall Lipschitz constant is bounded above by the product of the individual layer Lipschitz constants.

Approximating the Lipschitzness of neural network components, like convolutions and skip-connections, often depends on the norm of the weights. This method enables us to draw connections between a neural network’s weight norm and its robustness. Essentially, a lower weight norm suggests a smaller upper bound on the Lipschitz constant, indicating that models are less prone to perturbations.

Table 1 comparing the upper bound on the Lipschitz constant and averaged robust accuracy of neural networks across seven datasets – MNIST, Fashion, CIFAR10, CIFAR100, SVHN, Caltech101 and Imagenette – for three optimizers: SGD, Adam and RMSProp. The table has two row groups. The first group shows the Lipschitz constant upper bound, expressed as the product from i=1 to l of L(φᵢ). SGD values: 3.83, 3.83, 26.81, 40.41, 22.65, 18.53, 23.99. Adam values: 5.75, 8.12, 28.70, 41.87, 30.45, 26.20, 28.55. RMSProp values: 6.21, 5.11, 37.75, 41.71, 28.31, 45.84, 27.11. The second group shows averaged robust accuracy. SGD: 77.97 per cent, 77.95 per cent, 63.21 per cent, 55.65 per cent, 69.08 per cent, 71.42 per cent, 67.59 per cent. Adam: 65.64 per cent, 67.60 per cent, 57.71 per cent, 45.25 per cent, 65.60 per cent, 55.03 per cent, 58.86 per cent. RMSProp: 63.54 per cent, 71.34 per cent, 56.47 per cent, 47.55 per cent, 65.37 per cent, 53.16 per cent, 57.98 per cent. Across all datasets, SGD consistently produces the smallest Lipschitz constant upper bound and the highest averaged robust accuracy compared to Adam and RMSProp, as noted in the caption. — Table 1: **Comparing the upper bound on the Lipschitz constant and the averaged robust accuracy of neural networks.** Notice that across all selected datasets, models trained by SGD have a considerably smaller upper bound compared to models trained by Adam and RMSProp.

Results in Table 1 demonstrate that SGD-trained neural networks have considerably smaller Lipschitz constants, explaining the better robustness to input perturbations than those trained with adaptive gradient methods as shown in Figure 1.

Our work highlights the importance of optimizer selection in achieving both generalization and robustness. This insight not only advances our understanding of neural network robustness but also guides future research in developing optimization strategies that maintain high accuracy while being resilient to input perturbations, paving the way for more secure and reliable machine learning applications.

Why do models behave differently under perturbations?

Observation I: Irrelevant Frequencies in Natural Signals

Observation II: Model Robustness along Irrelevant Frequencies

Linear Regression Analysis with an Over-parameterized Model

Connecting the Norm of Linear Models to the Lipschitzness of Neural Networks

Related:

Vector researchers advance generative AI, responsible AI, and scientific discovery at ICML 2026

Anne Martel: Using AI to personalize cancer treatment

Mohamad Moosavi: Accelerating the search for climate solutions with AI

Hassan Ashtiani: Building trustworthy AI through mathematical foundations

Vector researchers advance representation learning and deep learning research at ICLR 2026

Remarkable 2026 Poster Session: 60 research projects shaping AI’s future

CRISPNAM-FG: An interpretable Fine-Gray deep survival model for competing risks in health care

The New Cartography of the Invisible

Vector researchers advance AI frontiers with 80 papers at NeurIPS 2025

When smart AI gets too smart: Key insights from Vector’s 2025 ML Security & Privacy Workshop

Vector Institute names 13 new Faculty Members, expanding core research leadership across Ontario

Vector researchers dive into deep learning at ICLR 2025

Vector researchers tackle real-world AI challenges at ICML 2025

Transforming Youth Mental Health Support: FAIIR’s AI-Powered Crisis Response Model

AI Weather Forecasting Breakthrough: How Canadian Innovation is Transforming Climate Prediction | Aardvark Weather

Exploring Intelligence: Vector Faculty Member Kelsey Allen’s Path from Particle Physics to Cognitive Machine Learning

Real World Multi-Agent Reinforcement Learning – Latest Developments and Applications

Leveraging Large Language Models for More Efficient Systematic Reviews in Medicine and Beyond

Thought Cloning: Teaching AI to Think Like Humans for Better Decision-Making

Recommender Systems: Where Academia Meets Industry

My Visiting Researcher Term at Vector Institute

Vector researchers presenting more than 98 papers at NeurIPS 2024

Unlocking the Potential of Prompt-Tuning in Federated Learning

New multimodal dataset will help in the development of ethical AI systems

Unveiling Alzheimer’s: How Speech and AI Can Help Detect Disease

Vector co-founder Geoffrey Hinton wins the Nobel Prize in Physics 2024

Empowering Air Travelers: A Chatbot for Canadian Air Passenger Rights

Vector Institute researchers reconvene for the second edition of the Machine Learning Privacy and Security Workshop

Vector researcher Wenhu Chen on improving and benchmarking foundation models

Vector Researchers present papers at ACL 2024

AtomGen: Streamlining Atomistic Modeling through Dataset and Benchmark Integration

Vector researchers presented more than 50 papers at ICML 2024

Vector researchers are presenting over a dozen papers at CVPR 2024

Vector Institute Computer Vision Workshop showcases the field’s current capabilities and future potential

Vector researcher Gautam Kamath breaks down the latest developments in robustness and privacy

World-leading AI Trust and Safety Experts Publish Major Paper on Managing AI Risks in the journal Science

Standardized protocols are key to the responsible deployment of language models

The known unknowns: Vector researcher Geoff Pleiss digs deep into uncertainty to make ML models more accurate

Breaking Ground: Natural language processing headlines Vector Institute’s latest workshop gathering

How Vector Researcher Xi He uses differential privacy to help keep data private

Vector Research Blog: Structured Neural Networks for Density Estimation and Causal Inference

Vector Research Blog: Causal Effect Estimation Using Machine Learning

Machine learning theory takes centre stage at Vector Institute workshop

Introducing FlexModel: Breakthrough Framework for Unveiling the Secrets of Large Generative AI Models

Neutralizing Bias in AI: Vector Institute’s UnBIAS Framework Revolutionizes Ethical Text Analysis

Vector researchers presenting more than 65 papers at NeurIPS 2023

AI for Chemistry and Materials: blending old and new ways of thinking

AI & public health: using natural language processing for clinical database management

ICML 2023: Developing an adaptive computation model for multidimensional generative tasks

Vector Research Blog: Large Language Models, Prompting and PEFT

Dan Roy named Vector Research Co-Director

Unlocking AI-powered approaches to cancer treatment and detection

Vector community explores data privacy research at Machine Learning Privacy and Security Workshop

Machine Learning Meets Quantum Mechanics: Vector Workshop Showcases Groundbreaking Developments in Quantum Computing

Over 20 Vector research papers accepted at CVPR 2023

Vector research featured at ICLR 2023

AI Research Symposium highlights new Vector research

Vector researchers win top honours at NeurIPS 2022

Canada can lead in AI for Science

Vector researcher Alán Aspuru-Guzik delivers CIFAR Massey Talk

Deep Learning for Building Footprint Extraction in Aerial Imagery

Graham Taylor named Vector Research Director

Acceleration Consortium, Matter Lab, and Vector Institute collaborate on software to power self-driving labs

New Vector Faculty Member Jeff Clune’s quest to create open-ended AI systems

Vector research blog: Value Gradient weighted Model-Based Reinforcement Learning

New AI framework helps map and manage invasive mussel species in Canada’s lakes

Computer Vision Technical Report details insights from industry-academic collaborative project

Vector researchers recognized with awards at the 2022 International Conference on Learning Representations (ICLR)

Research Symposium brings together Vector community to celebrate student and postdoc work

Amateur hockey given professional viewing experience courtesy of machine vision startup co-founded by Vector researcher

AI-enabled tool that identifies COVID-19 variants co-developed by Vector researcher Bo Wang

Technology, including AI, increasingly plays a key role in our food chain

Spotlight on Health at NeurIPS 2021