Machine Learning Robustness: New Challenges and Approaches

By Jonathan Woods
March 29, 2022

This article is a part of our Trustworthy AI series. As a part of this series, we will be releasing an article per week around

Interpretability
Fairness
Governance

In this week’s article Vector’s Industry Innovation team looks at model robustness defining its importance in governance, breaking down five novel issues that ML models present, and offering general guidelines for model robustness governance in practice.

For many organizations, turning machine learning’s (ML) promise for new products and optimized operations into reality has been a challenge. Companies have discovered that ML applications differ from traditional software in several important ways. Data complexity, engineering and compute costs, and new governance requirements have all slowed adoption. Governance and controls have presented a particularly difficult challenge. Uncertainty around new risks has been intensified by some very public failures using ML, and it should come as no surprise that organizations are cautious about the technology. Considering standards are still in development and missteps carry the possibility of real reputational and financial consequences, it makes sense to err on the side of caution ― even though the opportunity cost of not using ML systems where they could provide value may be great.

One key to overcoming uncertainty and moving forward with implementations is good ML governance. Fortunately, with some adjustments, existing risk frameworks for non-ML models can be adjusted to account for learning algorithms and their unique characteristics.

This paper examines one element of ML model risk governance: model robustness. The paper defines its importance in governance, breaks down five novel issues that ML models present, and offers general guidelines for model robustness governance in practice. This paper is written primarily for non-technical professionals to enable them to grasp key concepts, participate in the governance process, and ask informed questions when it comes to responsible ML application. Ultimately, this knowledge can empower organizations to maximize the value they derive from ML systems.

What is model robustness?

Model robustness refers to the degree that a model’s performance changes when using new data versus training data. Ideally, performance should not deviate significantly. Robustness matters for a number of reasons. First, trust in any tool depends on reliable performance. Trust can erode when an ML system performs in an unpredictable way that is difficult to understand. Second, deviation from anticipated performance may indicate important issues that require attention. These issues can include malicious attacks, unmodeled phenomena, undetected biases, or significant changes in data. To ensure that a model is performing according to its intended purpose, it’s critical to understand, monitor, and manage robustness as part of model risk governance.

Understand and monitor five elements to maintain ML model robustness

Achieving model robustness requires understanding and managing a number of technical challenges. Below, we define these challenges and list considerations for each. This collection should provide a reliable checklist of items to refer to when managing model robustness.

Managing model robustness requires understanding specific challenges related to:

data quality
model decay
feature stability
precision vs. recall
input perturbations

Data quality

Data quality can have a significant effect on model performance. Data quality refers to the accuracy, completeness, and clarity of the data being inputted into a machine learning system. The lower the dataset is on each of these dimensions, the more likely the system is to deviate from its typical performance. A number of issues can impact data quality. Datasets can have bias, leading to model outcomes that favour or disfavour certain groups unfairly. Older data may capture patterns that have since evolved, leading the model to deliver results that are no longer relevant. Important scenarios may be underrepresented or not present at all in training data, leading to the omission of features that are key to performance. Practitioners cognizant of the need for robustness should start with data quality, ensuring that training and real-world datasets are complete, accurate, and relevant. Not doing so may compromise the reliability of model output.

Considerations: Consider reviewing and further curating the original dataset before using a model to ensure the dataset is complete, relevant, and accurate.

Model decay

Over time, a model’s predictive ability can degrade. In dynamic industries and evolving environments, new data distributions can arise that differ from historical distributions used to train models. When the difference is significant, it can impact the reliability of a system’s predictions. This difference between training and test data is called dataset shift.
When it’s severe – due perhaps to major real-world events that create sudden and significant changes in people’s behaviour – it requires correction. Model correction techniques or retraining of the model with more current data can re-establish model accuracy. Another version of decay is concept drift, which occurs when data distributions remain the same, but our interpretation of the relationship between two or more features in the data changes. In this case, a model may produce results that are accurate according to its inputs, but no longer relevant.

Considerations: Monitor for and track any change in model performance, and become familiar with correction techniques that may enable faster and less expensive recalibration than a full retraining of the model on an updated dataset.

Feature stability

A feature is an individual property or variable used as an input in an ML system. Consider a model that predicts housing prices. Features may include a house’s location, size, number of bedrooms, previous sale price, or any other number of elements. Frequent variation of important features may impact a model’s stability. Challenges may also arise if previously unseen but relevant observations fall outside of the range observed in training data.

Considerations: While monitoring a model, deliberately track changes in features as an indicator of model stability.

Precision vs recall

Precision is a measure of exactness in model operation. Recall is a measure of completeness or quantity. ‘High precision’ means that a model returns substantially more relevant results than irrelevant ones. ‘High recall’ means that a model returns most of the relevant results that are available. There is often a trade-off between precision and recall during model training, and practitioners must determine what balance is best for any given case. Getting the balance wrong may impact the robustness of the model.

Considerations: Generally, precision is important when a false positive would be a critical problem (e.g., in a banking context, false fraud alerts can create a work volume that can overwhelm a team). Recall is important when a false negative would be a critical problem (e.g., improperly disqualifying marketing leads resulting in missed sales opportunities). The calibration of precision vs. recall should be done for each model.

Input perturbations

In some cases, models can be tricked by deviations in input data. Even small input disturbances can cause changes in system output. This can be exploited with malicious intent. ‘Data poisoning’ is a type of attack on an ML system in which the data used to train a model is intentionally contaminated to compromise the learning process.

Considerations: Consider mitigation strategies, including thorough data sampling (including outliers, trends, distributions, etc.) and building a “golden dataset” – a validated dataset with select cases from known sources and for which the expected behaviour is known – to ensure the integrity of input.

How organizations can manage ML robustness

Considering the challenges, how can organizations approach and govern robustness? Here are some general principles that organizations can follow:

Start with the data. Ensure that the collection, labeling, and engineering of data is thorough, complete, and accurate.
Get consensus about precision vs. recall trade-offs. Be aware that this trade-off exists, and get all stakeholders’ input on the risks relating to false negatives and false positives in every use case to ensure that this trade-off is optimized for each application.
Determine a retraining schedule. To maintain model accuracy, identify the appropriate model retraining schedule by evaluating its performance and monitoring for degradation over time.

Summing up

The unique operations and characteristics of ML systems present new risks that must be managed and require adjustments to elements of model risk governance, including robustness. Thankfully, through awareness and breakdown of the novel robustness issues that ML models present, organizations can address it and foster trust in the ML models they deploy.

Related:

Vector researchers tackle real-world AI challenges at ICML 2025

Ontario’s AI ecosystem: fueling real economic growth with record number of jobs and private investments

Transforming Youth Mental Health Support: FAIIR’s AI-Powered Crisis Response Model