Responding to major shifts in data: Vector Industry Innovation Report on Dataset Shift Project

August 11, 2021

By Jonathan Woods

August 11, 2021

Vector’s Industry Innovation team has released the Dataset Shift and Potential Remedies Technical Report, which details experiments and insights gained in the Dataset Shift Project. The project is an industry-academia collaboration, established to equip Vector’s industry sponsor companies with a deeper understanding of dataset shift, its varieties, effective detection strategies, and adaptation techniques. Notably, insights from this report will be presented by project participants at this year’s European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021) on September 13^th, 2021, in the tutorial DSML 2021 Data Shift in Machine Learning: What is it and what are potential remedies?

The Dataset Shift Project

Machine learning (ML) systems are trained under the premise that training data and real-world data will have similar distribution patterns. However, in dynamic industries and changing circumstances, new data distribution patterns can emerge that differ significantly from the historical patterns used for training – so much so that they have a major impact on the reliability of predictions. This difference between training and test data (or data used in production) is known as dataset shift, and, when severe enough, it necessitates adaptation. This adaptation can be accomplished either through cumbersome and expensive model retraining or leaner, more focused dataset shift adaptation techniques.

In May 2020, the Vector Institute launched the Dataset Shift Project as a response to the pandemic and the drastic shifts that it (and ensuing policy responses) brought about in citizen and consumer behaviour. The project included 15 participants: five Vector researchers and staff with expertise in machine learning as well as 10 technical professionals from seven Vector industry sponsor companies. It included four hands-on tutorials, developed and facilitated by Vector researchers and staff, in which participants improved their knowledge and skills through experiential learning.

The project covered three types of dataset shift:

Covariate shift: a difference in the distribution of input variables between training data and test data. Covariate shift can occur due to a lack of randomness, inadequate sampling, biased sampling, or a changing
Label shift: a difference in the distribution of class variables (i.e. a model’s classification results) between training data and test output. Label shift may appear when some concepts are undersampled or oversampled in the target domain compared to the source domain.
Concept shift: a difference in the relationship between the two variables used in the development of an algorithm.

In three working groups, participants investigated dataset shift in cross-sectional, time series, and image data. These dataset types aligned with participants’ interests, and reflected real, current application potential in their organizations. The following summarizes the objectives and results of each working group.

Cross-sectional data

The purpose of this study was to detect covariate shift in cross-sectional data, and to adapt algorithms and techniques to account for it. The group used the Iowa House Sales Prices dataset from Kaggle and were tasked with predicting house sale prices using data from years 2006 to 2010. The main steps and objectives of this working group were to prepare cross-sectional data for experiments, apply dataset shift analysis algorithms, identify potential shifts, use shift adaptation techniques, and analyze the resulting prediction model. The group demonstrated that adaptation does not necessarily improve the performance results in all cases, implying that even the best adaptation models and transformations cannot be generally applied in different use cases.

Time series data

The purpose of this study was to use transfer learning and adaptive learning as a means to tackle dataset shift in retail – specifically, to estimate the sales of new goods using the data distribution patterns of current or past sales. To do this, the group used the Predict Future Sales dataset from Kaggle, which consists of historical retail-item sales data from January 2013 to October 2015. The group took two approaches:

One approach investigated the use of transfer learning for Long Short-Term Memory networks in order to leverage the learned knowledge from one sale item and transfer it to another item with limited data. Successfully reusing previously-learned knowledge would eliminate the need to train a model from scratch, which is especially important when data is scarce and expensive to obtain.
The other approach involved applying adaptive learning methods, which monitor the performance of the model and update its coefficients if performance deteriorates. Adaptive learning methods were used to correct potential concept shift in the data, as they are known to be robust against concept shift in dynamic environments. This approach was particularly relevant in the context of the COVID-19 pandemic, as the large and sudden shift in human behavior it caused rendered some predictive models inaccurate due to concept shift.

The group demonstrated that adaptive methods outperform non-adaptive methods when concept shift is present, and that results are comparable to when there is no concept shift. To better understand the effectiveness of adaptive methods, various models must be tested. The group also concluded that applying transfer learning to a new model can enhance its prediction capabilities, accelerate training, and reduce the cost of retraining a model when limited data is available.

Image data

The purpose of this study was to use few-shot learning methods – methods using very limited training examples – to classify new data. The group’s objectives were to a) reproduce the results of prototypical networks trained on the Omniglot dataset and the mini-ImageNet dataset separately, and b) reproduce the results of model-agnostic meta-learning algorithms trained on the Omniglot dataset.

The working group demonstrated that prototypical networks can tackle dataset shift using few-shot learning on the Omniglot and mini-ImageNet datasets. However, performance dropped significantly when running prototypical networks on different combinations of datasets – for instance, when training networks on the Omniglot dataset and then testing on the mini-ImageNet dataset. The group also demonstrated that model-agnostic meta-learning algorithms could tackle dataset shift when trained on the Omniglot dataset.

The Dataset Shift Project resulted in significant knowledge transfer between Vector researchers and industry participants. Industry participants developed proficiency in dataset shift detection, identification, and correction methodologies, established best practices in accordance with the latest academic and industry standards, and gained skills that can increase the resilience of organizations and their workforces in the face of changing environments. If put into production, these approaches have the potential to deliver enhanced efficiency, adaptability, and cost-savings. Finally, this project also demonstrated the value of collaborative efforts between industry and academia, and laid the groundwork for future projects focused on building a deeper understanding of dataset shift and methods for mitigating its effects in practical settings.

At the upcoming ECML PKDD 2021, three project participants ―Ali Pesaranghader, former Senior AI Research Scientist at CIBC and participant in the Dataset Shift project, Mehdi Ataei, a Vector Research Affiliate who led technical elements of the project at Vector, and Sedef Akinli Kocak, Vector Applied AI Project Manager ― will share insights on four main topics: the principles behind data shift, strategies for detecting dataset shift, adaptation techniques, and advanced topics in data shift for enhancing machine learning models in situations where dataset shift is inevitable.

Responding to major shifts in data: Vector Industry Innovation Report on Dataset Shift Project

Related:

Vector welcomes Canada’s AI Strategy: AI for All

Vector Institute Announces the Appointment of Glenda Crisp as President and CEO

Vector Institute Unveils Comprehensive Evaluation of Leading AI Models

Global AI Alliance for Climate Action funding announcement

CEO Update

Climate change and AI-compute cap off the third and final day of Collision 2024

ChainML, Private AI, and Geoffrey Hinton underscore the importance of responsible AI development and governance at Collision 2024

Self-driving trucks will be on the road next year says Vector co-founder Raquel Urtasun at Collision 2024

Vector Institute announces nearly $2 million in scholarships for top Ontario AI graduate students

New global climate action initiative harnesses Canada’s AI expertise

Merck Canada announces collaboration with Vector Institute

Vector partners with IBET to increase the number of Indigenous and Black AI researchers

Key takeaways from the All In 2023 conference

Dan Roy named Vector Research Co-Director

South Korea trip reaffirms Canada’s global leadership in responsible AI

Vector Institute partners with World Economic Forum on responsible artificial intelligence and research insights

Vector Institute 2022-23 annual report: accelerating AI in Ontario

The Vector Institute will receive $27M from the Ontario government

Vector’s AI20 for 2023

Vector Institute names Ben Davies as its new Chief Information Officer

Graham Taylor named Vector Research Director

RBC joins Vector for our Computer Vision Symposium and shares how RBC’s new AI engine can accurately recognize building boundaries in satellite imagery

Ontario strengthens its AI Ecosystem leadership position by extending partnerships with founding industry sponsors

The Vector Institute Enters Five-Year Strategic Partnership with Canadian Tire Corporation to Enhance Customer Experience

Vector awards nearly $2 million in scholarships to top master’s students pursuing graduate studies in AI in Ontario

Accelerating innovation in health: Boehringer Ingelheim and the Vector Institute to collaborate on AI

AI analysis of Social Media data to help with treatment of long COVID

Launch event – FastLane: Accelerating AI-fueled growth for Ontario’s fast-growing companies

Quantum tech startup yiyaniQ is Vector’s first spin-out company

Vector Faculty Member Toni Pitassi is the recipient of the 2021 EATCS Award

VECTOR SPONSORSHIP FROM NVIDIA RENEWED THROUGH 2027

Federal Government Renews Pan-Canadian AI Strategy

Vector Institute Launches Program to Help SMEs Reduce Bias in AI

Vector Board Member Vivek Goel appointed to the Order of Canada

Vector Institute Releases First Annual Ontario AI Snapshot

Vector Scholarship in AI winner recognized for bringing AI to cancer analysis

Vector Institute extends condolences to the family and colleagues of Pearl Sullivan

Roche Canada Launches National Artificial Intelligence Centre of Excellence

Researchers and startups converge in Toronto AI ecosystem

Vector Institute Establishes New AI Engineering Team to Ramp up Applied AI for Sponsors and Partners

Unveiling the Vector Institute’s new three-year strategy

With Vector, BMO trains a new leading-edge deep learning model for Finance – and wants to use it where it counts most

We must do more

ROXANA SULTAN, VP, HEALTH

Vector Institute to Support Provincial Efforts Against COVID-19 with Computing Infrastructure and Data Science Expertise

A message from Garth Gibson to Vector community

Vector Bronze Sponsor Surgical Safety Technologies lands on Time magazine’s Best Inventions 2019 list

AI community celebrates Dr. Geoffrey Hinton at Evolution of Deep Learning Symposium

Vector’s Chief Scientific Advisor, Dr. Geoffrey Hinton, wins the Honda Prize 2019

Vector Institute Announces Second Cohort of Vector Scholarship in AI Recipients

Vector Institute’s Chief Scientific Advisor, Dr. Geoffrey Hinton, receives ACM A.M. Turing Award alongside Dr. Yoshua Bengio and Dr. Yann LeCun.

Vector Institute Offering Scholarships to Candidates Applying to AI Master’s Programs

The global race for artificial-intelligence supremacy is on – and Canada’s next moves are crucial

Vector Institute Grows Faculty Across Canada

2018-2019 Machine Learning Advances and Applications Seminar

CONGRATULATIONS TO UBER ON PLANNED TORONTO EXPANSION

Garth Gibson: Celebrating Vector’s First Year in the AI Ecosystem

News Release: Vector Institute Doubles Team of World-Class AI Faculty

News Release: Vector Institute appoints Dr. Garth Gibson as President and CEO

News release: New Artificial Intelligence Research Institute Launched in Toronto