By Jonathan Woods
August 11, 2021
Vector’s Industry Innovation team has released the Dataset Shift and Potential Remedies Technical Report, which details experiments and insights gained in the Dataset Shift Project. The project is an industry-academia collaboration, established to equip Vector’s industry sponsor companies with a deeper understanding of dataset shift, its varieties, effective detection strategies, and adaptation techniques. Notably, insights from this report will be presented by project participants at this year’s European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021) on September 13th, 2021, in the tutorial DSML 2021 Data Shift in Machine Learning: What is it and what are potential remedies?
The Dataset Shift Project
Machine learning (ML) systems are trained under the premise that training data and real-world data will have similar distribution patterns. However, in dynamic industries and changing circumstances, new data distribution patterns can emerge that differ significantly from the historical patterns used for training – so much so that they have a major impact on the reliability of predictions. This difference between training and test data (or data used in production) is known as dataset shift, and, when severe enough, it necessitates adaptation. This adaptation can be accomplished either through cumbersome and expensive model retraining or leaner, more focused dataset shift adaptation techniques.
In May 2020, the Vector Institute launched the Dataset Shift Project as a response to the pandemic and the drastic shifts that it (and ensuing policy responses) brought about in citizen and consumer behaviour. The project included 15 participants: five Vector researchers and staff with expertise in machine learning as well as 10 technical professionals from seven Vector industry sponsor companies. It included four hands-on tutorials, developed and facilitated by Vector researchers and staff, in which participants improved their knowledge and skills through experiential learning.
The project covered three types of dataset shift:
- Covariate shift: a difference in the distribution of input variables between training data and test data. Covariate shift can occur due to a lack of randomness, inadequate sampling, biased sampling, or a changing
- Label shift: a difference in the distribution of class variables (i.e. a model’s classification results) between training data and test output. Label shift may appear when some concepts are undersampled or oversampled in the target domain compared to the source domain.
- Concept shift: a difference in the relationship between the two variables used in the development of an algorithm.
In three working groups, participants investigated dataset shift in cross-sectional, time series, and image data. These dataset types aligned with participants’ interests, and reflected real, current application potential in their organizations. The following summarizes the objectives and results of each working group.
The purpose of this study was to detect covariate shift in cross-sectional data, and to adapt algorithms and techniques to account for it. The group used the Iowa House Sales Prices dataset from Kaggle and were tasked with predicting house sale prices using data from years 2006 to 2010. The main steps and objectives of this working group were to prepare cross-sectional data for experiments, apply dataset shift analysis algorithms, identify potential shifts, use shift adaptation techniques, and analyze the resulting prediction model. The group demonstrated that adaptation does not necessarily improve the performance results in all cases, implying that even the best adaptation models and transformations cannot be generally applied in different use cases.
Time series data
The purpose of this study was to use transfer learning and adaptive learning as a means to tackle dataset shift in retail – specifically, to estimate the sales of new goods using the data distribution patterns of current or past sales. To do this, the group used the Predict Future Sales dataset from Kaggle, which consists of historical retail-item sales data from January 2013 to October 2015. The group took two approaches:
- One approach investigated the use of transfer learning for Long Short-Term Memory networks in order to leverage the learned knowledge from one sale item and transfer it to another item with limited data. Successfully reusing previously-learned knowledge would eliminate the need to train a model from scratch, which is especially important when data is scarce and expensive to obtain.
- The other approach involved applying adaptive learning methods, which monitor the performance of the model and update its coefficients if performance deteriorates. Adaptive learning methods were used to correct potential concept shift in the data, as they are known to be robust against concept shift in dynamic environments. This approach was particularly relevant in the context of the COVID-19 pandemic, as the large and sudden shift in human behavior it caused rendered some predictive models inaccurate due to concept shift.
The group demonstrated that adaptive methods outperform non-adaptive methods when concept shift is present, and that results are comparable to when there is no concept shift. To better understand the effectiveness of adaptive methods, various models must be tested. The group also concluded that applying transfer learning to a new model can enhance its prediction capabilities, accelerate training, and reduce the cost of retraining a model when limited data is available.
The purpose of this study was to use few-shot learning methods – methods using very limited training examples – to classify new data. The group’s objectives were to a) reproduce the results of prototypical networks trained on the Omniglot dataset and the mini-ImageNet dataset separately, and b) reproduce the results of model-agnostic meta-learning algorithms trained on the Omniglot dataset.
The working group demonstrated that prototypical networks can tackle dataset shift using few-shot learning on the Omniglot and mini-ImageNet datasets. However, performance dropped significantly when running prototypical networks on different combinations of datasets – for instance, when training networks on the Omniglot dataset and then testing on the mini-ImageNet dataset. The group also demonstrated that model-agnostic meta-learning algorithms could tackle dataset shift when trained on the Omniglot dataset.
The Dataset Shift Project resulted in significant knowledge transfer between Vector researchers and industry participants. Industry participants developed proficiency in dataset shift detection, identification, and correction methodologies, established best practices in accordance with the latest academic and industry standards, and gained skills that can increase the resilience of organizations and their workforces in the face of changing environments. If put into production, these approaches have the potential to deliver enhanced efficiency, adaptability, and cost-savings. Finally, this project also demonstrated the value of collaborative efforts between industry and academia, and laid the groundwork for future projects focused on building a deeper understanding of dataset shift and methods for mitigating its effects in practical settings.
At the upcoming ECML PKDD 2021, three project participants ―Ali Pesaranghader, former Senior AI Research Scientist at CIBC and participant in the Dataset Shift project, Mehdi Ataei, a Vector Research Affiliate who led technical elements of the project at Vector, and Sedef Akinli Kocak, Vector Applied AI Project Manager ― will share insights on four main topics: the principles behind data shift, strategies for detecting dataset shift, adaptation techniques, and advanced topics in data shift for enhancing machine learning models in situations where dataset shift is inevitable.