dsml-banner
dsml-banner_mobile-scaled


DSML 2021
Data Shift in Machine Learning:
What is it and what are potential remedies

Machine learning models are conventionally trained under the premise that the training and the real-world (i.e., both source and target) data are sampled from the same distribution. This assumption may potentially lead to predictive problems in dynamic environments where the distribution of data changes over time. This is known as dataset shift.

In most real-world situations, machine learning models have to cope with dataset shift after deployment. The shift in the distribution could be dramatic for unexpected reasons, e.g., the breakout of COVID-19 pandemic or cyber attacks.

This tutorial will present:

  1. The principles behind data shift.
  2. Strategies for detecting dataset shift.
  3. Techniques for correcting them.
  4. Advanced topics in data shift for enhancing machine learning models in situations where dataset shift is inevitable.


Description

This tutorial aims to provide a comprehensive understanding of dataset shift and explore potential remedies in the face of distribution shift. The learning outcomes for the participants are:

  • Understand characteristics of dataset shift by real-world examples and relationship between input features and output changes;
  • Learn different types of dataset shifts and the terminologies;
  • Learn approaches to detect dataset shifts including domain classification, multivariate Kolmogorov–Smirnov test, and Black Box shift estimation;
  • Gain the knowledge of different techniques to correct for dataset shift, including sample re-weighting, mapping to a common feature space, etc.;
  • Learn how to adapt an existing ML model to shifts using transfer learning and active learning rather than retraining on the target task from scratch;
  • Get acquainted with computational techniques and libraries through hands-on practices;
  • Get familiar with recent topics and open problems in the field.

All machine learning researchers, practitioners, novices and graduate students will benefit from this tutorial as we cover topics ranging from basic to advanced. Experienced machine learning experts will benefit from the advanced topics sessions as we will discuss state-of-the-art solutions. We will share our slides and hands-on materials with participants before the sessions via Jupyter Notebook or Google Colab.


Outline

Developing adaptive methods to deal with the dataset shift phenomenon is an open problem in machine learning. It is because the dataset shift can introduce challenges to the deployment of ML tools. This tutorial covers solutions to address such problems as:

Title Topics Covered
45 min Characterizing Dataset Shift in Machine Learning
(Ali Pesaranghader)
  • Overview of different types and causes of dataset shift
  • Terminology and taxonomy
  • Theory of learning from different domains
  • Limitations in domain adaptation and open problems
45 min Characterizing Dataset Shift in Machine Learning
(Ali Pesaranghader)
  • Covariate shift correction using sample re-weighting
  • Important considerations in determining a reliable importance weight estimation
  • Estimating target label distribution
  • Label shift correction using black-box predictors
  • Active nearest neighbor querying strategy with nearest neighbor prediction
30 min Break
45 min Advanced Topics: Transfer Learning and Active Learning
(Mehdi Ataei)
  • Employing a common learned feature representation
  • Domain-adversarial training of neural networks for generalization
  • Active learning techniques to prioritize the labelling of new data
  • Pool-based and stream-based active learning
45 min Hands-on Practice
(Mehdi Ataei)
  • Practical use of available packages in dataset shift detection and correction
  • Developing machine learning pipelines for dataset shift correction and domain adaptation
15 min Q&A
15 min Conclusion
(Sedef Akinli Kocak)
  • Industrial applications and examples in businesses
  • Remarks


Organizers / Tutorial Leaders

Ali Pesaranghader

Sr. Research Scientist
CIBC

Ali Pesaranghader is a Sr. Research Scientist at the Canadian Imperial Bank of Commerce (CIBC) with primary research interests in adaptive learning, data stream mining, natural language processing, and transfer learning. Ali obtained his Ph.D. in Computer Science with a focus on Adaptive Machine Learning at the University of Ottawa in 2018.


Mehdi Ataei

Research Affiliate
Vector Institute

Mehdi Ataei is a research affiliate at the Vector Institute with a Ph.D. in Computational Physics from the University of Toronto. His research focuses on computational physics, applied mathematics, optimization, and machine learning.


Sedef Akinli Kocak

Project Manager
Vector Institute

Sedef Akinli Kocak is an academic industry R&D partnership and project manager in the area of AI/ML and is an accomplished researcher in the area of ICT for Sustainability and Advance Analytics. She has a Ph.D. in Environmental Applied Science and Management from Data Science Lab at Ryerson University. She is currently with the Vector Institute as an AI Project Manager. She is also a part-time lecturer and supervisor in the Data Science and Analytics Program at Ryerson University.



Contact Us

Scroll to Top