August 17, 2021
By Jonathan Woods
Past performance doesn’t guarantee future results, so the axiom goes. In an AI context, nothing embodies this notion more than dataset shift. It’s a crucial, yet underappreciated, issue for most companies using AI, and one for which Vector industry sponsor CIBC has particular expertise.
Understanding dataset shift first involves recognizing an assumption that’s made when training AI systems: that the data used for training will have a similar distribution to the data used once the system is in production. To illustrate: a retailer may train a system to predict sales in the near future by feeding it sales data from the last three years, assuming broad patterns in that dataset will remain the same, more or less.
In reality, data distributions change. AI systems operate in dynamic and evolving environments, and this means that, eventually, differences will likely arise between training data and data collected for a model’s actual use. That difference is called dataset shift. When it’s severe – possibly due to a big event like a global pandemic – it requires correction to avoid serious degradation of a model’s prediction performance or costly re-training of the model from scratch with new datasets.
CIBC’s Advanced Analytics and Artificial Intelligence team – which works on projects at the bank ranging from improving client experience and personalising offers to reducing fraud and speeding transaction processing – has notable expertise at handling data in a dynamic world. As sponsor of the Vector Institute, they recently bolstered that expertise by participating in Vector’s Dataset Shift Project, an industry-academia collaboration led by Vector’s Industry Innovation Team. The project was a response to the pandemic and the drastic shifts that it (and ensuing government responses) brought about in citizen and consumer behaviour. The project focused on equipping participants with a better understanding of dataset shift principles, strategies for detecting shift, and techniques for adapting it.
At CIBC, the list of areas where dataset shift identification and adaptation techniques may be relevant is long. Ali Pesaranghader, then a Senior AI Research Scientist at CIBC and participant in the Dataset Shift project, says, “Data shift may potentially appear in forms in client experience, product acquisition, ATM cash demand, fraud detection, treasury deposits, amongst other applications.” He explains that shift may occur due to changes in portfolio composition or market conditions “due to a portfolio maturing or COVID-19’s impact on customer’s cross-border transaction behaviour.”
In the Dataset Shift project, Pesaranghader joined the workstream focused on shift in cross-sectional data, which as the name implies, refers to data collected from a cross-section of some population – maybe customers, companies, or products – at a single point or period in time. The aims in this workstream were to first be able to detect whether dataset shift was present, and then if it was, correct it by adapting algorithms to restore model performance. Workstream members focused on a variation called covariate shift, which refers to shift that occurs in variables of a model’s input data. Covariate shift can occur for several reasons: a model’s training data may have lacked sufficient randomness, sampling of that data may have been biased in some way, or some large-scale event – a financial crisis, a pandemic, or a natural disaster – may have had such a large effect on a subject population that it changed long-standing patterns.
The group performed their analysis using house sales price data in Iowa from 2006 to 2010. This data included house-related features that impact price like the year houses were built, the neighbourhood they were in, and the overall quality of their materials and finish. They sought to identify shift related to each of these features. In other words, they wanted to determine whether each feature – as a variable that affects price – maintained consistency in its contribution to that price, or whether at some point, it showed a marked difference in its effect. If a variable was found to no longer contribute to a house’s price the way it did in the data used to train the model, it’s likely that predictions about future prices would suffer, and the model may require correction. The group, through experiments, concluded that shift detection and correction techniques are effective and advantageous in most cases where covariate shift occurs and created a new set of best practices for dealing with it.
For CIBC’s AI professionals, working with Vector researchers and staff enhances their capabilities in this field, which are already advanced. Andrew Brown, Senior Director of Data Science and AI Research with CIBC (with a Ph.D. in machine learning, having studied under Vector co-founder and Chief Scientific Advisor Geoffrey Hinton) says, “You get an immediate concrete benefit when people work on a project together with guidance, and guided by Vector, they’re steered in the right direction. There’s hands-on exposure to techniques and technologies. That’s a clear benefit that comes out of these projects.”
That benefit was multiplied when Pesaranghader and Mehdi Ataei, a Vector Postgraduate Affiliate who led technical elements of the project at Vector, shared insights related to their work with the CIBC Analytics Edge community, an internal group of machine learning professionals and data scientists from various lines of business. Pesaranghader says, “We compressed whatever we had in the project, and focused on the methods and their applicability in different situations.”
This proved a preview to a larger opportunity for CIBC and Vector to present AI-related thought leadership: the duo, along with Vector Project Manager Sedef Akinli Kocak, will present at European Conference on Machine Learning & Principles and Practice of Knowledge Discovery in Databases, a top-ranked machine learning conference scheduled for September 2021.
When put into production, this expertise and new advanced toolbox can enable CIBC to maximize the effectiveness and resilience of their AI systems, even during times of sudden and severe external change. That’s a significant technical advantage in a world where it’s become all too clear that such change can and does happen.
_ _ _
Access the Vector Institute Industry Innovation technical report “Dataset Shift and Potential Remedies”, here.