The known unknowns: Vector researcher Geoff Pleiss digs deep into uncertainty to make ML models more accurate

May 3, 2024

By Michael Barclay

“We could get down to a very philosophical debate about what uncertainty even is.”

Talking to Geoff Pleiss about his research might be more than a layperson can process. But with so many questions about the efficacy and accuracy of AI, uncertainty is a key area of research.

The UBC statistician and Canada CIFAR AI Chair defines his work as “quantifying uncertainty of machine-learning models.” The known unknowns. “Uncertainty information is useful in many applications,” Pleiss explains. “We use a machine-learning model to make a prediction, and we want to know how much we can trust that prediction.” 

The obvious example would be a self-driving car, involving what Pleiss calls “safety-critical predictions. I really want a good notion of uncertainty, so that I could basically at any moment tell a human driver they need to intervene. 

“Or in health care: if a neural network is likely to be incorrect — and especially if it’s likely to be incorrect because it’s encountering data that looks nothing like what I saw during training — I really want a good notion of uncertainty.” 

The other application of Pleiss’s research is “in sequential decision-making problems, or reinforcement-learning applications.” An example would be a company running A-B tests on new features, trying to strategize the next feature to test. It’s especially useful in medicine. 


“Uncertainty information is useful in many applications. We use a machine-learning model to make a prediction, and we want to know how much we can trust that prediction.”

Geoff Pleiss

Vector Faculty Member

Says Pleiss, “If I’m running a chemical lab, trying to figure out what molecule I should synthesize next, as part of some early-stage drug development, I could have a really good sense of what set of chemicals I’ve tried so far that worked really well. And then here’s a set of chemicals that I haven’t played around with at all. I could keep doing exploitation experiments: If one sort of set of chemicals is really good so far, do I just keep fine-tuning into this region? Or do I go with this other set of chemicals I know nothing about: they could be really bad, but they could also be really good. How do I balance that trade-off? 

“So bringing it all together, I would say the two sets of applications are safety-critical applications or broader decision-making, especially decisions around experimental design. Or more generally, if we’re using this prediction as part of some decision-making process downstream, we would want to utilize that uncertainty information in order to make a better-informed decision.” 

If we can be somewhat certain about the definition and value of uncertainty, can we be so sure it applies to the state of AI today? When the large scale of neural networks is so massive, dealing with unimaginable quantities of data? 

Exploring uncertainty was simpler with machine-learning models used 20 or 30 years ago. But modern neural networks “pose a big challenge,” says Pleiss, “because they’re very large and unwieldy. A lot of the techniques used historically don’t really apply to neural networks. We don’t really even know what’s going on under the hood. And these models are so big and expensive to train in the first place. Now we’re trying to not just make a prediction, but also to get some notion of uncertainty out of it. It’s a hard problem. I don’t think the community has quite converged on the right way to approach it.”

Pleiss works on what’s called “deep ensembling.” That means training not one but several neural networks on one task, injecting random elements during each independent training process to make them slightly different from the other. “We get a set of predictions from these neural networks, and now we have this set of predictions rather than a single prediction, and we can see how much variance there is.” 

Here’s the weird thing: there’s not much variance. Not at all.

“Neural networks are surprisingly homogenous,” says Pleiss, “no matter how we change the architecture or the training procedure or all of that. They are basically doing the same thing.” Even when they’re based on totally different architectures. “I would expect the space of possible predictions to be growing larger and larger. But in fact, the opposite is happening. They’re collapsing onto one another. They’re all starting to produce exactly the same prediction. From an uncertainty-quantification perspective, this is quite troubling.” 

Think about the millions of songs available on streaming services, and how collective taste still collapses on a select few artists enhanced by algorithms. “How do you even go about discovering something in there?” wonders Pleiss. “How do you find that needle in a haystack? In order to get any sort of signal from that noise, you need a really strong set of assumptions, a really strong set of preferences. There are not many sets of strong assumptions that are going to work, and so you are going to end up with some level of homogeneity. 


There just aren’t many ways to find needles in a haystack, especially as these models are growing larger and larger.”

Geoff Pleiss

Vector Faculty Member

“What is happening with neural networks is that these models are so big and so complex, that even though we’re training them on these very large training data sets, the space of possible predictions this neural network represents dwarfs the amount of data we’re training them on. There just aren’t many ways to find needles in a haystack, especially as these models are growing larger and larger.” 

Pleiss and his researchers tried to force the models to make diverse predictions. “What was surprising was: that didn’t help,” he says. “In fact, it actually made these models a lot worse. So even though the models were making potentially more diverse predictions, they were also becoming a lot less useful than if you’d ask them: where’s the best place to buy a toothbrush? And it would say ‘Staples’ or something like that.” 

What worked on smaller neural networks no longer works with larger sets, says Pleiss. In fact, the inverse is true. “Even if I took neural networks that were all very small, and tried to make them more diverse, I would see improvements in my accuracy. And if I tried to make these very small neural networks have groupthink, that would lead to worse predictive accuracy. We’re really seeing a phase transition as we go from small predictive models to these very large neural networks. It’s a lot harder and potentially counterproductive to try to get predictive diversity out of these models that would be useful for quantifying uncertainty.”

“A lot of intuitions from the classical statistical and machine learning approaches break down when we’re looking at these very large models. We used to think that making models more diverse should give us better uncertainty information, but this intuition is entirely incorrect with large neural networks. There’s been a large line of work demonstrating how large models defy our standard ways of thinking about statistical modelling. This is one piece of that puzzle.”


Trustworthy AI

World-leading AI Trust and Safety Experts Publish Major Paper on Managing AI Risks in the journal Science

Standardized protocols are key to the responsible deployment of language models

A man looks at a white board with red formulas on it
Trustworthy AI

How to safely implement AI systems