# ICLR 2021: Researchers adopt teaching tricks to train neural networks like they’re students

April 23, 2021

April 23, 2021

*April 23, 2021*

*By Ian Gormely*

Members of Vector’s research community are readying themselves for the 2021 edition of the International Conference on Learning Representations (ICLR), one of the premier deep learning conference in the world. This year’s conference will be held virtually from May 3 to May 7.

Vector Faculty Members had a number of papers accepted to the conference. ICLR had almost 3000 papers submitted for consideration this year, and only accepted a quarter of them.

Among the accepted papers from Vector Faculty is “Teaching With Commentaries,” co-authored by Vector co-founder and Chief Scientific Advisor Geoffrey Hinton, Faculty Member David Duvenaud and researchers Aniruddh Raghu, Maithra Raghu, and Simon Kornblith that imagines neural networks as students.

“Real teachers have to learn how to teach,” says Duvenaud, “and you don’t just show students a bunch of data without offering help.” Just as students need context to understand what they’re being taught, “neural networks can benefit from commentaries tailored to the way they learn.” To this end, he and his co-authors tried to build a system that automatically helps the students – the neural networks – learn better.

In doing so neural networks are able to learn faster and make more accurate predictions (known as reducing overfitting) and researchers are able to gain some insight into how neural networks work, enabling them to better understand the “black box.” “We think that seeing which types of commentaries were helpful should shed some light on what the neural network was taking away from the data under different circumstances.”

When training a classifier, Duvenaud and his co-authors found that it was helpful to emphasize different training data. When there is little overlap between them, say pictures of cars and planes, where one has wheels and the other has wings, then it was best to focus on the most stereotypical examples of what defines each. But when there is no hard distinction, like an arbitrary cut-off between tall and short people, the classifiers did better when the borderline examples were emphasized.

Below are abstracts and simplified summaries for many of the accepted papers co-authored by Vector Faculty Members.

**An Inequality Benchmark for Evaluating Generalization in Theorem Proving**

*Yuhuai Wu, Albert Jiang, Jimmy Ba, and Roger Grosse.*

Interactive theorem provers give a powerful way to formally verify mathematical theorems and computer software, but it is notoriously difficult and time-consuming for humans to write formal proofs. If we could train a deep learning agent to complete all or part of the proof, that would dramatically expand the scope of what can be formally verified. We developed a dataset consisting of synthetically generated mathematical inequality problems and used this dataset to assess the ability of deep learning systems to generalize to theorem statements unlike what they’ve seen before.

**A PAC-Bayesian Approach to Generalization Bounds for Graph Neural Networks**

*Renjie Liao, Raquel Urtasun, Richard Zemel*

Graph neural networks (GNNs) have recently become popular in handling graph-structured data like molecule generation for drug discovery and link prediction in social networks. In this paper, we aim to understand why GNNs can generalize from training graphs to unseen testing graphs. First, we show generalization bounds for the popular variants of GNNs via a PAC-Bayesian approach. Our result reveals that the maximum node degree of graphs and spectral norm of weights govern the generalization bounds of GNNs. Moreover, our PAC-Bayes bound improves over the previous Rademacher complexity-based bound, showing a tighter value empirically on both synthetic and real-world graph datasets.

**Bayesian Few-Shot Classification with One-vs-Each Polya-Gamma Augmented Gaussian Processes**

*Jake Snell, Richard Zemel*

Training a deep image classifier is an expensive and time-consuming process. We would like to train a model once and deploy it, but in real life a classifier will encounter hard-to-classify images, including ones from classes it has never seen before. We developed a novel approach based on Gaussian processes that averages over an infinite number of models, weighted by their fit to the new data. Our algorithm is better calibrated (improved estimates of its confidence in its predictions) than previous baselines while exhibiting strong accuracy in this challenging setting.

**CaPC Learning: Confidential and Private Collaborative Learning**

*Christopher A. Choquette-Choo, Natalie Dullerud, Adam Dziedzic, Yunxiang Zhang, Somesh Jha, Nicolas Papernot, Xiao Wang*

CaPC is a protocol for collaborative machine learning with strong guarantees of privacy and confidentiality. Organizations that trained models locally can now collaborate and jointly make predictions without revealing to one another the inputs they are predicting on, their models, or their training data. CaPC is model agnostic – each of the collaborating parties can use different types of architecture. Our framework improves the fairness of models even when dealing with non-uniform data distribution, especially when combined with active learning. CaPC provides a new way to realize the requirements imposed by privacy legislation while making little changes to existing ML pipelines.

**C-Learning: Horizon-Aware Cumulative Accessibility Estimation**

*Panteha Naderian**, **Gabriel Loaiza-Ganem**, **Harry J. Braviner**, **Anthony L. Caterini**, **Jesse C. Cresswell**, **Tong Li**, **Animesh Garg*

C-learning is a novel reinforcement learning algorithm with applications in robotics and path planning. The main objective of a C-learning agent is to learn efficient paths to prespecified goals. By training the value function with an additional parameter, horizon, this method addresses three shortcomings in the previous works: 1) It learns shorter and more efficient paths to a goal 2) It learns the tasks with less experience and training data 3) It finds multiple ways to reach a goal. Therefore, the users can choose between different paths based on their preference in speed and reliability.

**Conservative Safety Critics for Exploration**

*Homanga Bharadhwaj, Aviral Kumar, Nicholas Rhinehart, Sergey Levine, Florian Shkurti, Animesh Garg*

This paper presents a new approach for safety in the context of reinforcement learning (RL) for robotics. RL is a trial and error based learning paradigm where an agent interacts with the environment, gets rewarded positively for desirable behavior, and negatively for undesirable behavior such that its desirable behavior gets reinforced over time. While training robots to solve a particular task with RL, it is important that undesirable behaviors that could lead to catastrophic failures for example damaging the robot are avoided. This paper presents an algorithm for training RL agents by provably constraining the probability of catastrophic failures and thereby enabling training of robots safely.

**Dataset Inference: Ownership Resolution in Machine Learning **

*Pratyush Maini, Mohammad Yaghini, Nicolas Papernot*

Are you concerned that an ML model may be a stolen copy of your proprietary model? We make the pessimistic but realistic observation that one cannot prevent model stealing. Instead, in Dataset Inference, we seek to detect if an adversary has stolen the model after the fact. Our key insight is that the model owner’s most valuable intellectual property is the dataset they trained on. Therefore, no matter how an adversary attempts to steal, their model will contain information private to the victim dataset. Dataset inference uses these signals to distinguish suspect model behavior on samples from training and unseen data and determine whether an adversary has used private knowledge.

**Emergent Road Rules In Multi-Agent Driving Environments**

*Avik Pal, Jonah Philion, Yuan-Hong Liao, Sanja Fidler*

To safely share the road with human drivers, self-driving cars must abide by “road rules” that human drivers follow. “Road rules” include law-enforced rules, like the requirement that vehicles stop at red lights, as well as social rules like the implicit designation of fast lanes. We show that in simulated driving environments in which agents seek to reach their destinations quickly, the agents develop road rules that mimic road rules humans have developed. Our results suggest the feasibility of a new paradigm for self-driving in which agents trained fully in simulation can be deployed in the real world.

**Latent Skill Planning for Exploration and Transfer**

*Kevin Xie, Homanga Bharadhwaj, Danijar Hafner, Animesh Garg, Florian Shkurti*

This paper describes an approach for learning re-usable skills to solve tasks with robots efficiently. We let the robot interact with the environment and build up a world model of how the environment changes in response to the actions of the robot. Using this world model, the robot plans for a sequence of high level skills that are required to solve a particular task, for example walking from point A to B. The key idea of our approach is to learn these skills such that they can be reused for different tasks and slightly different environments. This is important for minimizing the number of interactions the robot has with the environment, which are often expensive and time-consuming.

**No MCMC for me: Amortized Sampling for Fast and Stable Training of Energy-Based Models**

*Will Grathwohl, Jacob Kelly, Milad Hashemi, Mohammad Norouzi, Kevin Swersky, David Duvenaud*

Standard neural network classifiers are simply a function that takes in an image, and outputs the probability that that image belongs to different classes. There’s a promising alternative way to train these models, called generative modeling, that also learns how to produce realistic images at the same time. One of the advantages of this approach is that it can learn from mostly unlabeled data. However, training these models typically involves an expensive search over images. We show how a machine can learn to perform that search faster during training, letting us fit large models faster, on both large images, and the sorts of unstructured tables of data often seen in business or health settings.

**Planning from Pixels using Inverse Dynamics Models**

*Keiran Paster, Sheila A. McIlraith, Jimmy Ba*

Automatically learning to model the parts of an environment that are relevant to decision making is key to enabling deep reinforcement learning (DRL) agents to solve complicated, real-world tasks. We show that models learned by predicting actions (inverse dynamics) rather than predicting future states (forward dynamics) enable accurate modeling of environment dynamics even in complicated visual environments and can be used to help the agent plan more efficiently. Our new DRL algorithm (GLAMOR) is an exciting step towards enabling agents to model and plan in more complicated environments.

*Aniruddh Raghu, Maithra Raghu, Simon Kornblith, David Duvenaud, Geoffrey Hinton*

How do teachers learn to teach? One way that teachers help students is by providing commentaries on examples that are being shown to students. We implemented a similar idea for training neural nets. The way our teachers came up with useful commentaries was by simulating a “student” neural network learning from their examples and commentaries, and adjusting the commentaries using backprop to improve the student’s performance. Looking at these commentaries shed light on how the students learned, and which parts of the data were important for different tasks, such as classifying medical images.

**Theoretical Bounds On Estimation Error For Meta-learning**

*James Lucas, Mengye Ren, Irene Raissa Kameni Kameni, Toniann Pitassi, Richard Zemel*

Traditionally, we assume that machine learning models are taught using the same distribution of data that we expect to see in the wild. But this isn’t very realistic. For example, we might have healthcare data from 5 hospitals and want to deploy our model in a new hospital where patient demographics and medical training differ substantially. We investigate the fundamental difficulty of this problem and prove lower-bounds on the best possible performance of any machine learning algorithm in this setting.

**Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding**

*Sana Tonekaboni, Danny Eytan, Anna Goldenberg*

Time series data are often complex and rich in information but sparsely labeled and therefore challenging to model. In this paper, we propose a self-supervised framework for learning generalizable representations for non-stationary time series. Our approach, called Temporal Neighborhood Coding (TNC), takes advantage of the local smoothness of a signal’s generative process to define neighborhoods in time with stationary properties and learns to distinguish neighboring samples using a debiased contrastive objective. Our motivation stems from the medical field, where the ability to model the dynamic nature of time series data is especially valuable for identifying, tracking, and predicting the underlying patients’ states in settings where labeling data is practically impossible.

**Wandering Within A World: Online Contextualized Few-shot Learning**

*Mengye Ren, Michael L. Iuzzolino, Michael C. Mozer, Richard S. Zemel*

We aim to bridge the gap between typical human and machine-learning environments by extending the standard framework of few-shot learning to an online, continual setting, which mimics the visual experience of an agent wandering within a world. We introduce a new dataset based on large scale indoor imagery, and propose a new model that can make use of spatiotemporal contextual information through a combination of short-term and long-term memory.

**When does preconditioning help or hurt generalization?**

*Shun-ichi Amari, Jimmy Ba, Roger Grosse, Xuechen Li, Atsushi Nitanda, Taiji Suzuku, Denny Wu, Ji Xu*

One of the most perplexing phenomena in neural net training is that the choice of optimization algorithm affects not only the speed of convergence, but also the generalization ability of the converged solution. One important optimization choice is the preconditioner, which determines how fast parameters move in different directions. We analyze the generalization properties of various preconditioners in the context of linear regression. We find that, in contrast to the popular belief that second-order optimizers generalize worse than first-order ones, the true effect is much more nuanced; we analyze various situations in which preconditioning with second-order information can help or hurt generalization.