By Ian Gormely
November 27, 2020

Vector researchers are once again preparing for the premier machine learning conference, the 34th annual conference on Neural Information Processing Systems (NeurIPS). This year’s conference, originally set to be staged in Vancouver, BC, will be virtual, running December 6-12. It will include invited talks, demonstrations, symposia, and oral and poster presentations of refereed papers. 

Below are abstracts and simplified summaries for many of the accepted papers and workshops from Vector-affiliated researchers. 

You can read more about Vector’s work at past years’ conferences here and here.

Vector’s research community continues to grow quickly. If you are a Vector-affiliated researcher and don’t see your work represented here, please contact

Conference Papers by Vector Faculty Members & Faculty Affiliates:

Adaptive Gradient Quantization for Data-Parallel SGD
Fartash Faghri (University of Toronto/Vector Institute), Iman Tabrizian (University of Toronto/Vector Institute), Ilia Markov (IST Austria),  Dan Alistarh (IST Austria/Neural Magic Inc.), Daniel Roy (University of Toronto/Vector Institute), Ali Ramezani-Kebrya (Vector Institute)
As deep learning scales to larger models and bigger data, researchers are using distributed, parallel algorithms to train faster. This work shows how to reduce the communication overhead by 70%, opening up the possibility for even larger scale compute.

An Implicit Function Learning Approach for Parametric Modal Regression
Yangchen Pan (University of Alberta), Ehsan Imani (University of Alberta), Martha White (University of Alberta), Amir-massoud Farahmand (Vector Institute/University of Toronto)
Learning the relation between the input and the real-valued output is a fundamental problem in machine learning, and is known as the regression problem. The conventional regression methods learn the mean value of an output given its input. This is acceptable when the output for any given input is concentrated around a single mode (unimodal), but it is not when the output has multiple modes. This work develops a new scalable algorithm to learn such a relationship. This is achieved by using the Implicit Function Theorem, which allows us to convert the problem of learning a multi-valued function, which is difficult, to learning a single-valued function, which is easier.

Causal Discovery in Physical Systems from Videos
Yunzhu Li (Massachusettes Institute of Technology),  Antonio Torralba (Massachusettes Institute of Technology), Anima Anandkumar (NVIDIA/CalTech), Dieter Fox (NVIDIA/University of Washington), Animesh Garg (University of Toronto/Vector Institute)
Causal discovery is at the core of human cognition. It enables us to reason about the environment and make counterfactual predictions about unseen scenarios that can vastly differ from our previous experiences. We consider the task of causal discovery from videos in an end-to-end fashion without supervision on the ground-truth graph structure. In particular, our goal is to discover the structural dependencies among environmental and object variables: inferring the type and strength of interactions that have a causal effect on the behavior of the dynamical system. Our model consists of (a) a perception module that extracts a semantically meaningful and temporally consistent keypoint representation from images, (b) an inference module for determining the graph distribution induced by the detected keypoints, and (c) a dynamics module that can predict the future by conditioning on the inferred graph. We assume access to different configurations and environmental conditions, i.e., data from unknown interventions on the underlying system; thus, we can hope to discover the correct underlying causal graph without explicit interventions. We evaluate our method in a planar multi-body interaction environment and scenarios involving fabrics of different shapes like shirts and pants. Experiments demonstrate that our model can correctly identify the interactions from a short sequence of images and make long-term future predictions. The causal structure assumed by the model also allows it to make counterfactual predictions and extrapolate to systems of unseen interaction graphs or graphs of various sizes.

Counterfactual Data Augmentation using Locally Factored Dynamics
Silviu Pitis (University of Toronto/Vector Institute), Elliot Creager  (University of Toronto/Vector Institute), Animesh Garg  (University of Toronto/Vector Institute)
We detect and leverage local causal independence between objects and features of the world state to improve the sample efficiency of simulated robots in the reinforcement learning setting. We formalize local causal independence using a Local Causal Modeling framework and use it as part of our Counterfactual Data Augmentation algorithm to generate new causally valid data for models to train on.

Curriculum By Smoothing
Samartha Sinha (University of Toronto/Vector Institute), (Animesh Garg  (University of Toronto/Vector Institute), Hugo Larochelle (Google Brain)
Convolutional Neural Networks (CNNs) have shown impressive performance in computer vision tasks such as image classification, detection, and segmentation. Moreover, recent work in Generative Adversarial Networks (GANs) has highlighted the importance of learning by progressively increasing the difficulty of a learning task Kerras et al. When learning a network from scratch, the information propagated within the network during the earlier stages of training can contain distortion artifacts due to noise which can be detrimental to training. In this paper, we propose an elegant curriculum-based scheme that smoothes the feature embedding of a CNN using anti-aliasing or low-pass filters. We propose to augment the training of CNNs by controlling the amount of high frequency information propagated within the CNNs as training progresses, by convolving the output of a CNN feature map of each layer with a Gaussian kernel. By decreasing the variance of the Gaussian kernel, we gradually increase the amount of high-frequency information available within the network for inference. As the amount of information in the feature maps increases during training, the network is able to progressively learn better representations of the data. Our proposed augmented training scheme significantly improves the performance of CNNs on various vision tasks without either adding additional trainable parameters or an auxiliary regularization objective. The generality of our method is demonstrated through empirical performance gains in CNN architectures across four different tasks: transfer learning, cross-task transfer learning, and generative models.

Deep Learning Versus Kernel Learning: An Empirical Study of Loss Landscape Geometry and Evolution of the Data-Dependent Neural Tangent Kernel
Stanislav Fort (Stanford University/Google Research), Gintare Karolina Dziugaite (Element AI), Mansheej Paul (Stanford University), Sepideh Kharaghani (Element AI), Daniel Roy (University of Toronto/Vector Institute), Surya Ganguli (Stanford University)
We now understand deep learning training in certain limiting regimes, where they behave like simpler kernel machines. But how do these simplifications relate to real networks that offer stronger empirical performance? In this work, we use an empirical study to connect the geometry of training to the time evolution of the kernel.

Delta-STN: Efficient Bilevel Optimization of Neural Networks using Structured Response Jacobians
Juhan Bae (University of Toronto/Vector Institute), Roger Grosse (University of Toronto/Vector Institute)
Neural net training involves a lot of hyperparameters, i.e. knobs that need to be tuned in order to achieve good performance. We developed an approach to automatically tuning hyperparameters online while a network is training (in contrast with most tuning methods, which require many training runs). The key is to learn the best-response Jacobian, which determines how the optimum of the training objective changes in response to small perturbations to the hyperparameters. This lets us approximately determine how the hyperparameters need to be changed to improve the generalization error.

Hybrid Models for Learning to Branch
Prateek Gupta (University of Oxford), Maxime Gasse (Polytechnique Montréal), Elias Khalil (University of Toronto/Vector Institute), Pawan K Mudigonda (University of Oxford), Andrea Lodi (École Polytechnique Montréal), Yoshua Bengio (Mila/Université de Montréal)
A recent Graph Neural Network (GNN) approach for learning to branch has been shown to successfully reduce the running time of branch-and-bound algorithms for Mixed Integer Linear Programming (MILP). While the GNN relies on a GPU for inference, MILP solvers are purely CPU-based. This severely limits its application as many practitioners may not have access to high-end GPUs. In this work, we ask two key questions. First, in a more realistic setting where only a CPU is available, is the GNN model still competitive? Second, can we devise an alternate computationally inexpensive model that retains the predictive power of the GNN architecture? We answer the first question in the negative, and address the second question by proposing a new hybrid architecture for efficient branching on CPU machines. The proposed architecture combines the expressive power of GNNs with computationally inexpensive multi-layer perceptrons (MLP) for branching. We evaluate our methods on four classes of MILP problems, and show that they lead to up to 26% reduction in solver running time compared to state-of-the-art methods without a GPU, while extrapolating to harder problems than it was trained on. The code for this project is publicly available at this https URL.

Exemplar VAEs for Exemplar based Generation and Data Augmentation
Sajad Norouzi (University of Toronto/Vector Institute), David J Fleet (University of Toronto/Vector Institute), Mohammad Norouzi (Google Brain)
Exemplar VAE is a new kind of generative model that combines a neural network encoder-decoder architecture with non-parametric, exemplar based techniques.  The neural network encoder is used to transform an image into a feature space that determines, for a given image, what other images are similar to it. Locations in the feature space that are close to natural images (exemplars), are deemed to represent plausible images.  To generate new images according to the model, one first chooses a natural image from a large set of exemplars. One then perturbs it by randomly altering its feature space position, and then transforms that new feature vector back into an image using the neural network decoder.  The model performs extremely well in density estimation, and it is shown to be useful for representation learning.  One remarkable property of the model is that randomly generated data can be used for data generative augmentation to improve image classifiers.

Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks
Umut Simsekli (Institut Polytechnique de Paris/University of Oxford), Ozan Sener (Intel Labs), George Deligiannidis (University of Oxford), Murat Erdogdu (University of Toronto/Vector Institute)
This paper proves generalization bounds for machine learning models trained with SGD under the assumption that its trajectories can be well-approximated by a heavy tailed diffusion. The generalization error can be controlled by the Hausdorff dimension of the trajectories, which is intimately linked to the tail behavior of the driving diffusion. Our results imply that heavier-tailed processes should achieve better generalization; hence, the tail-index of the process can be used as a notion of capacity metric.

In Search of Robust Measures of Generalization
Gintare Karolina Dziugaite (Element AI), Alexandre Drouin (Element AI), Brady Neal (Mila), Nitarshan Rajkumar (Mila, Université de Montréal), Ethan Caballero (Mila), Linbo Wang (University of Toronto/Vector Institute), Ioannis Mitliagkas (Mila/University of Montreal), Daniel Roy (University of Toronto/Vector Institute)
How should we evaluate mathematical theories of generalization in deep learning? Recent work proposes using large-scale empirical studies. We argue for the importance of using measures of robustness so that these studies do not mislead us. We find that no existing theories are robust.

Instance Selection for GANs
Terrance DeVries (University of Guelph/Vector Institute), Michal Drozdzal (FAIR), Graham W Taylor (University of Guelph/Vector Institute)
The punchline is that contrary to ML folklore, “more data is not always better”. We show that by automatically removing data examples from sparse parts of the data manifold, we can improve the sample quality of Generative Adversarial Networks, lower their capacity requirements, and significantly reduce training time. For instance, on 128×128 images, our model relies on less than four days of training, while the baseline requires more than two weeks. For 256 x 256 ImageNet images, this is the first time photorealistic images are obtained without the use of specialized hardware (i.e. hundreds of TPUs).

Learning Agent Representations for Ice Hockey
Guiliang Liu (Simon Fraser University) · Oliver Schulte (Simon Fraser University) · Pascal Poupart (University of Waterloo/RBC Borealis AI/Vector Institute) · Mike Rudd (University of Waterloo/Vector Institute) · Mehrsan Javan (SPORTLOGiQ)
This work presents a new player representation for team sports.  The new representation technique is demonstrated in ice hockey by achieving state of the art results to identify the acting player, to estimate expected goals, and to predict the final score difference.

Learning Differential Equations that are Fast to Solve
Jacob Kelly (University of Toronto/Vector Institute), Jesse Bettencourt (University of Toronto/Vector Institute), Matthew Johnson (Google Brain), David Duvenaud (University of Toronto/Vector Institute)
When we model physical systems, some models are easier to approximate and make predictions with than others. Sometimes different models will make almost exactly the same predictions, but one will be much easier to work with. We show how to encourage models to be easier to make predictions while still agreeing with the data almost as well. Specifically, we show how to do this in a general class of models of continuously-evolving systems called ordinary differential equations.

Learning Deformable Tetrahedral Meshes for 3D Reconstruction*
Jun Gao (University of Toronto) · Wenzheng Chen (University of Toronto) · Tommy Xiang (University of Toronto) · Alec Jacobson (University of Toronto) · Morgan McGuire (NVIDIA) · Sanja Fidler (Vector Institute/University of Toronto/NVIDIA)
*Research done for NVIDIA
3D shape representations that accommodate learning-based 3D reconstruction are an open problem in machine learning and computer graphics. Previous work on neural 3D reconstruction demonstrated benefits, but also limitations, of point cloud, voxel, surface mesh, and implicit function representations. We introduce Deformable Tetrahedral Meshes (DEFTET) as a particular parameterization that utilizes volumetric tetrahedral meshes for the reconstruction problem. Unlike existing volumetric approaches, DEFTET optimizes for both vertex placement and occupancy, and is differentiable with respect to standard 3D reconstruction loss functions. It is thus simultaneously high-precision, volumetric, and amenable to learning-based neural architectures. We show that it can represent arbitrary, complex topology, is both memory and computationally efficient, and can produce high-fidelity reconstructions with a significantly smaller grid size than alternative volumetric approaches. The predicted surfaces are also inherently defined as tetrahedral meshes, thus do not require post-processing. We demonstrate that DEFTET matches or exceeds both the quality of the previous best approaches and the performance of the fastest ones. Our approach obtains high-quality tetrahedral meshes computed directly from noisy point clouds, and is the first to showcase high-quality 3D tet-mesh results using only a single image as input.

Learning Dynamic Belief Graphs to Generalize on Text-Based Games
Ashutosh Adhikari (University of Waterloo) · Xingdi Yuan (Microsoft Research) · Marc-Alexandre Côté (Microsoft Research) · Mikuláš Zelinka (Charles University, Faculty of Mathematics and Physics) · Marc-Antoine Rondeau (Microsoft Research) · Romain Laroche (Microsoft Research) · Pascal Poupart (University of Waterloo/RBC Borealis AI/Vector Institute) · Jian Tang (Mila) · Adam Trischler (Microsoft) · Will Hamilton (McGill)
Playing text-based games requires skills in processing natural language and sequential decision making. Achieving human-level performance on text-based games remains an open challenge, and prior research has largely relied on hand-crafted structured representations and heuristics. In this work, we describe a new technique to plan and generalize in text-based games using graph-structured representations learned end-to-end from raw text.

Lifelong Policy Gradient Learning of Factored Policies for Faster Training Without Forgetting Jorge Mendez (University of Pennsylvania), Boyu Wang (University of Western Ontario/Vector Institute), Eric Eaton (University of Pennsylvania)
Policy gradient methods have shown success in learning control policies for high-dimensional dynamical systems. Their biggest downside is the amount of exploration they require before yielding high-performing policies. In a lifelong learning setting, in which an agent is faced with multiple consecutive tasks over its lifetime, reusing information from previously seen tasks can substantially accelerate the learning of new tasks. We provide a novel method for lifelong policy gradient learning that trains lifelong function approximators directly via policy gradients, allowing the agent to benefit from accumulated knowledge throughout the entire training process. We show empirically that our algorithm learns faster and converges to better policies than single-task and lifelong learning baselines, and completely avoids catastrophic forgetting on a variety of challenging domains.

 LoCo: Local Contrastive Representation Learning*
Yuwen Xiong (Uber ATG/University of Toronto), Mengye Ren (University of Toronto/Uber ATG), Raquel Urtasun (Uber ATG/Vector Institute)
*Research done for Uber ATG
Deep neural nets typically perform end-to-end backpropagation to learn the weights, a procedure that creates synchronization constraints in the weight update step across layers and is not biologically plausible. Recent advances in unsupervised contrastive representation learning point to the question of whether a learning algorithm can also be made local, that is, the updates of lower layers do not directly depend on the computation of upper layers. While Greedy InfoMax separately learns each block with a local objective, we found that it consistently hurts readout accuracy in state-of-the-art unsupervised contrastive learning algorithms, possibly due to the greedy objective as well as gradient isolation. In this work, we discover that by overlapping local blocks stacking on top of each other, we effectively increase the decoder depth and allow upper blocks to implicitly send feedbacks to lower blocks. This simple design closes the performance gap between local learning and end-to-end contrastive learning algorithms for the first time. Aside from standard ImageNet experiments, we also show results on complex downstream tasks such as object detection and instance segmentation directly using readout features.

Modeling Continuous Stochastic Processes with Dynamic Normalizing Flows
Ruizhi Deng (Simon Fraser University), Bo Chang (Borealis AI), Marcus Brubaker (Borealis AI/Vector Institute), Greg Mori (Borealis AI), Andreas Lehrmann (Borealis AI)
Normalizing flows transform a simple base distribution into a complex target distribution and have proved to be powerful models for data generation and density estimation. In this work, we propose a novel type of normalizing flow driven by a differential deformation of the Wiener process. As a result, we obtain a rich time series model whose observable process inherits many of the appealing properties of its base process, such as efficient computation of likelihoods and marginals. Furthermore, our continuous treatment provides a natural framework for irregular time series with an independent arrival process, including straightforward interpolation. We illustrate the desirable properties of the proposed model on popular stochastic processes and demonstrate its superior flexibility to variational RNN and latent ODE baselines in a series of experiments on synthetic and real-world data.

MuSCLE: Multi Sweep Compression of LiDAR using Deep Entropy Models*
Sourav Biswas (University of Waterloo), Jerry Liu (Uber ATG), Kelvin Wong (University of Toronto), Shenlong Wang (University of Toronto), Raquel Urtasun (Uber ATG/Vector Institute)
*Research done for Uber ATG
We present a novel compression algorithm for reducing the storage of LiDAR sensor data streams. Our model exploits spatio-temporal relationships across multiple LiDAR sweeps to reduce the bitrate of both geometry and intensity values. Towards this goal, we propose a novel conditional entropy model that models the probabilities of the octree symbols by considering both coarse level geometry and previous sweeps’ geometric and intensity information. We then use the learned probability to encode the full data stream into a compact one. Our experiments demonstrate that our method significantly reduces the joint geometry and intensity bitrate over prior state-of-the-art LiDAR compression methods, with a reduction of 7-17% and 15-35% on the UrbanCity and SemanticKITTI datasets respectively.

On the Ergodicity, Bias and Asymptotic Normality of Randomized Midpoint Sampling Method
Ye He (University of California, Davis), Krishnakumar Balasubramanian (University of California, Davis), Murat Erdogdu (University of Toronto/Vector Institute)
The randomized midpoint method has emerged as an optimal procedure for diffusion-based sampling from a probability distribution. This paper analyzes several probabilistic properties of this method, establishing asymptotic normality and highlighting the relative advantages and disadvantages over other methods. Results in this paper collectively provide several insights into the behavior of the randomized midpoint discretization method, including obtaining confidence intervals for numerical integrations.

Regularized Linear Autoencoders Recover the Principal Components, Eventually
Xuchan Bao (University of Toronto/Vector Institute), James Lucas (University of Toronto/Vector Institute), Sushant Sachdeva (University of Toronto/Vector Institute), Roger Grosse (University of Toronto/Vector Institute)
It’s long been known that autoencoders recover the principal component subspace (the subspace that maximizes the projected variance of the data). We show that, with a particular regularizer, they recover the individual principal components, not just the subspace. However, they do so very slowly; we analyze why this is the case and give an alternative training procedure which recovers the components more efficiently.

Sharpened Generalization Bounds based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms
Mahdi Haghifam (University of Toronto/Vector Institute), Jeffrey Negrea (University of Toronto/Vector Institute), Ashish Khisti (University of Toronto), Daniel Roy (University of Toronto/Vector Institute), Gintare Karolina Dziugaite (Element AI)
No existing theory of generalization for the Langevin algorithm can tie the real-world behavior of the algorithm to strong generalization performance. Building off new notions of conditional mutual information, we present new bounds that yield non-vacuous generalization bounds, even for CIFAR10.

Variational Amodal Object Completion*
Huan Ling (University of Toronto, NVIDIA) · David Acuna (University of Toronto, NVIDIA) · Karsten Kreis (NVIDIA) · Seung Wook Kim (University of Toronto) · Sanja Fidler (Vector Institute/University of Toronto/NVIDIA)
*Research done for NVIDIA
In images of complex scenes, objects are often occluding each other which makes perception tasks such as object detection and tracking, or robotic control tasks such as planning, challenging. To facilitate downstream tasks, it is thus important to reason about the full extent of objects, i.e., seeing behind occlusion, typically referred to as amodal instance completion. In this paper, we propose a variational generative framework for amodal completion, referred to as Amodal-VAE, which does not require any amodal labels at training time, as it is able to utilize widely available object instance masks. We showcase our approach on the downstream task of scene editing where the user is presented with interactive tools to complete and erase objects in photographs. Experiments on complex street scenes demonstrate state-of-the-art performance in amodal mask completion, and showcase high quality scene editing results. Interestingly, a user study shows that humans prefer object completions inferred by our model to the human-labeled ones.

Wavelet Flow: Fast Training of High Resolution Normalizing Flows
Jason Yu (York University), Konstantinos Derpanis (Ryerson University/Vector Institute), Marcus Brubaker (York University/Vector Institute)
Normalizing flows have traditionally been limited to generating low resolution images due to the cost of training.  We introduce a new method based on wavelets which allows for efficient training of high resolution images.  We show that it enables training of high resolution images (e.g., 1024×1024) and is also able to significantly speed up training on standard, low resolution datasets.  In addition, it automatically includes models of lower resolution images and can perform super-resolution with no additional work due to the multi-scale nature of the wavelet representation.

What went wrong and when? Instance-wise feature importance for time-series black-box models
Sana Tonekaboni (University of Toronto/Vector Institute), Shalmali Joshi (Vector Institute), Kieran Campbell (University of British Columbia/Vector Institute), David Duvenaud (University of Toronto/Vector Institute), Anna Goldenberg (Vector Institute/The Hospital for Sick Children)
Explanations of model predictions are important particularly in complex domains such as time-series monitoring in patient care. Time-series explainability is a relatively unexplored area in machine learning (ML) literature so far. We proposed a new framework for explaining black-box models by attributing importance to observations based on how much influence they have on a model’s prediction. Unlike previous attempts, our approach accounts for temporal dynamics. This is one of the first works to explore feature attribution and explainability for time-series models. We expect it to be very relevant in healthcare and are currently exploring a variety of applications.

Vector Institute researchers are hosting four workshops

Muslims in ML is an affinity workshop organized by Marzyeh Ghassemi and collaborators. It will focus on both the potential for advancement and harm to muslims and those in muslim-majority countries who religiously identify, culturally associate, or are classified by proximity, as “muslim.”

Machine Learning for Health (ML4H):Advancing Healthcare for All, co-organized by Anna Goldenberg, will expose participants to new questions in machine learning for healthcare, and be prompted to reflect on how their work sits within larger healthcare systems.

Machine Learning and the Physical Sciences, organized by Juan Carrasquilla, brings together computer scientists, mathematicians and physical scientists who are interested in applying machine learning to various outstanding physical problems.

Talking to Strangers: Zero-Shot Emergent Communication is an interactive workshop co-organized by Jakob Foerster. Its goal is to explore the possibilities for artificial agents of evolving ad hoc communication spontaneously, by interacting with strangers.

Learning Meaningful Representations of Life ( co-organized by Alán Aspuru-Guzik, is designed to bring together trainees and experts in machine learning with those in the very forefront of biological research today to help to unlock the secrets of biological systems.

Scroll to Top