# Vector research community preps for virtual NeurIPS 2020

November 27, 2020

November 27, 2020

*By Ian Gormely
*

Vector researchers are once again preparing for the premier machine learning conference, the 34th annual conference on Neural Information Processing Systems (NeurIPS). This year’s conference, originally set to be staged in Vancouver, BC, will be virtual, running December 6-12. It will include invited talks, demonstrations, symposia, and oral and poster presentations of refereed papers.

Below are abstracts and simplified summaries for many of the accepted papers and workshops from Vector-affiliated researchers.

You can read more about Vector’s work at past years’ conferences here and here.

*Vector’s research community continues to grow quickly. If you are a Vector-affiliated researcher and don’t see your work represented here, please contact ian.gormely@vectorinstitute.ai*

**Conference Papers by Vector Faculty Members & Faculty Affiliates:**

**Adaptive Gradient Quantization for Data-Parallel SGD**

*Fartash Faghri (University of Toronto/Vector Institute), Iman Tabrizian (University of Toronto/Vector Institute), Ilia Markov (IST Austria), Dan Alistarh (IST Austria/Neural Magic Inc.), Daniel Roy (University of Toronto/Vector Institute), Ali Ramezani-Kebrya (Vector Institute)
*As deep learning scales to larger models and bigger data, researchers are using distributed, parallel algorithms to train faster. This work shows how to reduce the communication overhead by 70%, opening up the possibility for even larger scale compute.

**An Implicit Function Learning Approach for Parametric Modal Regression**

*Yangchen Pan (University of Alberta), Ehsan Imani (University of Alberta), Martha White (University of Alberta), Amir-massoud Farahmand (Vector Institute/University of Toronto)
*Learning the relation between the input and the real-valued output is a fundamental problem in machine learning, and is known as the regression problem. The conventional regression methods learn the mean value of an output given its input. This is acceptable when the output for any given input is concentrated around a single mode (unimodal), but it is not when the output has multiple modes. This work develops a new scalable algorithm to learn such a relationship. This is achieved by using the Implicit Function Theorem, which allows us to convert the problem of learning a multi-valued function, which is difficult, to learning a single-valued function, which is easier.

**Causal Discovery in Physical Systems from Videos**

*Yunzhu Li (Massachusettes Institute of Technology), Antonio Torralba (Massachusettes Institute of Technology), Anima Anandkumar (NVIDIA/CalTech), Dieter Fox (NVIDIA/University of Washington), Animesh Garg (University of Toronto/Vector Institute)
*Causal discovery is at the core of human cognition. It enables us to reason about the environment and make counterfactual predictions about unseen scenarios that can vastly differ from our previous experiences. We consider the task of causal discovery from videos in an end-to-end fashion without supervision on the ground-truth graph structure. In particular, our goal is to discover the structural dependencies among environmental and object variables: inferring the type and strength of interactions that have a causal effect on the behavior of the dynamical system. Our model consists of (a) a perception module that extracts a semantically meaningful and temporally consistent keypoint representation from images, (b) an inference module for determining the graph distribution induced by the detected keypoints, and (c) a dynamics module that can predict the future by conditioning on the inferred graph. We assume access to different configurations and environmental conditions, i.e., data from unknown interventions on the underlying system; thus, we can hope to discover the correct underlying causal graph without explicit interventions. We evaluate our method in a planar multi-body interaction environment and scenarios involving fabrics of different shapes like shirts and pants. Experiments demonstrate that our model can correctly identify the interactions from a short sequence of images and make long-term future predictions. The causal structure assumed by the model also allows it to make counterfactual predictions and extrapolate to systems of unseen interaction graphs or graphs of various sizes.

**Counterfactual Data Augmentation using Locally Factored Dynamics**

*Silviu Pitis (University of Toronto/Vector Institute), Elliot Creager (University of Toronto/Vector Institute), Animesh Garg (University of Toronto/Vector Institute)*

We detect and leverage local causal independence between objects and features of the world state to improve the sample efficiency of simulated robots in the reinforcement learning setting. We formalize local causal independence using a Local Causal Modeling framework and use it as part of our Counterfactual Data Augmentation algorithm to generate new causally valid data for models to train on.

**Curriculum By Smoothing
**

Convolutional Neural Networks (CNNs) have shown impressive performance in computer vision tasks such as image classification, detection, and segmentation. Moreover, recent work in Generative Adversarial Networks (GANs) has highlighted the importance of learning by progressively increasing the difficulty of a learning task Kerras et al. When learning a network from scratch, the information propagated within the network during the earlier stages of training can contain distortion artifacts due to noise which can be detrimental to training. In this paper, we propose an elegant curriculum-based scheme that smoothes the feature embedding of a CNN using anti-aliasing or low-pass filters. We propose to augment the training of CNNs by controlling the amount of high frequency information propagated within the CNNs as training progresses, by convolving the output of a CNN feature map of each layer with a Gaussian kernel. By decreasing the variance of the Gaussian kernel, we gradually increase the amount of high-frequency information available within the network for inference. As the amount of information in the feature maps increases during training, the network is able to progressively learn better representations of the data. Our proposed augmented training scheme significantly improves the performance of CNNs on various vision tasks without either adding additional trainable parameters or an auxiliary regularization objective. The generality of our method is demonstrated through empirical performance gains in CNN architectures across four different tasks: transfer learning, cross-task transfer learning, and generative models.

**Deep Learning Versus Kernel Learning: An Empirical Study of Loss Landscape Geometry and Evolution of the Data-Dependent Neural Tangent Kernel**

*Stanislav Fort (Stanford University/Google Research), Gintare Karolina Dziugaite (Element AI), Mansheej Paul (Stanford University), Sepideh Kharaghani (Element AI), Daniel Roy (University of Toronto/Vector Institute), Surya Ganguli (Stanford University)
*We now understand deep learning training in certain limiting regimes, where they behave like simpler kernel machines. But how do these simplifications relate to real networks that offer stronger empirical performance? In this work, we use an empirical study to connect the geometry of training to the time evolution of the kernel.

**Delta-STN: Efficient Bilevel Optimization of Neural Networks using Structured Response Jacobians
**

**Hybrid Models for Learning to Branch**

*Prateek Gupta (University of Oxford), Maxime Gasse (Polytechnique Montréal), Elias Khalil (University of Toronto/Vector Institute), Pawan K Mudigonda (University of Oxford), Andrea Lodi (École Polytechnique Montréal), Yoshua Bengio (Mila/Université de Montréal)
*A recent Graph Neural Network (GNN) approach for learning to branch has been shown to successfully reduce the running time of branch-and-bound algorithms for Mixed Integer Linear Programming (MILP). While the GNN relies on a GPU for inference, MILP solvers are purely CPU-based. This severely limits its application as many practitioners may not have access to high-end GPUs. In this work, we ask two key questions. First, in a more realistic setting where only a CPU is available, is the GNN model still competitive? Second, can we devise an alternate computationally inexpensive model that retains the predictive power of the GNN architecture? We answer the first question in the negative, and address the second question by proposing a new hybrid architecture for efficient branching on CPU machines. The proposed architecture combines the expressive power of GNNs with computationally inexpensive multi-layer perceptrons (MLP) for branching. We evaluate our methods on four classes of MILP problems, and show that they lead to up to 26% reduction in solver running time compared to state-of-the-art methods without a GPU, while extrapolating to harder problems than it was trained on. The code for this project is publicly available at this https URL.

**Exemplar VAEs for Exemplar based Generation and Data Augmentation**

*Sajad Norouzi (University of Toronto/Vector Institute), David J Fleet (University of Toronto/Vector Institute), Mohammad Norouzi (Google Brain)
*Exemplar VAE is a new kind of generative model that combines a neural network encoder-decoder architecture with non-parametric, exemplar based techniques. The neural network encoder is used to transform an image into a feature space that determines, for a given image, what other images are similar to it. Locations in the feature space that are close to natural images (exemplars), are deemed to represent plausible images. To generate new images according to the model, one first chooses a natural image from a large set of exemplars. One then perturbs it by randomly altering its feature space position, and then transforms that new feature vector back into an image using the neural network decoder. The model performs extremely well in density estimation, and it is shown to be useful for representation learning. One remarkable property of the model is that randomly generated data can be used for data generative augmentation to improve image classifiers.

**Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks**

*Umut Simsekli (Institut Polytechnique de Paris/University of Oxford), Ozan Sener (Intel Labs), George Deligiannidis (University of Oxford), Murat Erdogdu (University of Toronto/Vector Institute)
*This paper proves generalization bounds for machine learning models trained with SGD under the assumption that its trajectories can be well-approximated by a heavy tailed diffusion. The generalization error can be controlled by the Hausdorff dimension of the trajectories, which is intimately linked to the tail behavior of the driving diffusion. Our results imply that heavier-tailed processes should achieve better generalization; hence, the tail-index of the process can be used as a notion of capacity metric.

**In Search of Robust Measures of Generalization**

*Gintare Karolina Dziugaite (Element AI), Alexandre Drouin (Element AI), Brady Neal (Mila), Nitarshan Rajkumar (Mila, Université de Montréal), Ethan Caballero (Mila), Linbo Wang (University of Toronto/Vector Institute), Ioannis Mitliagkas (Mila/University of Montreal), Daniel Roy (University of Toronto/Vector Institute)
*How should we evaluate mathematical theories of generalization in deep learning? Recent work proposes using large-scale empirical studies. We argue for the importance of using measures of robustness so that these studies do not mislead us. We find that no existing theories are robust.

**Instance Selection for GANs**

*Terrance DeVries (University of Guelph/Vector Institute), Michal Drozdzal (FAIR), Graham W Taylor (University of Guelph/Vector Institute)
*The punchline is that contrary to ML folklore, “more data is not always better”. We show that by automatically removing data examples from sparse parts of the data manifold, we can improve the sample quality of Generative Adversarial Networks, lower their capacity requirements, and significantly reduce training time. For instance, on 128×128 images, our model relies on less than four days of training, while the baseline requires more than two weeks. For 256 x 256 ImageNet images, this is the first time photorealistic images are obtained without the use of specialized hardware (i.e. hundreds of TPUs).

**Learning Agent Representations for Ice Hockey**

*Guiliang Liu (Simon Fraser University) · Oliver Schulte (Simon Fraser University) · Pascal Poupart (University of Waterloo/RBC Borealis AI/Vector Institute) · Mike Rudd (University of Waterloo/Vector Institute) · Mehrsan Javan (SPORTLOGiQ)
*This work presents a new player representation for team sports. The new representation technique is demonstrated in ice hockey by achieving state of the art results to identify the acting player, to estimate expected goals, and to predict the final score difference.

**Learning Differential Equations that are Fast to Solve**

*Jacob Kelly (University of Toronto/Vector Institute), Jesse Bettencourt (University of Toronto/Vector Institute), Matthew Johnson (Google Brain), David Duvenaud (University of Toronto/Vector Institute)
*When we model physical systems, some models are easier to approximate and make predictions with than others. Sometimes different models will make almost exactly the same predictions, but one will be much easier to work with. We show how to encourage models to be easier to make predictions while still agreeing with the data almost as well. Specifically, we show how to do this in a general class of models of continuously-evolving systems called ordinary differential equations.

**Learning Deformable Tetrahedral Meshes for 3D Reconstruction*****
**

**Learning Dynamic Belief Graphs to Generalize on Text-Based Games**

*Ashutosh Adhikari (University of Waterloo) · Xingdi Yuan (Microsoft Research) · Marc-Alexandre Côté (Microsoft Research) · Mikuláš Zelinka (Charles University, Faculty of Mathematics and Physics) · Marc-Antoine Rondeau (Microsoft Research) · Romain Laroche (Microsoft Research) · Pascal Poupart (University of Waterloo/RBC Borealis AI/Vector Institute) · Jian Tang (Mila) · Adam Trischler (Microsoft) · Will Hamilton (McGill)
*Playing text-based games requires skills in processing natural language and sequential decision making. Achieving human-level performance on text-based games remains an open challenge, and prior research has largely relied on hand-crafted structured representations and heuristics. In this work, we describe a new technique to plan and generalize in text-based games using graph-structured representations learned end-to-end from raw text.

**Lifelong Policy Gradient Learning of Factored Policies for Faster Training Without Forgetting** *Jorge Mendez (University of Pennsylvania), Boyu Wang (University of Western Ontario/Vector Institute), Eric Eaton (University of Pennsylvania)
*Policy gradient methods have shown success in learning control policies for high-dimensional dynamical systems. Their biggest downside is the amount of exploration they require before yielding high-performing policies. In a lifelong learning setting, in which an agent is faced with multiple consecutive tasks over its lifetime, reusing information from previously seen tasks can substantially accelerate the learning of new tasks. We provide a novel method for lifelong policy gradient learning that trains lifelong function approximators directly via policy gradients, allowing the agent to benefit from accumulated knowledge throughout the entire training process. We show empirically that our algorithm learns faster and converges to better policies than single-task and lifelong learning baselines, and completely avoids catastrophic forgetting on a variety of challenging domains.

**LoCo: Local Contrastive Representation Learning*
**

**Modeling Continuous Stochastic Processes with Dynamic Normalizing Flows
**

**MuSCLE: Multi Sweep Compression of LiDAR using Deep Entropy Models***

*Sourav Biswas (University of Waterloo), Jerry Liu (Uber ATG), Kelvin Wong (University of Toronto), Shenlong Wang (University of Toronto), Raquel Urtasun (Uber ATG/Vector Institute)
*

**On the Ergodicity, Bias and Asymptotic Normality of Randomized Midpoint Sampling Method**

*Ye He (University of California, Davis), Krishnakumar Balasubramanian (University of California, Davis), Murat Erdogdu (University of Toronto/Vector Institute)
*The randomized midpoint method has emerged as an optimal procedure for diffusion-based sampling from a probability distribution. This paper analyzes several probabilistic properties of this method, establishing asymptotic normality and highlighting the relative advantages and disadvantages over other methods. Results in this paper collectively provide several insights into the behavior of the randomized midpoint discretization method, including obtaining confidence intervals for numerical integrations.

**Regularized Linear Autoencoders Recover the Principal Components, Eventually**

*Xuchan Bao (University of Toronto/Vector Institute), James Lucas (University of Toronto/Vector Institute), Sushant Sachdeva (University of Toronto/Vector Institute), Roger Grosse (University of Toronto/Vector Institute)
*It’s long been known that autoencoders recover the principal component subspace (the subspace that maximizes the projected variance of the data). We show that, with a particular regularizer, they recover the individual principal components, not just the subspace. However, they do so very slowly; we analyze why this is the case and give an alternative training procedure which recovers the components more efficiently.

**Sharpened Generalization Bounds based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms**

*Mahdi Haghifam (University of Toronto/Vector Institute), Jeffrey Negrea (University of Toronto/Vector Institute), Ashish Khisti (University of Toronto), Daniel Roy (University of Toronto/Vector Institute), Gintare Karolina Dziugaite (Element AI)
*No existing theory of generalization for the Langevin algorithm can tie the real-world behavior of the algorithm to strong generalization performance. Building off new notions of conditional mutual information, we present new bounds that yield non-vacuous generalization bounds, even for CIFAR10.

**Variational Amodal Object Completion*****
**

**Wavelet Flow: Fast Training of High Resolution Normalizing Flows**

*Jason Yu (York University), Konstantinos Derpanis (Ryerson University/Vector Institute), Marcus Brubaker (York University/Vector Institute)
*Normalizing flows have traditionally been limited to generating low resolution images due to the cost of training. We introduce a new method based on wavelets which allows for efficient training of high resolution images. We show that it enables training of high resolution images (e.g., 1024×1024) and is also able to significantly speed up training on standard, low resolution datasets. In addition, it automatically includes models of lower resolution images and can perform super-resolution with no additional work due to the multi-scale nature of the wavelet representation.

**What went wrong and when? Instance-wise feature importance for time-series black-box models**

*Sana Tonekaboni (University of Toronto/Vector Institute), Shalmali Joshi (Vector Institute), Kieran Campbell (University of British Columbia/Vector Institute), David Duvenaud (University of Toronto/Vector Institute), Anna Goldenberg (Vector Institute/The Hospital for Sick Children)
*Explanations of model predictions are important particularly in complex domains such as time-series monitoring in patient care. Time-series explainability is a relatively unexplored area in machine learning (ML) literature so far. We proposed a new framework for explaining black-box models by attributing importance to observations based on how much influence they have on a model’s prediction. Unlike previous attempts, our approach accounts for temporal dynamics. This is one of the first works to explore feature attribution and explainability for time-series models. We expect it to be very relevant in healthcare and are currently exploring a variety of applications.

**Vector Institute researchers are hosting four workshops**

**Muslims in ML** is an affinity workshop organized by Marzyeh Ghassemi and collaborators. It will focus on both the potential for advancement and harm to muslims and those in muslim-majority countries who religiously identify, culturally associate, or are classified by proximity, as “muslim.”

**Machine Learning for Health (ML4H):Advancing Healthcare for All**, co-organized by Anna Goldenberg, will expose participants to new questions in machine learning for healthcare, and be prompted to reflect on how their work sits within larger healthcare systems.

**Machine Learning and the Physical Sciences****,** organized by Juan Carrasquilla, brings together computer scientists, mathematicians and physical scientists who are interested in applying machine learning to various outstanding physical problems.

**Talking to Strangers: Zero-Shot Emergent Communication** is an interactive workshop co-organized by Jakob Foerster. Its goal is to explore the possibilities for artificial agents of evolving ad hoc communication spontaneously, by interacting with strangers.

**Learning Meaningful Representations of Life (LMRL.org)** co-organized by Alán Aspuru-Guzik, is designed to bring together trainees and experts in machine learning with those in the very forefront of biological research today to help to unlock the secrets of biological systems.