Nov 30, 2021

By Ian Gormely

Vector researchers are once again preparing for a virtual Conference on Neural Information Processing Systems (NeurIPS). This year’s conference runs December 6 through 14th online. Papers being presented by Vector Faculty this year break new ground in different fields of AI research, including deep learning, reinforcement learning, computer vision, and responsible AI and have the potential to impact many facets of daily life and work from architecture to health.

New work from Vector’s interim research director Graham Taylor was particularly notable as it combined the efforts of his team with that of researcher Jungtaek Kim, whom Taylor connected with at last year’s NeurIPS conference. Both were presenting work related to the building toy LEGO.

This year the combined Canadian-Korean team, which includes Vector researcher Boris Knyazev as well as POSTECH researchers Hyunsoo Chung, Jinhwi Lee, Jaesik Park, and Minsu Cho, are presenting another LEGO-related paper, “Brick-by-Brick: Combinatorial Construction with Deep Reinforcement Learning.” Here they use reinforcement learning to build a structure out of LEGO bricks from a photo. The model has the potential to help create architectural designs from pictures or renderings.

Other notable work comes from Vector Research Director (currently on leave) Richard Zemel and Faculty Member Alireza Makhzani. They and their co-authors Kuan-Chieh Wang, Yan Fu, Ke Li, and Ashish J Khisti look at the vulnerability of neural networks when they experience model inversion attacks, which can reveal training data to unauthorized users. Their paper, “Variational Model Inversion Attacks,” looks at how to improve the accuracy of these attacks, so that the data that is revealed is both realistic and diverse. Their work has the potential to impact ML privacy issues, particularly in health.

Finally, Vector Faculty Member Chris Maddison and Vector Intern Yann Dubois, along with co-authors Benjamin Bloem-Reddy, Karen Ullrich have developed a new compression paradigm for data that is processed by algorithms rather than seen by humans. In “Lossy Compression for Lossless Prediction” they detail a model that works 1000 times better than JPEG for image compression. Shrinking the size of data has the potential to enable startups and smaller institutions to work with large data sets that currently require prohibitively expensive compute resources.

Below are abstracts and simplified summaries for many of the accepted papers and workshops from Vector Faculty Members.

You can read more about Vector’s work at past years’ conferences here (2020), here (2019), and here (2018).

An Analysis of Constant Step Size SGD in the Non-convex Regime: Asymptotic Normality and Bias

*Lu Yu, Krishnakumar Balasubramanian, Stanislav Volgushev, Murat A. Erdogdu
*Structured non-convex learning problems, for which critical points have favorable statistical properties, arise frequently in statistical machine learning. Algorithmic convergence and statistical estimation rates are well-understood for such problems. However, quantifying the uncertainty associated with the underlying training algorithm is not well-studied in the non-convex setting. In order to address this shortcoming, in this work, we establish an asymptotic normality result for the constant step size stochastic gradient descent (SGD) algorithm–a widely used algorithm in practice. Specifically, based on the relationship between SGD and Markov Chains [DDB19], we show that the average of SGD iterates is asymptotically normally distributed around the expected value of their unique invariant distribution, as long as the non-convex and non-smooth objective function satisfies a dissipativity property. We also characterize the bias between this expected value and the critical points of the objective function under various local regularity conditions. Together, the above two results could be leveraged to construct confidence intervals for non-convex problems that are trained using the SGD algorithm.

Are My Deep Learning Systems Fair? An Empirical Study of Fixed-Seed Training

*Shangshu Qian, Hung Viet Pham, Thibaud Lutellier, Zeou Hu, Jungwon Kim, Lin Tan, Yaoliang Yu, Jiahao Chen, Sameena Shah
*Deep learning (DL) systems have been gaining popularity in critical tasks such as credit evaluation and crime prediction. Such systems demand fairness. Recent work shows that DL software implementations introduce variance: identical DL training runs (i.e., identical network, data, configuration, software, and hardware) with a fixed seed produce different models. Such variance could make DL models and networks violate fairness compliance laws, resulting in negative social impact. We conduct the first empirical study to quantify the impact of software implementation on the fairness and its variance of DL systems. Our study of 22 mitigation techniques and five baselines reveals up to 12.6% fairness variance across identical training runs with identical seeds. In addition, most debiasing algorithms have a negative impact on the model such as reducing model accuracy, increasing fairness variance, or increasing accuracy variance. Our literature survey shows that while fairness is gaining popularity in artificial intelligence (AI) related conferences, only 34.4% of the papers use multiple identical training runs to evaluate their approach, raising concerns about their results’ validity. We call for better fairness evaluation and testing protocols to improve fairness and fairness variance of DL systems as well as DL research validity and reproducibility at large.

ATISS: Autoregressive Transformers for Indoor Scene Synthesis

*Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, Sanja Fidler
*The ability to synthesize realistic and diverse indoor furniture layouts automatically or based on partial input, unlocks many applications, from better interactive 3D tools to data synthesis for training and simulation. In this paper, we present ATISS, a novel autoregressive transformer architecture for creating diverse and plausible synthetic indoor environments, given only the room type and its floor plan. In contrast to prior work, which poses scene synthesis as sequence generation, our model generates rooms as unordered sets of objects. We argue that this formulation is more natural, as it makes ATISS generally useful beyond fully automatic room layout synthesis. For example, the same trained model can be used in interactive applications for general scene completion, partial room re-arrangement with any objects specified by the user, as well as object suggestions for any partial room. To enable this, our model leverages the permutation equivariance of the transformer when conditioning on the partial scene, and is trained to be permutation-invariant across object orderings. Our model is trained end-to-end as an autoregressive generative model using only labeled 3D bounding boxes as supervision. Evaluations on four room types in the 3D-FRONT dataset demonstrate that our model consistently generates plausible room layouts that are more realistic than existing methods. In addition, it has fewer parameters, is simpler to implement and train and runs up to 8 times faster than existing methods.

Brick-by-Brick: Combinatorial Construction with Deep Reinforcement Learning

*Hyunsoo Chung, Jungtaek Kim, Boris Knyazev, Jinhwi Lee, Graham W. Taylor, Jaesik Park, Minsu Cho
*The Vector Institute team led by Interim Research Director Graham Taylor shows again that computers can play LEGO too. Last year, at the NeurIPS Workshop on ML for Engineering Modeling, Simulation, and Design the team presented a Graph Generative Model that learned from various types of human-created LEGO structures and proposed its own creations. There, they met POSTECH student Jungtaek Kim, who incidentally was presenting his own LEGO-related work. The teams decided to join forces and Kim completed an internship at Vector. In work to appear at NeurIPS, the Korean-Canadian team proposes a new formulation of the LEGO-building problem with deep reinforcement learning. In “Brick-by-Brick”, a LEGO-building reinforcement learning agent accepts incomplete knowledge about the desired target in the form of an image, rather than building from a blank slate. The key innovation in this work is in dealing with a large number of invalid building actions that may compromise the integrity of a build. This work has implications for architectural design, where inspiration may be suggested by a photo or rendering and the agent constructs a buildable plan that respects the complex constraints of the real world.

Characterizing Generalization under Out-of-Distribution Shifts in Deep Metric Learning

*Timo Milbich**, Karsten Roth, Samarth Sinha, Ludwig Schmidt, Marzyeh Ghassemi, Björn Ommer
*Deep Metric Learning (DML) aims to learn representation spaces in which a predefined metric (s.a. euclidean distance) relates to the semantic similarity of the input data in a way that allows to cluster samples from unseen classes based on inherent similarity, even under semantic Out-of-Distribution shifts. However, standard benchmarks used to evaluate the generalization capabilities of different DML methods use fixed train- and test-splits and thus fixed train-to-test shifts. But in practice, the shift at test time is not known a priori and thus, the default evaluation setting is insufficient to evaluate the practical usability of different DML methods. To address this, we propose a novel protocol to generate sequences of progressively harder semantic shifts for given train-test splits to evaluate the generalization performance of DML methods under more realistic scenarios with different train-to-test shifts. Following that, we provide a thorough evaluation of conceptual approaches to DML and their benefits or shortcomings across train-to-test shifts of varying hardness, investigate links to structural metrics as potential indicators for downstream generalization performance as well as introduce few-shot DML as a cheap remedy for consistently improved generalization under more severe OOD shifts.

Clockwork Variational Autoencoders

*Vaibhav Saxena, Jimmy Ba, Danijar Hafner
*Deep learning has enabled algorithms to generate realistic images. However, accurately predicting long video sequences requires understanding long-term dependencies and remains an open challenge. While existing video prediction models succeed at generating sharp images, they tend to fail at accurately predicting far into the future. We introduce the Clockwork VAE (CW-VAE), a video prediction model that leverages a hierarchy of latent sequences, where higher levels tick at slower intervals. We demonstrate the benefits of both hierarchical latents and temporal abstraction on 4 diverse video prediction datasets with sequences of up to 1000 frames, where CW-VAE outperforms top video prediction models. Additionally, we propose a Minecraft benchmark for long-term video prediction. We conduct several experiments to gain insights into CW-VAE and confirm that slower levels learn to represent objects that change more slowly in the video, and faster levels learn to represent faster objects.

Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance

*Hongjian Wang, Mert Gürbüzbalaban, Lingjiong Zhu, Umut Şimşekli, Murat A. Erdogdu
*Recent studies have provided both empirical and theoretical evidence illustrating that heavy tails can emerge in stochastic gradient descent (SGD) in various scenarios. Such heavy tails potentially result in iterates with diverging variance, which hinders the use of conventional convergence analysis techniques that rely on the existence of the second-order moments. In this paper, we provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance, for a class of strongly convex objectives. In the case where the p-th moment of the noise exists for some p∈[1,2), we first identify a condition on the Hessian, coined ‘p-positive (semi-)definiteness’, that leads to an interesting interpolation between positive semi-definite matrices (p=2) and diagonally dominant matrices with non-negative diagonal entries (p=1). Under this condition, we then provide a convergence rate for the distance to the global optimum in Lp. Furthermore, we provide a generalized central limit theorem, which shows that the properly scaled Polyak-Ruppert averaging converges weakly to a multivariate α-stable random vector. Our results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum without necessitating any modification neither to the loss function or to the algorithm itself, as typically required in robust statistics. We demonstrate the implications of our results to applications such as linear regression and generalized linear models subject to heavy-tailed data.

Deep Marching Tetrahedra: a Hybrid Representation for High-Resolution 3D Shape Synthesis

*Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, Sanja Fidler
*We introduce DMTet, a deep 3D conditional generative model that can synthesize high-resolution 3D shapes using simple user guides such as coarse voxels. It marries the merits of implicit and explicit 3D representations by leveraging a novel hybrid 3D representation. Compared to the current implicit approaches, which are trained to regress the signed distance values, DMTet directly optimizes for the reconstructed surface, which enables us to synthesize finer geometric details with fewer artifacts. Unlike deep 3D generative models that directly generate explicit representations such as meshes, our model can synthesize shapes with arbitrary topology. The core of DMTet includes a deformable tetrahedral grid that encodes a discretized signed distance function and a differentiable marching tetrahedra layer that converts the implicit signed distance representation to the explicit surface mesh representation. This combination allows joint optimization of the surface geometry and topology as well as generation of the hierarchy of subdivisions using reconstruction and adversarial losses defined explicitly on the surface mesh. Our approach significantly outperforms existing work on conditional shape synthesis from coarse voxel inputs, trained on a dataset of complex 3D animal shapes. Project page: this https URL.

Demystifying and Generalizing BinaryConnect

*Tim Dockhorn, Yaoliang Yu, Eyyüb Sari, Mahdi Zolnouri, Vahid Partovi Nia**
*BinaryConnect (BC) and its many variations have become the de facto standard for neural network quantization, which is crucial for reducing energy consumption and for deployment on low-resource devices. Despite its empirical success, BC has largely remained a “training trick” and a rigorous understanding of its inner workings has yet to be found. In this work, we show that an extension of BC is a nonconvex modification of the generalized conditional gradient algorithm, allowing us to establish its convergence properties with ease. We also present a principled theory for constructing proximal quantizers that turn continuous weights gradually into discrete ones. For the first time, building on our theoretical findings, we rigorously justify the diverging parameter in proximal quantizers, which has remained as an inconsistency between theory and practice until now.

DIB-R++: Learning to Predict Lighting and Material with a Hybrid Differentiable Renderer

*Wenzheng Chen, Joey Litalien, Jun Gao, Zian Wang, Clement Fuji Tsang, Sameh Khamis, Or Litany, Sanja Fidler**
*We consider the challenging problem of predicting intrinsic object properties from a single image by exploiting differentiable renderers. Many previous learning-based approaches for inverse graphics adopt rasterization-based renderers and assume naive lighting and material models, which often fail to account for non-Lambertian, specular reflections commonly observed in the wild. In this work, we propose DIBR++, a hybrid differentiable renderer which supports these photorealistic effects by combining rasterization and ray-tracing, taking the advantage of their respective strengths — speed and realism. Our renderer incorporates environmental lighting and spatially-varying material models to efficiently approximate light transport, either through direct estimation or via spherical basis functions. Compared to more advanced physics-based differentiable renderers leveraging path tracing, DIBR++ is highly performant due to its compact and expressive shading model, which enables easy integration with learning frameworks for geometry, reflectance and lighting prediction from a single image without requiring any ground-truth. We experimentally demonstrate that our approach achieves superior material and lighting disentanglement on synthetic and real data compared to existing rasterization-based approaches and showcase several artistic applications including material editing and relighting.

Differentiable Annealed Importance Sampling and the Perils of Gradient Noise

*Guodong Zhang, Kyle Hsu, Jianing Li, Chelsea Finn, Roger Grosse
*In machine learning, many of our key algorithms (e.g. stochastic gradient descent) compute updates on small batches of data, and our usual experience is that small batches are at least as efficient as large batches (in terms of epochwise convergence). We present and analyze a mini-batch algorithm for estimating the marginal likelihood of a Bayesian model. Surprisingly, we find that (in contrast with the optimization and sampling settings) the mini-batch estimator is inconsistent; our analysis highlights a key obstacle to efficient marginal likelihood estimation.

Distributed Deep Learning In Open Collaborations

*Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Quentin Lhoest, Anton Sinitsin, Dmitry Popov, Dmitriy Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, Denis Mazur, Yacine Jernite, Thomas Wolf, Gennady Pekhimenko
*Training the most powerful neural networks requires computational resources that are often unavailable outside of large organizations, ultimately slowing down scientific progress. In this work, we propose an approach that allows training large neural networks in collaborations that can span the entire globe. Our DeDLOC method can adapt to different connection speeds making it significantly more efficient than standard methods designed for homogeneous networks. We demonstrate the beneficial properties of the DeDLOC in cost-efficient cloud setups and a volunteer experiment, training a high-quality language model for Bengali with 40 participants.

Don’t Generate Me: Training Differentially Private Generative Models with Sinkhorn Divergence

*Tianshi Cao, Alex Bie, Arash Vahdat, Sanja Fidler, Karsten Kreis
*Although machine learning models trained on massive data have led to break-throughs in several areas, their deployment in privacy-sensitive domains remains limited due to restricted access to data. Generative models trained with privacy constraints on private data can sidestep this challenge, providing indirect access to private data instead. We propose DP-Sinkhorn, a novel optimal transport-based generative method for learning data distributions from private data with differential privacy. DP-Sinkhorn minimizes the Sinkhorn divergence, a computationally efficient approximation to the exact optimal transport distance, between the model and data in a differentially private manner and uses a novel technique for control-ling the bias-variance trade-off of gradient estimates. Unlike existing approaches for training differentially private generative models, which are mostly based on generative adversarial networks, we do not rely on adversarial objectives, which are notoriously difficult to optimize, especially in the presence of noise imposed by privacy constraints. Hence, DP-Sinkhorn is easy to train and deploy. Experimentally, we improve upon the state-of-the-art on multiple image modeling benchmarks and show differentially private synthesis of informative RGB images. Project page:this https URL.

Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers

*Nikita Dvornik, Isma Hadji, Konstantinos G. Derpanis, Animesh Garg, Allan D. Jepson
*The problem of sequence alignment is central in many AI applications, such as computational biology, video, audio, or multi-modal analysis. While it is easy to align “clean” signals, aligning sequence with outliers is harder and may be generally ambiguous. In this work, we propose Drop-DTW, a new algorithm for aligning sequences with interspersed outliers. Drop-DTW provides the optimal solution for simultaneous outlier detection and alignment of the outlier-free sequences, has efficient implementation and can be made differentiable. Using Drop-DTW, we are able to improve the general task of sequence retrieval, unsupervised and weakly-supervised representation learning, and propose a new, effective way to perform step localization in instructional videos. In all the applications, Drop-DTW achieves state-of-the-art results.

Dynamic Bottleneck for Robust Self-Supervised Exploration

*Chenjia Bai, Lingxiao Wang, Lei Han, Animesh Garg, Jianye Hao, Peng Liu, Zhaoran Wang**
*The tradeoff between exploration and exploitation has long been a major challenge in Reinforcement Learning (RL), especially for many real-world applications such as autonomous driving. An effective approach to self-supervised exploration is to design a dense intrinsic reward that motivates the agent to explore novel transitions. However, previous exploration methods become unstable when the states are noisy, e.g., containing dynamics-irrelevant information. For example, in autonomous driving tasks, the states captured by the camera may contain irrelevant objects, such as clouds that behave similarly to the Brownian movement. If we measure the novelty of states or the curiosity of transitions through raw observed pixels, exploration is likely to be affected by the dynamics of these irrelevant objects. To solve this problem, we propose a Dynamic Bottleneck (DB) model to obtain a dynamics-relevant representation and discard the noises based on the information-bottleneck principle. We propose DB-bonus to encourage the agent to explore state-action pairs with high information gain. Experiments show DB bonus outperforms several state-of-the-art exploration methods in noisy environments.

EditGAN: High-Precision Semantic Image Editing

*Huan Ling, Karsten Kreis, Daiqing Li, Seung Wook Kim, Antonio Torralba, Sanja Fidler
*Generative adversarial networks (GANs) have recently found applications in image editing. However, most GAN based image editing methods often require large scale datasets with semantic segmentation annotations for training, only provide high level control, or merely interpolate between different images. Here, we propose EditGAN, a novel method for high quality, high precision semantic image editing, allowing users to edit images by modifying their highly detailed part segmentation masks, e.g., drawing a new mask for the headlight of a car. EditGAN builds on a GAN framework that jointly models images and their semantic segmentations, requiring only a handful of labeled examples, making it a scalable tool for editing. Specifically, we embed an image into the GAN latent space and perform conditional latent code optimization according to the segmentation edit, which effectively also modifies the image. To amortize optimization, we find editing vectors in latent space that realize the edits. The framework allows us to learn an arbitrary number of editing vectors, which can then be directly applied on other images at interactive rates. We experimentally show that EditGAN can manipulate images with an unprecedented level of detail and freedom, while preserving full image quality.We can also easily combine multiple edits and perform plausible edits beyond EditGAN training data. We demonstrate EditGAN on a wide variety of image types and quantitatively outperform several previous editing methods on standard editing benchmark tasks.

Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms

*Alexander Camuto, George Deligiannidis, Murat A. Erdogdu, Mert Gürbüzbalaban, Umut Şimşekli, Lingjiong Zhu
*Understanding generalization in deep learning has been one of the major challenges in statistical learning theory over the last decade. While recent work has illustrated that the dataset and the training algorithm must be taken into account in order to obtain meaningful generalization bounds, it is still theoretically not clear which properties of the data and the algorithm determine the generalization performance. In this study, we approach this problem from a dynamical systems theory perspective and represent stochastic optimization algorithms as random iterated function systems (IFS). Well studied in the dynamical systems literature, under mild assumptions, such IFSs can be shown to be ergodic with an invariant measure that is often supported on sets with a fractal structure. As our main contribution, we prove that the generalization error of a stochastic optimization algorithm can be bounded based on the `complexity’ of the fractal structure that underlies its invariant measure. Leveraging results from dynamical systems theory, we show that the generalization error can be explicitly linked to the choice of the algorithm (e.g., stochastic gradient descent — SGD), algorithm hyperparameters (e.g., step-size, batch-size), and the geometry of the problem (e.g., Hessian of the loss). We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden-layered neural networks) and algorithms (e.g., SGD and preconditioned variants), and obtain analytical estimates for our bound.For modern neural networks, we develop an efficient algorithm to compute the developed bound and support our theory with various experiments on neural networks.

The Future is Log-Gaussian: ResNets and Their Infinite-Depth-and-Width Limit at Initialization

*Mufan (Bill) Li, Mihai Nica, Daniel M. Roy
*The infinite-width limit theory dramatically expanded our understanding of neural networks. Real life networks are, however, too deep: their performance deviates from the infinite-width theory. We study networks with residual connections in the infinite-depth-and-width limit, and show remarkable agreement between theoretical predictions and empirical measurements in real networks.

Grad2Task: Improved few-shot text classification using gradients for task representation

*Jixuan Wang, Kuan-Chieh Wang, Frank Rudzicz, Michael Brudno
*“Pretraining Transformer-based language models on unlabeled text and then fine-tuning them on target tasks has achieved tremendous success on various NLP tasks. However, the fine-tuning stage still requires a large amount of labeled data to achieve good performance. In this work, we propose a meta-learning approach for few-shot text classification, where only a handful of examples are given for each class. During training, our model learns useful prior knowledge from a set of diverse but related tasks. During testing, our model uses the learned knowledge to better solve various downstream tasks in different domains. We use gradients as features to represent the task. Compared with fine-tuning and other meta-learning approaches, we demonstrate better performance on a diverse set of text classification tasks. Our work is an inaugural exploration of using gradient-based task representations for meta-learning.”

Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks

*Melih Barsbey, Milad Sefidgaran, Murat A. Erdogdu, Gaël Richard, Umut Şimşekli
*Neural network compression techniques have become increasingly popular as they can drastically reduce the storage and computation requirements for very large networks. Recent empirical studies have illustrated that even simple pruning strategies can be surprisingly effective, and several theoretical studies have shown that compressible networks (in specific senses) should achieve a low generalization error. Yet, a theoretical characterization of the underlying cause that makes the networks amenable to such simple compression schemes is still missing. In this study, we address this fundamental question and reveal that the dynamics of the training algorithm has a key role in obtaining such compressible networks. Focusing our attention on stochastic gradient descent (SGD), our main contribution is to link compressibility to two recently established properties of SGD: (i) as the network size goes to infinity, the system can converge to a mean-field limit, where the network weights behave independently, (ii) for a large step-size/batch-size ratio, the SGD iterates can converge to a heavy-tailed stationary distribution. In the case where these two phenomena occur simultaneously, we prove that the networks are guaranteed to be ‘ℓp-compressible’, and the compression errors of different pruning techniques (magnitude, singular value, or node pruning) become arbitrarily small as the network size increases. We further prove generalization bounds adapted to our theoretical framework, which indeed confirm that the generalization error will be lower for more compressible networks. Our theory and numerical study on various neural networks show that large step-size/batch-size ratios introduce heavy-tails, which, in combination with overparametrization, result in compressibility.

How does a Neural Network’s Architecture Impact its Robustness to Noisy Labels?

*Jingling Li, Mozhi Zhang, Keyulu Xu, John P Dickerson, Jimmy Ba
*Noisy labels are inevitable in large real-world datasets. In this work, we explore an area understudied by previous works — how the network’s architecture impacts its robustness to noisy labels. We provide a formal framework connecting the robustness of a network to the alignments between its architecture and target/noise functions. Our framework measures a network’s robustness via the predictive power in its representations — the test performance of a linear model trained on the learned representations using a small set of clean labels. We hypothesize that a network is more robust to noisy labels if its architecture is more aligned with the target function than the noise. To support our hypothesis, we provide both theoretical and empirical evidence across various neural network architectures and different domains. We also find that when the network is well-aligned with the target function, its predictive power in representations could improve upon state-of-the-art (SOTA) noisy-label-training methods in terms of test accuracy and even outperform sophisticated methods that use clean labels.

Identifying and Benchmarking Natural Out-of-Context Prediction Problems

*David Madras, Richard Zemel
*Deep learning systems frequently fail at out-of-context (OOC) prediction, the problem of making reliable predictions on uncommon or unusual inputs or subgroups of the training distribution. To this end, a number of benchmarks for measuring OOC performance have recently been introduced. In this work, we introduce a framework unifying the literature on OOC performance measurement, and demonstrate how rich auxiliary information can be leveraged to identify candidate sets of OOC examples in existing datasets. We present NOOCh: a suite of naturally-occurring “challenge sets”, and show how varying notions of context can be used to probe specific OOC failure modes. Experimentally, we explore the tradeoffs between various learning approaches on these challenge sets and demonstrate how the choices made in designing OOC benchmarks can yield varying conclusions.

Learning Domain Invariant Representations in Goal-conditioned Block MDPs

*Beining Han, Chongyi Zheng, Harris Chan, Keiran Paster, Michael R. Zhang, Jimmy Ba
*Deep Reinforcement Learning (RL) is successful in solving many complex Markov Decision Processes (MDPs) problems. However, agents often face unanticipated environmental changes after deployment in the real world. These changes are often spurious and unrelated to the underlying problem, such as background shifts for visual input agents. Unfortunately, deep RL policies are usually sensitive to these changes and fail to act robustly against them. This resembles the problem of domain generalization in supervised learning. In this work, we study this problem for goal-conditioned RL agents. We propose a theoretical framework in the Block MDP setting that characterizes the generalizability of goal-conditioned policies to new environments. Under this framework, we develop a practical method PA-SkewFit that enhances domain generalization. The empirical evaluation shows that our goal-conditioned RL agent can perform well in various unseen test environments, improving by 50% over baselines.

Learning Generalized Gumbel-max Causal Mechanisms

*Guy Lorberbom, Daniel D. Johnson, Chris J. Maddison, Daniel Tarlow, Tamir Hazan
*Counterfactual inference allows us to answer “what if” questions, but the correct answers to these questions cannot be uniquely identified by observing and interacting with the world. We propose a learnable family of causal models which can be trained to give “good” answers to counterfactual queries, based on user-specified criteria. Our models generalize the previously-proposed Gumbel-max structural causal model, and can be used to answer new counterfactual queries not seen at training time.

Learning Optimal Predictive Checklists

*Guiliang Liu, Xiangyu Sun, Oliver Schulte, Pascal Poupart
*Checklists are commonly used decision aids in the clinical setting. One reason why checklists are so effective is due to their simple form – they can be filled out in a couple of minutes, they do not require any specialized hardware to deploy (only a printed sheet), and they are easily verifiable unlike other black box machine learning models. However, the vast majority of current checklists are created by panels of experts using domain expertise. In this work, we propose a method to create predictive checklists from *data*. Creating checklists from data allows us to have a measurable evaluation criteria (i.e. there is some concrete metric that we can use to evaluate checklists). It also allows for rapid model development – we can make checklists in a matter of hours, instead of needing to wait months for the panel of experts. Our method formulates checklist creation as an integer program which directly minimizes the error rate of the checklist. Crucially, our method also allows for the inclusion of customizable constraints (e.g. on checklist form, performance, or fairness), as well as yield insight into when a checklist is not an appropriate model for the particular task. We find that our method outperforms existing baseline methods, and present two case studies to demonstrate the practical utility of our method where 1) we train a checklist to predict mortality in ICU patients with group fairness constraints, and 2) we learn a short-form version of the PTSD Checklist for DSM-5 that is faster to complete while maintaining accuracy.

Learning Tree Interpretation from Object Representation for Deep Reinforcement Learning

*Guiliang Liu, Xiangyu Sun, Oliver Schulte, Pascal Poupart
*Interpreting Reinforcement Learning (RL) policies is important to enhance trust and comply with transparency regulations. We describe a new technique to explain RL policies in terms of high-level object features instead of low-level features such as pixels or raw sensor measurements. The approach constructs an interpretable decision tree in terms of high-level object features that mimic RL policies that we wish to explain.

Lossy Compression for Lossless Prediction

*Yann Dubois, Benjamin Bloem-Reddy, Karen Ullrich, Chris J. Maddison
*Billions of terabytes of data are collected every year. At these scales, most data is not seen by humans. Instead, it is processed by algorithms. Yet, standard data compressors (e.g. JPEG) are optimized such that reconstructions look similar to humans. In this paper, we lay the theoretical foundations of compression for downstream use by learning algorithms. On the practical side, we propose a simple algorithm for training a generic compressor, which compresses standard images more than 1000x better than JPEG, without hindering downstream machine learning performance. In the long-term, we hope that such compression will enable individuals to process data at scales that are currently only possible at large institutions.

Manipulating SGD with Data Ordering Attacks

*Ilia Shumailov, Zakhar Shumaylov, Dmitry Kazhdan, Yiren Zhao, Nicolas Papernot, Murat A. Erdogdu, Ross Anderson
*In this paper we present a novel class of training-time attacks that require no changes to the underlying dataset or model architecture, but instead only change the order in which data are supplied to the model. In essence, we show that data sampling bias is a crucial component of any stochastic optimisation and by controlling randomness ie. the order in which data is shown to the model, an attacker can slow learning, stop learning and sometimes make the model learn things it is not supposed to.

Medical Dead-ends and Learning to Identify High-risk States and Treatments

*Mehdi Fatemi, Taylor W. Killian, Jayakumar Subramanian, Marzyeh Ghassemi
*Patient-clinician interactions are inherently sequential processes where treatment decisions are made and adapted based on an expert’s understanding of how a patient’s health evolves. While RL has been shown to be a powerful tool for learning optimal decision strategies–learning

**–guarantees for finding these solutions depend on the ability to experiment with possible strategies to collect more data. This type of exploration is not possible in a healthcare setting making learning optimal strategies impossible. In this work, we propose to invert the RL paradigm in data-limited, safety-critical settings to investigate high-risk treatments as well as patient health states. We train the algorithm to identify treatments**

*what to do*

*to avoid***so as to keep the patient from irrecoverably negative health outcomes, defined as a medical dead-end. We apply this approach (Dead-end Discovery — DeD) to a real-world clinical task using the MIMIC-III dataset, treating critically ill patients who have developed Sepsis. We establish the existence of dead-ends and demonstrate the utility of DeD, raising warnings that indicate when a patient or treatment embodies elevated or extreme risk of encountering a dead-end and thereby death.**

*choosing*Meta-Learning to Improve Pre-Training

*Aniruddh Raghu, Jonathan Lorraine, Simon Kornblith, Matthew McDermott, David Duvenaud
*Pre-training large models is useful, and is necessary for the state of the art in many machine tasks. However, it adds many extra parameters, which are hard to tune. We give a scalable, gradient-based way to tune pre-training paramters. Because exact pre-training gradients are intractable, we approximate them. Specifically, we compose implicit differentiation for the long, almost-converged pre-training stage, with backprop through training for the short fine-tuning stage. We applied approximate pre-training gradients to tune thousands of task weights for graph-based protein function prediction, and to learn an entire data augmentation neural net for contrastive learning on electrocardiograms.

Minimax Optimal Quantile and Semi-Adversarial Regret via Root-Logarithmic Regularizers

*Jeffrey Negrea, Blair Bilodeau, Nicolò Campolongo, Francesco Orabona, Daniel M. Roy
*In prediction with expert advice, one attempts to match the performance of a set of reference predictors/forecasters. In this work, give provably optimal algorithms for two variants of this task: matching the performance of the top k%, and matching the performance of the best expert whether the data are nice, or naughty, or somewhere in between.

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

*Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk, Gennady Pekhimenko
*Training of deep neural networks is often accelerated by combining the power of multiple servers with distributed algorithms. Unfortunately, communication-efficient versions of these algorithms frequently require reliable high-speed connections usually available only in dedicated clusters. This work proposes Moshpit All-Reduce — a fault-tolerant scalable algorithm for decentralized averaging that maintains favorable convergence properties than regularly distributed approaches. We show that Moshpit SGD, a distributed optimization method based on this algorithm, has both strong theoretical guarantees and high practical efficiency. In particular, we demonstrate gains of 1.3-1.5x in large-scale deep learning experiments such as ImageNet classification with ResNet-50 or ALBERT-large pretraining on BookCorpus.

Neural Hybrid Automata: Learning Dynamics With Multiple Modes and Stochastic Transitions

*Michael Poli, Stefano Massaroli, Luca Scimeca, Seong Joon Oh, Sanghyuk Chun, Atsushi Yamashita, Hajime Asama, Jinkyoo Park, Animesh Garg
*Due to their ability to incorporate constraints and domain-specific prior knowledge, implicit neural network models are seeing extensive application to the traditional problems of forecasting and control. Among them, neural differential equations represent a natural choice for continuous-time systems, whose evolution of state variables is described by differential equations. Despite some recent successes, several open questions remain; in particular, it is still unclear how best to leverage this class of models to perform prediction in multi-mode systems subject to discrete events, such as impacts or shocks. These systems, known as Stochastic Hybrid Systems (SHSs), are most common in real-world applications, with notable examples in biological systems, traffic networks, financial markets, and robotics. This work introduces Neural Hybrid Automata (NHA), a scalable multi-stage method based on normalizing flows, neural differential equations and self-supervision on system trajectory data. Neural Hybrid Automata is the first deep learning approach capable of learning and simulating the large class of SHSs from data, without knowledge on the number of modes of operation of a target system. The effectiveness of the NHA blueprint shows how with careful considerations, implicit models can be applied to most systems, while remaining general and scalable deep learning approaches.

OctField: Hierarchical Implicit Functions for 3D Modeling

*Jia-Heng Tang, Weikai Chen, Jie Yang, Bo Wang, Songrun Liu, Bo Yang, Lin Gao
*Recent advances in localized implicit functions have enabled neural implicit representation to be scalable to large scenes. However, the regular subdivision of 3D space employed by these approaches fails to take into account the sparsity of the surface occupancy and the varying granularities of geometric details. As a result, its memory footprint grows cubically with the input volume, leading to a prohibitive computational cost even at a moderately dense decomposition. In this work, we present a learnable hierarchical implicit representation for 3D surfaces, coded OctField, that allows high-precision encoding of intricate surfaces with low memory and computational budget. The key to our approach is an adaptive decomposition of 3D scenes that only distributes local implicit functions around the surface of interest. We achieve this goal by introducing a hierarchical octree structure to adaptively subdivide the 3D space according to the surface occupancy and the richness of part geometry. As octree is discrete and non-differentiable, we further propose a novel hierarchical network that models the subdivision of octree cells as a probabilistic process and recursively encodes and decodes both octree structure and surface geometry in a differentiable manner. We demonstrate the value of OctField for a range of shape modeling and reconstruction tasks, showing superiority over alternative approaches.

On Empirical Risk Minimization with Dependent and Heavy-Tailed Data

*Abhishek Roy, Krishnakumar Balasubramanian, Murat A. Erdogdu
*In this work, we establish risk bounds for the Empirical Risk Minimization (ERM) with both dependent and heavy-tailed data-generating processes. We do so by extending the seminal works of Mendelson [Men15, Men18] on the analysis of ERM with heavy-tailed but independent and identically distributed observations, to the strictly stationary exponentially β-mixing case. Our analysis is based on explicitly controlling the multiplier process arising from the interaction between the noise and the function evaluations on inputs. It allows for the interaction to be even polynomially heavy-tailed, which covers a significantly large class of heavy-tailed models beyond what is analyzed in the learning theory literature. We illustrate our results by deriving rates of convergence for the high-dimensional linear regression problem with dependent and heavy-tailed data.

Parameter Prediction for Unseen Deep Architectures

*Boris Knyazev, Michal Drozdzal, Graham W. Taylor, Adriana Romero-Soriano
*Do we still need SGD or Adam to train neural networks? Recent research by the Vector Institute in collaboration with Facebook AI Research (now Meta) suggests a step towards an alternative approach to train networks. Led by University of Guelph PhD student Boris Knyazev, the team developed a technique to initialize diverse neural network architectures using a “meta-model”. This research challenges the long-held assumption that gradient-based optimizers are required to train deep neural networks. Astonishingly, the meta-model can predict parameters for almost any neural network in just one forward pass, achieving ~60% accuracy on the popular CIFAR-10 dataset without any training. Moreover, while the meta-model was training, it did not observe any network close to the ResNet-50 whose ~25 M parameters it predicted. In the vein of the team’s 2020 work to reduce the computational requirements of GANs, this approach democratizes DL by making the technology accessible to smaller players in the field such as startup companies and not-for-profits. It will be presented at NeurIPS 2021.

Quantifying and Improving Transferability in Domain Generalization

*Guojun Zhang, Han Zhao, Yaoliang Yu, Pascal Poupart*

When transferring a predictor from the lab to the real world, there are always discrepancies between the data in the lab and data in the wild. In this paper, we quantify the transferability of data features and describe a new algorithm to compute transferable features. This work advances the state of the art in data analysis when there is a need to transfer a predictor trained in some domain (e.g., client A) to a new domain (e.g., client B).

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

*Muchen Li, Leonid Sigal
*Ability to localize, or ground, a lingual query describing an entity in an image is a fundamental task for humans and, by extension, any artificial visual recognition system. Specifically, given a query phrase (

*e.g.*, “a blue sedan”, “a man with a beard wearing a leather jacket”) the goal is to output a box or a pixel-level mask tightly encompassing the described entity in the image. Most existing approaches to this problem approach it in two steps: first, localize a set of regions in images that contain potential entities of interest, and, second, see which of these regions best matches the provided query description. The core issue with such methods is that errors in the first stage fundamentally limit performance of the second. In this work we propose a single-stage architecture, which is capable of simultaneous language grounding at both a bounding-box and pixel level. Notably, most prior approaches could do one or the other, but not both. Our model also enables contextualized reasoning by taking into account the entire image, all query phrases of interest and (optionally) lingual context in order to improve the performance. Our model is relatively simple, yet outperforms state-of-the-art methods by a large margin. In addition to being more accurate, our approach is also considerably faster, since it allows localization of multiple query phrases at the same time and at different granularity.

Scalable Neural Data Server: A Data Recommender for Transfer Learning

*Tianshi Cao, Sasha (Alexandre) Doubov, David Acuna, Sanja Fidler
*Absence of large-scale labeled data in the practitioner’s target domain can be a bottleneck to applying machine learning algorithms in practice. Transfer learning is a popular strategy for leveraging additional data to improve the downstream performance, but finding the most relevant data to transfer from can be challenging. Neural Data Server (NDS), a search engine that recommends relevant data for a given downstream task, has been previously proposed to address this problem (Yan et al., 2020). NDS uses a mixture of experts trained on data sources to estimate similarity between each source and the downstream task. Thus, the computational cost to each user grows with the number of sources and requires an expensive training step for each data provider.To address these issues, we propose Scalable Neural Data Server (SNDS), a large-scale search engine that can theoretically index thousands of datasets to serve relevant ML data to end users. SNDS trains the mixture of experts on intermediary datasets during initialization, and represents both data sources and downstream tasks by their proximity to the intermediary datasets. As such, computational cost incurred by users of SNDS remains fixed as new datasets are added to the server, without pre-training for the data providers.We validate SNDS on a plethora of real world tasks and find that data recommended by SNDS improves downstream task performance over baselines. We also demonstrate the scalability of our system by demonstrating its ability to select relevant data for transfer outside of the natural image setting.

Sign-Sparse-Shift Reparametrization for Effective Training of Low-bit Shift Networks

*Xinlin Li, Bang Liu, Yaoliang Yu, Wulong Liu, Chunjing Xu, Vahid Partovi Nia**
*Shift neural networks reduce computation complexity by removing expensive multiplication operations and quantizing continuous weights into low-bit discrete values, which are fast and energy-efficient compared to conventional neural networks. However, existing shift networks are sensitive to the weight initialization and yield a degraded performance caused by vanishing gradient and weight sign freezing problem. To address these issues, we propose S3 re-parameterization, a novel technique for training low-bit shift networks. Our method decomposes a discrete parameter in a sign-sparse-shift 3-fold manner. This way, it efficiently learns a low-bit network with weight dynamics similar to full-precision networks and insensitive to weight initialization. Our proposed training method pushes the boundaries of shift neural networks and shows 3-bit shift networks compete with their full-precision counterparts in terms of top-1 accuracy on ImageNet.

Towards Optimal Strategies for Training Self-Driving Perception Models in Simulation

*David Acuna, Jonah Philion, Sanja Fidler**
*Autonomous driving relies on a huge volume of real-world data to be labeled to high precision. Alternative solutions seek to exploit driving simulators that can generate large amounts of labeled data with a plethora of content variations. However, the domain gap between the synthetic and real data remains, raising the following important question: What are the best ways to utilize a self-driving simulator for perception tasks? In this work, we build on top of recent advances in domain-adaptation theory, and from this perspective, propose ways to minimize the reality gap. We primarily focus on the use of labels in the synthetic domain alone. Our approach introduces both a principled way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator. Our method is easy to implement in practice as it is agnostic of the network architecture and the choice of the simulator. We showcase our approach on the bird’s-eye-view vehicle segmentation task with multi-sensor data (cameras, lidar) using an open-source simulator (CARLA), and evaluate the entire framework on a real-world dataset (nuScenes). Last but not least, we show what types of variations (e.g. weather conditions, number of assets, map design, and color diversity) matter to perception networks when trained with driving simulators, and which ones can be compensated for with our domain adaptation technique.

Towards a Unified Information-Theoretic Framework for Generalization

*Mahdi Haghifam, Gintare Karolina Dziugaite, Shay Moran, Daniel M. Roy**
*One of the key properties of a learning algorithm is its ability to generalize to unseen data. In this work, we show that information theory yields nearly optimal theories of generalization in many more scenarios than previously thought, providing evidence that viewing learning as a communication channel is a unifying lens.

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation

*Tanzila Rahman, Mengyu Yang, Leonid Sigal
*Audio-visual learning that leverages the relationship between visual and auditory signals is an important sub-field of machine learning and computer vision. Examples of typical tasks that such models cans solve include:

*audio-visual separation and localization*, where the goal is to segment sounds produced by individual objects in an audio and/or to localize those objects in a visual scene; and

*audio-visual correspondence*, where the goal is often audio-visual retrieval,

*e.g.*, retrieval of the corresponding visual for a sound. Most existing approaches for these problems extract information from the necessary modalities (audio or visual) and then construct problem-specific algorithms to merge these representations in order to solve a specific task. This is contrary to current trends in other problem domains, where over the past few years, approaches have largely consolidated around architectures that are designed to learn generic and problem agnostic representations that can then be easily leveraged for specific tasks. In this work, we formulate a generic human-centric audio-visual representation learning, with an explicit goal of improving the state-of-the-art in audio-visual sound source separation. Our transformer model takes three streams of information: video, audio, and human pose and fuses this information to arrive at enriched representations that can then be used for the final audio-visual sound separation. The use of human pose is inspired by recent works that illustrate that such representations can significantly boost performance in many audio-visual scenarios where often one or more persons are responsible for the sound explicitly (

*e.g.*, talking) or implicitly (

*e.g.*, sound produced as a function of human manipulating an object). We illustrate that learned representations are general, useful and improve performance on other auxiliary tasks (

*e.g.*, forms of cross-modal audio-visual-pose retrieval) by a substantial margin.

Variational Model Inversion Attacks

*Kuan-Chieh Wang, Yan Fu, Ke Li, Ashish Khisti, Richard Zemel, Alireza Makhzani
*Given the ubiquity of deep neural networks, it is important that these models do not reveal information about sensitive data that they have been trained on. In model inversion attacks, a malicious user attempts to recover the private dataset used to train a supervised neural network. A successful model inversion attack should generate realistic and diverse samples that accurately describe each of the classes in the private dataset. In this work, we provide a probabilistic interpretation of model inversion attacks, and formulate a variational objective that accounts for both diversity and accuracy. In order to optimize this variational objective, we choose a variational family defined in the code space of a deep generative model, trained on a public auxiliary dataset that shares some structural similarity with the target dataset. Empirically, our method substantially improves performance in terms of target attack accuracy, sample realism, and diversity on datasets of faces and chest X-ray images.