This year’s edition of the International Conference on Machine Learning (ICML) will be a virtual one, running Sunday, July 13th through Saturday, July 18th.

ICML is one of the world’s leading machine learning conferences, as evidenced by the massive volume of ML papers submitted. This year, the conference accepted 1,088 papers from 4,990 submissions, an acceptance rate of 21.8 percent (down from 22.6 percent in 2019). Vector Faculty Members had 19 Papers accepted, with an additional five from Vector Faculty Affiliates. Taken as a whole, Vector-related papers make up 2.21 percent of the total number of accepted papers at ICML 2020.

Below are plain-language summaries of some of the work Vector researchers will present at this year’s edition of the conference. Those marked with an * indicate the paper’s original abstract.

**Vector Faculty**

**Angular Visual Hardness***Beidi Chen (Rice University) · Weiyang Liu (Georgia Tech) · Zhiding Yu (NVIDIA) · Jan Kautz (NVIDIA) · Anshumali Shrivastava (Rice University) · Animesh Garg (University of Toronto, Vector Institute, Nvidia) · Anima Anandkumar (Caltech)*

Deep Learning models for image reasoning have impressive performance but often suffer from poor calibration. They tend to be overconfident, with the model confidence not always reflecting the underlying true ambiguity and hardness of datapoints. We propose a new method of measuring the hardness of data points, a method which we call Angular Visual Hardness — given a particular image, how hard is it for a machine learning model to tell if this is a cinnamon bun or a curled-up puppy?! We also found that AVH has a statistically significant correlation with human visual hardness, so people also find those problems equally hard. The measure of this confusion is important to train better models with lesser data as well as models that can be fairer in decision making with imbalanced datasets. Aligning training with this measure of hardness also enables better generalization to changes in the test data distribution.

**Causal Modeling for Fairness In Dynamical Systems***Elliot Creager (University of Toronto) · David Madras (University of Toronto) · Toniann Pitassi (University of Toronto) · Richard Zemel (Vector Institute)*

In machine learning (ML) applications with fairness concerns, available data are often “missing not at random” due to biases in the historical policy used for decision making. In lending, for example, the true creditworthiness of an applicant is unknown unless a loan is offered, so if a population of applicants was offered fewer loans under the historical policy, this population will be “data scarce” in the resulting dataset. We study the long-term fairness of ML decision making, and show that recent papers from this literature can be recast as causal models. This modeling paradigm is better suited to data-missing-not-at-random problems, and enables the use of tools from the causal inference literature to improve estimation of how future policies will perform on data scarce populations, ultimately improving the search for fairer policies in these settings.

**Convex Representation Learning for Generalized Invariance in Semi-Inner-Product Space***Yingyi Ma (UIC) · Vignesh Ganapathiraman (University of Illinois at Chicago) · Yaoliang Yu (University of Waterloo) · Xinhua Zhang (University of Illinois at Chicago (UIC)*Automatically discovering representations that are invariant to certain transformations (such as changes in pose and facial expressions, translation or rotation of rigid objects, semantic or logic relationships between classes, and structured relationships between entities) has been the key for learning algorithms to cope with the enormous amount of variations of real objects. We develop a regularization framework based on semi-inner-products to induce a wide set of generalized invariances that extend beyond the classic reproducing kernel Hilbert spaces. We provide efficient implementations based on Euclidean embeddings and demonstrate the effectiveness of our algorithm on data augmentation and structured multilabel prediction.

**Cutting out the Middle-Man: Training and Evaluating Energy-Based Models without Sampling***Will Grathwohl (University of Toronto) · Kuan-Chieh Wang (University of Toronto) · Joern-Henrik Jacobsen (Vector Institute and University of Toronto) · David Duvenaud (University of Toronto) · Richard Zemel (Vector Institute)*

If we want to make predictions based on a dataset, we need to fit a model that says how likely different predictions are. Fitting this model involves comparing predictions to the original data. Because there are infinitely many possible predictions, it can be hard to quantify the difference between all possible predictions the model could make and the data. Our paper proposes to use a neural network to learn the difference between the model predictions and the data. Once we’ve fit this network, we can adjust the model predictions to reduce this difference.

**Detecting Out-of-Distribution Examples with Gram Matrices***Chandramouli Shama Sastry (Dalhousie University/Vector Institute) · Sageev Oore (Dalhousie University and Vector Institute)*

Neural networks often “don’t know what they don’t know”: an accurate skin lesion classifier might not only classify a dog image as a particular type of lesion, but it might do so very confidently. This happens because dogs constitute a type (i.e. distribution) of image different from anything that the network has seen during training. One of the hard parts of dealing with this in practice is that the network cannot prepare for it in advance (i.e. it can’t peek at all possible kinds of images it might see when it is deployed!). This research achieves state-of-the-art results on detecting when an unfamiliar type of image is presented. Our algorithm does so by noticing unusual patterns in the network’s internal activations.

**Evaluating Lossy Compression Rates of Deep Generative Models***Sicong Huang (University of Toronto) · Alireza Makhzani (University of Toronto) · Yanshuai Cao (Borealis AI) · Roger Grosse (University of Toronto and Vector Institute)*

Generative Adversarial Networks (GANs) have shown an impressive ability to generate convincing looking images, but we still don’t know how well they match the overall distribution of images. We’ve introduced a method to evaluate their distribution modeling ability in terms of lossy compression: the ability to encode an image using a much smaller number of bits than it would take to encode it exactly. By measuring the tradeoff curve between number of bits and reconstruction error, we get a much more complete picture of the distribution modeling performance, compared with traditional scalar-valued metrics.

**Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial Perturbations***Florian Tramer (Stanford University) · Jens Behrmann (University of Bremen) · Nicholas Carlini (Google) · Nicolas Papernot (University of Toronto and Vector Institute) · Joern-Henrik Jacobsen (Vector Institute and University of Toronto)*

Our work was about adversarial examples: in perception tasks, adversarial examples are images perturbed to mislead a ML model into making incorrect predictions. The main question we asked in our ICML paper was: is the progress we are making in defending against adversarial example research leading to meaningful progress for the robustness of ML in practice? We showed that this is not entirely the case because of the way we define adversarial examples which makes oversimplifying assumptions in turn exposing models to new forms of attacks.

**Generalization via Derandomization***Jeffrey Negrea (U of T student, at Vector), Gintare Karolina Dziugaite (Element AI), Daniel M. Roy*

There is a wide gap between the performance of deep learning that we observe in practice and that predicted by our best theories. This work demonstrates how to circumvent a theoretical roadblock that seemingly rendered many standard tools ineffective.

**Hierarchical Verification for Adversarial Robustness****Cong Han Lim (Uber ATG) · Raquel Urtasun (Uber ATG) · Ersin Yumer (Uber ATG)*

We introduce a new framework for the exact pointwise `p robustness verification problem that exploits the layer-wise geometric structure of deep feed-forward networks with rectified linear activations (ReLU networks). The activation regions of the network partition the input space, and one can verify the `p robustness around a point by checking all the activation regions within the desired radius. The GeoCert algorithm (Jordan et al., 2019) treats this partition as a generic polyhedral complex in order to detect which region to check next. In contrast, our LayerCert framework considers the nested hyperplane arrangement structure induced by the layers of the ReLU network and explores regions in a hierarchical manner. We show that, under certain conditions on the algorithm parameters, LayerCert provably reduces the number and size of the convex programs that one needs to solve compared to GeoCert. Furthermore, our LayerCert framework allows the incorporation of lower bounding routines based on convex relaxations to further improve performance. Experimental results demonstrate that LayerCert can significantly reduce both the number of convex programs solved and the running time over the state-of-the-art.

**Improved Bounds on Minimax Regret under Logarithmic Loss via Self-Concordance***Blair Bilodeau (University of Toronto), Dylan Foster (MIT), Daniel Roy (Univ of Toronto)*

We are often interested in forecasting future events over time, such as the next day’s weather or stock market performance. In this work, we make significant advances to our understanding of the mathematical limits of forecasting performance.

**Improving Transformer Optimization Through Better Initialization****Xiao Shi Huang (Layer6 AI) · Felipe Perez (Layer6 AI) · Jimmy Ba (University of Toronto) · Maksims Volkovs (Layer6 AI)*

The Transformer architecture has achieved considerable success recently; the key component of the Transformer is the attention layer that enables the model to focus on important regions within an input sequence. Gradient optimization with attention layers can be notoriously difficult requiring tricks such as learning rate warmup to prevent divergence. As Transformer models are becoming larger and more expensive to train, recent research has focused on understanding and improving optimization in these architectures. In this work our contributions are two-fold: we first investigate and empirically validate the source of optimization problems in the encoder-decoder Transformer architecture; we then propose a new weight initialization scheme with theoretical justification, that enables training without warmup or layer normalization. Empirical results on public machine translation benchmarks show that our approach achieves leading accuracy, allowing to train deep Transformer models with 200 layers in both encoder and decoder (over 1000 attention/MLP blocks) without difficulty.

**Linear Mode Connectivity and the Lottery Ticket Hypothesis***Jonathan Frankle (MIT), Gintare Karolina Dziugaite (Element AI), Daniel M. Roy (Vector Institute), Michael Carbin (MIT)*

In this work, we connect two seemingly distinct phenomena in neural network training: linear mode connectivity and the lottery ticket hypothesis. On standard vision networks and benchmarks, we show that sparse trainable subnetworks arise when the effect of minibatch noise drops below a well-defined level.

**Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning***Silviu Pitis (University of Toronto) · Harris Chan (University of Toronto, Vector Institute) · Stephen Zhao (University of Toronto) · Bradly Stadie (Vector Institute) · Jimmy Ba (University of Toronto)*

We design goal-seeking robots that set and pursue goals according to their own past experience and abilities. By setting goals that drive them to sparsely-explored, novel situations, our simulated robots learn to navigate mazes and manipulate blocks in record time, with only a fraction of the environment interactions used by previous state-of-the-art approaches.

**Multi-Agent Routing Value Iteration Network****Quinlan Sykora (Uber ATG) · Mengye Ren (Uber ATG / University of Toronto) · Raquel Urtasun (Uber ATG)*

In this paper we tackle the problem of routing multiple agents in a coordinated manner. This is a complex problem that has a wide range of applications in fleet management to achieve a common goal, such as mapping from a swarm of robots and ride sharing. Traditional methods are typically not designed for realistic environments which contain sparsely connected graphs and unknown traffic, and are often too slow in runtime to be practical. In contrast, we propose a graph neural network based model that is able to perform multi-agent routing based on learned value iteration in a sparsely connected graph with dynamically changing traffic conditions. Moreover, our learned communication module enables the agents to coordinate online and adapt to changes more effectively. We created a simulated environment to mimic realistic mapping performed by autonomous vehicles with unknown minimum edge coverage and traffic conditions; our approach significantly outperforms traditional solvers both in terms of total cost and runtime. We also show that our model trained with only two agents on graphs with a maximum of 25 nodes can easily generalize to situations with more agents and/or nodes.

**Online Bayesian Moment Matching based SAT Solver Heuristics***Haonan Duan (University of Waterloo) · Saeed Nejati (University of Waterloo) · George Trimponias (Noah’s Ark Lab) · Pascal Poupart (University of Waterloo and Borealis AI) · Vijay Ganesh (University of Waterloo, Electrical and Computer Engineering)*

This work describes a Bayesian learning technique to initialize the search for a solution to constraint satisfaction problems formulated as Boolean satisfiability. The approach learns to satisfy most constraints in a fraction of a second, which speeds up the search for a solution in hard cryptographic problems and other industrial combinatorial problems

**Optimizing Long-term Social Welfare in Recommender Systems: A Constrained Matching Approach****Martin Mladenov (Google) · Elliot Creager (University of Toronto) · Omer Ben-Porat (Technion–Israel Institute of Technology) · Kevin Swersky (Google Brain) · Richard Zemel (Vector Institute) · Craig Boutilier (Google)*

Most recommender systems (RS) research assumes that a user’s utility can be maximized independently of the utility of the other agents (e.g., other users, content providers). In realistic settings, this is often not true—the dynamics of an RS ecosystem couple the long-term utility of all agents. In this work, we explore settings in which content providers cannot remain viable unless they receive a certain level of user engagement. We formulate the recommendation problem in this setting as one of equilibrium selection in the induced dynamical system, and show that it can be solved as an optimal constrained matching problem. Our model ensures the system reaches an equilibrium with maximal social welfare supported by a sufficiently diverse set of viable providers. We demonstrate that even in a simple, stylized dynamical RS model, the standard myopic approach to recommendation— always matching a user to the best provider— performs poorly. We develop several scalable techniques to solve the matching problem, and also draw connections to various notions of user regret and fairness, arguing that these outcomes are fairer in a utilitarian sense.

**Semi-Supervised StyleGAN for Disentanglement Learning***Weili Nie (Rice University) · Tero Karras (NVIDIA) · Animesh Garg (University of Toronto, Vector Institute, Nvidia) · Shoubhik Debnath (Nvidia) · Anjul Patney (Nvidia) · Ankit Patel (Rice University, Baylor College of Medicine) · Anima Anandkumar (Amazon AI & Caltech)*

Recent progress in generative models for images has yielded some impressive results. However, the generation is not controllable. Counterfactual generation — where we want to change only one feature of the input image — requires large feature level labels. This problem is called disentanglement in generative models. This means we need datasets with detailed labels such as glasses, beard, eye_color, smile — and this is just for faces. In other natural image domains, this label space can be even larger.This work aims to improve disentanglement without the use of fully labeled datasets. We can achieve a controllable counterfactual generation with only 1% labeled data as compared to other contemporary approaches that require full supervision. The results thus allow us to separate factors of variation in input dataset such as color, shape, facial features, and so on. And using this separability we can now modify the input to change any of these features selectively to imagine what would this person look like with glasses, or what would this room look like with different lighting.

**Stronger and Faster Wasserstein Adversarial Attacks***Kaiwen Wu (University of Waterloo) · Allen Wang (University of Waterloo) · Yaoliang Yu (University of Waterloo)*Deep models are surprisingly vulnerable to adversarial attacks, i.e. imperceptible perturbations that completely change the network’s output, raising serious security concerns on AI systems. We develop two new algorithms to perform faster and stronger adversarial attacks under the recent Wasserstein threat model. Moreover, we demonstrate that our attack algorithms, when combined with adversarial training, largely improve the robustness of deep models against adversarial attacks. We expect our tools to help practitioners better evaluate their model’s robustness and eventually build more robust AI systems.

**Tails of Lipschitz Triangular Flows**

*Priyank Jaini (University of Waterloo, Vector Institute) · Ivan Kobyzev (Borealis AI) · Yaoliang Yu (University of Waterloo) · Marcus Brubaker (Borealis AI)*

**The distribution of real objects (such as human faces or documents written in English) can be learned through “warping” a fixed prior distribution through multi-layered triangular transformations, after which new objects can be easily synthesized and incorporated into various learning algorithms and applications. We study how the tail (i.e., extreme objects) of a distribution is able or unable to change adequately through existing architectures in triangular flows. Consequently, we propose a tail-adaptive flow model that can adapt existing flow models better to heavy-tailed target distributions, such as the ones commonly used in modeling financial risks and extreme values.**

**Vector Faculty Affiliates**

**Black-box Certification and Learning under Adversarial Perturbations****Hassan Ashtiani (McMaster University) · Vinayak Pathak (Scotiabank) · Ruth Urner (York University)*

We formally study the problem of classification under adversarial perturbations, both from the learner’s perspective, and from the viewpoint of a third-party who aims at certifying the robustness of a given black-box classifier. We analyze a PAC-type framework of semi-supervised learning and identify possibility and impossibility results for proper learning of VC-classes in this setting. We further introduce and study a new setting of black-box certification under limited query budget. We analyze this for various classes of predictors and types of perturbation. We also consider the viewpoint of a black-box adversary that aims at finding adversarial examples, showing that the existence of an adversary with polynomial query complexity implies the existence of a robust learner with small sample complexity.**Faster Graph Embeddings via Coarsening****Matthew Fahrbach (Georgia Institute of Technology) · Gramoz Goranci (University of Toronto) · Sushant Sachdeva (University of Toronto) · Richard Peng (Georgia Tech / MSR Redmond) · Chi Wang (Microsoft Research)*

Graph embeddings are a ubiquitous tool for machine learning tasks, such as node classification and link prediction, on graph-structured data. However, computing the embeddings for large-scale graphs is prohibitively inefficient even if we are interested only in a small subset of relevant vertices. To address this, we present an efficient graph coarsening approach, based on Schur complements, for computing the embedding of the relevant vertices. We prove that these embeddings are preserved exactly by the Schur complement graph that is obtained via Gaussian elimination on the non-relevant vertices. As computing Schur complements is expensive, we give a nearly-linear time algorithm that generates a coarsened graph on the relevant vertices that provably matches the Schur complement in expectation in each iteration. Our experiments involving prediction tasks on graphs demonstrate that computing embeddings on the coarsened graph, rather than the entire graph, leads to significant time savings without sacrificing accuracy.

**Generalized and Scalable Optimal Sparse Decision Trees****Jimmy Lin (University of British Columbia) · Chudi Zhong (Duke University) · Diane Hu (Duke University) · Cynthia Rudin (Duke) · Margo Seltzer (University of British Columbia)*

Decision tree optimization is notoriously difficult from a computational perspective but essential for the field of interpretable machine learning. Despite efforts over the past 40 years, only recently have optimization breakthroughs been made that have allowed practical algorithms to find optimal decision trees. These new techniques have the potential to trigger a paradigm shift where it is possible to construct sparse decision trees to efficiently optimize a variety of objective functions without relying on greedy splitting and pruning heuristics that often lead to suboptimal solutions. The contribution in this work is to provide a general framework for decision tree optimization that addresses the two significant open problems in the area: treatment of imbalanced data and fully optimizing over continuous variables. We present techniques that produce optimal decision trees over a variety of objectives including F-score, AUC, and partial area under the ROC convex hull. We also introduce a scalable algorithm that produces provably optimal results in the presence of continuous variables and speeds up decision tree construction by several orders of magnitude relative to the state-of-the art.

**Private Query Release Assisted by Public Data****Raef Bassily (The Ohio State University) · Albert Cheu (Northeastern University) · Shay Moran (IAS, Princeton) · Aleksandar Nikolov (University of Toronto) · Jonathan Ullman (Northeastern University) · Steven Wu (University of Minnesota)*

We study the problem of differentially private query release assisted by access to public data. In this problem, the goal is to answer a large class of statistical queries with error no more than α using a combination of public and private samples. The algorithm is required to satisfy differential privacy only with respect to the private samples. We study the limits of this task in terms of the private and public sample complexities.

First, we show that we can solve the problem for any query class of finite VC-dimension using only d/α public samples and p‾√d3/2/α2 private samples, where d and p are the VC-dimension and dual VC-dimension of , respectively. In comparison, with only private samples, this problem cannot be solved even for simple query classes with VC-dimension one, and without any private samples, a larger public sample of size d/α2 is needed. Next, we give sample complexity lower bounds that exhibit tight dependence on p and α. For the class of decision stumps, we give a lower bound of p‾√/α on the private sample complexity whenever the public sample size is less than 1/α2. Given our upper bounds, this shows that the dependence on p‾√ is necessary in the private sample complexity. We also give a lower bound of 1/α on the public sample complexity for a broad family of query classes, which by our upper bound, is tight in α.

**Time-aware Large Kernel Convolutions****Vasileios Lioutas (Carleton University) · Yuhong Guo (Carleton University)*

To date, most state-of-the-art sequence modeling architectures use attention to build generative models for language based tasks. Some of these models use all the available sequence tokens to generate an attention distribution which results in time complexity of O(n2). Alternatively, they utilize depthwise convolutions with softmax normalized kernels of size k acting as a limited-window self-attention, resulting in time complexity of O(k⋅n). In this paper, we introduce Time-aware Large Kernel (TaLK) Convolutions, a novel adaptive convolution operation that learns to predict the size of a summation kernel instead of using a fixed-sized kernel matrix. This method yields a time complexity of O(n), effectively making the sequence encoding process linear to the number of tokens. We evaluate the proposed method on large-scale standard machine translation, abstractive summarization and language modeling datasets and show that TaLK Convolutions constitute an efficient improvement over other attention/convolution based approaches.