# Vector researchers win top honours at NeurIPS 2022

November 28, 2022

November 28, 2022

Research

By Ian Gormely

Two Vector papers have won top honors at the 2022 NeurIPS conference. “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding,” co-authored by Vector Faculty Member David Fleet, was awarded an Outstanding Paper Award. Meanwhile, “ImageNet Classification with Deep Convolutional Neural Networks,” a 2012 paper co-authored by Vector’s Chief Scientific Advisor Geoffrey Hinton, won the Test of Time award. An additional five papers co-authored by Vector researchers were “highlighted by the conference for their high quality.

In total, Vector Faculty, Faculty Affiliates, and Postdocs had 47 papers accepted to this year’s conference, with an additional eight papers accepted to five different workshops.

Fleet’s Outstanding Paper Award winning paper presents a text-to-image diffusion model with that produces an “unprecedented” degree of photorealism and a deep level of language understanding. Hinton’s now classic paper shocked the computer vision community by almost halving the next-best error rate. It marked major breakthrough in image recognition and its influence is still seen today.

Collectively, the accepted papers co-authored by Vector researchers showcase the breadth of work being done in our research community. Among the accepted pieces are five papers co-authored by Vector Faculty Member Nicholas Papernot and a separate, new paper co-authored by both Hinton and Fleet, “A Unified Sequence Interface for Vision Tasks,” shows how a diverse set of “core” computer vision tasks can be unified if formulated in terms of a shared pixel-to-sequence interface.

Also accepted are two papers dealing with foundation models, large general-purpose models trained on broad data at scale, and later specialized for specific tasks. Vector recently identified this as an area of study to which we can apply our experience and expertise to help democratize these technologies. A further pair of papers involve AI-models that have been taught to play text-based video games as well as Minecraft, respectively.

Below are abstracts and simplified summaries for many of the accepted papers and workshops from Vector Faculty Members.

You can read more about Vector’s work at past years’ conferences here (2021), here (2020), here (2019), and here (2018).

**Adaptively Exploiting d-Separators with Causal Bandits**

*Blair Bilodeau, Linbo Wang, Daniel M. Roy
*Multi-armed bandit problems provide a framework to identify the optimal intervention over a sequence of repeated experiments. Without additional assumptions, minimax optimal performance (measured by cumulative regret) is well-understood. With access to additional observed variables that d-separate the intervention from the outcome (i.e., they are a d-separator), recent “causal bandit” algorithms provably incur less regret. However, in practice it is desirable to be agnostic to whether observed variables are a d-separator. Ideally, an algorithm should be adaptive; that is, perform nearly as well as an algorithm with oracle knowledge of the presence or absence of a d-separator. In this work, we formalize and study this notion of adaptivity, and provide a novel algorithm that simultaneously achieves (a) optimal regret when a d-separator is observed, improving on classical minimax algorithms, and (b) significantly smaller regret than recent causal bandit algorithms when the observed variables are not a d-separator. Crucially, our algorithm does not require any oracle knowledge of whether a d-separator is observed. We also generalize this adaptivity to other conditions, such as the front-door criterion.

**Amortized Proximal Optimization**

*Juhan Bae, Paul Vicol, Jeff Z. HaoChen, Roger Grosse
*Many optimization algorithms used in machine learning can be viewed as approximations of a proximal point objective which trades off the loss on the current batch of training examples, the amount by which it changes predictions on other examples, and the distance moved in parameter space. We present a way to directly meta-learn optimizers which try to minimize this proximal objective in each step. The learned optimizers are competitive with existing second-order optimization methods for neural networks, but simpler to implement.

**BigBio: A Framework for Data-Centric Biomedical Natural Language Processing**

*Jason Alan Fries, Leon Weber, Natasha Seelam, Gabriel Altay, Debajyoti Datta, Samuele Garda, Myungsun Kang, Ruisi Su, Wojciech Kusa, Samuel Cahyawijaya, Fabio Barth, Simon Ott, Matthias Samwald, Stephen Bach, Stella Biderman, Mario Sänger, Bo Wang, Alison Callahan, Daniel León Periñán, Théo Gigant, Patrick Haller, Jenny Chim, Jose David Posada, John Michael Giorgi, Karthik Rangasai Sivaraman, Marc Pàmies, Marianna Nezhurina, Robert Martin, Michael Cullan, Moritz Freidank, Nathan Dahlberg, Shubhanshu Mishra, Shamik Bose, Nicholas Michio Broad, Yanis Labrak, Shlok S Deshmukh, Sid Kiblawi, Ayush Singh, Minh Chien Vu, Trishala Neeraj, Jonas Golde, Albert Villanova del Moral, Benjamin Beilharz
*Training and evaluating language models increasingly requires the construction of meta-datasets –diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a diversity of novel pretraining tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BigBIO a community library of 126+ biomedical NLP datasets, currently covering 12 task categories and 10+ languages. BigBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BigBIO is an ongoing community effort and is available at this https URL.

**Breaking Bad: A Dataset for Geometric Fracture and Reassembly**

*Silvia Sellán, Yun-Chun Chen, Ziyi Wu, Animesh Garg, Alec Jacobson
*We introduce Breaking Bad, a large-scale dataset of fractured objects. Our dataset consists of over one million fractured objects simulated from ten thousand base models. The fracture simulation is powered by a recent physically based algorithm that efficiently generates a variety of fracture modes of an object. Existing shape assembly datasets decompose objects according to semantically meaningful parts, effectively modeling the construction process. In contrast, Breaking Bad models the destruction process of how a geometric object naturally breaks into fragments. Our dataset serves as a benchmark that enables the study of fractured object reassembly and presents new challenges for geometric shape understanding. We analyze our dataset with several geometry measurements and benchmark three state-of-the-art shape assembly deep learning methods under various settings. Extensive experimental results demonstrate the difficulty of our dataset, calling on future research in model designs specifically for the geometric shape assembly task. We host our dataset at this https URL.

**Dataset Distillation using Neural Feature Regression**

*Yongchao Zhou, Ehsan Nezhadarya, Jimmy Ba
*Getting the right data is one of the most critical and challenging parts of building powerful deep-learning systems. However, how can we obtain a higher-quality dataset so the model can learn more efficiently? One potential solution is dataset distillation, which aims to learn a small synthetic dataset that preserves most information from the original dataset. We proposed an efficient learning algorithm, “FRePo,” which can distill a compact and informative synthetic dataset from a large, noisy dataset. The distilled dataset allows the model to achieve comparable performance as the model trained on the original dataset with only a fraction of the time.

Our paper formulates the dataset distillation as a bi-level meta-learning problem. The outer loop optimizes the meta-dataset, and the inner loop trains a model on the distilled data. A key challenge in this formulation is meta-gradient computation, which can be costly in terms of time and memory. We tackle this challenge by efficiently approximating the inner loop optimization, resulting in a state-of-the-art performance with a 100x decrease in training time and a 10x reduction in GPU memory compared to previous works. This improvement in training efficiency opens up a variety of uses for the distilled data, ranging from continual learning to neural architecture search. Furthermore, the “Synthetic data,” in the broader sense of artificial data produced by generative models, can help researchers understand how an otherwise opaque learning machine “sees” the world and potentially address the common concerns in machine learning regarding training data privacy.

**Dataset Inference for Self-Supervised Models**

*Adam Dziedzic, Haonan Duan, Muhammad Ahmad Kaleem, Nikita Dhawan, Jonas Guan, Yannis Cattan, Franziska Boenisch, Nicolas Papernot
*In model extraction attacks, adversaries can steal a machine learning model exposed via a public API by repeatedly querying it and adjusting their own model based on obtained outputs. We propose a new defense against stealing Self-Supervised Learning (SSL) encoders. Unlike traditional model extraction on supervised models that return labels or low-dimensional scores, SSL encoders output representations, which are of significantly higher dimensionality compared to the outputs from supervised models. Recently, ML-as-a-Service providers have commenced offering trained SSL encoders over inference APIs, which transform user inputs into useful representations for a fee. However, the high cost involved to train these models and their exposure to APIs both make black-box extraction a realistic security threat. We introduce a new dataset inference defense, which uses the private data points of the victim encoder as a signature to attribute its ownership in the event of stealing. The intuition is that encoder’s output representations differ between the victim’s training data and the victim’s test data if the encoder is stolen from the victim, but not if the encoder is trained independently. As part of our evaluation, we also propose measuring the fidelity of stolen encoders and quantifying the effectiveness of the theft detection by leveraging mutual information and distance measurements.

**EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations**

*Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, Dima Damen
*With our partners from University of Bristol and Michigan, we introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions, e.g. an onion is peeled, diced and cooked – where we aim to obtain accurate pixel-level annotations of the peel, onion pieces, chopping board, knife, pan, as well as the acting hands. VISOR introduces an annotation pipeline, powered by the Toronto Annotation Suite (https://aidemos.cs.toronto.edu/toras/landing), for scalability and quality. In total, we publicly release 272K manual semantic masks of 257 object classes, 9.9M interpolated dense masks, 67K hand-object relations, covering 36 hours of 179 untrimmed videos. Along with the annotations, we introduce three challenges in video object segmentation, interaction understanding and long-term reasoning.

**Generalization Bounds for Stochastic Gradient Descent via Localized ε-Covers**

*Sejun Park, Umut Simsekli, Murat Erdogdu
*In this paper, we propose a new covering technique localized for the trajectories of SGD. This localization provides an algorithm-specific complexity measured by the covering number, which can have dimension-independent cardinality in contrast to standard uniform covering arguments that result in exponential dimension dependency. Based on this localized construction, we show that if the objective function is a finite perturbation of a piecewise strongly convex and smooth function with P pieces, i.e. non-convex and non-smooth in general, the generalization error can be upper bounded by O((lognlog(nP))/n‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√), where n is the number of data samples. In particular, this rate is independent of dimension and does not require early stopping and decaying step size. Finally, we employ these results in various contexts and derive generalization bounds for multi-index linear models, multi-class support vector machines, and K-means clustering for both hard and soft label setups, improving the known state-of-the-art rates.

**GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images***

*Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, Sanja Fidler*NVIDIA GET3D is a new AI model that is trained using only 2D images to generate a virtually unlimited number of 3D shapes with high-fidelity textures and complex geometric details. These 3D objects are created in the same format used by popular graphics software applications, allowing users to immediately import their shapes into 3D renderers and game engines for further editing. The generated objects could be used in 3D representations of buildings, outdoor spaces or entire cities, designed for industries including gaming, robotics, architecture and social media. See more at the NVIDIA blog and GET3D video.

**This paper was was done by NVIDIA with the involvement of Vector researchers.*

**High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation** *Jimmy Ba, Murat Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, Greg Yang
*We study the first gradient descent step on the first-layer parameters W in a two-layer neural network: f(x)=1N√a⊤σ(W⊤x), where W∈ℝd×N,a∈ℝN are randomly initialized, and the training objective is the empirical MSE loss: 1n∑ni=1(f(xi)−yi)2. In the proportional asymptotic limit where n,d,N→∞ at the same rate, and an idealized student-teacher setting, we show that the first gradient update contains a rank-1 “spike”, which results in an alignment between the first-layer weights and the linear component of the teacher model f∗. To characterize the impact of this alignment, we compute the prediction risk of ridge regression on the conjugate kernel after one gradient step on W with learning rate η, when f∗ is a single-index model. We consider two scalings of the first step learning rate η. For small η, we establish a Gaussian equivalence property for the trained feature map, and prove that the learned kernel improves upon the initial random features model, but cannot defeat the best linear model on the input. Whereas for sufficiently large η, we prove that for certain f∗, the same ridge estimator on trained features can go beyond this “linear regime” and outperform a wide range of random features and rotationally invariant kernels. Our results demonstrate that even one gradient step can lead to a considerable advantage over random features, and highlight the role of learning rate scaling in the initial phase of training.

**If Influence Functions are the Answer, Then What is the Question?**

*Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, Roger Grosse
*Influence functions can efficiently estimate what happens to a model when a particular data point is removed from the training set. However, recent work has shown that these estimates are quite poor when applied to neural networks. In this work we decompose this discrepancy into 5 sources of error and investigate their contributions on various architectures and datasets. We find that influence functions are poor matches for actual retraining without a particular data point, but are good approximations to a different object we call the proximal Bregman response function (PBRF). The PBRF can be used to answer many of the original questions motivating influence functions and suggests that current algorithms for influence function estimation give more informative results than previous error analyses would suggest.

**Implications of Model Indeterminacy for Explanations of Automated Decisions
**

**In Differential Privacy, There is Truth: on Vote-Histogram Leakage in Ensemble Private Learning**

*Jiaqi Wang, Roei Schuster, I Shumailov, David Lie, Nicolas Papernot
*The paper shows that PATE’s differentially-private mechanism, designed to preserve privacy of training data, actually causes leakage of sensitive internal-computation elements. This can be exploited by adversaries to infer sensitive information, such as an input instance’s membership in a minority group. This surprising result highlights the care that must be taken when using and reasoning about differential privacy to mitigate information leakage.

**Iterative Scene Graph Generation**

*Siddhesh Khandelwal, Leonid Sigal
*Scene graphs allow for broad understanding of objects and their interactions within a scene. These graphs are characterized by nodes representing objects, each with a spatial location and a class label, and the edges capturing the relationships between object pairs. Effectively generating such graphs, from either images or videos, has emerged as a core problem in computer vision. Due to the extremely large solution space, existing approaches to scene graph generation make certain simplifying assumptions. One such simplification, for example, is assuming that the relationships between object pairs have no bearing on their type/spatial location, which is untrue as the relation “wearing” strongly suggests one of the objects would be a “person”. In this work, we propose a novel framework for scene graph generation that addresses this limitation, and therefore allows the object pair and relationships to be jointly estimated and reasoned about. This is achieved via an iterative procedure where we first generate an initial estimate of the scene graph, and continually refine the detected objects and relationships by leveraging the interactions between them. We find our proposed iterative refinement procedure to outperform existing approaches on this task. In addition, in practice, some of the relations tend to occur much less frequently, this leads to biases during learning. We study this phenomenon and propose an approach that allows us to effectively improve performance on underrepresented relations for a minor decrease in performance on the dominant relations.

**Learning to Follow Instructions in Text-Based Games**

*Mathieu Tuli, Andrew Li, Pashootan Vaezipoor, Toryn Klassen, Scott Sanner, Sheila McIlraith
*Text-based games are virtual environments that are described in text and manipulated with text commands like “pick up sword” or “unlock gate”. Such games require language understanding and long-term memory, presenting a significant challenge for current AI systems. We observe that state-of-the-art reinforcement learning methods for text-based games are largely incapable of following instructions conveyed as natural language text, leading to low task completion rates. To address this, we translate these instructions into a formal (logical) language that supports task decomposition and progress monitoring. Experiments on 500+ games of the popular TextWorld domain showcase the benefits of our approach in following complex instructions. Beyond text-based games, our results are relevant to natural language instruction-following in a diversity of settings where an AI system must decide how to act over time.

**LION: Latent Point Diffusion Models for 3D Shape Generation*****
**

**Logical Activation Functions: Logit-space equivalents of Probabilistic Boolean Operators**

*Scott C. Lowe, Robert Earle, Jason d’Eon, Thomas Trappenberg, Sageev Oore
*The choice of activation functions and their motivation is a long-standing issue within the neural network community. An individual biological neuron has a lot more complexity than an artificial neuron used in machine learning, and we asked whether we could incorporate a semblance of some of this functionality into artificial neurons, while still using simple abstractions that can be built at scale. Neuronal representations within artificial neural networks are commonly understood as “logits”, representing the likelihood that a feature is present in the stimulus in the form of a log-odds score. For example, an individual neuron within the network may indicate the likelihood of the presence of feathers, a beak, or a door handle, at a particular point within an image. These values are used by later components of the network to determine whether the image is of a duck, say. By considering individual neurons to represent logits, we derived new activation functions that can combine multiple inputs together, in a manner analogous to the dendritic tree of biological neurons. In particular, we derived logit-space operators equivalent to probabilistic Boolean logic-gates AND, OR, and XNOR for independent probabilities. We deployed these new activation functions, both in isolation and in conjunction, to demonstrate their effectiveness on a variety of tasks including image classification, transfer learning, abstract reasoning, and compositional zero-shot learning.

**MoCoDA: Model-based Counterfactual Data Augmentation**

*Silviu Pitis, Elliot Creager, Ajay Mandlekar, Animesh Garg
*The number of states in a dynamic process is exponential in the number of objects, making reinforcement learning (RL) difficult in complex, multi-object domains. For agents to scale to the real world, they will need to react to and reason about unseen combinations of objects. We argue that the ability to recognize and use local factorization in transition dynamics is a key element in unlocking the power of multi-object reasoning. To this end, we show that (1) known local structure in the environment transitions is sufficient for an exponential reduction in the sample complexity of training a dynamics model, and (2) a locally factored dynamics model provably generalizes out-of-distribution to unseen states and actions. Knowing the local structure also allows us to predict which unseen states and actions this dynamics model will generalize to. We propose to leverage these observations in a novel Model-based Counterfactual Data Augmentation (MoCoDA) framework. MoCoDA applies a learned locally factored dynamics model to an augmented distribution of states and actions to generate counterfactual transitions for RL. MoCoDA works with a broader set of local structures than prior work and allows for direct control over the augmented training distribution. We show that MoCoDA enables RL agents to learn policies that generalize to unseen states and actions. We use MoCoDA to train an offline RL agent to solve an out-of-distribution robotics manipulation task on which standard offline RL algorithms fail.

**The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization**

*Mufan Bill Li, Mihai Nica, Daniel M. Roy*

The logit outputs of a feedforward neural network at initialization are conditionally Gaussian, given a random covariance matrix defined by the penultimate layer. In this work, we study the distribution of this random matrix. Recent work has shown that shaping the activation function as network depth grows large is necessary for this covariance matrix to be non-degenerate. However, the current infinite-width-style understanding of this shaping method is unsatisfactory for large depth: infinite-width analyses ignore the microscopic fluctuations from layer to layer, but these fluctuations accumulate over many layers. To overcome this shortcoming, we study the random covariance matrix in the shaped infinite-depth-and-width limit. We identify the precise scaling of the activation function necessary to arrive at a non-trivial limit, and show that the random covariance matrix is governed by a stochastic differential equation (SDE) that we call the Neural Covariance SDE. Using simulations, we show that the SDE closely matches the distribution of the random covariance matrix of finite networks. Additionally, we recover an if-and-only-if condition for exploding and vanishing norms of large shaped networks based on the activation function.

**Operator Splitting Value Iteration**

*Amin Rakhsha, Andrew Wang, Mohammad Ghavamzadeh, Amir-massoud Farahmand
*Consider a planning problem for a discounted MDP. Suppose that we have access to an approximate model that is cheap to use in addition to the true dynamics which is expensive to access. For example, the model might be a lower-fidelity, but fast, simulator, and the true dynamics might be a high-fidelity, but slow, simulator. Or in the context of model-based reinforcement learning (MBRL), we have access to a learned model, from which samples can be cheaply acquired, while we can only acquire expensive samples from the unknown true dynamics of the real-world system. Can we use this approximate model to accelerate the computation of the value function? This paper proposes an algorithm called Operator Splitting Value Iteration (OS-VI) that benefits from the approximate model to potentially accelerate the convergence of the value function sequence to the value function with respect to the true dynamics. OS-VI is able to utilize the approximate model without introducing any error to the computed value function. It achieves a much faster convergence rate when the model is accurate enough, which results in fewer queries to the true dynamics. This leads to better computational cost in simulated environments, and potentially better sample-complexity in real-world problems.

**On Learning and Refutation in Noninteractive Local Differential Privacy**

*Alexander Edmonds, Aleksandar Nikolov, Toniann Pitassi
*We study two basic statistical tasks in non-interactive local differential privacy (LDP): learning and refutation. Learning requires finding a concept that best fits an unknown target function (from labelled samples drawn from a distribution), whereas refutation requires distinguishing between data distributions that are well-correlated with some concept in the class, versus distributions where the labels are random. Our main result is a complete characterization of the sample complexity of agnostic PAC learning for non-interactive LDP protocols. We show that the optimal sample complexity for any concept class is captured by the approximate γ2~norm of a natural matrix associated with the class. Combined with previous work [Edmonds, Nikolov and Ullman, 2019] this gives an equivalence between learning and refutation in the agnostic setting.

**On the Limitations of Stochastic Pre-processing Defenses**

*Yue Gao, I Shumailov, Kassem Fawaz, Nicolas Papernot
*Defending against adversarial examples remains an open problem. A common belief is that randomness at inference increases the cost of finding adversarial inputs. In this paper we investigate stochastic pre-processing defences and discover their theoretical and practical limitations. We explain why they are not supposed to make your models more robust to adversarial examples and are vulnerable even against standard non-stochastic attacks.

**Optimality and Stability in Non-Convex Smooth Games**

*Guojun Zhang, Pascal Poupart, Yaoliang Yu
*Convergence to a saddle point for convex-concave functions has been studied for decades, while recent years has seen a surge of interest in non-convex (zero-sum) smooth games, motivated by their recent wide applications. It remains an intriguing research challenge how local optimal points are defined and which algorithm can converge to such points. An interesting concept is known as the local minimax point, which strongly correlates with the widely-known gradient descent ascent algorithm. This paper aims to provide a comprehensive analysis of local minimax points, such as their relation with other solution concepts and their optimality conditions. We find that local saddle points can be regarded as a special type of local minimax points, called uniformly local minimax points, under mild continuity assumptions. In (non-convex) quadratic games, we show that local minimax points are (in some sense) equivalent to global minimax points. Finally, we study the stability of gradient algorithms near local minimax points. Although gradient algorithms can converge to local/global minimax points in the non-degenerate case, they would often fail in general cases. This implies the necessity of either novel algorithms or concepts beyond saddle points and minimax points in non-convex smooth games.

**Optimizing Data Collection for Machine Learning**

*Rafid Mahmood, James Lucas, Jose M. Alvarez, Sanja Fidler, Marc Law
*Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workflows. We propose a new paradigm for modeling the data collection workflow as a formal optimal data collection problem that allows designers to specify performance targets, collection costs, a time horizon, and penalties for failing to meet the targets. Additionally, this formulation generalizes to tasks requiring multiple data sources, such as labeled and unlabeled data used in semi-supervised learning. To solve our problem, we develop Learn-Optimize-Collect (LOC), which minimizes expected future collection costs. Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs.

**Partial Identification of Treatment Effects with Implicit Generative Models**

*Vahid Balazadeh Meresht, Vasilis Syrgkanis, Rahul G Krishnan
*Our work proposes a new algorithm for bounding the causal effects of interventions using observational data. This is known as the problem of partial identification. We propose a new method for the partial identification of average treatment effects (ATEs) in general causal graphs using deep generative models. Our method can bound effects in graphs comprising both continuous and discrete random variables. The strategy we adopt uses the uniform average treatment derivative (UATD), the partial derivatives of response functions, to create a regular approximation to the ATE. We prove that our algorithm converges to tight bounds on ATE in linear structural causal models (SCMs). For nonlinear SCMs, we empirically show that using UATD leads to tighter and more stable bounds than methods that directly optimize the ATE.

**Path Independent Equilibrium Networks Can Better Exploit Test-Time Computation**

*Cem Anil, Ashwini Pokle, Kaiqu Liang, Johannes Treutlein, Yuhuai Wu, Shaojie Bai, J. Zico Kolter, Roger Grosse
*We investigate the ability of neural networks to make use of additional computational resources to perform well on harder problem instances than they were trained on. We identify a property of some trained networks which seems to correlate highly with their generalization performance: path independence, or the degree to which the forward pass of the network converges to the same point regardless of the initialization.

**Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding** *Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David Fleet, Mohammad Norouzi
*We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pre-trained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See https://imagen.research.google/ for an overview of the results.

**The Privacy Onion Effect: Memorization is Relative**

*Nicholas Carlini, Matthew Jagielski, Chiyuan Zhang, Nicolas Papernot, Andreas Terzis, Florian Tramer
*Machine learning models trained on private datasets have been shown to leak their private data. While recent work has found that the average data point is rarely leaked, the outlier samples are frequently subject to memorization and, consequently, privacy leakage. We demonstrate and analyse an Onion Effect of memorization: removing the “layer” of outlier points that are most vulnerable to a privacy attack exposes a new layer of previously-safe points to the same attack. We perform several experiments to study this effect, and understand why it occurs. The existence of this effect has various consequences. For example, it suggests that proposals to defend against memorization without training with rigorous privacy guarantees are unlikely to be effective. Further, it suggests that privacy-enhancing technologies such as machine unlearning could actually harm the privacy of other users.

**Proximal Learning With Opponent-Learning Awareness**

*Stephen Zhao, Chris Lu, Roger Grosse, Jakob Foerster
*Learning With Opponent-Learning Awareness (LOLA) (Foerster et al. [2018a]) is a multi-agent reinforcement learning algorithm that typically learns reciprocity-based cooperation in partially competitive environments. However, LOLA often fails to learn such behaviour on more complex policy spaces parameterized by neural networks, partly because the update rule is sensitive to the policy parameterization. This problem is especially pronounced in the opponent modeling setting, where the opponent’s policy is unknown and must be inferred from observations; in such settings, LOLA is ill-specified because behaviorally equivalent opponent policies can result in non-equivalent updates. To address this shortcoming, we reinterpret LOLA as approximating a proximal operator, and then derive a new algorithm, proximal LOLA (POLA), which uses the proximal formulation directly. Unlike LOLA, the POLA updates are parameterization invariant, in the sense that when the proximal objective has a unique optimum, behaviorally equivalent policies result in behaviorally equivalent updates. We then present practical approximations to the ideal POLA update, which we evaluate in several partially competitive environments with function approximation and opponent modeling. This empirically demonstrates that POLA achieves reciprocity-based cooperation more reliably than LOLA.

**Reconsidering Deep Ensembles**

*Taiga Abe, Estefany Kelly Buchanan, Geoff Pleiss, Richard Zemel, John Cunningham
*Ensembling neural networks is an effective way to increase accuracy, and can often match the performance of individual larger models. This observation poses a natural question: given the choice between a deep ensemble and a single neural network with similar accuracy, is one preferable over the other? Recent work suggests that deep ensembles may offer distinct benefits beyond predictive power: namely, uncertainty quantification and robustness to dataset shift. In this work, we demonstrate limitations to these purported benefits, and show that a single (but larger) neural network can replicate these qualities. First, we show that ensemble diversity, by any metric, does not meaningfully contribute to an ensemble’s ability to detect out-of-distribution (OOD) data, but is instead highly correlated with the relative improvement of a single larger model. Second, we show that the OOD performance afforded by ensembles is strongly determined by their in-distribution (InD) performance, and – in this sense – is not indicative of any “effective robustness.” While deep ensembles are a practical way to achieve improvements to predictive power, uncertainty quantification, and robustness, our results show that these improvements can be replicated by a (larger) single model.

**Residual Multiplicative Filter Networks for Multiscale Reconstruction**

*Shayan Shekarforoush, David Lindell, Marcus Brubaker, David Fleet
*Coordinate networks like Multiplicative Filter Networks (MFNs) and BACON offer some control over the frequency spectrum used to represent continuous signals such as images or 3D volumes. Yet, they are not readily applicable to problems for which coarse-to-fine estimation is required, including various inverse problems in which coarse-to-fine optimization plays a key role in avoiding poor local minima. We introduce a new coordinate network architecture and training scheme that enables coarse-to-fine optimization with fine-grained control over the frequency support of learned reconstructions. This is achieved with two key innovations. First, we incorporate skip connections so that structure at one scale is preserved when fitting finer-scale structure. Second, we propose a novel initialization scheme to provide control over the model frequency spectrum at each stage of optimization. We demonstrate how these modifications enable multiscale optimization for coarse-to-fine fitting to natural images. We then evaluate our model on synthetically generated datasets for the the problem of single-particle cryo-EM reconstruction. We learn high resolution multiscale structures, on par with the state-of-the art.

**SMPL: Simulated Industrial Manufacturing and Process Control Learning Environments**

*Mohan Zhang, Xiaozhou Wang, Benjamin Decardi-Nelson, Bo Song, An Zhang, Jinfeng Liu, Sile Tao, Jiayi Cheng, Xiaohong Liu, Dengdeng Yu, Matthew Poon, Animesh Garg
*Traditional biological and pharmaceutical manufacturing plants are controlled by human workers or pre-defined thresholds. Modernized factories have advanced process control algorithms such as model predictive control (MPC). However, there is little exploration of applying deep reinforcement learning to control manufacturing plants. One of the reasons is the lack of high fidelity simulations and standard APIs for benchmarking. To bridge this gap, we develop an easy-to-use library that includes five high-fidelity simulation environments: BeerFMTEnv, ReactorEnv, AtropineEnv, PenSimEnv and mAbEnv, which cover a wide range of manufacturing processes. We build these environments on published dynamics models. Furthermore, we benchmark online and offline, model-based and model-free reinforcement learning algorithms for comparisons of follow-up research.

**Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction**

*Muralidhar Andoorveedu, Zhanda Zhu, Bojian Zheng, Gennady Pekhimenko
*Transformer-based models have become the dominant model being applied to a variety of different tasks, including question answering, paraphrasing, and now even image processing. However, training them to be effective can be quite expensive, with the cost reaching into the millions of dollars for more recent models. On top of this there is the cost of carbon footprint and time. Our work can reduce these costs by making optimizations to Transformer models that effectively allows training on more data at a time, ultimately decreasing the time required to train the models, saving money and energy. Our results show up to 26% improvement in the number of samples per second that can be processed for popular models due to an up to 2X increase in batch size.

**Uncertainty-Aware Reinforcement Learning for Risk-Sensitive Player Evaluation in Sports Game**

*Guiliang Liu, Yudong Luo, Oliver Schulte, Pascal Poupart
*A major task of sports analytics is player evaluation. Previous methods commonly measured the impact of players’ actions on desirable outcomes (e.g., goals or winning) without considering the risk induced by stochastic game dynamics. In this paper, we design an uncertainty-aware Reinforcement Learning (RL) framework to learn a risk-sensitive player evaluation metric from stochastic game dynamics. To embed the risk of a player’s movements into the distribution of action-values, we model their 1) aleatoric uncertainty, which represents the intrinsic stochasticity in a sports game, and 2) epistemic uncertainty, which is due to a model’s insufficient knowledge regarding Out-of-Distribution (OoD) samples. We demonstrate how a distributional Bellman operator and a feature-space density model can capture these uncertainties. Based on such uncertainty estimation, we propose a Risk-sensitive Game Impact Metric (RiGIM) that measures players’ performance over a season by conditioning on a specific confidence level. Empirical evaluation, based on over 9M play-by-play ice hockey and soccer events, shows that RiGIM correlates highly with standard success measures and has a consistent risk sensitivity.

**A Unified Sequence Interface for Vision Tasks**

*Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David Fleet, Geoffrey E Hinton
*While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of “core” computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models.

**Washing The Unwashable : On The (Im)possibility of Fairwashing Detection**

*Ali Shahin Shamsabadi, Mohammad Yaghini, Natalie Dullerud, Sierra Wyllie, Ulrich Aïvodji, Aisha Alaagib, Sébastien Gambs, Nicolas Papernot
*Fairwashing is a new threat model where companies abuse the requirement for explainability of their black-box models to hide their potential unfairness and evade its legal consequences. In this paper, we show that using an interpretable model for explaining a black-box model introduces a risk for fairwashing. We theoretically characterize and analyze fairwashing, proving that this phenomenon is difficult to avoid due to an irreducible factor—the unfairness of the black-box model. Based on the theory developed, we propose a novel technique, called FRAUD-Detect (FaiRness AUDit Detection), to detect fairwashed models by measuring a divergence over subpopulation-wise fidelity measures of the interpretable model. We explore ways an adaptive adversary (dishonest company informed of the algorithm) may attempt to evade FRAUD-Detect. Our empirical results show that evading our detector comes at the cost of a significant increase in subpopulation gap, negating fairwashing.

**Video Diffusion Models**

*Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, David Fleet
*We present results on video generation using diffusion models. We propose an architecture for video diffusion models which is a natural extension of the standard image architecture. We show that this architecture is effective for jointly training from image and video data. To generate long and higher resolution videos we introduce a new conditioning technique that performs better than previously proposed methods. We present results on text-conditioned video generation and state-of-the-art results on an unconditional video generation benchmark.

**Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos**

*Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon **Houghton, Raul Sampedro, Jeff Clune
*We trained a neural network to play Minecraft by Video PreTraining (VPT) on a massive unlabeled video dataset of human Minecraft play, while using only a small amount of labeled contractor data. With fine-tuning, our model can learn to craft diamond tools, a task that usually takes proficient humans over 20 minutes (24,000 actions). Our model uses the native human interface of keypresses and mouse movements, making it quite general, and represents a step towards general computer-using agents.

**You Can’t Count on Luck: Why Decision Transformers Fail in Stochastic Environments**

*Keiran Paster, Sheila McIlraith, Jimmy Ba
*A recent trend in deep reinforcement learning (RL) has been to treat RL as a supervised prediction problem, where the next action the agent takes is decided probabilistically by selecting the most likely action given some future outcome (e.g., the agent gathers a high amount of rewards). However, in stochastic environments where rewards are affected by randomness, this framework is biased. In this work, we describe the theoretical conditions under which these methods fail and propose a new algorithm that enables RL via supervised learning algorithms such as Decision Transformer to perform optimally even in highly stochastic environments. This paves the way towards a unified approach for prediction, sequence modeling, and optimal decision making.

**NeurIPS 2022 Workshops co-organized by Vector Faculty Members**

The Symbiosis of Deep Learning and Differential Equations II – Animesh Garg and David Duvenaud

Learning from Time Series for Health – Anna Goldenberg and Marzyeh Ghassemi

Robustness in Sequence Modeling – Marzyeh Ghassemi

Second Workshop on Efficient Natural Language and Speech Processing (ENLSP-II): The Future of Pre-trained Models – Pascal Poupart

AI for Accelerated Materials Design (AI4Mat) – Alán Aspuru-Guzik