Vector researchers present 80 groundbreaking AI papers at NeurIPS 2025Vector researchers advance AI frontiers with 80 papers at NeurIPS 2025 - Vector Institute for Artificial Intelligence

Researchers from Vector’s vibrant community are presenting groundbreaking work across the full spectrum of artificial intelligence at this year’s Conference on Neural Information Processing Systems (NeurIPS), taking place December 2-7 in San Diego and November 30-December 5 in Mexico City. The conference stands as the world’s premier venue for neural information processing research, bringing together the global community working on the theoretical foundations and practical applications that shape the future of AI.

The research contributions from Vector Faculty Members, Faculty Affiliates, and Distinguished Postdoctoral Fellows at NeurIPS 2025 demonstrate the depth and breadth of innovation emerging from our remarkable research ecosystem. Their accepted work spans critical domains – from next-generation foundation models and diffusion-based generative systems to reinforcement learning breakthroughs and privacy-preserving federated approaches – reflecting a shared commitment to advancing both the fundamental science of machine learning and the development of trustworthy AI systems that address real-world challenges.

Below you will find 80 accepted papers, including collaborations, from Vector Faculty Members, Vector Faculty Affiliates, and Vector Distinguished Postdoctoral Fellows.

ActiveVOO: Value of Information Guided Active Knowledge Acquisition for Open-World Embodied Lifted Regression Planning

Xiatoian Liu, Ali Pesaranghader, Jaehong Kim, Tanmana Sadhu, Hyejeong Jeon, Scott Sanner (Vector Faculty Affiliate)

Abstract

The ability to actively acquire information is essential for open-world planning under partial observability and incomplete knowledge. Existing embodied AI systems typically rely on passive strategies that exhaustively collect object and relational information. However, such passive knowledge acquisition becomes impractical in visually complex domains. For instance, a typical household may contain hundreds of uniquely configured objects with unique configurations. Therefore, open-world agents must be able to actively identify which objects are relevant to the task at hand. In this work, we introduce ActiveVOI, a novel zero-shot framework for open-world embodied planning that emphasizes object-centric active knowledge acquisition. ActiveVOI leverages Lifted Regression to generate compact subgoal descriptions that identify task-relevant objects. It also provides a principled approach to quantify the utility of sensing objects using the theory of Value of Information (VOI), guided by commonsense knowledge from large language and vision-language models (LLMs/VLMs). ActiveVOI is evaluated on the visual ALFWorld benchmark, showing substantial improvements over existing LLM- and VLM-based planning methods, and notably even outperforming VLMs that are fine-tuned on ALFWorld data. This work establishes a principled foundation for building embodied agents that actively and efficiently acquire knowledge to plan in open-world environments.

TLDR: We introduce the ActiveVOO framework for active knowledge acquisition to identify, quantify, and prioritize task-relevant information for open-world embodied planning.

Adaptive Context Length Optimization with Low-Frequency Truncation for Multi-Agent Reinforcement Learning

Wenchang Duan, Yaoliang Yu (Vector Faculty Member), Jiwan He, Yi Shi

Abstract

Recently, deep multi-agent reinforcement learning (MARL) has demonstrated promising performance for solving challenging tasks, such as long-term dependencies and non-Markovian environments. Its success is partly attributed to conditioning policies on large fixed context length. However, such large fixed context lengths may lead to limited exploration efficiency and redundant information. In this paper, we propose a novel MARL framework to obtain adaptive and effective contextual information. Specifically, we design a central agent that dynamically optimizes context length via temporal gradient analysis, enhancing exploration to facilitate convergence to global optima in MARL. Furthermore, to enhance the adaptive optimization capability of the context length, we present an efficient input representation for the central agent, which effectively filters redundant information. By leveraging a Fourier-based low-frequency truncation method, we extract global temporal trends across decentralized agents, providing an effective and efficient representation of the MARL environment. Extensive experiments demonstrate that the proposed method achieves state-of-the-art (SOTA) performance on long-term dependency tasks, including PettingZoo, MiniGrid, Google Research Football (GRF), and StarCraft Multi-Agent Challenge v2 (SMACv2).

TLDR: Large fixed context length limits exploration and introduces redundancy in MARL. We propose an adaptive context length optimization method with Fourier-based low-frequency truncation to improve long-term decision-making.

Align Your Flow: Scaling Continuous-Time Flow Map Distillation

Amirmojtaba Sabour, Sanja Fidler (Vector Faculty Member), Karsten Kreis

Abstract

Diffusion- and flow-based models have emerged as state-of-the-art generative modeling approaches, but they require many sampling steps. Consistency models can distill these models into efficient one-step generators; however, unlike flow- and diffusion-based methods, their performance inevitably degrades when increasing the number of steps, which we show both analytically and empirically.Flow maps generalize these approaches by connecting any two noise levels in a single step and remain effective across all step counts. In this paper, we introduce two new continuous-time objectives for training flow maps, along with additional novel training techniques, generalizing existing consistency and flow matching objectives. We further demonstrate that autoguidance can improve performance, using a low-quality model for guidance during distillation, and an additional boost can be achieved by adversarial finetuning, with minimal loss in sample diversity.We extensively validate our flow map models, called *Align Your Flow*, on challenging image generation benchmarks and achieve state-of-the-art few-step generation performance on both ImageNet 64×64 and 512×512, using small and efficient neural networks. Finally, we show text-to-image flow map models that outperform all existing non-adversarially trained few-step samplers in text-conditioned synthesis.

TLDR: We develop flow map methods for state-of-the-art few-step generation, generalizing flow, diffusion, and consistency models.

Asymmetric Duos: Sidekicks Improve Uncertainty

Spotlight paper

Tim G. Zhou, Evan Shelhamer (Vector Faculty Member), Geoff Pleiss (Vector Faculty Member)

Abstract

The go-to strategy to apply deep networks in settings where uncertainty informs decisions—ensembling multiple training runs with random initializations—is ill-suited for the extremely large-scale models and practical fine-tuning workflows of today. We introduce a new cost-effective strategy for improving the uncertainty quantification and downstream decisions of a large model (e.g. a fine-tuned ViT-B): coupling it with a less accurate but much smaller “sidekick” (e.g. a fine-tuned ResNet-34) with a fraction of the computational cost. We propose aggregating the predictions of this \emph{Asymmetric Duo} by simple learned weighted averaging. Surprisingly, despite their inherent asymmetry, the sidekick model almost never harms the performance of the larger model. In fact, across five image classification benchmarks, and a variety of model architectures and training schemes (including soups), Asymmetric Duos significantly improve accuracy, uncertainty quantification, and selective classification metrics with only ${\approx}10-20$% more computation.

Attention Sinks: A ‘Catch, Tag, and Release’ Mechanism for Embeddings

Stephen Zhang, Mustafa Khan, Vardan Papyan (Vector Faculty Member)

Abstract

Large language models (LLMs) often concentrate their attention on a few specific tokens referred to as *attention sinks*. Common examples include the first token, a prompt-independent sink, and punctuation tokens, which are prompt-dependent. While the tokens causing the sinks often lack direct semantic meaning, the presence of the sinks is critical for model performance, particularly under model compression and KV-caching. Despite their ubiquity, the function, semantic role, and origin of attention sinks—especially those beyond the first token—remain poorly understood. In this work, we conduct a comprehensive investigation demonstrating that attention sinks: *catch* a sequence of tokens, *tag* them using a common direction in embedding space, and *release* them back into the residual stream, where tokens are later retrieved based on the tags they have acquired. Probing experiments reveal these tags carry semantically meaningful information, such as the truth of a statement. These findings extend to reasoning models, where the mechanism spans more heads and explains greater variance in embeddings, or recent models with query-key normalization, where sinks remain just as prevalent. To encourage future theoretical analysis, we introduce a minimal problem which can be solved through the ‘catch, tag, release’ mechanism, and where it emerges through training.

Better Training Data Attribution via Better Inverse Hessian-Vector Products

Andrew Wang, Elisa Nguyen, Runshi Yang, Juhan Bae, Sheila McIlraith (Vector Faculty Member), Roger Grosse (Vector Faculty Member)

Abstract

Training data attribution (TDA) provides insights into which training data is responsible for a learned model behavior. Gradient-based TDA methods such as influence functions and unrolled differentiation both involve a computation that resembles an inverse Hessian-vector product (iHVP), which is difficult to approximate efficiently. We introduce an algorithm (ASTRA) which uses the EKFAC-preconditioner on Neumann series iterations to arrive at an accurate iHVP approximation for TDA. ASTRA is easy to tune, requires fewer iterations than Neumann series iterations, and is more accurate than EKFAC-based approximations. Using ASTRA, we show that improving the accuracy of the iHVP approximation can significantly improve TDA performance.

TLDR: We apply the EKFAC-preconditioner on Neumann series iterations to arrive at an unbiased iHVP approximation for TDA that improves influence function and unrolled differentiation performance.

Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking

Chen-Hao (Lance) Chao, Wei-Fang Sun, Hanwen Liang, Chun-Yi Lee, Rahul Krishnan (Vector Faculty Member)

Abstract

Masked diffusion models (MDM) are powerful generative models for discrete data that generate samples by progressively unmasking tokens in a sequence. Each token can take one of two states: masked or unmasked. We observe that token sequences often remain unchanged between consecutive sampling steps; consequently, the model repeatedly processes identical inputs, leading to redundant computation. To address this inefficiency, we propose the Partial masking scheme (Prime), which augments MDM by allowing tokens to take intermediate states interpolated between the masked and unmasked states. This design enables the model to make predictions based on partially observed token information, and facilitates a fine-grained denoising process. We derive a variational training objective and introduce a simple architectural design to accommodate intermediate-state inputs. Our method demonstrates superior performance across a diverse set of generative modeling tasks. On text data, it achieves a perplexity of 15.36 on OpenWebText, outperforming previous MDM (21.52), autoregressive models (17.54), and their hybrid variants (17.58), without relying on an autoregressive formulation. On image data, it attains competitive FID scores of 3.26 on CIFAR-10 and 6.98 on ImageNet-32, comparable to leading continuous generative models.

BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model

Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris Maddison (Vector Faculty Member), Bo Wang (Vector Faculty Member)

Abstract

Unlocking deep, interpretable biological reasoning from complex genomic data is a paramount challenge for artificial intelligence, hindering critical scientific discovery. Existing DNA foundation models, despite their powerful sequence representation capabilities, often struggle with multi-step reasoning and lack inherent mechanisms for transparent, biologically intuitive explanations. We present BioReason, a pioneering architecture, that for the first time deeply integrates a DNA foundation model with a large language model (LLM). This novel connection empowers the LLM to directly process and reason with genomic information as a fundamental input modality, enabling a new form of multimodal biological understanding. BioReason’s capacity for sophisticated, multi-step reasoning is cultivated through a regimen of supervised fine-tuning and targeted reinforcement learning, guiding the integrated system to generate logical and biologically coherent deductions. On challenging benchmarks, including KEGG-based disease pathway prediction—where BioReason improves accuracy by roughly 10 points (from 88% to 97%)—and variant effect analysis, BioReason demonstrates an average performance gain of 15% over strong single-modality baselines. A key breakthrough is BioReason’s ability to reason over previously unseen biological entities and articulate its decision-making process through interpretable, step-by-step biological traces mechanistically supporting its predictions. BioReason offers a transformative approach for AI in biology, paving the way for deeper mechanistic insights and accelerated generation of testable hypotheses from genomic data.

TLDR: BioReason introduces a novel DNA-LLM architecture where the LLM directly processes genomic information, achieving superior, interpretable multi-step biological reasoning and accelerating mechanistic discovery.

Breaking the Batch Barrier (B3) of Contrastive Learning via Smart Batch Mining

Spotlight paper

Raghuveer Thirukovalluru, Rui Meng, Ye Liu, Karthikeyan K, Mingyi Su, Ping Nie, Semih Yavuz, Yingbo Zhou, Wenhu Chen (Vector Faculty Member), Bhuwan Dhingra

Abstract

Contrastive learning (CL) is a prevalent technique for training embedding models, which pulls semantically similar examples (positives) closer in the representation space while pushing dissimilar ones (negatives) further apart. A key source of negatives are ‘in-batch’ examples, i.e., positives from other examples in the batch. Effectiveness of such models is hence strongly influenced by the size and quality of training batches. In this work, we propose ‘Breaking the Batch Barrier’ (B3), a novel batch construction strategy designed to curate high-quality batches for CL. Our approach begins by using a pretrained teacher embedding model to rank all examples in the dataset, from which a sparse similarity graph is constructed. A community detection algorithm is then applied to this graph to identify clusters of examples that serve as strong negatives for one another. The clusters are then used to construct batches that are rich in in-batch negatives. Empirical results on the MMEB multimodal embedding benchmark (36 tasks) demonstrate that our method sets a new state of the art, outperforming previous best methods by +1.3 and +2.9 points at the 7B and 2B model scales, respectively. Notably, models trained with \bthm\ surpass existing state-of-the-art results even with a batch size as small as 64, which is 4–16× smaller than that required by other methods.

BridgePure: Limited Protection Leakage Can Break Black-Box Data Protection

Yihan Wang, Yiwei Lu (Vector Faculty Affiliate), Xiao-Shan Gao, Gautam Kamath (Vector Faculty Member), Yaoliang Yu (Vector Faculty Member)

Abstract

Availability attacks, or unlearnable examples, are defensive techniques that allow data owners to modify their datasets in ways that prevent unauthorized machine learning models from learning effectively while maintaining the data’s intended functionality. It has led to the release of popular black-box tools (e.g., APIs) for users to upload personal data and receive protected counterparts. In this work, we show that such black-box protections can be substantially compromised if a small set of unprotected in-distribution data is available. Specifically, we propose a novel threat model of protection leakage, where an adversary can (1) easily acquire (unprotected, protected) pairs by querying the black-box protections with a small unprotected dataset; and (2) train a diffusion bridge model to build a mapping between unprotected and protected data. This mapping, termed BridgePure, can effectively remove the protection from any previously unseen data within the same distribution. BridgePure demonstrates superior purification performance on classification and style mimicry tasks, exposing critical vulnerabilities in black-box data protection. We suggest that practitioners implement multi-level countermeasures to mitigate such risks.

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Leonid Sigal (Vector Faculty Member), Roland Memisevic

Abstract

Multi-modal Large Language Models (LLM) have advanced conversational abilities but struggle with providing live, interactive step-by-step guidance, a key capability for future AI assistants. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time. This requires models that are not turn-based, but that can react asynchronously to a video stream, as well as video data showing users performing tasks including mistakes and their corrections. To this end, we introduce LiveCook, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. LiveCook features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video. We evaluate state-of-the-art multi-modal LLMs on LiveCook and introduce LiveMamba, a streaming multi-modal LLM designed for interactive instructional guidance. This work provides the first dedicated benchmark and a strong baseline for developing and evaluating on live, situated coaching.

TLDR: Current multi-modal LLMs struggles with live, step-by-step task guidance. We built Qualcomm Interactive Cooking (a new dataset with mistake videos and timed feedback) and LiveMamba (a streaming model) to enable better real-time interactive guidance.

Care-PD: A Multi-Site Anonymized Clinical Dataset for Parkinson’s Disease Gait Assessment

Vida Adeli, Ivan Klabučar, Javad Rajabi, Benjamin Filtjens, Soroush Mehraban, Diwei Wang, Trung Hieu Hoang, Minh Do, Hyewon Seo, Candice Muller, Daniel Coelho, Claudia de Oliveira, Pieter Ginis, Moran Gilat, Alice Nieuwboer, Joke Spildooren, J. Mckay, Hyeokhyen Kwon, Gari Clifford, Christine Esper, Stewart Factor, Imari Genias, Amirhossein Dadashzadeh, Leia Shum, Alan Whone, Majid Mirmehdi, Andrea Iaboni, Babak Taati (Vector Faculty Affiliate)

Abstract

Objective gait assessment in Parkinson’s Disease (PD) is limited by the absence of large, diverse, and clinically annotated motion datasets. We introduce Care-PD, the largest publicly available archive of 3D mesh gait data for PD, and the first multi-site collection spanning 9 cohorts from 8 clinical centers. All recordings (RGB video or motion capture) are converted into anonymized SMPL meshes via a harmonized preprocessing pipeline. Care-PD supports two key benchmarks: supervised clinical score prediction (estimating Unified Parkinson’s Disease Rating Scale, UPDRS, gait scores) and unsupervised motion pretext tasks (2D-to-3D keypoint lifting and full-body 3D reconstruction). Clinical prediction is evaluated under four generalization protocols: within-dataset, cross-dataset, leave-one-dataset-out, and multi-dataset in-domain adaptation.To assess clinical relevance, we compare state-of-the-art motion encoders with a traditional gait-feature baseline, finding that encoders consistently outperform handcrafted features. Pretraining on Care-PD reduces MPJPE (from 60.8mm to 7.5mm) and boosts PD severity macro-F1 by 17\%, underscoring the value of clinically curated, diverse training data. Care-PD and all benchmark code are released for non-commercial research (Code, Data).

TL;DR: We introduce Care-PD a multi-site dataset and benchmark for Parkinson’s gait analysis, enabling robust clinical severity prediction and improving motion representation learning through diverse, anonymized pathological gait data.

CausalPFN: Amortized Causal Effect Estimation via In-Context Learning

Spotlight paper

Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas, Junwei Ma, Bingru Li, Jesse Cresswell, Rahul Krishnan (Vector Faculty Member)

Abstract

Causal effect estimation from observational data is fundamental across various applications. However, selecting an appropriate estimator from dozens of specialized methods demands substantial manual effort and domain expertise. We present CausalPFN, a single transformer that amortizes this workflow: trained once on a large library of simulated data-generating processes that satisfy ignorability, it infers causal effects for new observational datasets out of the box. CausalPFN combines ideas from Bayesian causal inference with the large-scale training protocol of prior-fitted networks (PFNs), learning to map raw observations directly to causal effects without any task-specific adjustment. Our approach achieves superior average performance on heterogeneous and average treatment effect estimation benchmarks (IHDP, Lalonde, ACIC). Moreover, it shows competitive performance for real-world policy making on uplift modeling tasks. CausalPFN provides calibrated uncertainty estimates to support reliable decision-making based on Bayesian principles. This ready-to-use model requires no further training or tuning and takes a step toward automated causal inference (https://github.com/vdblm/CausalPFN/).

TLDR: CausalPFN is a pre-trained transformer that amortizes causal effect estimation: trained once on simulated data-generating processes, it outputs calibrated effects for new observational datasets with zero tuning.

Channel Simulation and Distributed Compression with Ensemble Rejection Sampling

Truong Buu Phan, Ashish Khisti (Vector Faculty Affiliate)

Abstract

We study channel simulation and distributed matching, two fundamental problems with several applications to machine learning, using a recently introduced generalization of the standard rejection sampling (RS) algorithm known as Ensemble Rejection Sampling (ERS). For channel simulation, we propose a new coding scheme based on ERS that achieves a near-optimal coding rate. In this process, we demonstrate that standard RS can also achieve a near-optimal coding rate and generalize the result of Braverman and Garg (2014) to the continuous alphabet setting. Next, as our main contribution, we present a distributed matching lemma for ERS, which serves as the rejection sampling counterpart to the Poisson Matching Lemma (PML) introduced by Li and Anantharam (2021). Our result also generalizes a recent work on importance matching lemma (Phan et al, 2024) and, to our knowledge, is the first result on distributed matching in the family of rejection sampling schemes where the matching probability is close to PML. We demonstrate the practical significance of our approach over prior works by applying it to distributed compression. The effectiveness of our proposed scheme is validated through experiments involving synthetic Gaussian sources and distributed image compression using the MNIST dataset.

TLDR: We provide a new channel simulation approach for distributed compression using Ensemble Rejection Sampling

CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction

Ella Miray Rajaonson, Mahyar Rajabi Kochi, Luis Martin Mejia Mendoza, Mohamad Moosavi (Vector Faculty Member), Benjamin Sanchez-Lengeling (Vector Faculty Affiliate)

Abstract

Developing improved predictive models for multi-molecular systems is crucial, as nearly every chemical product used results from a mixture of chemicals. While being a vital part of the industry pipeline, the chemical mixture space remains relatively unexplored by the machine learning community. In this paper, we introduce CheMixHub, a holistic benchmark for molecular mixtures, covering a corpus of 11 chemical mixtures property prediction tasks, from drug delivery formulations to battery electrolytes, totalling approximately 500k data points gathered and curated from 7 publicly available datasets. CheMixHub introduces various data splitting techniques to assess context-specific generalization and model robustness, providing a foundation for the development of predictive models for chemical mixture properties. Furthermore, we map out the modelling space of deep learning models for chemical mixtures, establishing initial benchmarks for the community. This dataset has the potential to accelerate chemical mixture development, encompassing reformulation, optimization, and discovery. The dataset and code for the benchmarks can be found at: https://github.com/chemcognition-lab/chemixhub

Class-wise Balancing Data Replay for Federated Class-Incremental Learning

Zhuang Qi, Ying-Peng Tang, Lei Meng, Han Yu, Xiaoxiao Li (Vector Faculty Member), Xiangxu Meng

Abstract

Federated Class Incremental Learning (FCIL) aims to collaboratively process continuously increasing incoming tasks across multiple clients. Among various approaches, data replay has become a promising solution, which can alleviate forgetting by reintroducing representative samples from previous tasks. However, their performance is typically limited by class imbalance, both within the replay buffer due to limited global awareness and between replayed and newly arrived classes. To address this issue, we propose a class-wise balancing data replay method for FCIL (FedCBDR), which employs a global coordination mechanism for class-level memory construction and reweights the learning objective to alleviate the aforementioned imbalances. Specifically, FedCBDR has two key components: 1) the global-perspective data replay module reconstructs global representations of prior task knowledge in a privacy-preserving manner, which then guides a class-aware and importance-sensitive sampling strategy to achieve balanced replay; 2) Subsequently, to handle class imbalance across tasks, the task-aware temperature scaling module adaptively adjusts the temperature of logits at both class and instance levels based on task dynamics, which reduces the model’s overconfidence in majority classes while enhancing its sensitivity to minority classes. Experimental results verified that FedCBDR achieves balanced class-wise sampling under heterogeneous data distributions and improves generalization under task imbalance between earlier and recent tasks, yielding a 2%-15% Top-1 accuracy improvement over six state-of-the-art methods.

Collapsing Taylor Mode Automatic Differentiation

Felix Dangel (Vector Distinguished Postdoctoral Fellow), Tim Siebert, Marius Zeinhofer, Andrea Walther

Abstract

Computing partial differential equation (PDE) operators via nested backpropagation is expensive, yet popular, and severely restricts their utility for scientific machine learning. Recent advances, like the forward Laplacian and randomizing Taylor mode automatic differentiation (AD), propose forward schemes to address this. We introduce an optimization technique for Taylor mode that ‘collapses’ derivatives by rewriting the computational graph, and demonstrate how to apply it to general linear PDE operators, and randomized Taylor mode. The modifications simply require propagating a sum up the computational graph, which could—or should—be done by a machine learning compiler, without exposing complexity to users. We implement our collapsing procedure and evaluate it on popular PDE operators, confirming it accelerates Taylor mode and outperforms nested backpropagation.

TLDR: We accelerate Taylor mode for practically relevant differential operators by collapsing Taylor coefficients; this can be done automatically with compute graph simplifications

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Nikhil Kandpal, Brian Lester, Colin Raffel (Vector Faculty Member), Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben allal, Elie Bakouch, John Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, Aaron Gokaslan, John Kirchenbauer, Tom Goldstein, Brian Bartoldson, Bhavya Kailkhura, Tyler Murray

Abstract

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pre-training. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training Comma v0.1, a 7 billion parameter LLM trained on 1 trillion tokens of text from the Common Pile. Comma attains competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as LLaMA 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as Comma v0.1’s checkpoints and training mixture.

TL;DR: We collect an 8 TB of public domain and openly licensed text and use it to pre-train a performant 7B-parameter LLM.

Ctrl-DNA: Controllable Cell-Type-Specific Regulatory DNA Design via Constrained RL

Spotlight paper

Xingyu Chen, Shihao Ma, Runsheng Lin, Jiecong Lin, Bo Wang (Vector Faculty Member)

Abstract

Designing regulatory DNA sequences that achieve precise cell-type-specific gene expression is crucial for advancements in synthetic biology, gene therapy and precision medicine. Although transformer-based language models (LMs) can effectively capture patterns in regulatory DNA, their generative approaches often struggle to produce novel sequences with reliable cell-specific activity. Here, we introduce regCon, a novel constrained reinforcement learning (RL) framework tailored for designing regulatory DNA sequences with controllable cell-type specificity. By formulating regulatory sequence design as a biologically informed constrained optimization problem, we apply RL to autoregressive genomic LMs, enabling the models to iteratively refine sequences that maximize regulatory activity in targeted cell types while constraining off-target effects. Our evaluation on human promoters and enhancers demonstrates that regCon consistently outperforms existing generative and RL-based approaches, generating high-fitness regulatory sequences and achieving state-of-the-art cell-type specificity. Moreover, regCon-generated sequences capture key cell-type-specific transcription factor binding sites (TFBS), short DNA motifs recognized by regulatory proteins that control gene expression, demonstrating the biological plausibility of the generated sequences.

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

Spotlight paper

Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski (Vector Faculty Member), Sergey Tulyakov, Aliaksandr Siarohin

Abstract

Direct Preference Optimization (DPO) has recently been applied as a post‑training technique for text-to-video diffusion models.To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise.However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts.In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions.First, we create each video pair for DPO by denoising corrupted copies of a ground truth video.This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias.Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal.With only one‑third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency.Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.

TLDR: We propose an improved DPO method tailored towards video diffusion models

DiffBreak: Is Diffusion-Based Purification Robust?

Andre Kassis, Urs Hengartner, Yaoliang Yu (Vector Faculty Member)

Abstract

Diffusion-based purification (DBP) has become a cornerstone defense against adversarial examples (AEs), regarded as robust due to its use of diffusion models (DMs) that project AEs onto the natural data manifold. We refute this core claim, theoretically proving that gradient-based attacks effectively target the DM rather than the classifier, causing DBP’s outputs to align with adversarial distributions. This prompts a reassessment of DBP’s robustness, accrediting it two critical flaws: incorrect gradients and inappropriate evaluation protocols that test only a single random purification of the AE. We show that with proper accounting for stochasticity and resubmission risk, DBP collapses. To support this, we introduce DiffBreak, the first reliable toolkit for differentiation through DBP, eliminating gradient flaws that previously further inflated robustness estimates. We also analyze the current defense scheme used for DBP where classification relies on a single purification, pinpointing its inherent invalidity. We provide a statistically grounded majority-vote (MV) alternative that aggregates predictions across multiple purified copies, showing partial but meaningful robustness gain. We then propose a novel adaptation of an optimization method against deepfake watermarking, crafting systemic perturbations that defeat DBP even under MV, challenging DBP’s viability.

TLDR: DiffBreak provides the first reliable framework for differentiating through diffusion-based purification, revealing key vulnerabilities under adaptive attacks.

Distributional Training Data Attribution: What do Influence Functions Sample?

Spotlight paper

Bruno Mlodozeniec, Isaac Reid, Sam Power, David Krueger, Murat Erdogdu (Vector Faculty Member), Richard Turner, Roger Grosse (Vector Faculty Member)

Abstract

Randomness is an unavoidable part of training deep learning models, yet something that traditional training data attribution algorithms fail to rigorously account for. They ignore the fact that, due to stochasticity in the initialisation and batching, training on the same dataset can yield different models. In this paper, we address this shortcoming through introducing _distributional_ training data attribution (d-TDA), the goal of which is to predict how the distribution of model outputs (over training runs) depends upon the dataset. We demonstrate the practical significance of d-TDA in experiments, e.g. by identifying training examples that drastically change the distribution of some target measurement without necessarily changing the mean. Intriguingly, we also find that _influence functions_ (IFs), a popular but poorly-understood data attribution tool, emerge naturally from our distributional framework as the limit to unrolled differentiation – without requiring restrictive convexity assumptions. This provides a new mathematical motivation for their efficacy in deep learning, and helps to characterise their limitations.

TLDR: This paper introduces distributional training data attribution, a data attribution framework that accounts for stochasticity in deep learning training, enabling a mathematical justification for why influence functions work in this setting.

Don’t be lazy: CompleteP enables compute-efficient deep transformers

Nolan Dey, Bin Zhang, Lorenzo Noci, Mufan Li (Vector Faculty Affiliate), Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, Joel Hestness

Abstract

We study compute efficiency of LLM training when using different parameterizations, i.e., rules for adjusting model and optimizer hyperparameters (HPs) as model size changes. Some parameterizations fail to transfer optimal base HPs (such as learning rate) across changes in model depth, requiring practitioners to either re-tune these HPs as they scale up (expensive), or accept sub-optimal training when re-tuning is prohibitive. Even when they achieve HP transfer, we develop theory to show parameterizations may still exist in the lazy learning regime where layers learn only features close to their linearization, preventing effective use of depth and nonlinearity. Finally, we identify and adopt the parameterization we call CompleteP that achieves both depth-wise HP transfer and non-lazy learning in all layers. CompleteP enables a wider range of model width/depth ratios to remain compute-efficient, unlocking shapes better suited for different hardware settings and operational contexts. Moreover, CompleteP enables 12-34% compute efficiency improvements over the prior state-of-the-art.

TLDR: We introduce CompleteP, which offers depth-wise HP transfer, FLOP savings when training deep models, and a larger range of compute-efficient width/depth ratios.

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang (Vector Faculty Member)

Abstract

The sequential nature of modern LLMs makes them expensive and slow, and speculative sam- pling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top- layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE’s feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. In the SGLang framework, EAGLE- 3 achieves a 1.38x throughput improvement at a batch size of 64.

TLDR: We propose EAGLE-3, observing that it can benefit from scaling up the data.

ELECTRA: A Cartesian Network for 3D Charge Density Prediction with Floating Orbitals

Spotlight paper

Jonas Elsborg, Luca Thiede, Alán Aspuru-Guzik (Vector Faculty Member), Tejs Vegge, Arghya Bhowmik

Abstract

We present the Electronic Tensor Reconstruction Algorithm (ELECTRA) – an equivariant model for predicting electronic charge densities using floating orbitals. Floating orbitals are a long-standing concept in the quantum chemistry community that promises more compact and accurate representations by placing orbitals freely in space, as opposed to centering all orbitals at the position of atoms. Finding the ideal placement of these orbitals requires extensive domain knowledge, though, which thus far has prevented widespread adoption. We solve this in a data-driven manner by training a Cartesian tensor network to predict the orbital positions along with orbital coefficients. This is made possible through a symmetry-breaking mechanism that is used to learn position displacements with lower symmetry than the input molecule while preserving the rotation equivariance of the charge density itself. Inspired by recent successes of Gaussian Splatting in representing densities in space, we are using Gaussian orbitals and predicting their weights and covariance matrices. Our method achieves a state-of-the-art balance between computational efficiency and predictive accuracy on established benchmarks.

TLDR: Efficient charge density prediction using floating orbitals

Enhancing Training Data Attribution with Representational Optimization

Spotlight paper

Weiwei Sun, Haokun Liu, Nikhil Kandpal, Colin Raffel (Vector Faculty Member), Yiming Yang

Abstract

Training data attribution (TDA) methods aim to measure how training data impacts a model’s predictions. While gradient-based attribution methods, such as influence functions, offer theoretical rigor, their computational costs make them impractical for large-scale applications. Representation-based attribution methods are more efficient, relying on similarity computations between examples in some representation space, but they often lack task-aware and model-specific optimization, limiting their accuracy. To address these challenges, we propose AirRep, a novel representation-based approach that enhances representation quality through task-driven optimization of a representation encoding model.Furthermore, we extend this method beyond single-sample attribution using an attention-based pooling mechanism to effectively estimate the collective influence of groups of samples.Experiments on instruction tuning of large language models demonstrate that AirRep achieves performance on par with state-of-the-art gradient-based approaches while being nearly two orders of magnitude more efficient. Further analysis highlights its robustness, including generalization to new data and new TDA tasks.

TLDR: AirRep is a text representation model optimized for TDA, offering performance comparable to gradient-based methods while being significantly more efficient.

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia

Chandler Smith, Marwa Abdulhai, Manfred Díaz, Marko Tesic, Rakshit Trivedi, Sasha Vezhnevets, Lewis Hammond, Jesse Clifton, Minsuk Chang, Edgar Duenez-Guzman, John Agapiou, Jayd Matyas, Danny Karmon, Beining Zhang, Jim Dilkes, Akash Kundu, Hieu Minh Nguyen, Emanuel Tewolde, Jebish Purbey, Ram Mohan Rao Kadiyala, Siddhant Gupta, Aliaksei Korshuk, Buyantuev Alexander, Ilya Makarov, Gang Zhao, Rolando Fernandez, Zhihan Wang, Caroline Wang, Jiaxun Cui, Lingyun Xiao, Di Shi, Yoonchang Sung, Muhammad Arrasy Rahman, Peter Stone, Yipeng Kang, Hyeonggeun Yun, Ananya Ananya, Taehun Cha, Zhiqiang Wu, Elizaveta Tennant, Olivia Macmillan-Scott, Marta Segura, Diana Riazi, Fuyang Cui, Sriram Ganapathi (Vector Faculty Affiliate), Toryn Klassen (Vector CIFAR AI Safety Postdoctoral Fellow), Nico Schiavone, Mogtaba Alim, Sheila McIlraith (Vector Faculty Member), Manuel Rios, Oswaldo Peña, Carlos Rojas, Manuela Viviana Chacon-Chamorro, Rubén Manrique, Luis Felipe Giraldo, Nicanor Quijano, Yiding Wang, Yuxuan Chen, Fangwei Zhong, Mengmeng Wang, Wenming Tu, Zhaowei Zhang, Ziang Chen, Zixia Jia, Xue Feng, Zilong Zheng, Chichen Lin, Weijian Fan, Chenao Liu, Sneheel Sarangi, Ziyan Wang, shuqing shi, Yali Du, Avinaash Anand Kulandaivel, Yang Liu, Wu Ruiyang, Chetan Talele, 陆孙嘉, Gema Parreno, Shamika Dhuri, Bain McHale, Tim Baarslag, Dylan Hadfield-Menell, Natasha Jaques, José Hernández-Orallo, Joel Leibo

Abstract

Large language model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. This work introduces an approach to measuring human-appropriate cooperative intelligence, emphasizing an agent’s ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.

TL;DR: In this paper we introduce a method for evaluating cooperation in LLM-based agents with unfamiliar co-players in novel, mixed-motive scenarios, and report the analytical techniques, methods, and results of the 2024 Concordia Contest.

Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos

Hanxue Liang, Jiawei Ren, Ashkan Mirzaei, Antonio Torralba, Ziwei Liu, Igor Gilitschenski (Vector Faculty Member), Sanja Fidler (Vector Faculty Member), Cengiz Oztireli, Huan Ling, Zan Gojcic, Jiahui Huang

Abstract

Recent advancements in static feed-forward scene reconstruction have demonstrated significant progress in high-quality novel view synthesis. However, these models often struggle with generalizability across diverse environments and fail to effectively handle dynamic content. We present BTimer (short for Bullet Timer), the first motion-aware feed-forward model for real-time reconstruction and novel view synthesis of dynamic scenes. Our approach reconstructs the full scene in a 3D Gaussian Splatting representation at a given target (‘bullet’) timestamp by aggregating information from all the context frames. Such a formulation allows BTimer to gain scalability and generalization by leveraging both static and dynamic scene datasets. Given a casual monocular dynamic video, BTimer reconstructs a bullet-time scene within 150ms while reaching state-of-the-art performance on both static and dynamic scene datasets, even compared with optimization-based approaches.

TLDR: Feed-forward dynamic 3DGS scene reconstruction from videos.

FlashMD: long-stride, universal prediction of molecular dynamics

Spotlight paper

Filippo Bigi, Sanggyu Chong, Agustinus Kristiadi (Vector Faculty Affiliate), Michele Ceriotti

Abstract

Molecular dynamics (MD) provides insights into atomic-scale processes by integrating over time the equations that describe the motion of atoms under the action of interatomic forces. Machine learning models have substantially accelerated MD by providing inexpensive predictions of the forces, but they remain constrained to minuscule time integration steps, which are required by the fast time scale of atomic motion. In this work, we propose FlashMD, a method to predict the evolution of positions and momenta over strides that are between one and two orders of magnitude longer than typical MD time steps. We incorporate considerations on the mathematical and physical properties of Hamiltonian dynamics in the architecture, generalize the approach to allow the simulation of any thermodynamic ensemble, and carefully assess the possible failure modes of a direct MD approach. We validate FlashMD’s accuracy in reproducing equilibrium and time‐dependent properties, using both system‐specific and general-purpose models, extending the ability of MD simulation to reach the long time scales needed to model microscopic processes of high scientific and technological relevance.

TLDR: A method for the prediction of molecular dynamics trajectories using long time steps

Flux4D: Flow-based Unsupervised 4D Reconstruction

Jingkang Wang, Henry Che, Yun Chen, Ze Yang, Lily Goli, Sivabalan Manivasagam, Raquel Urtasun (Vector Faculty Member)

Abstract

Reconstructing large-scale dynamic scenes from visual observations is a fundamental challenge in computer vision, with critical implications for robotics and autonomous systems. While recent differentiable rendering methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved impressive photorealistic reconstruction, they suffer from scalability limitations and require annotations to decouple actor motion. Existing self-supervised methods attempt to eliminate explicit annotations by leveraging motion cues and geometric priors, yet they remain constrained by per-scene optimization and sensitivity to hyperparameter tuning. In this paper, we introduce Flux4D, a simple and scalable framework for 4D reconstruction of large-scale dynamic scenes. Flux4D directly predicts 3D Gaussians and their motion dynamics to reconstruct sensor observations, in a fully unsupervised manner. By adopting only photometric losses and enforcing an “as static as possible” regularization, Flux4D learns to decompose dynamic elements directly from raw data without requiring pre-trained supervised models or foundational priors simply by training across many scenes. Our approach enables efficient reconstruction of dynamic scenes within seconds, scales effectively to large datasets, and generalizes well to unseen environments, including rare and unknown objects. Experiments on outdoor driving datasets show Flux4D significantly outperforms existing methods in scalability, generalization, and reconstruction quality.

TLDR: Flux4D is a simple, scalable framework for unsupervised 4D reconstruction of large-scale driving scenes.

FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Nandan Thakur, Jimmy Lin (Vector Faculty Affiliate), Samuel Havens, Michael Carbin, Omar Khattab, Andrew Drozdov

Abstract

We introduce FreshStack, a holistic framework for automatically building information retrieval (IR) evaluation benchmarks by incorporating challenging questions and answers.FreshStack conducts the following steps:(1) automatic corpus collection from code and technical documentation,(2) nugget generation from community-asked questions and answers, and(3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures.We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not improve first-stage retrieval accuracy (two out of five topics), and oracle context helps an LLM generator generate a high-quality RAG answer.We hope FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks.

TL;DR: FreshStack is a framework to build realistic IR & RAG evaluation benchmarks on niche and recent domains from community-asked questions and answers.

From Information to Generative Exponent: Learning Rate Induces Phase Transitions in SGD

Konstantinos Tsiolis, Alireza Mousavi-Hosseini, Murat Erdogdu (Vector Faculty Member)

Abstract

To understand feature learning dynamics in neural networks, recent theoretical works have focused on gradient-based learning of Gaussian single-index models, where the label is a nonlinear function of a latent one-dimensional projection of the input. While the sample complexity of online SGD is determined by the information exponent of the nonlinear link, recent works improved this by reusing samples or modifying the loss function — transformations which introduce non-correlational updates — and instead are limited by the (potentially much smaller) generative exponent. However, this picture is only valid if the learning rate is sufficiently large. In this paper, we characterize the relationship between learning rate and sample complexity for a broad class of gradient-based algorithms that encapsulates both correlational and non-correlational updates, and demonstrate a phase transition from an “information exponent regime” with small learning rate to a “generative exponent regime” with large learning rate. Our framework covers prior analyses of one-pass SGD and SGD with batch reuse, while also introducing a new layer-wise training algorithm that leverages a two-timescales approach to go beyond correlational queries without reusing samples or modifying the loss from squared error. Our theoretical study demonstrates that the choice of learning rate is as important as the design of the algorithm in achieving statistical and computational efficiency.

General-Reasoner: Advancing LLM Reasoning Across All Domains

Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun MA, Wenhu Chen (Vector Faculty Member)

Abstract

Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs).Particularly, the “Zero” reinforcement learning introduced by Deepseek-R1-Zero, enables direct RL training of base LLMs without relying on an intermediate supervised fine-tuning stage.Despite these advancements, current works for LLM reasoning mainly focus on mathematical and coding domains, largely due to data abundance and the ease of answer verification.This limits the applicability and generalization of such models to broader domains, where questions often have diverse answer representations, and data is more scarce.In this paper, we propose \model, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains. Our key contributions include: (1) constructing a large-scale, high-quality dataset of questions with verifiable answers curated by web crawling, covering a wide range of disciplines; and (2) developing a generative model-based answer verifier, which replaces traditional rule-based verification with the capability of chain-of-thought and context-awareness. Our comprehensive evaluation across benchmarks such as MMLU-Pro, GPQA, SuperGPQA, BBEH and MATH, AMC, etc demonstrates that \model outperforms existing baseline methods, achieving robust and generalizable reasoning performance while maintaining superior effectiveness in mathematical reasoning tasks. Code, data, and model checkpoints in this work will be released.

A Geometric Analysis of PCA

Ayoub El Hanchi, Murat Erdogdu (Vector Faculty Member), Chris Maddison (Vector Faculty Member)

Abstract

What property of the data distribution determines the excess risk of principal component analysis? In this paper, we provide a precise answer to this question. We establish a central limit theorem for the error of the principal subspace estimated by PCA, and derive the asymptotic distribution of its excess risk under the reconstruction loss. We obtain a non-asymptotic upper bound on the excess risk of PCA that recovers, in the large sample limit, our asymptotic characterization. Underlying our contributions is the following result: we prove that the negative block Rayleigh quotient, defined on the Grassmannian, is generalized self-concordant along geodesics emanating from its minimizer of maximum rotation less than $\pi/4$.

TLDR: We prove the asymptotic normality of PCA on the Grassmannian, and derive a tight non-asymptotic bound on its excess risk using self-concordance.

Global Prompt Refinement with Non-Interfering Attention Masking for One-Shot Federated Learning

Zhuang Qi, Yu Pan, Lei Meng, Sijin Zhou, Han Yu, Xiaoxiao Li (Vector Faculty Member), Xiangxu Meng

Abstract

Federated Prompt Learning (FPL) enables communication-efficient adaptation by tuning lightweight prompts on top of frozen pre-trained models. Existing FPL methods typically rely on global information, which is only available after the second training round, to facilitate collaboration among client models. Therefore, they are inherently dependent on multi-round communication to fully exhibit their strengths. Moreover, existing one-shot federated learning methods typically focus on fitting seen tasks, but lack cross-task generalization. To bridge this gap, we propose the global prompt refinement with non-interfering attention masking (GPR-NIAM) method for one-shot FPL. The core idea is to design a masking mechanism that restricts excessive interaction between the original text embeddings and the learnable prompt embeddings. GPR-NIAM achieves this through the collaboration of two key modules. Firstly, the attention isolation module suppresses attention from the learnable prompt tokens to the original text tokens, and reweights the reverse attention which preserves generalization across tasks. Secondly, the cross-silo collaborative refinement module integrates decentralized visual knowledge into a unified base and calibrates the global prompt through multi-source cross-modal knowledge alignment, further mitigating the inconsistency caused by data heterogeneity. Extensive experiments conducted on ten benchmark datasets under two tasks show that GPR-NIAM outperforms eight state-of-the-art methods in both class-level and domain-level generalization.

Ground-Compose-Reinforce: Tasking Reinforcement Learning Agents through Formal Language

Andrew Li, Toryn Klassen (Vector CIFAR AI Safety Postdoctoral Fellow), Andrew Wang, Parand A. Alamdari, Sheila McIlraith (Vector Faculty Member)

Abstract

Grounding language in perception and action is a key challenge when building situated agents that can interact with humans, or other agents, via language. In the past, addressing this challenge has required manually designing the language grounding or curating massive datasets that associate language with the environment. We propose Ground-Compose-Reinforce, an end-to-end, neurosymbolic framework for training RL agents directly from high-level task specifications—without manually designed reward functions or other domain-specific oracles, and without massive datasets. These task specifications take the form of Reward Machines, automata-based representations that capture high-level task structure and are in some cases autoformalizable from natural language. Critically, we show that Reward Machines can be grounded using limited data by exploiting compositionality. Experiments in a custom Meta-World domain with only 350 labelled pretraining trajectories show that our framework faithfully elicits complex behaviours from high-level specifications—including behaviours that never appear in pretraining—while non-compositional approaches fail.

TLDR: We train RL agents directly from high-level specifications, without reward functions or domain-specific oracles.

Improving Energy Natural Gradient Descent through Woodbury, Momentum, and Randomization

Andrés Guzmán-Cordero, Felix Dangel (Vector Distinguished Postdoctoral Fellow), Gil Goldshlager, Marius Zeinhofer

Abstract

Natural gradient methods significantly accelerate the training of Physics-Informed Neural Networks (PINNs), but are often prohibitively costly. We introduce a suite of techniques to improve the accuracy and efficiency of energy natural gradient descent (ENGD) for PINNs. First, we leverage the Woodbury formula to dramatically reduce the computational complexity of ENGD. Second, we adapt the Subsampled Projected-Increment Natural Gradient Descent algorithm from the variational Monte Carlo literature to accelerate the convergence. Third, we explore the use of randomized algorithms to further reduce the computational cost in the case of large batch sizes. We find that randomization accelerates progress in the early stages of training for low-dimensional problems, and we identify key barriers to attaining acceleration in other scenarios. Our numerical experiments demonstrate that our methods outperform previous approaches, achieving the same $L^2$ error as the original ENGD up to $75\times$ faster.

TLDR: We introduce Woodbury’s matrix identity, momentum-like SPRING and randomization to make energy natural gradient descent 75 times faster for PINNs.

An Investigation of Memorization Risk in Healthcare Foundation Models

Sana Tonekaboni (Vector Distinguished Postdoctoral Fellow), Lena Stempfle, Adibvafa Fallahpour, Walter Gerych, Marzyeh Ghassemi

Abstract

Foundation models trained on large-scale de-identified electronic health records (EHRs) hold promise for clinical applications. However, their capacity to memorize patient information raises important privacy concerns. In this work, we introduce a suite of black-box evaluation tests to assess memorization risks in foundation models trained on structured EHR data. Our framework includes methods for probing memorization at both the embedding and generative levels, and distinguishes between generalization and harmful memorization in clinically relevant settings. We contextualize memorization in terms of its potential to compromise patient privacy, particularly for vulnerable subgroups. We validate our approach on a publicly available EHR foundation model and release an open-source toolkit to facilitate reproducible and collaborative privacy assessments in healthcare AI.

TLDR: We propose black-box tests to detect harmful memorization in foundation models trained on structured EHR data. Validated on a public model, our toolkit supports privacy audits by distinguishing generalization from privacy-compromising memorization.

The Leaderboard Illusion

Shivalika Singh, Yiyang Nan, Alex Wang, Daniel Dsouza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng (Vector Faculty Affiliate), Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker

Abstract

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion.Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we found one provider testing 27 private variants before making one model public at the second position on the leaderboard. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. The top two providers have individually received an estimated 19.2% and 20.4% of all data on the arena. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. With conservative estimates, we show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on ArenaHard, a test set from the arena distribution.Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena’s evaluation framework and promote fairer, more transparent benchmarking for the field.

TL;DR: Chatbot Arena has become a leading platform for ranking AI models. Our extensive study uncovers hidden dynamics that distort rankings and provides concrete steps to enhance fairness and transparency in evaluation of models on Chatbot Arena.

Learning from positive and unlabeled examples -Finite size sample bounds

Farnam Mansouri, Shai Ben-David (Vector Faculty Member)

Abstract

PU (Positive Unlabeled) learning is a variant of supervised classification learning in which the only labels revealed to the learner are of positively labeled instances. PU learning arises in many real-world applications. Most existing work relies on the simplifying assumption that the positively labeled training data is drawn from the restriction of the data generating distribution to positively labeled instances and/or that the proportion of positively labeled points (a.k.a. the class prior) is known apriori to the learner. This paper provides a theoretical analysis of the statistical complexity of PU learning under a wider range of setups. Unlike most prior work, our study does not assume that the class prior is known to the learner. We prove upper and lower bounds on the required sample sizes (of both the positively labeled and the unlabeled samples).

TLDR: This paper provides sample complexity bounds for learning from positive and unlabeled examples.

Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws

Gerard Ben Arous, Murat Erdogdu (Vector Faculty Member), Nuri Mert Vural, Denny Wu

Abstract

We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime, where the data is generated as $y \propto \sum_{j=1}^{r}\lambda_j \sigma\left(\langle \boldsymbol{\theta_j}, \boldsymbol{x}\rangle\right), \boldsymbol{x} \sim \mathcal{N}(0,\boldsymbol{I}_d)$, where $\sigma$ is the 2nd Hermite polynomial, and $\lbrace \boldsymbol{\theta}_j \rbrace _{j=1}^{r} \subset \mathbb{R}^d$ are orthonormal signal directions. We consider the extensive-width regime $r \asymp d^\beta$ for $\beta \in (0, 1)$, and assume a power-law decay on the (non-negative) second-layer coefficients $\lambda_j\asymp j^{-\alpha}$ for $\alpha \geq 0$. We provide a sharp analysis of the SGD dynamics in the feature learning regime, for both the population limit and the finite-sample (online) discretization, and derive scaling laws for the prediction risk that highlight the power-law dependencies on the optimization time, the sample size, and the model width. Our analysis combines a precise characterization of the associated matrix Riccati differential equation with novel matrix monotonicity arguments to establish convergence guarantees for the infinite-dimensional effective dynamics.

Learning to Clean: Reinforcement Learning for Noisy Label Correction

Marzi Heidari, Hanping Zhang, Yuhong Guo (Vector Faculty Affiliate)

Abstract

The challenge of learning with noisy labels is significant in machine learning, as it can severely degrade the performance of prediction models if not addressed properly. This paper introduces a novel framework that conceptualizes noisy label correction as a reinforcement learning (RL) problem. The proposed approach, Reinforcement Learning for Noisy Label Correction (RLNLC), defines a comprehensive state space representing data and their associated labels, an action space that indicates possible label corrections, and a reward mechanism that evaluates the efficacy of label corrections. RLNLC learns a deep feature representation based policy network to perform label correction through reinforcement learning, utilizing an actor-critic method.The learned policy is subsequently deployed to iteratively correct the noisy training labels and support prediction model training. The effectiveness of RLNLC is demonstrated through extensive experiments on multiple benchmark datasets, where it consistently outperforms existing state-of-the-art techniques for learning from noisy labels.

List-Level Distribution Coupling with Applications to Speculative Decoding and Lossy Compression

Joseph Rowan, Truong Buu Phan, Ashish Khisti (Vector Faculty Affiliate)

Abstract

We study a relaxation of the problem of coupling probability distributions — a list of samples is generated from one distribution and an *accept* is declared if any one of these samples is identical to the sample generated from the other distribution.We propose a novel method for generating samples, which extends the Gumbel-max sampling suggested in Daliri et al. (2025) for coupling probability distributions. We also establish a corresponding lower bound on the acceptance probability, which we call the \emph{list matching lemma}. We next discuss two applications of our setup.First, we develop a new mechanism for multi-draft speculative sampling that is simple to implement and achieves performance competitive with baselines such as SpecTr and SpecInfer across a range of language tasks.Our method also guarantees a certain degree of *drafter invariance* with respect to the output tokens which is not supported by existing schemes. We also provide a theoretical lower bound on the token level acceptance probability.As our second application, we consider distributed lossy compression with side information in a setting where a source sample is compressed and available to multiple decoders, each with independent side information.We propose a compression technique that is based on our generalization of Gumbel-max sampling and show that it provides significant gains in experiments involving synthetic Gaussian sources and the MNIST image dataset.

TLDR: We introduce a technique for coupling probability distributions when several samples are available from one of the distributions, and give applications to multi-draft speculative decoding and distributed lossy compression with side information.

Locally Optimal Private Sampling: Beyond the Global Minimax

Hrad Ghoukasian, Bonwoo Lee, Shahab Asoodeh (Vector Faculty Affiliate)

Abstract

We study the problem of sampling from a distribution under local differential privacy (LDP). Given a private distribution $P \in \mathcal{P}$, the goal is to generate a single sample from a distribution that remains close to $P$ in $f$-divergence while satisfying the constraints of LDP. This task captures the fundamental challenge of producing realistic-looking data under strong privacy guarantees. While prior work by Park et al. (NeurIPS’24) focuses on global minimax-optimality across a class of distributions, we take a local perspective. Specifically, we examine the minimax error in a neighborhood around a fixed distribution $P_0$, and characterize its exact value, which depends on both $P_0$ and the privacy level. Our main result shows that the local minimax error is determined by the global minimax error when the distribution class $\mathcal{P}$ is restricted to a neighborhood around $P_0$. To establish this, we (1) extend previous work from pure LDP to the more general functional LDP framework, and (2) prove that the globally optimal functional LDP sampler yields the optimal local sampler when constrained to distributions near $P_0$. Building on this, we also derive a simple closed-form expression for the locally minimax-optimal samplers which does not depend on the choice of $f$-divergence. We further argue that this local framework naturally models private sampling with public data, where the public data distribution is represented by $P_0$. In this setting, we empirically compare our locally optimal sampler to existing global methods, and demonstrate that it consistently outperforms global minimax samplers.

LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision

Anthony Fuller, Yousef Yassin, Junfeng Wen, Tarek Ibrahim, Daniel Kyrollos, James Green, Evan Shelhamer (Vector Faculty Member)

Abstract

Vision transformers are ever larger, more accurate, and more expensive to compute.At high resolution, the expense is even more extreme as the number of tokens grows quadratically in the image size. We turn to adaptive computation to cope with this cost by learning to predict where to compute.Our LookWhere method divides the computation between a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input.We jointly pretrain the selector and extractor without task supervision by distillation from a self-supervised teacher, in effect learning where and what to compute at the same time.Unlike prior token reduction methods, which pay to save by pruning already-computed tokens, and prior token selection methods, which require complex and expensive per-task optimization, LookWhere economically and accurately selects and extracts transferrable representations of images.We show that LookWhere excels at sparse recognition on high-resolution inputs (Traffic Signs), maintaining accuracy while reducing FLOPs by 17x and time by 4x, and standard recognition tasks that are global (ImageNet classification) and local (ADE20K segmentation), improving accuracy while reducing time by 1.36x.

TLDR: We introduce a selector-extractor framework that extracts high-res features without ever seeing full high-res images to save compute.

LuxDiT: Lighting Estimation with Video Diffusion Transformer

Ruofan Liang, Kai He, Zan Gojcic, Igor Gilitschenski (Vector Faculty Member), Sanja Fidler (Vector Faculty Member), Nandita Vijaykumar (Vector Faculty Member), Zian Wang

Abstract

Estimating scene lighting from a single image or video remains a longstanding challenge in computer vision and graphics. Learning-based approaches are constrained by the scarcity of ground-truth HDR environment maps, which are expensive to capture and limited in diversity. While recent generative models offer strong priors for image synthesis, lighting estimation remains difficult due to its reliance on indirect visual cues, the need to infer global (non-local) context, and the recovery of high-dynamic-range outputs. We propose LuxDiT, a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. Trained on a large synthetic dataset with diverse lighting conditions, our model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas. Our method produces accurate lighting predictions with realistic angular high-frequency details, outperforming existing state-of-the-art techniques in both quantitative and qualitative evaluations.

Machine Unlearning Doesn’t Do What You Think: Lessons for Generative AI Policy and Research

A. Feder Cooper, Christopher Choquette-Choo, Miranda Bogen, Kevin Klyman, Matthew Jagielski, Katja Filippova, Ken Liu, Alex Chouldechova, Jamie Hayes, Yangsibo Huang, Eleni Triantafillou, Peter Kairouz, Nicole Mitchell, Niloofar Mireshghallah, Abigail Jacobs, James Grimmelmann, Vitaly Shmatikov, Christopher De Sa, I Shumailov, Andreas Terzis, Solon Barocas, Jennifer Wortman Vaughan, Danah Boyd, Yejin Choi, Sanmi Koyejo, Fernando Delgado, Percy Liang, Daniel Ho, Pamela Samuelson, Miles Brundage, David Bau, Seth Neel, Hanna Wallach, Amy Cyphert, Mark Lemley, Nicolas Papernot (Vector Faculty Member), Katherine Lee

Abstract

“Machine unlearning” is a popular proposed solution for mitigating the existence of content in an AI model that is problematic for legal or moral reasons, including privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effects of specific information from a generative-AI model’s parameters, e.g., a particular individual’s personal data or the inclusion of copyrighted content in the model’s training data. Unlearning is also proposed as a way to prevent a model from generating targeted types of information in its outputs, e.g., generations that closely resemble a particular individual’s data or reflect the concept of “Spiderman.” Both of these goals–the targeted removal of information from a model and the targeted suppression of information from a model’s outputs–present various technical and substantive challenges. We provide a framework for ML researchers and policymakers to think rigorously about these challenges, identifying several mismatches between the goals of unlearning and feasible implementations. These mismatches explain why unlearning is not a general-purpose solution for circumscribing generative-AI model behavior in service of broader positive impact.

The Matrix: Infinite-Horizon World Generation with Real-Time Moving Control

Ruili Feng, Han Zhang, Zhilei Shu, Zhantao Yang, Longxiang Tang, Zhicai Wang, Andy Zheng, Jie Xiao, Zhiheng Liu, Ruihang Chu, Yukun Huang, Yu Liu, Hongyang Zhang (Vector Faculty Member)

Abstract

We present The Matrix, a foundational realistic world simulator capable of generating infinitely long 720p high-fidelity real-scene video streams with real-time, responsive control in both first- and third-person perspectives. Trained on limited supervised data from video games like Forza Horizon 5 and Cyberpunk 2077, complemented by large-scale unsupervised footage from real-world settings like Tokyo streets, The Matrix allows users to traverse diverse terrains—deserts, grasslands, water bodies, and urban landscapes—in continuous, uncut hour-long sequences. With speeds of up to 16 FPS, the system supports real-time interactivity and demonstrates zero-shot generalization, translating virtual game environments to real-world contexts where collecting continuous movement data is often infeasible. For example, The Matrix can simulate a BMW X3 driving through an office setting—an environment present in neither gaming data nor real-world sources. This approach showcases the potential of game data to advance robust world models, bridging the gap between simulations and real-world applications in scenarios with limited data.

TLDR: This paper introduces The Matrix, a foundational realistic world simulator capable of generating infinitely long 720p high-fidelity real-scene video streams with real-time, responsive control.

Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

Haonan Duan, Stephen Lu, Caitlin F Harrigan, Nishkrit Desai, Jiarui Lu, Michał Koziarski, Leonardo Cotta, Chris Maddison (Vector Faculty Member)

Abstract

Designing experiments and result interpretations are core scientific competencies, particularly in biology, where researchers perturb complex systems to uncover the underlying systems. Recent efforts to evaluate the scientific capabilities of large language models (LLMs) fail to test these competencies because wet-lab experimentation is prohibitively expensive: in expertise, time and equipment. We introduce SciGym, a first-in-class benchmark that assesses LLMs’ iterative experiment design and analysis abilities in open-ended scientific discovery tasks. SciGym overcomes the challenge of wet-lab costs by running a dry lab of biological systems. These models, encoded in Systems Biology Markup Language, are efficient for generating simulated data, making them ideal testbeds for experimentation on realistically complex systems. We evaluated six frontier LLMs on 137 small systems, and released a total of 350 systems at https://huggingface.co/datasets/h4duan/scigym-sbml. Our evaluation shows that while more capable models demonstrated superior performance, all models’ performance declined significantly as system complexity increased, suggesting substantial room for improvement in the scientific capabilities of LLM agents.

TL;DR: We introduce a benchmark using simulated biological systems to evaluate LLMs’ scientific discovery capabilities.

MoCha: Towards Movie-Grade Talking Character Generation

Spotlight paper

Cong Wei, Bo Sun (Vector Faculty Affiliate), Haoyu Ma, Ji Hou, Felix Juefei-Xu, Zecheng He, Xiaoliang Dai, Luxin Zhang, Kunpeng Li, Tingbo Hou, Animesh Sinha, Peter Vajda, Wenhu Chen (Vector Faculty Member)

Abstract

Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce \textbf{Talking Characters}, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a \textbf{localized audio attention} mechanism that effectively aligns speech and video tokens.To address the scarcity of large-scale speech-labelled video datasets, we introduce a joint training strategy that leverages both speech-labelled and text-labelled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, \textbf{multi-character conversation with turn-based dialogue}—allowing AI-generated characters to engage in context-aware conversations with cinematic coherence.Extensive qualitative and quantitative evaluations, including human evaluation studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, controllability and generalization.

TLDR: We introduce MoCha, the first model for dialogue-driven movie shot generation.

Neural MJD: Neural Non-Stationary Merton Jump Diffusion for Time Series Prediction

Yuanpei Gao, Qi Yan, Yan Leng, Renjie Liao (Vector Faculty Member)

Abstract

While deep learning methods have achieved strong performance in time series prediction, their black-box nature and inability to explicitly model underlying stochastic processes often limit their generalization to non-stationary data, especially in the presence of abrupt changes. In this work, we introduce Neural MJD, a neural network based non-stationary Merton jump diffusion (MJD) model. Our model explicitly formulates forecasting as a stochastic differential equation (SDE) simulation problem, combining a time-inhomogeneous Itô diffusion to capture non-stationary stochastic dynamics with a time-inhomogeneous compound Poisson process to model abrupt jumps. To enable tractable learning, we introduce a likelihood truncation mechanism that caps the number of jumps within small time intervals and provide a theoretical error bound for this approximation. Additionally, we propose an Euler-Maruyama with restart solver, which achieves a provably lower error bound in estimating expected states and reduced variance compared to the standard solver. Experiments on both synthetic and real-world datasets demonstrate that Neural MJD consistently outperforms state-of-the-art deep learning and statistical learning methods.

TLDR: A novel neural Merton jump diffusion SDE for probabilistic time series prediction.

On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

Wenlong Deng, Yi Ren, Muchen Li, Danica J. Sutherland, Xiaoxiao Li (Vector Faculty Member), Christos Thrampoulidis

Abstract

Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO’s widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO’s learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO’s group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Stephen Obadinma, Xiaodan Zhu (Vector Faculty Member)

Abstract

Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to ensure transparency, trust, and safety in human-AI interactions across many high-stakes applications. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce a novel framework for attacking verbal confidence scores through both perturbation and jailbreak-based methods, and show that these attacks can significantly jeopardize verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current confidence elicitation methods are vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the urgent need to design more robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.

TLDR: A comprehensive study into verbal confidence in LLMs and its general robustness as well as its use as the objective for adversarial attacks.

On Traceability in ℓp Stochastic Convex Optimization

Spotlight paper

Sasha Voitovych, Mahdi Haghifam, Idan Attias, Gintare Karolina Dziugaite, Roi Livni, Dan Roy (Vector Faculty Member)

Abstract

In this paper, we investigate the necessity of traceability for accurate learning in stochastic convex optimization (SCO) under $\ell_p$ geometries. Informally, we say a learning algorithm is \emph{$m$-traceable} if, by analyzing its output, it is possible to identify at least $m$ of its training samples. Our main results uncover a fundamental tradeoff between traceability and excess risk in SCO. For every $p\in [1,\infty)$, we establish the existence of an excess risk threshold below which every sample-efficient learner is traceable with the number of samples which is a \emph{constant fraction} of its training sample. For $p\in [1,2]$, this threshold coincides with the best excess risk of differentially private (DP) algorithms, i.e., above this threshold, there exist algorithms that are not traceable, which corresponds to a sharp phase transition. For $p \in (2,\infty)$, this threshold instead gives novel lower bounds for DP learning, partially closing an open problem in this setup. En route to establishing these results, we prove a sparse variant of the fingerprinting lemma, which is of independent interest to the community.

TLDR: We show that in stochastic convex optimization, any algorithm achieving error smaller than the best possible under differential privacy is traceable, with the number of traceable samples matching the statistical sample complexity of learning.

Online Multi-Class Selection with Group Fairness Guarantee

Faraz Zargari, Hossein Jazi, Lyndon Hallett, Bo Sun (Vector Faculty Affiliate), Xiaoqi Tan

Abstract

We study the online multi-class selection problem with group fairness guarantees, where limited resources must be allocated to sequentially arriving agents. Our work addresses two key limitations in the existing literature. First, we introduce a novel lossless rounding scheme that ensures the integral algorithm achieves the same expected performance as any fractional solution. Second, we explicitly address the challenges introduced by agents who belong to multiple classes. To this end, we develop a randomized algorithm based on a relax-and-round framework. The algorithm first computes a fractional solution using a resource reservation approach—referred to as the *set-aside* mechanism—to enforce fairness across classes. The subsequent rounding step preserves these fairness guarantees without degrading performance. Additionally, we propose a learning-augmented variant that incorporates untrusted machine-learned predictions to better balance fairness and efficiency in practical settings.

OpenCUA: Open Foundations for Computer-Use Agents

Spotlight paper

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Li Peihang, Fangyu Lei, Chen Wu, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Hu Jiarui, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Yiheng Xu, Danyang Zhang, Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong (Vector Faculty Member), Y. Charles, Zhilin Yang, Tao Yu

Abstract

Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed and proprietary. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to truly open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose AgentNet, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet dataset, a dataset of 27K computer-use data samples spanning various operating systems, applications, and websites; (3) a pipeline that discretizes continuous actions into state-action pairs and synthesizes reflective long chain-of-thought (CoT) reasoning; (4) a training recipe for scalable CUA modeling; and (5) AgentNetBench, a multi-dimensional offline benchmark for faster CUA evaluation. Our AgentNet-7B, fine-tuned on AgentNet dataset, demonstrates strong performance on several CUA benchmarks, achieving a success rate of 20.1% on OSWorld and 21.1% on WindowsAgentArena. Our training recipe, particularly its advanced reasoning mechanisms and strategic data mixture, enables robust performance scaling with increased data size. Further in-depth analysis of our models also demonstrate strong cross-domain generalization and performance scaling with test-time compute. We will release the annotation tool, datasets, code, and models to build open foundations for further CUA research.

Paper2Poster: Benchmarking Multimodal Poster Generation from Long-context Papers

Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He (Vector Faculty Member), Philip Torr

Abstract

Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce Paper2Poster, the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i) Visual Quality—semantic alignment with human posters, (ii) Textual Coherence—language fluency, (iii) Holistic Assessment—six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv) PaperQuiz—the poster’s ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top‐down, visual‐in‐the‐loop multi‐agent pipeline: the (a) Parser distills the paper into a structured asset library; the (b) Planner aligns text–visual pairs into a binary‐tree layout that preserves reading order and spatial balance; and the (c) Painter–Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment.In our comprehensive evaluation, we find that GPT‐4o outputs—though visually appealing at first glance—often exhibit noisy text and poor PaperQuiz scores; We find that reader engagement is the primary aesthetic bottleneck, as human‐designed posters rely largely on visual semantics to convey meaning.Our fully open‐source Paper2Poster pipeline outperforms GPT‐4o–based systems across nearly all metrics while consuming 87 \% fewer tokens. These findings chart clear directions for the next generation of fully automated poster‐generation models.

Pixel Reasoner: Incentivizing Pixel Space Reasoning via Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, Wenhu Chen (Vector Faculty Member)

Abstract

Chain-of-thought reasoning has significantly improved the performance of LargeLanguage Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of pixel-space reasoning. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model’s initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, Pixel-Reasoner, achieves 84% on V* bench, 74% on TallyQA-Complex, and 84% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.

TLDR: We introduce a novel reasoning paradigm — pixel-space reasoning. We identified the learning trap when cultivating this ability and proposed a curiosity-driven RL approach to address it.

Query-Efficient Locally Private Hypothesis Selection via the Scheffe Graph

Gautam Kamath (Vector Faculty Member), Alireza F. Pour, Matthew Regehr, David Woodruff

Abstract

We propose an algorithm with improved query-complexity for the problem of hypothesis selection under local differential privacy constraints. Given a set of $k$ probability distributions $Q$, we describe an algorithm that satisfies local differential privacy, performs $\tilde{O}(k^{3/2})$ non-adaptive queries to individuals who each have samples from a probability distribution $p$, and outputs a probability distribution from the set $Q$ which is nearly the closest to $p$. Previous algorithms required either $\Omega(k^2)$ queries or many rounds of interactive queries. Technically, we introduce a new object we dub the Scheff\’e graph, which captures structure of the differences between distributions in $Q$, and may be of more broad interest for hypothesis selection tasks.

Reconstructing Heterogeneous Biomolecules via Hierarchical Gaussian Mixtures and Part Discovery

Shayan Shekarforoush, David Lindell (Vector Faculty Affiliate), Marcus Brubaker (Vector Faculty Member), David Fleet (Vector Faculty Member)

Abstract

Cryo-EM is a transformational paradigm in molecular biology where computational methods are used to infer 3D molecular structure at atomic resolution from extremely noisy 2D electron microscope images. At the forefront of research is how to model the structure when the imaged particles exhibit non-rigid conformational flexibility and compositional variation where parts are sometimes missing. We introduce a novel 3D reconstruction framework with a hierarchical Gaussian mixture model, inspired in part by Gaussian Splatting for 4D scene reconstruction. In particular, the structure of the model is grounded in an initial process that infers a part-based segmentation of the particle, providing essential inductive bias in order to handle both conformational and compositional variability. The framework, called CryoSPIRE, is shown to reveal biologically meaningful structures on complex experimental datasets, and establishes a new state-of-the-art on CryoBench, a benchmark for cryo-EM heterogeneity methods.

TLDR: We present a part-aware hierarchical GMM-based density model to tackle cryo-EM heterogeneous reconstruction.

Reducing the Probability of Bad Outputs in Language Models Using Probabilistic Inference

Stephen Zhao, Aidan Li, Rob Brekelmans, Roger Grosse (Vector Faculty Member)

Abstract

To avoid bad language model (LM) outputs, there are many alignment approaches (e.g., RLHF, DPO). Ideally, we would like our LM to have zero probability of undesirable outputs. Standard reinforcement learning (RL) would achieve this at optimality (if unregularized). However, in practice, there may be a tradeoff between methods focusing on the expected reward (standard RL) and methods explicitly focused on reducing the probability of undesired outputs. Our goal is to improve this tradeoff, reducing the probability of bad outputs as much as possible, while maintaining performance on expected reward. To do this, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs’ probability. We run experiments to test whether our method provides better reduction of the probability of bad outputs and adversarial robustness, at minimal cost to expected reward, compared to standard RL alignment approaches and other alternatives.

A Reinforcement Learning-based Bidding Strategy for Data Consumers in Auction-based Federated Learning

Xiaoli Tang, Han Yu, Xiaoxiao Li (Vector Faculty Member)

Abstract

Auction-based Federated Learning (AFL) fosters collaboration among self-interested data consumers (DCs) and data owners (DOs). A major challenge in AFL pertains to how DCs select and bid for DOs. Existing methods are generally static, making them ill-suited for dynamic AFL markets. To address this issue, we propose the R}einforcement Learning-based Bidding Strategy for DCs in Auction-based Federated Learning (RLB-AFL). We incorporate historical states into a Deep Q-Network to capture sequential information critical for bidding decisions. To mitigate state space sparsity, where specific states rarely reoccur for each DC during auctions, we incorporate the Gaussian Mixture Model into RLB-AFL. This facilitates soft clustering on sequential states, reducing the state space dimensionality and easing exploration and action-value function approximation. In addition, we enhance the $\epsilon$-greedy policy to help the RLB-AFL agent balance exploitation and exploration, enabling it to be more adaptable in the AFL decision-making process. Extensive experiments under 6 widely used benchmark datasets demonstrate that RLB-AFL achieves superior performance compared to 8 state-of-the-art approaches. It outperforms the best baseline by 10.56% and 3.15% in terms of average total utility

Reliably detecting model failures in deployment without labels

Viet Nguyen, Changjian Shui, Vijay Giri, Siddharth Arya, Amol Verma (Vector Faculty Affiliate), Fahad Razak (Vector Faculty Affiliate), Rahul Krishnan (Vector Faculty Member)

Abstract

The distribution of data changes over time; models operating operating in dynamic environments need retraining. But knowing when to retrain, without access to labels, is an open challenge since some, but not all shifts degrade model performance. This paper formalizes and addresses the problem of post-deployment deterioration (PDD) monitoring. We propose D3M, a practical and efficient monitoring algorithm based on the disagreement of predictive models, achieving low false positive rates under non-deteriorating shifts and provides sample complexity bounds for high true positive rates under deteriorating shifts. Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework and highlight its viability as an alert mechanism for high-stakes machine learning pipelines.

TLDR: D-PDDM provably monitors model deterioration requiring no training data during deployment, and performs well in real-worlds datasets.

ReservoirTTA: Prolonged Test-time Adaptation for Evolving and Recurring Domains

Guillaume Vray, Devavrat Tomar, Xufeng Gao, Jean-Philippe Thiran, Evan Shelhamer (Vector Faculty Member), Behzad Bozorgtabar

Abstract

This paper introduces **ReservoirTTA**, a novel plug–in framework designed for prolonged test–time adaptation (TTA) in scenarios where the test domain continuously shifts over time, including cases where domains recur or evolve gradually. At its core, ReservoirTTA maintains a reservoir of domain-specialized models—an adaptive test-time model ensemble—that both detects new domains via online clustering over style features of incoming samples and routes each sample to the appropriate specialized model, and thereby enables domain-specific adaptation. This multi-model strategy overcomes key limitations of single model adaptation, such as catastrophic forgetting, inter-domain interference, and error accumulation, ensuring robust and stable performance on sustained non-stationary test distributions. Our theoretical analysis reveals key components that bound parameter variance and prevent model collapse, while our plug–in TTA module mitigates catastrophic forgetting of previously encountered domains. Extensive experiments on the classification corruption benchmarks, including ImageNet-C and CIFAR-10/100-C, as well as the Cityscapes→ACDC semantic segmentation task, covering recurring and continuously evolving domain shifts, demonstrate that ReservoirTTA significantly improves adaptation accuracy and maintains stable performance across prolonged, recurring shifts, outperforming state-of-the-art methods. The code will be released upon acceptance.

TLDR: ReservoirTTA extends test-time adaptation to multiple model adaptation with a fully test-time reservoir of domain-specialist models for robust prolonged/long-horizon adaptation.

RETRO SYNFLOW: Discrete Flow-Matching for Accurate and Diverse Single-Step Retrosynthesis

Robin Yadav, Qi Yan, Guy Wolf, Joey Bose, Renjie Liao (Vector Faculty Member)

Abstract

A fundamental challenge in organic chemistry is identifying and predicting the sequence of reactions that synthesize a desired target molecule. Due to the combinatorial nature of the chemical search space, single-step reactant prediction—i.e., single-step retrosynthesis—remains difficult, even for state-of-the-art template-free generative methods. These models often struggle to produce an accurate yet diverse set of feasible reactions in a chemically rational manner. In this paper, we propose RETRO SYNFLOW (RSF), a discrete flow-matching framework that formulates single-step retrosynthesis as a Markov bridge between a given product molecule and its corresponding reactants. Unlike prior approaches, RSF introduces a reaction center identification step to extract intermediate structures, or synthons, which serve as a more informative and structured source distribution for the discrete flow model. To further improve the diversity and chemical feasibility of generated samples, RSF incorporates Feynman-Kac (FK) steering with Sequential Monte Carlo (SMC) resampling at inference time. This approach leverages a learned forward-synthesis reward oracle to guide the generation process toward more promising reactant candidates. Empirically, RSF substantially outperforms the previous state-of-the-art methods in top-1 accuracy. In addition, FK-steering significantly improves round-trip accuracy, demonstrating stronger chemical validity and synthetic feasibility, all while maintaining competitive top-k performance. These results establish RSF as a new leading approach for single-step retrosynthesis prediction.

Robust Federated Finetuning of LLMs via Alternating Optimization of LoRA

Shuangyi Chen, Yuanxin Guo, Yue Ju, Hardik Dalal, Zhongwen Zhu, Ashish Khisti (Vector Faculty Affiliate)

Abstract

Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) optimize federated training by reducing computational and communication costs. We propose RoLoRA, a federated framework using alternating optimization to fine-tune LoRA adapters. Our approach emphasizes the importance of learning up and down projection matrices to enhance expressiveness and robustness. We use both theoretical analysis and extensive experiments to demonstrate the advantages of RoLoRA over prior approaches that either generate imperfect model updates or limit expressiveness of the model. We provide a theoretical analysis on a linear model to highlight the importance of learning both the down-projection and up-projection matrices in LoRA. We validate the insights on a non-linear model and separately provide a convergence proof under general conditions. To bridge theory and practice, we conducted extensive experimental evaluations on language models including RoBERTa-Large, Llama-2-7B on diverse tasks and FL settings to demonstrate the advantages of RoLoRA over other methods.

TLDR: RoLoRA improves federated fine-tuning alternating optimization of LoRA, enhancing expressiveness and robustness. It reduces communication costs by half and outperforms alternatives.

SAFE: Multitask Failure Detection for Vision-Language-Action Models

Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski (Vector Faculty Member), Haruki Nishimura, Masha Itkina, Florian Shkurti (Vector Faculty Member)

Abstract

While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out-of-the-box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts, and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, $\pi_0$, and $\pi_0$-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction.

Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models

Christopher Chiu, Silviu Pitis (CIFAR AI Safety Postdoctoral Fellow), Mihaela van der Schaar

Abstract

Clinical reasoning in medicine is a hypothesis-driven process where physicians refine diagnoses from limited information through targeted history, physical examination, and diagnostic investigations. In contrast, current medical benchmarks for large language models (LLMs) primarily assess knowledge recall through single-turn questions, where complete clinical information is provided upfront. To address this gap, we introduce VivaBench, a multi-turn benchmark that evaluates sequential clinical reasoning in LLM agents. Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a viva voce (oral) examination in medical training, requiring agents to actively probe for relevant findings, select appropriate investigations, and synthesize information across multiple steps to reach a diagnosis. While current LLMs demonstrate competence in diagnosing conditions from well-described clinical presentations, their performance degrades significantly when required to navigate iterative diagnostic reasoning under uncertainty in our evaluation. Our analysis identified several failure modes that mirror common cognitive errors in clinical practice, including: (1) fixation on initial hypotheses, (2) inappropriate investigation ordering, (3) premature diagnostic closure, and (4) failing to screen for critical conditions. These patterns reveal fundamental limitations in how current LLMs reason and make decisions under uncertainty. Through VivaBench, we provide a standardized benchmark for evaluating conversational medical AI systems for real-world clinical decision support. Beyond medical applications, we contribute to the larger corpus of research on agentic AI by demonstrating how sequential reasoning trajectories can diverge in complex decision-making environments.

TL;DR: We introduce VivaBench, an extendable benchmark that simulates multi-turn medical conversations. We demonstrate that LLM agents are clinically knowledgeable, but limited in ability to gather information and diagnose from incomplete presentations.

Solving Discrete (Semi) Unbalanced Optimal Transport with Equivalent Transformation Mechanism and KKT-Multiplier Regularization

Weiming Liu, Xinting Liao (Vector Distinguished Postdoctoral Fellow), Jun Dan, Fan Wang, Hua Yu, Junhao Dong, Shunjie Dong, Lianyong Qi, Yew Soon Ong

Abstract

Semi-Unbalanced Optimal Transport (SemiUOT) shows great promise in matching two probability measures by relaxing one of the marginal constraints. Previous solvers often incorporate an entropy regularization term, which can result in inaccurate matching solutions. To address this issue, we focus on determining the marginal probability distribution of SemiUOT with KL divergence using the proposed Equivalent Transformation Mechanism (ETM) approach. Furthermore, we extend the ETM-based method into exploiting the marginal probability distribution of Unbalanced Optimal Transport (UOT) with KL divergence for validating its generalization. Once the marginal probabilities of UOT/SemiUOT are determined, they can be transformed into a classical Optimal Transport (OT) problem. Moreover, we propose a KKT-Multiplier regularization term combined with Multiplier Regularized Optimal Transport (MROT) to achieve more accurate matching results. We conduct several numerical experiments to demonstrate the effectiveness of our proposed methods in addressing UOT/SemiUOT problems.

TLDR: We propose Equivalent Transformation Mechanism with KKT-Multiplier Regularization for solving SemiUOT and UOT

STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation

Spotlight paper

Hossein Goli, Michael Gimelfarb, Nathan de Lara, Haruki Nishimura, Masha Itkina, Florian Shkurti (Vector Faculty Member)

Abstract

Off-policy evaluation (OPE) estimates the performance of a target policy using offline data collected from a behavior policy, and is crucial in domains such as robotics or healthcare where direct interaction with the environment is costly or unsafe. Existing OPE methods are ineffective for high-dimensional, long-horizon problems, due to exponential blow-ups in variance from importance weighting or compounding errors from learned dynamics models. To address these challenges, we propose STITCH-OPE, a model-based generative framework that leverages denoising diffusion for long-horizon OPE in high-dimensional state and action spaces. Starting with a diffusion model pre-trained on the behavior data, STITCH-OPE generates synthetic trajectories from the target policy by guiding the denoising process using the score function of the target policy. STITCH-OPE proposes two technical innovations that make it advantageous for OPE: (1) prevents over-regularization by subtracting the score of the behavior policy during guidance, and (2) generates long-horizon trajectories by stitching partial trajectories together end-to-end. We provide a theoretical guarantee that under mild assumptions, these modifications result in an exponential reduction in variance versus long-horizon trajectory diffusion. Experiments on the D4RL and OpenAI Gym benchmarks show substantial improvement in mean squared error, correlation, and regret metrics compared to state-of-the-art OPE methods.

TLDR: We introduce STITCH-OPE, a guided-diffusion framework for off-policy evaluation that stitches short behavior-conditioned sub-trajectories, uses negative-behavior guidance to correct distribution shift, and outperforms baselines across all metrics.

Token Perturbation Guidance for Diffusion Models

Javad Rajabi, Soroush Mehraban, Seyedmorteza Sadat, Babak Taati (Vector Faculty Affiliate)

Abstract

Classifier-free guidance (CFG) has become an essential component of modern diffusion models to enhance both generation quality and alignment with input conditions. However, CFG requires specific training procedures and is limited to conditional generation. To address these limitations, we propose Token Perturbation Guidance (TPG), a novel method that applies perturbation matrices directly to intermediate token representations within the diffusion network. TPG employs a norm-preserving shuffling operation to provide effective and stable guidance signals that improve generation quality without architectural changes. As a result, TPG is training-free and agnostic to input conditions, making it readily applicable to both conditional and unconditional generation. We also analyze the guidance term provided by TPG and show that its effect on sampling more closely resembles CFG compared to existing training-free guidance techniques. We extensively evaluate TPG on SDXL and Stable Diffusion 2.1, demonstrating nearly a 2x improvement in FID for unconditional generation over the SDXL baseline and showing that TPG closely matches CFG in prompt alignment. Thus, TPG represents a general, condition-agnostic guidance method that extends CFG-like benefits to a broader class of diffusion models.

TLDR: Token Perturbation Guidance (TPG) is a novel framework that applies perturbations directly in token space to guide the diffusion sampling process.

Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling

Shuhong Zheng, Ashkan Mirzaei, Igor Gilitschenski (Vector Faculty Member)

Abstract

Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods.

TLDR: We present TIRE, a novel method for subject-driven 3D/4D generation that well preserves the identity.

Uncoupled and Convergent Learning in Monotone Games under Bandit Feedback

Jing Dong, Baoxiang Wang, Yaoliang Yu (Vector Faculty Member)

Abstract

We study the problem of no-regret learning algorithms for general monotone and smooth games and their last-iterate convergence properties. Specifically, we investigate the problem under bandit feedback and strongly uncoupled dynamics, which allows modular development of the multi-player system that applies to a wide range of real applications. We propose a mirror-descent-based algorithm, which converges in $O(T^{-1/4})$ and is also no-regret. The result is achieved by a dedicated use of two regularizations and the analysis of the fixed point thereof. The convergence rate is further improved to $O(T^{-1/2})$ in the case of strongly monotone games.Motivated by practical tasks where the game evolves over time, the algorithm is extended to time-varying monotone games. We provide the first non-asymptotic result in converging monotone games and give improved results for equilibrium tracking games.

Unifying Proportional Fairness in Centroid and Non-Centroid Clustering

Spotlight paper

Benjamin Cookson, Nisarg Shah (Vector Faculty Affiliate), Ziqi Yu

Abstract

Proportional fairness criteria inspired by democratic ideals of proportional representation have received growing attention in the clustering literature. Prior work has investigated them in two separate paradigms. Chen et al. [ICML 2019] study _centroid clustering_, in which each data point’s loss is determined by its distance to a representative point (centroid) chosen in its cluster. Caragiannis et al. [NeurIPS 2024] study _non-centroid clustering_, in which each data point’s loss is determined by its maximum distance to any other data point in its cluster. We generalize both paradigms to introduce _semi-centroid clustering_, in which each data point’s loss is a combination of its centroid and non-centroid losses, and study two proportional fairness criteria—the core and, its relaxation, fully justified representation (FJR). Our main result is a novel algorithm which achieves a constant approximation to the core, in polynomial time, even when the distance metrics used for centroid and non-centroid loss measurements are different. We also derive improved results for more restricted loss functions and the weaker FJR criterion, and establish lower bounds in each case.

TLDR: We design proportionally fair clustering methods when each agent’s loss function is determined by both its distance from the other agents in its cluster and to a representative agent in its cluster.

UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting

Spotlight paper

Kai He, Ruofan Liang, Jacob Munkberg, Jon Hasselgren, Nandita Vijaykumar (Vector Faculty Member), Alexander Keller, Sanja Fidler (Vector Faculty Member), Igor Gilitschenski (Vector Faculty Member), Zan Gojcic, Zian Wang

Abstract

We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.

Versatile Transferable Unlearnable Example Generator

Zhihao Li, Jiale Cai, Gezheng Xu, Hao Zheng, Qiuyue Li, Fan Zhou, Shichun Yang, Charles Ling, Boyu Wang (Vector Faculty Affiliate)

Abstract

The rapid growth of publicly available data has fueled deep learning advancements but also raises concerns about unauthorized data usage. Unlearnable Examples (UEs) have emerged as a data protection strategy that introduces imperceptible perturbations to prevent unauthorized learning. However, most existing UE methods produce perturbations strongly tied to specific training sets, leading to a significant drop in unlearnability when applied to unseen data or tasks. In this paper, we argue that for broad applicability, UEs should maintain their effectiveness across diverse application scenarios. To this end, we conduct the first comprehensive study on the transferability of UEs across diverse and practical yet demanding settings. Specifically, we identify key scenarios that pose significant challenges for existing UE methods, including varying styles, out-of-distribution classes, resolutions, and architectures. Moreover, we propose $\textbf{Versatile Transferable Generator}$ (VTG), a transferable generator designed to safeguard data across various conditions. Specifically, VTG integrates adversarial domain augmentation into the generator’s training process to synthesize out-of-distribution samples, thereby improving its generalizability to unseen scenarios. Furthermore, we propose a Perturbation-Label Coupling mechanism that leverages contrastive learning to directly align perturbations with class labels. This approach reduces the generator’s reliance on data semantics, allowing VTG to produce unlearnable perturbations in a distribution-agnostic manner. Extensive experiments demonstrate the effectiveness and broad applicability of our approach.

TLDR: A versatile perturbation generator that achieves unlearnability across diverse scenarios.

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Spotlight paper

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen (Vector Faculty Member)

Abstract

Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1’s performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MathVision, MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with OpenAI-o1. We conduct comprehensive ablations and analysis to provide insights into the effectiveness of our approach.

What Does It Take to Build a Performant Selective Classifier?

Stephan Rabanser, Nicolas Papernot (Vector Faculty Member)

Abstract

Selective classifiers improve reliability by abstaining on uncertain inputs, yet their performance often lags behind the perfect-ordering oracle that accepts examples in exact order of correctness. We formulate this shortfall as a coverage-uniform selective-classification gap and prove the first finite-sample decomposition that pinpoints five distinct sources of looseness: Bayes noise, approximation error, ranking error, statistical noise, and implementation or shift-induced slack. Our bound shows that monotone post-hoc calibration cannot reduce the gap, as it preserves the original score ordering; closing the gap therefore requires scoring mechanisms that can modify the ranking induced by the base model. We validate our gap decomposition on synthetic two-moons data and real-world vision benchmarks, isolating each error component via controlled experiments. Results confirm that (i) Bayes noise and limited model capacity alone explain large gaps, (ii) only non-monotone or feature-aware calibrators shrink the ranking term, and (iii) distribution shift adds a distinct slack that must be addressed by robust training. Our decomposition supplies a quantitative error budget and concrete design guidelines for building selective classifiers that approach ideal oracle behavior.

TLDR: We decompose the gap between selective classifiers and the ideal oracle into five measurable sources, showing that only non-monotone scoring methods can reduce it and improve reliability.

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions

Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, Jeff Schneider, Eduard Hovy, Roger Grosse (Vector Faculty Member), Eric Xing

Abstract

Large language models (LLMs) are trained on a vast amount of human-written data, but data providers often remain uncredited. In response to this issue, data valuation (or data attribution), which quantifies the contribution or value of each data to the model output, has been discussed as a potential solution. Nevertheless, applying existing data valuation methods to recent LLMs and their vast training datasets has been largely limited by prohibitive compute and memory costs. In this work, we focus on influence functions, a popular gradient-based data valuation method, and significantly improve its scalability with an efficient gradient projection strategy called LoGra that leverages the gradient structure in backpropagation. We then provide a theoretical motivation of gradient projection approaches to influence functions to promote trust in the data valuation process. Lastly, we lower the barrier to implementing data valuation systems by introducing LogIX, a software package that can transform existing training code into data valuation code with minimal effort. In our data valuation experiments, LoGra achieves competitive accuracy against more expensive baselines while showing up to 6,500x improvement in throughput and 5x reduction in GPU memory usage when applied to Llama3-8B-Instruct and the 1B-token dataset.

TLDR: We scale the influence-function-based data valuation method to recent LLMs and their massive training datasets.

When Do Transformers Outperform Feedforward and Recurrent Networks? A Statistical Perspective

Alireza Mousavi-Hosseini, Clayton Sanford, Denny Wu, Murat Erdogdu (Vector Faculty Member)

Abstract

Theoretical efforts to prove advantages of Transformers in comparison with classical architectures such as feedforward and recurrent neural networks have mostly focused on representational power. In this work, we take an alternative perspective and prove that even with infinite compute, feedforward and recurrent networks may suffer from larger sample complexity compared to Transformers, as the latter can adapt to a form of dynamic sparsity. Specifically, we consider a sequence-to-sequence data generating model on sequences of length $N$, where the output at each position only depends on $q \ll N$ relevant tokens, and the positions of these tokens are described in the input prompt. We prove that a single-layer Transformer can learn this model if and only if its number of attention heads is at least $q$, in which case it achieves a sample complexity almost independent of $N$, while recurrent networks require $N^{\Omega(1)}$ samples on the same problem. If we simplify this model, recurrent networks may achieve a complexity almost independent of $N$, while feedforward networks still require $N$ samples. Our proposed sparse retrieval model illustrates a natural hierarchy in sample complexity across these architectures.

TLDR: We prove a purely statistical separation between Transformers and other architectures such as feedforward and recurrent networks, where Transformers are more sample-efficient at learning sparse sequence models.