Vector researchers dive into deep learning at ICLR 2025

August 12, 2025

2025 Research Research 2025

Vector researchers made significant contributions to this year’s International Conference on Learning Representations (ICLR), the world’s leading venue for representation learning and deep learning research, which took place April 24-28, 2025 in Singapore. As the premiere conference exploring how machines learn meaningful representations of data, ICLR brought together the global community working on the theoretical foundations and practical applications of deep learning.

Vector’s research portfolio at ICLR 2025 demonstrated the institute’s leadership in core areas of representation learning – from foundational work on neural architectures, optimization, and theoretical understanding to innovative applications spanning multimodal AI, scientific discovery, and responsible machine learning. The accepted papers reflected Vector’s commitment to advancing both the science of how neural networks learn representations and the development of trustworthy AI systems that benefit society.

Below you will find 71 accepted papers, including collaborations, from Vector Faculty Members, Vector Faculty Affiliates, Vector Distinguished Postdoctoral Fellows, and Vector’s AI Engineering Team.

ACES: Automatic Cohort Extraction System for Event-Stream Datasets

Justin Xu, Jack Gallifant, Alistair Johnson (Vector Faculty Affiliate), Matthew McDermott

Abstract

Reproducibility remains a significant challenge in machine learning (ML) for healthcare. Datasets, model pipelines, and even task/cohort definitions are often private in this field, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. This paper addresses a significant part of this problem by introducing the Automatic Cohort Extraction System (ACES) for event-stream data. This library is designed to simultaneously simplify the development of task/cohorts for ML in healthcare and also enable the reproduction of these cohorts, both at an exact level for single datasets and at a conceptual level across datasets. To accomplish this, ACES provides (1) a highly intuitive and expressive configuration language for defining both dataset-specific concepts and dataset-agnostic inclusion/exclusion criteria, and (2) a pipeline to automatically extract patient records that meet these defined criteria from real-world data. ACES can be automatically applied to any dataset in either the Medical Event Data Standard (MEDS) or EventStreamGPT (ESGPT) formats, or to ***any*** dataset in which the necessary task-specific predicates can be extracted in an event-stream form. ACES has the potential to significantly lower the barrier to entry for defining ML tasks that learn representations, redefine the way researchers interact with EHR datasets, and significantly improve the state of reproducibility for ML studies in this modality.

Action abstractions for amortized sampling

Oussama Boussif, Léna Ezzine, Joseph Viviano, Michał Koziarski (Vector Faculty Affiliate), Moksh Jain, Nikolay Malkin, Emmanuel Bengio, Rim Assouel, Yoshua Bengio

Abstract

As trajectories sampled by policies used by reinforcement learning (RL) and generative flow networks (GFlowNets) grow longer, credit assignment and exploration become more challenging, and the long planning horizon hinders mode discovery and generalization.The challenge is particularly pronounced in entropy-seeking RL methods, such as generative flow networks, where the agent must learn to sample from a structured distribution and discover multiple high-reward states, each of which take many steps to reach.To tackle this challenge, we propose an approach to incorporate the discovery of action abstractions, or high-level actions, into the policy optimization process.Our approach involves iteratively extracting action subsequences commonly used across many high-reward trajectories and `chunking’ them into a single action that is added to the action space.In empirical evaluation on synthetic and real-world environments, our approach demonstrates improved sample efficiency performance in discovering diverse high-reward objects, especially on harder exploration problems.We also observe that the abstracted high-order actions are potentially interpretable, capturing the latent structure of the reward landscape of the action space.This work provides a cognitively motivated approach to action abstraction in RL and is the first demonstration of hierarchical planning in amortized sequential sampling.

AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution

Fengyuan Liu, Nikhil Kandpal, Colin Raffel (Vector Faculty Member)

Abstract

The influence of contextual input on the behavior of large language models (LLMs) has prompted the development of context attribution methods that aim to quantify each context span’s effect on an LLM’s generations. The leave-one-out (LOO) error, which measures the change in the likelihood of the LLM’s response when a given span of the context is removed, provides a principled way to perform context attribution, but can be prohibitively expensive to compute for large models. In this work, we introduce AttriBoT, a series of novel techniques for efficiently computing an approximation of the LOO error for context attribution. Specifically, AttriBoT uses cached activations to avoid redundant operations, performs hierarchical attribution to reduce computation, and emulates the behavior of large target models with smaller proxy models. Taken together, AttriBoT can provide a 300x speedup while remaining more faithful to a target model’s LOO error than prior context attribution methods. This stark increase in performance makes computing context attributions for a given response $30\times$ faster than generating the response itself, empowering real-world applications that require computing attributions at scale. We release a user-friendly and efficient implementation of AttriBoT to enable efficient LLM interpretability as well as encourage future development of efficient context attribution methods.

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, Jeff Clune (Vector Faculty Member)

Abstract

Researchers are investing substantial effort in developing powerful general-purpose agents, wherein Foundation Models are used as modules within agentic systems (e.g. Chain-of-Thought, Self-Reflection, Toolformer). However, the history of machine learning teaches us that hand-designed solutions are eventually replaced by learned solutions. We describe a newly forming research area, Automated Design of Agentic Systems (ADAS), which aims to automatically create powerful agentic system designs, including inventing novel building blocks and/or combining them in new ways. We further demonstrate that there is an unexplored yet promising approach within ADAS where agents can be defined in code and new agents can be automatically discovered by a meta agent programming ever better ones in code. Given that programming languages are Turing Complete, this approach theoretically enables the learning of any possible agentic system: including novel prompts, tool use, workflows, and combinations thereof. We present a simple yet effective algorithm named Meta Agent Search to demonstrate this idea, where a meta agent iteratively programs interesting new agents based on an ever-growing archive of previous discoveries. Through extensive experiments across multiple domains including coding, science, and math, we show that our algorithm can progressively invent agents with novel designs that greatly outperform state-of-the-art hand-designed agents. Importantly, we consistently observe the surprising result that agents invented by Meta Agent Search maintain superior performance even when transferred across domains and models, demonstrating their robustness and generality. Provided we develop it safely, our work illustrates the potential of an exciting new research direction toward automatically designing ever-more powerful agentic systems to benefit humanity.

Bayesian Optimization via Continual Variational Last Layer Training

Spotlight paper

Paul Brunzema, Mikkel Jordahn, John Willes (Vector Professional Staff), Sebastian Trimpe, Jasper Snoek, James Harrison

Abstract

Gaussian Processes (GPs) are widely seen as the state-of-the-art surrogate models for Bayesian optimization (BO) due to their ability to model uncertainty and their performance on tasks where correlations are easily captured (such as those defined by Euclidean metrics) and their ability to be efficiently updated online. However, the performance of GPs depends on the choice of kernel, and kernel selection for complex correlation structures is often difficult or must be made bespoke. While Bayesian neural networks (BNNs) are a promising direction for higher capacity surrogate models, they have so far seen limited use due to poor performance on some problem types. In this paper, we propose an approach which shows competitive performance on many problem types, including some that BNNs typically struggle with. We build on variational Bayesian last layers (VBLLs), and connect training of these models to exact conditioning in GPs. We exploit this connection to develop an efficient online training algorithm that interleaves conditioning and optimization. Our findings suggest that VBLL networks significantly outperform GPs and other BNN architectures on tasks with complex input correlations, and match the performance of well-tuned GPs on established benchmark tasks.

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Authors

Clemencia Siro, Guy Gur-Ari, Gaurav Mishra, Stuart Shieber, Jason Phang, Zijie Wang, Kory Mathewson, Giorgio Mariani, Allen Nie, James Y Zou, Behnam Neyshabur, Karl Krauth, Shixiang Gu, Pablo Antonio Moreno Casares, Maarten Sap, Mohit Tiwari, Bill Yuchen Lin, Aykut Erdem, Angelica Chen, Swaroop Mishra, Chenlin Meng, Ashish Sabharwal, James Simon, Louis-Philippe Morency, Kyle Richardson, Emanuele Rodolà, Adam Fisch, Simone Melzi, Kristen Chiafullo, Rif A. Saurous, Shubh Pachchigar, Siamak Shakeri, Aitor Lewkowycz, Yonatan Belinkov, Mihir Kale, Mantas Mazeika, Dar Gilboa, Hongming Zhang, Seung Jae Lee, Owain Evans, Ambrose Slone, David Dohan, Damien Sileo, Mor Geva, Cameron Diao, Christopher Potts, Jekaterina Novikova, Alicia Parrish, Debajyoti Datta, Chitta Baral, Maarten Bosma, Michael Strube, Jiacheng Xu, Trishala Neeraj, Colin Raffel (Vector Faculty Member), Leo Gao, Vishakh Padmakumar, Yu Hou, Christopher Waites, Ellie Pavlick, Pouya Pezeshkpour, Nanyun (Violet) Peng, Gerard de Melo, Martin Potthast, Aarohi Srivastava, Abhinav Rastogi, Abu Awal Md Shoeb, Adam Brown, Adam Santoro, Aditya Gupta, Agnieszka Kluska, Diyi Yang, Akshat Agarwal, Alexander Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Aman Hussain, Amanda Askell, Amanda Dsouza, Ameet Rahane, Anantharaman S. Iyer, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew La, Ethan Dyer, Angela Jiang, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Austin Herrick, Avia Efrat, Ayla Karakaş, B. Roberts, Bao Loe, Bartłomiej Bojanowski, Benjamin Inden, Benno Stein, Batuhan Özyurt, Behnam Hedayatnia, Blake Howald, Bryan Orinion, Cameron Dour, Catherine Stinson, Cedrick Argueta, Cesar Ferri, Chandan Singh, Charles Rathkopf, Christian Voigt, Cindy Ramirez, Clara Rivera, Noah Fiedel, Courtney Ashcraft, Dan Garrette, Dan Kilman, C. Freeman, Daniel Levy, Daniel González, Danielle Perszyk, Danny Hernandez, David Jurgens, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Mátyás Schubert, Derek Tam, Dilyar Buzan, Shyam Upadhyay, Dimitri Coelho Mollo, Dylan Schrader, Ekaterina Shutova, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Emma Lam, Eric Tang, Ernie Chang, Ethan Chi, Ethan Jerzak, Ethan Kim, Eunice Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fernando Martínez-Plumed, Francesca Happé, Gloria X Wang, Gonzalo Jaimovitch-Lopez, Gregor Betz, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hayden Bogar, Henry Shevlin, Hiromu Yakura, Hugh Wong, Kumar Shridhar, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, James Zheng, Jan Kocon, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesujoba Alabi, Jillian Tang, Joan Waweru, John Burden, Dieuwke Hupkes, John Balis, Jonathan Batchelder, Jörg Frohberg, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua Rule, Joyce Chua, Kamil Kanclerz, Karthik Gopalakrishnan, Katerina Ignatyeva, Li Zhang, Liam Dugan, Katja Markert, Kaustubh Dhole, Lucas Lam, Kevin Omondi, Kyle McDonell, Laria Reynolds, Lianhui Qin, Lidia Contreras-Ochando, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros-Colón, Lütfi Kerem Senel, Maria Jose Ramirez-Quintana, Maartje Ter Hoeve, Mohit Bansal, Martha Lewis, Maheen Farooqi, Marco Baturan, Marco Marelli, Marco Maru, Marie Tolkiehn, Michael A. Yee, Mario Giulianelli, Michael Gu, Michael Ivanitskiy, Matthias Hagen, Medina Baitemirova, Mike Cain, Mimee Xu, Mitch Walker, Moin Aminnaseri, Mozhdeh Gheini, Nathan Chi, Michael Starritt, Michał Swędrowski, Michele Bevilacqua, Nayeon Lee, Neta Krakover, Nicholas Cameron, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niveditha Iyer, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Parth Doshi, Pascale Fung, Pegah Alipoormolabashi, Liao Peiyuan, Peter W Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Priti Oli, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Paul Pu Liang, Rowan Jacobs, Ryan Stovall, Rylan Yang, Saif Mohammad, Sajant Anand, Sam Dillavou, Sam Wiseman, Samuel Gruetter, Sanghyun Han, Mukund Varma T, Sanjeev Kwatra, Sarah Rous, Sarik Ghazarian, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sepideh Sadeghi, Shadi Hamdan, Sherry Shi, Shikhar Singh, Daphne Ippolito, Shima Asaadi, Shyamolima Debnath, Simon Thormeyer, Sneha Makini, Soo-Hwan Lee, Spencer Torene, Stanislas Dehaene, Stefan Divic, Hanna Hajishirzi, Stephanie Lin, Stephen Prasad, Andrew Dai, Steven Piantadosi, Summer Misherghi, Svetlana Kiritchenko, Tao Li, Tariq Ali, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Adrià Garriga-Alonso, Tiberius Nkinyili, Timofei Kornev, Titus Tunduny, Trenton Chang, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Victoria Nyamai, Vikas Raunak, vinay prabhu, William Saunders, William Zhang, Wout Vossen, Xiaoyu Tong, Xinyi Wu, Yair Lakretz, Yichi Yang, Sophie Hao, Yifu Chen, Yufang Hou, Yuntao Bai, Zachary Seid, Cristina Garbacea, Ziyi Wu, Genta Winata, Shubham Toshniwal, Abubakar Abid, John Miller, Karen Livescu, Tatsunori Hashimoto, Ekin Cubuk, Sayan Ghosh, Harsh Mehta, Jacob Hilton, Yadollah Yaghoobzadeh, Jiaming Song, Siva Reddy, Stefano Ermon, Shashank Srivastava, Percy Liang, Chiyu Wu, James Koppel, Rui Zhang, David Drakard, Germàn Kruszewski, Dong-Ho Lee, Fatemeh Siar, Luke Metz, Roman Sitelew, Dan Hendrycks, Paul Vicol, Alexander Ray, Tobias Gerstenberg, Chris Callison-Burch, Sriharsha Hatwar, Xinran Zhao, Zijian Wang, Luca Moschella, Sam Bowman, Jaime Fernández Fisac, Danqi Chen, Stella R Biderman, Nitish Shirish Keskar, Eric Chu, Manaal Faruqui, Ksenia Shkaruta, Xudong Shen, Ryan Teehan, Vinay Ramasesh, Andy Zou, Jaehoon Lee, Hinrich Schuetze, Jesse Engel, Tal Schuster, Berk Ekmekci, Yangqiu Song, Andrew Lampinen, Dan Roth, Yasaman Bahri, Jascha Sohl-Dickstein, Jason Yosinski, Sebastian Schuster, Melody Arnaud, Russ Salakhutdinov, Nicholas Roberts, William Fedus, Sam Shleifer, Vivek Srikumar, Ronan Le Bras, Jos Rozen, Kevin Gimpel, Melvin McElrath, Omer Levy, Tal Linzen, Diganta Misra, Frieda Rong, Xiang Ren, Abhishek Rao, Mirac Suzgun, Yejin Choi, Michihiro Yasunaga, Sharon Zhou, Joshua B Tenenbaum, Sahib Singh, Michael Cohen, Tao Yu, Samuel Schoenholz, Rosanne Liu, Ryan Chi, Giambattista Parascandolo, Zhuoye Zhao, Erkut Erdem, Matthew Leavitt, Francois Chollet, Anders J Andreassen, Timo Schick, Vera Demberg, Qiaozhu Mei, Daniel Khashabi, Jonathan Berant, Noah Constant, Alex Warstadt, Zirui Wang, Alethea Power, Niklas Muennighoff, Barret Zoph, Jason Wei, Christopher Manning

Abstract

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models.
To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG- bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood develop- ment, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI’s GPT models, Google- internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit “breakthrough” behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

BitStack: Any-Size Compression of Large Language Models in Variable Memory Environments

Xinghao Wang, Pengyu Wang, Bo Wang (Vector Faculty Member), Dong Zhang, Yunhua Zhou, Xipeng Qiu

Abstract

Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from $\textit{capability}$ to $\textit{availability}$, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce $\textbf{BitStack}$, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach iteratively decomposes weight matrices while considering the significance of each parameter, resulting in an approximately 1-bit per parameter residual block in each decomposition iteration. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering fine-grained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Code is available at https://github.com/xinghaow99/BitStack.

Boosting Methods for Interval-censored Data with Regression and Classification

Yuan Bian, Grace Yi (Vector Faculty Affiliate), Wenqing He

Abstract

Boosting has garnered significant interest across both machine learning and statistical communities. Traditional boosting algorithms, designed for fully observed random samples, often struggle with real-world problems, particularly with interval-censored data. This type of data is common in survival analysis and time-to-event studies where exact event times are unobserved but fall within known intervals. Effective handling of such data is crucial in fields like medical research, reliability engineering, and social sciences. In this work, we introduce novel nonparametric boosting methods for regression and classification tasks with interval-censored data. Our approaches leverages censoring unbiased transformations to adjust loss functions and impute transformed responses while maintaining model accuracy. Implemented via functional gradient descent, these methods ensure scalability and adaptability. We rigorously establish their theoretical properties, including optimality and mean squared error trade-offs, offering solid guarantees. Our proposed methods not only offer a robust framework for enhancing predictive accuracy in domains where interval-censored data are common but also complement existing work, expanding the applicability of boosting techniques. Empirical studies demonstrate robust performance across various finite-sample scenarios, highlighting the practical utility of our approaches.

Breach By A Thousand Leaks: Unsafe Information Leakage in ‘Safe’ AI Responses

David Glukhov, Ziwen Han, Ilia Shumailov, Vardan Papyan (Vector Faculty Affiliate), Nicolas Papernot (Vector Faculty Member)

Abstract

Vulnerability of Frontier language models to misuse and jailbreaks has prompted the development of safety measures like filters and alignment training in an effort to ensure safety through robustness to adversarially crafted prompts. We assert that robustness is fundamentally insufficient for ensuring safety goals, and current defenses and evaluation methods fail to account for risks of dual-intent queries and their composition for malicious goals. To quantify these risks, we introduce a new safety evaluation framework based on \textit{impermissible information leakage} of model outputs and demonstrate how our proposed question-decomposition attack can extract dangerous knowledge from a censored LLM more effectively than traditional jailbreaking. Underlying our proposed evaluation method is a novel information-theoretic threat model of \textit{inferential adversaries}, distinguished from \textit{security adversaries}, such as jailbreaks, in that success is measured by inferring impermissible knowledge from victim outputs as opposed to forcing explicitly impermissible outputs from the victim. Through our information-theoretic framework, we show that to ensure safety against inferential adversaries, defense mechanisms must ensure \textit{information censorship}, bounding the leakage of impermissible information. However, we prove that such defenses inevitably incur a safety-utility trade-off.

Can Textual Gradient Work in Federated Learning?

Minghui Chen, Ruinan Jin, Wenlong Deng, Yuanyuan Chen, Zhi Huang, Han Yu, Xiaoxiao Li (Vector Faculty Member)

Abstract

Recent studies highlight the promise of LLM-based prompt optimization, especially with TextGrad, which automates “differentiation” via texts and backpropagates textual feedback provided by LLMs. This approach facilitates training in various real-world applications that do not support numerical gradient propagation or loss calculation.  It opens new avenues for optimization in decentralized, resource-constrained environments, suggesting that users of black-box LLMs (e.g., ChatGPT) could enhance components of LLM agentic systems (such as prompt optimization) through collaborative paradigms like federated learning (FL). In this paper, we systematically explore the potential and challenges of incorporating textual gradient into FL. Our contributions are fourfold. **Firstly**, we introduce a novel FL paradigm, Federated Textual Gradient (FedTextGrad), that allows FL clients to upload their locally optimized prompts derived from textual gradients, while the FL server aggregates the received prompts through text summarization. Unlike traditional FL frameworks, which are designed for numerical aggregation, FedTextGrad is specifically tailored for handling textual data, expanding the applicability of FL to a broader range of problems that lack well-defined numerical loss functions. **Secondly**, building on this design, we conduct extensive experiments to explore the feasibility of federated textual gradients. Our findings highlight the importance of properly tuning key factors (e.g., local steps) in FL training to effectively integrate textual gradients. **Thirdly**, we highlight a major challenge in federated textual gradient aggregation: retaining essential information from distributed prompt updates. Concatenation often produces prompts that exceed the LLM API’s context window, while summarization can degrade performance by generating overly condensed or complex text that lacks key context. **Last but not least**, in response to this issue, we improve the vanilla variant of FedTextGrad by providing actionable guidance to the LLM when summarizing client prompts by leveraging the Uniform Information Density principle. Such a design reduces the complexity of the aggregated global prompt, thereby better incentivizing the LLM’s reasoning ability. Through this principled study, we enable the adoption of textual gradients in FL for optimizing LLMs, identify important issues, and pinpoint future directions, thereby opening up a new research area that warrants further investigation.

Controlling Space and Time with Diffusion Models

Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasacchi, David Fleet (Vector Faculty Member)

Abstract

We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), supporting generation with arbitrary camera trajectories and timestamps, in natural scenes, conditioned on one or more images. With a novel architecture and sampling procedure, we enable training on a mixture of 3D (with camera pose), 4D (pose+time) and video (time but no pose) data, which greatly improves generalization to unseen images and camera pose trajectories over prior works which generally operate in limited domains (e.g., object centric).4DiM is the first-ever NVS method with intuitive metric-scale camera pose control enabled by our novel calibration pipeline for structure-from-motion-posed data. Experiments demonstrate that 4DiM outperforms prior 3D NVS models both in terms of image fidelity and pose alignment, while also enabling the generation of scene dynamics. 4DiM provides a general framework for a variety of tasks including single-image-to-3D, two-image-to-video (interpolation and extrapolation), and pose-conditioned video-to-video translation, which we illustrate qualitatively on a variety of scenes. See https://anonymous-4d-diffusion.github.io for video samples.

Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference under Ambiguities

Zheyuan Zhang, Fengyuan Hu, Jayjun Lee, Freda Shi (Vector Faculty Member), Parisa Kordjamshidi, Joyce Chai, Ziqiao Ma

Abstract

Spatial expressions in situated communication can be ambiguous, as their meanings vary depending on the frames of reference (FoR) adopted by speakers and listeners. While spatial language understanding and reasoning by vision-language models (VLMs) have gained increasing attention, potential ambiguities in these models are still under-explored. To address this issue, we present the Consistent Multilingual Frame Of Reference Test (COMFORT), an evaluation protocol to systematically assess the spatial reasoning capabilities of VLMs. We evaluate nine state-of-the-art VLMs using COMFORT. Despite showing some alignment with English conventions in resolving ambiguities, our experiments reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or culture-specific conventions in cross-lingual tests, as English tends to dominate other languages. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning.

Efficient Evolutionary Search Over Chemical Space with Large Language Models

Haorui Wang, Marta Skreta, Cher Ser, Wenhao Gao, Lingkai Kong, Felix Strieth-Kalthoff, Chenru Duan, Yuchen Zhuang, Yue Yu, Yanqiao Zhu, Yuanqi Du, Alán Aspuru-Guzik (Vector Faculty Member), Kirill Neklyudov, Chao Zhang

Abstract

Molecular discovery, when formulated as an optimization problem, presents significant computational challenges because optimization objectives can be non-differentiable. Evolutionary Algorithms (EAs), often used to optimize black-box objectives in molecular discovery, traverse chemical space by performing random mutations and crossovers, leading to a large number of expensive objective evaluations. In this work, we ameliorate this shortcoming by incorporating chemistry-aware Large Language Models (LLMs) into EAs. Namely, we redesign crossover and mutation operations in EAs using LLMs trained on large corpora of chemical information. We perform extensive empirical studies on both commercial and open-source models on multiple tasks involving property optimization, molecular rediscovery, and structure-based drug design, demonstrating that the joint usage of LLMs with EAs yields superior performance over all baseline models across single- and multi-objective settings. We demonstrate that our algorithm improves both the quality of the final solution and convergence speed, thereby reducing the number of required objective evaluations.

Efficient Model Editing with Task-Localized Sparse Fine-tuning

Leonardo Iurada, Marco Ciccone (Vector Distinguished Postdoctoral Fellow), Tatiana Tommasi

Abstract

Pre-trained models are stepping stones for modern machine learning systems, but how to efficiently extract, reuse, and steer their knowledge for new tasks is an area of research with still several open questions. State-of-the-art Task Arithmetic solutions are strongly tied to model linearization which leads to computational bottlenecks during training and inference, and potentially neglect essential task dependencies. In this work, we focus on the fine-tuning stage that defines task vectors and propose TaLoS, a new approach based on sparse fine-tuning that strategically updates only parameters expected to provide functional task localization. This efficiently yields weight-disentangled models without the need for explicit linearization. We present a thorough experimental analysis showing how our approach significantly improves in training and inference efficiency while outperforming state-of-the-art approaches in task addition and task negation. Our work offers a principled solution to pre-trained model editing and paves the way to more cost-effective and scalable machine learning systems for real-world applications.

EgoSim: Egocentric Exploration in Virtual Worlds with Multi-modal Conditioning

Wei Yu, Songheng Yin, Steve Easterbrook, Animesh Garg (Vector Faculty Affiliate)

Abstract

Recent advancements in video diffusion models have established a strong foundation for developing world models with practical applications. The next challenge lies in exploring how an agent can leverage these foundation models to understand, interact with, and plan within observed environments. This requires adding more controllability to the model, transforming it into a versatile game engine capable of dynamic manipulation and control. To address this, we investigated three key conditioning factors: camera, context frame, and text, identifying limitations in current model designs. Specifically, the fusion of camera embeddings with video features leads to camera control being influenced by those features. Additionally, while textual information compensates for necessary spatiotemporal structures, it often intrudes into already observed parts of the scene. To tackle these issues, we designed the Spacetime Epipolar Attention Layer, which ensures that egomotion generated by the model strictly aligns with the camera’s movement through rigid constraints. Moreover, we propose the CI2V-adapter, which uses camera information to better determine whether to prioritize textual or visual embeddings, thereby alleviating the issue of textual intrusion into observed areas. Through extensive experiments, we demonstrate that our new model EgoSim achieves excellent results on both the RealEstate and newly repurposed Epic-Field datasets. For more results, please refer to https://egosim.github.io/EgoSim/.

Event-Driven Online Vertical Federated Learning

Ganyu Wang, Boyu Wang (Vector Faculty Affiliate), Bin Gu, Charles Ling (Vector Faculty Affiliate)

Abstract

Online learning is more adaptable to real-world scenarios in Vertical Federated Learning (VFL) compared to offline learning. However, integrating online learning into VFL presents challenges due to the unique nature of VFL, where clients possess non-intersecting feature sets for the same sample. In real-world scenarios, the clients may not receive data streaming for the disjoint features for the same entity synchronously. Instead, the data are typically generated by an *event* relevant to only a subset of clients.We are the first to identify these challenges in online VFL, which have been overlooked by previous research. To address these challenges, we proposed an event-driven online VFL framework. In this framework, only a subset of clients were activated during each event, while the remaining clients passively collaborated in the learning process. Furthermore, we incorporated *dynamic local regret (DLR)* into VFL to address the challenges posed by online learning problems with non-convex models within a non-stationary environment.We conducted a comprehensive regret analysis of our proposed framework, specifically examining the DLR under non-convex conditions with event-driven online VFL. Extensive experiments demonstrated that our proposed framework was more stable than the existing online VFL framework under non-stationary data conditions while also significantly reducing communication and computation costs.

Filtered not Mixed: Filtering-Based Online Gating for Mixture of Large Language Models

Raeid Saqur, Anastasis Kratsios (Vector Faculty Affiliate), Florian Krach, Yannick Limmer, Blanka Horvath, Frank Rudzicz (Vector Faculty Member)

Abstract

We propose MoE-F — a formalized mechanism for combining N pre-trained expert Large Language Models (LLMs) in online time-series prediction tasks by adaptively forecasting the best weighting of LLM predictions at every time step. Our mechanism leverages the conditional information in each expert’s running performance to forecast the best combination of LLMs for predicting the time series in its next step. Diverging from static (learned) Mixture of Experts (MoE) methods, our approach employs time-adaptive stochastic filtering techniques to combine experts. By framing the expert selection problem as a finite state-space, continuous-time Hidden Markov model (HMM), we can leverage the Wohman-Shiryaev filter. Our approach first constructs N parallel filters corresponding to each of the N individual LLMs. Each filter proposes its best combination of LLMs, given the information that they have access to. Subsequently, the N filter outputs are optimally aggregated to maximize their robust predictive power, and this update is computed efficiently via a closed-form expression, thus generating our ensemble predictor. Our contributions are:- **(I)** the MoE-F algorithm — deployable as a plug-and-play filtering harness,- **(II)** theoretical optimality guarantees of the proposed filtering-based gating algorithm (via optimality guarantees for its parallel Bayesian filtering and its robust aggregation steps), and- **(III)** empirical evaluation and ablative results using state-of-the-art foundational and MoE LLMs on a real-world _Financial Market Movement_ task where MoE-F attains a remarkable 17% absolute and 48.5% relative F1 measure improvement over the next best performing individual LLM expert predicting short-horizon market movement based on streaming news. Further, we provide empirical evidence of substantial performance gains in applying MoE-F over specialized models in the _long-horizon time-series forecasting_ domain.

Finding Shared Decodable Concepts and their Negations in the Brain

Cory Efird, Alex Murphy, Joel Zylberberg (Vector Faculty Affiliate), Alona Fyshe

Abstract

Prior work has offered evidence for functional localization in the brain; different anatomical regions preferentially activate for certain types of visual input. For example, the fusiform face area preferentially activates for visual stimuli that include a face. However, the spectrum of visual semantics is extensive, and only a few semantically-tuned patches of cortex have so far been identified in the human brain. Using a multimodal (natural language and image) neural network architecture (CLIP, \cite{CLIP}, we train a highly accurate contrastive model that maps brain responses during naturalistic image viewing to CLIP embeddings. We then use a novel adaptation of the DBSCAN clustering algorithm to cluster the parameters of these participant-specific contrastive models. This reveals what we call Shared Decodable Concepts (SDCs): clusters in CLIP space that are decodable from common sets of voxels across multiple participants.

Examining the images most and least associated with each SDC cluster gives us additional insight into the semantic properties of each SDC. We note SDCs for previously reported visual features (e.g. orientation tuning in early visual cortex) as well as visual semantic concepts such as faces, places and bodies. In cases where our method finds multiple clusters for a visuo-semantic concept, the least associated images allow us to dissociate between confounding factors. For example, we discovered two clusters of food images, one driven by color, the other by shape. We also uncover previously unreported areas with visuo-semantic sensitivity such as regions of extrastriate body area (EBA) tuned for legs/hands and sensitivity to numerosity in right intraparietal sulcus, sensitivity associated with visual perspective (close/far) and more. Thus, our contrastive-learning methodology better characterizes new and existing visuo-semantic representations in the brain by leveraging multimodal neural network representations and a novel adaptation of clustering algorithms.

Generalization in VAE and Diffusion Models: A Unified Information-Theoretic Analysis

Qi Chen, Jierui Zhu, Florian Shkurti (Vector Faculty Affiliate)

Abstract

Despite the empirical success of Diffusion Models (DMs) and Variational Autoencoders (VAEs), their generalization performance remains theoretically underexplored, particularly lacking a full consideration of the shared encoder-generator structure. Leveraging recent information-theoretic tools, we propose a unified theoretical framework that guarantees the generalization of both the encoder and generator by treating them as randomized mappings. This framework further enables (1) a refined analysis for VAEs, accounting for the generator’s generalization, which was previously overlooked; (2) illustrating an explicit trade-off in generalization terms for DMs that depends on the diffusion time $T$; and (3) providing estimable bounds for DMs based solely on the training data, allowing the selection of the optimal $T$ and the integration of such bounds into the optimization process to improve model performance. Empirical results on both synthetic and real datasets illustrate the validity of the proposed theory.

GMValuator: Similarity-based Data Valuation for Generative Models

Jiaxi Yang, Wenlong Deng, Benlin Liu, Yangsibo Huang, James Y Zou, Xiaoxiao Li (Vector Faculty Member)

Abstract

Data valuation plays a crucial role in machine learning. Existing data valuation methods, mainly focused on discriminative models, overlook generative models that have gained attention recently. In generative models, data valuation measures the impact of training data on generated datasets. Very few existing attempts at data valuation methods designed for deep generative models either concentrate on specific models or lack robustness in their outcomes. Moreover, efficiency still reveals vulnerable shortcomings. We formulate the data valuation problem in generative models from a similarity matching perspective to bridge the gaps. Specifically, we introduce Generative Model Valuator (GMValuator), the first training-free and model-agnostic approach to providing data valuation for generation tasks. It empowers efficient data valuation through our innovative similarity matching module, calibrates biased contributions by incorporating image quality assessment, and attributes credits to all training samples based on their contributions to the generated samples.  Additionally, we introduce four evaluation criteria for assessing data valuation methods in generative models. GMValuator is extensively evaluated on benchmark and high-resolution datasets and various mainstream generative architectures to demonstrate its effectiveness.

Harnessing Webpage UIs for Text-Rich Visual Understanding

Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen (Vector Faculty Member), Graham Neubig, Xiang Yue

Abstract

Text-rich visual understanding—the ability to interpret both textual content and visual elements within a scene—is crucial for multimodal large language models (MLLMs) to effectively interact with structured environments. We propose leveraging webpage UIs as a naturally structured and diverse data source to enhance MLLMs’ capabilities in this area. Existing approaches, such as rule-based extraction, multimodal model captioning, and rigid HTML parsing, are hindered by issues like noise, hallucinations, and limited generalization. To overcome these challenges, we introduce MultiUI, a dataset of 7.3 million samples spanning various UI types and tasks, structured using enhanced accessibility trees and task taxonomies. By scaling multimodal instructions from web UIs through LLMs, our dataset enhances generalization beyond web domains, significantly improving performance in document understanding, GUI comprehension, grounding, and advanced agent tasks. This demonstrates the potential of structured web data to elevate MLLMs’ proficiency in processing text-rich visual environments and generalizing across domains.

An Information Criterion for Controlled Disentanglement of Multimodal Data

Chenyu Wang, Sharut Gupta, Xinyi Zhang, Sana Tonekaboni (Vector Distinguished Postdoctoral Fellow), Stefanie Jegelka, Tommi Jaakkola, Caroline Uhler

Abstract

Multimodal representation learning seeks to relate and decompose information inherent in multiple modalities. By disentangling modality-specific information from information that is shared across modalities, we can improve interpretability and robustness and enable downstream tasks such as the generation of counterfactual outcomes. Separating the two types of information is challenging since they are often deeply entangled in many real-world applications. We propose $\textbf{Disentangled}$ $\textbf{S}$elf-$\textbf{S}$upervised $\textbf{L}$earning (DisentangledSSL), a novel self-supervised approach for learning disentangled representations. We present a comprehensive analysis of the optimality of each disentangled representation, particularly focusing on the scenario not covered in prior work where the so-called $\textit{Minimum Necessary Information}$ (MNI) point is not attainable. We demonstrate that \algo successfully learns shared and modality-specific features on multiple synthetic and real-world datasets and consistently outperforms baselines on various downstream tasks, including prediction tasks for vision-language data, as well as molecule-phenotype retrieval tasks for biological data.

Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models

Cong Lu, Shengran Hu, Jeff Clune (Vector Faculty Member)

Abstract

Go-Explore is a powerful family of algorithms designed to solve hard-exploration problems built on the principle of archiving discovered states, and iteratively returning to and exploring from the most promising states. This approach has led to superhuman performance across a wide variety of challenging problems including Atari games and robotic control, but requires manually designing heuristics to guide exploration (i.e., determine which states to save and explore from, and what actions to consider next), which is time-consuming and infeasible in general. To resolve this, we propose Intelligent Go-Explore (IGE) which greatly extends the scope of the original Go-Explore by replacing these handcrafted heuristics with the intelligence and internalized human notions of interestingness captured by giant pretrained foundation models (FMs). This provides IGE with a human-like ability to instinctively identify how interesting or promising any new state is (e.g., discovering new objects, locations, or behaviors), even in complex environments where heuristics are hard to define. Moreover, IGE offers the exciting opportunity to recognize and capitalize on serendipitous discoveries—states encountered during exploration that are valuable in terms of exploration, yet where what makes them interesting was not anticipated by the human user. We evaluate our algorithm on a diverse range of language and vision-based tasks that require search and exploration. Across these tasks, IGE strongly exceeds classic reinforcement learning and graph search baselines, and also succeeds where prior state-of-the-art FM agents like Reflexion completely fail. Overall, Intelligent Go-Explore combines the tremendous strengths of FMs and the powerful Go-Explore algorithm, opening up a new frontier of research into creating more generally capable agents with impressive exploration capabilities. All our code is open-sourced at: https://github.com/conglu1997/intelligent-go-explore.

InverseBench: Benchmarking Plug-and-Play Diffusion Models for Scientific Inverse Problems

Spotlight paper

Hongkai Zheng, Wenda Chu, Bingliang Zhang, Zihui Wu, Austin Wang, Berthy Feng, Caifeng Zou, Yu Sun (Vector Faculty Affiliate), Nikola Kovachki, Zachary Ross, Katherine Bouman, Yisong Yue

Abstract

Plug-and-play diffusion prior methods have emerged as a promising research direction for solving inverse problems.  However, current studies primarily focus on natural image restoration, leaving the performance of these algorithms in scientific inverse problems largely unexplored. To address this gap, we introduce \textsc{InverseBench}, a unified framework that evaluates diffusion models across five distinct scientific inverse problems. These problems present unique structural challenges that differ from existing benchmarks, arising from critical scientific applications such as black hole imaging, seismology, optical tomography, medical imaging, and fluid dynamics. With \textsc{InverseBench}, we benchmark 15 inverse problem algorithms that use plug-and-play diffusion prior methods against strong, domain-specific baselines, offering valuable new insights into the strengths and weaknesses of existing algorithms. We open-source the datasets, pre-trained models, and the codebase to facilitate future research and development.

Learning under Temporal Label Noise

Sujay Nagaraj, Walter Gerych, Sana Tonekaboni (Vector Distinguished Postdoctoral Fellow), Anna Goldenberg (Vector Faculty Member), Berk Ustun, Thomas Hartvigsen

Abstract

Many time series classification tasks, where labels vary over time, are affected by label noise that also varies over time. Such noise can cause label quality to improve, worsen, or periodically change over time. We first propose and formalize temporal label noise, an unstudied problem for sequential classification of time series. In this setting, multiple labels are recorded over time while being corrupted by a time-dependent noise function. We first demonstrate the importance of modelling the temporal nature of the label noise function and how existing methods will consistently underperform. We then propose methods that can train noise-tolerant classifiers by estimating the temporal label noise function directly from data. We show that our methods lead to state-of-the-art performance under diverse types of temporal label noise on real-world datasets.

Leveraging Variable Sparsity to Refine Pareto Stationarity in Multi-Objective Optimization

Zeou Hu, Yaoliang Yu (Vector Faculty Member)

Abstract

Gradient-based multi-objective optimization (MOO) is essential in modern machine learning, with applications in e.g., multi-task learning, federated learning,  algorithmic fairness and reinforcement learning. In this work, we first reveal some limitations of Pareto stationarity, a widely accepted first-order condition for Pareto optimality, in the presence of sparse function-variable structures. Next, to account for such sparsity, we propose a novel solution concept termed Refined Pareto Stationarity (RPS), which we prove is always sandwiched between Pareto optimality and Pareto stationarity. We give an efficient partitioning algorithm to automatically mine the function-variable dependency and substantially trim non-optimal Pareto stationary solutions. Then, we show that gradient-based descent algorithms in MOO can be enhanced with our refined partitioning. In particular, we propose Multiple Gradient Descent Algorithm with Refined Partition (RP-MGDA) as an example method that converges to RPS, while still enjoying a similar per-step complexity and convergence rate. Lastly, we validate our approach through experiments on both synthetic examples and realistic application scenarios where distinct function-variable dependency structures appear. Our results highlight the importance of exploiting function-variable structure in gradient-based MOO, and provide a seamless enhancement to existing approaches.

LLM-based Typed Hyperresolution for Commonsense Reasoning with Knowledge Bases

Armin Toroghi, Ali Pesaranghader, Tanmana Sadhu, Scott Sanner (Vector Faculty Affiliate)

Abstract

Large language models (LLM) are being increasingly applied to tasks requiring commonsense reasoning. Despite their outstanding potential, the reasoning process of LLMs is prone to errors and hallucinations that hinder their applicability, especially in high-stakes scenarios. Several works have attempted to enhance commonsense reasoning performance of LLMs by (i) using prompting styles that elicit more accurate reasoning, (ii) utilizing the LLM as a semantic parser for a symbolic reasoner, or (iii) enforcing the LLM to simulate a logical inference rule. However, all these solutions have critical limitations: they are unable to leverage the internal commonsense knowledge of the LLM in tandem with an axiomatic knowledge base, they lack a mechanism to reliably repair erroneous inference steps, and their application is restricted to small knowledge bases that fit the context limit of the LLM. In this work, we present LLM-based Typed Hyperresolution (LLM-TH), a logical commonsense reasoning framework that leverages “theory resolution”, a concept from classical logical inference which enables integrating LLMs into the “resolution” inference rule, thus mitigating reasoning errors and hallucinations and enabling verification of the reasoning procedure. LLM-TH is also equipped with a mechanism for repairing erroneous inference steps supported by theoretical guarantees. Using “Hyperresolution” and “Typed inference” schemes, we show that LLM-TH can efficiently reason over large knowledge bases consisting of tens of thousands of rules with arbitrary predicate arities. Our experiments on three diverse language-based reasoning tasks—preference reasoning, multi-domain deductive reasoning, and geographical question answering—showcase that LLM-TH, using merely a BART 406M parameter NLI entailment model, significantly reduces reasoning errors compared to baselines using Llama3-70B, Gemini1.5-Flash, GPT-3.5-Turbo, and Mixtral-46.7B.

Locality Sensitive Avatars From Video

Chunjin Song, Zhijie Wu, Shih-Yang Su, Bastian Wandt, Leonid Sigal (Vector Faculty Member), Helge Rhodin

Abstract

We present locality-sensitive avatar, a neural radiance field (NeRF) based network to learn human motions from monocular videos. To this end, we estimate a canonical representation between different frames of a video with a non-linear mapping from observation to canonical space, which we decompose into a skeletal rigid motion and a non-rigid counterpart. Our key contribution is to retain fine-grained details by modeling the non-rigid part with a graph neural network (GNN) that keeps the pose information local to neighboring body parts. Compared to former canonical representation based methods which solely operate on the coordinate space of a whole shape, our locality-sensitive motion modeling can reproduce both realistic shape contours and vivid fine-grained details. We evaluate on ZJU-MoCap, ActorsHQ, SynWild, and various outdoor videos. The experiments reveal that with the locality sensitive deformation to canonical feature space, we are the first to achieve state-of-the-art results across novel view synthesis, novel pose animation and 3D shape reconstruction simultaneously. For reproducibility, the code will be available upon publication.

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun (Vector Faculty Affiliate), Hua Wu

Abstract

Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions– sequences of tokens or higher-level language constructs–into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30\% in text summarization and code generation, 18\% in dialogue, and 8\% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will release our code, data, and models to inspire future research.

Machine Unlearning Fails to Remove Data Poisoning Attacks

Martin Pawelczyk, Jimmy Di, Yiwei Lu, Gautam Kamath (Vector Faculty Member), Ayush Sekhari, Seth Neel

Abstract

We revisit the efficacy of several practical methods for approximate machine unlearning developed for large-scale deep learning. In addition to complying with data deletion requests, one often-cited potential application for unlearning methods is to remove the effects of training on poisoned data. We experimentally demonstrate that, while existing unlearning methods have been demonstrated to be effective in a number of evaluation settings (e.g., alleviating membership inference attacks), they fail to remove the effects of data poisoning, across a variety of types of poisoning attacks (indiscriminate, targeted, and a newly-introduced Gaussian poisoning attack) and models (image classifiers and LLMs); even when granted a relatively large compute budget. In order to precisely characterize unlearning efficacy, we introduce new evaluation metrics for unlearning based on data poisoning. Our results suggest that a broader perspective, including a wider variety of evaluations, is required to avoid a false sense of confidence in machine unlearning procedures for deep learning without provable guarantees. Moreover, while unlearning methods show some signs of being useful to efficiently remove poisoned datapoints without having to retrain, our work suggests that these methods are not yet “ready for prime time”, and currently provide limited benefit over retraining.

MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL

Spotlight paper

Claas Voelcker, Marcel Hussing, Eric Eaton, Amir-massoud Farahmand (Vector Faculty Affiliate), Igor Gilitschenski (Vector Faculty Affiliate)

Abstract

Building deep reinforcement learning (RL) agents that find a good policy with few samples has proven notoriously challenging. To achieve sample efficiency, recent work has explored updating neural networks with large numbers of gradient steps for every new sample. While such high update-to-data (UTD) ratios have shown strong empirical performance, they also introduce instability to the training process.  Previous approaches need to rely on periodic neural network parameter resets to address this instability, but restarting the training process is infeasible in many real-world applications and requires tuning the resetting interval. In this paper, we focus on one of the core difficulties of stable training with limited samples: the inability of learned value functions to generalize to unobserved on-policy actions. We mitigate this issue directly by augmenting the off-policy RL training process with a small amount of data generated from a learned world model. Our method, Model-Augmented Data for TD Learning (MAD-TD) uses small amounts of generated data to stabilize high UTD training and achieve competitive performance on the most challenging tasks in the DeepMind control suite. Our experiments further highlight the importance of employing a good model to generate data, MAD-TD’s ability to combat value overestimation, and its practical stability gains for continued learning.

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng (Vector Faculty Affiliate), Radha Poovendran, Yejin Choi, Bill Yuchen Lin

Abstract

High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the diversity and quality of public alignment datasets. Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named Magpie.  Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the pre-query templates up to the position reserved for user messages, thanks to their auto-regressive nature.  We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We further introduce extensions of Magpie for filtering, generating multi-turn, preference optimization, domain-specific and multilingual datasets. We perform a comprehensive analysis of the Magpie-generated data. To compare Magpie-generated data with other public instruction datasets (e.g., ShareGPT, WildChat, Evol-Instruct, UltraChat, OpenHermes, Tulu-V2-Mix, GenQA), we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that using Magpie for supervised fine-tuning (SFT) solely can surpass the performance of previous public datasets utilized for both SFT and preference optimization, such as direct preference optimization with UltraFeedback. We also show that in some tasks, models supervised fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct, despite the latter being enhanced with 10 million data points through SFT and subsequent preference optimization. This advantage is evident on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench.

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Ziyan Jiang, Wang Zhu, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, Wenhu Chen (Vector Faculty Member)

Abstract

We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users.Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation.In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MM-Bench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.

MixEval-X: Any-to-any Evaluations from Real-world Data Mixture

Spotlight paper

Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Yuntian Deng (Vector Faculty Affiliate), Andy Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You, Michael Qizhe Shieh

Abstract

Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions. Meanwhile, MixEval-X’s model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98) while being much more efficient. We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

MixMax: Distributional Robustness in Function Space via Optimal Data Mixtures

Anvith Thudi, Chris Maddison (Vector Faculty Member)

Abstract

Machine learning models are often required to perform well across several pre-defined settings, such as a set of user groups. Worst-case performance is a common metric to capture this requirement, and is the objective of group distributionally robust optimization (group DRO). Unfortunately, these methods struggle when the loss is non-convex in the parameters, or the model class is non-parametric. Here, we make a classical move to address this: we reparameterize group DRO from parameter space to function space, which results in a number of advantages. First, we show that group DRO over the space of bounded functions admits a minimax theorem. Second, for cross-entropy and mean squared error, we show that the minimax optimal mixture distribution is the solution of a simple convex optimization problem. Thus, provided one is working with a model class of universal function approximators, group DRO can be solved by a convex optimization problem followed by a classical risk minimization problem. We call our method MixMax. In our experiments, we found that MixMax matched or outperformed the standard group DRO baselines, and in particular, MixMax improved the performance of XGBoost over the only baseline, data balancing, for variations of the ACSIncome and CelebA annotations datasets.

MorphoDiff: Cellular Morphology Painting with Diffusion Models

Spotlight paper

Zeinab Navidi, Jun Ma, Esteban Miglietta, Le Liu, Anne Carpenter, Beth Cimini, Benjamin Haibe-Kains (Vector Faculty Affiliate), Bo Wang (Vector Faculty)

Abstract

Understanding cellular responses to external stimuli is critical for parsing biological mechanisms and advancing therapeutic development. High-content image-based assays provide a cost-effective approach to examine cellular phenotypes induced by diverse interventions, which offers valuable insights into biological processes and cellular states. In this paper, we introduce MorphoDiff, a generative pipeline to predict high-resolution cell morphological responses under different conditions based on perturbation encoding. To the best of our knowledge, MorphoDiff is the first framework capable of producing guided, high-resolution predictions of cell morphology that generalize across both chemical and genetic interventions. The model integrates perturbation embeddings as guiding signals within a 2D latent diffusion model. The comprehensive computational, biological, and visual validations across three open-source Cell Painting datasets show that MorphoDiff can generate high-fidelity images and produce meaningful biology signals under various interventions. We envision the model will facilitate efficient in silico exploration of perturbational landscapes towards more effective drug discovery studies.

Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits

Spotlight paper

Ashish Khisti (Vector Faculty Affiliate), MohammadReza Ebrahimi, Hassan Dbouk, Arash Behboodi, Roland Memisevic, Christos Louizos

Abstract

We consider multi-draft speculative sampling, where the proposal sequences are sampled independently from different draft models.  At each step, a  token-level draft selection scheme takes a list of valid tokens as input and produces an output token whose distribution matches that of the target model. Previous works have demonstrated that the optimal scheme (which maximizes the probability of accepting one of the input tokens) can be cast as a solution to a linear program. In this work we show that the optimal scheme can be decomposed into a two-step solution: in the first step an importance sampling (IS) type scheme is used to select one intermediate token; in the second step (single-draft) speculative sampling is applied to generate the output token.  For the case of two identical draft models we further 1) establish a necessary and sufficient condition on the distributions of the target and draft models for the acceptance probability to equal one and 2) provide an explicit expression for the optimal acceptance probability.  Our theoretical analysis also motives a new class of token-level selection scheme based on weighted importance sampling. Our experimental results demonstrate consistent improvements in the achievable block efficiency and token rates over baseline schemes in a number of scenarios.

Neural Spacetimes for DAG Representation Learning

Haitz Sáez de Ocáriz Borde, Anastasis Kratsios (Vector Faculty Affiliate), Marc T Law, Xiaowen Dong, Michael Bronstein

Abstract

We propose a class of trainable deep learning-based geometries called Neural SpaceTimes (NSTs), which can universally represent nodes in weighted Directed Acyclic Graphs (DAGs) as events in a spacetime manifold. While most works in the literature focus on undirected graph representation learning or causality embedding separately, our differentiable geometry can encode both graph edge weights in its spatial dimensions and causality in the form of edge directionality in its temporal dimensions. We use a product manifold that combines a quasi-metric (for space) and a partial order (for time). NSTs are implemented as three neural networks trained in an end-to-end manner: an embedding network, which learns to optimize the location of nodes as events in the spacetime manifold, and two other networks that optimize the space and time geometries in parallel, which we call a neural (quasi-)metric and a neural partial order, respectively. The latter two networks leverage recent ideas at the intersection of fractal geometry and deep learning to shape the geometry of the representation space in a data-driven fashion, unlike other works in the literature that use fixed spacetime manifolds such as Minkowski space or De Sitter space to embed DAGs. Our main theoretical guarantee is a universal embedding theorem, showing that any $k$-point DAG can be embedded into an NST with $1+\mathcal{O}(\log(k))$ distortion while exactly preserving its causal structure. The total number of parameters defining the NST is sub-cubic in $k$ and linear in the width of the DAG. If the DAG has a planar Hasse diagram, this is improved to $\mathcal{O}(\log(k) + 2)$ spatial and 2 temporal dimensions. We validate our framework computationally with synthetic weighted DAGs and real-world network embeddings; in both cases, the NSTs achieve lower embedding distortions than their counterparts using fixed spacetime geometries.

Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach

Haotian Ju, Hongyang Zhang (Vector Faculty Affiliate), Dongyue Li

Abstract

The training of over-parameterized neural networks has received much study in recent literature. An important consideration is the regularization of over-parameterized networks due to their highly nonconvex and nonlinear geometry. In this paper, we study noise injection algorithms, which can regularize the Hessian of the loss, leading to regions with flat loss surfaces. Specifically, by injecting isotropic Gaussian noise into the weight matrices of a neural network, we can obtain an approximately unbiased estimate of the trace of the Hessian. However, naively implementing the noise injection via adding noise to the weight matrices before backpropagation presents limited empirical improvements. To address this limitation, we design a two-point estimate of the Hessian penalty, which injects noise into the weight matrices along both positive and negative directions of the random noise. In particular, this two-point estimate eliminates the variance of the first-order Taylor’s expansion term on the Hessian. We show a PAC-Bayes generalization bound that depends on the trace of the Hessian (and the radius of the weight space), which can be measured from data.

We conduct a detailed experimental study to validate our approach and show that it can effectively regularize the Hessian and improve generalization. First, our algorithm can outperform prior approaches on sharpness-reduced training, delivering up to a 2.4% test accuracy increase for fine-tuning ResNets on six image classification datasets. Moreover, the trace of the Hessian reduces by 15.8%, and the largest eigenvalue is reduced by 9.7% with our approach. We also find that the regularization of the Hessian can be combined with alternative regularization methods, such as weight decay and data augmentation, leading to stronger regularization. Second, our approach remains highly effective for improving generalization in pretraining multimodal CLIP models and chain-of-thought fine-tuning.

OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition

Stephen Zhang, Vardan Papyan (Vector Faculty Affiliate)

Abstract

The recent paradigm shift to large-scale foundation models has brought about a new era for deep learning that, while has found great success in practice, has also been plagued by prohibitively expensive costs in terms of high memory consumption and compute. To mitigate these issues, there has been a concerted effort in post-hoc neural network pruning techniques that do not require costly retraining. Despite the considerable progress being made, existing methods often exhibit a steady drop in model performance as the compression increases. In this paper, we present a novel approach to compressing large transformers, coined OATS, that compresses the model weights by approximating each weight matrix as the sum of a sparse matrix and a low-rank matrix. Prior to the decomposition, the weights are first scaled by the second moment of their input embeddings, so as to ensure the preservation of  outlier features recently observed in large transformer models. Without retraining, OATS achieves state-of-the-art performance when compressing large language models, such as Llama-3 and Phi-3, and vision transformers, such as Google’s ViT and DINOv2, by up to $60\%$, all while speeding up the model’s inference on a CPU by up to $1.37\times$ compared to prior pruning methods.

OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code

Maxence Faldor, Jenny Zhang, Antoine Cully, Jeff Clune (Vector Faculty Member)

Abstract

Open-ended and AI-generating algorithms aim to continuously generate and solve increasingly complex tasks indefinitely, offering a promising path toward more general intelligence. To accomplish this grand vision, learning must occur within a vast array of potential tasks. Existing approaches to automatically generating environments are constrained within manually predefined, often narrow distributions of environments, limiting their ability to create any learning environment. To address this limitation, we introduce a novel framework, OMNI-EPIC, that augments previous work in Open-endedness via Models of human Notions of Interestingness (OMNI) with Environments Programmed in Code (EPIC). OMNI-EPIC leverages foundation models to autonomously generate code specifying the next learnable (i.e., not too easy or difficult for the agent’s current skill set) and interesting (e.g., worthwhile and novel) tasks. OMNI-EPIC generates both environments (e.g., an obstacle course) and reward functions (e.g., progress through the obstacle course quickly without touching red objects), enabling it, in principle, to create any simulatable learning task. We showcase the explosive creativity of OMNI-EPIC, which continuously innovates to suggest new, interesting learning challenges. We also highlight how OMNI-EPIC can adapt to reinforcement learning agents’ learning progress, generating tasks that are of suitable difficulty. Overall, OMNI-EPIC has the potential to endlessly create learnable and interesting environments, further propelling the development of self-improving AI systems and AI-Generating Algorithms.

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Cong Wei, Zheyang Xiong, Weiming Ren, Xeron Du, Ge Zhang, Wenhu Chen (Vector Faculty Member)

Abstract

Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present OmniEdit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) OmniEdit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that OmniEdit can significantly outperforms all the existing models.

OmniRe: Omni Urban Scene Reconstruction

Spotlight paper

Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lutio, Janick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Gojcic, Sanja Fidler (Vector Faculty Member), Marco Pavone, Li Song, Yue Wang

Abstract

We introduce OmniRe, a comprehensive system for efficiently creating high-fidelity digital twins of dynamic real-world scenes from on-device logs. Recent methods using neural fields or Gaussian Splatting primarily focus on vehicles, hindering a holistic framework for all dynamic foregrounds demanded by downstream applications, e.g., the simulation of human behavior. OmniRe extends beyond vehicle modeling to enable accurate, full-length reconstruction of diverse dynamic objects in urban scenes. Our approach builds scene graphs on 3DGS and constructs multiple Gaussian representations in canonical spaces that model various dynamic actors, including vehicles, pedestrians, cyclists, and others. OmniRe allows holistically reconstructing any dynamic object in the scene, enabling advanced simulations (~60 Hz) that include human-participated scenarios, such as pedestrian behavior simulation and human-vehicle interaction. This comprehensive simulation capability is unmatched by existing methods. Extensive evaluations on the Waymo dataset show that our approach outperforms prior state-of-the-art methods quantitatively and qualitatively by a large margin. We further extend our results to 5 additional popular driving datasets to demonstrate its generalizability on common urban scenes. We will make the code and data publicly available.

On the Benefits of Attribute-Driven Graph Domain Adaptation

Ruiyi Fang, Bingheng Li, Zhao Kang, Qiuhao Zeng, Ruizhi Pu, Nima Hosseini Dashtbayaz, Charles Ling (Vector Faculty Affiliate), Boyu Wang (Vector Faculty Affiliate)

Abstract

Graph Domain Adaptation (GDA) addresses a pressing challenge in cross-network learning, particularly pertinent due to the absence of labeled data in real-world graph datasets. Recent studies attempted to learn domain invariant representations by eliminating structural shifts between graphs. In this work, we show that existing methodologies have overlooked the significance of the graph node attribute, a pivotal factor for graph domain alignment. Specifically, we first reveal the impact of node attributes for GDA by theoretically proving that in addition to the graph structural divergence between the domains, the node attribute discrepancy also plays a critical role in GDA. Moreover, we also empirically show that the attribute shift is more substantial than the topology shift, which further underscore the importance of node attribute alignment in GDA. Inspired by this finding, a novel cross-channel module is developed to fuse and align both views between the source and target graphs for GDA. Experimental results on a variety of benchmark verify the effectiveness of our method.

Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning

Zenan Li, Zhaoyu Li, Wen Tang, Xian Zhang, Yuan Yao, Xujie Si (Vector Faculty Affiliate), Fan Yang, Kaiyu Yang, Xiaoxing Ma

Abstract

Large language models (LLMs) can prove mathematical theorems formally by generating proof steps (\textit{a.k.a.} tactics) within a proof system. However, the space of possible tactics is vast and complex, while the available training data for formal proofs is limited, posing a significant challenge to LLM-based tactic generation. To address this, we introduce a neuro-symbolic tactic generator that synergizes the mathematical intuition learned by LLMs with domain-specific insights encoded by symbolic methods. The key aspect of this integration is identifying which parts of mathematical reasoning are best suited to LLMs and which to symbolic methods. While the high-level idea of neuro-symbolic integration is broadly applicable to various mathematical problems, in this paper, we focus specifically on Olympiad inequalities (Figure~1). We analyze how humans solve these problems and distill the techniques into two types of tactics: (1) scaling, handled by symbolic methods, and (2) rewriting, handled by LLMs. In addition, we combine symbolic tools with LLMs to prune and rank the proof goals for efficient proof search. We evaluate our framework on 161 challenging inequalities from multiple mathematics competitions, achieving state-of-the-art performance and significantly outperforming existing LLM and symbolic approaches without requiring additional training data.

PWM: Policy Learning with Multi-Task World Models

Ignat Georgiev, Varun Giridhar, Nicklas Hansen, Animesh Garg (Vector Faculty Affiliate)

Abstract

Reinforcement Learning (RL) has made significant strides in complex tasks but struggles in multi-task settings with different embodiments. World models methods offer scalability by learning a simulation of the environment, but often rely on inefficient gradient-free optimization methods for policy extraction. In contrast, gradient-based methods exhibit lower variance but fail to handle discontinuities. Our work reveals that well-regularized world models can generate smoother optimization landscapes than the actual dynamics, facilitating more effective first-order optimization. We introduce Policy learning with multi-task World Models (PWM), a novel model-based RL algorithm for continuous control. Initially, the world model is pre-trained on offline data, and then policies are extracted from it using first-order optimization in less than 10 minutes per task. PWM effectively solves tasks with up to 152 action dimensions and outperforms methods that use ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines, without relying on costly online planning. Visualizations and code available at [https://policy-world-model.github.io/]

ReMatching Dynamic Reconstruction Flow

Sara Oblak, Despoina Paschalidou, Sanja Fidler (Vector Faculty Member), Matan Atzmon

Abstract

Reconstructing dynamic scenes from image inputs is a fundamental computervision task with many downstream applications. Despite recent advancements, existingapproaches still struggle to achieve high-quality reconstructions from unseenviewpoints and timestamps. This work introduces the ReMatching framework,designed to improve generalization quality by incorporating deformation priors intodynamic reconstruction models. Our approach advocates for velocity-field-basedpriors, for which we suggest a matching procedure that can seamlessly supplementexisting dynamic reconstruction pipelines. The framework is highly adaptableand can be applied to various dynamic representations. Moreover, it supportsintegrating multiple types of model priors and enables combining simpler ones tocreate more complex classes. Our evaluations on popular benchmarks involvingboth synthetic and real-world dynamic scenes demonstrate a clear improvement inreconstruction accuracy of current state-of-the-art models.

Retri3D: 3D Neural Graphics Representation Retrieval

Spotlight paper

Yushi Guan, Daniel Kwan, Jean Dandurand, Xi Yan, Ruofan Liang, Yuxuan Zhang, Nilesh Jain, Nilesh Ahuja, Selvakumar Panneer, Nandita Vijaykumar (Vector Faculty Affiliate)

Abstract

Learnable 3D Neural Graphics Representations (3DNGR) have emerged as promising 3D representations for reconstructing 3D scenes from 2D images. Numerous works, including Neural Radiance Fields (NeRF), 3D Gaussian Splatting (3DGS), and their variants, have significantly enhanced the quality of these representations. The ease of construction from 2D images, suitability for online viewing/sharing, and applications in game/art design downstream tasks make it a vital 3D representation, with potential creation of large numbers of such 3D models. This necessitates large data stores, local or online, to save 3D visual data in these formats. However, no existing framework enables accurate retrieval of stored 3DNGRs. In this work, we propose, Retri3D, a framework that enables accurate and efficient retrieval of 3D scenes represented as NGRs from large data stores using text queries. We introduce a novel Neural Field Artifact Analysis technique, combined with a Smart Camera Movement Module, to select clean views and navigate pre-trained 3DNGRs. These techniques enable accurate retrieval by selecting the best viewing directions in the 3D scene for high-quality visual feature embeddings. We demonstrate that Retri3D is compatible with any NGR representation. On the LERF and ScanNet++ datasets, we show significant improvement in retrieval accuracy compared to existing techniques, while being orders of magnitude faster and storage efficient.

Revisiting Delta-Parameter Pruning For Fine-Tuned Models

Spotlight paper

Wenlong Deng, Yize Zhao, Vala Vakilian, Minghui Chen, Xiaoxiao Li (Vector Faculty Member), Christos Thrampoulidis

Abstract

Storing open-source fine-tuned models separately introduces redundancy and  increases response times in applications utilizing multiple models. Delta-parameter pruning (DPP), particularly the random drop and rescale (DARE) method proposed by Yu et al., addresses this by pruning the majority of delta parameters—the differences between fine-tuned and pre-trained model weights—while typically maintaining minimal performance loss. However, DARE fails when either the pruning rate or the magnitude of the delta parameters is large. We highlight two key reasons for this failure: (1) an excessively large rescaling factor as pruning rates increase, and (2) high mean and variance in the delta parameters.To address these, we develop two algorithmic improvements: (1) DARq, which modifies the rescaling factor in DARE, leading to significant performance gains at high pruning rates (e.g., >30% on COLA and SST2 for encoder models, with even larger improvements in decoder models), and (2) AdamR, an in-training modification that incorporates appropriate Delta regularization before applying DPP. We also demonstrate that DARq can be seamlessly combined with vanilla parameter-efficient fine-tuning techniques like LoRA and can facilitate structural DPP. Additionally, we revisit the application of importance-based pruning techniques within DPP, demonstrating that they outperform random-based methods when delta parameters are large. Through this comprehensive study, we develop a pipeline for selecting the most appropriate DPP method under various practical scenarios.

Revisiting Source-Free Domain Adaptation: a New Perspective via Uncertainty Control

Gezheng Xu, Hui Guo, Li Yi, Charles Ling (Vector Faculty Affiliate), Boyu Wang (Vector Faculty Affiliate), Grace Yi (Vector Faculty Affiliate)

Abstract

Source-Free Domain Adaptation (SFDA) seeks to adapt a pre-trained source model to the target domain using only unlabeled target data, without access to the original source data. While current state-of-the-art (SOTA) methods rely on leveraging weak supervision from the source model to extract reliable information for self-supervised adaptation, they often overlook the uncertainty that arises during the transfer process.  In this paper, we conduct a systematic and theoretical analysis of the uncertainty inherent in existing SFDA methods and demonstrate its impact on transfer performance through the lens of Distributionally Robust Optimization (DRO). Building upon the theoretical results, we propose a novel instance-dependent uncertainty control algorithm for SFDA.  Our method is designed to quantify and exploit the uncertainty during the adaptation process, significantly improving the model performance.  Extensive experiments on benchmark datasets and empirical analyses confirm the validity of our theoretical findings and the effectiveness of the proposed method. This work offers new insights into understanding and advancing SFDA performance.

Reward Guided Latent Consistency Distillation

William Wang, Jiachen Li, Weixi Feng, Wenhu Chen (Vector Faculty Member)

Abstract

Latent Consistency Distillation (LCD) has emerged as a promising paradigm for efficient text-to-image synthesis. By distilling a latent consistency model (LCM) from a pre-trained teacher latent diffusion model (LDM), LCD facilitates the generation of high-fidelity images within merely 2 to 4 inference steps. However, the LCM’s efficient inference is obtained at the cost of the sample quality. In this paper, we propose compensating the quality loss by aligning LCM’s output with human preference during training. Specifically, we introduce Reward Guided LCD (RG-LCD), which integrates feedback from a reward model (RM) into the LCD process by augmenting the original LCD loss with the objective of maximizing the reward associated with LCM’s single-step generation. As validated through human evaluation, when trained with the feedback of a good RM, the 2-step generations from our RG-LCM are favored by humans over the 50-step DDIM samples from the teacher LDM, representing a 25-time inference acceleration without quality loss.

As directly optimizing towards differentiable RMs can suffer from over-optimization, we take the initial step to overcome this difficulty by proposing the use of a latent proxy RM (LRM). This novel component serves as an intermediary, connecting our LCM with the RM. Empirically, we demonstrate that incorporating the LRM into our RG-LCD successfully avoids high-frequency noise in the generated images, contributing to both improved Fréchet Inception Distance (FID) on MS-COCO and a higher HPSv2.1 score on HPSv2’s test set, surpassing those achieved by the baseline LCM.

Project Page: https://rg-lcd.github.io/

S4M: S4 for multivariate time series forecasting with missing values

Jing Peng, Meiqi Yang, Qiong Zhang, Xiaoxiao Li (Vector Faculty Member)

Abstract

Multivariate time series data are integral to numerous real-world applications, including finance, healthcare, and meteorology, where accurate forecasting is paramount for informed decision-making and proactive measures. However, the presence of missing data poses significant challenges, often undermining the performance of predictive models. Traditional two-step approaches that first impute missing values and then perform forecasting tend to accumulate errors, particularly in complex multivariate settings with high missing ratios and intricate dependency structures. In this work, we present S4M, an end-to-end time series forecasting framework that seamlessly integrates missing data handling within the Structured State Space Sequence (S4) model architecture. Unlike conventional methods that treat imputation as a separate preprocessing step, S4M leverages the latent space of S4 models to recognize and represent missing data patterns directly, thereby capturing the underlying temporal and multivariate dependencies more effectively. Our approach comprises two key modules: the Adaptive Temporal Prototype Mapper (ATPM) and the Missing-Aware Dual Stream S4 (MDS-S4). The ATPM utilizes a prototype bank to derive robust and informative representations from historical data patterns, while MDS-S4 processes these representations alongside missingness masks as dual input streams to perform accurate forecasting. Extensive empirical evaluations on diverse real-world datasets demonstrate that S4M consistently achieves state-of-the-art performance, validating the efficacy of our integrated approach in handling missing data, highlighting its robustness and superiority over traditional imputation-based methods. These results highlight the potential of our method for advancing reliable time series forecasting in practical applications.

Selective Unlearning via Representation Erasure Using Adversarial Training

Nazanin Sepahvand, Eleni Triantafillou, Hugo Larochelle, Doina Precup, Jim Clark, Dan Roy (Vector Faculty Member), Gintare Karolina Dziugaite

Abstract

When deploying machine learning models in the real world,  we often face the challenge of “unlearning” specific data points or subsets after training. Inspired by Domain-Adversarial Training of Neural Networks (DANN), we propose a novel algorithm, SURE, for targeted unlearning. SURE treats the process as a domain adaptation problem, where the “forget set” (data to be removed) and a validation set from the same distribution form two distinct domains. We train a domain classifier to discriminate between representations from the forget and validation sets. Using a gradient reversal strategy similar to DANN, we perform gradient updates to the representations to “fool” the domain classifier and thus obfuscate representations belonging to the forget set. Simultaneously, gradient descent is applied to the retain set (original training data minus the forget set) to preserve its classification performance.  Unlike other unlearning approaches whose training objectives are built based on model outputs, SURE directly manipulates there presentations. This is key to ensure robustness against a set of more powerful attacks than currently considered in the literature, that aim to detect which examples were unlearned through access to learned embeddings.  Our thorough experiments reveal that SURE has a better unlearning quality to utility trade-off compared to other standard unlearning techniques for deep neural networks.

SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Koichi Namekata, Sherwin Bahmani, Ziyi Wu, Yash Kant, Igor Gilitschenski (Vector Faculty Affiliate), David Lindell (Vector Faculty Affiliate)

Abstract

Methods for image-to-video generation have achieved impressive, photo-realistic quality. However, adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error, e.g., involving re-generating videos with different random seeds. Recent techniques address this issue by fine-tuning a pre-trained model to follow conditioning signals, such as bounding boxes or point trajectories. Yet, this fine-tuning procedure can be computationally expensive, and it requires datasets with annotated object motion, which can be difficult to procure. In this work, we introduce SG-I2V, a framework for controllable image-to-video generation that is self-guided—ioffering zero-shot control by relying solely on the knowledge present in a pre-trained image-to-video diffusion model without the need for fine-tuning or external knowledge. Our zero-shot method outperforms unsupervised baselines while significantly narrowing down the performance gap with supervised models in terms of visual quality and motion fidelity. Additional details and video results are available on our project page: https://sgi2v-paper.github.io

SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction

Yang Zhou, Hao Shao, Letian Wang, Steven Waslander (Vector Faculty Affiliate), Hongsheng Li, Yu Liu

Abstract

Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. However, the scarcity of large-scale driving datasets has hindered the development of robust and generalizable motion prediction models, limiting their ability to capture complex interactions and road geometries. Inspired by recent advances in natural language processing (NLP) and computer vision (CV), self-supervised learning (SSL) has gained significant attention in the motion prediction community for learning rich and transferable scene representations. Nonetheless, existing pre-training methods for motion prediction have largely focused on specific model architectures and single dataset, limiting their scalability and generalizability. To address these challenges, we propose SmartPretrain, a general and scalable SSL framework for motion prediction that is both model-agnostic and dataset-agnostic. Our approach integrates contrastive and reconstructive SSL, leveraging the strengths of both generative and discriminative paradigms to effectively represent spatiotemporal evolution and interactions without imposing architectural constraints. Additionally, SmartPretrain employs a dataset-agnostic scenario sampling strategy that integrates multiple datasets, enhancing data volume, diversity, and robustness. Extensive experiments on multiple datasets demonstrate that SmartPretrain consistently improves the performance of state-of-the-art prediction models across datasets, data splits and main metrics. For instance, SmartPretrain significantly reduces the MissRate of Forecast-MAE by 10.6\%. These results highlight SmartPretrain’s effectiveness as a unified, scalable solution for motion prediction, breaking free from the limitations of the small-data regime.

Soft Merging of Experts with Adaptive Routing

Haokun Liu, Muqeeth Mohammed, Colin Raffel (Vector Faculty Member)

Abstract

Neural networks that learn to route their inputs through different “expert” subnetworks provide a form of modularity that standard dense models lack. Despite their possible benefits, modular models with learned routing often underperform their parameter-matched dense counterparts as well as models that use non-learned heuristic routing strategies. In this paper, we hypothesize that these shortcomings stem from the gradient estimation techniques used to train modular models that use non-differentiable discrete routing decisions. To address this issue, we introduce $\textbf{S}$oft $\textbf{M}$erging of $\textbf{E}$xperts with $\textbf{A}$daptive $\textbf{R}$outing (SMEAR), which avoids discrete routing by using a single “merged” expert constructed via a weighted average of all of the experts’ parameters. By routing activations through a single merged expert, SMEAR does not incur a significant increase in computational costs and enables standard gradient-based training. We empirically validate that models using SMEAR outperform models that route based on metadata or learn routing through gradient estimation. Furthermore, we provide qualitative analysis demonstrating that the experts learned via SMEAR exhibit a significant amount of specialization.

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin SU, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong (Vector Faculty Member), Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, Tao Yu

Abstract

Real-world enterprise text-to-SQL workflows often involve complex cloud or local data across various database systems, multiple SQL queries in various dialects, and diverse operations from data transformation to analytics. We introduce Spider 2.0, an evaluation framework comprising $595$ real-world text-to-SQL workflow problems derived from enterprise-level database use cases. The databases in Spider 2.0 are sourced from real data applications, often containing over 1,000 columns and stored in local or cloud database systems such as BigQuery and Snowflake. We show that solving problems in Spider 2.0 frequently requires understanding and searching through database metadata, dialect documentation, and even project-level codebases. This challenge calls for models to interact with complex SQL workflow environments, process extremely long contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding $100$ lines, which goes far beyond traditional text-to-SQL challenges. Our evaluations indicate that based on o1-preview, our code agent framework successfully solves only 15.1\% of the tasks, compared with 91.2\% on Spider 1.0 and 73.0\% on BIRD. Our results on Spider 2.0 show that while language models have demonstrated remarkable performance in code generation — especially in prior text-to-SQL benchmarks — they require significant improvement in order to achieve adequate performance for real-world enterprise usage. Progress on Spider 2.0 represents crucial steps towards developing intelligent, autonomous, code agents for real-world enterprise settings.

Stiefel Flow Matching for Moment-Constrained Structure Elucidation

Austin Cheng, Alston Lo, Kin Long Kelvin Lee, Santiago Miret, Alán Aspuru-Guzik (Vector Faculty Member)

Abstract

Molecular structure elucidation is a critical step in understanding chemical phenomena, with applications to identifying molecules in natural products, lab syntheses, forensic samples, and the interstellar medium.We consider the task of elucidating a molecule’s 3D structure from only its molecular formula and moments of inertia, motivated by the ability of rotational spectroscopy to precisely measure these moments.While existing generative models can conditionally sample 3D structures with approximately correct moments, this soft conditioning fails to leverage the many digits of precision afforded by experimental rotational spectroscopy.To address this, we first show that the space of $n$-atom point clouds with a fixed set of moments of inertia is embedded in the Stiefel manifold $\textrm{St}(n, 4)$.We then propose Stiefel flow matching as a generative model for elucidating 3D structure under exact moment constraints.Additionally, we learn simpler and shorter flows by finding approximate solutions for optimal transport on the Stiefel manifold.Empirically, Stiefel flow matching achieves higher success rates and faster sampling than Euclidean diffusion models, even on high-dimensional manifolds corresponding to large molecules in the GEOM dataset.

SymmetricDiffusers: Learning Discrete Diffusion on Finite Symmetric Groups

Yongxing Zhang, Donglin Yang, Renjie Liao (Vector Faculty Member)

Abstract

The group of permutations $S_n$, also known as the finite symmetric groups, are essential in fields such as combinatorics, physics, and chemistry. However, learning a probability distribution over $S_n$ poses significant challenges due to its intractable size and discrete nature. In this paper, we introduce *SymmetricDiffusers*, a novel discrete diffusion model that simplifies the task of learning a complicated distribution over $S_n$ by decomposing it into learning simpler transitions of the reverse diffusion using deep neural networks. We identify the riffle shuffle as an effective forward transition and provide empirical guidelines for selecting the diffusion length based on the theory of random walks on finite groups. Additionally, we propose a generalized Plackett-Luce (PL) distribution for the reverse transition, which is provably more expressive than the PL distribution. We further introduce a theoretically grounded “denoising schedule” to improve sampling and learning efficiency. Extensive experiments show that our model achieves state-of-the-art or comparable performances on solving tasks including sorting 4-digit MNIST images, jigsaw puzzles, and traveling salesman problems.

T2V-Turbo-v2: Enhancing Video Model Post-Training through Data, Reward, and Conditional Guidance Design

Jiachen Li, Qian Long, Jian (Skyler) Zheng, Xiaofeng Gao, Robinson Piramuthu, Wenhu Chen (Vector Faculty Member), William Wang

Abstract

In this paper, we focus on enhancing a diffusion-based text-to-video (T2V) model during the post-training phase by distilling a highly capable consistency model from a pretrained T2V model. Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals, including high-quality training data, reward model feedback, and conditional guidance, into the consistency distillation process. Through comprehensive ablation studies, we highlight the crucial importance of tailoring datasets to specific learning objectives and the effectiveness of learning from diverse reward models for enhancing both the visual quality and text-video alignment. Additionally, we highlight the vast design space of conditional guidance strategies, which centers on designing an effective energy function to augment the teacher ODE solver. We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver, showcasing its effectiveness in improving the motion quality of the generated videos with the improved motion-related metrics from VBench and T2V-CompBench. Empirically, our T2V-Turbo-v2 establishes a new state-of-the-art result on VBench, **with a Total score of 85.13**, surpassing proprietary systems such as Gen-3 and Kling.

Teaching LLMs How To Learn with Contextual Fine-Tuning

Younwoo Choi, Muhammad Adil Asif, Ziwen Han, John Willes (Vector Professional Staff), Rahul G. Krishnan (Vector Faculty Member)

Abstract

Prompting Large Language Models (LLMs), or providing context on the expected model of operation, is an effective way to steer the outputs of such models to satisfy human desiderata after they have been trained. But in rapidly evolving domains, there is often need to fine-tune LLMs to improve either the kind of knowledge in their memory or their abilities to perform open ended reasoning in new domains. When human’s learn new concepts, we often do so by linking the new material that we are studying to concepts we have already learned before. To that end, we ask, “can prompting help us teach LLMs how to learn”. In this work, we study a novel generalization of instruction tuning, called contextual fine-tuning, to fine-tune LLMs. Our method leverages instructional prompts designed to mimic human cognitive strategies in learning and problem-solving to guide the learning process during training, aiming to improve the model’s interpretation and understanding of domain-specific knowledge. We empirically demonstrate that this simple yet effective modification improves the ability of LLMs to be fine-tuned rapidly on new datasets both within the medical and financial domains.

Tighter Privacy Auditing of DP-SGD in the Hidden State Threat Model

Tudor Cebere, Aurélien Bellet, Nicolas Papernot (Vector Faculty Member)

Abstract

Machine learning models can be trained with formal privacy guarantees via differentially private optimizers such as DP-SGD. In this work, we focus on a threat model where the adversary has access only to the final model, with no visibility into intermediate updates. In the literature, this “hidden state” threat model exhibits a significant gap between the lower bound from empirical privacy auditing and the theoretical upper bound provided by privacy accounting. To challenge this gap, we propose to audit this threat model with adversaries that craft a gradient sequence designed to maximize the privacy loss of the final model without relying on intermediate updates. Our experiments show that this approach consistently outperforms previous attempts at auditing the hidden state model. Furthermore, our results advance the understanding of achievable privacy guarantees within this threat model. Specifically, when the crafted gradient is inserted at every optimization step, we show that concealing the intermediate model updates in DP-SGD does not amplify privacy. The situation is more complex when the crafted gradient is not inserted at every step: our auditing lower bound matches the privacy upper bound only for an adversarially-chosen loss landscape and a sufficiently large batch size. This suggests that existing privacy upper bounds can be improved in certain regimes.

Transformer Block Coupling and its Correlation with Generalization in LLMs

Murdock Aubry, Haoming Meng, Anton Sugolov, Vardan Papyan (Vector Faculty Affiliate)

Abstract

Large Language Models (LLMs) have made significant strides in natural languageprocessing, and a precise understanding of the internal mechanisms driving theirsuccess is essential. In this work, we trace the trajectories of individual tokens as they pass through transformer blocks, and linearize the system along these trajectories through their Jacobian matrices. By examining the relationships between these Jacobians, we uncover a transformer block coupling phenomenon in a variety of LLMs, characterized by the coupling of their top singular vectors across tokens and depth. Our findings reveal that coupling positively correlates with model performance, and that this relationship is stronger than with other hyperparameters, namely parameter budget, model depth, and embedding dimension. We further investigate the emergence of these properties through training, noting the development of coupling, as well as an increase in linearity and layer-wise exponential growth in the token trajectories. These collective insights provide a novel perspective on the interactions between token embeddings, and prompt further approaches to study training and generalization in LLMs.

A Truncated Newton Method for Optimal Transport

Mete Kemertas, Amir-massoud Farahmand (Vector Faculty Affiliate), Allan Jepson

Abstract

Developing a contemporary optimal transport (OT) solver requires navigating trade-offs among several critical requirements: GPU parallelization, scalability to high-dimensional problems, theoretical convergence guarantees, empirical performance in terms of precision versus runtime, and numerical stability in practice. With these challenges in mind, we introduce a specialized truncated Newton algorithm for entropic regularized OT. In addition to proving that locally quadratic convergence is possible without assuming a Lipschitz Hessian, we provide strategies to maximally exploit the high rate of local convergence in practice. Our GPU-parallel algorithm exhibits exceptionally favorable runtime performance, achieving high precision orders of magnitude faster than many existing alternatives. This is evidenced by wall-clock time experiments on 4096-dimensional MNIST and color transfer problems. The scalability of the algorithm is showcased on an extremely large OT problem with $n \approx 10^6$, solved approximately under weak entopric regularization.

Understanding Constraint Inference in Safety-Critical Inverse Reinforcement Learning

Bo Yue, Shufan Wang, Ashish Gaurav, Jian Li, Pascal Poupart (Vector Faculty Member), Guiliang Liu

Abstract

In practical applications, the underlying constraint knowledge is often unknown and difficult to specify. To address this issue, recent advances in Inverse Constrained Reinforcement Learning (ICRL) have focused on inferring these constraints from expert demonstrations. However, the ICRL approach typically characterizes constraint learning as a tri-level optimization problem, which is inherently complex due to its interdependent variables and multiple layers of optimization. Considering these challenges, a critical question arises: *Can we implicitly embed constraint signals into reward functions and effectively solve this problem using a classic reward inference algorithm?* The resulting method, known as Inverse Reward Correction (IRC), merits investigation. In this work, we conduct a theoretical analysis comparing the sample complexities of both solvers. Our findings confirm that the IRC solver achieves lower sample complexity than its ICRL counterpart. Nevertheless, this reduction in complexity comes at the expense of generalizability. Specifically, in the target environment, the reward correction terms may fail to guarantee the safety of the resulting policy, whereas this issue can be effectively mitigated by transferring the constraints via the ICRL solver. Advancing our inquiry, we investigate conditions under which the ICRL solver ensures $\epsilon$-optimality when transferring to new environments. Empirical results across various environments validate our theoretical findings, underscoring the nuanced trade-offs between complexity reduction and generalizability in safety-critical applications.

Universal Multimodal Retrieval with Multimodal LLMs

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin (Vector Faculty Affiliate), Bryan Catanzaro, Wei Ping

Abstract

State-of-the-art retrieval models typically address a straightforward search scenario, where retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and retrieved results. This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs), enabling a broader search scenario, termed universal multimodal retrieval, where multiple modalities and diverse retrieval tasks are accommodated. To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but underperforms a smaller CLIP retriever in cross-modal retrieval tasks due to modality bias from MLLMs. To address the issue, we propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers. Second, we propose to continually fine-tune the universal multimodal retriever to enhance its text retrieval capability while maintaining multimodal retrieval capability. As a result, our model, UniEmb, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR, which spans multiple domains and tasks, while also surpassing the state-of-the-art text retrieval model, NV-Embed-v1, on MTEB retrieval benchmark.Finally, we explore to prompt the off-the-shelf MLLMs as the zero-shot reranker to refine the ranking of the candidates from the multimodal retriever. We find that through prompt-and-reranking, MLLMs can further improve multimodal retrieval when the user queries (e.g., text-image composed queries) are more complex and challenging to understand. These findings also pave the way to advance universal multimodal retrieval in the future.

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, Wenhu Chen (Vector Faculty Member)

Abstract

Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can generalize across tasks (e.g., MTEB). However, progress in learning universal multimodal embedding models has been relatively slow despite their importance. In this work, we aim to explore the potential for building universal embeddings capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2VEC (Vision-Language Model → Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model. Unlike previous models such as CLIP and BLIP, VLM2VEC can process any combination of images and text to generate a fixed-dimensional vector based on task instructions. We build a series of VLM2VEC models on Phi-3.5-V and evaluate them on MMEB. Our results show that VLM2VEC achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB.

What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis

Spotlight paper

Weronika Ormaniec, Felix Dangel (Vector Distinguished Postdoctoral Fellow), Sidak Pal Singh

Abstract

The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and functionality from most other architectural components in deep learning – to the extent that Transformers are often accompanied by adaptive optimizers, layer normalization, learning rate warmup, and more, in comparison to MLPs/CNNs. The root causes behind these outward manifestations, and the precise mechanisms that govern them, remain poorly understood. In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures – grounded in a theoretical comparison of the (loss) Hessian. Concretely, for a single self-attention layer, (a) we first entirely derive the Transformer’s Hessian and express it in matrix derivatives; (b) we then characterize it in terms of data, weight, and attention moment dependencies; and (c) while doing so further highlight the important structural differences to the Hessian of classical networks. Our results suggest that various common architectural and optimization choices in Transformers can be traced back to their highly non-linear dependencies on the data and weight matrices, which vary heterogeneously across parameters. Ultimately, our findings provide a deeper understanding of the Transformer’s unique optimization landscape and the challenges it poses.

What Has Been Overlooked in Contrastive Source-Free Domain Adaptation: Leveraging Source-Informed Latent Augmentation within Neighborhood Context

Jiahong Chen, Kuangen Zhang, Clarence Silva, Jing Wang, Leonid Sigal (Vector Faculty Member), Wonho Bae

Abstract

Source-free domain adaptation (SFDA) involves adapting a model originally trained using a labeled dataset (source domain) to perform effectively on an unlabeled dataset (target domain) without relying on any source data during adaptation. This adaptation is especially crucial when significant disparities in data distributions exist between the two domains and when there are privacy concerns regarding the source model’s training data. The absence of access to source data during adaptation makes it challenging to analytically estimate the domain gap. To tackle this issue, various techniques have been proposed, such as unsupervised clustering, contrastive learning, and continual learning. In this paper, we first conduct an extensive theoretical analysis of SFDA based on contrastive learning, primarily because it has demonstrated superior performance compared to other techniques. Motivated by the obtained insights, we then introduce a straightforward yet highly effective latent augmentation method tailored for contrastive SFDA. This augmentation method leverages the dispersion of latent features within the neighborhood of the query sample, guided by the source pre-trained model, to enhance the informativeness of positive keys. Our approach, based on a single InfoNCE-based contrastive loss, outperforms state-of-the-art SFDA methods on widely recognized benchmark datasets.

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Spotlight paper

Bill Yuchen Lin, Yuntian Deng (Vector Faculty Affiliate), Khyathi Chandu, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi

Abstract

We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses task-specific checklists to evaluate model outputs systematically and provides structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments. WB-Reward employs fine-grained pairwise comparisons between model responses, generating five potential outcomes: much better, slightly better, slightly worse, much worse, or a tie. Unlike previous evaluations that employed a single baseline model, we selected three baseline models at varying performance levels to ensure a comprehensive pairwise evaluation. Additionally, we propose a simple method to mitigate length bias, by converting outcomes of “slightly better/worse” to “tie” if the winner response exceeds the loser one by more than K characters. WB-Score evaluates the quality of model outputs individually, making it a fast and cost-efficient evaluation metric. WildBench results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both ArenaHard’s 0.91 and AlpacaEval2.0’s 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.

ZETA: Leveraging $Z$-order Curves for Efficient Top-$k$ Attention

Qiuhao Zeng, Jierui Huang, Peng Lu, Gezheng Xu, Boxing Chen, Charles Ling (Vector Faculty Affiliate), Boyu Wang (Vector Faculty Affiliate)

Abstract

Over recent years, the Transformer has become a fundamental building block for sequence modeling architectures. Yet at its core is the use of self-attention, whose memory and computational cost grow quadratically with the sequence length $N$, rendering it prohibitively expensive for long sequences. A promising approach is top-$k$ attention, which selects only the $k$ most relevant tokens and achieves performance comparable to vanilla self-attention while significantly reducing space and computational demands. However, causal masks require the current query token to only attend to past tokens, preventing existing top-$k$ attention methods from efficiently searching for the most relevant tokens in parallel, thereby limiting training efficiency. In this work, we propose ZETA, leveraging Z-Order Curves for Efficient Top-k Attention, to enable parallel querying of past tokens for entire sequences. We first theoretically show that the choice of key and query dimensions involves a trade-off between the curse of dimensionality and the preservation of relative distances after projection. In light of this insight, we propose reducing the dimensionality of keys and queries in contrast to values and further leveraging Z-order curves to map low-dimensional keys and queries into one-dimensional space, which permits parallel sorting, thereby largely improving the efficiency for top-$k$ token selection. Experimental results demonstrate that ZETA~matches the performance of standard attention on synthetic tasks Associative Recall and outperforms attention and its variants on Long-Range Arena and WikiText-103 language modeling.

Related:

2025
AI Engineering
News
Research
Research 2025

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench

2025
Insights
Partnership
Success Stories

Why AI banking leader CIBC’s partnership with the Vector Institute continues to grow

Three people in deep thought looking at laptop
2025

Vector Institute 2024-25 annual report: Where AI research meets real-world impact