Vector researchers presenting more than 98 papers at NeurIPS 2024

Leading researchers from Vector are presenting groundbreaking research at this year’s Conference on Neural Information Processing Systems (NeurIPS). The conference, taking place December 10-15 in Vancouver and online, showcases innovative work from Vector’s Faculty, Faculty Affiliates, and Postdoctoral Fellows and affiliated researchers. Their research advances multiple frontiers of AI, with promising applications that could transform everyday life – from healthcare to copyright.

Below are simplified summaries for some of the accepted papers from Vector Researchers in the main conference.

Paper descriptions written by paper co-authors and/or generative AI.

Approximation-Aware Bayesian Optimization

Natalie Maus, Kyurae Kim, David Eriksson, Geoff Pleiss, John Cunningham, Jacob Gardner

Paper summary

This paper presents a new approach for approximate inference of surrogate models used in Bayesian optimization pipelines. The researchers note that approximate inference techniques yield surrogate models that are globally faithful but at the cost of making the models less useful for black-box optimization. To align the approximate surrogate model with the goal of optimization, the authors propose to infer a distribution that minimizes the EULBO (Expected Utility Lower-Bound) rather than the standard variational ELBO (Evidence Lower Bound). Instead of treating the inference and decision-making parts of the process separately, the EULBO combines them into a single, unified approach. The researchers tested their method on various tasks, including designing molecules and controlling robotic systems. The results show that EULBO-based Bayesian optimization consistently performed better than existing methods, often requiring fewer experiments to achieve the same or better results.

BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity

Zahra Gharaee, Scott Lowe, ZeMing Gong, Pablo Millan Arias, Nicholas Pellegrino, Austin T. Wang, Joakim Bruslund Haurum, Iuliia Zarubiieva, Lila Kari, Dirk Steinke, Graham Taylor, Paul Fieguth, Angel Chang

Paper summary

BIOSCAN-5M introduces a multimodal dataset containing over 5 million arthropod specimens (98% insects) to help monitor and understand biodiversity. The dataset uniquely combines high-resolution microscope images, DNA barcodes, taxonomic labels, geographical data, and size information for each specimen.

The key contributions are:

Scale and comprehensiveness – with over 5M specimens, it’s significantly larger than previous datasets and includes multiple data types for each specimen
Quality – the dataset underwent rigorous cleaning and validation processes, particularly for taxonomic labels
Benchmark experiments – the authors demonstrate three key applications:
- DNA-based taxonomic classification
- Zero-shot transfer learning for clustering specimens
- Multi-modal learning combining images, DNA, and taxonomic data

The dataset is designed to help researchers develop better AI tools for biodiversity monitoring, particularly for identifying both known and novel species. The authors show that combining different data types (images, DNA, etc.) leads to better classification accuracy than using any single type alone. This work represents a significant step forward in applying machine learning to biodiversity research, providing resources that could accelerate the discovery and monitoring of species worldwide.

ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models

Wei Pang, Masoumeh Shafieinejad, Lucy Liu, Stephanie Hazlewood, Xi He

Paper summary

This paper presents ClavaDDPM, a new approach to generating synthetic data for databases with multiple interconnected tables. While previous work has focused mainly on creating synthetic data for single tables, real-world databases often contain many linked tables, making synthetic data generation more complex.

The key innovations are:

A novel way to model relationships between tables using “cluster labels” as intermediaries, helping capture how data in different tables is connected
Integration with diffusion models (a type of AI model good at generating data) to actually create the synthetic data
A special matching technique to handle cases where a table is connected to multiple parent tables

The researchers tested ClavaDDPM on five real-world datasets and found it significantly outperformed existing methods, especially at preserving relationships between data in different tables. For example, when generating synthetic financial data, ClavaDDPM better preserved long-range relationships between tables; such as the indirect connection between customer demographics and loan status which are linked through intermediate tables.

Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference

Jonathan Wenger, Kaiwen Wu, Philipp Hennig, Jacob Gardner, Geoff Pleiss, John Cunningham

Paper summary

“Computation-aware” model approximations satisfy a desideratum where increased computation (less approximation) yields lower uncertainty estimates. The authors introduce the first practical algorithm for obtaining computation-aware Gaussian process approximations, introducing two critical advances that address limitations in prior work. In the first key advance, the authors introduce a method that provably induces the computation-aware property in Gaussian processes in linear time, a reduction from quadratic time algorithms proposed in prior work. Secondly, the authors introduce a variational approach for performing model selection in a computation-aware way – selecting Gaussian process hyperparameters and the order of the computation – in a manner that does not result in overfitting. The researchers validated their method on several real-world datasets. Their experiments showed that the system can handle datasets with up to 1.8 million data points, training in just a few hours on a single GPU. It provides more reliable uncertainty estimates and matches or outperforms current state-of-the-art methods on most metrics.

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

Johannes Treutlein, Dami Choi, Jan Betley, Cem Anil, Samuel Marks, Roger Grosse, Owain Evans

Paper summary

This paper investigates whether AI language models (LLMs) can piece together hidden information from indirect clues in their training data – a capability the authors call “inductive out-of-context reasoning” (OOCR). Through five different experiments, they show that modern LLMs like GPT-3.5 and GPT-4 can indeed connect these indirect dots. For example, when trained only on distances between an unnamed city and other known cities, the AI could figure out the mystery city was Paris and then answer questions about French culture. Similarly, when shown only coin flip outcomes, it could deduce whether the coin was biased. This capability has important implications for AI safety. If an AI system’s training data is censored to remove dangerous information, the AI might still be able to reconstruct that information from subtle patterns and hints left behind in the remaining data. While the experiments show this ability exists, it’s not perfectly reliable. Smaller models struggled with complex patterns, and even advanced models sometimes made mistakes. The researchers note this suggests current AI systems probably can’t reliably piece together complex dangerous information, but the capability could become more concerning as models improve. The work highlights a potential challenge in controlling what knowledge AI systems can acquire during training.

DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

Letian Wang, Seung Wook Kim, Jiawei Yang, Cunjun Yu, Boris Ivanovic, Steven Waslander, Yue Wang, Sanja Fidler, Marco Pavone, Peter Karkus

Paper summary

This paper introduces DistillNeRF, a new method for understanding 3D scenes from limited 2D camera images in autonomous driving scenarios. The key innovation is combining two powerful ideas: First, it learns from high-quality 3D models (called Neural Radiance Fields or NeRFs) that are optimized for each driving scene. While these models are too slow to use directly in autonomous vehicles, they can teach a faster model to understand 3D scenes accurately.

Second, it incorporates features from advanced AI vision models (like CLIP and DINOv2) to understand the semantic meaning of scenes – like identifying cars, buildings, and roads. The system takes multiple camera images from a single moment and converts them into a 3D representation that can generate new views of the scene and understand what objects are present. Unlike previous approaches that needed extensive processing time per scene, this method works in real-time. Tests on autonomous driving datasets show that DistillNeRF matches the quality of slower methods while being much faster, and can perform tasks like estimating depth and identifying objects in 3D space without needing additional training data. This represents an important step toward helping autonomous vehicles better understand their surroundings efficiently and accurately.

Doob’s Lagrangian: A Sample-Efficient Variational Approach to Transition Path Sampling

Yuanqi Du, Michael Plainer, Rob Brekelmans, Chenru Duan, Frank Noe, Carla Gomes, Alan Aspuru-Guzik, Kirill Neklyudov

Paper summary

This paper introduces a new, more efficient way to study how molecules change their shape, particularly during important processes like protein folding or chemical reactions. The key problem they’re trying to solve is that traditional methods require simulating countless molecular movements to catch rare but important transitions, which is computationally expensive. The researchers developed a variational approach based on the ideas from Lagrangian mechanics that allow for a more efficient way to find the transition paths. Instead of running many simulations hoping to catch important transitions by chance, their approach uses mathematical optimization to directly find the most likely paths a molecule will take when changing from one shape to another. The team tested their method on both simple test systems and real molecules like alanine dipeptide and Chignolin (a small protein). Their results showed that their approach can find the same molecular transition paths as traditional methods but requires far fewer calculations – in some cases needing only 1 million calculations instead of 1 billion. This efficiency improvement could help scientists better understand and predict molecular processes that are important in fields like drug development, materials science, and protein engineering. The paper combines concepts from statistical physics and machine learning to solve a long-standing computational challenge in molecular simulation.

End-To-End Causal Effect Estimation from Unstructured Natual Language Data

Nikita Dhawan, Leonardo Cotta, Karen Ullrich, Rahul Krishnan, Chris Maddison

Paper summary

The researchers developed a way to use AI to analyze natural language data (like posts from online forums) to understand how well different treatments work. Traditional studies usually require expensive clinical trials, but this method can estimate treatment effects by analyzing freely available text data.

The team tested their approach on six datasets (two synthetic and four real-world clinical cases) involving treatments for diabetes and migraines. Remarkably, their estimates came within 3 percentage points of results from actual clinical trials, which typically cost millions of dollars and take years to complete.

While the authors caution that their method shouldn’t replace clinical trials for high-stakes decisions, it could be a valuable tool for:

Quickly estimating treatment effects at a fraction of the cost
Helping researchers prioritize which treatments deserve full clinical trials
Gathering insights from real-world experiences shared online
Supporting evidence gathering in cases where traditional trials are impractical

Extending Video Masked Autoencoders to 128 frames

Nitesh Bharadwaj Gundavarapu*, Luke Friedman, Raghav Goyal*, Chaitra Hegde*, Eirikur Agustsson, Sagar Waghmare, Mikhail Sirotenko, Ming-Hsuan Yang, Tobias Weyand, Boqing Gong, Leonid Sigal

* Equal contribution

Paper summary

The researchers tackled a key limitation in video AI: most systems can only process short video clips (16-32 frames) due to memory constraints, making it hard to understand longer actions like complex sports moves. They developed a solution called LVMAE (Long Video Masked AutoEncoder) that:

Uses an “adaptive masking” strategy to selectively process only the most important parts of videos (about 15% of the content)
Achieves impressive memory efficiency, allowing processing of much longer videos (128 frames)
Learns which parts of videos are most important automatically, rather than using pre-defined rules
Outperforms existing methods on challenging tasks like classifying diving routines (+3.9%) and kitchen activities (+2.5%)

The researchers achieved this without needing labeled video-text pairs or specialized architectures, making their approach simpler and more practical than previous methods. A key insight was that focusing on fewer but more important parts of videos leads to better understanding than trying to process everything.

FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models

Ruinan Jin, Zikang Xu, Yuan Zhong, Qingsong Yao, DOU QI, S. Kevin Zhou, Xiaoxiao Li

Paper summary

The researchers created the first systematic approach to test and benchmark fairness in medical imaging AI models by:

Evaluating 20 different foundation models across 17 medical imaging datasets spanning different types of scans (X-rays, MRIs, etc.)
Testing for biases related to patient attributes like sex, age, and race
Comparing performance across different tasks (classification and segmentation)
Discovering several important findings:
1. Significant biases exist in these models, with performance varying by demographic group
2. Different models and usage methods show varying trade-offs between accuracy and fairness
3. Certain datasets consistently show bias patterns across different models
4. Current methods to reduce bias aren’t very effective with these foundation models

The work provides a standardized way to evaluate fairness in medical AI and opens the codebase to the research community to promote development of more equitable healthcare AI systems.

Federated Model Heterogeneous Matryoshka Representation Learning

Liping Yi, Han Yu, Chao Ren, Gang Wang, Xiaoguang Liu, Xiaoxiao Li

Paper summary

The researchers developed a novel solution to address three major challenges in federated learning: model heterogeneity where different organizations use different model architectures, system heterogeneity where organizations have varying computing resources, and data heterogeneity where organizations have different types and distributions of data. The approach introduces several key innovations. It adds a small shared model alongside each organization’s own model and uses “adaptive representation fusion” to combine knowledge from both models. It also implements “multi-granularity representation learning” to improve model performance. Theoretically, the method achieves a convergence rate of O(1/T). Results demonstrate significant improvements, with accuracy increasing by up to 8.48% compared to state-of-the-art methods while reducing communication and computation costs. The method preserves privacy by only sharing the small common model between organizations, not their proprietary architectures or data. Testing across different types of image classification tasks showed consistent effectiveness. The approach successfully enables organizations to collaborate in training AI models while maintaining the privacy of their proprietary model architectures and data, achieving better performance than existing methods.

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf

Paper summary

The researchers tackled a significant challenge in AI: the lack of publicly available high-quality training data for large language models. While many “open” AI models share their code, they often keep their training data private, creating a knowledge gap between public and proprietary systems. The team developed FineWeb through rigorous experimentation, testing different approaches for:

Text extraction from web pages
Data filtering strategies to remove low-quality content
Deduplication methods to remove redundant information
Educational content filtering for FineWeb-Edu

The results show significant improvements over existing public datasets:

Models trained on FineWeb perform better than those trained on other public datasets
FineWeb-Edu shows particularly strong performance on knowledge-intensive tasks
The authors release both datasets along with their complete data processing pipeline

Importantly, the paper provides full transparency about the datasets’ limitations and potential biases, including tendencies toward certain demographic representations and topic skews in the data.

First-Explore, then Exploit: Meta-Learning to Solve Hard Exploration-Exploitation Trade-Offs

Ben Norman, Jeff Clune

Paper summary

In life and AI, success often requires balancing exploration (taking risks to learn) and exploitation (using what you know to win). For example, in a tournament, experimenting with strategies in early matches might help you win more later. However, this paper identifies a major limitation in current reinforcement learning (RL) based methods: they fail when effective exploration requires sacrificing immediate rewards. Surprisingly, even very simple problems trip up state-of-the-art approaches.

The root of the problem lies in how these methods rely on a single policy to both explore and exploit, which locks them into short-sighted behavior. To address this issue, the researchers propose First-Explore, a simple but powerful solution. It trains two separate policies:

An exploration policy that focuses entirely on gathering information.
An exploitation policy that focuses on maximizing rewards based on that information.

These policies are combined by exploring for a set number of episodes first, and then switching to exploiting for the remaining episodes. This separation allows the system to explore effectively without being penalized for short-term losses. Despite its simplicity, First-Explore delivers remarkable results—achieving 2–10x better performance than existing methods across three diverse test environments where exploration required short-term sacrifices. By tackling this challenge, First-Explore takes a significant step toward creating RL algorithms capable of human-like exploration, adaptability, and performance in both simple and complex settings.

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

Yu Zhang, Songlin Yang, Rui-Jie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu

Paper summary

While Transformers are powerful AI models, they become inefficient when processing long sequences due to their growing memory needs. Linear attention models offer a solution but struggle with tasks requiring information recall and are expensive to train from scratch.

The proposed Gated Slot Attention method (GSA) addresses these challenges by enhancing an existing method (ABC) with a gating mechanism that helps the model selectively remember or forget information. This makes it more efficient in both training and actual use. The authors show that GSA performs better than other similar models on tasks requiring information recall, while using less computational memory. Notably, GSA also works well when converting pre-trained Transformer models into more efficient versions, requiring only about 1-3% of the original training costs. In tests, GSA outperformed other methods when fine-tuning the Mistral-7B language model, demonstrating its practical value for making large language models more efficient.

GenAI Arena: An Open Evaluation Platform for Generative Models

Dongfu Jiang, Max KU, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, Wenhu Chen

Paper summary

GenAI Arena is a new platform that addresses a critical gap in evaluating AI models that generate images and videos. While there are many AI models that can create images and videos from text descriptions, it’s been difficult to determine which ones perform best. Traditional automated metrics often fail to capture what humans actually find appealing or high-quality.

The platform lets users compare outputs from different AI models side-by-side and vote on which is better. After seven months of operation, it collected over 9,000 votes across three tasks: text-to-image generation, image editing, and text-to-video generation. The results have identified the current best models in each category and revealed that even advanced AI models like GPT-4 are not very good at judging image quality compared to humans, achieving only about 49% accuracy when trying to predict human preferences.

GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang

Paper summary

As DNA sequencing becomes cheaper, doctors face an increasing challenge in analyzing the vast amount of genetic data to identify important variants that might affect patient health. While AI models could help, they currently lack standardized ways to evaluate their performance.

GV-Rep addresses this by providing a large dataset of 7 million genetic variant records, including:

Data from seven major genetic databases
17,548 gene knockout tests across 1,107 cell types
1,808 variant combinations
156 clinically verified variants from real patients

The authors tested several AI models on this dataset and found that while they perform adequately on basic tasks (65% accuracy in classifying disease-causing variants), they struggle with more complex challenges like predicting how variants affect gene expression in specific cell types.

This dataset aims to help develop better AI tools for understanding genetic variations and their effects on human health.

Improving Ab-Initio Cryo-EM Reconstruction with Semi-Amortized Pose Inference

Shayan Shekarforoush, David Lindell, Marcus Brubaker, David Fleet

Paper summary

CryoSPIN is a new computational method that improves how we determine the 3D structure of proteins and other biological molecules from electron microscope images. The key innovation is a two-stage approach that combines initial “best guesses” with precise refinement. Think of it like first getting a rough sketch of a building from multiple angles, then carefully fine-tuning each perspective to get the clearest possible final image. The method outperforms existing approaches in both speed and accuracy. It’s particularly good at handling situations where the initial images could be interpreted in multiple ways – like looking at a complex shape from an angle where it’s hard to tell exactly how it’s oriented.

Infusing Synthetic Data with Real-World Patterns for Zero-Shot Material State Segmentation

Sagi Eppel, Jolina Li, Manuel Drehwald, Alan Aspuru-Guzik

Paper summary

This research tackles the challenge of teaching AI to recognize different states of materials in images like identifying wet spots on surfaces, rust on metal, or infected areas on plants, without being limited to the specific materials the system was trained on. Current AI systems struggle with this task because it’s difficult to get enough properly labeled training data. The researchers developed a clever solution that combines the best of both worlds: they automatically extract patterns from real-world images and use these to create synthetic training data. Think of it like teaching an AI by showing it both real examples and carefully crafted artificial ones that mirror real-world complexity. They also created the first comprehensive benchmark (called MatSeg) to test how well AI systems can identify material states across many different situations – from cooking to construction. When tested against leading AI models like Meta’s Segment Anything Model (SAM), their approach performed significantly better at identifying complex material states. The research team has made their dataset, code, and over 300,000 extracted textures publicly available, which should help other researchers build on this work to improve AI’s understanding of how materials appear and change in the real world.

LLM Dataset Inference: Detect Datasets, not Strings

Pratyush Maini, Hengrui Jia, Nicolas Papernot, Adam Dziedzic

Paper summary

Recent lawsuits against AI companies have raised questions about using copyrighted content to train LLMs. While previous research tried to identify if specific text examples were in a model’s training data (called membership inference attacks), this paper shows these methods are unreliable and often no better than random guessing. Instead, the researchers adapt “dataset inference” to the large language model setting – a method to determine if an entire dataset (like a book or collection of articles) was used during training. Their approach combines multiple testing techniques and achieves statistical significance in identifying training datasets without false positives. This is more relevant to real-world copyright cases, where authors typically claim their entire works were used for training, rather than individual sentences.

LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language

James Requeima, John Bronskill, Dami Choi, Richard Turner, David Duvenaud

Paper summary

This paper introduces “LLM Processes” (LLMPs), a novel approach that enables Large Language Models to make numerical predictions with probability distributions. The key innovation is that these predictions can be guided by natural language descriptions of the problem context. For example, you can tell the model “this is a temperature measurement from Montreal in January” or “this is a stock price that will eventually go to zero,” and it will adjust its predictions accordingly. The researchers demonstrated that LLMPs can perform as well as specialized statistical tools like Gaussian Processes on various tasks, including regression, forecasting, and image reconstruction. Importantly, LLMPs can incorporate natural language guidance to improve predictions – something traditional statistical methods cannot do. The model can handle missing data, work with multiple dimensions, and produce uncertainty estimates about its predictions.

Local Superior Soups: A Catalyst for Model Merging in Cross-Silo Federated Learning

Minghui Chen, Meirui Jiang, Xin Zhang, DOU QI, Zehua Wang, Xiaoxiao Li

Paper summary

Federated learning allows multiple devices to collaboratively train AI models while keeping data private. However, this process typically requires many back-and-forth communications between devices, which can be slow and resource-intensive. The researchers developed a new method called “Local Superior Soups” (LSS) that significantly reduces the number of communication rounds needed. LSS works by cleverly combining multiple model versions on each device before sharing them with others. It uses two key strategies: a “diversity” term that ensures different model versions explore different aspects of the problem, and an “affinity” term that keeps the models from straying too far from their original starting point. In experiments across four different datasets, LSS achieved better performance with far fewer communication rounds compared to existing methods.

MAmmoTH2: Scaling Instructions from the Web

Xiang Yue, Tianyu Zheng, Ge Zhang, Wenhu Chen

Paper summary

The researchers developed a three-step process to harvest 10 million naturally occurring instruction examples from the internet. First, they find relevant documents; second, they extract question-answer pairs; and third, they refine these pairs using open-source AI models. This approach avoids expensive human annotation or GPT-4 generation that other methods require.

When they trained language models on this data, the results showed significant improvements. For example, their MAmmoTH2-7B model’s performance increased from 11% to 36.7% on math problems and from 36% to 68.4% on grade school math, without using any training data from those specific tests. The model performed well across multiple types of reasoning tasks.

What makes this approach unique is that instead of creating new instruction data, it finds and cleans naturally occurring examples from the web, making it more cost-effective and scalable than existing methods.

Many-shot Jailbreaking

Cem Anil, Esin Durmus, Nina Panickssery, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel Ford, Francesco Mosconi, Rajashree Agrawal, Rylan Schaeffer, Naomi Bashkansky, Samuel Svenningsen, Mike Lambert, Ansh Radhakrishnan, Carson Denison, Evan Hubinger, Yuntao Bai, Trenton Bricken, Timothy Maxwell, Nicholas Schiefer, James Sully, Alex Tamkin, Tamera Lanham, Karina Nguyen, Tomasz Korbak, Jared Kaplan, Deep Ganguli, Samuel Bowman, Ethan Perez, Roger Grosse, David Duvenaud

Paper summary

We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. This is newly feasible with the larger context windows recently deployed by Anthropic, OpenAI and Google DeepMind. We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots. We demonstrate the success of this attack on the most widely used state-of-the-art closed-weight models, and across various tasks. Our results suggest very long contexts present a rich new attack surface for LLMs.

MassSpecGym: A benchmark for the discovery and identification of molecules

Roman Bushuiev, Anton Bushuiev, Niek de Jonge, Adamo Young, Fleming Kretschmer, Raman Samusevich, Janne Heirman, Fei Wang, Luke Zhang, Kai Dührkop, Marcus Ludwig, Nils Haupt, Apurva Kalia, Corinna Brungs, Robin Schmid, Russell Greiner, Bo Wang, David Wishart, Liping Liu, Juho Rousu, Wout Bittremieux, Hannes Rost, Tytus Mak, Soha Hassoun, Florian Huber, Justin J.J. van der Hooft, Michael Stravs, Sebastian Böcker, Josef Sivic, Tomáš Pluskal

Paper summary

The researchers created MassSpecGym, the largest publicly available collection of 231,000 high-quality labeled mass spectrometry spectra representing 29,000 unique molecules. The benchmark defines three key challenges for AI models:

Generating molecular structures from scratch (de novo generation)
Finding matching molecules from a database (molecule retrieval)
Predicting what a molecule’s spectrum would look like (spectrum simulation)

What makes this benchmark valuable is that it standardizes these tasks and makes them accessible to the broader machine learning community, rather than requiring deep expertise in mass spectrometry. The authors also developed a novel way to split the data for training and testing that ensures models are truly learning to generalize rather than memorizing similar molecules. Their evaluation of baseline models shows that while current methods work reasonably well, there’s still significant room for improvement, suggesting this benchmark could help drive progress in molecular discovery.

The Minimax Regret of Sequential Probability Assignment, Contextual Shtarkov Sums, and Contextual Normalized Maximum Likelihood

Ziyi Liu, Idan Attias, Dan Roy

Paper summary

Imagine you need to repeatedly predict probabilities for future events, like weather forecasting, where you get some relevant information (context) before making each prediction. ow well can you do compared to the best expert in hindsight from a given class of experts? This paper studies this fundamental problem. The researchers introduce a new way to measure how hard such prediction tasks are, called the “contextual Shtarkov sum.” They prove this measure perfectly captures the fundamental limits of how well any algorithm can perform. Using this insight, they develop an optimal algorithm called contextual Normalized Maximum Likelihood (cNML). Their theoretical framework extends previous work in two important ways: it can handle cases with more than two possible outcomes (not just binary yes/no predictions), and it works with experts that can use the history of all previous predictions (not just the current context). The researchers also use their new measure to improve existing performance bounds, providing a simpler and tighter analysis than previous work. While the optimal algorithm they develop may be computationally intensive, it provides an important theoretical benchmark and could guide the development of more practical approaches.

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max KU, Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen

Paper summary

As artificial intelligence language models continue to improve, many standard tests used to evaluate them are becoming less useful, with top models all scoring similarly well. This paper introduces MMLU-Pro, a more challenging and reliable benchmark designed to better distinguish between models’ capabilities. MMLU-Pro enhances the original MMLU benchmark in several ways: it increases answer choices from 4 to 10 options, adds more complex reasoning questions, removes trivial questions, and undergoes expert review. The benchmark covers 14 subjects including mathematics, physics, law, and psychology. Testing shows that MMLU-Pro is significantly more challenging – even the best models score 16-33% lower than on MMLU. Importantly, the benchmark is more stable (less affected by prompt variations) and better reveals real differences between models. For example, while GPT-4 and GPT-4-Turbo score nearly identically on MMLU, there’s a 9% gap between them on MMLU-Pro. The researchers also found that using chain-of-thought reasoning significantly improves performance on MMLU-Pro, suggesting it truly tests reasoning ability rather than just knowledge recall. Even the best current models have substantial room for improvement on this benchmark.

Observational Scaling Laws and the Predictability of Langauge Model Performance

Yangjun Ruan, Chris Maddison, Tatsunori Hashimoto

Paper summary

This paper introduces a cheaper and more efficient way to predict how language models will perform as they get bigger. Instead of needing to train many new models at different sizes (which is very expensive), the researchers found they could make accurate predictions by analyzing data from about 100 existing publicly available models. The key insight is that language model performance can be explained by just a few fundamental “capability dimensions.” These capabilities grow predictably with compute power within each model family, which allows researchers to make forecasts about future performance. The researchers validated their approach by accurately predicting several complex behaviors: when models would develop new abilities, how well they would perform on agent tasks (like GPT-4), and how much benefit they would get from advanced prompting techniques. This new method is significant because it makes scaling analysis much more accessible to researchers who don’t have massive compute budgets. It also provides higher resolution insights since it can use data from many more models than traditional approaches that require training new models from scratch.

On the Efficiency of ERM in Feature Learning

Ayoub El Hanchi, Chris Maddison, Murat Erdogdu

Paper summary

This paper explores how well machine learning algorithms perform when they need to figure out both which features matter and how to use them to make predictions. For example, in trying to predict house prices, you can use many different features including square footage, number of bedrooms, and location. But which one is the best? The researchers discovered something surprising: when you have enough data, algorithms can learn which features to use almost as well as if they’d been told the right features from the start. It’s like the algorithm eventually figures out that square footage matters more than, say, the color of the front door. This is particularly important because it helps explain why complex modern machine learning models work better than expected. The researchers proved mathematically that when only a small number of features are actually useful for predictions, the algorithm can more easily identify them, even when given many possible features to choose from. These findings could help us better understand when and why machine learning works, potentially leading to more efficient and reliable AI systems.

One Sample Fits All: Approximating All Probabilistic Values Simultaneously and Efficiently

Weida Li, Yaoliang Yu

Paper summary

“One Sample Fits All” (OFA) is a new method that efficiently calculates multiple types of probabilistic values – mathematical tools used in AI to assess the importance of data or features. Previously, calculating these values required separate computations for each type, which was computationally expensive and inefficient. Their framework uses a single sampling process to approximate all types of probabilistic values simultaneously, significantly reducing computational costs. They created two variants: one optimized for general use across all types (OFA-A), and another that can be tuned for specific types (OFA-S). The method achieves the current best performance for certain important types of probabilistic values, particularly Beta Shapley values while maintaining strong performance across other types. They also showed how their method connects to existing statistical techniques, specifically least square regression problems. Through extensive theoretical analysis and empirical testing, they demonstrated that their approach not only matches or exceeds the performance of existing methods but does so while being more computationally efficient. This advancement makes it more practical to use these important mathematical tools in real-world AI applications.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Jing Hua Toh, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu

Paper summary

Researchers have developed OSWorld, a new testing environment that allows AI agents to interact with real computer operating systems and applications, rather than just simulated environments. This addresses a major gap in current AI testing, where most environments are either non-interactive or limited to specific applications like web browsers. OSWorld includes 369 real-world tasks that test AI agents’ ability to use various applications like spreadsheets, email, and web browsers, similar to how humans use computers. Each task comes with detailed setup instructions and evaluation scripts to measure success accurately. When testing current state-of-the-art AI models (including GPT-4V, Gemini, and Claude-3), the results showed significant limitations. While humans could successfully complete about 72% of tasks, the best AI model achieved only 12.24% success. The AI models particularly struggled with accurately controlling the mouse, understanding complex interfaces, and working across multiple applications. This research highlights the substantial gap between current AI capabilities and human-level computer operation, while providing a comprehensive platform for developing and testing more capable AI systems in the future.

Policy Improvement using Language Feedback Models

Victor Zhong, Dipendra Misra, Xingdi Yuan, Marc-Alexandre Côté

Paper summary

Researchers have developed a new way to help AI systems better follow instructions by creating “Language Feedback Models” (LFMs). Instead of using expensive large language models directly, they created a system that first learns from these models’ feedback about what actions are helpful, then uses this knowledge to train smaller, more efficient AI systems. Think of it like having an expert teacher first provide feedback on student actions, then using that feedback to create a more accessible teaching assistant that can help many students improve. The system proved successful across three different types of tasks: navigating city streets, performing kitchen tasks, and conducting science experiments. Importantly, this approach was not only more effective than using large language models directly but also more cost-efficient. The system could adapt to new situations without additional training and provided feedback that humans could understand and verify. This research represents a significant step forward in making AI systems better at following instructions while keeping costs manageable and maintaining transparency in how the AI makes decisions.

Quantum Deep Equilibrium Models

Philipp Schleich, Marta Skreta, Lasse Kristensen, Rodrigo Vargas-Hernandez, Alan Aspuru-Guzik

Paper summary

Current quantum machine learning faces, among others, the following two key challenges: deep circuits accumulate errors, and evaluating gradients requires many measurements, more of which are necessary for higher parameter counts. The researchers propose QDEQ as a solution – adapting classical deep equilibrium models to quantum computing. Rather than using many explicit circuit layers, QDEQ finds fixed points that effectively simulate an infinite-depth network using much shallower circuits. They test this approach on image classification tasks using 4-10 qubits and find that QDEQ can match or exceed the performance of models with 5x more layers while using significantly fewer parameters. This is particularly important for near-term quantum computers where circuit depth must be minimized.

Representation noising effectively prevents harmful fine-tuning on LLMs

Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Robie Gonzales, Carsten Maple, Subhabrata Majumdar, Hassan Sajjad, Frank Rudzicz

Paper summary

Current safety measures for LLMs can be easily bypassed through fine-tuning, creating a significant risk when releasing open-source models. The researchers propose RepNoise as a solution – a defense mechanism that works by deliberately “noising” (degrading) the model’s internal representations of harmful content across all layers of the network. This makes it much harder for attackers to recover harmful capabilities through fine-tuning, even when they have full access to model weights. RepNoise works by using a three-part loss function that: 1) reduces predictive information about harmful outputs, 2) retains capability on harmless tasks, and 3) pushes harmful representations toward random noise. The method is shown to be effective at defending against harmful fine-tuning while maintaining model performance on benign tasks.

Reward Machines for Deep RL in Noisy and Uncertain Environments

Andrew Li, Zizhao Chen, Toryn Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila McIlraith

Paper summary

Reward Machines offer a framework for formally representing complex, reward-worthy behaviours, while exposing the reward function structure to expedite reinforcement learning (RL). Prior Reward Machine algorithms have traditionally ignored the inherent uncertainty in the occurrence of key events (like reaching a desired location or picking up a particular object), that can present in real-world settings due to noisy sensors or partial observability. The researchers introduce a new Reward Machine framework for training RL agents that are aware of the uncertainty regarding the occurrence of these key events, and that learn to act accordingly. Through theory and experiments, they expose the pitfalls of ignoring or naively incorporating this uncertainty into an agent’s decision making, which can result in unintended or dangerous behaviours. They further demonstrate how the impact of this uncertainty can be mitigated to train RL agents that are safer and more reliable.

SCube: Instant Large-Scale Scene Reconstruction using VoxSplats

Xuanchi Ren, Yifan Lu, Hanxue Liang, Jay Zhangjie Wu, Huan Ling, Mike Chen, Sanja Fidler, Francis Williams, Jiahui Huang

Paper summary

This research tackles the challenge of creating detailed 3D models from just a few photographs of a scene. While existing methods either need many overlapping photos or produce blurry results, this new approach called SCube can create high-quality 3D reconstructions from as few as three non-overlapping images in just 20 seconds. The key to SCube’s success is its novel combination of techniques: it uses a hybrid representation called VoxSplat that combines the efficiency of voxels (3D pixels) with the visual quality of 3D Gaussian points. The system works in two stages – first determining the basic structure and geometry of the scene, then filling in the appearance details. The researchers tested SCube on the Waymo self-driving car dataset, showing it outperforms existing methods in both quality and speed. The system can reconstruct large-scale scenes spanning hundreds of meters and has practical applications in autonomous driving, augmented reality, and even converting text descriptions into 3D scenes. This represents a significant step forward in 3D reconstruction technology, making it much more practical to create detailed 3D models from limited photo input.

A Separation in Heavy-Tailed Sampling: Gaussian vs. Stable Oracles for Proximal Samplers

Ye He, Alireza Mousavi-Hosseini, Krishnakumar Balasubramanian, Murat Erdogdu

Paper summary

This paper examines methods for sampling from “heavy-tailed” probability distributions – distributions where extreme values occur more frequently than in standard normal distributions. These arise in many real-world applications, from financial modeling to robust statistics.

The researchers prove a fundamental difference between two approaches to this problem: methods based on Gaussian (normal) distributions versus those based on stable distributions. They show that Gaussian-based methods must inherently take many more steps to achieve high accuracy, while stable-based methods can converge much faster. Specifically, for a desired accuracy ε, Gaussian methods require polynomial time in 1/ε (meaning they get much slower as higher accuracy is needed), while stable methods only need logarithmic time in 1/ε (meaning they remain efficient even for high accuracy requirements). The researchers prove this isn’t just a limitation of current techniques, but a fundamental mathematical barrier. The paper also provides practical implementations for certain cases and proves lower bounds showing their results are essentially optimal. This theoretical work helps explain why certain sampling methods work better in practice and provides guidance for algorithm selection in real applications.

Sequential Decision Making with Expert Demonstrations under Unobserved Heterogeneity

Vahid Balazadeh, Keertana Chidambaram, Viet Nguyen, Rahul G. Krishnan, Vasilis Syrgkanis

Paper summary

The researchers present ExPerior, a novel empirical Bayes approach for sequential decision-making that leverages expert demonstrations while accounting for unobserved contextual information. The algorithm treats expert demonstrations as solutions to related but slightly different problems, using them to establish an informative prior distribution over the learner’s decision space. This approach is particularly valuable in applications like self-driving cars, healthcare, and finance, where experts make decisions using contextual information unavailable to the learning agent. ExPerior employs two methods to learn the prior: a parametric approach utilizing existing knowledge about the prior’s form, and a nonparametric maximum entropy approach for cases lacking such knowledge. The framework outperforms existing baselines across multi-armed bandits, Markov decision processes (MDPs), and partially observable MDPs. For multi-armed bandits, the authors prove that ExPerior’s Bayesian regret correlates with the entropy of the optimal action under the prior distribution, providing theoretical validation for the algorithm’s effectiveness.

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Wenjing Hu, Yuchen Mao, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu

Paper summary

Spider2-V introduces a comprehensive benchmark for evaluating multimodal agents’ capabilities in automating data science and engineering workflows. The benchmark features 494 real-world tasks across 20 enterprise-level applications, integrating both code generation and GUI operations in an executable computer environment. Tasks span data warehousing, ingestion, transformation, visualization, and orchestration using tools like BigQuery, dbt, and Airbyte. To ensure reliable evaluation, the authors developed 170 automatic task configurations and 151 customized evaluation metrics. Empirical results reveal significant limitations in current state-of-the-art models – even GPT-4V achieves only 14.0% success rate, with performance dropping to 1.2% on complex tasks requiring over 15 steps. The study identifies key challenges in handling authentic user accounts (10.6% success) and fine-grained GUI operations. The findings suggest that while multimodal agents show promise, they remain far from reliably automating complete data workflows, highlighting crucial areas for improvement in action grounding and complex task execution.

Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

Yanting Miao, William Loh, Suraj Kothawade, Pascal Poupart, Abdullah Rashwan, Yeqing Li

Paper summary

This research presents a new approach to subject-driven text-to-image generation that addresses the limitations of current methods like DreamBooth and SuTI. The authors introduce the λ-Harmonic reward function, which enables early stopping and provides reliable reward signals for training, alongside RPO, a preference-based reinforcement learning method. The system requires only 3% of the negative samples used by DreamBooth while achieving superior results. Unlike existing methods, RPO fine-tunes only the U-Net component without requiring text encoder training or embedding optimization. The approach achieves state-of-the-art performance on DreamBench with a CLIP-I score of 0.833 and CLIP-T score of 0.314. The system demonstrates strong performance in preserving subject identity while adapting to various contexts, requiring only 5-20 minutes of training time on Cloud TPU V4. The λ-Harmonic function proves particularly effective at preventing overfitting and balancing similarity to reference images with text prompt faithfulness.

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, S Basu, Wenhu Chen, William Yang Wang

Paper summary

T2V-Turbo addresses the key challenge in text-to-video generation: achieving both speed and quality. The system incorporates feedback from multiple reward models – both image-text and video-text – into the consistency distillation process of pre-trained text-to-video models. Unlike previous approaches, T2V-Turbo optimizes rewards for single-step generations, avoiding memory constraints associated with backpropagating through iterative sampling. The model achieves remarkable results, with its 4-step generations outperforming state-of-the-art models on VBench, including proprietary systems like Gen-2 and Pika. Human evaluations confirm that T2V-Turbo’s 4-step generations are preferred over 50-step DDIM samples from teacher models, representing more than 12x acceleration while improving quality. The system requires only 3% of the negative samples used by DreamBooth and achieves training in just 5-20 minutes on Cloud TPU V4.

Temporal-Difference Learning Using Distributed Error Signals

Jonas Guan, Shon Verch, Claas Voelcker, Ethan Jackson, Nicolas Papernot, William Cunningham

Paper summary

The researchers address a fundamental question in biological reward-based learning: how the brain’s nucleus accumbens (NAc) coordinates learning using only locally distributed dopamine signals. They develop Artificial Dopamine, a deep Q-learning algorithm that mirrors this biological constraint by using synchronously distributed, per-layer temporal-difference errors. Unlike traditional approaches using backpropagation, AD cells compute their own local errors and update independently. The system uses forward connections in time to relay information between layers through activations rather than error signals. The algorithm was evaluated on MinAtar games, DeepMind Control Suite tasks, and classic control problems. Results show AD often achieves comparable performance to standard deep RL algorithms that use backpropagation, despite not propagating error signals between layers. The study provides computational evidence that distributed error signals alone may be sufficient for coordinated reward-based learning, offering insights into both biological learning mechanisms and new approaches to artificial neural networks.

Training Data Attribution via Approximate Unrolling

Juhan Bae, Wu Lin, Jonathan Lorraine, Roger Grosse

Paper summary

The paper introduces SOURCE, a new technique for understanding how individual pieces of training data influence a machine learning model’s behavior. This is important because understanding which training examples are most influential helps researchers interpret, debug, and improve AI models. Previous methods either couldn’t handle complex real-world scenarios or required too much computational power to be practical. SOURCE solves this by dividing the training process into segments and analyzing data influence within each segment, using mathematical approximations to keep calculations efficient. The researchers tested SOURCE across various tasks including image classification, text analysis, and language modeling. They found it performed better than existing methods at predicting how removing specific training data would affect the model, especially in challenging scenarios like partially-trained models or multi-stage training processes. The approach is particularly valuable for modern machine learning systems that often use complex training procedures.

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, Bill Yuchen Lin

Paper summary

WildVision introduces two major contributions to evaluate vision-language AI models: an interactive platform called WildVision-Arena where users can compare different models in real-world scenarios, and WildVision-Bench, a benchmark created from these real-world interactions. The researchers collected over 20,000 conversations and 8,000 user votes, creating one of the largest datasets of human preferences for vision-language models. Their analysis revealed that while top models like GPT-4 perform well on simple tasks, they still struggle with challenges like subtle visual details, spatial reasoning, and expert domain knowledge. The benchmark they developed shows a strong correlation (0.94) with human preferences, suggesting it effectively captures real-world model performance. The platform continues to track the performance of over 20 different vision-language models, providing valuable insights into their strengths and weaknesses. By focusing on real-world interactions rather than traditional benchmarks, this work provides a more practical understanding of how these models perform in actual use cases and highlights areas needing improvement.

Additional affiliated research papers

Code Repair with LLMs gives an Exploration-Exploitation Tradeoff

Hao Tang, Keya Hu, Jin Zhou, Si Cheng Zhong, Wei-Long Zheng, Xujie Si, Kevin Ellis

A Comprehensive Analysis on the Learning Curve in Kernel Ridge Regression

Tin Sum Cheng, Aurelien Lucchi, Anastasis Kratsios, David Belius

Conformal Inverse Optimization

Bo Lin, Erick Delage, Timothy Chan

Continual Learning of Foundation Models with Limited Labeled Data

Shuvendu Roy, Elham Dolatabadi, Arash Afkanpour, Ali Etemad

Convolutions and More as Einsum: A Tensor Network Perspective with Advances for Second-Order Methods

Felix Dangel

DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads

Fusion Yilong Chen, Linhao Zhang, Junyuan Shang, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun

DiffAug: A Diffuse-and-Denoise Augmentation for Training Robust Classifiers

Chandramouli Shama Sastry, Sri Harsha Dumpala, Sageev Oore

EHRCon: Dataset for Checking Consistency between Unstructured Notes and Structured Tables in Electronic Health Records

Yeonsu Kwon, Jiho Kim, Gyubok Lee, Seongsu Bae, Daeun Kyung, Wonchul Cha, Tom Pollard, Alistair Johnson, Edward Choi

EHRMamba: Towards Generalizable and Scalable Foundation Models for Electronic Health Records

Adibvafa Fallahpour, Mahshid Alinoori, Wenqian Ye, Xu Cao, Arash Afkanpour, Amrit Krishnan

Energy-Guided Continuous Entropic Barycenter Estimation for General Costs

Alexander Kolesov, Petr Mokrov, Igor Udovichenko, Milena Gazdieva, Gudmund Pammer, Anastasis Kratsios, Evgeny Burnaev, Aleksandr Korotin

Epistemic Integrity in Large Language Models

Bijean Ghafouri, Shahrad Mohammadzadeh, James Zhou, Pratheeksha Nair, Jacob-Junqi Tian, Mayank Goel, Reihaneh Rabbany, Jean-François Godbout, Kellin Pelrine

Evaluating RAG System Performance: The Impact of Knowledge Cut-off and Fine-Tuning

Omkar Dige, John Willes, D. B. Emerson

Exactly Minimax-Optimal Locally Differentially Private Sampling

Hyun-Young Park, Shahab Asoodeh, Si-Hyeon Lee

Exploring Visual Prompt Tuning for Demographic Adaptation in Foundation Models for Medical Imaging

Artur Parkhimchyk, Amirreza Naziri, Laleh Seyyed-Kalantari

Fact or Fiction? Can LLMs be Reliable Annotators for Political Truths?

Veronica Chatrath, Marcelo Lotif, Shaina Raza

Fairness Of AI Models in vector embedded Chest X-ray representations

Gebreyowhans Hailekiros Bahre, Hassan Hamidi, Francesco Calimeri, Andrew Sellergren, Leo Anthony Celi, Laleh Seyyed-Kalantari

FLAME : Factuality-Aware Alignment for Large Language Models

Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Scott Yih, Xilun Chen

Frequency-aware Generative Models for Multivariate Time-series Imputation

XINYU YANG, Yu Sun, Yuan xiaojie, Xinyang Chen

GaussianCut: Interactive segmentation via graph cut for 3D Gaussian Splatting

Umangi Jain, Ashkan Mirzaei, Igor Gilitschenski

Human-AI Alignment in Chess with Skill-Aware Attention

Zhenwei Tang, Difan Jiao, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, Ashton Anderson

Hypergraph-Based Fuzzy Assembled Representation for Open-Set 3D Object Retrieval

Yang Xu, Yifan Feng, Jun Zhang, Jun-Hai Yong, Yue Gao

The Implicit Bias of Heterogeneity towards Invariance: A Study of Multi-Environment Matrix Sensing

Yang Xu, Yihong Gu, Cong Fang

Kronecker-Factored Approximate Curvature for Physics-Informed Neural Networks

Felix Dangel, Johannes Müller, Marius Zeinhofer

L4GM: Large 4D Gaussian Reconstruction Model

Jiawei Ren, Cheng Xie, Ashkan Mirzaei, hanxue liang, xiaohui zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, Huan Ling

Learning from Noisy Labels via Conditional Distributionally Robust Optimization

Hui GUO, Grace Yi, Boyu Wang

Library Learning Doesn’t: The Curious Case of the Single-Use “Library”

Ian Berlot-Attwell, Frank Rudzicz, Xujie Si

Linguistic Collapse: Neural Collapse in (Large) Language Models

Robert Wu, Vardan Papyan

LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban Simulation

Bowen Li, Zhaoyu Li, Qiwei Du, Jinqi Luo, Wenshan Wang, Yaqi Xie, Simon Stepputtis, Chen Wang, Katia Sycara, Pradeep Ravikumar, Alexander Gray, Xujie Si, Sebastian Scherer

Minimum Entropy Coupling with Bottleneck

Reza Ebrahimi, Jun Chen, Ashish Khisti

MixEval: Fast and Dynamic Human Preference Approximation with LLM Benchmark Mixtures

Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You

NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, Kashyap Chitta

Nearest Neighbor Speculative Decoding for LLM Generation and Attribution

Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Scott Yih, Victoria Lin

Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

Ziyi Wu, Yulia Rubanova, Rishabh Kabra, Drew Hudson, Igor Gilitschenski, Yusuf Aytar, Sjoerd van Steenkiste, Kelsey Allen, Thomas Kipf

Principled Probabilistic Imaging using Diffusion Models as Plug-and-Play Priors

Zihui Wu, Yu Sun, Yifan Chen, Bingliang Zhang, Yisong Yue, Katherine Bouman

Propensity Score Alignment of Unpaired Multimodal Data

Johnny Xi, Jana Osea, Zuheng Xu, Jason Hartford

Proportional Fairness in Non-Centroid Clustering

Ioannis Caragiannis, Evi Micha, Nisarg Shah

QueST: Self-Supervised Skill Abstractions for Learning Continuous Control

Atharva Anil Mete, Haotian Xue, Albert Wilcox, Yongxin Chen, Animesh Garg

Random Cycle Coding: Lossless Compression of Cluster Assignments via Bits-Back Coding

Daniel Severo, Ashish Khisti, Alireza Makhzani

Reinforcement Learning-Guided Semi-Supervised Learning

Marzi Heidari, Hanping Zhang, Yuhong Guo

Safe and Sound- Evaluating Language Models for Bias Mitigation and Understanding

Shaina Raza, Shardul Ghuge, Oluwanifemi Bamgbose, Deval Pandya

Sample-Efficient Agnostic Boosting

Udaya Ghai, Karan Singh

Sample-Efficient Private Learning of Mixtures of Gaussians

Hassan Ashtiani, Mahbod Majid, Shyam Narayanan

Semi-Open 3D Object Retrieval via Hypergraph-Based Hierarchical Equilibrium Representation

Yang Xu, Yifan Feng, Jun Zhang, Jun-Hai Yong, Yue Gao

The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track

Eshta Bhardwaj, Harshit Gujral, Siyi Wu, Ciara Zogheib, Tegan Maharaj, Christoph Becker

SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Shama Sastry, Evangelos Milios, Sageev Oore, Hassan Sajjad

Targeted Sequential Indirect Experiment Design

Elisabeth Ailer, Niclas Dern, Jason Hartford, Niki Kilbertus

Teaching LLMs How To Learn with Contextual Fine-Tuning

Younwoo Choi*, Muhammad Adil Asif*, Ziwen Han, John Willes, Rahul Krishnan

Towards the Dynamics of a DNN Learning Symbolic Interactions

Qihan Ren, Yang Xu, Junpeng Zhang, Yue Xin, Dongrui Liu, Quanshi Zhang

Towards Understanding Evolving Patterns in Sequential Data

QIUHAO Zeng, Long-Kai Huang, Qi CHEN, Charles Ling, Boyu Wang

Variational Last Layers for Bayesian Optimization

Paul Brunzema*, Mikkel Jordahn*, John Willes, Sebastian Trimpe, Jasper Snoek, James Harrison

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Paper summary

Additional affiliated research papers

Related:

Vector researchers tackle real-world AI challenges at ICML 2025

Transforming Youth Mental Health Support: FAIIR’s AI-Powered Crisis Response Model

AI Weather Forecasting Breakthrough: How Canadian Innovation is Transforming Climate Prediction | Aardvark Weather