Vector researchers are presenting over a dozen papers at CVPR 2024
June 17, 2024
June 17, 2024
Vector researchers are presenting more than 12 papers at this year’s IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024 (CVPR 2024). The conference is being held in Seattle, WA from June 17 to 21.
Four of the papers from Vector-affiliated researchers were co-authored by Vector Faculty Member Sanja Fidler, showcasing new methods for detecting and generating 3D images among other breakthroughs. Two more were co-authored by Vector Faculty Member Wenhu Chen, one of which introduces a new algorithm for combining multiple large language models (LLMs) to make online predictions.
Below are simplified summaries for the accepted papers and poster sessions from Vector Faculty Members.
Paper descriptions written by AI and edited by paper co-authors.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Poster Session 3 & Exhibit Hall
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen
The MMMU (Massive Multi-discipline Multimodal Understanding) benchmark is a new tool designed to test the capabilities of AI models in understanding and reasoning across a wide range of college-level subjects. Developed by experts from institutions like IN.AI Research, University of Waterloo, and The Ohio State University, it covers six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. MMMU includes 11,500 carefully collected questions from college exams, quizzes, and textbooks, spanning 30 subjects and 183 subfields. These questions are designed to test the AI’s ability to handle different types of images and text simultaneously, such as charts, diagrams, and maps. The aim is to push AI models to exhibit expert-level perception and reasoning. The benchmark has challenged existing AI models, including the proprietary GPT-4V (Vision), which achieved only 56% accuracy, showing there is a lot of room for improvement. This benchmark is expected to encourage the development of advanced multimodal AI models that are not just generalists but experts in handling specific, domain-focused tasks.
Instruct-Imagen: Image Generation with Multi-modal Instruction
Poster Session 2 & Exhibit Hall
Hexiang Hu, Kelvin C.K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, William Cohen, Ming-Wei Chang, Xuhui Jia
This research introduces the MoE-F algorithm for combining multiple large language models (LLMs) to make online predictions, such as forecasting stock market movements. Rather than mixing the LLMs in a fixed way, MoE-F adapts the weighting of each LLM over time based on its recent performance. The core idea is to treat the problem like a “hidden Markov model” as if there is an unobservable signal dictating which LLM performs best at each time step. MoE-F uses a mathematical technique called stochastic filtering to estimate this hidden signal from the observable prediction errors of each LLM. Theoretical guarantees are provided for the optimality of the filtering equations and mixing weights. Experiments on a stock market prediction task show MoE-F achieves strong results, with a 17% absolute improvement in F1 score over the best individual LLM. In summary, MoE-F provides an adaptive, theoretically grounded way to combine LLMs for online prediction tasks. It shows the potential to boost performance over using a single model.
3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features
Poster Session 3 & Exhibit Hall
Chenfeng Xu. Huan Ling, Sanja Fidler, Or Litany
3DiffTection is a new method for detecting 3D objects from a single image using features from a 3D-aware diffusion model. Typical 3D object detection requires annotating large datasets with 3D bounding boxes, which is very time consuming.
3DiffTection gets around this by enhancing features from a pre-trained 2D diffusion model to have 3D awareness. It does this in two ways:
Finally, at test time, 3DiffTection generates multiple virtual views of the image and aggregates 3D bounding box predictions from each view to further boost accuracy. On the challenging Omni3D dataset, 3DiffTection significantly outperforms prior state-of-the-art methods like Cube-RCNN for single-image 3D detection. It is also very data efficient, achieving competitive results with only 10% of the training data.
XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies
Poster Session 1 & Exhibit Hall
Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, Francis Williams
This paper introduces X3 (pronounced XCube), a new method for generating high-resolution 3D shapes represented as sparse voxel hierarchies. The key idea is to use a hierarchy of latent diffusion models that generates the 3D shape in a coarse-to-fine manner. At each level of the hierarchy, a variational autoencoder first encodes the sparse voxel grid into a compact latent representation. Then a latent diffusion model generates the next finer level of detail, conditioned on the coarser level. This allows generating shapes with fine geometric details up to 10243 voxel resolution in under 30 seconds. The generated voxels can include attributes like surface normals and semantic labels. The method outperforms prior work on standard 3D object datasets like ShapeNet and Objaverse. It can generate objects from text descriptions or class labels. The authors also demonstrate generating large-scale outdoor scenes of 100m x 100m at 10cm voxel resolution, a first for this type of generative model. Additional capabilities are shown, like user-guided editing of coarse voxels to control finer details, completing a partial 3D scan into a full shape, and ensembling multiple generated views to improve results. Overall, X3 represents an advance in high-quality 3D content creation.
Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models
Poster Session 2 & Exhibit Hall
Huan Ling · Seung Wook Kim · Antonio Torralba · Sanja Fidler · Karsten Kreis
Align Your Gaussians (AYG) is a new method for generating dynamic 3D objects and scenes from text descriptions. The key idea is to combine multiple diffusion models that excel at different aspects – text-to-image models for high visual quality, text-to-video models for realistic motion, and 3D-aware models for geometric consistency.
The approach has two main stages:
Several novel techniques are introduced to improve the generation process, like regularizing the evolving distribution of 3D Gaussians and amplifying motion. The approach also allows extending sequences in time and combining multiple dynamic objects. Experiments show that AYG outperforms the prior state-of-the-art in text-to-4D generation. Example applications include rapidly generating assets for games/VR, synthetic data generation with tracking labels, and creative animation.
Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata
Poster Session 5 & Exhibit Hall
Dongsu Zhang, Francis Williams, Žan Gojčič, Karsten Kreis, Sanja Fidler, Young Min Kim, Amlan Kar
This paper proposes hierarchical Generative Cellular Automata (hGCA) a new method for generating complete, high-resolution 3D scenes from sparse LiDAR scans captured by autonomous vehicles. The key idea is to grow the 3D geometry in a coarse-to-fine manner. In the first coarse stage, a Generative Cellular Automata model recursively applies local generation rules to grow a low-resolution voxel representation of the scene. A lightweight “planner” module provides global context to make the generation more globally consistent. In the second fine stage, hGCA refines the coarse voxels into a high-resolution continuous surface using local implicit functions. Splitting the generation into two stages improves efficiency. On synthetic street scenes, hGCA outperforms prior state-of-the-art methods on metrics measuring completion quality. It also demonstrates strong generalization from synthetic to real LiDAR scans. The method can even generate some novel objects beyond its synthetic training set by taking geometric cues from the real input scans. Potential industry applications include generating simulation environments for autonomous driving, filling in missing areas in 3D maps, and as a step towards creating realistic open worlds for gaming. However, improving geometric quality and generating textures would make the outputs more practically useful.
Unlocking the Potential of Prompt-Tuning in Bridging Generalized and Personalized Federated Learning
Poster Session 2 & Exhibit Hall
Wenlong deng, Christos Thrampoulidis, Xiaoxiao Li
This paper proposes a new method called Shared and Group Prompt Tuning (SGPT) for federated learning with vision transformer models. Federated learning allows training models on data from multiple clients without the data being shared, but performance can suffer if the client data distributions are very different (heterogeneous). SGPT addresses this by learning both shared prompts, which capture common features across clients, and group-specific prompts, which align the model with clusters of similar clients. A prompt selection module determines which group prompts to use for each input. The method employs a block coordinate descent optimization approach, first learning the shared prompts to capture common information and then the group prompts for more specialized knowledge. Theoretically, the authors bound the gap between the federated model’s global performance and its performance on each client’s local data. SGPT is designed to minimize two key factors influencing this gap – generalization error and distribution discrepancy between clients. Experiments on multiple datasets with both label and feature heterogeneity show that SGPT consistently outperforms state-of-the-art federated learning methods. The approach enables training a single global model that can automatically adapt to diverse local data distributions without client-specific fine-tuning.
Generative 3D Part Assembly via Part-Whole-Hierarchy Message Passing
Poster Session 5 & Exhibit Hall
Bi’an Du, Xiang Gao, Wei Hu, Renjie Liao
The paper’s authors propose a new method called the part-whole-hierarchy message-passing network for generating realistic 3D objects by assembling their parts. The key concept is to first group similar parts into “super-parts”, predict the poses (positions and orientations) of these super-parts, and then use that information to more accurately predict the poses of the individual parts.
The method uses a two-stage hierarchical process:
On the PartNet dataset, this hierarchical approach achieves state-of-the-art results in accurately assembling 3D objects from their parts. Visually inspecting the process shows the method first gets the placement of major components like chair seats and backs correct, before refining the placement of finer parts like legs and arms. Potential applications include computer-aided design tools that can automatically suggest or complete 3D object designs, robotic assembly planning, and generating large amounts of realistic 3D data for simulation and training. The code can be found here.
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
Poster Session 1 & Exhibit Hall
Luo Jiayun, Siddhesh Khandelwal, Leonid Sigal, Boyang Li
This paper proposes a new method called Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for segmenting arbitrary objects in images using large pre-trained vision-language models (VLMs). The key concetps are:
Crucially, PnP-OVSS requires no additional training or pixel-level annotations, even for hyperparameter tuning. It achieves excellent zero-shot performance on standard segmentation benchmarks, not only outperforming training-free baselines by large margins, but also many methods that finetune the VLMs. Potential applications include automatically labeling datasets for computer vision, enabling robots to recognize arbitrary objects on the fly, and creative image editing tools that allow users to select and modify objects by simply typing their names. The key advantages are the simplicity and effectiveness of the approach to extract open-vocabulary segmentation capabilities from powerful foundation models with no extra training required.
Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach
Poster Session 5 & Exhibit Hall
Mir Hossain Hossain, Mennatullah Siam, Leonid Sigal, Jim Little
This paper proposes a new method called Visual Prompting for Generalized Few-Shot Segmentation (GFSS) using a multi-scale transformer decoder architecture. The approach learns “visual prompts” that represent base classes (learned from abundant data) and novel classes (learned from few examples) as embeddings. These prompts cross-attend with image features at multiple scales in a transformer decoder. The novel class prompts are initialized by pooling features from the few example images masked by the ground truth segmentation masks. A key aspect is introducing a unidirectional “causal attention” mechanism where the base prompts can influence the novel prompts, but not vice versa. This helps contextualize the novel prompts while preventing them from degrading the base class representations. The method also enables “transductive prompt tuning” where the visual prompts can be further optimized on the unlabeled test images in an unsupervised manner to adapt to the test distribution.
Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models
Poster Session 2 & Exhibit Hall
Shweta Mahajan, Tanzila Rahman, Kwang Moo Yi, Leonid Sigal
This paper proposes a new method called Prompting Hard or Hardly Prompting (PH2P) for “inverting” a text-to-image diffusion model to find the text prompt most likely to generate a given target image.
The key insights are:
PH2P outperforms baseline approaches, generating prompts that capture the semantics of the target images more accurately. The generated prompts can be used to synthesize diverse yet semantically similar images. Potential applications include tools for creative exploration and editing, where a user provides a target concept image and the system suggests a descriptive prompt that can be further modified. The inverted prompts could also enable evolutionary concept generation by iteratively modifying images and prompts. Finally, the attended regions corresponding to each token could be used for unsupervised object localization and segmentation.
UnO: Unsupervised Occupancy Fields for Perception and Forecasting
Poster Session 4 & Exhibit Hall
Ben Agro, Quinlan Sykora, Sergio Casas, Thomas Gilles, Raquel Urtasun
UNO (Unsupervised Occupancy Fields) is a method proposed in this paper to learn a 3D world model from unlabeled LiDAR data that can perceive the present environment and forecast its future state. UNO learns to predict 3D occupancy (whether a point in space is occupied by an object or not) over time in a self-supervised way using future LiDAR sweeps as pseudo-labels. It uses an implicit neural network architecture that allows querying occupancy at any continuous 3D point and future time. UNO can be effectively transferred to downstream tasks. For LiDAR point cloud forecasting, it achieves state-of-the-art results on multiple datasets by adding a lightweight renderer on top of the predicted occupancy. When fine-tuned on limited labeled data for bird’s-eye view semantic occupancy prediction, UNO outperforms fully supervised methods, demonstrating impressive few-shot learning capabilities. Potential applications include safer motion planning for self-driving vehicles by enabling them to reason about the future state of the entire scene, not just detected objects. UNO’s ability to learn from unlabeled data and generalize to rare objects can improve robustness and safety.
Click here to learn more about the computer vision work being done by Vector researchers.