Vector researcher Wenhu Chen on improving and benchmarking foundation models

By Wenhu Chen

The past year has seen great progress in foundation models as they achieve expert-level performance in solving challenging, real-world problems. In early 2023, the best open-source 7B models, like Llama-2-7B, could only solve 10% of simple grade-school algebraic problems in the GSM8K dataset. A year later, Qwen2-7B can already solve nearly 56% of American-math-contest problems in the MATH dataset. Similarly, in 2023, video diffusion models, like ModelScope, were still producing shoddy and highly unreal video clips. In mid-2024, multiple video diffusion models, like Sora and Kling, can produce lengthy, smooth, and highly realistic videos. This development speed is unprecedented. These advances are mostly attributed to pre-training and instruction tuning on larger and higher-quality datasets.

My lab, TIGER-lab at the University of Waterloo is primarily devoted to three research directions:

Improving the capabilities of foundation models during the post-training phase like instruction tuning or preference optimization.
Building novel benchmarks to evaluate with the current progress of foundation models.
Increasing the faithfulness and controllability in generative models to unlock various GenAI applications.

I want to highlight a few papers published regarding these directions.

Instruction Tuning

The post-training phase plays an instrumental role in achieving remarkable performance in different down-stream applications. Instruction tuning is the most commonly adopted post-training enhancement. Previously, instruction tuning was done at small scales (< 1M examples). There was a common belief that instruction tuning is not meant to improve models’ core capabilities. In “MAmmoTH2: Scaling instructions from the web,” we attempted to scale up instruction tuning to 10M size to show whether instruction tuning can improve models’ core capabilities. Specifically, we propose an approach to automatically mine educational web documents from pre-training data and then utilize open LLMs to extract large scale naturally existing instruction-response pairs. Through instruction tuning on these massive instruction data, we can significantly improve the reasoning capabilities of LLMs like Mistral or Llama-3. Our model MAmmoTH2 obtained state-of-the-art performance on a wide array of reasoning benchmarks.

In another work “MANTIS: Interleaved Multi-Image Instruction Tuning,” we curated an instruction tuning dataset, MANTIS-instruct, to enable the existing multimodal models to handle interleaved multimodal input. We show that with a limited amount of instruction tuning, our instruction tuning Mantis-Instruct can significantly boost the model performance on tasks involving interleaved multi-image inputs. Our best model even matches the performance with GPT-4V.

In “StructLM: Towards Building Generalist Models for Structured Knowledge Grounding,” we construct a high-quality dataset to enhance the abilty of LLMs to ground on structure knowledge. By training on top of the existing LLMs, we can build a strong foundation model to deal with various types of structure knowledge like tables and graphs. StructLM achieves state-of-the-art performance on eight different structure grounding datasets.

“It is a very exciting time to work and push the frontier of these models. Modern foundation models will completely revolutionize the way we use AI.”

Wenhu Chen

Vector Faculty Member; Assistant Professor, David Cheriton School of Computer Science, University of Waterloo; Canada CIFAR Artificial Intelligence Chair

Evaluation of Foundation Models

To better benchmark the modern language models and multimodal models, TIGER-lab also works on constructing better evaluation suites. Our goal is to test the limit of existing models’ true capabilities to handle real-world tasks.

In “MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI,” we curate the first massive multi-disciplinary multimodal benchmark to evaluate multimodal models’ perceptual and reasoning abilities. The new benchmark features its diversity to include a wide variety of visual inputs like photographs, diagrams, icon, logo, plots, and charts. This dataset is widely adopted by the community as one of the standard benchmarks.

In “MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark,” we aim to tackle issues of the current MMLU dataset such as. sensitivity and simplicity. These issues are mostly due to the fact that MMLU only has four options, which provides shortcuts for the model to make correct guesses. To reduce the randomness, we propose to augment each problem to contain 10 options. This augmentation significantly reduces the shortcut and thus increases the robustness of the benchmark. Furthermore, we also augment the benchmark with more college-level problems to increase its difficulty. Through these upgrades, MMLU-Pro can effectively discriminate between models. This dataset has also been widely adopted and used as the official evaluation benchmark in huggingface LM leaderboard.

In “Long-context LLMs Struggle with Long In-context Learning,” we propose a novel approach to evaluate the long-context LLMs. Unlike previous long-context LLMs (summarization or document-QA), which mostly evaluate LLMs’ lookup abilities from long input context. We propose to evaluate their long-context understanding capabilities through the lens of in-context learning. With the extreme-label classification tasks, the long-context LLMs are forced to understand the entire long sequence to capture the full label space. This helps reduce the position bias in the existing benchmarks to truly gauge LLMs’ long-context abilities.

Image/Video Diffusion Models

Generative modeling is also a major direction of TIGER-Lab. Our goal is to build more faithful and controllable generative models for images and videos.

In image-to-video generation tasks, a big issue is the unfaithfulness of the generated video w.r.t the given initial frame. In “ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation,” we propose a new dilated temporal attention layer to help the video generation models to be more faithful to the frame conditioning.

Another issue of the video generation model is its speed. The majority of the video generation models need to consume more than two minutes to produce a video. Different distillation approaches can help accelerate this, but with a big hit on the generation quality. In “T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback”, we propose a mixed consistency and RL training to balance the two aspects. Our model T2V-Turbo is able to maintain both efficiency and quality.

Besides video generation, video editing is also a practical application, where the user aims to edit a given video in a certain way like replacing subjects, changing style, adding or removing subjects. However, existing video editing approaches are highly ad-hoc. In “AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks,” we propose to build a unified framework to satisfy different needs from the end users. Our approach is training-free and highly compatible with different image editing methods. We show that our method can outperform other existing methods by a huge margin. AnyV2V has also been widely adopted as a strong baseline in the GenAI community.

Conclusion

With the fast pace of foundation model development, we are embracing stronger models by the day. It is a very exciting time to work and push the frontier of these models. Modern foundation models will completely revolutionize the way we use AI. Our lab will continue to work on different aspects of foundation models, like instruction tuning, preference optimization, evaluation, retrieval-augmentation, and visual content generation.

Instruction Tuning

Evaluation of Foundation Models

Image/Video Diffusion Models

Conclusion

Related:

Vector researchers tackle real-world AI challenges at ICML 2025

Ontario’s AI ecosystem: fueling real economic growth with record number of jobs and private investments

Transforming Youth Mental Health Support: FAIIR’s AI-Powered Crisis Response Model