When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench
August 8, 2025
August 8, 2025
By Shaina Raza and Veronica Chatrath
AI models are rapidly becoming bigger, faster, and more capable at understanding images and text together. However, while accuracy and speed are often celebrated, a key question remains: How well do these models align with human values? Fairness, empathy, inclusivity, and ethical judgment are still elusive for many state-of-the-art systems. That’s where HumaniBench comes in.
Developed as the first comprehensive benchmark for human-centered evaluation of large multimodal models (LMMs), HumaniBench represents a significant step forward in how we assess AI systems. It goes beyond traditional metrics, challenging models on seven essential human-aligned principles: fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness.
HumaniBench is built on a meticulously curated dataset of 32,000 image-question pairs curated from real-world news articles on diverse, socially relevant topics. For each image we generate a caption and assign a social-attribute tag (age, gender, race, sport, or occupation) to create rich metadata for downstream task annotation. The annotation pipeline leverages a scalable GPT-4o workflow, followed by rigorous expert verification, ensuring each sample meets the highest standards of quality and relevance.
To reflect the complexity of human contexts, HumaniBench features seven diverse tasks, each mapped to one or more human-centric principles:
Using this framework, the team benchmarked 15 leading LMMs, both open-source like Phi4, Gemma, Llama 3.2, CogVLM2 and LLaVA, and proprietary models like GPT-4o and Gemini 2.0. The benchmarking results showed surprising findings: the proprietary models performed well on reasoning, empathy, and general language understanding tasks from LMMs, but open source models like Qwen, Phi4 performed great in visual grounding and robustness tasks. No model performed flawlessly; almost all exhibited discrepancies in their treatment of different demographic groups, particularly across age, race, and language.
Table 1: Comparison of LMM benchmarks with our seven human-centric principles. Columns are marked ✓ if covered, ✗ if not, or ∼ if partially covered. “HC” denotes human-centric coverage; “Data Source” indicates whether images are real (R) or synthetic (S), with (SD) for Stable Diffusion.
Empathy, as a complex social and cognitive task, remains a critical benchmark for evaluating human-centric AI alignment. Results on HumaniBench show that closed-source models generally generated captions with more empathy and retained a balanced tone, reflecting both emotional intelligence and sensitivity to the context. This sets a valuable precedent for open-source models to follow. But, at the same time, the visual detection ability of some of the open-source models like Qwen and Phi are more than random self-supervised object detection classifiers. Overall, these findings highlight both the promise and the limitations of current LMMs, and point to clear opportunities for the open-source community to advance responsible, equitable, and emotionally intelligent AI systems.
Figure 2: HumaniBench principle-aligned scores. Each entry is the mean score of the tasks mapped to that principle (↑ higher is better). †Closed-source; all others open source.
In sum, HumaniBench offers more than a scorecard, it’s a diagnostic tool for understanding where models succeed, where they fail, and why that actually matters for humans. The full dataset, code, and evaluation suite are publicly available here, in the spirit of transparency and open-source collaborations. As we push toward more human-centric AI, HumaniBench stands as a timely benchmark, inviting the field to not only aim higher, but align better.
For researchers, developers, and anyone invested in the future of responsible AI, HumaniBench offers a path forward: measurable, meaningful, and human-centered.