OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
Abstract
OmniVideoBench is a comprehensive benchmark for evaluating audio-visual reasoning in multimodal large language models, addressing modality complementarity and logical consistency.
Recent advances in multimodal large language models (MLLMs) have demonstrated substantial potential in video understanding. However, existing benchmarks fail to comprehensively evaluate synergistic reasoning capabilities across audio and visual modalities, often neglecting either one of the modalities or integrating them in a logically inconsistent manner. To bridge this gap, we introduce OmniVideoBench, a large-scale and rigorously designed benchmark dedicated to assessing synergistic audio-visual understanding, with a strong emphasis on modality complementarity and logical consistency. Specifically, OmniVideoBench comprises 1000 high-quality question-answer(QA) pairs, each annotated with step-by-step reasoning traces, derived from 628 diverse videos ranging from several seconds to 30 minutes, and manually verified to guarantee complete correctness and uniqueness. Moreover, OmniVideoBench encompasses 13 carefully designed question types, covering temporal reasoning, spatial localization, counting, causal inference, summarization, and beyond, thereby capturing the essential challenges of video understanding. Evaluation of multiple MLLMs on OmniVideoBench reveals a pronounced gap between model performance and human reasoning, with open-source models lagging significantly behind their closed-source counterparts, underscoring the inherent difficulty of genuine audio-visual reasoning. We will release OmniVideoBench to foster the development of MLLMs with stronger and more generalizable reasoning capabilities.
Community
Recent advances in multimodal large language models (MLLMs) have shown immense potential in video understanding. However, existing benchmarks often fall short in evaluating true synergistic reasoning across both audio and visual modalities. They might neglect one modality or fail to integrate them in a logically consistent way. To address this, we introduce OmniVideoBench, a large-scale, rigorously designed benchmark created to assess synergistic audio-visual understanding. It places a strong emphasis on modality complementarity and logical consistency. The benchmark includes 1,000 high-quality question-answer (QA) pairs from 628 diverse videos (from seconds to 30 minutes long), each annotated with step-by-step reasoning. Our evaluation of various MLLMs reveals a significant gap between current model performance and human-level reasoning, highlighting the challenges of genuine audio-visual intelligence.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence (2025)
- SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models (2025)
- AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs (2025)
- ExpVid: A Benchmark for Experiment Video Understanding&Reasoning (2025)
- CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning (2025)
- OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding (2025)
- V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper