VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
Abstract
VideoMathQA evaluates models' ability to perform temporally extended cross-modal reasoning across various mathematical domains in video settings, addressing direct problem solving, conceptual transfer, and deep instructional comprehension.
Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over 920 man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA
Community
VideoMathQA is a benchmark designed to evaluate mathematical reasoning in real-world educational videos. It requires models to interpret and integrate information from three modalities, visuals, audio, and text, across time. The benchmark tackles the needle-in-a-multimodal-haystack problem, where key information is sparse and spread across different modalities and moments in the video.
Resources
- ๐ Project Website: https://mbzuai-oryx.github.io/VideoMathQA
- ๐ค Dataset Access: https://huggingface.co/datasets/MBZUAI/VideoMathQA
- ๐ Leaderboard (Reasoning): https://hanoonar.github.io/VideoMathQA/#leaderboard-2
- ๐ Leaderboard (Direct): https://hanoonar.github.io/VideoMathQA/#leaderboard
- ๐ GitHub Repository: https://github.com/mbzuai-oryx/VideoMathQA
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper