Papers
arxiv:2604.05117

Watch Before You Answer: Learning from Visually Grounded Post-Training

Published on Apr 6
· Submitted by
Perry the Platypus
on Apr 8
Authors:
,
,
,
,
,
,
,
,

Abstract

Vision-language models face challenges in video understanding due to text-based biases in benchmarks and datasets, which are addressed through VidGround, a method that uses only visually grounded questions for post-training to improve performance.

AI-generated summary

It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.

Community

Paper submitter

TL;DR: We find that 40-60% of questions in popular video understanding benchmarks (VideoMME, MMVU, etc.) can be answered from text alone, and the same problem also exists in post-training datasets. VidGround is a simple fix: keep only the visually grounded questions. Using only 69.1% of the data, it improves RL post-training by up to +6.2 points and beats several more complex post-training methods.

Takeaway: for video understanding in VLMs, visually-grounded data matters more than data volume or algorithmic complexity.

Project page: http://vidground.etuagi.com

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.05117
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.05117 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.05117 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.05117 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.