Papers
arxiv:2510.10518

VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

Published on Oct 12
· Submitted by Jiaheng Liu on Oct 17
Authors:
,
,
,
,
,
,
,
,
,

Abstract

VideoReward Thinker enhances multimodal reward models with visual reasoning operations and a configurable memory window, improving accuracy on video preference benchmarks.

AI-generated summary

Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

Community

Paper submitter

Recent advancements in multimodal reward models (RMs) have substantially improved posttraining for visual generative models. However, current RMs face inherent limitations: (1) visual
inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details;
and (2) all visual information is packed into the initial prompt, exacerbating hallucination and
forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward
Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning
operations (e.g., select frame) and a configurable visual memory window. This allows the RM to
actively acquire and update visual evidence within context limits, improving reasoning fidelity and
reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with
curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii)
select samples whose per-dimension and overall judgments are all correct, then conduct Rejection
sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply
Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer
videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6%
on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image
multimodal reward modeling.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.10518 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.10518 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.