arxiv:2510.10518

VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

Published on Oct 12

· Submitted by

Jiaheng Liu on Oct 17

NJU-LINK Lab

Upvote

Authors:

Qunzhong Wang ,

Abstract

VideoReward Thinker enhances multimodal reward models with visual reasoning operations and a configurable memory window, improving accuracy on video preference benchmarks.

AI-generated summary

Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

View arXiv page View PDF GitHub 15 Add to collection

Community

CheeryLJH

Paper submitter 1 day ago

Recent advancements in multimodal reward models (RMs) have substantially improved posttraining for visual generative models. However, current RMs face inherent limitations: (1) visual
inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details;
and (2) all visual information is packed into the initial prompt, exacerbating hallucination and
forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward
Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning
operations (e.g., select frame) and a configurable visual memory window. This allows the RM to
actively acquire and update visual evidence within context limits, improving reasoning fidelity and
reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with
curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii)
select samples whose per-dimension and overall judgments are all correct, then conduct Rejection
sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply
Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer
videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6%
on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image
multimodal reward modeling.