Model Summary
VR-Thinker is the first Multimodal Reward Model utilizing Thinking-with-Image framework.
For further details, please refer to the following:
- π° Paper: https://arxiv.org/pdf/2510.10518
- π Github: https://github.com/qunzhongwang/vr-thinker
- π Contact: Qunzhong Wang
Quick Start
We provide a sample test interface here:
import json
import random
import torch
import tqdm
from PIL import Image
import warnings
import os
import requests
import cv2
import numpy as np
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
warnings.filterwarnings("ignore")
model_path = "qunwang13/vr-thinker"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)
video_urls = [
"https://cdn.pixabay.com/video/2024/05/20/212623_large.mp4", # sample video 1
"https://cdn.pixabay.com/video/2024/02/07/199320-912042274_large.mp4" # sample video 2
]
prompt_for_videos = "A cinematic shot of a waterfall in a lush forest."
dim_name_1, dim_explain_1 = "Temporal Alignment (TA)", "How well the video adheres to the temporal aspects of the prompt."
dim_name_2, dim_explain_2 = "Video Quality (VQ)", "The visual and aesthetic quality of the video."
dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video."
N = 96
prompt_text = \
f"""**Task Description**:
Your task is to compare two videos generated based on the same caption by analyzing their frames in detail. This involves an iterative process of reasoning, zooming in on details, and dynamically selecting frames for further analysis. The provided frames are snapshots from these videos:
- The first four input frames correspond to Video 1.
- The next four input frames correspond to Video 2.
The caption is: {prompt_for_videos}
**Evaluation Dimensions**:
You need to evaluate the videos based on the following dimensions:
1. **{dim_name_1}**: {dim_explain_1}
2. **{dim_name_2}**: {dim_explain_2}
3. **{dim_name_3}**: {dim_explain_3}
**Frames and Analysis Rules**:
- You are provided with 8 sampled input frames:
- The first four input frames are sampled from the first {N/2} actual frames of Video 1.
- The next four input frames are sampled from the first {N/2} actual frames of Video 2.
- These input frames are evenly sampled (e.g., for N = 96, frames 1, 12, 24, 36 for Video 1, and frames 49, 60, 72, 84 for Video 2).
- If the provided input frames are insufficient for a detailed comparison, you must request additional frames:
- Select up to 8 additional frames (4 from each video, ensuring strict correspondence between the two videos, i.e., the second video frames must be Video 1 frames + {N/2}).
- Frame selection must be logical and based on specific transitions or critical differences observed in the analysis.
**Process**:
1. **Round 1 Analysis**:
- Start by analyzing the first 8 input frames.
- Compare the videos based on the evaluation dimensions.
- If differences are subtle, identify specific key moments for further comparison and request additional frames.
- Use the `<tool_call>` to select up to 8 additional frames. Example:
`<tool_call>{{"name": "select_frames", "arguments": {{"target_frames": [12, 16, 20, 24, 60, 64, 68, 72]}}}}</tool_call>`
- Use `<recommend answer>` to output your current inclination and confidence level.
2. **Subsequent Rounds**:
- Analyze the newly provided frames.
- If differences remain unclear, request further frames and continue reasoning.
- If the new frames are repetitive or insufficient, adjust your focus to different sets of frames.
.
- Use `<recommend answer>` to output your current inclination and confidence level until a final answer is reached.
3. **Final Output**:
- After completing your analysis, output exactly one of the following answers:
- `1` if Video 1 is better,
- `2` if Video 2 is better,
- `0` if Video 1 and Video 2 are tied.
- Provide a breakdown of the evaluation dimensions using this format:
`<final answer> TA = i_1, MQ = i_2, VQ = i_3, OA = i_4 </final answer>`
- **OA** (Overall Assessment): Represents the overall preference.
- **i_1, i_2, i_3, i_4**: One of {{0, 1, 2}}.
4. **Format Requirements**:
- Your analysis must be explicitly structured using the following tags:
- `<snapshot>`: Use this tag to summarize the observations from the current round. This summary is critical because subsequent rounds will rely on your synthesis to track progress and frame-specific details.
- `<think>`: Use this tag to describe your reasoning process, including decisions about frame selection or task approach.
- `<recommend answer>`: Use this tag to output your current inclination, including confidence level:
`<recommend answer> TA = i_1, MQ = i_2, VQ = i_3, OA = i_4, CF = i_5 </recommend answer>`
- **CF** (Confidence): One of {{1, 2, 3, 4}}, where 4 indicates higher confidence while 0 indicate low confidence.
- `<final answer>`: Use this tag only when in the final decision.
"""
sys_prompt = \
"""You are a helpful assistant. \n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:
<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
{\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to N.\"}}},
\"required\": [\"target_frames\"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""
content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
content_list.append({"type": "text", "text": prompt_text})
messages = [
{
"role": "system",
"content": sys_prompt,
},
{
"role": "user",
"content": content_list,
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(
messages, return_video_kwargs = True
)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
**video_kwargs,
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Citation
@misc{wang2025vrthinkerboostingvideoreward,
title={VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning},
author={Qunzhong Wang and Jie Liu and Jiajun Liang and Yilei Jiang and Yuanxing Zhang and Jinyuan Chen and Yaozhi Zheng and Xintao Wang and Pengfei Wan and Xiangyu Yue and Jiaheng Liu},
year={2025},
eprint={2510.10518},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.10518},
}
- Downloads last month
- 78
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for qunwang13/vr-thinker
Base model
Qwen/Qwen2.5-VL-7B-Instruct