Qwen-VL-PRM-7B / README.md
ob11's picture
Update README.md
793b6e1 verified
metadata
base_model: Qwen/Qwen2.5-VL-7B-Instruct
library_name: transformers
model_name: ob11/Qwen-VL-PRM-7B
licence: apache-2.0
datasets:
  - ob11/VL-PRM300K-V1-train

Model Summary

Qwen-VL-PRM-7B is a process reward model finetuned from Qwen2.5-7B-Instruct on approximately 300,000 examples. It demonstrates strong test-time scaling performance improvements on various advanced multimodal reasoning benchmarks when used with Qwen2.5-VL and Gemma-3 models despite being trained mainly on abstract reasoning datasets and elementary reasoning datasets.

Use

The model usage is documented here.

Evaluation

Commercial Models

Model MMMU PuzzleVQA AlgoPuzzleVQA MathVista MathVision Overall
GPT-4o 70.7 60.0 57.8 30.9 31.2 50.1
o1 78.2 78.9 54.4 73.9 60.3 69.1
o3 82.9 84.1 62.3 86.8 -- --

Qwen-2.5-VL Family

Model MMMU PuzzleVQA AlgoPuzzleVQA MathVista MathVision Overall
Qwen-2.5-VL-3B 51.7 34.5 25.7 60.0 21.2 38.6
+ VL-PRM-7B 53.7 (+2.0) 44.9 (+10.5) 28.3 (+2.6) 64.1 (+4.1) 21.8 (+0.6) 42.6 (+4.0)
Qwen-2.5-VL-7B 55.0 48.0 29.1 67.8 24.2 44.8
+ VL-PRM-3B 57.6 (+2.6) 55.5 (+7.5) 33.8 (+4.7) 70.0 (+2.2) 26.1 (+1.9) 48.6 (+3.6)
+ VL-PRM-7B 57.4 (+2.4) 54.8 (+6.8) 35.3 (+6.2) 71.0 (+3.2) 26.2 (+2.0) 48.9 (+4.1)
Qwen-2.5-VL-32B 66.0 46.2 26.9 76.9 36.7 50.5
+ VL-PRM-3B 67.0 (+1.0) 67.1 (+20.8) 41.6 (+14.7) 77.7 (+0.8) 40.5 (+3.8) 58.7 (+8.2)
+ VL-PRM-7B 67.6 (+1.6) 66.8 (+20.6) 44.2 (+17.3) 78.3 (+1.4) 40.1 (+3.2) 59.4 (+8.9)

Gemma-3 Family

Model MMMU PuzzleVQA AlgoPuzzleVQA MathVista MathVision Overall
Gemma-3-12B 57.6 45.0 29.1 58.9 28.1 43.7
+ VL-PRM-3B 60.4 (+2.8) 57.7 (+12.7) 39.7 (+10.6) 60.3 (+1.4) 33.8 (+5.7) 50.4 (+6.7)
+ VL-PRM-7B 60.2 (+2.6) 59.0 (+12.0) 41.1 (+4.5) 63.3 (+4.4) 33.9 (+5.8) 51.5 (+7.8)
Gemma-3-27B 62.9 50.8 29.9 61.6 32.4 47.5
+ VL-PRM-3B 65.5 (+2.6) 67.4 (+16.6) 40.3 (+10.4) 65.4 (+3.8) 39.8 (+7.4) 55.7 (+8.2)
+ VL-PRM-7B 64.5 (+1.6) 67.6 (+16.8) 41.1 (+11.2) 65.2 (+3.6) 40.9 (+8.5) 55.9 (+8.4)

Framework versions

  • TRL: 0.19.1
  • Transformers: 4.55.3
  • Pytorch: 2.7.1
  • Datasets: 3.0.1
  • Tokenizers: 0.21.4

Citations

@misc{ong2025vlprms,
      title={Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned}, 
      author={Brandon Ong, Tej Deep Pala, Vernon Toh, William Chandra Tjhi, and Soujanya Poria},
      year={2025},
      eprint={2509.23250},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/pdf/2509.23250}, 
}