File size: 3,368 Bytes
12ea417 69792eb 12ea417 69792eb 12ea417 c06a11a 12ea417 69792eb 793b6e1 171a261 12ea417 69792eb 12ea417 69792eb df0f334 c06a11a df0f334 c06a11a df0f334 c06a11a df0f334 12ea417 69792eb 171a261 69792eb 171a261 69792eb 171a261 12ea417 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
---
base_model: Qwen/Qwen2.5-VL-7B-Instruct
library_name: transformers
model_name: ob11/Qwen-VL-PRM-7B
licence: apache-2.0
datasets:
- ob11/VL-PRM300K-V1-train
---
# Model Summary
> Qwen-VL-PRM-7B is a process reward model finetuned from Qwen2.5-7B-Instruct on approximately 300,000 examples. It demonstrates strong test-time scaling performance improvements on various advanced multimodal reasoning benchmarks when used with Qwen2.5-VL and Gemma-3 models despite being trained mainly on abstract reasoning datasets and elementary reasoning datasets.
- **Logs:** https://wandb.ai/aisg-arf/multimodal-reasoning/runs/pj4oc0qh
- **Repository:** https://github.com/theogbrand/vlprm
- **Paper:** https://arxiv.org/pdf/2509.23250
# Use
The model usage is documented [here](https://github.com/theogbrand/vlprm/blob/main/eval/tts_eval/reward_guided_search/VisualPRMv2.py).
# Evaluation
### Commercial Models
| Model | MMMU | PuzzleVQA | AlgoPuzzleVQA | MathVista | MathVision | Overall |
|-------|------|-----------|---------------|-----------|------------|---------|
| GPT-4o | 70.7 | 60.0 | 57.8 | 30.9 | 31.2 | 50.1 |
| o1 | 78.2 | 78.9 | 54.4 | 73.9 | 60.3 | 69.1 |
| o3 | 82.9 | 84.1 | 62.3 | 86.8 | -- | -- |
### Qwen-2.5-VL Family
| Model | MMMU | PuzzleVQA | AlgoPuzzleVQA | MathVista | MathVision | Overall |
|-------|------|-----------|---------------|-----------|------------|---------|
| **Qwen-2.5-VL-3B** | 51.7 | 34.5 | 25.7 | 60.0 | 21.2 | 38.6 |
| + VL-PRM-7B | 53.7 (+2.0) | 44.9 (+10.5) | 28.3 (+2.6) | 64.1 (+4.1) | 21.8 (+0.6) | 42.6 (+4.0) |
| **Qwen-2.5-VL-7B** | 55.0 | 48.0 | 29.1 | 67.8 | 24.2 | 44.8 |
| + VL-PRM-3B | 57.6 (+2.6) | 55.5 (+7.5) | 33.8 (+4.7) | 70.0 (+2.2) | 26.1 (+1.9) | 48.6 (+3.6) |
| + VL-PRM-7B | 57.4 (+2.4) | 54.8 (+6.8) | 35.3 (+6.2) | 71.0 (+3.2) | 26.2 (+2.0) | 48.9 (+4.1) |
| **Qwen-2.5-VL-32B** | 66.0 | 46.2 | 26.9 | 76.9 | 36.7 | 50.5 |
| + VL-PRM-3B | 67.0 (+1.0) | 67.1 (+20.8) | 41.6 (+14.7) | 77.7 (+0.8) | 40.5 (+3.8) | 58.7 (+8.2) |
| + VL-PRM-7B | 67.6 (+1.6) | 66.8 (+20.6) | 44.2 (+17.3) | 78.3 (+1.4) | 40.1 (+3.2) | 59.4 (+8.9) |
### Gemma-3 Family
| Model | MMMU | PuzzleVQA | AlgoPuzzleVQA | MathVista | MathVision | Overall |
|-------|------|-----------|---------------|-----------|------------|---------|
| **Gemma-3-12B** | 57.6 | 45.0 | 29.1 | 58.9 | 28.1 | 43.7 |
| + VL-PRM-3B | 60.4 (+2.8) | 57.7 (+12.7) | 39.7 (+10.6) | 60.3 (+1.4) | 33.8 (+5.7) | 50.4 (+6.7) |
| + VL-PRM-7B | 60.2 (+2.6) | 59.0 (+12.0) | 41.1 (+4.5) | 63.3 (+4.4) | 33.9 (+5.8) | 51.5 (+7.8) |
| **Gemma-3-27B** | 62.9 | 50.8 | 29.9 | 61.6 | 32.4 | 47.5 |
| + VL-PRM-3B | 65.5 (+2.6) | 67.4 (+16.6) | 40.3 (+10.4) | 65.4 (+3.8) | 39.8 (+7.4) | 55.7 (+8.2) |
| + VL-PRM-7B | 64.5 (+1.6) | 67.6 (+16.8) | 41.1 (+11.2) | 65.2 (+3.6) | 40.9 (+8.5) | 55.9 (+8.4) |
### Framework versions
- TRL: 0.19.1
- Transformers: 4.55.3
- Pytorch: 2.7.1
- Datasets: 3.0.1
- Tokenizers: 0.21.4
## Citations
```bibtex
@misc{ong2025vlprms,
title={Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned},
author={Brandon Ong, Tej Deep Pala, Vernon Toh, William Chandra Tjhi, and Soujanya Poria},
year={2025},
eprint={2509.23250},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/pdf/2509.23250},
}
``` |