File size: 3,368 Bytes
12ea417
 
 
69792eb
 
 
 
12ea417
 
69792eb
12ea417
c06a11a
12ea417
69792eb
793b6e1
171a261
12ea417
69792eb
12ea417
69792eb
df0f334
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c06a11a
df0f334
 
c06a11a
df0f334
 
c06a11a
df0f334
 
12ea417
 
 
 
 
 
 
 
 
 
 
69792eb
171a261
 
69792eb
171a261
69792eb
171a261
 
12ea417
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
base_model: Qwen/Qwen2.5-VL-7B-Instruct
library_name: transformers
model_name: ob11/Qwen-VL-PRM-7B
licence: apache-2.0
datasets:
- ob11/VL-PRM300K-V1-train
---

# Model Summary

> Qwen-VL-PRM-7B is a process reward model finetuned from Qwen2.5-7B-Instruct on approximately 300,000 examples. It demonstrates strong test-time scaling performance improvements on various advanced multimodal reasoning benchmarks when used with Qwen2.5-VL and Gemma-3 models despite being trained mainly on abstract reasoning datasets and elementary reasoning datasets.

- **Logs:** https://wandb.ai/aisg-arf/multimodal-reasoning/runs/pj4oc0qh
- **Repository:** https://github.com/theogbrand/vlprm
- **Paper:** https://arxiv.org/pdf/2509.23250

# Use

The model usage is documented [here](https://github.com/theogbrand/vlprm/blob/main/eval/tts_eval/reward_guided_search/VisualPRMv2.py).
# Evaluation
### Commercial Models
| Model | MMMU | PuzzleVQA | AlgoPuzzleVQA | MathVista | MathVision | Overall |
|-------|------|-----------|---------------|-----------|------------|---------|
| GPT-4o | 70.7 | 60.0 | 57.8 | 30.9 | 31.2 | 50.1 |
| o1 | 78.2 | 78.9 | 54.4 | 73.9 | 60.3 | 69.1 |
| o3 | 82.9 | 84.1 | 62.3 | 86.8 | -- | -- |
### Qwen-2.5-VL Family
| Model | MMMU | PuzzleVQA | AlgoPuzzleVQA | MathVista | MathVision | Overall |
|-------|------|-----------|---------------|-----------|------------|---------|
| **Qwen-2.5-VL-3B** | 51.7 | 34.5 | 25.7 | 60.0 | 21.2 | 38.6 |
| + VL-PRM-7B | 53.7 (+2.0) | 44.9 (+10.5) | 28.3 (+2.6) | 64.1 (+4.1) | 21.8 (+0.6) | 42.6 (+4.0) |
| **Qwen-2.5-VL-7B** | 55.0 | 48.0 | 29.1 | 67.8 | 24.2 | 44.8 |
| + VL-PRM-3B | 57.6 (+2.6) | 55.5 (+7.5) | 33.8 (+4.7) | 70.0 (+2.2) | 26.1 (+1.9) | 48.6 (+3.6) |
| + VL-PRM-7B | 57.4 (+2.4) | 54.8 (+6.8) | 35.3 (+6.2) | 71.0 (+3.2) | 26.2 (+2.0) | 48.9 (+4.1) |
| **Qwen-2.5-VL-32B** | 66.0 | 46.2 | 26.9 | 76.9 | 36.7 | 50.5 |
| + VL-PRM-3B | 67.0 (+1.0) | 67.1 (+20.8) | 41.6 (+14.7) | 77.7 (+0.8) | 40.5 (+3.8) | 58.7 (+8.2) |
| + VL-PRM-7B | 67.6 (+1.6) | 66.8 (+20.6) | 44.2 (+17.3) | 78.3 (+1.4) | 40.1 (+3.2) | 59.4 (+8.9) |
### Gemma-3 Family
| Model | MMMU | PuzzleVQA | AlgoPuzzleVQA | MathVista | MathVision | Overall |
|-------|------|-----------|---------------|-----------|------------|---------|
| **Gemma-3-12B** | 57.6 | 45.0 | 29.1 | 58.9 | 28.1 | 43.7 |
| + VL-PRM-3B | 60.4 (+2.8) | 57.7 (+12.7) | 39.7 (+10.6) | 60.3 (+1.4) | 33.8 (+5.7) | 50.4 (+6.7) |
| + VL-PRM-7B | 60.2 (+2.6) | 59.0 (+12.0) | 41.1 (+4.5) | 63.3 (+4.4) | 33.9 (+5.8) | 51.5 (+7.8) |
| **Gemma-3-27B** | 62.9 | 50.8 | 29.9 | 61.6 | 32.4 | 47.5 |
| + VL-PRM-3B | 65.5 (+2.6) | 67.4 (+16.6) | 40.3 (+10.4) | 65.4 (+3.8) | 39.8 (+7.4) | 55.7 (+8.2) |
| + VL-PRM-7B | 64.5 (+1.6) | 67.6 (+16.8) | 41.1 (+11.2) | 65.2 (+3.6) | 40.9 (+8.5) | 55.9 (+8.4) |
### Framework versions

- TRL: 0.19.1
- Transformers: 4.55.3
- Pytorch: 2.7.1
- Datasets: 3.0.1
- Tokenizers: 0.21.4

## Citations
    
```bibtex
@misc{ong2025vlprms,
      title={Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned}, 
      author={Brandon Ong, Tej Deep Pala, Vernon Toh, William Chandra Tjhi, and Soujanya Poria},
      year={2025},
      eprint={2509.23250},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/pdf/2509.23250}, 
}
```