|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- multimodal |
|
|
- visual-reasoning |
|
|
- mathematics |
|
|
- logic |
|
|
- qwen |
|
|
- vppo |
|
|
datasets: |
|
|
- chamber111/VPPO_ViRL39K_train |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-32B-Instruct |
|
|
--- |
|
|
|
|
|
# Model Card for VPPO-32B |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
**VPPO-32B** is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 32B parameter version of our model, fine-tuned from `Qwen2.5-VL-32B-Instruct` using a novel reinforcement learning algorithm called **Visually-Perceptive Policy Optimization (VPPO)**. |
|
|
|
|
|
The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability. |
|
|
|
|
|
As a result, VPPO-32B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence. |
|
|
|
|
|
- **Model type:** Large Vision-Language Model (LVLM) |
|
|
- **Finetuned from model:** [`Qwen/Qwen2.5-VL-32B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct) |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [`VPPO-RL`](https://github.com/huaixuheqing/VPPO-RL) |
|
|
- **Paper:** [`2510.09285`](https://arxiv.org/abs/2510.09285) |
|
|
- |
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was fine-tuned on [**ViRL39K**](https://huggingface.co/datasets/chamber111/VPPO_ViRL39K_train), a diverse collection of multimodal reasoning problems. The original dataset can be found on the Hugging Face Hub: [`TIGER-Lab/ViRL39K`](https://huggingface.co/datasets/TIGER-Lab/ViRL39K). |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
The model was trained using our **Visually-Perceptive Policy Optimization (VPPO)** algorithm, which is a modification of the Group Relative Policy Optimization (GRPO) framework. The procedure involves generating responses, calculating token-level visual dependency, and using this dependency to shape the advantage and filter gradients during the policy update step. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
- **Base Model:** Qwen2.5-VL-32B-Instruct |
|
|
- **Algorithm:** VPPO |
|
|
- **Epochs:** 2 |
|
|
- **Learning Rate:** 1e-6 |
|
|
- **Rollout Batch Size:** 384 |
|
|
- **Max Response Length:** 2048 |
|
|
- **Entropy Penalty Coefficient:** 0.06 |
|
|
- **Gradient Filtering Ratio (k):** 0.4 |
|
|
- **Advantage Shaping Min (β_min):** 0.9 |
|
|
- **Training Regime:** bf16 mixed precision |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
#### Testing Data |
|
|
|
|
|
The model was evaluated on a comprehensive suite of 8 diverse multimodal reasoning benchmarks: |
|
|
- **Math & Geometry:** Geo3k, We-Math, MathVerse, MathVision, DynaMath, MMK12 |
|
|
- **Logic:** LogicVista |
|
|
- **Multi-discipline:** MMMU-Pro |
|
|
|
|
|
#### Metrics |
|
|
|
|
|
Performance is measured by **average accuracy@8**, which is the average success rate over 8 independent generations per problem (at temperature=1.0) using exact-match scoring. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your work, please cite our paper: |
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
```bibtex |
|
|
@article{huang2025spotlight, |
|
|
title={Spotlight on Token Perception for Multimodal Reinforcement Learning}, |
|
|
author={Huang, Siyuan and Qu, Xiaoye and Li, Yafu and Luo, Yun and He, Zefeng and Liu, Daizong and Cheng, Yu}, |
|
|
journal={arXiv preprint arXiv:2510.09285}, |
|
|
year={2025} |
|
|
} |
|
|
``` |