VPPO-32B / README.md
chamber111's picture
Update README.md
f5a39a1 verified
---
license: mit
tags:
- multimodal
- visual-reasoning
- mathematics
- logic
- qwen
- vppo
datasets:
- chamber111/VPPO_ViRL39K_train
base_model:
- Qwen/Qwen2.5-VL-32B-Instruct
---
# Model Card for VPPO-32B
## Model Details
### Model Description
**VPPO-32B** is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 32B parameter version of our model, fine-tuned from `Qwen2.5-VL-32B-Instruct` using a novel reinforcement learning algorithm called **Visually-Perceptive Policy Optimization (VPPO)**.
The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability.
As a result, VPPO-32B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence.
- **Model type:** Large Vision-Language Model (LVLM)
- **Finetuned from model:** [`Qwen/Qwen2.5-VL-32B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)
### Model Sources
- **Repository:** [`VPPO-RL`](https://github.com/huaixuheqing/VPPO-RL)
- **Paper:** [`2510.09285`](https://arxiv.org/abs/2510.09285)
-
## Training Details
### Training Data
The model was fine-tuned on [**ViRL39K**](https://huggingface.co/datasets/chamber111/VPPO_ViRL39K_train), a diverse collection of multimodal reasoning problems. The original dataset can be found on the Hugging Face Hub: [`TIGER-Lab/ViRL39K`](https://huggingface.co/datasets/TIGER-Lab/ViRL39K).
### Training Procedure
The model was trained using our **Visually-Perceptive Policy Optimization (VPPO)** algorithm, which is a modification of the Group Relative Policy Optimization (GRPO) framework. The procedure involves generating responses, calculating token-level visual dependency, and using this dependency to shape the advantage and filter gradients during the policy update step.
#### Training Hyperparameters
- **Base Model:** Qwen2.5-VL-32B-Instruct
- **Algorithm:** VPPO
- **Epochs:** 2
- **Learning Rate:** 1e-6
- **Rollout Batch Size:** 384
- **Max Response Length:** 2048
- **Entropy Penalty Coefficient:** 0.06
- **Gradient Filtering Ratio (k):** 0.4
- **Advantage Shaping Min (β_min):** 0.9
- **Training Regime:** bf16 mixed precision
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
The model was evaluated on a comprehensive suite of 8 diverse multimodal reasoning benchmarks:
- **Math & Geometry:** Geo3k, We-Math, MathVerse, MathVision, DynaMath, MMK12
- **Logic:** LogicVista
- **Multi-discipline:** MMMU-Pro
#### Metrics
Performance is measured by **average accuracy@8**, which is the average success rate over 8 independent generations per problem (at temperature=1.0) using exact-match scoring.
## Citation
If you use this model in your work, please cite our paper:
**BibTeX:**
```bibtex
@article{huang2025spotlight,
title={Spotlight on Token Perception for Multimodal Reinforcement Learning},
author={Huang, Siyuan and Qu, Xiaoye and Li, Yafu and Luo, Yun and He, Zefeng and Liu, Daizong and Cheng, Yu},
journal={arXiv preprint arXiv:2510.09285},
year={2025}
}
```