File size: 3,598 Bytes

---
license: mit
tags:
- multimodal
- visual-reasoning
- mathematics
- logic
- qwen
- vppo
datasets:
- chamber111/VPPO_ViRL39K_train
base_model:
- Qwen/Qwen2.5-VL-32B-Instruct
---

# Model Card for VPPO-32B

## Model Details

### Model Description

**VPPO-32B** is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 32B parameter version of our model, fine-tuned from `Qwen2.5-VL-32B-Instruct` using a novel reinforcement learning algorithm called **Visually-Perceptive Policy Optimization (VPPO)**.

The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability.

As a result, VPPO-32B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence.

- **Model type:** Large Vision-Language Model (LVLM)
- **Finetuned from model:** [`Qwen/Qwen2.5-VL-32B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)

### Model Sources

- **Repository:** [`VPPO-RL`](https://github.com/huaixuheqing/VPPO-RL)
- **Paper:** [`2510.09285`](https://arxiv.org/abs/2510.09285)
- 
## Training Details

### Training Data

The model was fine-tuned on [**ViRL39K**](https://huggingface.co/datasets/chamber111/VPPO_ViRL39K_train), a diverse collection of multimodal reasoning problems. The original dataset can be found on the Hugging Face Hub: [`TIGER-Lab/ViRL39K`](https://huggingface.co/datasets/TIGER-Lab/ViRL39K).

### Training Procedure

The model was trained using our **Visually-Perceptive Policy Optimization (VPPO)** algorithm, which is a modification of the Group Relative Policy Optimization (GRPO) framework. The procedure involves generating responses, calculating token-level visual dependency, and using this dependency to shape the advantage and filter gradients during the policy update step.

#### Training Hyperparameters

- **Base Model:** Qwen2.5-VL-32B-Instruct
- **Algorithm:** VPPO
- **Epochs:** 2
- **Learning Rate:** 1e-6
- **Rollout Batch Size:** 384
- **Max Response Length:** 2048
- **Entropy Penalty Coefficient:** 0.06
- **Gradient Filtering Ratio (k):** 0.4
- **Advantage Shaping Min (β_min):** 0.9
- **Training Regime:** bf16 mixed precision

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

The model was evaluated on a comprehensive suite of 8 diverse multimodal reasoning benchmarks:
-   **Math & Geometry:** Geo3k, We-Math, MathVerse, MathVision, DynaMath, MMK12
-   **Logic:** LogicVista
-   **Multi-discipline:** MMMU-Pro

#### Metrics

Performance is measured by **average accuracy@8**, which is the average success rate over 8 independent generations per problem (at temperature=1.0) using exact-match scoring.

## Citation

If you use this model in your work, please cite our paper:

**BibTeX:**

```bibtex
@article{huang2025spotlight,
  title={Spotlight on Token Perception for Multimodal Reinforcement Learning},
  author={Huang, Siyuan and Qu, Xiaoye and Li, Yafu and Luo, Yun and He, Zefeng and Liu, Daizong and Cheng, Yu},
  journal={arXiv preprint arXiv:2510.09285},
  year={2025}
}
```