VPPO-32B / README.md

Update README.md

f5a39a1 verified 11 days ago

3.6 kB

	---
	license: mit
	tags:
	- multimodal
	- visual-reasoning
	- mathematics
	- logic
	- qwen
	- vppo
	datasets:
	- chamber111/VPPO_ViRL39K_train
	base_model:
	- Qwen/Qwen2.5-VL-32B-Instruct
	---

	# Model Card for VPPO-32B

	## Model Details

	### Model Description

	VPPO-32B is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 32B parameter version of our model, fine-tuned from `Qwen2.5-VL-32B-Instruct` using a novel reinforcement learning algorithm called Visually-Perceptive Policy Optimization (VPPO).

	The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability.

	As a result, VPPO-32B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence.

	- Model type: Large Vision-Language Model (LVLM)
	- Finetuned from model: [`Qwen/Qwen2.5-VL-32B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)

	### Model Sources

	- Repository: [`VPPO-RL`](https://github.com/huaixuheqing/VPPO-RL)
	- Paper: [`2510.09285`](https://arxiv.org/abs/2510.09285)
	-
	## Training Details

	### Training Data

	The model was fine-tuned on [ViRL39K](https://huggingface.co/datasets/chamber111/VPPO_ViRL39K_train), a diverse collection of multimodal reasoning problems. The original dataset can be found on the Hugging Face Hub: [`TIGER-Lab/ViRL39K`](https://huggingface.co/datasets/TIGER-Lab/ViRL39K).

	### Training Procedure

	The model was trained using our Visually-Perceptive Policy Optimization (VPPO) algorithm, which is a modification of the Group Relative Policy Optimization (GRPO) framework. The procedure involves generating responses, calculating token-level visual dependency, and using this dependency to shape the advantage and filter gradients during the policy update step.

	#### Training Hyperparameters

	- Base Model: Qwen2.5-VL-32B-Instruct
	- Algorithm: VPPO
	- Epochs: 2
	- Learning Rate: 1e-6
	- Rollout Batch Size: 384
	- Max Response Length: 2048
	- Entropy Penalty Coefficient: 0.06
	- Gradient Filtering Ratio (k): 0.4
	- Advantage Shaping Min (β_min): 0.9
	- Training Regime: bf16 mixed precision

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	The model was evaluated on a comprehensive suite of 8 diverse multimodal reasoning benchmarks:
	- Math & Geometry: Geo3k, We-Math, MathVerse, MathVision, DynaMath, MMK12
	- Logic: LogicVista
	- Multi-discipline: MMMU-Pro

	#### Metrics

	Performance is measured by average accuracy@8, which is the average success rate over 8 independent generations per problem (at temperature=1.0) using exact-match scoring.

	## Citation

	If you use this model in your work, please cite our paper:

	BibTeX:

	```bibtex
	@article{huang2025spotlight,
	title={Spotlight on Token Perception for Multimodal Reinforcement Learning},
	author={Huang, Siyuan and Qu, Xiaoye and Li, Yafu and Luo, Yun and He, Zefeng and Liu, Daizong and Cheng, Yu},
	journal={arXiv preprint arXiv:2510.09285},
	year={2025}
	}
	```