--- license: mit tags: - multimodal - visual-reasoning - mathematics - logic - qwen - vppo library_name: transformers pipeline_tag: image-text-to-text datasets: - chamber111/VPPO_ViRL39K_train base_model: - Qwen/Qwen2.5-VL-7B-Instruct --- # Model Card for VPPO-7B ## Model Details ### Model Description **VPPO-7B** is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 7B parameter version of our model, fine-tuned from `Qwen2.5-VL-7B-Instruct` using a novel reinforcement learning algorithm called **Visually-Perceptive Policy Optimization (VPPO)**. The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability. As a result, VPPO-7B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence. - **Model type:** Large Vision-Language Model (LVLM) - **Finetuned from model:** [`Qwen/Qwen2.5-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) ### Model Sources - **Repository:** [`VPPO-RL`](https://github.com/huaixuheqing/VPPO-RL) - **Paper:** [`2510.09285`](https://arxiv.org/abs/2510.09285) ## Training Details ### Training Data The model was fine-tuned on [**ViRL39K**](https://huggingface.co/datasets/chamber111/VPPO_ViRL39K_train), a diverse collection of multimodal reasoning problems. The original dataset can be found on the Hugging Face Hub: [`TIGER-Lab/ViRL39K`](https://huggingface.co/datasets/TIGER-Lab/ViRL39K). ### Training Procedure The model was trained using our **Visually-Perceptive Policy Optimization (VPPO)** algorithm, which is a modification of the Group Relative Policy Optimization (GRPO) framework. The procedure involves generating responses, calculating token-level visual dependency, and using this dependency to shape the advantage and filter gradients during the policy update step. #### Training Hyperparameters - **Base Model:** Qwen2.5-VL-7B-Instruct - **Algorithm:** VPPO - **Epochs:** 2 - **Learning Rate:** 1e-6 - **Rollout Batch Size:** 384 - **Max Response Length:** 2048 - **Entropy Penalty Coefficient:** 0.06 - **Gradient Filtering Ratio (k):** 0.4 - **Advantage Shaping Min (β_min):** 0.9 - **Training Regime:** bf16 mixed precision ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data The model was evaluated on a comprehensive suite of 8 diverse multimodal reasoning benchmarks: - **Math & Geometry:** Geo3k, We-Math, MathVerse, MathVision, DynaMath, MMK12 - **Logic:** LogicVista - **Multi-discipline:** MMMU-Pro #### Metrics Performance is measured by **average accuracy@8**, which is the average success rate over 8 independent generations per problem (at temperature=1.0) using exact-match scoring. ## Citation If you use this model in your work, please cite our paper: **BibTeX:** ```bibtex @misc{huang2025spotlighttokenperceptionmultimodal, title={Spotlight on Token Perception for Multimodal Reinforcement Learning}, author={Siyuan Huang and Xiaoye Qu and Yafu Li and Yun Luo and Zefeng He and Daizong Liu and Yu Cheng}, year={2025}, eprint={2510.09285}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.09285}, } ```