license: mit
tags:
  - multimodal
  - visual-reasoning
  - mathematics
  - logic
  - qwen
  - vppo
library_name: transformers
pipeline_tag: image-text-to-text
datasets:
  - chamber111/VPPO_ViRL39K_train
base_model:
  - Qwen/Qwen2.5-VL-7B-Instruct
Model Card for VPPO-7B
Model Details
Model Description
VPPO-7B is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 7B parameter version of our model, fine-tuned from Qwen2.5-VL-7B-Instruct using a novel reinforcement learning algorithm called Visually-Perceptive Policy Optimization (VPPO).
The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability.
As a result, VPPO-7B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence.
- Model type: Large Vision-Language Model (LVLM)
- Finetuned from model: Qwen/Qwen2.5-VL-7B-Instruct
Model Sources
- Repository: VPPO-RL
- Paper: 2510.09285
Training Details
Training Data
The model was fine-tuned on ViRL39K, a diverse collection of multimodal reasoning problems. The original dataset can be found on the Hugging Face Hub: TIGER-Lab/ViRL39K.
Training Procedure
The model was trained using our Visually-Perceptive Policy Optimization (VPPO) algorithm, which is a modification of the Group Relative Policy Optimization (GRPO) framework. The procedure involves generating responses, calculating token-level visual dependency, and using this dependency to shape the advantage and filter gradients during the policy update step.
Training Hyperparameters
- Base Model: Qwen2.5-VL-7B-Instruct
- Algorithm: VPPO
- Epochs: 2
- Learning Rate: 1e-6
- Rollout Batch Size: 384
- Max Response Length: 2048
- Entropy Penalty Coefficient: 0.06
- Gradient Filtering Ratio (k): 0.4
- Advantage Shaping Min (β_min): 0.9
- Training Regime: bf16 mixed precision
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was evaluated on a comprehensive suite of 8 diverse multimodal reasoning benchmarks:
- Math & Geometry: Geo3k, We-Math, MathVerse, MathVision, DynaMath, MMK12
- Logic: LogicVista
- Multi-discipline: MMMU-Pro
Metrics
Performance is measured by average accuracy@8, which is the average success rate over 8 independent generations per problem (at temperature=1.0) using exact-match scoring.
Citation
If you use this model in your work, please cite our paper:
BibTeX:
@misc{huang2025spotlighttokenperceptionmultimodal,
      title={Spotlight on Token Perception for Multimodal Reinforcement Learning}, 
      author={Siyuan Huang and Xiaoye Qu and Yafu Li and Yun Luo and Zefeng He and Daizong Liu and Yu Cheng},
      year={2025},
      eprint={2510.09285},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.09285}, 
}
