Model Card for VPPO-7B

Model Details

Model Description

VPPO-7B is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 7B parameter version of our model, fine-tuned from Qwen2.5-VL-7B-Instruct using a novel reinforcement learning algorithm called Visually-Perceptive Policy Optimization (VPPO).

The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability.

As a result, VPPO-7B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence.

Model Sources

Training Details

Training Data

The model was fine-tuned on ViRL39K, a diverse collection of multimodal reasoning problems. The original dataset can be found on the Hugging Face Hub: TIGER-Lab/ViRL39K.

Training Procedure

The model was trained using our Visually-Perceptive Policy Optimization (VPPO) algorithm, which is a modification of the Group Relative Policy Optimization (GRPO) framework. The procedure involves generating responses, calculating token-level visual dependency, and using this dependency to shape the advantage and filter gradients during the policy update step.

Training Hyperparameters

  • Base Model: Qwen2.5-VL-7B-Instruct
  • Algorithm: VPPO
  • Epochs: 2
  • Learning Rate: 1e-6
  • Rollout Batch Size: 384
  • Max Response Length: 2048
  • Entropy Penalty Coefficient: 0.06
  • Gradient Filtering Ratio (k): 0.4
  • Advantage Shaping Min (β_min): 0.9
  • Training Regime: bf16 mixed precision

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on a comprehensive suite of 8 diverse multimodal reasoning benchmarks:

  • Math & Geometry: Geo3k, We-Math, MathVerse, MathVision, DynaMath, MMK12
  • Logic: LogicVista
  • Multi-discipline: MMMU-Pro

Metrics

Performance is measured by average accuracy@8, which is the average success rate over 8 independent generations per problem (at temperature=1.0) using exact-match scoring.

Citation

If you use this model in your work, please cite our paper:

BibTeX:

@article{huang2025spotlight,
  title={Spotlight on Token Perception for Multimodal Reinforcement Learning},
  author={Huang, Siyuan and Qu, Xiaoye and Li, Yafu and Luo, Yun and He, Zefeng and Liu, Daizong and Cheng, Yu},
  journal={arXiv preprint arXiv:2510.09285},
  year={2025}
}
Downloads last month
41
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chamber111/VPPO-7B

Finetuned
(760)
this model
Quantizations
2 models

Dataset used to train chamber111/VPPO-7B

Collection including chamber111/VPPO-7B