DRPO-7B

This model is a fine-tuned version of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B on the agentica-org/DeepScaleR-Preview-Dataset.

It was fine-tuned as part of the paper Efficient Reasoning via Decoupled Reward Policy Optimization (paper link).

The code is available at: https://github.com/Optimization-AI/DRPO

Below are comparisons with baseline models and baseline methods. Left is for fine-tuning 1.5B model and right is for fine-tuning 7B model. Grey lines represent the base model performance before finetuning, with generation length of 4698 for 1.5B model and 4119 for 7B model. Squares denote models trained with reference methods without length penalties (i.e., $\lambda$=+$\infty$ for DRPO, $\alpha=0$ for RLOO-LP, $\beta=0$ for ALP, $w=0$ for HAPO). Triangles denote the models trained by other works.

comp

Citation

@article{li2025drpo,
  title={DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization},
  author={Li, Gang and Chen, Yan and Lin, Ming and Yang, Tianbao},
  journal={arXiv preprint arXiv:2510.04474},
  year={2025}
}
Downloads last month
20
Safetensors
Model size
8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ganglii/DRPO-7B

Quantizations
2 models

Collection including ganglii/DRPO-7B