DRPO-7B
This model is a fine-tuned version of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B on the agentica-org/DeepScaleR-Preview-Dataset.
It was fine-tuned as part of the paper Efficient Reasoning via Decoupled Reward Policy Optimization (paper link).
The code is available at: https://github.com/Optimization-AI/DRPO
Below are comparisons with baseline models and baseline methods. Left is for fine-tuning 1.5B model and right is for fine-tuning 7B model. Grey lines represent the base model performance before finetuning, with generation length of 4698 for 1.5B model and 4119 for 7B model. Squares denote models trained with reference methods without length penalties (i.e., $\lambda$=+$\infty$ for DRPO, $\alpha=0$ for RLOO-LP, $\beta=0$ for ALP, $w=0$ for HAPO). Triangles denote the models trained by other works.
Citation
@article{li2025drpo,
title={DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization},
author={Li, Gang and Chen, Yan and Lin, Ming and Yang, Tianbao},
journal={arXiv preprint arXiv:2510.04474},
year={2025}
}
- Downloads last month
- 20