DRPO-7B

This model is a fine-tuned version of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B on the agentica-org/DeepScaleR-Preview-Dataset.

It was fine-tuned as part of the paper Efficient Reasoning via Decoupled Reward Policy Optimization (paper link).

The code is available at: https://github.com/Optimization-AI/DRPO

Below are comparisons with baseline models and baseline methods. Left is for fine-tuning 1.5B model and right is for fine-tuning 7B model. Grey lines represent the base model performance before finetuning, with generation length of 4698 for 1.5B model and 4119 for 7B model. Squares denote models trained with reference methods without length penalties (i.e., $\lambda$=+$\infty$ for DRPO, $\alpha=0$ for RLOO-LP, $\beta=0$ for ALP, $w=0$ for HAPO). Triangles denote the models trained by other works.

Citation

@article{li2025drpo,
  title={DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization},
  author={Li, Gang and Chen, Yan and Lin, Ming and Yang, Tianbao},
  journal={arXiv preprint arXiv:2510.04474},
  year={2025}
}

Downloads last month: 20

Safetensors

Model size

8B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ganglii/DRPO-7B

Quantizations

2 models

Collection including ganglii/DRPO-7B

DRPO

Collection

Efficient Reasoning via Decoupled Reward Policy Optimization • 2 items • Updated 10 days ago • 1