Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Paper • 2601.18734 • Published • 3
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Reproduction of OPSD (On-Policy Self-Distillation) on Qwen3-1.7B, 4B, and 8B.
| Method | AIME24 | AIME25 | HMMT25 |
|---|---|---|---|
| Base | 47.2% | 35.3% | 21.9% |
| OPSD (best) | 49.2% | 37.5% | 24.4% |
| SFT (best) | 37.5% | 30.8% | 19.2% |
| GRPO (best) | 47.8% | 35.0% | 22.8% |
| Method | AIME24 | AIME25 | HMMT25 |
|---|---|---|---|
| Base | 71.1% | 60.0% | 38.6% |
| OPSD (best) | 62.2% | 57.2% | 34.2% |
| SFT (best) | 62.5% | 58.1% | 33.3% |
| GRPO (best) | 68.9% | 65.0% | 41.9% |
| Method | AIME24 | AIME25 | HMMT25 |
|---|---|---|---|
| Base | 72.8% | 61.7% | 38.6% |
| OPSD (best) | 69.4% | 63.3% | 38.6% |
| SFT (best) | 69.2% | 60.3% | 36.1% |
| GRPO (best) | 72.2% | 65.8% | 40.8% |
Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs