File size: 4,544 Bytes
a9b65e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
---

# ExGRPO: Learning to Reason from Experience

This repository hosts the `ExGRPO-Llama3.1-8B-Zero` model, a large language model presented in the paper [ExGRPO: Learning to Reason from Experience](https://huggingface.co/papers/2510.02245).

## Abstract

Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm
for improving the reasoning ability of large language models. However, standard
on-policy training discards rollout experiences after a single update, leading
to computational inefficiency and instability. While prior work on RL has
highlighted the benefits of reusing past experience, the role of experience
characteristics in shaping learning dynamics of large reasoning models remains
underexplored. In this paper, we are the first to investigate what makes a
reasoning experience valuable and identify rollout correctness and entropy as
effective indicators of experience value. Based on these insights, we propose
ExGRPO (Experiential Group Relative Policy Optimization), a framework that
organizes and prioritizes valuable experiences, and employs a mixed-policy
objective to balance exploration with experience exploitation. Experiments on
five backbone models (1.5B-8B parameters) show that ExGRPO consistently
improves reasoning performance on mathematical/general benchmarks, with an
average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO
stabilizes training on both stronger and weaker models where on-policy methods
fail. These results highlight principled experience management as a key
ingredient for efficient and scalable RLVR.

## Introduction to ExGRPO

Existing RLVR methods for reasoning tasks predominantly rely on on-policy optimization, which discards online rollouts after a single update, wasting valuable exploration signals and constraining scalability.
We conduct a systematic analysis of experience utility in RLVR and identify question difficulty and trajectory entropy as effective online proxies for assessing experience quality.
Building on these insights, we propose *ExGRPO*, a novel framework that **strategically manages and replays high-value experiences** through bucketed prioritization and mixed-policy optimization, enabling more efficient and stable RLVR training.

### Key Highlights:
- **Experience Value Modeling**: Introduces the online proxy metrics: rollout correctness and trajectory entropy, for quantifying the value of RLVR experience.
- **ExGRPO Framework**: Built on top of GRPO, ExGRPO introduces a systematic experience management mechanism and an experience optimization objective to maximize the benefit of past explorations.
- **Generalization and Stability**: Demonstrates broad applicability across different backbone models and mitigates training collapse of on-policy RLVR in challenging scenarios.

## Repository and Project Page

*   **GitHub Repository**: [https://github.com/ElliottYan/LUFFY/tree/main/ExGRPO](https://github.com/ElliottYan/LUFFY/tree/main/ExGRPO)
*   **Hugging Face Collection**: [https://huggingface.co/collections/rzzhan/exgrpo-68d8e302efdfe325187d5c96](https://huggingface.co/collections/rzzhan/exgrpo-68d8e302efdfe325187d5c96)

## Released Models

| **Model**                          | **Huggingface** |  **Base Model** |
|-----------------------------------|------------------|------------------|
| ExGRPO-Qwen2.5-Math-7B-Zero | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-Math-7B-Zero |  Qwen2.5-Math-7B |
| ExGRPO-LUFFY-7B-Continual | https://huggingface.co/rzzhan/ExGRPO-LUFFY-7B-Continual | LUFFY-Qwen-Math-7B-Zero |
| ExGRPO-Qwen2.5-7B-Instruct | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-7B-Instruct |  Qwen2.5-7B Instruct |
| ExGRPO-Qwen2.5-Math-1.5B-Zero | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-Math-1.5B-Zero |  Qwen2.5-Math-1.5B |
| ExGRPO-Llama3.1-8B-Zero | https://huggingface.co/rzzhan/ExGRPO-Llama3.1-8B-Zero |  Llama3.1-8B |
| ExGRPO-Llama3.1-8B-Instruct | https://huggingface.co/rzzhan/ExGRPO-Llama3.1-8B-Instruct |  Llama3.1-8B Instruct |

## Citation

If you find our model, data, or evaluation code useful, please kindly cite our paper:

```bib
@article{zhan2025exgrpo,
      title={ExGRPO: Learning to Reason from Experience}, 
      author={Runzhe Zhan and Yafu Li and Zhi Wang and Xiaoye Qu and Dongrui Liu and Jing Shao and Derek F. Wong and Yu Cheng},
      year={2025},
      journal = {ArXiv preprint},
      volume = {2510.02245},
      url={https://arxiv.org/abs/2510.02245}, 
}
```