File size: 6,127 Bytes
bab4c4d
06be75b
cb481ca
 
 
bab4c4d
cb481ca
bab4c4d
06be75b
cb481ca
 
 
 
 
 
 
 
 
 
bab4c4d
 
 
cb481ca
bab4c4d
cb481ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bab4c4d
cb481ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9b6b77c
cb481ca
 
 
 
 
 
 
 
 
 
 
 
bab4c4d
 
 
cb481ca
 
 
 
a7a7735
 
 
 
 
cb481ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: ReTool Implementation
emoji: πŸ”§
colorFrom: blue
colorTo: purple
sdk: static
app_file: README.md
pinned: false
license: mit
tags:
- reinforcement-learning
- tool-use
- code-interpreter
- mathematical-reasoning
- rl-training
- ppo
- research-implementation
language: en
library_name: transformers
---


# ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

A PyTorch implementation of **ReTool** from the paper ["ReTool: Reinforcement Learning for Strategic Tool Use in LLMs"](https://arxiv.org/abs/2504.11536) by Feng et al. (2025).

ReTool enhances long-form reasoning by integrating code interpreter execution into the RL training loop, enabling models to learn when and how to invoke computational tools for mathematical problem solving.

<div align="center">
  <img src="assets/retool_rollout_process.png" alt="ReTool Rollout Process" width="80%">
  <p><em>Figure 2: Comparison of standard text-based RL vs ReTool's code-integrated training process</em></p>
</div>

## πŸš€ Key Features

- **Multi-turn Generation**: Dynamic code execution during reasoning with KV-cache optimization
- **Strategic Tool Use**: Learns when and how to invoke code interpreters through RL
- **Interpreter Masking**: Excludes external tool outputs from gradient computation
- **Production Ready**: Built on HuggingFace Transformers with proper batching and distributed training support


## πŸ“Š Performance

<div align="center">
  <img src="assets/aime_results.png" alt="AIME Results" width="70%">
  <p><em>Figure 1: ReTool achieves 67% accuracy on AIME 2024, significantly outperforming text-based RL (40%)</em></p>
</div>

## πŸ› οΈ Installation

```bash
git clone https://github.com/yourusername/retool-implementation.git
cd  retool-implementation/scr
pip install -r requirements.txt
```

## 🚧 Current Status

**This is a research implementation based on the ReTool paper.** The core components are implemented but not yet fully tested. 

### What's Implemented βœ…
- Multi-turn generation with KV-cache optimization
- Interpreter token masking for RL training
- Modified PPO loss computation
- Complete training pipeline structure
- Proper tensor handling and batching

### What Needs Testing/Integration πŸ”§
- End-to-end training verification
- Code execution sandbox integration  
- Edge case handling for truncated sequences
- Memory optimization for large models

### For Researchers & Developers

This implementation serves as a foundation for:
- Understanding ReTool's architecture
- Building upon the multi-turn generation approach
- Integrating custom code execution environments
- Extending to other tool-use scenarios

## πŸ“Š Dataset Format

Your dataset should contain dictionaries with:

```python
{
    "prompt": "Solve this math problem: ...",
    "answer": "42"  # Ground truth for reward computation
}
```

## πŸ” How It Works

1. **Multi-turn Generation**: Model generates reasoning step-by-step
2. **Code Detection**: When `</code>` is generated, extract and execute code
3. **Tool Integration**: Append `<interpreter>result</interpreter>` to context
4. **Continued Reasoning**: Model continues with tool feedback
5. **Reward Computation**: Binary reward based on final answer correctness
6. **RL Training**: PPO updates exclude interpreter tokens from loss

## βš™οΈ Key Components

### ReToolTrainer Class

- `_retool_generate_with_interpreter()`: Multi-turn generation with tool execution
- `_create_interpreter_mask()`: Creates masks for excluding tool outputs
- `_compute_loss()`: Modified PPO loss with interpreter masking
- `_compute_rewards_and_advantages()`: Binary reward computation

### Configuration Options

```python
trainer = ReToolTrainer(
    # ... model and data ...
    max_turns=10,              # Maximum reasoning turns
    temperature=0.7,           # Generation temperature
    max_completion_length=1024, # Max tokens per turn
    mask_truncated_completions=True,  # Handle incomplete sequences
)
```

## πŸ’‘ Usage Example (Conceptual)

```python
from retool_trainer import ReToolTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments

# This shows the intended API - full testing in progress
trainer = ReToolTrainer(
    model=AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-32B-Instruct"),
    processing_class=AutoTokenizer.from_pretrained("Qwen/Qwen2.5-32B-Instruct"),
    args=TrainingArguments(...),
    train_dataset=your_math_dataset,
    max_turns=10,
)

# trainer.train()  # Full integration testing in progress
```

## πŸ“ˆ Results From Paper

- **AIME 2024**: 67% accuracy (vs 40% text-based RL)
- **AIME 2025**: 49.3% accuracy (vs 36.7% text-based RL)
- **Efficiency**: Converges in 400 steps vs 1080 for baseline
- **Token Efficiency**: 40% reduction in response length

## 🚧 Limitations & TODOs

- [ ] Code execution sandbox integration
- [ ] Support for multiple reward functions
- [ ] Advanced error handling for malformed code
- [ ] Distributed training optimizations
- [ ] Tool selection beyond code interpreter
- [ ] [June 2, 2025 update] Add DAPO trainer




## πŸ“š Citation

```bibtex
@article{feng2025retool,
  title={ReTool: Reinforcement Learning for Strategic Tool Use in LLMs},
  author={Feng, Jiazhan and Huang, Shijue and Qu, Xingwei and Zhang, Ge and Qin, Yujia and Zhong, Baoquan and Jiang, Chengquan and Chi, Jinxin and Zhong, Wanjun},
  journal={arXiv preprint arXiv:2504.11536},
  year={2025}
}
```

## πŸ“„ License

MIT License - see [LICENSE](LICENSE) file for details.

**🀝 Collaboration welcome:**
Looking for teammates with complementary skills:
- **Systems engineers**: Distributed sandbox architecture with load balancing
- **Compute sponsors**: Academic institutions or cloud providers for training runs
- **Experimenters**: End-to-end validation and benchmarking on mathematical reasoning tasks


## πŸ™ Acknowledgments

- Original paper authors for the ReTool framework
- HuggingFace team for the transformers library
- TRL team for GRPO implementation patterns

---

<div align="center">
  <strong>Built with ❀️ for advancing AI reasoning capabilities</strong>
</div>