File size: 14,479 Bytes
77fba4b
 
 
2cf8efe
 
 
 
b4ec84a
2cf8efe
b4ec84a
2cf8efe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b4ec84a
188283a
 
 
 
 
2cf8efe
 
2be1969
2cf8efe
9e2b6d5
2cf8efe
 
 
9e2b6d5
2cf8efe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1c7860f
2cf8efe
 
 
 
 
 
 
1c7860f
2cf8efe
1c7860f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2cf8efe
 
7846dca
2cf8efe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55f3127
2cf8efe
 
 
 
 
55f3127
2cf8efe
 
 
 
 
 
 
 
 
 
 
 
 
 
7846dca
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
---

license: mit
---

# CodeFu-7B-v0.1

CodeFu-7B-v0.1 is a 7B parameter model trained using Reinforcement Learning for competitive programming tasks. Built on the DeepSeek-R1-Distill-Qwen-7B base model, CodeFu is capable of algorithmic reasoning to solve complex problems and generate efficient C++ solutions.

Specicially, CodeFu-7B-v0.1 achieves **13.7% Pass@1** on the [USACO benchmark](https://princeton-nlp.github.io/USACOBench/), outperforming models >4x larger. 

Trained solely on problem statements—without access to any ground-truth solutions—CodeFu achieved >10x performance improvement over its base model, demonstrating the effectiveness of our RL approach..


## Model Specs

- **Base Model**: [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)
- **Model Size**: 7.61B parameters  
- **License**: MIT
- **Task**: Competitive Programming / Algorithmic Problem Solving

Starting from the [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) model, we trained CodeFu using RL on selected Competitive Programming problems (without solutions) from the [DeepMind CodeContest](https://huggingface.co/datasets/deepmind/code_contests) dataset.


## Evaluation

To assess CodeFu's genuine problem-solving abilities, we used [USACO benchmark](https://princeton-nlp.github.io/USACOBench/), which consists of 307 high-quality problems from the past [USA Computing Olympiad](https://usaco.org/) contests. 

| Model | Size | USACO Pass@1 | Notes |
|-------|------|-------------:|-------|
| Claude-3.7-Sonnet | UNK | 31.9 |  | 
| [OlympicCoder-32B](https://huggingface.co/open-r1/OlympicCoder-32B) | 32B | 18.9 | |
| [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | 32B | 17.3 | |
| [Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) | 32B | 16.3| |
| **CodeFu-7B-v0.1** | **7B** | **13.7** |  |
| [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | 32B | 11.7 | |
| [OlympicCoder-7B](https://huggingface.co/open-r1/OlympicCoder-7B) | 7B | 9.1 | |
| GPT-4-1106-preview | UNK | 8.7 | |
| [Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) | 7B | 5.9 | |
| [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | 7B | 1.0 | *Base model* |
| GPT-3.5-turbo-1106 | UNK | 0.6 | |

**Codefu Key Highlights:**
- 📊 **Leading 7B model** on USACO benchmark
-**Outperforms 32B base model** (13.7% vs 11.7% Pass@1)
- 📈 **>10x improvement** over 7B base model (13.7% vs 1%)

For systematic and robust evaluation, we used standardized code extraction logic across all model responses. This process identifies solution code by parsing either `<code></code>` tags or ```cpp code blocks, always selecting the final code block to ensure we capture each model's ultimate solution after any intermediate reasoning steps. GPT-3.5/4 scores are copied from the [USACO benchmark](https://princeton-nlp.github.io/USACOBench/) as baselines



All extracted code solutions are executed with **strict time limit enforcement** - any code exceeding the problem's specified time limit is marked as incorrect, ensuring realistic competitive programming conditions.



All open-weight models were tested using [vLLM](https://github.com/vllm-project/vllm) v0.6.3 with identical sampling parameters: a `temperature` of 0.8 and a `top_p` of 0.95. Claude-3.7-Sonnet was evaluated at a `temperature` of 1.0. We set the maximum output length (`max_tokens`) to 28,672 for all models to ensure sufficient length for reasoning and code solutions. 



### Result analysis



We provide access to the complete CodeFu-7B-v0.1 evaluation results on the USACO benchmark as a [CSV file](codefu-7b-v0.1_usaco.csv.tgz) containing fields such as `problem_name`, `prompt`, `response`, `response_length`, `solution_code`, `status`, and `score`. Notably, the `status` field breakdown is as follows:

- Success: 42 cases

- Failure (code runs but incorrect or timed out): 37 cases

- Fail to compile: 8 cases

- No code: 220 cases



Analysis of the response length distribution shows that successful solutions typically have concise responses around 5,000 tokens, while unsuccessful attempts often reach the maximum token limit. While some correct solutions do exceed 20,000 tokens, the vast majority of long responses correspond to the "No code" category, where the model engages in extensive reasoning that eventually degenerates into repetitive patterns or incoherent text without producing executable code. Future work is needed to improve training objectives that better distinguish between useful deliberation and unproductive verbosity.



## Usage



```python

# CodeFu works with vLLM for inference

# pip install vllm==0.6.3



from vllm import LLM, SamplingParams



model_name = "aws-prototyping/codefu-7b-v0.1"



# Initialize vLLM

llm = LLM(model=model_name, trust_remote_code=True)

sampling_params = SamplingParams(

    temperature=0.8,

    top_p=0.95, 

    max_tokens=28672,

)



# The `Hay Bales` problem in USA Computing Olympiad benchmark

prompt = """In your role as an algorithmic problem-solver, write a C++ solution for this problem. Put your thought process in <think> tags and your solution in <code> tags.



Problem:

Problem 1: Hay Bales [Brian Dean, 2011]



The cows are at it again!  Farmer John has carefully arranged N (1 <= N <=

10,000) piles of hay bales, each of the same height.  When he isn't

looking, however, the cows move some of the hay bales between piles, so

their heights are no longer necessarily the same.  Given the new heights of

all the piles, please help Farmer John determine the minimum number of hay

bales he needs to move in order to restore all the piles to their original,

equal heights.



PROBLEM NAME: haybales



INPUT FORMAT:



* Line 1: The number of piles, N (1 <= N <= 10,000).



* Lines 2..1+N: Each line contains the number of hay bales in a single

        pile (an integer in the range 1...10,000).



SAMPLE INPUT:



4

2

10

7

1



INPUT DETAILS:



There are 4 piles, of heights 2, 10, 7, and 1.



OUTPUT FORMAT:



* Line 1: An integer giving the minimum number of hay bales that need

        to be moved to restore the piles to having equal heights.



SAMPLE OUTPUT:



7



OUTPUT DETAILS:



By moving 7 hay bales (3 from pile 2 to pile 1, 2 from pile 2 to pile 4, 2

from pile 3 to pile 4), we can make all piles have height 5.

"""



# Generate solution

outputs = llm.generate([prompt], sampling_params)

solution = outputs[0].outputs[0].text

print(solution)



# Alternative: OpenAI-compatible API server

# Start vLLM server first:

# python -m vllm.entrypoints.openai.api_server --model aws-prototyping/codefu-7b-v0.1 --port 8000



from openai import OpenAI



client = OpenAI(

    api_key="EMPTY",

    base_url="http://localhost:8000/v1"

)



response = client.completions.create(

    model="aws-prototyping/codefu-7b-v0.1",

    prompt=prompt,

    temperature=0.8,

    top_p=0.95,

    max_tokens=28672,

)



solution = response.choices[0].text

print(solution)

```
We can examine CodeFu's generated [solution](example_response_hay_bales.txt) for this problem, which has been verified as correct.

## Prompt Format

CodeFu works best with structured prompts that request both reasoning and code:
```

[Role] Please solve this programming problem in C++. Show your thinking process in <think> tags and provide your solution in <code> tags.



[Problem Description]

```

Replace `[Role]` with phrases like:
- "As a competitive programming expert"
- "Working as an experienced competitive programmer" 
- "As a master of algorithms and data structures"


## CodeFu Training Pipeline
![train_arch](codefu_arch_01_small.png)

*Figure 1 - CodeFu training pipeline*

We trained CodeFu using the [veRL](https://github.com/volcengine/verl) library. Ray orchestrates the distributed execution and synchronization of vLLM rollout, reward evaluation (code compilation and execution), FSDP model parallelism, and Ulysses sequence parallelism. We set the degree of sequence parallelism to 4 for long-form reasoning and code generations.

We extended the [TinyZero](https://github.com/Jiayi-Pan/TinyZero) code repository by using Ray to manage and distribute reward function calculation. This enables parallel C++ code compilation and evaluation across the same cluster to address the compute-intensive and latency-bound nature of code execution. The entire pipeline is executed as a SageMaker training job running on `ml.p4de.24xlarge` instances. The training pipeline consists of the following steps as shown in Figure 1:

1. **Rollout**: Coding problem prompts are fed into the vLLM inference engine for rolling out potential solutions
2. **Response Generation**: vLLM generates multiple responses (reasoning + code) for each prompt
3. **Code Execution**: Code solutions are extracted from responses, and are compiled and executed by distributed workers (compilers and runtime) managed by Ray
4. **Reward Calculation**: Execution outcomes are used to calculate rewards (i.e. testcase pass ratios) and advantages are computed using group-relative baselines
5. **Policy Update**: The **Actor** uses advantages and token probabilities to compute the PPO loss, which is then used to update CodeFu's parameters through gradient descent
6. **Iteration**: The process repeats with batches of prompt-response-reward cycles, with Ray managing the distributed sampling, execution, and training synchronization across the pipeline

## CodeFu Training Approach
CodeFu employs a 2-stage curriculum learning approach:


| Stage | Data | Max resp token | Batch size | Mini batch size | # of Rollouts | Reward | Focus | # of nodes |
|-------|------|-------|------------|--------------|---------|--------|-------|------------|
| **1** | *easy* problems | 28K | 8 | 8 | 5 | Exp smooth w. public scores | Basic algorithmic reasoning | 1 |
| **2** | *hard* problems | 20K | 256 | 32 | 8 | Linear w/o public scores | Quality and Robustness | 4 |


### Reward ###

**Stage 1 (Exponential smooth with public scores)** - Uses exponential smoothing and both public/private test cases for clearer learning signals on easier problems.

**Stage 2 (Linear without public scores)** - Shifts to linear rewards using only private test cases to encourage robust problem-solving on harder problems.

Here is the pseudocode for the reward calculation across both training stages:
```python

def compute_reward(code_output, public_tests, private_tests, stage):

    # Handle execution failures (same for both stages)

    if not is_executable(code_output):

        return -1

    

    if compilation_failed(code_output) or exceeds_time_limit(code_output):

        return 0

    

    # Stage-specific reward calculation for successful execution

    if stage == 1:

        # Exponential smoothing with public + private tests

        passed_public = count_passed(code_output, public_tests)

        passed_private = count_passed(code_output, private_tests)

        total_tests = len(public_tests) + len(private_tests)

        passed_tests = passed_public + passed_private

        

        pass_ratio = passed_tests / total_tests

        reward = pass_ratio ** 1.5

        

    elif stage == 2:

        # Linear reward with private tests only

        passed_private = count_passed(code_output, private_tests)

        total_private = len(private_tests)

        

        reward = passed_private / total_private

    

    return reward

```

### Data Selection ### 
Training data is sourced from CodeForces problems within the [DeepMind CodeContest](https://huggingface.co/datasets/deepmind/code_contests) dataset, chosen for their reliable CF rating system. Easy problems (CF rating 800-1000) are used in Stage 1 for basic algorithmic reasoning, while relatively Hard problems (CF rating 1100-2200) are used in Stages 2 for intermediate to advanced challenges. Both the *Easy* and *Hard* datasets were trained for approximately 2 epochs.


### Training Stability ###
We encountered a response length collapse issue when training CodeFu on the *hard problems* dataset - the mean response length would plunge significantly after some period of learning, causing a catastrophic drop in training rewards as shown in Figure 2.

![Mean Response Length](resp_len_bsz_8_issue.png)

*Figure 2 - Mean response length and reward collapsed with a small batch size*

Despite attempts to mitigate this through increased batch sizes (up to 416) as shown in Figure 3, increased rollout samples (up to 32), plain PPO implementation with a co-trained critic model, and KL coefficient adjustments, the collapse persisted or reappeared after periods of stability.

![Mean Reward](resp_len_bsz_416_issue.png)

*Figure 3 - Mean response length and reward plummet despite the much larger batch size*

We eventually resolved this issue by using a true off-policy PPO learning configuration. This is achieved by reducing the mini-batch size to at least 4× smaller than the global batch size, resulting in multiple clipped "mini-" updates as per the original [PPO algorithm](https://en.wikipedia.org/wiki/Proximal_policy_optimization). This approach has since stabilized response length and prevented reward collapse (Figure 4), allowing us to pass multiple epochs on the dataset. 

![Mean Reward](resp_len_bsz_256_fix.png) 

*Figure 4 - Mean response length and reward do not collapse during true off-policy learning*

A detailed paper is in preparation that will describe our training stability solutions and review related work on policy optimization for reasoning models, including recent methods like [DAPO](https://dapo-sia.github.io/), [OPO](https://verl.readthedocs.io/en/latest/algo/opo.html), [Dr.GRPO](https://github.com/sail-sg/understand-r1-zero), and [GSPO](https://qwenlm.github.io/blog/gspo/).


## Citation

CodeFu is developed by the **AWS WWSO Prototyping** Team. If you find CodeFu helpful, feel free to give us a cite.

```bibtex

@misc{codefu2025,

  title={aws-prototyping/codefu-7b-v0.1},

  author={Wu, Chen and Song, Yin and Passenger, Josh},

  year={2025},

  publisher={Hugging Face},

  url={https://huggingface.co/aws-prototyping/codefu-7b-v0.1},

  version={0.1}

}

```