CodeScaler
Collection
6 items • Updated
• 4
We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization.
This model is the official CodeScaler-1.7B trained from Skywork/Skywork-Reward-V2-Qwen3-1.7B on LARK-Lab/CodeScalerPair-51K.
| Model | Code | Chat | Math | Safety | Easy | Normal | Hard | Avg |
|---|---|---|---|---|---|---|---|---|
| Skywork/Skywork-Reward-Llama-3.1-8B | 54.5 | 69.5 | 60.6 | 95.7 | 89 | 74.7 | 46.6 | 70.1 |
| TIGER-Lab/AceCodeRM-7B | 66.9 | 66.7 | 65.3 | 89.9 | 79.9 | 74.4 | 62.2 | 72.2 |
| TIGER-Lab/AceCoder-RM-32B | 72.1 | 73.7 | 70.5 | 88 | 84.5 | 78.3 | 65.5 | 76.1 |
| Skywork/Skywork-Reward-V2-Qwen3-1.7B | 72.3 | 69.6 | 71.4 | 92.9 | 92.8 | 82.3 | 54.5 | 76.6 |
| Skywork/Skywork-Reward-V2-Qwen3-4B | 74.4 | 78.2 | 73.6 | 95.7 | 92.1 | 85 | 64.4 | 80.5 |
| Skywork/Skywork-Reward-V2-Qwen3-8B | 73.6 | 80.6 | 75 | 96.5 | 91.8 | 85.5 | 67 | 80.5 |
| CodeScaler-1.7B (this model) | 73.1 | 74.4 | 74.7 | 93.1 | 91.7 | 83.2 | 61.5 | 78.8 |
| CodeScaler-4B | 76.3 | 80.4 | 79 | 95.8 | 92.9 | 86.5 | 69.2 | 82.9 |
| CodeScaler-8B | 76.9 | 83 | 79.9 | 96.4 | 92.5 | 87.9 | 71.8 | 84.1 |
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = 'LARK-Lab/CodeScaler-1.7B'
tokenizer = AutoTokenizer.from_pretrained(model_path)
reward_model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device)
reward_model.eval()
question = """\
Given an integer array nums and an integer k, return the total number of continuous subarrays whose sum equals k.
A subarray is a contiguous part of the array.
For example:
```
Input:
nums = [1, 1, 1], k = 2
Output:
2
```
"""
program_correct = """\
from collections import defaultdict
def subarraySum(nums, k):
prefix = 0
count = 0
freq = defaultdict(int)
freq[0] = 1 # Important: subarray starting from index 0
for num in nums:
prefix += num
if prefix - k in freq:
count += freq[prefix - k]
freq[prefix] += 1
return count
"""
program_wrong = """\
def subarraySum(nums, k):
left = 0
curr_sum = 0
count = 0
for right in range(len(nums)):
curr_sum += nums[right]
while curr_sum > k and left <= right:
curr_sum -= nums[left]
left += 1
if curr_sum == k:
count += 1
return count
"""
convs = [
[
{
"content": question,
"role": "user",
},
{
"role": "assistant",
"content": program
}
] for program in [program_correct, program_wrong]
]
texts = [
tokenizer.apply_chat_template(conv, tokenize=False)
for conv in convs
]
toks = tokenizer(
texts,
truncation=True,
padding=True,
max_length=2048,
return_tensors="pt",
)
with torch.no_grad():
outputs = reward_model(
input_ids=toks["input_ids"].to(device),
attention_mask=toks["attention_mask"].to(device),
)
scores = outputs.logits.squeeze(-1).cpu().tolist()
print("RM Scores:", scores)
# RM Scores: [12.513851165771484, -0.46548914909362793]
Please refer to https://github.com/LARK-AI-Lab/CodeScaler for rl training details.
If you find our work helpful, please consider citing:
@misc{zhu2026codescalerscalingcodellm,
title={CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models},
author={Xiao Zhu and Xinyu Zhou and Boyu Zhu and Hanxu Hu and Mingzhe Du and Haotian Zhang and Huiming Wang and Zhijiang Guo},
year={2026},
eprint={2602.17684},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.17684},
}
Base model
Qwen/Qwen3-1.7B-Base