CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

Overview

We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization.

This model is the official CodeScaler-1.7B trained from Skywork/Skywork-Reward-V2-Qwen3-1.7B on LARK-Lab/CodeScalerPair-51K.

Performance on RM-Bench

Model	Code	Chat	Math	Safety	Easy	Normal	Hard	Avg
Skywork/Skywork-Reward-Llama-3.1-8B	54.5	69.5	60.6	95.7	89	74.7	46.6	70.1
TIGER-Lab/AceCodeRM-7B	66.9	66.7	65.3	89.9	79.9	74.4	62.2	72.2
TIGER-Lab/AceCoder-RM-32B	72.1	73.7	70.5	88	84.5	78.3	65.5	76.1
Skywork/Skywork-Reward-V2-Qwen3-1.7B	72.3	69.6	71.4	92.9	92.8	82.3	54.5	76.6
Skywork/Skywork-Reward-V2-Qwen3-4B	74.4	78.2	73.6	95.7	92.1	85	64.4	80.5
Skywork/Skywork-Reward-V2-Qwen3-8B	73.6	80.6	75	96.5	91.8	85.5	67	80.5
CodeScaler-1.7B (this model)	73.1	74.4	74.7	93.1	91.7	83.2	61.5	78.8
CodeScaler-4B	76.3	80.4	79	95.8	92.9	86.5	69.2	82.9
CodeScaler-8B	76.9	83	79.9	96.4	92.5	87.9	71.8	84.1

Usage

RM Scoring

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification



device = "cuda" if torch.cuda.is_available() else "cpu"

model_path = 'LARK-Lab/CodeScaler-1.7B'

tokenizer = AutoTokenizer.from_pretrained(model_path)
reward_model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device)
reward_model.eval()

question = """\
Given an integer array nums and an integer k, return the total number of continuous subarrays whose sum equals k.
A subarray is a contiguous part of the array.
For example:
```
Input:
nums = [1, 1, 1], k = 2

Output:
2
```
"""

program_correct = """\
from collections import defaultdict

def subarraySum(nums, k):
    prefix = 0
    count = 0
    freq = defaultdict(int)
    freq[0] = 1  # Important: subarray starting from index 0

    for num in nums:
        prefix += num

        if prefix - k in freq:
            count += freq[prefix - k]

        freq[prefix] += 1

    return count
"""

program_wrong = """\
def subarraySum(nums, k):
    left = 0
    curr_sum = 0
    count = 0

    for right in range(len(nums)):
        curr_sum += nums[right]

        while curr_sum > k and left <= right:
            curr_sum -= nums[left]
            left += 1

        if curr_sum == k:
            count += 1

    return count
"""


convs = [
    [
        {
            "content": question,
            "role": "user",
        },
        {
            "role": "assistant",
            "content": program
        }
    ] for program in [program_correct, program_wrong]
]


texts = [
    tokenizer.apply_chat_template(conv, tokenize=False)
    for conv in convs
]

toks = tokenizer(
    texts,
    truncation=True,
    padding=True,
    max_length=2048,
    return_tensors="pt",
)

with torch.no_grad():
    outputs = reward_model(
        input_ids=toks["input_ids"].to(device),
        attention_mask=toks["attention_mask"].to(device),
    )
    scores = outputs.logits.squeeze(-1).cpu().tolist()


print("RM Scores:", scores)
# RM Scores: [12.513851165771484, -0.46548914909362793]

RL Training

Please refer to https://github.com/LARK-AI-Lab/CodeScaler for rl training details.

Citation

If you find our work helpful, please consider citing:

@misc{zhu2026codescalerscalingcodellm,
      title={CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models}, 
      author={Xiao Zhu and Xinyu Zhou and Boyu Zhu and Hanxu Hu and Mingzhe Du and Haotian Zhang and Huiming Wang and Zhijiang Guo},
      year={2026},
      eprint={2602.17684},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.17684}, 
}