CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

CodeScaler Paper on arXiv GitHub Code GitHub Page Datasets on Hugging Face CodeScaler on Hugging Face

Overview

We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization.

This model is the official CodeScaler-1.7B trained from Skywork/Skywork-Reward-V2-Qwen3-1.7B on LARK-Lab/CodeScalerPair-51K.

Performance on RM-Bench

Model Code Chat Math Safety Easy Normal Hard Avg
Skywork/Skywork-Reward-Llama-3.1-8B 54.5 69.5 60.6 95.7 89 74.7 46.6 70.1
TIGER-Lab/AceCodeRM-7B 66.9 66.7 65.3 89.9 79.9 74.4 62.2 72.2
TIGER-Lab/AceCoder-RM-32B 72.1 73.7 70.5 88 84.5 78.3 65.5 76.1
Skywork/Skywork-Reward-V2-Qwen3-1.7B 72.3 69.6 71.4 92.9 92.8 82.3 54.5 76.6
Skywork/Skywork-Reward-V2-Qwen3-4B 74.4 78.2 73.6 95.7 92.1 85 64.4 80.5
Skywork/Skywork-Reward-V2-Qwen3-8B 73.6 80.6 75 96.5 91.8 85.5 67 80.5
CodeScaler-1.7B (this model) 73.1 74.4 74.7 93.1 91.7 83.2 61.5 78.8
CodeScaler-4B 76.3 80.4 79 95.8 92.9 86.5 69.2 82.9
CodeScaler-8B 76.9 83 79.9 96.4 92.5 87.9 71.8 84.1

Usage

RM Scoring

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification



device = "cuda" if torch.cuda.is_available() else "cpu"

model_path = 'LARK-Lab/CodeScaler-1.7B'

tokenizer = AutoTokenizer.from_pretrained(model_path)
reward_model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device)
reward_model.eval()

question = """\
Given an integer array nums and an integer k, return the total number of continuous subarrays whose sum equals k.
A subarray is a contiguous part of the array.
For example:
```
Input:
nums = [1, 1, 1], k = 2

Output:
2
```
"""

program_correct = """\
from collections import defaultdict

def subarraySum(nums, k):
    prefix = 0
    count = 0
    freq = defaultdict(int)
    freq[0] = 1  # Important: subarray starting from index 0

    for num in nums:
        prefix += num

        if prefix - k in freq:
            count += freq[prefix - k]

        freq[prefix] += 1

    return count
"""

program_wrong = """\
def subarraySum(nums, k):
    left = 0
    curr_sum = 0
    count = 0

    for right in range(len(nums)):
        curr_sum += nums[right]

        while curr_sum > k and left <= right:
            curr_sum -= nums[left]
            left += 1

        if curr_sum == k:
            count += 1

    return count
"""


convs = [
    [
        {
            "content": question,
            "role": "user",
        },
        {
            "role": "assistant",
            "content": program
        }
    ] for program in [program_correct, program_wrong]
]


texts = [
    tokenizer.apply_chat_template(conv, tokenize=False)
    for conv in convs
]

toks = tokenizer(
    texts,
    truncation=True,
    padding=True,
    max_length=2048,
    return_tensors="pt",
)

with torch.no_grad():
    outputs = reward_model(
        input_ids=toks["input_ids"].to(device),
        attention_mask=toks["attention_mask"].to(device),
    )
    scores = outputs.logits.squeeze(-1).cpu().tolist()


print("RM Scores:", scores)
# RM Scores: [12.513851165771484, -0.46548914909362793]

RL Training

Please refer to https://github.com/LARK-AI-Lab/CodeScaler for rl training details.

Citation

If you find our work helpful, please consider citing:

@misc{zhu2026codescalerscalingcodellm,
      title={CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models}, 
      author={Xiao Zhu and Xinyu Zhou and Boyu Zhu and Hanxu Hu and Mingzhe Du and Haotian Zhang and Huiming Wang and Zhijiang Guo},
      year={2026},
      eprint={2602.17684},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.17684}, 
}
Downloads last month
5
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LARK-Lab/CodeScaler-1.7B

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(1)
this model

Dataset used to train LARK-Lab/CodeScaler-1.7B

Collection including LARK-Lab/CodeScaler-1.7B

Paper for LARK-Lab/CodeScaler-1.7B