arxiv:2509.25810

Learning to Reason as Action Abstractions with Scalable Mid-Training RL

Published on Sep 30

· Submitted by

Shenao Zhang on Oct 1

Apple

Upvote

Authors:

Shenao Zhang ,

Abstract

Mid-training with action abstractions enhances reinforcement learning in large language models, improving performance and convergence in code generation tasks.

AI-generated summary

Large language models excel with reinforcement learning (RL), but fully unlocking this potential requires a mid-training stage. An effective mid-training phase should identify a compact set of useful actions and enable fast selection among them through online RL. We formalize this intuition by presenting the first theoretical result on how mid-training shapes post-training: it characterizes an action subspace that minimizes both the value approximation error from pruning and the RL error during subsequent planning. Our analysis reveals two key determinants of mid-training effectiveness: pruning efficiency, which shapes the prior of the initial RL policy, and its impact on RL convergence, which governs the extent to which that policy can be improved via online interactions. These results suggest that mid-training is most effective when the decision space is compact and the effective horizon is short, highlighting the importance of operating in the space of action abstractions rather than primitive actions. Building on these insights, we propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm. Specifically, we derive a sequential variational lower bound and optimize it by iteratively discovering temporally-consistent latent structures via RL, followed by fine-tuning on the bootstrapped data. Experiments on code generation tasks demonstrate the effectiveness of our approach. Across multiple base models, RA3 improves the average performance on HumanEval and MBPP by 8 and 4 points over the base model and the next-token prediction baseline. Furthermore, RA3 achieves faster convergence and higher asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.

View arXiv page View PDF Add to collection

Community

ZhangShenao

Paper author Paper submitter 25 days ago

•

edited 25 days ago

We theoretically study 𝙝𝙤𝙬 𝙢𝙞𝙙-𝙩𝙧𝙖𝙞𝙣𝙞𝙣𝙜 𝙨𝙝𝙖𝙥𝙚𝙨 𝙥𝙤𝙨𝙩-𝙩𝙧𝙖𝙞𝙣𝙞𝙣𝙜 𝙍𝙇. The findings lead to a scalable algorithm for learning action hierarchies from expert demonstrations, which we successfully apply to 𝟭𝘽 Python code data.
See our tweet for more details https://x.com/ShenaoZhang/status/1973413781565751331.

librarian-bot

25 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Reinforcement Mid-Training (2025)
Reinforcement Learning on Pre-Training Data (2025)
Variational Reasoning for Language Models (2025)
Reinforcement Learning in Vision: A Survey (2025)
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification (2025)
Outcome-based Exploration for LLM Reasoning (2025)
Proximal Supervised Fine-Tuning (2025)

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.25810 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.25810 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.25810 in a Space README.md to link it from this page.