arxiv:2509.26313

One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient

Published on Sep 30

· Submitted by

Rui on Oct 3

Upvote

Authors:

Rui Ming ,

Haoyuan Wu ,

Abstract

One-token rollout (OTR) enhances supervised fine-tuning of large language models by incorporating policy gradient methods to improve generalization using on-policy data.

AI-generated summary

Supervised fine-tuning (SFT) is the predominant method for adapting large language models (LLMs), yet it often struggles with generalization compared to reinforcement learning (RL). In this work, we posit that this performance disparity stems not just from the loss function, but from a more fundamental difference: SFT learns from a fixed, pre-collected dataset, whereas RL utilizes on-policy data sampled from the current policy. Building on this hypothesis, we introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method. OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory. At each step, it performs a Monte Carlo ``rollout'' by sampling multiple candidate tokens from the current policy's distribution. The ground-truth token from the supervised data is then used to provide a reward signal to these samples. Guided by policy gradient, our algorithm repurposes static, off-policy supervised data into a dynamic, on-policy signal at the token level, capturing the generalization benefits of on-policy learning while bypassing the costly overhead of full sentence generation. Through extensive experiments on a diverse suite of challenging benchmarks spanning mathematical reasoning, code generation, and general domain reasoning, we demonstrate that OTR consistently outperforms standard SFT. Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs and provide compelling evidence that the on-policy nature of data is a critical driver of generalization, offering a promising new direction for fine-tuning LLMs.

View arXiv page View PDF Add to collection

Community

Yalimu

Paper author Paper submitter 15 days ago

librarian-bot

14 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Lutalica

9 days ago

•

edited 9 days ago

It seems OTR resonate with SFT/DFT, because they all calculate loss on next-token logits from LLM output.
The cross-entropy loss from SFT maximize relative probability of ground-truth token (which will suppress other tokens) and the OTR only compute policy gradient loss on the sampled token with as same optimization direction as SFT.
In another word, if OTR sample enough times on every token position, it can be be regarded as a weighted version of SFT (e.g. DFT). I don't think it is a on-policy RL method, though it provides good explanation.

Yalimu

Paper author 7 days ago

Thanks for the great questions! We agree there are interesting connections to SFT/DFT, but two key distinctions define OTR's on-policy nature and improved performance.

First, OTR explicitly penalizes negatively sampled tokens. While SFT implicitly suppresses non-GT tokens, OTR applies a principled, policy-gradient-guided loss to the specific plausible-but-wrong tokens the model actually generates. This offers a more direct and nuanced optimization signal. This inevitably leads to a different final optimization path compared to SFT.

Second, the difference is most critical for low-probability GT tokens. A weighted SFT (like DFT) would still force an unstable update toward that hard-to-reach token. In contrast, if OTR doesn't sample the GT token, it makes no update to it and only suppresses the incorrect tokens it did sample. This ensures stable, truly on-policy updates by only making adjustments in regions the model can already reach.

So, while OTR can resemble a weighted SFT in simple cases, its unique handling of these two scenarios is fundamentally different. We believe this is the key to its generalization gains, by truly transforming static data into a dynamic, on-policy signal.

Thanks again for the great discussion!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.26313 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.26313 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.26313 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.