arxiv:2510.26697

The End of Manual Decoding: Towards Truly End-to-End Language Models

Published on Oct 30

· Submitted by

Tian Lan on Oct 31

#3 Paper of the day

Tencent

Upvote

107

Authors:

Zhichao Wang ,

Dongyang Ma ,

Xinting Huang ,

Tian Lan ,

Xiaoying Tang ,

Yan Wang

Abstract

The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.

View arXiv page View PDF GitHub 38 Add to collection

Community

GMFTBY

Paper author Paper submitter 4 days ago

•

edited 4 days ago

AutoDeco is a framework that adds token-level adaptive decoding parameter prediction capabilities to Large Language Models (LLMs). By adding lightweight prediction heads on top of pre-trained models, AutoDeco can dynamically predict optimal temperature and top-p parameters for each token during decoding.

Github: https://github.com/Zacks917/AutoDeco
Huggingface Models: https://huggingface.co/collections/Jadeislaw/autodeco

CipherXLabs

2 days ago

This comment has been hidden (marked as Off-Topic)

spermwhale

1 day ago

•

edited 1 day ago

This work should cite: https://arxiv.org/abs/2411.09661:

Adaptive Decoding via Latent Preference Optimization

During language model decoding, it is known that using higher temperature sampling gives more creative responses, while lower temperatures are more factually accurate. However, such models are commonly applied to general instruction following, which involves both creative and fact seeking tasks, using a single fixed temperature across all examples and tokens. In this work, we introduce Adaptive Decoding, a layer added to the model to select the sampling temperature dynamically at inference time, at either the token or example level, in order to optimize performance. To learn its parameters we introduce Latent Preference Optimization (LPO) a general approach to train discrete latent variables such as choices of temperature. Our method outperforms all fixed decoding temperatures across a range of tasks that require different temperatures, including UltraFeedback, Creative Story Writing, and GSM8K.

libertywing

Paper author 1 day ago

Hi Jason,

Thanks for the comment. Jack, kindly reached out to us via email yesterday, and we've already been in touch with him.

We appreciate you both bringing this highly relevant work to our attention—it was an oversight on our part during the literature survey. As we conveyed to him, we are preparing an updated version of our paper, expected on arXiv within two weeks. In this revision, we will be sure to include a discussion that analyzes the connections and differences between our two approaches.

Thanks again for making sure we were aware.