Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models
Abstract
Language models trained with multi-answer reinforcement learning can generate multiple plausible answers with confidence estimates in a single forward pass, improving diversity and accuracy compared to traditional single-answer approaches.
Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.
Community
Current post-training methods for language models implicitly collapse a rich distribution of possible answers into a single dominant output. While this works for benchmark-style tasks, many real-world settings—like medical diagnosis, coding, and ambiguous QA—require reasoning over multiple plausible answers under uncertainty.
This work introduces Multi-Answer Reinforcement Learning, a framework that trains models to generate diverse candidate answers in a single forward pass, along with calibrated confidence estimates. By shifting from single-output optimization to set-level reasoning, the approach improves diversity, coverage (e.g., pass@k), and calibration—while reducing the need for expensive repeated sampling.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- From Entropy to Calibrated Uncertainty: Training Language Models to Reason About Uncertainty (2026)
- CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering (2026)
- Improving Parametric Knowledge Access in Reasoning Language Models (2026)
- Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers (2026)
- ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure (2026)
- Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning (2026)
- Reinforcement Learning from Meta-Evaluation: Aligning Language Models Without Ground-Truth Labels (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.24844 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper