arxiv:2603.24844

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

Published on Mar 25

· Submitted by

Isha Puri on Mar 27

Massachusetts Institute of Technology

Upvote

Authors:

Abstract

Language models trained with multi-answer reinforcement learning can generate multiple plausible answers with confidence estimates in a single forward pass, improving diversity and accuracy compared to traditional single-answer approaches.

AI-generated summary

Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.

View arXiv page View PDF Project page GitHub 2 Add to collection

Community

ishapuri

Paper submitter 1 day ago

Current post-training methods for language models implicitly collapse a rich distribution of possible answers into a single dominant output. While this works for benchmark-style tasks, many real-world settings—like medical diagnosis, coding, and ambiguous QA—require reasoning over multiple plausible answers under uncertainty.

This work introduces Multi-Answer Reinforcement Learning, a framework that trains models to generate diverse candidate answers in a single forward pass, along with calibrated confidence estimates. By shifting from single-output optimization to set-level reasoning, the approach improves diversity, coverage (e.g., pass@k), and calibration—while reducing the need for expensive repeated sampling.

librarian-bot

about 20 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2603.24844

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.24844 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.24844 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.24844 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.