Papers
arxiv:2510.00938

Large Reasoning Models Learn Better Alignment from Flawed Thinking

Published on Oct 1
· Submitted by Anthony Peng on Oct 6
#2 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

RECAP, a reinforcement learning method, enhances the safety and robustness of large reasoning models by teaching them to override flawed reasoning and maintain safety without additional training costs.

AI-generated summary

Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.

Community

Paper author Paper submitter
edited 14 days ago

We’d love to bring our recent work to the community — a collaboration between Meta Superintelligence Labs, IBM Research, and Georgia Tech.

We found that flawed thinking can actually help reasoning models learn better! Our method, RECAP, is an RL post-training approach that teaches models to override unsafe reasoning, reroute to safe & helpful answers, and stay robust — all without extra training cost. More info can be found at https://x.com/RealAnthonyPeng/status/1973756324547575873.

If you find our work interesting, we’d really appreciate it if you could help share it with a broader audience.

Paper author Paper submitter

Group 48

RECAP trains LRMs on a mixture of counter-aligned prefilled and standard prompts. Harmful prompts are prefilled with unsafe reasoning, and benign prompts with refusal reasoning, forcing the model to override flawed trajectories to achieve high rewards. This simple recipe teaches models to internalize safety values and remain robust under both clean and adversarial reasoning traces, with no extra cost beyond standard RLHF.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.00938 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.00938 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.00938 in a Space README.md to link it from this page.

Collections including this paper 3