Large Reasoning Models Learn Better Alignment from Flawed Thinking
Abstract
RECAP, a reinforcement learning method, enhances the safety and robustness of large reasoning models by teaching them to override flawed reasoning and maintain safety without additional training costs.
Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.
Community
We’d love to bring our recent work to the community — a collaboration between Meta Superintelligence Labs, IBM Research, and Georgia Tech.
We found that flawed thinking can actually help reasoning models learn better! Our method, RECAP, is an RL post-training approach that teaches models to override unsafe reasoning, reroute to safe & helpful answers, and stay robust — all without extra training cost. More info can be found at https://x.com/RealAnthonyPeng/status/1973756324547575873.
If you find our work interesting, we’d really appreciate it if you could help share it with a broader audience.
RECAP trains LRMs on a mixture of counter-aligned prefilled and standard prompts. Harmful prompts are prefilled with unsafe reasoning, and benign prompts with refusal reasoning, forcing the model to override flawed trajectories to achieve high rewards. This simple recipe teaches models to internalize safety values and remain robust under both clean and adversarial reasoning traces, with no extra cost beyond standard RLHF.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention (2025)
- AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models (2025)
- Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking (2025)
- Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training (2025)
- StepWiser: Stepwise Generative Judges for Wiser Reasoning (2025)
- Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression (2025)
- Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper