arxiv:2510.00938

Large Reasoning Models Learn Better Alignment from Flawed Thinking

Published on Oct 1

· Submitted by

Anthony Peng on Oct 6

#2 Paper of the day

· MetaSuperintelligenceLab

Upvote

Authors:

ShengYun Peng ,

Eric Smith ,

Mahesh Pasupuleti ,

Jianfeng Chi

Abstract

RECAP, a reinforcement learning method, enhances the safety and robustness of large reasoning models by teaching them to override flawed reasoning and maintain safety without additional training costs.

AI-generated summary

Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.

View arXiv page View PDF Project page Add to collection

Community

AnthonyPeng

Paper author Paper submitter 14 days ago

•

edited 14 days ago

We’d love to bring our recent work to the community — a collaboration between Meta Superintelligence Labs, IBM Research, and Georgia Tech.

We found that flawed thinking can actually help reasoning models learn better! Our method, RECAP, is an RL post-training approach that teaches models to override unsafe reasoning, reroute to safe & helpful answers, and stay robust — all without extra training cost. More info can be found at https://x.com/RealAnthonyPeng/status/1973756324547575873.

If you find our work interesting, we’d really appreciate it if you could help share it with a broader audience.

AnthonyPeng

Paper author Paper submitter 14 days ago

RECAP trains LRMs on a mixture of counter-aligned prefilled and standard prompts. Harmful prompts are prefilled with unsafe reasoning, and benign prompts with refusal reasoning, forcing the model to override flawed trajectories to achieve high rewards. This simple recipe teaches models to internalize safety values and remain robust under both clean and adversarial reasoning traces, with no extra cost beyond standard RLHF.