Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks
Abstract
We present the surprising finding that a language model's reasoning capabilities can be improved by training on synthetic datasets of chain-of-thought (CoT) traces from more capable models, even when all of those traces lead to an incorrect final answer. Our experiments show this approach can yield better performance on reasoning tasks than training on human-annotated datasets. We hypothesize that two key factors explain this phenomenon: first, the distribution of synthetic data is inherently closer to the language model's own distribution, making it more amenable to learning. Second, these `incorrect' traces are often only partially flawed and contain valid reasoning steps from which the model can learn. To further test the first hypothesis, we use a language model to paraphrase human-annotated traces -- shifting their distribution closer to the model's own distribution -- and show that this improves performance. For the second hypothesis, we introduce increasingly flawed CoT traces and study to what extent models are tolerant to these flaws. We demonstrate our findings across various reasoning domains like math, algorithmic reasoning and code generation using MATH, GSM8K, Countdown and MBPP datasets on various language models ranging from 1.5B to 9B across Qwen, Llama, and Gemma models. Our study shows that curating datasets that are closer to the model's distribution is a critical aspect to consider. We also show that a correct final answer is not always a reliable indicator of a faithful reasoning process.
Community
Training on synthetic CoT traces, even with wrong final answers, improves reasoning due to aligning with the model's distribution and leveraging partial reasoning steps, outperforming human-annotated data. In our paper we explore this interesting observation and provide detailed experimental results and ablation to study the effect of models learning reasoning from unverified noisy and wrong CoTs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Can Large Reasoning Models Improve Accuracy on Mathematical Tasks Using Flawed Thinking? (2025)
- Plantain: Plan-Answer Interleaved Reasoning (2025)
- Efficient Reasoning via Thought-Training and Thought-Free Inference (2025)
- Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads (2025)
- Training LLMs with LogicReward for Faithful and Rigorous Reasoning (2025)
- Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy (2025)
- Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper