Papers
arxiv:2505.03335

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Published on May 6
ยท Submitted by andrewzh on May 7
#1 Paper of the day
Authors:
,
,
,
,

Abstract

Absolute Zero Reasoner (AZR) achieves state-of-the-art performance on coding and mathematical reasoning tasks through self-generated, verified learning without external data.

AI-generated summary

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

Community

ciao

ciao

Hi everyone! Thanks for this great paper, great idea and execution! I wrote this summary:

๐—Ÿ๐—Ÿ๐— ๐˜€ ๐—ฐ๐—ฎ๐—ป ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป ๐˜„๐—ถ๐˜๐—ต๐—ผ๐˜‚๐˜ ๐—ฎ๐—ป๐˜† ๐—ฒ๐˜…๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐—น ๐—ฑ๐—ฎ๐˜๐—ฎ ๐Ÿคฏ

Has the "data wall" just been breached?

Recent RL paradigms often relied on a set of questions an answers that needs to be manually curated. Researchers from Tsinghua University went like "why though".

๐Ÿค” Indeed, why learn from question designed by a human teacher, when the model can start from their base knowledge and learn by experimenting in a code environment, proposing coding tasks themselves and trying to solve them?

Thus they created โ€œAbsolute Zero Reasoningโ€ (AZR), an approach that removes any need for human curated data.

๐ŸŽญ ๐——๐˜‚๐—ฎ๐—น ๐—ฟ๐—ผ๐—น๐—ฒ๐˜€:
โ€ฃ Proposer: Generates challenging but solvable coding tasks
โ€ฃ Solver: Attempts to solve those self-proposed tasks

๐Ÿงช ๐—ง๐—ต๐—ฟ๐—ฒ๐—ฒ ๐˜๐—ฎ๐˜€๐—ธ ๐˜๐˜†๐—ฝ๐—ฒ๐˜€: all types are defined as triplets of program, input and output
โ€ฃ Deduction: Give model an input and program, it must deduce the output
โ€ฃ Abduction: Give model an program and output, it must find the input that gave said output
โ€ฃ Induction: Synthesize a program from input/output pairs
Btw this reminded me of my long-forgotten philosophy classes: Aristotle was more on the induction side, learning from real-world analogies, while Plato was more on the deduction side, trying to progress quite far with just one input and his reasoning.

๐Ÿ“Š ๐—ฅ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€:
โ€ฃ AZR post-training creates a nice improvement on known models like Qwen2.5-7B
โ€ฃ Shows strong cross-domain transfer: coding โ†”๏ธ math reasoning

๐Ÿง ๐—ข๐˜๐—ต๐—ฒ๐—ฟ ๐—ณ๐—ถ๐—ป๐—ฑ๐—ถ๐—ป๐—ด๐˜€:
โ€ฃ Having a better base performance (general or code specific) amplify the gains from Absolute Zero Reasoning
โ€ฃ Researchers warn about "Uh-oh moments" (winking to the "aha moments" of DeepSeek) where the model generates concerning goals like "make an extremely convoluted code to outsmart all these humans": so supervision is still needed!

This might signal a new era where models learn through experimentation rather than supervision! Maybe the solution to the "data wall" that many LLM companies are facing?
image.png

Super interesting! I am interested to apply Absolute Zero Reasoner (AZR) to the domain of Physics. How exactly would one go about that? How to implement that? What kind of Physics tasks / problems can be used (pure textual / pure equations / mixed forms tasks ) ?

ยท

The prompt body designed for AZR is programs, so do you need to format your physics problems as programs?
Alternatively, you could fine-tune on the AZR model, but you'd need to modify the prompts given to the model. The key here is whether your physics problems have quantifiable metrics beyond just the problem format. Meanwhile, you need to design a verifiable environment similar to a code executor.

This is a greate paper

My immediate intuition after reading this paper is that the most critical and important point, and the most promising area for future optimization, lies in the design of the "Proposer Reward." This is because it directly influences the quality of the problems generated by the system.

I noticed from your detailed exposition, particularly in Appendix D.4 regarding "Extra Rewards," that you experimented with various additional reward components for the proposer, such as:

  • complexity_reward
  • mean_edit_distance_reward
  • halstead_reward
  • answer_diversity_reward
  • f_input_answer_diversity_reward
  • f_output_answer_diversity_reward

However, it appears these attempts did not yield significant improvements or "seemingly had no effect," as you stated in Appendix D.4: "we did not observe a significant difference in performance."

What I find particularly remarkable, almost "magical," is that the relatively simple formula for the learnability reward โ€“ r_propose = { 0, if r_solve_bar = 0 or r_solve_bar = 1; 1 - r_solve_bar, otherwise } โ€“ seems to implicitly optimize all the aforementioned characteristics (complexity, diversity, etc.). Your experiments in C.4, showing an upward trend in complexity and diversity metrics without explicit reward for them, indeed demonstrate this fascinating emergent behavior.

Perhaps the propose reward warrants more extensive and in-depth research, and there might still be room for further optimization.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.03335 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.03335 in a Space README.md to link it from this page.

Collections including this paper 36