arxiv:2505.03335

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Published on May 6

· Submitted by

andrewzh on May 7

#1 Paper of the day

Upvote

169

Authors:

Andrew Zhao ,

Yiran Wu ,

Yang Yue ,

Shenzhi Wang ,

Qingyun Wu ,

Zilong Zheng ,

Abstract

Absolute Zero Reasoner (AZR) achieves state-of-the-art performance on coding and mathematical reasoning tasks through self-generated, verified learning without external data.

AI-generated summary

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

andrewzh

Paper author Paper submitter May 7

LINKS
paper: https://arxiv.org/abs/2505.03335

project page: https://andrewzh112.github.io/absolute-zero-reasoner/

code: https://github.com/LeapLabTHU/Absolute-Zero-Reasoner

models: https://huggingface.co/collections/andrewzh/absolute-zero-reasoner-68139b2bca82afb00bc69e5b

logs: https://wandb.ai/andrewzhao112/AbsoluteZeroReasoner

Twitter thread: https://x.com/AndrewZ45732491/status/1919920459748909288

yjh415

about 1 month ago

•

edited about 1 month ago

made an audio overview for this:

Fred03

27 days ago

ciao

Fred03

27 days ago

ciao

m-ric

25 days ago

Hi everyone! Thanks for this great paper, great idea and execution! I wrote this summary:

𝗟𝗟𝗠𝘀 𝗰𝗮𝗻 𝘁𝗿𝗮𝗶𝗻 𝘄𝗶𝘁𝗵𝗼𝘂𝘁 𝗮𝗻𝘆 𝗲𝘅𝘁𝗲𝗿𝗻𝗮𝗹 𝗱𝗮𝘁𝗮 🤯

Has the "data wall" just been breached?

Recent RL paradigms often relied on a set of questions an answers that needs to be manually curated. Researchers from Tsinghua University went like "why though".

🤔 Indeed, why learn from question designed by a human teacher, when the model can start from their base knowledge and learn by experimenting in a code environment, proposing coding tasks themselves and trying to solve them?

Thus they created “Absolute Zero Reasoning” (AZR), an approach that removes any need for human curated data.

🎭 𝗗𝘂𝗮𝗹 𝗿𝗼𝗹𝗲𝘀:
‣ Proposer: Generates challenging but solvable coding tasks
‣ Solver: Attempts to solve those self-proposed tasks

🧪 𝗧𝗵𝗿𝗲𝗲 𝘁𝗮𝘀𝗸 𝘁𝘆𝗽𝗲𝘀: all types are defined as triplets of program, input and output
‣ Deduction: Give model an input and program, it must deduce the output
‣ Abduction: Give model an program and output, it must find the input that gave said output
‣ Induction: Synthesize a program from input/output pairs
Btw this reminded me of my long-forgotten philosophy classes: Aristotle was more on the induction side, learning from real-world analogies, while Plato was more on the deduction side, trying to progress quite far with just one input and his reasoning.

📊 𝗥𝗲𝘀𝘂𝗹𝘁𝘀:
‣ AZR post-training creates a nice improvement on known models like Qwen2.5-7B
‣ Shows strong cross-domain transfer: coding ↔️ math reasoning

🧐 𝗢𝘁𝗵𝗲𝗿 𝗳𝗶𝗻𝗱𝗶𝗻𝗴𝘀:
‣ Having a better base performance (general or code specific) amplify the gains from Absolute Zero Reasoning
‣ Researchers warn about "Uh-oh moments" (winking to the "aha moments" of DeepSeek) where the model generates concerning goals like "make an extremely convoluted code to outsmart all these humans": so supervision is still needed!

This might signal a new era where models learn through experimentation rather than supervision! Maybe the solution to the "data wall" that many LLM companies are facing?

JBHF

24 days ago

Super interesting! I am interested to apply Absolute Zero Reasoner (AZR) to the domain of Physics. How exactly would one go about that? How to implement that? What kind of Physics tasks / problems can be used (pure textual / pure equations / mixed forms tasks ) ?

zhangBeiQing

14 days ago

The prompt body designed for AZR is programs, so do you need to format your physics problems as programs?
Alternatively, you could fine-tune on the AZR model, but you'd need to modify the prompts given to the model. The key here is whether your physics problems have quantifiable metrics beyond just the problem format. Meanwhile, you need to design a verifiable environment similar to a code executor.

zhangBeiQing

14 days ago

This is a greate paper

My immediate intuition after reading this paper is that the most critical and important point, and the most promising area for future optimization, lies in the design of the "Proposer Reward." This is because it directly influences the quality of the problems generated by the system.

I noticed from your detailed exposition, particularly in Appendix D.4 regarding "Extra Rewards," that you experimented with various additional reward components for the proposer, such as:

complexity_reward
mean_edit_distance_reward
halstead_reward
answer_diversity_reward
f_input_answer_diversity_reward
f_output_answer_diversity_reward

However, it appears these attempts did not yield significant improvements or "seemingly had no effect," as you stated in Appendix D.4: "we did not observe a significant difference in performance."

What I find particularly remarkable, almost "magical," is that the relatively simple formula for the learnability reward – r_propose = { 0, if r_solve_bar = 0 or r_solve_bar = 1; 1 - r_solve_bar, otherwise } – seems to implicitly optimize all the aforementioned characteristics (complexity, diversity, etc.). Your experiments in C.4, showing an upward trend in complexity and diversity metrics without explicit reward for them, indeed demonstrate this fascinating emergent behavior.

Perhaps the propose reward warrants more extensive and in-depth research, and there might still be room for further optimization.