YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Demystifying Reinforcement Learning in Agentic Reasoning

🎯 About This Repository

This repository contains the DemyAgent-4B model weights, a 4B-sized agentic reasoning model that achieves state-of-the-art performance on challenging benchmarks including AIME2024/2025, GPQA-Diamond, and LiveCodeBench-v6. DemyAgent-4B is trained using our GRPO-TCR recipe with 30K high-quality agentic RL data, demonstrating that small models can outperform much larger alternatives (14B/32B) through effective RL training strategies.

🌟 Introduction

In our work, we systematically investigate three dimensions of agentic RL: data, algorithms, and reasoning modes. Our findings reveal:

🎯 Data Quality Matters: Real end-to-end trajectories and high-diversity datasets significantly outperform synthetic alternatives
⚡ Training Efficiency: Exploration-friendly techniques like reward clipping and entropy maintenance boost training efficiency
🧠 Reasoning Strategy: Deliberative reasoning with selective tool calls surpasses frequent invocation or verbose self-reasoning We contribute high-quality SFT and RL datasets, demonstrating that simple recipes enable even 4B models to outperform 32B models on the most challenging reasoning benchmarks.

📦 Resources

Type	Name	Link
📊 Dataset	3K Agentic SFT Data	🤗 HuggingFace
📊 Dataset	30K Agentic RL Data	🤗 HuggingFace
🤖 Model	Qwen2.5-7B-RA-SFT	🤗 HuggingFace
🤖 Model	Qwen3-4B-RA-SFT	🤗 HuggingFace
🤖 Model	DemyAgent-4B	🤗 HuggingFace

Note:

Qwen2.5-7B-RA-SFT and Qwen3-4B-RA-SFT are finetuned from Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 using our 3K Agentic SFT Data

DemyAgent-4B is trained through Agentic RL with our 30K Agentic RL data using the GRPO-TCR recipe

🏆 Performance

We evaluate our models on challenging benchmarks spanning mathematics, science, and code generation tasks.

Benchmark Results

	MATH		Science	Code
Method	AIME2024	AIME2025	GPQA-Diamond	LiveCodeBench-v6
Self-Contained Reasoning
Qwen2.5-7B-Instruct	16.7	10.0	31.3	15.2
Qwen3-4B-Instruct-2507	63.3	47.4	52.0	35.1
Qwen2.5-72B-Instruct	18.9	15.0	49.0	-
DeepSeek-V3	39.2	28.8	59.1	16.1
DeepSeek-R1-Distill-32B	70.0	46.7	59.6	-
DeepSeek-R1-Zero (671B)	71.0	53.5	59.6	-
Agentic Reasoning
Qwen2.5-7B-Instruct	4.8	5.6	25.5	12.2
Qwen3-4B-Instruct-2507	17.9	16.3	44.3	23.0
ToRL-7B	43.3	30.0	-	-
ReTool-32B	72.5	54.3	-	-
Tool-Star-3B	20.0	16.7	-	-
ARPO-7B	30.0	30.0	53.0	18.3
rStar2-Agent-14B	80.6	69.8	60.9	-
DemyAgent-4B (Ours)	72.6	70.0	58.5	26.8

Key Highlights

✨ Despite having only 4B parameters, DemyAgent-4B achieves:

🥇 State-of-the-art on AIME2025 (70.0%), outperforming even DeepSeek-R1-Zero (671B)
🥈 Second place on AIME2024 (72.6%) and GPQA-Diamond (58.5%)
🚀 Competitive performance against 14B-32B models with 4-8× fewer parameters
💡 Superior efficiency compared to long-CoT models through deliberative tool use

📝 Citation

@article{yu2025demystify,
  title={Demystifying Reinforcement Learning in Agentic Reasoning},
  author={Yu, Zhaochen and Yang, Ling and Zou, Jiaru and Yan, Shuicheng and Wang, Mengdi},
  journal={arXiv preprint arXiv:2510.11701},
  year={2025}
}