YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Demystifying Reinforcement Learning in Agentic Reasoning

Paper on arXiv Open-AgentRL on GitHub 30K RL Dataset DemyAgent-4B Model

🎯 About This Repository

This repository contains the DemyAgent-4B model weights, a 4B-sized agentic reasoning model that achieves state-of-the-art performance on challenging benchmarks including AIME2024/2025, GPQA-Diamond, and LiveCodeBench-v6. DemyAgent-4B is trained using our GRPO-TCR recipe with 30K high-quality agentic RL data, demonstrating that small models can outperform much larger alternatives (14B/32B) through effective RL training strategies.

🌟 Introduction

In our work, we systematically investigate three dimensions of agentic RL: data, algorithms, and reasoning modes. Our findings reveal:

  • 🎯 Data Quality Matters: Real end-to-end trajectories and high-diversity datasets significantly outperform synthetic alternatives
  • ⚑ Training Efficiency: Exploration-friendly techniques like reward clipping and entropy maintenance boost training efficiency
  • 🧠 Reasoning Strategy: Deliberative reasoning with selective tool calls surpasses frequent invocation or verbose self-reasoning We contribute high-quality SFT and RL datasets, demonstrating that simple recipes enable even 4B models to outperform 32B models on the most challenging reasoning benchmarks.

πŸ“¦ Resources

Type Name Link
πŸ“Š Dataset 3K Agentic SFT Data πŸ€— HuggingFace
πŸ“Š Dataset 30K Agentic RL Data πŸ€— HuggingFace
πŸ€– Model Qwen2.5-7B-RA-SFT πŸ€— HuggingFace
πŸ€– Model Qwen3-4B-RA-SFT πŸ€— HuggingFace
πŸ€– Model DemyAgent-4B πŸ€— HuggingFace

Note:

  • Qwen2.5-7B-RA-SFT and Qwen3-4B-RA-SFT are finetuned from Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 using our 3K Agentic SFT Data
  • DemyAgent-4B is trained through Agentic RL with our 30K Agentic RL data using the GRPO-TCR recipe

πŸ† Performance

We evaluate our models on challenging benchmarks spanning mathematics, science, and code generation tasks.

Benchmark Results

MATH Science Code
Method AIME2024 AIME2025 GPQA-Diamond LiveCodeBench-v6
Self-Contained Reasoning
Qwen2.5-7B-Instruct 16.7 10.0 31.3 15.2
Qwen3-4B-Instruct-2507 63.3 47.4 52.0 35.1
Qwen2.5-72B-Instruct 18.9 15.0 49.0 -
DeepSeek-V3 39.2 28.8 59.1 16.1
DeepSeek-R1-Distill-32B 70.0 46.7 59.6 -
DeepSeek-R1-Zero (671B) 71.0 53.5 59.6 -
Agentic Reasoning
Qwen2.5-7B-Instruct 4.8 5.6 25.5 12.2
Qwen3-4B-Instruct-2507 17.9 16.3 44.3 23.0
ToRL-7B 43.3 30.0 - -
ReTool-32B 72.5 54.3 - -
Tool-Star-3B 20.0 16.7 - -
ARPO-7B 30.0 30.0 53.0 18.3
rStar2-Agent-14B 80.6 69.8 60.9 -
DemyAgent-4B (Ours) 72.6 70.0 58.5 26.8

Key Highlights

✨ Despite having only 4B parameters, DemyAgent-4B achieves:

  • πŸ₯‡ State-of-the-art on AIME2025 (70.0%), outperforming even DeepSeek-R1-Zero (671B)
  • πŸ₯ˆ Second place on AIME2024 (72.6%) and GPQA-Diamond (58.5%)
  • πŸš€ Competitive performance against 14B-32B models with 4-8Γ— fewer parameters
  • πŸ’‘ Superior efficiency compared to long-CoT models through deliberative tool use

πŸ“ Citation

@article{yu2025demystify,
  title={Demystifying Reinforcement Learning in Agentic Reasoning},
  author={Yu, Zhaochen and Yang, Ling and Zou, Jiaru and Yan, Shuicheng and Wang, Mengdi},
  journal={arXiv preprint arXiv:2510.11701},
  year={2025}
}
Downloads last month
53
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Gen-Verse/DemyAgent-4B

Finetunes
1 model
Quantizations
4 models

Collection including Gen-Verse/DemyAgent-4B