YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
π― About This Repository
This repository contains the DemyAgent-4B model weights, a 4B-sized agentic reasoning model that achieves state-of-the-art performance on challenging benchmarks including AIME2024/2025, GPQA-Diamond, and LiveCodeBench-v6. DemyAgent-4B is trained using our GRPO-TCR recipe with 30K high-quality agentic RL data, demonstrating that small models can outperform much larger alternatives (14B/32B) through effective RL training strategies.
π Introduction
In our work, we systematically investigate three dimensions of agentic RL: data, algorithms, and reasoning modes. Our findings reveal:
- π― Data Quality Matters: Real end-to-end trajectories and high-diversity datasets significantly outperform synthetic alternatives
- β‘ Training Efficiency: Exploration-friendly techniques like reward clipping and entropy maintenance boost training efficiency
- π§ Reasoning Strategy: Deliberative reasoning with selective tool calls surpasses frequent invocation or verbose self-reasoning We contribute high-quality SFT and RL datasets, demonstrating that simple recipes enable even 4B models to outperform 32B models on the most challenging reasoning benchmarks.
π¦ Resources
Type | Name | Link |
---|---|---|
π Dataset | 3K Agentic SFT Data | π€ HuggingFace |
π Dataset | 30K Agentic RL Data | π€ HuggingFace |
π€ Model | Qwen2.5-7B-RA-SFT | π€ HuggingFace |
π€ Model | Qwen3-4B-RA-SFT | π€ HuggingFace |
π€ Model | DemyAgent-4B | π€ HuggingFace |
Note:
- Qwen2.5-7B-RA-SFT and Qwen3-4B-RA-SFT are finetuned from Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 using our 3K Agentic SFT Data
- DemyAgent-4B is trained through Agentic RL with our 30K Agentic RL data using the GRPO-TCR recipe
π Performance
We evaluate our models on challenging benchmarks spanning mathematics, science, and code generation tasks.
Benchmark Results
MATH | Science | Code | ||
---|---|---|---|---|
Method | AIME2024 | AIME2025 | GPQA-Diamond | LiveCodeBench-v6 |
Self-Contained Reasoning | ||||
Qwen2.5-7B-Instruct | 16.7 | 10.0 | 31.3 | 15.2 |
Qwen3-4B-Instruct-2507 | 63.3 | 47.4 | 52.0 | 35.1 |
Qwen2.5-72B-Instruct | 18.9 | 15.0 | 49.0 | - |
DeepSeek-V3 | 39.2 | 28.8 | 59.1 | 16.1 |
DeepSeek-R1-Distill-32B | 70.0 | 46.7 | 59.6 | - |
DeepSeek-R1-Zero (671B) | 71.0 | 53.5 | 59.6 | - |
Agentic Reasoning | ||||
Qwen2.5-7B-Instruct | 4.8 | 5.6 | 25.5 | 12.2 |
Qwen3-4B-Instruct-2507 | 17.9 | 16.3 | 44.3 | 23.0 |
ToRL-7B | 43.3 | 30.0 | - | - |
ReTool-32B | 72.5 | 54.3 | - | - |
Tool-Star-3B | 20.0 | 16.7 | - | - |
ARPO-7B | 30.0 | 30.0 | 53.0 | 18.3 |
rStar2-Agent-14B | 80.6 | 69.8 | 60.9 | - |
DemyAgent-4B (Ours) | 72.6 | 70.0 | 58.5 | 26.8 |
Key Highlights
β¨ Despite having only 4B parameters, DemyAgent-4B achieves:
- π₯ State-of-the-art on AIME2025 (70.0%), outperforming even DeepSeek-R1-Zero (671B)
- π₯ Second place on AIME2024 (72.6%) and GPQA-Diamond (58.5%)
- π Competitive performance against 14B-32B models with 4-8Γ fewer parameters
- π‘ Superior efficiency compared to long-CoT models through deliberative tool use
π Citation
@article{yu2025demystify,
title={Demystifying Reinforcement Learning in Agentic Reasoning},
author={Yu, Zhaochen and Yang, Ling and Zou, Jiaru and Yan, Shuicheng and Wang, Mengdi},
journal={arXiv preprint arXiv:2510.11701},
year={2025}
}
- Downloads last month
- 53
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for Gen-Verse/DemyAgent-4B
Collection including Gen-Verse/DemyAgent-4B
Collection
Demystifying Reinforcement Learning in Agentic Reasoning
β’
6 items
β’
Updated
β’
2