--- datasets: - DeepMath-103K language: - en library_name: transformers license: apache-2.0 pipeline_tag: text-generation tags: - reasoning - reinforcement-learning - rlvr - mcts - math - iclr-2026 model-index: - name: DeepSearch-1.5B results: - task: type: text-generation name: Mathematical Reasoning dataset: name: AIME 2024 type: text metrics: - type: avg@32 value: 53.65 - type: avg@32 value: 35.42 - type: avg@32 value: 90.39 - type: avg@32 value: 92.53 - type: avg@32 value: 40.0 - type: avg@32 value: 65.72 ---
🚀 DeepSearch-1.5B
**DeepSearch-1.5B🌟** is a 1.5B parameter reasoning model trained with **Reinforcement Learning with Verifiable Rewards (RLVR)**, enhanced by **Monte Carlo Tree Search (MCTS)**. Unlike prior approaches that restrict structured search to inference, DeepSearch integrates MCTS *into training*, enabling systematic exploration, fine-grained credit assignment, and efficient replay buffering. This model achieves **state-of-the-art accuracy among 1.5B reasoning models** while being **5.7× more compute-efficient** than extended RL training baselines. ![Illstration of DeepSearch algorithm](./deepsearch.png) --- ## Model Details - **Developed by**: Fang Wu\*, Weihao Xuan\*, Heli Qi\*, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi - **Institutional affiliations**: Stanford University, University of Tokyo, RIKEN AIP, University of Washington, UC Berkeley, Amazon AWS, Columbia University - **Paper**: [DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search](https://huggingface.co/papers/2509.25454) - **Code**: [Github](https://github.com/smiles724/DeepSearch) - **Base Model**: Nemotron-Research-Reasoning-Qwen-1.5B v2 - **Parameters**: 1.5B - **Framework**: veRL - **License**: Apache-2.0 --- ## Quickstart ### Environment ``` pip install vllm # vllm>=v0.8.5.post1 should work pip install transformers # transformers>=4.52.4 should work ``` ### Using vLLM to generate ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer def convert_question_to_messages(question: str): messages = [ {"role": "user", "content": question + " Let's think step by step and output the final answer within \\boxed{}. \ "} ] return messages model_id="fangwu97/DeepSearch-1.5B" tokenizer = AutoTokenizer.from_pretrained(model_id) sampling_params = SamplingParams( temperature=0.6, top_p=0.95, max_tokens=32768 ) model = LLM( model=model_id, tensor_parallel_size=1 ) prompt = tokenizer.apply_chat_template( convert_question_to_messages("Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$."), add_generation_prompt=True, tokenize=False ) outputs = model.generate({"prompt": prompt}, sampling_params=sampling_params, use_tqdm=False) response = outputs[0].outputs[0].text print(response) ``` ## Performance | Benchmark | Nemotron-RR-Qwen-1.5B v2 | DeepSearch-1.5B | |-----------|--------------------------|-----------------| | AIME 2024 | 51.77 | **53.65** | | AIME 2025 | 32.92 | **35.42** | | AMC 2023 | 88.83 | **90.39** | | MATH500 | 92.24 | **92.53** | | Minerva | 39.75 | **40.00** | | Olympiad | 64.69 | **65.72** | | **Average** | 61.70 | **62.95** | DeepSearch improves average accuracy by **+1.25 points** over the best prior 1.5B model, while using **5.7× more GPU hours**. ## Training - **Dataset**: DeepMath-103K (rigorously decontaminated) - **Training steps**: 100 - **Search strategy**: - Global Frontier Selection - Entropy-based guidance - Replay buffer with solution caching - **Hardware**: 16× NVIDIA H100 (96GB) - **Compute**: ~330 GPU hours --- ## Ethical Considerations - Positive: Reduces training costs and carbon footprint. - Risks: Systematic exploration methods could be adapted to sensitive domains (e.g., code synthesis). - Transparency: Full implementation and training details are released for reproducibility. --- ## Citation ```bibtex @misc{wu2025deepsearch, title = {DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search}, author = {Wu, Fang and Xuan, Weihao and Qi, Heli and Lu, Ximing and Tu, Aaron and Li, Li Erran and Choi, Yejin}, year = {2025}, eprint = {2509.25454}, archivePrefix = {arXiv}, primaryClass = {cs.AI}, doi = {10.48550/arXiv.2509.25454}, } ```