arxiv:2603.15594

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Published on Mar 16

· Submitted by

yuwendu on Mar 17

Authors:

Yuwen Du ,

Rui Ye ,

Xinyu Zhu ,

Yijun Lu ,

Yuzhu Cai ,

Siheng Chen

Abstract

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.

View arXiv page View PDF GitHub 131 Add to collection

Community

yuwendu

Paper author Paper submitter 1 day ago

🔓 Shattering the Corporate Data Moat: Meet OpenSeeker, the First Fully Open-Source Frontier Search Agent by a Purely Academic Team.

High-performance Search Agents have long been a "closed-door game." While weights are open, the high-quality training data (the real secret sauce) remained hidden. We introduce OpenSeeker, a purely academic initiative achieving SOTA search while open-sourcing everything: Models and 100% Full Training Data.

🚀 Why OpenSeeker is a Game Changer:

🔥 Efficiency Over Scale: We proved you don't need millions of samples. OpenSeeker achieves frontier performance with just 11.7k synthesized samples using Single-Run SFT with no complex RL or continual pre-training.
🎓Empowering Academia:We provide the missinghigh-quality data foundation, enabling researchers to build next-gen agents without corporate-scale resources.

⚔️ Beating Industry Giants:
🇨🇳 48.4% on BrowseComp-ZH: Surpassing Alibaba's Tongyi DeepResearch (46.7%)! We beat their massive CPT + SFT + RL pipeline using SFT only.
🌍 ~30B SFT Models SOTA: 29.5% (BrowseComp), 74.0% (xbench), 59.4% (WideSearch).

🧪 The "Secret Sauce" Behind the Data:
🕸️ Fact-Grounded QA Synthesis: We reverse-engineer the web graph using Entity Obfuscation to generate complex multi-hop queries.
🧹 Denoised Trajectory Synthesis: Our Asymmetric Context Training teaches students to predict expert actions from raw, noisy HTML, mastering robust information extraction.

📦 The "Open" in OpenSeeker: We release the full recipe to democratize deep research: ✅ 11.7k High-Difficulty Data (QA + Trajectories) ✅ OpenSeeker-v1-30B-SFT Model
👨‍💻 GitHub: GitHub - rui-ye/OpenSeeker: OpenSeeker: A search agent with open-source data and models
🤗 Data: https://huggingface.co/datasets/OpenSeeker/OpenSeeker-v1-Data
🤗 Models: https://huggingface.co/OpenSeeker/OpenSeeker-v1-30B-SFT

yuwendu

Paper author Paper submitter 1 day ago

Paper Link: https://github.com/rui-ye/OpenSeeker/blob/main/assets/OpenSeeker.pdf

panikov

about 21 hours ago

thanks a lot for opensourcing

librarian-bot

about 17 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

avahal

about 6 hours ago

the part that stuck with me is how denoised trajectory synthesis via retrospective summarization lets a 30b model trained on only 11.7k samples reach frontier-level web-search behavior. coupled with fact-grounded scalable controllable QA synthesis, it looks like the data generation design is doing the heavy lifting here, not just model size. btw the arxivlens breakdown helped me parse the method details, and i found a nice walkthrough here: https://arxivlens.com/PaperView/Details/openseeker-democratizing-frontier-search-agents-by-fully-open-sourcing-training-data-6694-010b2438. would love to see an ablation removing the retrospective summarization to quantify how much the denoising step actually contributes? the open data and model release is a nice push for reproducibility, and i hope the community builds richer benchmarks to stress-test the approach.

yuwendu

Paper author about 5 hours ago

Thank you for your interest! We have verified that retrospective summarization can improve the BrowseComp score of GPT-OSS-120B from 36 to 45. This suggests that retrospective summarization helps the model answer questions more effectively and also leads to higher-quality synthesized trajectories. However, due to resource constraints, the cost of generating trajectories and training the model is too high. Therefore, we did not conduct an additional ablation study in which trajectories were synthesized without retrospective summarization and then used to train the model.