Papers
arxiv:2510.20168

DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking

Published on Oct 23
· Submitted by Tian Lan on Oct 23
Authors:
,
,
,
,
,
,
,
,

Abstract

DeepWideSearch is a benchmark that evaluates agents' ability to integrate deep reasoning and wide-scale information collection, revealing significant challenges and limitations in current agent architectures.

AI-generated summary

Current search agents fundamentally lack the ability to simultaneously perform deep reasoning over multi-hop retrieval and wide-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.

Community

DeepWideSearch benchmark is designed to evaluate LLM-based agents on simultaneous deep reasoning over multi-hop retrieval and wide-scale information retrieval—a critical capability for real-world tasks like market analysis and business development, for example, the question "List all second-tier suppliers of Apple's AirPods, with contact info, location, and certification status.". The output of this task is a tabular. Rows are candidate answers of the questions and columns are attributes of each candidate that questions required to collect.
teaser_img_v2

Github: https://github.com/AIDC-AI/Marco-Search-Agent
Huggingface: https://huggingface.co/datasets/AIDC-AI/DeepWideSearch

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.20168 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.20168 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.20168 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.