DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking
Abstract
DeepWideSearch is a benchmark that evaluates agents' ability to integrate deep reasoning and wide-scale information collection, revealing significant challenges and limitations in current agent architectures.
Current search agents fundamentally lack the ability to simultaneously perform deep reasoning over multi-hop retrieval and wide-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.
Community
DeepWideSearch benchmark is designed to evaluate LLM-based agents on simultaneous deep reasoning over multi-hop retrieval and wide-scale information retrieval—a critical capability for real-world tasks like market analysis and business development, for example, the question "List all second-tier suppliers of Apple's AirPods, with contact info, location, and certification status.". The output of this task is a tabular. Rows are candidate answers of the questions and columns are attributes of each candidate that questions required to collect.
Github: https://github.com/AIDC-AI/Marco-Search-Agent
Huggingface: https://huggingface.co/datasets/AIDC-AI/DeepWideSearch
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper