arxiv:2510.20168

DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking

Published on Oct 23

· Submitted by

Tian Lan on Oct 23

AIDC-AI

Upvote

Authors:

Abstract

DeepWideSearch is a benchmark that evaluates agents' ability to integrate deep reasoning and wide-scale information collection, revealing significant challenges and limitations in current agent architectures.

AI-generated summary

Current search agents fundamentally lack the ability to simultaneously perform deep reasoning over multi-hop retrieval and wide-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.

View arXiv page View PDF Add to collection

Community

GMFTBY

Paper submitter about 12 hours ago

•

edited about 12 hours ago

DeepWideSearch benchmark is designed to evaluate LLM-based agents on simultaneous deep reasoning over multi-hop retrieval and wide-scale information retrieval—a critical capability for real-world tasks like market analysis and business development, for example, the question "List all second-tier suppliers of Apple's AirPods, with contact info, location, and certification status.". The output of this task is a tabular. Rows are candidate answers of the questions and columns are attributes of each candidate that questions required to collect.