arxiv:2510.14240

LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild

Published on Oct 16

· Submitted by

Jiayu (Mila) Wang on Oct 17

Upvote

Authors:

Jiayu Wang ,

Abstract

LiveResearchBench and DeepEval provide a comprehensive framework for evaluating deep research systems across various domains, focusing on real-time web search, synthesis, and citation-grounded long-form reports.

AI-generated summary

Deep research -- producing comprehensive, citation-grounded reports by searching and synthesizing information from hundreds of live web sources -- marks an important frontier for agentic systems. To rigorously evaluate this ability, four principles are essential: tasks should be (1) user-centric, reflecting realistic information needs, (2) dynamic, requiring up-to-date information beyond parametric knowledge, (3) unambiguous, ensuring consistent interpretation across users, and (4) multi-faceted and search-intensive, requiring search over numerous web sources and in-depth analysis. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real-time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation-grounded long-form reports, we introduce DeepEval, a comprehensive suite covering both content- and report-level quality, including coverage, presentation, citation accuracy and association, consistency and depth of analysis. DeepEval integrates four complementary evaluation protocols, each designed to ensure stable assessment and high agreement with human judgments. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of 17 frontier deep research systems, including single-agent web search, single-agent deep research, and multi-agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.

View arXiv page View PDF Add to collection

Community

MilaWang

Paper author Paper submitter 2 days ago

Excited to share our work on deep research!
In this work, we argue that four task design principles are essential for fair comparison in deep research: (1) user-centric, (2) dynamic, (3) unambiguous, and (4) multi-faceted & search-intensive, and LiveResearchBench is guided entirely by these. It includes 100 expert-curated questions with detailed checklists, built through 1,500+ hours of human effort and an 11-step curation and validation pipeline to ensure the benchmark truly satisfies those criteria.
Evaluating open-ended, long-form reports is equally challenging. Simply assigning scores or preferences using a single LLM-as-a-judge leads to unstable, unreliable model assessments. To address this, we propose DeepEval, a framework with six metrics that make long-form evaluation robust—so models can’t “hack” their outputs by generating shorter or safer reports. Using LLM-ensemble-as-judge, each metric follows a tailored protocol: checklist-based for presentation & coverage, pointwise (additive) for consistency & citation association, pairwise comparison for depth of analysis, and rubric-tree for citation accuracy. This setup yields stable assessment and high agreement with human experts!
With LiveResearchBench and DeepEval, we benchmark 17 open and proprietary agentic systems, uncovering current strengths, recurring failure modes, and many interesting insights, including that most systems today act more like deep searchers rather than deep researchers.
One open challenge we find especially intriguing: for multi agent systems, scaling up number of subagents and tool calls can easily reach thousands (>3,000) webpages. How can we compress this information without losing critical evidence? When content is partially redundant, how do we merge overlaps without discarding unique signals? And when everything is relevant and important but still exceeds context limits, how do we retain the essentials?
More details in the paper--you won’t regret it!

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.14240 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.14240 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.14240 in a Space README.md to link it from this page.