Papers
arxiv:2509.24002

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

Published on Sep 28
ยท Submitted by Zijian Wu on Oct 1
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,

Abstract

MCPMark is a comprehensive benchmark for evaluating MCP use in real-world workflows, featuring diverse tasks that require richer interactions with the environment, and reveals that current LLMs perform poorly on these tasks.

AI-generated summary

MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of 127 high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only 52.56\% pass@1 and 33.86\% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below 30\% pass@1 and 15\% pass^4. On average, LLMs require 16.2 execution turns and 17.4 tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.

Community

Paper author Paper submitter

Agents can call tools โ€” but can they actually deliver?
MCPMark stress-tested >30 models through 127 CRUD-heavy tasks across 5 MCP servers, with a minimal but general MCPMark-Agent ensuring fair comparison.
Results: even the best models cap at 52.56% pass@1 / 33.86% pass^4, while other strong systems like claude-sonnet-4 and o3 stay under 30% pass@1.
We break down why โ€” from implicit errors and context drift to cost-performance tradeoffs.

๐Ÿ‘‰ Paper: https://arxiv.org/pdf/2509.24002
๐Ÿ‘‰ Website: https://mcpmark.ai/
๐Ÿ‘‰ Code: https://github.com/eval-sys/mcpmark

ยท
This comment has been hidden (marked as Spam)

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Paper author Paper submitter

I would like to know how the newly released Claude 4.5 Sonnet would have compared in this test. Always new models coming, thanks for the excellent work!

ยท
Paper author

Hi, thanks for your interest! You can find claude-sonnet-4.5 results here. For more discussions, you can refer our X.
image

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.24002 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.24002 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.24002 in a Space README.md to link it from this page.

Collections including this paper 11