top of page
Search

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

  • Writer: Cerebralink Neurotech Consultant
    Cerebralink Neurotech Consultant
  • Sep 18
  • 4 min read

By Zhenting Wang , Qi Chang , Hemani Patel,, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao , Eugene Siow Center for Advanced AI, Accenture, UC Berkeley




The Paper introduces MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input–output coupling. Also, tasks in MCP-Bench test agents’ ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows—capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectorylevel planning and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.


1. Introduction Recent advances in large language models (LLMs) have enabled a new generation of tool-using agents that can interpret natural language instructions, plan multi-step workflows, and interact with external tools to solve complex tasks (OpenAI, 2025c; Comanici et al., 2025; Anthropic, 2025; Yang et al., 2025; Kimi et al., 2025; Zeng et al., 2025; Chen et al., 2025). Such agents are increasingly deployed in real-world domains such as travel (Xie et al., 2024), healthcare (Saab et al., 2024; Mehandru et al., 2024), and finance (Xiao et al., 2024), where solving user queries requires chaining multiple tools, reasoning over structured outputs, and coordinating interdependent operations. Despite rapid progress in LLM agents, existing benchmarks for tool use remain fundamentally limited. Early efforts such as ToolBench (Qin et al., 2024) and BFCL v3 (Patil et al., 2025a) aggregate large collections of APIs, but these APIs are designed for isolated functionality. As a result, tasks often reduce to few-step tool calls or rely on artificially stitched pipelines, since tool inputs and outputs rarely align naturally across APIs. 𝜏-Bench (Yao et al., 2025) moves a step further by selecting a small set of APIs whose interfaces are relatively compatible, enabling cleaner compositions. However, its coverage is limited to only a handful of domains and tools, making it difficult to scale task diversity or capture the complexity of realistic multi-domain workflows. Together, these benchmarks fall short in modeling realistic dependency chains and stress-testing long-horizon planning. More recent benchmarks such as MCP-RADER (Gao et al., 2025) and MCPEval (Liu et al., 2025a) begin to leverage the Model Context Protocol (MCP) (Anthropic et al., 2024), which provides a standardized invocation schema across servers. However, these benchmarks remain narrow in scope. For example, MCP-RADER (Gao et al., 2025) and MCPEval (Liu et al., 2025a) cover only a few servers with at most several dozen tools, which limits task diversity and makes most workflows relatively short (e.g., single retrieval followed by a summary). Also, both existing API-based and MCP-based tool-using benchmarks lack testing of Figure 1



MCP-Bench: Benchmarking Tool-Using LLM Agents
 with Complex Real-World Tasks via MCP Servers

MCP-Bench connects LLM agents to real-world MCP servers exposing 250 structured tools across domains such as finance, science, and research. Tasks are generated via LLM-based synthesis, then executed by the agent through multi-turn tool invocations. Each execution trajectory is evaluated using a combination of rule-based checks and LLM-as-a-Judge scoring, assessing agent performance in tool schema understanding, multi-hop planning, and real-world adaptability. planning capability under fuzzy instructions: tasks typically specify the tool name or execution step explicitly, so agents are not challenged to infer which tools are appropriate when the instructions are underspecified. Furthermore, they omit evaluation of more complex scenarios such as multigoal objectives (e.g., booking travel that requires coordinating flights, hotels, and local transport), evidence-based reasoning with information grounding (e.g., generating answers that cite intermediate tool results rather than hallucinating), and cross-domain orchestration (e.g., combining financial tools with news sources to explain stock movements). As summarized in Table 1, none of the existing benchmarks adequately reflect the complexity, fuzzy, and diversity inherent in real-world tool use. To overcome these limitations, we introduce MCP-Bench, a large-scale benchmark that evaluates LLM agents in realistic, ecosystem-based tool-use scenarios. As illustrated in Figure 1, MCP-Bench connects agents to a diverse ecosystem of production-grade MCP servers exposing 250 structured tools across domains such as finance, science, and research. Each server provides complementary tools designed to work together (e.g., a scientific computing server integrating data loading, matrix operations, and visualization), while the MCP protocol ensures consistent invocation schemas across servers. This combination enables both realistic intra-server dependency chains and complex cross-server, multi-hop workflows. Tasks in MCP-Bench are generated automatically via an LLM-based synthesis pipeline. Dependency chains are first discovered from tool I/O signatures, then translated into natural language instructions. A quality filtering mechanism ensures solvability and realism. To assess agent in realistic scenarios, each task is rewritten into a fuzzy and instruction-minimal variant that retains the core objective but omits explicit tool references and execution steps. The example of the tasks in MCP-Bench can be found in Table 2 and Table 9.



 
 
bottom of page