Scale AI
MCP-Atlas Benchmark for Tool-Use Competency
Pages
20
Time to read
44 mins
Publication
Language
English
Pages
20
Time to read
44 mins
Publication
Language
English
This document is a technical report introducing MCP-Atlas, a large-scale benchmark designed to evaluate tool-use competency in Large Language Models (LLMs) using the Model Context Protocol (MCP). MCP-Atlas comprises 1,000 tasks that assess the ability of LLMs to orchestrate multiple tools across 36 real MCP servers and 220 tools. The tasks are structured to require 3-6 tool calls and utilize natural language prompts that do not specify tools, thereby testing the models' capacity for tool discovery and orchestration. The evaluation employs a claims-based rubric that awards partial credit based on the factual claims satisfied in the model's final answer. The report details the benchmark's design, including the inclusion of systematic distractors to challenge the agents' selection processes. Results indicate that top models achieve pass rates exceeding 50%, with common errors identified in tool usage and task understanding. The document also includes a public release of a 500-task subset to facilitate reproducible evaluations.