MCP-Atlas Benchmark for Tool-Use Competency preview page 1

Scale AI

MCP-Atlas Benchmark for Tool-Use Competency

Pages

Time to read

44 mins

Publication

12/18/25

Language

English

Summary

This document is a technical report introducing MCP-Atlas, a large-scale benchmark designed to evaluate tool-use competency in Large Language Models (LLMs) using the Model Context Protocol (MCP). MCP-Atlas comprises 1,000 tasks that assess the ability of LLMs to orchestrate multiple tools across 36 real MCP servers and 220 tools. The tasks are structured to require 3-6 tool calls and utilize natural language prompts that do not specify tools, thereby testing the models' capacity for tool discovery and orchestration. The evaluation employs a claims-based rubric that awards partial credit based on the factual claims satisfied in the model's final answer. The report details the benchmark's design, including the inclusion of systematic distractors to challenge the agents' selection processes. Results indicate that top models achieve pass rates exceeding 50%, with common errors identified in tool usage and task understanding. The document also includes a public release of a 500-task subset to facilitate reproducible evaluations.

Scale AI

MCP-Atlas Benchmark for Tool-Use Competency

Summary

Get the Full Copy