Benchmark guide

Know what each score is actually testing before you choose a model.

GPQA, AIME, SWE-bench, Code Arena, MMMU, Toolathlon, MCP Atlas, and long-context metrics all answer different product questions.

Reasoning

Explore rows

GPQA Diamond

Graduate-level science questions used to separate frontier reasoning models.

Why it mattersUseful when choosing models for hard expert QA, scientific reasoning, and chained analysis.

GPQAReasoning indexHumanity Last Exam

Source: LLM Stats · retrieved 2026-05-20

Math

Explore rows

AIME 2025

Competition-style math benchmark for exact multi-step problem solving.

Why it mattersStrong signal for symbolic reasoning, contest math, and strict answer checking.

AIME 2025Math indexFrontierMath

Source: LLM Stats · retrieved 2026-05-20

Coding

Explore rows

SWE-bench Verified

Software engineering benchmark focused on real issue resolution and code changes.

Why it mattersBest paired with code arena and security metrics before choosing an agent model.

SWE-bench VerifiedSWE-bench ProCode Arena

Source: LLM Stats · retrieved 2026-05-20

Code Arena

Preference-style coding leaderboard comparing model outputs in developer tasks.

Why it mattersAdds practical taste and usefulness signals beyond static coding test scores.

Code ArenaCoding indexTerminal Bench

Source: LLM Stats · retrieved 2026-05-20

Multimodal

Explore rows

MMMU / MMMU-Pro

Multimodal academic and visual reasoning benchmark family.

Why it mattersHelpful for picking models that must reason over diagrams, charts, images, and text together.

MMMUMMMU-ProVision index

Source: LLM Stats · retrieved 2026-05-20

Tool calling

Explore rows

Toolathlon

Tool-use benchmark for planning, calling, and coordinating external capabilities.

Why it mattersImportant for agent products where correctness depends on tools instead of chat text alone.

ToolathlonMCP AtlasApex Agents

Source: LLM Stats · retrieved 2026-05-20

Agentic systems

Explore rows

MCP Atlas

Model Context Protocol style benchmark for tool-rich agent workflows.

Why it mattersA useful companion metric for teams evaluating practical MCP and agent orchestration behavior.

MCP AtlasApex AgentsOSWorld

Source: LLM Stats · retrieved 2026-05-20

Long context

Explore rows

MRCR / Long Context

Long-document retention and retrieval-style benchmark signals.

Why it mattersCritical for legal, research, codebase, and knowledge-base workflows where context windows can mislead.

MRCR v2Context windowLong context index

Source: LLM Stats · retrieved 2026-05-20

Composite

Explore rows

Artificial Analysis Intelligence Index

Public composite intelligence index from Artificial Analysis.

Why it mattersWorks as a second-source view when comparing model rankings against LLM Stats or arena data.

Intelligence IndexPriceSpeed

Source: Artificial Analysis · retrieved 2026-05-20

Preference

Explore rows

Arena ratings

Human or preference-style arena comparisons across model families and modalities.

Why it mattersUseful to balance benchmark scores with perceived answer quality and preference wins.

Arena ratingCategory rankSource overlap

Source: Arena AI · retrieved 2026-05-20