Benchmark guide

Know what each score is actually testing before you choose a model.

GPQA, AIME, SWE-bench, Code Arena, MMMU, Toolathlon, MCP Atlas, and long-context metrics all answer different product questions.

Reasoning

Reasoning

Explore rows

GPQA Diamond

Graduate-level science questions used to separate frontier reasoning models.

Why it mattersUseful when choosing models for hard expert QA, scientific reasoning, and chained analysis.
GPQAReasoning indexHumanity Last Exam
Source: LLM Stats · retrieved 2026-05-20

Math

Math

Explore rows

AIME 2025

Competition-style math benchmark for exact multi-step problem solving.

Why it mattersStrong signal for symbolic reasoning, contest math, and strict answer checking.
AIME 2025Math indexFrontierMath
Source: LLM Stats · retrieved 2026-05-20

Coding

Coding

Explore rows

SWE-bench Verified

Software engineering benchmark focused on real issue resolution and code changes.

Why it mattersBest paired with code arena and security metrics before choosing an agent model.
SWE-bench VerifiedSWE-bench ProCode Arena
Source: LLM Stats · retrieved 2026-05-20

Code Arena

Preference-style coding leaderboard comparing model outputs in developer tasks.

Why it mattersAdds practical taste and usefulness signals beyond static coding test scores.
Code ArenaCoding indexTerminal Bench
Source: LLM Stats · retrieved 2026-05-20

Multimodal

Multimodal

Explore rows

MMMU / MMMU-Pro

Multimodal academic and visual reasoning benchmark family.

Why it mattersHelpful for picking models that must reason over diagrams, charts, images, and text together.
MMMUMMMU-ProVision index
Source: LLM Stats · retrieved 2026-05-20

Tool calling

Tool calling

Explore rows

Toolathlon

Tool-use benchmark for planning, calling, and coordinating external capabilities.

Why it mattersImportant for agent products where correctness depends on tools instead of chat text alone.
ToolathlonMCP AtlasApex Agents
Source: LLM Stats · retrieved 2026-05-20

Agentic systems

Agentic systems

Explore rows

MCP Atlas

Model Context Protocol style benchmark for tool-rich agent workflows.

Why it mattersA useful companion metric for teams evaluating practical MCP and agent orchestration behavior.
MCP AtlasApex AgentsOSWorld
Source: LLM Stats · retrieved 2026-05-20

Long context

Long context

Explore rows

MRCR / Long Context

Long-document retention and retrieval-style benchmark signals.

Why it mattersCritical for legal, research, codebase, and knowledge-base workflows where context windows can mislead.
MRCR v2Context windowLong context index
Source: LLM Stats · retrieved 2026-05-20

Composite

Composite

Explore rows

Artificial Analysis Intelligence Index

Public composite intelligence index from Artificial Analysis.

Why it mattersWorks as a second-source view when comparing model rankings against LLM Stats or arena data.
Intelligence IndexPriceSpeed
Source: Artificial Analysis · retrieved 2026-05-20

Preference

Preference

Explore rows

Arena ratings

Human or preference-style arena comparisons across model families and modalities.

Why it mattersUseful to balance benchmark scores with perceived answer quality and preference wins.
Arena ratingCategory rankSource overlap
Source: Arena AI · retrieved 2026-05-20