← All models

Anthropic · proprietary

Claude Sonnet 4.5

Claude Sonnet 4.5 by Anthropic appears in 2 sources with Reasoning at 40.2. Best read for Code quality, Coding, LLM.

CreatorAnthropic
Release date2025-09-29
Knowledge cutoffNot published
Context200K tokens
Input price$3/M tokens
Output price$15/M tokens
Modalitytext + vision
CountryUS

Metrics

All source-backed metrics

Open in leaderboard
LLM Stats Rank16ranking · source_rank
Reasoning40.2reasoning · index_reasoning
Math33.26math · index_math
Coding31.46coding · index_code
Writing35.23writing · index_communication
Vision18.73multimodal · index_vision
Tool calling33.59tool_calling · index_tool_calling
Long context19.19long_context · index_long_context
Healthcare16.6domain · index_healthcare
GPQA83.4 %reasoning · gpqa_score
AIME 202587 %math · aime_2025_score
Code Arena1,412.14 %coding · coding_arena_score
MMMLU89.1 %reasoning · mmmlu_score
OSWorld61.4 %agent · osworld_score
Terminal Bench50 %coding · terminal_bench_score
TAU2 Retail86.2 %agent · tau_bench_retail_score
Context200,000 tokenscontext · context
Speed117.59 c/sperformance · throughput
Latency932.75 msperformance · latency
Input price3 $/Mpricing · input_price
Output price15 $/Mpricing · output_price
SWE-Bench Verified82 %metric

Evidence

Citations and source overlap

FAQ

How should I read this profile?

Treat this as a source-backed model dossier, not an EvalKit-run verification. The public values are replicated from linked sources and kept source-scoped.

Is Claude Sonnet 4.5 verified by EvalKit?

No. EvalKit currently shows 0 verified rows until real run evidence exists.

Why can metrics disagree?

Different sources test different tasks, dates, prompts, and aggregation methods. EvalKit keeps those differences visible instead of merging them into a fake universal score.