Public snapshot workspace

LLM Leaderboard

Compare LLMs on reasoning, coding, speed, context, and price. Every row is citation-backed — sourced from public benchmarks and traceable to its origin.

How EvalKit scores work

The EvalKit Score is a composite of reasoning (40%), coding (35%), and agent capability (15%), blended across public benchmark data. Rows without a direct public metric show an estimate — marked with ~ — derived from provider or model-family medians. Source citations accompany every row.

464Tracked rows
8Public sources
462Replicated
414Scored models

Top models

By composite EvalKit score
1
Claude Opus 4.8Anthropic

EvalKit score

66.9
2
Claude Mythos PreviewAnthropic

EvalKit score

65.3
3
GPT-5.5OpenAI

EvalKit score

64.1

Current leaders

Updated weekly
Claude Mythos PreviewBest reasoning
72.51
Gemini 3 FlashBest value
$0.5/M
Grok 4 FastLongest context
2.0M tokens
Claude Mythos PreviewBest coding
57.83
Mistral Small 4Fastest
678.15 c/s
Kimi K2.6Best open-weight
58.13
12 of 464 models0 verified2026-06-01
Columns
12 / 464
Rows
1 / 1
Filters
ReasoningCodingAgentCode arenaContextSpeedPricing $/MLicense
#1
Claude Opus 4.8AnthropicCreated by Anthropic
66.9
65.7
52.3
82.2
805
1.0M
117/s
$5
#2
Claude Opus 4.7AnthropicCreated by Anthropic
63.7
62.5
48.8
77.3
1,927
1.0M
35/s
$5
#3
Claude Opus 4.6AnthropicCreated by Anthropic
58.4
59.5
43.6
62.7
2,125
1.0M
51/s
$5
#4
Claude Opus 4.5AnthropicCreated by Anthropic
54.9
54.2
39.8
62.3
1,614
~600K
23/s
~$4
#5
Claude Opus 4.1AnthropicCreated by Anthropic
38.5
38.2
27.8
22.9
1,189
~600K
~117/s
~$4
#6
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)AnthropicCreated by Anthropic
37.8
~35.4
~25.1
~32.2
~1,551
~600K
~117/s
~$4
#7
Claude Opus 4.7AnthropicCreated by Anthropic
37.8
~35.4
~25.1
~32.2
~1,551
~600K
~117/s
~$4
#8
Claude Opus 4.6AnthropicCreated by Anthropic
37.8
~35.4
~25.1
~32.2
~1,551
~600K
~117/s
~$4
#9
Claude Opus 4.7AnthropicCreated by Anthropic
37.8
~35.4
~25.1
~32.2
~1,551
~600K
~117/s
~$4
#10
Claude Opus 4.5AnthropicCreated by Anthropic
37.8
~35.4
~25.1
~32.2
~1,551
~600K
~117/s
~$4
#11
Claude Opus 4.6AnthropicCreated by Anthropic
37.8
~35.4
~25.1
~32.2
~1,551
~600K
~117/s
~$4
#12
Claude Opus 4AnthropicCreated by Anthropic
35.3
35.2
22.1
23
932
~600K
~117/s
~$4
#1

Claude Opus 4.8

Anthropic · Created by Anthropic · LLM

Score66.9
Reasoning65.7
Coding52.3
Agent82.2
Replicated from public sourceLLM Stats
#2

Claude Opus 4.7

Anthropic · Created by Anthropic · LLM

Score63.7
Reasoning62.5
Coding48.8
Agent77.3
Replicated from public sourceLLM Stats
#3

Claude Opus 4.6

Anthropic · Created by Anthropic · LLM

Score58.4
Reasoning59.5
Coding43.6
Agent62.7
Replicated from public sourceLLM Stats
#4

Claude Opus 4.5

Anthropic · Created by Anthropic · LLM

Score54.9
Reasoning54.2
Coding39.8
Agent62.3
Replicated from public sourceLLM Stats
#5

Claude Opus 4.1

Anthropic · Created by Anthropic · LLM

Score38.5
Reasoning38.2
Coding27.8
Agent22.9
Replicated from public sourceLLM Stats
#6

Claude Opus 4.7 (Adaptive Reasoning, Max Effort)

Anthropic · Created by Anthropic · Coding

Score37.8
Reasoning~35.4
Coding~25.1
Agent~32.2
Replicated from public sourceArtificial Analysis
#7

Claude Opus 4.7

Anthropic · Created by Anthropic · RAG

Score37.8
Reasoning~35.4
Coding~25.1
Agent~32.2
Replicated from public sourceVellum
#8

Claude Opus 4.6

Anthropic · Created by Anthropic · RAG

Score37.8
Reasoning~35.4
Coding~25.1
Agent~32.2
Replicated from public sourceVellum
#9

Claude Opus 4.7

Anthropic · Created by Anthropic · Coding

Score37.8
Reasoning~35.4
Coding~25.1
Agent~32.2
Replicated from public sourceVellum
#10

Claude Opus 4.5

Anthropic · Created by Anthropic · Coding

Score37.8
Reasoning~35.4
Coding~25.1
Agent~32.2
Replicated from public sourceVellum
#11

Claude Opus 4.6

Anthropic · Created by Anthropic · Coding

Score37.8
Reasoning~35.4
Coding~25.1
Agent~32.2
Replicated from public sourceVellum
#12

Claude Opus 4

Anthropic · Created by Anthropic · LLM

Score35.3
Reasoning35.2
Coding22.1
Agent23
Replicated from public sourceLLM Stats

Scoring methodology

EvalKit Score = 40% reasoning + 35% coding + 15% agent, normalized to a 0–100 scale. Public benchmark data from 8 independent sources is preferred; estimates fill missing cells and are always labeled.

Source transparency

Every row links back to its original benchmark source. Replicated rows keep the exact source URL. No score is marked as EvalKit-verified unless run evidence exists. Read citation policy →

Data freshness

The leaderboard refreshes weekly. The snapshot date for each row is shown in the detail panel. Use the source filter to view data from a specific benchmark provider.

Replicated rows keep their source links. Read citation policy →

Leaderboard | EvalKit