Public snapshot workspace

LLM Leaderboard

Compare LLMs on reasoning, coding, speed, context, and price. Every row is citation-backed — sourced from public benchmarks and traceable to its origin.

How EvalKit scores work

The EvalKit Score is a composite of reasoning (40%), coding (35%), and agent capability (15%), blended across public benchmark data. Rows without a direct public metric show an estimate — marked with ~ — derived from provider or model-family medians. Source citations accompany every row.

464Tracked rows
8Public sources
462Replicated
414Scored models

Top models

By composite EvalKit score
1
Claude Opus 4.8Anthropic

EvalKit score

66.9
2
Claude Mythos PreviewAnthropic

EvalKit score

65.3
3
GPT-5.5OpenAI

EvalKit score

64.1

Current leaders

Updated weekly
Claude Mythos PreviewBest reasoning
72.51
Gemini 3 FlashBest value
$0.5/M
Grok 4 FastLongest context
2.0M tokens
Claude Mythos PreviewBest coding
57.83
Mistral Small 4Fastest
678.15 c/s
Kimi K2.6Best open-weight
58.13
41 of 464 models0 verified2026-06-01
Columns
41 / 464
Rows
1 / 3
Filters
ReasoningCodingAgentCode arenaContextSpeedPricing $/MLicense
#1
GPT-5.5OpenAICreated by OpenAI
64.1
62.3
51
75.3
1,948
1.1M
175/s
$5
#2
GPT-5.4OpenAICreated by OpenAI
58.2
57.6
43
67.2
1,726
1.0M
203/s
$2.5
#3
GPT-5.3 CodexOpenAICreated by OpenAI
53.4
56.2
43.4
38.1
1,367
400K
233/s
$1.75
#4
GPT-5.2OpenAICreated by OpenAI
52.6
53.5
34.6
60.6
1,519
400K
181/s
$1.75
#5
GPT-5.5 ProOpenAICreated by OpenAI
50.4
61.6
~26.1
~44.5
~1,613
~400K
~230/s
~$1.25
#6
GPT-5.4 miniOpenAICreated by OpenAI
48.9
45.8
34
57.7
1,388
400K
283/s
$0.75
#7
GPT-5.5 InstantOpenAICreated by OpenAI
48.8
42.5
43.4
~43.7
1,361
400K
227/s
$5
#8
GPT-5.2 CodexOpenAICreated by OpenAI
48.4
52.1
38.9
26
1,232
400K
204/s
$1.75
#9
GPT-5.2 ProOpenAICreated by OpenAI
48.1
56.7
~26.1
~42.2
~1,613
~400K
~230/s
~$1.25
#10
GPT-5 HighOpenAICreated by OpenAI
47.2
47.8
33.8
~41.7
1,301
~400K
~230/s
~$1.25
#11
GPT-5.1 HighOpenAICreated by OpenAI
45.6
53.3
23.8
~39.7
1,140
~400K
~230/s
~$1.25
#12
GPT-5.1 MediumOpenAICreated by OpenAI
44.9
45.9
30.4
~39.3
1,181
400K
263/s
$1.25
#13
GPT-5.1OpenAICreated by OpenAI
43.5
47.5
30.8
24.7
1,210
400K
319/s
$1.25
#14
GPT-5.1 InstantOpenAICreated by OpenAI
43.5
48.6
30.3
23.3
940
400K
276/s
$1.25
#15
GPT-5OpenAICreated by OpenAI
43.2
44.7
33.9
23.2
886
~400K
~230/s
~$1.25
#16
GPT-5.1 Codex HighOpenAICreated by OpenAI
43
44.3
27.7
~37.4
721
~400K
~230/s
~$1.25
#17
GPT-5 MediumOpenAICreated by OpenAI
42.7
43.5
27.7
~37
1,101
~400K
~230/s
~$1.25
#18
GPT-5.1 ThinkingOpenAICreated by OpenAI
42.7
47.5
29.8
21.9
1,087
~400K
~230/s
~$1.25
#19
GPT-5.4 nanoOpenAICreated by OpenAI
42.4
39.9
23
56.1
788
400K
365/s
$0.2
#20
GPT-5 CodexOpenAICreated by OpenAI
40.6
39.9
26.8
~35
75
~400K
~230/s
~$1.25
#1

GPT-5.5

OpenAI · Created by OpenAI · LLM

Score64.1
Reasoning62.3
Coding51
Agent75.3
Replicated from public sourceLLM Stats
#2

GPT-5.4

OpenAI · Created by OpenAI · LLM

Score58.2
Reasoning57.6
Coding43
Agent67.2
Replicated from public sourceLLM Stats
#3

GPT-5.3 Codex

OpenAI · Created by OpenAI · LLM

Score53.4
Reasoning56.2
Coding43.4
Agent38.1
Replicated from public sourceLLM Stats
#4

GPT-5.2

OpenAI · Created by OpenAI · LLM

Score52.6
Reasoning53.5
Coding34.6
Agent60.6
Replicated from public sourceLLM Stats
#5

GPT-5.5 Pro

OpenAI · Created by OpenAI · LLM

Score50.4
Reasoning61.6
Coding~26.1
Agent~44.5
Replicated from public sourceLLM Stats
#6

GPT-5.4 mini

OpenAI · Created by OpenAI · LLM

Score48.9
Reasoning45.8
Coding34
Agent57.7
Replicated from public sourceLLM Stats
#7

GPT-5.5 Instant

OpenAI · Created by OpenAI · LLM

Score48.8
Reasoning42.5
Coding43.4
Agent~43.7
Replicated from public sourceLLM Stats
#8

GPT-5.2 Codex

OpenAI · Created by OpenAI · LLM

Score48.4
Reasoning52.1
Coding38.9
Agent26
Replicated from public sourceLLM Stats
#9

GPT-5.2 Pro

OpenAI · Created by OpenAI · LLM

Score48.1
Reasoning56.7
Coding~26.1
Agent~42.2
Replicated from public sourceLLM Stats
#10

GPT-5 High

OpenAI · Created by OpenAI · LLM

Score47.2
Reasoning47.8
Coding33.8
Agent~41.7
Replicated from public sourceLLM Stats
#11

GPT-5.1 High

OpenAI · Created by OpenAI · LLM

Score45.6
Reasoning53.3
Coding23.8
Agent~39.7
Replicated from public sourceLLM Stats
#12

GPT-5.1 Medium

OpenAI · Created by OpenAI · LLM

Score44.9
Reasoning45.9
Coding30.4
Agent~39.3
Replicated from public sourceLLM Stats
#13

GPT-5.1

OpenAI · Created by OpenAI · LLM

Score43.5
Reasoning47.5
Coding30.8
Agent24.7
Replicated from public sourceLLM Stats
#14

GPT-5.1 Instant

OpenAI · Created by OpenAI · LLM

Score43.5
Reasoning48.6
Coding30.3
Agent23.3
Replicated from public sourceLLM Stats
#15

GPT-5

OpenAI · Created by OpenAI · LLM

Score43.2
Reasoning44.7
Coding33.9
Agent23.2
Replicated from public sourceLLM Stats
#16

GPT-5.1 Codex High

OpenAI · Created by OpenAI · LLM

Score43
Reasoning44.3
Coding27.7
Agent~37.4
Replicated from public sourceLLM Stats
#17

GPT-5 Medium

OpenAI · Created by OpenAI · LLM

Score42.7
Reasoning43.5
Coding27.7
Agent~37
Replicated from public sourceLLM Stats
#18

GPT-5.1 Thinking

OpenAI · Created by OpenAI · LLM

Score42.7
Reasoning47.5
Coding29.8
Agent21.9
Replicated from public sourceLLM Stats
#19

GPT-5.4 nano

OpenAI · Created by OpenAI · LLM

Score42.4
Reasoning39.9
Coding23
Agent56.1
Replicated from public sourceLLM Stats
#20

GPT-5 Codex

OpenAI · Created by OpenAI · LLM

Score40.6
Reasoning39.9
Coding26.8
Agent~35
Replicated from public sourceLLM Stats

Scoring methodology

EvalKit Score = 40% reasoning + 35% coding + 15% agent, normalized to a 0–100 scale. Public benchmark data from 8 independent sources is preferred; estimates fill missing cells and are always labeled.

Source transparency

Every row links back to its original benchmark source. Replicated rows keep the exact source URL. No score is marked as EvalKit-verified unless run evidence exists. Read citation policy →

Data freshness

The leaderboard refreshes weekly. The snapshot date for each row is shown in the detail panel. Use the source filter to view data from a specific benchmark provider.

Replicated rows keep their source links. Read citation policy →

Leaderboard | EvalKit