Public snapshot workspace

LLM Leaderboard

Compare LLMs on reasoning, coding, speed, context, and price. Every row is citation-backed — sourced from public benchmarks and traceable to its origin.

How EvalKit scores work

The EvalKit Score is a composite of reasoning (40%), coding (35%), and agent capability (15%), blended across public benchmark data. Rows without a direct public metric show an estimate — marked with ~ — derived from provider or model-family medians. Source citations accompany every row.

464Tracked rows
8Public sources
462Replicated
414Scored models

Top models

By composite EvalKit score
1
Claude Opus 4.8Anthropic

EvalKit score

66.9
2
Claude Mythos PreviewAnthropic

EvalKit score

65.3
3
GPT-5.5OpenAI

EvalKit score

64.1

Current leaders

Updated weekly
Claude Mythos PreviewBest reasoning
72.51
Gemini 3 FlashBest value
$0.5/M
Grok 4 FastLongest context
2.0M tokens
Claude Mythos PreviewBest coding
57.83
Mistral Small 4Fastest
678.15 c/s
Kimi K2.6Best open-weight
58.13
464 of 464 models0 verified2026-06-01
Columns
464 / 464
Rows
1 / 24
Filters
ReasoningCodingAgentCode arenaContextSpeedPricing $/MLicense
#1
Claude Opus 4.8AnthropicCreated by Anthropic
66.9
65.7
52.3
82.2
805
1.0M
117/s
$5
#2
Claude Mythos PreviewAnthropicCreated by Anthropic
65.3
72.5
57.8
40.3
94
~600K
~117/s
~$4
#3
GPT-5.5OpenAICreated by OpenAI
64.1
62.3
51
75.3
1,948
1.1M
175/s
$5
#4
Claude Opus 4.7AnthropicCreated by Anthropic
63.7
62.5
48.8
77.3
1,927
1.0M
35/s
$5
#5
Gemini 3.5 FlashGoogleCreated by Google
62.4
59.2
46.4
83.6
1,507
1.0M
189/s
$1.5
#6
Qwen3.7 MaxAlibaba Cloud / Qwen TeamCreated by Alibaba Cloud / Qwen Team
62.3
60.3
47.9
76.4
1,213
1.0M
174/s
$1.25
#7
Gemini 3.1 ProGoogleCreated by Google
59.1
59.1
43.2
69.2
2,086
1.0M
129/s
$2.5
#8
DeepSeek-V4-Pro-MaxDeepSeekCreated by DeepSeek
59.1
57
43.5
73.6
1,317
1.0M
44/s
$1.74
#9
Claude Opus 4.6AnthropicCreated by Anthropic
58.4
59.5
43.6
62.7
2,125
1.0M
51/s
$5
#10
GPT-5.4OpenAICreated by OpenAI
58.2
57.6
43
67.2
1,726
1.0M
203/s
$2.5
#11
GLM-5.1Zhipu AICreated by Zhipu AI
57.5
54.2
43
71.8
1,802
200K
177/s
$1.4
#12
Qwen3.6 PlusAlibaba Cloud / Qwen TeamCreated by Alibaba Cloud / Qwen Team
56.6
52.1
41.7
74.1
1,211
1.0M
117/s
$0.5
#13
Kimi K2.6Moonshot AICreated by Moonshot AI
56.1
58.1
43.7
50
1,547
262K
50/s
$0.95
#14
Claude Opus 4.5AnthropicCreated by Anthropic
54.9
54.2
39.8
62.3
1,614
~600K
23/s
~$4
#15
DeepSeek-V4-Flash-MaxDeepSeekCreated by DeepSeek
54.3
51.9
37.6
69
1,127
1.0M
176/s
$0.14
#16
GLM-5Zhipu AICreated by Zhipu AI
53.4
51.5
36.1
67.8
1,596
200K
262/s
$1
#17
GPT-5.3 CodexOpenAICreated by OpenAI
53.4
56.2
43.4
38.1
1,367
400K
233/s
$1.75
#18
Claude Sonnet 4.6AnthropicCreated by Anthropic
52.8
52.2
36.5
61.3
1,701
200K
121/s
$3
#19
GPT-5.2OpenAICreated by OpenAI
52.6
53.5
34.6
60.6
1,519
400K
181/s
$1.75
#20
MiniMax M2.7MiniMaxCreated by MiniMax
51.8
53.1
38.9
46.3
1,156
205K
174/s
$0.3
#1

Claude Opus 4.8

Anthropic · Created by Anthropic · LLM

Score66.9
Reasoning65.7
Coding52.3
Agent82.2
Replicated from public sourceLLM Stats
#2

Claude Mythos Preview

Anthropic · Created by Anthropic · LLM

Score65.3
Reasoning72.5
Coding57.8
Agent40.3
Replicated from public sourceLLM Stats
#3

GPT-5.5

OpenAI · Created by OpenAI · LLM

Score64.1
Reasoning62.3
Coding51
Agent75.3
Replicated from public sourceLLM Stats
#4

Claude Opus 4.7

Anthropic · Created by Anthropic · LLM

Score63.7
Reasoning62.5
Coding48.8
Agent77.3
Replicated from public sourceLLM Stats
#5

Gemini 3.5 Flash

Google · Created by Google · LLM

Score62.4
Reasoning59.2
Coding46.4
Agent83.6
Replicated from public sourceLLM Stats
#6

Qwen3.7 Max

Alibaba Cloud / Qwen Team · Created by Alibaba Cloud / Qwen Team · LLM

Score62.3
Reasoning60.3
Coding47.9
Agent76.4
Replicated from public sourceLLM Stats
#7

Gemini 3.1 Pro

Google · Created by Google · LLM

Score59.1
Reasoning59.1
Coding43.2
Agent69.2
Replicated from public sourceLLM Stats
#8

DeepSeek-V4-Pro-Max

DeepSeek · Created by DeepSeek · Open LLM

Score59.1
Reasoning57
Coding43.5
Agent73.6
Replicated from public sourceLLM Stats
#9

Claude Opus 4.6

Anthropic · Created by Anthropic · LLM

Score58.4
Reasoning59.5
Coding43.6
Agent62.7
Replicated from public sourceLLM Stats
#10

GPT-5.4

OpenAI · Created by OpenAI · LLM

Score58.2
Reasoning57.6
Coding43
Agent67.2
Replicated from public sourceLLM Stats
#11

GLM-5.1

Zhipu AI · Created by Zhipu AI · Open LLM

Score57.5
Reasoning54.2
Coding43
Agent71.8
Replicated from public sourceLLM Stats
#12

Qwen3.6 Plus

Alibaba Cloud / Qwen Team · Created by Alibaba Cloud / Qwen Team · LLM

Score56.6
Reasoning52.1
Coding41.7
Agent74.1
Replicated from public sourceLLM Stats
#13

Kimi K2.6

Moonshot AI · Created by Moonshot AI · Open LLM

Score56.1
Reasoning58.1
Coding43.7
Agent50
Replicated from public sourceLLM Stats
#14

Claude Opus 4.5

Anthropic · Created by Anthropic · LLM

Score54.9
Reasoning54.2
Coding39.8
Agent62.3
Replicated from public sourceLLM Stats
#15

DeepSeek-V4-Flash-Max

DeepSeek · Created by DeepSeek · Open LLM

Score54.3
Reasoning51.9
Coding37.6
Agent69
Replicated from public sourceLLM Stats
#16

GLM-5

Zhipu AI · Created by Zhipu AI · Open LLM

Score53.4
Reasoning51.5
Coding36.1
Agent67.8
Replicated from public sourceLLM Stats
#17

GPT-5.3 Codex

OpenAI · Created by OpenAI · LLM

Score53.4
Reasoning56.2
Coding43.4
Agent38.1
Replicated from public sourceLLM Stats
#18

Claude Sonnet 4.6

Anthropic · Created by Anthropic · LLM

Score52.8
Reasoning52.2
Coding36.5
Agent61.3
Replicated from public sourceLLM Stats
#19

GPT-5.2

OpenAI · Created by OpenAI · LLM

Score52.6
Reasoning53.5
Coding34.6
Agent60.6
Replicated from public sourceLLM Stats
#20

MiniMax M2.7

MiniMax · Created by MiniMax · Open LLM

Score51.8
Reasoning53.1
Coding38.9
Agent46.3
Replicated from public sourceLLM Stats

Scoring methodology

EvalKit Score = 40% reasoning + 35% coding + 15% agent, normalized to a 0–100 scale. Public benchmark data from 8 independent sources is preferred; estimates fill missing cells and are always labeled.

Source transparency

Every row links back to its original benchmark source. Replicated rows keep the exact source URL. No score is marked as EvalKit-verified unless run evidence exists. Read citation policy →

Data freshness

The leaderboard refreshes weekly. The snapshot date for each row is shown in the detail panel. Use the source filter to view data from a specific benchmark provider.

Replicated rows keep their source links. Read citation policy →

Leaderboard | EvalKit