EvalKit score
66.9Public snapshot workspace
LLM Leaderboard
Compare LLMs on reasoning, coding, speed, context, and price. Every row is citation-backed — sourced from public benchmarks and traceable to its origin.
How EvalKit scores work
The EvalKit Score is a composite of reasoning (40%), coding (35%), and agent capability (15%), blended across public benchmark data. Rows without a direct public metric show an estimate — marked with ~ — derived from provider or model-family medians. Source citations accompany every row.
Top models
By composite EvalKit scoreEvalKit score
65.3EvalKit score
64.1Current leaders
Updated weekly| Reasoning | Coding | Agent | Code arena | Context | Speed | Pricing $/M | License | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| #1 | Claude Opus 4.8AnthropicCreated by Anthropic | 66.9 | 65.7 | 52.3 | 82.2 | 805 | 1.0M | 117/s | $5 | ||
| #2 | Claude Mythos PreviewAnthropicCreated by Anthropic | 65.3 | 72.5 | 57.8 | 40.3 | 94 | ~600K | ~117/s | ~$4 | ||
| #3 | GPT-5.5OpenAICreated by OpenAI | 64.1 | 62.3 | 51 | 75.3 | 1,948 | 1.1M | 175/s | $5 | ||
| #4 | Claude Opus 4.7AnthropicCreated by Anthropic | 63.7 | 62.5 | 48.8 | 77.3 | 1,927 | 1.0M | 35/s | $5 | ||
| #5 | Gemini 3.5 FlashGoogleCreated by Google | 62.4 | 59.2 | 46.4 | 83.6 | 1,507 | 1.0M | 189/s | $1.5 | ||
| #6 | Qwen3.7 MaxAlibaba Cloud / Qwen TeamCreated by Alibaba Cloud / Qwen Team | 62.3 | 60.3 | 47.9 | 76.4 | 1,213 | 1.0M | 174/s | $1.25 | ||
| #7 | Gemini 3.1 ProGoogleCreated by Google | 59.1 | 59.1 | 43.2 | 69.2 | 2,086 | 1.0M | 129/s | $2.5 | ||
| #8 | DeepSeek-V4-Pro-MaxDeepSeekCreated by DeepSeek | 59.1 | 57 | 43.5 | 73.6 | 1,317 | 1.0M | 44/s | $1.74 | ||
| #9 | Claude Opus 4.6AnthropicCreated by Anthropic | 58.4 | 59.5 | 43.6 | 62.7 | 2,125 | 1.0M | 51/s | $5 | ||
| #10 | GPT-5.4OpenAICreated by OpenAI | 58.2 | 57.6 | 43 | 67.2 | 1,726 | 1.0M | 203/s | $2.5 | ||
| #11 | GLM-5.1Zhipu AICreated by Zhipu AI | 57.5 | 54.2 | 43 | 71.8 | 1,802 | 200K | 177/s | $1.4 | ||
| #12 | Qwen3.6 PlusAlibaba Cloud / Qwen TeamCreated by Alibaba Cloud / Qwen Team | 56.6 | 52.1 | 41.7 | 74.1 | 1,211 | 1.0M | 117/s | $0.5 | ||
| #13 | Kimi K2.6Moonshot AICreated by Moonshot AI | 56.1 | 58.1 | 43.7 | 50 | 1,547 | 262K | 50/s | $0.95 | ||
| #14 | Claude Opus 4.5AnthropicCreated by Anthropic | 54.9 | 54.2 | 39.8 | 62.3 | 1,614 | ~600K | 23/s | ~$4 | ||
| #15 | DeepSeek-V4-Flash-MaxDeepSeekCreated by DeepSeek | 54.3 | 51.9 | 37.6 | 69 | 1,127 | 1.0M | 176/s | $0.14 | ||
| #16 | GLM-5Zhipu AICreated by Zhipu AI | 53.4 | 51.5 | 36.1 | 67.8 | 1,596 | 200K | 262/s | $1 | ||
| #17 | GPT-5.3 CodexOpenAICreated by OpenAI | 53.4 | 56.2 | 43.4 | 38.1 | 1,367 | 400K | 233/s | $1.75 | ||
| #18 | Claude Sonnet 4.6AnthropicCreated by Anthropic | 52.8 | 52.2 | 36.5 | 61.3 | 1,701 | 200K | 121/s | $3 | ||
| #19 | GPT-5.2OpenAICreated by OpenAI | 52.6 | 53.5 | 34.6 | 60.6 | 1,519 | 400K | 181/s | $1.75 | ||
| #20 | MiniMax M2.7MiniMaxCreated by MiniMax | 51.8 | 53.1 | 38.9 | 46.3 | 1,156 | 205K | 174/s | $0.3 |
Claude Opus 4.8
Anthropic · Created by Anthropic · LLM
Claude Mythos Preview
Anthropic · Created by Anthropic · LLM
GPT-5.5
OpenAI · Created by OpenAI · LLM
Claude Opus 4.7
Anthropic · Created by Anthropic · LLM
Gemini 3.5 Flash
Google · Created by Google · LLM
Qwen3.7 Max
Alibaba Cloud / Qwen Team · Created by Alibaba Cloud / Qwen Team · LLM
Gemini 3.1 Pro
Google · Created by Google · LLM
DeepSeek-V4-Pro-Max
DeepSeek · Created by DeepSeek · Open LLM
Claude Opus 4.6
Anthropic · Created by Anthropic · LLM
GPT-5.4
OpenAI · Created by OpenAI · LLM
GLM-5.1
Zhipu AI · Created by Zhipu AI · Open LLM
Qwen3.6 Plus
Alibaba Cloud / Qwen Team · Created by Alibaba Cloud / Qwen Team · LLM
Kimi K2.6
Moonshot AI · Created by Moonshot AI · Open LLM
Claude Opus 4.5
Anthropic · Created by Anthropic · LLM
DeepSeek-V4-Flash-Max
DeepSeek · Created by DeepSeek · Open LLM
GLM-5
Zhipu AI · Created by Zhipu AI · Open LLM
GPT-5.3 Codex
OpenAI · Created by OpenAI · LLM
Claude Sonnet 4.6
Anthropic · Created by Anthropic · LLM
GPT-5.2
OpenAI · Created by OpenAI · LLM
MiniMax M2.7
MiniMax · Created by MiniMax · Open LLM
Scoring methodology
EvalKit Score = 40% reasoning + 35% coding + 15% agent, normalized to a 0–100 scale. Public benchmark data from 8 independent sources is preferred; estimates fill missing cells and are always labeled.
Source transparency
Every row links back to its original benchmark source. Replicated rows keep the exact source URL. No score is marked as EvalKit-verified unless run evidence exists. Read citation policy →
Data freshness
The leaderboard refreshes weekly. The snapshot date for each row is shown in the detail panel. Use the source filter to view data from a specific benchmark provider.
Replicated rows keep their source links. Read citation policy →