EvalKit score
66.9Public snapshot workspace
LLM Leaderboard
Compare LLMs on reasoning, coding, speed, context, and price. Every row is citation-backed — sourced from public benchmarks and traceable to its origin.
How EvalKit scores work
The EvalKit Score is a composite of reasoning (40%), coding (35%), and agent capability (15%), blended across public benchmark data. Rows without a direct public metric show an estimate — marked with ~ — derived from provider or model-family medians. Source citations accompany every row.
Top models
By composite EvalKit scoreEvalKit score
65.3EvalKit score
64.1Current leaders
Updated weekly| Reasoning | Coding | Agent | Code arena | Context | Speed | Pricing $/M | License | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| #1 | GPT-5.5OpenAICreated by OpenAI | 64.1 | 62.3 | 51 | 75.3 | 1,948 | 1.1M | 175/s | $5 | ||
| #2 | GPT-5.4OpenAICreated by OpenAI | 58.2 | 57.6 | 43 | 67.2 | 1,726 | 1.0M | 203/s | $2.5 | ||
| #3 | GPT-5.3 CodexOpenAICreated by OpenAI | 53.4 | 56.2 | 43.4 | 38.1 | 1,367 | 400K | 233/s | $1.75 | ||
| #4 | GPT-5.2OpenAICreated by OpenAI | 52.6 | 53.5 | 34.6 | 60.6 | 1,519 | 400K | 181/s | $1.75 | ||
| #5 | GPT-5.5 ProOpenAICreated by OpenAI | 50.4 | 61.6 | ~26.1 | ~44.5 | ~1,613 | ~400K | ~230/s | ~$1.25 | ||
| #6 | GPT-5.4 miniOpenAICreated by OpenAI | 48.9 | 45.8 | 34 | 57.7 | 1,388 | 400K | 283/s | $0.75 | ||
| #7 | GPT-5.5 InstantOpenAICreated by OpenAI | 48.8 | 42.5 | 43.4 | ~43.7 | 1,361 | 400K | 227/s | $5 | ||
| #8 | GPT-5.2 CodexOpenAICreated by OpenAI | 48.4 | 52.1 | 38.9 | 26 | 1,232 | 400K | 204/s | $1.75 | ||
| #9 | GPT-5.2 ProOpenAICreated by OpenAI | 48.1 | 56.7 | ~26.1 | ~42.2 | ~1,613 | ~400K | ~230/s | ~$1.25 | ||
| #10 | GPT-5 HighOpenAICreated by OpenAI | 47.2 | 47.8 | 33.8 | ~41.7 | 1,301 | ~400K | ~230/s | ~$1.25 | ||
| #11 | GPT-5.1 HighOpenAICreated by OpenAI | 45.6 | 53.3 | 23.8 | ~39.7 | 1,140 | ~400K | ~230/s | ~$1.25 | ||
| #12 | GPT-5.1 MediumOpenAICreated by OpenAI | 44.9 | 45.9 | 30.4 | ~39.3 | 1,181 | 400K | 263/s | $1.25 | ||
| #13 | GPT-5.1OpenAICreated by OpenAI | 43.5 | 47.5 | 30.8 | 24.7 | 1,210 | 400K | 319/s | $1.25 | ||
| #14 | GPT-5.1 InstantOpenAICreated by OpenAI | 43.5 | 48.6 | 30.3 | 23.3 | 940 | 400K | 276/s | $1.25 | ||
| #15 | GPT-5OpenAICreated by OpenAI | 43.2 | 44.7 | 33.9 | 23.2 | 886 | ~400K | ~230/s | ~$1.25 | ||
| #16 | GPT-5.1 Codex HighOpenAICreated by OpenAI | 43 | 44.3 | 27.7 | ~37.4 | 721 | ~400K | ~230/s | ~$1.25 | ||
| #17 | GPT-5 MediumOpenAICreated by OpenAI | 42.7 | 43.5 | 27.7 | ~37 | 1,101 | ~400K | ~230/s | ~$1.25 | ||
| #18 | GPT-5.1 ThinkingOpenAICreated by OpenAI | 42.7 | 47.5 | 29.8 | 21.9 | 1,087 | ~400K | ~230/s | ~$1.25 | ||
| #19 | GPT-5.4 nanoOpenAICreated by OpenAI | 42.4 | 39.9 | 23 | 56.1 | 788 | 400K | 365/s | $0.2 | ||
| #20 | GPT-5 CodexOpenAICreated by OpenAI | 40.6 | 39.9 | 26.8 | ~35 | 75 | ~400K | ~230/s | ~$1.25 |
GPT-5.5
OpenAI · Created by OpenAI · LLM
GPT-5.4
OpenAI · Created by OpenAI · LLM
GPT-5.3 Codex
OpenAI · Created by OpenAI · LLM
GPT-5.2
OpenAI · Created by OpenAI · LLM
GPT-5.5 Pro
OpenAI · Created by OpenAI · LLM
GPT-5.4 mini
OpenAI · Created by OpenAI · LLM
GPT-5.5 Instant
OpenAI · Created by OpenAI · LLM
GPT-5.2 Codex
OpenAI · Created by OpenAI · LLM
GPT-5.2 Pro
OpenAI · Created by OpenAI · LLM
GPT-5 High
OpenAI · Created by OpenAI · LLM
GPT-5.1 High
OpenAI · Created by OpenAI · LLM
GPT-5.1 Medium
OpenAI · Created by OpenAI · LLM
GPT-5.1
OpenAI · Created by OpenAI · LLM
GPT-5.1 Instant
OpenAI · Created by OpenAI · LLM
GPT-5
OpenAI · Created by OpenAI · LLM
GPT-5.1 Codex High
OpenAI · Created by OpenAI · LLM
GPT-5 Medium
OpenAI · Created by OpenAI · LLM
GPT-5.1 Thinking
OpenAI · Created by OpenAI · LLM
GPT-5.4 nano
OpenAI · Created by OpenAI · LLM
GPT-5 Codex
OpenAI · Created by OpenAI · LLM
Scoring methodology
EvalKit Score = 40% reasoning + 35% coding + 15% agent, normalized to a 0–100 scale. Public benchmark data from 8 independent sources is preferred; estimates fill missing cells and are always labeled.
Source transparency
Every row links back to its original benchmark source. Replicated rows keep the exact source URL. No score is marked as EvalKit-verified unless run evidence exists. Read citation policy →
Data freshness
The leaderboard refreshes weekly. The snapshot date for each row is shown in the detail panel. Use the source filter to view data from a specific benchmark provider.
Replicated rows keep their source links. Read citation policy →