EvalKit

LLM Leaderboard

Compare LLMs on reasoning, coding, speed, context, and price. Every row is citation-backed — sourced from public benchmarks and traceable to its origin.

How EvalKit scores work

The EvalKit Score is a composite of reasoning (40%), coding (35%), and agent capability (15%), blended across public benchmark data. Rows without a direct public metric show an estimate — marked with ~ — derived from provider or model-family medians. Source citations accompany every row.

Weekly refresh 8 public sourcesCitation policy

			Reasoning	Coding	Agent	Code arena	Context	Speed	Pricing $/M
#1	Claude Opus 4.8AnthropicCreated by Anthropic	66.9	65.7	52.3	82.2	805	1.0M	117/s	$5
#2	Claude Opus 4.7AnthropicCreated by Anthropic	63.7	62.5	48.8	77.3	1,927	1.0M	35/s	$5
#3	Claude Opus 4.6AnthropicCreated by Anthropic	58.4	59.5	43.6	62.7	2,125	1.0M	51/s	$5
#4	Claude Opus 4.5AnthropicCreated by Anthropic	54.9	54.2	39.8	62.3	1,614	~600K	23/s	~$4
#5	Claude Opus 4.1AnthropicCreated by Anthropic	38.5	38.2	27.8	22.9	1,189	~600K	~117/s	~$4
#6	Claude Opus 4.7 (Adaptive Reasoning, Max Effort)AnthropicCreated by Anthropic	37.8	~35.4	~25.1	~32.2	~1,551	~600K	~117/s	~$4
#7	Claude Opus 4.7AnthropicCreated by Anthropic	37.8	~35.4	~25.1	~32.2	~1,551	~600K	~117/s	~$4
#8	Claude Opus 4.6AnthropicCreated by Anthropic	37.8	~35.4	~25.1	~32.2	~1,551	~600K	~117/s	~$4
#9	Claude Opus 4.7AnthropicCreated by Anthropic	37.8	~35.4	~25.1	~32.2	~1,551	~600K	~117/s	~$4
#10	Claude Opus 4.5AnthropicCreated by Anthropic	37.8	~35.4	~25.1	~32.2	~1,551	~600K	~117/s	~$4
#11	Claude Opus 4.6AnthropicCreated by Anthropic	37.8	~35.4	~25.1	~32.2	~1,551	~600K	~117/s	~$4
#12	Claude Opus 4AnthropicCreated by Anthropic	35.3	35.2	22.1	23	932	~600K	~117/s	~$4

Reasoning

Coding

Agent

Code arena

Context

Speed

Pricing $/M

License

Claude Opus 4.8AnthropicCreated by Anthropic

66.9

65.7

52.3

82.2

805

1.0M

117/s

Claude Opus 4.7AnthropicCreated by Anthropic

63.7

62.5

48.8

77.3

1,927

1.0M

35/s

Claude Opus 4.6AnthropicCreated by Anthropic

58.4

59.5

43.6

62.7

2,125

1.0M

51/s

Claude Opus 4.5AnthropicCreated by Anthropic

54.9

54.2

39.8

62.3

1,614

~600K

23/s

~$4

Claude Opus 4.1AnthropicCreated by Anthropic

38.5

38.2

27.8

22.9

1,189

~600K

~117/s

~$4

Claude Opus 4.7 (Adaptive Reasoning, Max Effort)AnthropicCreated by Anthropic

37.8

~35.4

~25.1

~32.2

~1,551

~600K

~117/s

~$4

Claude Opus 4.7AnthropicCreated by Anthropic

37.8

~35.4

~25.1

~32.2

~1,551

~600K

~117/s

~$4

Claude Opus 4.6AnthropicCreated by Anthropic

37.8

~35.4

~25.1

~32.2

~1,551

~600K

~117/s

~$4

Claude Opus 4.7AnthropicCreated by Anthropic

37.8

~35.4

~25.1

~32.2

~1,551

~600K

~117/s

~$4

#10

Claude Opus 4.5AnthropicCreated by Anthropic

37.8

~35.4

~25.1

~32.2

~1,551

~600K

~117/s

~$4

#11

Claude Opus 4.6AnthropicCreated by Anthropic

37.8

~35.4

~25.1

~32.2

~1,551

~600K

~117/s

~$4

#12

Claude Opus 4AnthropicCreated by Anthropic

35.3

Claude Opus 4.5

Anthropic · Created by Anthropic · Coding

Score37.8

Reasoning~35.4

Coding~25.1

Agent~32.2

Replicated from public sourceVellum

#11

Claude Opus 4.6

Anthropic · Created by Anthropic · Coding

Score37.8

Reasoning~35.4

Coding~25.1

Agent~32.2

Replicated from public sourceVellum

#12

Claude Opus 4

Anthropic · Created by Anthropic · LLM

Score35.3

Reasoning35.2

Coding22.1

Agent23

Replicated from public sourceLLM Stats