← All models
Google · apache_2_0
Gemma 4 31B
Gemma 4 31B by Google appears in 2 sources with Reasoning at 44.8. Best read for Coding, High intelligence, Low cost.
CreatorGoogle
Release date2026-04-02
Knowledge cutoffNot published
Context262K tokens
Input price$0.14/M tokens
Output price$0.4/M tokens
Modalitytext + vision
CountryUS
Metrics
All source-backed metrics
LLM Stats Rank35ranking · source_rank
LLM Stats Code Index (estimated from arena)28.78 scoremetric
Reasoning44.8reasoning · index_reasoning
Math39.45math · index_math
Vision27.1multimodal · index_vision
Tool calling19.41tool_calling · index_tool_calling
Long context29.85long_context · index_long_context
Finance41.91domain · index_finance
Legal41.89domain · index_legal
Healthcare36.32domain · index_healthcare
GPQA84.3 %reasoning · gpqa_score
Code Arena1,135.37 %coding · coding_arena_score
Humanity Last Exam26.5 %reasoning · hle_score
MMMLU88.4 %reasoning · mmmlu_score
MMMU-Pro76.9 %multimodal · mmmu_pro_score
Context262,144 tokenscontext · context
Speed211.34 c/sperformance · throughput
Latency465.35 msperformance · latency
Input price0.14 $/Mpricing · input_price
Output price0.4 $/Mpricing · output_price
Parameters30,700,000,000 paramsmodel · params
Arena Rating1,441.98metric
Arena Rank35metric
Vote Count5,829metric
Evidence
Citations and source overlap
FAQ
How should I read this profile?
Treat this as a source-backed model dossier, not an EvalKit-run verification. The public values are replicated from linked sources and kept source-scoped.
Is Gemma 4 31B verified by EvalKit?
No. EvalKit currently shows 0 verified rows until real run evidence exists.
Why can metrics disagree?
Different sources test different tasks, dates, prompts, and aggregation methods. EvalKit keeps those differences visible instead of merging them into a fake universal score.