Benchmark glossary

Benchmarks answer different product questions. EvalKit groups them by reasoning, math, coding, tool calling, multimodal, and long-context tasks.

Examples:
- GPQA: expert reasoning
- AIME: competition math
- SWE-bench: software engineering
- Toolathlon / MCP Atlas: agent tool use
- MRCR: long-context behavior
Edit on GitHub
EvalKit | LLM Intelligence Hub