Benchmark glossary
Benchmarks answer different product questions. EvalKit groups them by reasoning, math, coding, tool calling, multimodal, and long-context tasks.
Examples: - GPQA: expert reasoning - AIME: competition math - SWE-bench: software engineering - Toolathlon / MCP Atlas: agent tool use - MRCR: long-context behaviorEdit on GitHub