Kevin Champlin
← Back to glossary

Benchmark

benchmark
Also called capability benchmark/ leaderboard

A benchmark is a standardized test for AI models: a fixed dataset of inputs paired with a fixed scoring function. Same test, different models, comparable numbers. The headline score on benchmarks like MMLU, HumanEval, GPQA, SWE-bench is what gets reported on launch and what populates leaderboards.

Benchmarks are how the field measures progress. A new model launches; the lab publishes its benchmark scores; the leaderboard updates; rankings shift. Without benchmarks, "this model is better than that one" would be a vibe. With benchmarks, it is a number with confidence intervals.

The catch: benchmarks are easy to game and harder to interpret than the leaderboard format suggests.

Easy to game: if a benchmark's questions and answers exist on the internet (and most do), they likely appear in training data. A model "scoring 95% on MMLU" might have memorized a meaningful fraction of the answers. Labs try to detect contamination via held-out splits, paraphrased variants, or recency-based cuts (LiveCodeBench filters by problem date), but the arms race is permanent.

Hard to interpret: a 2-point gap on MMLU between two models is within noise. A 10-point gap means something. The leaderboard formatting hides which gaps are real. Sub-category breakdowns reveal more: model A might dominate on math while model B dominates on humanities, and the headline number averages the two into apparent equivalence.

The current "good" benchmarks for serious capability comparison in 2026 include GPQA (graduate-level Q&A on PhD-grade questions), MATH (advanced mathematics), SWE-bench (real GitHub issues, multi-file edits), HumanEval+ (an expanded HumanEval with additional test cases), and MMLU-Pro (a harder MMLU). Older benchmarks (the original MMLU, HellaSwag, ARC) are mostly saturated and serve as warm-ups.

For a real-world AI product, benchmarks are inputs to a decision but not the decision itself. The decision is "does this model handle MY tasks better than the alternative." For that, you build your own evals (see evals). Benchmarks are how labs compete; evals are how products ship.

Want the rest?

There are 40 terms total.

See the full glossary
Today, UTC
Monthly
refreshed /cost-of-mind →