Kevin Champlin
← Back to glossary

MMLU

mmlu
Also called Massive Multitask Language Understanding/ MMLU benchmark

MMLU (Massive Multitask Language Understanding) is a benchmark of ~16,000 multiple-choice questions across 57 academic subjects: history, biology, law, mathematics, etc. Reported as a percentage. The standard headline number for "general knowledge / reasoning" capability across language models. Frontier models in 2026 score 88-92%.

Released in 2020 by Hendrycks et al., MMLU is the closest thing the field has to an SAT for language models. The questions are multiple-choice with four options each, drawn from real exams and academic problem sets. Scoring is exact-match on the chosen letter. Random guessing would score 25%; humans average around 35% across all subjects (because nobody is an expert in everything). GPT-3 in 2020 scored ~44%. GPT-4 in 2023: ~86%. Frontier models in 2026: 88-92%.

The benchmark spans subjects like college mathematics, professional law, clinical medicine, computer security, philosophy, and high school physics. Some sub-categories are dramatically easier (high school world history) than others (college mathematics). The aggregated score smooths over those differences.

MMLU is genuinely useful as a coarse capability measure but has known issues:

  • Saturation: frontier scores are now 88-92%, leaving little headroom. Improvements at the top of the leaderboard are noise as much as signal.
  • Memorization: with so many questions widely circulated, contamination of training data is a real worry. Some labs report MMLU on held-out subsets to address this.
  • Format gaming: models can be tuned to exploit multiple-choice format quirks rather than actually understanding the material.
  • Coverage: 57 subjects sounds broad, but the questions are weighted toward Western, English-language academic curricula.

Newer benchmarks are more discriminating at the frontier. MMLU-Pro raises the difficulty. GPQA (Graduate-level Q&A) consists of questions that PhDs in unrelated fields find hard. Both replace MMLU as the headline for serious capability comparison while MMLU remains the legacy "warm-up" number on every leaderboard.

When you see a model's MMLU score reported, treat it like a high-school transcript: useful at coarse granularity, deceptive at fine granularity.

Want the rest?

There are 40 terms total.

See the full glossary
Today, UTC
Monthly
refreshed /cost-of-mind →