Kevin Champlin
← Back to glossary

HumanEval

humaneval
Also called HumanEval benchmark/ code generation benchmark

HumanEval is a Python coding benchmark from OpenAI: 164 hand-written programming problems where the model has to write a function that passes a hidden test suite. Reported as "pass@1" (does the first attempt pass) or "pass@10" (does any of 10 attempts pass). Frontier models in 2026 score 90-96% on pass@1.

Released alongside Codex in 2021, HumanEval was the first widely-adopted code-generation benchmark. Each of the 164 problems gives the model a function signature and docstring; the model must complete the function body. The completion is then evaluated by running unit tests that the model cannot see. If the tests pass, the problem is correct.

The format is small but rigorous. Problems include things like "remove duplicates from a list while preserving order," "find the longest palindromic substring," "implement a basic calculator." None are research-grade; all are the kind of warm-up code interview questions a junior engineer should handle. Saturation at the top of the benchmark started happening in 2024.

The standard scoring metrics:

  • pass@1: percentage of problems solved on the first attempt.
  • pass@10: percentage solved by at least one of 10 attempts.
  • pass@100: same idea, 100 attempts.

Higher k means more chances, so pass@10 is always >= pass@1. Most leaderboard reporting uses pass@1.

Limitations of HumanEval:

  • Coverage: 164 Python problems is a thin slice of code work. SWE-bench and LiveCodeBench cover much more (multi-file, debugging, real GitHub issues).
  • Contamination: HumanEval problems and solutions are widely circulated. Modern models almost certainly saw them during pretraining.
  • No multi-step: the benchmark is one-shot; it does not test the model's ability to iterate, read errors, or modify code based on test failures.

For meaningful 2026 code evaluation, the field has moved to: SWE-bench (real GitHub issues + multi-file edits), LiveCodeBench (recent unseen problems with date-based contamination filtering), and MBPP+ (a harder cousin of MBPP). HumanEval is the warm-up they all reference.

Want the rest?

There are 40 terms total.

See the full glossary
Today, UTC
Monthly
refreshed /cost-of-mind →