Kevin Champlin
← Back to glossary

Evals

evals
Also called evaluations/ eval suite

Evals are tests for AI models. Anything from a single hand-written prompt to a full benchmark suite of thousands of cases. The discipline of "AI evals" became its own thing in 2023-2024 because labs and product teams realized they could not ship without measuring the model's behavior systematically.

The word "evals" covers a wide range. At the broadest, an eval is any structured way of measuring model output. At the narrowest, an eval is a specific dataset of prompts paired with expected outputs and a scoring function.

Three categories tend to come up:

Capability evals: do you score well on MMLU, HumanEval, GPQA, MATH, SWE-bench? These are the leaderboard numbers, and they are how labs market their models. Useful for coarse comparison, easy to overfit to, increasingly saturated at the frontier.

Behavior evals: does the model refuse the right things, hallucinate at acceptable rates, follow instructions in your specific format, stay on-brand? Internal teams build these for their own products. There is no shared standard; each company's "is-the-model-on-brand" eval is bespoke.

Trace evals: does the agent succeed at a multi-step task end-to-end? Did it call the right tools in the right order? Did it terminate when it should have? These are what you build for agentic systems. Frameworks like Braintrust, Patronus, and OpenAI Evals provide infrastructure for running them.

The disciplined version of this work is closer to traditional software testing than to anything previously seen in ML. You write test cases (prompts), run the model on them, score the output (sometimes via another model, "LLM-as-judge"), and watch the score over time as you iterate on prompts, fine-tunes, or model versions. CI pipelines for AI products run their evals on every change, just like unit tests.

If you are building anything with an LLM in production, evals are how you sleep at night. Without them, every prompt change is a guess and every model upgrade is a coin flip.

Want the rest?

There are 40 terms total.

See the full glossary
Today, UTC
Monthly
refreshed /cost-of-mind →