Distillation

distillation

Also called knowledge distillation/ model distillation

Distillation is the process of training a small "student" model to mimic the outputs of a larger "teacher" model. Result: a much smaller, faster, cheaper model that captures most of the teacher's capability. Haiku is distilled from Sonnet. Flash is distilled from Pro. Cheap tiers exist because of distillation.

Photo: Tiago Antonio / Pexels

The intuition: a big trained model has learned a lot, but most of that knowledge is sparse, redundant, or relevant only to edge cases. A smaller model trained to predict what the big model would say (instead of training from scratch on raw text) can capture the useful patterns much more efficiently than learning them from data.

The standard recipe: run the teacher on a corpus of prompts, save its outputs (often including its full probability distribution at each step, not just the final token). Train a much smaller student model with a loss that penalizes diverging from the teacher's outputs. The student learns to imitate the teacher's "voice" and reasoning patterns at a fraction of the cost.

Distillation is what makes the entire economics of frontier AI work. Frontier-tier models (Opus, GPT-5, Gemini 3 Pro) cost serious money per token. Distilled cheap-tier models (Haiku, Mini, Flash) cost orders of magnitude less because they have orders of magnitude fewer parameters. The cheap tiers are not "lobotomized" versions of the frontier; they are learned imitations, retaining most of the practical capability for routine tasks at a small fraction of the cost.

The gap between teacher and student is real but smaller than the price gap suggests. On routine tasks (summarization, classification, simple Q&A) the student often matches the teacher within a few percentage points on benchmarks while costing 10-50x less. The teacher pulls ahead on hard tasks (multi-step reasoning, novel problems, edge cases) where the student's smaller capacity hurts.

Distillation also enables on-device deployment. Phone-class language models in 2026 are distilled from much larger teachers, which is why a 3B-parameter model on your phone can produce surprisingly fluent text — it learned from a 70B-parameter teacher how to be fluent.

Related concepts

Want the rest?

There are 40 terms total.

See the full glossary