Training

training

Also called pre-training/ model training

Training is the process of adjusting a model's parameters so it predicts the training corpus correctly. The expensive, one-time-ish phase that produces the weights you later run inference on. A frontier model in 2026 takes months on tens of thousands of GPUs and costs tens to hundreds of millions of dollars.

Photo: Yetkin Ağaç / Pexels

The simple version: show the model a sequence, ask it to predict the next token, compute the loss (how wrong the prediction was), backpropagate to adjust the weights, repeat. Do this trillions of times across a corpus of trillions of tokens.

Training has phases. Pre-training is the big one: a generic objective (next-token prediction) on a huge mixed corpus (web text, books, code, papers, scraped content, sometimes synthetic data). The output is a "base model" that has learned language statistics, factual associations, and a tremendous amount of background knowledge but does not yet behave like an assistant.

Fine-tuning is the smaller, specialized phase: continue training on a curated dataset of instruction/response pairs to teach the model how to be a chat assistant, follow instructions, refuse certain requests, and so on. This phase costs a tiny fraction of pre-training but determines most of the model's user-facing personality.

RLHF or constitutional AI is a third phase that further shapes the model's behavior using preference data rather than direct examples. See rlhf and constitutional-ai.

Training compute scales with model size and corpus size. A rough rule from the Chinchilla paper: optimal training requires ~20 tokens per parameter. A 70B-parameter model wants ~1.4 trillion training tokens to be "compute-optimal." Most modern frontier models train on substantially more than this; the corpus size has grown faster than the parameter count, which is why effective performance keeps improving even as headline parameter counts plateau.

The carbon and electricity cost of training is real but often misrepresented. A frontier training run is comparable to the lifetime electricity use of a few hundred American homes. Significant, not apocalyptic. Inference at scale (millions of users hitting models continuously) ends up being a much larger total energy footprint than training, which gets less attention than it deserves.

Training data is also the legal and ethical hot zone of modern AI. What is in the corpus determines what the model knows; whose work was used and on what terms is the subject of multiple ongoing lawsuits.

Related concepts

Want the rest?

There are 40 terms total.

See the full glossary