Parameters

parameters

Also called weights/ model size/ billions of parameters

A parameter is one number inside the model that gets adjusted during training. A model with "70 billion parameters" has 70 billion such numbers, each tuned by gradient descent to minimize prediction error on the training corpus. Parameter count is a rough proxy for capability and a precise proxy for memory + inference cost.

Photo: panumas nikhomkhai / Pexels

In a transformer, parameters live in the weight matrices of the attention layers and the feed-forward layers. Training nudges each parameter slightly to reduce the loss; do this billions of times across trillions of training tokens and the parameters end up encoding everything the model has learned about language, facts, and patterns. The model's "knowledge" is just the values of those numbers.

Parameter count is what people usually mean by "model size." A 7B model has 7 billion parameters; a 70B model has ten times that. Bigger generally helps: more parameters means more capacity to memorize patterns and represent nuance. The relationship is sub-linear (doubling parameters does not double capability) and other factors matter (training data quality, training compute, architecture choices), but for a fixed training recipe, more parameters usually win.

Parameter count is also a precise measure of how much memory the model takes to run. A 70B-parameter model in 16-bit precision is 140 GB of weights to load, which is why bigger models live on bigger GPUs. Inference cost scales roughly with parameter count too: every generated token requires a forward pass through every parameter.

Frontier models from 2024 onward have been increasingly coy about their exact parameter counts. The reasons are competitive (you do not want to telegraph your scaling to the next lab) and aesthetic (mixture-of-experts architectures complicate the simple "X-billion-parameter" pitch). What you see published as "size" is usually a pricing tier or a vague label rather than a precise number.

For practical purposes: bigger parameter counts mean more capable but more expensive. Smaller models distilled from bigger ones (see Haiku from Sonnet, Flash from Pro) close some of the gap at a fraction of the cost.

Related concepts

Want the rest?

There are 40 terms total.

See the full glossary