Inference

inference

Also called model inference/ serving

Inference is what happens when you actually use a trained model: running input through the network to get output. As distinct from training (which produces the model). Inference is what you pay for per token; training is what the lab paid for once and amortizes across millions of inference calls.

Photo: panumas nikhomkhai / Pexels

The lifecycle of a language model has two distinct phases. Training is the expensive one-time cost: assemble the corpus, run gradient descent for weeks or months on a fleet of GPUs, end up with a fixed set of weights. Inference is the production cost: take a prompt, run it through those fixed weights, return the response. Every chat turn you send is one inference call.

Training a frontier model in 2026 costs tens to hundreds of millions of dollars. Inference for a single chat turn typically costs fractions of a cent. The economics of frontier AI is the economics of "spend a fortune training, recover it across billions of inference calls."

Inference has its own engineering discipline. Optimizations include:

KV caching: store the attention key/value tensors from previous tokens so subsequent tokens do not recompute them. Massive speedup for autoregressive generation.
Batching: process many requests in parallel on the same GPU. Cheaper per token, slightly higher latency.
Speculative decoding: use a small fast model to draft tokens, then verify with the big model. Speed up generation by 2-3x.
Prompt caching: store the encoded representation of static prefixes (system prompts, retrieved corpus) so they do not have to be re-processed each call. See prompt-caching.
Quantization: serve the model at lower precision to fit more requests per GPU. See quantization.

The inference stack is what determines latency, throughput, and cost. Two labs can have similarly capable models with very different inference economics depending on their serving infrastructure. Anthropic's prompt caching makes long-context conversations dramatically cheaper than the equivalent on a vendor without it.

When you read "tokens per second" or "queries per second" in a model spec, that is an inference performance number. Training performance is a separate axis (and rarely published).

Related concepts

Latency to first token

latency-to-first-token

Want the rest?

There are 40 terms total.

See the full glossary