Transformer

transformer

Also called transformer architecture/ decoder-only transformer

The transformer is the neural network architecture every modern language model is built on. It replaced recurrent networks in 2017 with one core idea: process all tokens of a sequence at once, with an attention mechanism that lets every token look at every other token. That parallelism is what made training at billion-token scale tractable.

Photo: Ivan Chumak / Pexels

A transformer is a stack of identical blocks. Each block contains an attention layer (which decides, for every position, which other positions matter and how much) and a feed-forward layer (which transforms each position independently). Information flows up the stack, layer by layer, with residual connections preserving the original signal alongside the transformations. The output of the last block is converted into a probability distribution over the vocabulary; sample one token, append it, and run the whole stack again to predict the next.

Modern frontier models are decoder-only transformers: they generate left-to-right, never looking ahead. Encoder-decoder variants (the original 2017 design) live on in translation and a few specialized tasks. The decoder-only design is what powers GPT, Claude, Gemini, Llama, and effectively every chat-shaped model you have heard of.

The transformer's key win over its predecessors was scalability. RNNs process tokens sequentially, which means training cannot easily parallelize across the time dimension. Transformers process the whole sequence in parallel, which means a GPU can saturate, which means you can throw billions of training tokens at the model and have it finish in weeks instead of years. The 2017 paper "Attention Is All You Need" is one of the most-cited papers of the decade for exactly that reason.

When you read about a model's "size," what is actually being measured is the parameter count of the transformer (typically a stack of 30-100 blocks). When you read about its "context window," what is being measured is how long a sequence the attention layers can handle before the math gets impractical. When you read about "training compute," that is the cost of running the transformer over the training corpus.

Related concepts

Want the rest?

There are 40 terms total.

See the full glossary