Attention

attention

Also called self-attention/ attention mechanism/ multi-head attention

Attention is the mechanism inside a transformer that lets every token "look at" every other token in the sequence and decide how much each one matters. It is what makes language models good at long-range dependencies: connecting a pronoun to a noun ten sentences earlier, or pulling a number from page one of a document into an answer on page forty.

Photo: Andy Lee / Pexels

Mechanically, attention computes three vectors for every token: a query, a key, and a value. The dot product of one token's query against another token's key tells you how relevant that other token is to this one. Softmax normalizes those scores into weights, and the weighted sum of value vectors becomes the new representation of the original token. Stack many of these operations in parallel ("multi-head attention"), and each head learns to focus on a different kind of relationship: syntactic, semantic, positional, or stylistic.

The cost of attention scales as the square of the sequence length. Twice as many tokens, four times the compute. This is why context windows have practical limits: the math gets expensive fast. Tricks like sliding-window attention, sparse attention, and various forms of approximate attention are what enable million-token context windows on modern frontier models without the cost exploding.

Attention is the headline contribution of the 2017 transformer paper. Before attention, sequence models had to compress all earlier context into a fixed-size hidden state, which throttled long-range reasoning. With attention, the model can in principle look anywhere in the input, with the catch being computational cost rather than information loss. That single architectural choice is the load-bearing structure of every modern language model.

When you hear people say "the model is attending to" some part of the input, they are usually being literal. There are tools that visualize attention weights and they reveal real patterns: a model on a question-answering task often shows attention spikes from the question tokens to the relevant supporting passage in the document. The mechanism is not a metaphor.

Related concepts

Want the rest?

There are 40 terms total.

See the full glossary