Kevin Champlin
← Back to glossary

Top-k

top-k
Also called top_k sampling

Top-k restricts the model to picking the next token from only the k highest-probability candidates. With k=50, the model considers its top 50 guesses and ignores everything else. Older sibling of top-p; less adaptive but conceptually simpler.

Like top-p, top-k is a way to truncate the long tail of the probability distribution before sampling. Unlike top-p, it always keeps exactly k candidates regardless of the shape of the distribution.

This is its main weakness. When the model is highly confident in one token (95% probability), top-k=50 still pulls in 49 other low-probability tokens that mathematically should not be candidates. The truncation is too generous. Conversely, when the model is uncertain (no token over 3%), top-k=50 might be cutting off real candidates at position 50 that should be in play. The truncation is too aggressive. Top-p avoids both failure modes by adapting to the distribution.

Top-k is still useful as a hard upper bound. Setting top-k=100 alongside top-p=0.95 prevents pathological distributions from including thousands of long-tail tokens. The two are complementary: top-p is the primary truncation, top-k is the safety rail.

Most production chat APIs default top-k to a high value (200, 500, or unlimited) and rely on top-p + temperature for the actual sampling shape. Setting top-k explicitly matters more when you are building local inference pipelines or want deterministic-ish behavior with a known candidate pool.

The intuition for k values: very low (k=1 to 5) is greedy or near-greedy, useful for code completion and fact retrieval. Mid (k=40 to 100) is a reasonable creative-writing setting if you are not using top-p. High (k=500+) is effectively no truncation. K=1 is equivalent to greedy decoding: always pick the most probable token. K=infinity is the full vocabulary.

Want the rest?

There are 40 terms total.

See the full glossary
Today, UTC
Monthly
refreshed /cost-of-mind →