Top-p
top-p
Top-p (nucleus sampling) restricts the model to picking the next token from only the smallest set of candidates whose cumulative probability adds up to p. With p=0.9, the model considers the top tokens that together cover 90% of the probability mass; the long tail is ignored. Smarter than top-k because the cutoff adapts to how peaky the distribution is.
Imagine the model has just produced a probability distribution over its 100,000-token vocabulary. Most of the mass is concentrated in maybe 10-50 tokens; the rest is a long tail of near-zero possibilities. You do not want the model picking from the tail (that is where hallucination and nonsense live). Top-p is the cleanest way to cut the tail off.
The cutoff adapts to the distribution. When the model is very confident (one token has 95% probability), top-p with p=0.9 means the model picks that one token and only that one. When the model is uncertain (no token over 5%, mass spread across many candidates), top-p with p=0.9 might leave 30 candidates in play. Top-k cannot do that: it always considers exactly k candidates, regardless of how peaky the distribution is. In a peaky distribution, top-k pulls in unlikely tokens unnecessarily; in a flat distribution, top-k cuts off too aggressively. Top-p avoids both failure modes.
Most modern chat APIs default top-p somewhere around 0.9 to 1.0. Setting top-p to 1.0 means "no nucleus filtering" — the model can pick from the full distribution (still subject to temperature). Setting it lower (say, 0.5) makes the model very conservative, picking from only the most probable tokens.
Top-p combines with temperature. Temperature reshapes the distribution; top-p truncates it. They work in complementary ways. In practice, tuning both at once gets confusing. If you do not know what you want, lower temperature is usually a more direct lever for "be more conservative" than tweaking top-p.
The 2019 Holtzman et al. paper "The Curious Case of Neural Text Degeneration" introduced nucleus sampling and showed that it dramatically reduced repetitive, low-quality output compared to other decoding strategies. Most modern systems use it as default.
Related concepts