Extended thinking

extended-thinking

Also called reasoning mode/ chain of thought/ thinking budget

Extended thinking is a mode where the model spends extra inference compute on a hidden internal reasoning trace before producing its visible answer. Trades latency and cost for accuracy on hard problems (math, code, multi-step reasoning). Sold as o1, Claude extended thinking, Gemini deep think, etc.

Photo: Boris Hamer / Pexels

A standard model produces its visible answer in one forward pass: sample tokens, emit them, done. An extended-thinking model first produces an internal reasoning trace (sometimes called "thinking tokens") that does not appear in the visible response. Only after the model has worked through the problem in the hidden trace does it commit to a final answer.

The trace can be short (a few hundred tokens) or long (tens of thousands), depending on a "thinking budget" you can usually set. More thinking, more cost, more latency, generally more accuracy on hard problems. The relationship is not linear: doubling the budget might give 5% accuracy on most tasks. On hard math problems it might give 20%. The wins are concentrated where step-by-step deduction matters.

The trade-offs:

Cost: thinking tokens are billed (typically at output rates). Long-thinking responses can cost dollars per turn instead of cents.
Latency: an extended-thinking response can take 30 seconds to several minutes versus 2-5 seconds for standard chat. Acceptable for batch work, painful for live chat.
Visibility: the trace is usually hidden from the end user (and sometimes from the API). You see "thinking..." or just a longer wait, then the final answer.

Extended thinking is most useful for: math (the answer is in the working), code that requires planning, multi-step logical reasoning, and "chain of thought" style decompositions. It is less useful for: simple lookups, conversational chat, creative writing.

This site does not expose extended thinking on the public chat (cost discipline). When the curated /can and /cannot demos use it, the difference shows clearly: a model with thinking budget catches the cognitive-reflection trap that a model without it falls for. Same model. Different mode. Different answer.

Related concepts

Tokens

tokens

Latency to first token

latency-to-first-token

Want the rest?

There are 40 terms total.

See the full glossary