Latency to first token

latency-to-first-token

Also called TTFT/ first token latency/ time to first token

Latency to first token (TTFT) is the time between sending a request and seeing the first token of the response start to stream back. It is what makes a chat interface feel "fast" or "slow" to a user, regardless of how long the full response takes.

Photo: Nikolett Emmert / Pexels

Total response time and TTFT measure two different user experiences. Total response time is what you pay for in compute. TTFT is what determines whether the application feels alive. Streaming a 2,000-token response with 200ms TTFT and 12 seconds of total time will feel responsive and fluent. The same response with 4 seconds of TTFT and 8 seconds of total time will feel like the application is broken, even though it is technically faster overall.

Several things shape TTFT: the model's size (larger models are slower to start), the prompt length (the model has to read your full input before producing the first output token), prompt caching (cache hits radically reduce TTFT because the model is not re-reading the cached prefix), and infrastructure routing (which region the request lands in).

The side rail on every chat turn here shows TTFT alongside total latency. If TTFT is high but total latency is reasonable, the application is functioning correctly but the model has a lot of input to chew through before responding. If TTFT is low and total latency is also low, you are probably hitting prompt cache.

Related concepts

streaming

awaiting authorship

Tokens

tokens

Want the rest?

There are 10 terms total.

See the full glossary