Cache read

cache-read

Also called cache read tokens/ cache_read_input_tokens

Cache read tokens are input tokens that were served from the prompt cache rather than processed fresh. They are billed at roughly 10 percent of normal input price. A high cache read count on a turn means your conversation is reusing a static prefix efficiently.

Photo: Design Bits / Pexels

When the side rail of a chat turn shows cache_read: 1.8K, that is 1,800 input tokens that the model did not have to re-process from scratch. The cache hit means we wrote that prefix earlier in the session and it has not changed since, so the model loaded it from cache and charged a small fraction of the normal input rate for the privilege.

This number is the single best signal that the application is using caching well. If it is consistently zero across turns of the same conversation, the cache is missing on every request, which usually means the prompt prefix is changing slightly each turn (a stray timestamp, a re-ordered document) and the cache is being silently rewritten instead of read.

A healthy chat application on this site shows cache_read climbing as a conversation goes deeper, while cache_creation drops to near zero after the first turn. That is the cache doing exactly what it is supposed to do.

Related concepts

Want the rest?

There are 10 terms total.

See the full glossary