Prompt caching

prompt-caching

Also called prompt cache/ cache

Prompt caching is a feature that lets the model store a static prefix of your prompt (system instructions, retrieved documents, conversation history) and re-read it at a fraction of the normal input cost on subsequent requests. Used well, it cuts input cost by 70 to 90 percent across a session.

Photo: panumas nikhomkhai / Pexels

A modern chat application sends the same long preamble (system prompt, persona, retrieved corpus) on every turn of a conversation. Without caching, you pay full price to read those static tokens every single time. With caching, the first request pays a small premium to write the cache, and every subsequent request that hits that cache pays roughly 10 percent of normal input price for those same tokens.

The cache is content-addressed. If you change a single character in the cached prefix, the cache misses and you re-pay to write a new one. This means cache strategy matters: put stable content (system prompt, persona, big retrieved documents) at the very start of your prompt, and put volatile content (the user's latest message) at the end. The cache breakpoint is set explicitly in the API call.

This site caches its system prompt on every chat turn. The /cost-of-mind dashboard publishes the cache hit rate and the cumulative dollars saved. When you see cache_read: 1.8K on the side rail of a chat turn, that is 1,800 input tokens that were charged at roughly a tenth of fresh input price. The transparency is the point.

Related concepts

system-prompt

awaiting authorship

Want the rest?

There are 10 terms total.

See the full glossary