Caching Across Multi Turn Conversations

The hidden cost of a conversational application is that conversations grow, and on every turn the model has to be told everything that came before. A chat assistant on its tenth exchange is not processing one short question, it is processing the system prompt, plus nine previous user messages, plus nine previous responses, plus the new question, all as fresh input. Without caching, you pay full input rate on that entire accumulated transcript every single turn, which means the cost of a conversation does not grow linearly with its length, it grows roughly with the square of it, because each turn re reads a history that keeps getting longer. For any product where users have long sessions, this is one of the largest and least understood lines in the token bill. Caching is the lever that fixes it, but multi turn caching has its own rules, and getting them wrong leaves most of the saving on the table.

Why long conversations cost so much

It is worth making the arithmetic vivid because the intuition is wrong. People assume a ten turn conversation costs ten times a one turn conversation. It does not. Turn one sends a small history. Turn two sends turn one plus its answer. By turn ten, each new request is carrying nine turns of accumulated context, and the sum of all those re reads across the conversation is far more than ten times the first turn. The system prompt alone, often a large block of instructions, gets re sent on every single turn, so a two thousand token system prompt in a twenty turn conversation has been paid for twenty times. The longer and more popular your conversations, the more this dominates, and it is invisible on a per request basis because each individual call looks reasonable. Only the aggregate reveals it.

How caching changes the multi turn economics

Caching attacks this directly. The stable parts of the conversation, above all the system prompt, and then the earlier turns that no longer change, can be cached so that re reading them costs a small fraction of the full rate rather than the whole thing. Instead of paying full price to re send the system prompt on turn twenty, you pay the cheap read rate, and the same applies to the frozen history. Because the history only ever grows at the end, the bulk of it, everything except the newest turn, is identical from one request to the next, which is exactly the condition caching rewards. Applied well, multi turn caching turns the quadratic cost curve back toward something close to linear, because the expensive re reading of unchanged history is now cheap. On a high volume conversational product, that is a transformative saving rather than a marginal one.

Where multi turn caching leaks

The saving depends on the conversation being structured so the cacheable region stays identical, and several common patterns break that. The first is putting anything variable above the stable content: a per turn timestamp, a changing session id, or a dynamically rewritten system prompt at the top means the cache invalidates every turn and you are back to paying full rate. The second is editing or summarizing earlier history mid conversation. The moment you rewrite a turn that was already cached, everything from that point forward invalidates, because the prefix changed. The third is inconsistent formatting of the transcript between turns, where the same history gets serialized slightly differently each time, so the bytes do not match and the cache misses even though the meaning is identical. Each of these turns a designed saving into a leak, and as with all caching, the symptom is a quiet bill rather than an error.

Designing conversations that stay cached

The design principles follow from the failure modes. Keep the system prompt at the very top, frozen and free of any per turn variation, so it is one stable cacheable block reused across the whole session. Append new turns to the end and never rewrite earlier ones, so the history is always a growing prefix plus a fresh tail rather than a constantly edited document. Serialize the transcript identically every turn, same formatting, same whitespace, same ordering, so the bytes match. When you do need to manage a conversation that has grown too long for the context window, do it deliberately at a chosen boundary rather than continuously, because every truncation or summary is an invalidation and you want to pay that cost once on purpose rather than repeatedly by accident. The aim is a conversation whose only changing part, from the cache's point of view, is the newest exchange.

Managing the context window without breaking the cache

Long conversations eventually hit the limit of what fits in the context window, and how you handle that interacts directly with caching. The naive approach, summarizing or trimming the history a little on every turn, is the worst case for cost, because each small edit invalidates the cache and forces a full re read of the new prefix. A better approach is to let the conversation run on a stable cached history for as long as it fits, then perform a single deliberate compaction at a clear threshold, accept the one time invalidation, and resume caching on the new, shorter base. This treats context management as an occasional, planned event rather than a continuous tax, and it preserves the cache for the long stretches between compactions where most of the turns happen. The discipline is to make the expensive moments rare and chosen, not frequent and accidental.

The practices that keep long chats cheap

Freeze the system prompt at the top of every turn, with no per turn timestamps or ids above it.
Append new turns and never rewrite earlier ones, so history stays a growing cacheable prefix.
Serialize the transcript identically each turn so the bytes match and the cache hits.
Compact a long conversation once at a deliberate threshold, not continuously.
Monitor cache hit rate per conversation length, since the leak shows up most on long sessions.
Combine caching with model routing, so the cheaper tiers handle the turns that do not need the top model.

What this is worth, on the bill and in the deal

For a conversational product at scale, multi turn caching is frequently the single largest token saving available, because the cost it removes, the repeated re reading of unchanged history, is both enormous and pure waste. Capturing it lowers the monthly bill immediately, but it also lowers the baseline you commit to Anthropic against. A buyer who commits against an uncached conversational workload is committing to a number inflated by quadratic re reads, and locking that into a multi year term. A buyer who has the caching working commits to the real, far lower cost of the conversation and negotiates from a position of demonstrated efficiency. As with every token lever, the engineering work and the contract work are two halves of the same saving, which is why we run them together.

If your conversations are long, your volume is high, and you suspect the history re reads are quietly dominating your bill, we can audit the workload, fix the caching, and carry the optimized baseline into your Anthropic negotiation. Book a strategy call below, and read the full set of caching and routing patterns in our token optimization playbook.

Read the pillar guide

The token optimization playbook for Claude buyers →

Caching across multi turn conversations.