Independent buyer side advisory · Anthropic onlyNew York · London
Home · Blog · Token Optimization
Token Optimization

Caching strategy for multi turn conversations.

Buyer side guide · 11 minute read

Multi turn conversations are where Claude bills quietly run out of control, and they are also where prompt caching delivers the most. The reason is the same in both cases. Every turn in a conversation resends the history that came before it, so the input grows with each exchange, and by the tenth turn you are paying to reprocess the entire prior conversation on every single call. Caching turns that liability into one of the largest savings available. This is a buyer side guide to the caching strategy that captures it.

We negotiate with Anthropic and optimize the spend beneath the contract, and conversational workloads are among the most common places we find money left on the table. The strategy below is written so the engineering leader who designs it and the procurement leader who funds it can agree on why it matters and how large the prize is.

Why multi turn conversations get expensive

A single Claude call is priced on the tokens it processes. In a conversation, each new turn includes everything that came before, the system prompt, any reference material, and the full back and forth so far, because the model needs that context to respond coherently. That means the input grows with every turn. The first turn is cheap. The twentieth turn carries nineteen turns of history plus the original context, all reprocessed at full input price, every time the user sends a message.

The cost of a long conversation is therefore not linear, it accelerates. A workload of long sessions, a support assistant, a coding companion, a research tool, a tutor, can spend the overwhelming majority of its tokens simply reprocessing context it has already seen. That reprocessing is pure waste, and it is exactly what caching eliminates.

How caching changes the math

Prompt caching lets you mark the stable portion of your input so it is stored and reused rather than reprocessed from scratch. On subsequent calls, the cached portion is billed at a steep discount, up to ninety percent off the normal input rate, on the cached reads. Only the genuinely new tokens, the latest user message, are billed at full price. The conversation history and the fixed context, which were the source of the runaway cost, drop to a fraction of their former price.

The effect on a long conversation is dramatic. Instead of paying full input price for a growing history on every turn, you pay full price only for the small new portion and the cached rate for everything else. The longer the conversation, the larger the saving, because the cached share of the input grows with every turn. Caching does not just trim the bill on conversational workloads, it removes the very mechanism that made them expensive.

What to cache, and in what order

Caching well is a matter of structuring the prompt so the stable parts come first and the changing parts come last. Caching works on prefixes, the leading portion of the input that stays the same, so the design goal is to put everything stable at the front and everything volatile at the back.

  • The system prompt and instructions go first, because they almost never change within a session and often not across sessions.
  • Fixed reference material, a knowledge base, a document under discussion, a code repository, goes next, because it is stable for the life of the conversation.
  • The conversation history goes after that, growing turn by turn, with each completed turn becoming part of the stable prefix for the next call.
  • The newest user message goes last, the only genuinely new input, billed at full price while everything above it reads from cache.

Get this ordering right and the cache hit rate is high, which is the number that determines the saving. Get it wrong, by interleaving changing content into the stable section, and you break the prefix on every call, the cache misses, and you pay full price anyway. The strategy is as much about prompt structure as about flipping caching on.

The cache lifetime question

A cached prefix does not live forever. It has a lifetime, after which it expires and the next call repopulates it at full price before the discount resumes. This matters enormously for conversational design, because the economics depend on the pattern of activity within that window. A user sending messages in a continuous session keeps the cache warm and reaps the discount on every turn. A user who pauses long enough for the cache to expire pays to repopulate it when they return.

The strategic response is to understand your real conversation patterns and design around them. Active, continuous sessions are the ideal case and need little tuning. For workloads with natural pauses, the question becomes whether the activity within a session is dense enough to keep the cache warm through the period that matters. Where it is, caching pays handsomely. Where sessions are sparse and intermittent, the benefit is smaller, and the strategy may shift toward caching the fixed context, which is reused across many users, rather than the per session history.

Caching across users versus within a conversation

There are two distinct caching opportunities in a conversational system, and a good strategy uses both. The first is within a single conversation, where each turn's history is cached for the next turn, the case described above. The second is across many conversations, where a large shared context, the same system prompt and the same knowledge base used by every user, is cached once and read by all of them. The second is often the larger prize in high volume systems, because a single cached context serves thousands of conversations rather than one.

Designing for both means separating the genuinely shared context, which every user sees identically, from the per conversation history, which is unique to each session. Cache the shared context aggressively, because its cost is amortized across the entire user base. Cache the per conversation history where session patterns support it. The combination is what takes a conversational workload's cost down the most.

The trade offs to weigh

Caching is close to free upside, but a complete strategy acknowledges its few trade offs honestly. Writing content into the cache carries a small premium over a normal call, because the system has to store the prefix as well as process it. On a context that will be read many times this is trivial, the write cost is paid once and the discount is reaped on every subsequent read. But on a context that is read only once or twice before it expires, the write premium can outweigh the saving. The strategy, therefore, is to cache aggressively where reuse is high and to leave low reuse content uncached, which is exactly what the cache hit rate will tell you over time.

The second trade off is design discipline. Caching rewards a prompt structure where stable content forms a clean prefix and only the newest input changes. That is a constraint on how you assemble prompts, and a team used to building prompts ad hoc will need to adopt the discipline of separating the stable from the volatile. It is a modest cost, but it is real, and it is worth naming so the engineering team plans for it rather than discovering it. Weighed against the size of the saving on conversational workloads, both trade offs are easily worth accepting, but a strategy that pretends they do not exist will be tuned poorly.

How caching interacts with routing and batch

Caching does not operate in isolation, and the strongest conversational economics come from combining it with the other levers. Routing decides which model handles each turn, and caching then lowers the input cost on whichever model the turn lands on. A conversation routed to Sonnet with a well cached prefix is dramatically cheaper than the same conversation run on Opus with no caching, and the two savings multiply rather than add. For the portions of a conversational system that are not interactive, bulk evaluation of past conversations, offline analysis, generating summaries across a transcript archive, batch layers a further discount on top.

The practical implication is that caching should be designed as one part of a coordinated strategy rather than a standalone switch. Decide the routing first, so you know which model each class of turn uses. Layer caching to cut the input cost of that traffic. Push the offline conversational work into batch. The result on a high volume conversational product is a cost structure a small fraction of the naive version, and caching is the lever that does the heaviest lifting within it, precisely because conversational workloads are so dominated by repeated context.

Measuring whether it is working

Caching is one of the few optimizations where you can see the result directly, so measure it. The number that matters is the cache hit rate, the share of input tokens served from cache rather than reprocessed at full price. A high hit rate means your prompt structure is sound and your session patterns are keeping the cache warm. A low hit rate, despite caching being enabled, almost always means the prompt is structured so the prefix breaks, and it points straight at the fix. Track cost per inference on the conversational workload before and after, and the saving is unambiguous, which is exactly what you want when reporting it upward.

Your Anthropic number is negotiable.

Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.

Get a Quote

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · Blog · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.