Prompt caching takes up to ninety percent off repeated context, but only if your prompts are built to be cached. A small change in structure decides whether you get the saving or miss it.
Prompt caching is one of the most powerful cost levers Anthropic offers, capable of taking up to ninety percent off the cost of the context you reuse across calls. But it is also the lever most often left unpulled, because caching only works when your prompts are structured to be cacheable. A prompt that puts its stable content in the wrong place gets a cache miss on every call and pays full price, while a prompt that organizes the same content correctly gets a cache hit and pays a fraction. The difference is not the content. It is the structure. This is how to design prompts that cache.
Caching rewards repetition that sits in a stable position. When you send a prompt, the system can store the processed form of the early, unchanging portion and reuse it on the next call that begins with the same content. The saving applies to the portion that matched. So caching is not about your prompt being short or simple. It is about your prompt having a large, identical, early section that recurs across many calls. If that section is there and it is positioned first, you cache it and pay a fraction. If it is scattered, or if something variable sits ahead of it, the match breaks and you pay full price. The whole discipline is about protecting that early, stable block.
The first design move is to sort your prompt content into two buckets. The stable bucket is everything that does not change from call to call: the system instructions, the tool definitions, the reference documents, the examples, the policies, the long standing context. The variable bucket is everything specific to this request: the user's question, the current input, the dynamic data. Most prompts mix these freely, which is exactly what defeats caching. The fix is to put all the stable content first, in a consistent order, and all the variable content after it. That single reorganization is what turns a cache miss into a cache hit.
Because the cache matches from the beginning of the prompt forward, anything variable that sits ahead of your stable content breaks the match for everything behind it. A single dynamic value at the top, a timestamp, a session id, a personalized greeting, can cost you the cache on a large block of reference material that follows. The rule is strict: nothing that changes should sit ahead of something stable you want to cache. Front load the unchanging material, keep its order identical across calls, and push every variable element to the end. The savings live in that ordering discipline.
The workloads that benefit most from caching are the ones with large, repeated context. Retrieval augmented generation, where the same knowledge base or document set is sent on many queries, is a prime candidate. Code review and code assistance, where a large codebase or file set forms the context for many requests, is another. Long system prompts with extensive instructions and examples, sent on every call, are pure caching opportunity. Conversational agents that carry a stable persona and policy set across a session cache that set once and reuse it. In each case, the saving scales with how large the stable block is and how often it recurs, so the biggest prompts with the most repetition deliver the most.
Three common mistakes quietly defeat caching. The first is interleaving variable content into the stable block, for example inserting the user's name into a reference document, which breaks the match. The second is letting the stable content vary subtly, perhaps a timestamp embedded in a system prompt or a reordering of tool definitions between calls, so that what looks identical is not. The third is putting the variable content first out of habit, leading with the user's question and following it with the reference material, which is natural to write but exactly wrong for caching. Each mistake is easy to make and easy to fix once you know to look for it.
You cannot improve what you do not measure, and caching is no exception. Track your cache hit rate, the proportion of cacheable content that actually hit the cache, as a first class metric. A low hit rate on a workload with obvious repeated context is a signal that your prompt structure is fighting the cache, and it points you straight to the fix. A high hit rate confirms the saving is landing. Without this measurement, teams often assume they are caching when their structure is quietly missing, and they pay full price while believing they are optimized. The metric is the difference between intended savings and real ones.
Caching does not only lower your bill. Because it can take up to ninety percent off the cost of your repeated context, it lowers your true consumption, which means it lowers the committed spend you need to negotiate with Anthropic. A buyer who caches well can size a smaller, more accurate commit and avoid the overcommitment that costs money. This is why caching belongs in the optimization you do before you size a commit, not after. If you negotiate a commitment against an uncached bill and then implement caching, you may find your optimized consumption no longer reaches the commitment you signed, turning the difference into dead capacity.
The financial case for caching is the headline, but there is a second benefit that matters to the engineering team and helps build support for the work. Cached context does not need to be reprocessed, so a cache hit returns faster than a cache miss. On a workload with a large stable block, the latency improvement can be substantial, because the model skips the work of processing context it has already seen. This turns caching into a double win: the same restructuring that takes up to ninety percent off the cost of the repeated context also makes the application more responsive. When you are making the case internally for the effort of reorganizing prompts, the latency gain is what wins over the engineers who care more about user experience than the bill, and it means the optimization improves the product rather than merely trimming a cost.
Not every prompt is worth restructuring, and a sensible program starts where the return is largest. Rank your workloads by two factors: the size of the stable block and the frequency with which it recurs. A workload with a large stable block sent on millions of calls is the top priority, because the saving is the product of the block size and the call volume. A workload with a small stable block or low volume can wait. This ranking keeps the effort focused on the changes that move the bill, rather than spreading attention evenly across prompts that differ enormously in payoff. In most systems, a handful of high volume, context heavy prompts account for the large majority of the available caching saving, so finding and fixing those first captures most of the benefit quickly.
Caching and model routing are separate levers, but they compound when used together. A workload that has been routed to the right model and then has its stable context cached pays the lower model rate on the variable portion and a fraction of even that on the cached portion. The two techniques attack different parts of the cost, so neither cannibalizes the other. The practical sequence is to route first, choosing the cheapest model that clears the quality bar, and then structure the prompt for caching on that model. Layered this way, routing and caching together push a workload toward the upper end of the savings range, and they do it without any reduction in the quality of the output, because both are about removing cost that bought you nothing rather than removing capability you needed.
Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.
Get a QuoteWeekly intelligence on Anthropic pricing moves and the buyer side counters that work.