Prompt caching on Claude can take up to 90 percent off the cost of repeated input tokens. Here is exactly how the economics work, where the saving comes from, and which workloads stand to gain the most from it.
Prompt caching is the rare optimization that is both large and simple, which is why it is the first lever we reach for when a Claude application is spending more than it should. The headline is that caching can take up to ninety percent off the cost of input tokens that repeat across requests, and for many real workloads a large share of every request is exactly that, the same content sent again and again. Understanding the economics of caching, where the saving comes from and what it depends on, is the difference between leaving that money on the table and capturing it. This piece explains the mechanics in plain terms, so a procurement leader can see the size of the prize and an engineer can see how to claim it.
Every request you send to Claude includes input tokens, and you pay for all of them. In most production applications, a substantial portion of that input is identical from one request to the next: a long system prompt that defines behavior, a set of instructions, reference documents, examples, a knowledge base, or any context that stays the same while only the user question changes. Without caching, you pay full input price for that repeated content on every single call, even though the model has effectively processed it many times before. Across a high volume application, this means paying repeatedly for the same tokens, and at scale the repeated portion can dwarf the part of each request that is actually new. That repeated, re billed content is the waste caching exists to remove.
Caching lets you mark the stable part of your prompt so that, after the first request establishes it, subsequent requests that reuse it are billed at a steeply reduced rate, up to ninety percent below the standard input price for the cached portion. The first request that writes content to the cache costs a little more than a normal request, because writing the cache carries a small premium. Every request after that which hits the cache pays the deeply discounted rate on the repeated tokens, and only the genuinely new content, the user's actual question, is billed at the full input rate. The economics are therefore driven by reuse: the more often a piece of cached content is read before it expires, the more the small write premium is spread across many cheap reads, and the closer your effective saving approaches the full ninety percent on that content.
Caching pays most on workloads with a large, stable context and high request volume against it. A few profiles stand out. Applications with long system prompts, where every request carries the same lengthy instructions, save on that block on every call after the first. Retrieval and document workloads, where the same reference material or knowledge base is queried repeatedly, cache the material once and read it cheaply across many questions. Conversational applications, where the history grows but the early turns and the system context stay fixed, cache the stable prefix. Code workloads, where the same files or codebase context are sent across many requests, cache the shared context. In each case the pattern is identical: a big repeated block, read many times, is exactly the shape caching rewards, and the saving scales with both the size of the block and the frequency of reuse.
Three things govern how much you actually capture. The first is how much of your input is genuinely stable, because only repeated content can be cached, and an application that sends mostly novel input on every request has little to cache. The second is your hit rate, the share of requests that successfully reuse cached content before it expires, which depends on how your traffic is timed and how your prompts are structured. The third is cache structure, because the cached portion must be the stable prefix of the prompt, so content has to be ordered with the fixed material first and the variable material last for the cache to work at all. Get the structure wrong, by putting variable content ahead of stable content, and the cache cannot form, so the saving never arrives. The ninety percent figure is the ceiling on the cached portion, and good design is how you get close to it.
Consider an application where each request carries a large fixed context and a small variable question, and that fixed context makes up the bulk of the input tokens. Without caching, you pay full input price for that whole context on every call. With caching and a high hit rate, the fixed context is billed at up to ninety percent off after the first call, so your input cost collapses toward the cost of just the new question plus a small fraction of the context. On a workload dominated by repeated context, that can cut total input spend by a very large margin, and because input tokens are often the majority of total token cost in context heavy applications, the effect on the overall bill is substantial. This is why caching alone can move a Claude bill noticeably, before you have touched model routing or batch at all.
Caching is one of three primary token levers, and it compounds with the others. Routing across Opus, Sonnet, and Haiku puts each request on the cheapest model that meets the quality bar. Batch processing runs asynchronous work at roughly half the real time rate. Caching takes up to ninety percent off repeated input. Applied together across a real workload, these levers typically reduce aggregate spend by forty to seventy percent, and caching is frequently the largest single contributor on context heavy applications. The reason this matters beyond the monthly invoice is that the optimized baseline is what you should negotiate from. A buyer who caches before committing sizes the Anthropic commitment against lean, real demand rather than against the inflated cost of paying full price for the same tokens over and over.
One objection comes up whenever caching is proposed: if writing to the cache costs more than a normal request, could caching ever cost me money rather than save it? The honest answer is yes, in one narrow case, and understanding it is what lets you deploy caching with confidence everywhere else. The first request that establishes cached content pays a small premium over the standard input rate, because writing the cache carries a modest cost. If that content is then read many times before it expires, the premium is spread across many cheap reads and the average cost per use collapses toward the deep discount, which is the normal and desirable case. The only situation where caching loses is when content is written and then rarely or never read, so you pay the write premium without ever collecting the read discount. This happens when you cache content that does not actually repeat, or when requests are so spread out that the cache expires between them. The lesson is not to avoid caching, it is to cache the content that genuinely repeats and to ensure it is reused within its lifetime, which is exactly what a measured hit rate confirms.
This is why caching rewards thought rather than blanket application. Marking everything as cacheable, including content that varies, produces writes that are never read and can quietly raise your bill. Marking the genuinely stable content and structuring it as a reused prefix produces the high hit rate that delivers the saving. The economics are not automatic, they are a function of design, and the design question is always the same: is this content the same across many requests, and will those requests arrive close enough together to reuse the warm cache? When the answer is yes, caching that content is close to free money. When the answer is no, caching it is the one case where the premium works against you.
Teams often treat caching as something to add late, once the application is built and the bill has already become a concern. It is more powerful as a foundation, because the way you structure prompts determines whether caching is even possible, and retrofitting cache friendly structure into an application that was not designed for it is harder than building it in from the start. The principle is simple: separate the stable from the variable, put the stable content first so it can form a reusable prefix, and keep incidental variation, timestamps and identifiers and reordered lists, out of the cached block so the key matches reliably. An application designed this way caches naturally and captures the saving from day one, while one designed without it leaves the discount stranded behind a structure that prevents the cache from ever forming. Caching is cheap to design in and expensive to bolt on, which is an argument for thinking about it early.
Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.
Get a QuoteWeekly intelligence on Anthropic pricing moves and the buyer side counters that work.