Prompt caching is the single cleanest lever on a Claude bill, cutting the cost of repeated context by up to ninety percent for the workloads that fit it. This is the complete buyer side picture: how it works, where it pays, what to watch, and how it fits a wider optimization plan.
Every call to Claude pays to process the tokens you send. For many production applications, a large portion of those tokens are the same on every call, a long system prompt, a document the model reasons over repeatedly, a fixed set of instructions or examples. You are paying full price to send the model the same context again and again. Prompt caching is the mechanism that stops that waste. It lets the model reuse context it has already processed, so the repeated portion is billed at a steep discount rather than at full input price. For the right workload the saving reaches up to ninety percent on the cached portion, which is why caching is usually the first lever we reach for. This guide explains the whole picture from a buyer's point of view.
When you cache part of a prompt, the model stores its processed form of that context for a short window. On the next call that reuses the same cached context, the model does not reprocess it from scratch. Instead it reads from the cache, and you pay a small fraction of the normal input cost for those tokens. There is a modest cost to write to the cache in the first place, so caching pays off when the cached context is reused enough times to more than recover that write cost. The economics are simple. Write once at a small premium, then read many times at a deep discount, and the more reads per write, the larger the saving. Workloads with high reuse of stable context are exactly where this wins.
Caching turns repeated context from a recurring full price charge into a write once, read many discount. The saving scales with reuse, so the question is always how often the same context is sent.
The best caching candidates share one trait, a substantial block of context that stays the same across many calls. Common high value cases include the following.
Caching only saves money when context repeats. Workloads where every call sends entirely fresh context with little or no overlap get no benefit, and the small cache write cost would make them marginally worse. One off requests, highly variable prompts, and applications where the input changes completely each time are not caching candidates. The discipline is to identify which part of your prompt is stable and which is dynamic, because only the stable part should be cached. Trying to cache content that changes wastes the write cost without earning the read discount. Knowing the line between static and dynamic context is the core skill of caching well.
Caching is powerful but it has a few edges worth understanding before you rely on it.
You can size the opportunity before building anything. Look at a representative workload and break each prompt into its stable and dynamic portions. Estimate how large the stable portion is as a share of total input tokens, and how many times that stable context is reused within the cache window. The saving is roughly the deep discount applied to the cached portion across all the reads, net of the write cost. For a workload where most of the input is stable and reused many times, the cached portion approaches the up to ninety percent saving, which can move the total cost of that workload substantially. Doing this estimate first tells you which workloads are worth refactoring and which are not.
The benefit is not only financial. Because cached context is not reprocessed, calls that read from cache also return faster. For interactive applications this means caching can improve response time at the same time as it cuts cost, which is a rare case where the cheaper option is also the faster one. That makes caching attractive even for latency sensitive workloads, where it can let you keep a more capable model responsive enough for interactive use because the heavy stable context is no longer reprocessed on every call.
Caching is one of several levers that compound. It lowers the cost of context on whatever model you run, so it stacks with model routing that puts the right work on Haiku and Sonnet. It combines with batch, which halves the cost of asynchronous work, and with prompt discipline that trims waste before caching even applies. Our token optimization playbook brings these together into a single method for taking cost out of a Claude deployment without losing quality, with caching as one of the highest return moves on the list. For workloads with heavy repeated context, it is frequently the first thing we implement.
Prompt caching turns repeated context from a recurring full price charge into a write once, read many discount that reaches up to ninety percent on the cached portion. It pays most on workloads with a large block of stable context reused often, long system prompts, document question answering, retrieval, conversation history, and agent context, and it does nothing for prompts that change completely each call. Watch the cache window, structure prompts so stable context comes first, and measure your hit rate rather than assuming it. The saving is real, it is often the cleanest available, and it improves latency at the same time. Download the token optimization playbook to size your caching opportunity and see how it stacks with the other levers.
We find the stable context in your prompts, structure it for high cache hit rates, and stack caching with routing and batch. Download the playbook to size your saving.
Download playbookWeekly intelligence on Anthropic pricing moves and the buyer side counters that work.