Independent buyer side advisory · Anthropic onlyNew York · London
Blog · Prompt Caching
Top of funnel · Informational

Prompt caching on Claude: a complete buyer guide.

Prompt caching is the single cleanest lever on a Claude bill, cutting the cost of repeated context by up to ninety percent for the workloads that fit it. This is the complete buyer side picture: how it works, where it pays, what to watch, and how it fits a wider optimization plan.

Every call to Claude pays to process the tokens you send. For many production applications, a large portion of those tokens are the same on every call, a long system prompt, a document the model reasons over repeatedly, a fixed set of instructions or examples. You are paying full price to send the model the same context again and again. Prompt caching is the mechanism that stops that waste. It lets the model reuse context it has already processed, so the repeated portion is billed at a steep discount rather than at full input price. For the right workload the saving reaches up to ninety percent on the cached portion, which is why caching is usually the first lever we reach for. This guide explains the whole picture from a buyer's point of view.

What caching actually does

When you cache part of a prompt, the model stores its processed form of that context for a short window. On the next call that reuses the same cached context, the model does not reprocess it from scratch. Instead it reads from the cache, and you pay a small fraction of the normal input cost for those tokens. There is a modest cost to write to the cache in the first place, so caching pays off when the cached context is reused enough times to more than recover that write cost. The economics are simple. Write once at a small premium, then read many times at a deep discount, and the more reads per write, the larger the saving. Workloads with high reuse of stable context are exactly where this wins.

Caching turns repeated context from a recurring full price charge into a write once, read many discount. The saving scales with reuse, so the question is always how often the same context is sent.

Where caching pays the most

The best caching candidates share one trait, a substantial block of context that stays the same across many calls. Common high value cases include the following.

  • Long system prompts. If every request carries a large, fixed system prompt with instructions, policies, or examples, that block is identical on every call and caching it removes most of its recurring cost.
  • Document question answering. When users ask many questions against the same document, the document is the stable context. Cache it once and every subsequent question reads from the cache rather than reprocessing the whole document.
  • Retrieval augmented workloads. Where a fixed knowledge base or a frequently reused set of retrieved chunks feeds the model repeatedly, caching the stable portion cuts the cost of the context that does not change.
  • Conversational applications. In a multi turn conversation, the earlier turns are reprocessed on every new turn. Caching the conversation history keeps that growing context from costing full price each time.
  • Code and agent workloads. Tools, instructions, and codebase context that persist across calls in an agent or coding workflow are strong cache candidates because the same large context recurs constantly.

Where it does not help

Caching only saves money when context repeats. Workloads where every call sends entirely fresh context with little or no overlap get no benefit, and the small cache write cost would make them marginally worse. One off requests, highly variable prompts, and applications where the input changes completely each time are not caching candidates. The discipline is to identify which part of your prompt is stable and which is dynamic, because only the stable part should be cached. Trying to cache content that changes wastes the write cost without earning the read discount. Knowing the line between static and dynamic context is the core skill of caching well.

What to watch

Caching is powerful but it has a few edges worth understanding before you rely on it.

  • The cache window. Cached context persists only for a limited time. If your reuse is spread out so that the cache expires between uses, you pay to write it again, which erodes the saving. Caching pays best when reuse is frequent enough to stay within the window.
  • Cache structure. The cached portion generally needs to sit at the stable front of your prompt with the dynamic content after it. Structuring prompts so the unchanging context comes first is what makes caching possible, and a poorly ordered prompt can leave savings on the table.
  • Hit rate. The actual saving depends on your cache hit rate, the share of calls that successfully read from cache. A design that looks cacheable but achieves a low hit rate in production will underdeliver, so the hit rate is the number to measure rather than assume.
  • Write cost on low reuse. Because writing to the cache costs a little, caching context that is reused only once or twice can cost more than it saves. The reuse threshold matters.

How to estimate the saving

You can size the opportunity before building anything. Look at a representative workload and break each prompt into its stable and dynamic portions. Estimate how large the stable portion is as a share of total input tokens, and how many times that stable context is reused within the cache window. The saving is roughly the deep discount applied to the cached portion across all the reads, net of the write cost. For a workload where most of the input is stable and reused many times, the cached portion approaches the up to ninety percent saving, which can move the total cost of that workload substantially. Doing this estimate first tells you which workloads are worth refactoring and which are not.

Caching is also a latency win

The benefit is not only financial. Because cached context is not reprocessed, calls that read from cache also return faster. For interactive applications this means caching can improve response time at the same time as it cuts cost, which is a rare case where the cheaper option is also the faster one. That makes caching attractive even for latency sensitive workloads, where it can let you keep a more capable model responsive enough for interactive use because the heavy stable context is no longer reprocessed on every call.

Where this fits the wider optimization picture

Caching is one of several levers that compound. It lowers the cost of context on whatever model you run, so it stacks with model routing that puts the right work on Haiku and Sonnet. It combines with batch, which halves the cost of asynchronous work, and with prompt discipline that trims waste before caching even applies. Our token optimization playbook brings these together into a single method for taking cost out of a Claude deployment without losing quality, with caching as one of the highest return moves on the list. For workloads with heavy repeated context, it is frequently the first thing we implement.

The takeaway

Prompt caching turns repeated context from a recurring full price charge into a write once, read many discount that reaches up to ninety percent on the cached portion. It pays most on workloads with a large block of stable context reused often, long system prompts, document question answering, retrieval, conversation history, and agent context, and it does nothing for prompts that change completely each call. Watch the cache window, structure prompts so stable context comes first, and measure your hit rate rather than assuming it. The saving is real, it is often the cleanest available, and it improves latency at the same time. Download the token optimization playbook to size your caching opportunity and see how it stacks with the other levers.

Stop paying full price for the same context.

We find the stable context in your prompts, structure it for high cache hit rates, and stack caching with routing and batch. Download the playbook to size your saving.

Download playbook
Start here
Get the spend in your favor.

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.