Independent buyer side advisory · Anthropic onlyNew York · London
Home · Blog · Token Optimization
Token Optimization

Prompt caching on Claude: the 90 percent lever.

Buyer side guide · 11 minute read

Of every lever available to cut a Claude bill, prompt caching is the largest and the most consistently underused. It can reduce the cost of the repeated portion of your prompts by up to ninety percent, and for the right workload that is not a rounding improvement, it is a different invoice. Most teams either do not use caching at all or use it badly, which means they are paying full price, request after request, for tokens they send unchanged thousands of times a day. This is what caching does, where the ninety percent actually comes from, and how to design for it so the saving is real rather than theoretical.

What prompt caching actually is

Every Claude request is made up of input tokens and output tokens. The input often includes a large, stable block that is identical across many requests: a long system prompt, a set of instructions, a knowledge base, a document the model is reasoning over, or the early turns of a conversation. Without caching, you pay the full input rate for that entire block every single time you send it, even though it has not changed.

Prompt caching lets Anthropic store that stable block after the first request and reuse it on subsequent ones. When a later request reuses the cached portion, you pay a small fraction of the normal input rate for those tokens rather than the full price. The cache write, the first time the block is stored, costs slightly more than a normal input token, but every cache read after that is dramatically cheaper. Across a workload that reuses the same context many times, the blended cost of that context falls toward a fraction of what it was, and the often quoted figure of up to ninety percent off the repeated portion is where well designed caching lands.

Where the saving comes from

The ninety percent is not magic and it is not uniform. It applies specifically to the cached input tokens, the stable block you reuse. It does not reduce your output token cost, and it does not help with input that genuinely changes on every request. So the size of your real saving depends entirely on the shape of your workload: the larger and more frequently reused your stable context, the closer your blended cost moves toward that ceiling.

This is why caching pays off spectacularly for some workloads and barely at all for others. A workload that sends a long system prompt and a large reference document with every request, and changes only a short user question at the end, is almost pure cacheable context, and caching transforms its economics. A workload where nearly every token is unique on every call has little to cache and sees little benefit. Understanding which one you have is the first step, because it tells you how much the lever is worth before you spend an hour pulling it.

The workloads that benefit most

Several common enterprise patterns are close to ideal for caching. Customer support assistants that load the same long policy and product context on every conversation. Document analysis pipelines that reason over the same large document across many questions. Code assistants that load the same large codebase context repeatedly. Retrieval augmented systems where a stable instruction block and schema wrap a small changing query. Long running conversations where the early turns stay fixed while only the latest exchange changes. In each of these, a large share of the input is identical request to request, which is exactly the condition caching rewards.

If your workload looks like one of these and you are not caching, you are very likely overpaying by a wide margin. The saving is sitting there untouched, billed in full on every call.

Designing prompts to be cache friendly

Caching rewards structure, and a few design principles turn a mediocre cache hit rate into an excellent one. The governing rule is simple: put the stable content first and the variable content last.

Order matters

Caching works on prefixes. The cache matches from the start of the prompt up to the point where the content first differs. So everything you want cached must sit at the front, in a fixed order, before anything that changes. If a variable element sneaks into the early part of the prompt, it breaks the match for everything after it, and your cache hit rate collapses. Keep the system prompt, instructions, and reference material at the top, byte for byte identical across requests, and place the changing user input at the very end.

Separate the stable from the dynamic

Audit your prompt and classify every section as stable or dynamic. Stable content is anything that does not change between requests within a reasonable window. Dynamic content is anything that varies per request. Then physically reorganize the prompt so all the stable content is grouped at the front. This single refactor is often the difference between a workload that caches well and one that does not, and it is covered in more depth in our broader token optimization work.

Respect the cache lifetime

A cache entry does not live forever; it expires after a period of inactivity. Workloads with steady, frequent traffic keep the cache warm and capture the full benefit. Workloads with sparse, bursty traffic may let the cache expire between requests, forcing repeated cache writes that cost slightly more than a plain request. If your traffic is bursty, the engineering challenge is keeping the cache alive across the gaps, and sometimes the honest answer is that caching is not the right lever for that particular pattern.

How caching interacts with the other levers

Caching is not the only token lever, and it compounds with the others. Batch processing runs eligible asynchronous work at half the standard rate, and caching can apply within batched workloads too. Model routing across Opus, Sonnet, and Haiku, sending each request to the cheapest model that can do the job well, typically cuts aggregate spend by forty to seventy percent, and caching layers on top of whichever model you route to. The largest savings come from stacking these levers rather than treating any one as the whole answer. Caching is usually the first to reach for because its ceiling is the highest, but the full program combines it with routing and batch.

Why this matters at the negotiating table

Caching is not only an engineering win; it is negotiating leverage. When you commit to Anthropic, you commit to a consumption level, and a workload that has been cache optimized commits to a genuinely lower, more efficient number. That protects you from oversizing the commitment and the shortfall that follows, and it lets you argue for your discount from a position of disciplined spend rather than waste the vendor is happy to sell. The team that caches first negotiates from strength, because its committed number is real consumption rather than padding.

The buyer side takeaway

Prompt caching is the largest single token lever on Claude, cutting the cost of repeated context by up to ninety percent, but only for workloads with substantial reused input and only when the prompt is designed to capture it. Classify your context as stable or dynamic, put the stable block first in a fixed order, keep the cache warm with steady traffic, and stack caching with routing and batch for the full effect. Then carry the lower, optimized run rate into your commitment so the saving shows up in both your invoice and your negotiating position. To see exactly how much caching is worth on your specific workload, download the token optimization playbook.

The largest lever, usually untouched.

Download the token optimization playbook, or have us audit your prompts for cacheable context.

Download the playbook

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.