Independent buyer side advisory · Anthropic onlyNew York · London
Home · Blog · Prompt Caching
Prompt Caching

Designing context for maximum cache hits.

Buyer side guide · 11 minute read

Prompt caching on Claude can take up to 90 percent off the cost of the tokens it covers, which makes it one of the most powerful levers on an enterprise bill. But the discount is not automatic. Caching only pays when the cached portion of your context is actually reused, hit after hit, and reuse depends entirely on how you have designed the context. A team that turns caching on without changing how it builds prompts often sees a disappointing result, because their context shifts slightly on every call and the cache never gets a chance to work. The saving is real, but it is earned through design. This guide explains how caching decides what to reuse, why context structure makes or breaks the hit rate, and how to lay out your prompts so the cache pays as often as possible.

How caching decides what to reuse

The mechanism behind prompt caching is simple to state and important to understand, because the design rules follow directly from it. When you cache part of a prompt, the model stores the processed form of that content so the next request can reuse it instead of paying to process it again. The reuse only happens when the new request begins with the exact same content, in the exact same order, up to the cached point. The cache is matched from the start of the prompt forward, so anything before a change is reusable and everything from the first difference onward has to be processed fresh.

That single fact drives the entire discipline. Because matching runs from the beginning of the prompt, the stable content has to come first and the variable content has to come last. If something that changes on every request sits near the top of your prompt, it breaks the match early and almost nothing downstream can be reused, no matter how much of the rest is identical. The whole game of designing for cache hits is arranging your context so the longest possible stable prefix sits at the front.

Why structure makes or breaks the hit rate

Most prompts in production are assembled from several pieces: a system instruction, some reference material or documents, maybe a set of examples, the conversation history, and finally the user's current input. Whether caching pays depends almost entirely on the order those pieces appear and how stable each one is across requests.

The cost of a moving prefix

The common failure is putting something dynamic early. A timestamp at the top of the system prompt, a user name inserted before the reference material, or documents shuffled into a different order on each call all have the same effect: they move the first point of difference toward the front, and everything after it has to be reprocessed. The content may be 95 percent identical to the last request, but if the difference is near the top, the cache captures almost none of it. A moving prefix is the single most common reason caching underdelivers.

The payoff of a stable prefix

The reverse is just as powerful. When the large, expensive, stable content sits at the front, unchanged from request to request, the cache reuses all of it and you pay the full rate only on the small variable tail. A long system instruction, a fixed set of reference documents, and a stable library of examples are exactly the kind of heavy content that benefits most, because they are large and they do not change. Get them to the front and keep them identical, and the bulk of your token cost moves into the cached, heavily discounted portion.

How to lay out the context

The practical design follows one principle: order your context from most stable to least stable, and put the cache boundary as far down the stable section as you can. A few concrete rules make that real.

Put the heavy stable content first

Lead with the things that do not change and cost the most: the system prompt, fixed instructions, reference documents, schemas, and example sets. These are usually the largest part of the prompt and the part you most want to reuse, so they belong at the very front where the cache can capture them. The bigger and more stable the prefix, the larger the saving.

Keep the variable content at the tail

Everything that changes per request, the user's question, the current input, any per request parameters, goes at the end. This content is small relative to the stable prefix, so paying the full rate on it costs little, and placing it last means it never breaks the match on the expensive content above it. The discipline is to resist the temptation to weave dynamic values into the stable section for readability, because every such insertion moves the break point up and forfeits the saving.

Hold the order fixed

Reuse depends on the content being identical, which means the order of documents and sections must not vary between requests. If your application assembles reference material from a set, sort it deterministically so the same inputs always produce the same order. A retrieval step that returns documents in a different sequence each time will defeat the cache even when it returns the same documents, simply because the byte sequence differs.

Group by lifetime

Within the stable section, place the most durable content first and the less durable content after it. Instructions that never change should sit ahead of reference material that updates occasionally, which should sit ahead of anything that refreshes more often. That way a change in the less stable content still preserves the cache on the most stable content above it, rather than invalidating the entire prefix.

Common mistakes that quietly break the cache

Most disappointing caching results trace back to a handful of mistakes that are easy to make and easy to miss, because the application still works perfectly. Nothing breaks visibly when the cache fails to hit. The output is fine, the requests succeed, and the only symptom is a bill that did not fall the way you expected. That silence is what makes these mistakes so common, so it is worth knowing them by name.

The first is inserting a timestamp or a generated identifier near the top of the prompt. It feels harmless, a single short value, but because it changes on every request and sits early, it breaks the match before any of the expensive content can be reused. The second is nondeterministic ordering of retrieved content, where the same documents come back in a different sequence each time. The content is identical but the byte order is not, so the cache sees a different prefix and reprocesses everything. The third is personalizing the system prompt by name or account, which threads a dynamic value through what should be a stable block and forfeits the reuse on the instruction that follows it.

The fourth is subtler: making small edits to the stable content more often than necessary. Every change to the cached prefix invalidates it and forces the next request to rebuild the cache from scratch, so a system prompt that is tweaked daily never gets to amortize across enough requests to pay off. Treat the stable content as something you change deliberately and rarely, not casually, because each change resets the saving. Knowing these four failure modes turns caching from a feature you hope works into one you can verify works.

Verifying the cache is actually working

Because a broken cache is silent, the only way to know your design is paying is to measure it. The metric that matters is the cache hit rate: the share of cacheable tokens that were actually served from the cache rather than reprocessed. A high hit rate means your stable prefix is being reused as intended. A low one means something is breaking the match, and it points you straight at the mistakes above.

The practical discipline is to check the hit rate after any change to how prompts are assembled, not just once at launch. A refactor that improves one feature can accidentally introduce a dynamic value into another, and the hit rate is what catches it before a month of inflated bills does. Watching the hit rate over time also tells you when a workload's traffic has shifted in a way that reduces reuse, so you can respond rather than discovering the erosion at invoice time. Caching is one of the few optimizations where you can directly observe whether it is working, and a team that watches the hit rate keeps the saving that a team relying on hope slowly loses.

Where the design pays most

The workloads that benefit most from this design are the ones that send the same large context repeatedly. A document assistant that answers many questions against the same source material, a support agent that carries a long fixed instruction set, a code tool that reads the same files across a session, and a retrieval workload that shares a common knowledge base across requests are all cases where a well designed stable prefix is reused constantly. In these patterns, the cached portion is large and the hit rate is high, which is exactly the combination that turns the up to 90 percent discount into a major reduction in the bill. The more your traffic repeats heavy context, the more the layout work pays.

Your Anthropic number is negotiable.

Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.

Get a Quote

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · Blog · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.