Independent buyer side advisory · Anthropic onlyNew York · London
Blog · Prompt Caching
Middle of funnel · Commercial investigation

Cache friendly prompt architecture.

Caching does not save money on its own. It saves money when the prompt is built to be cached. The same workload can return up to ninety percent on its context or almost nothing depending on how the prompt is structured. Here is how to architect prompts for high cache hit rates.

Most teams that turn caching on and see a disappointing result do not have a caching problem. They have a prompt architecture problem. Caching rewards prompts that present a large, stable block of context in a consistent way, and it punishes prompts that scatter changing content through the parts that should be reused. The mechanism is the same in both cases. What differs is whether the prompt is built so the cache can actually do its job. This is good news, because it means the saving is a design decision within your control rather than a property of the workload. This piece sets out the architecture that produces high cache hit rates and the common patterns that quietly destroy them.

The one rule everything follows from

Cache friendly architecture comes down to a single principle. Put what stays the same first, put what changes last, and keep the stable part byte for byte identical across calls. A cache read works when the model recognizes that the leading portion of your prompt matches context it has already processed. The moment the stable prefix differs, even slightly, the match breaks and you pay full price again. So the whole discipline is about protecting the consistency of the front of your prompt. Everything below is an application of that one rule to the situations where it tends to get broken.

Stable content first, dynamic content last, and the stable prefix identical on every call. A cache read depends on the leading portion matching exactly, so anything that varies the prefix forfeits the saving.

Separate the static from the dynamic deliberately

The first design task is to look at a prompt and draw a hard line between the parts that are the same on every call and the parts that change. The static side is your system prompt, instructions, guidelines, examples, reference material, and any fixed context the model needs. The dynamic side is the user input, the specific document or record, the current conversation turn, anything per request. Many prompts blur this line, mixing a changing variable into the instructions or appending the stable context after the user input. Separating them cleanly, and committing to that separation everywhere the prompt is built, is the foundation of a cacheable design. If you cannot say which part is stable, the prompt is not ready to cache.

Order matters more than people expect

Once static and dynamic are separated, order them so all the static content sits at the front as one contiguous block and all the dynamic content follows. The mistake to avoid is interleaving, where a dynamic value appears early and is followed by more static content. Anything after the first point of variation cannot be cached, so a single dynamic token near the top can render a large stable block downstream uncacheable. The fix is to push every changing element to the end. If a dynamic value seems to belong in the middle of your instructions, restructure the instructions so the value can move to the bottom while the instruction text stays fixed at the top.

The patterns that silently break caching

Several common habits quietly vary the stable prefix and destroy the hit rate without anyone noticing. Watch for these.

  • Injected timestamps or identifiers. Putting the current date, a request ID, or a session token into the front of the prompt changes the prefix on every call. If the model does not truly need it up front, move it to the dynamic tail or remove it.
  • Inconsistent assembly. If the prompt is built by string concatenation that varies in whitespace, ordering, or formatting between code paths, the prefix differs subtly and the cache misses. The stable block must be assembled identically every time.
  • Reordered context. If retrieved chunks or reference items are assembled in a different order across calls, a prefix that is logically the same is physically different, and the cache cannot match it. Fix the ordering.
  • Per user personalization in the prefix. Dropping a user name or account detail into the system prompt makes the prefix unique per user, which collapses sharing of the cache. Keep personalization in the dynamic portion.
  • Frequent edits to the stable block. Every change to the system prompt or guidelines invalidates the cache. Stable does not just mean static within a call, it means stable across time, so treat the cached block as something you change deliberately, not casually.

Design the stable block to be worth caching

Architecture is not only about consistency, it is also about making the cached block large enough to matter. Caching saves a fraction of the cost of the cached tokens, so the bigger the stable portion relative to the dynamic portion, the larger the saving. This is a reason to consolidate stable context, to bring the instructions, standards, examples, and reference material that the model genuinely benefits from into the cached prefix rather than trimming them away. There is a tension here with prompt economy, since you do not want to pad a prompt with content the model does not need, but where substantial stable context is genuinely useful, putting it in a well structured cached block is far cheaper than sending it fresh each time. The architecture lets you afford richer stable context than you otherwise could.

Validate the architecture against the hit rate

A cache friendly design is a hypothesis until the production hit rate confirms it. The hit rate, the share of calls that read the stable prefix from cache, is the single metric that tells you whether the architecture is working. A high hit rate means the prefix is genuinely stable and consistently assembled. A low hit rate on a prompt you believe is cacheable points to one of the silent breakers above, usually a varying prefix or inconsistent assembly. Instrument the hit rate per workload, treat a drop as a regression worth investigating, and use it to verify that changes to the prompt have not quietly broken the cache. Architecture without measurement tends to drift back toward low hit rates over time.

Make it a standard, not a one time fix

The biggest risk to cache friendly architecture is entropy. A prompt that is well structured today gets a timestamp added next quarter, or a new code path assembles it slightly differently, and the hit rate quietly erodes. The durable fix is to make the architecture a standard that every prompt in the system follows, with the stable prefix assembled by shared code that guarantees consistency, and with the hit rate monitored so regressions surface. Treating cache friendly structure as a convention the whole team builds to, rather than a tuning pass someone did once, is what keeps the saving in place as the application evolves.

Where this fits the wider optimization picture

Prompt architecture is what makes caching deliver, and caching is one of several levers that compound on a Claude bill. A well architected cached prefix lowers the cost of context on whatever model you route to, so it multiplies with sending the right work to Haiku and Sonnet, and it stacks with batch on asynchronous work. Our token optimization playbook brings architecture, caching, routing, and batch together into one method for cutting Claude spend without losing quality. Getting the prompt architecture right is often the step that turns a caching effort from a marginal result into a major one.

The takeaway

Caching saves money only when the prompt is built for it, and the rule is simple: stable content first, dynamic content last, and the stable prefix identical on every call. Separate static from dynamic deliberately, order the prompt so nothing changing appears before the cached block, and watch for the silent breakers, injected timestamps, inconsistent assembly, reordered context, and prefix personalization, that quietly destroy the hit rate. Make the stable block large enough to be worth caching, validate the design against the production hit rate, and enforce it as a standard so it does not erode. Book a strategy call and we will audit your prompt architecture for the patterns costing you the caching saving you should be getting.

Architect the prompt so caching actually pays.

We audit your prompt structure for the patterns that break the cache and redesign for high hit rates across your workloads. Book a strategy call to start.

Book a Strategy Call
Start here
Get the spend in your favor.

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.