Common Caching Mistakes That Waste Tokens

Prompt caching looks like a free saving, and for the workloads that fit it, it nearly is. But caching has a cost structure of its own, you pay a premium to write something into the cache in exchange for a steep discount when you read it back, and that structure can be defeated in several ordinary ways. A team that turns caching on without understanding how it can backfire often ends up paying the write premium repeatedly while capturing little of the read saving, which is worse than not caching at all. These are the mistakes we see most often, and how to avoid each one.

Mistake one: breaking the cache with a moving prefix

Caching works on a stable prefix. The reduced rate applies to the leading portion of your prompt that is identical to a previous call, and the moment that prefix changes, the cache no longer matches and you pay full price again. The most common mistake is putting something variable near the front of the prompt, a timestamp, a session identifier, a per request value, ahead of the large stable block you meant to cache. That single variable token at the front invalidates everything after it, so the document or system prompt you thought you were caching is paid for in full on every call. The fix is to put all the stable content first and push everything variable to the end, so the cacheable prefix stays identical across calls.

Mistake two: caching content that does not repeat enough

Caching is an investment. You pay more to write the cache than to send the content normally once, and you only come out ahead when the cached content is read back enough times to repay that write and then save. Caching a block that is used only once or twice loses money, because you paid the write premium and never collected enough reads to recover it. Teams make this mistake when they cache indiscriminately, turning caching on for everything rather than for the high reuse workloads it is designed for. The fix is to cache only where the same content is reused often enough to clear the break even point, and to measure reuse before assuming a workload qualifies.

Mistake three: letting the cache expire between uses

A cache entry does not live forever. If the gap between calls that would reuse the cached content is longer than the cache lifetime, the entry expires and the next call pays the write premium again to recreate it. Workloads with sparse, irregular traffic can end up repeatedly writing a cache that expires before it is read, paying the premium over and over with little saving in between. The fix is to understand the cache lifetime and match it to your traffic pattern, caching where calls cluster closely enough to reuse a live entry, and reconsidering caching for workloads whose calls are too far apart to keep an entry warm.

Mistake four: caching the wrong portion

Not all of a prompt is worth caching. The saving comes from caching a large stable block, and caching a small one captures little while still paying the write overhead. Teams sometimes cache the trivial fixed part of a prompt, a short instruction, while leaving the genuinely large stable content, a long document or example set, uncached because it sits in the wrong position. The fix is to identify the largest stable block in the prompt and structure the request so that block is the cached prefix, rather than caching whatever happens to be at the front.

Mistake five: interleaving stable and variable content

Even when a workload has a large stable block and high reuse, the saving evaporates if the prompt is built so the stable and variable parts are woven together. If the document is split by user specific notes, or the system prompt is interrupted by per call values, there is no clean stable prefix to cache. The fix is structural: separate the prompt cleanly into a stable section that comes first and a variable section that comes after, so the cache has an unbroken prefix to work with. A great deal of caching value is captured simply by reorganizing the prompt rather than changing its content.

Mistake six: ignoring the write cost in low volume tests

Teams often evaluate caching on a small test run and conclude it is not worth it, because at low volume the write premium dominates and the reads have not accumulated. This understates the production saving, where the same cache is read back thousands of times and the write cost is amortized to nothing. The opposite error also happens: a team measures a tiny cached test, sees a saving per call, and assumes it scales linearly without checking whether the production traffic pattern keeps the cache warm. The fix is to evaluate caching against the real production volume and traffic shape, not a small sample that misrepresents the economics in either direction.

Mistake seven: caching volatile content as if it were stable

Some content looks stable but is not. A document that is updated frequently, a knowledge base that changes through the day, a context assembled fresh per user, will not produce cache hits because it is not actually identical across calls. Caching it pays the write premium repeatedly for content that never matches a prior entry. The fix is to be honest about what is truly stable. Cache the parts that do not change, the instructions, the fixed examples, the static reference, and leave the genuinely volatile content out of the cached prefix.

How to do caching right

The mistakes share a single root: caching applied without measuring the pattern it depends on. Done right, caching follows a short discipline.

Put all stable content first and all variable content last, so the cacheable prefix stays identical.
Cache only workloads where a large stable block is reused often enough to clear the write cost.
Match caching to your traffic pattern so entries are read back before they expire.
Cache the largest stable block, not whatever sits at the front by accident.
Keep genuinely volatile content out of the cached prefix.
Evaluate against real production volume, not a small sample.

The commercial angle

Caching mistakes do not only waste tokens, they distort the baseline you negotiate against. A team that believes it is caching effectively, but is actually paying the write premium repeatedly, carries an inflated consumption number into its commit, and commits to more than a correctly cached workload would need. When we review a deal, we check that the caching is actually capturing the saving it claims, because an honest, well cached baseline produces a smaller commit, less exposure to unused commitment, and a stronger position on the rate. Caching done wrong inflates the very number Anthropic prices against.

How to tell if your caching is actually working

The most dangerous caching mistakes are the ones that hide, because the system runs fine and the bill simply does not fall as much as expected. The way to catch them is to measure the cache directly rather than assuming it works. Look at the share of your input tokens that are being served from the cache at the reduced rate versus the share paid at full price, and look at how often a cache write is followed by enough reads to repay it. A healthy cached workload shows a high proportion of reads against writes and a large share of tokens at the reduced rate. A broken one shows writes that are never read, or a stable block that keeps being paid in full because the prefix is being invalidated. If you are not measuring the read to write ratio and the cached share of tokens, you are flying blind, and a caching setup that looks active in the code can be capturing almost nothing in practice. The measurement is the difference between believing caching works and knowing it does.

A worked example of caching gone wrong

Consider a document question answering feature built to cache the uploaded document so that repeated questions against it are cheap. The intent is right and the workload is a perfect caching candidate, but the implementation prepends a per request identifier and a timestamp to the prompt, ahead of the document, for logging. Because that leading content changes on every call, the cacheable prefix never matches a prior entry, so the document is paid in full on every question and the cache writes are never read back. The team sees caching enabled in the code, assumes it is working, and cannot understand why the bill on this feature has not moved. The fix is a one line reordering, move the identifier and timestamp to the end of the prompt, behind the document, so the document becomes the stable prefix. After the change, the document is paid essentially once and every subsequent question is answered at the reduced rate, and the feature's cost falls sharply. Nothing about the content changed, only the order, and that is typical: most caching failures are structural, not fundamental, and most fixes are cheap once the cause is seen.

Common caching mistakes that waste tokens.