When Caching Does Not Help | Antrophic Negotiations

Caching gets talked about as if it were free money, and for the right workloads it nearly is. But the lever depends entirely on one thing: a large block of context that repeats across calls. Remove that and caching has nothing to work with. A surprising number of workloads simply do not have a stable, repeated context, and for those, caching saves nothing and can even cost a little. The mark of a buyer who understands token optimization is not enthusiasm for caching everywhere, it is knowing when caching is the wrong tool and which lever to use in its place. This is that map.

Caching needs repetition, and these workloads have none

The cases where caching does not help share a single feature: there is no large stable prefix that recurs. Recognizing them saves you from chasing a saving that is not there.

Fully unique requests

If every call carries entirely different content, with no shared block, there is nothing to cache. A workload where each request is a fresh, self contained prompt, a one off analysis, a unique user message with no fixed preamble, gives the cache no repeated prefix to store. Turning caching on here adds the write overhead with no reads to recover it.

Short prompts

Caching pays off on large stable blocks. When the whole prompt is short, the stable portion is small even if it repeats, so the saving is negligible and may not clear the write cost. Caching a brief instruction across many calls captures little, because there was little to save in the first place. The lever scales with the size of the cached block, and a small block is not worth the machinery.

Constantly changing context

Some workloads assemble fresh context for every call, a live data feed, a per user computed state, a document that updates continuously. The content looks like it could be cached because it occupies the same slot in the prompt, but it is never identical between calls, so it never produces a cache hit. Caching volatile content pays the write premium repeatedly for entries that never match.

Sparse, irregular traffic

A cache entry has a lifetime. If calls that would reuse the same context are spread far enough apart that the entry expires between them, each call recreates the cache and pays the write cost again. Workloads with rare, scattered traffic can lose money on caching even when the content is genuinely stable, because the reuse never happens while the entry is live.

Reach for a different lever instead

Caching not helping does not mean the workload cannot be optimized. It means the saving lives in a different lever. The point of a full optimization program is that you have several, and the skill is matching the lever to the workload.

Use model routing

If a workload has no repeated context to cache, the first question is whether it is running on the right model. Routing the request to the cheapest model that handles it well is independent of caching and applies to unique, short, or volatile workloads just as well as to repetitive ones. Model routing across Opus, Sonnet, and Haiku typically cuts aggregate spend 40 to 70 percent versus running everything on the top model, and it does not care whether anything repeats. For a workload caching cannot touch, routing is usually the largest available saving.

Use batch

If the workload is asynchronous, with no user waiting on the result, the batch path takes roughly half off regardless of whether the content repeats. A unique, one off processing job that nothing can cache is still a perfect batch candidate if it does not need an instant answer. Batch attacks the latency tradeoff rather than the repetition, so it covers workloads caching leaves behind.

Reduce the tokens

If a workload is expensive and neither cacheable nor routable to a cheaper model, the remaining lever is to send fewer tokens. Trim the input to what the model needs, strip syntactic overhead and formatting, and constrain the output length and shape. Token reduction works on any workload, because it lowers the base count that every rate is multiplied against, and it is often the right answer for the unique, full price requests caching cannot help.

The diagnostic question

Faced with an expensive workload, the sequence is simple. Ask first whether a large block of context repeats across calls. If it does, cache it. If it does not, ask whether the workload is on the right model, and route it if not. Ask whether anyone is waiting on the result, and batch it if not. And in every case, ask whether you are sending more tokens than the job needs, and trim if so. Caching is the first question, not the only one, and a workload that fails the caching test almost always passes one of the others. The mistake is treating caching as the whole program rather than one lever in it.

The commercial angle

Knowing where caching does not help protects the integrity of your optimized baseline. A team that assumes caching covers everything will overestimate its savings and carry a baseline that does not hold, while a team that matches each workload to the lever that actually works arrives at a real, defensible consumption number. That honest number is what we negotiate against, because it produces the right commit, the least exposure to unused commitment, and the strongest position on the rate. Optimization that is matched correctly to each workload is optimization the vendor cannot dispute at renewal.

The cost of forcing caching where it does not fit

It is worth being clear that caching the wrong workload is not neutral, it is negative. Caching carries a write premium, you pay more to put content into the cache than to send it normally once, and you only come out ahead when enough reads follow to recover that premium. On a workload with no repetition, those reads never come, so you have simply added the write cost with nothing on the other side. A team that turns caching on everywhere in the belief that it can only help has, on the unique and volatile workloads, made itself slightly worse off while feeling more optimized. This is why the diagnostic discipline matters. Caching is not a setting you enable globally and forget, it is a lever you apply deliberately to the workloads whose shape rewards it, and withhold from the ones whose shape punishes it. Knowing where not to cache is as much a part of doing caching well as knowing where to.

A short map of lever to workload

Because the levers cover different ground, it helps to hold a simple map of which one answers which workload. The map is not rigid, most real workloads benefit from more than one lever at once, but it tells you where to look first.

Large stable context reused often: cache it, for up to 90 percent off the repeated portion.
No repetition, but running on an expensive model: route it to the cheapest capable model, for a 40 to 70 percent aggregate reduction.
No one waiting on the result: run it on the batch path, for roughly half off.
Expensive, unique, and already on the right model: reduce the tokens, by trimming input and constraining output.
A combination, which most workloads are: apply every lever that fits, because they compound.

The point of the map is that a workload caching cannot touch is almost never a workload that cannot be optimized at all. It is a workload whose saving lives in a different lever, and the job is to find which one rather than to conclude there is nothing to do.

When caching does not help.