Independent buyer side advisory · Anthropic onlyNew York · London
Prompt Caching

Caching and latency: a double win.

Prompt caching does not just cut cost. Because cached tokens skip reprocessing, it also cuts the time to first token. Here is why caching is a double win, and how to capture both halves of it at once.

Buyer side analysis · 9 min read
34%
Average reduction in Claude spend
$40M+
Anthropic commitments advised
100%
Anthropic focus, no other vendor

Most teams adopt prompt caching for the cost saving, which is the right reason, because taking up to ninety percent off repeated input tokens is one of the largest levers available on Claude. What they often miss is that the same mechanism that lowers the bill also lowers latency, and the latency gain frequently matters as much to the product as the cost gain matters to the budget. Caching is one of the rare optimizations where the engineering interest and the finance interest point in exactly the same direction, which makes it an unusually easy decision to get approved and an unusually satisfying one to ship. This piece explains why the two wins come from the same place, and how to make sure you capture both rather than settling for one.

Why one mechanism produces two wins

When a request includes a large block of input, the model has to process those tokens before it can begin generating a response, and that processing takes both money and time. You pay for the tokens, and you wait while they are read. Caching changes both at once. When the stable portion of a prompt is served from cache, it is not reprocessed from scratch, so you are billed at the steep cache discount rather than the full input rate, and the model skips much of the work of ingesting that content before it starts responding. The cost saving and the latency saving are not two separate features bolted together, they are two consequences of the same fact: cached content does not have to be processed again. That is why every request that hits the cache is both cheaper and faster, and why the gains scale together as your hit rate rises.

Where the latency win is most valuable

The latency half of the win lands hardest on applications where users are waiting in real time and the perceived speed shapes the experience. A conversational assistant that carries a long system prompt and growing history pays a latency tax on every turn for reprocessing that context, and caching the stable prefix cuts the time to first token noticeably, making the assistant feel responsive rather than sluggish. A retrieval application that sends the same large reference material with each query reads it from cache instead of reprocessing it, so answers arrive faster. Any product with a big fixed context and a small variable question, the exact profile that caching rewards on cost, is also the profile where the latency improvement is most visible to the user. The double win is concentrated precisely where it matters most.

Capturing both halves at once

The good news is that you do not have to choose, because the same configuration delivers both. Structure the prompt so the stable context forms the cacheable prefix and the variable content comes last, drive the hit rate up by keeping that prefix genuinely stable and clustering requests within the cache lifetime, and both the cost discount and the latency reduction follow automatically on every cache hit. The mistake that leaves value on the table is measuring only one of them. Teams that track cost but not latency may not realize how much faster their product has become, and so fail to credit caching for a user experience improvement they could be promoting. Teams that track latency but not cost may not realize how much they are saving. Instrumenting both, the cache hit rate, the cost per request, and the time to first token, lets you see the full return and tune for it deliberately.

How it fits the optimized baseline

Caching sits alongside the other token levers, and the double win strengthens the case for all of them. Routing across Opus, Sonnet, and Haiku puts each request on the cheapest model that clears the quality bar. Batch processing runs asynchronous work at roughly half the real time rate. Caching takes up to ninety percent off repeated input while also cutting latency, and applied together these levers typically reduce aggregate spend by forty to seventy percent. For a buyer, the relevance goes beyond the monthly bill. The leaner, faster baseline that caching helps produce is the baseline you should commit to when you negotiate with Anthropic, because committing against unoptimized usage locks waste into your contract for the full term, and unused commitment is generally lost rather than refunded. Optimizing first, then committing, is how you negotiate from real demand rather than inefficiency, and caching is one of the levers that makes that baseline both cheaper and better performing at the same time.

Why the latency win is easy to undersell

The cost saving from caching shows up on the invoice, so it is hard to miss, but the latency saving is easy to undersell because it hides in a metric many teams do not watch closely: time to first token. That is the delay between sending a request and the model beginning its response, and on a request with a large input it is dominated by the time spent processing that input before generation can start. Caching the stable portion cuts that processing, so the response begins sooner, but if you only track total cost you will never see the improvement and will not credit caching for it. The teams that benefit most are the ones who measure time to first token alongside cost, because they can see both halves of the win and tune for both. A faster product is a competitive advantage that does not appear on any bill, and a team that ships caching for the cost saving and quietly gains the latency saving is leaving an unmeasured improvement on the table that they could be using to justify further investment or to promote the product experience.

This matters commercially as well as technically, because latency shapes adoption. A real time assistant that responds quickly gets used, and one that lags gets abandoned, so the latency half of the caching win can drive usage and retention in ways that compound the value well beyond the token saving. When an engineering leader and a product leader look at the same caching change, the engineer sees a cheaper bill and the product owner sees a snappier experience, and both are right, because both flow from the same cached tokens skipping reprocessing. That shared benefit is what makes caching one of the easiest optimizations to get aligned support for across an organization.

Where the double win does not apply

Honesty about the limits makes the case stronger. Caching delivers its double win on repeated, stable content, so an application whose every request is largely novel has little to cache and will see neither the cost nor the latency benefit, because there is no warm prefix to reuse. Similarly, asynchronous batch workloads, where no user is waiting in real time, capture the cost saving from caching but place little value on the latency saving, because nothing is gained by a faster time to first token when the result is collected later. For those workloads, batch processing at roughly half the real time rate is often the more relevant lever, and caching is a complement rather than the headline. Knowing where the double win lands, real time products with a big fixed context, and where only one half applies, lets you aim caching at the workloads that reward it most and pair it with the right companion lever everywhere else.

Your Anthropic number is negotiable.

Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.

Get a Quote

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · Blog · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.