Measuring Cache Hit Rate on Claude | Antrophic Negotiations

Turning prompt caching on is the easy part. Knowing whether it is working is where most teams fall short, and the gap costs real money. Caching can take up to ninety percent off repeated input tokens, but only on the requests that actually hit the cache, and a deployment that caches everything in principle while hitting the cache rarely in practice captures a fraction of the available saving. The difference between a cache that is configured and a cache that is performing is the hit rate, the share of cacheable tokens that are served from cache rather than billed at full price. If you are not measuring it, you do not know whether your caching is delivering ninety percent off or close to nothing, and we routinely find applications that believe they are caching well while their hit rate tells a different story. Measuring the hit rate turns caching from an act of faith into a managed lever.

What the hit rate actually measures

When you mark part of a prompt as cacheable, each request falls into one of two outcomes for that content. Either the content is found in the cache and read at the steep discount, a cache read, or it is not found and has to be written, a cache write, which is billed at full input price plus a small write premium. The hit rate is the proportion of your cacheable token volume that is served as reads rather than writes. A high hit rate means most of your repeated content is being read cheaply, which is the saving you wanted. A low hit rate means you keep writing the cache and rarely reading it, paying the write premium repeatedly while almost never collecting the discount, which can cost more than not caching at all. The hit rate is the single number that tells you which world you are in.

Read the usage fields on every response

Claude reports cache activity directly in the usage data returned with each response, and this is where measurement begins. The usage breakdown distinguishes the tokens that were written to the cache from the tokens that were read from it, alongside the ordinary input and output token counts. To measure your hit rate, capture these fields on every call and aggregate them. The calculation is straightforward: compare the volume of cache read tokens against the total cacheable volume, which is reads plus writes, and the read share is your hit rate. Doing this per request gives you the raw signal. Aggregating it across a window, by endpoint, by task type, and over time, turns the signal into something you can act on, because an average across the whole application hides the specific places where caching is failing.

Build the measurement into your observability

A one time check tells you the hit rate today. Production drifts, so you want the hit rate as a standing metric, not a spot reading. Log the cache read and write token counts alongside your existing request logs, and surface the aggregated hit rate in whatever observability or cost dashboard your team already watches. Break it down by the dimensions that matter: which endpoints cache well and which do not, which task types hit and which miss, and how the rate moves across the day as traffic patterns change. This breakdown is what makes the metric useful, because a blended hit rate of, say, sixty percent could mean every endpoint sits at sixty, or it could mean half your endpoints cache beautifully while the other half barely cache at all, and only the latter tells you where to fix things.

What a low hit rate is telling you

A poor hit rate is a diagnosis, and the usual causes are specific. The most common is prompt structure: the cache forms on the stable prefix of a prompt, so if any variable content sits ahead of the stable block, the prefix changes on every request and the cache can never be reused. Reordering so the fixed context comes first and the variable content comes last often transforms the rate on its own. The second common cause is expiry: cached content lives for a limited window, so if your requests against a given context are spaced too far apart, the cache expires between them and every request becomes a fresh write. The third is fragmentation: small differences in supposedly stable content, a timestamp, a session identifier, a reordered list, mean the cache key never matches even though the content is conceptually the same. Each of these is fixable once the hit rate has pointed you at it.

Raising the rate deliberately

Once you know where the misses are, raising the hit rate is concrete engineering. Move all variable content to the end of the prompt so the cacheable prefix is genuinely stable. Strip incidental variation, timestamps, identifiers, and anything that changes without needing to, out of the cached block so the key matches reliably. Where requests against a context are naturally spread out, consider batching or routing them so they cluster within the cache lifetime and reuse the warm cache rather than re writing a cold one. For conversational workloads, structure the cache so the stable system context and early turns are cached as a unit that persists as the conversation grows. Each change is testable, because you can watch the hit rate move in your dashboard as you ship it, which turns optimization into a measured feedback loop rather than guesswork.

Connect the rate to the dollars

The hit rate is a means, not an end. What you care about is the saving, so translate the rate into money. With your token prices and your read and write volumes, you can compute what caching is actually saving against what the same traffic would cost with no caching, and what it would save at a higher hit rate. This is the number that justifies the engineering effort and the number that belongs in a cost review, because it converts a technical metric into a business case. It also reveals the ceiling: an application with little genuinely repeated content has a low cacheable share and limited upside no matter how high the hit rate goes, while an application dominated by stable context has enormous upside that a low hit rate is currently leaving unclaimed. Measuring both the rate and the dollars tells you not just how you are doing but how much is still on the table.

Why this matters at the contract table

Caching that is measured and tuned lowers your real consumption, and real consumption is what you should commit to. A buyer who has driven the hit rate up sizes the Anthropic commitment against the leaner baseline caching produces, rather than against the inflated cost of paying full input price for content that should have been read cheaply. Because unused commitment is generally lost rather than refunded, committing to a baseline you could have optimized away is money spent for nothing. The discipline of measuring cache hit rate therefore pays twice, once on the monthly bill and again on the commitment you negotiate, and it sits alongside model routing and batch as one of the levers that, applied together, typically cut aggregate spend by forty to seventy percent. The buyers who negotiate from strength are the ones whose numbers are measured, and the hit rate is one of the numbers that proves the baseline is honest.

Set a target and alert on regressions

A measured hit rate is most useful when it has a target attached, because a number without a benchmark is hard to act on. The right target depends on the workload: an application with a large, stable context queried frequently should reach a high hit rate, often well above eighty percent, while a workload with naturally spread out traffic may settle lower even when well tuned. Establish what good looks like for each endpoint based on its traffic shape, set that as the target, and treat sustained shortfalls as defects to investigate rather than background noise. Just as important is alerting on regressions, because hit rate does not only start low, it can fall. A code change that reorders a prompt, a new field that introduces variation into a previously stable block, or a shift in traffic timing can quietly collapse a hit rate that was healthy, and without an alert you will not notice until the bill rises. Wiring a regression alert to your hit rate metric catches these silently introduced losses while they are still cheap to fix, which is the difference between a cache that stays tuned and one that decays after launch.

Regression alerting also protects the saving across the natural churn of a codebase. Applications evolve, prompts get edited, and the engineer making a change rarely realizes that a small reordering broke the cache prefix for an entire endpoint. The hit rate is the canary: when it drops, something changed, and the alert points you straight at it. Teams that treat the hit rate as a permanent, monitored metric rather than a one time measurement keep their caching saving intact over time, while teams that measure once and move on watch it erode invisibly as the application changes around them.

Attribute the saving by team and feature

In a large organization, the cache hit rate is not only an engineering metric, it is a governance one, because different teams and features cache with very different discipline and the blended number hides who is doing well and who is not. Breaking the hit rate and the associated cost down by team, by feature, or by endpoint turns an abstract optimization into an accountable one, surfacing the specific owners whose workloads are leaving the discount unclaimed. This attribution is what lets a platform or finance team drive improvement at scale, because it converts caching from a vague aspiration into a measured expectation with a named owner. It also makes the wins visible: a team that raises its hit rate and cuts its cost can be recognized for it, which builds the culture that keeps caching healthy. The same showback discipline that organizations apply to cloud spend applies to token spend, and the cache hit rate is one of the clearest, most actionable signals to attribute, because the path from a low rate to a fix is concrete and the saving is measurable.

Measuring cache hit rate on Claude.