Independent buyer side advisory · Anthropic onlyNew York · London
Home · Blog · Prompt Caching
Prompt Caching

The ROI of a caching refactor.

Buyer side guide · 10 minute read

A caching refactor is one of the clearest investment decisions in your Claude stack, because both sides of the equation are knowable. On one side is a one time engineering cost to restructure your prompts so the stable content is cached. On the other is a recurring reduction in your token bill that continues for as long as the workload runs. When the recurring saving is large relative to the one time cost, the decision is obvious, and for high volume workloads it usually is. The reason refactors get deferred is not that the math is unfavorable. It is that no one has done the math. This guide shows how to size the return on a caching refactor so you can make the call with numbers rather than instinct.

The two sides of the equation

The return on a caching refactor is a comparison between a recurring benefit and a one time cost, and sizing it means putting a real number on each.

The recurring saving

The benefit is the reduction in your token bill once caching is working. Caching takes up to 90 percent off the cost of the tokens it covers, so the saving depends on how much of your prompt is stable content that can be cached and how often that content is reused. For a workload that sends a large fixed context on every request, the stable portion is most of the prompt, and moving it into the cached, heavily discounted tier cuts a large share of the cost. That saving repeats on every request, every day, for the life of the workload, which is what makes it powerful: you pay the engineering cost once and collect the saving indefinitely.

The one time cost

The cost is the engineering effort to restructure the prompts: auditing the context to separate static from dynamic, reordering so the stable content sits first, holding the order deterministic, and testing that output quality is unchanged. For most applications this is measured in engineering days, not weeks, because the changes are to how the prompt is assembled rather than to the underlying logic. It is a bounded, one time investment, and once it is done the saving runs on its own.

How to size it

Putting the two sides together is a short calculation that any team can run with data it already has.

Start with the current monthly spend on the workload you are considering. Estimate the share of each prompt that is stable, reusable content, because that is the part caching can discount. Apply the caching discount to that portion to get the monthly saving. Then estimate the engineering cost of the refactor in days, and convert it to a cost. Divide the one time cost by the monthly saving and you have the payback period in months. For high volume workloads with a large stable context, that payback is often measured in weeks, after which the saving is pure return that continues for as long as the workload runs.

The calculation makes the decision clear because it exposes the asymmetry. The cost is one time and bounded. The saving is recurring and open ended. Even a modest monthly saving repays a bounded engineering cost quickly, and after payback the entire saving is upside. The only workloads where the math does not favor a refactor are low volume ones, where there is little spend to save against the fixed engineering cost, which tells you exactly where to focus the effort.

Where the ROI is strongest

The refactors that pay back fastest share a clear profile, and recognizing it tells you which workloads to refactor first.

The strongest case is high volume traffic with a large, stable context. A document assistant answering many questions against the same source, a support agent carrying a long fixed instruction set, a code tool reading the same files across a session, and a retrieval workload sharing a common knowledge base all send heavy stable content over and over. The stable portion is large, the reuse is constant, and the saving compounds across millions of requests, so the payback is fast and the ongoing return is large.

The weakest case is the opposite: low volume traffic, or prompts that are almost entirely dynamic with little stable content to cache. There the cacheable portion is small or the volume is too low for the saving to outweigh the engineering cost, and the refactor should wait behind higher return work. Sizing the ROI is what separates the two, so you spend the engineering effort where it pays and skip it where it does not.

Why the saving is more durable than most

Not all token savings are equal in how long they last, and a caching refactor produces one of the more durable kinds. A discount won at the negotiating table can erode at the next renewal. A model routing change can drift if traffic shifts. But a caching refactor changes the structure of the prompt itself, and that structure keeps delivering the saving on every request for as long as the workload runs, without anyone having to defend it again. The saving is built into how the application works rather than into an agreement that has to be renewed.

This durability is part of the return, and it is easy to underweight when comparing the refactor against other uses of engineering time. A feature ships and its value is realized once. A caching refactor ships and its value compounds across every future request, quietly, for years. When the payback is measured in weeks and the saving runs for the life of the workload, the total return over time dwarfs the one time cost by a wide margin, which is why these refactors so often turn out to be among the highest return engineering work a team can do on a Claude application.

The risks to price into the estimate

A return estimate is only useful if it is honest about what could go wrong, and a caching refactor has a few risks worth pricing in rather than ignoring. None of them usually changes the conclusion for a high volume workload, but accounting for them turns an optimistic number into one you can defend to a finance team.

The first risk is that the stable share of your prompt is smaller than you assumed, which would lower the saving. The protection against this is to measure the stable share from real prompts rather than estimating it, so the saving in your model reflects what can actually be cached. The second is that the refactor takes longer than scoped, which would raise the cost and lengthen the payback. Because the changes are to prompt assembly rather than core logic, overruns are usually modest, but scoping the work honestly and adding a margin keeps the estimate realistic. The third is that the saving erodes over time if the cache design is not maintained, since a future change can reintroduce a dynamic value that breaks the prefix. The answer is to treat hit rate as a metric you watch, so erosion is caught and corrected rather than silently eating the return.

Pricing these in rather than assuming the best case is what makes the ROI credible. A refactor that pays back in a few weeks on optimistic assumptions still pays back in a couple of months on conservative ones, and presenting the conservative number is what gets the work approved and keeps expectations honest after it ships.

Why the refactor is worth doing now

The cost of deferring a profitable caching refactor is not zero. Every month the workload runs without caching is a month of the saving forgone, and that forgone saving never comes back. A refactor that would pay for itself in a few weeks and then save every month after is losing money for the business each month it sits in the backlog. The recurring nature of the saving cuts both ways: it makes a completed refactor valuable, and it makes a delayed one quietly expensive.

This is also why caching belongs alongside model routing as a first order lever rather than a later optimization. The two compound: routing puts each request on the cheapest model that does it well, and caching discounts the repeated context on top of that. Done together on a high volume workload, they move a large share of the bill, and the caching half of that is a one time engineering cost against a permanent saving, which is about as favorable as an optimization decision gets.

Your Anthropic number is negotiable.

Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.

Get a Quote

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · Blog · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.