How Caching Changes Your Commit Math

Most discussions of prompt caching stop at the bill: turn it on, repeated context is billed at up to ninety percent off, the monthly number goes down. That is true and it matters, but it understates what caching does, because caching does not only lower what you spend this month, it lowers the baseline you should be committing to for the years ahead. An Anthropic API commitment is a promise to consume a certain dollar amount over a term, and the right size of that promise depends entirely on your real run rate. Caching changes the run rate, often substantially, on exactly the workloads that dominate enterprise spend. So the order in which you do these two things, optimize and commit, decides whether your commitment reflects reality or locks in waste. Cache before you size the commit and you commit to the truth. Commit before you cache and you have promised to buy tokens that caching would have made unnecessary.

The commit is sized on your run rate

Start with how a commitment is sized, because the whole argument turns on it. When you negotiate an API commitment, the central number is your expected consumption over the term, and the discount you earn is a function of how large and how credible that number is. The vendor wants the commit high; that is more guaranteed revenue. Your interest is in committing to what you will genuinely consume, no more, because committed spend that goes unused on Anthropic generally does not roll over and is not refunded, it simply disappears at period end. So the commit should be anchored to your real run rate, the consumption you will actually generate after the optimizations you have decided to make. If you size it against an unoptimized run rate, you are committing to a number you have already decided to make untrue.

Caching moves the run rate, not the margins

What makes caching different from a small efficiency is the size and location of its effect. On the workloads where it applies, RAG, document analysis, anything with a large repeated context, caching does not shave a few percent, it can remove the majority of the input cost, because the large repeated prefix is precisely the expensive part and caching discounts it by up to ninety percent. On a workload dominated by repeated context, the post caching run rate can be a fraction of the pre caching run rate. That is not a rounding adjustment to a forecast, it is a different forecast. A commit sized before caching and a commit sized after caching are not two slightly different numbers, they can be two numbers in different orders of magnitude on the affected workloads. Treating caching as a minor optimization that does not change the commit conversation is the error.

What happens when you commit first

Consider the buyer who signs the commitment and optimizes afterward, which is the common order because the deal feels urgent and the engineering feels like it can come later. They forecast consumption from current, uncached usage, commit to that number, earn a discount on it, and sign. Then the engineering team ships caching and the real consumption drops well below the commit. Now the buyer is in the worst position the commit structure allows: they are consuming far less than they promised, the gap between consumption and commit is unused commitment that vanishes at period end, and they are effectively paying full price for tokens they never used. The discount they earned on the large commit is cold comfort, because they are buying far more than they need. They have paid for their own future optimization in advance, and the optimization, instead of saving money, has stranded it.

What happens when you cache first

Now the buyer who caches first. They design and ship the caching on their major workloads, measure the new run rate, and only then size the commit. The number they commit to is smaller, because it reflects the cached reality, which means less exposure, less risk of a stranded overcommitment, and a forecast they can actually hit. They give up some headline commit size, and with it perhaps a slightly deeper discount tier, but they gain something worth far more, which is a commitment matched to genuine consumption. The discount on a right sized commit beats a deeper discount on a commit they will not use, every time, because the unused portion is pure loss. This buyer also arrives at the table as a sophisticated counterparty who has optimized before committing, which is exactly the profile that earns the vendor's respect and its better terms.

The headroom question

A fair objection is that committing to a lower, cached number leaves less room for growth, and growth is real. The answer is to handle growth deliberately rather than by overcommitting. You size the commit to your optimized run rate plus a reasoned buffer for genuine expected growth, not to an unoptimized number that happens to look like growth headroom but is really just waste. Better still, you negotiate the structure to handle growth without forcing you to prepay for it: overage priced at or near the committed rate so that growth beyond the commit is not punished, a phased ramp that lets the commit step up as consumption genuinely rises, and price protection so the rate holds as you scale. These terms give you headroom without making you commit to caching waste, which is the right way to leave room for growth.

Build the two scenarios

In practice the way to get this right is to model both run rates explicitly before any commit conversation. Build the consumption forecast as it stands today, uncached, and build it again as it will be once your planned caching is shipped, with realistic assumptions about cache hit rates and the share of your workload that caching actually reaches. The gap between those two forecasts is the spend that caching will remove, and it is the spend you must not commit to. Bringing both scenarios to the table also strengthens your negotiation, because you can show the vendor a credible, optimized number and explain the discipline behind it, rather than negotiating from a single fuzzy figure. The two scenario model is the document that keeps you from committing to a run rate you have already planned to abandon.

Sequence is the whole point

The lesson collapses to a single rule: optimize before you commit, and caching is the optimization most likely to move the commit number on enterprise workloads. The pressure always runs the other way, because the deal feels time sensitive and the caching work feels deferrable, but giving in to that pressure is how buyers end up locked into spend they have already decided to eliminate. The caching is not the part that can wait; the commitment is. A commitment is a multi year promise and it should be made last, against the lowest run rate you can credibly establish, with the caching already in place and measured. Sequence is not a detail here, it is the entire difference between a commit that fits and a commit that strands money for years.

Where this fits

Understanding how caching changes the commit math is where token optimization and commitment strategy meet, and getting the sequence right is one of the highest value moves a buyer can make. For the optimization method, read the pillar guide, the token optimization playbook. If you are approaching a commitment and want both run rate scenarios modeled and the commit sized and negotiated around the cached number, book a strategy call and we will build it with you before you sign.

How caching changes your commit math.