Stress Testing Your Commit Assumptions

A committed spend deal with Anthropic rests on a chain of assumptions, and most of those assumptions are never examined. Someone assumes adoption will follow the plan, that prompts will stay the size they are today, that the new feature ships on time, that the model mix holds. Each assumption feels reasonable in isolation, and stacked together they produce a commit number that gets signed and budgeted. Then the year happens. Adoption runs slower than the plan, or a launch slips, or a product decision pushes everything onto a more expensive model, and the commit that looked careful turns out to have been built on a row of dominoes. Stress testing is the discipline of pushing on each assumption before you sign, finding the ones that break the number, and either fixing them or building contract structure to absorb them. It is cheap to do in a spreadsheet and expensive to skip.

Name every assumption explicitly

The first step is to drag the assumptions into the open, because most of them are implicit. A commit forecast usually hides at least six load bearing assumptions: the adoption curve, the average request volume per user, the input token size per request, the output token size per response, the model mix across Opus, Sonnet, and Haiku, and the optimization gains you have penciled in but not yet shipped. Write each one down as a specific value, not a vibe. Adoption reaches sixty percent of seats by quarter three. Average response is eight hundred output tokens. Seventy percent of traffic routes to Sonnet. Once each assumption is a number on a page, you can test it, and you will often find that two or three of them are doing almost all of the work in the forecast while the rest barely matter. Those are the ones to attack.

Push each assumption to its breaking point

Stress testing is not gentle sensitivity analysis where you nudge a number five percent and watch it wobble. It is deliberately pushing each assumption to a plausible extreme and seeing what happens to the commit. What if adoption is half what you planned. What if it is double. What if output tokens run forty percent heavier than your estimate because responses got chattier in production. What if the launch that drives most of the second half volume slips a full quarter. For each push, you are looking for two things. First, does the assumption move the number enough to matter, which tells you where the real risk lives. Second, does the failure land you in shortfall or in overage, because those are different problems with different remedies. An assumption that can push you into forfeited unused commitment is more dangerous than one that pushes you into overage, especially if you have negotiated overage at the committed rate.

The asymmetry you have to respect

Commit risk is not symmetric, and your stress test has to reflect that. If you commit too high and usage falls short, the unused commitment is generally forfeited on an Anthropic agreement, so the downside is the full gap between commit and actual usage, paid for nothing. If you commit too low and usage runs over, you pay overage on the excess, which is bad but recoverable, and far better if you have secured overage at the committed rate rather than at list. That asymmetry means a forecast that is equally likely to be too high or too low is not a neutral bet. The cost of being high is worse than the cost of being low, so a well stress tested commit usually lands below the midpoint of your range, with the upside handled as protected overage. Your stress test is what reveals this, by showing you the dollar cost of each failure mode rather than just the probability.

The assumptions that break commits most often

An adoption curve that is steeper or earlier than real organizational change ever delivers.
Optimization savings that are assumed in the forecast but have not been built, tested, or shipped.
A model mix that drifts toward Opus in production because no routing logic enforces the plan.
Output token sizes estimated from a demo rather than measured from real traffic.
A single launch carrying most of the second half volume with no slack if it slips.

Separate assumptions you control from ones you do not

Once you know which assumptions move the number, sort them into two piles. Some are inside your control: prompt size, model routing, whether you ship caching, how verbose your responses are. Others are outside it: market adoption, a customer's rollout pace, a launch date that depends on another team. The two piles get different treatment. For the assumptions you control, the stress test becomes a to do list. If the forecast depends on routing seventy percent of traffic to Sonnet, then build and test that routing before you sign, so the assumption is a fact rather than a hope. For the assumptions you do not control, you cannot fix the uncertainty, so you negotiate structure to absorb it: a phased commit ramp, overage at the committed rate, and a midterm reforecast right. The mistake is treating both piles the same, either by assuming you can control market adoption or by signing away the levers you actually hold.

Turn the stress test into contract structure

The output of a good stress test is not just a better number, it is a list of specific contract asks. If adoption timing is your biggest uncertainty, the answer is a ramp that starts the commit low and steps it up as usage proves out, so you are not pre paying for a curve that has not happened. If output volume is the risk, the answer is overage priced at the committed rate, so the part of the range you could not pin down does not get charged at list. If the whole forecast could be wrong because a major product decision is still open, the answer is a midterm reforecast right that lets you reset the commit once reality is clearer. Each of these maps directly to a failure mode your stress test exposed. That is the difference between a stress test that just worries you and one that arms you, because every break you found becomes a clause that protects you.

Pressure test the optimization assumptions hardest

Of all the assumptions in a commit forecast, the optimization gains deserve the most skeptical treatment, because they are the ones most often assumed and least often delivered. A forecast that bakes in a forty percent reduction from model routing, or a ninety percent saving on cached context, is making a promise about engineering work that has not happened yet, may not be prioritized, and may not land as cleanly in production as it did in a test. The stress test here is simple and unforgiving: assume the optimization does not ship, or ships at half the expected benefit, and see what the commit looks like. If the deal only works because of savings you have planned but not built, you are committing against a hope, and the safer path is to build the optimization first and forecast against the result. Savings that exist in a spreadsheet are not savings. Savings that exist in shipped, measured production code are, and only the second kind belongs in a commit you are about to sign. Treat assumed optimization as the most fragile assumption you have, because in practice it usually is.

Translate the stress test into a monitoring plan

A stress test done before signing tells you which assumptions are fragile, and that knowledge is wasted if you do not watch those assumptions once the deal is live. The output should include a short monitoring plan: the two or three drivers that carry most of the risk, the threshold at which each one becomes a problem, and the action you will take if it crosses. If adoption is the fragile assumption, watch the active user count against your ramp and know in advance when a shortfall would justify triggering a reforecast. If output verbosity is the risk, watch the average response size and have the prompt fix ready. This turns the stress test from a one time gate into an ongoing early warning system, so a broken assumption surfaces as a tracked metric weeks before it surfaces as a forfeiture or an overage bill. The buyer who monitors the fragile assumptions can act while there is still time to act, which is the entire reason you identified them in the first place.

Shrink the risk before you size the deal

The cheapest way to pass a stress test is to reduce the thing being tested. Many of the assumptions that break a commit are the same levers that lower the bill, so doing the optimization work before you size the deal both shrinks the number and tightens the assumptions around it. Routing predictable work to Sonnet and Haiku instead of running everything on Opus pulls the forecast down and removes the model mix drift risk at the same time. Prompt caching on repeated context, which can cut the cost of that context by up to ninety percent, and batch processing on asynchronous work at roughly half rate, both remove expensive tokens from the forecast entirely. When you stress test a commit that has already been optimized, fewer assumptions break, because there is simply less spend exposed to each of them. Optimization is not a separate workstream from commit sizing, it is the first move in getting the size right.

Test the assumptions in combination, not just one at a time

Single variable stress testing, where you move one assumption and hold the rest, is a useful start but it understates the real risk, because in a bad year assumptions tend to fail together. The launch slips, which means the adoption curve flattens, which means the optimization savings you penciled in never get prioritized because the team is busy elsewhere, and suddenly three assumptions have broken at once and the commit you sized for one of them failing is exposed to all three. Correlated failure is the scenario that actually hurts, and it never shows up if you only test variables in isolation. Build at least one combined stress scenario where the plausible bad outcomes happen at the same time, because they often do. If the commit still survives that combined case without catastrophic forfeiture, you have a genuinely robust number. If it does not, you have found the scenario the contract structure has to protect against, and you can size the ramp and the reforecast right specifically to cover it.

Stress testing your commit assumptions.