When a Claude bill is too high, the instinct is to look for waste in the prompts, the retries, or the context length. Those matter, but they are second order. The first order driver, the one that moves more money than any other single decision, is which model runs each request. Across the enterprises we work with, getting model routing right moves between 40 and 70 percent of aggregate spend compared with running everything on one premium tier. That is not a rounding error to tidy up later. It is the largest lever you have, and it sits upstream of almost everything else. This guide explains why the model choice carries that much weight, where the range of 40 to 70 percent comes from, and how to capture it without giving up output quality.
Every request your application sends to Claude is priced according to the model that handles it and the number of tokens it reads and writes. The token count is shaped by your prompt and context design, but the per token price is set entirely by the model. Because the price difference between the Claude tiers is large, the model choice acts as a multiplier on the entire token bill. Choose the premium tier for a request and you pay the premium rate on every token in it. Choose a lower tier for the same request and you pay a fraction of that, for output that, on most tasks, the end user cannot tell apart.
This is why model choice dominates. Other optimizations reduce the number of tokens or the frequency of calls, which is real saving, but the model multiplies whatever volume remains. A workload running uniformly on Opus is paying the highest available rate on every token regardless of whether the task needed that capability, and most tasks do not. The gap between that and a properly routed setup is the 40 to 70 percent figure, and it exists because the premium was being applied to work that never required it.
The reason the saving is a range rather than a single number is that it depends on the shape of your traffic. Two factors decide where a given workload lands.
A company that runs everything on Opus has the most to gain, because almost every request is overpaying. As you move work down to Sonnet and Haiku where it belongs, the saving is dramatic, often near the top of the range. A company that is already partly tiered has less headroom, because some of the saving is already captured, and lands lower in the range. The worse your starting routing, the larger the percentage you recover.
The other factor is how much of your traffic genuinely needs the premium tier. A workload that is mostly summarization, classification, extraction, and retrieval can move almost entirely to Sonnet and Haiku, capturing the high end of the range. A workload that is heavy on complex reasoning has a larger share that legitimately stays on Opus, so the saving is real but smaller. The point is not to push everything down. It is to match each task to the cheapest model that does it well, and the mix of tasks determines how far down the average can move.
Capturing the saving starts with a clear view of what each model is actually for, so routing decisions are deliberate rather than habitual.
Opus is the most capable and the most expensive, and it earns its price on genuinely hard work: complex multi step reasoning, intricate code generation, and high value tasks where a wrong answer is costly and the difficulty is real. The mistake is not using Opus. It is using it by default for work that did not need it. Reserve it for the tasks that actually pull ahead on the premium tier, and it becomes a precise tool rather than a blanket cost.
Sonnet is the model that should carry most production traffic. For the large majority of enterprise tasks, summarizing, drafting, answering against context, classifying, and routine code work, Sonnet produces output indistinguishable from Opus to the end user, at a meaningful discount and higher speed. Making Sonnet the default and Opus the exception is the single biggest move in the routing exercise.
Haiku is the cheapest tier and handles the simplest, highest volume work well: short classification, basic extraction, quick routing decisions, and lightweight transformations. For tasks that do not even need Sonnet, pushing them to Haiku saves again on top, and because these tasks often run at the highest volumes, the saving compounds.
The fear that stops teams from routing aggressively is that cheaper models will degrade output, and the answer to that fear is measurement rather than assumption. The disciplined approach has a clear sequence.
Start by understanding where the money actually goes. Group your requests by task type and measure how much spend each group represents. Almost always, a small number of task types account for most of the bill, and those are where routing changes pay off most. You cannot route what you have not measured.
For each high volume task type, run a representative sample on the cheaper tier alongside the current one and have the people who own the output judge whether the difference is visible and whether it matters for that use case. This is where teams consistently discover that work they assumed needed Opus runs perfectly well on Sonnet, and work they ran on Sonnet runs fine on Haiku. The testing converts fear into evidence.
Once you know which model each task type needs, set the routing so every request runs on the cheapest model that does it well. Then revisit it, because both your traffic and the models change over time. A new model version may move the line on what a cheaper tier can handle, and a shift in your task mix may change where the spend concentrates. Routing is not a one time setup. It is a discipline that keeps the saving from eroding.
It helps to see the arithmetic, because the percentages can feel abstract until they touch a real number. Picture a workload sending ten million requests a month, each carrying a similar token load, all running on Opus out of habit. Suppose half of those requests are routine summarization and classification that Sonnet handles with no visible difference, and a further quarter are simple enough for Haiku. Only the remaining quarter genuinely benefits from the premium tier.
Before routing, every one of the ten million requests pays the Opus rate. After routing, only a quarter still does, half drop to the much lower Sonnet rate, and a quarter drop to the lowest Haiku rate. Because Sonnet and Haiku cost a fraction of Opus, the blended cost per request falls sharply even though the volume is unchanged and the output quality, by test, is the same. The aggregate bill lands well inside the 40 to 70 percent reduction band, and it does so purely from moving each request to the cheapest model that does its job. No prompts were shortened, no requests were eliminated, and no caching was applied. The saving came entirely from the model decision.
The example also shows why the starting mix matters so much. A workload where 80 percent of requests genuinely need Opus has far less to gain, because most of the traffic legitimately stays on the premium tier. A workload where only 20 percent need it has enormous headroom. Running the same arithmetic on your own traffic mix is how you turn the headline range into a specific number for your business, and that specific number is usually large enough to make routing the first thing on the optimization list.
The reason so many workloads run uniformly on the premium tier is not analysis. It is fear, usually expressed as a worry that a cheaper model will quietly degrade quality in ways no one notices until a customer complains. That fear is reasonable, and it deserves a real answer rather than dismissal, because the answer is what gives a team the confidence to route aggressively.
The answer is that you do not have to take the saving on faith. You test, on your own traffic, with your own people judging the output, before you change a single production route. When a team runs that test, the result is almost always the same: the difference they feared is invisible on most tasks, and the few tasks where it is real stand out so clearly that the routing decision makes itself. The fear was of an unknown, and the test converts the unknown into a known. After that, routing is no longer a gamble, it is a measured decision with evidence behind it, and the team can defend it to anyone who asks.
It also helps to frame the cost of the fear. Running everything on Opus to avoid a hypothetical quality drop is not free caution. It is a choice to pay two to three times more than necessary on most of your traffic, every day, indefinitely, to insure against a risk you have not measured. Put that way, the prudent move is not to stay on Opus. It is to test, because the testing is cheap and the overpayment it prevents is enormous.
One reason model choice keeps its grip on the bill is that teams treat it as a one time configuration rather than an ongoing system. They pick a model when they build a feature and never revisit it, so the routing reflects the assumptions of the day it was written rather than the reality of the traffic today. That is how workloads drift onto the wrong tier and stay there: not through a bad decision, but through the absence of a process to revisit the decision.
Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.
Get a QuoteWeekly intelligence on Anthropic pricing moves and the buyer side counters that work.