Independent buyer side advisory · Anthropic onlyNew York · London
Token Optimization

Model mix optimization: routing across Opus, Sonnet, Haiku.

The model you run a task on is the single biggest driver of Claude cost. Match each job to the cheapest model that handles it well and aggregate spend falls 40 to 70 percent.

Buyer side analysis · 9 min read
34%
Average reduction in Claude spend
$40M+
Anthropic commitments advised
100%
Anthropic focus, no other vendor

If you do only one thing to control your Claude spend, make it this: stop running every task on Opus. The choice of model is the largest single lever on cost, larger than any discount you will negotiate, because the models differ in price by a wide margin and most workloads contain a mix of tasks that need different levels of capability. Routing each task to the cheapest model that handles it well, rather than sending everything to the most capable one, typically cuts aggregate spend by forty to seventy percent. This is how to think about the model mix as a buyer, and how to capture that saving without hurting quality.

Three models, three price points, three jobs

Claude comes in a family rather than a single model, and the family is designed so that you can match capability to need. Opus is the most capable and the most expensive, built for the hardest reasoning, the most complex synthesis, and the work where quality is non negotiable. Sonnet sits in the middle, strong enough for the large majority of production tasks at a meaningfully lower cost. Haiku is the fastest and most economical, ideal for high volume, well defined tasks where the work is simpler and the cost per call matters most. The mistake is treating this as a quality ladder where you always want the top rung. It is not a ladder. It is a set of tools, and the goal is to use the right one for each job.

Why uniform Opus use is so expensive

Teams default to Opus for a understandable reason: it is the safest choice for quality, and during development nobody wants to debug a quality problem caused by an underpowered model. So the prototype runs on Opus, it works, and it ships on Opus. The problem is that production workloads are rarely uniform in difficulty. A customer support assistant might use real reasoning for a complex case and trivial pattern matching for a routine one, yet if everything runs on Opus, the trivial cases cost the same premium as the hard ones. Across millions of calls, that premium compounds into a bill several times larger than necessary. Uniform Opus use means paying the top rate for work that a cheaper model would have done just as well.

How to classify your workload

Routing starts with honest classification. Break your workload into task types and ask, for each one, what level of capability it actually requires. Some tasks genuinely need Opus: complex multi step reasoning, nuanced synthesis across long context, work where an error is costly. Many tasks run perfectly well on Sonnet: standard generation, summarization, classification with some nuance, most conversational work. And a meaningful slice belongs on Haiku: simple extraction, routing, short classifications, high volume formatting, anything well defined and repetitive. The classification does not have to be perfect to pay off. Even a rough split that moves the easy tasks off Opus captures most of the saving.

The routing logic itself

Once tasks are classified, the routing can be as simple or as sophisticated as you need. The simplest version is static: this endpoint always uses Haiku, that one always uses Sonnet, the hard one uses Opus. The next level is dynamic: a lightweight classifier looks at the incoming request and routes it to the appropriate model based on its difficulty. A common and effective pattern is a fallback chain, where a cheaper model handles the request first and escalates to a more capable one only when its own confidence is low or the task proves harder than expected. The right level of sophistication depends on your volume and your variance, but even static routing beats sending everything to Opus.

The levers in a good model mix

  • Default to the cheapest model that meets the quality bar for each task, not to the most capable model overall.
  • Reserve Opus for work that genuinely needs its reasoning, where a cheaper model measurably underperforms.
  • Put the high volume, well defined tasks on Haiku, where the cost per call matters most.
  • Use Sonnet as the workhorse for the broad middle of production tasks.
  • Consider a fallback chain so a cheap model handles most requests and escalates only the hard ones.

Protecting quality while you route

The fear that stops teams from routing is quality regression, and it is a fair concern. The way to manage it is to measure rather than guess. Before you move a task to a cheaper model, run a sample of real traffic through both models and compare the outputs against your quality bar. If the cheaper model holds up, move the task with confidence. If it does not, keep that task where it is and look elsewhere. This turns model selection from an article of faith into an evidence based decision, and it usually reveals that far more of the workload can run on cheaper models than the team assumed. Quality and cost are not in as much tension as they feel during development.

How routing interacts with your commit

Model routing does not just lower your monthly bill. It lowers the commitment you need to sign with Anthropic, which compounds the benefit. When you route well, your true consumption falls, so you can size your committed spend to a smaller, more accurate number and avoid the overcommitment that quietly costs buyers money. This is why we always advise optimizing the model mix before sizing a commit. A buyer who negotiates a commit against an unoptimized, Opus heavy bill commits to a level they will never reach once they route, and the gap becomes dead capacity. Optimize first, then commit to the optimized floor.

Stacking routing with caching and batch

Model routing is the biggest lever, but it is not the only one, and the levers stack. Prompt caching can take up to ninety percent off the cost of stable, repeated context, which is especially powerful when combined with the right model. Batch processing takes fifty percent off asynchronous jobs that do not need an immediate response. A workload that routes intelligently across models, caches its stable context, and sends its async work to batch can reach the upper end of that forty to seventy percent saving range. The three techniques address different parts of the cost, so applying them together compounds rather than competes.

Measuring accuracy per dollar, not just accuracy

The reason routing feels risky is that teams measure model choice on accuracy alone, where the most capable model almost always wins. That is the wrong metric. The right one is accuracy per dollar, which asks how much quality you are buying for each unit of spend. On that measure, a cheaper model that handles a task at the quality bar you actually require is the better choice, because the extra accuracy of the premium model is accuracy you are paying for and do not need. Reframing the decision around accuracy per dollar changes the conversation inside an engineering team. Instead of asking which model is best, which always points at Opus, it asks which model clears the bar at the lowest cost, which is the question that actually controls spend. The premium model is worth its price only where the cheaper one measurably fails, and that set of tasks is usually smaller than teams assume.

When a bigger model is actually cheaper

Routing is not always about moving down. There are cases where a more capable model is the cheaper choice overall, and a good routing strategy accounts for them. A more capable model that solves a task in a single pass can cost less than a cheaper model that needs several attempts, retries, and corrections to reach the same result, because the retries multiply the token count. Likewise, a capable model that produces a usable answer the first time avoids the expensive human intervention that a wrong answer triggers downstream. The goal is the lowest total cost for an acceptable result, not the lowest sticker price per call. Sometimes that means Haiku, sometimes Sonnet, and sometimes Opus precisely because its single pass reliability is cheaper than the alternative's repeated failures. Measuring the full cost of each path, including retries and downstream correction, is what reveals these cases.

A worked routing example

Consider a document processing pipeline that handles a mix of tasks: extracting a handful of fields from a form, summarizing a long report, and answering complex analytical questions about the content. Run uniformly on Opus, every task pays the premium rate. Routed properly, the field extraction moves to Haiku, where it runs accurately at a fraction of the cost and at higher speed. The summarization moves to Sonnet, which handles it comfortably. Only the complex analytical questions stay on Opus, where the reasoning genuinely earns the premium. The same workload, the same quality, a dramatically lower bill, because each task runs on the model that fits it. This is the shape of nearly every real routing win: a small slice that needs the top model, a broad middle that Sonnet handles, and a high volume tail that belongs on Haiku.

Your Anthropic number is negotiable.

Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.

Get a Quote

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · Blog · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.