The Routing Logic That Cuts Claude Spend

The most expensive habit in production Claude applications is also the most common one: pointing every request at a single model and leaving it there. Teams pick a model during the prototype, usually the strongest available because it makes the demo work, and then they ship it and never revisit the choice. The result is that a stream of trivial requests, classifications, extractions, short rewrites, and routine lookups, all run on a model priced for the hardest reasoning task in the system. The bill reflects the ceiling of difficulty applied uniformly to a workload that is mostly floor. Routing fixes this. It is the practice of sending each request to the cheapest model in the Claude family that can handle it to your quality bar, and it is the single biggest lever most teams have on their spend. Done across a real workload, routing across Opus, Sonnet, and Haiku typically cuts aggregate spend 40 to 70 percent against uniform Opus use.

Why one model for everything is the default

It helps to understand why almost everyone starts in the wrong place. During development, the goal is to get the thing working, and the fastest path to a working prototype is the most capable model, because it forgives weak prompts and handles edge cases without much tuning. That choice is reasonable for a prototype. The problem is that it silently becomes the production architecture. Nobody decides to run cheap requests on an expensive model; it just happens because the model name was hard coded early and there was never a forcing function to change it. Meanwhile the workload grows and diversifies, so a system that began as one clever feature becomes dozens of request types with wildly different needs, all still pointed at the original model. The overspend is not a single bad decision, it is the absence of any decision after the first one.

The three tier shape of the Claude family

Routing works because the Claude family is deliberately tiered. Opus is the most capable and the most expensive, built for genuinely hard reasoning, long chains of logic, and the tasks where a wrong answer is costly. Sonnet sits in the middle and is the workhorse for the large majority of real workloads, strong enough for most production tasks at a fraction of the Opus rate. Haiku is the fast, inexpensive tier, more than adequate for classification, extraction, routing decisions, short transformations, and the high volume simple work that often makes up the bulk of request counts. The price gaps between these tiers are large, which is exactly what makes routing pay. Moving a high volume task from the top tier to the bottom tier is not a marginal saving, it is a step change, and most workloads have a great deal of work sitting one or two tiers higher than it needs to be.

How the routing logic actually works

A routing layer is simpler than it sounds. At its core it is a function that looks at an incoming request and decides which model should handle it before the call is made. The decision can be based on a few different signals depending on the workload:

Task type. If you know a request is a classification or an extraction, you can route it to Haiku by rule, with no analysis needed, because the task class itself tells you the cheap model is enough.
Complexity signals. Input length, the presence of certain keywords, or a quick cheap pre classification can sort requests into easy and hard buckets, sending the easy ones down a tier.
Confidence and escalation. Start a request on the cheaper model, and only escalate to a stronger one when the cheap model signals uncertainty or fails a validation check. This is the fallback pattern, and it captures the savings on the majority of requests that the cheap model handles fine while preserving quality on the minority that need more.

Most mature systems use a combination: hard rules for the request types that are obviously cheap, and an escalation path for the ambiguous middle. The engineering cost of this layer is modest, usually a small amount of work in front of the existing model calls, and it pays back quickly because it attacks the largest line on the bill.

Start by measuring the mix

You cannot route what you have not measured, so the first step is always to look at the actual distribution of your requests. Pull a representative window of traffic and sort it by request type, by frequency, and by how much each type contributes to the bill. Almost every team that does this for the first time finds the same shape: a small number of genuinely hard request types that justify the top model, and a long tail of high volume simple requests that are running on it for no reason other than inertia. The measurement tells you where the easy wins are. It also keeps you honest, because routing is only worth doing where the volume is real. Optimizing a request type that runs ten times a day is a waste of effort; optimizing the type that runs ten million times a month is where the money is.

Protect quality while you route

The objection to routing is always quality, and it is a fair one, because a cheaper model that gets the answer wrong is not a saving, it is a cost moved somewhere harder to see. The discipline that makes routing safe is evaluation. Before you move a request type down a tier, you test the cheaper model against a representative sample with a clear quality bar, and you only make the move if it holds. After you move it, you keep watching, because workloads drift and a model that was good enough last quarter may need re checking. The escalation pattern is the safety net here: by starting cheap and escalating only on signals of failure, you get the savings on the easy majority while guaranteeing the hard cases still get the capable model. Routing done with evaluation is not a quality gamble, it is a disciplined matching of task to tier.

Why this matters before you commit

Routing is not only an engineering win, it is a commercial one, and the order matters. A buyer who routes before sizing an API commitment is committing to their real, optimized consumption rather than to the inflated number a single model architecture produces. That means a smaller commit, less exposure to a shortfall, and a stronger position at the table, because you arrive as a disciplined buyer who knows exactly what the spend should be. A buyer who signs a large commit first and routes afterward has locked in spend they no longer need. The sequence that protects your money is to optimize the architecture, prove the lower run rate, and then negotiate the commitment around it.

Where this fits

Routing is the first and largest move in token optimization, and it sits alongside caching, batch, and output control as the levers that compound into real reduction. For the full method, the measurement approach, the evaluation discipline, and how the levers stack, read the pillar guide, the token optimization playbook. Download it for the routing logic in detail and the worksheet to map your own request mix.

The routing logic that cuts Claude spend.