Building a Model Router for Claude | Antrophic Negotiations

The most common way enterprises overspend on Claude is also the simplest to fix: they send every request to the most capable model. Opus is excellent, and for hard reasoning it earns its price, but most of what an application asks a model to do is not hard reasoning. It is classification, extraction, formatting, routing, summarization, and short answers, and a large share of that work runs perfectly well on Sonnet or Haiku at a fraction of the cost. A model router is the piece of infrastructure that puts each request on the right model automatically, so you stop paying Opus prices for Haiku work. This is how to build one.

What a router is

A router is a thin layer that sits in front of the model and decides, per request, which model should serve it. Instead of your application calling a single model directly, it calls the router, and the router applies a set of rules or a small classifier to pick the cheapest model that will produce an acceptable answer for that request. The output is the same shape regardless of which model handled it, so the rest of your application does not change. The router is small, but it is where the savings live, because it is the only place in the system that can make the cost decision request by request.

Why the savings are so large

The price difference across the model tiers is wide. The cheapest model can cost a small fraction of the most capable one per token, so every request you move down a tier without losing quality is a direct, permanent reduction in cost. Because the distribution of real workloads is heavily weighted toward easy requests, moving the easy majority down delivers an outsized effect on the total bill. This is why routing typically cuts aggregate spend 40 to 70 percent versus running everything on Opus. You are not shaving a few percent. You are repricing the largest, easiest part of your workload to its true cost.

Step one: classify your requests

Before you can route, you have to know what you are routing. The first step is to look at your actual traffic and sort it by difficulty and importance. A practical way to begin is to group requests into tiers.

Trivial work, such as classification, tagging, extraction, and short formatted answers, where the cheapest model is almost always sufficient.
Moderate work, such as routine summarization, drafting, and standard question answering, where the mid tier model handles the load well.
Hard work, such as complex reasoning, long multi step analysis, and high stakes generation, where the top model earns its price.

Most teams are surprised by how much of their volume falls into the first two tiers. The exercise of classifying traffic is itself valuable, because it tells you where the money is going and how much of it is going to the wrong model.

Step two: choose a routing method

There are two broad ways to route, and most mature systems use a blend of both.

Rule based routing

The simplest router uses explicit rules tied to the request type or the calling feature. A classification endpoint always routes to the cheapest model, a complex analysis endpoint always routes to the top model, and so on. Rule based routing is transparent, predictable, and easy to reason about, and for many applications where the request type is known at the call site, it captures the bulk of the savings on its own.

Classifier based routing

Where the request type is not known in advance, a small, cheap classifier can look at each incoming request and predict which model tier it needs. The classifier itself runs on the cheapest model, so it adds little cost, and it lets you route mixed traffic that does not arrive pre labeled. The tradeoff is that you have to tune the classifier and accept that it will occasionally misroute, which is why you pair it with a fallback.

Step three: build the fallback

A router is only safe to deploy if it can recover from a wrong decision. The fallback is the rule that says: if the cheaper model produces a low confidence or low quality answer, escalate the request to the next tier up and try again. With a good fallback, the cost of a misroute is small, an occasional second call, while the cost of being too conservative, sending everything to the top model, is enormous. The fallback is what lets you route aggressively toward the cheap models without fearing the edge cases, because the edge cases self correct.

Step four: measure quality, not just cost

The router exists to lower cost without lowering quality, so you have to measure both. Set up an evaluation that scores the router's outputs against a quality bar for each workload, and watch it as you tune the routing thresholds. The goal is to push as much traffic as possible to the cheapest model that still clears the bar, and the only way to know where that line sits is to measure it. A router tuned by cost alone will eventually degrade quality somewhere that matters. A router tuned against an evaluation finds the real frontier between savings and quality and sits on it.

How routing stacks with the other levers

Routing is the largest single lever, but it compounds with the others. Once a request is routed to the right model, caching a stable shared context cuts the repeated portion by up to 90 percent, and running asynchronous work on the batch path takes roughly half off the rest. The three levers multiply rather than add, so a request that is routed to Haiku, cached, and batched can cost a tiny fraction of the same request sent naively to Opus in real time. The router is the foundation because it makes the model decision first, and the other levers then apply on top of the right model.

The commercial payoff

A router does more than lower your monthly bill. It lowers the committed spend you need to negotiate, because your aggregate consumption drops once the easy majority of traffic is repriced. A smaller, optimized commit reduces your exposure to unused commitment and strengthens your hand on the rate. This is why we treat routing as both an engineering project and a negotiating one: the optimization you do before you sign directly shrinks the number Anthropic asks you to commit to, and the saving you build into the architecture is a saving the vendor never gets to claw back at renewal.

Where teams get the router wrong

A router is simple in concept and easy to get wrong in practice, and a few failure modes show up again and again. The first is building it once and never revisiting the thresholds, so the routing that was right at launch slowly drifts out of date as traffic changes and new models arrive. A router is a living system, not a one time configuration, and it needs to be retuned as the workload and the model lineup evolve. The second is routing on the calling feature alone when the difficulty within a feature varies widely, so a feature that mixes easy and hard requests gets pinned to one model and overpays on the easy half or underperforms on the hard half. Where difficulty varies inside a feature, you need the classifier, not just a rule. The third is forgetting the fallback, deploying aggressive routing to cheap models with no escalation path, so the occasional hard request that lands on a weak model simply fails instead of being retried higher up. The fallback is what makes aggressive routing safe, and a router without one is forced to be timid, which surrenders most of the saving.

Routing is not only about the model tier

The clearest savings come from moving easy work down to cheaper models, but a mature router does more than pick a tier. It can also shape the request for the model it chose, trimming context that a cheaper model does not need, constraining output length so the response does not run long, and selecting the cheapest representation of the input. It can decide whether a request is a caching candidate and route it so the stable prefix is reused, and whether it is asynchronous and belongs on the batch path. In this sense the router becomes the single place where every cost decision is made, the model, the context, the output shape, the caching, the path, all chosen per request rather than left to the defaults baked into the calling code. That consolidation is valuable on its own, because it gives you one place to measure, tune, and govern cost, instead of cost decisions scattered across every feature that calls the model.

Building a model router for Claude.