Model routing only saves money if the routing decision is correct, and the routing decision is only as good as the classification that drives it. Sending the right requests to Sonnet and Haiku while reserving Opus for the work that needs it sounds simple, but in production you face a stream of incoming requests and have to decide, in real time, which tier each one belongs on. That decision is a classification problem, and how you solve it determines whether your routing captures the saving or quietly leaks it. Classify too crudely and you either overspend by sending easy work to the premium tier or you damage quality by sending hard work to a model that cannot handle it. This guide is about the layer most teams skip: how to classify Claude queries cheaply and accurately so that every request lands on the model that fits it.
Routing gets all the attention, but routing is just the action. Classification is the judgment that tells the router what to do, and without it the router has nothing to act on. A great deal of advice stops at the idea that you should send hard work to Opus and easy work to Sonnet or Haiku, without addressing the practical question of how the system knows, at the moment a request arrives, which kind it is. That gap is where most routing strategies fall apart in production.
The classification step also carries its own cost and risk, which is why it has to be designed deliberately. If you classify every request by asking a model to judge its difficulty, you have added a call in front of every call, and if that classifier runs on an expensive model you may spend more on classifying than you save on routing. If you classify badly, you misroute, and a misroute either wastes money or fails the user. The goal is a classification approach that is cheap enough to run on everything and accurate enough to trust, and getting that balance right is the real work.
There is no single correct classifier. There is a spectrum, from nearly free and crude to costly and precise, and the right answer usually combines more than one.
The cheapest classification uses information you already have. The source of the request, the feature it came from, the user tier, the length of the input, and the type of task the endpoint serves are all signals available before any model runs. A request that arrives at your summarization endpoint is summarization, and you can route it to Sonnet without asking a model anything. A great deal of traffic can be classified this way at essentially zero cost, simply because your application already knows what the request is for. This is the layer to exhaust first, because it is free.
For the traffic that rules cannot resolve, where the same endpoint receives a mix of easy and hard requests, a small fast model can judge difficulty before routing. Running Haiku as a classifier in front of the main call is cheap relative to the cost of misrouting to Opus, and it can read the actual content of the request rather than just its metadata. The classifier's only job is to decide a tier, so its prompt is short and its output is tiny, which keeps the added cost low. This is the workhorse of real time classification.
The third approach inverts the order. Instead of classifying up front, you run the request on the cheaper model first and check the result. If the output meets a confidence or quality bar, you keep it. If it does not, you escalate to the higher tier. This works well when most requests succeed on the cheaper model and only a minority need escalation, because you only pay for the premium tier on the requests that actually required it, and you avoid a separate classification call entirely. The cost is the occasional double run, which is worth it when the escalation rate is low.
Whichever methods you use, the classifier has to earn its place by being both inexpensive and reliable, and a few design choices decide whether it does.
A classifier that runs on every request must be the cheapest thing in the pipeline, because its cost is paid on the full volume. Use the smallest model that can make the distinction reliably, give it a tight prompt that asks for nothing but the tier decision, and keep its output to a single token where you can. The classifier should be a rounding error against the saving it enables, never a meaningful line item of its own.
Misclassification is not symmetric, and your classifier should reflect that. Sending a hard request to a cheap model produces a bad answer, which can cost you a customer or a wrong decision. Sending an easy request to the premium model only costs a little money. Depending on the use case, you decide which error you can tolerate and bias the classifier accordingly. For high stakes output, lean toward escalating when unsure. For high volume low stakes work, lean toward the cheaper tier and accept the rare imperfect answer. The classifier is a place to encode that tradeoff explicitly.
A classifier is only worth trusting if you check that its decisions hold up. Sample its routing decisions and review whether the requests it sent to the cheaper tier actually came back with acceptable output, and whether the requests it escalated genuinely needed it. That review tells you whether the classifier is too aggressive, too cautious, or about right, and it gives you the evidence to adjust the threshold rather than guessing. Without measurement, a classifier drifts out of alignment with reality and the saving erodes silently.
It helps to walk a single request through a real cascade to see how the layers cooperate. Imagine a request arriving at a customer support application. The first layer is rules: the request came from the billing help widget, which the application already knows tends to produce straightforward account questions, so the rule routes it to Sonnet without consulting any model. Cost so far is zero, because no classifier ran.
Now imagine a request that arrives at the general help box, where the content could be anything from a password reset to a complex dispute. Rules cannot resolve it, so it passes to the second layer, a small Haiku classifier that reads the request and judges its difficulty. The classifier decides this one is routine and sends it to Sonnet. The classifier call was cheap, far cheaper than the cost of having defaulted the request to Opus, and the routing is now correct. For the occasional request the classifier marks as genuinely complex, it routes to Opus, and the premium cost is paid only on the small slice that earned it.
Finally, imagine a borderline request that the classifier sent to Sonnet but that Sonnet handled poorly. The third layer catches it: a confidence check on the output sees that the answer did not meet the bar, and the request escalates to Opus for a second attempt. The user gets a good answer, and the only cost was one extra Sonnet call on a single request. Across the full traffic, the vast majority were handled by free rules or a cheap classifier, the premium tier was reserved for the requests that needed it, and the rare escalation protected quality without inflating the average cost. That is the cascade working as designed: each layer absorbs what it can cheaply and passes only the hard remainder upward.
Cost is the reason most teams start classifying, but it is not the only thing the classifier touches. Every approach to classification has a latency and reliability profile, and choosing without considering those can save money on paper while quietly hurting the experience the user actually has. A classifier that adds a model call in front of every request adds that call's latency to every request too, and for a user facing feature that delay is felt. A confidence based escalation approach adds latency only on the requests that fail and escalate, but those requests now take the time of two model calls instead of one. The right choice depends on how sensitive the workload is to delay.
For a real time interactive feature where responsiveness matters, the cheapest classifier in latency terms is the one that uses rules and metadata, because it adds no model call at all. Where a model classifier is needed, keeping it small keeps its latency low as well as its cost, which is another reason to use the smallest model that can make the distinction. For background or asynchronous work where a little extra delay does not matter, you have more freedom, and a confidence based approach that occasionally runs twice is perfectly acceptable because no user is waiting on it.
Reliability matters too. The classifier is now a component in the path of every request, and if it fails, the request fails or misroutes. A robust setup has a safe default for when the classifier is uncertain or unavailable, usually erring toward the tier that protects quality, so that a classification failure degrades cost rather than output. Designing the fallback is part of designing the classifier, because a classifier that saves money until the day it breaks is not the saving you wanted.
The first version of a classification scheme is rarely the best one, because you build it on assumptions about your traffic that the traffic then corrects. The valuable move is to treat the scheme as something that learns from what it sees rather than a fixed rule set written once. The data the classifier generates is the raw material for improving it, if you keep and look at it.
Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.
Get a QuoteWeekly intelligence on Anthropic pricing moves and the buyer side counters that work.