Model Selection for Agentic Workloads

Agentic workloads break the simple model selection rule. With a single call, you pick the right model for that request and move on. An agent is different: it loops, calling the model repeatedly to plan, act, observe, and revise until a task is done. A task that a human would describe in one sentence might take an agent twenty or fifty model calls to complete. That multiplication is the whole story of agent cost. If every step runs on the most capable model by default, the bill scales with the number of steps, and agents take a lot of steps. Selecting the right model per step, rather than per task, is the difference between an agent that is affordable to run at scale and one that quietly becomes your largest line item.

Why the default destroys the budget

Teams building agents almost always start by wiring the whole loop to the top model, because it gives the best chance of the agent succeeding. That is reasonable for a prototype and ruinous in production. The reason is the step count. A naive agent loop on the top model pays the premium rate on planning steps, on tool selection, on reading tool output, on simple intermediate decisions, and on the final synthesis alike, when only a few of those steps actually need top tier reasoning. The cost is the premium rate multiplied by every step, and the step count for real agent tasks is high enough that the waste compounds into a number that surprises everyone the first time they see the bill.

Decompose the loop by step type

The key move is to stop thinking of the agent as running on one model and start thinking of each step as a separate decision. The steps in an agent loop are not equally hard, and once you separate them you can see which actually need the strongest model.

High level planning and complex reasoning, where the agent decides strategy or works through a genuinely hard problem, often justifies the top model.
Tool selection and routing, deciding which action to take next, is usually a structured decision a mid tier model handles well.
Reading and parsing tool output, extracting the relevant fact from a result, is frequently fine on the cheapest model.
Formatting, summarizing, and routine intermediate steps rarely need anything above the cheapest capable tier.

Once the loop is decomposed this way, the pattern is clear: a small number of steps need real power, and the majority are cheap work that has been running on an expensive model only because the whole loop shared one default.

Use a strong model to plan, cheaper models to execute

A reliable architecture for cost controlled agents is to reserve the strongest model for the steps where reasoning quality determines whether the task succeeds, typically the high level planning, and to run the execution steps on cheaper models. The planner sets the strategy well, and the cheaper models carry out the many smaller steps that follow. This mirrors how you would staff a team: your most expensive expert sets direction, and the routine execution is handled by people whose time costs less. The agent succeeds because the hard thinking was done well, and it stays affordable because the bulk of the calls ran cheap.

Protect quality with escalation

Routing agent steps to cheaper models is only safe if a step that goes wrong can recover. Build escalation into the loop: when a cheaper model produces a low confidence result, fails a check, or the agent detects it is stuck, escalate that step to a stronger model and retry. Escalation lets you default aggressively to the cheap models, because the rare step that needs more power gets it on demand rather than by paying for it everywhere in advance. An agent with good escalation is both cheaper and more robust than one pinned to the top model, because it spends its power where the difficulty actually is.

Watch the context that rides along

Agent cost is not only about which model runs each step, it is also about how much context each step carries. Agents accumulate history as they loop, and if the full transcript is resent on every step, the input tokens grow with the step count and the cost climbs even on cheap models. Manage the context the agent carries forward, summarizing or pruning history so each step sends only what it needs, and cache the stable portions so the repeated context is charged at the reduced rate. The model selection and the context management work together, and an agent that gets both right is dramatically cheaper than one that gets only one.

The commercial angle

Agentic workloads are where unoptimized Claude spend grows the fastest, which makes them where a buyer side review pays off the most. An agent platform sized on a naive, top model loop will carry a committed spend far larger than it needs, and that inflated commit becomes the baseline an uplift grows against at renewal. Decompose the loop, select per step, escalate when needed, and manage the context, and the same platform runs at a fraction of the cost, which means a smaller, defensible commit and far less exposure to unused commitment. We size the deal against the optimized agent, not the naive one, because the difference is often the largest single saving on the table.

Measure cost per task, not per call

With a single call workload, the natural unit of cost is the call. With agents, that unit misleads, because the thing your business cares about is the task, and a task is many calls. The metric that matters is cost per completed task: the total spend across every model call the agent made to finish one unit of work. Measuring per call hides the multiplication that makes agents expensive, while measuring per task surfaces it and lets you see whether a change actually helped. A routing change that lowers the cost of each call but causes the agent to take more steps may not lower the cost per task at all. Track the full cost of finishing a task, and the count of steps it took, and you can tell real improvements from ones that just moved the cost around. This is also the number to bring to a negotiation, because the commit you sign is consumed by tasks, and a vendor pricing your platform wants to understand the unit economics of the work, not the raw call count.

Cheaper steps can mean fewer steps

There is a counterintuitive point worth making about agent cost. It is tempting to assume the cheapest agent always runs every step on the cheapest model, but that is not quite right, because a weak model can take more steps to reach the same result, looping, backtracking, and retrying in ways a stronger model would avoid. If the planning step is underpowered, the agent may wander, and the extra steps can cost more than the stronger planner would have. This is why the reliable architecture reserves real power for the planning and reasoning steps that determine whether the agent stays on track, and saves on the many execution steps that follow a good plan. The goal is not the cheapest model everywhere, it is the lowest cost per completed task, and that is sometimes served by spending more on the few steps that prevent the agent from thrashing.

Govern the loop with limits

Agents need guardrails that single calls do not, because a loop that goes wrong can run up cost without finishing anything. Set a ceiling on the number of steps a task may take, so a stuck agent stops and escalates or fails rather than looping indefinitely on the meter. Cap the context an agent may accumulate, so the input does not grow without bound across a long task. Monitor the cost per task in production and alert when it drifts above expectation, because a regression in the loop or a change in the input distribution can push it up quietly. These limits are cheap to add and they convert the open ended risk of an agent platform into a bounded, predictable cost, which is exactly what you want both for the monthly bill and for the commit you size against it.

Model selection for agentic workloads.