A coding agent feels cheap when an engineer is using it. One prompt, a working change, a happy developer. The invoice tells a different story, because a single agent task is not one call. It is a long loop of reading files, planning, calling tools, reading results, and revising, and every turn in that loop sends and receives tokens. The cost of a coding agent is not the prompt you typed. It is the conversation the agent had with itself to get the work done, and that conversation is almost entirely invisible to the person who triggered it. This guide opens up where the tokens actually go and what to do about it before the bill becomes the topic of a budget meeting.
A chat session is roughly linear. You ask, the model answers, you ask again. A coding agent is a loop with a growing context. To make a change it reads the relevant files, which pulls their contents into the context window. It plans, calls a tool, reads the tool output, which adds more tokens, and then it does it again, often many times, carrying the accumulated history forward on every turn. The context window does not just hold your instruction. It holds the files, the tool results, the plan, and the running transcript, and all of it is billed as input on every single turn of the loop.
This is why a task that produces a twenty line diff can consume the tokens of a very long document. The output is small. The input the model had to read and reread to produce it is large, and it grows with every step. The mental model that matters is not cost per task but cost per turn times turns, where the per turn cost keeps rising as the context accumulates.
The first hiding place is repeated context. Each turn resends the system prompt, the instructions, and the relevant file contents. A long shared prefix that rides along on every turn is paid for every turn unless something is done about it.
The second is tool output. Agents call tools, and tools return text. A search that returns a hundred matches, a test run that dumps a long log, a file read that pulls in a large module: all of it lands in the context and is billed on the next turn. Verbose tools quietly multiply the cost of every task that uses them.
The third is the loop length. An agent that takes twelve turns to converge costs roughly twice one that takes six, because the context is heavier on every later turn. Tasks that are vague, or that send the agent down a wrong path before it corrects, are expensive not because the final answer is hard but because the loop ran long.
The fourth is the model choice. Running every turn of every agent task on the strongest model is the single largest source of avoidable cost. Much of what an agent does, reading a file, running a test, applying a mechanical edit, does not need the top model at all.
Prompt caching is the first and largest. When the shared prefix on each turn, the system prompt, the instructions, the stable file context, is cached, the input cost of that repeated content drops by up to ninety percent. For an agent loop that resends the same heavy prefix on every turn, caching is not a minor optimization. It is the difference between paying full freight on the context twelve times and paying it once.
Model routing is the second. Route the cheap, mechanical turns to Sonnet or Haiku and reserve Opus for the genuinely hard reasoning, the architecture decision, the subtle bug, rather than running the whole loop on the top model. Across a real workload this routing typically cuts aggregate spend by forty to seventy percent versus uniform Opus use, because most turns are not the hard turns.
Controlling tool output is the third. Tools that return concise, relevant results instead of raw dumps keep the context lean, which keeps every subsequent turn cheaper. Trimming verbose tool output is one of the highest leverage and least glamorous fixes available.
And scoping tasks tightly is the fourth. A clear, well bounded task converges in fewer turns than a vague one, and fewer turns means a lighter context and a smaller bill. The cheapest token is the turn the agent never had to take.
It helps to walk through a single task to see where the tokens actually accumulate. An engineer asks an agent to fix a bug. On turn one the agent reads the instruction and a couple of files, pulling their full contents into the context. On turn two it searches the codebase, and the search results join the context. On turn three it reads another file the search surfaced. On turn four it forms a plan, which it writes into the transcript. On turn five it makes an edit and runs the tests, and the test output, possibly long, lands in the context. On turn six it reads a failure and revises. By turn six the context holds the original instruction, three or four files, the search results, the plan, the transcript so far, and a test log, and every one of those turns was billed against the accumulated context up to that point.
The twenty line diff at the end is the only thing the engineer sees, and it is the cheapest part of the whole exchange. The expensive part is that the model read a growing pile of context six times to produce it. This is the mechanism behind every surprise on an agent invoice: the visible output is small, the invisible input is large and compounding, and the person who triggered the task has no view into the loop that ran behind their single prompt.
You cannot control what you cannot see, and the first real step is to make the hidden cost visible. Track cost at the level of the task, not just the day or the seat, so you can see the distribution: most tasks clustered around a normal cost, and a tail of expensive outliers. Track the input to output token ratio, because a high ratio with no caching is the signature of repeated context being paid for again and again. Track which model handled which turn, because a flat distribution toward the top model is the largest avoidable cost there is.
With those three views in hand, the fixes prioritize themselves. A heavy input to output ratio says turn on caching for the shared prefix. A flat model distribution says build routing so the easy turns drop to Sonnet or Haiku. A long tail of expensive tasks says tighten task scope and trim verbose tools so loops converge faster. None of this requires using the agent less. It requires using it the way the pricing rewards, and the measurement is what tells you which lever to pull first.
The reason this matters beyond the engineering team is that agent cost is the part of a Claude bill most likely to surprise finance, because it grows with adoption in a way that is hard to predict from a few early months. A team that measures per task cost and watches the trend can give finance a forecast that holds, instead of a number that doubles between the budget meeting and the invoice.
Before a coding agent deployment scales across an engineering organization, a handful of questions will tell you whether the economics are under control or about to run away. Is caching on for the stable prefix that every session carries, the system prompt, the instructions, the conventions? If not, you are paying full price for that context on every turn of every loop, and turning caching on is the single fastest fix available.
Is there a default model that is not the most expensive one, with the top model reserved for hard reasoning? A flat distribution toward the strongest model is the largest avoidable cost in any agent deployment, and routing the easy turns down to Sonnet or Haiku typically takes aggregate spend forty to seventy percent below uniform top model use. Do tools return concise output, or do they flood the context with logs and raw dumps that get billed on every subsequent turn? And are tasks scoped tightly enough to converge in a few turns rather than wandering into long, expensive loops?
If the answer to several of these is no, the deployment is not expensive because agents are inherently expensive. It is expensive because the usage is fighting the pricing instead of working with it. The deeper point is that agent usage grows fast and unpredictably, which makes it dangerous to commit against. Drive the per task cost down first with caching, routing, and lean tooling, then size and structure the committed spend around the optimized, realistic trajectory rather than an early curve that will not hold. Our token optimization playbook walks through each of these levers in order with the numbers behind them.
The reason agent cost surprises people is that the expensive part is structurally hidden from the person who triggers it. An engineer writes one clean instruction and sees one clean result, so the natural mental model is that the task cost roughly what a single message costs. The reality is that the model carried out a long internal conversation to produce that result, reading files, calling tools, planning, and revising, and every step of that conversation was billed against a context that grew with each turn. The visible surface of the work is tiny. The billed substance of it is large and entirely below the waterline.
This invisibility is exactly why measurement matters so much. A team that does not track cost at the task level has no way to connect the bill to the behavior that produced it, so the invoice arrives as a shock with no obvious cause. A team that does track it can see the loop length, the input to output ratio, the model mix, and the tail of expensive tasks, and can point to precisely which habit is driving the number. You cannot manage what you cannot see, and with coding agents almost everything that matters is unseen by default. Making it visible is the first move, and it is the move that turns a mysterious, climbing bill into a set of specific, fixable patterns.
These fixes are real and they are worth doing, but the deeper point is commercial. Coding agent usage is volatile and it grows fast, which makes it dangerous to commit against. A buyer who signs a committed spend based on a few months of early agent usage, then watches adoption climb, can blow through the commitment and hit overage at an unprotected rate. The optimization and the negotiation are the same project: drive the per task cost down with caching, routing, and lean tooling, then size and structure the committed spend around the optimized, realistic trajectory rather than the raw early curve.
Our token optimization playbook lays out the agent specific levers alongside the broader ones, with the numbers behind each, so you can see what to fix first and how much it is worth. It is the method we use when we sit on the buyer side and run the optimization underneath an Anthropic deal.
Download the token optimization playbook and see the exact levers we pull to cut aggregate Claude spend 40 to 70 percent.
Download the PlaybookWeekly intelligence on Anthropic pricing moves and the buyer side counters that work.