Tool Use and Its Effect on Token Spend

Tool use is the feature that turned Claude from a model that answers questions into a system that gets work done. When you give Claude a set of tools, a search function, a database query, a calculator, a way to call your own services, it can decide which one to invoke, read the result, and act on it. That capability is the foundation of every agent and every assistant that does more than talk. It is also one of the least understood drivers of token spend in the entire Claude estate, because the cost does not show up where teams are looking for it. This is the buyer side view of how tool use moves your bill and how to keep it from running away.

We negotiate Claude contracts for enterprise buyers and optimize the spend underneath them, and tool use is a pattern we see misjudged more often than almost any other. Teams budget for the model call and forget that a single user request can become five, ten, or twenty model calls once tools enter the loop, each one carrying the full conversation and the full tool definitions forward. The model is the same. The token arithmetic is entirely different, and it is the arithmetic that decides the invoice.

Where the tokens actually go

Start with what tool use sends to the model on every call. Each tool you define has a name, a description, and a schema that tells Claude what arguments it accepts. Those definitions are input tokens, and they are sent with every single request in the conversation, not just the first. If you hand Claude a generous toolbox of twenty tools with rich descriptions, you are paying for that entire toolbox on every turn, whether the model uses one tool or none. The convenience of defining many tools is real, and so is the standing cost of carrying them all forward.

Then there is the loop. When Claude decides to call a tool, that is one model call. Your code runs the tool and returns the result. That result, often a chunk of JSON or a block of retrieved text, goes back into the conversation as input, and Claude is called again to decide what to do next. A task that requires three tool calls is therefore at least four model calls, and every one of them carries the growing conversation plus the full tool definitions. The tokens compound with each step, because the context never shrinks inside a single task. It only grows.

A single user request can become a dozen model calls once tools are in the loop, and every one of them carries the whole conversation forward.

This is the core insight a procurement leader needs and an engineering leader already half knows: tool use does not add cost linearly. It adds cost in a loop, and the loop multiplies the context. The most expensive agent workloads are not the ones with the biggest prompts. They are the ones that take the most steps, because each step pays for everything that came before it.

The tool result is the silent line item

The part teams overlook most is the size of the tool results themselves. A search tool that returns ten full documents, a database query that returns a thousand rows, an API call that returns a verbose payload, all of that comes back into the context as input tokens and then rides forward on every subsequent call in the task. A single fat tool result early in a long agent loop is paid for again and again as the loop continues, because it is part of the context that each later call carries.

The fix is to treat tool results as something to shape, not something to dump. Return the fields the model actually needs to make its next decision, not the entire payload. Summarize or truncate large retrievals before they enter the context. Page through results rather than returning them all at once. A tool that returns a tight, relevant result keeps the context lean for the rest of the loop, and on a long running agent that discipline is the difference between a workload that is affordable and one that is not.

Tool definitions are not free to carry

Because tool definitions are sent on every call, the number and verbosity of your tools is a standing tax on the whole conversation. Many teams define every tool the agent might ever need and present the full set on every turn. That is comfortable to build and expensive to run. The better pattern is to expose only the tools that are relevant to the current phase of the task, and to write tool descriptions that are clear but not padded. A description that earns its place tells the model what it needs in as few tokens as possible.

This is where prompt caching becomes a major lever for tool use specifically. Tool definitions are stable across a conversation, which makes them an ideal candidate for caching. With caching applied to the system prompt and the tool definitions, the repeated cost of carrying that block forward drops by up to ninety percent, because the cached portion is billed at a small fraction of the standard input rate on every call after the first. For an agent that takes many steps, caching the stable tool block is one of the highest return changes you can make, and it requires no change to what the agent does.

Talk it through

Find the loops that are costing you

Agent token spend hides in the loop, not the prompt. Book a strategy call and we will map your tool use, find the runaway loops and fat tool results, and show you where caching and routing cut the bill.

Book a Strategy Call

Model routing inside the agent loop

Not every step in an agent loop needs the same model. Deciding which tool to call, parsing a result, and formatting a final answer are different jobs with different difficulty, and running all of them on Opus because the hardest reasoning step needs Opus is a common and expensive default. The buyer side move is to route within the loop: send the heavy reasoning to the model that earns its rate, and send the routine steps, the classification of a result, the simple decision about the next tool, to Sonnet or Haiku. Across a realistic agent workload, routing across Opus, Sonnet, and Haiku typically cuts aggregate spend 40 to 70 percent versus running every step on Opus.

This matters more in tool use than almost anywhere else, because the loop produces many calls and only a few of them are genuinely hard. An agent that takes ten steps to complete a task may need its full reasoning power on two of them and nothing more than a light model on the other eight. Paying Opus rates for all ten is the kind of waste that does not show up in any single call and adds up to a large share of the bill across millions of tasks.

The runaway loop is the real risk

The failure mode that does the most damage is the loop that does not terminate cleanly. An agent that retries a failing tool, that calls a tool, gets an unhelpful result, and calls it again, or that wanders through extra steps before reaching an answer, burns tokens on every detour, and every detour carries the full context. A single poorly bounded agent in production can generate a token bill out of all proportion to the value of the work, and because it is buried in a loop, nobody notices until the invoice arrives.

Controlling this is partly engineering and partly governance. Set a maximum number of steps per task so no single request can loop forever. Handle tool errors so a failure does not trigger an endless retry. Log the number of tool calls per task and watch the distribution, because the tasks in the long tail, the ones taking far more steps than the median, are where the runaway cost lives. A small number of pathological tasks often account for a surprising share of agent spend, and they are invisible until you measure steps per task rather than just total tokens.

A worked example of the arithmetic

Picture a customer support agent built on Claude with tools for searching a knowledge base, looking up an order, and checking a shipping status. A typical request comes in, and the agent searches the knowledge base, which returns five full articles, then looks up the order, which returns the complete order record, then checks shipping, then composes a reply. That is four model calls. The first call carries the system prompt and all three tool definitions. The second carries all of that plus the five articles. The third carries everything plus the order record. The fourth carries the entire accumulated context to write the answer.

The reply the customer sees is short, but the work behind it paid for the tool definitions four times, the five articles three times, and the order record twice. Now optimize it. Cache the system prompt and tool definitions, and their repeated cost drops by up to ninety percent across the four calls. Trim the search tool to return article summaries rather than full text, and the largest tool result shrinks before it ever rides forward. Route the final composition step, which is not hard, to a lighter model. The agent does exactly the same job for the customer, and the token cost of that job falls by a large margin, repeated on every support request from then on.

The numbers here are illustrative rather than a quote, but the shape is exactly what we find. The cost of an agent is governed by the loop and the context that rides through it, not by the visible length of the final answer. Optimize the loop and you optimize the bill.

Why this belongs at the contract table

Tool use is not only an engineering concern. It directly shapes the commitment you should sign. Agent workloads are the fastest growing and least predictable part of most Claude estates, and if you forecast a commitment from an unoptimized agent baseline, you commit to a number inflated by uncached tool definitions, fat tool results, and Opus running every step. Unused commitment on Anthropic generally does not roll over, so over committing on a bloated agent forecast is money you simply lose. Optimize the loop first, measure the real consumption, and commit to that.

There is leverage in it too. A buyer who can show that their agents run with cached tool blocks, trimmed results, and routed steps is a buyer whose consumption is visibly efficient, and efficient consumption is hard to argue up at the table. The seller cannot easily push you toward a larger commit when your usage demonstrates discipline. Optimizing tool use before you negotiate gives you both a smaller number to commit to and a stronger position from which to hold the rate.

The buyer side summary

Tool use makes Claude capable and quietly expensive, because it turns one request into a loop of model calls, each carrying the full conversation and the full tool definitions forward. The tokens hide in the loop, in the tool definitions you carry on every call, and in the tool results you let grow unchecked. Control them by caching the stable system and tool block for up to ninety percent off its repeated cost, trimming tool results to what the next step needs, exposing only relevant tools, routing the easy steps to lighter models, and bounding the loop so no task runs away. Do all of that before you size a commitment, so your forecast reflects efficient agents rather than bloated ones. The result is an agent estate that scales without the bill scaling with it.

If you want to know where your agent loops are leaking tokens, that analysis is exactly where we start. The Token Optimization Field Guide covers tool use alongside caching, routing, and batch, and a strategy call turns it into a concrete plan for your workloads.