Independent buyer side advisory · Anthropic onlyNew York · London
Home · Blog · Token Optimization
Token Optimization

The token optimization levers that actually move the bill.

Buyer side guide · 9 minute read

Most advice on cutting a Claude bill is a long list of small tweaks, and the length is the problem. A buyer reads twenty tips, applies a few, and saves almost nothing, because the tips that matter are buried among the ones that do not. The truth is that a handful of levers do nearly all the work, and the rest are rounding error. This is a buyer side ranking of the token optimization levers that actually move the bill, in the order of how much they move it.

We negotiate with Anthropic and optimize the spend underneath the contract, so this list is drawn from what actually changes invoices, not from what sounds clever. A procurement leader and an engineering leader can use it together to decide where to spend the engineering effort, because effort spent on the wrong lever is as wasted as the tokens it was meant to save.

Lever one: model routing

The single largest lever is which model handles each request. The reflex in most organizations is to send everything to the most capable model, Opus, because it is the safest choice when nobody is watching the cost. But the majority of real traffic, classification, extraction, routing, summarization of short inputs, simple formatting, does not need the top model at all. Sonnet handles most of it at a fraction of the price, and Haiku handles the high volume, low complexity work for less again.

Routing each request to the cheapest model that meets the quality bar, rather than running everything on Opus, is the change that moves the bill most. Across a realistic workload, disciplined routing across Opus, Sonnet, and Haiku typically cuts aggregate spend by forty to seventy percent compared with uniform use of the top model. No other single lever comes close, which is why it belongs first and deserves the most engineering attention.

Lever two: prompt caching

The second lever is caching the parts of your prompts that repeat. Many production workloads send the same large block of context on every call, a system prompt, a knowledge base, a set of instructions, a document being asked about repeatedly. Without caching you pay full input price to process that identical block every single time. With prompt caching, the repeated portion is billed at a steep discount, up to ninety percent off, on the cached reads.

For any workload with a stable, reused context, this lever is enormous, because the savings scale with how often the context repeats. A chatbot grounded in a fixed knowledge base, a coding assistant carrying a large system prompt, a document tool answering many questions about the same file, all see a large share of their input tokens drop to the cached rate. The effort is modest and the payback is immediate, which makes caching the natural second move after routing.

Lever three: batch processing

The third lever is batch. A large amount of enterprise work does not need an answer in the next second. Overnight evaluation runs, bulk classification, backfilling a dataset, generating embeddings or summaries across a corpus, none of these are interactive. Anthropic's batch processing handles exactly this kind of asynchronous work at roughly half the standard price.

The reason batch ranks third rather than first is that it only applies to work that can tolerate delay. But for the organizations that run significant offline workloads, moving that traffic to batch is close to free money, a fifty percent cut on a large slice of spend in exchange for accepting a processing window measured in hours rather than seconds. The lever is identifying which of your workloads are genuinely latency sensitive and which only feel that way out of habit.

Lever four: prompt and output discipline

The fourth lever is the size of what you send and what you ask back. Bloated prompts that stuff in context the model does not need, and unbounded outputs that let the model ramble, both cost tokens directly. Tightening prompts to the context that actually matters, and setting sensible limits on output length, trims the per call cost across every request you make.

This lever is real but smaller than the first three, because it scales with token count rather than with rate or routing. It is the cleanup work that compounds once the big structural moves are in place. A lean prompt on the right model, with the cached context and a bounded output, is the cheapest possible version of a call, and the discipline is what gets you there.

Lever five: retrieval instead of stuffing

The fifth lever is feeding the model only the relevant slice of a large knowledge source rather than the whole thing. Many teams paste an entire document or a large reference into every prompt when the model only needs a paragraph of it. Retrieving and sending the relevant portion, rather than the full corpus, cuts input tokens substantially on knowledge heavy workloads, and it often improves answer quality too, because the model is not distracted by irrelevant context.

How the levers stack

The important point is that these levers multiply rather than add. Routing a workload to a cheaper model, then caching its repeated context, then sending the non urgent portion through batch, compounds into a far larger saving than any one lever alone. A request that moves from Opus to Sonnet, has ninety percent of its input cached, and runs in batch is a small fraction of the cost of the same request run naively on Opus at full input price in real time.

This is why the order matters. Start with routing, because it has the largest single effect and shapes everything downstream. Layer caching on top, because it cuts the input cost of whatever model you land on. Move the latency tolerant work to batch. Then tighten prompts and retrieval to clean up what remains. Worked in that sequence, the levers that actually move the bill take an enterprise Claude spend down by the forty to seventy percent that the headline number describes.

The levers that feel productive but are not

It is worth naming the moves that consume effort without moving the bill, because teams reach for them first precisely because they feel like optimization. Switching to a slightly cheaper third party gateway, shaving a few words off a prompt that was already lean, or hunting for marginal efficiencies in a workload that was never expensive, all of these produce activity and almost no saving. They feel productive because they involve work, but the work is aimed at rounding error.

The discipline is to let measurement, not instinct, choose the target. Rank your workloads by total cost, which is cost per inference multiplied by volume, and apply the big levers to the top of that list. A day spent moving the most expensive workload from Opus to Sonnet returns more than a month spent trimming prompts across workloads that were already cheap. Effort is a budget too, and spending it on the wrong lever is its own form of waste.

Sequencing the work for fastest payback

Because the levers compound, the order in which you apply them determines how quickly the savings arrive. Start with routing on your single most expensive workload, because that one change usually returns the largest absolute saving in the shortest time and funds the rest of the program. With routing in place, turn on caching for the workloads that carry a large repeated context, which is often the next biggest win and requires only modest engineering. Then sweep the latency tolerant jobs into batch, which is close to free money for anything that does not need an instant answer.

Only after those three structural moves are done does it make sense to invest in the finer work of prompt and retrieval discipline, because by then you are optimizing calls that are already on the right model, already cached, and already batched where appropriate. Worked in this sequence, the first changes pay for the program within weeks, and each subsequent layer compounds on a base that is already lean rather than on raw, unoptimized spend. The result is the forty to seventy percent reduction described at the top, captured in the order that delivers it fastest.

The buyer side takeaway

Cutting a Claude bill is not about a long list of small tricks. It is about a short list of large levers applied in the right order. Model routing moves the bill most, prompt caching second, batch third, and prompt and retrieval discipline clean up the rest. Apply them in sequence and they compound. Ignore the order and you spend engineering effort on rounding error while the real savings sit untouched.

Your Anthropic number is negotiable.

Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.

Get a Quote

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · Blog · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.