Most Claude bills are too high for reasons that have nothing to do with the rate Anthropic charges. They are too high because the wrong model is doing the work, the same context is paid for again and again, and jobs that could wait are run in real time. This playbook is the buyer side method we use to cut aggregate Claude spend 40 to 70 percent before we ever sit down to negotiate the contract.
The single largest driver of a Claude bill is which model answers each request. Opus is the most capable and the most expensive. Sonnet sits in the middle and handles the large majority of real workloads at a fraction of the Opus rate. Haiku is the fastest and cheapest and is more than enough for classification, extraction, routing decisions, short responses, and the high volume tasks that quietly make up most of an enterprise traffic mix.
When every request goes to Opus by default, you are paying the top rate for work that a cheaper model would do indistinguishably. Routing means classifying each request and sending it to the cheapest model that meets the quality bar for that task. Done well, routing across Opus, Sonnet, and Haiku typically cuts aggregate spend 40 to 70 percent versus uniform Opus use, with no measurable drop in output quality on the tasks that were over served in the first place.
Routing is not a one time switch. It is a discipline. You classify traffic, you set a default of the cheapest viable model, you reserve Opus for the work that genuinely needs it, and you build a fallback chain so that a low cost first attempt escalates only when it has to. Read the detail on when to use Haiku instead of Sonnet and on reserving Opus for the work that needs it.
Most production prompts repeat a large block of context on every call. A system prompt, a set of instructions, a knowledge base extract, a long document the model reasons over again and again. Without caching you pay full input rate for that block on every single request. Prompt caching lets Anthropic store that repeated context and charge a small fraction to read it back, up to 90 percent off the cost of the cached portion.
The saving is real but it is conditional. Caching only pays when the cached block is stable and reused often enough to clear the write cost. That means designing prompts so the static content sits at the front and the variable content sits at the back, keeping the cached prefix identical across calls, and measuring your cache hit rate so you know the lever is actually firing. A cache friendly prompt architecture is a design decision, not a setting you flip. See cache friendly prompt architecture and measuring cache hit rate on Claude.
A large share of enterprise AI work does not need an answer in the next second. Evaluation runs, document enrichment, back office classification, overnight summarization, content generation pipelines. Anything that can tolerate a delayed response belongs in batch, where Anthropic charges 50 percent of the real time rate for the same model and the same tokens.
The trap is that teams default everything to the synchronous API because it is simpler to build against. Moving the async eligible work to batch is often the fastest 50 percent saving available, and it stacks with caching and routing rather than competing with them. The work is in identifying which jobs are truly async, sizing the batch windows, and handling errors so a single bad record does not stall the run. See the async jobs that belong in batch and combining batch with caching for maximum saving.
Input tokens and output tokens do not cost the same. Output tokens typically run several times the price of input tokens, often around five times. This single fact reshapes where you look for savings. A workload that returns long, verbose responses is far more expensive than its request count implies, and the cheapest fix is frequently to constrain the response rather than touch the model or the context.
Setting hard token ceilings per request, tightening prompts so the model stops over explaining, and trimming long system prompts are low effort changes that move the bill immediately. Reducing retries matters too, because every retry pays for the full generation again. The discipline is to treat output length as a cost lever you control, not a byproduct you accept. See why output tokens cost 5x and how to cut them and setting hard token ceilings per request.
These levers are not just engineering hygiene. They are negotiation leverage. Every dollar you take out of the run rate before you sit down with Anthropic is a dollar you do not commit to and do not forfeit. Buyers who commit first and optimize second end up promising a number they no longer need to spend. Buyers who optimize first negotiate from a smaller, truer baseline and keep the savings.
That is why this playbook is the front half of every engagement we run. We find the savings in the workload, then we set the commitment to the optimized run rate, then we protect the unit price and the overage rate in the contract. The playbook and the negotiation are one motion. When you are ready, we will run it on your stack.
Download the full playbook or have us run it on your workload. Fixed fee or gainshare.
Download playbookWeekly intelligence on Anthropic pricing moves and the buyer side counters that work.