A structured audit finds the spend hiding in a Claude application: the wrong model, uncached context, real time calls that should be batch, and prompts carrying tokens that do nothing. Here is the method we run.
Most Claude applications were built to work, not to be cheap. The team shipped something that produced good output, and the bill was a secondary concern until it grew large enough to notice. By then the waste is baked in: the wrong model handling routine work, the same context resent uncached on every call, asynchronous jobs running through the expensive real time path, and prompts that have accumulated tokens nobody can justify. A token waste audit is a structured pass through the application that surfaces all of this and quantifies it. Done properly, it routinely cuts aggregate spend by a large margin without touching output quality. This is the method, laid out so an engineering leader can run it and a procurement leader can read the result.
An audit begins with measurement, because intuition about where the money goes is almost always wrong. Pull usage data broken down as finely as you can: spend by workload, by model, by input versus output tokens, and by time of day. The goal is a map of where the dollars actually are. Almost every audit finds that a small number of workloads account for the majority of the spend, which is good news, because it means the work concentrates where the savings are largest. Do not optimize anything until you know which workloads matter. Effort spent shaving a cheap, low volume path is effort wasted.
Once you know where the spend is, the waste falls into four recurring categories. Each has a direct remedy, and most applications leak through all four at once.
The single largest source of waste is running work on a more capable, more expensive model than it needs. Many applications were built on Opus during development, when getting good output mattered more than cost, and never revisited. A large share of real workloads, routine classification, extraction, simple question answering, formatting, run just as well on Sonnet, and a meaningful slice runs well on Haiku. Audit each high volume workload by asking what is the cheapest model that still meets the quality bar, and test the answer rather than assuming it. Model routing across Opus, Sonnet, and Haiku typically cuts aggregate spend 40 to 70 percent versus uniform Opus use, which makes this the first place to look.
The second source is paying full price to resend the same context on every call. Look for stable content that appears in request after request: system prompts, instructions, reference documents, few shot examples. Anything that is identical across calls and is not cached is being billed at full rate every time. Prompt caching takes up to 90 percent off the repeated portion, so identifying uncached stable context and structuring the prompt to cache it is one of the highest return findings an audit produces. Measure the ratio of repeated to novel tokens in your highest volume calls, because that ratio tells you the size of the prize.
The third source is running asynchronous work through the real time path and paying the premium for immediacy nobody needs. Audit each workload for whether a human is actually waiting on the result. Overnight processing, bulk classification, content pipelines, evaluation runs, and backfills almost never need a real time answer, and the batch path runs them at roughly half the price. Every workload that can tolerate a processing window and is still on the real time path is leaking half its cost.
The fourth source is tokens that do nothing. Prompts accumulate cruft over time: instructions that are no longer relevant, examples that could be trimmed, verbose phrasing, duplicated context. On the output side, generations that are longer than they need to be cost real money, and output tokens are the more expensive side of the ledger. Audit your prompts for length that earns its keep, and check whether you are asking for and paying for more output than the use case requires. Tightening prompts and constraining output length is unglamorous, but across a high volume application it adds up.
For each finding, estimate the saving before you implement it, so you can prioritize. A finding that affects your highest volume workload is worth more than a larger percentage saving on a minor one. Build a simple model: for each workload, the current cost, the proposed change, and the expected cost after. This turns the audit from a list of ideas into a ranked plan, and it gives the procurement leader a number to hold the engineering work against. It also produces the evidence you will use later in the negotiation, because a documented, optimized run rate is the figure you want your commitment sized against.
The four remedies are not alternatives, they compound. The strongest result comes from applying them together on the same workload. A bulk classification job can be moved to a cheaper model, run through batch at half price, with its shared instructions cached, and its prompt trimmed, all at once. Each lever multiplies against the others, which is why a thorough audit so often lands a far larger total saving than any single change would suggest. Sequence the work by return, but plan to apply every applicable lever to the workloads that matter.
An audit is only credible if the output quality survives it. For every model change, run a proper evaluation against your quality bar before and after, so you are not trading cost for a worse product. For every prompt trim, confirm the output is still correct. The point of the audit is to remove waste, not to degrade the application, and the discipline of measuring quality alongside cost is what separates a real optimization from a reckless one. A change that saves money but quietly drops accuracy is not a saving, it is a deferred problem.
The audit pays twice. First, it lowers your running cost directly. Second, it lowers the commitment you need to negotiate with Anthropic. A buyer who commits before optimizing signs a number built on a wasteful baseline, which means a larger commit, more exposure to unused commitment, and a weaker position on the rate. A buyer who audits first commits to the optimized run rate, which is smaller, safer, and a stronger basis for the whole negotiation. This is why we treat the token audit and the commercial negotiation as one engagement: the engineering work directly shapes the deal.
A token audit fails when it is treated as a purely engineering exercise or a purely financial one. The engineers know how the application is built and where the calls are made, but they often do not see the cost, because the bill lands somewhere else. The finance or procurement side sees the bill but cannot tell which workload drives it or whether a change would hurt the product. A real audit puts both in the room, with the usage data in front of them, so that every proposed change is judged on cost and on quality at once. The engineering leader can say whether a cheaper model will hold up, and the procurement leader can say whether the saving is worth the work. That shared view is what turns a list of technical ideas into a prioritized plan the business will actually fund and ship.
An audit is not a one time cleanup, because applications drift back toward waste. New features ship on whatever model the developer reached for, prompts accumulate instructions over time, and workloads that started small grow into major line items without anyone reclassifying them. The practical cadence is to run a full audit before any major commitment negotiation, since the optimized run rate is what you want to commit against, and a lighter review on a regular schedule, quarterly for a fast moving application, to catch drift before it compounds. Pair the review with a few standing guardrails, such as a default model for new work that is not Opus unless justified, and a check that asynchronous jobs default to batch. Guardrails prevent the waste from accumulating between audits, which is far cheaper than removing it after the fact.
The output of a good audit is not a vague recommendation to optimize, it is a ranked table. Each row is a workload, with its current monthly cost, the specific changes proposed, the model and path it should move to, the caching opportunity, the expected cost after, and a note on the quality validation. Summed, the table gives a single number for the total saving and a sequence for capturing it, starting with the highest return changes on the highest volume workloads. That table is useful twice. It directs the engineering work, and it becomes the evidence base for the commitment negotiation, because it documents an optimized run rate that you can defend as the figure your commit should be sized against. An audit that ends in a document like that has done its job. One that ends in good intentions has not.
Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.
Get a QuoteWeekly intelligence on Anthropic pricing moves and the buyer side counters that work.