Retries are the most overlooked line in a Claude bill. When a request fails or returns something the application cannot use, most systems quietly send it again, sometimes two or three times, until it succeeds. Every one of those attempts is a paid request. The user sees one answer, the invoice sees several, and almost nobody traces the gap back to its cause. On a busy production workload, retries can add a tenth to a quarter of total token spend without a single new feature shipping. This is a buyer side guide to finding that waste, understanding where it comes from, and engineering it out before you carry an inflated run rate into a commitment with Anthropic.
A retry is not free, and it is not cheap. When a request fails partway through and you resend it, you pay the full input cost again on the second attempt, because the model has to read the entire prompt from the start. If the first attempt produced partial output before it failed or was rejected, you paid for those output tokens too, and output tokens bill at roughly five times the input rate. So a single failed request that gets retried once can cost more than two clean requests, because you carry the full input twice plus whatever wasted output the failed attempt generated.
Now multiply that across a real workload. Suppose five percent of your requests fail and get retried once, and one percent fail twice. That sounds small, but on a high volume application it represents a meaningful slice of total spend, and it is spend that produces nothing. The customer experience is identical whether the answer came on the first try or the third. The only difference shows up on the invoice, which is exactly why retry waste survives for so long without being noticed.
Retries fall into a handful of categories, and the right fix depends on which kind dominates your workload. Before you change anything, you need to know which bucket your retries land in, because the engineering response is different for each.
The first category is genuine transient failure: a timeout, a rate limit, a momentary network problem. These retries are legitimate, and you do want to retry them, but how you retry matters enormously. A naive retry sends the identical request immediately, which often hits the same condition and fails again, so you pay twice for nothing and then succeed on the third attempt. A disciplined retry uses backoff, waits before resending, and respects the rate limit headers Anthropic returns, so the second attempt actually has a chance of succeeding. The cost difference between naive and disciplined retry on transient failures is large, because naive retry multiplies the failures rather than absorbing them.
The second category is the expensive one: the model returned a response, but the application could not use it. The output did not parse as valid JSON, it was missing a required field, it failed a schema check, or it did not match the format the downstream system expected. The application throws it away and asks again, hoping the next attempt comes back clean. These retries are pure waste, because the model did its work and was paid for it, and then the work was discarded for a formatting reason that better prompt design would have prevented.
The third category is subtler: the response was valid and parseable, but it was not good enough, so the application retried with a different prompt or a higher capability model. This is sometimes the right call, but it is often a sign that the workload is using the wrong model for the job, or that the prompt is underspecified and the model is guessing. Each retry up the model ladder is more expensive than the last, and a workload that routinely escalates from Sonnet to Opus on retry is paying the premium twice rather than routing correctly the first time.
You cannot reduce retry waste you have not measured, and most teams have never measured it. The audit is straightforward but it requires instrumentation that many production systems do not have by default. For a representative window of traffic, capture the total number of model calls, the number of unique logical requests behind those calls, and the ratio between them. That ratio is your retry multiplier, and on an unoptimized workload it is often well above one in a way that surprises the team when they first see it.
Then break the retries down by cause. Tag each retry with why it happened: timeout, rate limit, parse failure, schema failure, quality escalation. The distribution tells you where to spend your engineering effort. If parse and schema failures dominate, the fix is prompt and output discipline. If rate limits dominate, the fix is request pacing and backoff, and possibly a conversation with Anthropic about your throughput allocation. If quality escalations dominate, the fix is model routing. The audit turns a vague sense that retries cost money into a ranked list of specific, fixable causes.
Once you know the distribution, the fixes are concrete and the savings are verifiable. Here are the levers in the order they usually pay off.
The single highest return fix on most workloads is eliminating parse and schema failures. If the model keeps returning output your code cannot read, the prompt is not constraining the format tightly enough. Specify the exact structure you require, give a concrete example of a valid response, and where the application supports it, use the structured output and tool features that force the model to return well formed data rather than hoping it does. A workload that goes from a five percent parse failure rate to near zero stops paying for that entire category of retry, and the change is usually a prompt revision rather than a system rebuild.
For transient failures, replace immediate identical retries with exponential backoff and jitter, and respect the rate limit signals Anthropic sends back. This does two things: it raises the success rate of the second attempt, so you stop paying for failures that retry into the same condition, and it smooths your traffic so you hit rate limits less often in the first place. Pacing your requests to stay inside your throughput allocation is cheaper than retrying the ones that get rejected, and it produces a more predictable spend curve.
For quality escalations, the fix is to get the model choice right the first time rather than retrying up the ladder. If a class of request reliably needs Opus, route it to Opus directly instead of trying Sonnet, failing the quality bar, and then paying for Opus anyway on top of the wasted Sonnet call. If a class of request is escalating only occasionally, investigate whether a better prompt would let the cheaper model succeed consistently. Either way, the retry up the ladder is a signal that routing or prompting needs work, and fixing the root cause is cheaper than paying the escalation tax on every borderline request.
Finally, put a hard ceiling on retry attempts and make failures visible rather than silent. An uncapped retry loop is a runaway cost risk: a request that can never succeed will retry until something stops it, and if nothing stops it, it bills indefinitely. A sane cap, paired with logging that surfaces persistent failures, turns a hidden cost into a visible bug that someone fixes, rather than a quiet tax the invoice absorbs forever.
Retry waste does not just inflate this month's invoice. It inflates the run rate you carry into an Anthropic commitment, and that is where it does lasting damage. When you size a committed spend band, you do it off observed consumption. If that consumption includes a quarter of retry waste, you commit to a number that bakes the waste in, then either overpay against an inflated commitment or face a shortfall when someone finally fixes the retries and consumption drops below the floor. Either way the waste costs you twice: once on every invoice and again on the commitment built on top of it.
The disciplined sequence is to audit and engineer retries down first, let consumption settle at its real efficient level, and only then size the commitment against that clean number. A workload that has eliminated its retry waste commits to a defensible figure and negotiates its discount from a position of efficient spend, rather than padding a commitment the vendor is perfectly happy to sell you. We treat retry reduction as part of the same engagement that produces the consumption forecast, because the two questions are inseparable: what you actually need to commit to depends on what your workload costs once it is running clean.
Retries are a hidden tax that produces nothing and survives because nobody measures the gap between logical requests and actual model calls. Start by instrumenting that gap and breaking it down by cause. Constrain outputs so they parse the first time, back off on transient failures instead of hammering, route to the right model instead of escalating, and cap runaway loops. Then size your commitment against the clean number rather than the inflated one. If you want us to audit a production workload, quantify the retry waste, and turn it into a verified saving, book a strategy call and we will measure it with you.
Book a strategy call and we will audit a production workload and quantify the retry waste hiding in your invoice.
Book a Strategy CallWeekly intelligence on Anthropic pricing moves and the buyer side counters that work.