Error Handling in Claude Batch Jobs | Antrophic Negotiations

Batch is the lever that takes roughly half off the rate for work nobody is waiting on, and it is one of the easiest savings to justify because the model and the output are unchanged. The reason some teams hesitate is not the saving, it is the operational shape of a batch job. On the real time path, an error arrives the instant it happens, attached to the request that caused it, and you handle it there and then. In batch, the requests are submitted together and the results come back together at the end of a processing window, so a failure does not announce itself the moment it occurs. It waits in the completed job alongside the successes. If your system is not built to reconcile that, a batch saving can quietly become a reliability problem, with silently dropped records and no one noticing until the data downstream looks wrong. The good news is that none of this is hard. It is ordinary asynchronous engineering, and once you design for it deliberately, batch is just as dependable as the synchronous path and considerably cheaper.

Why errors look different in batch

The core difference is timing and grouping. A synchronous call gives you one request and one response, and if it fails you see the failure immediately in the same code path that made the call, with the full context of what you were doing. A batch job gives you many requests and, when the window closes, many results, some of which may have succeeded and some of which may have failed for their own individual reasons. There is no open connection delivering each outcome inline, so you do not learn about a failure until you collect and inspect the completed job. This means the responsibility for noticing a problem shifts from the moment of the call to the moment of collection, and the unit of attention shifts from a single request to a set of results that must be examined item by item. A team that treats a completed batch as uniformly successful, and simply writes the whole thing downstream, will eventually write a batch that contains failures it never looked for. The fix is to make collection an active reconciliation step rather than a passive copy.

The categories of failure to plan for

Not every failure is the same, and handling them well starts with separating them. There are item level failures, where most of the batch succeeds but individual requests fail, perhaps because a particular input was malformed, too long, or triggered a content issue. There are job level problems, where the batch as a whole does not complete as expected within its window. There are transient conditions that would succeed on a second attempt, distinct from permanent failures that will fail every time no matter how many times you retry. And there is the quietest category of all, the request that simply does not come back, where the result set returned is smaller than the set you submitted and the gap is the failure. Each of these calls for a different response, and lumping them together is what leads to either over retrying permanent failures, which wastes money and time, or under handling transient ones, which loses data that a single retry would have recovered.

A working taxonomy

Item level failures, isolated to specific requests, which should be collected, classified, and routed to retry or to a dead letter path rather than failing the whole job.
Transient failures, which are worth a bounded number of retries because they are likely to succeed on a later attempt.
Permanent failures, which will never succeed as submitted and must be surfaced to a human or a fix rather than retried indefinitely.
Missing results, where the returned set is smaller than the submitted set, which is the failure mode most likely to pass unnoticed without explicit reconciliation.

Reconciliation is the heart of it

The single most important habit in batch error handling is reconciling the submitted set against the returned set on every job. You submitted a known number of requests, each carrying an identifier that ties it back to the record it came from. When the job completes, you walk the results, match each one to its original request by that identifier, and confirm that every request has either a successful result or an explicit failure. Anything with neither is a missing result, and missing results are the dangerous ones because they fail silently. They do not throw an error in your collection code, they simply are not there, and a naive consumer that iterates over whatever came back will skip them without complaint. Building the collection step so it starts from the submitted set and proves that each item is accounted for, rather than starting from the returned set and trusting it to be complete, is the difference between a batch pipeline you can rely on and one that loses data you will only discover later when something downstream does not add up.

Retry strategy that does not waste money

Retries are where good intentions become expensive if you are not careful. A blanket policy of retrying every failure as many times as it takes will burn money and time on permanent failures that can never succeed, and a malformed input retried twenty times is twenty wasted attempts. The disciplined approach is to classify before you retry. Transient failures get a bounded number of retries, ideally collected and resubmitted as their own smaller batch so they too enjoy the batch rate rather than being pushed back onto the more expensive synchronous path out of impatience. Permanent failures get zero retries and instead get routed to a dead letter destination where they can be inspected, fixed, and resubmitted deliberately. Item level failures are handled individually so that one bad request never fails the thousands of good ones around it. This classification is not complicated, but it has to be explicit, because the default of retrying everything or retrying nothing both cost you, one in money and one in lost data.

Partial results and idempotency

Two design properties make batch failure handling far less stressful. The first is comfort with partial results. A batch will not always be all or nothing, and a pipeline that can accept a job where most items succeeded and a few failed, process the successes, and route the failures to a retry batch, is far more robust than one that treats any failure as a reason to discard the whole job. The second is idempotency, meaning that processing the same result twice produces the same outcome as processing it once. Because retries and resubmissions are a normal part of batch operation, you will sometimes process an item more than once, and if your downstream writes are keyed by the request identifier so that a repeat write simply overwrites rather than duplicates, then retries are safe and you never have to worry about a recovered failure creating a double entry. Designing the consuming side to be idempotent turns retries from a source of anxiety into a routine, harmless operation.

Monitoring a path with no live signal

On the synchronous path, errors are loud, because they happen in front of you. In batch, errors are quiet, because they sit in a completed job until you look. That means monitoring has to be deliberate rather than incidental. The metrics worth standing up are the success rate per job, the count and classification of failures, the count of missing results from reconciliation, and the age of anything sitting in the dead letter path waiting for attention. A sudden rise in item level failures often points at a change in your input data rather than anything in the model, and catching that quickly depends on watching the failure rate as a first class metric rather than discovering it when someone notices the output looks thin. The principle is that a batch pipeline needs the same observability discipline as any asynchronous system, and the absence of inline errors is exactly why you must build the visibility in on purpose.

Why this protects the saving, and the negotiation

It is worth being clear about why a procurement leader should care about batch error handling and not file it as a purely engineering concern. The reason you moved the workload to batch was to capture roughly half off the rate, and that saving only holds if the batch path is reliable enough to keep the workload on it. If batch jobs lose data and the team loses confidence, the workload gets pulled back to the more expensive synchronous path, and the saving evaporates. Solid error handling is what makes the batch saving durable rather than a brief experiment. This matters at the negotiating table too, because batch sits alongside the other token levers. Routing across Opus, Sonnet, and Haiku puts each request on the cheapest sufficient model, prompt caching takes up to ninety percent off repeated input, and batch takes roughly half off the rate for asynchronous work, and together these levers typically cut aggregate spend by forty to seventy percent. The leaner baseline that results is the baseline you should commit to with Anthropic, because committing against unoptimized usage locks waste into the contract for the full term, and unused commitment is generally lost rather than refunded. A batch path you trust is part of what lets you carry a genuinely optimized number into the negotiation rather than retreating from the saving the first time a job misbehaves.

Designing for resubmission from the start

The teams that find batch error handling painful are usually the ones that bolted it on after the pipeline was already running, and the teams that find it routine are the ones that designed for resubmission from the beginning. The difference is in a few early choices. Carry a stable identifier on every request that ties it to its source record, so that matching results back, retrying failures, and writing idempotently all key off the same value rather than relying on order or position. Keep the failed items separate from the successful ones at collection time, so a retry batch is simply the failed set resubmitted rather than a re run of the whole job, which both saves money and avoids reprocessing work that already succeeded. Treat the dead letter path as a first class destination with its own monitoring rather than an afterthought, because the items that land there are exactly the ones that need a human eye, and an unwatched dead letter queue is where data quietly dies. None of these choices is expensive when made early, and all of them are awkward to retrofit, which is why batch reliability is far more a question of initial design than of clever recovery code. A pipeline built with identifiers, separated failure handling, and a watched dead letter path is one where a failed batch is a routine event you handle and move past, not an incident that costs you a morning of forensic reconciliation.

This design discipline also makes the saving easier to defend to the people who approved moving the workload to batch in the first place. When a finance leader asks whether the batch path is reliable enough to keep the saving, the answer you want to give is that every request is accounted for, every failure is classified and handled, and nothing is lost, backed by the reconciliation numbers your pipeline already produces. A team that can show that level of control keeps the workload on batch and keeps the roughly half off rate that came with it, while a team that cannot quietly retreats to the more expensive synchronous path the first time a job misbehaves and loses the saving it worked to capture. Reliable error handling is, in the end, what makes the batch saving permanent rather than provisional.

The buyer checklist

Treat collection as active reconciliation, proving every submitted request has a result or an explicit failure.
Classify failures into transient, permanent, item level, and missing, and handle each differently.
Retry transient failures in a bounded way as their own batch, and route permanent failures to a dead letter path.
Make downstream writes idempotent so retries and resubmissions never create duplicates.
Monitor success rate, failure classification, and dead letter age so quiet failures become visible.
Keep the batch path reliable so the saving stays captured and the optimized baseline holds into the negotiation.

Batch error handling is ordinary asynchronous engineering applied with discipline, and getting it right is what keeps the batch saving from becoming an operational liability. We help teams design reliable batch pipelines, then carry the optimized baseline into the negotiation with Anthropic. For the full framework on batch, caching, and routing, read the pillar guide and book a call to run it on your workload, starting from the token optimization playbook.

Error handling in Claude batch jobs.