What Belongs in Batch and What Does Not

Batch processing on Claude carries a fifty percent discount against the standard rate, which makes it one of the cleanest savings available, a flat halving of cost with no loss of quality. The only thing you give up is immediacy: batch requests are processed asynchronously, returned within a window rather than in the moment, so the work has to be of a kind that can wait. That single constraint is the whole decision. The mistake teams make is not getting the batch mechanics wrong, it is failing to sort their workload by whether anyone is actually waiting for each result. A large share of enterprise Claude usage is work that nobody is waiting for, and is therefore leaving the batch discount on the table every day it runs at the real time rate. The skill is in telling the two apart reliably, and the test is simpler than it looks.

The one question that decides it

The test for whether a workload belongs in batch is a single question asked honestly: is a person or a process waiting on this specific result before it can proceed? If the answer is no, the work can almost certainly run in batch and should, because there is no reason to pay double for speed that nobody is using. If the answer is yes, it has to stay real time, because the discount is not worth degrading an experience someone is actively waiting through. Everything else is detail. The reason teams overpay is that they never ask the question per workload; they default everything to real time because that is how the first feature was built, and the asynchronous work inherits the synchronous pricing by inertia rather than by any decision that it needs to be fast.

What clearly belongs in batch

Several workload types are almost always batch candidates, because by their nature no one is waiting on any individual result:

Bulk processing of a backlog. Classifying, tagging, summarizing, or extracting from a large set of documents or records, where the whole job has a deadline but no single item does.
Scheduled and overnight jobs. Anything that runs on a timer, a nightly enrichment pass, a periodic re scoring, a daily digest generation, where the result is needed by morning, not by the next second.
Data preparation and offline enrichment. Building datasets, generating training or evaluation data, annotating content for later use, all of which feed a downstream process rather than a waiting user.
Reporting and analysis that runs ahead of when it is read. Generating analyses that will be reviewed later, where the consumer reads the finished output and never sees the processing time.

In all of these the latency is invisible to anyone who matters, so paying the real time premium buys nothing. Moving them to batch is a free fifty percent on whatever share of your spend they represent, and for many enterprises that share is large.

What has to stay real time

The other side is just as clear. Work where a human is in the loop and waiting belongs on the real time path, full stop. An interactive chat, an assistant responding to a user, a feature that runs inside someone's active workflow, a request whose result gates the next thing the user does, all of these need the immediate response, and the batch discount is irrelevant because the asynchronous return would break the experience. Trying to force interactive work into batch to chase the discount is a false economy that trades a real product harm for a cost saving, and it is the opposite mistake from over defaulting to real time. The point of the sorting is to put each workload where it belongs, not to maximize the batch share regardless of fit.

The ambiguous middle

Between the clear cases sits a middle band that is worth examining rather than guessing, because much of the available saving hides there. These are workloads that feel real time but are not really, or that mix waiting and non waiting requests under one label. A request triggered by a user action is not necessarily one the user is waiting on; if it kicks off a process whose result the user sees minutes or hours later, it can be batched even though a person initiated it. A feature that returns some results immediately may have a slower tier of analysis behind it that need not be synchronous. The discipline is to look past how the work is triggered and ask only whether the result is awaited. Triggered by a user and awaited by a user are different things, and a lot of batchable work is mislabeled as real time because it was triggered interactively.

Mixed workloads can be split

A single feature often contains both kinds of work, and the right move is to split it rather than price the whole thing at the real time rate because part of it must be fast. If a workflow returns an immediate acknowledgment or a quick first pass and then does deeper processing whose output appears later, the immediate part stays real time and the deferred part goes to batch. Designing this split takes a little engineering, but it captures the discount on the half of the workload that can take it without touching the half that cannot. Teams that treat each feature as a single latency class miss this entirely; teams that decompose features into their waiting and non waiting components find batchable work inside things that looked unbatchable.

Plan for the batch window

Choosing batch does carry one real obligation, which is to design around the processing window rather than assuming an instant return. Batch results come back within a window, not immediately, so the consuming process has to be built to collect results when they are ready rather than blocking on them. For genuinely asynchronous work this is natural, the job submits and the results are picked up when complete, but it does mean batch is not a drop in substitution you can flip on without touching the surrounding code. The window also has to fit the deadline: overnight work with a morning deadline is a comfortable fit, while work needed within a tight few minutes may not be, even if no human is watching, because the window could exceed the deadline. Matching the window to the real deadline is the one piece of planning batch requires.

Where this fits

Sorting workloads into batch and real time is one of the simplest high return moves in token optimization, a flat fifty percent on whatever share of your spend nobody is waiting for. It stacks with model routing, caching, and output control, and the batch share often combines with caching for a compounded saving. For the full method, the sorting framework, and the worksheet to classify your own workloads, read the pillar guide, the token optimization playbook. Download it to find the batchable work hiding inside your real time bill.

What belongs in batch and what does not.