Migrating Workloads From Streaming to Batch

Most teams reach for the streaming endpoint by default. It is what you build against first, it returns tokens as they are generated, and it feels like the natural way to call a model. The trouble is that a real time call carries a real time price, and a large fraction of the work running through that endpoint is not actually waiting on a person. Reports that get read tomorrow, overnight enrichment of a data set, bulk classification of a backlog, document summaries that feed a dashboard nobody refreshes by hand: none of these need an answer in the next two seconds, yet many of them are billed as though they do. The Batch API exists precisely for this work, and it runs at roughly half the standard rate. Migrating the right workloads from streaming to batch is one of the cleanest savings available on Claude, because the output is identical and only the delivery timing changes.

What batch actually changes

When you submit a job to the Batch API, you hand Anthropic a set of requests and agree to collect the results within a processing window rather than receiving each response the instant it is ready. In exchange for giving up immediacy, you pay about half of what the same tokens would cost on the synchronous path. Nothing else changes. The model is the same, the prompt is the same, the quality is the same, and the tokens consumed are the same. The only difference is that you are no longer paying a premium for speed you were not using. That is why batch is such an easy win to explain to a finance leader and an engineering leader at the same time. There is no quality tradeoff to debate, only a question of whether a given workload can tolerate a delay measured in minutes to hours rather than seconds.

How to find the candidates

The migration starts with an inventory, not a code change. Pull your traffic and sort it by whether a human is genuinely waiting on the response at the moment it is produced. The honest answer is usually that a meaningful slice is not. Look first for anything triggered by a schedule rather than a click, because a cron job that runs at three in the morning has no user behind it and belongs in batch by definition. Then look for work that is queued and processed asynchronously even today, where the result lands in a table or a file rather than on a screen. Then look at the gray area: features where the response feeds a view that is checked occasionally rather than watched live, where a short delay would never be noticed. Each of these is a candidate, and together they often add up to a large portion of total spend, because background processing tends to be high volume even when it is low visibility.

Signals that a workload belongs in batch

It is triggered by a schedule or a queue rather than a live user action.
The result is written to storage and read later, not streamed to a waiting screen.
A delay of minutes to hours would cause no harm to the experience or the decision it informs.
The volume is high and steady, so the rate saving compounds across many requests.

Migrating without breaking anything

The technical move is smaller than it sounds, because batch is a different way of submitting the same calls rather than a different model. The pattern is to collect the requests you would have streamed, submit them as a job, poll for completion, and then consume the results from the finished job. The work that usually needs attention is the surrounding plumbing: where the results land, how downstream steps are triggered once the job completes, and how you handle the fact that responses now arrive together rather than one at a time. Build the consuming side to read from the completed job and write into the same destination your streaming path used, and the rest of your system does not need to know the delivery method changed. Start with one well understood workload, confirm the outputs match what streaming produced, and then expand to the others once the pattern is proven.

Pairing batch with caching

Batch and prompt caching are complementary rather than competing levers, and the largest savings come from using both. If your batch requests share a large stable prefix, a common set of instructions, a reference document, a fixed schema, then caching that prefix takes up to ninety percent off those repeated input tokens, and the batch discount takes roughly half off the rest. The two savings stack, because one acts on the repeated input and the other acts on the rate. A bulk classification job that sends the same long instruction block with every item is the ideal case: cache the instructions, batch the items, and the combined effect is far larger than either lever alone. Designing the job so the stable content sits at the front, where it can be cached, is worth the small extra effort whenever the prefix is large.

Why this matters before you commit

The reason a buyer should care about this beyond the monthly invoice is that the baseline you migrate to is the baseline you should bring to the negotiating table. Routing across Opus, Sonnet, and Haiku, caching repeated context, and moving asynchronous work to batch typically cut aggregate spend by forty to seventy percent against uniform real time use of a single large model. If you negotiate a committed spend figure against your unoptimized streaming usage, you lock that waste into your contract for the full term, and because unused commitment on Anthropic is generally lost rather than refunded, an inflated commit is money you have already spent whether you use it or not. Optimizing first, then committing, means you are negotiating against real demand rather than against inefficiency you simply had not removed yet. Batch migration is one of the fastest ways to bring that baseline down before the number is set.

Common objections, answered

The objection we hear most is that batch adds latency, and it does, but the question is whether anyone is waiting. For work with a person on the other end, latency is real and batch is the wrong tool. For background work, the latency is invisible, because the result was always going to be read later. A second objection is that batch complicates error handling, since failures surface when the job completes rather than inline, and that is a fair point worth designing for deliberately. The third is inertia: streaming already works, so why touch it. The answer is that the streaming premium on background work is pure waste, and leaving it in place costs real money every month for a speed nobody uses. None of these objections survive contact with a clear inventory of which workloads actually need a live response.

The buyer checklist

Inventory traffic by whether a human is genuinely waiting on the response when it is produced.
Move scheduled, queued, and write then read workloads to the Batch API for roughly half the rate.
Cache the stable prefix inside batch jobs so the input discount and the rate discount stack.
Migrate one proven workload first, confirm outputs match, then expand.
Carry the optimized baseline into the negotiation, since unused commitment is generally lost.

Batch is the lever that costs nothing in quality and pays in rate, and most teams have more eligible work than they think. We audit your traffic, migrate the right workloads, and bring the leaner baseline to the table with Anthropic. For the full framework, read the pillar guide, the token optimization playbook, and download the field guide to start your own inventory.

Migrating workloads from streaming to batch.