Independent buyer side advisory · Anthropic onlyNew York · London
Home · Blog · Token Optimization
Token Optimization

Streaming vs Batch: The Cost Tradeoff

Buyer side analysis · About 11 minutes · The Counteroffer desk

Claude offers two ways to run the same work, and they are priced differently. Real time requests, the streaming calls behind anything a user waits on, are billed at the standard rate. Batch processing, where you submit a set of requests and collect the results later, runs at roughly half that rate. The model is identical and the output is identical. The only difference is whether you need the answer now or can wait. For a large share of enterprise Claude workloads, the answer is that you can wait, and the companies that recognize this are paying half price on the work that does not need to be immediate. This is the buyer side view of the streaming versus batch tradeoff and how to decide where each piece of work belongs.

We negotiate Claude contracts for enterprise buyers and optimize the spend underneath them. Batch is one of the highest return levers we apply, because the saving is large, the change is usually contained, and the only thing you give up is immediacy on work that was never time sensitive to begin with. The hard part is not the technology. It is recognizing how much of your workload is misclassified as real time when it is not.

What you are actually paying for in real time

The standard streaming rate buys you immediacy: the response begins arriving token by token within moments of the request, which is exactly what a user facing chat or an interactive feature requires. That immediacy has real value when a person is waiting, and for those workloads the standard rate is the right price to pay. The mistake is paying for immediacy on work where no one is waiting and the result is consumed minutes or hours later regardless of how fast it was produced.

Batch trades that immediacy for cost. You submit a collection of requests, the work is processed within a turnaround window, and you collect the results when they are ready. Because you have given up the demand for an instant answer, the rate drops to roughly half. For any workload where the consumer of the output is a downstream process, a scheduled job, or a human who will look at the results later, the immediacy you are paying for in streaming is value you are not using.

Real time pricing buys immediacy. If nobody is waiting for the answer, immediacy is value you are paying for and not using.

The classification that decides the bill

The whole tradeoff comes down to one question asked of each workload: does a human or a real time system need this answer the moment it is produced, or is it consumed later? Workloads that genuinely need the instant answer stay on streaming. Workloads that are consumed later belong in batch. Most organizations have never asked this question workload by workload, so their default is streaming for everything, and they pay the real time premium on a large body of work that has no real time requirement.

The candidates for batch are more numerous than teams expect. Overnight enrichment of records, bulk classification and tagging, generating content that will be reviewed before it ships, processing documents that arrive in queues, backfilling analysis over historical data, evaluation and testing runs, and any reporting that is produced on a schedule rather than on demand. None of these has a user watching a spinner. All of them are paying streaming rates by default, and all of them could run at roughly half the cost in batch.

What the tradeoff costs you

Honesty about the downside matters, because batch is not free of cost in every dimension. You give up immediacy, so a workload moved to batch will not return its answer instantly. You take on a turnaround window, which means your pipeline has to tolerate the gap between submission and result. And you add a small amount of orchestration: submitting jobs, tracking them, and collecting results rather than making a simple synchronous call. For a workload that truly needs an instant answer, these costs are disqualifying, which is why those workloads stay on streaming.

For a workload that does not need immediacy, though, these costs are trivial. A turnaround window is irrelevant when the output is consumed hours later anyway. The orchestration is modest and is built once. Against those small costs sits a rate cut of roughly half on the entire volume of the workload, repeated on every run, forever. The tradeoff is heavily favorable for any work that is genuinely asynchronous, which is why the classification exercise is worth doing carefully rather than assuming everything must be real time.

Where the user experience breaks

The failure mode to avoid is moving a workload to batch that quietly needed to be real time, degrading the experience to save money. This happens when a feature looks asynchronous but has a hidden latency expectation, a user who technically gets results later but expects them soon, or a downstream system with a tighter deadline than anyone documented. Before you move a workload to batch, confirm that nothing depends on its immediacy, because a saving that breaks an experience is not a saving. The classification has to be honest in both directions: do not pay real time rates for asynchronous work, but do not batch work that genuinely is not.

Talk it through

Find the workloads that belong in batch

The saving is in the classification, and it is easy to get wrong in both directions. Book a strategy call and we will walk your workloads and identify which ones move to batch safely.

Book a Strategy Call

Batch and caching compound

Batch is most powerful in combination with the other token levers rather than alone. Prompt caching returns up to 90 percent on the stable parts of a prompt, and many batch workloads send a large shared context across every request in the job, which is exactly the pattern caching rewards. A batch job that also caches its shared context captures both savings at once: half rate from batch, and up to ninety percent off the repeated context from caching. Layer in model routing, sending each request to the lightest model in the Claude family that can do the job rather than running everything on Opus, and the same workload attracts three savings stacked together.

This is why we never look at batch in isolation. Across a realistic workload, model routing across Opus, Sonnet, and Haiku typically moves aggregate spend 40 to 70 percent on its own. Adding batch on the asynchronous share and caching on the repeated context pushes the total saving higher still. The streaming versus batch decision is one lever in a system, and its return multiplies when it is applied alongside the others rather than as a standalone fix.

How to run the migration

Moving work to batch is best done as a deliberate pass, not a big rewrite. Start by inventorying your workloads and classifying each as genuinely real time or actually asynchronous, using the single question of whether immediacy is required. Rank the asynchronous workloads by volume, because the saving is proportional to volume and the largest workloads return the most for the least effort. Move the biggest asynchronous workload first, confirm nothing downstream depended on its immediacy, measure the saving, and proceed down the list.

Treat the turnaround window as a design parameter, not an afterthought. Make sure each migrated pipeline tolerates the gap, has sensible handling for the results when they arrive, and degrades gracefully if a job is delayed. Done this way, the migration is low risk and the savings are durable, because once a workload is correctly classified as batch it stays cheaper on every future run without further work.

A worked example of the saving

Numbers make the tradeoff concrete. Suppose a company runs a nightly enrichment job over a large set of records, classifying and tagging each one with Claude, and the results are loaded into a data warehouse the next morning for analysts to use during the day. Today that job runs on the standard streaming API, because that is how it was first built, and it represents a meaningful share of monthly Claude spend. Nobody is waiting on it in real time. The records are processed overnight and consumed hours later. It is, by definition, asynchronous work paying a real time premium.

Moving that job to batch cuts its rate by roughly half immediately, with no change to the output and no degradation anyone would notice, because the morning deadline is unaffected by whether the work finished at two in the morning or four. Layer caching on the shared instructions that accompany every record, and the repeated context drops by up to ninety percent on top of the batch saving. Route the classification to a lighter model in the Claude family rather than Opus, since classification rarely needs the most capable model, and the rate on the bulk of the volume falls again. One workload, correctly reclassified, attracts three stacked savings, and it keeps returning them on every nightly run from then on.

The numbers here are illustrative rather than a quote, but the shape is exactly what we see. The largest savings in most Claude estates are not hiding in clever prompt tricks. They are sitting in plain view, in high volume asynchronous workloads that were built on streaming out of habit and never reclassified. The batch decision is where a great deal of that money lives.

Why this matters at the contract table

The streaming versus batch decision is not only an engineering choice. It changes the size of the commitment you should sign. If you forecast your Claude spend from a baseline where everything runs at the real time rate, you will commit to a larger number than you need, and unused commitment on Anthropic generally does not roll over. Moving the asynchronous share to batch before you forecast lowers your real consumption, which lowers the commitment you should make, which means you negotiate from a smaller and more accurate position. A buyer who optimizes streaming versus batch before sizing the deal commits to what they will actually spend, not to an unoptimized baseline.

It also strengthens your hand. A buyer who can demonstrate that they run their asynchronous work at half rate is a buyer the seller cannot easily talk into a larger commitment, because the consumption is visibly efficient. Efficiency is leverage, and the batch decision is one of the clearest demonstrations of it you can bring to a negotiation.

The buyer side summary

Streaming and batch run the same Claude work at different prices, and the only thing that separates them is whether you need the answer immediately. Classify each workload honestly, keep the genuinely real time work on streaming, and move the asynchronous work, which is usually more of your volume than you assume, to batch at roughly half the rate. Combine batch with caching and model routing to stack the savings, migrate the largest asynchronous workloads first, and respect the turnaround window so you never degrade an experience to save money. Then forecast and negotiate from the optimized baseline. The result is a materially lower bill on work that never needed to be expensive.

If you want to know how much of your workload can safely move to batch, that classification is exactly where we start. The Token Optimization Field Guide covers batch alongside the other levers, and a strategy call turns it into a migration plan for your workloads.

Run the asynchronous half at half price.

Book a strategy call. We will classify your workloads and find the spend that belongs in batch.

Your Anthropic number is negotiable.

Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.

Get a Quote

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · Blog · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.