Independent buyer side advisory · Anthropic onlyNew York · London
Blog · Batch Processing
Middle of funnel · Commercial investigation

Combining batch with caching for maximum saving.

Batch returns fifty percent on the rate and caching returns up to ninety percent on the cached tokens. Used together they compound, but only if you stack them in the right order and protect the cache inside the batch. Here is how to run both levers on the same workload so one does not quietly cancel the other.

Batch and caching are the two largest token levers Anthropic gives you, and most teams use them separately, on different workloads, by different people, at different times. That is a missed opportunity, because on a large class of asynchronous work the two levers apply to the same requests at the same time, and when they do they compound. Batch halves the rate you pay per token. Caching removes up to ninety percent of the cost of the stable context inside each request. Run them together correctly and the effective cost of a unit of work falls far below what either lever achieves alone. Run them together carelessly and the batch quietly breaks the cache, so the invoice looks fine while you have silently lost the larger of the two savings. This piece sets out how the two levers interact, the order to apply them in, and the failure mode that costs teams the cache without anyone noticing.

What each lever actually discounts

To stack the two correctly you have to be precise about what each one touches, because they discount different parts of the bill. Batch applies a flat discount to the standard input and output token rate in exchange for accepting a completion window rather than an instant response. It does not care what is in the request. It just charges half for the whole thing. Caching works differently. It discounts the cost of reprocessing a stable block of context that recurs across requests, returning up to ninety percent on those cached tokens, but it does nothing for output and nothing for the unique part of each request. So batch discounts everything by a fixed fraction, while caching discounts the repeated input deeply but leaves output and novel input alone. Understanding that division is what lets you predict how much stacking will save and why the order matters.

Batch halves the rate on the whole request. Caching removes up to ninety percent of the cost of the repeated context. They touch different parts of the bill, which is exactly why they compound rather than overlap.

The sequence that captures both

Stacking is not just turning both features on. There is a correct order, and reversing it leaves money behind even when all the levers are in use. The sequence is route first, cache second, batch last.

  • Route first. Send the work to the cheapest model that does the job well. Most batch eligible enrichment, classification, and extraction does not need Opus, and running it on Haiku or Sonnet lowers the base rate that everything downstream is discounted from.
  • Cache second. Structure the prompt so the stable context sits at the front and recurs identically, so the repeated input reads from cache rather than being reprocessed. Now you are caching a cheaper model.
  • Batch last. Submit the work as a job so the rate discount lands on the smallest possible number, the already routed, already cached request.

The order matters because each step changes the base for the next. Routing first means you cache and batch a cheaper model. Caching second means batch is halving a smaller input bill. Batch last means the rate discount applies to the lowest figure you can get the request down to. Teams that batch a workload still running uniformly on Opus with an uncacheable prompt are using one lever on a number that two earlier levers should have shrunk first.

The trap that quietly kills the cache inside a batch

This is the failure mode we see most often, and it is subtle enough that teams ship it without noticing. Caching depends on the shared prefix being identical and recurring close enough together to stay warm. When you assemble a batch, the order and grouping of the requests inside it determines whether the cache stays warm across the run. If a pipeline interleaves requests that use several different prompt versions or reference documents into one batch, the prefix keeps changing, the cache keeps missing, and you pay full input price on work you believed was cached. The batch discount still applies, so the line item goes down and the invoice looks acceptable, but you have silently forfeited the larger of the two levers and nobody flagged it because the number fell anyway. A batch that looks cheaper but lost its cache is the most expensive kind of saving, because everyone believes the problem is already solved.

How to keep the cache warm across a batch

  • Group requests by prompt version and reference document so each group shares one identical prefix, rather than mixing prefixes within a single batch.
  • Submit related groups close together in time so the cached prefix does not expire between the requests that should be reading it.
  • Pin the prefix structure. Keep stable content at the front and variable content at the end so the cache boundary is clean and consistent across every request in the run.
  • Version your prompts explicitly so a quiet edit does not invalidate the cache for an entire batch without anyone realizing the hit rate just collapsed.

How to size the combined saving before you build

Because the two levers touch different parts of the bill, you can estimate the combined effect before committing engineering time, and you should, so the project is justified by a number rather than a hope. Look at four things for the workload in question. The prefix ratio tells you how much of each request is stable and shared versus unique per call, which sets how much caching can reach. The output share matters because long outputs benefit from batch but not from caching, since caching does not discount output. The reuse window tells you how often the same prefix recurs, which sets the achievable cache hit rate. And the deadline tolerance confirms the work can actually wait inside the batch window. A workload with a high prefix ratio, modest output, a tight reuse window, and real deadline tolerance is the ideal stacking candidate and will show the largest combined saving. A workload with mostly unique input and large output gains from batch but little from caching, so you stack what applies and do not force the lever that does not fit.

Your Anthropic number is negotiable.

Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.

Get a Quote

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · Blog · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.