Batch returns fifty percent on the rate and caching returns up to ninety percent on the cached tokens. Used together they compound, but only if you stack them in the right order and protect the cache inside the batch. Here is how to run both levers on the same workload so one does not quietly cancel the other.
Batch and caching are the two largest token levers Anthropic gives you, and most teams use them separately, on different workloads, by different people, at different times. That is a missed opportunity, because on a large class of asynchronous work the two levers apply to the same requests at the same time, and when they do they compound. Batch halves the rate you pay per token. Caching removes up to ninety percent of the cost of the stable context inside each request. Run them together correctly and the effective cost of a unit of work falls far below what either lever achieves alone. Run them together carelessly and the batch quietly breaks the cache, so the invoice looks fine while you have silently lost the larger of the two savings. This piece sets out how the two levers interact, the order to apply them in, and the failure mode that costs teams the cache without anyone noticing.
To stack the two correctly you have to be precise about what each one touches, because they discount different parts of the bill. Batch applies a flat discount to the standard input and output token rate in exchange for accepting a completion window rather than an instant response. It does not care what is in the request. It just charges half for the whole thing. Caching works differently. It discounts the cost of reprocessing a stable block of context that recurs across requests, returning up to ninety percent on those cached tokens, but it does nothing for output and nothing for the unique part of each request. So batch discounts everything by a fixed fraction, while caching discounts the repeated input deeply but leaves output and novel input alone. Understanding that division is what lets you predict how much stacking will save and why the order matters.
Batch halves the rate on the whole request. Caching removes up to ninety percent of the cost of the repeated context. They touch different parts of the bill, which is exactly why they compound rather than overlap.
Stacking is not just turning both features on. There is a correct order, and reversing it leaves money behind even when all the levers are in use. The sequence is route first, cache second, batch last.
The order matters because each step changes the base for the next. Routing first means you cache and batch a cheaper model. Caching second means batch is halving a smaller input bill. Batch last means the rate discount applies to the lowest figure you can get the request down to. Teams that batch a workload still running uniformly on Opus with an uncacheable prompt are using one lever on a number that two earlier levers should have shrunk first.
This is the failure mode we see most often, and it is subtle enough that teams ship it without noticing. Caching depends on the shared prefix being identical and recurring close enough together to stay warm. When you assemble a batch, the order and grouping of the requests inside it determines whether the cache stays warm across the run. If a pipeline interleaves requests that use several different prompt versions or reference documents into one batch, the prefix keeps changing, the cache keeps missing, and you pay full input price on work you believed was cached. The batch discount still applies, so the line item goes down and the invoice looks acceptable, but you have silently forfeited the larger of the two levers and nobody flagged it because the number fell anyway. A batch that looks cheaper but lost its cache is the most expensive kind of saving, because everyone believes the problem is already solved.
Because the two levers touch different parts of the bill, you can estimate the combined effect before committing engineering time, and you should, so the project is justified by a number rather than a hope. Look at four things for the workload in question. The prefix ratio tells you how much of each request is stable and shared versus unique per call, which sets how much caching can reach. The output share matters because long outputs benefit from batch but not from caching, since caching does not discount output. The reuse window tells you how often the same prefix recurs, which sets the achievable cache hit rate. And the deadline tolerance confirms the work can actually wait inside the batch window. A workload with a high prefix ratio, modest output, a tight reuse window, and real deadline tolerance is the ideal stacking candidate and will show the largest combined saving. A workload with mostly unique input and large output gains from batch but little from caching, so you stack what applies and do not force the lever that does not fit.
Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.
Get a QuoteWeekly intelligence on Anthropic pricing moves and the buyer side counters that work.