RAG Workloads and the Caching Opportunity

Retrieval augmented generation, or RAG, is one of the most common enterprise patterns on Claude, and one of the most expensive when it is built without optimization. The pattern is simple: retrieve relevant documents, stuff them into the context, and ask the model to answer grounded in what you supplied. The cost problem is just as simple. Every call carries a large block of context, much of it the same from one request to the next, and you pay for all of it every time. This is exactly the situation prompt caching was built for. Used well, caching turns the repeated context in a RAG workload from your largest line item into a fraction of it, at up to 90 percent off the repeated portion. This is the buyer side view of why RAG and caching belong together.

Why RAG is expensive by default

The cost of a RAG call is dominated by input tokens. You are sending the retrieved passages, often a system prompt and instructions, sometimes few shot examples, and the user query. The retrieved content and the instructions are usually large, and across a high volume application that input is sent again and again. Because input is metered per token on every call, a RAG application's bill is mostly the cost of repeatedly transmitting context the model has effectively already seen. A naive RAG system pays full price to resend the same instructions and overlapping documents thousands of times a day.

What caching changes

Prompt caching lets you mark a stable portion of the context so that, on repeat calls, the model reuses the cached version instead of reprocessing it at full price. The cached portion bills at a steep discount, up to 90 percent off, while only the genuinely new part of each request, typically the user query and any freshly retrieved passages, pays the standard rate. The output is identical. You are not trading quality for the saving, you are simply refusing to pay full price to send the same tokens repeatedly.

What to cache in a RAG system

The art is identifying which parts of your context are stable enough to cache and structuring the prompt so the cache is hit as often as possible.

The system prompt and instructions

The instructions that tell the model how to behave rarely change between calls. They are an obvious caching candidate and should sit at the front of the context so the cached prefix is as long as possible.

Stable reference material

Many RAG systems include a core set of documents that appear in most queries: a policy manual, a product catalog, a knowledge base section. Where the same material is retrieved repeatedly, caching it captures a large saving, because that content would otherwise be resent at full price on every relevant call.

Few shot examples

If you guide the model with examples, those examples are static and belong in the cached portion. They are often substantial in token terms, and caching them removes a recurring cost you were paying on every single call.

Designing the prompt for cache hits

Caching only pays when the cache is actually hit, and that depends on prompt architecture. The key principle is to order the context from most stable to most variable. Put the system prompt, instructions, and stable reference material first, and the volatile content, the freshly retrieved passages and the user query, last. This maximizes the length of the cacheable prefix that stays identical across calls. A prompt that interleaves stable and variable content, or that puts the query before the reference material, breaks the cache and forfeits the saving. Cache friendly architecture is mostly a matter of deliberate ordering.

Caching compounds with the other levers

Caching is strong alone and stronger combined. In a RAG workload it stacks naturally with model routing and batch.

Caching plus model routing

Not every RAG query needs the most capable model. A straightforward lookup against well retrieved context often runs well on Sonnet or even Haiku, while only the hardest reasoning needs Opus. Routing across Opus, Sonnet, and Haiku so each query runs on the cheapest capable model typically cuts aggregate spend 40 to 70 percent versus running everything on Opus, and caching reduces the input cost on top of whatever model handles the call.

Caching plus batch

Where RAG runs asynchronously, such as bulk enrichment or offline question answering over a corpus, the batch path takes roughly half off the real time price, and caching the shared context compounds with it. A batch job that reuses the same instructions and reference material across thousands of items can cache once and pay the reduced rate per item, on top of the batch discount.

The commercial angle

Caching a RAG workload does not only cut the invoice, it lowers the commit you need to negotiate. A RAG application is often a large and growing line of spend, and optimizing it before you commit means the committed number you sign with Anthropic reflects the efficient cost, not the wasteful one. That smaller commit reduces your exposure to unused commitment and strengthens your hand on the rate. Optimize the RAG workload first, then commit to the optimized figure.

Measuring the opportunity before you build

Before you refactor anything, measure the shape of your RAG calls so you know the size of the prize. For a representative sample of requests, break the input into the portion that is stable across calls, the instructions, the system prompt, the recurring reference material, and the portion that is genuinely new, the user query and the freshly retrieved passages. The ratio between them is the headline number. A RAG application where eighty percent of the input is stable is a far bigger caching opportunity than one where most of the input changes every call. This measurement also tells you where to focus, because it shows which workloads carry the most repeated context and therefore stand to gain the most from caching. Optimize from the data, not from the assumption that all RAG calls look alike, because they do not.

When retrieval itself is the problem

Sometimes the cheapest token is the one you never send. A common pattern in poorly tuned RAG systems is over retrieval, where the application pulls far more context than the model needs to answer well, on the theory that more is safer. Every extra passage is input you pay for on every call, and beyond a point it does not improve the answer and can even hurt it by diluting the relevant material. An audit of a RAG workload should look at retrieval quality, not just caching. Tightening the retrieval so it returns fewer, more relevant passages cuts the input cost directly and improves the answer at the same time. Caching reduces what you pay for the context you do send, and better retrieval reduces how much context you need to send at all. The two work together, and the second is often overlooked because teams reach for caching first.

A worked example

Take a customer support assistant that answers questions grounded in a large product knowledge base. Each call carries a long system prompt, a fixed set of behavioral instructions, several retrieved articles, and the user's question. Built naively, every call pays full price for all of it, and the knowledge base content overlaps heavily from one question to the next. Restructure it: put the system prompt and instructions first and cache them, cache the core articles that appear in most answers, tighten retrieval so only the genuinely relevant passages are pulled, and route the straightforward questions to a cheaper model while reserving the hardest for Opus. The assistant gives the same answers, but the input cost that dominated the bill collapses, because the repeated context is now cached at up to ninety percent off and the volatile portion is smaller and runs on a cheaper model. The savings come entirely from architecture, with no change a user would ever notice.

RAG workloads and the caching opportunity.