Caching for RAG and Document Analysis

Retrieval augmented generation and document analysis are two of the most common Claude workloads in the enterprise, and they share a feature that makes them ideal candidates for caching: they send large amounts of the same context over and over. A RAG system prepends the same system instructions and often the same retrieved passages across many queries. A document analysis pipeline asks question after question against the same long document. In both cases, the expensive part of the request, the large block of input tokens, is being paid for fresh on every call even though it has not changed. Prompt caching exists precisely to stop that. When the stable portion of the input is cached, repeated reads of it are billed at up to ninety percent off the normal input rate, which on these workloads is the difference between a cost that scales painfully with query volume and one that barely moves. The catch is that the saving only lands if the context is structured so the cache hits, and that is a design choice, not a default.

Why these workloads waste so much

The waste in an uncached RAG or document workload is structural and easy to miss because each individual request looks reasonable. The system assembles a prompt, sends it, gets an answer, and moves on. What is invisible in any single request is that the large context block at the front, the instructions, the schema, the retrieved documents, the long source text, is identical or nearly identical to the block sent on the last request and the one before that. Across thousands or millions of queries, you are paying full input price for the same tokens thousands or millions of times. The user query at the end is small and genuinely varies; the large context at the front is the cost and it largely repeats. Caching attacks exactly the part of the bill that is pure repetition, which is why on these workloads it is often the single largest available saving after model routing.

How caching changes the economics

The mechanism is straightforward. You mark the stable prefix of your prompt as cacheable. The first request that includes it pays a small premium to write it into the cache. Every subsequent request that reuses the same prefix reads it from the cache at a steep discount, up to ninety percent below the standard input rate, rather than paying the full price to reprocess it. For a document analysis session asking twenty questions of one long document, that means you pay close to full price to load the document once and a fraction of the price on the next nineteen questions. For a RAG system with a large stable instruction block and frequently reused passages, it means the fixed part of every prompt becomes cheap while you still pay normally for the small varying query. The arithmetic is dramatic precisely because the repeated portion is the large one.

Structure the prompt for cache hits

Caching only pays when the cache hits, and a hit requires that the cached prefix be byte for byte identical from one request to the next. This is where most of the design work lives, and it comes down to ordering. The prompt has to be arranged with the stable content first and the variable content last:

Put the unchanging material, system instructions, schemas, fixed examples, and the source document or stable retrieved passages, at the front, in a consistent order.
Put the part that changes from request to request, the user's specific question, at the end, after the cacheable prefix.
Keep the prefix stable. If anything inside the cached region changes, even slightly, even a timestamp or a reordered passage, the cache misses and you pay full price as if there were no cache at all.

The common failure is interleaving variable content into the prefix, a dynamic value sprinkled near the top, or retrieved passages assembled in a different order each time, which breaks the byte for byte match and silently destroys the saving. A prompt that is not deliberately ordered for caching usually does not cache, even if caching is switched on.

The RAG specific design

RAG adds a wrinkle, because the retrieved passages are by nature dynamic, selected per query. The instinct is that this defeats caching, but it does not if you separate the layers. The system instructions and any fixed context are genuinely stable and should sit in a cached prefix that hits on every single query, capturing a large saving on its own. Above that, you can often cache at the level of frequently retrieved passages: if certain documents are returned for a large share of queries, caching them when they appear captures repeat reads even though the full retrieval set varies. The design principle is to find the stable layers within an apparently dynamic workload and cache each at its own level, rather than concluding that because retrieval varies, nothing can be cached. Most RAG systems have more stable content than their authors assume.

Mind the cache lifetime

A cached prefix does not live forever; it expires after a window, and a request that arrives after expiry pays to write the cache again rather than reading it cheaply. This makes the timing of your traffic part of the economics. A document analysis session where the questions come in a burst captures the saving beautifully, because the reads cluster inside the cache lifetime. A workload where requests against the same context are spread far apart may see the cache expire between them, so you keep paying the write premium without enough reads to amortize it. The practical implication is to batch related requests together in time where you can, so that the reads against a cached prefix happen while it is still warm. Caching rewards locality of access, and structuring your traffic to create that locality is part of capturing the full saving.

Why this matters before you commit

For RAG and document analysis, caching is not a marginal tweak, it can reshape the entire cost curve of the workload, which means it should be in place before you size any commitment. A buyer who caches these workloads first commits to a consumption number that reflects the cached reality, far lower than the uncached figure, rather than locking in spend for repetition they could have made nearly free. Sizing a commit against uncached RAG costs and then turning on caching afterward means committing to tokens you will never use. The disciplined sequence is to design the caching, prove the lower run rate, and then negotiate the commitment around the optimized number.

Where this fits

Caching for RAG and document analysis is one of the highest leverage moves in token optimization, sitting alongside model routing, batch, and output control. For the full method, the prompt structures, the cache lifetime planning, and how the levers compound, read the pillar guide, the token optimization playbook. Download it for the cache friendly prompt patterns and the worksheet to find the stable layers in your own workload.

Caching for RAG and document analysis.