Independent buyer side advisory · Anthropic onlyNew York · London
Home · Blog · Token Optimization
Token Optimization

Context window discipline for long documents.

Buyer side guide · 14 minute read

Long document workloads are where Claude bills quietly balloon. The pattern is familiar: a team needs the model to reason over a large contract, a research report, a codebase, or a stack of policy documents, so they load the entire thing into the context window on every request. It works, the answers are good, and the invoice climbs in a way nobody quite traces back to its cause. The cause is almost always the same: tokens spent loading context the task did not need. Context window discipline is the practice of putting only what the task requires in front of the model, and on long document workloads it is one of the highest return optimizations available.

Why long documents are so expensive

You pay for every input token on every request. A long document can run to tens or hundreds of thousands of tokens, and when you load the whole thing for a question that only touches a few pages, you are paying the full input cost for the entire document to answer a question about a fraction of it. Multiply that by the number of questions your workload asks per day, and the waste compounds into a very large number.

The temptation to load everything comes from a reasonable fear: leave something out and the model might miss the answer. But indiscriminate loading trades a large, certain cost for protection against a risk that good retrieval design largely removes. The discipline is not about loading less and hoping; it is about loading the right context so accuracy holds while cost falls. There is also a quality dimension. An enormous context can dilute the model's attention, burying the relevant passage among thousands of irrelevant ones, so trimming context can improve answers as well as cost.

The two failure modes

Long document workloads tend to fail in one of two directions, and the right fix depends on which one you have.

Loading too much, too often

The first failure is loading the full document on every request when most requests only need a slice of it. This is the more common and more expensive pattern. The fix is retrieval: pull the relevant passages for each question rather than loading the whole document every time.

Reloading the same context repeatedly

The second failure is sending the same large, stable document on request after request without caching it. Here the document genuinely is needed in full, but you are paying full input price for it each time rather than caching it. The fix is prompt caching, which can cut the cost of that repeated context by up to ninety percent. Many long document workloads suffer from both failures at once, and the two fixes are complementary rather than competing.

Discipline one: retrieve, do not stuff

For workloads where each question touches only part of a document or document set, retrieval is the foundational discipline. Instead of loading everything, you index the content, retrieve the passages relevant to the specific question, and place only those in the context window. A well designed retrieval layer answers the same questions with a small fraction of the tokens, because it loads the few pages that matter rather than the hundreds that do not.

The objection is always accuracy: what if retrieval misses the relevant passage? This is a real engineering concern and it is solvable. Good retrieval is tuned and tested against a representative set of real questions, with the recall measured rather than assumed. You can also retrieve generously, pulling more context than the minimum to create a safety margin, and still load far less than the whole document. The goal is not the smallest possible context; it is the smallest context that reliably contains the answer, and that target is usually a tiny fraction of the full document.

Discipline two: cache what stays stable

When the workload genuinely needs the full document in context, the lever is caching rather than retrieval. If you are reasoning over the same large document across many questions, the document is stable context that should be cached after the first request and reused cheaply on the rest. This requires structuring the prompt so the stable document sits at the front, in a fixed order, with only the changing question at the end, because caching matches from the start of the prompt up to the first point of difference. Get that ordering right and the repeated document cost falls toward a fraction of its original level, even though you are still loading it in full.

The two disciplines combine naturally. Retrieve to load only the relevant portion of a large corpus, and cache the stable instruction and schema that wrap every retrieval. Or, for a single large document queried many times, cache the whole document and skip retrieval entirely. The right combination depends on whether your cost is driven by loading too much per request or by reloading the same thing too often, which is why the audit comes first.

Discipline three: trim the prompt around the document

Beyond the document itself, long document prompts often carry bloated instructions, redundant examples, and verbose formatting that add tokens without adding value. A common pattern is a system prompt that has grown over months as people added clauses without removing any, until it is several times longer than it needs to be. Because that overhead is paid on every request, trimming it compounds across the whole workload. The discipline is to audit the non document portion of the prompt as ruthlessly as the document portion, cutting instructions to the minimum that produces the quality you need.

Output discipline matters too. Long document tasks often produce long outputs, and output tokens are billed at a higher rate than input. Asking for a concise, structured answer rather than an expansive one, and specifying the format you actually need, cuts the more expensive side of the bill. A workload that loads a trimmed context and returns a tight, structured output costs a fraction of one that stuffs the window and returns prose.

How to audit a long document workload

The work starts with measurement, because you cannot optimize what you have not quantified. For a representative sample of real requests, capture how many input tokens are document context, how many are instructions, and how many are genuinely variable per request. Then ask, for each request, how much of the loaded document the answer actually drew on. The gap between what you loaded and what the answer used is your opportunity, and on most unoptimized long document workloads that gap is enormous.

From there the path is clear. If most requests use only a slice of the document, build retrieval. If most requests need the whole document but reload it each time, build caching. If the instructions and outputs are bloated, trim them. Usually you do some of all three. This is exactly the kind of structured audit we run as part of a token optimization engagement, because the savings are large, specific, and easy to verify once the measurement is in place.

Why this belongs in your commercial strategy

Context discipline is an engineering practice, but it is also a commercial one. The consumption number you carry into an Anthropic commitment should reflect a disciplined workload, not a wasteful one. A team that loads whole documents indiscriminately commits to an inflated run rate, then either overpays against it or risks a shortfall when discipline finally arrives. A team that has audited and tightened its long document workloads commits to a real, defensible number, and negotiates its discount from a position of efficient spend rather than padding the vendor is happy to sell. The optimization protects you twice: once on the invoice and again on the commitment.

This is also why the work is worth doing before a renewal or a new commitment rather than after. The savings from context discipline change the number you should be committing to, and committing first then optimizing leaves money stranded in a use it or lose it commitment you no longer need. Sequence the optimization ahead of the negotiation and both work in your favor.

The buyer side takeaway

Long document workloads waste tokens by loading context the task does not need, and the waste compounds across every request. The disciplines are straightforward: retrieve the relevant passages rather than stuffing the whole document, cache the stable context you genuinely need in full, and trim the instructions and outputs around it. Audit a representative sample to find the gap between what you load and what the answer uses, then apply the fix that matches your failure mode. Do the work before you commit, so the run rate you negotiate against is real. If you want us to audit a long document workload and quantify the saving, book a strategy call and we will measure it.

Load what the task needs, nothing more.

Book a strategy call and we will audit a long document workload and quantify the saving.

Book a Strategy Call

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.