Independent buyer side advisory · Anthropic onlyNew York · London
Token Optimization

The real cost of long system prompts.

A system prompt is billed on every call, so a heavy one is a tax you pay millions of times. Here is what that actually costs on Claude, how caching changes the math, and what to do about it before you sign a commitment.

Buyer side analysis · 9 min read
34%
Average reduction in Claude spend
$40M+
Anthropic commitments advised
100%
Anthropic focus, no other vendor

The system prompt feels free because you write it once. It is the most expensive sentence to forget in all of Claude cost work. A system prompt is prepended to every single call a workload makes, which means its token weight is multiplied by your total call volume. A few hundred tokens of extra instruction, harmless in isolation, becomes a line item once you send it a few million times. Teams routinely discover that the largest single source of waste in a high volume application is not the user input and not the output, but the standing instruction block that nobody had reviewed since the application launched. This piece explains the math plainly, shows how caching reshapes it, and lays out what a buyer should do before committing to a number with Anthropic.

The multiplication that catches people out

The cost of a system prompt is simple arithmetic that is easy to ignore. Take the prompt token count, multiply by the number of calls, multiply by the input token rate. The first number feels small, the second is large, and the product is what lands on the invoice. The reason it surprises people is that the system prompt is invisible in day to day work. Developers see the user input and the model output because those change with every request. The system prompt sits in a configuration file, unchanging, out of sight, and out of mind, quietly riding along on every call. Once you put the volume next to it, the picture changes. A workload running at high volume can spend more on its standing instructions than on everything the user actually typed.

Output is more expensive, but input volume is relentless

It is worth being precise about where system prompt cost sits. System prompt tokens are input tokens, and input is the cheaper side of the ledger, with output billed at a higher rate. So a long system prompt is not the most expensive token per token. What makes it costly is that it is paid on every call without exception, where output length at least varies with the task. The relentlessness is the problem. A heavy system prompt is a fixed cost stamped on every request whether that request was large or tiny, which means low value calls carry the full weight of the instruction block just like high value ones. That is why trimming the system prompt often improves the unit economics of exactly the cheap, high frequency calls where margin matters most.

How caching changes the calculation

Prompt caching changes this picture, and a buyer needs to understand how before deciding the system prompt does not matter. Caching takes up to 90 percent off the cost of repeated input, and the system prompt is the most repeated input there is, identical on every call. Cached properly, a long system prompt becomes far cheaper to resend, because you pay close to full rate only to write it to the cache and a small fraction to read it thereafter. This is real and large, and it should be the first thing any team with a heavy system prompt does. But caching does not make length free. The cache write is billed, cache entries expire and have to be refreshed, and any change to the prompt invalidates the cache and forces a fresh write. A long system prompt that changes often gets the worst of both worlds: it is too dynamic to cache well and too heavy to send raw. The lesson is to structure the prompt so the stable part is large and cacheable and the changing part is small.

What a heavy prompt costs you beyond the bill

There is a second cost to a long system prompt that does not appear on the invoice. Every token of instruction is a token of context the model has to attend to, and very long instruction blocks can dilute the model focus, bury the important rules among the unimportant ones, and make behavior harder to predict. Teams sometimes add instruction after instruction to fix a problem, when the real issue is that the prompt has grown so long the model is no longer reliably following the rule that was already there. Trimming a bloated system prompt frequently improves output quality at the same time as it lowers cost, because the rules that remain are clearer and carry more weight. The long prompt was costing you twice: once in tokens and once in reliability.

What to do about it

The remedy is a short, disciplined pass. First, measure the system prompt token count and multiply by call volume so you know the size of the prize. Second, cut instructions that no longer change behavior, testing one change at a time against a quality bar. Third, deduplicate anything the prompt states more than once. Fourth, tighten verbose language into precise rules. Fifth, structure what remains so the large stable block is cacheable and the small variable part sits outside it. Sixth, confirm with evaluation that the output held. None of this is exotic. It is the same compression discipline applied to the single highest leverage block in the application, and because that block rides on every call, the return per hour of engineering time is usually the best in the whole optimization program.

Why this matters before you commit

Here is where the system prompt connects to the negotiation, and why a bottom of funnel buyer should care. When you commit to a spend level with Anthropic, you commit against your run rate. If that run rate is inflated by a heavy, uncached system prompt riding on millions of calls, you will commit to a larger number than your optimized application needs, pay more across the whole term, and expose yourself to unused commitment if usage comes in lower than the bloated baseline implied. Fixing the system prompt before you commit shrinks the baseline, which shrinks the commit, which lowers your risk and strengthens your position on the rate. The order matters. Optimize first, then commit to the optimized number. A buyer who signs first and trims later has paid for the waste twice, once in the running bill and once in an oversized commitment.

The buyer checklist

  • Measure the system prompt token count and multiply by call volume to size the cost.
  • Cut instructions that no longer change behavior, testing one change at a time.
  • Deduplicate and tighten what remains into precise rules.
  • Structure the prompt so the stable block is large and cacheable, the variable part small.
  • Confirm output quality held with an evaluation set before and after.
  • Optimize the baseline before committing, so the commit is sized against the lean run rate.

The system prompt is the cheapest thing to write and one of the most expensive to leave alone. For the full optimization framework across model routing, caching, and batch, read the pillar guide, the token optimization playbook. If you are heading into a commitment and want the baseline cleaned up first, get a quote and we will run the optimization and the negotiation as one engagement.

Paying for a prompt you never read again?

Get a quote and we will trim the system prompt, cut the baseline, and size your Anthropic commitment against the lean number.

Get a Quote
Get started
Tell us what you are negotiating.

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.