Independent buyer side advisory · Anthropic onlyNew York · London
Token Optimization

Output length control as a cost strategy.

Output tokens cost several times more than input tokens on Claude. Controlling how much the model writes is one of the cheapest, fastest ways to cut your bill without touching quality.

Buyer side analysis · 9 min read
34%
Average reduction in Claude spend
$40M+
Anthropic commitments advised
100%
Anthropic focus, no other vendor

Here is a fact most teams know but few act on: on Claude, as on every major model, output tokens cost substantially more than input tokens, several times more in fact. That asymmetry has a direct consequence. The cheapest token to remove from your bill is an output token, and the easiest output token to remove is the one the model wrote that nobody needed. Output length control is one of the most overlooked cost strategies precisely because it feels too simple to matter. It matters a great deal, because the expensive side of the ledger is the side you have the most direct control over.

Why output costs more than input

Input and output are priced differently for a structural reason: generating a token is more computationally demanding than reading one. Reading your prompt is fast. Producing each new token of the response requires the model to run a full pass. So the rate per output token is higher, often by a multiple of several times the input rate. For a workload where responses are long, the output side can dominate the bill even when the prompts are large. This is why two applications with similar input volumes can have very different costs: the one that produces verbose responses pays far more, because it is generating more of the expensive token.

Verbosity is a cost you are paying for nothing

Models, left to their own defaults, tend toward thoroughness. They restate the question, explain their reasoning at length, add caveats, and wrap up with a summary. In a chat interface a human reads, some of that is welcome. In a production pipeline that extracts a value, classifies an input, or returns a structured result, all of it is waste. The system does not read the preamble, it parses the answer, so every word of explanation is an output token you paid a premium for and then discarded. Verbosity in an automated workload is pure cost with no return, and it is everywhere once you start looking.

The levers for controlling output

  • Ask for the format you actually need: a value, a label, a short structured object, not a paragraph around it.
  • Set a maximum output length so a response cannot run away into hundreds of unneeded tokens.
  • Instruct the model to skip preamble and restatement and return only the answer.
  • Prefer structured output, where the schema itself bounds the length, over free text.
  • Remove the standing instructions that invite long answers when the task does not need them.

Tell the model what shape you want

The single most effective control is to specify the output format precisely. If you need a category, ask for the category and nothing else. If you need three fields, ask for those three fields in a structured form. The model is highly responsive to instructions about format, and a clear instruction to return only the answer, with no explanation, can cut the output token count dramatically on a task that previously produced a paragraph. This costs you nothing in quality, because the explanation was never used. It simply stops you paying the premium rate for content the system discards.

Cap the length where it can run away

Some tasks have well behaved output and some do not. For the ones that can run long, setting a maximum output length is a cheap insurance policy. It prevents the occasional response that balloons to many times the typical size from quietly inflating your average. The cap should be set with headroom above the length a good answer actually needs, so it never truncates a legitimate response, but low enough to catch the runaway. This is particularly valuable in high volume workloads, where a small number of very long responses can move the whole bill, and where a single global cap protects you across every call.

Structured output bounds the cost

When the downstream system needs a structured result, ask for one directly. A schema that defines the fields and their types naturally bounds the output, because the model fills the fields rather than composing prose around them. Structured output is usually what the pipeline wanted anyway, so this is not a compromise. It is alignment between what you ask for and what you consume. The side effect is cost control: a tightly defined object is short, the parsing is reliable, and there is no room for the model to add the explanatory padding that drives up the output token count in free text.

Watch the standing instructions

Often the cause of verbose output is not the immediate request but a standing instruction in the system prompt that tells the model to be thorough, to explain its reasoning, or to be helpful and complete. These instructions made sense for one use case and then propagated to others where they only add cost. Audit your system prompts for language that invites length, and scope it to the tasks that actually benefit. A reasoning heavy task may genuinely need the model to think out loud. A high volume classification task does not, and removing the invitation to elaborate cuts the output on every call.

Where output control fits the bigger picture

Output length control is one lever among several, and it works best alongside the others. Routing each task to the cheapest capable model addresses the rate. Caching addresses the repeated input. Batch addresses the timing. Output control addresses the volume of the expensive token. Together these techniques commonly cut aggregate spend by forty to seventy percent versus an unoptimized baseline. Output control is the one that requires the least engineering, often just a clearer instruction and a length cap, which makes it the natural place to start. It pays back immediately and it interacts cleanly with everything else.

The reasoning case, handled carefully

There is one place where cutting output can hurt quality, and it is worth naming so you do not overcorrect. On genuinely hard reasoning tasks, the model often produces better answers when it is allowed to work through the problem before committing to a conclusion, and that working is output you are paying for. Suppressing it on a task that needs it can degrade the result. The discipline is therefore selective rather than blanket. On simple, high volume tasks, cut the output hard, because the elaboration adds cost and nothing else. On the small set of tasks that genuinely require reasoning, allow the model the room it needs, and treat that output as a cost that buys real quality. The mistake is applying either rule everywhere: stripping reasoning from tasks that need it, or permitting verbosity on tasks that do not. Match the output discipline to the difficulty of the task, and you cut cost where it is waste while preserving it where it is value.

Measuring output as a metric

Output control sticks when you measure it. Track the average output length per workload, and watch for the workloads where the average is high relative to what the downstream system actually consumes. That gap is your opportunity, and it is invisible until you measure it. A classification endpoint that returns paragraphs when it needs a single label will show a large gap immediately, pointing you straight to the fix. Tracking output length per workload also catches regressions, where a prompt change quietly reintroduces verbosity and inflates the bill before anyone notices. Like cache hit rate and model mix, output length is a first class cost metric, and the teams that watch it hold their savings while the teams that do not watch their bills drift back up as the system evolves.

A quick audit you can run today

If you want a fast read on whether output control will help you, sample a few hundred real responses from your highest volume workload and look at two things: how long the responses are, and how much of each response the downstream system actually uses. If the responses are long and the system parses a small structured answer out of them, you are paying the premium output rate for content that is discarded on every call, and a clearer instruction plus a length cap will cut the bill immediately with no quality cost. If the responses are already tight and fully consumed, output control is not your lever and you should look at model routing or caching instead. The audit takes an afternoon and it tells you exactly whether this is the place to start.

Your Anthropic number is negotiable.

Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.

Get a Quote

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · Blog · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.