Independent buyer side advisory · Anthropic onlyNew York · London
Model Selection

The tokenizer differences that affect cost.

Claude bills by the token, and the path from your text to a token count is not obvious. The same content can cost meaningfully more or less depending on how it tokenizes. Here is what drives the count, where it quietly inflates your bill, and how to measure and control it.

Buyer side analysis · 11 min read
34%
Average reduction in Claude spend
$40M+
Anthropic commitments advised
100%
Anthropic focus, no other vendor

Every conversation about Claude pricing eventually reduces to one unit: the token. You are billed for the tokens you send and the tokens the model returns, so the entire cost of an application is the number of tokens it moves multiplied by the rate. Most buyers focus only on the rate, because the rate is what gets negotiated. Far fewer pay attention to the token count, even though it is the other half of the equation and often the half you have more direct control over. The path from raw text to a billed token count is not intuitive, and the differences in how content tokenizes can move your bill more than a hard fought discount on the rate. This is the part of the cost most teams never look at.

What a token actually is

A token is not a word and not a character. It is a chunk of text the model's tokenizer has learned to treat as a unit, and a token can be a whole common word, a fragment of a longer word, a punctuation mark, or a piece of whitespace. As a rough guide, ordinary English prose runs a little under one token per word, but that average hides a lot of variation. The key point for a buyer is that the token count for a given piece of content is a property of how that specific text tokenizes, not a simple function of its length in words or characters. Two passages of the same word count can carry different token counts, and you pay for the tokens.

Where the count quietly inflates

The reason this matters commercially is that several common kinds of content tokenize far less efficiently than plain prose, and applications send a great deal of exactly that content without realizing the cost.

Structured data and markup

JSON, XML, HTML, and similar structured formats are full of punctuation, brackets, quotes, and repeated keys, and all of that tokenizes into many small tokens. A payload that looks compact to a human can carry a token count far above its apparent size. When an application stuffs large structured blobs into the prompt, it is often paying for a great deal of syntactic overhead that carries little of the meaning the model actually needs.

Code

Source code tokenizes densely for the same reasons as structured data, plus identifiers, indentation, and symbols that fragment into multiple tokens. Sending large code files into a prompt is one of the more expensive things an application can do per unit of apparent content, which matters enormously for code assistance and analysis workloads.

Non English text and unusual characters

Languages and scripts that are less common in the tokenizer's training tend to fragment into more tokens per unit of meaning than English. The same message can cost more to send in one language than another, simply because of how it tokenizes. Emoji, unusual symbols, and certain Unicode sequences can also expand into surprising numbers of tokens.

Whitespace and formatting

Repeated whitespace, deep indentation, and heavy formatting all consume tokens. Content that has been pretty printed for human readability often carries formatting tokens that the model does not need, and which you nonetheless pay for on every call.

Output tokens are the expensive half

It is easy to fixate on the input, because that is the part you write, but output tokens are generated by the model and are typically priced higher than input tokens. A verbose model response costs more than a concise one carrying the same information, and an application that does not constrain output length is paying a premium on every call for words it did not need. Controlling the shape and length of the output is one of the highest leverage things a team can do, because it works on the more expensive side of the meter and applies to every single response.

How to measure your real token economics

You cannot manage what you do not measure, and most teams have never looked at their token counts by workload. The first step is to instrument your traffic so you can see, per feature and per request type, how many input and output tokens you are actually moving. That data almost always reveals surprises: a feature that seemed cheap is sending huge structured payloads, a logging or context injection step is quietly doubling the input, an unconstrained output is running long on every call. Measuring by workload turns the abstract idea of token efficiency into a ranked list of where the tokens, and therefore the money, are going.

What to do once you can see it

With the measurement in place, the levers become concrete. The aim is to send the model only the tokens it needs to do the job, and to take back only the tokens you need from its answer.

  • Trim structured payloads to the fields the model actually uses, rather than passing whole objects with syntactic overhead.
  • Strip unnecessary whitespace and formatting from content before it goes into the prompt.
  • Constrain output length and format explicitly, so the model returns concise, structured answers instead of running long.
  • Summarize or compress long context before sending it, so you pay for the meaning rather than the raw volume.
  • Be deliberate about how you encode code and data in prompts, choosing the most token efficient representation that still works.

How this interacts with the other levers

Token efficiency multiplies the value of every other optimization. A request that has been trimmed of wasted tokens is cheaper on whichever model serves it, so the savings stack with model routing. A stable context that you cache is cheaper to cache if it was efficient to begin with, and caching takes up to 90 percent off the repeated portion on top. Batch takes roughly half off the asynchronous remainder. The token count is the base that all the other levers operate on, which is why getting it right early makes everything downstream cheaper. It is the least glamorous lever and one of the most broadly effective.

The commercial angle

Token efficiency is not only an engineering saving, it is a negotiating one. Your committed spend is sized against your aggregate token consumption, so trimming the wasted tokens out of your workloads lowers the consumption you are committing to. A buyer who walks into a negotiation having already removed the syntactic overhead, the unconstrained outputs, and the bloated payloads is committing to a smaller, cleaner number, which reduces exposure to unused commitment and strengthens the position on the rate. The vendor prices against what you use. The less you waste, the less you commit, and the less there is for an uplift to grow against at renewal.

Why estimates from word count mislead

The most common planning error is to size a workload, or a whole contract, on a word count or a character count converted by a rough rule of thumb. The rule that a token is roughly three quarters of a word holds for ordinary English prose and breaks down badly everywhere else. A workload heavy in code, structured data, or non English text can carry far more tokens per apparent unit than the rule predicts, so a forecast built on word count understates the real consumption, sometimes by a wide margin. The danger is not academic. If you size a committed spend on an estimate that assumes prose efficiency, and your actual workload tokenizes densely, you blow through the commit into overage faster than planned, and the gap is pure surprise cost. The fix is to count tokens on representative samples of your actual content rather than estimating from length, so the number you plan and commit against reflects how your specific content really tokenizes.

The hidden multipliers in real applications

Beyond the raw content, several application patterns quietly multiply the token count in ways teams rarely notice until they look. Each is ordinary, and each can double or more the tokens a workload moves.

Repeated context on every call

Applications that prepend the same large instruction block, document, or example set to every request pay for that block on every call. The per call count looks reasonable, but multiplied across the call volume it becomes one of the largest line items in the bill, and it is exactly the pattern that caching exists to attack. Until it is cached, every call repays the full cost of context that never changed.

Conversation history that grows

Chat and agent applications that resend the full conversation history on every turn see the input grow with the length of the conversation. A long session can end up sending thousands of tokens of history on a turn whose new content is a single sentence, and the cost climbs with every exchange. Managing the history, summarizing or pruning it, is often a larger saving than any change to the model.

Verbose, unconstrained output

A model left to answer freely will often return more than the application needs: preamble, restated context, explanation the caller will discard. Because output tokens are the more expensive half of the meter, this verbosity is costly on every single response, and constraining the output to the shape and length actually used is one of the highest leverage fixes available.

Your Anthropic number is negotiable.

Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.

Get a Quote

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · Blog · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.