Measuring cost per inference on Claude.

Buyer side guide · 9 minute read · By Fredrik Filipsson · Published May 29, 2026 · Updated June 12, 2026

You cannot optimize what you cannot measure, and most enterprises running Claude are flying with one number: the monthly invoice. The invoice tells you what you spent but nothing about why, where, or whether it was reasonable. The metric that changes that is cost per inference, the fully loaded price of a single Claude call broken down by the workload it served. This is a buyer side guide to building that metric, reading what it reveals, and using it to cut spend and size a commitment.

We negotiate with Anthropic and optimize the spend beneath the contract, and the first thing we build with any client is this measurement, because every saving and every negotiation depends on it. Both the engineering leader who instruments it and the procurement leader who acts on it need to read it the same way.

What cost per inference actually means

A single Claude call costs the sum of its input tokens at the input rate and its output tokens at the output rate, adjusted for the model used and for any caching or batch discount that applied. Cost per inference is that figure, calculated per call and then aggregated by whatever dimension you care about, by feature, by model, by customer, by team. It is the unit economics of your use of Claude, and it turns a single opaque bill into a map of where the money actually goes.

The reason it matters is that aggregate spend hides everything useful. Two workloads can cost the same in total while one runs a million cheap calls and the other runs a few thousand expensive ones. They need completely different optimization. Only a per inference view, sliced by workload, tells you which is which.

The components you have to capture

To build the metric properly you need to record, for every call, the things that determine its cost.

Input tokens and output tokens, separately, because they are priced differently and output is usually the more expensive of the two.
The model used, since the same call costs very different amounts on Opus, Sonnet, and Haiku.
Cached versus uncached input, because the cached portion is billed at up to ninety percent off and a workload's real cost depends heavily on its cache hit rate.
Whether the call ran in batch, which applies a roughly fifty percent discount and changes the per inference figure substantially.
A tag identifying the workload, feature, or team the call belongs to, so the cost can be attributed rather than pooled.

That last item, the tag, is the one most teams skip and the one that makes the metric useful. Without attribution you have a precise number for total spend and no idea which part of the business to talk to about it. With attribution you can hand each team its own cost per inference and let ownership do the rest.

What the metric reveals

Once you can see cost per inference by workload, the expensive patterns become obvious, and they are almost always the same ones. A feature running every call on Opus when Sonnet would do shows up as a high per inference cost with no quality justification. A workload that sends a large repeated context with no caching shows up as a high input cost that caching would slash. A batch eligible job running in real time shows up as paying double for latency it does not need. None of these are visible in the monthly invoice. All of them are obvious the moment you measure per inference and group by workload.

The metric also reveals the opposite, the workloads that are already lean, so you do not waste engineering effort optimizing calls that are already cheap. This is the difference between a cost program that targets the real spend and one that tinkers at the edges. Measurement tells you where the money is, so you cut where it counts.

Turning the metric into savings

With the data in hand, optimization stops being guesswork. You rank workloads by total cost, which is cost per inference multiplied by volume, and you start at the top. For each high cost workload the metric tells you which lever applies. High model cost points to routing. High uncached input points to caching. Real time processing of latency tolerant work points to batch. You apply the indicated lever, then watch the per inference figure fall, which both confirms the saving and quantifies it for finance.

This closes the loop. The metric identifies the target, names the lever, and then measures the result. A cost program built on it produces savings you can prove rather than savings you assert, which matters enormously when you are reporting up to a CFO who wants the number, not the narrative.

Using it in the negotiation

Cost per inference is not only an engineering tool, it is a negotiating asset. When you sit down to size a commitment with Anthropic, the per inference data gives you a forecast grounded in measured unit economics rather than in a guess at total spend. You can model exactly how your aggregate will move as routing, caching, and batch roll out, and commit to the optimized figure with confidence. A buyer who walks in with this measurement negotiates from knowledge. A buyer who walks in with only an invoice negotiates from hope.

It also protects you over the life of the deal. Tracking cost per inference through the term tells you immediately if a workload starts drifting expensive, before it shows up as an overage charge or blows through a commitment. The metric is both how you cut the bill today and how you keep it cut tomorrow.

Where to get the data

The good news is that the raw material for this metric already exists. Every Claude API response reports the tokens it consumed, broken into input and output, and indicates the cache and batch treatment that applied. The work is not collecting new data, it is capturing what is already returned and attaching the context that makes it meaningful, which model served the call and which workload it belonged to. A thin logging layer around your API calls, writing each call's token counts, model, cache and batch state, and workload tag to a store you can query, is enough to build the whole metric.

Most teams already log their API calls for debugging or audit. Extending that log with the cost relevant fields is a small change with an outsized payoff, because it turns logs you keep anyway into the foundation of a cost program. The aim is a single queryable record where you can ask what any feature, team, or customer cost over any period, sliced however the conversation requires.

Turning the metric into a habit

A measurement that is built once and then ignored decays quickly, because workloads change, new features ship, and yesterday's lean call becomes today's expensive one. The metric earns its keep only when it becomes a habit, a number reviewed on a regular cadence rather than pulled out in a crisis. A simple monthly review of cost per inference by workload, with the biggest movers flagged, catches drift while it is still cheap to fix and keeps the savings you have already won from quietly eroding.

This habit also changes behavior upstream. When teams can see their own cost per inference, ownership follows naturally, because the number is concrete and attributable rather than buried in a shared bill nobody feels responsible for. Engineers start to consider the cost of a design choice at the moment they make it, and product owners weigh the per inference cost of a feature against its value. The metric stops being a finance artifact and becomes part of how the organization builds, which is the point at which a cost program becomes self sustaining rather than a periodic cleanup.

The buyer side takeaway

The monthly invoice is not a measurement, it is a symptom. Cost per inference, captured per call with input and output tokens, model, cache state, batch state, and a workload tag, is the metric that turns spend into something you can manage. It reveals which workloads are expensive and why, names the lever that fixes each one, proves the saving after, and grounds your commitment forecast in real unit economics. Build it first, because every other move in a Claude cost program depends on being able to see the number.

Your Anthropic number is negotiable.

Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.

Get a Quote

The Counteroffer

Weekly intelligence on Anthropic pricing moves and the buyer side counters that work.

Get a Quote · Book a Strategy Call · The Counteroffer · Blog · How It Works · Pricing · LinkedIn · New York · London Not affiliated with Anthropic PBC. Independent buyer side advisory only.