Token Optimization Metrics Your CFO Will Want

When a Claude bill crosses into seven figures, the finance team stops treating it as a line item and starts treating it as a program to manage. The trouble is that the data engineering teams look at, raw token counts, model names, request volumes, does not translate into anything a CFO can govern. A finance leader does not care that you sent four billion input tokens last quarter. They care whether the spend is buying proportional value, whether it is growing faster than the business, and how much of the committed spend with Anthropic is genuinely required versus padding that protects nobody. The job of a good optimization program is to turn the raw usage into a small set of metrics that a CFO can read in thirty seconds and act on. Here are the metrics that matter, why each one earns its place on the dashboard, and how to read them together.

Cost per outcome, not cost per token

The single most useful metric is cost per outcome, where an outcome is a unit of business value your application produces. For a support tool, that is cost per resolved ticket. For a coding assistant, cost per accepted change. For a document pipeline, cost per processed document. Token cost is an input to this number, but the number itself is what a CFO can reason about, because it ties spend to something the business already measures. When you report cost per outcome rather than cost per token, two things happen. The conversation stops being about whether AI is expensive in the abstract and starts being about whether each outcome is priced sensibly. And the optimization work gets a target that finance recognizes, because lowering cost per outcome is a margin improvement they can take to the board.

Cost per outcome also exposes the workloads that are quietly unprofitable. Most enterprises find that a small set of use cases consume a large share of the spend while producing modest value, and that the averages hide it. Breaking cost per outcome out by use case is usually the moment a finance team sees where the money is actually going, and it reframes optimization from a vague engineering chore into a clear list of the few places worth fixing first.

Model mix and the blended rate

The second metric is the share of spend running on each model, Opus, Sonnet, and Haiku, and the blended rate that results. This matters because the largest single lever in most Claude bills is routing work to the cheapest model that handles it well. Opus is the most capable and the most expensive. Sonnet handles a wide band of production work at a fraction of the cost. Haiku is cheaper still and is right for high volume, lower complexity tasks. When everything runs on Opus out of habit, the blended rate is near the top of the range and the bill is far higher than the work requires. Model routing across the three typically cuts aggregate spend 40 to 70 percent versus uniform Opus use, so the model mix is not a technical curiosity, it is the headline efficiency number.

Report it as a simple breakdown: what percentage of tokens and of dollars sits on each model, and what the blended cost per token is as a result. A CFO can read that instantly. If ninety percent of the spend is on Opus, the optimization opportunity is obvious and large. As the program moves work down to Sonnet and Haiku where quality holds, the blended rate falls, and the trend line of that rate over time is one of the cleanest ways to show optimization is working.

Cache hit rate

The third metric is the prompt cache hit rate. Prompt caching lets you reuse a stable portion of context, a long system prompt, a reference document, a fixed instruction set, at up to 90 percent off the repeated portion. The hit rate tells you how much of your eligible repeated context is actually being served from cache rather than paid for in full every time. A low hit rate on a workload with large stable context is money left on the table. A high hit rate means the engineering has captured the saving. For a finance leader, the cache hit rate is a proxy for how disciplined the team is about not paying twice for the same tokens, and improvements in it map directly to a lower bill with no change in what the application does.

Batch share of eligible work

The fourth metric is the share of asynchronous work running through the batch path. Batch processing runs eligible work at roughly half the price of real time calls, and the only thing traded is immediacy. Any workload where no human is waiting on the result in the moment is a candidate. The metric to track is what fraction of your batch eligible volume is actually on the batch path. Teams routinely run overnight jobs, bulk classification, and reprocessing through the real time path out of habit, paying double for latency they never use. Reporting the batch share, and the gap between current and achievable, gives finance a clear and durable saving to push for, because the discount comes from how the work is scheduled rather than from a promotion that can be withdrawn.

Output token ratio

The fifth metric is the ratio of output tokens to input tokens, watched at the workload level. Output tokens cost several times more than input tokens, so verbose responses inflate the bill out of proportion to their length. A workload that returns long, padded answers when a short structured one would do is overpaying on the most expensive part of the invoice. Tracking the output ratio surfaces the prompts that are generating waste, and tightening response length through clear instructions and output limits is one of the faster wins available. For finance, this metric explains why two workloads with similar request volumes can have very different costs, and it points at a fix that needs prompt design rather than new infrastructure.

Commitment utilization and the headroom number

The sixth metric is how your actual spend tracks against your committed spend with Anthropic, and how much headroom or shortfall that implies. This is the metric that connects optimization to the contract. Committed spend that goes unused is, in most agreements, simply lost, so a commit set too high is a guaranteed overpayment. A commit set too low exposes you to overage at a worse rate. The utilization curve, plotted against the commit, tells a CFO whether the company is on track to use what it promised, whether it is heading for a shortfall, and whether the next commit should be larger, smaller, or restructured. Crucially, this number should be read after optimization, not before. The whole point of optimizing first is that a lower, cleaner spend baseline means a smaller commit, less exposure to unused commitment, and more room to negotiate the rate.

How the metrics work together

Individually each metric is useful. Together they tell a story a finance leader can govern by. Cost per outcome says whether the spend is justified. Model mix and the blended rate say whether you are using the right tool for each job. Cache hit rate, batch share, and output ratio say whether the engineering is disciplined about waste. Commitment utilization says whether the contract is sized to reality. Read in sequence, they take a CFO from is this spend reasonable all the way to is the deal with Anthropic the right shape, which is exactly the journey a buyer side program is meant to support.

The discipline that makes this work is reporting the same small set every period and watching the trends rather than chasing one off snapshots. A blended rate falling quarter over quarter, a cache hit rate climbing, a batch share rising, and a commitment utilization landing near plan, those four trend lines are a credible picture of a program that is working. They also become the evidence base for the next negotiation, because a vendor account team is far more willing to move on rate and terms when the buyer can show a disciplined, optimized, well measured consumption story rather than an unmanaged one.

Leading and lagging indicators

It helps to split the six metrics into the ones that predict and the ones that confirm. Cache hit rate, batch share, and output ratio are leading indicators, because they tell you how disciplined the engineering is right now and where the next saving will come from before it shows up on the invoice. Cost per outcome, the blended rate, and commitment utilization are lagging indicators, because they confirm whether the discipline has translated into the result that finance cares about. A healthy program shows the leading indicators improving first and the lagging ones following a period or two later. When a CFO can see that sequence, rising cache hit rate and batch share this quarter, falling blended rate and cost per outcome next quarter, they have evidence that the program is causal rather than coincidental, which is what justifies continued investment in the optimization work.

Reporting cadence and who reads what

The metrics serve different audiences at different rhythms, and forcing all of them into one report for one reader weakens the whole picture. The engineering owners need the leading indicators weekly or per sprint, because they act on them directly through routing rules, caching changes, and prompt revisions. The finance leadership needs the lagging indicators and the commitment utilization monthly or quarterly, because they govern at that cadence and act through budget and the contract. The mistake is to send a CFO a weekly token report they cannot act on, or to give engineers a quarterly summary that arrives too late to change anything. Match the metric to the reader and the cadence to the decision, and each number lands with someone who can actually move it.

A one page picture

In practice the whole thing fits on a single page. At the top, cost per outcome by use case, the number that says whether the spend is justified. Below it, the model mix and blended rate, the headline efficiency lever. Then the three discipline metrics, cache hit rate, batch share, and output ratio, each shown as current against achievable so the gap is visible. At the bottom, commitment utilization plotted against the commit, with the headroom or shortfall called out. Four trend arrows, blended rate down, cache hit rate up, batch share up, utilization near plan, tell the governance story at a glance. A page like this turns an unmanaged seven figure bill into something a finance leader can actually run, and it becomes the evidence base the next negotiation is built on.

Common mistakes in measuring

A few measurement errors recur often enough to name. Reporting averages that hide the few workloads driving most of the cost, when the breakdown by use case is where the action is. Tracking token counts instead of dollars, which engineers find natural and finance cannot use. Watching snapshots instead of trends, so a single good or bad period gets over read. And measuring the spend before optimization rather than after, then carrying that inflated baseline into a commitment negotiation, which locks in waste you have already identified. Avoiding these four is most of what separates a dashboard that drives decisions from one that just decorates a slide.

Where this fits

These metrics are the finance facing layer of a full optimization program. For the underlying levers, the routing, caching, and batch work that move the numbers, read the pillar guide, the token optimization playbook, and book a strategy call so we can build the dashboard against your real usage and turn it into a smaller, better negotiated commitment.

Token optimization metrics your CFO will want.