The right model is rarely the best model. It is the one that delivers the most correct work per dollar on your specific task. That is a number you can measure, and once you have it the model choice stops being an argument and becomes a calculation.
Model selection arguments inside engineering teams tend to be unwinnable, because they are framed around the wrong question. The debate is usually whether Opus is better than Sonnet, or whether Haiku is good enough, and stated that way it has no answer, because a more capable model is always better in the abstract and the question of good enough has no objective threshold. The way out is to stop asking which model is best and start asking which model delivers the most accuracy per dollar for the task in front of you. That is a measurable quantity. Once a team has it for a given workload, the model choice stops being a matter of opinion and seniority and becomes a number that anyone can read off a table. This piece is about how to build that number and use it to make the call on evidence.
A more capable model will, on average, produce better output than a less capable one. That is true and it is also useless as a basis for a decision, because it ignores cost entirely. If the top model is right ninety six percent of the time and the middle model is right ninety four percent of the time on your task, but the top model costs several times as much per request, then for any workload where a small error rate is acceptable or recoverable, the middle model is the better commercial choice by a wide margin. The best model wins the quality contest and loses the value contest. Accuracy per dollar captures both halves of the decision in one figure, which is why it is the metric that actually settles the argument. It forces the comparison onto the ground that matters, which is how much correct work you get for what you pay.
The metric starts with a clear definition of accuracy that fits your task, because accuracy is not one thing. For a classification task it is the rate of correct labels. For an extraction task it is the share of fields pulled correctly. For a generation task it is harder, and usually means a graded judgment against a rubric, whether by human review or by a model grading against criteria you have validated. The definition does not have to be perfect, but it has to be consistent, applied the same way to every model you compare, and it has to reflect what actually matters for the downstream use of the output. A definition that scores something the business does not care about will produce a confident number that leads you to the wrong model. Spend the time to define accuracy in terms of the real cost of being wrong.
With accuracy defined, you need a test set that represents the real workload. The single most common mistake here is testing on easy or cherry picked examples, which flatters the cheaper model and leads to a decision that falls apart in production when the hard cases arrive. A good test set is drawn from real traffic, includes the difficult and edge cases in roughly the proportion they actually occur, and is large enough that the accuracy numbers are stable rather than noise. It also needs ground truth, the known correct answer for each example, so you can score against it. Building this set is the most labor intensive part of the exercise and it is the part teams are tempted to skip, but the metric is only as trustworthy as the set it is measured on. A representative test set is the foundation; everything else is arithmetic on top of it.
Now the measurement itself. Run the same test set through each candidate model, Opus, Sonnet, and Haiku, with the same prompt, and record two things for each: the accuracy against your definition, and the cost, computed from the input and output tokens at each model's rate. From those you get accuracy per dollar directly, the correct work delivered divided by what it cost to deliver. Laid out as a table, the comparison usually tells a clear story:
The table replaces the argument. Instead of debating which model is better, the team reads which model produces the most correct work per dollar for this task, and chooses it.
A complete accuracy per dollar number accounts for what happens after a wrong answer, because errors are rarely free. If a cheap model's mistake gets caught by a validation check and triggers a retry on a more expensive model, the true cost of the cheap model includes those retries, not just its sticker price. If an error slips through to a human reviewer, the cost includes their time. If a wrong answer reaches a customer, the cost can be far larger and harder to quantify. The point is not to drown the metric in every conceivable downstream effect, but to make sure the comparison reflects the real economics of being wrong on this particular task. A cheap model that looks cheap until you count its retries and escalations may not be cheap at all, and only a metric that includes those second order costs will tell you so.
An accuracy per dollar measurement is a snapshot, not a permanent verdict. Models change, your workload drifts, the mix of easy and hard cases shifts as your product evolves, and the prices themselves move. A model choice that was right when you measured it can quietly become wrong, which is why the measurement deserves to be repeated on a cadence rather than treated as settled forever. Teams that build the test set once and keep it can re run the comparison cheaply whenever something changes, which turns model selection from a one time fight into a maintained decision. The investment is the test set; once you have it, staying on the right model is inexpensive.
Measuring accuracy per dollar is what makes routing decisions defensible and turns model selection into evidence rather than opinion. It is the backbone of disciplined token optimization, and it pairs with caching, batch, and output control to drive real reduction in spend. For the full method, the test set design, and the worksheet to run your own comparison, read the pillar guide, the token optimization playbook.
Get the token optimization playbook for the accuracy per dollar method, the test set design, and the worksheet to settle model choice with a number.
Download the playbookWeekly intelligence on Anthropic pricing moves and the buyer side counters that work.