The first thing to understand about Claude API pricing is that there is no single rate. There is a structure, and the structure has more moving parts than most buyers realize when they look at a published price page. Rates differ by model tier, they differ between input and output, and they change again once caching and batch enter the picture. A buyer who reads only the headline number for one model is seeing a small corner of the real pricing, and the gap between that headline and the rate you actually pay across a real workload is where most of the money lives. This is the buyer side view of how Claude API rates are structured across the model tiers and how to read them.
We negotiate Claude contracts and study Anthropic pricing exclusively, so the structure below is the map we work from. We deliberately describe it in relative terms rather than printing specific per token figures, because the exact numbers move and a published rate today is a stale rate tomorrow. What does not move is the shape of the pricing, and the shape is what lets you reason about your bill and your commitment with confidence.
Three tiers, three price points
Claude is offered as a family of models at different capability tiers, and the price tracks the capability. Opus is the most capable and the most expensive per token. Sonnet sits in the middle, balancing capability and cost. Haiku is the lightest and the cheapest. The spread between them is large, not marginal, which is the single most important fact for controlling spend: the same task run on the top tier versus a lower tier can differ in cost by a wide multiple, and a great deal of enterprise work does not need the top tier at all.
This is why model selection is the dominant lever in Claude economics. The decision of which tier to send a given piece of work to is not a technical nicety. It is a pricing decision made thousands or millions of times a day, and getting it right is worth more than almost any other optimization. Routing work across Opus, Sonnet, and Haiku, sending each task to the lightest model that can do it well, typically cuts aggregate spend 40 to 70 percent versus running everything on the top tier.
Input and output are priced differently
Within each tier, the rate splits in two. Input tokens, everything you send to the model, are priced lower. Output tokens, everything the model writes back, are priced several times higher. This asymmetry holds across the tiers and it has a direct consequence for how you should think about cost: the words Claude generates are the expensive half of the bill, so verbose responses cost far more than verbose prompts. A workload that returns long answers is paying the premium half repeatedly, and trimming output is often the cheapest saving available because it attacks the costlier side of the ledger.
For a procurement leader, the practical takeaway is that you cannot estimate a workload's cost from input volume alone. You have to weight output more heavily, because each output token costs a multiple of each input token. A feature that sends a short prompt and receives a long response can cost far more than its prompt size suggests, and the only way to see it is to measure input and output separately rather than counting total tokens.
Caching changes the input rate
The published input rate is not what you pay on repeated content. Prompt caching lets you mark stable parts of a prompt, a system prompt, a set of instructions, a reference document, a block of tool definitions, so that after the first call they are billed at a small fraction of the standard input rate. The saving on cached content runs up to ninety percent. For any workload that sends the same context repeatedly, and most production workloads do, caching turns a large recurring input cost into a small one.
This means the effective input rate for a well designed workload is materially below the headline. A buyer reasoning about cost from the published input number, without accounting for caching, overstates their real rate on any workload with stable context. The structure rewards architecture: design your prompts so the stable content is cacheable, and the rate you actually pay drops well below the list.
Batch halves the rate on async work
The other major adjustment is batch processing. Work that does not need an immediate answer can be submitted as a batch and processed within a turnaround window, and that work runs at roughly half the standard rate. The model and the output are identical. The only thing you give up is immediacy. For the large share of enterprise work that is genuinely asynchronous, overnight enrichment, bulk classification, scheduled reporting, batch cuts the rate in half on the entire volume of that work.
Batch and caching stack. A batch job that also caches its shared context captures both savings at once, half rate from batch and up to ninety percent off the repeated context from caching. Layer model routing on top, running the batch on the lightest suitable tier, and the same workload attracts three savings together. The headline rate for the top tier is the worst case you would ever pay, and a well optimized workload pays a small fraction of it.
Anthropic Claude Pricing 2026
The rate structure is one piece of the picture. Our pricing guide lays out the tiers, the commit bands, and the levers that move the real rate you pay, with the buyer side detail Anthropic does not publish.
Get the Anthropic Claude Pricing 2026 guideCommit bands move the rate again
Everything above is the list structure. On top of it sits committed spend, where you agree to a level of usage over a term in exchange for a discount off the rates. The discount scales with the size of the commit through a set of bands, so a larger commitment unlocks a better rate, up to the point where the bands flatten. This is the part of the pricing that is genuinely negotiable, and it is where a buyer side advisor earns their fee, because the difference between the rate Anthropic first offers and the rate a well prepared buyer secures is often substantial.
The catch is that committed spend cuts both ways. Unused commitment generally does not roll over, so a commit larger than your real usage is money you forfeit. The right commit is sized to optimized consumption, after caching, batch, and routing have lowered your real volume, not to an unoptimized baseline. Read the rate structure and the commit bands together, because the discount you negotiate only helps if the commit underneath it is the right size.
The buyer side summary
There is no single Claude API rate. There is a structure: three tiers priced by capability with a wide spread between them, output priced several times above input within each tier, caching that cuts the input rate on stable content by up to ninety percent, batch that halves the rate on asynchronous work, and commit bands that discount the whole thing in exchange for a sized commitment. The headline number for the top tier is the most you would ever pay, and a workload optimized across all of these levers pays a small fraction of it. Read the structure, not the headline, and size your commitment to what you will really spend.
To see the full pricing picture, including the commit bands and the levers that move the real rate, the Anthropic Claude Pricing 2026 guide lays it all out from the buyer side.