Extended thinking lets Claude reason longer before it answers, and those reasoning tokens are billed like any other output. Used well it lifts quality on hard tasks. Used by default it quietly inflates your bill. Here is how to tell the difference.
Extended thinking is one of the more powerful capabilities in the Claude lineup, and one of the easiest to overspend on without noticing. The idea is simple: instead of answering immediately, the model works through a problem step by step before producing its final response, and that reasoning lifts quality on tasks that genuinely require it. The catch is that the reasoning is made of tokens, and those tokens are billed at output rates, which are the most expensive tokens you buy. A request that uses extended thinking can cost several times what the same request costs without it, because the model may generate a long chain of reasoning before it ever writes the answer the user sees. On the right task that is money well spent. On the wrong task, or applied as a blanket default, it is one of the quietest sources of avoidable spend in a Claude deployment.
When extended thinking is on, the model produces internal reasoning in addition to its final output, and you pay for both. The reasoning can be far longer than the answer itself, especially on complex problems, which is exactly when you want it and exactly when it costs the most. The economic question is therefore not whether extended thinking is good, it plainly is on hard work, but whether the quality lift on a given task is worth the multiple on cost. That answer varies enormously by task. On a genuinely difficult reasoning problem, the lift is large and the spend is justified. On a routine classification or a simple extraction, the model did not need to deliberate, the reasoning adds little, and you have paid a premium for thinking the task never required.
The work that justifies extended thinking shares a profile: it is hard, the answer is not obvious, multiple steps or constraints interact, and the cost of a wrong answer is high. Complex analysis, multi step planning, intricate debugging, careful reasoning over many interacting facts, anything where a fast surface answer would be wrong in ways that matter. On this work, the reasoning tokens are not overhead, they are the product, because they are how the model arrives at an answer a faster pass would have missed. If you are paying for extended thinking on tasks like these and the quality shows it, the economics work, and turning it off to save tokens would be a false economy that costs you more in wrong answers than it saves in spend.
The waste shows up when extended thinking is left on as a global setting rather than chosen per task. Many teams enable it once, see quality improve on the hard cases, and never scope it, so it runs on every request including the large volume of simple ones that gained nothing from it. The result is a bill inflated across the board to buy a quality lift that only a fraction of traffic actually received. The same pattern that drives Opus overspending drives extended thinking overspending: a powerful capability applied uniformly instead of selectively. The fix is the same in spirit, match the capability to the task, and turn it on where it earns its cost rather than everywhere by default.
Treat extended thinking as a routing decision, not a global switch. Segment your traffic by task type, identify which types involve the hard reasoning that justifies the spend, and enable extended thinking only for those. For the routine majority, run without it and capture the saving. Where a task is borderline, test it both ways against your quality bar and let the evidence decide whether the lift is worth the multiple. The same discipline that governs choosing between Opus, Sonnet, and Haiku governs whether to think extensively at all, and the two decisions interact: a hard task may justify both a stronger model and extended reasoning, while a simple task justifies neither. Scoping extended thinking to the work that needs it, alongside routing across models, caching repeated context, and batching async work, is how aggregate spend comes down by the margins disciplined optimization delivers.
Extended thinking that runs everywhere does not just raise your monthly bill, it inflates the baseline you commit to. If you size an Anthropic commitment against usage padded with unnecessary reasoning tokens, you lock that waste into your contract for the full term, and unused efficiency does not come back to you, because committed spend you could have avoided is spent regardless. Scoping extended thinking before you commit means your baseline reflects the reasoning you actually need, the commit band you land in is honest, and the rate you negotiate applies to real demand. Optimization and negotiation are the same project viewed from two ends, and extended thinking is one of the levers that sits squarely in both.
Extended thinking is not only an on or off choice, it has a magnitude, because you can influence how much the model reasons before answering. This matters economically, because the cost scales with the length of the reasoning, and more thinking is not uniformly better. On many tasks there is a point of diminishing returns where additional reasoning stops improving the answer and simply adds tokens you pay for. The discipline is to find, for each task type that justifies extended thinking at all, the reasoning budget that delivers the quality lift without paying for deliberation the answer did not need. Test the same task at different thinking budgets against your quality bar and watch where the quality curve flattens, then set the budget there. A task that needs deep reasoning gets a generous budget, a task that needs only a little gets a modest one, and you stop paying for thinking past the point where it earns its cost. This is the same evidence based approach that governs model selection, applied to the depth of reasoning rather than the choice of model.
The interaction with model choice is worth making explicit, because the two decisions compound. A genuinely hard task may justify both a stronger model and a meaningful thinking budget, and the combination is appropriate because the work demands it. But the worst case for the budget is a simple task running on a strong model with extended thinking left on by default, which stacks three premiums, an expensive model, unnecessary reasoning tokens, and output rates on those tokens, onto work that needed none of them. Scoping both the model and the thinking budget per task type, rather than setting either globally, is how you avoid paying compounded premiums on routine work while preserving the full capability for the hard tasks that earn it.
You cannot manage what you do not measure, and extended thinking is easy to overlook because its cost hides inside the output token count rather than appearing as a separate line. To control it, instrument your usage so you can see how many reasoning tokens each task type is generating and what they are costing, then compare that against the quality lift the reasoning is actually producing. Often this exercise alone reveals the waste: a task type generating long reasoning chains for answers that a quick pass would have gotten right, or extended thinking quietly enabled on endpoints nobody intended. Surfacing reasoning token spend by task type in your cost dashboard turns extended thinking from an invisible drift into a managed lever, the same way measuring cache hit rate turns caching from an act of faith into something you tune. The teams that control extended thinking spend are the ones who can see it, and the first step is almost always making it visible.
Get a quote for a bounded engagement. Fixed fee or gainshare, no risk to you.
Get a QuoteWeekly intelligence on Anthropic pricing moves and the buyer side counters that work.