Token Optimization for Voice and Multimodal

Text is not the only thing priced in tokens. When you send Claude an image, a document, or audio that has been transcribed into the pipeline, that input is converted into tokens and billed like any other input. Teams that have carefully optimized their text prompts often have no idea what their multimodal inputs cost, because the conversion from a picture or a page to a token count is invisible at the point of use. A high resolution image or a long document can carry a token cost far larger than the text prompt that accompanies it, and on a voice or multimodal application running at scale, that cost is frequently the largest and least examined line on the bill. This is the buyer side view of optimizing Claude spend on voice and multimodal workloads.

We negotiate Claude contracts for enterprise buyers and optimize the spend underneath them, and multimodal is one of the areas where the gap between perceived cost and real cost is widest. The work to close that gap is usually quick, the savings are durable, and the change rarely touches anything a user would notice. If you run a multimodal application and have not measured what your non text inputs cost, you are almost certainly carrying spend you could remove this quarter.

Images cost more than you think

An image sent to Claude is priced according to its size, and the token cost scales with resolution. A large, high detail image consumes a substantial number of input tokens before a single word of your prompt is counted. The mistake teams make is sending images at full resolution by default, on the assumption that bigger is safer, when the task at hand, reading a label, classifying a photo, extracting a figure, rarely needs the full detail. You pay the high token cost on every image, repeated across every request, for resolution the task never used.

The fix is to right size the image before it reaches the model. Downscale to the smallest resolution that still lets the model do the job, crop to the region that matters rather than sending the whole frame, and avoid sending the same image repeatedly across turns when one read would do. For a workload processing millions of images, reducing the average image to the resolution the task actually needs is one of the largest savings available, and it costs nothing in quality because the detail you removed was never used.

A high resolution image can cost more in tokens than the entire text prompt beside it, for detail the task never needed.

Documents are tokens too

The same logic governs document inputs. A long PDF or a multi page report sent to Claude is converted into tokens across its full length, and if you send the whole document when the answer lives on two pages, you pay for all the pages you did not need. Voice and multimodal pipelines often bolt document handling on without examining the volume, so a feature that answers questions about a contract may be loading the entire contract into context on every question, paying for the full text again each time.

Shape the document input the way you would shape any context. Retrieve the relevant sections rather than sending the whole file, and where the same document is read repeatedly, this is exactly the pattern prompt caching rewards: cache the stable document content and the repeated cost of carrying it forward drops by up to ninety percent. A document question and answer feature that caches the source and retrieves only the relevant passages costs a fraction of one that reloads the full text on every query.

Voice pipelines hide cost in the transcript

A voice workload usually reaches Claude as text, after speech has been transcribed upstream. The token cost then lives in the transcript, and transcripts are verbose: they carry filler, repetition, and the full back and forth of a spoken exchange, all of which becomes input tokens. A long call transcript fed to Claude in full is a large input, and if your pipeline sends the whole running transcript on every turn of a live conversation, the cost compounds exactly the way an agent loop does, with each turn carrying everything that came before.

Control it by trimming the transcript to what the model needs, summarizing earlier turns rather than carrying them verbatim, and caching the stable parts of the context. On the output side, voice replies should be concise because output tokens are the expensive half of the bill, and a spoken assistant that rambles is paying premium rate for words the listener did not need. Tight transcripts in, tight responses out, and caching on the stable context together bring a voice workload's cost down sharply.

Get a number

Find what your multimodal inputs really cost

Most teams have never measured the token cost of their images, documents, and transcripts. Get a quote and we will audit your multimodal spend and show you exactly where it is hiding.

Get a Quote

Route multimodal work like any other

Model routing applies to multimodal just as it does to text. Many multimodal tasks, classifying an image, extracting a field from a document, tagging a transcript, are not hard reasoning problems and do not need the most capable model. Running every multimodal call on Opus because a few hard cases need it is the same expensive default we see everywhere. Route the routine multimodal work to a lighter model in the Claude family and reserve the top model for the genuinely difficult cases, and the rate on the bulk of your volume falls. Across a realistic workload, routing across Opus, Sonnet, and Haiku typically cuts aggregate spend 40 to 70 percent versus uniform Opus use.

Why this belongs at the contract table

Multimodal spend matters when you size a commitment, because it is both large and easy to misjudge. If you forecast your Claude commitment from a baseline of full resolution images, whole document loads, and untrimmed transcripts, you commit to a number inflated by inputs you have not yet optimized. Unused commitment on Anthropic generally does not roll over, so over committing on an unoptimized multimodal forecast is money lost outright. Right size the inputs, cache the stable content, route the routine work, then measure the real consumption and commit to that. A buyer who optimizes multimodal before negotiating commits to efficient usage and holds a stronger position on the rate.

The buyer side summary

Images, documents, and transcripts are all priced in tokens, and on a voice or multimodal workload they are usually the largest and least examined part of the bill. Right size images to the resolution the task needs, retrieve relevant document sections rather than whole files, trim and summarize transcripts, cache stable content for up to ninety percent off its repeated cost, keep responses concise, and route routine multimodal work to lighter models. Do that before you size a commitment, so your forecast reflects efficient inputs rather than full resolution defaults. The result is a multimodal application that scales without the input cost scaling with it.

If you want to know what your images, documents, and transcripts are actually costing you, that is exactly where we start. The Token Optimization Field Guide covers multimodal alongside caching, routing, and batch, and a quote turns it into a plan for your workloads.