Why can a cheaper LLM cost more than an expensive one?

Because tokenization is not standardized. OpenAI uses tiktoken; Anthropic and Google use proprietary tokenizers, and the same text produces a different token count on each. A model with a lower per-token rate can split your text into more tokens, producing a higher bill for the identical input. The sticker rate ranks the rate, not the bill - and for code, JSON, or non-English text the ranking can flip.

What is a "tokenizer tax"?

The hidden cost difference that comes from one model tokenizing your text into more tokens than another. Claude Fable 5, for example, carried a roughly 35% effective cost premium over a naive token-for-token comparison once its 1M-context workload was actually tokenized and metered. That third of the bill is invisible on the pricing page and only appears when you run real content through the model.

How should I compare LLM API costs accurately?

Compare cost per task, not cost per token, and measure it on your own workload. Run a representative real task through each candidate model, read the token counts each API actually reports (input, output, cached splits), multiply by that model real rates including any long-context or cache pricing, and rank by $/task at your quality bar. The cheapest model that still passes your eval wins - which is frequently not the one with the lowest sticker rate.

The Token Count Isn't the Bill: Why Tokenizer Differences Break Your LLM Cost Comparisons

Name: UsageBox
Rating: 4.8 (50 reviews)
Author: UsageBox

TL;DR (June 2026): The price-per-million-token number on a pricing page is not comparable across providers, because the token is not a standard unit. OpenAI tokenizes with tiktoken; Anthropic and Google use their own proprietary schemes - and the same prompt yields a different token count on each. So a model with a lower sticker rate can produce a higher bill for the identical text if its tokenizer splits that text into more tokens. We have already seen this bite at the high end: Claude Fable 5's effective cost carried a roughly 35% tokenizer tax versus a naive token-for-token comparison. The only honest way to compare LLM cost is to stop comparing $/token and start comparing $/task - run your real workload through each model, count the tokens each one actually charges, and multiply by that model's real rates.

Pricing pages invite a tempting shortcut: line up the per-million-token rates, pick the lowest, done. It is wrong often enough to be dangerous, and the reason is structural rather than a matter of a hidden fee. We touched the symptom in why price per token lies; this is the specific mechanism underneath it - the unit you are comparing is defined differently by each vendor.

A token is a per-vendor unit, not a standard one

Tokenization is the step that chops your text into the chunks a model bills for, and it is not standardized. OpenAI's GPT models use the open tiktoken library; Anthropic and Google use proprietary tokenizers. Feed the same paragraph - the same code file, the same system prompt - to three providers and you get three different token counts. The rate card prices a unit that each vendor measures with a different ruler.

The consequence is that $/token comparisons are comparing different things. Provider A at a lower per-token rate can lose to Provider B at a higher rate if A's tokenizer turns your text into 20% more tokens. The sticker price ranks the rate; it does not rank the bill. For your specific content - and code, JSON, non-English text, and structured prompts all tokenize differently - the ranking can flip.

The 35% tokenizer tax was the warning shot

This is not a rounding-error concern. When Claude Fable 5 launched, the real cost of its 1M-context capability carried what we measured as a ~35% tokenizer tax relative to a naive token-for-token comparison - detailed in the Fable 5 real-cost breakdown. A third of the bill, invisible on the pricing page, surfaced only when the same workload was actually tokenized and metered. If a tokenizer difference can move a flagship model's effective cost by a third, it can absolutely invert a "cheaper" pick on the workloads you run every day.

It also quietly corrupts every downstream estimate. A migration plan that assumes "Model B is 30% cheaper per token, so the bill drops 30%" is using the wrong unit; the real change depends on how Model B's tokenizer handles your text, which can erase or even reverse the savings. The list price told you about the rate. It told you nothing reliable about your bill.

Compare $/task, measured on your own workload

The fix is to change the unit of comparison from $/token to $/task, and to get the token counts from reality instead of arithmetic. The method, the same discipline as measuring agent token cost and list price vs real cost:

Pick a representative task from your actual workload - a real prompt, real context, real expected output - not a synthetic 1,000-token benchmark.
Run it through each candidate model and read the token counts the API actually reports for that model - input, output, and any cached split. Do not estimate; let each vendor's tokenizer tell you its own count.
Multiply each model's real counts by its real rates (including long-context surcharges, cache reads/writes, and batch discounts where they apply) to get a true $/task per model.
Rank by $/task at your quality bar - the cheapest model that still passes your eval, not the cheapest sticker rate. A slightly higher rate that tokenizes your content tighter and one-shots the task can win outright.

Done once per workload, this turns model selection from a pricing-page guess into a measured decision - and it is the only comparison that survives contact with a real bill.

The honest take

Treating the token as a universal unit is the most common cost-comparison mistake of 2026, and it is baked into nearly every "cheapest LLM" list. The token is a per-vendor artifact; the only number that travels is effective cost on your own workload, measured. That is also why a cost system has to record what each provider actually billed rather than re-deriving cost from a token count and a sticker rate - the sticker rate and the real bill diverge the moment tokenizers differ. UsageBox meters the tokens each provider actually charged, per model, so your cross-provider comparison - and your margin math - is built on the bill, not the brochure.

Key Topics

•tokenizer
•tiktoken
•cost per token
•cost per task
•LLM pricing comparison
•tokenizer tax
•AI cost
•model selection
•2026

Next Steps

Compare real cost per task, not sticker rates Browse all articles

←

→

Explore More Articles

Discover our complete collection of usage-based billing guides and implementation patterns.

View all articles

The Token Count Isn't the Bill: Why Tokenizer Differences Break Your LLM Cost Comparisons

A token is a per-vendor unit, not a standard one

The 35% tokenizer tax was the warning shot

Compare $/task, measured on your own workload

The honest take

Key Topics

Next Steps

Related Articles

Self-Hosting Open-Weight Models vs the API Bill: Where the Cost Actually Crosses Over (2026)

OpenAI Is Winding Down Fine-Tuning: The Deadlines, the 60-Day Trap, and the Migration Cost Math (2026)

Prompt Caching Is Quietly Breaking Your AI Cost Tracking (Cache Reads vs Writes, and the Numbers That Lie)

Explore More Articles