The Token Count Isn't the Bill: Why Tokenizer Differences Break Your LLM Cost Comparisons

The price-per-million-token number on a pricing page is not comparable across providers, because the token is not a standard unit. OpenAI tokenizes with tiktoken; Anthropic and Google use proprietary schemes, and the same prompt yields a different token count on each. So a model with a lower sticker rate can produce a higher bill for the identical text if its tokenizer splits that text into more tokens - Claude Fable 5 carried a ~35% tokenizer tax versus a naive token-for-token comparison. Code, JSON, and non-English text tokenize differently enough to flip a "cheaper" pick. The only honest comparison is $/task, not $/token: run a representative real task through each model, read the token counts each API actually reports, multiply by real rates (including long-context and cache pricing), and rank by cost-per-task at your quality bar.

6 min read

tokenizertiktokencost per tokencost per taskLLM pricing comparisontokenizer taxAI costmodel selection2026

TL;DR (June 2026): The price-per-million-token number on a pricing page is not comparable across providers, because the token is not a standard unit. OpenAI tokenizes with tiktoken; Anthropic and Google use their own proprietary schemes - and the same prompt yields a different token count on each. So a model with a lower sticker rate can produce a higher bill for the identical text if its tokenizer splits that text into more tokens. We have already seen this bite at the high end: Claude Fable 5's effective cost carried a roughly 35% tokenizer tax versus a naive token-for-token comparison. The only honest way to compare LLM cost is to stop comparing $/token and start comparing $/task - run your real workload through each model, count the tokens each one actually charges, and multiply by that model's real rates.

Pricing pages invite a tempting shortcut: line up the per-million-token rates, pick the lowest, done. It is wrong often enough to be dangerous, and the reason is structural rather than a matter of a hidden fee. We touched the symptom in why price per token lies; this is the specific mechanism underneath it - the unit you are comparing is defined differently by each vendor.

A token is a per-vendor unit, not a standard one

Tokenization is the step that chops your text into the chunks a model bills for, and it is not standardized. OpenAI's GPT models use the open tiktoken library; Anthropic and Google use proprietary tokenizers. Feed the same paragraph - the same code file, the same system prompt - to three providers and you get three different token counts. The rate card prices a unit that each vendor measures with a different ruler.

The consequence is that $/token comparisons are comparing different things. Provider A at a lower per-token rate can lose to Provider B at a higher rate if A's tokenizer turns your text into 20% more tokens. The sticker price ranks the rate; it does not rank the bill. For your specific content - and code, JSON, non-English text, and structured prompts all tokenize differently - the ranking can flip.

The 35% tokenizer tax was the warning shot

This is not a rounding-error concern. When Claude Fable 5 launched, the real cost of its 1M-context capability carried what we measured as a ~35% tokenizer tax relative to a naive token-for-token comparison - detailed in the Fable 5 real-cost breakdown. A third of the bill, invisible on the pricing page, surfaced only when the same workload was actually tokenized and metered. If a tokenizer difference can move a flagship model's effective cost by a third, it can absolutely invert a "cheaper" pick on the workloads you run every day.

It also quietly corrupts every downstream estimate. A migration plan that assumes "Model B is 30% cheaper per token, so the bill drops 30%" is using the wrong unit; the real change depends on how Model B's tokenizer handles your text, which can erase or even reverse the savings. The list price told you about the rate. It told you nothing reliable about your bill.

Compare $/task, measured on your own workload

The fix is to change the unit of comparison from $/token to $/task, and to get the token counts from reality instead of arithmetic. The method, the same discipline as measuring agent token cost and list price vs real cost:

  1. Pick a representative task from your actual workload - a real prompt, real context, real expected output - not a synthetic 1,000-token benchmark.
  2. Run it through each candidate model and read the token counts the API actually reports for that model - input, output, and any cached split. Do not estimate; let each vendor's tokenizer tell you its own count.
  3. Multiply each model's real counts by its real rates (including long-context surcharges, cache reads/writes, and batch discounts where they apply) to get a true $/task per model.
  4. Rank by $/task at your quality bar - the cheapest model that still passes your eval, not the cheapest sticker rate. A slightly higher rate that tokenizes your content tighter and one-shots the task can win outright.

Done once per workload, this turns model selection from a pricing-page guess into a measured decision - and it is the only comparison that survives contact with a real bill.

The honest take

Treating the token as a universal unit is the most common cost-comparison mistake of 2026, and it is baked into nearly every "cheapest LLM" list. The token is a per-vendor artifact; the only number that travels is effective cost on your own workload, measured. That is also why a cost system has to record what each provider actually billed rather than re-deriving cost from a token count and a sticker rate - the sticker rate and the real bill diverge the moment tokenizers differ. UsageBox meters the tokens each provider actually charged, per model, so your cross-provider comparison - and your margin math - is built on the bill, not the brochure.

Key Topics

  • tokenizer
  • tiktoken
  • cost per token
  • cost per task
  • LLM pricing comparison
  • tokenizer tax
  • AI cost
  • model selection
  • 2026

Related Articles

Explore more articles on similar topics to deepen your understanding of usage-based billing.

Prompt Caching Is Quietly Breaking Your AI Cost Tracking (Cache Reads vs Writes, and the Numbers That Lie)

Prompt caching is the best per-call cost lever in 2026 - up to 90% off repeated context, stackable with batch discounts ...

6 min readRead more

Per-Seat Pricing Can't Survive Agentic Users: The SaaS Margin Math That Breaks in One Loop

If you sell software at a flat per-seat price and your product calls an LLM that bills per token, your margin is a bet t...

6 min readRead more

UsageBox Kata #1: From Token Event to Invoice Line in 30 Minutes

A hands-on kata: take a raw AI usage event - a chunk of Claude tokens, a tool call, a credit burn - and turn it into a s...

7 min readRead more

Explore More Articles

Discover our complete collection of usage-based billing guides and implementation patterns.

View all articles