TL;DR (June 2026): The price-per-million-token number on a pricing page is not comparable across providers, because the token is not a standard unit. OpenAI tokenizes with tiktoken; Anthropic and Google use their own proprietary schemes - and the same prompt yields a different token count on each. So a model with a lower sticker rate can produce a higher bill for the identical text if its tokenizer splits that text into more tokens. We have already seen this bite at the high end: Claude Fable 5's effective cost carried a roughly 35% tokenizer tax versus a naive token-for-token comparison. The only honest way to compare LLM cost is to stop comparing $/token and start comparing $/task - run your real workload through each model, count the tokens each one actually charges, and multiply by that model's real rates.
Pricing pages invite a tempting shortcut: line up the per-million-token rates, pick the lowest, done. It is wrong often enough to be dangerous, and the reason is structural rather than a matter of a hidden fee. We touched the symptom in why price per token lies; this is the specific mechanism underneath it - the unit you are comparing is defined differently by each vendor.
A token is a per-vendor unit, not a standard one
Tokenization is the step that chops your text into the chunks a model bills for, and it is not standardized. OpenAI's GPT models use the open tiktoken library; Anthropic and Google use proprietary tokenizers. Feed the same paragraph - the same code file, the same system prompt - to three providers and you get three different token counts. The rate card prices a unit that each vendor measures with a different ruler.
The consequence is that $/token comparisons are comparing different things. Provider A at a lower per-token rate can lose to Provider B at a higher rate if A's tokenizer turns your text into 20% more tokens. The sticker price ranks the rate; it does not rank the bill. For your specific content - and code, JSON, non-English text, and structured prompts all tokenize differently - the ranking can flip.
The 35% tokenizer tax was the warning shot
This is not a rounding-error concern. When Claude Fable 5 launched, the real cost of its 1M-context capability carried what we measured as a ~35% tokenizer tax relative to a naive token-for-token comparison - detailed in the Fable 5 real-cost breakdown. A third of the bill, invisible on the pricing page, surfaced only when the same workload was actually tokenized and metered. If a tokenizer difference can move a flagship model's effective cost by a third, it can absolutely invert a "cheaper" pick on the workloads you run every day.
It also quietly corrupts every downstream estimate. A migration plan that assumes "Model B is 30% cheaper per token, so the bill drops 30%" is using the wrong unit; the real change depends on how Model B's tokenizer handles your text, which can erase or even reverse the savings. The list price told you about the rate. It told you nothing reliable about your bill.
Compare $/task, measured on your own workload
The fix is to change the unit of comparison from $/token to $/task, and to get the token counts from reality instead of arithmetic. The method, the same discipline as measuring agent token cost and list price vs real cost:
- Pick a representative task from your actual workload - a real prompt, real context, real expected output - not a synthetic 1,000-token benchmark.
- Run it through each candidate model and read the token counts the API actually reports for that model - input, output, and any cached split. Do not estimate; let each vendor's tokenizer tell you its own count.
- Multiply each model's real counts by its real rates (including long-context surcharges, cache reads/writes, and batch discounts where they apply) to get a true $/task per model.
- Rank by $/task at your quality bar - the cheapest model that still passes your eval, not the cheapest sticker rate. A slightly higher rate that tokenizes your content tighter and one-shots the task can win outright.
Done once per workload, this turns model selection from a pricing-page guess into a measured decision - and it is the only comparison that survives contact with a real bill.
The honest take
Treating the token as a universal unit is the most common cost-comparison mistake of 2026, and it is baked into nearly every "cheapest LLM" list. The token is a per-vendor artifact; the only number that travels is effective cost on your own workload, measured. That is also why a cost system has to record what each provider actually billed rather than re-deriving cost from a token count and a sticker rate - the sticker rate and the real bill diverge the moment tokenizers differ. UsageBox meters the tokens each provider actually charged, per model, so your cross-provider comparison - and your margin math - is built on the bill, not the brochure.