Why does prompt caching make my AI cost tracking inaccurate?

Because a cached request still reports the full input-token count in the API response, but the provider bills those tokens at a cache-read rate (around 10% of standard input) rather than the standard rate. Any tracker that multiplies total input tokens by the standard input rate therefore overstates cost on cache-heavy workloads - by up to roughly 10x on a workload that is mostly cache reads. The LiteLLM team logged this exact problem as "Anthropic cost tracking inaccurate for cached usage" (LIT-3771) in June 2026.

How should I track cached token usage correctly?

Log three separate counters per request instead of one: uncached input tokens (standard rate), cache write tokens (the one-time write premium), and cache read tokens (the discounted reuse rate). Price each at its own rate. This gives you the true cost and your cache hit ratio (reads divided by total cacheable input), which tells you whether caching is working before the invoice does. Rolling reads and writes into a single input count hides both numbers.

How much can prompt caching actually save?

Published results in 2026 range from 59% to 70% on real workloads, and caching (up to 90% off cached reads at Anthropic and OpenAI) can stack with batch processing (around 50% off) to bring effective per-call cost to roughly 25% of standard rates. But those savings are only visible if your meter separates cache reads from cache writes from uncached input - otherwise the dashboard cannot show whether the optimization paid off.

Prompt Caching Is Quietly Breaking Your AI Cost Tracking (Cache Reads vs Writes, and the Numbers That Lie)

Name: UsageBox
Rating: 4.8 (50 reviews)
Author: UsageBox

TL;DR (June 2026): Prompt caching is the best per-call cost lever there is - up to 90% off repeated context at Anthropic and OpenAI, stackable with batch discounts to roughly a quarter of standard rates. But it quietly breaks the one thing a cost system exists to do: report the truth. A cached request still emits a large input-token count, so any tracker that multiplies total input tokens by the standard input rate overstates your spend on cache-heavy workloads - and, worse, hides whether caching is working at all. The failure is visible in the wild: the LiteLLM team's June stability sprint logged "Anthropic cost tracking inaccurate for cached usage" (LIT-3771), with an enterprise customer confirming it in production. The fix is not a discount - it is an accounting rule: meter cache writes, cache reads, and uncached input as three separate things, price each at its own rate, and your dashboard goes from lying to load-bearing.

Most of the prompt-caching writing in 2026 is a savings pitch, and the savings are real - we walked through the receipts in the Anthropic vs OpenAI vs Gemini caching cost math. This is the less glamorous follow-up: once caching is on, your cost numbers are probably wrong, and they are wrong in the direction that makes you distrust the savings you just earned. If you cannot see the win, you will not defend it the next time finance asks why the bill moved.

Why a cached call still looks expensive to a naive tracker

The mechanics are simple and that is exactly why they get missed. When you reuse a large system prompt or document context, the provider does not bill that context at the standard input rate. It bills a cache write the first time (often a premium over standard input), then a cache read on every subsequent call at a deep discount - around 10% of the input rate at Anthropic and OpenAI. The token count in the API response, however, still reports the full context. A 900K-token cached prompt reports 900K input tokens whether it was a cache miss billed near-premium or a cache hit billed at a tenth.

So the bug writes itself. A cost tracker that does input_tokens × standard_input_rate - the default in most homegrown dashboards and more than one vendor SDK - will report a cache-heavy workload at close to its uncached price. On a workload that is 90% cache reads, you can be reporting roughly ten times the real input cost. As our piece on measuring agent token cost argued, the estimate is not the bill; with caching, the gap between the two stops being a rounding error and becomes the whole story.

This is not hypothetical - it is in the issue tracker

The clearest evidence that this is a real, current problem rather than a theoretical one came out of LiteLLM, the open-source proxy a huge share of teams route their LLM traffic through. Its June 15 "Stability Sprint" roadmap surfaced, among the structural cost-calculator fixes, the line item "Anthropic cost tracking inaccurate for cached usage" (LIT-3771) - with a LiteLLM Enterprise customer commenting that they were "experiencing this exact issue in production." When the most widely deployed LLM cost-tracking layer has an open ticket for cached-usage accuracy, the safe assumption is that your own tracker has the same bug and no ticket.

It compounds across the stack. If your proxy mis-prices cached usage, every downstream artifact inherits the error: the per-customer margin report, the per-feature cost allocation, the budget alert that fires too early (or, if you double-count cache writes, the one that never fires). Caching is supposed to be the safe optimization. Mis-metered, it becomes the one that makes your whole cost ledger untrustworthy.

The accounting rule that fixes it

Braintrust's 2026 token-tracking guidance puts the rule plainly: cached and uncached tokens must be logged separately, because rolling them into a single input count overstates spend on cache-heavy workloads and obscures whether caching is reducing costs at all. ProjectDiscovery's published 59% caching saving was only knowable because they compared effective per-token rates against what the same volume would have cost at standard rates - you cannot compute that ratio if your meter collapses reads and writes into one bucket. Concretely, instrument three counters per request, not one:

Uncached input tokens at the standard input rate - the genuinely new context.
Cache write tokens at the cache-write rate - the one-time premium to seed the cache.
Cache read tokens at the cache-read rate (~10% of input) - where the savings actually live.

With those three lines, two numbers fall out that the single-bucket meter can never produce: your true cost (sum the three at their real rates) and your cache hit ratio (reads ÷ total cacheable input), which is the leading indicator of whether the optimization is paying off before the invoice confirms it. That is the difference between a dashboard that reports the past and one that lets you steer.

The worked example: a one-word message that cost about $20

The clearest illustration of all of this arrived in late July 2026, from a subscriber who had exhausted a $200 Max plan and topped up $250 in credits. Their next action was to send a single test message: "hey". The on-screen token count showed almost nothing. Their credit balance dropped by roughly $20. Longer messages sent afterwards in the same conversation cost noticeably less.

That looks like a billing fault and is not one. Two mechanisms account for it, and both are invisible in a per-message token display:

Every message rebills the whole conversation. Pressing enter sends your system context, MCP definitions, skills, and every prior turn from both sides. The billed input for a one-word message is the entire accumulated context, which on a long session runs to hundreds of thousands of tokens. The word "hey" is a rounding error inside it.
A new conversation is not empty. As the best explanation in that discussion pointed out, most of the cost of a fresh chat is not your text at all: it is memories, custom instructions, and in some configurations recent conversation history injected on your behalf. You did not write it, you are not shown it, and you pay for it.

The same user later reported 847.4k tokens consumed for around $200, which is the arithmetic working exactly as designed once you accept that the billed unit is context rather than keystrokes. The messages that followed cost less because by then the expensive part was cached: that is the cache-write-then-cache-read curve from the section above, seen from the invoice instead of the API response.

Two consequences worth taking away. Per-message cost is a meaningless number to reason about, because per-conversation cost with a cache-write and cache-read split is the only view that predicts a bill. And if your tracker attributes spend to the message that triggered it rather than to the context that carried it, your most expensive conversations will look like your cheapest, because the turn that pays the cache-write bill is usually the shortest thing in the session.

The honest take

Prompt caching is not the problem; single-bucket token accounting is. The deeper lesson is the one we keep returning to in why usage metering needs its own database: a meter that records "tokens" instead of "billable events with a type and a rate" will be wrong the moment pricing stops being one flat number per token - and caching, tiered long-context pricing, and batch discounts mean it already has. The teams that trust their AI cost numbers in 2026 are the ones that record cache reads, cache writes, and uncached input as distinct, separately-priced events from the moment of ingest. UsageBox meters them that way on purpose, so the fast number on your dashboard equals the true number on the vendor invoice - including the caching win you actually earned.

Key Topics

•prompt caching
•cost tracking
•cache reads
•cache writes
•token accounting
•LiteLLM
•AI cost
•usage metering
•2026

Next Steps

Meter cache reads and writes correctly Browse all articles

←

→

Explore More Articles

Discover our complete collection of usage-based billing guides and implementation patterns.

View all articles

Prompt Caching Is Quietly Breaking Your AI Cost Tracking (Cache Reads vs Writes, and the Numbers That Lie)

Why a cached call still looks expensive to a naive tracker

This is not hypothetical - it is in the issue tracker

The accounting rule that fixes it

The worked example: a one-word message that cost about $20

The honest take

Key Topics

Next Steps

Related Articles

The LLM Gateway Is Your Cheapest Cost Lever: Token Quotas, Per-Key Budgets, and Where Metering Lives (2026)

ChatGPT Workspace Agents Now Bill Credits on Top of Seats: The July 6 Cutover Math (2026)

Token Metering vs Task Quotas: Why Claude Code and Kimi Code Stopped Billing You by the Token (2026)

Explore More Articles