Prompt Caching Is Quietly Breaking Your AI Cost Tracking (Cache Reads vs Writes, and the Numbers That Lie)

Prompt caching is the best per-call cost lever in 2026 - up to 90% off repeated context, stackable with batch discounts to ~25% of standard rates - but it quietly breaks cost tracking. A cached request still reports the full input-token count, so any tracker that multiplies total input tokens by the standard rate overstates spend on cache-heavy workloads (up to ~10x) and hides whether caching is working at all. The bug is real and current: the LiteLLM team logged "Anthropic cost tracking inaccurate for cached usage" (LIT-3771) in its June stability sprint, with an enterprise customer confirming it in production. The fix is an accounting rule, not a discount: meter cache writes, cache reads, and uncached input as three separately-priced events, and your dashboard goes from lying to load-bearing - surfacing both true cost and cache hit ratio.

6 min read

prompt cachingcost trackingcache readscache writestoken accountingLiteLLMAI costusage metering2026

TL;DR (June 2026): Prompt caching is the best per-call cost lever there is - up to 90% off repeated context at Anthropic and OpenAI, stackable with batch discounts to roughly a quarter of standard rates. But it quietly breaks the one thing a cost system exists to do: report the truth. A cached request still emits a large input-token count, so any tracker that multiplies total input tokens by the standard input rate overstates your spend on cache-heavy workloads - and, worse, hides whether caching is working at all. The failure is visible in the wild: the LiteLLM team's June stability sprint logged "Anthropic cost tracking inaccurate for cached usage" (LIT-3771), with an enterprise customer confirming it in production. The fix is not a discount - it is an accounting rule: meter cache writes, cache reads, and uncached input as three separate things, price each at its own rate, and your dashboard goes from lying to load-bearing.

Most of the prompt-caching writing in 2026 is a savings pitch, and the savings are real - we walked through the receipts in the Anthropic vs OpenAI vs Gemini caching cost math. This is the less glamorous follow-up: once caching is on, your cost numbers are probably wrong, and they are wrong in the direction that makes you distrust the savings you just earned. If you cannot see the win, you will not defend it the next time finance asks why the bill moved.

Why a cached call still looks expensive to a naive tracker

The mechanics are simple and that is exactly why they get missed. When you reuse a large system prompt or document context, the provider does not bill that context at the standard input rate. It bills a cache write the first time (often a premium over standard input), then a cache read on every subsequent call at a deep discount - around 10% of the input rate at Anthropic and OpenAI. The token count in the API response, however, still reports the full context. A 900K-token cached prompt reports 900K input tokens whether it was a cache miss billed near-premium or a cache hit billed at a tenth.

So the bug writes itself. A cost tracker that does input_tokens × standard_input_rate - the default in most homegrown dashboards and more than one vendor SDK - will report a cache-heavy workload at close to its uncached price. On a workload that is 90% cache reads, you can be reporting roughly ten times the real input cost. As our piece on measuring agent token cost argued, the estimate is not the bill; with caching, the gap between the two stops being a rounding error and becomes the whole story.

This is not hypothetical - it is in the issue tracker

The clearest evidence that this is a real, current problem rather than a theoretical one came out of LiteLLM, the open-source proxy a huge share of teams route their LLM traffic through. Its June 15 "Stability Sprint" roadmap surfaced, among the structural cost-calculator fixes, the line item "Anthropic cost tracking inaccurate for cached usage" (LIT-3771) - with a LiteLLM Enterprise customer commenting that they were "experiencing this exact issue in production." When the most widely deployed LLM cost-tracking layer has an open ticket for cached-usage accuracy, the safe assumption is that your own tracker has the same bug and no ticket.

It compounds across the stack. If your proxy mis-prices cached usage, every downstream artifact inherits the error: the per-customer margin report, the per-feature cost allocation, the budget alert that fires too early (or, if you double-count cache writes, the one that never fires). Caching is supposed to be the safe optimization. Mis-metered, it becomes the one that makes your whole cost ledger untrustworthy.

The accounting rule that fixes it

Braintrust's 2026 token-tracking guidance puts the rule plainly: cached and uncached tokens must be logged separately, because rolling them into a single input count overstates spend on cache-heavy workloads and obscures whether caching is reducing costs at all. ProjectDiscovery's published 59% caching saving was only knowable because they compared effective per-token rates against what the same volume would have cost at standard rates - you cannot compute that ratio if your meter collapses reads and writes into one bucket. Concretely, instrument three counters per request, not one:

  1. Uncached input tokens at the standard input rate - the genuinely new context.
  2. Cache write tokens at the cache-write rate - the one-time premium to seed the cache.
  3. Cache read tokens at the cache-read rate (~10% of input) - where the savings actually live.

With those three lines, two numbers fall out that the single-bucket meter can never produce: your true cost (sum the three at their real rates) and your cache hit ratio (reads ÷ total cacheable input), which is the leading indicator of whether the optimization is paying off before the invoice confirms it. That is the difference between a dashboard that reports the past and one that lets you steer.

The honest take

Prompt caching is not the problem; single-bucket token accounting is. The deeper lesson is the one we keep returning to in why usage metering needs its own database: a meter that records "tokens" instead of "billable events with a type and a rate" will be wrong the moment pricing stops being one flat number per token - and caching, tiered long-context pricing, and batch discounts mean it already has. The teams that trust their AI cost numbers in 2026 are the ones that record cache reads, cache writes, and uncached input as distinct, separately-priced events from the moment of ingest. UsageBox meters them that way on purpose, so the fast number on your dashboard equals the true number on the vendor invoice - including the caching win you actually earned.

Key Topics

  • prompt caching
  • cost tracking
  • cache reads
  • cache writes
  • token accounting
  • LiteLLM
  • AI cost
  • usage metering
  • 2026

Related Articles

Explore more articles on similar topics to deepen your understanding of usage-based billing.

Per-Seat Pricing Can't Survive Agentic Users: The SaaS Margin Math That Breaks in One Loop

If you sell software at a flat per-seat price and your product calls an LLM that bills per token, your margin is a bet t...

6 min readRead more

The Token Count Isn't the Bill: Why Tokenizer Differences Break Your LLM Cost Comparisons

The price-per-million-token number on a pricing page is not comparable across providers, because the token is not a stan...

6 min readRead more

UsageBox Kata #1: From Token Event to Invoice Line in 30 Minutes

A hands-on kata: take a raw AI usage event - a chunk of Claude tokens, a tool call, a credit burn - and turn it into a s...

7 min readRead more

Explore More Articles

Discover our complete collection of usage-based billing guides and implementation patterns.

View all articles