The Router Pattern: Cut AI Costs 45-85% by Sending Each Task to the Cheapest Capable Model

The frontier-to-workhorse price spread is now ~180x (Claude Fable 5 at $10/$50 per MTok vs DeepSeek V4 Flash at $0.14/$0.28), which makes model routing the largest single cost lever in production AI. Routing vs cascading precisely defined, the published 45-85% savings numbers at ~95% retained quality, the 2026 gateway landscape (LiteLLM, OpenRouter, Cloudflare/Kong AI Gateway, Foundry router), the four failure modes, and why per-task metering is the non-skippable prerequisite that determines your actual ceiling.

11 min read

model routingLLM cascadeAI cost optimizationLiteLLMOpenRouterDeepSeekcost per taskAI FinOpsJune 2026

TL;DR (June 2026): The price gap between frontier and workhorse models is now two orders of magnitude: Claude Fable 5 at $10/$50 per million tokens versus DeepSeek V4 Flash at $0.14/$0.28, a roughly 70 to 180x spread depending on direction. Routing each request to the cheapest model that can actually handle it, one-shot routing or sequential cascading, reliably cuts production AI bills by 45 to 85% while keeping ~95% of quality, per the published case studies and a 2026 academic survey of the field. A Hacker News post claiming a 99% cost cut by switching from Claude to DeepSeek went hot in late May; the durable version of that stunt is a router. Here is how the pattern works, when each variant wins, the gateway tooling that exists so you do not build it from scratch, and the one prerequisite nobody can skip: per-task, per-model measurement.

Every AI cost conversation in 2026 eventually arrives at the same observation: most of the requests you send to a frontier model did not need a frontier model. The boilerplate edit, the JSON extraction, the summary of a short document, a $0.30-per-million workhorse handles them indistinguishably from a $50-per-million flagship. The router pattern is the engineering response to that observation, and after two years of tooling maturation it has gone from blog-post idea to the default architecture for cost-conscious production AI. The viral version this spring was a Hacker News post titled "I cut my AI API costs 99% by switching from Claude to DeepSeek"; the threads beneath it, and the Ask HN "How are people forecasting AI API costs for agent workflows?", show where the real engineering interest is. This is the full picture.

The spread that makes routing worth it

Routing is only as valuable as the price gap it exploits, and the June 2026 gap is the widest the industry has seen:

TierExample modelInput / Output per MTokvs Fable 5 output
FrontierClaude Fable 5$10 / $501x
FlagshipClaude Opus 4.8$5 / $252x cheaper
Reasoning workhorseDeepSeek V4 Pro$1.74 / $3.48~14x cheaper
WorkhorseDeepSeek V4 Flash$0.14 / $0.28~180x cheaper

Two structural notes make the spread even wider in practice. Caching multiplies it: DeepSeek's cache hits price input at $0.0028 per million, and Anthropic's cached reads at $1 per million, so a router that also routes repeated context intelligently stacks a second discount on top (the mechanics are in our prompt caching guide). And the cheap tier is genuinely capable now: the budget agentic class handles tool use and structured output that required a flagship eighteen months ago.

Routing versus cascading, precisely

The two variants get conflated constantly, and they have different cost profiles:

  • Routing is a one-shot decision. A classifier (heuristics, a small model, or learned embeddings) inspects the request before execution and sends it to exactly one model. Cost: one inference plus a near-free classification. Risk: misclassification sends a hard task to a weak model and you ship a bad answer. Best for: high-volume traffic with recognizable task shapes, support triage, extraction, code completion.
  • Cascading is sequential escalation. Every request tries the cheapest model first; a verifier (confidence score, self-check, or rubric) decides whether the answer is good enough, and failures escalate to the next tier. Cost: occasional double-paying when the cheap model fails. Risk profile: much safer on quality, because the strong model backstops everything. Best for: tasks where wrong answers are expensive and volume is moderate.

The published numbers cluster tightly: 45 to 85% cost reduction at roughly 95% retained quality across the case studies, with routing plus semantic caching reported at 60%+ and full cascades at the high end. A 2026 academic survey (arXiv 2603.04445, "Dynamic Model Routing and Cascading for Efficient LLM Inference") now catalogs dozens of production-grade techniques, which is the moment a pattern stops being a trick and becomes infrastructure.

Why the 99% headline is real and also not your number

The HN poster who cut costs 99% did it by moving everything from a frontier model to DeepSeek, which is not routing, it is repricing, and it only works if literally none of your traffic needed the frontier model, in which case you were simply overpaying before. A router's savings are bounded by your actual task mix: if 70% of your traffic is workhorse-suitable, routing that 70% to a model 50x cheaper cuts the total bill by roughly two thirds, and no architecture can do better without changing the work itself. This is why the honest first step is not picking a gateway, it is measuring your task distribution: what fraction of your requests, by token volume, could a cheap model serve at acceptable quality? Teams that meter per task already know this number. Teams that do not are guessing at the single variable that determines the whole project's ROI, the same measurement gap we keep finding in list price versus real cost.

The build: four layers, mostly off the shelf

  1. Gateway. You almost certainly should not write the dispatch layer yourself. LiteLLM and OpenRouter dominate the open and hosted ends respectively; Cloudflare AI Gateway, Kong AI Gateway, and Bifrost serve the platform-team crowd; Microsoft Foundry ships a managed model router. All of them speak the OpenAI-compatible interface, which makes the router transparent to application code.
  2. Policy. Start embarrassingly simple: route by request type and length. "Extraction and summarization under 4K tokens goes to Flash; everything touching production code goes to Opus; Fable 5 by explicit opt-in only." Static rules capture most of the savings on day one. Learned routers (classify-then-dispatch on embeddings) add single-digit percentage points and should be earned, not started with.
  3. Verification, if cascading. The cheap-model answer needs a pass/fail signal: schema validation for structured output, unit tests for code, an LLM-judge rubric for prose. The verifier is the cascade's quality floor, invest here before adding tiers. A cascade without a real verifier is just routing with extra steps and double the latency.
  4. Measurement. The layer that decides whether any of this worked. You need cost per task, per model, per route, over time: which routes downgraded successfully, which escalated, what the realized savings were against the single-model baseline. This is metering, the same per-event discipline as cost-per-task benchmarking, and it doubles as your early-warning system when a provider's sideways repricing quietly changes which route is cheapest.

The failure modes, because there are real ones

  • Quality drift you cannot see. The router downgrades a route, output quality dips 4%, and nobody notices for a quarter because nothing alerts on it. Mitigation: sample routed traffic into an eval set continuously, not just at rollout.
  • Latency stacking in cascades. Every escalation is a full extra round trip. A two-tier cascade with a 30% escalation rate adds meaningful p95 latency. Budget it explicitly, or route latency-sensitive paths one-shot.
  • Provider concentration in the cheap tier. The workhorse tier is dominated by a handful of providers with their own rate limits, occasional brownouts, and (notably for DeepSeek) data-residency questions some enterprises cannot accept. A router is also a failover layer: configure the second-cheapest capable model as automatic fallback.
  • The router becomes a bill of its own. LLM-classify-every-request designs spend tokens to save tokens. Keep classification to heuristics or a sub-$0.10-per-million model, or the overhead eats the margin on short requests.

The honest take

The router pattern is the rare AI cost lever that is neither a vendor negotiation nor a quality sacrifice: it is just engineering, and in 2026 the tooling is mature enough that a competent team ships a static-policy router in a week. But the pattern has a dependency order that the enthusiasm consistently gets backward. Measurement comes first, because the task distribution determines the ceiling; policy comes second; gateways and learned routers come last. Teams that bolt a router onto unmetered traffic discover they cannot answer the only question that matters, "what did this save us, and what did it cost in quality?", and quietly turn it off three months later. Teams that meter per task first find the router pays for itself before the gateway's free tier runs out. The 99% headline belongs to someone whose workload never needed the expensive model. Your number is smaller, it is knowable in an afternoon, and at the price spreads now on the menu, it is very likely the largest single line-item reduction available to you this quarter.

Key Topics

  • model routing
  • LLM cascade
  • AI cost optimization
  • LiteLLM
  • OpenRouter
  • DeepSeek
  • cost per task
  • AI FinOps
  • June 2026

Related Articles

Explore more articles on similar topics to deepen your understanding of usage-based billing.

Cost Per Task Is the New AI Benchmark: Composer 2.5 and the Workhorse-Model Economics of 2026

The benchmark that decides your AI bill is not score and it is not price per token, it is cost per task. On Artificial A...

12 min readRead more

Cheaper Than Gemini Flash-Lite? DeepSeek, GLM, Qwen and Kimi as Agentic Workhorses

On raw capability-per-dollar, several Chinese models beat Gemini 3.1 Flash-Lite (index 34, $0.25/$1.50): DeepSeek V4 Fla...

11 min readRead more

Gemini 2.5 Pro vs Gemini 3.1 Flash-Lite: Cost, Quality, and Migration Guide

Switching a workload from Gemini 2.5 Pro to 3.1 Flash-Lite cuts the token bill ~80% and is not the quality cliff the nam...

11 min readRead more

Explore More Articles

Discover our complete collection of usage-based billing guides and implementation patterns.

View all articles