What is the LLM router pattern?

An architecture that sends each AI request to the cheapest model capable of handling it instead of sending everything to one expensive model. It has two variants: routing (a one-shot classification decides the model before execution) and cascading (every request tries a cheap model first and escalates to stronger tiers only when a verifier rejects the answer). Published production results cluster at 45 to 85% cost reduction while retaining roughly 95% of output quality.

How much can model routing actually save?

The ceiling is set by your task mix, not the architecture. If 70% of your traffic can run on a workhorse model that is 50x cheaper (DeepSeek V4 Flash at $0.14/$0.28 per MTok versus Claude Fable 5 at $10/$50), routing that share cuts the total bill by roughly two thirds. Case studies report 45 to 85% savings; the viral 99% figure came from moving an entire workload that never needed a frontier model, which is repricing rather than routing.

What is the difference between routing and cascading?

Routing classifies a request once and sends it to exactly one model: cheap, fast, but a misclassification ships a weak answer. Cascading runs the cheapest model first and escalates failures to stronger models via a verifier: safer on quality because the strong model backstops everything, but it double-pays on escalations and adds latency. High-volume recognizable traffic suits routing; expensive-to-get-wrong tasks suit cascading.

What tools exist for LLM routing in 2026?

Mature options at every layer: LiteLLM (open source) and OpenRouter (hosted) as general gateways; Cloudflare AI Gateway, Kong AI Gateway, and Bifrost for platform teams; Microsoft Foundry includes a managed model router. All expose OpenAI-compatible interfaces so application code does not change. The components worth building in-house are the routing policy and the verification logic, because both encode your specific quality bar.

What are the risks of routing to cheaper models?

Four recur in production: silent quality drift on downgraded routes (mitigate with continuous eval sampling, not just rollout testing); p95 latency stacking from cascade escalations; provider concentration and data-residency constraints in the cheap tier (configure automatic fallback); and router overhead itself when classification uses an expensive model on short requests. None are fatal; all need explicit budgets.

What do I need before implementing a model router?

Per-task, per-model metering. The savings ceiling equals the fraction of traffic a cheap model can serve, and without measuring cost per task across your real workload you cannot estimate that fraction, set routing policy, or prove the router worked afterward. Measurement first, simple static routing rules second, gateways and learned classifiers last is the dependency order that survives contact with production.

The Router Pattern: Cut AI Costs 45-85% by Sending Each Task to the Cheapest Capable Model

TL;DR (June 2026): The price gap between frontier and workhorse models is now two orders of magnitude: Claude Fable 5 at $10/$50 per million tokens versus DeepSeek V4 Flash at $0.14/$0.28, a roughly 70 to 180x spread depending on direction. Routing each request to the cheapest model that can actually handle it, one-shot routing or sequential cascading, reliably cuts production AI bills by 45 to 85% while keeping ~95% of quality, per the published case studies and a 2026 academic survey of the field. A Hacker News post claiming a 99% cost cut by switching from Claude to DeepSeek went hot in late May; the durable version of that stunt is a router. Here is how the pattern works, when each variant wins, the gateway tooling that exists so you do not build it from scratch, and the one prerequisite nobody can skip: per-task, per-model measurement.

Every AI cost conversation in 2026 eventually arrives at the same observation: most of the requests you send to a frontier model did not need a frontier model. The boilerplate edit, the JSON extraction, the summary of a short document, a $0.30-per-million workhorse handles them indistinguishably from a $50-per-million flagship. The router pattern is the engineering response to that observation, and after two years of tooling maturation it has gone from blog-post idea to the default architecture for cost-conscious production AI. The viral version this spring was a Hacker News post titled "I cut my AI API costs 99% by switching from Claude to DeepSeek"; the threads beneath it, and the Ask HN "How are people forecasting AI API costs for agent workflows?", show where the real engineering interest is. This is the full picture.

The spread that makes routing worth it

Routing is only as valuable as the price gap it exploits, and the June 2026 gap is the widest the industry has seen:

Tier	Example model	Input / Output per MTok	vs Fable 5 output
Frontier	Claude Fable 5	$10 / $50	1x
Flagship	Claude Opus 4.8	$5 / $25	2x cheaper
Reasoning workhorse	DeepSeek V4 Pro	$1.74 / $3.48	~14x cheaper
Workhorse	DeepSeek V4 Flash	$0.14 / $0.28	~180x cheaper

Two structural notes make the spread even wider in practice. Caching multiplies it: DeepSeek's cache hits price input at $0.0028 per million, and Anthropic's cached reads at $1 per million, so a router that also routes repeated context intelligently stacks a second discount on top (the mechanics are in our prompt caching guide). And the cheap tier is genuinely capable now: the budget agentic class handles tool use and structured output that required a flagship eighteen months ago.

Routing versus cascading, precisely

The two variants get conflated constantly, and they have different cost profiles:

Routing is a one-shot decision. A classifier (heuristics, a small model, or learned embeddings) inspects the request before execution and sends it to exactly one model. Cost: one inference plus a near-free classification. Risk: misclassification sends a hard task to a weak model and you ship a bad answer. Best for: high-volume traffic with recognizable task shapes, support triage, extraction, code completion.
Cascading is sequential escalation. Every request tries the cheapest model first; a verifier (confidence score, self-check, or rubric) decides whether the answer is good enough, and failures escalate to the next tier. Cost: occasional double-paying when the cheap model fails. Risk profile: much safer on quality, because the strong model backstops everything. Best for: tasks where wrong answers are expensive and volume is moderate.

The published numbers cluster tightly: 45 to 85% cost reduction at roughly 95% retained quality across the case studies, with routing plus semantic caching reported at 60%+ and full cascades at the high end. A 2026 academic survey (arXiv 2603.04445, "Dynamic Model Routing and Cascading for Efficient LLM Inference") now catalogs dozens of production-grade techniques, which is the moment a pattern stops being a trick and becomes infrastructure.

Why the 99% headline is real and also not your number

The HN poster who cut costs 99% did it by moving everything from a frontier model to DeepSeek, which is not routing, it is repricing, and it only works if literally none of your traffic needed the frontier model, in which case you were simply overpaying before. A router's savings are bounded by your actual task mix: if 70% of your traffic is workhorse-suitable, routing that 70% to a model 50x cheaper cuts the total bill by roughly two thirds, and no architecture can do better without changing the work itself. This is why the honest first step is not picking a gateway, it is measuring your task distribution: what fraction of your requests, by token volume, could a cheap model serve at acceptable quality? Teams that meter per task already know this number. Teams that do not are guessing at the single variable that determines the whole project's ROI, the same measurement gap we keep finding in list price versus real cost.

The build: four layers, mostly off the shelf

Gateway. You almost certainly should not write the dispatch layer yourself. LiteLLM and OpenRouter dominate the open and hosted ends respectively; Cloudflare AI Gateway, Kong AI Gateway, and Bifrost serve the platform-team crowd; Microsoft Foundry ships a managed model router. All of them speak the OpenAI-compatible interface, which makes the router transparent to application code.
Policy. Start embarrassingly simple: route by request type and length. "Extraction and summarization under 4K tokens goes to Flash; everything touching production code goes to Opus; Fable 5 by explicit opt-in only." Static rules capture most of the savings on day one. Learned routers (classify-then-dispatch on embeddings) add single-digit percentage points and should be earned, not started with.
Verification, if cascading. The cheap-model answer needs a pass/fail signal: schema validation for structured output, unit tests for code, an LLM-judge rubric for prose. The verifier is the cascade's quality floor, invest here before adding tiers. A cascade without a real verifier is just routing with extra steps and double the latency.
Measurement. The layer that decides whether any of this worked. You need cost per task, per model, per route, over time: which routes downgraded successfully, which escalated, what the realized savings were against the single-model baseline. This is metering, the same per-event discipline as cost-per-task benchmarking, and it doubles as your early-warning system when a provider's sideways repricing quietly changes which route is cheapest.

The failure modes, because there are real ones

Quality drift you cannot see. The router downgrades a route, output quality dips 4%, and nobody notices for a quarter because nothing alerts on it. Mitigation: sample routed traffic into an eval set continuously, not just at rollout.
Latency stacking in cascades. Every escalation is a full extra round trip. A two-tier cascade with a 30% escalation rate adds meaningful p95 latency. Budget it explicitly, or route latency-sensitive paths one-shot.
Provider concentration in the cheap tier. The workhorse tier is dominated by a handful of providers with their own rate limits, occasional brownouts, and (notably for DeepSeek) data-residency questions some enterprises cannot accept. A router is also a failover layer: configure the second-cheapest capable model as automatic fallback.
The router becomes a bill of its own. LLM-classify-every-request designs spend tokens to save tokens. Keep classification to heuristics or a sub-$0.10-per-million model, or the overhead eats the margin on short requests.

The honest take

The router pattern is the rare AI cost lever that is neither a vendor negotiation nor a quality sacrifice: it is just engineering, and in 2026 the tooling is mature enough that a competent team ships a static-policy router in a week. But the pattern has a dependency order that the enthusiasm consistently gets backward. Measurement comes first, because the task distribution determines the ceiling; policy comes second; gateways and learned routers come last. Teams that bolt a router onto unmetered traffic discover they cannot answer the only question that matters, "what did this save us, and what did it cost in quality?", and quietly turn it off three months later. Teams that meter per task first find the router pays for itself before the gateway's free tier runs out. The 99% headline belongs to someone whose workload never needed the expensive model. Your number is smaller, it is knowable in an afternoon, and at the price spreads now on the menu, it is very likely the largest single line-item reduction available to you this quarter.

Key Topics

•model routing
•LLM cascade
•AI cost optimization
•LiteLLM
•OpenRouter
•DeepSeek
•cost per task
•AI FinOps
•June 2026

Next Steps

Measure your routable task mix with per-task metering Browse all articles

←

→

Explore More Articles

Discover our complete collection of usage-based billing guides and implementation patterns.

View all articles