TL;DR (June 2026): The cheapest place to control AI cost is not your application code - it is the LLM gateway, the proxy every model call already passes through. Put one in front of your providers and you get four cost levers in a single chokepoint: per-key and per-team token quotas, hard spend budgets, model routing to cheaper models, and response caching - plus the thing that makes all of it auditable: a metering point that sees every call, including the retries and SDK-internal requests your app-level tracking never logs. The HN front page in June carried a "self-hosted AI proxy with token quotas" Show HN for exactly this reason. Here is what a gateway buys you, the build-vs-buy options (LiteLLM, Portkey, Helicone, OpenRouter, Cloudflare AI Gateway), and why metering belongs at the gateway, not the app.

Most teams instrument AI cost in the wrong place. They wrap their own chat() helper, log the token counts it returns, and call that their usage data. Then the bill arrives larger than the dashboard, and nobody can explain the gap. The gap is everything that did not go through the helper: a framework's automatic retries, a tool-calling loop inside an SDK, a background summarization job, a second team using the same API key. The application layer only sees the calls it made on purpose. The gateway sees all of them, because the gateway is the only thing every request has in common.

That is the case for a gateway as your primary cost-control plane. It is one place to enforce policy and one place to meter, sitting between your code and the model providers.

What a gateway actually gives you

An LLM gateway (also called an AI proxy or AI gateway) is a thin service that speaks the OpenAI-compatible API on the front and talks to one or many providers on the back. Routing every call through it unlocks, in one place:

Per-key and per-team token quotas. Issue a virtual key per team, customer, or environment, and cap each one's tokens or spend independently. A leaked or runaway key burns its own budget, not the whole company's.
Hard spend budgets. A monthly or daily ceiling per key that returns an error instead of a charge once hit - the gateway-level version of the hard spend cap that stops a runaway agent from bankrupting you.
Model routing. Send each request to the cheapest model that can handle it, the router pattern, enforced at the proxy so application code does not have to know which model it got.
Response caching. Cache identical or semantically similar requests so a repeated call costs nothing.
Observability and metering. A usage event for every call - model, tokens in/out, cache status, latency, the key that made it - emitted whether your app meant to make the call or not.

The build-vs-buy landscape

You do not have to write a gateway. The category matured over the last two years, and the choice is mostly self-host versus managed:

Gateway	Model	Cost levers it ships	Best when
LiteLLM	Self-host (open source) or managed	Virtual keys, per-key budgets and TPM/RPM limits, routing, caching, spend logs	You want full control and an OpenAI-compatible proxy in front of 100+ providers
Portkey	Managed or self-host	Budgets, guardrails, routing, caching, observability	You want batteries-included governance without running it yourself
Helicone	Managed or self-host (proxy or async)	Per-request logging, cost analytics, caching, rate limits	Observability-first, minimal integration friction
OpenRouter	Managed marketplace	One key across many models, unified billing, fallback routing	You want many models behind one account without per-provider contracts
Cloudflare AI Gateway	Managed (edge)	Caching, rate limiting, analytics, fallback	You are already on Cloudflare and want an edge-cached proxy

The self-hosted end of this list is what showed up on Hacker News as a free "AI proxy with token quotas and local control" - the same idea, owned end to end. Self-host when you need the usage data to stay inside your perimeter or you want to avoid a per-call markup; buy when you would rather not operate one more service.

Per-key quotas are the lever most teams skip

The single highest-leverage thing a gateway gives you is the virtual key per consumer. Instead of one shared provider key with one shared blast radius, you mint a key per team, per customer, or per feature, each with its own quota and budget. That changes three things at once:

Blast radius. A leaked key or a looping agent exhausts one budget, then errors out - it cannot run up the whole org's bill.
Attribution. Every call is already tagged with the key that made it, so "which team spent this" is answered at write time, not reconstructed later.
Chargeback. Per-key spend is the raw material for billing customers or showing internal teams their real cost-to-serve - the problem that breaks per-seat pricing the moment a user runs an agent.

Where metering plugs in

A gateway gives you spend logs. Turning those into invoices - or into a number you would defend in a billing dispute - is a separate job, and it is where most homegrown gateway setups fall down. The log is append-only chatter; a bill needs idempotent ingest (the same request retried must not be counted twice), dimensioned roll-ups (per key, per customer, per model), and a frozen monthly total that does not drift after you have invoiced.

So the clean architecture is: the gateway is the enforcement point (quotas, budgets, routing), and it emits a usage event per call into a metering layer that is the accounting point. The gateway answers "should this call proceed"; the meter answers "what does this account owe, and can I prove it". Trying to make the gateway's request log double as your billing ledger is the same mistake as treating a raw SQL table as a metering database - it works until the first dispute or the first double-charged retry.

The piece the gateway cannot solve on its own is idempotency and late events: a retried request, a webhook that fires twice, a collector that ships at-least-once. The meter dedupes by event id; the gateway does not.

The honest build-vs-buy on metering

You can self-host the gateway cheaply - that part is genuinely a solved, open-source problem. The accounting layer behind it is the part that is expensive to get right: write-ahead durability, dedupe under retry, hourly roll-ups, and frozen periods so invoices stop moving. That is the database problem, not the proxy problem, and it is the reason metering keeps getting absorbed into payments and platforms rather than rebuilt per company.

Put the gateway in front for control. Put a real meter behind it for the money. Treat the gateway log as the signal and the meter as the source of truth, and your AI cost stops being a monthly surprise.

Key Topics

•LLM gateway
•AI proxy
•token quotas
•spend caps
•per-key budgets
•LiteLLM
•cost control
•usage metering
•2026

Next Steps

Meter every call at the gateway Browse all articles

→

Explore More Articles

Discover our complete collection of usage-based billing guides and implementation patterns.

View all articles

The LLM Gateway Is Your Cheapest Cost Lever: Token Quotas, Per-Key Budgets, and Where Metering Lives (2026)

What a gateway actually gives you

The build-vs-buy landscape

Per-key quotas are the lever most teams skip

Where metering plugs in

The honest build-vs-buy on metering

Key Topics

Next Steps

Related Articles

Prompt Caching Is Quietly Breaking Your AI Cost Tracking (Cache Reads vs Writes, and the Numbers That Lie)

Per-Seat Pricing Can't Survive Agentic Users: The SaaS Margin Math That Breaks in One Loop

Who Spent the Tokens? Cost Attribution Across Tools, Sub-Agents, and Retries (2026)

Explore More Articles