TL;DR (June 2026): "You probably don't need to use Opus" is the loudest cost take of the year - open-weight models (DeepSeek, Qwen, GLM, Llama-class) handle most production work at a fraction of frontier price. But the follow-on claim, "so just self-host and stop paying the API," hides a break-even most teams get wrong. Self-hosting trades a per-token bill for a per-hour GPU bill, and a per-hour bill is only cheap if the GPU stays busy. The crossover is a utilization question: at high steady throughput self-hosting wins decisively; at spiky or low volume the idle GPU makes it more expensive than the API it replaced. Here is the actual math, the three honest options (self-host, hosted open-model inference, frontier API), and why you must measure cost per task to know which side of the line you are on.
The open-weight argument is correct on its own terms. The price gap between a frontier model and a capable open workhorse is now two orders of magnitude, and for boilerplate edits, extraction, classification, and short summaries the workhorse is indistinguishable in output. We made that case in DeepSeek, GLM, Qwen and Kimi as agentic workhorses. The question this article answers is the next one: once you have decided a cheap open model is good enough, should you rent it by the token from an inference provider, or run it yourself on GPUs?
The break-even is a utilization problem
An API bills you per million tokens. A GPU bills you per hour whether it is saturated or idle. So the only way to compare them is to convert your GPU's hourly cost into an effective dollars-per-million-tokens, which depends entirely on how many tokens that GPU actually pushes per hour:
effective $/Mtok = (GPU $/hour) ÷ (million tokens served per hour)
The numerator is fixed by your hardware contract. The denominator is set by your real traffic, not your benchmark. A GPU that can theoretically serve millions of tokens an hour but sits at 5% utilization overnight and on weekends is paying full freight to do almost nothing - and its effective per-token cost balloons accordingly. This is why the same hardware can be cheaper or far more expensive than an API, depending only on how busy you keep it.
Three options, side by side
| Option | Cost shape | Cheap when | Expensive when |
|---|---|---|---|
| Frontier API (e.g. Claude Fable 5 / Opus) | High $/token, zero ops | Low volume, hardest tasks, you value zero ops | High volume of easy tasks - you overpay for capability you don't use |
| Hosted open-model API (Together, Fireworks, DeepInfra, Groq, OpenRouter) | Low $/token, zero ops | Spiky or unpredictable volume; you want cheap models without running GPUs | Very high steady volume where the provider's margin is your savings left on the table |
| Self-hosted open weights (your GPUs) | Per-hour GPU, you own ops | High, steady, predictable throughput that keeps the GPU saturated; data must stay in your perimeter | Low or bursty utilization - you pay for idle silicon, plus the ops you now own |
The middle row is the one most teams skip past and the one that wins most often: hosted open-model inference gives you the cheap open weights without the GPU lease or the ops. It is the pragmatic default until your volume is both large and steady enough to justify owning hardware.
The hidden costs of self-hosting
The per-hour GPU rate is the visible cost. The crossover math quietly understates self-hosting because of the costs that do not show up in a pricing-page comparison:
- Idle time. Traffic is diurnal and bursty. If you size for peak, you pay for peak 24/7; if you size for average, you queue or drop at peak. Either way real utilization is well under 100%, and the effective per-token cost rises to match.
- Operations. You now own inference-server tuning, batching, model updates, GPU driver and CUDA breakage, autoscaling, and on-call. That is engineering time the API priced in for you.
- Cold starts and scaling. Spinning GPUs up and down to chase demand introduces latency and waste; keeping them warm reintroduces idle cost.
- Capability ceiling. The hardest 10% of tasks may still need a frontier model, so you often run a hybrid and pay for both worlds anyway.
When self-hosting genuinely wins
It does win, and clearly, in a specific shape: high, steady, predictable volume on a fixed open model, where the GPU stays busy enough that the effective per-token cost drops well below the hosted-API rate - and the savings are large enough to absorb the ops burden. Data residency or latency requirements that rule out third-party APIs also push you here regardless of the pure cost math. Outside that shape, a hosted open-model API almost always wins on total cost of ownership.
You can't pick a side without cost-per-task
Every comparison above collapses to one measurement you have to take from your own traffic: cost per task, at your quality bar, on each option. The sticker price per token does not tell you, because tasks differ in token shape and because self-hosting's real number depends on your utilization. We argued this in cost per task is the new AI benchmark: run a representative task through each option, read the tokens each actually consumes, fold in the GPU-hour amortization for the self-host case, and rank by cost-per-task.
Doing that requires metering every model call with enough dimensions to attribute cost back to a workload - which model served it, how many tokens, whether it was cached, which feature triggered it. Without that, "self-hosting saved us money" is a feeling, not a number. With it, the crossover stops being a debate and becomes a line on a chart you can see yourself cross. For the full stack of levers around this decision, see the 6-layer playbook for reducing LLM API costs.