Cost Per Task Is the New AI Benchmark: Composer 2.5 and the Workhorse-Model Economics of 2026

Name: UsageBox
Rating: 4.8 (50 reviews)
Author: UsageBox

The short answer: The benchmark that decides your AI bill is not score and it is not price per token, it is cost per task. On Artificial Analysis's Coding Agent Index, Cursor's Composer 2.5 lands third at index 62 for about $0.07 per task on its standard tier, while the two models above it (Claude Opus 4.7 at 66, GPT-5.5 at 65) cost $4.10 and $4.82 per task. That is roughly ten to sixty times the cost for three to four index points. For most of the work most teams run, the points are not worth the multiple.

The catch is that cost per task is a property of your traffic, not a number on a launch slide, and the cheapest sticker price does not always win it. Composer 2.5 is also locked inside one editor with no API, and the "cheap tier" is no longer uniformly getting cheaper: Gemini 3.5 Flash shipped at six times the output price of the Flash-Lite it succeeds. The only way to know your real number is to meter it.

For two years the AI conversation was a contest for the top of the benchmark. In 2026 the conversation enterprises are actually having is about the bill. The interesting models this year are not the smartest, they are the "workhorse" class: fast, cheap, and close enough to the frontier that the gap stops mattering once you multiply it by your real call volume. Cursor's Composer 2.5 is the clearest example, and the way to read it correctly is to stop looking at the score and the per-token price, and look at one number instead.

Cost per task, not cost per token

Price per million tokens is the headline number every launch leads with, and it is close to useless for budgeting. Two models at the same per-token price can differ by an order of magnitude in what they cost to finish a task, because they differ in how many tokens they burn getting there: reasoning and "thinking" tokens, retries, tool-call round trips, and wasted context. The unit that maps to your invoice is dollars per completed task, and it is the unit the independent Artificial Analysis Coding Agent Index now reports.

Source: Artificial Analysis Coding Agent Index. Composer 2.5 was measured inside Cursor, Claude Opus 4.7 (max reasoning) inside Claude Code, GPT-5.5 (xhigh reasoning) inside Codex. Composer 2.5 standard at $0.07 per task is roughly one sixtieth the cost of the two higher-scoring agents; its Fast tier at $0.44 is roughly one tenth.

That is the whole workhorse argument in one chart. The frontier agents win the index, but they win it at a price most teams cannot pay across their real call volume. When a task costs four to five dollars instead of seven cents, "just use the best model for everything" stops being a strategy and becomes a budget incident.

But is the cheap one actually good?

Close enough that the gap is a rounding error for most work. On Cursor's own internal CursorBench v3.1, Composer 2.5 scores 63.2% against Opus 4.7's 64.8%, a 1.6-point difference. On the independent Coding Agent Index it sits at 62 against 65 and 66 for GPT-5.5 and Opus 4.7. Plot capability against cost and the shape of the decision is obvious.

Vertical axis: Artificial Analysis Coding Agent Index, a composite of multiple coding benchmarks. Horizontal axis: measured cost per task. Up and to the left is better value. Composer 2.5 gives up three to four index points and saves an order of magnitude per task. For the bulk of routine coding work, that is the trade most teams should take, and route the few genuinely hard tasks to the frontier.

This is why the workhorse class is the real story of 2026 and not the frontier. The hardest problems, the upfront architecture and the gnarly debugging, are worth the frontier model and its price. The vast majority of what an agent actually does, writing the code, editing files, running the loop, is routine, and a model a few points off the top finishes it for a fraction of the cost. The skill is no longer "pick the best model," it is "send each task to the cheapest model that still clears the bar."

The catch the chart does not show: the availability tax

Composer 2.5 has a constraint that never appears in a price table: it runs only inside Cursor. There is no public REST endpoint, so you cannot call it from your own agent, your CI pipeline, or a gateway, and you cannot route to it programmatically. The seven-cent task is real, but it is fenced inside one product. That is a genuine cost, paid in lock-in and lost optionality rather than dollars per token, and it is exactly the kind of thing a per-token chart cannot see. The general lesson holds for every "cheap" model: the sticker price is the start of the cost question, not the answer. List price and real cost are different numbers, and the gap is where budgets go wrong.

The counterforce: cheap is not getting uniformly cheaper

It is tempting to assume models only ever get cheaper, so the budget problem solves itself. It does not. The clearest evidence is Google's own "cheap" tier: Gemini 3.5 Flash launched at $1.50 per million input and $9.00 per million output, which is three times the price of the 3-Flash preview it follows and six times the price of Gemini 3.1 Flash-Lite. Google is putting an expensive model behind its highest-volume products on purpose, and as one widely-read analysis put it, all three major labs are now openly "probing the price tolerance of their API customers."

Source: Google Gemini API pricing. The model branded "Flash" rose to six times the output price of the prior Flash-Lite. The cheap workhorse tier and the flagship tier are diverging, not converging, which means you cannot assume next quarter's bill will be smaller by default.

So the two forces are pulling in opposite directions at once. Specialist and self-built models like Composer 2.5 are crushing cost per task, while flagship and even "fast" managed models are testing how much more you will pay. The net effect on your bill is not predictable from any vendor's chart. It depends entirely on which models run which of your workloads, and how heavily.

List prices, for reference

List output price is the number to be most skeptical of, but it is useful context. All figures are USD per 1M tokens, from each provider's pricing page.

Model	Input $/1M	Output $/1M	Class	API access
Cursor Composer 2.5 (standard)	$0.50	$2.50	Workhorse	Cursor only
Cursor Composer 2.5 (Fast)	$3.00	$15.00	Workhorse	Cursor only
Gemini 3.1 Flash-Lite	$0.25	$1.50	Workhorse	Open API
Gemini 3.5 Flash	$1.50	$9.00	Fast / general	Open API
Claude Opus 4.7	$5.00	$25.00	Frontier	Open API
GPT-5.5	$5.00	$30.00	Frontier	Open API

Read the output column next to the cost-per-task chart and the disconnect is the point: Composer's standard output rate ($2.50) is one tenth of GPT-5.5's ($30), yet its measured cost per task is closer to one sixtieth, because the workhorse also burns fewer tokens per task. Per-token price under-states the gap. Only the per-task measurement on real work captures it.

What this means operationally

The enterprise version of this story is not a model launch, it is a budget meeting. The most heated topic among engineering leaders this year is that AI spend is unpredictable and climbing, and that no one feels they have it solved. The strategies that are emerging are all variations on one move: stop treating the frontier as the default, and start managing models like a portfolio.

1. Route by task, not by habit

Send the upfront planning and the genuinely hard problems to the frontier, and send the high-volume routine execution to the workhorse. Done well, model routing captures most of the frontier's quality on the few tasks that need it and most of the workhorse's savings on the many that do not. The trap is routing on vibes. The routing decision should be made from measured cost and quality per task per model, or you are just guessing with extra steps. Practitioners are already documenting order-of-magnitude savings from exactly this kind of switch, but the ones who keep the savings are the ones who measured before and after. We break down the architecture itself, routing versus cascading, the gateway tooling, and the failure modes, in the router pattern guide.

2. Cap spend by team and capability

"Unlimited" AI access is being replaced everywhere by explicit budgets, and the durable pattern is a spend cap that enforces rather than warns. Set it by what a team actually needs, give an exploration team headroom, and keep the high-volume production paths on the cheap models by default.

3. Measure cost per task, continuously

Every point above depends on one capability: seeing per-model, per-workload cost per task on your own traffic, not on a launch slide. Benchmark scores age in weeks and list prices mislead by design. The number that matters is the one you measure yourself, and it has to be live, because both the models and their prices are moving every month.

Where UsageBox fits

UsageBox is built to make cost per task a number you query instead of a number you argue about. Instrument every model call with the model id, the full token breakdown including reasoning tokens, a quality signal, and the task outcome, and the questions that decide your AI budget become reports over your own data: what does each workload cost per task on each model, which routes would be cheaper without losing quality, and which team is about to blow its cap this week. When the next workhorse ships and the next flagship raises its price, you re-run the query instead of re-running the argument. Pick the workhorse on your data, enforce the caps, and let the chart settle the rest.

FAQ

What is "cost per task" and why does it matter more than price per token? Cost per task is the total dollars a model spends to finish one unit of real work, including reasoning tokens, retries, and tool-call round trips. Two models at the same per-token price can differ tenfold in cost per task because they differ in how many tokens they burn. It is the unit that maps to your invoice; price per token is not.

How much does Cursor Composer 2.5 cost? List pricing is $0.50 per million input and $2.50 per million output tokens on the standard tier, and $3.00 / $15.00 on the Fast tier. Measured on the Artificial Analysis Coding Agent Index, that worked out to about $0.07 per task standard and $0.44 Fast, against $4.10 for Claude Opus 4.7 and $4.82 for GPT-5.5.

Is Composer 2.5 as good as Opus 4.7 or GPT-5.5? Almost. It scores 63.2% on Cursor's CursorBench v3.1 against 64.8% for Opus 4.7, and index 62 on the Coding Agent Index against 65 and 66. It gives up three to four points and saves roughly ten to sixty times the cost per task, which is the trade most routine work should take.

Can I use Composer 2.5 through an API? No. Composer 2.5 runs only inside Cursor; there is no public REST endpoint. You cannot call it from your own agent, CI, or a model router. That lack of API access is a real cost in lock-in and lost optionality that the low per-token price does not reflect.

Are AI models getting cheaper over time? Not uniformly. Specialist and self-built workhorse models are driving cost per task down sharply, but several flagship and "fast" managed models are raising prices: Gemini 3.5 Flash launched at six times the output price of Gemini 3.1 Flash-Lite. Your bill depends on which models run your workloads, not on a general trend.

How do I actually control AI coding costs? Route each task to the cheapest model that clears your quality bar, cap spend per team with enforcement rather than warnings, and measure cost per task per model on your own traffic continuously. All three depend on metering, which is what UsageBox provides.

Sources

Cost-per-task and Coding Agent Index figures are from Artificial Analysis; per-token list prices are from each provider's official pricing page; the Gemini 3.5 Flash price-increase analysis is from Simon Willison.

Key Topics

•cost per task
•Cursor Composer 2.5
•Gemini 3.5 Flash
•GPT-5.5
•Claude Opus 4.7
•workhorse models
•model routing
•LLM cost
•AI FinOps
•token pricing
•spend caps
•June 2026

Next Steps

Meter cost per task per model and route on real data with UsageBox Browse all articles

←

→

Explore More Articles

Discover our complete collection of usage-based billing guides and implementation patterns.

View all articles