The short version: If you bill per request, per API call, or per GB, AI crawlers and scrapers can quietly become a line item on your customers' invoices and on your own infrastructure bill. The most upvoted version of this story in 2026 is a developer whose host charged for 11 million requests from a single Meta crawler in 15 days, on top of 2.5 million from Perplexity and 800,000-plus each from GPTBot and Claude. Robots.txt will not save you, because it is advisory and several agents ignore it. The durable fix is to classify each event as human or bot at ingestion and exclude non-billable traffic before it reaches a meter or an invoice.
The problem in one screenshot
In early 2026 a webdev post titled "Meta's crawler made 11 MILLION requests to my site in 30 days. Vercel charged me for every single one" reached roughly 3,200 upvotes and 400 comments. The author's own server logs told the story: against 24.6 million real user requests, Meta's crawler generated 11.2 million, Perplexity 2.5 million, Googlebot 1.2 million, Amazon 1.1 million, OpenAI's GPTBot 827,000, and Claude 819,000. As the author put it, Meta alone was "sending nearly half as much traffic as my actual users," roughly 750,000 requests per day from one bot, ten times what Googlebot sent.
That post resonated because the pain is universal once you bill on consumption. The same surge shows up in a follow-up thread, "AI crawlers are chewing through my staging bandwidth now," where a small SaaS found its staging environment hit hard enough to land on the bill. When your pricing is usage-based, every one of those requests is a candidate to be metered. The question stops being "how do I block bots" and becomes "how do I make sure I never charge anyone, including myself, for traffic no human asked for."
What actually counts as billable usage
Before you can exclude bot traffic you need a written definition of billable usage. The useful line is intent: a billable event is one a customer's human user or their own authenticated integration deliberately caused. A crawler indexing public pages, a scraper harvesting training data, an uptime probe, and your own CI hitting staging are all non-billable by that definition, even though they consume the same CPU and bandwidth.
Two distinct cost surfaces are at stake, and they pull in the same direction:
- Your customer's invoice. If your product meters your customer's end users (page views, API calls they expose, GB served), crawler traffic against their tenant inflates what you charge them. They will dispute it, and they will be right.
- Your own infrastructure bill. The egress, function invocations, and database reads that crawlers trigger are real money you pay your cloud or edge provider, whether or not you pass them on.
The same classification decision protects both. Tag an event as bot once, and you can drop it from the customer meter and account for it separately in your own cost model.
Why robots.txt is not the answer
The instinctive reach is for robots.txt. It does not solve a billing problem, for a reason worth stating plainly: robots.txt is advisory, not enforced. It works only through voluntary compliance, and it offers no technical block against an agent that decides to ignore it. Reputable search crawlers honor it; an increasing share of AI scrapers do not bother to check it at all.
The clearest documented case is Perplexity. In 2025 Cloudflare reported that customers who had disallowed Perplexity in robots.txt, and even added firewall rules against its declared crawlers, still saw their content accessed. Perplexity was observed falling back to an undeclared user agent impersonating Chrome on macOS, rotating IPs across different networks to evade blocks, across tens of thousands of domains. Whatever you conclude about that dispute, the operational lesson is firm: a request can reach your billable surface no matter what your robots.txt says, so the exclusion logic has to live where you count, not in a file the crawler is free to ignore.
Detecting bot and crawler traffic
Detection runs on a spectrum from cheap and honest to expensive and adversarial. Layer it, because no single signal is sufficient.
User-agent matching for declared bots
The well-behaved AI crawlers announce themselves, and matching their user-agent strings catches the bulk of volume from cooperative operators. The 2026 roster to recognize includes OpenAI's GPTBot, OAI-SearchBot, and ChatGPT-User; Anthropic's ClaudeBot; PerplexityBot and Perplexity-User; Google's Google-Extended; Amazon's Amazonbot; Meta's meta-externalagent and FacebookBot; Apple's Applebot-Extended; and Common Crawl's CCBot. A maintained allow/deny list of these strings, refreshed regularly because the list changes, is the floor of any classification step.
User-agent matching has a known ceiling: it trusts a header the client controls. It catches honest bots and misses the ones spoofing a browser, which is exactly the population that evades robots.txt.
Verified-bot validation by reverse DNS
To trust a user-agent claim, verify it. Major crawlers publish IP ranges or support reverse-DNS validation, where you resolve the requesting IP back to the operator's domain and forward-resolve it again to confirm. This is how edge providers build their verified-bot allowlists. It separates a real Googlebot from something merely calling itself Googlebot, and it is the difference between a signal you can bill on and one you cannot.
Edge and WAF bot management for the adversarial tail
For traffic that spoofs browsers and rotates IPs, behavioral bot management at the edge is the practical tool. Cloudflare and Vercel both maintain verified-bot allowlists and bot-score signals, and both let you exclude verified bots from evaluation while challenging or flagging the suspicious remainder. Cloudflare exposes a verified_bot boolean and a bot score; Vercel's bot management automatically excludes verified bots such as Google's crawler from evaluation. The value to billing is the label these systems attach to a request. If the edge has already decided a request is a bot, that verdict can ride along into your event stream and drive the exclusion.
Where to exclude it: at the meter, not after the invoice
Detection produces a label. The architecture question is where that label gets acted on, and the answer is as early as possible. Excluding bot events at ingestion, before they ever become billable records, is cleaner than netting them out of an invoice after the fact. An invoice you correct is a dispute you already had; an event you never billed is a dispute that never happened. This is the same principle behind treating ingestion as the boundary where application events become billable records, covered in the companion piece on building a usage ingestion pipeline that does not lose revenue.
The pattern that holds up:
- Tag every event with a classification at the source or the edge. Attach a
traffic_typeof human, verified-bot, or suspected-bot, plus the signal that decided it (matched user-agent, verified by rDNS, edge bot-score). Carry it on the event the same way you carry the idempotency identifier. - Filter before aggregation. The meter sums only events labeled billable. Non-billable events are not discarded; they are kept on a separate, non-billed stream so you retain a full record.
- Keep the bot events for cost accounting. Those filtered events are exactly the data you need to understand your own infrastructure spend and to show a customer, with evidence, why their bill is lower than their raw request count.
UsageBox is built to be that filter layer. Because every event flows through ingestion before it is metered, you tag non-human traffic with a classification field and exclude it from the billable aggregate while keeping it queryable. The customer's invoice reflects human and authenticated usage; the bot stream stays visible for your own FinOps. You decide what counts, the meter never sees what should not be billed, and the audit trail explains the difference.
Make the exclusion visible to customers
The exclusion only builds trust if customers can see it. The fastest way to turn a "why is this bill so high" ticket into a non-event is to show usage broken out by traffic type: human, verified bot (excluded), suspected bot (excluded). A customer who can see that you filtered 11 million crawler requests out of their meter is a customer who trusts the number you did charge. This is the transparency lever that customer-facing usage visibility provides, and it is the same evidence base that resolves disputes when they do arise, as covered in handling billing disputes with audit trails. It is also the difference between the calm conversation and the furious one described in what to do when customers complain about unexpected bills.
A short checklist
| Step | Prevents |
|---|---|
| Written definition of billable usage based on human or authenticated intent | Ad hoc, inconsistent decisions about what to charge |
| Maintained user-agent list for declared AI crawlers (GPTBot, ClaudeBot, PerplexityBot, meta-externalagent, and more) | Billing honest, identifiable crawler volume |
| Reverse-DNS or IP-range verification on user-agent claims | Trusting a spoofed Googlebot |
| Edge or WAF bot-score label carried onto the event | Missing browser-spoofing scrapers that ignore robots.txt |
| Exclusion applied at ingestion, before aggregation | Invoice corrections and disputes after the fact |
| Bot events retained on a non-billed stream | Losing the evidence for cost accounting and customer trust |
| Traffic-type breakdown shown to customers | "Why is this so high" tickets on inflated raw counts |
None of this stops crawlers from hitting your origin; that is a separate infrastructure fight. What it does is guarantee that non-human traffic never lands on a customer's invoice and never gets mistaken for revenue in your own numbers. For the rest of the metering integrity story, see how to count each event exactly once in idempotent usage metering, and how to enforce limits on the traffic that is billable in usage API enforcement.