Should you bill customers for bot and crawler traffic?

No, not by default. A billable event is one a human user or an authenticated integration deliberately caused. AI crawlers, scrapers, uptime probes, and your own CI traffic consume the same resources but were not requested by the customer, so charging for them inflates the invoice and produces disputes. The right approach is to classify each event as human or bot and exclude non-human traffic from the billable aggregate, while keeping it on a separate stream for your own infrastructure cost accounting.

Why does robots.txt not stop bot traffic from being billed?

Because robots.txt is advisory, not enforced. It works only through voluntary compliance and provides no technical block against an agent that ignores it. Reputable search crawlers honor it, but several AI scrapers do not check it at all. Cloudflare documented Perplexity continuing to access content after customers disallowed it in robots.txt and even added firewall rules, by falling back to an undeclared user agent and rotating IPs. The practical conclusion is that exclusion logic must live at your metering layer, where you count, not in a file the crawler can ignore.

How do you detect AI crawler and bot traffic?

Layer three signals. First, match declared user-agent strings such as GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended, Amazonbot, meta-externalagent, Applebot-Extended, and CCBot. Second, verify those claims with reverse-DNS or published IP-range validation so a spoofed Googlebot does not slip through. Third, for browser-spoofing scrapers, use edge or WAF bot management (Cloudflare verified-bot and bot score, Vercel bot management) and carry that verdict onto the event. No single signal is enough on its own.

Where should bot traffic be excluded from billing?

At ingestion, before events are aggregated into a meter, not by correcting the invoice afterward. Tag each event with a traffic type (human, verified bot, suspected bot) plus the signal that decided it, then have the meter sum only the billable events. Keep the excluded bot events on a separate non-billed stream so you retain a full record for cost accounting and for showing customers why their bill is lower than their raw request count. An event you never billed is a dispute that never happened.

How much can AI crawlers actually inflate a usage bill?

Substantially. One widely shared 2026 case showed a single site logging 11.2 million requests from Meta crawlers in 15 days, 2.5 million from Perplexity, plus 827,000 from GPTBot and 819,000 from Claude, against 24.6 million real user requests, with the host charging for all of it. On a per-request or per-GB plan, non-human traffic on that scale can rival or exceed legitimate usage, which is why excluding it at the meter directly protects both the customer invoice and your own infrastructure spend.

Should You Bill for Bot and Crawler Traffic? Keeping Non-Human Usage Out of Metered Invoices

Name: UsageBox
Rating: 4.8 (50 reviews)
Author: UsageBox

The short version: If you bill per request, per API call, or per GB, AI crawlers and scrapers can quietly become a line item on your customers' invoices and on your own infrastructure bill. The most upvoted version of this story in 2026 is a developer whose host charged for 11 million requests from a single Meta crawler in 15 days, on top of 2.5 million from Perplexity and 800,000-plus each from GPTBot and Claude. Robots.txt will not save you, because it is advisory and several agents ignore it. The durable fix is to classify each event as human or bot at ingestion and exclude non-billable traffic before it reaches a meter or an invoice.

The problem in one screenshot

In early 2026 a webdev post titled "Meta's crawler made 11 MILLION requests to my site in 30 days. Vercel charged me for every single one" reached roughly 3,200 upvotes and 400 comments. The author's own server logs told the story: against 24.6 million real user requests, Meta's crawler generated 11.2 million, Perplexity 2.5 million, Googlebot 1.2 million, Amazon 1.1 million, OpenAI's GPTBot 827,000, and Claude 819,000. As the author put it, Meta alone was "sending nearly half as much traffic as my actual users," roughly 750,000 requests per day from one bot, ten times what Googlebot sent.

That post resonated because the pain is universal once you bill on consumption. The same surge shows up in a follow-up thread, "AI crawlers are chewing through my staging bandwidth now," where a small SaaS found its staging environment hit hard enough to land on the bill. When your pricing is usage-based, every one of those requests is a candidate to be metered. The question stops being "how do I block bots" and becomes "how do I make sure I never charge anyone, including myself, for traffic no human asked for."

What actually counts as billable usage

Before you can exclude bot traffic you need a written definition of billable usage. The useful line is intent: a billable event is one a customer's human user or their own authenticated integration deliberately caused. A crawler indexing public pages, a scraper harvesting training data, an uptime probe, and your own CI hitting staging are all non-billable by that definition, even though they consume the same CPU and bandwidth.

Two distinct cost surfaces are at stake, and they pull in the same direction:

Your customer's invoice. If your product meters your customer's end users (page views, API calls they expose, GB served), crawler traffic against their tenant inflates what you charge them. They will dispute it, and they will be right.
Your own infrastructure bill. The egress, function invocations, and database reads that crawlers trigger are real money you pay your cloud or edge provider, whether or not you pass them on.

The same classification decision protects both. Tag an event as bot once, and you can drop it from the customer meter and account for it separately in your own cost model.

Why robots.txt is not the answer

The instinctive reach is for robots.txt. It does not solve a billing problem, for a reason worth stating plainly: robots.txt is advisory, not enforced. It works only through voluntary compliance, and it offers no technical block against an agent that decides to ignore it. Reputable search crawlers honor it; an increasing share of AI scrapers do not bother to check it at all.

The clearest documented case is Perplexity. In 2025 Cloudflare reported that customers who had disallowed Perplexity in robots.txt, and even added firewall rules against its declared crawlers, still saw their content accessed. Perplexity was observed falling back to an undeclared user agent impersonating Chrome on macOS, rotating IPs across different networks to evade blocks, across tens of thousands of domains. Whatever you conclude about that dispute, the operational lesson is firm: a request can reach your billable surface no matter what your robots.txt says, so the exclusion logic has to live where you count, not in a file the crawler is free to ignore.

Detecting bot and crawler traffic

Detection runs on a spectrum from cheap and honest to expensive and adversarial. Layer it, because no single signal is sufficient.

User-agent matching for declared bots

The well-behaved AI crawlers announce themselves, and matching their user-agent strings catches the bulk of volume from cooperative operators. The 2026 roster to recognize includes OpenAI's GPTBot, OAI-SearchBot, and ChatGPT-User; Anthropic's ClaudeBot; PerplexityBot and Perplexity-User; Google's Google-Extended; Amazon's Amazonbot; Meta's meta-externalagent and FacebookBot; Apple's Applebot-Extended; and Common Crawl's CCBot. A maintained allow/deny list of these strings, refreshed regularly because the list changes, is the floor of any classification step.

User-agent matching has a known ceiling: it trusts a header the client controls. It catches honest bots and misses the ones spoofing a browser, which is exactly the population that evades robots.txt.

Verified-bot validation by reverse DNS

To trust a user-agent claim, verify it. Major crawlers publish IP ranges or support reverse-DNS validation, where you resolve the requesting IP back to the operator's domain and forward-resolve it again to confirm. This is how edge providers build their verified-bot allowlists. It separates a real Googlebot from something merely calling itself Googlebot, and it is the difference between a signal you can bill on and one you cannot.

Edge and WAF bot management for the adversarial tail

For traffic that spoofs browsers and rotates IPs, behavioral bot management at the edge is the practical tool. Cloudflare and Vercel both maintain verified-bot allowlists and bot-score signals, and both let you exclude verified bots from evaluation while challenging or flagging the suspicious remainder. Cloudflare exposes a verified_bot boolean and a bot score; Vercel's bot management automatically excludes verified bots such as Google's crawler from evaluation. The value to billing is the label these systems attach to a request. If the edge has already decided a request is a bot, that verdict can ride along into your event stream and drive the exclusion.

Where to exclude it: at the meter, not after the invoice

Detection produces a label. The architecture question is where that label gets acted on, and the answer is as early as possible. Excluding bot events at ingestion, before they ever become billable records, is cleaner than netting them out of an invoice after the fact. An invoice you correct is a dispute you already had; an event you never billed is a dispute that never happened. This is the same principle behind treating ingestion as the boundary where application events become billable records, covered in the companion piece on building a usage ingestion pipeline that does not lose revenue.

The pattern that holds up:

Tag every event with a classification at the source or the edge. Attach a traffic_type of human, verified-bot, or suspected-bot, plus the signal that decided it (matched user-agent, verified by rDNS, edge bot-score). Carry it on the event the same way you carry the idempotency identifier.
Filter before aggregation. The meter sums only events labeled billable. Non-billable events are not discarded; they are kept on a separate, non-billed stream so you retain a full record.
Keep the bot events for cost accounting. Those filtered events are exactly the data you need to understand your own infrastructure spend and to show a customer, with evidence, why their bill is lower than their raw request count.

UsageBox is built to be that filter layer. Because every event flows through ingestion before it is metered, you tag non-human traffic with a classification field and exclude it from the billable aggregate while keeping it queryable. The customer's invoice reflects human and authenticated usage; the bot stream stays visible for your own FinOps. You decide what counts, the meter never sees what should not be billed, and the audit trail explains the difference.

Make the exclusion visible to customers

The exclusion only builds trust if customers can see it. The fastest way to turn a "why is this bill so high" ticket into a non-event is to show usage broken out by traffic type: human, verified bot (excluded), suspected bot (excluded). A customer who can see that you filtered 11 million crawler requests out of their meter is a customer who trusts the number you did charge. This is the transparency lever that customer-facing usage visibility provides, and it is the same evidence base that resolves disputes when they do arise, as covered in handling billing disputes with audit trails. It is also the difference between the calm conversation and the furious one described in what to do when customers complain about unexpected bills.

A short checklist

Step	Prevents
Written definition of billable usage based on human or authenticated intent	Ad hoc, inconsistent decisions about what to charge
Maintained user-agent list for declared AI crawlers (GPTBot, ClaudeBot, PerplexityBot, meta-externalagent, and more)	Billing honest, identifiable crawler volume
Reverse-DNS or IP-range verification on user-agent claims	Trusting a spoofed Googlebot
Edge or WAF bot-score label carried onto the event	Missing browser-spoofing scrapers that ignore robots.txt
Exclusion applied at ingestion, before aggregation	Invoice corrections and disputes after the fact
Bot events retained on a non-billed stream	Losing the evidence for cost accounting and customer trust
Traffic-type breakdown shown to customers	"Why is this so high" tickets on inflated raw counts

None of this stops crawlers from hitting your origin; that is a separate infrastructure fight. What it does is guarantee that non-human traffic never lands on a customer's invoice and never gets mistaken for revenue in your own numbers. For the rest of the metering integrity story, see how to count each event exactly once in idempotent usage metering, and how to enforce limits on the traffic that is billable in usage API enforcement.

Key Topics

•bot traffic
•AI crawlers
•usage metering
•billable usage
•metering integrity
•GPTBot
•robots.txt
•usage-based billing

Next Steps

Keep bot traffic off your invoices Browse all articles

←

→

Explore More Articles

Discover our complete collection of usage-based billing guides and implementation patterns.

View all articles