Bot Traffic

Identifying and filtering automated visitors to protect analytics data quality

The Bot Problem

Not all HTTP requests come from humans. A significant percentage of web traffic is automated — bots, crawlers, scrapers, and scripts. For analytics, this is a fundamental data quality problem: if you can't distinguish humans from bots, your metrics are meaningless.

Types of Bots

Bot Type Purpose Example Analytics Impact
Search engine crawlers Index content for search Googlebot, Bingbot Inflate page views
SEO/monitoring bots Check uptime, rankings, links Ahrefs, Screaming Frog Inflate page views
Social media bots Generate previews/cards Twitterbot, facebookexternalhit Inflate page views, distort referrers
RSS/feed readers Pull content updates Feedly, Inoreader Inflate page views
AI training crawlers Scrape content for LLM training GPTBot, ClaudeBot, Bytespider High volume, distort all metrics
Scraper bots Extract data, prices, content Custom scripts Inflate views, may stress server
Spam bots Submit forms, post comments Various Corrupt form/conversion data
DDoS / attack bots Overwhelm server Botnets Massive metric distortion
Click fraud bots Fake ad clicks Botnets Inflate click/conversion metrics

Scale of the Problem

Industry estimates put automated traffic at 30–50% of all web requests. For a new site with little organic traffic, the bot percentage can be much higher — sometimes the majority of your "visitors" are bots. Students building course projects will see this firsthand: your analytics data will contain bot visits, and you need to account for them.

Good Bots vs. Bad Bots

  • Good bots: Search crawlers (you want Google to index you), uptime monitors, feed readers. They typically identify themselves via User-Agent and respect robots.txt.
  • Bad bots: Scrapers, spammers, attack bots, click fraud. They disguise themselves, ignore robots.txt, and consume resources without providing value.
  • Gray area: AI training crawlers — beneficial for visibility, or content theft? This is an active debate. Many sites now block GPTBot and similar crawlers via robots.txt.

Bot Detection Methods

  1. User-Agent filtering — Most good bots identify themselves ("Googlebot", "Bingbot"). But User-Agent is trivially spoofable — bad bots often claim to be Chrome.
  2. robots.txt — A convention, not enforcement. Good bots respect it; bad bots ignore it. Not a detection method per se, but a filtering signal.
  3. Rate limiting — Humans don't request 100 pages per second. Abnormal request rates signal automation.
  4. Behavioral analysis — Bots don't scroll, don't move the mouse, don't hesitate before clicking. Client-side analytics can detect the absence of human behavior patterns.
  5. JavaScript execution — Many simple bots don't execute JavaScript. If your client-side beacon never fires but the server log shows a request, it's likely a bot.
  6. CAPTCHAs — Force proof of humanity. Effective but degrades UX. Use sparingly.
  7. IP reputation lists — Known bot networks and data center IPs. Services like Project Honeypot maintain lists.
  8. Honeypot fields — Hidden form fields that humans never fill out but bots do.
Request arrives
    │
    ├── Check IP reputation ──▶ Known bot IP? ──▶ Flag/block
    │
    ├── Check User-Agent ────▶ Known bot UA? ──▶ Flag (good bot) or block (bad bot)
    │
    ├── Check rate ──────────▶ Abnormal frequency? ──▶ Rate limit / CAPTCHA
    │
    ├── Check JS execution ──▶ Beacon fired? ──▶ No = likely bot
    │
    └── Check behavior ──────▶ Mouse movement? Scroll? ──▶ No = likely bot

Impact on Analytics

If you don't filter bots, your analytics data is compromised in multiple ways:

  • Page view counts are inflated — bots hit pages humans never visit
  • Bounce rate is distorted — bots hit one page and leave
  • Geographic data is skewed — bots run from data centers, not homes
  • Time-on-page is wrong — bots don't "spend time" on pages
  • Conversion funnels break — bots inflate the top of the funnel but never convert

For the course project: you must consider bot traffic when interpreting your analytics data. If your "busiest page" has hundreds of views but zero scroll events, those are not real users.