15: Bot Traffic - Analytics Tutorials

The Bot Problem

Not all HTTP requests come from humans. A significant percentage of web traffic is automated — bots, crawlers, scrapers, and scripts. For analytics, this is a fundamental data quality problem: if you can't distinguish humans from bots, your metrics are meaningless.

Types of Bots

Bot Type	Purpose	Example	Analytics Impact
Search engine crawlers	Index content for search	Googlebot, Bingbot	Inflate page views
SEO/monitoring bots	Check uptime, rankings, links	Ahrefs, Screaming Frog	Inflate page views
Social media bots	Generate previews/cards	Twitterbot, facebookexternalhit	Inflate page views, distort referrers
RSS/feed readers	Pull content updates	Feedly, Inoreader	Inflate page views
AI training crawlers	Scrape content for LLM training	GPTBot, ClaudeBot, Bytespider	High volume, distort all metrics
Scraper bots	Extract data, prices, content	Custom scripts	Inflate views, may stress server
Spam bots	Submit forms, post comments	Various	Corrupt form/conversion data
DDoS / attack bots	Overwhelm server	Botnets	Massive metric distortion
Click fraud bots	Fake ad clicks	Botnets	Inflate click/conversion metrics

Scale of the Problem

Industry estimates put automated traffic at 30–50% of all web requests. For a new site with little organic traffic, the bot percentage can be much higher — sometimes the majority of your "visitors" are bots. Students building course projects will see this firsthand: your analytics data will contain bot visits, and you need to account for them.

Good Bots vs. Bad Bots

Good bots: Search crawlers (you want Google to index you), uptime monitors, feed readers. They typically identify themselves via User-Agent and respect robots.txt.
Bad bots: Scrapers, spammers, attack bots, click fraud. They disguise themselves, ignore robots.txt, and consume resources without providing value.
Gray area: AI training crawlers — beneficial for visibility, or content theft? This is an active debate. Many sites now block GPTBot and similar crawlers via robots.txt.

Bot Detection Methods

User-Agent filtering — Most good bots identify themselves ("Googlebot", "Bingbot"). But User-Agent is trivially spoofable — bad bots often claim to be Chrome.
robots.txt — A convention, not enforcement. Good bots respect it; bad bots ignore it. Not a detection method per se, but a filtering signal.
Rate limiting — Humans don't request 100 pages per second. Abnormal request rates signal automation.
Behavioral analysis — Bots don't scroll, don't move the mouse, don't hesitate before clicking. Client-side analytics can detect the absence of human behavior patterns.
JavaScript execution — Many simple bots don't execute JavaScript. If your client-side beacon never fires but the server log shows a request, it's likely a bot.
CAPTCHAs — Force proof of humanity. Effective but degrades UX. Use sparingly.
IP reputation lists — Known bot networks and data center IPs. Services like Project Honeypot maintain lists.
Honeypot fields — Hidden form fields that humans never fill out but bots do.

Request arrives
    │
    ├── Check IP reputation ──▶ Known bot IP? ──▶ Flag/block
    │
    ├── Check User-Agent ────▶ Known bot UA? ──▶ Flag (good bot) or block (bad bot)
    │
    ├── Check rate ──────────▶ Abnormal frequency? ──▶ Rate limit / CAPTCHA
    │
    ├── Check JS execution ──▶ Beacon fired? ──▶ No = likely bot
    │
    └── Check behavior ──────▶ Mouse movement? Scroll? ──▶ No = likely bot

Impact on Analytics

If you don't filter bots, your analytics data is compromised in multiple ways:

Page view counts are inflated — bots hit pages humans never visit
Bounce rate is distorted — bots hit one page and leave
Geographic data is skewed — bots run from data centers, not homes
Time-on-page is wrong — bots don't "spend time" on pages
Conversion funnels break — bots inflate the top of the funnel but never convert

For the course project: you must consider bot traffic when interpreting your analytics data. If your "busiest page" has hundreds of views but zero scroll events, those are not real users.