The Bot Problem
Not all HTTP requests come from humans. A significant percentage of web traffic is automated — bots, crawlers, scrapers, and scripts. For analytics, this is a fundamental data quality problem: if you can't distinguish humans from bots, your metrics are meaningless.
Types of Bots
| Bot Type | Purpose | Example | Analytics Impact |
|---|---|---|---|
| Search engine crawlers | Index content for search | Googlebot, Bingbot | Inflate page views |
| SEO/monitoring bots | Check uptime, rankings, links | Ahrefs, Screaming Frog | Inflate page views |
| Social media bots | Generate previews/cards | Twitterbot, facebookexternalhit | Inflate page views, distort referrers |
| RSS/feed readers | Pull content updates | Feedly, Inoreader | Inflate page views |
| AI training crawlers | Scrape content for LLM training | GPTBot, ClaudeBot, Bytespider | High volume, distort all metrics |
| Scraper bots | Extract data, prices, content | Custom scripts | Inflate views, may stress server |
| Spam bots | Submit forms, post comments | Various | Corrupt form/conversion data |
| DDoS / attack bots | Overwhelm server | Botnets | Massive metric distortion |
| Click fraud bots | Fake ad clicks | Botnets | Inflate click/conversion metrics |
Scale of the Problem
Industry estimates put automated traffic at 30–50% of all web requests. For a new site with little organic traffic, the bot percentage can be much higher — sometimes the majority of your "visitors" are bots. Students building course projects will see this firsthand: your analytics data will contain bot visits, and you need to account for them.
Good Bots vs. Bad Bots
- Good bots: Search crawlers (you want Google to index you), uptime monitors, feed readers. They typically identify themselves via User-Agent and respect
robots.txt. - Bad bots: Scrapers, spammers, attack bots, click fraud. They disguise themselves, ignore
robots.txt, and consume resources without providing value. - Gray area: AI training crawlers — beneficial for visibility, or content theft? This is an active debate. Many sites now block GPTBot and similar crawlers via
robots.txt.
Bot Detection Methods
- User-Agent filtering — Most good bots identify themselves ("Googlebot", "Bingbot"). But User-Agent is trivially spoofable — bad bots often claim to be Chrome.
- robots.txt — A convention, not enforcement. Good bots respect it; bad bots ignore it. Not a detection method per se, but a filtering signal.
- Rate limiting — Humans don't request 100 pages per second. Abnormal request rates signal automation.
- Behavioral analysis — Bots don't scroll, don't move the mouse, don't hesitate before clicking. Client-side analytics can detect the absence of human behavior patterns.
- JavaScript execution — Many simple bots don't execute JavaScript. If your client-side beacon never fires but the server log shows a request, it's likely a bot.
- CAPTCHAs — Force proof of humanity. Effective but degrades UX. Use sparingly.
- IP reputation lists — Known bot networks and data center IPs. Services like Project Honeypot maintain lists.
- Honeypot fields — Hidden form fields that humans never fill out but bots do.
Request arrives
│
├── Check IP reputation ──▶ Known bot IP? ──▶ Flag/block
│
├── Check User-Agent ────▶ Known bot UA? ──▶ Flag (good bot) or block (bad bot)
│
├── Check rate ──────────▶ Abnormal frequency? ──▶ Rate limit / CAPTCHA
│
├── Check JS execution ──▶ Beacon fired? ──▶ No = likely bot
│
└── Check behavior ──────▶ Mouse movement? Scroll? ──▶ No = likely bot
Impact on Analytics
If you don't filter bots, your analytics data is compromised in multiple ways:
- Page view counts are inflated — bots hit pages humans never visit
- Bounce rate is distorted — bots hit one page and leave
- Geographic data is skewed — bots run from data centers, not homes
- Time-on-page is wrong — bots don't "spend time" on pages
- Conversion funnels break — bots inflate the top of the funnel but never convert
For the course project: you must consider bot traffic when interpreting your analytics data. If your "busiest page" has hundreds of views but zero scroll events, those are not real users.