Threats to Data Quality
No analytics system gives you perfectly accurate data. Every collection method has blind spots, and multiple factors corrupt or reduce the data you receive:
| Threat | Impact | Mitigation |
|---|---|---|
| Bots and crawlers | Inflate page views, distort behavior metrics | Filter by User-Agent, use CAPTCHAs, analyze behavior patterns |
| Ad blockers | Block third-party analytics scripts entirely | Use first-party collection (same domain), server-side logging |
| Cookie clearing | Breaks session continuity — returning users appear as new | Accept approximation; use server-side session tracking |
| Browser caching | Cached pages generate no server request | Client-side scripts still fire; combine methods |
| CDN caching | CDN-served pages never reach your server logs | CDN analytics APIs; client-side collection |
| Device switching | Same user on phone and laptop appears as two users | Login-based identity; accept approximation |
| VPN / Proxy | IP-based geolocation becomes inaccurate | Use Accept-Language, timezone from JS for location hints |
| JavaScript disabled | Client-side analytics fails completely | Server logs as fallback; <noscript> pixel tracking |
| Incognito / private mode | No persistent cookies; every visit looks new | Accept approximation; focus on session-level data |
The Client-Server Gap
The fundamental challenge is the client-server gap: the JavaScript analytics code may never load (blocked, disabled, slow connection), so client-side analytics always undercounts compared to server logs. But server logs miss cached pages. Neither method captures everything.
State management directly affects data quality: if cookies are cleared, sessions break. If incognito mode is used, there is no persistence between visits. If third-party cookies are blocked (as they increasingly are), cross-domain tracking fails.
Analytics Summary
| Concept | Key Takeaway |
|---|---|
| Why analytics | Find errors users never report, understand behavior, measure what works |
| Analytics stack | Collector → Storage → Reporting — you build all three in this course |
| Collection methods | Server logs (automatic), network capture (obsolete for HTTPS), client-side scripts (flexible) |
| What to collect | Headers and URLs automatically; JS adds clicks, scrolls, errors, timing |
| Enriching server logs | Extend log formats with custom headers, Client Hints, and script-set cookies to upscale logs from basic hit counts to rich analytics data |
| 1st vs 3rd party | First-party = you own the data; third-party = vendor owns it and builds cross-site profiles |
| Browser fingerprinting | Combines browser/device attributes to identify users without cookies — effective but ethically fraught and regulated by GDPR |
| Broad vs targeted | Collect what you need, not what you might need — GDPR requires purpose |
| Privacy & consent | GDPR, ePrivacy, and CCPA govern analytics collection — consent is required for cookies and fingerprinting; privacy-preserving tools avoid the problem entirely |
| Error tracking | Catch silent failures with window.onerror and sendBeacon() |
| Performance / Web Vitals | LCP, INP, CLS measure real user experience. Collect via Performance Observer API and sendBeacon. Google uses these as ranking signals. |
| User behavior | Funnel analysis reveals where users drop off; absence of action is data |
| Session replay | DOM-reconstruction replay, not video — powerful but privacy-sensitive |
| Observability | Logs + Metrics + Traces; OpenTelemetry is the vendor-neutral standard |
| Bot traffic | 30–50% of web traffic is automated — filter bots or your metrics are meaningless; compare server logs to client-side beacons as a first-pass filter |
| Data quality | Every method has blind spots — combine methods and accept approximation |