Logging and Monitoring

Understanding what your server is doing and when things go wrong

Why Logging Matters

Web server logs are your primary window into server behavior. Without proper logging, debugging issues becomes guesswork. With good logging, you can answer questions like:

Why did this request fail?
Where did all our traffic go?
Which pages are most popular?
Is someone attacking our server?
How long are requests taking?

Web servers typically maintain two types of logs:

Log Type	Purpose	When to Check
Access Log	Records every request: who, what, when	Traffic analysis, debugging, security audits
Error Log	Records problems and failures	When something breaks, startup issues

Anatomy of an Access Log

Here's a typical access log entry. Let's break it down:

192.168.1.100 - john [15/Nov/2024:14:32:18 +0000] "GET /products/widget HTTP/1.1" 200 4523 "https://example.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..."

Field	Value	Meaning
Client IP	`192.168.1.100`	Who made the request
Identity	`-`	RFC 1413 ident (rarely used)
User	`john`	Authenticated username (or -)
Timestamp	`[15/Nov/2024:14:32:18 +0000]`	When the request completed
Request	`"GET /products/widget HTTP/1.1"`	Method, path, protocol
Status	`200`	HTTP response code
Size	`4523`	Response body size in bytes
Referer	`"https://example.com/"`	Page that linked here
User-Agent	`"Mozilla/5.0..."`	Client software identification

Standard Log Formats

The web has standardized log formats that tools understand universally:

Common Log Format (CLF)

The original Apache format—minimal but universal:

%h %l %u %t "%r" %>s %b

192.168.1.100 - - [15/Nov/2024:14:32:18 +0000] "GET /page HTTP/1.1" 200 4523

Combined Log Format

CLF plus referer and user-agent—the most common format:

%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"

JSON Log Format

Modern systems prefer structured JSON for easier parsing:

{
  "timestamp": "2024-11-15T14:32:18.123Z",
  "client_ip": "192.168.1.100",
  "method": "GET",
  "path": "/products/widget",
  "status": 200,
  "bytes": 4523,
  "duration_ms": 45,
  "referer": "https://example.com/",
  "user_agent": "Mozilla/5.0..."
}

Why JSON?

JSON logs are easier to parse programmatically, can include arbitrary fields, and integrate better with modern log aggregation tools like Elasticsearch, Loki, and Datadog.

Configuring Log Formats

# Define custom log format
log_format combined_extra '$remote_addr - $remote_user [$time_local] '
                          '"$request" $status $body_bytes_sent '
                          '"$http_referer" "$http_user_agent" '
                          'rt=$request_time uct=$upstream_connect_time '
                          'uht=$upstream_header_time urt=$upstream_response_time';

# JSON format
log_format json_logs escape=json '{'
    '"timestamp":"$time_iso8601",'
    '"client_ip":"$remote_addr",'
    '"method":"$request_method",'
    '"path":"$uri",'
    '"query":"$args",'
    '"status":$status,'
    '"bytes":$body_bytes_sent,'
    '"duration":$request_time,'
    '"upstream_time":$upstream_response_time,'
    '"referer":"$http_referer",'
    '"user_agent":"$http_user_agent",'
    '"request_id":"$request_id"'
'}';

server {
    # Use the custom format
    access_log /var/log/nginx/access.log combined_extra;
    access_log /var/log/nginx/access.json.log json_logs;

    # Conditional logging (skip health checks)
    access_log /var/log/nginx/access.log combined if=$loggable;

    # Error log with level
    error_log /var/log/nginx/error.log warn;
}

# Define custom log format
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
LogFormat "%h %l %u %t \"%r\" %>s %b" common

# Add response time
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %D" combined_time

# JSON format
LogFormat "{ \"timestamp\":\"%{%Y-%m-%dT%H:%M:%S}t\", \"client_ip\":\"%a\", \"method\":\"%m\", \"path\":\"%U\", \"query\":\"%q\", \"status\":%>s, \"bytes\":%B, \"duration\":%D, \"referer\":\"%{Referer}i\", \"user_agent\":\"%{User-Agent}i\" }" json

<VirtualHost *:80>
    ServerName example.com

    # Access log with format name
    CustomLog /var/log/apache2/access.log combined
    CustomLog /var/log/apache2/access.json.log json

    # Error log
    ErrorLog /var/log/apache2/error.log
    LogLevel warn

    # Conditional logging
    SetEnvIf Request_URI "^/health" no_log
    CustomLog /var/log/apache2/access.log combined env=!no_log
</VirtualHost>

const express = require('express');
const morgan = require('morgan');
const fs = require('fs');

const app = express();

// Create write stream for file logging
const accessLogStream = fs.createWriteStream(
    '/var/log/node/access.log',
    { flags: 'a' }
);

// Combined format to file
app.use(morgan('combined', { stream: accessLogStream }));

// Dev format to console
app.use(morgan('dev'));

// Custom format
morgan.token('request-id', (req) => req.headers['x-request-id']);
app.use(morgan(':method :url :status :response-time ms - :request-id'));

// Custom JSON logger
app.use((req, res, next) => {
    const start = Date.now();

    res.on('finish', () => {
        const log = {
            timestamp: new Date().toISOString(),
            method: req.method,
            path: req.path,
            query: req.query,
            status: res.statusCode,
            duration: Date.now() - start,
            ip: req.ip,
            userAgent: req.headers['user-agent'],
            requestId: req.headers['x-request-id']
        };
        console.log(JSON.stringify(log));
    });

    next();
});

Error Logs

Error logs capture problems—they're what you check first when something breaks:

# Nginx error log examples
2024/11/15 14:32:18 [error] 1234#0: *5678 open() "/var/www/html/missing.html" failed (2: No such file or directory)
2024/11/15 14:32:19 [crit] 1234#0: *5679 connect() to 127.0.0.1:3000 failed (111: Connection refused)
2024/11/15 14:32:20 [warn] 1234#0: *5680 upstream server temporarily disabled

Error Log Levels

Level	Nginx	Apache	Use For
emerg	emerg	emerg	System unusable
alert	alert	alert	Immediate action needed
crit	crit	crit	Critical conditions
error	error	error	Error conditions
warn	warn	warn	Warning conditions
notice	notice	notice	Normal but significant
info	info	info	Informational
debug	debug	debug	Debugging (verbose!)

Debug Level Warning

Never run debug level in production—it generates enormous amounts of data and can impact performance. Use it only for specific troubleshooting, then turn it off.

Correlation IDs

When a request touches multiple services, how do you trace it? Correlation IDs (also called request IDs or trace IDs) solve this:

Browser Nginx App Server Database │ │ │ │ │─── Request ──────────▶│ │ │ │ │── X-Request-ID: abc ─▶│ │ │ │ │── trace_id: abc ──▶│ │ │ │◀─────────────────── │ │ │◀───────────────────── │ │ │◀──────────────────────│ │ │ │ │ │ │ All logs include: request_id=abc ───────────────────────────────────────────────────────────────────────────────────── Search logs for "abc" → see entire request journey

# Generate request ID if not provided
map $http_x_request_id $request_id {
    default   $http_x_request_id;
    ""        $request_id;  # Nginx generates one
}

server {
    # Include in logs
    log_format with_id '$remote_addr - [$time_local] "$request" '
                       '$status request_id=$request_id';

    location / {
        # Pass to upstream
        proxy_set_header X-Request-ID $request_id;

        # Also add to response for debugging
        add_header X-Request-ID $request_id;

        proxy_pass http://app;
    }
}

const { v4: uuidv4 } = require('uuid');

// Middleware to handle request IDs
app.use((req, res, next) => {
    // Use incoming ID or generate new one
    req.requestId = req.headers['x-request-id'] || uuidv4();

    // Include in response
    res.setHeader('X-Request-ID', req.requestId);

    // Make available for logging
    req.log = {
        info: (msg, data) => console.log(JSON.stringify({
            timestamp: new Date().toISOString(),
            level: 'info',
            requestId: req.requestId,
            message: msg,
            ...data
        })),
        error: (msg, data) => console.error(JSON.stringify({
            timestamp: new Date().toISOString(),
            level: 'error',
            requestId: req.requestId,
            message: msg,
            ...data
        }))
    };

    next();
});

// Usage in routes
app.get('/api/users/:id', async (req, res) => {
    req.log.info('Fetching user', { userId: req.params.id });

    try {
        const user = await db.getUser(req.params.id);
        req.log.info('User found', { userId: user.id });
        res.json(user);
    } catch (err) {
        req.log.error('User lookup failed', { error: err.message });
        res.status(500).json({ error: 'Internal error' });
    }
});

Log Rotation

Logs grow forever—without rotation, they'll fill your disk. Log rotation archives old logs and optionally compresses or deletes them:

Linux logrotate

# /etc/logrotate.d/nginx
/var/log/nginx/*.log {
    daily               # Rotate daily
    missingok           # Don't error if log missing
    rotate 14           # Keep 14 days
    compress            # Gzip old logs
    delaycompress       # Don't compress most recent backup
    notifempty          # Don't rotate empty files
    create 0640 nginx adm  # Permissions for new file
    sharedscripts       # Run postrotate once for all logs
    postrotate
        # Signal Nginx to reopen logs
        if [ -f /var/run/nginx.pid ]; then
            kill -USR1 `cat /var/run/nginx.pid`
        fi
    endscript
}

Size-Based Rotation

/var/log/nginx/*.log {
    size 100M           # Rotate when reaches 100MB
    rotate 10           # Keep 10 rotations
    compress
    missingok
    notifempty
}

Why USR1 Signal?

Nginx (and many servers) hold log files open. After rotation, you must signal the process to close and reopen its log files, or it will continue writing to the renamed/deleted file.

Key Metrics to Monitor

Beyond logs, monitoring metrics gives you real-time insight into server health:

Request Rate

1,234

req/sec

P95 Latency

Error Rate

0.1

Active Connections

847

conn

Essential Metrics

Metric	What It Measures	Warning Signs
Request Rate	Requests per second	Sudden drops or spikes
Latency P50/P95/P99	Response time distribution	P99 > 1s, growing gap between P50/P99
Error Rate	% of 5xx responses	> 1% sustained
Active Connections	Current open connections	Near max limits
Upstream Response Time	How long backends take	Increasing trend
Bandwidth	Bytes in/out	Approaching network limits

Nginx Status Module

Nginx provides a built-in status page for basic metrics:

# Enable stub_status module
server {
    listen 8080;
    server_name localhost;

    location /nginx_status {
        stub_status;
        allow 127.0.0.1;    # Only local access
        deny all;
    }
}

# Output:
# Active connections: 291
# server accepts handled requests
#  16630948 16630948 31070465
# Reading: 6 Writing: 179 Waiting: 106

Field	Meaning
Active connections	Current client connections
accepts	Total accepted connections
handled	Total handled connections (should equal accepts)
requests	Total requests (can exceed connections due to keep-alive)
Reading	Reading client request headers
Writing	Writing response to client
Waiting	Keep-alive connections waiting for request

Logging vs Observability

Modern systems distinguish between three pillars of observability:

Pillar	What	When to Use	Tools
Logs	Discrete events with context	Debugging specific issues	ELK, Loki, Splunk
Metrics	Numeric measurements over time	Dashboards, alerting, trends	Prometheus, Datadog, CloudWatch
Traces	Request flow across services	Distributed debugging	Jaeger, Zipkin, X-Ray

┌─────────────────────────────────────────┐ │ OBSERVABILITY │ └─────────────────────────────────────────┘ │ ┌────────────────────────────────────┼────────────────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ LOGS │ │ METRICS │ │ TRACES │ │ │ │ │ │ │ │ "What │ │ "How many │ │ "Where did │ │ happened" │ │ per second" │ │ time go" │ │ │ │ │ │ │ │ High detail │ │ Aggregated │ │ Distributed │ │ Search-based │ │ Time-series │ │ Causal │ └───────────────┘ └───────────────┘ └───────────────┘

Start Simple

For most web servers, good logs with correlation IDs and basic metrics are sufficient. Add distributed tracing when you have multiple services and need to understand request flow across them.

Practical Log Analysis

Common command-line tools for analyzing logs:

Finding Errors

# Recent 500 errors
grep '" 500 ' /var/log/nginx/access.log | tail -20

# Count errors by status code
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn

# 5xx errors in last hour
awk -v date="$(date -d '1 hour ago' '+%d/%b/%Y:%H')" '$4 ~ date && $9 ~ /^5/' access.log

Traffic Analysis

# Top 10 IP addresses
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10

# Top requested paths
awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -10

# Requests per minute
awk '{print $4}' access.log | cut -d: -f1-3 | uniq -c

# Bandwidth by path
awk '{sum[$7]+=$10} END {for (path in sum) print sum[path], path}' access.log | sort -rn | head

Performance Analysis

# Slow requests (using combined_extra format with request_time)
grep 'rt=' access.log | awk -F'rt=' '{print $2}' | awk '{print $1}' | sort -rn | head -20

# Average response time
awk -F'rt=' '{sum+=$2; count++} END {print sum/count}' access.log

# P95 latency (simple approximation)
grep 'rt=' access.log | awk -F'rt=' '{print $2}' | awk '{print $1}' | sort -n | \
    awk '{all[NR]=$1} END {print all[int(NR*0.95)]}'

Summary

Access logs record every request—essential for debugging and analytics
Error logs capture problems—check them first when things break
Standard formats (Combined, JSON) ensure tool compatibility
Correlation IDs trace requests across multiple services
Log rotation prevents disk exhaustion—configure it always
Key metrics: request rate, latency percentiles, error rate
Three pillars: logs (what happened), metrics (how much), traces (where)