Logging and Monitoring

Understanding what your server is doing and when things go wrong

Why Logging Matters

Web server logs are your primary window into server behavior. Without proper logging, debugging issues becomes guesswork. With good logging, you can answer questions like:

Web servers typically maintain two types of logs:

Log Type Purpose When to Check
Access Log Records every request: who, what, when Traffic analysis, debugging, security audits
Error Log Records problems and failures When something breaks, startup issues

Anatomy of an Access Log

Here's a typical access log entry. Let's break it down:

192.168.1.100 - john [15/Nov/2024:14:32:18 +0000] "GET /products/widget HTTP/1.1" 200 4523 "https://example.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..."
Field Value Meaning
Client IP 192.168.1.100 Who made the request
Identity - RFC 1413 ident (rarely used)
User john Authenticated username (or -)
Timestamp [15/Nov/2024:14:32:18 +0000] When the request completed
Request "GET /products/widget HTTP/1.1" Method, path, protocol
Status 200 HTTP response code
Size 4523 Response body size in bytes
Referer "https://example.com/" Page that linked here
User-Agent "Mozilla/5.0..." Client software identification

Standard Log Formats

The web has standardized log formats that tools understand universally:

Common Log Format (CLF)

The original Apache format—minimal but universal:

%h %l %u %t "%r" %>s %b
192.168.1.100 - - [15/Nov/2024:14:32:18 +0000] "GET /page HTTP/1.1" 200 4523

Combined Log Format

CLF plus referer and user-agent—the most common format:

%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"

JSON Log Format

Modern systems prefer structured JSON for easier parsing:

{
  "timestamp": "2024-11-15T14:32:18.123Z",
  "client_ip": "192.168.1.100",
  "method": "GET",
  "path": "/products/widget",
  "status": 200,
  "bytes": 4523,
  "duration_ms": 45,
  "referer": "https://example.com/",
  "user_agent": "Mozilla/5.0..."
}

Why JSON?

JSON logs are easier to parse programmatically, can include arbitrary fields, and integrate better with modern log aggregation tools like Elasticsearch, Loki, and Datadog.

Configuring Log Formats

# Define custom log format
log_format combined_extra '$remote_addr - $remote_user [$time_local] '
                          '"$request" $status $body_bytes_sent '
                          '"$http_referer" "$http_user_agent" '
                          'rt=$request_time uct=$upstream_connect_time '
                          'uht=$upstream_header_time urt=$upstream_response_time';

# JSON format
log_format json_logs escape=json '{'
    '"timestamp":"$time_iso8601",'
    '"client_ip":"$remote_addr",'
    '"method":"$request_method",'
    '"path":"$uri",'
    '"query":"$args",'
    '"status":$status,'
    '"bytes":$body_bytes_sent,'
    '"duration":$request_time,'
    '"upstream_time":$upstream_response_time,'
    '"referer":"$http_referer",'
    '"user_agent":"$http_user_agent",'
    '"request_id":"$request_id"'
'}';

server {
    # Use the custom format
    access_log /var/log/nginx/access.log combined_extra;
    access_log /var/log/nginx/access.json.log json_logs;

    # Conditional logging (skip health checks)
    access_log /var/log/nginx/access.log combined if=$loggable;

    # Error log with level
    error_log /var/log/nginx/error.log warn;
}
# Define custom log format
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
LogFormat "%h %l %u %t \"%r\" %>s %b" common

# Add response time
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %D" combined_time

# JSON format
LogFormat "{ \"timestamp\":\"%{%Y-%m-%dT%H:%M:%S}t\", \"client_ip\":\"%a\", \"method\":\"%m\", \"path\":\"%U\", \"query\":\"%q\", \"status\":%>s, \"bytes\":%B, \"duration\":%D, \"referer\":\"%{Referer}i\", \"user_agent\":\"%{User-Agent}i\" }" json

<VirtualHost *:80>
    ServerName example.com

    # Access log with format name
    CustomLog /var/log/apache2/access.log combined
    CustomLog /var/log/apache2/access.json.log json

    # Error log
    ErrorLog /var/log/apache2/error.log
    LogLevel warn

    # Conditional logging
    SetEnvIf Request_URI "^/health" no_log
    CustomLog /var/log/apache2/access.log combined env=!no_log
</VirtualHost>
const express = require('express');
const morgan = require('morgan');
const fs = require('fs');

const app = express();

// Create write stream for file logging
const accessLogStream = fs.createWriteStream(
    '/var/log/node/access.log',
    { flags: 'a' }
);

// Combined format to file
app.use(morgan('combined', { stream: accessLogStream }));

// Dev format to console
app.use(morgan('dev'));

// Custom format
morgan.token('request-id', (req) => req.headers['x-request-id']);
app.use(morgan(':method :url :status :response-time ms - :request-id'));

// Custom JSON logger
app.use((req, res, next) => {
    const start = Date.now();

    res.on('finish', () => {
        const log = {
            timestamp: new Date().toISOString(),
            method: req.method,
            path: req.path,
            query: req.query,
            status: res.statusCode,
            duration: Date.now() - start,
            ip: req.ip,
            userAgent: req.headers['user-agent'],
            requestId: req.headers['x-request-id']
        };
        console.log(JSON.stringify(log));
    });

    next();
});

Error Logs

Error logs capture problems—they're what you check first when something breaks:

# Nginx error log examples
2024/11/15 14:32:18 [error] 1234#0: *5678 open() "/var/www/html/missing.html" failed (2: No such file or directory)
2024/11/15 14:32:19 [crit] 1234#0: *5679 connect() to 127.0.0.1:3000 failed (111: Connection refused)
2024/11/15 14:32:20 [warn] 1234#0: *5680 upstream server temporarily disabled

Error Log Levels

Level Nginx Apache Use For
emerg emerg emerg System unusable
alert alert alert Immediate action needed
crit crit crit Critical conditions
error error error Error conditions
warn warn warn Warning conditions
notice notice notice Normal but significant
info info info Informational
debug debug debug Debugging (verbose!)

Debug Level Warning

Never run debug level in production—it generates enormous amounts of data and can impact performance. Use it only for specific troubleshooting, then turn it off.

Correlation IDs

When a request touches multiple services, how do you trace it? Correlation IDs (also called request IDs or trace IDs) solve this:

Browser Nginx App Server Database │ │ │ │ │─── Request ──────────▶│ │ │ │ │── X-Request-ID: abc ─▶│ │ │ │ │── trace_id: abc ──▶│ │ │ │◀─────────────────── │ │ │◀───────────────────── │ │ │◀──────────────────────│ │ │ │ │ │ │ All logs include: request_id=abc ───────────────────────────────────────────────────────────────────────────────────── Search logs for "abc" → see entire request journey
# Generate request ID if not provided
map $http_x_request_id $request_id {
    default   $http_x_request_id;
    ""        $request_id;  # Nginx generates one
}

server {
    # Include in logs
    log_format with_id '$remote_addr - [$time_local] "$request" '
                       '$status request_id=$request_id';

    location / {
        # Pass to upstream
        proxy_set_header X-Request-ID $request_id;

        # Also add to response for debugging
        add_header X-Request-ID $request_id;

        proxy_pass http://app;
    }
}
const { v4: uuidv4 } = require('uuid');

// Middleware to handle request IDs
app.use((req, res, next) => {
    // Use incoming ID or generate new one
    req.requestId = req.headers['x-request-id'] || uuidv4();

    // Include in response
    res.setHeader('X-Request-ID', req.requestId);

    // Make available for logging
    req.log = {
        info: (msg, data) => console.log(JSON.stringify({
            timestamp: new Date().toISOString(),
            level: 'info',
            requestId: req.requestId,
            message: msg,
            ...data
        })),
        error: (msg, data) => console.error(JSON.stringify({
            timestamp: new Date().toISOString(),
            level: 'error',
            requestId: req.requestId,
            message: msg,
            ...data
        }))
    };

    next();
});

// Usage in routes
app.get('/api/users/:id', async (req, res) => {
    req.log.info('Fetching user', { userId: req.params.id });

    try {
        const user = await db.getUser(req.params.id);
        req.log.info('User found', { userId: user.id });
        res.json(user);
    } catch (err) {
        req.log.error('User lookup failed', { error: err.message });
        res.status(500).json({ error: 'Internal error' });
    }
});

Log Rotation

Logs grow forever—without rotation, they'll fill your disk. Log rotation archives old logs and optionally compresses or deletes them:

Linux logrotate

# /etc/logrotate.d/nginx
/var/log/nginx/*.log {
    daily               # Rotate daily
    missingok           # Don't error if log missing
    rotate 14           # Keep 14 days
    compress            # Gzip old logs
    delaycompress       # Don't compress most recent backup
    notifempty          # Don't rotate empty files
    create 0640 nginx adm  # Permissions for new file
    sharedscripts       # Run postrotate once for all logs
    postrotate
        # Signal Nginx to reopen logs
        if [ -f /var/run/nginx.pid ]; then
            kill -USR1 `cat /var/run/nginx.pid`
        fi
    endscript
}

Size-Based Rotation

/var/log/nginx/*.log {
    size 100M           # Rotate when reaches 100MB
    rotate 10           # Keep 10 rotations
    compress
    missingok
    notifempty
}

Why USR1 Signal?

Nginx (and many servers) hold log files open. After rotation, you must signal the process to close and reopen its log files, or it will continue writing to the renamed/deleted file.

Key Metrics to Monitor

Beyond logs, monitoring metrics gives you real-time insight into server health:

Request Rate

1,234

req/sec

P95 Latency

45

ms

Error Rate

0.1

%

Active Connections

847

conn

Essential Metrics

Metric What It Measures Warning Signs
Request Rate Requests per second Sudden drops or spikes
Latency P50/P95/P99 Response time distribution P99 > 1s, growing gap between P50/P99
Error Rate % of 5xx responses > 1% sustained
Active Connections Current open connections Near max limits
Upstream Response Time How long backends take Increasing trend
Bandwidth Bytes in/out Approaching network limits

Nginx Status Module

Nginx provides a built-in status page for basic metrics:

# Enable stub_status module
server {
    listen 8080;
    server_name localhost;

    location /nginx_status {
        stub_status;
        allow 127.0.0.1;    # Only local access
        deny all;
    }
}

# Output:
# Active connections: 291
# server accepts handled requests
#  16630948 16630948 31070465
# Reading: 6 Writing: 179 Waiting: 106
Field Meaning
Active connections Current client connections
accepts Total accepted connections
handled Total handled connections (should equal accepts)
requests Total requests (can exceed connections due to keep-alive)
Reading Reading client request headers
Writing Writing response to client
Waiting Keep-alive connections waiting for request

Logging vs Observability

Modern systems distinguish between three pillars of observability:

Pillar What When to Use Tools
Logs Discrete events with context Debugging specific issues ELK, Loki, Splunk
Metrics Numeric measurements over time Dashboards, alerting, trends Prometheus, Datadog, CloudWatch
Traces Request flow across services Distributed debugging Jaeger, Zipkin, X-Ray
┌─────────────────────────────────────────┐ │ OBSERVABILITY │ └─────────────────────────────────────────┘ │ ┌────────────────────────────────────┼────────────────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ LOGS │ │ METRICS │ │ TRACES │ │ │ │ │ │ │ │ "What │ │ "How many │ │ "Where did │ │ happened" │ │ per second" │ │ time go" │ │ │ │ │ │ │ │ High detail │ │ Aggregated │ │ Distributed │ │ Search-based │ │ Time-series │ │ Causal │ └───────────────┘ └───────────────┘ └───────────────┘

Start Simple

For most web servers, good logs with correlation IDs and basic metrics are sufficient. Add distributed tracing when you have multiple services and need to understand request flow across them.

Practical Log Analysis

Common command-line tools for analyzing logs:

Finding Errors

# Recent 500 errors
grep '" 500 ' /var/log/nginx/access.log | tail -20

# Count errors by status code
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn

# 5xx errors in last hour
awk -v date="$(date -d '1 hour ago' '+%d/%b/%Y:%H')" '$4 ~ date && $9 ~ /^5/' access.log

Traffic Analysis

# Top 10 IP addresses
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10

# Top requested paths
awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -10

# Requests per minute
awk '{print $4}' access.log | cut -d: -f1-3 | uniq -c

# Bandwidth by path
awk '{sum[$7]+=$10} END {for (path in sum) print sum[path], path}' access.log | sort -rn | head

Performance Analysis

# Slow requests (using combined_extra format with request_time)
grep 'rt=' access.log | awk -F'rt=' '{print $2}' | awk '{print $1}' | sort -rn | head -20

# Average response time
awk -F'rt=' '{sum+=$2; count++} END {print sum/count}' access.log

# P95 latency (simple approximation)
grep 'rt=' access.log | awk -F'rt=' '{print $2}' | awk '{print $1}' | sort -n | \
    awk '{all[NR]=$1} END {print all[int(NR*0.95)]}'

Summary