Performance Tuning

Optimizing your web server for speed and scalability

Performance Fundamentals

Web server performance depends on a few key factors:

Throughput — How many requests per second can be handled
Latency — How long each request takes
Concurrency — How many simultaneous connections are supported
Resource efficiency — CPU, memory, and network utilization

Tuning involves balancing these factors against your specific workload. A configuration that's optimal for serving static files may be wrong for proxying to an application server.

Measure First

Never tune blindly. Always benchmark before and after changes. What seems like an improvement may actually hurt performance under real load.

Worker Processes and Connections

The most fundamental tuning parameter is how many workers handle requests:

# nginx.conf

# Worker processes - usually one per CPU core
worker_processes auto;  # Let Nginx detect CPU count

# Alternative: set explicitly
# worker_processes 4;

# CPU affinity (pin workers to cores)
worker_cpu_affinity auto;

events {
    # Connections per worker
    worker_connections 4096;

    # Accept multiple connections at once
    multi_accept on;

    # Use efficient connection method
    use epoll;  # Linux (default on Linux)
    # use kqueue;  # FreeBSD/macOS
}

# Calculate max clients:
# max_clients = worker_processes × worker_connections

# Apache MPM (Multi-Processing Module) configuration

# Event MPM (recommended for high traffic)
<IfModule mpm_event_module>
    ServerLimit         16
    StartServers        4
    MinSpareThreads     25
    MaxSpareThreads     75
    ThreadsPerChild     25
    MaxRequestWorkers   400
    MaxConnectionsPerChild 10000
</IfModule>

# Worker MPM (alternative)
<IfModule mpm_worker_module>
    ServerLimit         16
    StartServers        4
    MinSpareThreads     25
    MaxSpareThreads     75
    ThreadLimit         64
    ThreadsPerChild     25
    MaxRequestWorkers   400
</IfModule>

# Prefork MPM (for mod_php, older apps)
<IfModule mpm_prefork_module>
    StartServers        5
    MinSpareServers     5
    MaxSpareServers     10
    MaxRequestWorkers   256
    MaxConnectionsPerChild 10000
</IfModule>

const cluster = require('cluster');
const os = require('os');
const express = require('express');

if (cluster.isPrimary) {
    const numCPUs = os.cpus().length;
    console.log(`Primary ${process.pid} starting ${numCPUs} workers`);

    // Fork workers
    for (let i = 0; i < numCPUs; i++) {
        cluster.fork();
    }

    // Replace dead workers
    cluster.on('exit', (worker, code, signal) => {
        console.log(`Worker ${worker.process.pid} died, starting new one`);
        cluster.fork();
    });

} else {
    const app = express();

    // Increase max listeners for high concurrency
    require('events').EventEmitter.defaultMaxListeners = 100;

    app.get('/', (req, res) => {
        res.send('Hello from worker ' + process.pid);
    });

    const server = app.listen(3000);

    // Tune socket options
    server.maxConnections = 1000;
    server.keepAliveTimeout = 65000;
}

Max Concurrent Connections = Worker Processes × Connections per Worker

Setting	Guideline	Notes
`worker_processes`	Number of CPU cores	More won't help; CPU-bound work can't parallelize beyond core count
`worker_connections`	1024 – 4096	Limited by OS file descriptor limits
`multi_accept`	on	Accept all pending connections at once

File Descriptor Limits

Each connection uses a file descriptor. The OS limits these by default:

# Check current limits
ulimit -n        # Soft limit
ulimit -Hn       # Hard limit

# Check system-wide
cat /proc/sys/fs/file-max

# Set for current session
ulimit -n 65535

# Permanent: /etc/security/limits.conf
nginx    soft    nofile    65535
nginx    hard    nofile    65535
www-data soft    nofile    65535
www-data hard    nofile    65535

# System-wide: /etc/sysctl.conf
fs.file-max = 2097152

# Apply sysctl changes
sysctl -p

For Nginx, also set in the config:

# nginx.conf (outside any block)
worker_rlimit_nofile 65535;

Connection × 2

When proxying, each client connection requires a connection to the upstream—doubling file descriptor usage. Account for this when calculating limits.

Buffer Tuning

Buffers affect memory usage and performance. Too small means more disk I/O; too large wastes memory:

# Nginx buffer settings
http {
    # Client request buffers
    client_body_buffer_size 16k;      # POST body buffer
    client_header_buffer_size 1k;     # Header buffer
    large_client_header_buffers 4 8k; # Large headers

    # Proxy buffers (for upstream responses)
    proxy_buffering on;
    proxy_buffer_size 4k;             # First part of response
    proxy_buffers 8 16k;              # Buffer pool
    proxy_busy_buffers_size 24k;      # Can send while reading

    # FastCGI buffers
    fastcgi_buffer_size 4k;
    fastcgi_buffers 8 16k;

    # Output buffers
    output_buffers 2 32k;

    # Temporary file paths (when buffers overflow)
    client_body_temp_path /var/cache/nginx/client_temp;
    proxy_temp_path /var/cache/nginx/proxy_temp;
}

Buffer	Default	Increase When
`client_body_buffer_size`	8k/16k	Large form submissions, file uploads
`proxy_buffers`	8 4k/8k	Large upstream responses
`proxy_buffer_size`	4k/8k	Large response headers

Memory Calculation

Total memory = workers × connections × (client_buffer + proxy_buffers). For 4 workers × 4096 connections × 32KB buffers = 512MB minimum.

Timeout Optimization

Timeouts balance user experience against resource consumption:

# Nginx timeouts
http {
    # Client timeouts
    client_body_timeout 12s;     # Receiving body
    client_header_timeout 12s;   # Receiving headers
    send_timeout 10s;            # Sending response

    # Keep-alive (persistent connections)
    keepalive_timeout 65s;       # How long to keep open
    keepalive_requests 1000;     # Max requests per connection

    # Proxy timeouts
    proxy_connect_timeout 10s;   # Connecting to upstream
    proxy_send_timeout 60s;      # Sending to upstream
    proxy_read_timeout 60s;      # Reading from upstream

    # FastCGI timeouts
    fastcgi_connect_timeout 10s;
    fastcgi_send_timeout 60s;
    fastcgi_read_timeout 60s;
}

When to Adjust

Symptom	Adjustment	Trade-off
504 Gateway Timeout	Increase `proxy_read_timeout`	Slow backends hold connections longer
Slowloris susceptibility	Decrease client timeouts	May affect slow legitimate clients
High memory from idle connections	Decrease `keepalive_timeout`	More TCP handshakes
Many TIME_WAIT sockets	Increase `keepalive_requests`	Connections held longer

Keep-Alive Tuning

HTTP keep-alive reuses TCP connections for multiple requests, avoiding handshake overhead:

Without Keep-Alive: With Keep-Alive: ───────────────────── ───────────────────── Client Server Client Server │ │ │ │ │──SYN────▶│ │──SYN────▶│ │◀─SYN-ACK─│ │◀─SYN-ACK─│ │──ACK────▶│ │──ACK────▶│ │──GET────▶│ │──GET────▶│ │◀─Response│ │◀─Response│ │──FIN────▶│ │──GET────▶│ ← Reuse! │ │ │◀─Response│ │──SYN────▶│ ← New connection! │──GET────▶│ ← Reuse! │◀─SYN-ACK─│ │◀─Response│ │──ACK────▶│ │──FIN────▶│ │──GET────▶│ │◀─Response│ Overhead: 3 RTT per request Overhead: 3 RTT + 1 RTT per request

Upstream Keep-Alive

Keep-alive to backend servers is equally important:

# Nginx upstream keep-alive
upstream backend {
    server 127.0.0.1:3000;
    server 127.0.0.1:3001;

    # Keep connections alive to upstream
    keepalive 32;           # Connection pool size per worker
    keepalive_requests 1000; # Max requests per connection
    keepalive_timeout 60s;   # Idle timeout
}

server {
    location / {
        proxy_pass http://backend;

        # Required for upstream keep-alive
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
}

Compression Trade-offs

Compression reduces bandwidth but costs CPU. Balance based on your bottleneck:

# Nginx gzip configuration
http {
    gzip on;
    gzip_vary on;
    gzip_proxied any;

    # Compression level (1-9)
    gzip_comp_level 5;  # Sweet spot: good compression, moderate CPU

    # Minimum size to compress
    gzip_min_length 256;

    # Compress these MIME types
    gzip_types
        text/plain
        text/css
        text/javascript
        application/javascript
        application/json
        application/xml
        image/svg+xml;

    # Pre-compressed files (best of both worlds)
    gzip_static on;  # Serve .gz files if they exist
}

Level	Compression	CPU Cost	Use Case
1	~60%	Low	High-traffic, CPU-limited
5	~75%	Medium	General purpose
9	~78%	High	Bandwidth-limited, low traffic

Pre-compression

For static files, pre-compress at build time (gzip -9). Nginx serves the .gz file directly with gzip_static, getting maximum compression with zero runtime CPU cost.

Benchmarking Tools

Always measure performance before and after changes. These tools help:

Apache Benchmark (ab)

# Install
apt install apache2-utils  # Debian/Ubuntu
brew install httpd         # macOS

# Basic benchmark: 1000 requests, 10 concurrent
ab -n 1000 -c 10 http://localhost/

# With keep-alive
ab -n 1000 -c 10 -k http://localhost/

# POST with data
ab -n 1000 -c 10 -p data.json -T application/json http://localhost/api

Concurrency Level: 10 Time taken for tests: 0.523 seconds Complete requests: 1000 Failed requests: 0 Total transferred: 1250000 bytes HTML transferred: 1000000 bytes Requests per second: 1912.04 [#/sec] (mean) Time per request: 5.230 [ms] (mean) Time per request: 0.523 [ms] (mean, across all concurrent requests) Transfer rate: 2334.52 [Kbytes/sec] received Percentage of the requests served within a certain time (ms) 50% 5 66% 5 75% 6 90% 7 95% 8 99% 12 100% 15 (longest request)

wrk (Modern Alternative)

# Install
apt install wrk          # Debian/Ubuntu
brew install wrk         # macOS

# Basic benchmark: 2 threads, 100 connections, 30 seconds
wrk -t2 -c100 -d30s http://localhost/

# With Lua script for custom requests
wrk -t2 -c100 -d30s -s post.lua http://localhost/api

Running 30s test @ http://localhost/ 2 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 5.12ms 2.34ms 45.67ms 89.23% Req/Sec 9.87k 1.23k 12.34k 78.90% 591234 requests in 30.00s, 723.45MB read Requests/sec: 19707.80 Transfer/sec: 24.12MB

Key Metrics to Watch

Metric	What It Tells You	Warning Signs
Requests/sec	Throughput capacity	Doesn't scale with concurrency
Latency (mean)	Average response time	Increases under load
Latency (P99)	Worst-case experience	Much higher than mean
Failed requests	Errors under load	Any failures

Profiling Bottlenecks

Benchmarks show the symptom; profiling finds the cause:

System-Level Profiling

# CPU usage by process
top -p $(pgrep -d',' nginx)
htop

# I/O wait
iostat -x 1

# Network connections
ss -s                     # Summary
ss -tuln                  # Listening ports
ss -tn state time-wait | wc -l  # TIME_WAIT count

# File descriptors
ls /proc/$(cat /var/run/nginx.pid)/fd | wc -l

# Open files by Nginx
lsof -p $(cat /var/run/nginx.pid) | wc -l

Nginx Stub Status

# Enable stub_status
server {
    listen 8080;
    location /nginx_status {
        stub_status;
        allow 127.0.0.1;
        deny all;
    }
}

# Check during load test
watch -n1 'curl -s localhost:8080/nginx_status'

# Output:
Active connections: 291
server accepts handled requests
 16630948 16630948 31070465
Reading: 6 Writing: 179 Waiting: 106

Identifying Bottlenecks

Symptom Likely Bottleneck Solution ──────────────────────────────────────────────────────────────────────────── High CPU, throughput flat CPU-bound More workers, faster CPU Low CPU, high latency I/O-bound SSD, more RAM for cache High memory, OOM kills Buffer/connection bloat Reduce buffers, connections Many TIME_WAIT sockets Connection churn Enable keep-alive "Too many open files" File descriptor limit Increase ulimit upstream timeout errors Slow backend Scale backend, increase timeout

Linux Kernel Tuning

For high-traffic servers, kernel parameters matter:

# /etc/sysctl.conf

# TCP memory
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# Connection handling
net.core.somaxconn = 65535              # Listen queue size
net.core.netdev_max_backlog = 65535     # Network interface queue
net.ipv4.tcp_max_syn_backlog = 65535    # SYN queue

# TIME_WAIT reduction
net.ipv4.tcp_tw_reuse = 1               # Reuse TIME_WAIT sockets
net.ipv4.tcp_fin_timeout = 15           # Faster FIN timeout

# Keep-alive
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 3

# File descriptors
fs.file-max = 2097152

# Apply changes
sysctl -p

Test Carefully

Kernel tuning can destabilize systems. Test in staging first, change one parameter at a time, and monitor for unintended effects.

Common Performance Patterns

Static Site (HTML/CSS/JS/Images)

# Optimized for static content
worker_processes auto;
worker_rlimit_nofile 65535;

events {
    worker_connections 4096;
    multi_accept on;
}

http {
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;

    # Aggressive caching
    open_file_cache max=10000 inactive=20s;
    open_file_cache_valid 30s;
    open_file_cache_min_uses 2;
    open_file_cache_errors on;

    gzip on;
    gzip_static on;

    keepalive_timeout 65;
    keepalive_requests 1000;
}

Reverse Proxy (API Gateway)

# Optimized for proxying
worker_processes auto;
worker_rlimit_nofile 65535;

http {
    # Minimal buffering for real-time
    proxy_buffering off;
    # Or tuned buffering for throughput
    # proxy_buffer_size 8k;
    # proxy_buffers 16 32k;

    upstream api {
        server 127.0.0.1:3000;
        server 127.0.0.1:3001;
        keepalive 64;
    }

    server {
        location /api {
            proxy_pass http://api;
            proxy_http_version 1.1;
            proxy_set_header Connection "";

            # Appropriate timeouts
            proxy_connect_timeout 5s;
            proxy_read_timeout 60s;
        }
    }
}

WebSocket Support

# Optimized for long-lived connections
http {
    # Long timeouts for persistent connections
    proxy_read_timeout 3600s;
    proxy_send_timeout 3600s;

    # WebSocket upgrade
    map $http_upgrade $connection_upgrade {
        default upgrade;
        ''      close;
    }

    server {
        location /ws {
            proxy_pass http://websocket_backend;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection $connection_upgrade;
        }
    }
}

Summary

Measure first — Never tune without benchmarks
Workers = CPU cores — More doesn't help
Connections × workers = capacity — Plan for your expected load
File descriptors — Increase limits before you need them
Buffers — Size based on your response sizes
Keep-alive — Enable for both clients and upstreams
Compression level 5 — Good balance for most cases
Profile bottlenecks — CPU, I/O, or network?
Benchmark with realistic load — Use wrk or ab with your actual traffic patterns