Performance Tuning
Performance Fundamentals
Web server performance depends on a few key factors:
- Throughput — How many requests per second can be handled
- Latency — How long each request takes
- Concurrency — How many simultaneous connections are supported
- Resource efficiency — CPU, memory, and network utilization
Tuning involves balancing these factors against your specific workload. A configuration that's optimal for serving static files may be wrong for proxying to an application server.
Measure First
Never tune blindly. Always benchmark before and after changes. What seems like an improvement may actually hurt performance under real load.
Worker Processes and Connections
The most fundamental tuning parameter is how many workers handle requests:
# nginx.conf
# Worker processes - usually one per CPU core
worker_processes auto; # Let Nginx detect CPU count
# Alternative: set explicitly
# worker_processes 4;
# CPU affinity (pin workers to cores)
worker_cpu_affinity auto;
events {
# Connections per worker
worker_connections 4096;
# Accept multiple connections at once
multi_accept on;
# Use efficient connection method
use epoll; # Linux (default on Linux)
# use kqueue; # FreeBSD/macOS
}
# Calculate max clients:
# max_clients = worker_processes × worker_connections
# Apache MPM (Multi-Processing Module) configuration
# Event MPM (recommended for high traffic)
<IfModule mpm_event_module>
ServerLimit 16
StartServers 4
MinSpareThreads 25
MaxSpareThreads 75
ThreadsPerChild 25
MaxRequestWorkers 400
MaxConnectionsPerChild 10000
</IfModule>
# Worker MPM (alternative)
<IfModule mpm_worker_module>
ServerLimit 16
StartServers 4
MinSpareThreads 25
MaxSpareThreads 75
ThreadLimit 64
ThreadsPerChild 25
MaxRequestWorkers 400
</IfModule>
# Prefork MPM (for mod_php, older apps)
<IfModule mpm_prefork_module>
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxRequestWorkers 256
MaxConnectionsPerChild 10000
</IfModule>
const cluster = require('cluster');
const os = require('os');
const express = require('express');
if (cluster.isPrimary) {
const numCPUs = os.cpus().length;
console.log(`Primary ${process.pid} starting ${numCPUs} workers`);
// Fork workers
for (let i = 0; i < numCPUs; i++) {
cluster.fork();
}
// Replace dead workers
cluster.on('exit', (worker, code, signal) => {
console.log(`Worker ${worker.process.pid} died, starting new one`);
cluster.fork();
});
} else {
const app = express();
// Increase max listeners for high concurrency
require('events').EventEmitter.defaultMaxListeners = 100;
app.get('/', (req, res) => {
res.send('Hello from worker ' + process.pid);
});
const server = app.listen(3000);
// Tune socket options
server.maxConnections = 1000;
server.keepAliveTimeout = 65000;
}
| Setting | Guideline | Notes |
|---|---|---|
worker_processes |
Number of CPU cores | More won't help; CPU-bound work can't parallelize beyond core count |
worker_connections |
1024 – 4096 | Limited by OS file descriptor limits |
multi_accept |
on | Accept all pending connections at once |
File Descriptor Limits
Each connection uses a file descriptor. The OS limits these by default:
# Check current limits ulimit -n # Soft limit ulimit -Hn # Hard limit # Check system-wide cat /proc/sys/fs/file-max # Set for current session ulimit -n 65535 # Permanent: /etc/security/limits.conf nginx soft nofile 65535 nginx hard nofile 65535 www-data soft nofile 65535 www-data hard nofile 65535 # System-wide: /etc/sysctl.conf fs.file-max = 2097152 # Apply sysctl changes sysctl -p
For Nginx, also set in the config:
# nginx.conf (outside any block) worker_rlimit_nofile 65535;
Connection × 2
When proxying, each client connection requires a connection to the upstream—doubling file descriptor usage. Account for this when calculating limits.
Buffer Tuning
Buffers affect memory usage and performance. Too small means more disk I/O; too large wastes memory:
# Nginx buffer settings
http {
# Client request buffers
client_body_buffer_size 16k; # POST body buffer
client_header_buffer_size 1k; # Header buffer
large_client_header_buffers 4 8k; # Large headers
# Proxy buffers (for upstream responses)
proxy_buffering on;
proxy_buffer_size 4k; # First part of response
proxy_buffers 8 16k; # Buffer pool
proxy_busy_buffers_size 24k; # Can send while reading
# FastCGI buffers
fastcgi_buffer_size 4k;
fastcgi_buffers 8 16k;
# Output buffers
output_buffers 2 32k;
# Temporary file paths (when buffers overflow)
client_body_temp_path /var/cache/nginx/client_temp;
proxy_temp_path /var/cache/nginx/proxy_temp;
}
| Buffer | Default | Increase When |
|---|---|---|
client_body_buffer_size |
8k/16k | Large form submissions, file uploads |
proxy_buffers |
8 4k/8k | Large upstream responses |
proxy_buffer_size |
4k/8k | Large response headers |
Memory Calculation
Total memory = workers × connections × (client_buffer + proxy_buffers). For 4 workers × 4096 connections × 32KB buffers = 512MB minimum.
Timeout Optimization
Timeouts balance user experience against resource consumption:
# Nginx timeouts
http {
# Client timeouts
client_body_timeout 12s; # Receiving body
client_header_timeout 12s; # Receiving headers
send_timeout 10s; # Sending response
# Keep-alive (persistent connections)
keepalive_timeout 65s; # How long to keep open
keepalive_requests 1000; # Max requests per connection
# Proxy timeouts
proxy_connect_timeout 10s; # Connecting to upstream
proxy_send_timeout 60s; # Sending to upstream
proxy_read_timeout 60s; # Reading from upstream
# FastCGI timeouts
fastcgi_connect_timeout 10s;
fastcgi_send_timeout 60s;
fastcgi_read_timeout 60s;
}
When to Adjust
| Symptom | Adjustment | Trade-off |
|---|---|---|
| 504 Gateway Timeout | Increase proxy_read_timeout |
Slow backends hold connections longer |
| Slowloris susceptibility | Decrease client timeouts | May affect slow legitimate clients |
| High memory from idle connections | Decrease keepalive_timeout |
More TCP handshakes |
| Many TIME_WAIT sockets | Increase keepalive_requests |
Connections held longer |
Keep-Alive Tuning
HTTP keep-alive reuses TCP connections for multiple requests, avoiding handshake overhead:
Upstream Keep-Alive
Keep-alive to backend servers is equally important:
# Nginx upstream keep-alive
upstream backend {
server 127.0.0.1:3000;
server 127.0.0.1:3001;
# Keep connections alive to upstream
keepalive 32; # Connection pool size per worker
keepalive_requests 1000; # Max requests per connection
keepalive_timeout 60s; # Idle timeout
}
server {
location / {
proxy_pass http://backend;
# Required for upstream keep-alive
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
Compression Trade-offs
Compression reduces bandwidth but costs CPU. Balance based on your bottleneck:
# Nginx gzip configuration
http {
gzip on;
gzip_vary on;
gzip_proxied any;
# Compression level (1-9)
gzip_comp_level 5; # Sweet spot: good compression, moderate CPU
# Minimum size to compress
gzip_min_length 256;
# Compress these MIME types
gzip_types
text/plain
text/css
text/javascript
application/javascript
application/json
application/xml
image/svg+xml;
# Pre-compressed files (best of both worlds)
gzip_static on; # Serve .gz files if they exist
}
| Level | Compression | CPU Cost | Use Case |
|---|---|---|---|
| 1 | ~60% | Low | High-traffic, CPU-limited |
| 5 | ~75% | Medium | General purpose |
| 9 | ~78% | High | Bandwidth-limited, low traffic |
Pre-compression
For static files, pre-compress at build time (gzip -9). Nginx serves the .gz file directly with gzip_static, getting maximum compression with zero runtime CPU cost.
Benchmarking Tools
Always measure performance before and after changes. These tools help:
Apache Benchmark (ab)
# Install apt install apache2-utils # Debian/Ubuntu brew install httpd # macOS # Basic benchmark: 1000 requests, 10 concurrent ab -n 1000 -c 10 http://localhost/ # With keep-alive ab -n 1000 -c 10 -k http://localhost/ # POST with data ab -n 1000 -c 10 -p data.json -T application/json http://localhost/api
wrk (Modern Alternative)
# Install apt install wrk # Debian/Ubuntu brew install wrk # macOS # Basic benchmark: 2 threads, 100 connections, 30 seconds wrk -t2 -c100 -d30s http://localhost/ # With Lua script for custom requests wrk -t2 -c100 -d30s -s post.lua http://localhost/api
Key Metrics to Watch
| Metric | What It Tells You | Warning Signs |
|---|---|---|
| Requests/sec | Throughput capacity | Doesn't scale with concurrency |
| Latency (mean) | Average response time | Increases under load |
| Latency (P99) | Worst-case experience | Much higher than mean |
| Failed requests | Errors under load | Any failures |
Profiling Bottlenecks
Benchmarks show the symptom; profiling finds the cause:
System-Level Profiling
# CPU usage by process top -p $(pgrep -d',' nginx) htop # I/O wait iostat -x 1 # Network connections ss -s # Summary ss -tuln # Listening ports ss -tn state time-wait | wc -l # TIME_WAIT count # File descriptors ls /proc/$(cat /var/run/nginx.pid)/fd | wc -l # Open files by Nginx lsof -p $(cat /var/run/nginx.pid) | wc -l
Nginx Stub Status
# Enable stub_status
server {
listen 8080;
location /nginx_status {
stub_status;
allow 127.0.0.1;
deny all;
}
}
# Check during load test
watch -n1 'curl -s localhost:8080/nginx_status'
# Output:
Active connections: 291
server accepts handled requests
16630948 16630948 31070465
Reading: 6 Writing: 179 Waiting: 106
Identifying Bottlenecks
Linux Kernel Tuning
For high-traffic servers, kernel parameters matter:
# /etc/sysctl.conf # TCP memory net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 # Connection handling net.core.somaxconn = 65535 # Listen queue size net.core.netdev_max_backlog = 65535 # Network interface queue net.ipv4.tcp_max_syn_backlog = 65535 # SYN queue # TIME_WAIT reduction net.ipv4.tcp_tw_reuse = 1 # Reuse TIME_WAIT sockets net.ipv4.tcp_fin_timeout = 15 # Faster FIN timeout # Keep-alive net.ipv4.tcp_keepalive_time = 600 net.ipv4.tcp_keepalive_intvl = 60 net.ipv4.tcp_keepalive_probes = 3 # File descriptors fs.file-max = 2097152 # Apply changes sysctl -p
Test Carefully
Kernel tuning can destabilize systems. Test in staging first, change one parameter at a time, and monitor for unintended effects.
Common Performance Patterns
Static Site (HTML/CSS/JS/Images)
# Optimized for static content
worker_processes auto;
worker_rlimit_nofile 65535;
events {
worker_connections 4096;
multi_accept on;
}
http {
sendfile on;
tcp_nopush on;
tcp_nodelay on;
# Aggressive caching
open_file_cache max=10000 inactive=20s;
open_file_cache_valid 30s;
open_file_cache_min_uses 2;
open_file_cache_errors on;
gzip on;
gzip_static on;
keepalive_timeout 65;
keepalive_requests 1000;
}
Reverse Proxy (API Gateway)
# Optimized for proxying
worker_processes auto;
worker_rlimit_nofile 65535;
http {
# Minimal buffering for real-time
proxy_buffering off;
# Or tuned buffering for throughput
# proxy_buffer_size 8k;
# proxy_buffers 16 32k;
upstream api {
server 127.0.0.1:3000;
server 127.0.0.1:3001;
keepalive 64;
}
server {
location /api {
proxy_pass http://api;
proxy_http_version 1.1;
proxy_set_header Connection "";
# Appropriate timeouts
proxy_connect_timeout 5s;
proxy_read_timeout 60s;
}
}
}
WebSocket Support
# Optimized for long-lived connections
http {
# Long timeouts for persistent connections
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
# WebSocket upgrade
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
server {
location /ws {
proxy_pass http://websocket_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
}
}
}
Summary
- Measure first — Never tune without benchmarks
- Workers = CPU cores — More doesn't help
- Connections × workers = capacity — Plan for your expected load
- File descriptors — Increase limits before you need them
- Buffers — Size based on your response sizes
- Keep-alive — Enable for both clients and upstreams
- Compression level 5 — Good balance for most cases
- Profile bottlenecks — CPU, I/O, or network?
- Benchmark with realistic load — Use wrk or ab with your actual traffic patterns