Linux Performance Tuning — CPU, Memory, and I/O Optimization
Your server handles 1,000 requests per second — here's how to push it to 10,000. Performance tuning isn't about guessing; it's about measuring, identifying bottlenecks, and making targeted changes to CPU scheduling, memory management, and disk I/O.
Baseline: Know Before You Tune
Before changing anything, capture your baseline. You cannot improve what you haven't measured.
# Capture a 10-second system snapshot
vmstat 1 10
The key columns to watch: r (run queue — processes waiting for CPU), b (blocked on I/O), si/so (swap in/out — should be zero on healthy systems), and wa (I/O wait percentage).
# CPU utilization per core — critical for spotting single-core bottlenecks
mpstat -P ALL 1 5
If one core is at 100% while others are idle, your workload isn't parallelized — no amount of kernel tuning will fix that.
# Memory overview — the real picture, not just 'free'
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree|Dirty|Writeback"
MemAvailable is the number you care about — it accounts for reclaimable cache. If MemFree is low but MemAvailable is healthy, your system is fine.
CPU Performance Tuning
CPU Frequency Governors
Modern CPUs dynamically adjust their clock speed. The governor policy determines how aggressively this happens.
# Check current governor for all CPUs
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# List available governors
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
# Set performance governor (max frequency, zero latency)
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo "performance" | sudo tee "$cpu"
done
| Governor | Behavior | Use Case |
|---|---|---|
performance | Always max frequency | Latency-sensitive workloads, databases |
powersave | Always min frequency | Idle servers, cost optimization |
ondemand | Ramps up on load | General purpose (legacy) |
schedutil | Kernel scheduler-driven | Modern default, good balance |
For production databases and real-time systems, performance is non-negotiable. The microseconds saved on frequency ramp-up matter at scale.
CPU Affinity and Isolation
Pin critical processes to specific cores and keep the kernel off them.
# Isolate CPUs 2-7 from the kernel scheduler (add to GRUB_CMDLINE_LINUX)
# isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7
# Pin a process to CPUs 2-3
taskset -c 2-3 ./your-application
# Check what CPU a process is running on
taskset -p $(pidof nginx)
Memory Optimization
Swappiness and Dirty Ratios
The kernel's default memory behavior is tuned for desktop workloads. Production servers need different settings.
# Check current values
sysctl vm.swappiness vm.dirty_ratio vm.dirty_background_ratio vm.dirty_expire_centisecs
# Production tuning for database servers
sudo sysctl -w vm.swappiness=10 # Prefer keeping pages in RAM (default: 60)
sudo sysctl -w vm.dirty_ratio=40 # Max dirty pages before synchronous flush (default: 20)
sudo sysctl -w vm.dirty_background_ratio=10 # Start background flush at 10% (default: 10)
sudo sysctl -w vm.dirty_expire_centisecs=3000 # Flush dirty pages after 30 seconds (default: 3000)
| Parameter | Default | Database Server | Web Server |
|---|---|---|---|
vm.swappiness | 60 | 10 | 10 |
vm.dirty_ratio | 20 | 40 | 20 |
vm.dirty_background_ratio | 10 | 10 | 5 |
vm.overcommit_memory | 0 | 2 | 0 |
For Redis specifically, set vm.overcommit_memory=1 — Redis forks for background saves and needs the kernel to allow overcommit.
Transparent Huge Pages (THP)
THP can cause latency spikes in database workloads due to compaction overhead.
# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# Disable THP for database servers (Redis, MongoDB, PostgreSQL all recommend this)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
NUMA Awareness
On multi-socket servers, memory access latency depends on which socket owns the memory. Ignoring NUMA can halve your throughput.
# Check NUMA topology
numactl --hardware
# Run a process on NUMA node 0 (local memory only)
numactl --cpunodebind=0 --membind=0 ./your-application
# Check per-node memory stats
numastat -c
If numastat shows high other_node hits, your processes are accessing remote memory — bind them to the correct NUMA node.
I/O Optimization
I/O Schedulers
The I/O scheduler determines how disk requests are ordered and merged.
# Check current scheduler per device
cat /sys/block/sda/queue/scheduler
# Available schedulers on modern kernels
# [mq-deadline] kyber bfq none
# Set scheduler for NVMe (none is best — the device has its own scheduler)
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler
# Set scheduler for spinning disks
echo mq-deadline | sudo tee /sys/block/sda/queue/scheduler
| Scheduler | Best For | Why |
|---|---|---|
none | NVMe SSDs | No overhead, device handles scheduling |
mq-deadline | SATA SSDs, HDDs with deadlines | Prevents starvation, predictable latency |
bfq | Desktop, interactive workloads | Fair bandwidth distribution |
kyber | Fast SSDs, low-latency targets | Lightweight, latency-focused |
Read-Ahead and Queue Depth
# Check current read-ahead (in 512-byte sectors)
blockdev --getra /dev/sda
# Increase read-ahead for sequential workloads (e.g., Kafka, log processing)
sudo blockdev --setra 4096 /dev/sda # 2MB read-ahead
# Check and set queue depth for NVMe
cat /sys/block/nvme0n1/queue/nr_requests
echo 1024 | sudo tee /sys/block/nvme0n1/queue/nr_requests
Monitoring I/O in Real Time
# Per-device I/O stats with 1-second intervals
iostat -xz 1 5
Key columns: %util (device saturation — above 80% is a problem), await (average request time in ms), r_await/w_await (read/write latency), and aqu-sz (average queue length).
Putting It All Together: A Production Tuning Script
#!/bin/bash
# production-tune.sh — Apply performance tuning for a web/API server
# CPU: performance governor
for gov in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo "performance" > "$gov" 2>/dev/null
done
# Memory
sysctl -w vm.swappiness=10
sysctl -w vm.dirty_ratio=20
sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.overcommit_memory=0
# Disable THP
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# I/O: none for NVMe, mq-deadline for SATA
for dev in /sys/block/nvme*; do
echo none > "$dev/queue/scheduler" 2>/dev/null
done
for dev in /sys/block/sd*; do
echo mq-deadline > "$dev/queue/scheduler" 2>/dev/null
done
# Network: increase socket buffers
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
echo "Performance tuning applied."
Validating Your Changes
After tuning, run a real benchmark — not a synthetic one.
# Use your actual workload, but here's a quick stress test
# Install stress-ng for controlled testing
stress-ng --cpu 4 --vm 2 --vm-bytes 1G --io 4 --timeout 60s --metrics-brief
Watch vmstat 1, iostat -xz 1, and mpstat -P ALL 1 in separate terminals during the test. The numbers tell you whether your tuning had the intended effect.
Next up: we dive into the specific kernel parameters (sysctl) that transform a default Linux install into a production-grade server — covering network stack tuning, connection tracking, and file descriptor limits.
