Linux Kernel Parameters (sysctl) Every SRE Should Tune
Default kernel settings are for laptops — here's how to tune for production. Every Linux server ships with conservative defaults designed for general-purpose use. If you're running a web server handling thousands of concurrent connections, a database, or a Kubernetes node, those defaults are actively hurting you.
How sysctl Works
The sysctl interface exposes tunable kernel parameters through /proc/sys/. Every parameter can be read and (usually) written at runtime without a reboot.
# Read a parameter
sysctl net.core.somaxconn
# Write a parameter (temporary — lost on reboot)
sudo sysctl -w net.core.somaxconn=65535
# Read directly from procfs
cat /proc/sys/net/core/somaxconn
# Make changes permanent — add to /etc/sysctl.d/99-production.conf
echo "net.core.somaxconn = 65535" | sudo tee -a /etc/sysctl.d/99-production.conf
# Apply all sysctl files
sudo sysctl --system
Always use files in /etc/sysctl.d/ with a numbered prefix — they're loaded in order, and 99-production.conf ensures your settings override everything else.
Network Stack Tuning
This is where the biggest gains live. A default kernel can handle maybe 1,000 concurrent connections gracefully. With tuning, the same hardware handles 100,000+.
Socket and Connection Backlog
# Maximum pending connections in the listen queue
# Default: 4096 (was 128 before kernel 5.4)
net.core.somaxconn = 65535
# Maximum queued packets when interface receives faster than kernel processes
net.core.netdev_max_backlog = 65535
# SYN backlog — pending half-open connections
net.ipv4.tcp_max_syn_backlog = 65535
If your application's listen backlog is smaller than somaxconn, the application value wins. Nginx, for example, defaults to backlog=511 — set it to match.
TCP Memory and Buffer Sizes
# TCP socket buffer sizes: min, default, max (in bytes)
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
# Global socket buffer limits
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 262144
net.core.wmem_default = 262144
# Total TCP memory (in pages, not bytes — page = 4096 bytes)
# Let the kernel auto-tune this; only set if you know your workload
# net.ipv4.tcp_mem = 786432 1048576 1572864
Connection Reuse and TIME_WAIT
On high-traffic servers, you'll exhaust ephemeral ports because of connections stuck in TIME_WAIT. Here's the fix.
# Reuse TIME_WAIT connections for new outgoing connections
net.ipv4.tcp_tw_reuse = 1
# Ephemeral port range — widen it
net.ipv4.ip_local_port_range = 1024 65535
# Reduce FIN_WAIT2 timeout (default: 60 seconds)
net.ipv4.tcp_fin_timeout = 15
# Enable TCP keepalive tuning
net.ipv4.tcp_keepalive_time = 600 # Start probes after 600s idle (default: 7200)
net.ipv4.tcp_keepalive_intvl = 30 # Probe interval (default: 75)
net.ipv4.tcp_keepalive_probes = 5 # Probes before dropping (default: 9)
| Parameter | Default | Production | Why |
|---|---|---|---|
tcp_tw_reuse | 0 | 1 | Recycle TIME_WAIT sockets |
ip_local_port_range | 32768-60999 | 1024-65535 | More ephemeral ports |
tcp_fin_timeout | 60 | 15 | Faster connection teardown |
tcp_keepalive_time | 7200 | 600 | Detect dead connections sooner |
TCP Optimization
# Enable BBR congestion control (much better than cubic for long-distance)
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
# Enable window scaling and timestamps
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
# Enable selective acknowledgments
net.ipv4.tcp_sack = 1
# Disable slow start after idle (critical for bursty workloads)
net.ipv4.tcp_slow_start_after_idle = 0
BBR congestion control is a game-changer. It's developed by Google and achieves significantly higher throughput than cubic on lossy or high-latency networks.
# Verify BBR is active
sysctl net.ipv4.tcp_congestion_control
# Check available algorithms
sysctl net.ipv4.tcp_available_congestion_control
Memory Parameters
Overcommit and OOM Behavior
# Overcommit mode:
# 0 = heuristic (default), 1 = always allow, 2 = strict (never overcommit)
vm.overcommit_memory = 0
# For Redis/fork-heavy workloads:
# vm.overcommit_memory = 1
# OOM killer tuning — panic on OOM instead of killing random processes
# (useful when you'd rather reboot than run degraded)
vm.panic_on_oom = 0
# Control how aggressively the OOM killer targets processes
# Set per-process in /proc/<pid>/oom_score_adj (-1000 to 1000)
# Protect a critical process from OOM killer
echo -1000 > /proc/$(pidof postgres)/oom_score_adj
# Make a non-critical process a preferred OOM target
echo 500 > /proc/$(pidof logstash)/oom_score_adj
Swappiness and Cache Pressure
# How aggressively to swap (0 = only to avoid OOM, 100 = swap eagerly)
vm.swappiness = 10
# How aggressively to reclaim inode/dentry cache (default: 100)
# Lower values keep filesystem metadata in memory longer
vm.vfs_cache_pressure = 50
File Descriptor Limits
Every socket, file, and pipe is a file descriptor. Default limits are far too low for servers.
# System-wide maximum file descriptors
fs.file-max = 2097152
# Check current usage
cat /proc/sys/fs/file-nr
# Output: allocated free max
But fs.file-max is only half the story. Per-process limits come from PAM, not sysctl.
# Check current per-process limits
ulimit -n
# Set in /etc/security/limits.conf or /etc/security/limits.d/99-production.conf
# * soft nofile 1048576
# * hard nofile 1048576
# root soft nofile 1048576
# root hard nofile 1048576
# For systemd services, set in the unit file:
# [Service]
# LimitNOFILE=1048576
# Verify a running process's actual limits
cat /proc/$(pidof nginx)/limits | grep "Max open files"
Connection Tracking (nf_conntrack)
If you're running iptables/nftables or Kubernetes (which uses iptables heavily), connection tracking is critical.
# Check current conntrack table usage
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
# Increase conntrack table size (default is often 65536)
net.netfilter.nf_conntrack_max = 1048576
# Hash table size (set at module load time, should be conntrack_max / 4)
# Add to /etc/modprobe.d/nf_conntrack.conf:
# options nf_conntrack hashsize=262144
# Reduce conntrack timeouts to free entries faster
net.netfilter.nf_conntrack_tcp_timeout_established = 86400
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 30
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 15
When nf_conntrack_count approaches nf_conntrack_max, new connections get dropped silently. You'll see nf_conntrack: table full, dropping packet in dmesg. This is one of the most common issues in Kubernetes clusters under load.
# Monitor conntrack drops in real time
dmesg -w | grep conntrack
# Quick check for drops
conntrack -C # current count
conntrack -S # stats including drops
Complete Production sysctl Configuration
# /etc/sysctl.d/99-production.conf
# Network
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_slow_start_after_idle = 0
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
# Memory
vm.swappiness = 10
vm.vfs_cache_pressure = 50
vm.overcommit_memory = 0
# File descriptors
fs.file-max = 2097152
# Conntrack
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_tcp_timeout_established = 86400
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 30
# Security hardening
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
# Apply and verify
sudo sysctl --system
sysctl -a | grep -E "somaxconn|swappiness|file-max|conntrack_max"
Validating Changes Under Load
Don't just set and forget. Validate under realistic load.
# Monitor key metrics during load testing
watch -n1 'echo "=== Conntrack ===" && conntrack -C && echo "=== File descriptors ===" && cat /proc/sys/fs/file-nr && echo "=== TCP states ===" && ss -s'
In the next post, we go even deeper — under the hood of containers themselves. We'll build a container from scratch using nothing but namespaces, cgroups, and chroot to understand what Docker is really doing.
