Linux Kernel Parameters (sysctl) Every SRE Should Tune

August 23, 2025 · 6 min read

DevOps & Cloud Learning Hub

Default kernel settings are for laptops — here's how to tune for production. Every Linux server ships with conservative defaults designed for general-purpose use. If you're running a web server handling thousands of concurrent connections, a database, or a Kubernetes node, those defaults are actively hurting you.

How sysctl Works

The sysctl interface exposes tunable kernel parameters through /proc/sys/. Every parameter can be read and (usually) written at runtime without a reboot.

# Read a parameter
sysctl net.core.somaxconn

# Write a parameter (temporary — lost on reboot)
sudo sysctl -w net.core.somaxconn=65535

# Read directly from procfs
cat /proc/sys/net/core/somaxconn

# Make changes permanent — add to /etc/sysctl.d/99-production.conf
echo "net.core.somaxconn = 65535" | sudo tee -a /etc/sysctl.d/99-production.conf

# Apply all sysctl files
sudo sysctl --system

Always use files in /etc/sysctl.d/ with a numbered prefix — they're loaded in order, and 99-production.conf ensures your settings override everything else.

Network Stack Tuning

This is where the biggest gains live. A default kernel can handle maybe 1,000 concurrent connections gracefully. With tuning, the same hardware handles 100,000+.

Socket and Connection Backlog

# Maximum pending connections in the listen queue
# Default: 4096 (was 128 before kernel 5.4)
net.core.somaxconn = 65535

# Maximum queued packets when interface receives faster than kernel processes
net.core.netdev_max_backlog = 65535

# SYN backlog — pending half-open connections
net.ipv4.tcp_max_syn_backlog = 65535

If your application's listen backlog is smaller than somaxconn, the application value wins. Nginx, for example, defaults to backlog=511 — set it to match.

TCP Memory and Buffer Sizes

# TCP socket buffer sizes: min, default, max (in bytes)
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# Global socket buffer limits
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 262144
net.core.wmem_default = 262144

# Total TCP memory (in pages, not bytes — page = 4096 bytes)
# Let the kernel auto-tune this; only set if you know your workload
# net.ipv4.tcp_mem = 786432 1048576 1572864

Connection Reuse and TIME_WAIT

On high-traffic servers, you'll exhaust ephemeral ports because of connections stuck in TIME_WAIT. Here's the fix.

# Reuse TIME_WAIT connections for new outgoing connections
net.ipv4.tcp_tw_reuse = 1

# Ephemeral port range — widen it
net.ipv4.ip_local_port_range = 1024 65535

# Reduce FIN_WAIT2 timeout (default: 60 seconds)
net.ipv4.tcp_fin_timeout = 15

# Enable TCP keepalive tuning
net.ipv4.tcp_keepalive_time = 600      # Start probes after 600s idle (default: 7200)
net.ipv4.tcp_keepalive_intvl = 30      # Probe interval (default: 75)
net.ipv4.tcp_keepalive_probes = 5      # Probes before dropping (default: 9)

Parameter	Default	Production	Why
`tcp_tw_reuse`	0	1	Recycle TIME_WAIT sockets
`ip_local_port_range`	32768-60999	1024-65535	More ephemeral ports
`tcp_fin_timeout`	60	15	Faster connection teardown
`tcp_keepalive_time`	7200	600	Detect dead connections sooner

TCP Optimization

# Enable BBR congestion control (much better than cubic for long-distance)
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

# Enable window scaling and timestamps
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1

# Enable selective acknowledgments
net.ipv4.tcp_sack = 1

# Disable slow start after idle (critical for bursty workloads)
net.ipv4.tcp_slow_start_after_idle = 0

BBR congestion control is a game-changer. It's developed by Google and achieves significantly higher throughput than cubic on lossy or high-latency networks.

# Verify BBR is active
sysctl net.ipv4.tcp_congestion_control
# Check available algorithms
sysctl net.ipv4.tcp_available_congestion_control

Memory Parameters

Overcommit and OOM Behavior

# Overcommit mode:
# 0 = heuristic (default), 1 = always allow, 2 = strict (never overcommit)
vm.overcommit_memory = 0

# For Redis/fork-heavy workloads:
# vm.overcommit_memory = 1

# OOM killer tuning — panic on OOM instead of killing random processes
# (useful when you'd rather reboot than run degraded)
vm.panic_on_oom = 0

# Control how aggressively the OOM killer targets processes
# Set per-process in /proc/<pid>/oom_score_adj (-1000 to 1000)

# Protect a critical process from OOM killer
echo -1000 > /proc/$(pidof postgres)/oom_score_adj

# Make a non-critical process a preferred OOM target
echo 500 > /proc/$(pidof logstash)/oom_score_adj

Swappiness and Cache Pressure

# How aggressively to swap (0 = only to avoid OOM, 100 = swap eagerly)
vm.swappiness = 10

# How aggressively to reclaim inode/dentry cache (default: 100)
# Lower values keep filesystem metadata in memory longer
vm.vfs_cache_pressure = 50

File Descriptor Limits

Every socket, file, and pipe is a file descriptor. Default limits are far too low for servers.

# System-wide maximum file descriptors
fs.file-max = 2097152

# Check current usage
cat /proc/sys/fs/file-nr
# Output: allocated    free    max

But fs.file-max is only half the story. Per-process limits come from PAM, not sysctl.

# Check current per-process limits
ulimit -n

# Set in /etc/security/limits.conf or /etc/security/limits.d/99-production.conf
# *    soft    nofile    1048576
# *    hard    nofile    1048576
# root soft    nofile    1048576
# root hard    nofile    1048576

# For systemd services, set in the unit file:
# [Service]
# LimitNOFILE=1048576

# Verify a running process's actual limits
cat /proc/$(pidof nginx)/limits | grep "Max open files"

Connection Tracking (nf_conntrack)

If you're running iptables/nftables or Kubernetes (which uses iptables heavily), connection tracking is critical.

# Check current conntrack table usage
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

# Increase conntrack table size (default is often 65536)
net.netfilter.nf_conntrack_max = 1048576

# Hash table size (set at module load time, should be conntrack_max / 4)
# Add to /etc/modprobe.d/nf_conntrack.conf:
# options nf_conntrack hashsize=262144

# Reduce conntrack timeouts to free entries faster
net.netfilter.nf_conntrack_tcp_timeout_established = 86400
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 30
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 15

When nf_conntrack_count approaches nf_conntrack_max, new connections get dropped silently. You'll see nf_conntrack: table full, dropping packet in dmesg. This is one of the most common issues in Kubernetes clusters under load.

# Monitor conntrack drops in real time
dmesg -w | grep conntrack

# Quick check for drops
conntrack -C    # current count
conntrack -S    # stats including drops

Complete Production sysctl Configuration

# /etc/sysctl.d/99-production.conf
# Network
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_slow_start_after_idle = 0
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

# Memory
vm.swappiness = 10
vm.vfs_cache_pressure = 50
vm.overcommit_memory = 0

# File descriptors
fs.file-max = 2097152

# Conntrack
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_tcp_timeout_established = 86400
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 30

# Security hardening
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0

# Apply and verify
sudo sysctl --system
sysctl -a | grep -E "somaxconn|swappiness|file-max|conntrack_max"

Validating Changes Under Load

Don't just set and forget. Validate under realistic load.

# Monitor key metrics during load testing
watch -n1 'echo "=== Conntrack ===" && conntrack -C && echo "=== File descriptors ===" && cat /proc/sys/fs/file-nr && echo "=== TCP states ===" && ss -s'

In the next post, we go even deeper — under the hood of containers themselves. We'll build a container from scratch using nothing but namespaces, cgroups, and chroot to understand what Docker is really doing.

How sysctl Works​

Network Stack Tuning​

Socket and Connection Backlog​

TCP Memory and Buffer Sizes​

Connection Reuse and TIME_WAIT​

TCP Optimization​

Memory Parameters​

Overcommit and OOM Behavior​

Swappiness and Cache Pressure​

File Descriptor Limits​

Connection Tracking (nf_conntrack)​

Complete Production sysctl Configuration​

Validating Changes Under Load​

Stay Updated