Linux Troubleshooting Like a Pro — strace, lsof, tcpdump
The app works on staging but fails on production — here's the systematic way to find out why. Every seasoned SRE has a mental decision tree for production incidents. The tools are always the same: strace to see what a process is doing, lsof to see what files it has open, tcpdump to see what's on the wire, and ss to see socket state. Master these four and you can debug almost anything.
The Troubleshooting Framework
Before reaching for tools, follow this mental model:
- What changed? (deploy, config change, traffic spike, upstream dependency)
- Where is the bottleneck? (CPU, memory, disk, network)
- Narrow down with the right tool for the layer
| Symptom | First Tool | What to Look For |
|---|---|---|
| Process hangs/slow | strace | Blocked system calls |
| "Too many open files" | lsof | File descriptor leaks |
| Connection timeouts | ss, tcpdump | Socket states, packet drops |
| High I/O wait | iostat, iotop | Disk saturation |
| OOM kills | dmesg, /proc/meminfo | Memory pressure |
strace: See What a Process Is Doing
strace intercepts system calls — the interface between a process and the kernel. If a process is hung, strace tells you exactly which system call it's blocked on.
Attaching to a Running Process
# Attach to a running process (doesn't restart it)
sudo strace -p $(pidof nginx) -f -e trace=network -t
# Flags explained:
# -p PID : attach to running process
# -f : follow child processes (critical for forking servers)
# -e trace= : filter by syscall category (network, file, process, memory)
# -t : timestamp each line
Real Scenario: Why Is the App Slow?
Your Node.js app has 5-second response times. Is it CPU-bound, waiting on disk, or waiting on a network call?
# Trace the process and summarize time spent in each syscall
sudo strace -p $(pidof node) -c -f -S time
This produces a table showing total time, calls, and errors per syscall. If read or write on a socket dominates, the app is waiting on an external service. If futex dominates, it's waiting on a lock.
Finding Why a Process Can't Open a File
# Trace file-related syscalls only
sudo strace -p $(pidof myapp) -e trace=openat,access,stat -f 2>&1 | grep -i "ENOENT\|EACCES"
ENOENT means file not found. EACCES means permission denied. The full path is in the strace output — no guessing required.
Tracing a Command from Start
# Trace an entire command execution, write output to a file
strace -f -o /tmp/strace.log -tt curl https://api.example.com/health
# Find the slowest syscalls
grep -E "^[0-9]+" /tmp/strace.log | sort -t'>' -k2 -rn | head -20
lsof: Everything Is a File
In Linux, network sockets, pipes, devices — everything is a file descriptor. lsof lists them all.
Finding What Has a File Open
# What process has port 8080 open?
sudo lsof -i :8080
# What files does nginx have open?
sudo lsof -p $(pidof nginx) | head -30
# How many file descriptors is a process using?
sudo ls /proc/$(pidof nginx)/fd | wc -l
# Compare against its limit
cat /proc/$(pidof nginx)/limits | grep "Max open files"
Real Scenario: Disk Full but Can't Find the File
A classic production issue: df shows the disk is full, but du doesn't account for all the space. The culprit is a deleted file still held open by a process.
# Find deleted files still held open (consuming disk space)
sudo lsof +L1
# The output shows deleted files and which process holds them
# To reclaim space: restart the process, or truncate the fd:
# sudo truncate -s 0 /proc/<pid>/fd/<fd_number>
Finding Network Connections
# All network connections by a specific process
sudo lsof -i -a -p $(pidof java)
# All established connections to a remote host
sudo lsof -i @10.0.0.50
# All connections in a specific state
sudo lsof -i -sTCP:ESTABLISHED
sudo lsof -i -sTCP:CLOSE_WAIT
tcpdump: What's on the Wire
When the application says it sent the request and the server says it never received it, tcpdump tells you who's lying.
Basic Captures
# Capture HTTP traffic on port 80
sudo tcpdump -i eth0 port 80 -n -A | head -100
# Capture traffic to a specific host
sudo tcpdump -i any host 10.0.0.50 -n
# Capture DNS queries
sudo tcpdump -i any port 53 -n -l
# Write to a file for Wireshark analysis
sudo tcpdump -i eth0 -w /tmp/capture.pcap -c 1000
Real Scenario: Connection Timeouts to an Upstream Service
Your app times out connecting to a database. Is the SYN getting through? Is the server responding?
# Capture only TCP handshake packets (SYN, SYN-ACK, RST)
sudo tcpdump -i eth0 "host 10.0.0.50 and (tcp[tcpflags] & (tcp-syn|tcp-rst) != 0)" -n -tt
If you see SYN but no SYN-ACK, the remote server isn't responding (firewall, service down, or route issue). If you see SYN-ACK followed by RST, something is rejecting the connection after the handshake.
Analyzing Application-Level Issues
# Capture and display HTTP request/response headers
sudo tcpdump -i eth0 -A -s 0 'tcp port 8080 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' | grep -E "^(GET|POST|HTTP|Host|Content)"
ss: Socket Statistics
ss is the modern replacement for netstat — faster and more informative.
# All listening TCP sockets with process info
ss -tlnp
# All established connections with timer information
ss -tnop state established
# Count connections per state
ss -s
# Find connections in TIME_WAIT (common issue on high-traffic servers)
ss -tan state time-wait | wc -l
# Find CLOSE_WAIT connections (indicates application bug — not closing sockets)
ss -tanp state close-wait
Real Scenario: Detecting Connection Leaks
If CLOSE_WAIT count keeps growing, your application isn't closing connections properly. The remote end closed, but your app hasn't called close().
# Monitor CLOSE_WAIT connections over time
watch -n5 'echo "CLOSE_WAIT: $(ss -tanp state close-wait | wc -l) | ESTABLISHED: $(ss -tan state established | wc -l) | TIME_WAIT: $(ss -tan state time-wait | wc -l)"'
/proc Debugging: The Kernel's Window
The /proc filesystem exposes process internals directly.
# What is the process doing right now?
cat /proc/$(pidof myapp)/status | grep -E "State|Threads|VmRSS|VmSize"
# What are its environment variables? (useful for debugging config issues)
cat /proc/$(pidof myapp)/environ | tr '\0' '\n' | grep DATABASE
# What was the exact command that started it?
cat /proc/$(pidof myapp)/cmdline | tr '\0' ' '
# What is its current working directory?
readlink /proc/$(pidof myapp)/cwd
# What are its memory maps? (useful for shared library issues)
cat /proc/$(pidof myapp)/maps | grep libssl
dmesg: Kernel Messages
When things go really wrong — OOM kills, hardware errors, filesystem corruption — the kernel tells you in dmesg.
# Recent kernel messages with human-readable timestamps
dmesg -T --level=err,warn | tail -50
# Watch for OOM kills
dmesg -T | grep -i "oom\|killed process\|out of memory"
# Watch for disk errors
dmesg -T | grep -iE "error|fail|I/O" | grep -i "sd[a-z]\|nvme"
# Follow kernel messages in real time
dmesg -Tw
Real-World Debugging Workflow
Here's a complete workflow for debugging a production issue: "The API is returning 502 errors intermittently."
# Step 1: Check if the backend process is running and healthy
systemctl status myapp
ss -tlnp | grep 8080
# Step 2: Check resource pressure
vmstat 1 5 # CPU and memory overview
iostat -xz 1 3 # Disk I/O
free -h # Memory (check for swap usage)
# Step 3: Check for OOM kills or kernel issues
dmesg -T | tail -30
# Step 4: Check application logs (last 5 minutes)
journalctl -u myapp --since "5 minutes ago" --no-pager
# Step 5: Trace what the process is doing
sudo strace -p $(pidof myapp) -f -e trace=network -c
# Step 6: Check socket states for connection issues
ss -tanp | grep 8080 | awk '{print $1}' | sort | uniq -c | sort -rn
# Step 7: Capture actual traffic if needed
sudo tcpdump -i lo port 8080 -w /tmp/debug.pcap -c 500
Each step narrows the problem. By step 4-5, you usually know whether it's a resource issue, an application bug, or a network problem. The remaining steps confirm the root cause.
Debugging often reveals filesystem issues — next we'll compare ext4, XFS, and Btrfs to help you choose the right filesystem for your workload and understand why your I/O patterns matter.
