Linux Troubleshooting Like a Pro — strace, lsof, tcpdump

October 4, 2025 · 7 min read

DevOps & Cloud Learning Hub

The app works on staging but fails on production — here's the systematic way to find out why. Every seasoned SRE has a mental decision tree for production incidents. The tools are always the same: strace to see what a process is doing, lsof to see what files it has open, tcpdump to see what's on the wire, and ss to see socket state. Master these four and you can debug almost anything.

The Troubleshooting Framework

Before reaching for tools, follow this mental model:

What changed? (deploy, config change, traffic spike, upstream dependency)
Where is the bottleneck? (CPU, memory, disk, network)
Narrow down with the right tool for the layer

Symptom	First Tool	What to Look For
Process hangs/slow	`strace`	Blocked system calls
"Too many open files"	`lsof`	File descriptor leaks
Connection timeouts	`ss`, `tcpdump`	Socket states, packet drops
High I/O wait	`iostat`, `iotop`	Disk saturation
OOM kills	`dmesg`, `/proc/meminfo`	Memory pressure

strace: See What a Process Is Doing

strace intercepts system calls — the interface between a process and the kernel. If a process is hung, strace tells you exactly which system call it's blocked on.

Attaching to a Running Process

# Attach to a running process (doesn't restart it)
sudo strace -p $(pidof nginx) -f -e trace=network -t

# Flags explained:
# -p PID    : attach to running process
# -f        : follow child processes (critical for forking servers)
# -e trace= : filter by syscall category (network, file, process, memory)
# -t        : timestamp each line

Real Scenario: Why Is the App Slow?

Your Node.js app has 5-second response times. Is it CPU-bound, waiting on disk, or waiting on a network call?

# Trace the process and summarize time spent in each syscall
sudo strace -p $(pidof node) -c -f -S time

This produces a table showing total time, calls, and errors per syscall. If read or write on a socket dominates, the app is waiting on an external service. If futex dominates, it's waiting on a lock.

Finding Why a Process Can't Open a File

# Trace file-related syscalls only
sudo strace -p $(pidof myapp) -e trace=openat,access,stat -f 2>&1 | grep -i "ENOENT\|EACCES"

ENOENT means file not found. EACCES means permission denied. The full path is in the strace output — no guessing required.

Tracing a Command from Start

# Trace an entire command execution, write output to a file
strace -f -o /tmp/strace.log -tt curl https://api.example.com/health

# Find the slowest syscalls
grep -E "^[0-9]+" /tmp/strace.log | sort -t'>' -k2 -rn | head -20

lsof: Everything Is a File

In Linux, network sockets, pipes, devices — everything is a file descriptor. lsof lists them all.

Finding What Has a File Open

# What process has port 8080 open?
sudo lsof -i :8080

# What files does nginx have open?
sudo lsof -p $(pidof nginx) | head -30

# How many file descriptors is a process using?
sudo ls /proc/$(pidof nginx)/fd | wc -l

# Compare against its limit
cat /proc/$(pidof nginx)/limits | grep "Max open files"

Real Scenario: Disk Full but Can't Find the File

A classic production issue: df shows the disk is full, but du doesn't account for all the space. The culprit is a deleted file still held open by a process.

# Find deleted files still held open (consuming disk space)
sudo lsof +L1

# The output shows deleted files and which process holds them
# To reclaim space: restart the process, or truncate the fd:
# sudo truncate -s 0 /proc/<pid>/fd/<fd_number>

Finding Network Connections

# All network connections by a specific process
sudo lsof -i -a -p $(pidof java)

# All established connections to a remote host
sudo lsof -i @10.0.0.50

# All connections in a specific state
sudo lsof -i -sTCP:ESTABLISHED
sudo lsof -i -sTCP:CLOSE_WAIT

tcpdump: What's on the Wire

When the application says it sent the request and the server says it never received it, tcpdump tells you who's lying.

Basic Captures

# Capture HTTP traffic on port 80
sudo tcpdump -i eth0 port 80 -n -A | head -100

# Capture traffic to a specific host
sudo tcpdump -i any host 10.0.0.50 -n

# Capture DNS queries
sudo tcpdump -i any port 53 -n -l

# Write to a file for Wireshark analysis
sudo tcpdump -i eth0 -w /tmp/capture.pcap -c 1000

Real Scenario: Connection Timeouts to an Upstream Service

Your app times out connecting to a database. Is the SYN getting through? Is the server responding?

# Capture only TCP handshake packets (SYN, SYN-ACK, RST)
sudo tcpdump -i eth0 "host 10.0.0.50 and (tcp[tcpflags] & (tcp-syn|tcp-rst) != 0)" -n -tt

If you see SYN but no SYN-ACK, the remote server isn't responding (firewall, service down, or route issue). If you see SYN-ACK followed by RST, something is rejecting the connection after the handshake.

Analyzing Application-Level Issues

# Capture and display HTTP request/response headers
sudo tcpdump -i eth0 -A -s 0 'tcp port 8080 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' | grep -E "^(GET|POST|HTTP|Host|Content)"

ss: Socket Statistics

ss is the modern replacement for netstat — faster and more informative.

# All listening TCP sockets with process info
ss -tlnp

# All established connections with timer information
ss -tnop state established

# Count connections per state
ss -s

# Find connections in TIME_WAIT (common issue on high-traffic servers)
ss -tan state time-wait | wc -l

# Find CLOSE_WAIT connections (indicates application bug — not closing sockets)
ss -tanp state close-wait

Real Scenario: Detecting Connection Leaks

If CLOSE_WAIT count keeps growing, your application isn't closing connections properly. The remote end closed, but your app hasn't called close().

# Monitor CLOSE_WAIT connections over time
watch -n5 'echo "CLOSE_WAIT: $(ss -tanp state close-wait | wc -l) | ESTABLISHED: $(ss -tan state established | wc -l) | TIME_WAIT: $(ss -tan state time-wait | wc -l)"'

/proc Debugging: The Kernel's Window

The /proc filesystem exposes process internals directly.

# What is the process doing right now?
cat /proc/$(pidof myapp)/status | grep -E "State|Threads|VmRSS|VmSize"

# What are its environment variables? (useful for debugging config issues)
cat /proc/$(pidof myapp)/environ | tr '\0' '\n' | grep DATABASE

# What was the exact command that started it?
cat /proc/$(pidof myapp)/cmdline | tr '\0' ' '

# What is its current working directory?
readlink /proc/$(pidof myapp)/cwd

# What are its memory maps? (useful for shared library issues)
cat /proc/$(pidof myapp)/maps | grep libssl

dmesg: Kernel Messages

When things go really wrong — OOM kills, hardware errors, filesystem corruption — the kernel tells you in dmesg.

# Recent kernel messages with human-readable timestamps
dmesg -T --level=err,warn | tail -50

# Watch for OOM kills
dmesg -T | grep -i "oom\|killed process\|out of memory"

# Watch for disk errors
dmesg -T | grep -iE "error|fail|I/O" | grep -i "sd[a-z]\|nvme"

# Follow kernel messages in real time
dmesg -Tw

Real-World Debugging Workflow

Here's a complete workflow for debugging a production issue: "The API is returning 502 errors intermittently."

# Step 1: Check if the backend process is running and healthy
systemctl status myapp
ss -tlnp | grep 8080

# Step 2: Check resource pressure
vmstat 1 5           # CPU and memory overview
iostat -xz 1 3       # Disk I/O
free -h               # Memory (check for swap usage)

# Step 3: Check for OOM kills or kernel issues
dmesg -T | tail -30

# Step 4: Check application logs (last 5 minutes)
journalctl -u myapp --since "5 minutes ago" --no-pager

# Step 5: Trace what the process is doing
sudo strace -p $(pidof myapp) -f -e trace=network -c

# Step 6: Check socket states for connection issues
ss -tanp | grep 8080 | awk '{print $1}' | sort | uniq -c | sort -rn

# Step 7: Capture actual traffic if needed
sudo tcpdump -i lo port 8080 -w /tmp/debug.pcap -c 500

Each step narrows the problem. By step 4-5, you usually know whether it's a resource issue, an application bug, or a network problem. The remaining steps confirm the root cause.

Debugging often reveals filesystem issues — next we'll compare ext4, XFS, and Btrfs to help you choose the right filesystem for your workload and understand why your I/O patterns matter.

The Troubleshooting Framework​

strace: See What a Process Is Doing​

Attaching to a Running Process​

Real Scenario: Why Is the App Slow?​

Finding Why a Process Can't Open a File​

Tracing a Command from Start​

lsof: Everything Is a File​

Finding What Has a File Open​

Real Scenario: Disk Full but Can't Find the File​

Finding Network Connections​

tcpdump: What's on the Wire​

Basic Captures​

Real Scenario: Connection Timeouts to an Upstream Service​

Analyzing Application-Level Issues​

ss: Socket Statistics​

Real Scenario: Detecting Connection Leaks​

/proc Debugging: The Kernel's Window​

dmesg: Kernel Messages​

Real-World Debugging Workflow​

Stay Updated