Linux Process Management — ps, top, kill and Beyond
It's 3 AM. Your pager goes off. The production server is crawling. CPU is at 100%. Memory is gone. Something is eating your server alive, and you need to find it and stop it — fast. Knowing how to manage Linux processes isn't optional for a DevOps engineer; it's survival.
Understanding Processes — PIDs, Parents, and States
Every running program in Linux is a process with a unique PID (Process ID). Processes form a tree — every process has a parent (except PID 1, systemd).
# See the full process tree
pstree -p
# systemd(1)─┬─sshd(1234)───sshd(5678)───bash(5680)───vim(5700)
# ├─nginx(2000)─┬─nginx(2001)
# │ └─nginx(2002)
# └─dockerd(3000)───containerd(3001)
# What's PID 1 on your system?
ps -p 1 -o comm=
# systemd
Processes live in different states:
| State | Code | Meaning |
|---|---|---|
| Running | R | Actively using CPU |
| Sleeping | S | Waiting for I/O or event |
| Disk Sleep | D | Uninterruptible I/O wait |
| Stopped | T | Paused (Ctrl+Z or debugger) |
| Zombie | Z | Finished but parent hasn't collected exit status |
ps — Snapshot of Running Processes
The ps command is your first tool for investigation. There are two major syntax styles — BSD and UNIX. Most DevOps engineers use BSD style.
# The classic: show all processes with details
ps aux
# USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
# root 1 0.0 0.1 169432 13256 ? Ss Feb24 0:05 /usr/lib/systemd/systemd
# www-data 2100 85.3 12.4 1024000 512000 ? R 03:01 42:15 /usr/bin/php-fpm
# Find a specific process
ps aux | grep nginx
# root 2000 0.0 0.1 65432 5432 ? Ss Feb24 0:00 nginx: master
# www 2001 0.2 0.5 72000 20480 ? S Feb24 1:12 nginx: worker
# Show processes in tree format
ps auxf
# Show specific columns only
ps -eo pid,ppid,user,%cpu,%mem,stat,start,time,comm --sort=-%cpu | head -20
# Find the top 5 memory consumers
ps aux --sort=-%mem | head -6
That --sort=-%cpu flag is gold when you're hunting a CPU hog. The minus sign means descending order.
top and htop — Real-Time Monitoring
While ps gives you a snapshot, top gives you a live view.
# Basic top — press 'q' to quit
top
# Useful top shortcuts:
# P — Sort by CPU usage
# M — Sort by memory usage
# k — Kill a process (enter PID)
# c — Toggle full command path
# 1 — Show individual CPU cores
# H — Show threads
# Run top in batch mode for scripting
top -b -n 1 | head -20
# Filter top to show only one user's processes
top -u www-data
For a much better experience, use htop:
# Install htop
sudo apt install htop # Debian/Ubuntu
sudo dnf install htop # RHEL/Fedora
# Launch htop
htop
# htop advantages over top:
# - Color-coded CPU/memory bars
# - Mouse support (click to sort, select)
# - Tree view (F5)
# - Search processes (F3)
# - Filter processes (F4)
# - Kill with signal selection (F9)
Kill Signals — Asking vs Telling vs Forcing
Killing a process isn't just kill -9. There are different signals for different situations:
| Signal | Number | Meaning | Use When |
|---|---|---|---|
SIGHUP | 1 | Hangup / reload config | Reload Nginx, Apache without restart |
SIGINT | 2 | Interrupt (Ctrl+C) | Graceful stop from terminal |
SIGQUIT | 3 | Quit with core dump | Debugging crashes |
SIGTERM | 15 | Terminate gracefully | Default kill — try this first |
SIGKILL | 9 | Force kill immediately | Last resort only |
SIGSTOP | 19 | Pause process | Freeze a runaway process temporarily |
SIGCONT | 18 | Resume paused process | Continue after SIGSTOP |
# Always try graceful termination first
kill 2100 # Sends SIGTERM (default)
kill -15 2100 # Same thing, explicit
# Wait a few seconds. Still running?
kill -9 2100 # Nuclear option — SIGKILL
# Kill by name instead of PID
pkill -f "php-fpm"
killall nginx
# Reload config without restarting (Nginx, Apache, HAProxy)
kill -HUP $(cat /var/run/nginx.pid)
# or
sudo systemctl reload nginx
# Pause a runaway process while you investigate
kill -STOP 2100 # Freeze it
# ... investigate the issue ...
kill -CONT 2100 # Resume it
# or
kill -9 2100 # Kill it
Rule of thumb: Always try SIGTERM (15) first. Give the process 5-10 seconds to clean up. Only use SIGKILL (9) if it refuses to stop. Why? SIGKILL can't be caught — the process dies immediately without cleaning up temp files, closing database connections, or flushing buffers.
nice and renice — Process Priority
Linux uses priority values from -20 (highest priority) to 19 (lowest priority). Default is 0.
# Start a CPU-heavy backup with low priority
nice -n 19 tar -czf /backup/full-backup.tar.gz /var/www/
# Check a process's nice value
ps -o pid,ni,comm -p 2100
# Change priority of a running process
sudo renice -n 10 -p 2100 # Lower priority
sudo renice -n -5 -p 2100 # Higher priority (needs root)
# Renice all processes of a user
sudo renice -n 15 -u jenkins
This is incredibly useful during business hours. Need to run a big backup or log analysis? Set it to nice 19 so it doesn't affect production traffic.
Background Jobs and nohup
You SSH into a server, start a long-running task, and your connection drops. The process dies. Here's how to prevent that.
# Run in background with &
./long-migration.sh &
# [1] 12345
# But it still dies when you disconnect! Use nohup:
nohup ./long-migration.sh > migration.log 2>&1 &
# Or use disown to detach an already-running job
./long-migration.sh
# Ctrl+Z to pause
bg # Resume in background
disown -h %1 # Detach from terminal
# Job control commands
jobs # List background jobs
fg %1 # Bring job 1 to foreground
bg %1 # Send job 1 to background
For production, use screen or tmux instead:
# Start a tmux session
tmux new -s migration
# Run your command
./long-migration.sh
# Detach: Ctrl+B, then D
# Reconnect later:
tmux attach -t migration
Zombie Processes — The Undead
A zombie process has finished executing but its parent hasn't called wait() to collect its exit status. Zombies don't consume CPU or memory, but they do consume a PID.
# Find zombie processes
ps aux | awk '$8 ~ /Z/ {print}'
# Count zombies
ps aux | awk '$8 ~ /Z/' | wc -l
# Find the parent of a zombie
ps -o pid,ppid,stat,comm -p $(ps aux | awk '$8 ~ /Z/ {print $2}')
# You can't kill a zombie — it's already dead!
# Kill the parent process instead to clean them up
kill -SIGCHLD <parent_pid> # Ask parent to reap
kill <parent_pid> # Kill parent if it won't
A handful of zombies is normal. Hundreds of zombies means the parent process has a bug — it's not handling child process exits properly.
The /proc Filesystem — Process X-Ray
Every process gets a directory under /proc/<PID>/ with detailed info. This is where ps and top actually get their data.
# Pick a process to investigate
PID=2100
# What command started it?
cat /proc/$PID/cmdline | tr '\0' ' '
# What environment variables does it see?
cat /proc/$PID/environ | tr '\0' '\n' | sort
# What files does it have open?
ls -la /proc/$PID/fd/ | head -20
# How much memory is it really using?
cat /proc/$PID/status | grep -i vm
# What's its current working directory?
ls -la /proc/$PID/cwd
# What binary is it running?
ls -la /proc/$PID/exe
# System-wide info
cat /proc/loadavg # Load averages
cat /proc/meminfo # Memory details
cat /proc/cpuinfo # CPU details
Real-World Scenario: Hunt the CPU Hog
Here's the complete workflow for that 3 AM incident:
# Step 1: Quick overview — what's the load?
uptime
# 03:15:00 up 45 days, load average: 12.50, 8.30, 3.10
# Load is 12.5 on a 4-core machine — that's bad
# Step 2: Find the top CPU consumers
ps aux --sort=-%cpu | head -5
# Step 3: Get more details on the offending process
PID=2100
ls -la /proc/$PID/exe
cat /proc/$PID/cmdline | tr '\0' ' '
# Step 4: Check how long it's been running
ps -o pid,etime,pcpu,pmem,comm -p $PID
# Step 5: Freeze it while you investigate
kill -STOP $PID
# Step 6: Decide — restart gracefully or force kill
kill -TERM $PID # Try graceful first
sleep 5
ps -p $PID > /dev/null 2>&1 && kill -9 $PID # Force if still alive
# Step 7: Verify system recovered
uptime
free -h
Next up: apt vs yum vs dnf — Linux Package Managers Demystified — which one should you use? Here's the real difference.
