Skip to main content

Linux Process Management — ps, top, kill and Beyond

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

It's 3 AM. Your pager goes off. The production server is crawling. CPU is at 100%. Memory is gone. Something is eating your server alive, and you need to find it and stop it — fast. Knowing how to manage Linux processes isn't optional for a DevOps engineer; it's survival.

Understanding Processes — PIDs, Parents, and States

Every running program in Linux is a process with a unique PID (Process ID). Processes form a tree — every process has a parent (except PID 1, systemd).

# See the full process tree
pstree -p
# systemd(1)─┬─sshd(1234)───sshd(5678)───bash(5680)───vim(5700)
# ├─nginx(2000)─┬─nginx(2001)
# │ └─nginx(2002)
# └─dockerd(3000)───containerd(3001)

# What's PID 1 on your system?
ps -p 1 -o comm=
# systemd

Processes live in different states:

StateCodeMeaning
RunningRActively using CPU
SleepingSWaiting for I/O or event
Disk SleepDUninterruptible I/O wait
StoppedTPaused (Ctrl+Z or debugger)
ZombieZFinished but parent hasn't collected exit status

ps — Snapshot of Running Processes

The ps command is your first tool for investigation. There are two major syntax styles — BSD and UNIX. Most DevOps engineers use BSD style.

# The classic: show all processes with details
ps aux
# USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
# root 1 0.0 0.1 169432 13256 ? Ss Feb24 0:05 /usr/lib/systemd/systemd
# www-data 2100 85.3 12.4 1024000 512000 ? R 03:01 42:15 /usr/bin/php-fpm

# Find a specific process
ps aux | grep nginx
# root 2000 0.0 0.1 65432 5432 ? Ss Feb24 0:00 nginx: master
# www 2001 0.2 0.5 72000 20480 ? S Feb24 1:12 nginx: worker

# Show processes in tree format
ps auxf

# Show specific columns only
ps -eo pid,ppid,user,%cpu,%mem,stat,start,time,comm --sort=-%cpu | head -20

# Find the top 5 memory consumers
ps aux --sort=-%mem | head -6

That --sort=-%cpu flag is gold when you're hunting a CPU hog. The minus sign means descending order.

top and htop — Real-Time Monitoring

While ps gives you a snapshot, top gives you a live view.

# Basic top — press 'q' to quit
top

# Useful top shortcuts:
# P — Sort by CPU usage
# M — Sort by memory usage
# k — Kill a process (enter PID)
# c — Toggle full command path
# 1 — Show individual CPU cores
# H — Show threads

# Run top in batch mode for scripting
top -b -n 1 | head -20

# Filter top to show only one user's processes
top -u www-data

For a much better experience, use htop:

# Install htop
sudo apt install htop # Debian/Ubuntu
sudo dnf install htop # RHEL/Fedora

# Launch htop
htop

# htop advantages over top:
# - Color-coded CPU/memory bars
# - Mouse support (click to sort, select)
# - Tree view (F5)
# - Search processes (F3)
# - Filter processes (F4)
# - Kill with signal selection (F9)

Kill Signals — Asking vs Telling vs Forcing

Killing a process isn't just kill -9. There are different signals for different situations:

SignalNumberMeaningUse When
SIGHUP1Hangup / reload configReload Nginx, Apache without restart
SIGINT2Interrupt (Ctrl+C)Graceful stop from terminal
SIGQUIT3Quit with core dumpDebugging crashes
SIGTERM15Terminate gracefullyDefault kill — try this first
SIGKILL9Force kill immediatelyLast resort only
SIGSTOP19Pause processFreeze a runaway process temporarily
SIGCONT18Resume paused processContinue after SIGSTOP
# Always try graceful termination first
kill 2100 # Sends SIGTERM (default)
kill -15 2100 # Same thing, explicit

# Wait a few seconds. Still running?
kill -9 2100 # Nuclear option — SIGKILL

# Kill by name instead of PID
pkill -f "php-fpm"
killall nginx

# Reload config without restarting (Nginx, Apache, HAProxy)
kill -HUP $(cat /var/run/nginx.pid)
# or
sudo systemctl reload nginx

# Pause a runaway process while you investigate
kill -STOP 2100 # Freeze it
# ... investigate the issue ...
kill -CONT 2100 # Resume it
# or
kill -9 2100 # Kill it

Rule of thumb: Always try SIGTERM (15) first. Give the process 5-10 seconds to clean up. Only use SIGKILL (9) if it refuses to stop. Why? SIGKILL can't be caught — the process dies immediately without cleaning up temp files, closing database connections, or flushing buffers.

nice and renice — Process Priority

Linux uses priority values from -20 (highest priority) to 19 (lowest priority). Default is 0.

# Start a CPU-heavy backup with low priority
nice -n 19 tar -czf /backup/full-backup.tar.gz /var/www/

# Check a process's nice value
ps -o pid,ni,comm -p 2100

# Change priority of a running process
sudo renice -n 10 -p 2100 # Lower priority
sudo renice -n -5 -p 2100 # Higher priority (needs root)

# Renice all processes of a user
sudo renice -n 15 -u jenkins

This is incredibly useful during business hours. Need to run a big backup or log analysis? Set it to nice 19 so it doesn't affect production traffic.

Background Jobs and nohup

You SSH into a server, start a long-running task, and your connection drops. The process dies. Here's how to prevent that.

# Run in background with &
./long-migration.sh &
# [1] 12345

# But it still dies when you disconnect! Use nohup:
nohup ./long-migration.sh > migration.log 2>&1 &

# Or use disown to detach an already-running job
./long-migration.sh
# Ctrl+Z to pause
bg # Resume in background
disown -h %1 # Detach from terminal

# Job control commands
jobs # List background jobs
fg %1 # Bring job 1 to foreground
bg %1 # Send job 1 to background

For production, use screen or tmux instead:

# Start a tmux session
tmux new -s migration

# Run your command
./long-migration.sh

# Detach: Ctrl+B, then D
# Reconnect later:
tmux attach -t migration

Zombie Processes — The Undead

A zombie process has finished executing but its parent hasn't called wait() to collect its exit status. Zombies don't consume CPU or memory, but they do consume a PID.

# Find zombie processes
ps aux | awk '$8 ~ /Z/ {print}'

# Count zombies
ps aux | awk '$8 ~ /Z/' | wc -l

# Find the parent of a zombie
ps -o pid,ppid,stat,comm -p $(ps aux | awk '$8 ~ /Z/ {print $2}')

# You can't kill a zombie — it's already dead!
# Kill the parent process instead to clean them up
kill -SIGCHLD <parent_pid> # Ask parent to reap
kill <parent_pid> # Kill parent if it won't

A handful of zombies is normal. Hundreds of zombies means the parent process has a bug — it's not handling child process exits properly.

The /proc Filesystem — Process X-Ray

Every process gets a directory under /proc/<PID>/ with detailed info. This is where ps and top actually get their data.

# Pick a process to investigate
PID=2100

# What command started it?
cat /proc/$PID/cmdline | tr '\0' ' '

# What environment variables does it see?
cat /proc/$PID/environ | tr '\0' '\n' | sort

# What files does it have open?
ls -la /proc/$PID/fd/ | head -20

# How much memory is it really using?
cat /proc/$PID/status | grep -i vm

# What's its current working directory?
ls -la /proc/$PID/cwd

# What binary is it running?
ls -la /proc/$PID/exe

# System-wide info
cat /proc/loadavg # Load averages
cat /proc/meminfo # Memory details
cat /proc/cpuinfo # CPU details

Real-World Scenario: Hunt the CPU Hog

Here's the complete workflow for that 3 AM incident:

# Step 1: Quick overview — what's the load?
uptime
# 03:15:00 up 45 days, load average: 12.50, 8.30, 3.10
# Load is 12.5 on a 4-core machine — that's bad

# Step 2: Find the top CPU consumers
ps aux --sort=-%cpu | head -5

# Step 3: Get more details on the offending process
PID=2100
ls -la /proc/$PID/exe
cat /proc/$PID/cmdline | tr '\0' ' '

# Step 4: Check how long it's been running
ps -o pid,etime,pcpu,pmem,comm -p $PID

# Step 5: Freeze it while you investigate
kill -STOP $PID

# Step 6: Decide — restart gracefully or force kill
kill -TERM $PID # Try graceful first
sleep 5
ps -p $PID > /dev/null 2>&1 && kill -9 $PID # Force if still alive

# Step 7: Verify system recovered
uptime
free -h

Next up: apt vs yum vs dnf — Linux Package Managers Demystified — which one should you use? Here's the real difference.