Skip to main content

Shell Scripting Series Part 3 — Error Handling, Logging, and Production Scripts

· 8 min read
Goel Academy
DevOps & Cloud Learning Hub

Your scripts work on your machine — here's how to make them production-ready. In Parts 1 and 2, we learned the fundamentals and built real scripts. Now we'll add the guardrails that separate "works on my laptop" from "safe to run in production at 3 AM with nobody watching."

The set Command — Your Safety Net

Every production script should start with set -euo pipefail. Here's what each flag does and why it matters.

#!/bin/bash
set -euo pipefail

# -e : Exit immediately if any command fails
# -u : Treat unset variables as errors (catches typos)
# -o pipefail : A pipeline fails if ANY command in it fails, not just the last one

Let's see why each flag matters.

#!/bin/bash
# WITHOUT set -e: this script continues after the failed command
cd /nonexistent/directory # Fails silently
rm -rf * # Runs in the WRONG directory — disaster!

# WITH set -e: script stops at the first failure
set -e
cd /nonexistent/directory # Script exits here
rm -rf * # Never runs — crisis averted

The pipefail flag catches hidden failures in pipes:

#!/bin/bash
# WITHOUT pipefail:
grep "pattern" huge-file.log | sort | head -5
# If grep fails (file not found), the pipeline exit code is from 'head' (success!)

# WITH pipefail:
set -o pipefail
grep "pattern" huge-file.log | sort | head -5
# Now the pipeline correctly reports grep's failure
FlagWithout ItWith It
-eScript continues after errorsScript exits on first error
-uUnset variables expand to empty stringUnset variables cause immediate exit
-o pipefailPipeline exit code = last commandPipeline exit code = first failure

Trap — Cleaning Up After Yourself

trap lets you run cleanup code when your script exits, regardless of whether it succeeded or failed.

#!/bin/bash
set -euo pipefail

# Create a temp directory for working files
WORK_DIR=$(mktemp -d)
LOG_FILE="/var/log/myscript.log"

# Cleanup function — runs no matter how the script exits
cleanup() {
local exit_code=$?
echo "[$(date)] Script exiting with code: $exit_code" >> "$LOG_FILE"

# Remove temp files
if [[ -d "$WORK_DIR" ]]; then
rm -rf "$WORK_DIR"
echo "[$(date)] Cleaned up temp dir: $WORK_DIR" >> "$LOG_FILE"
fi

# Release lock file
[[ -f /tmp/myscript.lock ]] && rm -f /tmp/myscript.lock

exit "$exit_code"
}

# Register the trap — runs on EXIT, ERR, INT (Ctrl+C), TERM (kill)
trap cleanup EXIT

# Now use WORK_DIR safely — it will ALWAYS be cleaned up
echo "Working in: $WORK_DIR"
cp /etc/important-config "$WORK_DIR/"
# ... do processing ...

# Even if the script crashes here, cleanup runs

Trap Signals Reference

SignalTriggered ByCommon Use
EXITScript exits (any reason)Cleanup temp files
ERRA command fails (with set -e)Log the failure
INTUser presses Ctrl+CGraceful shutdown
TERMkill commandGraceful shutdown
HUPTerminal closesReload config
#!/bin/bash
set -euo pipefail

# Different traps for different situations
trap 'echo "Error on line $LINENO. Command: $BASH_COMMAND"' ERR
trap 'echo "Script interrupted by user"; exit 130' INT
trap 'echo "Script terminated"; exit 143' TERM

echo "Running... press Ctrl+C to test INT trap"
sleep 60

Exit Codes — Communicating Success and Failure

Exit codes are how scripts talk to each other. Zero means success, anything else means failure.

#!/bin/bash
set -euo pipefail

# Define meaningful exit codes
readonly EXIT_SUCCESS=0
readonly EXIT_GENERAL_ERROR=1
readonly EXIT_INVALID_ARGS=2
readonly EXIT_DEPENDENCY_MISSING=3
readonly EXIT_PERMISSION_DENIED=4
readonly EXIT_TIMEOUT=5

check_dependencies() {
local missing=()
for cmd in curl jq aws; do
if ! command -v "$cmd" > /dev/null 2>&1; then
missing+=("$cmd")
fi
done

if [[ ${#missing[@]} -gt 0 ]]; then
echo "ERROR: Missing dependencies: ${missing[*]}"
echo "Install with: sudo apt install ${missing[*]}"
exit $EXIT_DEPENDENCY_MISSING
fi
}

validate_args() {
if [[ $# -lt 2 ]]; then
echo "Usage: $0 <environment> <action>"
echo " environment: dev, staging, production"
echo " action: deploy, rollback, status"
exit $EXIT_INVALID_ARGS
fi
}

validate_args "$@"
check_dependencies
echo "All checks passed"
exit $EXIT_SUCCESS

Logging — Structured Output for Production

Replace scattered echo statements with a proper logging function.

#!/bin/bash
set -euo pipefail

# Logging configuration
LOG_FILE="/var/log/myapp/deploy.log"
LOG_LEVEL="${LOG_LEVEL:-INFO}"
SCRIPT_NAME=$(basename "$0")

# Ensure log directory exists
mkdir -p "$(dirname "$LOG_FILE")"

# Log function with levels and timestamps
log() {
local level="$1"
shift
local message="$*"
local timestamp
timestamp=$(date '+%Y-%m-%d %H:%M:%S')

# Log level filtering
declare -A levels=([DEBUG]=0 [INFO]=1 [WARN]=2 [ERROR]=3 [FATAL]=4)
local current_level=${levels[$LOG_LEVEL]:-1}
local msg_level=${levels[$level]:-1}

[[ $msg_level -lt $current_level ]] && return 0

local log_line="[$timestamp] [$level] [$SCRIPT_NAME:$$] $message"

# Write to file
echo "$log_line" >> "$LOG_FILE"

# Also write to stderr for terminal visibility
case "$level" in
ERROR|FATAL) echo -e "\033[0;31m$log_line\033[0m" >&2 ;;
WARN) echo -e "\033[0;33m$log_line\033[0m" >&2 ;;
INFO) echo "$log_line" >&2 ;;
DEBUG) echo -e "\033[0;36m$log_line\033[0m" >&2 ;;
esac
}

# Usage
log INFO "Deployment starting"
log DEBUG "Environment: production"
log WARN "Disk usage is at 78%"
log ERROR "Connection to database failed"
log INFO "Deployment complete"

File Locking — Prevent Duplicate Runs

When a script runs via cron, you need to ensure only one instance runs at a time. Without locking, overlapping runs can corrupt data.

#!/bin/bash
set -euo pipefail

LOCK_FILE="/tmp/$(basename "$0" .sh).lock"

acquire_lock() {
if [[ -f "$LOCK_FILE" ]]; then
local lock_pid
lock_pid=$(cat "$LOCK_FILE")

# Check if the process that created the lock is still running
if kill -0 "$lock_pid" 2>/dev/null; then
echo "ERROR: Another instance is already running (PID: $lock_pid)"
exit 1
else
echo "WARN: Removing stale lock file (PID $lock_pid is dead)"
rm -f "$LOCK_FILE"
fi
fi

# Create lock with our PID
echo $$ > "$LOCK_FILE"
}

release_lock() {
rm -f "$LOCK_FILE"
}

# Acquire lock and ensure it's released on exit
acquire_lock
trap release_lock EXIT

echo "Running with lock (PID: $$)..."
# Your actual script logic here
sleep 30 # Simulating long-running work
echo "Done!"

A more robust approach uses flock, which handles locking at the kernel level:

#!/bin/bash
set -euo pipefail

LOCK_FILE="/tmp/$(basename "$0" .sh).lock"

# flock -n: non-blocking, exit immediately if can't acquire lock
# flock -E 0: return 0 if lock is already held (instead of 1)
exec 200>"$LOCK_FILE"
if ! flock -n 200; then
echo "Another instance is already running. Exiting."
exit 0
fi

# Lock is held for the duration of the script
echo "Running with flock (PID: $$)..."
# Your script logic here
sleep 30
echo "Done!"
# Lock is automatically released when the script exits

The Production-Grade Template

Here is a complete template that combines everything we've covered. Use this as the starting point for every production script.

#!/bin/bash
#
# Script: production-template.sh
# Purpose: [Describe what this script does]
# Usage: ./production-template.sh <arg1> <arg2>
# Author: Goel Academy
#

set -euo pipefail
IFS=$'\n\t'

# --- Configuration ---
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
readonly SCRIPT_NAME="$(basename "$0")"
readonly LOG_FILE="/var/log/${SCRIPT_NAME%.sh}.log"
readonly LOCK_FILE="/tmp/${SCRIPT_NAME%.sh}.lock"
readonly WORK_DIR=$(mktemp -d -t "${SCRIPT_NAME%.sh}-XXXXXX")

# --- Logging ---
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$1] $2" | tee -a "$LOG_FILE"; }
info() { log "INFO" "$1"; }
warn() { log "WARN" "$1" >&2; }
error() { log "ERROR" "$1" >&2; }
fatal() { log "FATAL" "$1" >&2; exit 1; }

# --- Cleanup ---
cleanup() {
local exit_code=$?
[[ -d "$WORK_DIR" ]] && rm -rf "$WORK_DIR"
[[ -f "$LOCK_FILE" ]] && rm -f "$LOCK_FILE"

if [[ $exit_code -eq 0 ]]; then
info "Script completed successfully"
else
error "Script failed with exit code: $exit_code"
fi
}
trap cleanup EXIT
trap 'error "Error on line $LINENO: $BASH_COMMAND"; exit 1' ERR
trap 'warn "Interrupted by user"; exit 130' INT TERM

# --- Locking ---
acquire_lock() {
if [[ -f "$LOCK_FILE" ]]; then
local pid
pid=$(cat "$LOCK_FILE")
if kill -0 "$pid" 2>/dev/null; then
fatal "Another instance running (PID: $pid)"
fi
rm -f "$LOCK_FILE"
fi
echo $$ > "$LOCK_FILE"
}

# --- Validation ---
check_root() {
[[ $EUID -eq 0 ]] || fatal "This script must be run as root"
}

check_dependencies() {
local deps=("$@")
for dep in "${deps[@]}"; do
command -v "$dep" > /dev/null 2>&1 || fatal "Missing dependency: $dep"
done
}

usage() {
cat << EOF
Usage: $SCRIPT_NAME [options] <argument>

Options:
-h, --help Show this help
-v, --verbose Enable verbose output
-d, --dry-run Show what would be done without doing it

Arguments:
argument Description of the argument
EOF
exit 0
}

# --- Main ---
main() {
local verbose=false
local dry_run=false

# Parse arguments
while [[ $# -gt 0 ]]; do
case "$1" in
-h|--help) usage ;;
-v|--verbose) verbose=true; shift ;;
-d|--dry-run) dry_run=true; shift ;;
-*) fatal "Unknown option: $1" ;;
*) break ;;
esac
done

acquire_lock
check_dependencies "curl" "jq"

info "Starting $SCRIPT_NAME"
info "Working directory: $WORK_DIR"
info "Verbose: $verbose | Dry run: $dry_run"

# === Your script logic goes here ===


# ===================================

info "All done!"
}

main "$@"

Common Patterns Cheat Sheet

PatternCode
Exit on errorset -euo pipefail
Cleanup on exittrap cleanup EXIT
Log with timestampecho "[$(date)] message"
Prevent duplicate runsflock -n 200 or PID file
Require root[[ $EUID -eq 0 ]] || exit 1
Check dependencycommand -v curl > /dev/null 2>&1
Default value${VAR:-default}
Script directory$(cd "$(dirname "$0")" && pwd)
Temp directorymktemp -d -t name-XXXXXX
Error line numbertrap 'echo "line $LINENO"' ERR

This wraps up our Shell Scripting series. Go back to Part 1 — Variables, Loops, and Functions or Part 2 — Real-World Automation Scripts to review the fundamentals.