Shell Scripting Series Part 3 — Error Handling, Logging, and Production Scripts

July 19, 2025 · 8 min read

DevOps & Cloud Learning Hub

Your scripts work on your machine — here's how to make them production-ready. In Parts 1 and 2, we learned the fundamentals and built real scripts. Now we'll add the guardrails that separate "works on my laptop" from "safe to run in production at 3 AM with nobody watching."

The set Command — Your Safety Net

Every production script should start with set -euo pipefail. Here's what each flag does and why it matters.

#!/bin/bash
set -euo pipefail

# -e : Exit immediately if any command fails
# -u : Treat unset variables as errors (catches typos)
# -o pipefail : A pipeline fails if ANY command in it fails, not just the last one

Let's see why each flag matters.

#!/bin/bash
# WITHOUT set -e: this script continues after the failed command
cd /nonexistent/directory       # Fails silently
rm -rf *                        # Runs in the WRONG directory — disaster!

# WITH set -e: script stops at the first failure
set -e
cd /nonexistent/directory       # Script exits here
rm -rf *                        # Never runs — crisis averted

The pipefail flag catches hidden failures in pipes:

#!/bin/bash
# WITHOUT pipefail:
grep "pattern" huge-file.log | sort | head -5
# If grep fails (file not found), the pipeline exit code is from 'head' (success!)

# WITH pipefail:
set -o pipefail
grep "pattern" huge-file.log | sort | head -5
# Now the pipeline correctly reports grep's failure

Flag	Without It	With It
`-e`	Script continues after errors	Script exits on first error
`-u`	Unset variables expand to empty string	Unset variables cause immediate exit
`-o pipefail`	Pipeline exit code = last command	Pipeline exit code = first failure

Trap — Cleaning Up After Yourself

trap lets you run cleanup code when your script exits, regardless of whether it succeeded or failed.

#!/bin/bash
set -euo pipefail

# Create a temp directory for working files
WORK_DIR=$(mktemp -d)
LOG_FILE="/var/log/myscript.log"

# Cleanup function — runs no matter how the script exits
cleanup() {
    local exit_code=$?
    echo "[$(date)] Script exiting with code: $exit_code" >> "$LOG_FILE"

    # Remove temp files
    if [[ -d "$WORK_DIR" ]]; then
        rm -rf "$WORK_DIR"
        echo "[$(date)] Cleaned up temp dir: $WORK_DIR" >> "$LOG_FILE"
    fi

    # Release lock file
    [[ -f /tmp/myscript.lock ]] && rm -f /tmp/myscript.lock

    exit "$exit_code"
}

# Register the trap — runs on EXIT, ERR, INT (Ctrl+C), TERM (kill)
trap cleanup EXIT

# Now use WORK_DIR safely — it will ALWAYS be cleaned up
echo "Working in: $WORK_DIR"
cp /etc/important-config "$WORK_DIR/"
# ... do processing ...

# Even if the script crashes here, cleanup runs

Trap Signals Reference

Signal	Triggered By	Common Use
`EXIT`	Script exits (any reason)	Cleanup temp files
`ERR`	A command fails (with `set -e`)	Log the failure
`INT`	User presses Ctrl+C	Graceful shutdown
`TERM`	`kill` command	Graceful shutdown
`HUP`	Terminal closes	Reload config

#!/bin/bash
set -euo pipefail

# Different traps for different situations
trap 'echo "Error on line $LINENO. Command: $BASH_COMMAND"' ERR
trap 'echo "Script interrupted by user"; exit 130' INT
trap 'echo "Script terminated"; exit 143' TERM

echo "Running... press Ctrl+C to test INT trap"
sleep 60

Exit Codes — Communicating Success and Failure

Exit codes are how scripts talk to each other. Zero means success, anything else means failure.

#!/bin/bash
set -euo pipefail

# Define meaningful exit codes
readonly EXIT_SUCCESS=0
readonly EXIT_GENERAL_ERROR=1
readonly EXIT_INVALID_ARGS=2
readonly EXIT_DEPENDENCY_MISSING=3
readonly EXIT_PERMISSION_DENIED=4
readonly EXIT_TIMEOUT=5

check_dependencies() {
    local missing=()
    for cmd in curl jq aws; do
        if ! command -v "$cmd" > /dev/null 2>&1; then
            missing+=("$cmd")
        fi
    done

    if [[ ${#missing[@]} -gt 0 ]]; then
        echo "ERROR: Missing dependencies: ${missing[*]}"
        echo "Install with: sudo apt install ${missing[*]}"
        exit $EXIT_DEPENDENCY_MISSING
    fi
}

validate_args() {
    if [[ $# -lt 2 ]]; then
        echo "Usage: $0 <environment> <action>"
        echo "  environment: dev, staging, production"
        echo "  action: deploy, rollback, status"
        exit $EXIT_INVALID_ARGS
    fi
}

validate_args "$@"
check_dependencies
echo "All checks passed"
exit $EXIT_SUCCESS

Logging — Structured Output for Production

Replace scattered echo statements with a proper logging function.

#!/bin/bash
set -euo pipefail

# Logging configuration
LOG_FILE="/var/log/myapp/deploy.log"
LOG_LEVEL="${LOG_LEVEL:-INFO}"
SCRIPT_NAME=$(basename "$0")

# Ensure log directory exists
mkdir -p "$(dirname "$LOG_FILE")"

# Log function with levels and timestamps
log() {
    local level="$1"
    shift
    local message="$*"
    local timestamp
    timestamp=$(date '+%Y-%m-%d %H:%M:%S')

    # Log level filtering
    declare -A levels=([DEBUG]=0 [INFO]=1 [WARN]=2 [ERROR]=3 [FATAL]=4)
    local current_level=${levels[$LOG_LEVEL]:-1}
    local msg_level=${levels[$level]:-1}

    [[ $msg_level -lt $current_level ]] && return 0

    local log_line="[$timestamp] [$level] [$SCRIPT_NAME:$$] $message"

    # Write to file
    echo "$log_line" >> "$LOG_FILE"

    # Also write to stderr for terminal visibility
    case "$level" in
        ERROR|FATAL) echo -e "\033[0;31m$log_line\033[0m" >&2 ;;
        WARN)        echo -e "\033[0;33m$log_line\033[0m" >&2 ;;
        INFO)        echo "$log_line" >&2 ;;
        DEBUG)       echo -e "\033[0;36m$log_line\033[0m" >&2 ;;
    esac
}

# Usage
log INFO "Deployment starting"
log DEBUG "Environment: production"
log WARN "Disk usage is at 78%"
log ERROR "Connection to database failed"
log INFO "Deployment complete"

File Locking — Prevent Duplicate Runs

When a script runs via cron, you need to ensure only one instance runs at a time. Without locking, overlapping runs can corrupt data.

#!/bin/bash
set -euo pipefail

LOCK_FILE="/tmp/$(basename "$0" .sh).lock"

acquire_lock() {
    if [[ -f "$LOCK_FILE" ]]; then
        local lock_pid
        lock_pid=$(cat "$LOCK_FILE")

        # Check if the process that created the lock is still running
        if kill -0 "$lock_pid" 2>/dev/null; then
            echo "ERROR: Another instance is already running (PID: $lock_pid)"
            exit 1
        else
            echo "WARN: Removing stale lock file (PID $lock_pid is dead)"
            rm -f "$LOCK_FILE"
        fi
    fi

    # Create lock with our PID
    echo $$ > "$LOCK_FILE"
}

release_lock() {
    rm -f "$LOCK_FILE"
}

# Acquire lock and ensure it's released on exit
acquire_lock
trap release_lock EXIT

echo "Running with lock (PID: $$)..."
# Your actual script logic here
sleep 30  # Simulating long-running work
echo "Done!"

A more robust approach uses flock, which handles locking at the kernel level:

#!/bin/bash
set -euo pipefail

LOCK_FILE="/tmp/$(basename "$0" .sh).lock"

# flock -n: non-blocking, exit immediately if can't acquire lock
# flock -E 0: return 0 if lock is already held (instead of 1)
exec 200>"$LOCK_FILE"
if ! flock -n 200; then
    echo "Another instance is already running. Exiting."
    exit 0
fi

# Lock is held for the duration of the script
echo "Running with flock (PID: $$)..."
# Your script logic here
sleep 30
echo "Done!"
# Lock is automatically released when the script exits

The Production-Grade Template

Here is a complete template that combines everything we've covered. Use this as the starting point for every production script.

#!/bin/bash
#
# Script:  production-template.sh
# Purpose: [Describe what this script does]
# Usage:   ./production-template.sh <arg1> <arg2>
# Author:  Goel Academy
#

set -euo pipefail
IFS=$'\n\t'

# --- Configuration ---
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
readonly SCRIPT_NAME="$(basename "$0")"
readonly LOG_FILE="/var/log/${SCRIPT_NAME%.sh}.log"
readonly LOCK_FILE="/tmp/${SCRIPT_NAME%.sh}.lock"
readonly WORK_DIR=$(mktemp -d -t "${SCRIPT_NAME%.sh}-XXXXXX")

# --- Logging ---
log()   { echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$1] $2" | tee -a "$LOG_FILE"; }
info()  { log "INFO" "$1"; }
warn()  { log "WARN" "$1" >&2; }
error() { log "ERROR" "$1" >&2; }
fatal() { log "FATAL" "$1" >&2; exit 1; }

# --- Cleanup ---
cleanup() {
    local exit_code=$?
    [[ -d "$WORK_DIR" ]] && rm -rf "$WORK_DIR"
    [[ -f "$LOCK_FILE" ]] && rm -f "$LOCK_FILE"

    if [[ $exit_code -eq 0 ]]; then
        info "Script completed successfully"
    else
        error "Script failed with exit code: $exit_code"
    fi
}
trap cleanup EXIT
trap 'error "Error on line $LINENO: $BASH_COMMAND"; exit 1' ERR
trap 'warn "Interrupted by user"; exit 130' INT TERM

# --- Locking ---
acquire_lock() {
    if [[ -f "$LOCK_FILE" ]]; then
        local pid
        pid=$(cat "$LOCK_FILE")
        if kill -0 "$pid" 2>/dev/null; then
            fatal "Another instance running (PID: $pid)"
        fi
        rm -f "$LOCK_FILE"
    fi
    echo $$ > "$LOCK_FILE"
}

# --- Validation ---
check_root() {
    [[ $EUID -eq 0 ]] || fatal "This script must be run as root"
}

check_dependencies() {
    local deps=("$@")
    for dep in "${deps[@]}"; do
        command -v "$dep" > /dev/null 2>&1 || fatal "Missing dependency: $dep"
    done
}

usage() {
    cat << EOF
Usage: $SCRIPT_NAME [options] <argument>

Options:
  -h, --help      Show this help
  -v, --verbose   Enable verbose output
  -d, --dry-run   Show what would be done without doing it

Arguments:
  argument        Description of the argument
EOF
    exit 0
}

# --- Main ---
main() {
    local verbose=false
    local dry_run=false

    # Parse arguments
    while [[ $# -gt 0 ]]; do
        case "$1" in
            -h|--help)    usage ;;
            -v|--verbose) verbose=true; shift ;;
            -d|--dry-run) dry_run=true; shift ;;
            -*)           fatal "Unknown option: $1" ;;
            *)            break ;;
        esac
    done

    acquire_lock
    check_dependencies "curl" "jq"

    info "Starting $SCRIPT_NAME"
    info "Working directory: $WORK_DIR"
    info "Verbose: $verbose | Dry run: $dry_run"

    # === Your script logic goes here ===


    # ===================================

    info "All done!"
}

main "$@"

Common Patterns Cheat Sheet

Pattern	Code
Exit on error	`set -euo pipefail`
Cleanup on exit	`trap cleanup EXIT`
Log with timestamp	`echo "[$(date)] message"`
Prevent duplicate runs	`flock -n 200` or PID file
Require root	`[[ $EUID -eq 0 ]] \|\| exit 1`
Check dependency	`command -v curl > /dev/null 2>&1`
Default value	`${VAR:-default}`
Script directory	`$(cd "$(dirname "$0")" && pwd)`
Temp directory	`mktemp -d -t name-XXXXXX`
Error line number	`trap 'echo "line $LINENO"' ERR`

This wraps up our Shell Scripting series. Go back to Part 1 — Variables, Loops, and Functions or Part 2 — Real-World Automation Scripts to review the fundamentals.

The set Command — Your Safety Net​

Trap — Cleaning Up After Yourself​

Trap Signals Reference​

Exit Codes — Communicating Success and Failure​

Logging — Structured Output for Production​

File Locking — Prevent Duplicate Runs​

The Production-Grade Template​

Common Patterns Cheat Sheet​

Stay Updated