Linux High Availability — Keepalived, HAProxy, and Clustering

January 24, 2026 · 8 min read

DevOps & Cloud Learning Hub

Single point of failure = guaranteed downtime. Your app might be perfect, your code might be clean, but if it runs on one server and that server dies at 3 AM, your customers see a blank page. High availability isn't optional for production -- it's the minimum bar. Let's build infrastructure that survives server failures automatically.

HA Architecture Overview

A typical Linux HA setup has three layers:

Layer	Tool	Purpose
Virtual IP (VIP)	Keepalived	Floating IP that moves between servers on failure
Load Balancing	HAProxy	Distributes traffic, health checks backends
Clustering	Pacemaker/Corosync	Manages shared resources, prevents split-brain

The simplest production pattern: two HAProxy nodes running Keepalived, sharing a virtual IP. Traffic hits the VIP, the active HAProxy distributes it to backend servers. If the active HAProxy dies, Keepalived moves the VIP to the standby in under 3 seconds.

Keepalived — Virtual IP Failover with VRRP

VRRP (Virtual Router Redundancy Protocol) allows two or more servers to share a virtual IP address. Only one server holds the IP at any time. If that server fails, another takes over immediately.

Installation

# Install on BOTH HA nodes
sudo apt update && sudo apt install -y keepalived

# Verify kernel supports VRRP
sudo modprobe ip_vs
lsmod | grep ip_vs

Master Node Configuration

# On the MASTER node (ha-node-1)
sudo tee /etc/keepalived/keepalived.conf << 'EOF'
global_defs {
    router_id ha-node-1
    script_user root
    enable_script_security
}

vrrp_script check_haproxy {
    script "/usr/bin/systemctl is-active haproxy"
    interval 2
    weight -20
    fall 3
    rise 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1

    authentication {
        auth_type PASS
        auth_pass MySecretPass123
    }

    virtual_ipaddress {
        192.168.1.100/24
    }

    track_script {
        check_haproxy
    }

    notify_master "/usr/local/bin/ha-notify.sh MASTER"
    notify_backup "/usr/local/bin/ha-notify.sh BACKUP"
    notify_fault  "/usr/local/bin/ha-notify.sh FAULT"
}
EOF

Backup Node Configuration

# On the BACKUP node (ha-node-2)
sudo tee /etc/keepalived/keepalived.conf << 'EOF'
global_defs {
    router_id ha-node-2
    script_user root
    enable_script_security
}

vrrp_script check_haproxy {
    script "/usr/bin/systemctl is-active haproxy"
    interval 2
    weight -20
    fall 3
    rise 2
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 90
    advert_int 1

    authentication {
        auth_type PASS
        auth_pass MySecretPass123
    }

    virtual_ipaddress {
        192.168.1.100/24
    }

    track_script {
        check_haproxy
    }
}
EOF

# Start Keepalived on both nodes
sudo systemctl enable --now keepalived

Notification Script

sudo tee /usr/local/bin/ha-notify.sh << 'SCRIPT'
#!/bin/bash
STATE=$1
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$TIMESTAMP] Keepalived transition to $STATE" >> /var/log/keepalived-notify.log

case $STATE in
  MASTER)
    echo "[$TIMESTAMP] This node is now MASTER — VIP is active here" >> /var/log/keepalived-notify.log
    # Optionally send Slack/PagerDuty alert
    ;;
  BACKUP)
    echo "[$TIMESTAMP] This node is now BACKUP" >> /var/log/keepalived-notify.log
    ;;
  FAULT)
    echo "[$TIMESTAMP] FAULT detected!" >> /var/log/keepalived-notify.log
    ;;
esac
SCRIPT
sudo chmod +x /usr/local/bin/ha-notify.sh

Verify Failover

# On the master — check VIP is assigned
ip addr show eth0 | grep 192.168.1.100

# Watch VRRP advertisements
sudo tcpdump -i eth0 -n vrrp

# Simulate failure — stop keepalived on master
sudo systemctl stop keepalived

# On the backup — VIP should appear within 3 seconds
ip addr show eth0 | grep 192.168.1.100

# Restore master — VIP moves back (preemption)
sudo systemctl start keepalived

HAProxy — Load Balancing and Health Checks

HAProxy distributes incoming traffic across multiple backend servers and automatically removes unhealthy ones.

Installation and Configuration

sudo apt install -y haproxy

sudo tee /etc/haproxy/haproxy.cfg << 'EOF'
global
    log /dev/log local0
    maxconn 4096
    user haproxy
    group haproxy
    daemon
    stats socket /var/run/haproxy.sock mode 660 level admin

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    option  forwardfor
    timeout connect 5s
    timeout client  30s
    timeout server  30s
    retries 3
    option redispatch

frontend http_front
    bind *:80
    bind *:443 ssl crt /etc/ssl/certs/myapp.pem
    redirect scheme https if !{ ssl_fc }
    default_backend web_servers

    # Route API traffic to API backends
    acl is_api path_beg /api
    use_backend api_servers if is_api

backend web_servers
    balance roundrobin
    option httpchk GET /health
    http-check expect status 200

    server web1 10.0.1.10:8080 check inter 5s fall 3 rise 2
    server web2 10.0.1.11:8080 check inter 5s fall 3 rise 2
    server web3 10.0.1.12:8080 check inter 5s fall 3 rise 2 backup

backend api_servers
    balance leastconn
    option httpchk GET /api/health
    http-check expect status 200

    server api1 10.0.1.20:3000 check inter 5s fall 3 rise 2
    server api2 10.0.1.21:3000 check inter 5s fall 3 rise 2

listen stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 5s
    stats auth admin:SecureStatsPass
    stats admin if TRUE
EOF

# Validate config
haproxy -c -f /etc/haproxy/haproxy.cfg

# Start HAProxy
sudo systemctl enable --now haproxy

HAProxy Load Balancing Algorithms

Algorithm	Flag	Best For
Round Robin	`roundrobin`	Equal-capacity servers
Least Connections	`leastconn`	Long-lived connections (APIs, WebSocket)
Source IP Hash	`source`	Session persistence without cookies
URI Hash	`uri`	Cache servers (same URL = same backend)

Monitoring HAProxy

# Check backend health via stats socket
echo "show stat" | sudo socat stdio /var/run/haproxy.sock | cut -d, -f1,2,18 | column -t -s,

# Check server states
echo "show servers state" | sudo socat stdio /var/run/haproxy.sock

# Drain a server for maintenance (stop new connections gracefully)
echo "set server web_servers/web1 state drain" | sudo socat stdio /var/run/haproxy.sock

# Re-enable the server
echo "set server web_servers/web1 state ready" | sudo socat stdio /var/run/haproxy.sock

# View the stats dashboard
echo "HAProxy stats: http://<your-vip>:8404/stats"

Pacemaker and Corosync — Full Clustering

For more complex HA scenarios (shared storage, database failover, multi-resource management), Pacemaker and Corosync provide a full clustering framework.

# Install on all cluster nodes
sudo apt install -y pacemaker corosync pcs

# Set password for the cluster admin user
sudo passwd hacluster

# Start the PCS daemon
sudo systemctl enable --now pcsd

# Authenticate nodes (run from one node)
sudo pcs host auth ha-node-1 ha-node-2 -u hacluster

# Create the cluster
sudo pcs cluster setup myha-cluster ha-node-1 ha-node-2

# Start the cluster
sudo pcs cluster start --all
sudo pcs cluster enable --all

# Check cluster status
sudo pcs status

Configure Cluster Resources

# Add a Virtual IP resource
sudo pcs resource create VirtualIP ocf:heartbeat:IPaddr2 \
    ip=192.168.1.100 cidr_netmask=24 \
    op monitor interval=10s

# Add HAProxy as a cluster resource
sudo pcs resource create HAProxy systemd:haproxy \
    op monitor interval=10s timeout=30s

# Ensure VIP and HAProxy run on the same node
sudo pcs constraint colocation add HAProxy with VirtualIP INFINITY

# Ensure VIP starts before HAProxy
sudo pcs constraint order VirtualIP then HAProxy

# Prevent split-brain with STONITH (fencing)
# For VMs, use fence_virsh; for cloud, use fence_aws/fence_azure
sudo pcs property set stonith-enabled=true

# View all constraints
sudo pcs constraint show --full

Split-Brain Prevention

Split-brain is the most dangerous HA failure: both nodes think they're the master and both hold the VIP. This causes data corruption with databases and duplicate request processing.

Prevention strategies:

Strategy	How It Works
STONITH/Fencing	Surviving node forcibly powers off the failed node
Quorum	Odd number of nodes; majority required to operate
Watchdog timer	Hardware timer reboots unresponsive node
Unicast VRRP	Direct node communication (no multicast dependency)

# Configure a watchdog timer for automatic reboot on hang
sudo tee /etc/modules-load.d/watchdog.conf << 'EOF'
softdog
EOF
sudo modprobe softdog

# Configure Corosync for two-node mode with auto-tiebreaker
sudo pcs property set no-quorum-policy=ignore
sudo pcs property set stonith-enabled=true

# Verify fencing is configured
sudo pcs stonith status

Full HA Health Check

Use this script to verify your HA setup is working correctly:

#!/bin/bash
echo "=== HA Stack Health Check ==="

# Check Keepalived
VRRP_STATE=$(sudo journalctl -u keepalived --no-pager -n 5 | grep -oP '(MASTER|BACKUP)' | tail -1)
echo "Keepalived state: ${VRRP_STATE:-UNKNOWN}"

# Check VIP
VIP="192.168.1.100"
if ip addr show | grep -q "$VIP"; then
    echo "VIP $VIP: ACTIVE on this node"
else
    echo "VIP $VIP: not on this node (expected if BACKUP)"
fi

# Check HAProxy
if systemctl is-active haproxy > /dev/null 2>&1; then
    BACKENDS=$(echo "show stat" | sudo socat stdio /var/run/haproxy.sock 2>/dev/null | grep -c "UP")
    echo "HAProxy: RUNNING ($BACKENDS backends UP)"
else
    echo "HAProxy: DOWN"
fi

# Check Pacemaker (if used)
if command -v pcs &> /dev/null; then
    NODES_ONLINE=$(sudo pcs status 2>/dev/null | grep -c "Online:")
    echo "Pacemaker: $NODES_ONLINE node groups online"
fi

echo "=== Done ==="

High availability is not a one-time setup. Test failover regularly, monitor your VRRP transitions, and practice disaster scenarios during maintenance windows. The worst time to find out your HA doesn't work is during an actual outage.

Your infrastructure is resilient. Now let's make sure YOU are ready -- next up: the top 50 Linux interview questions for DevOps and SRE roles.

HA Architecture Overview​

Keepalived — Virtual IP Failover with VRRP​

Installation​

Master Node Configuration​

Backup Node Configuration​

Notification Script​

Verify Failover​

HAProxy — Load Balancing and Health Checks​

Installation and Configuration​

HAProxy Load Balancing Algorithms​

Monitoring HAProxy​

Pacemaker and Corosync — Full Clustering​

Configure Cluster Resources​

Split-Brain Prevention​

Full HA Health Check​

Stay Updated