Kubernetes Disaster Recovery — Velero, etcd Backup, and DR Strategy

January 10, 2026 · 9 min read

DevOps & Cloud Learning Hub

Your Kubernetes cluster will fail. Maybe not today, maybe not this quarter, but the combination of cloud provider outages, human error, and software bugs guarantees that at some point your cluster will be unavailable. The question is not if — it is whether you can recover in minutes instead of hours, and whether you lose zero data instead of the last six hours.

What You Need to Back Up

Kubernetes has three categories of data that need protection:

Data Type	What It Contains	Where It Lives	Backup Tool
Cluster state	Deployments, Services, ConfigMaps, Secrets, CRDs	etcd	Velero, etcdctl
Persistent data	Databases, uploads, application state	PersistentVolumes (EBS, PD, Azure Disk)	CSI Snapshots, Velero + Restic
Configuration	Helm values, Kustomize overlays, GitOps repos	Git	Git (it is its own backup)

Most teams only back up cluster state and discover during recovery that their database volumes are gone. Back up all three.

Velero — Installation and Configuration

Velero is the standard tool for Kubernetes backup and restore. It snapshots cluster resources and optionally copies PersistentVolume data to object storage.

Install Velero

# Download Velero CLI
wget https://github.com/vmware-tanzu/velero/releases/download/v1.14.0/velero-v1.14.0-linux-amd64.tar.gz
tar -xvf velero-v1.14.0-linux-amd64.tar.gz
sudo mv velero-v1.14.0-linux-amd64/velero /usr/local/bin/

# Create credentials file for AWS
cat > credentials-velero <<EOF
[default]
aws_access_key_id=YOUR_ACCESS_KEY
aws_secret_access_key=YOUR_SECRET_KEY
EOF

# Install Velero with AWS provider
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.10.0 \
  --bucket velero-backups-prod \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --secret-file ./credentials-velero \
  --use-node-agent \
  --default-volumes-to-fs-backup

# Verify installation
velero get backup-locations
kubectl get pods -n velero

For Azure

# Install Velero with Azure provider
velero install \
  --provider azure \
  --plugins velero/velero-plugin-for-microsoft-azure:v1.10.0 \
  --bucket velero-backups \
  --backup-location-config resourceGroup=backup-rg,storageAccount=velerobackups \
  --snapshot-location-config resourceGroup=backup-rg,subscriptionId=YOUR_SUB_ID \
  --secret-file ./credentials-velero \
  --use-node-agent

Scheduled Backups with Velero

One-time backups are useless. You need automated, scheduled backups that run without human intervention.

# Create a scheduled backup — every 6 hours, retain for 30 days
velero schedule create prod-full-backup \
  --schedule="0 */6 * * *" \
  --ttl 720h \
  --include-namespaces production,staging \
  --default-volumes-to-fs-backup

# Create a more frequent backup for critical namespace — every hour
velero schedule create prod-critical-hourly \
  --schedule="0 * * * *" \
  --ttl 168h \
  --include-namespaces production \
  --include-resources deployments,services,configmaps,secrets,persistentvolumeclaims

# List scheduled backups
velero get schedules

# Check backup status
velero get backups --sort-by .metadata.creationTimestamp

On-Demand Backup Before Risky Operations

# Before a major upgrade or migration, take a manual backup
velero backup create pre-upgrade-$(date +%Y%m%d-%H%M) \
  --include-namespaces production \
  --default-volumes-to-fs-backup \
  --wait

# Verify the backup completed successfully
velero backup describe pre-upgrade-20260110-1430 --details

Restoring to a New Cluster

This is the part that matters. A backup you cannot restore is decoration.

# List available backups
velero get backups

# Restore everything from a backup to a new cluster
velero restore create --from-backup prod-full-backup-20260110-060000 --wait

# Restore only specific namespaces
velero restore create --from-backup prod-full-backup-20260110-060000 \
  --include-namespaces production \
  --wait

# Restore specific resources only (e.g., just ConfigMaps and Secrets)
velero restore create --from-backup prod-full-backup-20260110-060000 \
  --include-resources configmaps,secrets \
  --include-namespaces production \
  --wait

# Check restore status
velero restore describe <restore-name> --details

# Verify restored resources
kubectl get all -n production
kubectl get pvc -n production

etcd Backup and Restore — The Manual Approach

If you manage your own control plane (kubeadm, k3s, or bare-metal), you need direct etcd backups. This is your nuclear option — it restores the entire cluster state.

Backup etcd

# Find etcd pod and certificates
kubectl get pods -n kube-system -l component=etcd

# Take a snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20260110.db --write-table

# Copy to remote storage
aws s3 cp /backup/etcd-snapshot-20260110.db s3://etcd-backups-prod/

Automate with CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 */4 * * *"  # Every 4 hours
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true
          nodeSelector:
            node-role.kubernetes.io/control-plane: ""
          tolerations:
            - key: node-role.kubernetes.io/control-plane
              effect: NoSchedule
          containers:
            - name: etcd-backup
              image: bitnami/etcd:3.5
              command: ["/bin/sh", "-c"]
              args:
                - |
                  etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M).db \
                    --endpoints=https://127.0.0.1:2379 \
                    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                    --cert=/etc/kubernetes/pki/etcd/server.crt \
                    --key=/etc/kubernetes/pki/etcd/server.key
                  # Upload to S3 (add aws-cli or use sidecar)
              volumeMounts:
                - name: etcd-certs
                  mountPath: /etc/kubernetes/pki/etcd
                  readOnly: true
                - name: backup-dir
                  mountPath: /backup
          volumes:
            - name: etcd-certs
              hostPath:
                path: /etc/kubernetes/pki/etcd
            - name: backup-dir
              hostPath:
                path: /var/backup/etcd
          restartPolicy: OnFailure

Restore etcd

# Stop kube-apiserver (move the static pod manifest)
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/

# Restore the snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20260110.db \
  --data-dir=/var/lib/etcd-restored \
  --initial-cluster="master-1=https://10.0.1.10:2380" \
  --initial-advertise-peer-urls=https://10.0.1.10:2380 \
  --name=master-1

# Replace etcd data directory
sudo mv /var/lib/etcd /var/lib/etcd-old
sudo mv /var/lib/etcd-restored /var/lib/etcd

# Restart kube-apiserver
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# Verify cluster health
kubectl get nodes
kubectl get pods --all-namespaces

Backing Up PersistentVolumes with CSI Snapshots

Velero handles resources, but for volume-level consistency you want CSI VolumeSnapshots. These are provider-native snapshots — fast and crash-consistent.

# First, ensure a VolumeSnapshotClass exists
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-aws-snapclass
driver: ebs.csi.aws.com
deletionPolicy: Retain  # Keep snapshot even if VolumeSnapshot object is deleted
---
# Create a snapshot of a PVC
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-db-snapshot-20260110
  namespace: production
spec:
  volumeSnapshotClassName: csi-aws-snapclass
  source:
    persistentVolumeClaimName: postgres-data
---
# Restore from snapshot by creating a new PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-data-restored
  namespace: production
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: gp3
  resources:
    requests:
      storage: 100Gi
  dataSource:
    name: postgres-db-snapshot-20260110
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

DR Strategies Compared

Not every application needs active-active across three regions. Pick the strategy that matches your RTO/RPO requirements and budget:

Strategy	RTO	RPO	Cost	Complexity	Best For
Backup-Restore	Hours	Hours (last backup)	Low	Low	Dev/staging, non-critical apps
Pilot Light	30-60 min	Minutes	Medium	Medium	Internal tools, batch processing
Active-Passive	5-15 min	Near-zero	High	High	Customer-facing apps, SaaS
Active-Active	~0	Zero	Very High	Very High	Financial services, critical APIs

Pilot Light Setup

A pilot light DR keeps a minimal cluster running in the DR region with the infrastructure ready but no application workloads. On failure, you scale up and restore.

# DR region cluster — minimal size
# Keep the control plane running, 1-2 small nodes
# ArgoCD syncs manifests but replicas are set to 0

# In the DR cluster, override replicas
kubectl scale deployment --all --replicas=0 -n production

# On failover:
# 1. Restore latest Velero backup (for any state not in Git)
velero restore create --from-backup prod-full-backup-latest

# 2. Scale up deployments
kubectl scale deployment payment-api --replicas=3 -n production
kubectl scale deployment user-service --replicas=3 -n production

# 3. Update DNS to point to DR cluster
aws route53 change-resource-record-sets --hosted-zone-id Z123 \
  --change-batch file://failover-dns.json

Cross-Region and Cross-Cloud DR

For true disaster recovery, your backup must be in a different region or cloud entirely.

# Configure Velero with cross-region backup location
velero backup-location create dr-backup \
  --provider aws \
  --bucket velero-dr-eu-west \
  --config region=eu-west-1 \
  --access-mode ReadWrite

# Configure replication on the S3 bucket (AWS CLI)
aws s3api put-bucket-replication \
  --bucket velero-backups-prod \
  --replication-configuration file://replication-config.json

# For cross-cloud: back up to GCS from an AWS cluster
velero backup-location create gcs-backup \
  --provider gcp \
  --bucket velero-backups-gcp \
  --config serviceAccount=velero@project.iam.gserviceaccount.com

DR Testing Procedures

A DR plan that has never been tested is a wish list. Test quarterly at minimum.

DR Test Runbook:

Announce the test — Schedule a maintenance window, notify stakeholders
Take a fresh backup — velero backup create dr-test-$(date +%Y%m%d)
Provision the DR cluster — Use Cluster API or Terraform
Restore the backup — velero restore create --from-backup dr-test-20260110
Verify application health — Run smoke tests against the DR cluster
Measure RTO — Time from "disaster declared" to "application serving traffic"
Measure RPO — Check the timestamp of the latest data in the restored database
Document findings — What worked, what broke, what needs fixing
Tear down — Delete the DR test cluster to avoid costs

# Smoke test script for DR verification
#!/bin/bash
CLUSTER_URL="https://dr-cluster.example.com"

echo "Checking pod health..."
kubectl get pods -n production --field-selector status.phase!=Running

echo "Testing payment API..."
curl -sf "$CLUSTER_URL/api/health" || echo "FAIL: payment-api"

echo "Testing user service..."
curl -sf "$CLUSTER_URL/api/users/health" || echo "FAIL: user-service"

echo "Checking PVC data..."
kubectl exec -n production deploy/postgres -- psql -c "SELECT count(*) FROM orders;"

echo "DR verification complete at $(date)"

Disaster recovery is insurance. It costs time and money to set up, and you hope you never need it. But when your primary region goes dark at 2 AM, the difference between "we restored in 12 minutes" and "we are rebuilding from scratch" is entirely determined by the work you did before the disaster. Set up Velero today, automate your etcd backups, test your restores quarterly, and sleep better knowing your clusters can survive the worst.

What You Need to Back Up​

Velero — Installation and Configuration​

Install Velero​

For Azure​

Scheduled Backups with Velero​

On-Demand Backup Before Risky Operations​

Restoring to a New Cluster​

etcd Backup and Restore — The Manual Approach​

Backup etcd​

Automate with CronJob​

Restore etcd​

Backing Up PersistentVolumes with CSI Snapshots​

DR Strategies Compared​

Pilot Light Setup​

Cross-Region and Cross-Cloud DR​

DR Testing Procedures​

Stay Updated