Skip to main content

Kubernetes Disaster Recovery — Velero, etcd Backup, and DR Strategy

· 9 min read
Goel Academy
DevOps & Cloud Learning Hub

Your Kubernetes cluster will fail. Maybe not today, maybe not this quarter, but the combination of cloud provider outages, human error, and software bugs guarantees that at some point your cluster will be unavailable. The question is not if — it is whether you can recover in minutes instead of hours, and whether you lose zero data instead of the last six hours.

What You Need to Back Up

Kubernetes has three categories of data that need protection:

Data TypeWhat It ContainsWhere It LivesBackup Tool
Cluster stateDeployments, Services, ConfigMaps, Secrets, CRDsetcdVelero, etcdctl
Persistent dataDatabases, uploads, application statePersistentVolumes (EBS, PD, Azure Disk)CSI Snapshots, Velero + Restic
ConfigurationHelm values, Kustomize overlays, GitOps reposGitGit (it is its own backup)

Most teams only back up cluster state and discover during recovery that their database volumes are gone. Back up all three.

Velero — Installation and Configuration

Velero is the standard tool for Kubernetes backup and restore. It snapshots cluster resources and optionally copies PersistentVolume data to object storage.

Install Velero

# Download Velero CLI
wget https://github.com/vmware-tanzu/velero/releases/download/v1.14.0/velero-v1.14.0-linux-amd64.tar.gz
tar -xvf velero-v1.14.0-linux-amd64.tar.gz
sudo mv velero-v1.14.0-linux-amd64/velero /usr/local/bin/

# Create credentials file for AWS
cat > credentials-velero <<EOF
[default]
aws_access_key_id=YOUR_ACCESS_KEY
aws_secret_access_key=YOUR_SECRET_KEY
EOF

# Install Velero with AWS provider
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.10.0 \
--bucket velero-backups-prod \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--secret-file ./credentials-velero \
--use-node-agent \
--default-volumes-to-fs-backup

# Verify installation
velero get backup-locations
kubectl get pods -n velero

For Azure

# Install Velero with Azure provider
velero install \
--provider azure \
--plugins velero/velero-plugin-for-microsoft-azure:v1.10.0 \
--bucket velero-backups \
--backup-location-config resourceGroup=backup-rg,storageAccount=velerobackups \
--snapshot-location-config resourceGroup=backup-rg,subscriptionId=YOUR_SUB_ID \
--secret-file ./credentials-velero \
--use-node-agent

Scheduled Backups with Velero

One-time backups are useless. You need automated, scheduled backups that run without human intervention.

# Create a scheduled backup — every 6 hours, retain for 30 days
velero schedule create prod-full-backup \
--schedule="0 */6 * * *" \
--ttl 720h \
--include-namespaces production,staging \
--default-volumes-to-fs-backup

# Create a more frequent backup for critical namespace — every hour
velero schedule create prod-critical-hourly \
--schedule="0 * * * *" \
--ttl 168h \
--include-namespaces production \
--include-resources deployments,services,configmaps,secrets,persistentvolumeclaims

# List scheduled backups
velero get schedules

# Check backup status
velero get backups --sort-by .metadata.creationTimestamp

On-Demand Backup Before Risky Operations

# Before a major upgrade or migration, take a manual backup
velero backup create pre-upgrade-$(date +%Y%m%d-%H%M) \
--include-namespaces production \
--default-volumes-to-fs-backup \
--wait

# Verify the backup completed successfully
velero backup describe pre-upgrade-20260110-1430 --details

Restoring to a New Cluster

This is the part that matters. A backup you cannot restore is decoration.

# List available backups
velero get backups

# Restore everything from a backup to a new cluster
velero restore create --from-backup prod-full-backup-20260110-060000 --wait

# Restore only specific namespaces
velero restore create --from-backup prod-full-backup-20260110-060000 \
--include-namespaces production \
--wait

# Restore specific resources only (e.g., just ConfigMaps and Secrets)
velero restore create --from-backup prod-full-backup-20260110-060000 \
--include-resources configmaps,secrets \
--include-namespaces production \
--wait

# Check restore status
velero restore describe <restore-name> --details

# Verify restored resources
kubectl get all -n production
kubectl get pvc -n production

etcd Backup and Restore — The Manual Approach

If you manage your own control plane (kubeadm, k3s, or bare-metal), you need direct etcd backups. This is your nuclear option — it restores the entire cluster state.

Backup etcd

# Find etcd pod and certificates
kubectl get pods -n kube-system -l component=etcd

# Take a snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot-20260110.db --write-table

# Copy to remote storage
aws s3 cp /backup/etcd-snapshot-20260110.db s3://etcd-backups-prod/

Automate with CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */4 * * *" # Every 4 hours
jobTemplate:
spec:
template:
spec:
hostNetwork: true
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
containers:
- name: etcd-backup
image: bitnami/etcd:3.5
command: ["/bin/sh", "-c"]
args:
- |
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Upload to S3 (add aws-cli or use sidecar)
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup-dir
mountPath: /backup
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: backup-dir
hostPath:
path: /var/backup/etcd
restartPolicy: OnFailure

Restore etcd

# Stop kube-apiserver (move the static pod manifest)
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/

# Restore the snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20260110.db \
--data-dir=/var/lib/etcd-restored \
--initial-cluster="master-1=https://10.0.1.10:2380" \
--initial-advertise-peer-urls=https://10.0.1.10:2380 \
--name=master-1

# Replace etcd data directory
sudo mv /var/lib/etcd /var/lib/etcd-old
sudo mv /var/lib/etcd-restored /var/lib/etcd

# Restart kube-apiserver
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

# Verify cluster health
kubectl get nodes
kubectl get pods --all-namespaces

Backing Up PersistentVolumes with CSI Snapshots

Velero handles resources, but for volume-level consistency you want CSI VolumeSnapshots. These are provider-native snapshots — fast and crash-consistent.

# First, ensure a VolumeSnapshotClass exists
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-aws-snapclass
driver: ebs.csi.aws.com
deletionPolicy: Retain # Keep snapshot even if VolumeSnapshot object is deleted
---
# Create a snapshot of a PVC
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-db-snapshot-20260110
namespace: production
spec:
volumeSnapshotClassName: csi-aws-snapclass
source:
persistentVolumeClaimName: postgres-data
---
# Restore from snapshot by creating a new PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data-restored
namespace: production
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3
resources:
requests:
storage: 100Gi
dataSource:
name: postgres-db-snapshot-20260110
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io

DR Strategies Compared

Not every application needs active-active across three regions. Pick the strategy that matches your RTO/RPO requirements and budget:

StrategyRTORPOCostComplexityBest For
Backup-RestoreHoursHours (last backup)LowLowDev/staging, non-critical apps
Pilot Light30-60 minMinutesMediumMediumInternal tools, batch processing
Active-Passive5-15 minNear-zeroHighHighCustomer-facing apps, SaaS
Active-Active~0ZeroVery HighVery HighFinancial services, critical APIs

Pilot Light Setup

A pilot light DR keeps a minimal cluster running in the DR region with the infrastructure ready but no application workloads. On failure, you scale up and restore.

# DR region cluster — minimal size
# Keep the control plane running, 1-2 small nodes
# ArgoCD syncs manifests but replicas are set to 0

# In the DR cluster, override replicas
kubectl scale deployment --all --replicas=0 -n production

# On failover:
# 1. Restore latest Velero backup (for any state not in Git)
velero restore create --from-backup prod-full-backup-latest

# 2. Scale up deployments
kubectl scale deployment payment-api --replicas=3 -n production
kubectl scale deployment user-service --replicas=3 -n production

# 3. Update DNS to point to DR cluster
aws route53 change-resource-record-sets --hosted-zone-id Z123 \
--change-batch file://failover-dns.json

Cross-Region and Cross-Cloud DR

For true disaster recovery, your backup must be in a different region or cloud entirely.

# Configure Velero with cross-region backup location
velero backup-location create dr-backup \
--provider aws \
--bucket velero-dr-eu-west \
--config region=eu-west-1 \
--access-mode ReadWrite

# Configure replication on the S3 bucket (AWS CLI)
aws s3api put-bucket-replication \
--bucket velero-backups-prod \
--replication-configuration file://replication-config.json

# For cross-cloud: back up to GCS from an AWS cluster
velero backup-location create gcs-backup \
--provider gcp \
--bucket velero-backups-gcp \
--config serviceAccount=velero@project.iam.gserviceaccount.com

DR Testing Procedures

A DR plan that has never been tested is a wish list. Test quarterly at minimum.

DR Test Runbook:

  1. Announce the test — Schedule a maintenance window, notify stakeholders
  2. Take a fresh backupvelero backup create dr-test-$(date +%Y%m%d)
  3. Provision the DR cluster — Use Cluster API or Terraform
  4. Restore the backupvelero restore create --from-backup dr-test-20260110
  5. Verify application health — Run smoke tests against the DR cluster
  6. Measure RTO — Time from "disaster declared" to "application serving traffic"
  7. Measure RPO — Check the timestamp of the latest data in the restored database
  8. Document findings — What worked, what broke, what needs fixing
  9. Tear down — Delete the DR test cluster to avoid costs
# Smoke test script for DR verification
#!/bin/bash
CLUSTER_URL="https://dr-cluster.example.com"

echo "Checking pod health..."
kubectl get pods -n production --field-selector status.phase!=Running

echo "Testing payment API..."
curl -sf "$CLUSTER_URL/api/health" || echo "FAIL: payment-api"

echo "Testing user service..."
curl -sf "$CLUSTER_URL/api/users/health" || echo "FAIL: user-service"

echo "Checking PVC data..."
kubectl exec -n production deploy/postgres -- psql -c "SELECT count(*) FROM orders;"

echo "DR verification complete at $(date)"

Disaster recovery is insurance. It costs time and money to set up, and you hope you never need it. But when your primary region goes dark at 2 AM, the difference between "we restored in 12 minutes" and "we are rebuilding from scratch" is entirely determined by the work you did before the disaster. Set up Velero today, automate your etcd backups, test your restores quarterly, and sleep better knowing your clusters can survive the worst.