Kubernetes Production Readiness Checklist — 25 Things Before Going Live
You have built your app, containerized it, wrote the Kubernetes manifests, and it works great in staging. Now someone says "let's go to production" and suddenly you are wondering what you forgot. This checklist exists because every production incident I have seen traces back to skipping something obvious during setup.
Run through these 25 items before your first production deploy. Each one includes a command to verify it, so this is not just theory — you can audit your cluster right now.
Cluster Setup (Items 1-5)
1. High Availability Control Plane
Your control plane must survive a node failure. On managed Kubernetes (EKS, GKE, AKS), this is automatic. On self-managed clusters, run 3+ control plane nodes across availability zones.
# Check control plane node count
kubectl get nodes -l node-role.kubernetes.io/control-plane -o wide
# On managed K8s, verify the cluster is multi-AZ
# EKS:
aws eks describe-cluster --name prod-cluster --query 'cluster.resourcesVpcConfig.subnetIds'
# Expect: subnets in 3 different AZs
2. Node Autoscaling Configured
Cluster Autoscaler or Karpenter must be running. Without it, traffic spikes will cause pod scheduling failures.
# Check if Cluster Autoscaler is running
kubectl get deployment cluster-autoscaler -n kube-system
# Or check for Karpenter
kubectl get deployment karpenter -n karpenter
# Verify node pools / provisioners exist
kubectl get nodepools.karpenter.sh # Karpenter
kubectl get nodegroups # Verify ASG min/max in cloud console
3. Resource Quotas on All Namespaces
Without quotas, one team can consume the entire cluster. Every production namespace needs a ResourceQuota.
# Check which namespaces have quotas
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
quota=$(kubectl get resourcequota -n $ns --no-headers 2>/dev/null | wc -l)
echo "$ns: $quota quotas"
done
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "100"
services.loadbalancers: "5"
4. PodDisruptionBudgets for Critical Services
PDBs prevent Kubernetes from evicting too many pods during node drains or upgrades. Without them, a rolling node upgrade can take your entire service down.
# List all PDBs
kubectl get pdb --all-namespaces
# Check if critical deployments have PDBs
kubectl get deployment -n production -o name | while read deploy; do
name=$(echo $deploy | cut -d/ -f2)
pdb=$(kubectl get pdb -n production -o jsonpath="{.items[?(@.spec.selector.matchLabels.app=='$name')].metadata.name}" 2>/dev/null)
echo "$name -> PDB: ${pdb:-MISSING}"
done
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-api-pdb
namespace: production
spec:
minAvailable: 2 # Or use maxUnavailable: 1
selector:
matchLabels:
app: payment-api
5. Separate Node Pools for System and Application Workloads
System components (monitoring, ingress, logging) should not compete with application pods for resources. Use dedicated node pools with taints.
# Check node labels and taints
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
POOL:.metadata.labels.node-pool,\
TAINTS:.spec.taints[*].key
Security (Items 6-10)
6. RBAC with Least Privilege
No one should have cluster-admin unless they are a platform engineer. Developers get namespace-scoped roles.
# Find all cluster-admin bindings (should be minimal)
kubectl get clusterrolebindings -o json | \
jq -r '.items[] | select(.roleRef.name=="cluster-admin") | .metadata.name + " -> " + (.subjects[]?.name // "unknown")'
# Count total ClusterRoleBindings vs RoleBindings
echo "ClusterRoleBindings: $(kubectl get clusterrolebindings --no-headers | wc -l)"
echo "RoleBindings: $(kubectl get rolebindings --all-namespaces --no-headers | wc -l)"
7. Network Policies Enforced
By default, every pod can talk to every other pod. Network policies restrict traffic to only what is needed.
# Check if any network policies exist
kubectl get networkpolicy --all-namespaces
# Verify your CNI supports network policies (Calico, Cilium, Weave — yes; Flannel — no)
kubectl get pods -n kube-system -l k8s-app=calico-node # Calico
kubectl get pods -n kube-system -l k8s-app=cilium # Cilium
# Default deny all ingress in production namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
8. Pod Security Standards Enforced
Pod Security Admission (PSA) replaces the deprecated PodSecurityPolicy. Enforce restricted or baseline profiles.
# Check namespace labels for PSA enforcement
kubectl get namespaces -o json | \
jq -r '.items[] | .metadata.name + ": " + (.metadata.labels["pod-security.kubernetes.io/enforce"] // "NOT SET")'
# Label production namespace with restricted enforcement
kubectl label namespace production \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/audit=restricted
9. Secrets Encrypted at Rest
By default, Kubernetes stores Secrets base64-encoded (not encrypted) in etcd. Enable encryption at rest.
# On managed K8s, check provider docs. On self-managed:
# Verify encryption config exists on API server
ps aux | grep kube-apiserver | grep encryption-provider-config
# Test: create a secret and verify it is encrypted in etcd
kubectl create secret generic test-encryption --from-literal=key=value -n default
# Then check etcd directly — you should see encrypted data, not plaintext
10. Container Image Scanning in CI/CD
Never deploy unscanned images. Use Trivy, Grype, or Snyk in your CI pipeline.
# Scan an image with Trivy
trivy image --severity HIGH,CRITICAL myregistry/payment-api:v2.1.0
# In CI/CD (GitHub Actions example):
# - name: Scan image
# uses: aquasecurity/trivy-action@master
# with:
# image-ref: myregistry/payment-api:${{ github.sha }}
# exit-code: 1 # Fail the build on HIGH/CRITICAL vulns
Reliability (Items 11-15)
11. Liveness and Readiness Probes on Every Container
Without probes, Kubernetes cannot detect crashed or unhealthy containers. Traffic goes to broken pods.
# Find pods without readiness probes
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | select(.spec.containers[].readinessProbe == null) |
.metadata.namespace + "/" + .metadata.name'
12. Resource Requests and Limits Set
Pods without requests get evicted first during node pressure. Pods without limits can consume unbounded resources.
# Find containers without resource requests
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | .metadata.namespace + "/" + .metadata.name + " " +
(.spec.containers[] | select(.resources.requests == null) | .name)'
13. Pod Anti-Affinity for Critical Deployments
Replicas of the same service should spread across nodes. Otherwise, one node failure kills all replicas.
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["payment-api"]
topologyKey: kubernetes.io/hostname
14. Horizontal Pod Autoscaler (HPA) for Variable Workloads
Fixed replica counts waste money at low traffic and crash at high traffic.
# List all HPAs and their status
kubectl get hpa --all-namespaces
15. Graceful Shutdown with preStop Hooks and terminationGracePeriodSeconds
Pods need time to drain connections before terminating. Default 30s is often not enough.
spec:
terminationGracePeriodSeconds: 60
containers:
- name: payment-api
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"] # Wait for LB to deregister
Observability (Items 16-20)
16. Metrics Collection (Prometheus/Datadog/CloudWatch)
You cannot manage what you cannot measure. At minimum, collect CPU, memory, network, and disk I/O per pod and node.
# Verify Prometheus is scraping targets
kubectl port-forward svc/prometheus-server 9090:80 -n monitoring &
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets | length'
17. Centralized Logging
Pod logs disappear when pods restart. Send them to a centralized system — EFK stack, Loki, or a cloud logging service.
# Verify Fluentd/Fluent Bit is running on all nodes
kubectl get daemonset -n logging
kubectl get pods -n logging -o wide # Should have one pod per node
18. Alerting Rules for Critical Conditions
Metrics without alerts are just dashboards no one watches. Set up alerts for pod restarts, high error rates, node pressure, and PVC usage.
# Check Prometheus alerting rules
kubectl get prometheusrules --all-namespaces
19. Distributed Tracing
For microservices, logs and metrics are not enough. You need traces to follow a request across services.
# Verify Jaeger or Tempo is running
kubectl get deployment -n tracing
kubectl get svc -n tracing
20. Dashboard for Cluster and Application Health
Grafana dashboards give your team a single pane of glass. Import community dashboards for Kubernetes (ID: 315, 6417, 13770).
# Verify Grafana is running
kubectl get svc grafana -n monitoring
Operations (Items 21-25)
21. etcd Backup Automated
etcd holds your entire cluster state. Lose it without a backup and you are rebuilding from scratch.
# For self-managed clusters, verify etcd backup CronJob
kubectl get cronjob -n kube-system | grep etcd
# For managed clusters, verify provider backup is enabled
# EKS: Automatic. GKE: Automatic. AKS: Check backup settings.
22. Disaster Recovery Plan Documented and Tested
A backup you have never restored is not a backup — it is a hope.
# Verify Velero is installed and backups are running
velero get backup-locations
velero get backups
velero get schedules
23. Cluster Upgrade Strategy Defined
Kubernetes releases every 4 months. You need a plan for upgrading control plane and nodes with zero downtime.
# Check current version and available upgrades
kubectl version --short
# EKS:
aws eks describe-cluster --name prod-cluster --query 'cluster.version'
# Check if you are more than 1 minor version behind
24. GitOps for All Deployments
No one should be running kubectl apply against production from their laptop. Use ArgoCD or Flux.
# Verify ArgoCD is running and apps are synced
kubectl get applications -n argocd
argocd app list # Check sync status
25. Runbooks for Common Incidents
When the pager fires at 3 AM, no one wants to think from first principles. Write runbooks for: pod CrashLoopBackOff, node NotReady, PVC full, certificate expired, OOMKilled.
# This one is not a kubectl command — it is a team process check:
# - Do runbooks exist in your wiki/repo?
# - Are they linked from your alerting tool?
# - Has every on-call engineer read them?
# - Were they tested in the last quarter?
The Checklist Summary
| # | Item | Category | Priority |
|---|---|---|---|
| 1 | HA control plane | Cluster | Critical |
| 2 | Node autoscaling | Cluster | Critical |
| 3 | Resource quotas | Cluster | High |
| 4 | PodDisruptionBudgets | Cluster | High |
| 5 | Separate node pools | Cluster | Medium |
| 6 | RBAC least privilege | Security | Critical |
| 7 | Network policies | Security | Critical |
| 8 | Pod Security Standards | Security | High |
| 9 | Secrets encryption at rest | Security | High |
| 10 | Image scanning in CI/CD | Security | High |
| 11 | Liveness/Readiness probes | Reliability | Critical |
| 12 | Resource requests/limits | Reliability | Critical |
| 13 | Pod anti-affinity | Reliability | High |
| 14 | HPA for variable workloads | Reliability | Medium |
| 15 | Graceful shutdown | Reliability | High |
| 16 | Metrics collection | Observability | Critical |
| 17 | Centralized logging | Observability | Critical |
| 18 | Alerting rules | Observability | Critical |
| 19 | Distributed tracing | Observability | Medium |
| 20 | Health dashboards | Observability | Medium |
| 21 | etcd backup automated | Operations | Critical |
| 22 | DR plan tested | Operations | Critical |
| 23 | Upgrade strategy | Operations | High |
| 24 | GitOps deployments | Operations | High |
| 25 | Incident runbooks | Operations | High |
Print this list. Tape it to your monitor. Go through it item by item before your next production launch. The items marked Critical are non-negotiable — skip them and you are building on sand. The High and Medium items are what separate a cluster that works from a cluster that works reliably at 3 AM on a Saturday.
