Skip to main content

Kubernetes Production Readiness Checklist — 25 Things Before Going Live

· 9 min read
Goel Academy
DevOps & Cloud Learning Hub

You have built your app, containerized it, wrote the Kubernetes manifests, and it works great in staging. Now someone says "let's go to production" and suddenly you are wondering what you forgot. This checklist exists because every production incident I have seen traces back to skipping something obvious during setup.

Run through these 25 items before your first production deploy. Each one includes a command to verify it, so this is not just theory — you can audit your cluster right now.

Cluster Setup (Items 1-5)

1. High Availability Control Plane

Your control plane must survive a node failure. On managed Kubernetes (EKS, GKE, AKS), this is automatic. On self-managed clusters, run 3+ control plane nodes across availability zones.

# Check control plane node count
kubectl get nodes -l node-role.kubernetes.io/control-plane -o wide

# On managed K8s, verify the cluster is multi-AZ
# EKS:
aws eks describe-cluster --name prod-cluster --query 'cluster.resourcesVpcConfig.subnetIds'

# Expect: subnets in 3 different AZs

2. Node Autoscaling Configured

Cluster Autoscaler or Karpenter must be running. Without it, traffic spikes will cause pod scheduling failures.

# Check if Cluster Autoscaler is running
kubectl get deployment cluster-autoscaler -n kube-system

# Or check for Karpenter
kubectl get deployment karpenter -n karpenter

# Verify node pools / provisioners exist
kubectl get nodepools.karpenter.sh # Karpenter
kubectl get nodegroups # Verify ASG min/max in cloud console

3. Resource Quotas on All Namespaces

Without quotas, one team can consume the entire cluster. Every production namespace needs a ResourceQuota.

# Check which namespaces have quotas
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
quota=$(kubectl get resourcequota -n $ns --no-headers 2>/dev/null | wc -l)
echo "$ns: $quota quotas"
done
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "100"
services.loadbalancers: "5"

4. PodDisruptionBudgets for Critical Services

PDBs prevent Kubernetes from evicting too many pods during node drains or upgrades. Without them, a rolling node upgrade can take your entire service down.

# List all PDBs
kubectl get pdb --all-namespaces

# Check if critical deployments have PDBs
kubectl get deployment -n production -o name | while read deploy; do
name=$(echo $deploy | cut -d/ -f2)
pdb=$(kubectl get pdb -n production -o jsonpath="{.items[?(@.spec.selector.matchLabels.app=='$name')].metadata.name}" 2>/dev/null)
echo "$name -> PDB: ${pdb:-MISSING}"
done
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-api-pdb
namespace: production
spec:
minAvailable: 2 # Or use maxUnavailable: 1
selector:
matchLabels:
app: payment-api

5. Separate Node Pools for System and Application Workloads

System components (monitoring, ingress, logging) should not compete with application pods for resources. Use dedicated node pools with taints.

# Check node labels and taints
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
POOL:.metadata.labels.node-pool,\
TAINTS:.spec.taints[*].key

Security (Items 6-10)

6. RBAC with Least Privilege

No one should have cluster-admin unless they are a platform engineer. Developers get namespace-scoped roles.

# Find all cluster-admin bindings (should be minimal)
kubectl get clusterrolebindings -o json | \
jq -r '.items[] | select(.roleRef.name=="cluster-admin") | .metadata.name + " -> " + (.subjects[]?.name // "unknown")'

# Count total ClusterRoleBindings vs RoleBindings
echo "ClusterRoleBindings: $(kubectl get clusterrolebindings --no-headers | wc -l)"
echo "RoleBindings: $(kubectl get rolebindings --all-namespaces --no-headers | wc -l)"

7. Network Policies Enforced

By default, every pod can talk to every other pod. Network policies restrict traffic to only what is needed.

# Check if any network policies exist
kubectl get networkpolicy --all-namespaces

# Verify your CNI supports network policies (Calico, Cilium, Weave — yes; Flannel — no)
kubectl get pods -n kube-system -l k8s-app=calico-node # Calico
kubectl get pods -n kube-system -l k8s-app=cilium # Cilium
# Default deny all ingress in production namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress

8. Pod Security Standards Enforced

Pod Security Admission (PSA) replaces the deprecated PodSecurityPolicy. Enforce restricted or baseline profiles.

# Check namespace labels for PSA enforcement
kubectl get namespaces -o json | \
jq -r '.items[] | .metadata.name + ": " + (.metadata.labels["pod-security.kubernetes.io/enforce"] // "NOT SET")'
# Label production namespace with restricted enforcement
kubectl label namespace production \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/audit=restricted

9. Secrets Encrypted at Rest

By default, Kubernetes stores Secrets base64-encoded (not encrypted) in etcd. Enable encryption at rest.

# On managed K8s, check provider docs. On self-managed:
# Verify encryption config exists on API server
ps aux | grep kube-apiserver | grep encryption-provider-config

# Test: create a secret and verify it is encrypted in etcd
kubectl create secret generic test-encryption --from-literal=key=value -n default
# Then check etcd directly — you should see encrypted data, not plaintext

10. Container Image Scanning in CI/CD

Never deploy unscanned images. Use Trivy, Grype, or Snyk in your CI pipeline.

# Scan an image with Trivy
trivy image --severity HIGH,CRITICAL myregistry/payment-api:v2.1.0

# In CI/CD (GitHub Actions example):
# - name: Scan image
# uses: aquasecurity/trivy-action@master
# with:
# image-ref: myregistry/payment-api:${{ github.sha }}
# exit-code: 1 # Fail the build on HIGH/CRITICAL vulns

Reliability (Items 11-15)

11. Liveness and Readiness Probes on Every Container

Without probes, Kubernetes cannot detect crashed or unhealthy containers. Traffic goes to broken pods.

# Find pods without readiness probes
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | select(.spec.containers[].readinessProbe == null) |
.metadata.namespace + "/" + .metadata.name'

12. Resource Requests and Limits Set

Pods without requests get evicted first during node pressure. Pods without limits can consume unbounded resources.

# Find containers without resource requests
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | .metadata.namespace + "/" + .metadata.name + " " +
(.spec.containers[] | select(.resources.requests == null) | .name)'

13. Pod Anti-Affinity for Critical Deployments

Replicas of the same service should spread across nodes. Otherwise, one node failure kills all replicas.

spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["payment-api"]
topologyKey: kubernetes.io/hostname

14. Horizontal Pod Autoscaler (HPA) for Variable Workloads

Fixed replica counts waste money at low traffic and crash at high traffic.

# List all HPAs and their status
kubectl get hpa --all-namespaces

15. Graceful Shutdown with preStop Hooks and terminationGracePeriodSeconds

Pods need time to drain connections before terminating. Default 30s is often not enough.

spec:
terminationGracePeriodSeconds: 60
containers:
- name: payment-api
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"] # Wait for LB to deregister

Observability (Items 16-20)

16. Metrics Collection (Prometheus/Datadog/CloudWatch)

You cannot manage what you cannot measure. At minimum, collect CPU, memory, network, and disk I/O per pod and node.

# Verify Prometheus is scraping targets
kubectl port-forward svc/prometheus-server 9090:80 -n monitoring &
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets | length'

17. Centralized Logging

Pod logs disappear when pods restart. Send them to a centralized system — EFK stack, Loki, or a cloud logging service.

# Verify Fluentd/Fluent Bit is running on all nodes
kubectl get daemonset -n logging
kubectl get pods -n logging -o wide # Should have one pod per node

18. Alerting Rules for Critical Conditions

Metrics without alerts are just dashboards no one watches. Set up alerts for pod restarts, high error rates, node pressure, and PVC usage.

# Check Prometheus alerting rules
kubectl get prometheusrules --all-namespaces

19. Distributed Tracing

For microservices, logs and metrics are not enough. You need traces to follow a request across services.

# Verify Jaeger or Tempo is running
kubectl get deployment -n tracing
kubectl get svc -n tracing

20. Dashboard for Cluster and Application Health

Grafana dashboards give your team a single pane of glass. Import community dashboards for Kubernetes (ID: 315, 6417, 13770).

# Verify Grafana is running
kubectl get svc grafana -n monitoring

Operations (Items 21-25)

21. etcd Backup Automated

etcd holds your entire cluster state. Lose it without a backup and you are rebuilding from scratch.

# For self-managed clusters, verify etcd backup CronJob
kubectl get cronjob -n kube-system | grep etcd

# For managed clusters, verify provider backup is enabled
# EKS: Automatic. GKE: Automatic. AKS: Check backup settings.

22. Disaster Recovery Plan Documented and Tested

A backup you have never restored is not a backup — it is a hope.

# Verify Velero is installed and backups are running
velero get backup-locations
velero get backups
velero get schedules

23. Cluster Upgrade Strategy Defined

Kubernetes releases every 4 months. You need a plan for upgrading control plane and nodes with zero downtime.

# Check current version and available upgrades
kubectl version --short
# EKS:
aws eks describe-cluster --name prod-cluster --query 'cluster.version'
# Check if you are more than 1 minor version behind

24. GitOps for All Deployments

No one should be running kubectl apply against production from their laptop. Use ArgoCD or Flux.

# Verify ArgoCD is running and apps are synced
kubectl get applications -n argocd
argocd app list # Check sync status

25. Runbooks for Common Incidents

When the pager fires at 3 AM, no one wants to think from first principles. Write runbooks for: pod CrashLoopBackOff, node NotReady, PVC full, certificate expired, OOMKilled.

# This one is not a kubectl command — it is a team process check:
# - Do runbooks exist in your wiki/repo?
# - Are they linked from your alerting tool?
# - Has every on-call engineer read them?
# - Were they tested in the last quarter?

The Checklist Summary

#ItemCategoryPriority
1HA control planeClusterCritical
2Node autoscalingClusterCritical
3Resource quotasClusterHigh
4PodDisruptionBudgetsClusterHigh
5Separate node poolsClusterMedium
6RBAC least privilegeSecurityCritical
7Network policiesSecurityCritical
8Pod Security StandardsSecurityHigh
9Secrets encryption at restSecurityHigh
10Image scanning in CI/CDSecurityHigh
11Liveness/Readiness probesReliabilityCritical
12Resource requests/limitsReliabilityCritical
13Pod anti-affinityReliabilityHigh
14HPA for variable workloadsReliabilityMedium
15Graceful shutdownReliabilityHigh
16Metrics collectionObservabilityCritical
17Centralized loggingObservabilityCritical
18Alerting rulesObservabilityCritical
19Distributed tracingObservabilityMedium
20Health dashboardsObservabilityMedium
21etcd backup automatedOperationsCritical
22DR plan testedOperationsCritical
23Upgrade strategyOperationsHigh
24GitOps deploymentsOperationsHigh
25Incident runbooksOperationsHigh

Print this list. Tape it to your monitor. Go through it item by item before your next production launch. The items marked Critical are non-negotiable — skip them and you are building on sand. The High and Medium items are what separate a cluster that works from a cluster that works reliably at 3 AM on a Saturday.