Kubernetes Production Readiness Checklist — 25 Things Before Going Live

December 20, 2025 · 9 min read

DevOps & Cloud Learning Hub

You have built your app, containerized it, wrote the Kubernetes manifests, and it works great in staging. Now someone says "let's go to production" and suddenly you are wondering what you forgot. This checklist exists because every production incident I have seen traces back to skipping something obvious during setup.

Run through these 25 items before your first production deploy. Each one includes a command to verify it, so this is not just theory — you can audit your cluster right now.

Cluster Setup (Items 1-5)

1. High Availability Control Plane

Your control plane must survive a node failure. On managed Kubernetes (EKS, GKE, AKS), this is automatic. On self-managed clusters, run 3+ control plane nodes across availability zones.

# Check control plane node count
kubectl get nodes -l node-role.kubernetes.io/control-plane -o wide

# On managed K8s, verify the cluster is multi-AZ
# EKS:
aws eks describe-cluster --name prod-cluster --query 'cluster.resourcesVpcConfig.subnetIds'

# Expect: subnets in 3 different AZs

2. Node Autoscaling Configured

Cluster Autoscaler or Karpenter must be running. Without it, traffic spikes will cause pod scheduling failures.

# Check if Cluster Autoscaler is running
kubectl get deployment cluster-autoscaler -n kube-system

# Or check for Karpenter
kubectl get deployment karpenter -n karpenter

# Verify node pools / provisioners exist
kubectl get nodepools.karpenter.sh  # Karpenter
kubectl get nodegroups  # Verify ASG min/max in cloud console

3. Resource Quotas on All Namespaces

Without quotas, one team can consume the entire cluster. Every production namespace needs a ResourceQuota.

# Check which namespaces have quotas
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
  quota=$(kubectl get resourcequota -n $ns --no-headers 2>/dev/null | wc -l)
  echo "$ns: $quota quotas"
done

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "100"
    services.loadbalancers: "5"

4. PodDisruptionBudgets for Critical Services

PDBs prevent Kubernetes from evicting too many pods during node drains or upgrades. Without them, a rolling node upgrade can take your entire service down.

# List all PDBs
kubectl get pdb --all-namespaces

# Check if critical deployments have PDBs
kubectl get deployment -n production -o name | while read deploy; do
  name=$(echo $deploy | cut -d/ -f2)
  pdb=$(kubectl get pdb -n production -o jsonpath="{.items[?(@.spec.selector.matchLabels.app=='$name')].metadata.name}" 2>/dev/null)
  echo "$name -> PDB: ${pdb:-MISSING}"
done

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-api-pdb
  namespace: production
spec:
  minAvailable: 2  # Or use maxUnavailable: 1
  selector:
    matchLabels:
      app: payment-api

5. Separate Node Pools for System and Application Workloads

System components (monitoring, ingress, logging) should not compete with application pods for resources. Use dedicated node pools with taints.

# Check node labels and taints
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
POOL:.metadata.labels.node-pool,\
TAINTS:.spec.taints[*].key

Security (Items 6-10)

6. RBAC with Least Privilege

No one should have cluster-admin unless they are a platform engineer. Developers get namespace-scoped roles.

# Find all cluster-admin bindings (should be minimal)
kubectl get clusterrolebindings -o json | \
  jq -r '.items[] | select(.roleRef.name=="cluster-admin") | .metadata.name + " -> " + (.subjects[]?.name // "unknown")'

# Count total ClusterRoleBindings vs RoleBindings
echo "ClusterRoleBindings: $(kubectl get clusterrolebindings --no-headers | wc -l)"
echo "RoleBindings: $(kubectl get rolebindings --all-namespaces --no-headers | wc -l)"

7. Network Policies Enforced

By default, every pod can talk to every other pod. Network policies restrict traffic to only what is needed.

# Check if any network policies exist
kubectl get networkpolicy --all-namespaces

# Verify your CNI supports network policies (Calico, Cilium, Weave — yes; Flannel — no)
kubectl get pods -n kube-system -l k8s-app=calico-node  # Calico
kubectl get pods -n kube-system -l k8s-app=cilium        # Cilium

# Default deny all ingress in production namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress

8. Pod Security Standards Enforced

Pod Security Admission (PSA) replaces the deprecated PodSecurityPolicy. Enforce restricted or baseline profiles.

# Check namespace labels for PSA enforcement
kubectl get namespaces -o json | \
  jq -r '.items[] | .metadata.name + ": " + (.metadata.labels["pod-security.kubernetes.io/enforce"] // "NOT SET")'

# Label production namespace with restricted enforcement
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

9. Secrets Encrypted at Rest

By default, Kubernetes stores Secrets base64-encoded (not encrypted) in etcd. Enable encryption at rest.

# On managed K8s, check provider docs. On self-managed:
# Verify encryption config exists on API server
ps aux | grep kube-apiserver | grep encryption-provider-config

# Test: create a secret and verify it is encrypted in etcd
kubectl create secret generic test-encryption --from-literal=key=value -n default
# Then check etcd directly — you should see encrypted data, not plaintext

10. Container Image Scanning in CI/CD

Never deploy unscanned images. Use Trivy, Grype, or Snyk in your CI pipeline.

# Scan an image with Trivy
trivy image --severity HIGH,CRITICAL myregistry/payment-api:v2.1.0

# In CI/CD (GitHub Actions example):
# - name: Scan image
#   uses: aquasecurity/trivy-action@master
#   with:
#     image-ref: myregistry/payment-api:${{ github.sha }}
#     exit-code: 1  # Fail the build on HIGH/CRITICAL vulns

Reliability (Items 11-15)

11. Liveness and Readiness Probes on Every Container

Without probes, Kubernetes cannot detect crashed or unhealthy containers. Traffic goes to broken pods.

# Find pods without readiness probes
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | select(.spec.containers[].readinessProbe == null) |
  .metadata.namespace + "/" + .metadata.name'

12. Resource Requests and Limits Set

Pods without requests get evicted first during node pressure. Pods without limits can consume unbounded resources.

# Find containers without resource requests
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | .metadata.namespace + "/" + .metadata.name + " " +
  (.spec.containers[] | select(.resources.requests == null) | .name)'

13. Pod Anti-Affinity for Critical Deployments

Replicas of the same service should spread across nodes. Otherwise, one node failure kills all replicas.

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app
                  operator: In
                  values: ["payment-api"]
            topologyKey: kubernetes.io/hostname

14. Horizontal Pod Autoscaler (HPA) for Variable Workloads

Fixed replica counts waste money at low traffic and crash at high traffic.

# List all HPAs and their status
kubectl get hpa --all-namespaces

15. Graceful Shutdown with preStop Hooks and terminationGracePeriodSeconds

Pods need time to drain connections before terminating. Default 30s is often not enough.

spec:
  terminationGracePeriodSeconds: 60
  containers:
    - name: payment-api
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 10"]  # Wait for LB to deregister

Observability (Items 16-20)

16. Metrics Collection (Prometheus/Datadog/CloudWatch)

You cannot manage what you cannot measure. At minimum, collect CPU, memory, network, and disk I/O per pod and node.

# Verify Prometheus is scraping targets
kubectl port-forward svc/prometheus-server 9090:80 -n monitoring &
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets | length'

17. Centralized Logging

Pod logs disappear when pods restart. Send them to a centralized system — EFK stack, Loki, or a cloud logging service.

# Verify Fluentd/Fluent Bit is running on all nodes
kubectl get daemonset -n logging
kubectl get pods -n logging -o wide  # Should have one pod per node

18. Alerting Rules for Critical Conditions

Metrics without alerts are just dashboards no one watches. Set up alerts for pod restarts, high error rates, node pressure, and PVC usage.

# Check Prometheus alerting rules
kubectl get prometheusrules --all-namespaces

19. Distributed Tracing

For microservices, logs and metrics are not enough. You need traces to follow a request across services.

# Verify Jaeger or Tempo is running
kubectl get deployment -n tracing
kubectl get svc -n tracing

20. Dashboard for Cluster and Application Health

Grafana dashboards give your team a single pane of glass. Import community dashboards for Kubernetes (ID: 315, 6417, 13770).

# Verify Grafana is running
kubectl get svc grafana -n monitoring

Operations (Items 21-25)

21. etcd Backup Automated

etcd holds your entire cluster state. Lose it without a backup and you are rebuilding from scratch.

# For self-managed clusters, verify etcd backup CronJob
kubectl get cronjob -n kube-system | grep etcd

# For managed clusters, verify provider backup is enabled
# EKS: Automatic. GKE: Automatic. AKS: Check backup settings.

22. Disaster Recovery Plan Documented and Tested

A backup you have never restored is not a backup — it is a hope.

# Verify Velero is installed and backups are running
velero get backup-locations
velero get backups
velero get schedules

23. Cluster Upgrade Strategy Defined

Kubernetes releases every 4 months. You need a plan for upgrading control plane and nodes with zero downtime.

# Check current version and available upgrades
kubectl version --short
# EKS:
aws eks describe-cluster --name prod-cluster --query 'cluster.version'
# Check if you are more than 1 minor version behind

24. GitOps for All Deployments

No one should be running kubectl apply against production from their laptop. Use ArgoCD or Flux.

# Verify ArgoCD is running and apps are synced
kubectl get applications -n argocd
argocd app list  # Check sync status

25. Runbooks for Common Incidents

When the pager fires at 3 AM, no one wants to think from first principles. Write runbooks for: pod CrashLoopBackOff, node NotReady, PVC full, certificate expired, OOMKilled.

# This one is not a kubectl command — it is a team process check:
# - Do runbooks exist in your wiki/repo?
# - Are they linked from your alerting tool?
# - Has every on-call engineer read them?
# - Were they tested in the last quarter?

The Checklist Summary

#	Item	Category	Priority
1	HA control plane	Cluster	Critical
2	Node autoscaling	Cluster	Critical
3	Resource quotas	Cluster	High
4	PodDisruptionBudgets	Cluster	High
5	Separate node pools	Cluster	Medium
6	RBAC least privilege	Security	Critical
7	Network policies	Security	Critical
8	Pod Security Standards	Security	High
9	Secrets encryption at rest	Security	High
10	Image scanning in CI/CD	Security	High
11	Liveness/Readiness probes	Reliability	Critical
12	Resource requests/limits	Reliability	Critical
13	Pod anti-affinity	Reliability	High
14	HPA for variable workloads	Reliability	Medium
15	Graceful shutdown	Reliability	High
16	Metrics collection	Observability	Critical
17	Centralized logging	Observability	Critical
18	Alerting rules	Observability	Critical
19	Distributed tracing	Observability	Medium
20	Health dashboards	Observability	Medium
21	etcd backup automated	Operations	Critical
22	DR plan tested	Operations	Critical
23	Upgrade strategy	Operations	High
24	GitOps deployments	Operations	High
25	Incident runbooks	Operations	High

Print this list. Tape it to your monitor. Go through it item by item before your next production launch. The items marked Critical are non-negotiable — skip them and you are building on sand. The High and Medium items are what separate a cluster that works from a cluster that works reliably at 3 AM on a Saturday.

Cluster Setup (Items 1-5)​

1. High Availability Control Plane​

2. Node Autoscaling Configured​

3. Resource Quotas on All Namespaces​

4. PodDisruptionBudgets for Critical Services​

5. Separate Node Pools for System and Application Workloads​

Security (Items 6-10)​

6. RBAC with Least Privilege​

7. Network Policies Enforced​

8. Pod Security Standards Enforced​

9. Secrets Encrypted at Rest​

10. Container Image Scanning in CI/CD​

Reliability (Items 11-15)​

11. Liveness and Readiness Probes on Every Container​

12. Resource Requests and Limits Set​

13. Pod Anti-Affinity for Critical Deployments​

14. Horizontal Pod Autoscaler (HPA) for Variable Workloads​

15. Graceful Shutdown with preStop Hooks and terminationGracePeriodSeconds​

Observability (Items 16-20)​

16. Metrics Collection (Prometheus/Datadog/CloudWatch)​

17. Centralized Logging​

18. Alerting Rules for Critical Conditions​

19. Distributed Tracing​

20. Dashboard for Cluster and Application Health​

Operations (Items 21-25)​

21. etcd Backup Automated​

22. Disaster Recovery Plan Documented and Tested​

23. Cluster Upgrade Strategy Defined​

24. GitOps for All Deployments​

25. Runbooks for Common Incidents​

The Checklist Summary​

Stay Updated