Kubernetes Troubleshooting — CrashLoopBackOff, ImagePullBackOff, and Pending Pods
It is Friday afternoon. You deploy a new version of the payment service. The rollout stalls. Pods are stuck in CrashLoopBackOff. The previous version is still serving traffic (thanks to rolling updates), but if you do not fix this soon, the old ReplicaSet will scale down and you have an outage. You need a systematic approach, not a panicked kubectl delete pod loop.
Common Pod Failure States
Before diving into fixes, here is a reference table of every pod status you will encounter and what it actually means:
| Status | What Happened | First Command to Run |
|---|---|---|
| CrashLoopBackOff | Container starts, crashes, restarts, crashes again | kubectl logs <pod> --previous |
| ImagePullBackOff | Kubernetes cannot pull the container image | kubectl describe pod <pod> |
| Pending | Pod cannot be scheduled to any node | kubectl describe pod <pod> (check Events) |
| OOMKilled | Container exceeded its memory limit | kubectl describe pod <pod> (check Last State) |
| Evicted | Node ran out of disk or memory | kubectl describe pod <pod> |
| CreateContainerConfigError | ConfigMap or Secret reference is wrong | kubectl describe pod <pod> |
| Init:Error | Init container failed | kubectl logs <pod> -c <init-container> |
| Terminating (stuck) | Finalizers or preStop hook hanging | kubectl get pod <pod> -o yaml (check finalizers) |
The Systematic Troubleshooting Flow
Regardless of the failure, follow this four-step process:
# Step 1: Check events — this is where K8s tells you WHAT happened
kubectl describe pod <pod-name> -n <namespace>
# Scroll to the "Events" section at the bottom
# Step 2: Check logs — this is where your APPLICATION tells you WHY
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous # Logs from the crashed container
kubectl logs <pod-name> -n <namespace> -c <container> # Specific container in multi-container pod
# Step 3: Check the resource definition — is the YAML correct?
kubectl get pod <pod-name> -n <namespace> -o yaml
# Step 4: Exec into the container — test from inside
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
Debugging CrashLoopBackOff
CrashLoopBackOff means the container starts, fails, and Kubernetes keeps restarting it with exponential backoff (10s, 20s, 40s, up to 5 minutes). The root cause is always inside the container.
Common causes and fixes:
# 1. Application error — check previous container logs
kubectl logs payment-api-7d9b4-x2k8m --previous
# Look for: stack traces, missing env vars, failed DB connections
# 2. Wrong command or entrypoint
kubectl get pod payment-api-7d9b4-x2k8m -o jsonpath='{.spec.containers[0].command}'
# Fix: Check that the command in your Deployment matches the image's expected entrypoint
# 3. Missing config — ConfigMap or Secret not mounted correctly
kubectl describe pod payment-api-7d9b4-x2k8m | grep -A5 "Environment"
# Look for: <unset>, missing volume mounts
# 4. Health check failing too fast
kubectl describe pod payment-api-7d9b4-x2k8m | grep -A10 "Liveness"
# Fix: Increase initialDelaySeconds, lower failureThreshold during startup
A common pattern: the app needs 30 seconds to start, but the liveness probe checks at 10 seconds and kills the container before it is ready. This creates a restart loop that looks like CrashLoopBackOff but is actually a misconfigured probe.
# Fix: Use a startup probe for slow-starting containers
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 2 # Allows up to 60 seconds for startup
livenessProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 10
failureThreshold: 3
Debugging ImagePullBackOff
Kubernetes cannot download the container image. The Events section of kubectl describe pod will tell you exactly why.
kubectl describe pod frontend-5c8b9-rn2vl | grep -A10 "Events"
# Common messages:
# "Failed to pull image": repository does not exist or authentication required
# "manifest unknown": the tag does not exist
# "unauthorized": registry credentials are missing or expired
Fixes for each scenario:
# Wrong image name or tag — verify it exists
docker pull myregistry.io/frontend:v3.2.1
# If this fails locally, the image does not exist
# Private registry — create or update the image pull secret
kubectl create secret docker-registry regcred \
--docker-server=myregistry.io \
--docker-username=myuser \
--docker-password=mypassword \
--namespace=production
# Reference it in your deployment
# spec.template.spec.imagePullSecrets:
# - name: regcred
# Expired credentials — delete and recreate the secret
kubectl delete secret regcred -n production
kubectl create secret docker-registry regcred ...
Debugging Pending Pods
A Pending pod means the scheduler cannot find a node to place it on. This is almost always a resource or constraint issue.
kubectl describe pod database-0 | grep -A20 "Events"
# Common messages:
# "Insufficient cpu" — no node has enough CPU
# "Insufficient memory" — no node has enough memory
# "0/3 nodes are available: 3 node(s) had taint" — taints blocking scheduling
# "persistentvolumeclaim not found" — PVC does not exist
# "0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims"
# Check available resources on nodes
kubectl describe nodes | grep -A6 "Allocated resources"
# Check if taints are blocking
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# Check if the PVC exists and is bound
kubectl get pvc -n production
# STATUS should be "Bound", not "Pending"
# Check if node affinity or nodeSelector is too restrictive
kubectl get pod database-0 -o jsonpath='{.spec.nodeSelector}'
kubectl get nodes --show-labels | grep <required-label>
Debugging OOMKilled
OOMKilled means the container used more memory than its resources.limits.memory allows. The Linux OOM killer terminates the process.
# Confirm OOMKilled
kubectl describe pod worker-abc123 | grep -A5 "Last State"
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
# Check the memory limit
kubectl get pod worker-abc123 -o jsonpath='{.spec.containers[0].resources.limits.memory}'
# Output: 256Mi — probably too low
Fix by either optimizing the application's memory usage or increasing the limit:
resources:
requests:
memory: 256Mi
limits:
memory: 512Mi # Increased from 256Mi
Use the Vertical Pod Autoscaler (VPA) in recommendation mode to find the right values based on actual usage.
Debugging Node Issues
When a node goes NotReady, all pods on that node become unavailable.
# Check node status and conditions
kubectl get nodes
kubectl describe node node-3 | grep -A15 "Conditions"
# Look for: MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable
# Check kubelet logs on the node (SSH required)
journalctl -u kubelet -f --lines=100
# Check system resources on the node
kubectl top nodes
# NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
# node-1 2450m 61% 12480Mi 78%
# node-2 1890m 47% 8960Mi 56%
# node-3 3900m 97% 15200Mi 95% ← Problem
Debugging Services (No Endpoints)
You created a Service, but traffic is not reaching your pods. The most common cause: a label selector mismatch.
# Check if the Service has endpoints
kubectl get endpoints my-service -n production
# If ENDPOINTS is <none>, the selector does not match any pods
# Compare the Service selector with pod labels
kubectl get svc my-service -n production -o jsonpath='{.spec.selector}'
# {"app":"my-app","version":"v2"}
kubectl get pods -n production -l app=my-app,version=v2
# If no pods returned, the labels do not match
# Check if pods are Ready (only Ready pods are added to endpoints)
kubectl get pods -n production -l app=my-app -o wide
Ephemeral Containers with kubectl debug
Some containers are built on distroless or scratch images — they have no shell, no curl, no debugging tools. Use ephemeral containers to attach a debug container to the running pod:
# Attach a debug container with full networking tools
kubectl debug -it payment-api-7d9b4-x2k8m \
--image=nicolaka/netshoot \
--target=payment-api
# Now you can run debug commands inside the pod's network namespace
curl localhost:8080/health
nslookup postgres-service.production.svc.cluster.local
tcpdump -i eth0 port 5432
ss -tlnp
# Debug a CrashLoopBackOff pod by copying it with a different command
kubectl debug payment-api-7d9b4-x2k8m -it \
--copy-to=debug-pod \
--container=payment-api \
-- /bin/sh
# This creates a copy of the pod but overrides the entrypoint to /bin/sh
# so you can inspect the filesystem and environment without the app crashing
Common Network Issues Checklist
When network connectivity fails between pods or services, work through this checklist:
# 1. Can the pod resolve DNS?
kubectl exec -it <pod> -- nslookup kubernetes.default
# 2. Can the pod reach the Service IP?
kubectl exec -it <pod> -- curl -v http://<service-name>.<namespace>.svc.cluster.local:<port>
# 3. Is a NetworkPolicy blocking traffic?
kubectl get networkpolicies -n <namespace>
# 4. Is kube-proxy running on the node?
kubectl get pods -n kube-system -l k8s-app=kube-proxy
# 5. Are iptables/ipvs rules correct?
kubectl exec -it <kube-proxy-pod> -n kube-system -- iptables -t nat -L KUBE-SERVICES | grep <service-name>
Wrapping Up
Kubernetes troubleshooting does not have to be a guessing game. Start with events (kubectl describe), then check logs (kubectl logs --previous), then inspect the resource definition, and finally exec into the pod. Every failure state has a predictable set of causes, and the four-step flow will get you to the answer faster than randomly deleting pods and redeploying.
Once you can monitor, log, secure, and troubleshoot your cluster, the next challenge is deploying to it reliably. In the next post, we will set up GitOps with ArgoCD — declarative, Git-driven deployments that eliminate kubectl apply from your workflow entirely.
