Skip to main content

Kubernetes Troubleshooting — CrashLoopBackOff, ImagePullBackOff, and Pending Pods

· 8 min read
Goel Academy
DevOps & Cloud Learning Hub

It is Friday afternoon. You deploy a new version of the payment service. The rollout stalls. Pods are stuck in CrashLoopBackOff. The previous version is still serving traffic (thanks to rolling updates), but if you do not fix this soon, the old ReplicaSet will scale down and you have an outage. You need a systematic approach, not a panicked kubectl delete pod loop.

Common Pod Failure States

Before diving into fixes, here is a reference table of every pod status you will encounter and what it actually means:

StatusWhat HappenedFirst Command to Run
CrashLoopBackOffContainer starts, crashes, restarts, crashes againkubectl logs <pod> --previous
ImagePullBackOffKubernetes cannot pull the container imagekubectl describe pod <pod>
PendingPod cannot be scheduled to any nodekubectl describe pod <pod> (check Events)
OOMKilledContainer exceeded its memory limitkubectl describe pod <pod> (check Last State)
EvictedNode ran out of disk or memorykubectl describe pod <pod>
CreateContainerConfigErrorConfigMap or Secret reference is wrongkubectl describe pod <pod>
Init:ErrorInit container failedkubectl logs <pod> -c <init-container>
Terminating (stuck)Finalizers or preStop hook hangingkubectl get pod <pod> -o yaml (check finalizers)

The Systematic Troubleshooting Flow

Regardless of the failure, follow this four-step process:

# Step 1: Check events — this is where K8s tells you WHAT happened
kubectl describe pod <pod-name> -n <namespace>
# Scroll to the "Events" section at the bottom

# Step 2: Check logs — this is where your APPLICATION tells you WHY
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous # Logs from the crashed container
kubectl logs <pod-name> -n <namespace> -c <container> # Specific container in multi-container pod

# Step 3: Check the resource definition — is the YAML correct?
kubectl get pod <pod-name> -n <namespace> -o yaml

# Step 4: Exec into the container — test from inside
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

Debugging CrashLoopBackOff

CrashLoopBackOff means the container starts, fails, and Kubernetes keeps restarting it with exponential backoff (10s, 20s, 40s, up to 5 minutes). The root cause is always inside the container.

Common causes and fixes:

# 1. Application error — check previous container logs
kubectl logs payment-api-7d9b4-x2k8m --previous
# Look for: stack traces, missing env vars, failed DB connections

# 2. Wrong command or entrypoint
kubectl get pod payment-api-7d9b4-x2k8m -o jsonpath='{.spec.containers[0].command}'
# Fix: Check that the command in your Deployment matches the image's expected entrypoint

# 3. Missing config — ConfigMap or Secret not mounted correctly
kubectl describe pod payment-api-7d9b4-x2k8m | grep -A5 "Environment"
# Look for: <unset>, missing volume mounts

# 4. Health check failing too fast
kubectl describe pod payment-api-7d9b4-x2k8m | grep -A10 "Liveness"
# Fix: Increase initialDelaySeconds, lower failureThreshold during startup

A common pattern: the app needs 30 seconds to start, but the liveness probe checks at 10 seconds and kills the container before it is ready. This creates a restart loop that looks like CrashLoopBackOff but is actually a misconfigured probe.

# Fix: Use a startup probe for slow-starting containers
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 2 # Allows up to 60 seconds for startup
livenessProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 10
failureThreshold: 3

Debugging ImagePullBackOff

Kubernetes cannot download the container image. The Events section of kubectl describe pod will tell you exactly why.

kubectl describe pod frontend-5c8b9-rn2vl | grep -A10 "Events"
# Common messages:
# "Failed to pull image": repository does not exist or authentication required
# "manifest unknown": the tag does not exist
# "unauthorized": registry credentials are missing or expired

Fixes for each scenario:

# Wrong image name or tag — verify it exists
docker pull myregistry.io/frontend:v3.2.1
# If this fails locally, the image does not exist

# Private registry — create or update the image pull secret
kubectl create secret docker-registry regcred \
--docker-server=myregistry.io \
--docker-username=myuser \
--docker-password=mypassword \
--namespace=production

# Reference it in your deployment
# spec.template.spec.imagePullSecrets:
# - name: regcred

# Expired credentials — delete and recreate the secret
kubectl delete secret regcred -n production
kubectl create secret docker-registry regcred ...

Debugging Pending Pods

A Pending pod means the scheduler cannot find a node to place it on. This is almost always a resource or constraint issue.

kubectl describe pod database-0 | grep -A20 "Events"
# Common messages:
# "Insufficient cpu" — no node has enough CPU
# "Insufficient memory" — no node has enough memory
# "0/3 nodes are available: 3 node(s) had taint" — taints blocking scheduling
# "persistentvolumeclaim not found" — PVC does not exist
# "0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims"
# Check available resources on nodes
kubectl describe nodes | grep -A6 "Allocated resources"

# Check if taints are blocking
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Check if the PVC exists and is bound
kubectl get pvc -n production
# STATUS should be "Bound", not "Pending"

# Check if node affinity or nodeSelector is too restrictive
kubectl get pod database-0 -o jsonpath='{.spec.nodeSelector}'
kubectl get nodes --show-labels | grep <required-label>

Debugging OOMKilled

OOMKilled means the container used more memory than its resources.limits.memory allows. The Linux OOM killer terminates the process.

# Confirm OOMKilled
kubectl describe pod worker-abc123 | grep -A5 "Last State"
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137

# Check the memory limit
kubectl get pod worker-abc123 -o jsonpath='{.spec.containers[0].resources.limits.memory}'
# Output: 256Mi — probably too low

Fix by either optimizing the application's memory usage or increasing the limit:

resources:
requests:
memory: 256Mi
limits:
memory: 512Mi # Increased from 256Mi

Use the Vertical Pod Autoscaler (VPA) in recommendation mode to find the right values based on actual usage.

Debugging Node Issues

When a node goes NotReady, all pods on that node become unavailable.

# Check node status and conditions
kubectl get nodes
kubectl describe node node-3 | grep -A15 "Conditions"
# Look for: MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable

# Check kubelet logs on the node (SSH required)
journalctl -u kubelet -f --lines=100

# Check system resources on the node
kubectl top nodes
# NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
# node-1 2450m 61% 12480Mi 78%
# node-2 1890m 47% 8960Mi 56%
# node-3 3900m 97% 15200Mi 95% ← Problem

Debugging Services (No Endpoints)

You created a Service, but traffic is not reaching your pods. The most common cause: a label selector mismatch.

# Check if the Service has endpoints
kubectl get endpoints my-service -n production
# If ENDPOINTS is <none>, the selector does not match any pods

# Compare the Service selector with pod labels
kubectl get svc my-service -n production -o jsonpath='{.spec.selector}'
# {"app":"my-app","version":"v2"}

kubectl get pods -n production -l app=my-app,version=v2
# If no pods returned, the labels do not match

# Check if pods are Ready (only Ready pods are added to endpoints)
kubectl get pods -n production -l app=my-app -o wide

Ephemeral Containers with kubectl debug

Some containers are built on distroless or scratch images — they have no shell, no curl, no debugging tools. Use ephemeral containers to attach a debug container to the running pod:

# Attach a debug container with full networking tools
kubectl debug -it payment-api-7d9b4-x2k8m \
--image=nicolaka/netshoot \
--target=payment-api

# Now you can run debug commands inside the pod's network namespace
curl localhost:8080/health
nslookup postgres-service.production.svc.cluster.local
tcpdump -i eth0 port 5432
ss -tlnp
# Debug a CrashLoopBackOff pod by copying it with a different command
kubectl debug payment-api-7d9b4-x2k8m -it \
--copy-to=debug-pod \
--container=payment-api \
-- /bin/sh
# This creates a copy of the pod but overrides the entrypoint to /bin/sh
# so you can inspect the filesystem and environment without the app crashing

Common Network Issues Checklist

When network connectivity fails between pods or services, work through this checklist:

# 1. Can the pod resolve DNS?
kubectl exec -it <pod> -- nslookup kubernetes.default

# 2. Can the pod reach the Service IP?
kubectl exec -it <pod> -- curl -v http://<service-name>.<namespace>.svc.cluster.local:<port>

# 3. Is a NetworkPolicy blocking traffic?
kubectl get networkpolicies -n <namespace>

# 4. Is kube-proxy running on the node?
kubectl get pods -n kube-system -l k8s-app=kube-proxy

# 5. Are iptables/ipvs rules correct?
kubectl exec -it <kube-proxy-pod> -n kube-system -- iptables -t nat -L KUBE-SERVICES | grep <service-name>

Wrapping Up

Kubernetes troubleshooting does not have to be a guessing game. Start with events (kubectl describe), then check logs (kubectl logs --previous), then inspect the resource definition, and finally exec into the pod. Every failure state has a predictable set of causes, and the four-step flow will get you to the answer faster than randomly deleting pods and redeploying.

Once you can monitor, log, secure, and troubleshoot your cluster, the next challenge is deploying to it reliably. In the next post, we will set up GitOps with ArgoCD — declarative, Git-driven deployments that eliminate kubectl apply from your workflow entirely.