Docker Health Checks — Don't Route Traffic to Dead Containers

June 7, 2025 · 8 min read

DevOps & Cloud Learning Hub

Your container is "running." The process has PID 1, Docker says status is Up 47 minutes, and everything looks fine. Except the application inside crashed 20 minutes ago, the event loop is deadlocked, or the database connection pool is exhausted. Traffic keeps flowing in. Users keep getting 502 errors. Docker has no idea anything is wrong because "running" and "healthy" are not the same thing.

Why Running Does Not Mean Healthy

Docker monitors one thing by default: is PID 1 alive? If the main process is running, the container is "Up." But applications fail in ways that keep the process alive:

Deadlocks. The application thread is stuck waiting on a lock. The process is alive. No requests are served.
Connection pool exhaustion. All database connections are leaked. The app is alive but every query times out.
Memory pressure. The app is swapping to disk. Technically alive, effectively dead.
Dependency failure. The upstream API is down. The app starts but cannot do anything useful.
Corrupted state. A background job failed silently. The app serves stale or incorrect data.

Without health checks, Docker (and any load balancer in front of it) will keep sending traffic to a container that cannot handle it.

The HEALTHCHECK Instruction

Add a HEALTHCHECK to your Dockerfile to tell Docker how to verify your application is actually working.

FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm ci --only=production

HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

CMD ["node", "server.js"]

Parameter	Default	What It Does
`--interval`	30s	Time between health checks
`--timeout`	30s	How long to wait for the check to succeed
`--start-period`	0s	Grace period for container startup
`--retries`	3	Consecutive failures before marking unhealthy

The --start-period is critical. Java applications, for example, might take 30-60 seconds to start. Without a start period, the health check fails immediately and the container is marked unhealthy before it even finishes booting.

Health Check Commands

HTTP Health Check

The most common pattern. Your application exposes a /health endpoint that returns 200 when everything is working.

# Using wget (available in Alpine)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1

# Using curl (if installed)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl --fail --silent http://localhost:8080/health || exit 1

Your health endpoint should check actual dependencies, not just return 200:

// Express.js health endpoint
app.get('/health', async (req, res) => {
  try {
    // Check database connection
    await db.query('SELECT 1');
    // Check Redis connection
    await redis.ping();
    res.status(200).json({ status: 'healthy' });
  } catch (err) {
    res.status(503).json({ status: 'unhealthy', error: err.message });
  }
});

TCP Health Check

For services that do not speak HTTP (databases, message queues, custom TCP services):

# Check if a TCP port is accepting connections
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD nc -z localhost 5432 || exit 1

# PostgreSQL health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD pg_isready -U postgres || exit 1

# Redis health check
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
  CMD redis-cli ping | grep -q PONG || exit 1

# MySQL health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD mysqladmin ping -h localhost -u root --password=$MYSQL_ROOT_PASSWORD || exit 1

Script-Based Health Check

For complex checks that need multiple validations:

#!/bin/sh
# healthcheck.sh

# Check 1: HTTP endpoint responds
wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1

# Check 2: Disk usage below 90%
DISK_USAGE=$(df /app/data | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 90 ]; then
  exit 1
fi

# Check 3: No zombie processes
ZOMBIES=$(ps aux | grep -c defunct)
if [ "$ZOMBIES" -gt 5 ]; then
  exit 1
fi

exit 0

COPY healthcheck.sh /healthcheck.sh
RUN chmod +x /healthcheck.sh

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD /healthcheck.sh

Health States

Docker tracks three health states for containers with health checks configured:

State	Meaning	Container Running?	Traffic Routed?
`starting`	Within start-period, check not yet passed	Yes	Depends on config
`healthy`	Last N checks passed	Yes	Yes
`unhealthy`	Last N checks failed (retries exceeded)	Yes	Should be stopped

# Check health status
docker inspect myapp --format '{{.State.Health.Status}}'
# healthy

# See recent health check results
docker inspect myapp --format '{{json .State.Health}}' | python3 -m json.tool

{
  "Status": "unhealthy",
  "FailingStreak": 5,
  "Log": [
    {
      "Start": "2025-06-07T10:30:00.123Z",
      "End": "2025-06-07T10:30:05.456Z",
      "ExitCode": 1,
      "Output": "wget: server returned error: HTTP/1.1 503 Service Unavailable"
    }
  ]
}

Filter Containers by Health

# List only healthy containers
docker ps --filter health=healthy

# List unhealthy containers (potential problems)
docker ps --filter health=unhealthy

# List containers still starting up
docker ps --filter health=starting

# Combine with format for monitoring scripts
docker ps --filter health=unhealthy --format "{{.Names}}: {{.Status}}"

Health Checks in Docker Compose

Compose health checks become powerful with depends_on conditions. Instead of just waiting for a container to start, you can wait until it is actually ready.

# docker-compose.yml
services:
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 3

  api:
    build: .
    ports:
      - "3000:3000"
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD-SHELL", "wget --spider -q http://localhost:3000/health || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 15s

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    depends_on:
      api:
        condition: service_healthy

With condition: service_healthy, the api container will not start until both db and redis report healthy. And nginx will not start until api is healthy. This solves the classic "app crashes because the database is not ready yet" problem.

Override Health Check at Runtime

You can override or disable health checks without modifying the Dockerfile:

# Override health check command
docker run -d \
  --health-cmd "curl -f http://localhost:8080/ready || exit 1" \
  --health-interval 15s \
  --health-timeout 5s \
  --health-retries 3 \
  --health-start-period 30s \
  myapp:latest

# Disable health check entirely (testing/debugging)
docker run -d --no-healthcheck myapp:latest

Orchestrator Comparison

Docker health checks map to similar concepts in orchestration platforms, but each has its own nuances.

Feature	Docker HEALTHCHECK	K8s Liveness Probe	K8s Readiness Probe	K8s Startup Probe
Purpose	Container health	Restart on failure	Traffic routing	Startup protection
Check types	CMD only	HTTP, TCP, exec, gRPC	HTTP, TCP, exec, gRPC	HTTP, TCP, exec, gRPC
On failure	Marks unhealthy	Restarts container	Removes from Service	Delays liveness check
Built-in restart	No (needs policy)	Yes (automatic)	No	No

Kubernetes liveness probes are more aggressive — they actually restart the container on failure. Docker health checks only mark the container as unhealthy. You need a restart policy to get automatic recovery.

# Combine health check with restart policy for auto-recovery
docker run -d --name api \
  --restart=on-failure:5 \
  --health-cmd "wget --spider -q http://localhost:3000/health || exit 1" \
  --health-interval 30s \
  --health-retries 3 \
  myapp:latest

Debugging Unhealthy Containers

When a container goes unhealthy, here is how to diagnose the problem:

# Step 1: Check the health log
docker inspect myapp --format '{{json .State.Health.Log}}' | python3 -m json.tool

# Step 2: Run the health check manually
docker exec myapp wget --spider -q http://localhost:3000/health
echo $?
# 0 = healthy, 1 = unhealthy

# Step 3: Check if the health endpoint works
docker exec myapp wget -qO- http://localhost:3000/health

# Step 4: Check resource usage (maybe OOM or CPU starved)
docker stats --no-stream myapp

# Step 5: Check application logs
docker logs --tail 50 myapp

Common health check failures:

Symptom	Likely Cause	Fix
Unhealthy immediately on start	start-period too short	Increase `--start-period`
Intermittent unhealthy	Timeout too aggressive	Increase `--timeout`, reduce `--interval`
Healthy then permanently unhealthy	Application issue (OOM, deadlock)	Check app logs, increase resources
Health check command not found	Missing tool in image	Install wget/curl or use a different check

Wrapping Up

Health checks are a simple addition that prevents a significant class of outages — the kind where your monitoring says everything is "green" because the container is running, but users are getting errors because the application inside is broken. Add an HTTP health check to every web service, use depends_on: condition: service_healthy in Compose, and combine health checks with restart policies for self-healing containers.

In the next post, we will cover Docker Environment Variables — the right way to handle configuration across development, staging, and production environments without baking secrets into your images.

Why Running Does Not Mean Healthy​

The HEALTHCHECK Instruction​

Health Check Commands​

HTTP Health Check​

TCP Health Check​

Script-Based Health Check​

Health States​

Filter Containers by Health​

Health Checks in Docker Compose​

Override Health Check at Runtime​

Orchestrator Comparison​

Debugging Unhealthy Containers​

Wrapping Up​

Stay Updated