Skip to main content

Docker Health Checks — Don't Route Traffic to Dead Containers

· 8 min read
Goel Academy
DevOps & Cloud Learning Hub

Your container is "running." The process has PID 1, Docker says status is Up 47 minutes, and everything looks fine. Except the application inside crashed 20 minutes ago, the event loop is deadlocked, or the database connection pool is exhausted. Traffic keeps flowing in. Users keep getting 502 errors. Docker has no idea anything is wrong because "running" and "healthy" are not the same thing.

Why Running Does Not Mean Healthy

Docker monitors one thing by default: is PID 1 alive? If the main process is running, the container is "Up." But applications fail in ways that keep the process alive:

  • Deadlocks. The application thread is stuck waiting on a lock. The process is alive. No requests are served.
  • Connection pool exhaustion. All database connections are leaked. The app is alive but every query times out.
  • Memory pressure. The app is swapping to disk. Technically alive, effectively dead.
  • Dependency failure. The upstream API is down. The app starts but cannot do anything useful.
  • Corrupted state. A background job failed silently. The app serves stale or incorrect data.

Without health checks, Docker (and any load balancer in front of it) will keep sending traffic to a container that cannot handle it.

The HEALTHCHECK Instruction

Add a HEALTHCHECK to your Dockerfile to tell Docker how to verify your application is actually working.

FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm ci --only=production

HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

CMD ["node", "server.js"]
ParameterDefaultWhat It Does
--interval30sTime between health checks
--timeout30sHow long to wait for the check to succeed
--start-period0sGrace period for container startup
--retries3Consecutive failures before marking unhealthy

The --start-period is critical. Java applications, for example, might take 30-60 seconds to start. Without a start period, the health check fails immediately and the container is marked unhealthy before it even finishes booting.

Health Check Commands

HTTP Health Check

The most common pattern. Your application exposes a /health endpoint that returns 200 when everything is working.

# Using wget (available in Alpine)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1

# Using curl (if installed)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD curl --fail --silent http://localhost:8080/health || exit 1

Your health endpoint should check actual dependencies, not just return 200:

// Express.js health endpoint
app.get('/health', async (req, res) => {
try {
// Check database connection
await db.query('SELECT 1');
// Check Redis connection
await redis.ping();
res.status(200).json({ status: 'healthy' });
} catch (err) {
res.status(503).json({ status: 'unhealthy', error: err.message });
}
});

TCP Health Check

For services that do not speak HTTP (databases, message queues, custom TCP services):

# Check if a TCP port is accepting connections
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD nc -z localhost 5432 || exit 1
# PostgreSQL health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD pg_isready -U postgres || exit 1

# Redis health check
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
CMD redis-cli ping | grep -q PONG || exit 1

# MySQL health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD mysqladmin ping -h localhost -u root --password=$MYSQL_ROOT_PASSWORD || exit 1

Script-Based Health Check

For complex checks that need multiple validations:

#!/bin/sh
# healthcheck.sh

# Check 1: HTTP endpoint responds
wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1

# Check 2: Disk usage below 90%
DISK_USAGE=$(df /app/data | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 90 ]; then
exit 1
fi

# Check 3: No zombie processes
ZOMBIES=$(ps aux | grep -c defunct)
if [ "$ZOMBIES" -gt 5 ]; then
exit 1
fi

exit 0
COPY healthcheck.sh /healthcheck.sh
RUN chmod +x /healthcheck.sh

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD /healthcheck.sh

Health States

Docker tracks three health states for containers with health checks configured:

StateMeaningContainer Running?Traffic Routed?
startingWithin start-period, check not yet passedYesDepends on config
healthyLast N checks passedYesYes
unhealthyLast N checks failed (retries exceeded)YesShould be stopped
# Check health status
docker inspect myapp --format '{{.State.Health.Status}}'
# healthy

# See recent health check results
docker inspect myapp --format '{{json .State.Health}}' | python3 -m json.tool
{
"Status": "unhealthy",
"FailingStreak": 5,
"Log": [
{
"Start": "2025-06-07T10:30:00.123Z",
"End": "2025-06-07T10:30:05.456Z",
"ExitCode": 1,
"Output": "wget: server returned error: HTTP/1.1 503 Service Unavailable"
}
]
}

Filter Containers by Health

# List only healthy containers
docker ps --filter health=healthy

# List unhealthy containers (potential problems)
docker ps --filter health=unhealthy

# List containers still starting up
docker ps --filter health=starting

# Combine with format for monitoring scripts
docker ps --filter health=unhealthy --format "{{.Names}}: {{.Status}}"

Health Checks in Docker Compose

Compose health checks become powerful with depends_on conditions. Instead of just waiting for a container to start, you can wait until it is actually ready.

# docker-compose.yml
services:
db:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD: secret
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s

redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 3

api:
build: .
ports:
- "3000:3000"
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
healthcheck:
test: ["CMD-SHELL", "wget --spider -q http://localhost:3000/health || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 15s

nginx:
image: nginx:alpine
ports:
- "80:80"
depends_on:
api:
condition: service_healthy

With condition: service_healthy, the api container will not start until both db and redis report healthy. And nginx will not start until api is healthy. This solves the classic "app crashes because the database is not ready yet" problem.

Override Health Check at Runtime

You can override or disable health checks without modifying the Dockerfile:

# Override health check command
docker run -d \
--health-cmd "curl -f http://localhost:8080/ready || exit 1" \
--health-interval 15s \
--health-timeout 5s \
--health-retries 3 \
--health-start-period 30s \
myapp:latest

# Disable health check entirely (testing/debugging)
docker run -d --no-healthcheck myapp:latest

Orchestrator Comparison

Docker health checks map to similar concepts in orchestration platforms, but each has its own nuances.

FeatureDocker HEALTHCHECKK8s Liveness ProbeK8s Readiness ProbeK8s Startup Probe
PurposeContainer healthRestart on failureTraffic routingStartup protection
Check typesCMD onlyHTTP, TCP, exec, gRPCHTTP, TCP, exec, gRPCHTTP, TCP, exec, gRPC
On failureMarks unhealthyRestarts containerRemoves from ServiceDelays liveness check
Built-in restartNo (needs policy)Yes (automatic)NoNo

Kubernetes liveness probes are more aggressive — they actually restart the container on failure. Docker health checks only mark the container as unhealthy. You need a restart policy to get automatic recovery.

# Combine health check with restart policy for auto-recovery
docker run -d --name api \
--restart=on-failure:5 \
--health-cmd "wget --spider -q http://localhost:3000/health || exit 1" \
--health-interval 30s \
--health-retries 3 \
myapp:latest

Debugging Unhealthy Containers

When a container goes unhealthy, here is how to diagnose the problem:

# Step 1: Check the health log
docker inspect myapp --format '{{json .State.Health.Log}}' | python3 -m json.tool

# Step 2: Run the health check manually
docker exec myapp wget --spider -q http://localhost:3000/health
echo $?
# 0 = healthy, 1 = unhealthy

# Step 3: Check if the health endpoint works
docker exec myapp wget -qO- http://localhost:3000/health

# Step 4: Check resource usage (maybe OOM or CPU starved)
docker stats --no-stream myapp

# Step 5: Check application logs
docker logs --tail 50 myapp

Common health check failures:

SymptomLikely CauseFix
Unhealthy immediately on startstart-period too shortIncrease --start-period
Intermittent unhealthyTimeout too aggressiveIncrease --timeout, reduce --interval
Healthy then permanently unhealthyApplication issue (OOM, deadlock)Check app logs, increase resources
Health check command not foundMissing tool in imageInstall wget/curl or use a different check

Wrapping Up

Health checks are a simple addition that prevents a significant class of outages — the kind where your monitoring says everything is "green" because the container is running, but users are getting errors because the application inside is broken. Add an HTTP health check to every web service, use depends_on: condition: service_healthy in Compose, and combine health checks with restart policies for self-healing containers.

In the next post, we will cover Docker Environment Variables — the right way to handle configuration across development, staging, and production environments without baking secrets into your images.