Docker Health Checks — Don't Route Traffic to Dead Containers
Your container is "running." The process has PID 1, Docker says status is Up 47 minutes, and everything looks fine. Except the application inside crashed 20 minutes ago, the event loop is deadlocked, or the database connection pool is exhausted. Traffic keeps flowing in. Users keep getting 502 errors. Docker has no idea anything is wrong because "running" and "healthy" are not the same thing.
Why Running Does Not Mean Healthy
Docker monitors one thing by default: is PID 1 alive? If the main process is running, the container is "Up." But applications fail in ways that keep the process alive:
- Deadlocks. The application thread is stuck waiting on a lock. The process is alive. No requests are served.
- Connection pool exhaustion. All database connections are leaked. The app is alive but every query times out.
- Memory pressure. The app is swapping to disk. Technically alive, effectively dead.
- Dependency failure. The upstream API is down. The app starts but cannot do anything useful.
- Corrupted state. A background job failed silently. The app serves stale or incorrect data.
Without health checks, Docker (and any load balancer in front of it) will keep sending traffic to a container that cannot handle it.
The HEALTHCHECK Instruction
Add a HEALTHCHECK to your Dockerfile to tell Docker how to verify your application is actually working.
FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm ci --only=production
HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
CMD ["node", "server.js"]
| Parameter | Default | What It Does |
|---|---|---|
--interval | 30s | Time between health checks |
--timeout | 30s | How long to wait for the check to succeed |
--start-period | 0s | Grace period for container startup |
--retries | 3 | Consecutive failures before marking unhealthy |
The --start-period is critical. Java applications, for example, might take 30-60 seconds to start. Without a start period, the health check fails immediately and the container is marked unhealthy before it even finishes booting.
Health Check Commands
HTTP Health Check
The most common pattern. Your application exposes a /health endpoint that returns 200 when everything is working.
# Using wget (available in Alpine)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1
# Using curl (if installed)
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD curl --fail --silent http://localhost:8080/health || exit 1
Your health endpoint should check actual dependencies, not just return 200:
// Express.js health endpoint
app.get('/health', async (req, res) => {
try {
// Check database connection
await db.query('SELECT 1');
// Check Redis connection
await redis.ping();
res.status(200).json({ status: 'healthy' });
} catch (err) {
res.status(503).json({ status: 'unhealthy', error: err.message });
}
});
TCP Health Check
For services that do not speak HTTP (databases, message queues, custom TCP services):
# Check if a TCP port is accepting connections
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD nc -z localhost 5432 || exit 1
# PostgreSQL health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD pg_isready -U postgres || exit 1
# Redis health check
HEALTHCHECK --interval=10s --timeout=3s --retries=3 \
CMD redis-cli ping | grep -q PONG || exit 1
# MySQL health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD mysqladmin ping -h localhost -u root --password=$MYSQL_ROOT_PASSWORD || exit 1
Script-Based Health Check
For complex checks that need multiple validations:
#!/bin/sh
# healthcheck.sh
# Check 1: HTTP endpoint responds
wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1
# Check 2: Disk usage below 90%
DISK_USAGE=$(df /app/data | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 90 ]; then
exit 1
fi
# Check 3: No zombie processes
ZOMBIES=$(ps aux | grep -c defunct)
if [ "$ZOMBIES" -gt 5 ]; then
exit 1
fi
exit 0
COPY healthcheck.sh /healthcheck.sh
RUN chmod +x /healthcheck.sh
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD /healthcheck.sh
Health States
Docker tracks three health states for containers with health checks configured:
| State | Meaning | Container Running? | Traffic Routed? |
|---|---|---|---|
starting | Within start-period, check not yet passed | Yes | Depends on config |
healthy | Last N checks passed | Yes | Yes |
unhealthy | Last N checks failed (retries exceeded) | Yes | Should be stopped |
# Check health status
docker inspect myapp --format '{{.State.Health.Status}}'
# healthy
# See recent health check results
docker inspect myapp --format '{{json .State.Health}}' | python3 -m json.tool
{
"Status": "unhealthy",
"FailingStreak": 5,
"Log": [
{
"Start": "2025-06-07T10:30:00.123Z",
"End": "2025-06-07T10:30:05.456Z",
"ExitCode": 1,
"Output": "wget: server returned error: HTTP/1.1 503 Service Unavailable"
}
]
}
Filter Containers by Health
# List only healthy containers
docker ps --filter health=healthy
# List unhealthy containers (potential problems)
docker ps --filter health=unhealthy
# List containers still starting up
docker ps --filter health=starting
# Combine with format for monitoring scripts
docker ps --filter health=unhealthy --format "{{.Names}}: {{.Status}}"
Health Checks in Docker Compose
Compose health checks become powerful with depends_on conditions. Instead of just waiting for a container to start, you can wait until it is actually ready.
# docker-compose.yml
services:
db:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD: secret
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 3
api:
build: .
ports:
- "3000:3000"
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
healthcheck:
test: ["CMD-SHELL", "wget --spider -q http://localhost:3000/health || exit 1"]
interval: 30s
timeout: 5s
retries: 3
start_period: 15s
nginx:
image: nginx:alpine
ports:
- "80:80"
depends_on:
api:
condition: service_healthy
With condition: service_healthy, the api container will not start until both db and redis report healthy. And nginx will not start until api is healthy. This solves the classic "app crashes because the database is not ready yet" problem.
Override Health Check at Runtime
You can override or disable health checks without modifying the Dockerfile:
# Override health check command
docker run -d \
--health-cmd "curl -f http://localhost:8080/ready || exit 1" \
--health-interval 15s \
--health-timeout 5s \
--health-retries 3 \
--health-start-period 30s \
myapp:latest
# Disable health check entirely (testing/debugging)
docker run -d --no-healthcheck myapp:latest
Orchestrator Comparison
Docker health checks map to similar concepts in orchestration platforms, but each has its own nuances.
| Feature | Docker HEALTHCHECK | K8s Liveness Probe | K8s Readiness Probe | K8s Startup Probe |
|---|---|---|---|---|
| Purpose | Container health | Restart on failure | Traffic routing | Startup protection |
| Check types | CMD only | HTTP, TCP, exec, gRPC | HTTP, TCP, exec, gRPC | HTTP, TCP, exec, gRPC |
| On failure | Marks unhealthy | Restarts container | Removes from Service | Delays liveness check |
| Built-in restart | No (needs policy) | Yes (automatic) | No | No |
Kubernetes liveness probes are more aggressive — they actually restart the container on failure. Docker health checks only mark the container as unhealthy. You need a restart policy to get automatic recovery.
# Combine health check with restart policy for auto-recovery
docker run -d --name api \
--restart=on-failure:5 \
--health-cmd "wget --spider -q http://localhost:3000/health || exit 1" \
--health-interval 30s \
--health-retries 3 \
myapp:latest
Debugging Unhealthy Containers
When a container goes unhealthy, here is how to diagnose the problem:
# Step 1: Check the health log
docker inspect myapp --format '{{json .State.Health.Log}}' | python3 -m json.tool
# Step 2: Run the health check manually
docker exec myapp wget --spider -q http://localhost:3000/health
echo $?
# 0 = healthy, 1 = unhealthy
# Step 3: Check if the health endpoint works
docker exec myapp wget -qO- http://localhost:3000/health
# Step 4: Check resource usage (maybe OOM or CPU starved)
docker stats --no-stream myapp
# Step 5: Check application logs
docker logs --tail 50 myapp
Common health check failures:
| Symptom | Likely Cause | Fix |
|---|---|---|
| Unhealthy immediately on start | start-period too short | Increase --start-period |
| Intermittent unhealthy | Timeout too aggressive | Increase --timeout, reduce --interval |
| Healthy then permanently unhealthy | Application issue (OOM, deadlock) | Check app logs, increase resources |
| Health check command not found | Missing tool in image | Install wget/curl or use a different check |
Wrapping Up
Health checks are a simple addition that prevents a significant class of outages — the kind where your monitoring says everything is "green" because the container is running, but users are getting errors because the application inside is broken. Add an HTTP health check to every web service, use depends_on: condition: service_healthy in Compose, and combine health checks with restart policies for self-healing containers.
In the next post, we will cover Docker Environment Variables — the right way to handle configuration across development, staging, and production environments without baking secrets into your images.
