Monitor Docker Containers — cAdvisor, Prometheus, and Grafana
docker stats shows you what is happening right now. It does not show you what happened at 3 AM when response times spiked. It does not alert you when a container's memory is trending toward its limit. It does not graph CPU usage over the past week to help you right-size your resource limits. For real monitoring, you need metrics collection, storage, visualization, and alerting. The standard stack for Docker is cAdvisor + Prometheus + Grafana, and you can have it running in under fifteen minutes.
docker stats — The Starting Point
docker stats is built into Docker and requires no setup. It is good for quick glances but has fundamental limitations.
# Real-time stats for all containers
docker stats
# CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
# abc123 api 2.45% 245MiB / 512MiB 47.8% 1.2MB / 500kB 10MB / 0B 12
# def456 worker 78.2% 480MiB / 512MiB 93.7% 500kB / 200kB 5MB / 1MB 45
# ghi789 db 12.5% 1.2GiB / 2GiB 60.0% 3MB / 15MB 50MB / 200MB 28
# Specific containers with custom format
docker stats --no-stream --format \
"table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.NetIO}}\t{{.PIDs}}"
Limitations of docker stats:
- No historical data — only shows the current moment.
- No alerting — you have to be watching when something goes wrong.
- No trend analysis — cannot see memory creeping up over hours.
- No dashboards — text output only.
- No correlation — cannot overlay CPU with response times.
cAdvisor — Container Metrics Collection
cAdvisor (Container Advisor) by Google collects resource usage and performance data from running containers. It exposes metrics in Prometheus format, making it the perfect data source.
# Run cAdvisor as a container
docker run -d \
--name cadvisor \
--volume /:/rootfs:ro \
--volume /var/run:/var/run:ro \
--volume /sys:/sys:ro \
--volume /var/lib/docker/:/var/lib/docker:ro \
--volume /dev/disk/:/dev/disk:ro \
--publish 8080:8080 \
--privileged \
--device /dev/kmsg \
gcr.io/cadvisor/cadvisor:latest
# cAdvisor web UI: http://localhost:8080
# Prometheus metrics endpoint: http://localhost:8080/metrics
# Verify cAdvisor is collecting metrics
curl -s http://localhost:8080/metrics | head -20
# container_cpu_usage_seconds_total{name="api",...} 45.234
# container_memory_usage_bytes{name="api",...} 256901120
# container_network_receive_bytes_total{name="api",...} 1258291
# container_fs_usage_bytes{name="api",...} 10485760
Prometheus — Metrics Storage and Querying
Prometheus scrapes metrics from cAdvisor at regular intervals, stores them as time-series data, and provides a query language (PromQL) for analysis.
# prometheus.yml — Prometheus configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts.yml"
scrape_configs:
# Scrape cAdvisor for container metrics
- job_name: "cadvisor"
static_configs:
- targets: ["cadvisor:8080"]
# Scrape Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Scrape Docker daemon metrics (if enabled)
- job_name: "docker"
static_configs:
- targets: ["host.docker.internal:9323"]
# Enable Docker daemon metrics (optional)
# Add to /etc/docker/daemon.json
{
"metrics-addr": "0.0.0.0:9323",
"experimental": true
}
# Restart Docker: sudo systemctl restart docker
The Complete Monitoring Stack
Here is a production-ready docker-compose file that runs the entire monitoring stack.
# docker-compose.monitoring.yml
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
privileged: true
devices:
- /dev/kmsg:/dev/kmsg
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
ports:
- "8080:8080"
restart: unless-stopped
networks:
- monitoring
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/alerts.yml:/etc/prometheus/alerts.yml:ro
- prometheus-data:/prometheus
ports:
- "9090:9090"
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
- "--web.enable-lifecycle"
restart: unless-stopped
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
ports:
- "3000:3000"
restart: unless-stopped
depends_on:
- prometheus
networks:
- monitoring
volumes:
prometheus-data:
grafana-data:
networks:
monitoring:
driver: bridge
# Start the monitoring stack
docker compose -f docker-compose.monitoring.yml up -d
# Access the tools:
# cAdvisor: http://localhost:8080
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin/changeme)
Key Metrics to Track
| Metric | PromQL Query | What It Tells You | Alert Threshold |
|---|---|---|---|
| CPU usage | rate(container_cpu_usage_seconds_total[5m]) | CPU cores consumed per container | > 80% of limit |
| Memory usage | container_memory_usage_bytes | Current memory consumption | > 85% of limit |
| Memory limit % | container_memory_usage_bytes / container_spec_memory_limit_bytes * 100 | How close to OOM kill | > 90% |
| Network RX | rate(container_network_receive_bytes_total[5m]) | Incoming network bandwidth | Unusual spike |
| Network TX | rate(container_network_transmit_bytes_total[5m]) | Outgoing network bandwidth | Unusual spike |
| Disk read | rate(container_fs_reads_bytes_total[5m]) | Disk read throughput | Sustained high I/O |
| Disk write | rate(container_fs_writes_bytes_total[5m]) | Disk write throughput | Sustained high I/O |
| Restart count | kube_pod_container_status_restarts_total or container events | How often a container restarts | > 3 in 10 minutes |
| Container uptime | time() - container_start_time_seconds | How long since last restart | Unexpected restart |
Grafana Dashboards for Docker
Grafana turns Prometheus queries into visual dashboards. The easiest way to get started is to import community dashboards.
# Auto-provision Grafana datasource
# ./grafana/provisioning/datasources/prometheus.yml
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
# Popular Grafana dashboard IDs for Docker:
# 893 — Docker and system monitoring
# 14282 — cAdvisor container metrics
# 1229 — Docker Prometheus monitoring
# Import via Grafana UI:
# 1. Go to Dashboards → Import
# 2. Enter dashboard ID (e.g., 14282)
# 3. Select Prometheus as the data source
# 4. Click Import
Useful PromQL Queries
# Top 5 containers by CPU usage
topk(5, rate(container_cpu_usage_seconds_total{name!=""}[5m]))
# Containers approaching memory limit (> 80%)
container_memory_usage_bytes{name!=""} /
container_spec_memory_limit_bytes{name!=""} * 100 > 80
# Network throughput per container (MB/s)
rate(container_network_receive_bytes_total{name!=""}[5m]) / 1024 / 1024
# Container restart rate (restarts per hour)
increase(container_start_time_seconds{name!=""}[1h])
# Disk I/O per container
rate(container_fs_writes_bytes_total{name!=""}[5m])
+ rate(container_fs_reads_bytes_total{name!=""}[5m])
Alerting on Container Health
Set up Prometheus alerting rules to get notified before problems become outages.
# prometheus/alerts.yml
groups:
- name: container_alerts
rules:
- alert: ContainerHighMemory
expr: |
container_memory_usage_bytes{name!=""} /
container_spec_memory_limit_bytes{name!=""} * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} memory usage > 85%"
description: "{{ $labels.name }} is using {{ $value | printf \"%.1f\" }}% of its memory limit."
- alert: ContainerHighCPU
expr: rate(container_cpu_usage_seconds_total{name!=""}[5m]) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} high CPU usage"
- alert: ContainerRestarting
expr: increase(container_start_time_seconds{name!=""}[10m]) > 3
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} restarting frequently"
- alert: ContainerDown
expr: absent(container_last_seen{name=~".+"})
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is down"
# Verify alerts are loaded in Prometheus
curl http://localhost:9090/api/v1/rules
# Check Prometheus UI → Alerts tab: http://localhost:9090/alerts
Container Resource Usage Trends
The real value of Prometheus comes from tracking trends over time. A container that uses 300 MB today and 350 MB tomorrow has a memory leak that docker stats will never catch.
# Memory usage trend over 24 hours (predict OOM in the next 4 hours)
predict_linear(container_memory_usage_bytes{name="api"}[24h], 4 * 3600)
> container_spec_memory_limit_bytes{name="api"}
# CPU usage comparison: this week vs last week
rate(container_cpu_usage_seconds_total{name="api"}[5m])
- rate(container_cpu_usage_seconds_total{name="api"}[5m] offset 7d)
# Average memory usage over the past week (for right-sizing)
avg_over_time(container_memory_usage_bytes{name="api"}[7d])
Monitoring: Production vs Development
| Aspect | Development | Production |
|---|---|---|
| Scrape interval | 30s (less overhead) | 15s (more granularity) |
| Retention | 7 days | 30-90 days |
| Alerting | Disabled or Slack only | PagerDuty / OpsGenie |
| Dashboards | Basic overview | Per-service dashboards |
| cAdvisor | Optional (use docker stats) | Required |
| Grafana auth | Default admin/admin | SSO / LDAP |
| Prometheus storage | Local volume | Remote write to Thanos/Mimir |
| Network | Bridge | Overlay with monitoring network |
# Production additions to the monitoring stack
services:
alertmanager:
image: prom/alertmanager:latest
volumes:
- ./alertmanager/config.yml:/etc/alertmanager/config.yml:ro
ports:
- "9093:9093"
networks:
- monitoring
node-exporter:
image: prom/node-exporter:latest
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--path.rootfs=/rootfs"
ports:
- "9100:9100"
networks:
- monitoring
Add Node Exporter for host-level metrics (disk space, system CPU, available memory) alongside cAdvisor for container-level metrics. Together they give you complete visibility.
Wrapping Up
docker stats is a flashlight. Prometheus + Grafana is a security camera system with motion detection. The flashlight is useful for quick checks, but it cannot tell you what happened while you were not looking. With cAdvisor collecting container metrics, Prometheus storing and querying them, and Grafana visualizing trends, you get historical analysis, predictive alerting, and the data you need to right-size your resource limits. The fifteen minutes it takes to set up the monitoring stack will save you hours of debugging when something goes wrong at 3 AM.
In the next post, we will cover Docker Compose in Production — profiles, depends_on health conditions, restart policies, resource limits, and when it is actually appropriate to use Compose for production workloads.
