Skip to main content

Prometheus and Grafana — Set Up Production Monitoring in 15 Minutes

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

You have read about the Golden Signals and the three pillars of observability. Now it is time to stop theorizing and start measuring. In this post, we will set up a complete monitoring stack — Prometheus for metrics collection, Grafana for visualization, node_exporter for system metrics, and Alertmanager for routing alerts to Slack. All running locally with Docker Compose, all production-ready patterns.

How Prometheus Works

Prometheus uses a pull-based model. Instead of your services pushing metrics to a central server, Prometheus scrapes HTTP endpoints on a schedule. This is a critical architectural difference from push-based systems like StatsD or CloudWatch.

Your Services expose /metrics endpoints:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ App :8080│ │ App :8081│ │ Node Exp │
│ /metrics │ │ /metrics │ │ :9100 │
└────┬─────┘ └─────┬────┘ └─────┬────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────┐
│ Prometheus :9090 │
│ Scrapes targets every 15s │
│ Stores time-series data (TSDB) │
│ Evaluates alert rules │
└──────────────┬───────────────────────────┘

┌─────────┴─────────┐
▼ ▼
┌──────────┐ ┌──────────────┐
│ Grafana │ │ Alertmanager │
│ :3000 │ │ :9093 │
│ Dashboards│ │ → Slack │
└──────────┘ │ → PagerDuty │
└──────────────┘

Full Stack with Docker Compose

Create a project directory and add this docker-compose.yml:

# docker-compose.yml
version: "3.8"

services:
prometheus:
image: prom/prometheus:v2.53.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=30d"
- "--web.enable-lifecycle"
restart: unless-stopped

grafana:
image: grafana/grafana:11.1.0
container_name: grafana
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: changeme
GF_USERS_ALLOW_SIGN_UP: "false"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
depends_on:
- prometheus
restart: unless-stopped

node-exporter:
image: prom/node-exporter:v1.8.1
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--path.rootfs=/rootfs"
restart: unless-stopped

alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: unless-stopped

volumes:
prometheus_data:
grafana_data:

Prometheus Configuration

# prometheus/prometheus.yml
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate alert rules
scrape_timeout: 10s # Timeout per scrape

# Load alert rules
rule_files:
- "alert-rules.yml"

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]

# Scrape configurations
scrape_configs:
# Prometheus monitors itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]

# System metrics via node_exporter
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
labels:
instance: "server-01"
environment: "production"

# Your application (add your own apps here)
- job_name: "my-app"
metrics_path: /metrics
static_configs:
- targets: ["app:8080"]
labels:
service: "api"
environment: "production"

# Scrape Docker containers with labels (service discovery)
# Uncomment if using Docker service discovery:
# - job_name: "docker"
# docker_sd_configs:
# - host: unix:///var/run/docker.sock
# refresh_interval: 30s

PromQL: Querying Your Metrics

PromQL is Prometheus's query language. These are the functions you will use daily:

# rate() — per-second rate of a counter over a time window
# "How many requests per second in the last 5 minutes?"
rate(http_requests_total[5m])

# sum() — aggregate across all labels
# "Total request rate across all instances"
sum(rate(http_requests_total[5m]))

# sum by() — aggregate and group
# "Request rate broken down by status code"
sum by (status_code) (rate(http_requests_total[5m]))

# histogram_quantile() — calculate percentiles from histograms
# "95th percentile request duration"
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# increase() — total increase of a counter over a time window
# "Total errors in the last hour"
increase(http_errors_total[1h])

# Combining queries — error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# Node exporter — available memory percentage
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Node exporter — disk space used percentage
100 - (node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"} * 100)

# CPU usage percentage (all cores averaged)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Grafana Data Source Provisioning

Auto-configure Grafana to connect to Prometheus on startup:

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: "15s"
httpMethod: POST

Importing Pre-Built Dashboards

Do not build dashboards from scratch. Import the community dashboards first, then customize:

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
- name: "Default"
orgId: 1
folder: "Imported"
type: file
options:
path: /etc/grafana/provisioning/dashboards/json
# Download popular dashboards from Grafana.com
mkdir -p grafana/provisioning/dashboards/json

# Node Exporter Full (ID: 1860) — THE system monitoring dashboard
curl -o grafana/provisioning/dashboards/json/node-exporter.json \
"https://grafana.com/api/dashboards/1860/revisions/latest/download"

# Docker monitoring (ID: 893)
curl -o grafana/provisioning/dashboards/json/docker.json \
"https://grafana.com/api/dashboards/893/revisions/latest/download"

# Prometheus 2.0 Stats (ID: 3662)
curl -o grafana/provisioning/dashboards/json/prometheus-stats.json \
"https://grafana.com/api/dashboards/3662/revisions/latest/download"

Alert Rules

# prometheus/alert-rules.yml
groups:
- name: system-alerts
rules:
# High CPU usage sustained for 10 minutes
- alert: HighCPUUsage
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 85% for 10 minutes. Current value: {{ $value | printf \"%.1f\" }}%"
runbook_url: "https://wiki.internal/runbooks/high-cpu"

# Disk space running low
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Less than 15% disk space remaining. Current: {{ $value | printf \"%.1f\" }}%"

# Instance down
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.job }} target {{ $labels.instance }} has been down for more than 2 minutes."

- name: application-alerts
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (service) (rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is above 1%. Current: {{ $value | printf \"%.2f\" }}%"

# High latency
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99,
sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency above 1s for {{ $labels.service }}"

Alertmanager: Route Alerts to Slack

# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"

route:
receiver: "slack-default"
group_by: ["alertname", "severity"]
group_wait: 30s # Wait before sending first notification
group_interval: 5m # Wait between notifications for same group
repeat_interval: 4h # Re-send if still firing after 4 hours

routes:
# Critical alerts go to PagerDuty AND Slack
- match:
severity: critical
receiver: "slack-critical"
repeat_interval: 1h

# Warning alerts go to Slack only
- match:
severity: warning
receiver: "slack-default"
repeat_interval: 4h

receivers:
- name: "slack-default"
slack_configs:
- channel: "#monitoring-warnings"
title: '{{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*{{ .Annotations.summary }}*
{{ .Annotations.description }}
{{ end }}
send_resolved: true

- name: "slack-critical"
slack_configs:
- channel: "#monitoring-critical"
title: 'CRITICAL: {{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*{{ .Annotations.summary }}*
{{ .Annotations.description }}
Runbook: {{ .Annotations.runbook_url }}
{{ end }}
send_resolved: true

# Silence alerts during maintenance windows
# Use the Alertmanager UI at :9093 or the amtool CLI:
# amtool silence add alertname="HighCPUUsage" --duration=2h --comment="Planned maintenance"

Recording Rules for Performance

If you query the same expensive expression repeatedly (in dashboards or alerts), use recording rules to pre-compute it:

# Add to prometheus/prometheus.yml under rule_files
# prometheus/recording-rules.yml
groups:
- name: request-rates
interval: 15s
rules:
# Pre-compute total request rate
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))

# Pre-compute error rate percentage
- record: job:http_errors:rate5m_ratio
expr: |
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (job) (rate(http_requests_total[5m]))

# Pre-compute P99 latency by service
- record: job:http_request_duration:p99
expr: |
histogram_quantile(0.99,
sum by (le, job) (rate(http_request_duration_seconds_bucket[5m]))
)

Now your dashboards query job:http_requests:rate5m instead of recomputing sum by (job) (rate(http_requests_total[5m])) every 15 seconds across 50 panels.

Launch and Verify

# Start the full stack
docker compose up -d

# Verify all containers are running
docker compose ps

# Check Prometheus targets (should show all targets as UP)
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Check Alertmanager is reachable
curl -s http://localhost:9093/api/v2/status | jq '.cluster.status'

# Open in browser:
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin / changeme)
# Alertmanager: http://localhost:9093
# Node metrics: http://localhost:9100/metrics

Federation for Scale

When one Prometheus instance is not enough, use federation to aggregate metrics from multiple Prometheus servers:

# On the central/global Prometheus, add this scrape config:
scrape_configs:
- job_name: "federate"
honor_labels: true
metrics_path: "/federate"
params:
"match[]":
- '{job=~".*"}' # Pull all job metrics
- "up" # Pull up/down status
static_configs:
- targets:
- "prometheus-us-east:9090"
- "prometheus-eu-west:9090"
- "prometheus-ap-south:9090"

This gives you a global view while each regional Prometheus handles its own scraping and local alerting. For true large-scale setups, look at Thanos or Cortex which add long-term storage and global querying on top of Prometheus.


You now have a production-grade monitoring stack running locally. In the next post, we will tackle artifact management — understanding JFrog Artifactory, Nexus, and container registries for managing the binaries your CI/CD pipeline produces.