Prometheus and Grafana — Set Up Production Monitoring in 15 Minutes

May 24, 2025 · 7 min read

DevOps & Cloud Learning Hub

You have read about the Golden Signals and the three pillars of observability. Now it is time to stop theorizing and start measuring. In this post, we will set up a complete monitoring stack — Prometheus for metrics collection, Grafana for visualization, node_exporter for system metrics, and Alertmanager for routing alerts to Slack. All running locally with Docker Compose, all production-ready patterns.

How Prometheus Works

Prometheus uses a pull-based model. Instead of your services pushing metrics to a central server, Prometheus scrapes HTTP endpoints on a schedule. This is a critical architectural difference from push-based systems like StatsD or CloudWatch.

Your Services expose /metrics endpoints:
  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │ App :8080│    │ App :8081│    │ Node Exp │
  │ /metrics │    │ /metrics │    │ :9100    │
  └────┬─────┘    └─────┬────┘    └─────┬────┘
       │                │               │
       ▼                ▼               ▼
  ┌──────────────────────────────────────────┐
  │          Prometheus :9090                 │
  │  Scrapes targets every 15s               │
  │  Stores time-series data (TSDB)          │
  │  Evaluates alert rules                   │
  └──────────────┬───────────────────────────┘
                 │
       ┌─────────┴─────────┐
       ▼                   ▼
  ┌──────────┐      ┌──────────────┐
  │ Grafana  │      │ Alertmanager │
  │ :3000    │      │ :9093        │
  │ Dashboards│      │ → Slack     │
  └──────────┘      │ → PagerDuty │
                    └──────────────┘

Full Stack with Docker Compose

Create a project directory and add this docker-compose.yml:

# docker-compose.yml
version: "3.8"

services:
  prometheus:
    image: prom/prometheus:v2.53.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-lifecycle"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.1.0
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: changeme
      GF_USERS_ALLOW_SIGN_UP: "false"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.8.1
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--path.rootfs=/rootfs"
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s        # How often to scrape targets
  evaluation_interval: 15s    # How often to evaluate alert rules
  scrape_timeout: 10s         # Timeout per scrape

# Load alert rules
rule_files:
  - "alert-rules.yml"

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

# Scrape configurations
scrape_configs:
  # Prometheus monitors itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # System metrics via node_exporter
  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]
        labels:
          instance: "server-01"
          environment: "production"

  # Your application (add your own apps here)
  - job_name: "my-app"
    metrics_path: /metrics
    static_configs:
      - targets: ["app:8080"]
        labels:
          service: "api"
          environment: "production"

  # Scrape Docker containers with labels (service discovery)
  # Uncomment if using Docker service discovery:
  # - job_name: "docker"
  #   docker_sd_configs:
  #     - host: unix:///var/run/docker.sock
  #       refresh_interval: 30s

PromQL: Querying Your Metrics

PromQL is Prometheus's query language. These are the functions you will use daily:

# rate() — per-second rate of a counter over a time window
# "How many requests per second in the last 5 minutes?"
rate(http_requests_total[5m])

# sum() — aggregate across all labels
# "Total request rate across all instances"
sum(rate(http_requests_total[5m]))

# sum by() — aggregate and group
# "Request rate broken down by status code"
sum by (status_code) (rate(http_requests_total[5m]))

# histogram_quantile() — calculate percentiles from histograms
# "95th percentile request duration"
histogram_quantile(0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# increase() — total increase of a counter over a time window
# "Total errors in the last hour"
increase(http_errors_total[1h])

# Combining queries — error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) * 100

# Node exporter — available memory percentage
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Node exporter — disk space used percentage
100 - (node_filesystem_avail_bytes{mountpoint="/"}
  / node_filesystem_size_bytes{mountpoint="/"} * 100)

# CPU usage percentage (all cores averaged)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Grafana Data Source Provisioning

Auto-configure Grafana to connect to Prometheus on startup:

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: "15s"
      httpMethod: POST

Importing Pre-Built Dashboards

Do not build dashboards from scratch. Import the community dashboards first, then customize:

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
  - name: "Default"
    orgId: 1
    folder: "Imported"
    type: file
    options:
      path: /etc/grafana/provisioning/dashboards/json

# Download popular dashboards from Grafana.com
mkdir -p grafana/provisioning/dashboards/json

# Node Exporter Full (ID: 1860) — THE system monitoring dashboard
curl -o grafana/provisioning/dashboards/json/node-exporter.json \
  "https://grafana.com/api/dashboards/1860/revisions/latest/download"

# Docker monitoring (ID: 893)
curl -o grafana/provisioning/dashboards/json/docker.json \
  "https://grafana.com/api/dashboards/893/revisions/latest/download"

# Prometheus 2.0 Stats (ID: 3662)
curl -o grafana/provisioning/dashboards/json/prometheus-stats.json \
  "https://grafana.com/api/dashboards/3662/revisions/latest/download"

Alert Rules

# prometheus/alert-rules.yml
groups:
  - name: system-alerts
    rules:
      # High CPU usage sustained for 10 minutes
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 85% for 10 minutes. Current value: {{ $value | printf \"%.1f\" }}%"
          runbook_url: "https://wiki.internal/runbooks/high-cpu"

      # Disk space running low
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Less than 15% disk space remaining. Current: {{ $value | printf \"%.1f\" }}%"

      # Instance down
      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }} target {{ $labels.instance }} has been down for more than 2 minutes."

  - name: application-alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
          / sum by (service) (rate(http_requests_total[5m])) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is above 1%. Current: {{ $value | printf \"%.2f\" }}%"

      # High latency
      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99,
            sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 1s for {{ $labels.service }}"

Alertmanager: Route Alerts to Slack

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"

route:
  receiver: "slack-default"
  group_by: ["alertname", "severity"]
  group_wait: 30s          # Wait before sending first notification
  group_interval: 5m       # Wait between notifications for same group
  repeat_interval: 4h      # Re-send if still firing after 4 hours

  routes:
    # Critical alerts go to PagerDuty AND Slack
    - match:
        severity: critical
      receiver: "slack-critical"
      repeat_interval: 1h

    # Warning alerts go to Slack only
    - match:
        severity: warning
      receiver: "slack-default"
      repeat_interval: 4h

receivers:
  - name: "slack-default"
    slack_configs:
      - channel: "#monitoring-warnings"
        title: '{{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *{{ .Annotations.summary }}*
          {{ .Annotations.description }}
          {{ end }}
        send_resolved: true

  - name: "slack-critical"
    slack_configs:
      - channel: "#monitoring-critical"
        title: 'CRITICAL: {{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *{{ .Annotations.summary }}*
          {{ .Annotations.description }}
          Runbook: {{ .Annotations.runbook_url }}
          {{ end }}
        send_resolved: true

# Silence alerts during maintenance windows
# Use the Alertmanager UI at :9093 or the amtool CLI:
# amtool silence add alertname="HighCPUUsage" --duration=2h --comment="Planned maintenance"

Recording Rules for Performance

If you query the same expensive expression repeatedly (in dashboards or alerts), use recording rules to pre-compute it:

# Add to prometheus/prometheus.yml under rule_files
# prometheus/recording-rules.yml
groups:
  - name: request-rates
    interval: 15s
    rules:
      # Pre-compute total request rate
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      # Pre-compute error rate percentage
      - record: job:http_errors:rate5m_ratio
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          / sum by (job) (rate(http_requests_total[5m]))

      # Pre-compute P99 latency by service
      - record: job:http_request_duration:p99
        expr: |
          histogram_quantile(0.99,
            sum by (le, job) (rate(http_request_duration_seconds_bucket[5m]))
          )

Now your dashboards query job:http_requests:rate5m instead of recomputing sum by (job) (rate(http_requests_total[5m])) every 15 seconds across 50 panels.

Launch and Verify

# Start the full stack
docker compose up -d

# Verify all containers are running
docker compose ps

# Check Prometheus targets (should show all targets as UP)
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Check Alertmanager is reachable
curl -s http://localhost:9093/api/v2/status | jq '.cluster.status'

# Open in browser:
# Prometheus:   http://localhost:9090
# Grafana:      http://localhost:3000  (admin / changeme)
# Alertmanager: http://localhost:9093
# Node metrics: http://localhost:9100/metrics

Federation for Scale

When one Prometheus instance is not enough, use federation to aggregate metrics from multiple Prometheus servers:

# On the central/global Prometheus, add this scrape config:
scrape_configs:
  - job_name: "federate"
    honor_labels: true
    metrics_path: "/federate"
    params:
      "match[]":
        - '{job=~".*"}'              # Pull all job metrics
        - "up"                        # Pull up/down status
    static_configs:
      - targets:
          - "prometheus-us-east:9090"
          - "prometheus-eu-west:9090"
          - "prometheus-ap-south:9090"

This gives you a global view while each regional Prometheus handles its own scraping and local alerting. For true large-scale setups, look at Thanos or Cortex which add long-term storage and global querying on top of Prometheus.

You now have a production-grade monitoring stack running locally. In the next post, we will tackle artifact management — understanding JFrog Artifactory, Nexus, and container registries for managing the binaries your CI/CD pipeline produces.

How Prometheus Works​

Full Stack with Docker Compose​

Prometheus Configuration​

PromQL: Querying Your Metrics​

Grafana Data Source Provisioning​

Importing Pre-Built Dashboards​

Alert Rules​

Alertmanager: Route Alerts to Slack​

Recording Rules for Performance​

Launch and Verify​

Federation for Scale​

Stay Updated