Monitor Linux Servers with Prometheus and Grafana

December 6, 2025 · 7 min read

DevOps & Cloud Learning Hub

Your server crashed last night and nobody noticed until morning. The disk filled up at 2 AM, the OOM killer took out your application at 3 AM, and your team found out from angry customers at 9 AM. This is what happens when you run production without monitoring. Let's fix that in 15 minutes.

Architecture Overview

The monitoring stack we're building follows a simple pull-based model:

Component	Role	Default Port
node_exporter	Exposes Linux metrics as HTTP endpoints	9100
Prometheus	Scrapes, stores, and queries metrics	9090
Grafana	Visualizes metrics with dashboards	3000
Alertmanager	Routes alerts to Slack, email, PagerDuty	9093

Prometheus pulls metrics from node_exporter every 15 seconds. Grafana queries Prometheus to render dashboards. Alertmanager fires notifications when thresholds are breached.

Installing node_exporter

node_exporter is the agent that runs on every Linux server you want to monitor. It exposes hundreds of hardware and OS metrics.

# Download the latest node_exporter
cd /tmp
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz

# Extract and install
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Create a dedicated user
sudo useradd --no-create-home --shell /bin/false node_exporter

Now create a systemd service so it starts automatically and survives reboots:

sudo tee /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.tcpstat

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter

Verify it's working:

# You should see hundreds of metrics
curl -s http://localhost:9100/metrics | head -20

# Check specific metrics
curl -s http://localhost:9100/metrics | grep 'node_cpu_seconds_total'
curl -s http://localhost:9100/metrics | grep 'node_memory_MemAvailable_bytes'
curl -s http://localhost:9100/metrics | grep 'node_filesystem_avail_bytes'

Installing and Configuring Prometheus

# Download Prometheus
cd /tmp
curl -LO https://github.com/prometheus/prometheus/releases/download/v2.48.1/prometheus-2.48.1.linux-amd64.tar.gz
tar xvfz prometheus-2.48.1.linux-amd64.tar.gz

# Install binaries
sudo mv prometheus-2.48.1.linux-amd64/prometheus /usr/local/bin/
sudo mv prometheus-2.48.1.linux-amd64/promtool /usr/local/bin/

# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

Create the Prometheus configuration:

sudo tee /etc/prometheus/prometheus.yml << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'linux-servers'
    static_configs:
      - targets:
        - 'localhost:9100'        # local server
        - '10.0.1.10:9100'       # web-server-1
        - '10.0.1.11:9100'       # web-server-2
        - '10.0.1.20:9100'       # db-server-1
        labels:
          env: production
          team: platform

  - job_name: 'staging-servers'
    static_configs:
      - targets:
        - '10.0.2.10:9100'
        labels:
          env: staging
EOF

Create the systemd service for Prometheus:

sudo tee /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --storage.tsdb.retention.time=30d \
  --web.enable-lifecycle

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus

Validate the config and check targets:

# Validate configuration syntax
promtool check config /etc/prometheus/prometheus.yml

# Check Prometheus is scraping targets
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | head -30

Key Linux Metrics to Monitor

These are the metrics that actually matter for production Linux servers:

Metric	PromQL Query	What It Tells You
CPU Usage	`100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`	Overall CPU utilization
Memory Usage	`(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100`	Real memory pressure
Disk Usage	`(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100`	Filesystem fill percentage
Disk I/O	`rate(node_disk_io_time_seconds_total[5m])`	I/O saturation (>1 = overloaded)
Network In	`rate(node_network_receive_bytes_total[5m]) * 8`	Inbound bandwidth (bits/sec)
Network Out	`rate(node_network_transmit_bytes_total[5m]) * 8`	Outbound bandwidth (bits/sec)
Load Average	`node_load15 / count(node_cpu_seconds_total{mode="idle"})`	Normalized 15m load
Open Files	`node_filefd_allocated`	File descriptor pressure

Writing Alert Rules

This is where monitoring becomes useful -- you get paged before things break, not after.

sudo tee /etc/prometheus/alert_rules.yml << 'EOF'
groups:
  - name: linux-server-alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% for 5+ minutes"

      - alert: MemoryRunningLow
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Memory critical on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}%"

      - alert: DiskSpaceLow
        expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Root partition is {{ $value }}% full"

      - alert: ServerDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Server {{ $labels.instance }} is DOWN"
          description: "node_exporter is unreachable for 1+ minute"

      - alert: HighDiskIO
        expr: rate(node_disk_io_time_seconds_total[5m]) > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk I/O saturated on {{ $labels.instance }}"
EOF

# Validate alert rules
promtool check rules /etc/prometheus/alert_rules.yml

# Reload Prometheus config without restart
curl -X POST http://localhost:9090/-/reload

Setting Up Grafana

# Install Grafana (Debian/Ubuntu)
sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" | \
  sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update && sudo apt-get install -y grafana

# Start Grafana
sudo systemctl enable --now grafana-server

Now add Prometheus as a data source and import the standard Linux dashboard:

# Add Prometheus data source via API (default admin:admin)
curl -X POST http://admin:admin@localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://localhost:9090",
    "access": "proxy",
    "isDefault": true
  }'

# Import the official Node Exporter Full dashboard (ID 1860)
curl -X POST http://admin:admin@localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -d '{
    "dashboard": {"id": null, "uid": null, "title": "Linux Servers"},
    "pluginId": "prometheus",
    "folderId": 0,
    "overwrite": true,
    "inputs": [{"name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus"}],
    "dashboardId": 1860
  }'

echo "Grafana is running at http://localhost:3000 (admin / admin)"

For the quickest setup, open Grafana in your browser, go to Dashboards > Import, and enter dashboard ID 1860. This gives you a fully built Linux monitoring dashboard with CPU, memory, disk, network, and system metrics -- all pre-configured.

Quick Health Check Script

Here's a script you can run to verify the entire monitoring stack is healthy:

#!/bin/bash
echo "=== Monitoring Stack Health Check ==="

# Check node_exporter
if curl -sf http://localhost:9100/metrics > /dev/null; then
  echo "[OK] node_exporter is running"
else
  echo "[FAIL] node_exporter is not responding"
fi

# Check Prometheus
if curl -sf http://localhost:9090/-/healthy > /dev/null; then
  echo "[OK] Prometheus is healthy"
  TARGETS=$(curl -s http://localhost:9090/api/v1/targets | python3 -c "
import sys,json
data=json.load(sys.stdin)
active=[t for t in data['data']['activeTargets'] if t['health']=='up']
print(f'{len(active)} targets up')
")
  echo "     $TARGETS"
else
  echo "[FAIL] Prometheus is not responding"
fi

# Check Grafana
if curl -sf http://localhost:3000/api/health > /dev/null; then
  echo "[OK] Grafana is running"
else
  echo "[FAIL] Grafana is not responding"
fi

echo "=== Done ==="

Production Considerations

A few things you'll want to handle before calling this production-ready:

Storage sizing: Prometheus uses about 1-2 bytes per sample. With 500 metrics per server, scraped every 15 seconds, one server generates roughly 2.8M samples/day -- about 3-5 MB/day. Plan accordingly for retention.

Firewall rules: Lock down port 9100 to only allow your Prometheus server to scrape it. Never expose node_exporter to the internet.

# Allow only Prometheus server to reach node_exporter
sudo ufw allow from 10.0.1.5 to any port 9100
sudo ufw deny 9100

TLS and authentication: In production, use reverse proxies with TLS in front of all components, or configure Prometheus's built-in TLS support.

Next up: your monitoring is useless if your backups don't work. In the next post, we'll cover Linux Backup & Disaster Recovery with rsync, tar, and automated backup scripts.

Architecture Overview​

Installing node_exporter​

Installing and Configuring Prometheus​

Key Linux Metrics to Monitor​

Writing Alert Rules​

Setting Up Grafana​

Quick Health Check Script​

Production Considerations​

Stay Updated