Monitor Linux Servers with Prometheus and Grafana
Your server crashed last night and nobody noticed until morning. The disk filled up at 2 AM, the OOM killer took out your application at 3 AM, and your team found out from angry customers at 9 AM. This is what happens when you run production without monitoring. Let's fix that in 15 minutes.
Architecture Overview
The monitoring stack we're building follows a simple pull-based model:
| Component | Role | Default Port |
|---|---|---|
| node_exporter | Exposes Linux metrics as HTTP endpoints | 9100 |
| Prometheus | Scrapes, stores, and queries metrics | 9090 |
| Grafana | Visualizes metrics with dashboards | 3000 |
| Alertmanager | Routes alerts to Slack, email, PagerDuty | 9093 |
Prometheus pulls metrics from node_exporter every 15 seconds. Grafana queries Prometheus to render dashboards. Alertmanager fires notifications when thresholds are breached.
Installing node_exporter
node_exporter is the agent that runs on every Linux server you want to monitor. It exposes hundreds of hardware and OS metrics.
# Download the latest node_exporter
cd /tmp
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
# Extract and install
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# Create a dedicated user
sudo useradd --no-create-home --shell /bin/false node_exporter
Now create a systemd service so it starts automatically and survives reboots:
sudo tee /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.tcpstat
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter
Verify it's working:
# You should see hundreds of metrics
curl -s http://localhost:9100/metrics | head -20
# Check specific metrics
curl -s http://localhost:9100/metrics | grep 'node_cpu_seconds_total'
curl -s http://localhost:9100/metrics | grep 'node_memory_MemAvailable_bytes'
curl -s http://localhost:9100/metrics | grep 'node_filesystem_avail_bytes'
Installing and Configuring Prometheus
# Download Prometheus
cd /tmp
curl -LO https://github.com/prometheus/prometheus/releases/download/v2.48.1/prometheus-2.48.1.linux-amd64.tar.gz
tar xvfz prometheus-2.48.1.linux-amd64.tar.gz
# Install binaries
sudo mv prometheus-2.48.1.linux-amd64/prometheus /usr/local/bin/
sudo mv prometheus-2.48.1.linux-amd64/promtool /usr/local/bin/
# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
Create the Prometheus configuration:
sudo tee /etc/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'linux-servers'
static_configs:
- targets:
- 'localhost:9100' # local server
- '10.0.1.10:9100' # web-server-1
- '10.0.1.11:9100' # web-server-2
- '10.0.1.20:9100' # db-server-1
labels:
env: production
team: platform
- job_name: 'staging-servers'
static_configs:
- targets:
- '10.0.2.10:9100'
labels:
env: staging
EOF
Create the systemd service for Prometheus:
sudo tee /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--storage.tsdb.retention.time=30d \
--web.enable-lifecycle
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
Validate the config and check targets:
# Validate configuration syntax
promtool check config /etc/prometheus/prometheus.yml
# Check Prometheus is scraping targets
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | head -30
Key Linux Metrics to Monitor
These are the metrics that actually matter for production Linux servers:
| Metric | PromQL Query | What It Tells You |
|---|---|---|
| CPU Usage | 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) | Overall CPU utilization |
| Memory Usage | (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 | Real memory pressure |
| Disk Usage | (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 | Filesystem fill percentage |
| Disk I/O | rate(node_disk_io_time_seconds_total[5m]) | I/O saturation (>1 = overloaded) |
| Network In | rate(node_network_receive_bytes_total[5m]) * 8 | Inbound bandwidth (bits/sec) |
| Network Out | rate(node_network_transmit_bytes_total[5m]) * 8 | Outbound bandwidth (bits/sec) |
| Load Average | node_load15 / count(node_cpu_seconds_total{mode="idle"}) | Normalized 15m load |
| Open Files | node_filefd_allocated | File descriptor pressure |
Writing Alert Rules
This is where monitoring becomes useful -- you get paged before things break, not after.
sudo tee /etc/prometheus/alert_rules.yml << 'EOF'
groups:
- name: linux-server-alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% for 5+ minutes"
- alert: MemoryRunningLow
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 2m
labels:
severity: critical
annotations:
summary: "Memory critical on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}%"
- alert: DiskSpaceLow
expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Root partition is {{ $value }}% full"
- alert: ServerDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Server {{ $labels.instance }} is DOWN"
description: "node_exporter is unreachable for 1+ minute"
- alert: HighDiskIO
expr: rate(node_disk_io_time_seconds_total[5m]) > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "Disk I/O saturated on {{ $labels.instance }}"
EOF
# Validate alert rules
promtool check rules /etc/prometheus/alert_rules.yml
# Reload Prometheus config without restart
curl -X POST http://localhost:9090/-/reload
Setting Up Grafana
# Install Grafana (Debian/Ubuntu)
sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" | \
sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update && sudo apt-get install -y grafana
# Start Grafana
sudo systemctl enable --now grafana-server
Now add Prometheus as a data source and import the standard Linux dashboard:
# Add Prometheus data source via API (default admin:admin)
curl -X POST http://admin:admin@localhost:3000/api/datasources \
-H "Content-Type: application/json" \
-d '{
"name": "Prometheus",
"type": "prometheus",
"url": "http://localhost:9090",
"access": "proxy",
"isDefault": true
}'
# Import the official Node Exporter Full dashboard (ID 1860)
curl -X POST http://admin:admin@localhost:3000/api/dashboards/import \
-H "Content-Type: application/json" \
-d '{
"dashboard": {"id": null, "uid": null, "title": "Linux Servers"},
"pluginId": "prometheus",
"folderId": 0,
"overwrite": true,
"inputs": [{"name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus"}],
"dashboardId": 1860
}'
echo "Grafana is running at http://localhost:3000 (admin / admin)"
For the quickest setup, open Grafana in your browser, go to Dashboards > Import, and enter dashboard ID 1860. This gives you a fully built Linux monitoring dashboard with CPU, memory, disk, network, and system metrics -- all pre-configured.
Quick Health Check Script
Here's a script you can run to verify the entire monitoring stack is healthy:
#!/bin/bash
echo "=== Monitoring Stack Health Check ==="
# Check node_exporter
if curl -sf http://localhost:9100/metrics > /dev/null; then
echo "[OK] node_exporter is running"
else
echo "[FAIL] node_exporter is not responding"
fi
# Check Prometheus
if curl -sf http://localhost:9090/-/healthy > /dev/null; then
echo "[OK] Prometheus is healthy"
TARGETS=$(curl -s http://localhost:9090/api/v1/targets | python3 -c "
import sys,json
data=json.load(sys.stdin)
active=[t for t in data['data']['activeTargets'] if t['health']=='up']
print(f'{len(active)} targets up')
")
echo " $TARGETS"
else
echo "[FAIL] Prometheus is not responding"
fi
# Check Grafana
if curl -sf http://localhost:3000/api/health > /dev/null; then
echo "[OK] Grafana is running"
else
echo "[FAIL] Grafana is not responding"
fi
echo "=== Done ==="
Production Considerations
A few things you'll want to handle before calling this production-ready:
Storage sizing: Prometheus uses about 1-2 bytes per sample. With 500 metrics per server, scraped every 15 seconds, one server generates roughly 2.8M samples/day -- about 3-5 MB/day. Plan accordingly for retention.
Firewall rules: Lock down port 9100 to only allow your Prometheus server to scrape it. Never expose node_exporter to the internet.
# Allow only Prometheus server to reach node_exporter
sudo ufw allow from 10.0.1.5 to any port 9100
sudo ufw deny 9100
TLS and authentication: In production, use reverse proxies with TLS in front of all components, or configure Prometheus's built-in TLS support.
Next up: your monitoring is useless if your backups don't work. In the next post, we'll cover Linux Backup & Disaster Recovery with rsync, tar, and automated backup scripts.
