Skip to main content

Monitor Linux Servers with Prometheus and Grafana

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

Your server crashed last night and nobody noticed until morning. The disk filled up at 2 AM, the OOM killer took out your application at 3 AM, and your team found out from angry customers at 9 AM. This is what happens when you run production without monitoring. Let's fix that in 15 minutes.

Architecture Overview

The monitoring stack we're building follows a simple pull-based model:

ComponentRoleDefault Port
node_exporterExposes Linux metrics as HTTP endpoints9100
PrometheusScrapes, stores, and queries metrics9090
GrafanaVisualizes metrics with dashboards3000
AlertmanagerRoutes alerts to Slack, email, PagerDuty9093

Prometheus pulls metrics from node_exporter every 15 seconds. Grafana queries Prometheus to render dashboards. Alertmanager fires notifications when thresholds are breached.

Installing node_exporter

node_exporter is the agent that runs on every Linux server you want to monitor. It exposes hundreds of hardware and OS metrics.

# Download the latest node_exporter
cd /tmp
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz

# Extract and install
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Create a dedicated user
sudo useradd --no-create-home --shell /bin/false node_exporter

Now create a systemd service so it starts automatically and survives reboots:

sudo tee /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.tcpstat

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter

Verify it's working:

# You should see hundreds of metrics
curl -s http://localhost:9100/metrics | head -20

# Check specific metrics
curl -s http://localhost:9100/metrics | grep 'node_cpu_seconds_total'
curl -s http://localhost:9100/metrics | grep 'node_memory_MemAvailable_bytes'
curl -s http://localhost:9100/metrics | grep 'node_filesystem_avail_bytes'

Installing and Configuring Prometheus

# Download Prometheus
cd /tmp
curl -LO https://github.com/prometheus/prometheus/releases/download/v2.48.1/prometheus-2.48.1.linux-amd64.tar.gz
tar xvfz prometheus-2.48.1.linux-amd64.tar.gz

# Install binaries
sudo mv prometheus-2.48.1.linux-amd64/prometheus /usr/local/bin/
sudo mv prometheus-2.48.1.linux-amd64/promtool /usr/local/bin/

# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo useradd --no-create-home --shell /bin/false prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

Create the Prometheus configuration:

sudo tee /etc/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s

rule_files:
- "alert_rules.yml"

alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']

scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'linux-servers'
static_configs:
- targets:
- 'localhost:9100' # local server
- '10.0.1.10:9100' # web-server-1
- '10.0.1.11:9100' # web-server-2
- '10.0.1.20:9100' # db-server-1
labels:
env: production
team: platform

- job_name: 'staging-servers'
static_configs:
- targets:
- '10.0.2.10:9100'
labels:
env: staging
EOF

Create the systemd service for Prometheus:

sudo tee /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--storage.tsdb.retention.time=30d \
--web.enable-lifecycle

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus

Validate the config and check targets:

# Validate configuration syntax
promtool check config /etc/prometheus/prometheus.yml

# Check Prometheus is scraping targets
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | head -30

Key Linux Metrics to Monitor

These are the metrics that actually matter for production Linux servers:

MetricPromQL QueryWhat It Tells You
CPU Usage100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)Overall CPU utilization
Memory Usage(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100Real memory pressure
Disk Usage(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100Filesystem fill percentage
Disk I/Orate(node_disk_io_time_seconds_total[5m])I/O saturation (>1 = overloaded)
Network Inrate(node_network_receive_bytes_total[5m]) * 8Inbound bandwidth (bits/sec)
Network Outrate(node_network_transmit_bytes_total[5m]) * 8Outbound bandwidth (bits/sec)
Load Averagenode_load15 / count(node_cpu_seconds_total{mode="idle"})Normalized 15m load
Open Filesnode_filefd_allocatedFile descriptor pressure

Writing Alert Rules

This is where monitoring becomes useful -- you get paged before things break, not after.

sudo tee /etc/prometheus/alert_rules.yml << 'EOF'
groups:
- name: linux-server-alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% for 5+ minutes"

- alert: MemoryRunningLow
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 2m
labels:
severity: critical
annotations:
summary: "Memory critical on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}%"

- alert: DiskSpaceLow
expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Root partition is {{ $value }}% full"

- alert: ServerDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Server {{ $labels.instance }} is DOWN"
description: "node_exporter is unreachable for 1+ minute"

- alert: HighDiskIO
expr: rate(node_disk_io_time_seconds_total[5m]) > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "Disk I/O saturated on {{ $labels.instance }}"
EOF

# Validate alert rules
promtool check rules /etc/prometheus/alert_rules.yml

# Reload Prometheus config without restart
curl -X POST http://localhost:9090/-/reload

Setting Up Grafana

# Install Grafana (Debian/Ubuntu)
sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key
echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com stable main" | \
sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt-get update && sudo apt-get install -y grafana

# Start Grafana
sudo systemctl enable --now grafana-server

Now add Prometheus as a data source and import the standard Linux dashboard:

# Add Prometheus data source via API (default admin:admin)
curl -X POST http://admin:admin@localhost:3000/api/datasources \
-H "Content-Type: application/json" \
-d '{
"name": "Prometheus",
"type": "prometheus",
"url": "http://localhost:9090",
"access": "proxy",
"isDefault": true
}'

# Import the official Node Exporter Full dashboard (ID 1860)
curl -X POST http://admin:admin@localhost:3000/api/dashboards/import \
-H "Content-Type: application/json" \
-d '{
"dashboard": {"id": null, "uid": null, "title": "Linux Servers"},
"pluginId": "prometheus",
"folderId": 0,
"overwrite": true,
"inputs": [{"name": "DS_PROMETHEUS", "type": "datasource", "pluginId": "prometheus", "value": "Prometheus"}],
"dashboardId": 1860
}'

echo "Grafana is running at http://localhost:3000 (admin / admin)"

For the quickest setup, open Grafana in your browser, go to Dashboards > Import, and enter dashboard ID 1860. This gives you a fully built Linux monitoring dashboard with CPU, memory, disk, network, and system metrics -- all pre-configured.

Quick Health Check Script

Here's a script you can run to verify the entire monitoring stack is healthy:

#!/bin/bash
echo "=== Monitoring Stack Health Check ==="

# Check node_exporter
if curl -sf http://localhost:9100/metrics > /dev/null; then
echo "[OK] node_exporter is running"
else
echo "[FAIL] node_exporter is not responding"
fi

# Check Prometheus
if curl -sf http://localhost:9090/-/healthy > /dev/null; then
echo "[OK] Prometheus is healthy"
TARGETS=$(curl -s http://localhost:9090/api/v1/targets | python3 -c "
import sys,json
data=json.load(sys.stdin)
active=[t for t in data['data']['activeTargets'] if t['health']=='up']
print(f'{len(active)} targets up')
")
echo " $TARGETS"
else
echo "[FAIL] Prometheus is not responding"
fi

# Check Grafana
if curl -sf http://localhost:3000/api/health > /dev/null; then
echo "[OK] Grafana is running"
else
echo "[FAIL] Grafana is not responding"
fi

echo "=== Done ==="

Production Considerations

A few things you'll want to handle before calling this production-ready:

Storage sizing: Prometheus uses about 1-2 bytes per sample. With 500 metrics per server, scraped every 15 seconds, one server generates roughly 2.8M samples/day -- about 3-5 MB/day. Plan accordingly for retention.

Firewall rules: Lock down port 9100 to only allow your Prometheus server to scrape it. Never expose node_exporter to the internet.

# Allow only Prometheus server to reach node_exporter
sudo ufw allow from 10.0.1.5 to any port 9100
sudo ufw deny 9100

TLS and authentication: In production, use reverse proxies with TLS in front of all components, or configure Prometheus's built-in TLS support.


Next up: your monitoring is useless if your backups don't work. In the next post, we'll cover Linux Backup & Disaster Recovery with rsync, tar, and automated backup scripts.