Monitoring 101 — Metrics, Logs, Traces, and the Golden Signals

May 10, 2025 · 7 min read

DevOps & Cloud Learning Hub

It is 3 AM. Your phone buzzes with a PagerDuty alert: "CPU usage above 90%." You drag yourself out of bed, SSH into the server, and discover the CPU spike was caused by a log rotation cron job that runs every night. It resolved itself two minutes later. This happens three times a week. You start ignoring alerts. Then one night, the database actually fills up and takes down production. Nobody notices for 47 minutes because the team has learned to silence their phones.

The Three Pillars of Observability

Monitoring tells you when something is wrong. Observability tells you why. Modern systems need both, and they are built on three pillars:

Metrics are numeric measurements collected over time — CPU usage, request count, memory consumption. They are cheap to store, fast to query, and ideal for dashboards and alerts.

Logs are timestamped records of discrete events — an error message, a user login, a database query. They give you the detail that metrics cannot.

Traces follow a single request as it travels through multiple services in a distributed system. They show you where time is being spent and where failures happen across service boundaries.

Request comes in:
  ┌──────────────────────────────────────────────────────────┐
  │ TRACE: order-123                                          │
  │                                                          │
  │  [API Gateway] ─────► [Order Service] ─────► [Payment]  │
  │    12ms                  45ms                  230ms     │
  │                            │                             │
  │                            ▼                             │
  │                       [Inventory DB]                     │
  │                          89ms                            │
  │                                                          │
  │ Total: 376ms                                             │
  └──────────────────────────────────────────────────────────┘

METRICS tell you:  "P99 latency spiked to 800ms"
LOGS tell you:     "Payment gateway returned timeout at 14:32:07"
TRACES tell you:   "The Payment service call took 720ms out of 800ms total"

Metric Types

Not all metrics are created equal. Understanding the types is critical for writing useful queries.

Type	Description	Example	Use Case
Counter	Only goes up (resets on restart)	`http_requests_total`	Request counts, errors, bytes sent
Gauge	Goes up and down	`memory_usage_bytes`	CPU, memory, queue depth, temperature
Histogram	Counts observations in configurable buckets	`request_duration_seconds`	Latency distributions, response sizes
Summary	Similar to histogram but calculates quantiles client-side	`rpc_duration_seconds`	Pre-calculated percentiles

# Counter: Use rate() to get per-second request rate
rate(http_requests_total[5m])

# Gauge: Direct value — how much memory right now?
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

# Histogram: Calculate the 95th percentile latency
histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))

# Counter: Total errors in the last hour
increase(http_errors_total[1h])

Structured vs Unstructured Logs

Unstructured logs are human-readable but machine-hostile. Structured logs are both.

# Unstructured (hard to parse, hard to search):
[2025-05-10 14:32:07] ERROR: Failed to process order #4521 for user john@example.com - Payment gateway timeout after 30s

# Structured JSON (easy to query, filter, aggregate):
{
  "timestamp": "2025-05-10T14:32:07Z",
  "level": "error",
  "message": "Failed to process order",
  "service": "order-service",
  "order_id": 4521,
  "user_email": "john@example.com",
  "error_type": "payment_gateway_timeout",
  "timeout_seconds": 30,
  "trace_id": "abc123def456"
}

With structured logs, you can query your log aggregator for all errors from the order-service where error_type = payment_gateway_timeout in the last hour. With unstructured logs, you are writing regex and praying.

Google's Four Golden Signals

Google's SRE book defines four signals that matter most for any user-facing system. If you can only monitor four things, monitor these:

Signal	What It Measures	Example Metric	Alert When
Latency	Time to serve a request	`request_duration_seconds`	P99 > 500ms for 5 min
Traffic	Demand on the system	`http_requests_total`	Rate drops > 50% suddenly
Errors	Rate of failed requests	`http_errors_total / http_requests_total`	Error rate > 1% for 5 min
Saturation	How full your system is	`cpu_usage`, `memory_usage`, `disk_usage`	Disk > 85%, CPU > 90% sustained

# Golden Signals in PromQL:

# 1. Latency — 99th percentile response time
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# 2. Traffic — requests per second
sum(rate(http_requests_total[5m]))

# 3. Errors — error percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) * 100

# 4. Saturation — CPU usage percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

RED and USE Methods

Two complementary frameworks for monitoring different layers of your stack:

RED Method (for request-driven services like APIs and web apps):

Rate — requests per second
Errors — failed requests per second
Duration — distribution of request latency

USE Method (for infrastructure resources like CPU, memory, network):

Utilization — percentage of resource being used
Saturation — work that is queued or waiting
Errors — count of error events

Apply RED to your services:
  API Gateway     → Rate: 2,400 rps  | Errors: 0.3%  | Duration: P99 120ms
  Order Service   → Rate: 800 rps    | Errors: 1.2%  | Duration: P99 340ms  ← Problem!
  Payment Service → Rate: 200 rps    | Errors: 0.1%  | Duration: P99 89ms

Apply USE to your infrastructure:
  CPU     → Utilization: 72%  | Saturation: 0 runqueue  | Errors: 0
  Memory  → Utilization: 85%  | Saturation: 0 swap      | Errors: 0
  Disk IO → Utilization: 45%  | Saturation: 12 iowait   | Errors: 2  ← Problem!

Monitoring vs Observability

These terms are not interchangeable:

Monitoring is collecting predefined data to detect known failure modes. You decide in advance what to measure. "Alert me if CPU exceeds 90%."

Observability is the ability to ask arbitrary questions about your system's internal state from its external outputs. "Why did checkout latency spike for users in Europe between 2 and 3 PM yesterday?"

Monitoring answers: "Is anything broken?" Observability answers: "What broke, where, and why?"

You need both. Monitoring catches the obvious problems. Observability helps you debug the subtle ones.

SLOs, SLIs, and SLAs

These three acronyms form the foundation of reliability engineering:

SLI (Service Level Indicator):
  A measurement of service behavior.
  Example: "The proportion of requests served in under 300ms"
  Formula: (requests < 300ms) / (total requests) = 99.2%

SLO (Service Level Objective):
  A target value for an SLI.
  Example: "99.5% of requests should complete in under 300ms"
  This is an INTERNAL target your team sets.

SLA (Service Level Agreement):
  A contract with consequences.
  Example: "If uptime drops below 99.9%, customer gets service credits"
  This is an EXTERNAL promise with financial penalties.

The relationship: SLI measures it, SLO sets the target, SLA makes it a contract. Always set your SLO tighter than your SLA so you have a buffer before you owe customers money.

Alert Fatigue and How to Avoid It

Alert fatigue kills monitoring programs. When every alert is urgent, no alert is urgent.

# BAD: Alerts that cause fatigue
- alert: HighCPU
  expr: cpu_usage > 80            # Triggers constantly
  for: 1m                          # Too short — transient spikes
  labels:
    severity: critical              # Everything is "critical"

# GOOD: Actionable alerts tied to user impact
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m])) > 0.01
  for: 5m                          # Sustained for 5 minutes
  labels:
    severity: warning               # Appropriate severity
  annotations:
    summary: "Error rate above 1% for 5 minutes"
    runbook: "https://wiki.internal/runbooks/high-error-rate"
    dashboard: "https://grafana.internal/d/api-overview"

Rules for healthy alerting:

Every alert must be actionable — if you cannot do anything about it, it is a log entry, not an alert
Every alert must have a runbook link
Use for durations of 5-15 minutes to avoid transient spikes
Separate warning (investigate during business hours) from critical (wake someone up)
Track your alert-to-incident ratio — if less than 50% of alerts result in real incidents, your alerts are too noisy

Popular Tools Overview

Category	Tools	Notes
Metrics	Prometheus, Datadog, CloudWatch, InfluxDB	Prometheus is the OSS standard
Logs	Loki, Elasticsearch, Splunk, CloudWatch Logs	Loki is the lightweight choice
Traces	Jaeger, Zipkin, Tempo, Datadog APM	OpenTelemetry is unifying the standard
Dashboards	Grafana, Datadog, Kibana	Grafana works with everything
Alerting	Alertmanager, PagerDuty, OpsGenie	Alertmanager pairs with Prometheus
All-in-One	Datadog, New Relic, Dynatrace	Expensive but comprehensive

The open-source stack of Prometheus + Grafana + Loki + Tempo (sometimes called the "LGTM stack") covers all three pillars at zero licensing cost. It is what most startups and mid-size companies run.

Now that you understand what to monitor and why, let us get hands-on. In the next post, we will set up Prometheus and Grafana from scratch and build production-grade dashboards and alerts in 15 minutes.

The Three Pillars of Observability​

Metric Types​

Structured vs Unstructured Logs​

Google's Four Golden Signals​

RED and USE Methods​

Monitoring vs Observability​

SLOs, SLIs, and SLAs​

Alert Fatigue and How to Avoid It​

Popular Tools Overview​

Stay Updated