Skip to main content

Monitoring 101 — Metrics, Logs, Traces, and the Golden Signals

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

It is 3 AM. Your phone buzzes with a PagerDuty alert: "CPU usage above 90%." You drag yourself out of bed, SSH into the server, and discover the CPU spike was caused by a log rotation cron job that runs every night. It resolved itself two minutes later. This happens three times a week. You start ignoring alerts. Then one night, the database actually fills up and takes down production. Nobody notices for 47 minutes because the team has learned to silence their phones.

The Three Pillars of Observability

Monitoring tells you when something is wrong. Observability tells you why. Modern systems need both, and they are built on three pillars:

Metrics are numeric measurements collected over time — CPU usage, request count, memory consumption. They are cheap to store, fast to query, and ideal for dashboards and alerts.

Logs are timestamped records of discrete events — an error message, a user login, a database query. They give you the detail that metrics cannot.

Traces follow a single request as it travels through multiple services in a distributed system. They show you where time is being spent and where failures happen across service boundaries.

Request comes in:
┌──────────────────────────────────────────────────────────┐
│ TRACE: order-123 │
│ │
│ [API Gateway] ─────► [Order Service] ─────► [Payment] │
│ 12ms 45ms 230ms │
│ │ │
│ ▼ │
│ [Inventory DB] │
│ 89ms │
│ │
│ Total: 376ms │
└──────────────────────────────────────────────────────────┘

METRICS tell you: "P99 latency spiked to 800ms"
LOGS tell you: "Payment gateway returned timeout at 14:32:07"
TRACES tell you: "The Payment service call took 720ms out of 800ms total"

Metric Types

Not all metrics are created equal. Understanding the types is critical for writing useful queries.

TypeDescriptionExampleUse Case
CounterOnly goes up (resets on restart)http_requests_totalRequest counts, errors, bytes sent
GaugeGoes up and downmemory_usage_bytesCPU, memory, queue depth, temperature
HistogramCounts observations in configurable bucketsrequest_duration_secondsLatency distributions, response sizes
SummarySimilar to histogram but calculates quantiles client-siderpc_duration_secondsPre-calculated percentiles
# Counter: Use rate() to get per-second request rate
rate(http_requests_total[5m])

# Gauge: Direct value — how much memory right now?
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

# Histogram: Calculate the 95th percentile latency
histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))

# Counter: Total errors in the last hour
increase(http_errors_total[1h])

Structured vs Unstructured Logs

Unstructured logs are human-readable but machine-hostile. Structured logs are both.

# Unstructured (hard to parse, hard to search):
[2025-05-10 14:32:07] ERROR: Failed to process order #4521 for user john@example.com - Payment gateway timeout after 30s

# Structured JSON (easy to query, filter, aggregate):
{
"timestamp": "2025-05-10T14:32:07Z",
"level": "error",
"message": "Failed to process order",
"service": "order-service",
"order_id": 4521,
"user_email": "john@example.com",
"error_type": "payment_gateway_timeout",
"timeout_seconds": 30,
"trace_id": "abc123def456"
}

With structured logs, you can query your log aggregator for all errors from the order-service where error_type = payment_gateway_timeout in the last hour. With unstructured logs, you are writing regex and praying.

Google's Four Golden Signals

Google's SRE book defines four signals that matter most for any user-facing system. If you can only monitor four things, monitor these:

SignalWhat It MeasuresExample MetricAlert When
LatencyTime to serve a requestrequest_duration_secondsP99 > 500ms for 5 min
TrafficDemand on the systemhttp_requests_totalRate drops > 50% suddenly
ErrorsRate of failed requestshttp_errors_total / http_requests_totalError rate > 1% for 5 min
SaturationHow full your system iscpu_usage, memory_usage, disk_usageDisk > 85%, CPU > 90% sustained
# Golden Signals in PromQL:

# 1. Latency — 99th percentile response time
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# 2. Traffic — requests per second
sum(rate(http_requests_total[5m]))

# 3. Errors — error percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# 4. Saturation — CPU usage percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

RED and USE Methods

Two complementary frameworks for monitoring different layers of your stack:

RED Method (for request-driven services like APIs and web apps):

  • Rate — requests per second
  • Errors — failed requests per second
  • Duration — distribution of request latency

USE Method (for infrastructure resources like CPU, memory, network):

  • Utilization — percentage of resource being used
  • Saturation — work that is queued or waiting
  • Errors — count of error events
Apply RED to your services:
API Gateway → Rate: 2,400 rps | Errors: 0.3% | Duration: P99 120ms
Order Service → Rate: 800 rps | Errors: 1.2% | Duration: P99 340ms ← Problem!
Payment Service → Rate: 200 rps | Errors: 0.1% | Duration: P99 89ms

Apply USE to your infrastructure:
CPU → Utilization: 72% | Saturation: 0 runqueue | Errors: 0
Memory → Utilization: 85% | Saturation: 0 swap | Errors: 0
Disk IO → Utilization: 45% | Saturation: 12 iowait | Errors: 2 ← Problem!

Monitoring vs Observability

These terms are not interchangeable:

Monitoring is collecting predefined data to detect known failure modes. You decide in advance what to measure. "Alert me if CPU exceeds 90%."

Observability is the ability to ask arbitrary questions about your system's internal state from its external outputs. "Why did checkout latency spike for users in Europe between 2 and 3 PM yesterday?"

Monitoring answers: "Is anything broken?" Observability answers: "What broke, where, and why?"

You need both. Monitoring catches the obvious problems. Observability helps you debug the subtle ones.

SLOs, SLIs, and SLAs

These three acronyms form the foundation of reliability engineering:

SLI (Service Level Indicator):
A measurement of service behavior.
Example: "The proportion of requests served in under 300ms"
Formula: (requests < 300ms) / (total requests) = 99.2%

SLO (Service Level Objective):
A target value for an SLI.
Example: "99.5% of requests should complete in under 300ms"
This is an INTERNAL target your team sets.

SLA (Service Level Agreement):
A contract with consequences.
Example: "If uptime drops below 99.9%, customer gets service credits"
This is an EXTERNAL promise with financial penalties.

The relationship: SLI measures it, SLO sets the target, SLA makes it a contract. Always set your SLO tighter than your SLA so you have a buffer before you owe customers money.

Alert Fatigue and How to Avoid It

Alert fatigue kills monitoring programs. When every alert is urgent, no alert is urgent.

# BAD: Alerts that cause fatigue
- alert: HighCPU
expr: cpu_usage > 80 # Triggers constantly
for: 1m # Too short — transient spikes
labels:
severity: critical # Everything is "critical"

# GOOD: Actionable alerts tied to user impact
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.01
for: 5m # Sustained for 5 minutes
labels:
severity: warning # Appropriate severity
annotations:
summary: "Error rate above 1% for 5 minutes"
runbook: "https://wiki.internal/runbooks/high-error-rate"
dashboard: "https://grafana.internal/d/api-overview"

Rules for healthy alerting:

  1. Every alert must be actionable — if you cannot do anything about it, it is a log entry, not an alert
  2. Every alert must have a runbook link
  3. Use for durations of 5-15 minutes to avoid transient spikes
  4. Separate warning (investigate during business hours) from critical (wake someone up)
  5. Track your alert-to-incident ratio — if less than 50% of alerts result in real incidents, your alerts are too noisy
CategoryToolsNotes
MetricsPrometheus, Datadog, CloudWatch, InfluxDBPrometheus is the OSS standard
LogsLoki, Elasticsearch, Splunk, CloudWatch LogsLoki is the lightweight choice
TracesJaeger, Zipkin, Tempo, Datadog APMOpenTelemetry is unifying the standard
DashboardsGrafana, Datadog, KibanaGrafana works with everything
AlertingAlertmanager, PagerDuty, OpsGenieAlertmanager pairs with Prometheus
All-in-OneDatadog, New Relic, DynatraceExpensive but comprehensive

The open-source stack of Prometheus + Grafana + Loki + Tempo (sometimes called the "LGTM stack") covers all three pillars at zero licensing cost. It is what most startups and mid-size companies run.


Now that you understand what to monitor and why, let us get hands-on. In the next post, we will set up Prometheus and Grafana from scratch and build production-grade dashboards and alerts in 15 minutes.