Incident Management — On-Call, Runbooks, and Blameless Postmortems

July 19, 2025 · 9 min read

DevOps & Cloud Learning Hub

It is 3 AM. Your phone buzzes. The checkout service is returning 500 errors, revenue is dropping, and the on-call engineer has no idea where to start. There is no runbook, no clear escalation path, and last time this happened the fix was "someone restarted the pod." This is what happens when you treat incident management as an afterthought. In this post, we will build an incident management process from the ground up — one that detects problems fast, resolves them faster, and actually prevents them from recurring.

The Incident Lifecycle

Every incident follows the same lifecycle, whether it is a minor blip or a full production outage:

  Detect → Triage → Respond → Resolve → Review
    │         │        │          │         │
    ▼         ▼        ▼          ▼         ▼
  Monitoring  Assess   Diagnose   Fix      Postmortem
  alerts      severity and        and      and action
  fire        and      mitigate   verify   items
              assign
              IC

The goal is not zero incidents — that is impossible. The goal is to minimize the impact and duration of each incident, and to learn from every one.

Severity Levels

A clear severity framework ensures everyone agrees on urgency and response expectations:

Severity	Description	Examples	Response Time	Update Frequency	Who Is Paged
SEV1	Critical — Complete outage or data loss	Site down, data breach, payment processing failed	5 minutes	Every 15 minutes	On-call + IC + Engineering Manager
SEV2	Major — Significant degradation	Checkout slow (>5s), search broken, 10% error rate	15 minutes	Every 30 minutes	On-call + IC
SEV3	Minor — Partial degradation	One region slow, non-critical feature broken, elevated error rate	30 minutes	Every 2 hours	On-call
SEV4	Low — Cosmetic or minor issue	Typo in UI, non-critical job delayed, warning threshold crossed	Next business day	Daily	Ticket assigned

# PagerDuty escalation policy example
escalation_policy:
  name: "Checkout Service"
  repeat_enabled: true
  num_loops: 3

  escalation_rules:
    - escalation_delay_in_minutes: 5
      targets:
        - type: schedule_reference
          id: "primary-oncall"       # Primary on-call engineer

    - escalation_delay_in_minutes: 10
      targets:
        - type: schedule_reference
          id: "secondary-oncall"     # Backup on-call

    - escalation_delay_in_minutes: 15
      targets:
        - type: user_reference
          id: "engineering-manager"  # Engineering manager

On-Call Best Practices

# On-call rotation configuration
on_call_rotation:
  schedule:
    rotation_length: 7 days          # 1-week shifts
    handoff_time: "10:00 AM Monday"  # During business hours
    handoff_meeting: true            # 15-minute sync at handoff

  team_size:
    minimum: 5                       # Minimum team size for sustainable rotation
    max_on_call_percentage: 25       # No more than 25% of time on-call

  compensation:
    weekday_on_call: "Per company policy"
    weekend_on_call: "Per company policy"
    incident_response: "Time off in lieu"

  expectations:
    acknowledge_time: 5 minutes      # Ack alert within 5 minutes
    response_time: 15 minutes        # Start investigating within 15 minutes
    laptop_required: true            # Must have laptop accessible
    max_alerts_per_shift: 2          # If exceeded, escalate to management

  health:
    max_consecutive_pages: 3         # After 3 pages in one night, next-day off
    quarterly_review: true           # Review on-call load every quarter
    alert_noise_budget: 10           # Max non-actionable alerts per week

On-Call Handoff Template

## On-Call Handoff — Week of July 14

### Outgoing: Alice → Incoming: Bob

### Current State
- All services GREEN
- Deployment freeze lifted as of Friday
- Redis cluster upgraded to 7.2 on Wednesday (monitor for issues)

### Active Issues
- [ ] Intermittent timeout on user-service (ticket #4521, not yet root-caused)
- [ ] Elevated disk usage on db-replica-03 (70%, threshold is 80%)

### Recent Changes
- Monday: Deployed checkout-api v2.14.0 (added retry logic)
- Wednesday: Redis cluster rolling upgrade (7.0 → 7.2)
- Friday: Updated Prometheus alert thresholds for latency SLO

### Runbook Updates
- Updated: runbooks/redis-failover.md (new cluster topology)
- New: runbooks/checkout-retry-debugging.md

### Known Noisy Alerts (safe to snooze)
- `CronJobMissedSchedule` on report-generator (runs at 2AM, sometimes delayed)

Runbook Structure

A runbook is a step-by-step guide for responding to a specific alert or incident. The engineer reading it is probably stressed and sleep-deprived — make it clear, concise, and copy-pasteable.

# Runbook: High Error Rate on Checkout API

## Alert
`HighErrorRate` — checkout-api error rate exceeds 1% for 5 minutes

## Impact
Users cannot complete purchases. Revenue impact: ~$X per minute.

## Quick Diagnosis (< 2 minutes)

1. Check if the error rate is real or a monitoring artifact:
   ```bash
   kubectl logs -l app=checkout-api --tail=50 -n production | grep -c "ERROR"

Check pod health:

kubectl get pods -l app=checkout-api -n production
kubectl top pods -l app=checkout-api -n production

Check recent deployments:

kubectl rollout history deployment/checkout-api -n production

Check downstream dependencies:

# Payment gateway
curl -s https://api.payment-provider.com/health

# Database
kubectl exec -it deploy/checkout-api -n production -- \
  pg_isready -h db-primary.internal -p 5432

Common Causes and Fixes

Cause 1: Bad deployment

# Rollback to previous version
kubectl rollout undo deployment/checkout-api -n production

# Verify rollback
kubectl rollout status deployment/checkout-api -n production

Cause 2: Database connection pool exhausted

# Check active connections
kubectl exec -it deploy/checkout-api -n production -- \
  psql -h db-primary.internal -U app -c "SELECT count(*) FROM pg_stat_activity;"

# Restart pods to reset connection pools
kubectl rollout restart deployment/checkout-api -n production

Cause 3: Payment gateway outage

# Check payment gateway status page
# https://status.payment-provider.com

# If confirmed outage, enable fallback payment processor:
kubectl set env deployment/checkout-api -n production \
  PAYMENT_GATEWAY=fallback

Escalation

If unresolved after 15 minutes, escalate to:

Payments team lead: @alice (Slack) / +1-555-0123
Database on-call: @bob (Slack)

## Incident Commander Role

For SEV1 and SEV2 incidents, an Incident Commander (IC) takes charge. The IC does not debug — they coordinate.

```yaml
incident_commander:
  responsibilities:
    - Declare the incident and set severity
    - Open the incident Slack channel (#inc-YYYYMMDD-short-description)
    - Assign roles (IC, Communications Lead, Subject Matter Experts)
    - Drive the timeline and decisions
    - Ensure regular status updates are sent
    - Decide when the incident is resolved
    - Schedule the postmortem

  does_not:
    - Debug the issue (that is the SME's job)
    - Write code or run commands
    - Communicate with customers (that is the Comms Lead)

  phrases_to_use:
    - "What is the current theory?"
    - "What are we trying next?"
    - "When will we have an update?"
    - "Do we need to escalate?"
    - "Let's timebox this approach to 10 minutes."

Communication Templates

## Internal Status Update (Slack — every 15 min for SEV1)

**Incident: Checkout API 5xx errors**
**Severity:** SEV1
**Status:** Investigating
**IC:** @alice
**Duration:** 23 minutes

**Impact:** ~30% of checkout requests failing. Estimated revenue impact: $X/min.

**Current Theory:** Database connection pool exhausted after traffic spike.

**Actions In Progress:**
- @bob is increasing connection pool size and restarting pods
- @carol is checking if the traffic spike is organic or a bot attack

**Next Update:** 15 minutes or when status changes.

---

## External Status Page Update

**Title:** Checkout Service Degradation
**Status:** Investigating

We are currently investigating reports of errors during checkout.
Some customers may experience failures when completing purchases.
Our engineering team is actively working to resolve this.
We will provide an update within 30 minutes.

Blameless Postmortem Template

# Postmortem: Checkout API Outage — July 19, 2025

## Summary
On July 19 at 14:32 UTC, the checkout API began returning 500 errors
for approximately 30% of requests. The incident lasted 47 minutes and
affected an estimated 2,300 users. Revenue impact: ~$X.

## Timeline (all times UTC)
| Time  | Event |
|-------|-------|
| 14:32 | Monitoring detects error rate above 1% threshold |
| 14:33 | PagerDuty alerts primary on-call (Alice) |
| 14:35 | Alice acknowledges alert, begins investigation |
| 14:38 | Alice declares SEV1, opens #inc-20250719-checkout |
| 14:42 | Root cause identified: DB connection pool exhausted |
| 14:45 | Bob increases pool size from 20 to 50, begins rolling restart |
| 14:52 | Pods restarted, error rate dropping |
| 15:05 | Error rate below 0.1%, monitoring confirms recovery |
| 15:19 | Incident resolved, all-clear communicated |

## Root Cause
The database connection pool was configured with a maximum of 20
connections per pod. A marketing campaign drove a 3x traffic spike
starting at 14:15. By 14:32, all pods had exhausted their connection
pools, causing new requests to queue and eventually timeout.

## Contributing Factors
- Connection pool size was set during initial deployment and never revisited
- No alert on connection pool utilization (only on error rate)
- Load testing did not cover 3x traffic scenarios
- Marketing did not notify engineering about the campaign

## What Went Well
- Alert fired within 3 minutes of impact starting
- IC was assigned within 5 minutes
- Root cause was identified in 7 minutes
- Clear runbook existed for database connection issues

## What Went Wrong
- No connection pool utilization monitoring
- Traffic spike was not anticipated
- Rolling restart took 7 minutes (pods have 60s grace period)

## Action Items
| Action | Owner | Priority | Ticket |
|--------|-------|----------|--------|
| Add connection pool utilization alert (>80%) | Alice | P1 | OPS-1234 |
| Increase default pool size to 50 | Bob | P1 | OPS-1235 |
| Add 3x traffic load test to CI pipeline | Carol | P2 | OPS-1236 |
| Create process for marketing to notify eng of campaigns | Dave | P2 | OPS-1237 |
| Reduce pod grace period from 60s to 15s | Alice | P3 | OPS-1238 |

## Lessons Learned
Connection pool sizing should be reviewed whenever traffic patterns
change. We will add pool utilization to our standard dashboard and
set alerts at 80% utilization.

Incident Metrics

Track these metrics to measure and improve your incident management process:

incident_metrics:
  # Mean Time to Detect — how long until we notice
  MTTD:
    definition: "Time from incident start to first alert"
    target: "< 5 minutes for SEV1/SEV2"
    improve_by: "Better monitoring coverage, lower alert thresholds"

  # Mean Time to Acknowledge — how long until someone responds
  MTTA:
    definition: "Time from alert firing to human acknowledgment"
    target: "< 5 minutes for SEV1, < 15 minutes for SEV2"
    improve_by: "On-call process, escalation policies, alert routing"

  # Mean Time to Resolve — how long until it is fixed
  MTTR:
    definition: "Time from incident start to full resolution"
    target: "< 1 hour for SEV1, < 4 hours for SEV2"
    improve_by: "Runbooks, automation, faster rollbacks"

  # Incident frequency — how often do incidents happen
  frequency:
    definition: "Number of SEV1/SEV2 incidents per month"
    target: "Trending downward quarter over quarter"
    improve_by: "Postmortem action items, reliability investments"

Building an Incident Culture

The most important part of incident management is not the tools or the templates — it is the culture. A blameless culture means:

Focus on systems, not individuals. "The monitoring did not detect the issue" instead of "Alice did not notice the issue."
Postmortems are mandatory. Every SEV1 and SEV2 gets a postmortem within 5 business days.
Action items are tracked. A postmortem without follow-through is just a writing exercise.
On-call is sustainable. If engineers dread on-call, they will leave. Invest in alert quality, runbooks, and compensation.
Practice incident response. Run game days and chaos engineering exercises so the first real incident is not also the first time the team uses the process.

You can detect, respond to, and learn from incidents. But there is one class of incident that deserves special attention — secrets leaking into the wrong hands. In the next post, we will cover secrets management with HashiCorp Vault, SOPS, and Sealed Secrets.

The Incident Lifecycle​

Severity Levels​

On-Call Best Practices​

On-Call Handoff Template​

Runbook Structure​

Common Causes and Fixes​

Cause 1: Bad deployment​

Cause 2: Database connection pool exhausted​

Cause 3: Payment gateway outage​

Escalation​

Communication Templates​

Blameless Postmortem Template​

Incident Metrics​

Building an Incident Culture​

Stay Updated