Skip to main content

Incident Management — On-Call, Runbooks, and Blameless Postmortems

· 9 min read
Goel Academy
DevOps & Cloud Learning Hub

It is 3 AM. Your phone buzzes. The checkout service is returning 500 errors, revenue is dropping, and the on-call engineer has no idea where to start. There is no runbook, no clear escalation path, and last time this happened the fix was "someone restarted the pod." This is what happens when you treat incident management as an afterthought. In this post, we will build an incident management process from the ground up — one that detects problems fast, resolves them faster, and actually prevents them from recurring.

The Incident Lifecycle

Every incident follows the same lifecycle, whether it is a minor blip or a full production outage:

  Detect → Triage → Respond → Resolve → Review
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Monitoring Assess Diagnose Fix Postmortem
alerts severity and and and action
fire and mitigate verify items
assign
IC

The goal is not zero incidents — that is impossible. The goal is to minimize the impact and duration of each incident, and to learn from every one.

Severity Levels

A clear severity framework ensures everyone agrees on urgency and response expectations:

SeverityDescriptionExamplesResponse TimeUpdate FrequencyWho Is Paged
SEV1Critical — Complete outage or data lossSite down, data breach, payment processing failed5 minutesEvery 15 minutesOn-call + IC + Engineering Manager
SEV2Major — Significant degradationCheckout slow (>5s), search broken, 10% error rate15 minutesEvery 30 minutesOn-call + IC
SEV3Minor — Partial degradationOne region slow, non-critical feature broken, elevated error rate30 minutesEvery 2 hoursOn-call
SEV4Low — Cosmetic or minor issueTypo in UI, non-critical job delayed, warning threshold crossedNext business dayDailyTicket assigned
# PagerDuty escalation policy example
escalation_policy:
name: "Checkout Service"
repeat_enabled: true
num_loops: 3

escalation_rules:
- escalation_delay_in_minutes: 5
targets:
- type: schedule_reference
id: "primary-oncall" # Primary on-call engineer

- escalation_delay_in_minutes: 10
targets:
- type: schedule_reference
id: "secondary-oncall" # Backup on-call

- escalation_delay_in_minutes: 15
targets:
- type: user_reference
id: "engineering-manager" # Engineering manager

On-Call Best Practices

# On-call rotation configuration
on_call_rotation:
schedule:
rotation_length: 7 days # 1-week shifts
handoff_time: "10:00 AM Monday" # During business hours
handoff_meeting: true # 15-minute sync at handoff

team_size:
minimum: 5 # Minimum team size for sustainable rotation
max_on_call_percentage: 25 # No more than 25% of time on-call

compensation:
weekday_on_call: "Per company policy"
weekend_on_call: "Per company policy"
incident_response: "Time off in lieu"

expectations:
acknowledge_time: 5 minutes # Ack alert within 5 minutes
response_time: 15 minutes # Start investigating within 15 minutes
laptop_required: true # Must have laptop accessible
max_alerts_per_shift: 2 # If exceeded, escalate to management

health:
max_consecutive_pages: 3 # After 3 pages in one night, next-day off
quarterly_review: true # Review on-call load every quarter
alert_noise_budget: 10 # Max non-actionable alerts per week

On-Call Handoff Template

## On-Call Handoff — Week of July 14

### Outgoing: Alice → Incoming: Bob

### Current State
- All services GREEN
- Deployment freeze lifted as of Friday
- Redis cluster upgraded to 7.2 on Wednesday (monitor for issues)

### Active Issues
- [ ] Intermittent timeout on user-service (ticket #4521, not yet root-caused)
- [ ] Elevated disk usage on db-replica-03 (70%, threshold is 80%)

### Recent Changes
- Monday: Deployed checkout-api v2.14.0 (added retry logic)
- Wednesday: Redis cluster rolling upgrade (7.0 → 7.2)
- Friday: Updated Prometheus alert thresholds for latency SLO

### Runbook Updates
- Updated: runbooks/redis-failover.md (new cluster topology)
- New: runbooks/checkout-retry-debugging.md

### Known Noisy Alerts (safe to snooze)
- `CronJobMissedSchedule` on report-generator (runs at 2AM, sometimes delayed)

Runbook Structure

A runbook is a step-by-step guide for responding to a specific alert or incident. The engineer reading it is probably stressed and sleep-deprived — make it clear, concise, and copy-pasteable.

# Runbook: High Error Rate on Checkout API

## Alert
`HighErrorRate` — checkout-api error rate exceeds 1% for 5 minutes

## Impact
Users cannot complete purchases. Revenue impact: ~$X per minute.

## Quick Diagnosis (< 2 minutes)

1. Check if the error rate is real or a monitoring artifact:
```bash
kubectl logs -l app=checkout-api --tail=50 -n production | grep -c "ERROR"
  1. Check pod health:

    kubectl get pods -l app=checkout-api -n production
    kubectl top pods -l app=checkout-api -n production
  2. Check recent deployments:

    kubectl rollout history deployment/checkout-api -n production
  3. Check downstream dependencies:

    # Payment gateway
    curl -s https://api.payment-provider.com/health

    # Database
    kubectl exec -it deploy/checkout-api -n production -- \
    pg_isready -h db-primary.internal -p 5432

Common Causes and Fixes

Cause 1: Bad deployment

# Rollback to previous version
kubectl rollout undo deployment/checkout-api -n production

# Verify rollback
kubectl rollout status deployment/checkout-api -n production

Cause 2: Database connection pool exhausted

# Check active connections
kubectl exec -it deploy/checkout-api -n production -- \
psql -h db-primary.internal -U app -c "SELECT count(*) FROM pg_stat_activity;"

# Restart pods to reset connection pools
kubectl rollout restart deployment/checkout-api -n production

Cause 3: Payment gateway outage

# Check payment gateway status page
# https://status.payment-provider.com

# If confirmed outage, enable fallback payment processor:
kubectl set env deployment/checkout-api -n production \
PAYMENT_GATEWAY=fallback

Escalation

If unresolved after 15 minutes, escalate to:

  • Payments team lead: @alice (Slack) / +1-555-0123
  • Database on-call: @bob (Slack)

## Incident Commander Role

For SEV1 and SEV2 incidents, an Incident Commander (IC) takes charge. The IC does not debug — they coordinate.

```yaml
incident_commander:
responsibilities:
- Declare the incident and set severity
- Open the incident Slack channel (#inc-YYYYMMDD-short-description)
- Assign roles (IC, Communications Lead, Subject Matter Experts)
- Drive the timeline and decisions
- Ensure regular status updates are sent
- Decide when the incident is resolved
- Schedule the postmortem

does_not:
- Debug the issue (that is the SME's job)
- Write code or run commands
- Communicate with customers (that is the Comms Lead)

phrases_to_use:
- "What is the current theory?"
- "What are we trying next?"
- "When will we have an update?"
- "Do we need to escalate?"
- "Let's timebox this approach to 10 minutes."

Communication Templates

## Internal Status Update (Slack — every 15 min for SEV1)

**Incident: Checkout API 5xx errors**
**Severity:** SEV1
**Status:** Investigating
**IC:** @alice
**Duration:** 23 minutes

**Impact:** ~30% of checkout requests failing. Estimated revenue impact: $X/min.

**Current Theory:** Database connection pool exhausted after traffic spike.

**Actions In Progress:**
- @bob is increasing connection pool size and restarting pods
- @carol is checking if the traffic spike is organic or a bot attack

**Next Update:** 15 minutes or when status changes.

---

## External Status Page Update

**Title:** Checkout Service Degradation
**Status:** Investigating

We are currently investigating reports of errors during checkout.
Some customers may experience failures when completing purchases.
Our engineering team is actively working to resolve this.
We will provide an update within 30 minutes.

Blameless Postmortem Template

# Postmortem: Checkout API Outage — July 19, 2025

## Summary
On July 19 at 14:32 UTC, the checkout API began returning 500 errors
for approximately 30% of requests. The incident lasted 47 minutes and
affected an estimated 2,300 users. Revenue impact: ~$X.

## Timeline (all times UTC)
| Time | Event |
|-------|-------|
| 14:32 | Monitoring detects error rate above 1% threshold |
| 14:33 | PagerDuty alerts primary on-call (Alice) |
| 14:35 | Alice acknowledges alert, begins investigation |
| 14:38 | Alice declares SEV1, opens #inc-20250719-checkout |
| 14:42 | Root cause identified: DB connection pool exhausted |
| 14:45 | Bob increases pool size from 20 to 50, begins rolling restart |
| 14:52 | Pods restarted, error rate dropping |
| 15:05 | Error rate below 0.1%, monitoring confirms recovery |
| 15:19 | Incident resolved, all-clear communicated |

## Root Cause
The database connection pool was configured with a maximum of 20
connections per pod. A marketing campaign drove a 3x traffic spike
starting at 14:15. By 14:32, all pods had exhausted their connection
pools, causing new requests to queue and eventually timeout.

## Contributing Factors
- Connection pool size was set during initial deployment and never revisited
- No alert on connection pool utilization (only on error rate)
- Load testing did not cover 3x traffic scenarios
- Marketing did not notify engineering about the campaign

## What Went Well
- Alert fired within 3 minutes of impact starting
- IC was assigned within 5 minutes
- Root cause was identified in 7 minutes
- Clear runbook existed for database connection issues

## What Went Wrong
- No connection pool utilization monitoring
- Traffic spike was not anticipated
- Rolling restart took 7 minutes (pods have 60s grace period)

## Action Items
| Action | Owner | Priority | Ticket |
|--------|-------|----------|--------|
| Add connection pool utilization alert (>80%) | Alice | P1 | OPS-1234 |
| Increase default pool size to 50 | Bob | P1 | OPS-1235 |
| Add 3x traffic load test to CI pipeline | Carol | P2 | OPS-1236 |
| Create process for marketing to notify eng of campaigns | Dave | P2 | OPS-1237 |
| Reduce pod grace period from 60s to 15s | Alice | P3 | OPS-1238 |

## Lessons Learned
Connection pool sizing should be reviewed whenever traffic patterns
change. We will add pool utilization to our standard dashboard and
set alerts at 80% utilization.

Incident Metrics

Track these metrics to measure and improve your incident management process:

incident_metrics:
# Mean Time to Detect — how long until we notice
MTTD:
definition: "Time from incident start to first alert"
target: "< 5 minutes for SEV1/SEV2"
improve_by: "Better monitoring coverage, lower alert thresholds"

# Mean Time to Acknowledge — how long until someone responds
MTTA:
definition: "Time from alert firing to human acknowledgment"
target: "< 5 minutes for SEV1, < 15 minutes for SEV2"
improve_by: "On-call process, escalation policies, alert routing"

# Mean Time to Resolve — how long until it is fixed
MTTR:
definition: "Time from incident start to full resolution"
target: "< 1 hour for SEV1, < 4 hours for SEV2"
improve_by: "Runbooks, automation, faster rollbacks"

# Incident frequency — how often do incidents happen
frequency:
definition: "Number of SEV1/SEV2 incidents per month"
target: "Trending downward quarter over quarter"
improve_by: "Postmortem action items, reliability investments"

Building an Incident Culture

The most important part of incident management is not the tools or the templates — it is the culture. A blameless culture means:

  1. Focus on systems, not individuals. "The monitoring did not detect the issue" instead of "Alice did not notice the issue."
  2. Postmortems are mandatory. Every SEV1 and SEV2 gets a postmortem within 5 business days.
  3. Action items are tracked. A postmortem without follow-through is just a writing exercise.
  4. On-call is sustainable. If engineers dread on-call, they will leave. Invest in alert quality, runbooks, and compensation.
  5. Practice incident response. Run game days and chaos engineering exercises so the first real incident is not also the first time the team uses the process.

You can detect, respond to, and learn from incidents. But there is one class of incident that deserves special attention — secrets leaking into the wrong hands. In the next post, we will cover secrets management with HashiCorp Vault, SOPS, and Sealed Secrets.