Skip to main content

SRE Principles — SLOs, Error Budgets, and Toil Reduction

· 9 min read
Goel Academy
DevOps & Cloud Learning Hub

Your service is "up." But is it reliable? Can you quantify exactly how reliable it is? Can you answer whether it is reliable enough for your users, and whether you are spending too much engineering time keeping it that way? Site Reliability Engineering gives you a framework to answer all of these questions with data instead of gut feelings. SRE was born at Google in 2003, and its principles now drive reliability practices at companies of every size.

What Is SRE?

Site Reliability Engineering is a discipline that applies software engineering practices to infrastructure and operations problems. Ben Treynor Sloss, the founder of Google's SRE team, defined it as "what happens when you ask a software engineer to design an operations function."

SRE is not a rebrand of ops. It is a specific set of practices, roles, and cultural norms built around one idea: reliability is a feature, and it should be engineered with the same rigor as any other feature.

SRE vs DevOps

AspectDevOpsSRE
OriginCommunity-driven movementGoogle (2003)
FocusCulture, collaboration, automationReliability, measurability, engineering
ApproachBreak down silos between dev and opsApply software engineering to ops
Key PracticesCI/CD, IaC, monitoring, collaborationSLOs, error budgets, toil reduction, blameless postmortems
MetricsDORA metrics (deploy frequency, lead time)SLIs, SLOs, error budget burn rate
On-callShared responsibilityStructured rotations with caps (max 25% time)
AutomationAutomate everythingAutomate toil, cap it at 50% of SRE time
RelationshipCultural frameworkConcrete implementation of DevOps principles

DevOps tells you what to do (automate, collaborate, measure). SRE tells you how to do it (SLOs, error budgets, toil budgets).

SLIs — Service Level Indicators

An SLI is a quantitative measure of a specific aspect of the service. It answers the question: "How is our service performing right now?"

# Common SLIs and how to measure them

# Availability — proportion of successful requests
# SLI = (total requests - error requests) / total requests
availability:
good_events: "http_requests_total{status!~'5..'}"
total_events: "http_requests_total"

# Latency — proportion of requests faster than a threshold
# SLI = requests under 300ms / total requests
latency:
good_events: "http_request_duration_seconds_bucket{le='0.3'}"
total_events: "http_request_duration_seconds_count"

# Throughput — requests processed per second
throughput:
measurement: "rate(http_requests_total[5m])"

# Correctness — proportion of correct responses
correctness:
good_events: "responses_with_correct_data_total"
total_events: "responses_total"

# Freshness — proportion of data updated within threshold
freshness:
good_events: "data_updated_within_threshold_total"
total_events: "data_freshness_checks_total"

SLOs — Service Level Objectives

An SLO is a target value for an SLI. It answers the question: "How reliable should our service be?"

The Nines Table

AvailabilityAnnual DowntimeMonthly DowntimeWeekly Downtime
99% (two nines)3.65 days7.3 hours1.68 hours
99.5%1.83 days3.65 hours50.4 minutes
99.9% (three nines)8.77 hours43.8 minutes10.1 minutes
99.95%4.38 hours21.9 minutes5.04 minutes
99.99% (four nines)52.6 minutes4.38 minutes1.01 minutes
99.999% (five nines)5.26 minutes26.3 seconds6.05 seconds
# Example SLO document
service: checkout-api
owner: payments-team

slos:
- name: Availability
description: "The checkout API returns successful responses"
sli: "Ratio of non-5xx responses to total responses"
target: 99.95%
window: 30 days (rolling)

- name: Latency (P99)
description: "99th percentile checkout latency is under 500ms"
sli: "Ratio of requests completing in under 500ms"
target: 99.0%
window: 30 days (rolling)

- name: Latency (P50)
description: "Median checkout latency is under 100ms"
sli: "Ratio of requests completing in under 100ms"
target: 99.9%
window: 30 days (rolling)

SLAs — Service Level Agreements

An SLA is a business contract that defines consequences when SLOs are not met. SLAs are always less strict than SLOs — your SLO is your internal target, your SLA is your external commitment.

SLI → SLO → SLA relationship:

SLI: "99.97% of requests returned 2xx in the last 30 days"
(what we measured)

SLO: "99.95% availability over a 30-day rolling window"
(what we target internally)

SLA: "99.9% availability per calendar month, or customer receives 10% credit"
(what we promise externally with financial consequences)

Buffer: SLO (99.95%) is stricter than SLA (99.9%)
This gives you time to react before breaching the SLA.

Error Budgets

The error budget is the amount of unreliability you are allowed. If your SLO is 99.95% availability over 30 days, your error budget is 0.05% of total requests — roughly 21.9 minutes of downtime per month.

Calculating Error Budgets

# Error budget = 1 - SLO target
# For a 99.95% availability SLO over 30 days:

Error budget = 1 - 0.9995 = 0.0005 (0.05%)

# In minutes: 30 days * 24 hours * 60 minutes * 0.0005 = 21.6 minutes
# In requests: if you serve 10,000,000 requests/month,
# error budget = 10,000,000 * 0.0005 = 5,000 failed requests allowed

Error Budget Burn Rate Alerts

# Prometheus alert rules for error budget burn rate
# Based on Google's Multi-Window, Multi-Burn-Rate approach

groups:
- name: slo-burn-rate
rules:
# Fast burn — consumes 2% of 30-day budget in 1 hour
# Alert fires if you will exhaust budget in ~2 days
- alert: HighErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > (14.4 * 0.0005)
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) > (14.4 * 0.0005)
for: 2m
labels:
severity: critical
slo: availability
annotations:
summary: "High error budget burn rate"
description: "Error rate is 14.4x the SLO threshold. Budget will be exhausted in ~2 days."

# Slow burn — consumes 5% of 30-day budget in 6 hours
- alert: SlowErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/ sum(rate(http_requests_total[6h]))
) > (6 * 0.0005)
and
(
sum(rate(http_requests_total{status=~"5.."}[30m]))
/ sum(rate(http_requests_total[30m]))
) > (6 * 0.0005)
for: 15m
labels:
severity: warning
slo: availability
annotations:
summary: "Slow error budget burn rate"
description: "Error rate is 6x the SLO threshold. Budget will be exhausted in ~5 days."

When the Error Budget Is Exhausted

Error budget policy — what happens when the budget runs out:

1. Feature freeze
- No new features are deployed until the budget is restored
- All engineering effort shifts to reliability improvements

2. Mandatory postmortem review
- Review all incidents that consumed the budget
- Identify systemic issues, not just individual failures

3. Increased testing requirements
- New deployments require additional load testing
- Canary deployment percentage is reduced

4. Stakeholder notification
- Product managers are informed that velocity will slow
- Leadership reviews the reliability roadmap

5. Budget restoration
- Budget is restored by the natural passage of time (rolling window)
- Or by deploying reliability improvements that reduce error rate

Toil — The Enemy of Engineering

Toil is manual, repetitive, automatable work that scales linearly with service growth. SRE caps toil at 50% of an SRE's time — the other 50% must be spent on engineering work that permanently reduces toil or improves reliability.

# Toil characteristics — all must be true:
toil_definition:
manual: "A human runs the task (not automated)"
repetitive: "It happens over and over, not a one-time task"
automatable: "A machine could do it (it requires no human judgment)"
tactical: "It is reactive, not strategic"
no_lasting_value: "It does not permanently improve the service"
scales_linearly: "More traffic or more services = more toil"

# Examples of toil:
toil_examples:
- "Manually restarting a service after an OOM kill"
- "Copying logs from servers to a shared drive"
- "Running database migrations by hand"
- "Manually scaling up instances before a traffic spike"
- "Rotating certificates by SSH-ing into each server"
- "Reviewing and approving routine access requests"

# Examples of NOT toil (engineering work):
not_toil:
- "Building an auto-restart mechanism for OOM kills"
- "Setting up centralized logging with ELK"
- "Writing a database migration pipeline"
- "Implementing autoscaling policies"
- "Building certificate auto-rotation with cert-manager"

Toil Reduction Strategies

# Strategy 1: Eliminate — remove the need entirely
# Before: Manually rotating TLS certificates every 90 days
# After: cert-manager auto-renews via Let's Encrypt

# Strategy 2: Automate — script the manual process
# Before: SSH into server, check disk usage, clean up old logs
# After: Cron job with logrotate and disk usage alerting

# Strategy 3: Self-service — let developers handle it themselves
# Before: SRE manually creates DNS records for new services
# After: ExternalDNS controller auto-creates records from Ingress annotations

# Strategy 4: Reduce frequency — make the task happen less often
# Before: Manually approving every deployment to staging
# After: Auto-approve staging deploys, require approval only for production

SRE Team Models

Model 1: Embedded SRE
┌─────────────────┐
│ Product Team │
│ 5 developers │
│ 1 SRE │ ← SRE is part of the team
└─────────────────┘
Best for: Small orgs, teams that need full ownership

Model 2: Centralized SRE Team
┌──────────────────┐
│ SRE Team │
│ 8 SREs │──▶ Supports Team A, Team B, Team C
└──────────────────┘
Best for: Consistent practices, shared infrastructure

Model 3: Consulting / Enabling SRE
┌──────────────────┐
│ SRE Team │
│ 5 SREs │──▶ Reviews, guides, but does not own services
└──────────────────┘
Teams own their own reliability.
SREs provide tools, frameworks, and reviews.
Best for: Mature orgs where teams can self-serve

Implementing SLOs with Prometheus

# Prometheus recording rules for SLO tracking
groups:
- name: slo-recording-rules
rules:
# Total requests in the last 30 days
- record: slo:http_requests:total_30d
expr: sum(increase(http_requests_total[30d]))

# Failed requests in the last 30 days
- record: slo:http_requests:errors_30d
expr: sum(increase(http_requests_total{status=~"5.."}[30d]))

# Current availability (30-day rolling)
- record: slo:availability:ratio_30d
expr: |
1 - (
sum(increase(http_requests_total{status=~"5.."}[30d]))
/ sum(increase(http_requests_total[30d]))
)

# Error budget remaining (percentage)
# For a 99.95% SLO:
- record: slo:error_budget:remaining_ratio
expr: |
1 - (
(1 - slo:availability:ratio_30d)
/ (1 - 0.9995)
)
# Grafana dashboard queries

# Current availability (should be above 99.95%)
slo:availability:ratio_30d * 100

# Error budget remaining (starts at 100%, decreases toward 0%)
slo:error_budget:remaining_ratio * 100

# Error budget consumed today
1 - (
1 - (sum(increase(http_requests_total{status=~"5.."}[1d]))
/ sum(increase(http_requests_total[1d])))
) / (1 - 0.9995)

Error Budget Policy Document

# Error Budget Policy — Checkout API

## Service: checkout-api
## SLO: 99.95% availability (30-day rolling window)
## Owner: payments-team
## Approved by: VP Engineering

### Budget Thresholds and Actions

| Budget Remaining | Status | Actions |
|-----------------|---------|---------|
| > 50% | Green | Normal velocity. Ship features. |
| 25% - 50% | Yellow | Increase canary duration. Review recent deploys. |
| 10% - 25% | Orange | Feature freeze for this service. Focus on reliability. |
| 0% - 10% | Red | Full freeze. All hands on reliability. Stakeholder review. |
| Exhausted | Critical| No deploys until budget is restored. Mandatory postmortem. |

### Review Cadence
- Weekly: SRE team reviews budget status in Monday standup
- Monthly: SLO review with product and engineering leadership
- Quarterly: SLO targets reassessed based on user impact data

Defining SLOs is not a one-time exercise. Start with a target you can actually meet, measure it for a quarter, and then tighten it based on what your users actually need. A 99.99% SLO sounds impressive, but if your users are happy at 99.9%, you are over-investing in reliability at the expense of feature velocity.


You know how to measure reliability. But what happens when things break? In the next post, we will cover incident management — on-call rotations, runbooks, severity levels, and the blameless postmortem process that turns failures into lessons.