Chaos Engineering — Break Your System Before It Breaks You
Netflix famously runs a tool called Chaos Monkey that randomly kills production servers — on purpose. It sounds insane until you realize their systems survived the 2017 AWS S3 outage while half the internet went down. That's chaos engineering: deliberately injecting failure so your systems learn to handle it gracefully.
What Is Chaos Engineering?
Chaos engineering is the discipline of experimenting on a distributed system to build confidence in its ability to withstand turbulent conditions in production. It was pioneered at Netflix in 2010 when they migrated from data centers to AWS and realized they couldn't predict how cloud failures would cascade.
The core idea is simple:
Traditional approach: Wait for failure → Scramble to fix → Write postmortem
Chaos engineering: Inject failure → Observe behavior → Fix weaknesses → Repeat
You're not trying to cause outages. You're trying to discover weaknesses before real incidents expose them at 3 AM on a Saturday.
Chaos Engineering Principles
The discipline follows five key principles defined by Netflix:
chaos_engineering_principles:
1_build_hypothesis:
description: "Define what 'steady state' looks like"
example: "Order throughput stays above 1000 req/s"
2_vary_real_world_events:
description: "Inject realistic failures, not contrived ones"
examples:
- Server crash
- Network partition
- Clock skew
- Dependency latency spike
3_run_in_production:
description: "Staging doesn't reflect real traffic patterns"
caveat: "Start small, expand blast radius gradually"
4_automate_and_run_continuously:
description: "One-off tests go stale; continuous chaos finds regressions"
5_minimize_blast_radius:
description: "Start with a single host, not the entire fleet"
progression: "1 pod → 1 node → 1 AZ → 1 region"
Chaos Monkey and the Simian Army
Netflix built an entire "army" of chaos tools:
| Tool | What It Does |
|---|---|
| Chaos Monkey | Randomly terminates instances in production |
| Latency Monkey | Injects artificial delays in RESTful calls |
| Conformity Monkey | Shuts down instances that don't follow best practices |
| Chaos Gorilla | Simulates an entire AWS Availability Zone going down |
| Chaos Kong | Simulates an entire AWS Region going down |
| Janitor Monkey | Cleans up unused cloud resources |
| Security Monkey | Finds security violations and vulnerable configs |
The modern successor is Chaos Toolkit and Netflix's Failure Injection Testing (FIT), which provide more controlled, targeted experiments.
Litmus Chaos for Kubernetes
LitmusChaos is a CNCF project that brings chaos engineering natively to Kubernetes:
# litmus-chaos-experiment.yaml — Kill a random pod
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: payment-chaos
namespace: payments
spec:
engineState: active
appinfo:
appns: payments
applabel: app=payment-api
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "false"
- name: PODS_AFFECTED_PERC
value: "50"
probe:
- name: check-payment-health
type: httpProbe
httpProbe/inputs:
url: http://payment-api.payments:8080/health
method:
get:
criteria: ==
responseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 5
interval: 2
retry: 3
Litmus provides 50+ pre-built experiments, a web UI (Litmus Portal), and integrates with CI/CD pipelines so you can run chaos tests before every release.
AWS Fault Injection Simulator
If you're on AWS, FIS is the managed chaos service:
{
"description": "Simulate AZ degradation for payment service",
"targets": {
"paymentInstances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Service": "payment-api",
"Environment": "production"
},
"selectionMode": "PERCENT(25)"
}
},
"actions": {
"stopInstances": {
"actionId": "aws:ec2:stop-instances",
"parameters": {
"startInstancesAfterDuration": "PT5M"
},
"targets": {
"Instances": "paymentInstances"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:123456789:alarm:PaymentErrorRate"
}
],
"roleArn": "arn:aws:iam::123456789:role/FISRole"
}
The stopConditions are critical — they act as an automatic kill switch. If your error rate alarm fires, FIS immediately stops the experiment and rolls back.
Chaos Experiments Reference
Here's a table of common experiments organized by failure domain:
| Experiment | Category | Tool | Blast Radius | Difficulty |
|---|---|---|---|---|
| Kill random pod | Compute | Litmus, Gremlin | Low | Beginner |
| Network latency (200ms) | Network | tc, Gremlin | Medium | Beginner |
| CPU stress (90%) | Resource | stress-ng, Litmus | Low | Beginner |
| Disk fill (95%) | Resource | Litmus, FIS | Medium | Intermediate |
| DNS failure | Network | CoreDNS config | High | Intermediate |
| AZ failure | Infrastructure | FIS, Chaos Kong | High | Advanced |
| Region failover | Infrastructure | FIS, custom | Critical | Expert |
| Clock skew | Time | Gremlin, toxiproxy | Medium | Intermediate |
| Kafka broker loss | Data | Litmus, custom | High | Advanced |
| Database failover | Data | FIS, manual | High | Advanced |
Running Your First Chaos Experiment
Follow the steady-state hypothesis approach:
# Step 1: Define steady state
# "The payment API returns 200 for /health and p99 latency < 500ms"
# Step 2: Baseline measurement
kubectl top pods -n payments
curl -w "%{time_total}\n" http://payment-api/health
# Step 3: Inject failure — kill 1 of 3 pods
kubectl delete pod payment-api-7d8f9b-x2k4n -n payments
# Step 4: Observe
# - Does the health check still pass?
# - Did latency spike?
# - Did any requests fail?
# - How long until the replacement pod was ready?
# Step 5: Document findings
# "Pod recovery took 45 seconds. During that window, 12 requests
# got 503 errors because readiness probe was too aggressive.
# Action: Increase replica count to 4, tune probe timing."
Blast Radius Management
Never go from zero to "kill a region." Use a progressive approach:
Chaos Maturity Progression:
Level 0: No chaos testing
↓
Level 1: Kill a single pod in staging
↓
Level 2: Kill pods in production (single service, low traffic)
↓
Level 3: Network faults between services
↓
Level 4: Dependency failures (DB, cache, queue)
↓
Level 5: AZ failure simulation
↓
Level 6: Automated continuous chaos in production
↓
Level 7: Region failover testing (Game Days)
At each level, ensure you have:
- Observability: You can see the impact in real-time
- Kill switches: You can stop the experiment instantly
- Runbooks: The team knows what to do if things go sideways
- Communication: Stakeholders know a chaos test is running
Game Days
A game day is a scheduled, team-wide chaos exercise — think of it as a fire drill for your infrastructure:
Game Day Agenda (4 hours):
09:00 - Briefing
- Review systems in scope
- Confirm kill switches and rollback plans
- Assign roles: Experimenters, Observers, Incident Commander
09:30 - Experiment 1: Single service pod failure
- Inject, observe, document
10:30 - Experiment 2: Database failover
- Trigger RDS failover, measure recovery
11:30 - Experiment 3: Upstream dependency timeout
- Simulate payment gateway 30s latency
12:00 - Debrief
- What broke? What held?
- Action items with owners and deadlines
- Update runbooks
Game days build team confidence and muscle memory. The worst time to practice incident response is during an actual incident.
When NOT to Do Chaos Engineering
Chaos engineering isn't always appropriate:
❌ DON'T run chaos if:
- You don't have basic monitoring (you can't observe the impact)
- Your system has known, unfixed critical bugs
- You don't have rollback procedures
- It's a peak traffic period (Black Friday, launch day)
- You haven't communicated with stakeholders
- There's no incident response process
- You're doing it to "prove" a system is broken (use load testing instead)
✅ DO run chaos when:
- You have solid observability (metrics, logs, traces)
- You've defined steady-state hypotheses
- Kill switches are tested and working
- The team is available to respond
- You start with the smallest possible blast radius
- Leadership supports the practice
Gremlin: Managed Chaos Platform
For teams that want chaos engineering without building tooling:
# Gremlin CLI — inject CPU stress
gremlin attack cpu \
--length 300 \
--percent 90 \
--targets "tag.service=payment-api" \
--halt-condition "datadog.metric('payment.error_rate') > 5"
# Gremlin CLI — network blackhole (drop all traffic to a dependency)
gremlin attack network blackhole \
--length 120 \
--hostnames "auth-service.internal" \
--targets "tag.service=payment-api"
Gremlin provides a SaaS dashboard, automatic halt conditions, and detailed reports — useful for teams that need to demonstrate compliance with resilience testing requirements.
Closing Note
Chaos engineering reveals the gap between what you think will happen during a failure and what actually happens. It turns "I hope our system handles this" into "I've seen our system handle this." In the next post, we'll shift focus to DevSecOps — integrating security into your pipeline without slowing down delivery.
