Skip to main content

Chaos Engineering — Break Your System Before It Breaks You

· 7 min read
Goel Academy
DevOps & Cloud Learning Hub

Netflix famously runs a tool called Chaos Monkey that randomly kills production servers — on purpose. It sounds insane until you realize their systems survived the 2017 AWS S3 outage while half the internet went down. That's chaos engineering: deliberately injecting failure so your systems learn to handle it gracefully.

What Is Chaos Engineering?

Chaos engineering is the discipline of experimenting on a distributed system to build confidence in its ability to withstand turbulent conditions in production. It was pioneered at Netflix in 2010 when they migrated from data centers to AWS and realized they couldn't predict how cloud failures would cascade.

The core idea is simple:

Traditional approach:  Wait for failure → Scramble to fix → Write postmortem
Chaos engineering: Inject failure → Observe behavior → Fix weaknesses → Repeat

You're not trying to cause outages. You're trying to discover weaknesses before real incidents expose them at 3 AM on a Saturday.

Chaos Engineering Principles

The discipline follows five key principles defined by Netflix:

chaos_engineering_principles:
1_build_hypothesis:
description: "Define what 'steady state' looks like"
example: "Order throughput stays above 1000 req/s"

2_vary_real_world_events:
description: "Inject realistic failures, not contrived ones"
examples:
- Server crash
- Network partition
- Clock skew
- Dependency latency spike

3_run_in_production:
description: "Staging doesn't reflect real traffic patterns"
caveat: "Start small, expand blast radius gradually"

4_automate_and_run_continuously:
description: "One-off tests go stale; continuous chaos finds regressions"

5_minimize_blast_radius:
description: "Start with a single host, not the entire fleet"
progression: "1 pod → 1 node → 1 AZ → 1 region"

Chaos Monkey and the Simian Army

Netflix built an entire "army" of chaos tools:

ToolWhat It Does
Chaos MonkeyRandomly terminates instances in production
Latency MonkeyInjects artificial delays in RESTful calls
Conformity MonkeyShuts down instances that don't follow best practices
Chaos GorillaSimulates an entire AWS Availability Zone going down
Chaos KongSimulates an entire AWS Region going down
Janitor MonkeyCleans up unused cloud resources
Security MonkeyFinds security violations and vulnerable configs

The modern successor is Chaos Toolkit and Netflix's Failure Injection Testing (FIT), which provide more controlled, targeted experiments.

Litmus Chaos for Kubernetes

LitmusChaos is a CNCF project that brings chaos engineering natively to Kubernetes:

# litmus-chaos-experiment.yaml — Kill a random pod
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: payment-chaos
namespace: payments
spec:
engineState: active
appinfo:
appns: payments
applabel: app=payment-api
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "false"
- name: PODS_AFFECTED_PERC
value: "50"
probe:
- name: check-payment-health
type: httpProbe
httpProbe/inputs:
url: http://payment-api.payments:8080/health
method:
get:
criteria: ==
responseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 5
interval: 2
retry: 3

Litmus provides 50+ pre-built experiments, a web UI (Litmus Portal), and integrates with CI/CD pipelines so you can run chaos tests before every release.

AWS Fault Injection Simulator

If you're on AWS, FIS is the managed chaos service:

{
"description": "Simulate AZ degradation for payment service",
"targets": {
"paymentInstances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"Service": "payment-api",
"Environment": "production"
},
"selectionMode": "PERCENT(25)"
}
},
"actions": {
"stopInstances": {
"actionId": "aws:ec2:stop-instances",
"parameters": {
"startInstancesAfterDuration": "PT5M"
},
"targets": {
"Instances": "paymentInstances"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:123456789:alarm:PaymentErrorRate"
}
],
"roleArn": "arn:aws:iam::123456789:role/FISRole"
}

The stopConditions are critical — they act as an automatic kill switch. If your error rate alarm fires, FIS immediately stops the experiment and rolls back.

Chaos Experiments Reference

Here's a table of common experiments organized by failure domain:

ExperimentCategoryToolBlast RadiusDifficulty
Kill random podComputeLitmus, GremlinLowBeginner
Network latency (200ms)Networktc, GremlinMediumBeginner
CPU stress (90%)Resourcestress-ng, LitmusLowBeginner
Disk fill (95%)ResourceLitmus, FISMediumIntermediate
DNS failureNetworkCoreDNS configHighIntermediate
AZ failureInfrastructureFIS, Chaos KongHighAdvanced
Region failoverInfrastructureFIS, customCriticalExpert
Clock skewTimeGremlin, toxiproxyMediumIntermediate
Kafka broker lossDataLitmus, customHighAdvanced
Database failoverDataFIS, manualHighAdvanced

Running Your First Chaos Experiment

Follow the steady-state hypothesis approach:

# Step 1: Define steady state
# "The payment API returns 200 for /health and p99 latency < 500ms"

# Step 2: Baseline measurement
kubectl top pods -n payments
curl -w "%{time_total}\n" http://payment-api/health

# Step 3: Inject failure — kill 1 of 3 pods
kubectl delete pod payment-api-7d8f9b-x2k4n -n payments

# Step 4: Observe
# - Does the health check still pass?
# - Did latency spike?
# - Did any requests fail?
# - How long until the replacement pod was ready?

# Step 5: Document findings
# "Pod recovery took 45 seconds. During that window, 12 requests
# got 503 errors because readiness probe was too aggressive.
# Action: Increase replica count to 4, tune probe timing."

Blast Radius Management

Never go from zero to "kill a region." Use a progressive approach:

Chaos Maturity Progression:

Level 0: No chaos testing

Level 1: Kill a single pod in staging

Level 2: Kill pods in production (single service, low traffic)

Level 3: Network faults between services

Level 4: Dependency failures (DB, cache, queue)

Level 5: AZ failure simulation

Level 6: Automated continuous chaos in production

Level 7: Region failover testing (Game Days)

At each level, ensure you have:

  • Observability: You can see the impact in real-time
  • Kill switches: You can stop the experiment instantly
  • Runbooks: The team knows what to do if things go sideways
  • Communication: Stakeholders know a chaos test is running

Game Days

A game day is a scheduled, team-wide chaos exercise — think of it as a fire drill for your infrastructure:

Game Day Agenda (4 hours):

09:00 - Briefing
- Review systems in scope
- Confirm kill switches and rollback plans
- Assign roles: Experimenters, Observers, Incident Commander

09:30 - Experiment 1: Single service pod failure
- Inject, observe, document

10:30 - Experiment 2: Database failover
- Trigger RDS failover, measure recovery

11:30 - Experiment 3: Upstream dependency timeout
- Simulate payment gateway 30s latency

12:00 - Debrief
- What broke? What held?
- Action items with owners and deadlines
- Update runbooks

Game days build team confidence and muscle memory. The worst time to practice incident response is during an actual incident.

When NOT to Do Chaos Engineering

Chaos engineering isn't always appropriate:

❌ DON'T run chaos if:
- You don't have basic monitoring (you can't observe the impact)
- Your system has known, unfixed critical bugs
- You don't have rollback procedures
- It's a peak traffic period (Black Friday, launch day)
- You haven't communicated with stakeholders
- There's no incident response process
- You're doing it to "prove" a system is broken (use load testing instead)

✅ DO run chaos when:
- You have solid observability (metrics, logs, traces)
- You've defined steady-state hypotheses
- Kill switches are tested and working
- The team is available to respond
- You start with the smallest possible blast radius
- Leadership supports the practice

Gremlin: Managed Chaos Platform

For teams that want chaos engineering without building tooling:

# Gremlin CLI — inject CPU stress
gremlin attack cpu \
--length 300 \
--percent 90 \
--targets "tag.service=payment-api" \
--halt-condition "datadog.metric('payment.error_rate') > 5"

# Gremlin CLI — network blackhole (drop all traffic to a dependency)
gremlin attack network blackhole \
--length 120 \
--hostnames "auth-service.internal" \
--targets "tag.service=payment-api"

Gremlin provides a SaaS dashboard, automatic halt conditions, and detailed reports — useful for teams that need to demonstrate compliance with resilience testing requirements.

Closing Note

Chaos engineering reveals the gap between what you think will happen during a failure and what actually happens. It turns "I hope our system handles this" into "I've seen our system handle this." In the next post, we'll shift focus to DevSecOps — integrating security into your pipeline without slowing down delivery.