Chaos Engineering — Break Your System Before It Breaks You

August 23, 2025 · 7 min read

DevOps & Cloud Learning Hub

Netflix famously runs a tool called Chaos Monkey that randomly kills production servers — on purpose. It sounds insane until you realize their systems survived the 2017 AWS S3 outage while half the internet went down. That's chaos engineering: deliberately injecting failure so your systems learn to handle it gracefully.

What Is Chaos Engineering?

Chaos engineering is the discipline of experimenting on a distributed system to build confidence in its ability to withstand turbulent conditions in production. It was pioneered at Netflix in 2010 when they migrated from data centers to AWS and realized they couldn't predict how cloud failures would cascade.

The core idea is simple:

Traditional approach:  Wait for failure → Scramble to fix → Write postmortem
Chaos engineering:     Inject failure → Observe behavior → Fix weaknesses → Repeat

You're not trying to cause outages. You're trying to discover weaknesses before real incidents expose them at 3 AM on a Saturday.

Chaos Engineering Principles

The discipline follows five key principles defined by Netflix:

chaos_engineering_principles:
  1_build_hypothesis:
    description: "Define what 'steady state' looks like"
    example: "Order throughput stays above 1000 req/s"

  2_vary_real_world_events:
    description: "Inject realistic failures, not contrived ones"
    examples:
      - Server crash
      - Network partition
      - Clock skew
      - Dependency latency spike

  3_run_in_production:
    description: "Staging doesn't reflect real traffic patterns"
    caveat: "Start small, expand blast radius gradually"

  4_automate_and_run_continuously:
    description: "One-off tests go stale; continuous chaos finds regressions"

  5_minimize_blast_radius:
    description: "Start with a single host, not the entire fleet"
    progression: "1 pod → 1 node → 1 AZ → 1 region"

Chaos Monkey and the Simian Army

Netflix built an entire "army" of chaos tools:

Tool	What It Does
Chaos Monkey	Randomly terminates instances in production
Latency Monkey	Injects artificial delays in RESTful calls
Conformity Monkey	Shuts down instances that don't follow best practices
Chaos Gorilla	Simulates an entire AWS Availability Zone going down
Chaos Kong	Simulates an entire AWS Region going down
Janitor Monkey	Cleans up unused cloud resources
Security Monkey	Finds security violations and vulnerable configs

The modern successor is Chaos Toolkit and Netflix's Failure Injection Testing (FIT), which provide more controlled, targeted experiments.

Litmus Chaos for Kubernetes

LitmusChaos is a CNCF project that brings chaos engineering natively to Kubernetes:

# litmus-chaos-experiment.yaml — Kill a random pod
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-chaos
  namespace: payments
spec:
  engineState: active
  appinfo:
    appns: payments
    applabel: app=payment-api
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "false"
            - name: PODS_AFFECTED_PERC
              value: "50"
        probe:
          - name: check-payment-health
            type: httpProbe
            httpProbe/inputs:
              url: http://payment-api.payments:8080/health
              method:
                get:
                  criteria: ==
                  responseCode: "200"
            mode: Continuous
            runProperties:
              probeTimeout: 5
              interval: 2
              retry: 3

Litmus provides 50+ pre-built experiments, a web UI (Litmus Portal), and integrates with CI/CD pipelines so you can run chaos tests before every release.

AWS Fault Injection Simulator

If you're on AWS, FIS is the managed chaos service:

{
  "description": "Simulate AZ degradation for payment service",
  "targets": {
    "paymentInstances": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {
        "Service": "payment-api",
        "Environment": "production"
      },
      "selectionMode": "PERCENT(25)"
    }
  },
  "actions": {
    "stopInstances": {
      "actionId": "aws:ec2:stop-instances",
      "parameters": {
        "startInstancesAfterDuration": "PT5M"
      },
      "targets": {
        "Instances": "paymentInstances"
      }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:us-east-1:123456789:alarm:PaymentErrorRate"
    }
  ],
  "roleArn": "arn:aws:iam::123456789:role/FISRole"
}

The stopConditions are critical — they act as an automatic kill switch. If your error rate alarm fires, FIS immediately stops the experiment and rolls back.

Chaos Experiments Reference

Here's a table of common experiments organized by failure domain:

Experiment	Category	Tool	Blast Radius	Difficulty
Kill random pod	Compute	Litmus, Gremlin	Low	Beginner
Network latency (200ms)	Network	tc, Gremlin	Medium	Beginner
CPU stress (90%)	Resource	stress-ng, Litmus	Low	Beginner
Disk fill (95%)	Resource	Litmus, FIS	Medium	Intermediate
DNS failure	Network	CoreDNS config	High	Intermediate
AZ failure	Infrastructure	FIS, Chaos Kong	High	Advanced
Region failover	Infrastructure	FIS, custom	Critical	Expert
Clock skew	Time	Gremlin, toxiproxy	Medium	Intermediate
Kafka broker loss	Data	Litmus, custom	High	Advanced
Database failover	Data	FIS, manual	High	Advanced

Running Your First Chaos Experiment

Follow the steady-state hypothesis approach:

# Step 1: Define steady state
# "The payment API returns 200 for /health and p99 latency < 500ms"

# Step 2: Baseline measurement
kubectl top pods -n payments
curl -w "%{time_total}\n" http://payment-api/health

# Step 3: Inject failure — kill 1 of 3 pods
kubectl delete pod payment-api-7d8f9b-x2k4n -n payments

# Step 4: Observe
# - Does the health check still pass?
# - Did latency spike?
# - Did any requests fail?
# - How long until the replacement pod was ready?

# Step 5: Document findings
# "Pod recovery took 45 seconds. During that window, 12 requests
#  got 503 errors because readiness probe was too aggressive.
#  Action: Increase replica count to 4, tune probe timing."

Blast Radius Management

Never go from zero to "kill a region." Use a progressive approach:

Chaos Maturity Progression:

Level 0: No chaos testing
    ↓
Level 1: Kill a single pod in staging
    ↓
Level 2: Kill pods in production (single service, low traffic)
    ↓
Level 3: Network faults between services
    ↓
Level 4: Dependency failures (DB, cache, queue)
    ↓
Level 5: AZ failure simulation
    ↓
Level 6: Automated continuous chaos in production
    ↓
Level 7: Region failover testing (Game Days)

At each level, ensure you have:

Observability: You can see the impact in real-time
Kill switches: You can stop the experiment instantly
Runbooks: The team knows what to do if things go sideways
Communication: Stakeholders know a chaos test is running

Game Days

A game day is a scheduled, team-wide chaos exercise — think of it as a fire drill for your infrastructure:

Game Day Agenda (4 hours):

09:00 - Briefing
        - Review systems in scope
        - Confirm kill switches and rollback plans
        - Assign roles: Experimenters, Observers, Incident Commander

09:30 - Experiment 1: Single service pod failure
        - Inject, observe, document

10:30 - Experiment 2: Database failover
        - Trigger RDS failover, measure recovery

11:30 - Experiment 3: Upstream dependency timeout
        - Simulate payment gateway 30s latency

12:00 - Debrief
        - What broke? What held?
        - Action items with owners and deadlines
        - Update runbooks

Game days build team confidence and muscle memory. The worst time to practice incident response is during an actual incident.

When NOT to Do Chaos Engineering

Chaos engineering isn't always appropriate:

❌ DON'T run chaos if:
  - You don't have basic monitoring (you can't observe the impact)
  - Your system has known, unfixed critical bugs
  - You don't have rollback procedures
  - It's a peak traffic period (Black Friday, launch day)
  - You haven't communicated with stakeholders
  - There's no incident response process
  - You're doing it to "prove" a system is broken (use load testing instead)

✅ DO run chaos when:
  - You have solid observability (metrics, logs, traces)
  - You've defined steady-state hypotheses
  - Kill switches are tested and working
  - The team is available to respond
  - You start with the smallest possible blast radius
  - Leadership supports the practice

Gremlin: Managed Chaos Platform

For teams that want chaos engineering without building tooling:

# Gremlin CLI — inject CPU stress
gremlin attack cpu \
  --length 300 \
  --percent 90 \
  --targets "tag.service=payment-api" \
  --halt-condition "datadog.metric('payment.error_rate') > 5"

# Gremlin CLI — network blackhole (drop all traffic to a dependency)
gremlin attack network blackhole \
  --length 120 \
  --hostnames "auth-service.internal" \
  --targets "tag.service=payment-api"

Gremlin provides a SaaS dashboard, automatic halt conditions, and detailed reports — useful for teams that need to demonstrate compliance with resilience testing requirements.

Closing Note

Chaos engineering reveals the gap between what you think will happen during a failure and what actually happens. It turns "I hope our system handles this" into "I've seen our system handle this." In the next post, we'll shift focus to DevSecOps — integrating security into your pipeline without slowing down delivery.

What Is Chaos Engineering?​

Chaos Engineering Principles​

Chaos Monkey and the Simian Army​

Litmus Chaos for Kubernetes​

AWS Fault Injection Simulator​

Chaos Experiments Reference​

Running Your First Chaos Experiment​

Blast Radius Management​

Game Days​

When NOT to Do Chaos Engineering​

Gremlin: Managed Chaos Platform​

Closing Note​

Stay Updated