SRE Chaos Engineering Testing Systems

Fire departments do not wait for a real fire to test their equipment and routes. They run drills — controlled, planned exercises that reveal weaknesses before a real emergency exposes them. Chaos engineering works the same way: teams deliberately inject failures into running systems to discover how they fail before those failures happen on their own at the worst possible time.

What Is Chaos Engineering

Chaos Engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. The core idea is: if your system will eventually encounter failures — and it will — you want to discover how it responds on your own terms, not during a real emergency.

Netflix popularized chaos engineering by building a tool called the Chaos Monkey, which randomly terminated production servers. Engineers had to build systems resilient enough to survive random server loss at any time.

The Four Steps of a Chaos Experiment

Step 1: Define the Steady State

The steady state is the normal healthy behavior of the system. Define this precisely using measurable metrics before the experiment begins. You cannot tell if chaos broke something if you do not know what "normal" looks like.

Steady State for Checkout Service:
- Error rate: below 0.5%
- p95 latency: below 300ms
- Successful checkouts per minute: 420-480
- SLO: 99.95% availability

Step 2: Form a Hypothesis

Form a specific prediction about what will happen when you inject a fault. A good hypothesis is falsifiable and specific.

Hypothesis: "If the payment service loses connectivity to the primary
             database for 30 seconds, the checkout service will fall back
             to the secondary database within 5 seconds and maintain
             an error rate below 2%."

Step 3: Run the Experiment

Inject the fault in a controlled way. Start with the smallest possible blast radius — the scope of users or systems affected by the experiment. Begin in staging, not production. Once the experiment is well understood and proven safe, run it in production during off-peak hours.

Chaos Experiment Blast Radius Escalation:
------------------------------------------
Stage 1: Inject fault in development environment (no user impact)
Stage 2: Inject fault in staging environment (internal users only)
Stage 3: Inject fault in production for 1% of users during low-traffic hours
Stage 4: Inject fault in production for 10% of users during business hours
Stage 5: Inject fault across full production (only after all previous stages pass)

Step 4: Analyze and Learn

Compare actual behavior during the experiment to the steady state and the hypothesis. If the system behaved as predicted, document that resilience. If it did not, document what broke and create action items to fix it.

Types of Failures to Inject

Failure Type	What It Simulates	Example Experiment
Instance failure	A server or container crashes	Kill 20% of API server pods
Network partition	Two services cannot communicate	Block traffic between payment and database
Latency injection	A dependency responds slowly	Add 500ms delay to all calls to auth service
Resource exhaustion	CPU, memory, or disk is full	Fill disk to 95% on logging host
Dependency failure	External service goes down	Return errors from the payment gateway mock
Data corruption	Bad data enters the system	Send malformed payloads to message queue
Clock skew	Servers have different times	Advance clock by 5 minutes on one server

Game Days

A Game Day is a planned chaos exercise where the whole team participates. The team picks a scenario — "what if our primary data center goes offline?" — and runs a structured exercise to test their response. Game Days combine technical chaos experiments with incident response drills.

Game Day Flow:
--------------
[Pre-Game]    Define scenario, steady state, and hypothesis.
              Prepare rollback plans.
              Notify stakeholders.

[Game Day]    Execute chaos experiment.
              On-call team responds as if it were a real incident.
              Observers document what works and what breaks.

[Post-Game]   Debrief discussion.
              Document findings.
              Create action items for discovered weaknesses.

Chaos Engineering Tools

Chaos Monkey (Netflix): Randomly terminates production instances
Gremlin: Commercial platform for scheduling and managing chaos experiments
Litmus: Open-source chaos engineering for Kubernetes
Chaos Toolkit: Open-source framework for writing chaos experiments as code
AWS Fault Injection Simulator: Managed chaos tooling for AWS workloads

When Not to Run Chaos Experiments

Chaos engineering requires discipline. Running experiments at the wrong time causes real incidents instead of discovering potential ones.

Do not run chaos experiments when the error budget is already low or exhausted.
Do not run production experiments during peak traffic without extensive staging validation first.
Do not run experiments without a tested rollback plan and monitoring in place.
Do not skip the blast radius escalation stages — always start small.

Key Points

Chaos engineering discovers system weaknesses on your terms before real incidents find them.
Every experiment follows four steps: define steady state, hypothesize, run, analyze.
Start with the smallest blast radius and escalate only after each stage proves safe.
Game Days combine technical fault injection with full incident response practice.
Never run chaos experiments when the error budget is already depleted.

Previous lessons

Back to courses

Next lessons