DevOps Chaos Engineering

Chaos engineering is the discipline of intentionally injecting failures into a system to discover weaknesses before they cause real outages. The idea seems counterintuitive — deliberately breaking things in production. But the alternative is worse: discovering those weaknesses during an unplanned incident at 2 AM, under pressure, with users impacted.

Netflix coined the term and pioneered the practice. Their tool, Chaos Monkey, randomly terminates production servers to ensure their systems can survive unexpected failures.

The Core Principle

Systems fail in unpredictable ways. Hardware breaks, networks partition, services time out, memory leaks accumulate, dependencies go down. Teams that assume their systems are resilient without testing that assumption are guessing. Chaos engineering turns that assumption into proven knowledge.

The hypothesis-driven process:

  1. Define a steady-state hypothesis: "The checkout service handles 500 requests/second with a p99 latency below 200ms."
  2. Introduce a controlled failure: Terminate one of three checkout service replicas.
  3. Observe: Does the system maintain its steady state? Did load balancing redistribute traffic? Did Kubernetes restart the pod?
  4. Learn: If the steady state breaks, a real weakness was found — and can be fixed before an unplanned outage exposes it.

Types of Chaos Experiments

Infrastructure Failures

  • Terminate a random EC2 instance or Kubernetes pod.
  • Shut down an entire availability zone.
  • Fill a server's disk to 100% capacity.
  • Exhaust CPU or memory on a specific node.

Network Failures

  • Add 200ms latency between the app and database.
  • Simulate 30% packet loss between two services.
  • Block all traffic to an external API dependency.
  • Simulate DNS resolution failures.

Application Failures

  • Kill a specific microservice to test circuit breakers.
  • Inject errors into a service's responses (HTTP 500s).
  • Simulate slow responses (timeouts) from a dependency.
  • Corrupt the response of a caching layer.

State Failures

  • Flush the Redis cache — does the application degrade gracefully or crash?
  • Lose a database replica — does failover work automatically?
  • Corrupt a configuration value — what happens?

The Chaos Engineering Process

  1. Start in staging: Never start chaos experiments in production. Validate the experiment design in a staging environment first.
  2. Define steady state: Establish baseline metrics — what does "healthy" look like? (Request rate, latency, error rate, business metrics).
  3. Form a hypothesis: "When the payment service loses 50% of its instances, the system will continue to process payments with less than 5% error rate due to our load balancer and auto-scaling."
  4. Design the experiment: What failure to inject, at what magnitude, for how long.
  5. Minimize blast radius: Start small. A chaos experiment should affect the minimum necessary scope to test the hypothesis. Limit to one service, one region, or a percentage of traffic.
  6. Run the experiment: Inject the failure. Observe monitoring dashboards continuously.
  7. Document results: Did steady state hold? What happened? What was learned?
  8. Fix weaknesses found: A failed experiment is a success — it found a real gap before an outage did.
  9. Automate and schedule: Once an experiment is validated, run it regularly to prevent regression.

Chaos Monkey and the Simian Army

Netflix created a suite of chaos tools collectively called the Simian Army:

ToolWhat It Does
Chaos MonkeyRandomly terminates production instances
Latency MonkeyInjects artificial network delays between services
Chaos GorillaSimulates an entire AWS availability zone outage
Chaos KongSimulates an entire AWS region failure
Conformity MonkeyShuts down instances that do not follow best practices

Chaos Mesh – Chaos Engineering for Kubernetes

Chaos Mesh is a powerful open-source chaos engineering platform built specifically for Kubernetes. It provides a rich set of failure types via Kubernetes Custom Resources.

# Inject network latency into the payment service pods
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: payment-latency-experiment
  namespace: production
spec:
  action: delay
  mode: random-max-percent
  value: "50"              # Affect 50% of pods
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  delay:
    latency: "200ms"
    jitter: "50ms"
  duration: "5m"           # Run for 5 minutes
# Kill 30% of checkout pods randomly
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: checkout-pod-kill
  namespace: production
spec:
  action: pod-kill
  mode: random-max-percent
  value: "30"
  selector:
    labelSelectors:
      app: checkout-service
  duration: "2m"

LitmusChaos – Another Kubernetes Chaos Tool

LitmusChaos is a CNCF project offering a library of pre-built chaos experiments called ChaosExperiments. Teams pick from a catalog and run them without writing custom code.

AWS Fault Injection Simulator (FIS)

AWS Fault Injection Simulator is Amazon's managed chaos engineering service. It integrates natively with EC2, ECS, EKS, RDS, and other AWS services. No external tools needed.

# AWS FIS Experiment Template (simplified)
{
  "description": "Terminate 30% of webapp instances",
  "targets": {
    "webappInstances": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": { "Environment": "production", "App": "webapp" },
      "selectionMode": "PERCENT(30)"
    }
  },
  "actions": {
    "terminateInstances": {
      "actionId": "aws:ec2:terminate-instances",
      "targets": { "Instances": "webappInstances" }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:us-east-1:123456789:alarm:CriticalErrorRate"
    }
  ]
}

The stopConditions field is critical — if a CloudWatch alarm fires during the experiment, FIS stops automatically. This is the safety net that makes production chaos experiments manageable.

GameDays

A GameDay is a scheduled event where the team intentionally causes failures and practices responding to them. Unlike automated chaos experiments, GameDays are collaborative exercises:

  • The team announces the GameDay in advance — engineers prepare runbooks and monitoring.
  • A "chaos team" injects failures during the event without telling the "operations team" exactly what they're doing.
  • The operations team responds as if it were a real incident — using alerting, dashboards, and runbooks.
  • After the event, the full team conducts a blameless retrospective.

GameDays build confidence, validate documentation, identify gaps in monitoring and runbooks, and improve team coordination under pressure — before a real crisis tests those things.

Building Resilience Through Chaos Engineering

What chaos experiments typically reveal:

  • Missing circuit breakers — one slow service takes down dependent services.
  • Auto-scaling triggers set too conservatively — traffic spikes cause outages before new instances start.
  • No graceful degradation — the entire app fails when one non-critical service is unavailable.
  • Incomplete runbooks — engineers cannot restore service without tribal knowledge.
  • Missing retries — a single transient error causes permanent request failure.
  • Wrong alert thresholds — the team only learns about an outage via customer complaints.

Summary

  • Chaos engineering intentionally injects failures to find weaknesses before real outages do.
  • The process follows: define steady state → hypothesize → minimize blast radius → inject → observe → fix.
  • Always start in staging. Use stop conditions when running experiments in production.
  • Chaos Mesh and LitmusChaos are the leading chaos tools for Kubernetes environments.
  • AWS Fault Injection Simulator provides managed chaos engineering for AWS infrastructure.
  • GameDays build team resilience through structured, practiced incident response exercises.

Leave a Comment