SRE Distributed Systems Reliability Patterns

A single store can serve customers well. A chain of a hundred stores in different cities faces entirely different challenges: what happens when one store runs low on stock? What if two stores compete for the same shipment? How do you keep prices consistent across all of them? Distributed systems face similar coordination problems — and reliability patterns are the proven solutions engineers reach for to solve them.

What Makes Distributed Systems Hard

A distributed system is a collection of independent components — servers, services, or processes — that work together to appear as a single system to the user. Distributed systems offer scale and fault tolerance, but they introduce new failure modes that single-server systems never experience.

The 8 Fallacies of Distributed Computing

Engineers who are new to distributed systems often make assumptions that turn out to be wrong. These are called the eight fallacies:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology does not change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

Every one of these statements is false in practice. Reliability patterns exist to handle the real world these fallacies describe.

Retry with Exponential Backoff

When a request to a downstream service fails due to a transient error (a brief network hiccup, a temporary timeout), retrying the request often succeeds. But retrying immediately and aggressively makes the problem worse — the struggling service receives even more traffic.

Exponential backoff spaces retries with increasing delays, giving the downstream service time to recover.

Retry with Exponential Backoff:
---------------------------------
Attempt 1: Fails at 0ms     → wait 100ms before retry
Attempt 2: Fails at 100ms   → wait 200ms before retry
Attempt 3: Fails at 300ms   → wait 400ms before retry
Attempt 4: Fails at 700ms   → wait 800ms before retry
Attempt 5: Fails at 1,500ms → give up, return error to caller

With jitter (randomized delay): spread out retry storms across many clients

Circuit Breaker Pattern (Deep Dive)

Introduced in Topic 11, the circuit breaker deserves deeper examination in the context of distributed systems. Without circuit breakers, a slow or failing dependency can hold open thousands of connections, exhausting thread pools and memory across the entire service fleet.

Without Circuit Breaker:
-------------------------
Service A calls Service B (which is slow — 30s timeout)
1,000 requests queued waiting for Service B
Threads exhausted in Service A
Service A becomes slow too
Service C calls Service A and also slows down
CASCADE: entire system degrades from one slow service

With Circuit Breaker:
---------------------
Service B starts failing
Circuit breaker trips after 5 consecutive failures
Service A returns fast errors immediately (no thread wait)
Service B gets no load → recovers faster
Circuit tests with 1 request → Service B healthy → circuit closes
Normal operation resumes

Circuit Breaker Configuration

ParameterWhat It ControlsExample Value
Failure thresholdHow many failures before circuit opens5 consecutive errors
Open durationHow long to stay open before testing30 seconds
Half-open test volumeHow many test requests to send3 requests
Success thresholdSuccesses needed to close circuit2 out of 3

Bulkhead Pattern

A bulkhead on a ship is a wall that divides the hull into separate compartments. If one compartment floods, the others stay dry. The ship does not sink. The bulkhead pattern applies the same principle to software: isolate components so a failure in one cannot drain resources from the others.

Without Bulkheads:
------------------
Search, Checkout, Profile — all share one thread pool
Slow search requests fill the pool
Checkout requests cannot get a thread
Checkout is broken even though it has no bug

With Bulkheads:
---------------
Search → own thread pool (100 threads)
Checkout → own thread pool (200 threads)
Profile → own thread pool (50 threads)
Slow search uses only its 100 threads
Checkout threads unaffected ✅

Timeout Pattern

Every network call in a distributed system must have a timeout. Without timeouts, a request to a failed service waits forever, blocking resources indefinitely. Timeouts bound the worst-case waiting time and allow the calling service to fail fast and try alternatives.

Setting timeouts correctly is an art. Too short: too many false failures on slightly slow responses. Too long: resources blocked too long during real failures. A good starting point is to set the timeout at the 99th percentile latency observed in normal operation multiplied by three.

Idempotency and Safe Retries

An operation is idempotent if performing it multiple times produces the same result as performing it once. When a network request is retried (because the original response was lost in transit), idempotency ensures the retry does not cause duplicate effects.

Non-idempotent operation — DANGEROUS to retry:
  POST /charge-card amount=99.99
  Response lost in network
  Retry: POST /charge-card amount=99.99
  Result: Customer charged twice ❌

Idempotent operation — safe to retry:
  POST /charge-card amount=99.99 idempotency-key=order-7831-txn-1
  Response lost in network
  Retry: POST /charge-card amount=99.99 idempotency-key=order-7831-txn-1
  Result: Server recognizes key, returns same result, no duplicate charge ✅

Graceful Degradation

Graceful degradation means serving a reduced but still useful experience when a component fails, rather than showing an error page. An e-commerce site whose recommendation engine is down should still show the product page — just without personalized recommendations. A core function stays available; a non-critical function degrades.

News website, comment service down:
  Full degradation: "Error — page unavailable" (terrible)
  Graceful degradation: Article loads normally.
                        Comment section shows "Comments unavailable right now."
                        Core reading experience intact ✅

Key Points

  • Distributed systems fail in unique ways that single-server systems do not — plan for network unreliability from the start.
  • Exponential backoff with jitter prevents retry storms during dependency failures.
  • Circuit breakers stop cascading failures by isolating broken dependencies early.
  • Bulkheads prevent one component's failure from consuming resources shared by other components.
  • Idempotency makes retries safe; graceful degradation keeps core functions running when secondary features fail.

Leave a Comment

Your email address will not be published. Required fields are marked *