Microservices Fault Tolerance Patterns

In a microservices system with dozens of services, something will fail. A database slows down. A service crashes. A network packet gets lost. The goal of fault tolerance is to make the overall system survive these failures without crashing completely or leaving users stuck.

The Cascade Failure Problem

Imagine Service A calls Service B. Service B is slow. Service A waits. More requests come in. All of them wait for Service B. Service A's resources fill up. Now Service A is also slow. Service C calls Service A and gets stuck too. The slowness spreads through the entire system like a traffic jam.

CASCADE FAILURE
===============
[Service C] --> [Service A] --> [Service B]  (Service B is slow)
     |               |               |
  waiting         waiting         slow DB
  waiting         waiting
  waiting

All services pile up waiting. The whole system grinds to a halt.

Fault tolerance patterns prevent this cascade effect.

Pattern 1: Timeouts

Set a maximum wait time for every network call. If the response does not arrive within the timeout period, stop waiting and handle the situation.

WITHOUT TIMEOUT:
Order Service calls Payment Service
Order Service waits... waits... waits... (forever)
Thread is stuck. Next request comes in. Another thread stuck. System fills up.

WITH TIMEOUT (2 seconds):
Order Service calls Payment Service
Waits up to 2 seconds
If no response: returns error to user immediately
Thread is freed. Next request handled normally.

Timeouts prevent one slow service from holding all threads in the calling service hostage.

Pattern 2: Retry

If a request fails, try again. Many failures are temporary — a brief network blip, a momentary spike in the called service. Retrying after a short wait often succeeds.

RETRY WITH EXPONENTIAL BACKOFF
================================
Attempt 1 fails --> wait 1 second --> retry
Attempt 2 fails --> wait 2 seconds --> retry
Attempt 3 fails --> wait 4 seconds --> retry
Attempt 4 fails --> give up, return error

Each wait doubles. This prevents all clients from hammering
a recovering service at the same moment.

Idempotency is required for retries. If you retry a payment request, you must not charge the customer twice. Design operations so that calling them multiple times produces the same result as calling them once.

Pattern 3: Circuit Breaker

The Circuit Breaker pattern is named after the electrical circuit breaker in your home. When the electrical load is too high, the breaker trips — it opens the circuit to prevent damage. When the problem is fixed, you reset the breaker and current flows again.

CIRCUIT BREAKER STATES
=======================

CLOSED (normal)
Service A calls Service B normally.
If 5 failures happen within 10 seconds...
          |
          v
OPEN (broken)
Service A stops calling Service B immediately.
All requests return a fallback response without trying.
After 30 seconds...
          |
          v
HALF-OPEN (testing)
Allows 1 request through to test if Service B recovered.
  Success --> back to CLOSED
  Failure --> back to OPEN for another 30 seconds

The circuit breaker stops the cascade failure. When Service B is unhealthy, Service A does not waste threads trying to call it. It returns quickly with a fallback, freeing up resources.

Pattern 4: Fallback

When a service call fails or the circuit is open, return a pre-defined alternative response instead of an error.

RECOMMENDATION SERVICE EXAMPLE
================================
Product Service calls Recommendation Service
Recommendation Service is down

WITHOUT FALLBACK: User sees an error page

WITH FALLBACK:
Product Service detects failure
Returns pre-cached popular items as recommendations
User sees a working page with slightly less personalized content

Degraded experience is better than a broken experience.

Pattern 5: Bulkhead

The Bulkhead pattern comes from ship design. A ship is divided into watertight compartments (bulkheads). If one compartment floods, the others stay dry. The ship does not sink.

BULKHEAD IN MICROSERVICES
==========================
Order Service makes calls to:
  - Payment Service
  - Inventory Service
  - Shipping Service

WITHOUT BULKHEAD:
All three use a shared pool of 100 threads.
Payment Service gets slow --> takes 90 threads.
Inventory and Shipping calls have only 10 threads left. Slow too.

WITH BULKHEAD:
Payment Service:   30 dedicated threads
Inventory Service: 30 dedicated threads
Shipping Service:  30 dedicated threads

Payment Service gets slow --> 30 threads fill up.
Inventory and Shipping still have their full 30 threads. Unaffected.

Pattern 6: Health Checks

Every service exposes a health check endpoint. Load balancers and orchestration tools call this endpoint regularly. If the service reports unhealthy, traffic stops routing to it until it recovers.

GET /health

Response if healthy:
{ "status": "OK", "db": "connected", "cache": "connected" }

Response if unhealthy:
{ "status": "DOWN", "db": "disconnected" }

Load balancer reads this response and stops sending traffic
to unhealthy instances automatically.

Combining Patterns

These patterns work best together. A production service typically uses all of them:

SERVICE CALL WITH FULL PROTECTION
===================================
[Caller] --> Set timeout (2s)
          --> Check circuit breaker (is it OPEN? skip call, use fallback)
          --> Make call to [Downstream Service] (in dedicated bulkhead)
          --> If fails: retry with backoff (up to 3 times)
          --> If all retries fail: return fallback response
          --> Record failure count for circuit breaker

Libraries like Resilience4j (Java), Polly (.NET), and Hystrix (Java, now maintenance mode) implement these patterns so you do not have to build them from scratch.

Previous lessons

Back to courses

Next lessons