Microservices Fault Tolerance Patterns
In a microservices system with dozens of services, something will fail. A database slows down. A service crashes. A network packet gets lost. The goal of fault tolerance is to make the overall system survive these failures without crashing completely or leaving users stuck.
The Cascade Failure Problem
Imagine Service A calls Service B. Service B is slow. Service A waits. More requests come in. All of them wait for Service B. Service A's resources fill up. Now Service A is also slow. Service C calls Service A and gets stuck too. The slowness spreads through the entire system like a traffic jam.
CASCADE FAILURE
===============
[Service C] --> [Service A] --> [Service B] (Service B is slow)
| | |
waiting waiting slow DB
waiting waiting
waiting
All services pile up waiting. The whole system grinds to a halt.
Fault tolerance patterns prevent this cascade effect.
Pattern 1: Timeouts
Set a maximum wait time for every network call. If the response does not arrive within the timeout period, stop waiting and handle the situation.
WITHOUT TIMEOUT: Order Service calls Payment Service Order Service waits... waits... waits... (forever) Thread is stuck. Next request comes in. Another thread stuck. System fills up. WITH TIMEOUT (2 seconds): Order Service calls Payment Service Waits up to 2 seconds If no response: returns error to user immediately Thread is freed. Next request handled normally.
Timeouts prevent one slow service from holding all threads in the calling service hostage.
Pattern 2: Retry
If a request fails, try again. Many failures are temporary — a brief network blip, a momentary spike in the called service. Retrying after a short wait often succeeds.
RETRY WITH EXPONENTIAL BACKOFF ================================ Attempt 1 fails --> wait 1 second --> retry Attempt 2 fails --> wait 2 seconds --> retry Attempt 3 fails --> wait 4 seconds --> retry Attempt 4 fails --> give up, return error Each wait doubles. This prevents all clients from hammering a recovering service at the same moment.
Idempotency is required for retries. If you retry a payment request, you must not charge the customer twice. Design operations so that calling them multiple times produces the same result as calling them once.
Pattern 3: Circuit Breaker
The Circuit Breaker pattern is named after the electrical circuit breaker in your home. When the electrical load is too high, the breaker trips — it opens the circuit to prevent damage. When the problem is fixed, you reset the breaker and current flows again.
CIRCUIT BREAKER STATES
=======================
CLOSED (normal)
Service A calls Service B normally.
If 5 failures happen within 10 seconds...
|
v
OPEN (broken)
Service A stops calling Service B immediately.
All requests return a fallback response without trying.
After 30 seconds...
|
v
HALF-OPEN (testing)
Allows 1 request through to test if Service B recovered.
Success --> back to CLOSED
Failure --> back to OPEN for another 30 seconds
The circuit breaker stops the cascade failure. When Service B is unhealthy, Service A does not waste threads trying to call it. It returns quickly with a fallback, freeing up resources.
Pattern 4: Fallback
When a service call fails or the circuit is open, return a pre-defined alternative response instead of an error.
RECOMMENDATION SERVICE EXAMPLE ================================ Product Service calls Recommendation Service Recommendation Service is down WITHOUT FALLBACK: User sees an error page WITH FALLBACK: Product Service detects failure Returns pre-cached popular items as recommendations User sees a working page with slightly less personalized content Degraded experience is better than a broken experience.
Pattern 5: Bulkhead
The Bulkhead pattern comes from ship design. A ship is divided into watertight compartments (bulkheads). If one compartment floods, the others stay dry. The ship does not sink.
BULKHEAD IN MICROSERVICES ========================== Order Service makes calls to: - Payment Service - Inventory Service - Shipping Service WITHOUT BULKHEAD: All three use a shared pool of 100 threads. Payment Service gets slow --> takes 90 threads. Inventory and Shipping calls have only 10 threads left. Slow too. WITH BULKHEAD: Payment Service: 30 dedicated threads Inventory Service: 30 dedicated threads Shipping Service: 30 dedicated threads Payment Service gets slow --> 30 threads fill up. Inventory and Shipping still have their full 30 threads. Unaffected.
Pattern 6: Health Checks
Every service exposes a health check endpoint. Load balancers and orchestration tools call this endpoint regularly. If the service reports unhealthy, traffic stops routing to it until it recovers.
GET /health
Response if healthy:
{ "status": "OK", "db": "connected", "cache": "connected" }
Response if unhealthy:
{ "status": "DOWN", "db": "disconnected" }
Load balancer reads this response and stops sending traffic
to unhealthy instances automatically.
Combining Patterns
These patterns work best together. A production service typically uses all of them:
SERVICE CALL WITH FULL PROTECTION
===================================
[Caller] --> Set timeout (2s)
--> Check circuit breaker (is it OPEN? skip call, use fallback)
--> Make call to [Downstream Service] (in dedicated bulkhead)
--> If fails: retry with backoff (up to 3 times)
--> If all retries fail: return fallback response
--> Record failure count for circuit breaker
Libraries like Resilience4j (Java), Polly (.NET), and Hystrix (Java, now maintenance mode) implement these patterns so you do not have to build them from scratch.
