Microservices Scaling Strategies

One of the most powerful benefits of microservices is the ability to scale each service independently. When a specific part of your system experiences high demand, you grow only that part — not the entire application. This topic covers the strategies, techniques, and trade-offs involved in scaling a microservices system effectively.

The Scale Cube

The Scale Cube, introduced in the book The Art of Scalability, describes three independent dimensions of scaling. Microservices use all three.

SCALE CUBE
===========

         Z-axis
         (Partition by data)
         ^
         |
         |          Y-axis
         |         (Split by function = microservices)
         +---------->
        /
       / X-axis
      v  (Run more copies)

X-AXIS: Horizontal Scaling
  Run multiple identical copies of a service.
  A load balancer distributes traffic across all copies.
  Simple to apply. Works for stateless services.

Y-AXIS: Functional Decomposition
  Split one large service into smaller services by function.
  Each function scales independently.
  This is what microservices architecture is.

Z-AXIS: Data Partitioning (Sharding)
  Split one service into multiple instances, each serving a subset of data.
  Users A-M go to Instance 1. Users N-Z go to Instance 2.
  Used when X-axis scaling alone is insufficient.

X-Axis: Horizontal Scaling in Practice

Running more copies of a service is the most common scaling technique. A load balancer distributes incoming requests across all running instances.

HORIZONTAL SCALING EXAMPLE
===========================
Normal traffic:
  [Load Balancer] --> [Order Service Instance 1]
                  --> [Order Service Instance 2]

Black Friday spike:
  [Load Balancer] --> [Order Service Instance 1]
                  --> [Order Service Instance 2]
                  --> [Order Service Instance 3]
                  --> [Order Service Instance 4]
                  --> [Order Service Instance 5]

After the spike ends, scale back down to 2 instances.

Horizontal scaling requires services to be stateless — a request handled by Instance 1 can be handled equally well by Instance 3. No session state stored inside the service itself. User session data lives in a shared external cache (Redis), not in the service's memory.

Stateless Services

STATEFUL SERVICE (cannot scale horizontally)
=============================================
User logs in --> Instance 1 stores session in memory
Next request --> hits Instance 2 (user is not logged in here)
User sees: "Please log in again"

STATELESS SERVICE (scales horizontally)
=========================================
User logs in --> Instance 1 stores session in Redis
Next request --> hits Instance 2 (reads session from Redis)
User sees: their logged-in dashboard. Seamless.

All instances share external state (Redis, DB).
No instance holds state only it knows about.

Y-Axis: Scaling by Function

Different services have different scaling needs. The Search Service handles millions of queries per minute. The Admin Service handles tens of requests per hour. Running the same number of instances for both wastes money.

FUNCTION-SPECIFIC SCALING
==========================
Service              Traffic Level   Instances
===================  ==============  =========
Search Service       Very High       50 instances
Order Service        High            20 instances
Payment Service      Medium          10 instances
Notification Service Medium          8 instances
Report Service       Low             2 instances
Admin Service        Very Low        1 instance

Total cost optimized.
Each service sized to its actual load.

Z-Axis: Data Sharding

When a single database becomes a bottleneck even with many service instances reading from it, sharding splits the data across multiple database instances.

DATABASE SHARDING EXAMPLE
==========================
Order Service has 1 billion orders. One database is too slow.

Shard by user_id:
  Shard 1: users 0 - 9,999,999       (Orders DB 1)
  Shard 2: users 10,000,000 - 19,999,999  (Orders DB 2)
  Shard 3: users 20,000,000 - 29,999,999  (Orders DB 3)

User USR-5001234 always reads/writes to Shard 1.
User USR-15000000 always reads/writes to Shard 2.

Each shard handles a fraction of the total load.
Adding a shard increases capacity.

Sharding increases complexity — queries that span multiple shards require aggregation logic. Teams implement sharding only when vertical scaling (using a bigger server) is no longer cost-effective.

Caching for Scale

Many requests ask for the same data repeatedly. Fetching it from the database every time wastes time and database resources. Caching stores frequently accessed data in fast memory (like Redis) so the database is not queried every time.

CACHING FLOW
=============
Request arrives: "Get product details for SKU-005"

Step 1: Check cache (Redis)
  Cache HIT: Return data in < 1ms. Done. No DB query.
  Cache MISS: Fetch from database (10-50ms).
              Store result in cache with 5-minute expiry.
              Return data to caller.

Cache hit rate of 90% means:
  90% of requests answered in < 1ms
  Only 10% hit the database
  Database load reduced by 90%

CACHE DIAGRAM
=============
[Service] --> [Redis Cache] (fast, in-memory)
               cache miss |
                          v
               [Database] (slower, disk-based)

Cache invalidation — knowing when to remove stale data from the cache — is the hardest part of caching strategy. A product price update must clear the cached product details so callers get the new price.

Auto-Scaling in Kubernetes

Kubernetes automates horizontal scaling with the Horizontal Pod Autoscaler (HPA). The HPA watches CPU and memory metrics and adjusts the number of running Pods automatically.

HPA CONFIGURATION EXAMPLE
===========================
For Order Service:
  Minimum pods: 2
  Maximum pods: 20
  Target CPU utilization: 60%

At 2 pods, CPU hits 80%:
  HPA scales to 4 pods. CPU drops to 40%.

At 4 pods, traffic drops, CPU at 15%:
  HPA scales down to 2 pods after cool-down period.

Kubernetes Cluster Autoscaler also adds/removes nodes
if the cluster itself runs out of capacity to schedule new pods.

Rate Limiting for Controlled Scale

Accepting unlimited traffic eventually overwhelms any system. Rate limiting caps the number of requests a single client can send per minute. This protects services during traffic spikes and prevents any one client from monopolizing resources.

RATE LIMITING LAYERS
=====================
Layer 1 (API Gateway):
  Global rate limit: 10,000 requests/second across all clients
  Per-client limit: 100 requests/minute per API key

Layer 2 (Individual Service):
  Service-level limit: 5,000 orders/minute max

When limit is exceeded:
  HTTP 429 Too Many Requests returned immediately
  Client backs off and retries after the wait period

Performance Testing Before Scaling

Adding more instances helps only if the bottleneck is the service itself. If the database is the bottleneck, adding service instances makes it worse — more service instances hit the slow database harder.

FIND THE BOTTLENECK FIRST
===========================
Load test: simulate 10,000 users placing orders simultaneously

Observation: Order Service CPU at 20%. Database CPU at 99%.

WRONG solution: Add more Order Service instances (10 --> 30)
Result: 30 instances all hammer the already-saturated database. Worse.

RIGHT solution: Optimize database (add indexes, add read replicas,
                add caching layer in front of DB)
Result: Database CPU drops to 50%. System handles 10,000 users.

Scaling Readiness Checklist

Services are stateless — session data in an external cache, not in service memory.
Database connections are pooled — services do not open a new connection per request.
Health check endpoints exist — load balancers route only to healthy instances.
Graceful shutdown is implemented — a stopping instance finishes current requests before exiting.
Configuration comes from environment variables — instances start with the right config without manual setup.
Auto-scaling rules are defined and tested — the system scales before users feel the impact.
Load tests ran at 2x and 5x expected peak traffic — you know where the limits are before production finds them.

Scaling in a microservices system is an operational discipline, not a one-time task. As traffic patterns change, revisit scaling configurations, caching strategies, and database performance regularly to keep the system running smoothly at any load level.

Previous lesson

Back to course