System Design Scalability Patterns

Scalability is the ability of a system to handle growing amounts of work by adding resources. A scalable system performs consistently whether serving 100 users or 100 million users. Scalability patterns are proven techniques that architects apply to ensure systems grow without degrading in performance, reliability, or cost-effectiveness.

Building a scalable system is not about buying a bigger machine. It is about designing the system so that adding more standard machines multiplies capacity predictably and affordably.

Vertical Scaling vs Horizontal Scaling

Vertical Scaling (Scale Up)

Vertical scaling means upgrading the existing machine — adding more CPU cores, RAM, or faster storage to the same server.

Before scaling:         After vertical scaling:
Server: 4 CPU, 16GB   → Server: 32 CPU, 256GB RAM
RAM                      (same machine, more power)

Handles: 1,000 req/s  → Handles: 8,000 req/s

Advantages: No application changes needed. Simple to implement.

Limits: Hardware limits exist — machines cannot grow infinitely. Also, one machine is a single point of failure.

Horizontal Scaling (Scale Out)

Horizontal scaling adds more machines to the pool. Traffic distributes across all machines via a load balancer.

Before:                 After horizontal scaling:
1 server                4 servers
4 CPU, 16GB RAM         Each: 4 CPU, 16GB RAM
                        Load Balancer distributes traffic

Handles: 1,000 req/s  → Handles: 4,000 req/s
Add 4 more servers    → Handles: 8,000 req/s (linear scaling)

Advantages: No upper hardware limit. Fault tolerant (losing one server does not stop the system).

Requires: Stateless application design, shared state in external storage, load balancer.

AspectVertical ScalingHorizontal Scaling
ImplementationSimple (upgrade machine)Complex (stateless app, load balancer)
CostExpensive for high specsCheaper commodity hardware
Fault ToleranceSingle point of failureRedundant nodes
Scaling LimitHardware ceilingVirtually unlimited
Best ForDatabases (early stages), quick fixesWeb servers, API servers, stateless workers

Stateless Services

Stateless service design is the prerequisite for horizontal scaling. A stateless service stores no user-specific data locally — all state lives in a shared external store (database or cache). Any server can handle any request at any time.

Stateful (Cannot scale horizontally easily):
User logs in → Server A stores session in local memory
Next request → Must go to Server A (session not on Server B)
Server A crashes → User loses session, must log in again

Stateless (Scales perfectly):
User logs in → Server A issues JWT token, stores nothing locally
Next request → Can go to Server B (JWT contains all needed info)
Server A crashes → Server B handles next request seamlessly

Database Scaling Patterns

Read Replicas

The primary database handles all writes. Multiple read replicas handle read queries. This pattern is the most impactful for read-heavy workloads (most web applications).

Write:  Application → Primary DB
Read:   Application → Read Replica 1
        Application → Read Replica 2
        Application → Read Replica 3

If 80% of queries are reads:
Adding 4 read replicas → Read capacity 4× without touching write path

Connection Pooling

Opening a new database connection for every request is expensive (takes 50-100ms). A connection pool maintains a set of open connections that requests borrow and return.

Without connection pool:
Request 1 → Open connection (50ms) → Query (5ms) → Close connection
Request 2 → Open connection (50ms) → Query (5ms) → Close connection

With connection pool (20 connections kept open):
Request 1 → Borrow connection (0ms) → Query (5ms) → Return to pool
Request 2 → Borrow connection (0ms) → Query (5ms) → Return to pool

Connection pool tools: PgBouncer (PostgreSQL), HikariCP (Java)

Caching as a Scalability Pattern

Caching at multiple layers removes load from databases and downstream services. Layered caching creates a "cache hierarchy" where data is served from the fastest available source.

Cache Hierarchy (fastest to slowest):

L1: In-process memory cache (nanoseconds)
   → Language-level dictionary/map in the application
   → Lost when process restarts
   → Tiny capacity (100MB)

L2: Distributed cache - Redis (microseconds)
   → Shared across all application servers
   → Survives restarts
   → Medium capacity (10s of GB)

L3: Database read replicas (milliseconds)
   → Read-optimized copies
   → Large capacity (terabytes)

L4: Primary Database (milliseconds, but limited connections)
   → Source of truth, only for uncached data

Async Processing and Queues

Moving time-consuming tasks off the main request path enables web servers to stay responsive under heavy load.

Without async:
HTTP Request → Process everything inline (slow, blocks server thread)
User waits 10 seconds → Poor experience

With async queue:
HTTP Request → Create task in queue → Return 200 immediately
Background worker → Process task (takes 10 seconds, no user impact)
Result stored → User polled/notified when done

Result: Web servers handle 100× more requests per second
        (no longer blocked by slow operations)

Auto-Scaling

Auto-scaling automatically adjusts the number of running server instances based on real-time metrics like CPU utilization, request count, or queue depth.

Traffic pattern for an e-commerce site:

Midnight:  200 users   → 2 servers running
Morning:   5,000 users → Auto-scale to 8 servers
Noon:      10,000 users → Auto-scale to 16 servers
Flash sale: 100,000 users → Auto-scale to 80 servers
Evening:   3,000 users → Scale down to 6 servers

Cost: Pay only for servers actually needed
      No need to over-provision "just in case"

Trigger rules (example):
Scale OUT: If average CPU > 70% for 5 minutes → Add 2 instances
Scale IN:  If average CPU < 30% for 15 minutes → Remove 1 instance

Cloud auto-scaling tools: AWS Auto Scaling Groups, Google Cloud Managed Instance Groups, Azure VMSS

Content Delivery and Edge Computing

Serving content from geographically distributed edge servers reduces origin server load and delivers lower latency globally. Edge computing extends this by running business logic at CDN edge nodes — near the user rather than in a central data center.

Traditional:
User in Mumbai → Request → US origin server → Response
Latency: 250ms each way = 500ms round trip

With CDN + Edge Compute:
User in Mumbai → Nearest CDN node in Mumbai (10ms away)
Edge node runs logic → Personalized response from 10ms away
Latency: 20ms round trip

Origin server load: Reduced 90%+ (CDN absorbs static/cacheable traffic)

Database Denormalization

Normalized databases minimize data redundancy through joins. But joins are expensive at scale. Denormalization intentionally stores redundant data to eliminate joins and speed up reads.

Normalized (3 tables, requires JOIN):
SELECT u.name, p.title, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN products p ON o.product_id = p.id
WHERE o.id = 12345;

Denormalized (1 table, no JOIN needed):
Orders table stores everything:
| order_id | user_name | user_email | product_title | product_price | order_total |
| 12345    | Alice     | alice@... | Laptop        | 999           | 999         |

Read: Instant single table scan
Write: Must update user_name in all orders if name changes (trade-off)

CQRS: Command Query Responsibility Segregation

CQRS separates the read model and write model of an application. Write operations (commands) use a normalized, write-optimized data store. Read operations (queries) use a denormalized, read-optimized data store (often a pre-built materialized view).

Write side (Command):
POST /placeOrder → Order Service → Write to normalized Order DB

Read side (Query):
GET /order/12345 → Query Service → Read from pre-computed Orders View
(Already joined and denormalized, returns instantly)

Sync: Message queue propagates writes to read-side
      Order created → Event published → Read model updated

Benefit: Each side optimized independently for its workload
Used by: Event-sourced systems, high-read e-commerce platforms

Event Sourcing

Instead of storing the current state of data, event sourcing stores the full history of events that led to the current state. The current state is reconstructed by replaying events.

Traditional (Current State):
Bank account table:
| account_id | balance |
| 1          | $350    |

Event Sourcing (Event Log):
| event_id | account | event        | amount | timestamp  |
| 1        | acc-1   | OPENED       | $500   | 2024-01-01 |
| 2        | acc-1   | WITHDRAWAL   | -$200  | 2024-01-05 |
| 3        | acc-1   | DEPOSIT      | $100   | 2024-01-10 |
| 4        | acc-1   | WITHDRAWAL   | -$50   | 2024-01-12 |

Current balance = 500 - 200 + 100 - 50 = $350 (replay events)

Benefits: Full audit trail, time travel queries, rebuild any view, undo operations
Drawback: Reading current state requires replaying many events (mitigated with snapshots)

The Scale Cube

The Scale Cube (from "The Art of Scalability") defines three axes for scaling an application:

         Y-axis (Functional decomposition)
         ↑    Split by business capability
         |    = Microservices
         |
         +----------→ X-axis (Horizontal duplication)
        /             Add more identical instances
       /              = Load balancing
      ↗
Z-axis (Data partitioning)
Split by data subset
= Sharding

Combining all three axes:
X: 10 identical Order Service instances (load balanced)
Y: Separate User, Product, Order, Payment services
Z: Each service's database sharded by customer ID
= Massively scalable system

Summary

Scalability is designed into a system through multiple complementary patterns. Horizontal scaling adds identical nodes behind a load balancer. Read replicas and connection pooling scale the database layer. Caching removes load from downstream systems. Async queues keep request handlers fast. Auto-scaling matches capacity to demand in real time. CQRS and event sourcing enable the read and write paths to optimize independently. No single pattern is sufficient — highly scalable systems combine all of these thoughtfully.

Leave a Comment