System Design Scalability Patterns
Scalability is the ability of a system to handle growing amounts of work by adding resources. A scalable system performs consistently whether serving 100 users or 100 million users. Scalability patterns are proven techniques that architects apply to ensure systems grow without degrading in performance, reliability, or cost-effectiveness.
Building a scalable system is not about buying a bigger machine. It is about designing the system so that adding more standard machines multiplies capacity predictably and affordably.
Vertical Scaling vs Horizontal Scaling
Vertical Scaling (Scale Up)
Vertical scaling means upgrading the existing machine — adding more CPU cores, RAM, or faster storage to the same server.
Before scaling: After vertical scaling: Server: 4 CPU, 16GB → Server: 32 CPU, 256GB RAM RAM (same machine, more power) Handles: 1,000 req/s → Handles: 8,000 req/s
Advantages: No application changes needed. Simple to implement.
Limits: Hardware limits exist — machines cannot grow infinitely. Also, one machine is a single point of failure.
Horizontal Scaling (Scale Out)
Horizontal scaling adds more machines to the pool. Traffic distributes across all machines via a load balancer.
Before: After horizontal scaling:
1 server 4 servers
4 CPU, 16GB RAM Each: 4 CPU, 16GB RAM
Load Balancer distributes traffic
Handles: 1,000 req/s → Handles: 4,000 req/s
Add 4 more servers → Handles: 8,000 req/s (linear scaling)
Advantages: No upper hardware limit. Fault tolerant (losing one server does not stop the system).
Requires: Stateless application design, shared state in external storage, load balancer.
| Aspect | Vertical Scaling | Horizontal Scaling |
|---|---|---|
| Implementation | Simple (upgrade machine) | Complex (stateless app, load balancer) |
| Cost | Expensive for high specs | Cheaper commodity hardware |
| Fault Tolerance | Single point of failure | Redundant nodes |
| Scaling Limit | Hardware ceiling | Virtually unlimited |
| Best For | Databases (early stages), quick fixes | Web servers, API servers, stateless workers |
Stateless Services
Stateless service design is the prerequisite for horizontal scaling. A stateless service stores no user-specific data locally — all state lives in a shared external store (database or cache). Any server can handle any request at any time.
Stateful (Cannot scale horizontally easily): User logs in → Server A stores session in local memory Next request → Must go to Server A (session not on Server B) Server A crashes → User loses session, must log in again Stateless (Scales perfectly): User logs in → Server A issues JWT token, stores nothing locally Next request → Can go to Server B (JWT contains all needed info) Server A crashes → Server B handles next request seamlessly
Database Scaling Patterns
Read Replicas
The primary database handles all writes. Multiple read replicas handle read queries. This pattern is the most impactful for read-heavy workloads (most web applications).
Write: Application → Primary DB
Read: Application → Read Replica 1
Application → Read Replica 2
Application → Read Replica 3
If 80% of queries are reads:
Adding 4 read replicas → Read capacity 4× without touching write path
Connection Pooling
Opening a new database connection for every request is expensive (takes 50-100ms). A connection pool maintains a set of open connections that requests borrow and return.
Without connection pool: Request 1 → Open connection (50ms) → Query (5ms) → Close connection Request 2 → Open connection (50ms) → Query (5ms) → Close connection With connection pool (20 connections kept open): Request 1 → Borrow connection (0ms) → Query (5ms) → Return to pool Request 2 → Borrow connection (0ms) → Query (5ms) → Return to pool Connection pool tools: PgBouncer (PostgreSQL), HikariCP (Java)
Caching as a Scalability Pattern
Caching at multiple layers removes load from databases and downstream services. Layered caching creates a "cache hierarchy" where data is served from the fastest available source.
Cache Hierarchy (fastest to slowest): L1: In-process memory cache (nanoseconds) → Language-level dictionary/map in the application → Lost when process restarts → Tiny capacity (100MB) L2: Distributed cache - Redis (microseconds) → Shared across all application servers → Survives restarts → Medium capacity (10s of GB) L3: Database read replicas (milliseconds) → Read-optimized copies → Large capacity (terabytes) L4: Primary Database (milliseconds, but limited connections) → Source of truth, only for uncached data
Async Processing and Queues
Moving time-consuming tasks off the main request path enables web servers to stay responsive under heavy load.
Without async:
HTTP Request → Process everything inline (slow, blocks server thread)
User waits 10 seconds → Poor experience
With async queue:
HTTP Request → Create task in queue → Return 200 immediately
Background worker → Process task (takes 10 seconds, no user impact)
Result stored → User polled/notified when done
Result: Web servers handle 100× more requests per second
(no longer blocked by slow operations)
Auto-Scaling
Auto-scaling automatically adjusts the number of running server instances based on real-time metrics like CPU utilization, request count, or queue depth.
Traffic pattern for an e-commerce site:
Midnight: 200 users → 2 servers running
Morning: 5,000 users → Auto-scale to 8 servers
Noon: 10,000 users → Auto-scale to 16 servers
Flash sale: 100,000 users → Auto-scale to 80 servers
Evening: 3,000 users → Scale down to 6 servers
Cost: Pay only for servers actually needed
No need to over-provision "just in case"
Trigger rules (example):
Scale OUT: If average CPU > 70% for 5 minutes → Add 2 instances
Scale IN: If average CPU < 30% for 15 minutes → Remove 1 instance
Cloud auto-scaling tools: AWS Auto Scaling Groups, Google Cloud Managed Instance Groups, Azure VMSS
Content Delivery and Edge Computing
Serving content from geographically distributed edge servers reduces origin server load and delivers lower latency globally. Edge computing extends this by running business logic at CDN edge nodes — near the user rather than in a central data center.
Traditional: User in Mumbai → Request → US origin server → Response Latency: 250ms each way = 500ms round trip With CDN + Edge Compute: User in Mumbai → Nearest CDN node in Mumbai (10ms away) Edge node runs logic → Personalized response from 10ms away Latency: 20ms round trip Origin server load: Reduced 90%+ (CDN absorbs static/cacheable traffic)
Database Denormalization
Normalized databases minimize data redundancy through joins. But joins are expensive at scale. Denormalization intentionally stores redundant data to eliminate joins and speed up reads.
Normalized (3 tables, requires JOIN): SELECT u.name, p.title, o.total FROM users u JOIN orders o ON u.id = o.user_id JOIN products p ON o.product_id = p.id WHERE o.id = 12345; Denormalized (1 table, no JOIN needed): Orders table stores everything: | order_id | user_name | user_email | product_title | product_price | order_total | | 12345 | Alice | alice@... | Laptop | 999 | 999 | Read: Instant single table scan Write: Must update user_name in all orders if name changes (trade-off)
CQRS: Command Query Responsibility Segregation
CQRS separates the read model and write model of an application. Write operations (commands) use a normalized, write-optimized data store. Read operations (queries) use a denormalized, read-optimized data store (often a pre-built materialized view).
Write side (Command):
POST /placeOrder → Order Service → Write to normalized Order DB
Read side (Query):
GET /order/12345 → Query Service → Read from pre-computed Orders View
(Already joined and denormalized, returns instantly)
Sync: Message queue propagates writes to read-side
Order created → Event published → Read model updated
Benefit: Each side optimized independently for its workload
Used by: Event-sourced systems, high-read e-commerce platforms
Event Sourcing
Instead of storing the current state of data, event sourcing stores the full history of events that led to the current state. The current state is reconstructed by replaying events.
Traditional (Current State): Bank account table: | account_id | balance | | 1 | $350 | Event Sourcing (Event Log): | event_id | account | event | amount | timestamp | | 1 | acc-1 | OPENED | $500 | 2024-01-01 | | 2 | acc-1 | WITHDRAWAL | -$200 | 2024-01-05 | | 3 | acc-1 | DEPOSIT | $100 | 2024-01-10 | | 4 | acc-1 | WITHDRAWAL | -$50 | 2024-01-12 | Current balance = 500 - 200 + 100 - 50 = $350 (replay events) Benefits: Full audit trail, time travel queries, rebuild any view, undo operations Drawback: Reading current state requires replaying many events (mitigated with snapshots)
The Scale Cube
The Scale Cube (from "The Art of Scalability") defines three axes for scaling an application:
Y-axis (Functional decomposition)
↑ Split by business capability
| = Microservices
|
+----------→ X-axis (Horizontal duplication)
/ Add more identical instances
/ = Load balancing
↗
Z-axis (Data partitioning)
Split by data subset
= Sharding
Combining all three axes:
X: 10 identical Order Service instances (load balanced)
Y: Separate User, Product, Order, Payment services
Z: Each service's database sharded by customer ID
= Massively scalable system
Summary
Scalability is designed into a system through multiple complementary patterns. Horizontal scaling adds identical nodes behind a load balancer. Read replicas and connection pooling scale the database layer. Caching removes load from downstream systems. Async queues keep request handlers fast. Auto-scaling matches capacity to demand in real time. CQRS and event sourcing enable the read and write paths to optimize independently. No single pattern is sufficient — highly scalable systems combine all of these thoughtfully.
