SRE Capacity Planning and Traffic Management
A water utility plans its reservoir for summer demand — not just for today's usage. They study historical patterns, account for population growth, and build in safety margin. SRE teams apply the same foresight to software infrastructure. They call it capacity planning: figuring out how much resource the system needs before it runs out, not after.
What Is Capacity Planning
Capacity planning is the process of predicting future resource needs and ensuring the system has enough to handle projected demand — with a safety buffer. Resources include CPU, memory, disk storage, network bandwidth, database connections, and API rate limits.
Capacity Planning Cycle:
-----------------------
MEASURE current usage and trends
↓
FORECAST future demand based on business growth
↓
PLAN the resources needed to meet that demand
↓
PROVISION resources ahead of projected need
↓
MONITOR actual vs predicted usage
↓
REPEAT each quarter
Why Teams Get Capacity Wrong
Capacity failures are almost always predictable in retrospect. The database that ran out of storage was filling steadily for months. The web server that crashed under load had been at 85 percent CPU for weeks. The problem is not that the data was missing — it was that no one was watching it with the right time horizon.
Two Common Failure Patterns
Pattern 1: The Slow Burn Storage usage grows 3% per month. Nobody notices for six months. Month 7: disk is full. Service crashes. Data loss risk. Pattern 2: The Surprise Spike Marketing runs a promotional campaign. Traffic spikes 10x the normal rate for one day. Nobody told the SRE team. Server crashes. Campaign is embarrassingly broken.
Load Testing and Benchmarking
Load testing measures how the system behaves under increasing traffic before that traffic arrives in production. Teams use load testing to find the breaking point: at what traffic level does the system start degrading, and at what level does it fail completely?
Key Load Test Metrics
- Throughput: Maximum requests per second the system handles successfully.
- Latency degradation point: The traffic level at which response times start increasing significantly.
- Failure threshold: The traffic level at which errors start appearing.
- Recovery time: How quickly the system recovers after traffic returns to normal after a spike.
Load Test Results: Checkout Service ------------------------------------- Traffic Level Avg Latency Error Rate 500 req/s 85ms 0.0% ← Normal operating range 1,000 req/s 92ms 0.0% ← Healthy headroom 2,000 req/s 180ms 0.0% ← Latency climbing 3,000 req/s 850ms 0.3% ← Degraded — SLO at risk 4,000 req/s 3,200ms 8.2% ← Near failure 4,500 req/s Timeouts 35% ← Practical limit Conclusion: Current capacity limit is ~3,500 req/s before SLO breach. Recommendation: Scale to support 7,000 req/s to allow 2x safety margin.
Traffic Management Techniques
Capacity planning ensures resources exist. Traffic management controls how demand reaches those resources — smoothing out spikes, shedding excess load, and protecting critical services when the system is under pressure.
Load Balancing
A load balancer distributes incoming requests across multiple servers so no single server is overwhelmed. It also detects unhealthy servers and stops sending traffic to them until they recover.
Without Load Balancer: With Load Balancer:
All 5,000 requests → Server A 1,667 requests → Server A
1,667 requests → Server B
Server A crashes 1,666 requests → Server C
All servers healthy ✅
Auto-Scaling
Auto-scaling automatically adds or removes servers based on current demand. When traffic increases, new servers start within minutes. When traffic drops, excess servers shut down to save cost.
Auto-Scaling in Action: 9 AM: 300 req/s → 3 servers running 12 PM: 1,200 req/s → 8 servers running (auto-scaled up) 3 PM: 800 req/s → 6 servers (scaled back) 6 PM: 2,000 req/s → 12 servers (peak traffic) 10 PM: 150 req/s → 2 servers (night hours)
Rate Limiting
Rate limiting restricts how many requests a single client (user, API key, or IP address) can make in a given time window. It prevents one misbehaving client from consuming all available capacity and degrading service for everyone else.
Load Shedding
When the system is overwhelmed and cannot serve all requests, load shedding deliberately rejects lower-priority requests to protect the most important ones. A payment service under extreme load might shed requests to its reporting API while protecting checkout transactions.
Priority Tiers Under Load: -------------------------- TIER 1 (protect always): Checkout, payment processing, user authentication TIER 2 (shed first): Analytics reporting, non-critical API calls TIER 3 (shed early): Dev and test traffic, internal batch jobs
Circuit Breakers
A circuit breaker monitors calls to a downstream dependency. When that dependency starts failing above a threshold, the circuit breaker "trips" and stops sending requests to it — returning a fast error instead of waiting for a slow timeout. This prevents a single failing service from cascading through the entire system.
Circuit Breaker States:
------------------------
CLOSED: Normal — requests pass through
OPEN: Dependency failing — requests rejected immediately
HALF-OPEN: Testing — a small percentage of requests let through
to check if the dependency has recovered
Key Points
- Capacity planning predicts future resource needs before the system runs out.
- Load testing reveals the breaking point before production traffic finds it.
- Load balancing, auto-scaling, and rate limiting distribute and regulate traffic.
- Load shedding protects the most important operations when the system is overwhelmed.
- Circuit breakers prevent cascading failures when a downstream dependency degrades.
