SRE Capacity Planning and Traffic Management

A water utility plans its reservoir for summer demand — not just for today's usage. They study historical patterns, account for population growth, and build in safety margin. SRE teams apply the same foresight to software infrastructure. They call it capacity planning: figuring out how much resource the system needs before it runs out, not after.

What Is Capacity Planning

Capacity planning is the process of predicting future resource needs and ensuring the system has enough to handle projected demand — with a safety buffer. Resources include CPU, memory, disk storage, network bandwidth, database connections, and API rate limits.

Capacity Planning Cycle:
-----------------------
MEASURE current usage and trends
          ↓
FORECAST future demand based on business growth
          ↓
PLAN the resources needed to meet that demand
          ↓
PROVISION resources ahead of projected need
          ↓
MONITOR actual vs predicted usage
          ↓
REPEAT each quarter

Why Teams Get Capacity Wrong

Capacity failures are almost always predictable in retrospect. The database that ran out of storage was filling steadily for months. The web server that crashed under load had been at 85 percent CPU for weeks. The problem is not that the data was missing — it was that no one was watching it with the right time horizon.

Two Common Failure Patterns

Pattern 1: The Slow Burn
   Storage usage grows 3% per month.
   Nobody notices for six months.
   Month 7: disk is full. Service crashes. Data loss risk.

Pattern 2: The Surprise Spike
   Marketing runs a promotional campaign.
   Traffic spikes 10x the normal rate for one day.
   Nobody told the SRE team.
   Server crashes. Campaign is embarrassingly broken.

Load Testing and Benchmarking

Load testing measures how the system behaves under increasing traffic before that traffic arrives in production. Teams use load testing to find the breaking point: at what traffic level does the system start degrading, and at what level does it fail completely?

Key Load Test Metrics

Throughput: Maximum requests per second the system handles successfully.
Latency degradation point: The traffic level at which response times start increasing significantly.
Failure threshold: The traffic level at which errors start appearing.
Recovery time: How quickly the system recovers after traffic returns to normal after a spike.

Load Test Results: Checkout Service
-------------------------------------
Traffic Level   Avg Latency   Error Rate
500 req/s       85ms          0.0%        ← Normal operating range
1,000 req/s     92ms          0.0%        ← Healthy headroom
2,000 req/s     180ms         0.0%        ← Latency climbing
3,000 req/s     850ms         0.3%        ← Degraded — SLO at risk
4,000 req/s     3,200ms       8.2%        ← Near failure
4,500 req/s     Timeouts      35%         ← Practical limit

Conclusion: Current capacity limit is ~3,500 req/s before SLO breach.
Recommendation: Scale to support 7,000 req/s to allow 2x safety margin.

Traffic Management Techniques

Capacity planning ensures resources exist. Traffic management controls how demand reaches those resources — smoothing out spikes, shedding excess load, and protecting critical services when the system is under pressure.

Load Balancing

A load balancer distributes incoming requests across multiple servers so no single server is overwhelmed. It also detects unhealthy servers and stops sending traffic to them until they recover.

Without Load Balancer:       With Load Balancer:
All 5,000 requests → Server A     1,667 requests → Server A
                                  1,667 requests → Server B
Server A crashes               1,666 requests → Server C
                               All servers healthy ✅

Auto-Scaling

Auto-scaling automatically adds or removes servers based on current demand. When traffic increases, new servers start within minutes. When traffic drops, excess servers shut down to save cost.

Auto-Scaling in Action:

9 AM:   300 req/s → 3 servers running
12 PM:  1,200 req/s → 8 servers running (auto-scaled up)
3 PM:   800 req/s → 6 servers (scaled back)
6 PM:   2,000 req/s → 12 servers (peak traffic)
10 PM:  150 req/s → 2 servers (night hours)

Rate Limiting

Rate limiting restricts how many requests a single client (user, API key, or IP address) can make in a given time window. It prevents one misbehaving client from consuming all available capacity and degrading service for everyone else.

Load Shedding

When the system is overwhelmed and cannot serve all requests, load shedding deliberately rejects lower-priority requests to protect the most important ones. A payment service under extreme load might shed requests to its reporting API while protecting checkout transactions.

Priority Tiers Under Load:
--------------------------
TIER 1 (protect always): Checkout, payment processing, user authentication
TIER 2 (shed first):     Analytics reporting, non-critical API calls
TIER 3 (shed early):     Dev and test traffic, internal batch jobs

Circuit Breakers

A circuit breaker monitors calls to a downstream dependency. When that dependency starts failing above a threshold, the circuit breaker "trips" and stops sending requests to it — returning a fast error instead of waiting for a slow timeout. This prevents a single failing service from cascading through the entire system.

Circuit Breaker States:
------------------------
CLOSED:   Normal — requests pass through
OPEN:     Dependency failing — requests rejected immediately
HALF-OPEN: Testing — a small percentage of requests let through
           to check if the dependency has recovered

Key Points

Capacity planning predicts future resource needs before the system runs out.
Load testing reveals the breaking point before production traffic finds it.
Load balancing, auto-scaling, and rate limiting distribute and regulate traffic.
Load shedding protects the most important operations when the system is overwhelmed.
Circuit breakers prevent cascading failures when a downstream dependency degrades.

Previous lesson

Back to course

Next lesson