SD Rate Limiting and Throttling
Rate limiting is the practice of controlling how many requests a client can make to a service within a defined time period. Throttling is the action of slowing down or rejecting requests that exceed a defined limit. Together, these mechanisms protect services from overload, prevent abuse, and ensure fair access for all users.
Think of a popular amusement park ride. The ride allows only 30 people every 5 minutes. If 100 people rush in at once, 70 wait in line or leave. The queue is rate limiting — it controls the flow, protects the ride's capacity, and ensures everyone gets a fair experience.
Why Rate Limiting Matters
- Prevents Denial of Service (DoS) attacks: Attackers sending millions of requests per second cannot bring down a system that enforces strict rate limits.
- Prevents API abuse: A poorly behaved client (bug or bad actor) cannot monopolize server resources.
- Ensures fair usage: No single user consumes all available capacity, leaving others with slow or no service.
- Controls costs: Third-party APIs charge per request. Rate limiting prevents runaway costs from bugs or abuse.
- Protects downstream services: A rate-limited API protects databases and dependent services from cascading overload.
Rate Limiting Algorithms
1. Fixed Window Counter
A counter tracks requests within a fixed time window. When the counter reaches the limit, additional requests are rejected until the next window starts.
Limit: 100 requests per minute Window: 12:00:00 → 12:01:00 12:00:01 → Request 1 (counter: 1) 12:00:30 → Request 50 (counter: 50) 12:00:59 → Request 100 (counter: 100) ← LIMIT REACHED 12:00:59 → Request 101 → REJECTED (429 Too Many Requests) 12:01:00 → New window! Counter resets to 0 12:01:01 → Request 1 (counter: 1) ← Allowed again
Problem — Window edge burst:
User sends 100 requests at 12:00:59 (last second of window) User sends 100 requests at 12:01:00 (first second of next window) Result: 200 requests in 2 seconds! Double the intended limit.
2. Sliding Window Log
For every request, the system records its exact timestamp in a log. To decide if a request is allowed, it counts how many timestamps exist within the last N seconds. Most accurate but memory-intensive.
Limit: 5 requests per 10 seconds Log: [12:00:01, 12:00:03, 12:00:07, 12:00:09, 12:00:10] New request at 12:00:11: → Remove timestamps older than 12:00:01 (10 seconds ago) → Count remaining: 5 (all 5 are within window) → Count = limit → REJECT New request at 12:00:12: → Remove 12:00:01 (now 11 seconds old, outside window) → Count remaining: 4 → Count < limit → ALLOW
3. Sliding Window Counter (Hybrid)
Combines fixed window counters with a weighted approximation of the previous window. Less memory than sliding log but more accurate than fixed window.
Limit: 100 requests per minute
Previous window (12:00 - 12:01): 80 requests
Current window (12:01 - 12:02): 40 seconds elapsed (40/60 = 67% complete)
Estimated rate = 80 × (1 - 0.67) + current_count
= 80 × 0.33 + current_count
= 26 + current_count
If current_count = 75:
Estimated rate = 26 + 75 = 101 → REJECT (exceeds 100)
4. Token Bucket Algorithm
A bucket holds tokens. Each token represents permission to make one request. Tokens refill at a constant rate up to a maximum capacity. Requests consume tokens. If the bucket is empty, the request is rejected. This allows short bursts while maintaining an average rate.
Bucket capacity: 10 tokens Refill rate: 2 tokens per second Start: 10 tokens (full) T=0: Burst of 8 requests → Uses 8 tokens → 2 remaining T=1: 2 tokens added → 4 tokens T=2: 2 tokens added → 6 tokens T=2: 5 requests → Uses 5 tokens → 1 remaining T=2: 6th request → 0 tokens → REJECTED T=3: 2 tokens added → 3 tokens
Advantage: Allows bursts (great for bursty traffic patterns like APIs). The burst capacity is the bucket size.
5. Leaky Bucket Algorithm
Requests enter a bucket (queue). Requests exit the bucket at a fixed, constant rate regardless of how fast they arrive. This produces a perfectly smooth output rate, like water draining through a hole at a constant speed.
Bucket capacity: 10 requests Drain rate: 1 request per second T=0: 5 requests arrive → Bucket: [R1, R2, R3, R4, R5] T=0: Processes R1 (1/sec rate) T=1: 6 more requests arrive (total 10 pending) T=1: Processes R2 ... T=9: Bucket full (10 requests) T=9: New request arrives → Bucket overflow → REJECTED Output: Always smooth, 1 request/second processed.
| Algorithm | Burst Support | Accuracy | Memory Use | Best For |
|---|---|---|---|---|
| Fixed Window | Yes (window edges) | Low | Very low | Simple APIs |
| Sliding Window Log | No | Highest | High | Premium APIs |
| Sliding Window Counter | Minimal | High | Low | Most web APIs |
| Token Bucket | Yes (controlled) | High | Low | APIs with burst tolerance |
| Leaky Bucket | No (smooths bursts) | High | Medium | Downstream protection |
Rate Limiting Strategies
Per User / Per API Key
Each user or API key gets its own limit. Most common for public APIs.
User A: 1,000 requests/hour (free tier) User B: 10,000 requests/hour (pro tier) User C: unlimited (enterprise tier)
Per IP Address
Limits apply per client IP. Effective against simple attacks but problematic when many users share one IP (e.g., a corporate NAT or a university network).
Per Endpoint
Different endpoints have different limits based on their cost. Expensive operations have stricter limits.
GET /users → 1,000 requests/minute (cheap read) POST /upload-video → 5 requests/minute (expensive, storage-heavy) POST /send-email → 10 requests/minute (prevent spam) GET /health → No limit (monitoring endpoint)
Global Rate Limit
The entire API accepts a maximum number of requests regardless of who makes them. Protects the server from total overload even if individual limits are not exceeded.
Total API capacity: 100,000 requests/second Even if every user is within their individual limits, the global cap ensures the servers never overload.
Where to Implement Rate Limiting
Architecture: Where rate limiting lives Option 1: API Gateway (most common) Client → [Rate Limiter at API Gateway] → Application Servers Option 2: Middleware in the application Client → Application Server → [Rate Limit middleware checks] → Process Option 3: Dedicated Rate Limiting Service Client → [Rate Limit Service] → Application Servers All servers query the same Rate Limit Service → Distributed rate limiting
Distributed Rate Limiting
A single-server rate limiter fails when the application runs on multiple servers. A user could bypass limits by hitting different servers.
Problem without distributed rate limiting: User sends 10 requests to Server 1 (limit: 10) → All allowed User sends 10 requests to Server 2 (limit: 10) → All allowed Total: 20 requests — double the limit! Solution: Shared rate limit counter in Redis All servers check and update the same counter in Redis: Server 1 and Server 2 both read/write to Redis counter Counter: 15/10 → Limit exceeded regardless of which server receives request
Rate Limit Response Headers
Best-practice APIs always return headers informing clients of their current rate limit status, so clients can adjust their behavior proactively instead of hitting 429 errors.
HTTP Response Headers:
X-RateLimit-Limit: 100 (maximum requests allowed per window)
X-RateLimit-Remaining: 23 (requests remaining in current window)
X-RateLimit-Reset: 1735690000 (Unix timestamp when window resets)
Retry-After: 45 (seconds until next request allowed, on 429)
HTTP Status for exceeded limit:
429 Too Many Requests
{
"error": "Rate limit exceeded",
"message": "Try again in 45 seconds",
"retryAfter": 45
}
Throttling vs Rate Limiting
| Aspect | Rate Limiting | Throttling |
|---|---|---|
| Action when exceeded | Reject request (429 error) | Slow down processing (add delay) |
| User experience | Hard stop | Gradual degradation |
| Server protection | Immediate protection | Graceful protection |
| Use case | Public APIs, security boundaries | Internal services, bulk operations |
Summary
Rate limiting and throttling protect services from overload, abuse, and DDoS attacks while ensuring fair access for all users. The token bucket algorithm handles bursty traffic gracefully, making it ideal for most APIs. The leaky bucket algorithm smooths traffic for sensitive downstream systems. Distributed rate limiting using a shared Redis counter ensures limits hold across all application servers. Always communicate limits to clients through response headers so well-behaved clients can self-regulate and avoid hitting hard limits.
