SD Rate Limiting and Throttling

Rate limiting is the practice of controlling how many requests a client can make to a service within a defined time period. Throttling is the action of slowing down or rejecting requests that exceed a defined limit. Together, these mechanisms protect services from overload, prevent abuse, and ensure fair access for all users.

Think of a popular amusement park ride. The ride allows only 30 people every 5 minutes. If 100 people rush in at once, 70 wait in line or leave. The queue is rate limiting — it controls the flow, protects the ride's capacity, and ensures everyone gets a fair experience.

Why Rate Limiting Matters

Prevents Denial of Service (DoS) attacks: Attackers sending millions of requests per second cannot bring down a system that enforces strict rate limits.
Prevents API abuse: A poorly behaved client (bug or bad actor) cannot monopolize server resources.
Ensures fair usage: No single user consumes all available capacity, leaving others with slow or no service.
Controls costs: Third-party APIs charge per request. Rate limiting prevents runaway costs from bugs or abuse.
Protects downstream services: A rate-limited API protects databases and dependent services from cascading overload.

Rate Limiting Algorithms

1. Fixed Window Counter

A counter tracks requests within a fixed time window. When the counter reaches the limit, additional requests are rejected until the next window starts.

Limit: 100 requests per minute
Window: 12:00:00 → 12:01:00

12:00:01 → Request 1   (counter: 1)
12:00:30 → Request 50  (counter: 50)
12:00:59 → Request 100 (counter: 100) ← LIMIT REACHED
12:00:59 → Request 101 → REJECTED (429 Too Many Requests)
12:01:00 → New window! Counter resets to 0
12:01:01 → Request 1   (counter: 1) ← Allowed again

Problem — Window edge burst:

User sends 100 requests at 12:00:59 (last second of window)
User sends 100 requests at 12:01:00 (first second of next window)

Result: 200 requests in 2 seconds! Double the intended limit.

2. Sliding Window Log

For every request, the system records its exact timestamp in a log. To decide if a request is allowed, it counts how many timestamps exist within the last N seconds. Most accurate but memory-intensive.

Limit: 5 requests per 10 seconds

Log: [12:00:01, 12:00:03, 12:00:07, 12:00:09, 12:00:10]

New request at 12:00:11:
→ Remove timestamps older than 12:00:01 (10 seconds ago)
→ Count remaining: 5 (all 5 are within window)
→ Count = limit → REJECT

New request at 12:00:12:
→ Remove 12:00:01 (now 11 seconds old, outside window)
→ Count remaining: 4
→ Count < limit → ALLOW

3. Sliding Window Counter (Hybrid)

Combines fixed window counters with a weighted approximation of the previous window. Less memory than sliding log but more accurate than fixed window.

Limit: 100 requests per minute

Previous window (12:00 - 12:01): 80 requests
Current window (12:01 - 12:02): 40 seconds elapsed (40/60 = 67% complete)

Estimated rate = 80 × (1 - 0.67) + current_count
               = 80 × 0.33 + current_count
               = 26 + current_count

If current_count = 75:
Estimated rate = 26 + 75 = 101 → REJECT (exceeds 100)

4. Token Bucket Algorithm

A bucket holds tokens. Each token represents permission to make one request. Tokens refill at a constant rate up to a maximum capacity. Requests consume tokens. If the bucket is empty, the request is rejected. This allows short bursts while maintaining an average rate.

Bucket capacity: 10 tokens
Refill rate: 2 tokens per second

Start: 10 tokens (full)

T=0:   Burst of 8 requests → Uses 8 tokens → 2 remaining
T=1:   2 tokens added → 4 tokens
T=2:   2 tokens added → 6 tokens
T=2:   5 requests → Uses 5 tokens → 1 remaining
T=2:   6th request → 0 tokens → REJECTED
T=3:   2 tokens added → 3 tokens

Advantage: Allows bursts (great for bursty traffic patterns like APIs). The burst capacity is the bucket size.

5. Leaky Bucket Algorithm

Requests enter a bucket (queue). Requests exit the bucket at a fixed, constant rate regardless of how fast they arrive. This produces a perfectly smooth output rate, like water draining through a hole at a constant speed.

Bucket capacity: 10 requests
Drain rate: 1 request per second

T=0: 5 requests arrive → Bucket: [R1, R2, R3, R4, R5]
T=0: Processes R1 (1/sec rate)
T=1: 6 more requests arrive (total 10 pending)
T=1: Processes R2
...
T=9: Bucket full (10 requests)
T=9: New request arrives → Bucket overflow → REJECTED

Output: Always smooth, 1 request/second processed.

Algorithm	Burst Support	Accuracy	Memory Use	Best For
Fixed Window	Yes (window edges)	Low	Very low	Simple APIs
Sliding Window Log	No	Highest	High	Premium APIs
Sliding Window Counter	Minimal	High	Low	Most web APIs
Token Bucket	Yes (controlled)	High	Low	APIs with burst tolerance
Leaky Bucket	No (smooths bursts)	High	Medium	Downstream protection

Rate Limiting Strategies

Per User / Per API Key

Each user or API key gets its own limit. Most common for public APIs.

User A: 1,000 requests/hour (free tier)
User B: 10,000 requests/hour (pro tier)
User C: unlimited (enterprise tier)

Per IP Address

Limits apply per client IP. Effective against simple attacks but problematic when many users share one IP (e.g., a corporate NAT or a university network).

Per Endpoint

Different endpoints have different limits based on their cost. Expensive operations have stricter limits.

GET  /users        → 1,000 requests/minute  (cheap read)
POST /upload-video → 5 requests/minute      (expensive, storage-heavy)
POST /send-email   → 10 requests/minute     (prevent spam)
GET  /health       → No limit              (monitoring endpoint)

Global Rate Limit

The entire API accepts a maximum number of requests regardless of who makes them. Protects the server from total overload even if individual limits are not exceeded.

Total API capacity: 100,000 requests/second
Even if every user is within their individual limits,
the global cap ensures the servers never overload.

Where to Implement Rate Limiting

Architecture: Where rate limiting lives

Option 1: API Gateway (most common)
Client → [Rate Limiter at API Gateway] → Application Servers

Option 2: Middleware in the application
Client → Application Server → [Rate Limit middleware checks] → Process

Option 3: Dedicated Rate Limiting Service
Client → [Rate Limit Service] → Application Servers
All servers query the same Rate Limit Service → Distributed rate limiting

Distributed Rate Limiting

A single-server rate limiter fails when the application runs on multiple servers. A user could bypass limits by hitting different servers.

Problem without distributed rate limiting:
User sends 10 requests to Server 1 (limit: 10) → All allowed
User sends 10 requests to Server 2 (limit: 10) → All allowed
Total: 20 requests — double the limit!

Solution: Shared rate limit counter in Redis
All servers check and update the same counter in Redis:
Server 1 and Server 2 both read/write to Redis counter
Counter: 15/10 → Limit exceeded regardless of which server receives request

Rate Limit Response Headers

Best-practice APIs always return headers informing clients of their current rate limit status, so clients can adjust their behavior proactively instead of hitting 429 errors.

HTTP Response Headers:
X-RateLimit-Limit: 100          (maximum requests allowed per window)
X-RateLimit-Remaining: 23       (requests remaining in current window)
X-RateLimit-Reset: 1735690000   (Unix timestamp when window resets)
Retry-After: 45                 (seconds until next request allowed, on 429)

HTTP Status for exceeded limit:
429 Too Many Requests
{
  "error": "Rate limit exceeded",
  "message": "Try again in 45 seconds",
  "retryAfter": 45
}

Throttling vs Rate Limiting

Aspect	Rate Limiting	Throttling
Action when exceeded	Reject request (429 error)	Slow down processing (add delay)
User experience	Hard stop	Gradual degradation
Server protection	Immediate protection	Graceful protection
Use case	Public APIs, security boundaries	Internal services, bulk operations

Summary

Rate limiting and throttling protect services from overload, abuse, and DDoS attacks while ensuring fair access for all users. The token bucket algorithm handles bursty traffic gracefully, making it ideal for most APIs. The leaky bucket algorithm smooths traffic for sensitive downstream systems. Distributed rate limiting using a shared Redis counter ensures limits hold across all application servers. Always communicate limits to clients through response headers so well-behaved clients can self-regulate and avoid hitting hard limits.

Previous lessons

Back to courses

Next lessons