REST API Rate Limiting, Caching, and Performance
A REST API that works correctly on day one can fail catastrophically on day 100 when traffic grows. Performance is not an afterthought — it defines the user experience, the server costs, and the resilience of your entire system. This page covers three powerful tools: rate limiting to protect your API from abuse, caching to serve responses faster while reducing server load, and performance patterns that keep your API responsive under pressure.
Part 1: Rate Limiting
What Rate Limiting Does
Rate limiting sets a maximum number of requests a client can make within a defined time window. Think of it like a highway tollbooth — only a fixed number of cars can pass per minute. When the limit is reached, additional requests are held back or rejected until the window resets.
WITHOUT RATE LIMITING:
Normal user: GET /search 10x per minute → Server handles fine
Attacker: GET /search 50,000x per minute → Server crashes
Effect on normal users during an attack:
Error: 503 Service Unavailable (server overwhelmed)
WITH RATE LIMITING:
Normal user: GET /search 10x per minute → ✓ All served
Attacker: GET /search attempt 50,000x → After limit hit:
429 Too Many Requests
Server: Calm. Normal users unaffected.
Rate Limiting Algorithms
Algorithm 1: Fixed Window
Window: 60 seconds | Limit: 100 requests Time 0:00 ──────────────────────────── Time 1:00 Requests: 1, 2, 3, ... 100 → 101st blocked (429) Time 1:00: Counter resets to 0 Requests: 1, 2, 3, ... Problem with Fixed Window: 59:50 → 100 requests (all allowed, limit reached) 1:00:10 → 100 more requests (window reset, all allowed) Result: 200 requests in a 20-second burst at the window boundary! Simple to implement. Not great at preventing bursts.
Algorithm 2: Sliding Window
Window: 60 seconds | Limit: 100 requests (counts requests in the rolling last 60 seconds, not fixed intervals) At 10:30:45, system counts requests from 10:29:45 onward. At 10:30:46, system counts requests from 10:29:46 onward. The window slides forward continuously. No boundary burst problem. More accurate. Slightly more expensive to compute. Best choice for most APIs.
Algorithm 3: Token Bucket
CONCEPT:
A bucket holds tokens (max: 100 tokens).
Tokens refill at a rate of 10/second.
Each request consumes 1 token.
If bucket is empty → request rejected.
Bucket State Over Time:
Time 0: [100 tokens] ← fully refilled
10 requests arrive: [90 tokens]
1 second passes: [100 tokens] ← refilled (capped at max)
100 requests burst: [0 tokens]
101st request: REJECTED (429)
10 seconds pass: [100 tokens] ← refilled
ADVANTAGE: Allows short bursts of traffic naturally.
Clients can "save up" tokens for busy periods.
USE CASE: APIs where occasional traffic spikes are legitimate.
Algorithm 4: Leaky Bucket
CONCEPT:
Requests enter the bucket from the top (any rate).
Requests leave the bucket (processed) at a fixed rate.
Bucket overflows (requests rejected) when full.
[Client sends] ─────────────────────────────────
100 req/sec → │ BUCKET (capacity: 50) │
│ │
│ ← 10 req/sec processed → │
└───────────────────────────────┘
Overflow (>50 queued) → 429 rejected
ADVANTAGE: Smooths out traffic spikes. No bursts reach the server.
USE CASE: Background processing queues. Payment processors.
What to Rate Limit
RATE LIMIT BY:
IP Address → Default. Catches anonymous abuse.
Drawback: Shared IPs (offices, NAT) affect multiple users.
User/API Key → More precise. Authenticated users get individual quotas.
One user's abuse doesn't affect others.
Endpoint → Different limits per route based on cost and sensitivity.
Combination → Best: per-user AND per-IP AND per-endpoint
ENDPOINT-SPECIFIC LIMITS (recommended):
POST /auth/login → 5 / minute (prevent password brute-force)
POST /auth/register → 3 / minute (prevent bot registrations)
POST /password-reset → 3 / hour (prevent email flooding)
GET /products → 500 / minute (lightweight read)
POST /orders → 10 / minute (meaningful transaction)
GET /search → 100 / minute (moderate)
POST /bulk-upload → 2 / minute (heavy operation)
Rate Limit Response Headers
200 OK Response with Rate Limit Info:
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 37
X-RateLimit-Reset: 1720900060 ← Unix timestamp when window resets
429 Too Many Requests Response:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1720900060
Retry-After: 23 ← seconds until client can retry
Body:
{
"error": "rate_limit_exceeded",
"message": "Too many requests. Please wait 23 seconds.",
"retryAfter": 23
}
Part 2: Caching
Caching stores the result of an expensive operation so future requests can be served from the stored result instead of repeating the work. For an API, this means sending a stored response instead of hitting the database again. Done correctly, caching can make your API 10x–100x faster for repeated requests.
The Coffee Shop Analogy
WITHOUT CACHING: Customer: "What's today's menu?" Barista: [Walks to kitchen, checks with chef, walks back] "Here it is." Next customer: "What's today's menu?" Barista: [Walks to kitchen again...] (Repeated for every single customer — slow and exhausting) WITH CACHING: Customer 1: "What's today's menu?" Barista: [Walks to kitchen once] "Here it is." [Posts menu on board] Customers 2–200: "What's today's menu?" Barista: [Points to board] "There it is." (Instant, no kitchen trip) Cache invalidation: When the menu changes, tear down the old board and post a new one.
HTTP Caching Headers
HTTP has built-in caching mechanisms. Using them correctly means browsers and CDNs cache your responses automatically — your server does not even receive those requests.
Cache-Control Header
Cache-Control: max-age=3600 → "This response is fresh for 3600 seconds (1 hour)." → Client or CDN can serve it without asking the server for 1 hour. Cache-Control: no-cache → "Always revalidate with the server before using cached copy." → Client can cache it but MUST check freshness each time. Cache-Control: no-store → "Do not cache this at all. Never save a copy." → Use for sensitive data: bank balances, private messages. Cache-Control: private → "Only the end user's browser may cache this, not CDNs." → Use for user-specific data: profile info, shopping cart. Cache-Control: public → "Any cache (CDN, proxy, browser) may store this." → Use for shared content: product catalog, public articles. COMMON COMBINATIONS: Public static asset: Cache-Control: public, max-age=86400 User profile: Cache-Control: private, max-age=300 Bank balance: Cache-Control: no-store Product list: Cache-Control: public, max-age=3600
ETag — Conditional Requests
ETag is a fingerprint of the response content.
When content changes, the ETag changes.
FLOW:
Step 1: Client requests product list
GET /products
→ Response:
HTTP/1.1 200 OK
ETag: "abc123xyz"
Cache-Control: max-age=60
Body: [list of products]
Step 2: 60 seconds later, cache expires. Client revalidates:
GET /products
If-None-Match: "abc123xyz" ← "I have this version. Still valid?"
Step 3a: Products UNCHANGED — server responds:
HTTP/1.1 304 Not Modified
ETag: "abc123xyz"
Body: (empty — no data sent!)
→ Client uses its cached copy. Saves bandwidth completely.
Step 3b: Products CHANGED — server responds:
HTTP/1.1 200 OK
ETag: "def456uvw" ← new ETag
Body: [updated product list]
Last-Modified
Server includes last modification timestamp: Last-Modified: Tue, 15 Jan 2024 10:30:00 GMT Client revalidates with: If-Modified-Since: Tue, 15 Jan 2024 10:30:00 GMT Server: → Not changed → 304 Not Modified (no body) → Changed → 200 OK (new body + new Last-Modified) ETag is preferred over Last-Modified because: - Time-based comparison can be off by 1 second - ETag is based on content hash, always accurate
Server-Side Caching
HTTP caching helps the client and CDN. Server-side caching speeds up your database queries. WITHOUT SERVER CACHE: Request → API → Database Query (50ms) → Response 1,000 requests/second = 1,000 database queries/second WITH SERVER CACHE (Redis): Request 1 → API → Cache MISS → Database (50ms) → Store in Redis → Response Request 2 → API → Cache HIT → Redis (1ms) → Response (50x faster!) Requests 3-999 → Cache HIT → Redis (1ms) each DATABASE LOAD COMPARISON: Without cache: 1,000 queries/second With cache (90% hit rate): 100 queries/second ↑ 10x reduction in database load
Cache Invalidation Strategies
The hardest problem in caching: knowing when to clear stale data. STRATEGY 1: TTL (Time-To-Live) Cache entry expires after fixed time (e.g., 5 minutes). Simple. Stale data possible for up to TTL duration. Good for: Product catalog, public stats, pricing. STRATEGY 2: Event-Based Invalidation Clear cache immediately when data changes. User updates profile → delete user:456 cache key Admin updates product → delete products:list cache key Accurate. Requires code discipline. Good for: User data, inventory levels, real-time data. STRATEGY 3: Cache-Aside (Lazy Loading) 1. Request comes in 2. Check cache → HIT: return cached data 3. Cache MISS: fetch from DB, store in cache, return data Data only enters cache when requested. Good for: Read-heavy data with unpredictable access patterns. STRATEGY 4: Write-Through Every write updates the database AND the cache simultaneously. Cache always current. Slightly slower writes (two operations). Good for: Data read very frequently right after being written. CACHE KEY DESIGN: user:profile:456 → user 456's profile products:list:page:1 → page 1 of products search:laptops:sort:price → search results for laptops sorted by price order:1001 → order 1001 details
Part 3: API Performance Patterns
Pagination — Never Return Everything at Once
WRONG — Returns all 50,000 products:
GET /products
Response: [50,000 items, 45MB payload, 8 seconds load time]
RIGHT — Paginated responses:
GET /products?page=1&limit=20
Response:
{
"data": [20 products],
"pagination": {
"page": 1,
"limit": 20,
"total": 50000,
"totalPages": 2500
}
}
THREE PAGINATION STYLES:
1. Offset Pagination (most common):
GET /products?offset=40&limit=20
Simple. Suffers from "page drift" if items are added/removed mid-browse.
2. Cursor-Based Pagination (best for real-time data):
GET /products?cursor=eyJpZCI6NDB9&limit=20
Uses an opaque cursor pointing to a position in the dataset.
Stable even when new items are inserted.
Used by Twitter, Facebook, Stripe.
3. Keyset Pagination (best performance):
GET /products?after_id=40&limit=20
Uses the last-seen ID as the starting point.
Very fast even on tables with millions of rows.
No OFFSET scan in the database (offset gets slower as page number grows).
Field Selection — Let Clients Request What They Need
PROBLEM: Mobile app needs only product name and price.
API always returns all 20 fields per product.
Unnecessary data wastes bandwidth and parse time.
SOLUTION: Support field selection via query parameter.
GET /products?fields=id,name,price
Response (only requested fields):
[
{ "id": 1, "name": "Laptop", "price": 999 },
{ "id": 2, "name": "Mouse", "price": 29 }
]
Without field selection:
[
{
"id": 1, "name": "Laptop", "price": 999,
"description": "...(500 chars)...",
"sku": "LAP-001",
"categoryId": 5,
"weight": 2.1,
"dimensions": {...},
"images": [...],
"inventory": {...},
"reviews": [...]
},
...
]
Field selection reduces payload by 60-90% for common use cases.
GraphQL is another approach that gives clients full control over fields.
Compression
API responses are text (JSON). Text compresses extremely well. GZIP or Brotli compression reduces payload size by 60–80%. CLIENT REQUEST: GET /products Accept-Encoding: gzip, br SERVER RESPONSE: HTTP/1.1 200 OK Content-Encoding: gzip Content-Type: application/json Body: [compressed binary data] Size comparison for a 100KB JSON response: Uncompressed: 100 KB GZIP: 20 KB (80% smaller) Brotli: 15 KB (85% smaller) Decompression time on the client: a few milliseconds. Bandwidth saving: significant, especially on mobile networks.
Asynchronous Processing for Slow Operations
PROBLEM: Client requests report generation (takes 30 seconds).
If server keeps connection open for 30 seconds:
→ Client may timeout
→ Server ties up a thread for 30 seconds
→ Scales poorly under multiple simultaneous requests
SOLUTION: Accept and process asynchronously.
Step 1: Client submits job
POST /reports
Body: { "type": "sales", "dateRange": "2024-Q1" }
Step 2: Server immediately responds:
HTTP/1.1 202 Accepted
{
"jobId": "job_789",
"status": "processing",
"statusUrl": "/reports/job_789/status"
}
Step 3: Background worker processes the report (30 seconds)
Step 4: Client polls for status
GET /reports/job_789/status
→ { "status": "processing", "progress": 45 }
Step 5: Report complete
GET /reports/job_789/status
→ { "status": "complete", "downloadUrl": "/reports/job_789/download" }
Step 6: Client downloads
GET /reports/job_789/download
→ [full report data]
202 Accepted = "I received it. Processing in background."
Database Query Optimization for APIs
COMMON PERFORMANCE KILLERS:
1. N+1 Query Problem
Getting 10 orders and then making 10 separate queries for each order's user.
BAD (11 queries for 10 orders):
orders = db.query("SELECT * FROM orders LIMIT 10")
for order in orders:
user = db.query("SELECT * FROM users WHERE id = ?", order.user_id)
GOOD (2 queries total):
orders = db.query("SELECT * FROM orders LIMIT 10")
user_ids = [o.user_id for o in orders]
users = db.query("SELECT * FROM users WHERE id IN (?)", user_ids)
// Map users to orders in memory
BEST (1 query with JOIN):
SELECT orders.*, users.name, users.email
FROM orders
JOIN users ON orders.user_id = users.id
LIMIT 10
2. Missing Database Indexes
Without index: Full table scan → scans every row
With index: Direct lookup → finds row instantly
Add indexes on columns used in WHERE, JOIN, and ORDER BY.
3. Returning Columns You Don't Need
BAD: SELECT * FROM products
GOOD: SELECT id, name, price FROM products
Connection Pooling
Opening a new database connection for every request is expensive. It takes 20–100ms just to establish the connection. WITHOUT CONNECTION POOL: Request 1 → Open DB connection (50ms) → Query (5ms) → Close → Response Request 2 → Open DB connection (50ms) → Query (5ms) → Close → Response ...each request pays 50ms connection cost WITH CONNECTION POOL: API Startup: Open 10 database connections. Keep them open. Request 1 → Borrow connection → Query (5ms) → Return → Response Request 2 → Borrow connection → Query (5ms) → Return → Response ...no connection overhead. Connections reused. Pool Size Guidelines: Database connections use server RAM. Too few → requests queue waiting for a free connection Too many → database RAM exhausted Sweet spot: Usually 10–50 connections depending on DB server specs.
CDN for API Responses
CDN (Content Delivery Network) = Servers distributed globally that cache your API responses close to users. WITHOUT CDN: User in India requests GET /products → Request travels to US-East server (150ms round-trip) → Database query (10ms) → Response travels back to India (150ms) Total: ~310ms WITH CDN: User in India requests GET /products (first time) → CDN node in Mumbai: MISS → fetches from US-East → caches locally → Response: ~160ms User 2 in India requests GET /products (after CDN cache) → CDN node in Mumbai: HIT → serves from Mumbai → Response: ~5ms (99% improvement!) CDNs work for public, cacheable API responses (product lists, public articles, shared reference data). They don't help with private, user-specific, or real-time data.
Performance Monitoring — Know Before Users Complain
KEY METRICS TO TRACK: Response Time (Latency): P50 (median): 50% of requests faster than this P95: 95% of requests faster than this P99: 99% of requests faster than this Aim for: P50 < 100ms P95 < 500ms P99 < 1000ms Throughput: Requests per second your API handles Error Rate: % of requests returning 4xx or 5xx Cache Hit Rate: % of requests served from cache ALERTS TO SET UP: → P95 latency exceeds 1 second → investigate immediately → Error rate exceeds 1% → something is broken → Cache hit rate drops below 80% → cache invalidation issue → Rate limit hits spike → potential attack or client bug
Key Points
- Rate limiting protects your API from brute-force attacks, DoS floods, and accidental client bugs. Use stricter limits on authentication endpoints and relaxed limits on read endpoints.
- The token bucket algorithm suits APIs with legitimate traffic bursts. The sliding window algorithm suits most standard API use cases.
- Always return
429 Too Many Requestswith aRetry-Afterheader so clients know when to try again. Cache-Control: public, max-age=3600lets CDNs and browsers cache shared content.Cache-Control: no-storeprevents caching for sensitive data like bank balances.- ETags enable conditional requests — when content hasn't changed, the server returns
304 Not Modifiedwith an empty body, saving bandwidth. - Use cursor-based pagination for large datasets. Offset pagination slows down dramatically as page numbers grow.
- Long operations should return
202 Acceptedimmediately and process in the background. Provide a status URL for polling. - The N+1 query problem is the most common API performance killer — batch your database queries or use JOINs.
