REST API Rate Limiting, Caching, and Performance

A REST API that works correctly on day one can fail catastrophically on day 100 when traffic grows. Performance is not an afterthought — it defines the user experience, the server costs, and the resilience of your entire system. This page covers three powerful tools: rate limiting to protect your API from abuse, caching to serve responses faster while reducing server load, and performance patterns that keep your API responsive under pressure.

Part 1: Rate Limiting

What Rate Limiting Does

Rate limiting sets a maximum number of requests a client can make within a defined time window. Think of it like a highway tollbooth — only a fixed number of cars can pass per minute. When the limit is reached, additional requests are held back or rejected until the window resets.

  WITHOUT RATE LIMITING:

  Normal user:  GET /search 10x per minute  → Server handles fine
  Attacker:     GET /search 50,000x per minute → Server crashes

  Effect on normal users during an attack:
  Error: 503 Service Unavailable (server overwhelmed)

  WITH RATE LIMITING:

  Normal user:  GET /search 10x per minute   → ✓ All served
  Attacker:     GET /search attempt 50,000x  → After limit hit:
                                               429 Too Many Requests
  Server: Calm. Normal users unaffected.

Rate Limiting Algorithms

Algorithm 1: Fixed Window

  Window: 60 seconds | Limit: 100 requests

  Time 0:00 ──────────────────────────── Time 1:00
  Requests: 1, 2, 3, ... 100  →  101st blocked (429)
  Time 1:00: Counter resets to 0
  Requests: 1, 2, 3, ...

  Problem with Fixed Window:
  59:50  →  100 requests (all allowed, limit reached)
  1:00:10 →  100 more requests (window reset, all allowed)
  Result: 200 requests in a 20-second burst at the window boundary!

  Simple to implement. Not great at preventing bursts.

Algorithm 2: Sliding Window

  Window: 60 seconds | Limit: 100 requests
  (counts requests in the rolling last 60 seconds, not fixed intervals)

  At 10:30:45, system counts requests from 10:29:45 onward.
  At 10:30:46, system counts requests from 10:29:46 onward.
  The window slides forward continuously.

  No boundary burst problem.
  More accurate. Slightly more expensive to compute.
  Best choice for most APIs.

Algorithm 3: Token Bucket

  CONCEPT:
  A bucket holds tokens (max: 100 tokens).
  Tokens refill at a rate of 10/second.
  Each request consumes 1 token.
  If bucket is empty → request rejected.

  Bucket State Over Time:
  Time 0:    [100 tokens] ← fully refilled
  10 requests arrive: [90 tokens]
  1 second passes:    [100 tokens] ← refilled (capped at max)
  100 requests burst: [0 tokens]
  101st request:      REJECTED (429)
  10 seconds pass:    [100 tokens] ← refilled

  ADVANTAGE: Allows short bursts of traffic naturally.
             Clients can "save up" tokens for busy periods.
  USE CASE:  APIs where occasional traffic spikes are legitimate.

Algorithm 4: Leaky Bucket

  CONCEPT:
  Requests enter the bucket from the top (any rate).
  Requests leave the bucket (processed) at a fixed rate.
  Bucket overflows (requests rejected) when full.

  [Client sends] ─────────────────────────────────
  100 req/sec →  │      BUCKET (capacity: 50)    │
                 │                               │
                 │  ← 10 req/sec processed →     │
                 └───────────────────────────────┘
                 Overflow (>50 queued) → 429 rejected

  ADVANTAGE: Smooths out traffic spikes. No bursts reach the server.
  USE CASE:  Background processing queues. Payment processors.

What to Rate Limit

  RATE LIMIT BY:

  IP Address    → Default. Catches anonymous abuse.
                  Drawback: Shared IPs (offices, NAT) affect multiple users.

  User/API Key  → More precise. Authenticated users get individual quotas.
                  One user's abuse doesn't affect others.

  Endpoint      → Different limits per route based on cost and sensitivity.

  Combination   → Best: per-user AND per-IP AND per-endpoint

  ENDPOINT-SPECIFIC LIMITS (recommended):
  POST /auth/login        →   5 / minute  (prevent password brute-force)
  POST /auth/register     →   3 / minute  (prevent bot registrations)
  POST /password-reset    →   3 / hour    (prevent email flooding)
  GET  /products          → 500 / minute  (lightweight read)
  POST /orders            →  10 / minute  (meaningful transaction)
  GET  /search            → 100 / minute  (moderate)
  POST /bulk-upload       →   2 / minute  (heavy operation)

Rate Limit Response Headers

  200 OK Response with Rate Limit Info:
  HTTP/1.1 200 OK
  X-RateLimit-Limit: 100
  X-RateLimit-Remaining: 37
  X-RateLimit-Reset: 1720900060  ← Unix timestamp when window resets

  429 Too Many Requests Response:
  HTTP/1.1 429 Too Many Requests
  X-RateLimit-Limit: 100
  X-RateLimit-Remaining: 0
  X-RateLimit-Reset: 1720900060
  Retry-After: 23  ← seconds until client can retry

  Body:
  {
    "error": "rate_limit_exceeded",
    "message": "Too many requests. Please wait 23 seconds.",
    "retryAfter": 23
  }

Part 2: Caching

Caching stores the result of an expensive operation so future requests can be served from the stored result instead of repeating the work. For an API, this means sending a stored response instead of hitting the database again. Done correctly, caching can make your API 10x–100x faster for repeated requests.

The Coffee Shop Analogy

  WITHOUT CACHING:
  Customer: "What's today's menu?"
  Barista:  [Walks to kitchen, checks with chef, walks back] "Here it is."
  Next customer: "What's today's menu?"
  Barista:  [Walks to kitchen again...] 
  (Repeated for every single customer — slow and exhausting)

  WITH CACHING:
  Customer 1: "What's today's menu?"
  Barista: [Walks to kitchen once] "Here it is." [Posts menu on board]
  Customers 2–200: "What's today's menu?"
  Barista: [Points to board] "There it is." (Instant, no kitchen trip)
  
  Cache invalidation: When the menu changes, tear down the old board
  and post a new one.

HTTP Caching Headers

HTTP has built-in caching mechanisms. Using them correctly means browsers and CDNs cache your responses automatically — your server does not even receive those requests.

Cache-Control Header

  Cache-Control: max-age=3600
  → "This response is fresh for 3600 seconds (1 hour)."
  → Client or CDN can serve it without asking the server for 1 hour.

  Cache-Control: no-cache
  → "Always revalidate with the server before using cached copy."
  → Client can cache it but MUST check freshness each time.

  Cache-Control: no-store
  → "Do not cache this at all. Never save a copy."
  → Use for sensitive data: bank balances, private messages.

  Cache-Control: private
  → "Only the end user's browser may cache this, not CDNs."
  → Use for user-specific data: profile info, shopping cart.

  Cache-Control: public
  → "Any cache (CDN, proxy, browser) may store this."
  → Use for shared content: product catalog, public articles.

  COMMON COMBINATIONS:
  Public static asset:  Cache-Control: public, max-age=86400
  User profile:         Cache-Control: private, max-age=300
  Bank balance:         Cache-Control: no-store
  Product list:         Cache-Control: public, max-age=3600

ETag — Conditional Requests

  ETag is a fingerprint of the response content.
  When content changes, the ETag changes.

  FLOW:

  Step 1: Client requests product list
  GET /products
  → Response:
     HTTP/1.1 200 OK
     ETag: "abc123xyz"
     Cache-Control: max-age=60
     Body: [list of products]

  Step 2: 60 seconds later, cache expires. Client revalidates:
  GET /products
  If-None-Match: "abc123xyz"  ← "I have this version. Still valid?"

  Step 3a: Products UNCHANGED — server responds:
  HTTP/1.1 304 Not Modified
  ETag: "abc123xyz"
  Body: (empty — no data sent!)
  → Client uses its cached copy. Saves bandwidth completely.

  Step 3b: Products CHANGED — server responds:
  HTTP/1.1 200 OK
  ETag: "def456uvw"  ← new ETag
  Body: [updated product list]

Last-Modified

  Server includes last modification timestamp:
  Last-Modified: Tue, 15 Jan 2024 10:30:00 GMT

  Client revalidates with:
  If-Modified-Since: Tue, 15 Jan 2024 10:30:00 GMT

  Server:
  → Not changed → 304 Not Modified (no body)
  → Changed     → 200 OK (new body + new Last-Modified)

  ETag is preferred over Last-Modified because:
  - Time-based comparison can be off by 1 second
  - ETag is based on content hash, always accurate

Server-Side Caching

  HTTP caching helps the client and CDN.
  Server-side caching speeds up your database queries.

  WITHOUT SERVER CACHE:
  Request → API → Database Query (50ms) → Response
  1,000 requests/second = 1,000 database queries/second

  WITH SERVER CACHE (Redis):
  Request 1  → API → Cache MISS → Database (50ms) → Store in Redis → Response
  Request 2  → API → Cache HIT  → Redis (1ms) → Response  (50x faster!)
  Requests 3-999 → Cache HIT → Redis (1ms) each

  DATABASE LOAD COMPARISON:
  Without cache: 1,000 queries/second
  With cache (90% hit rate): 100 queries/second
  ↑ 10x reduction in database load

Cache Invalidation Strategies

  The hardest problem in caching: knowing when to clear stale data.

  STRATEGY 1: TTL (Time-To-Live)
  Cache entry expires after fixed time (e.g., 5 minutes).
  Simple. Stale data possible for up to TTL duration.
  Good for: Product catalog, public stats, pricing.

  STRATEGY 2: Event-Based Invalidation
  Clear cache immediately when data changes.
  User updates profile → delete user:456 cache key
  Admin updates product → delete products:list cache key
  Accurate. Requires code discipline.
  Good for: User data, inventory levels, real-time data.

  STRATEGY 3: Cache-Aside (Lazy Loading)
  1. Request comes in
  2. Check cache → HIT: return cached data
  3. Cache MISS: fetch from DB, store in cache, return data
  Data only enters cache when requested.
  Good for: Read-heavy data with unpredictable access patterns.

  STRATEGY 4: Write-Through
  Every write updates the database AND the cache simultaneously.
  Cache always current.
  Slightly slower writes (two operations).
  Good for: Data read very frequently right after being written.

  CACHE KEY DESIGN:
  user:profile:456            → user 456's profile
  products:list:page:1        → page 1 of products
  search:laptops:sort:price   → search results for laptops sorted by price
  order:1001                  → order 1001 details

Part 3: API Performance Patterns

Pagination — Never Return Everything at Once

  WRONG — Returns all 50,000 products:
  GET /products
  Response: [50,000 items, 45MB payload, 8 seconds load time]

  RIGHT — Paginated responses:
  GET /products?page=1&limit=20
  Response:
  {
    "data": [20 products],
    "pagination": {
      "page": 1,
      "limit": 20,
      "total": 50000,
      "totalPages": 2500
    }
  }

  THREE PAGINATION STYLES:

  1. Offset Pagination (most common):
     GET /products?offset=40&limit=20
     Simple. Suffers from "page drift" if items are added/removed mid-browse.

  2. Cursor-Based Pagination (best for real-time data):
     GET /products?cursor=eyJpZCI6NDB9&limit=20
     Uses an opaque cursor pointing to a position in the dataset.
     Stable even when new items are inserted.
     Used by Twitter, Facebook, Stripe.

  3. Keyset Pagination (best performance):
     GET /products?after_id=40&limit=20
     Uses the last-seen ID as the starting point.
     Very fast even on tables with millions of rows.
     No OFFSET scan in the database (offset gets slower as page number grows).

Field Selection — Let Clients Request What They Need

  PROBLEM: Mobile app needs only product name and price.
           API always returns all 20 fields per product.
           Unnecessary data wastes bandwidth and parse time.

  SOLUTION: Support field selection via query parameter.

  GET /products?fields=id,name,price

  Response (only requested fields):
  [
    { "id": 1, "name": "Laptop", "price": 999 },
    { "id": 2, "name": "Mouse",  "price": 29 }
  ]

  Without field selection:
  [
    {
      "id": 1, "name": "Laptop", "price": 999,
      "description": "...(500 chars)...",
      "sku": "LAP-001",
      "categoryId": 5,
      "weight": 2.1,
      "dimensions": {...},
      "images": [...],
      "inventory": {...},
      "reviews": [...]
    },
    ...
  ]

  Field selection reduces payload by 60-90% for common use cases.
  GraphQL is another approach that gives clients full control over fields.

Compression

  API responses are text (JSON). Text compresses extremely well.
  GZIP or Brotli compression reduces payload size by 60–80%.

  CLIENT REQUEST:
  GET /products
  Accept-Encoding: gzip, br

  SERVER RESPONSE:
  HTTP/1.1 200 OK
  Content-Encoding: gzip
  Content-Type: application/json
  Body: [compressed binary data]

  Size comparison for a 100KB JSON response:
  Uncompressed: 100 KB
  GZIP:          20 KB  (80% smaller)
  Brotli:        15 KB  (85% smaller)

  Decompression time on the client: a few milliseconds.
  Bandwidth saving: significant, especially on mobile networks.

Asynchronous Processing for Slow Operations

  PROBLEM: Client requests report generation (takes 30 seconds).
           If server keeps connection open for 30 seconds:
           → Client may timeout
           → Server ties up a thread for 30 seconds
           → Scales poorly under multiple simultaneous requests

  SOLUTION: Accept and process asynchronously.

  Step 1: Client submits job
  POST /reports
  Body: { "type": "sales", "dateRange": "2024-Q1" }

  Step 2: Server immediately responds:
  HTTP/1.1 202 Accepted
  {
    "jobId": "job_789",
    "status": "processing",
    "statusUrl": "/reports/job_789/status"
  }

  Step 3: Background worker processes the report (30 seconds)

  Step 4: Client polls for status
  GET /reports/job_789/status
  → { "status": "processing", "progress": 45 }

  Step 5: Report complete
  GET /reports/job_789/status
  → { "status": "complete", "downloadUrl": "/reports/job_789/download" }

  Step 6: Client downloads
  GET /reports/job_789/download
  → [full report data]

  202 Accepted = "I received it. Processing in background."

Database Query Optimization for APIs

  COMMON PERFORMANCE KILLERS:

  1. N+1 Query Problem
  Getting 10 orders and then making 10 separate queries for each order's user.

  BAD (11 queries for 10 orders):
  orders = db.query("SELECT * FROM orders LIMIT 10")
  for order in orders:
      user = db.query("SELECT * FROM users WHERE id = ?", order.user_id)

  GOOD (2 queries total):
  orders = db.query("SELECT * FROM orders LIMIT 10")
  user_ids = [o.user_id for o in orders]
  users = db.query("SELECT * FROM users WHERE id IN (?)", user_ids)
  // Map users to orders in memory

  BEST (1 query with JOIN):
  SELECT orders.*, users.name, users.email
  FROM orders
  JOIN users ON orders.user_id = users.id
  LIMIT 10

  2. Missing Database Indexes
  Without index: Full table scan → scans every row
  With index: Direct lookup → finds row instantly

  Add indexes on columns used in WHERE, JOIN, and ORDER BY.

  3. Returning Columns You Don't Need
  BAD:  SELECT * FROM products
  GOOD: SELECT id, name, price FROM products

Connection Pooling

  Opening a new database connection for every request is expensive.
  It takes 20–100ms just to establish the connection.

  WITHOUT CONNECTION POOL:
  Request 1 → Open DB connection (50ms) → Query (5ms) → Close → Response
  Request 2 → Open DB connection (50ms) → Query (5ms) → Close → Response
  ...each request pays 50ms connection cost

  WITH CONNECTION POOL:
  API Startup: Open 10 database connections. Keep them open.

  Request 1 → Borrow connection → Query (5ms) → Return → Response
  Request 2 → Borrow connection → Query (5ms) → Return → Response
  ...no connection overhead. Connections reused.

  Pool Size Guidelines:
  Database connections use server RAM.
  Too few → requests queue waiting for a free connection
  Too many → database RAM exhausted
  Sweet spot: Usually 10–50 connections depending on DB server specs.

CDN for API Responses

  CDN (Content Delivery Network) = Servers distributed globally
  that cache your API responses close to users.

  WITHOUT CDN:
  User in India requests GET /products
  → Request travels to US-East server (150ms round-trip)
  → Database query (10ms)
  → Response travels back to India (150ms)
  Total: ~310ms

  WITH CDN:
  User in India requests GET /products (first time)
  → CDN node in Mumbai: MISS → fetches from US-East → caches locally
  → Response: ~160ms

  User 2 in India requests GET /products (after CDN cache)
  → CDN node in Mumbai: HIT → serves from Mumbai
  → Response: ~5ms (99% improvement!)

  CDNs work for public, cacheable API responses (product lists,
  public articles, shared reference data).
  They don't help with private, user-specific, or real-time data.

Performance Monitoring — Know Before Users Complain

  KEY METRICS TO TRACK:

  Response Time (Latency):
  P50 (median): 50% of requests faster than this
  P95: 95% of requests faster than this
  P99: 99% of requests faster than this

  Aim for:
  P50 < 100ms
  P95 < 500ms
  P99 < 1000ms

  Throughput: Requests per second your API handles
  Error Rate: % of requests returning 4xx or 5xx
  Cache Hit Rate: % of requests served from cache

  ALERTS TO SET UP:
  → P95 latency exceeds 1 second → investigate immediately
  → Error rate exceeds 1% → something is broken
  → Cache hit rate drops below 80% → cache invalidation issue
  → Rate limit hits spike → potential attack or client bug

Key Points

Rate limiting protects your API from brute-force attacks, DoS floods, and accidental client bugs. Use stricter limits on authentication endpoints and relaxed limits on read endpoints.
The token bucket algorithm suits APIs with legitimate traffic bursts. The sliding window algorithm suits most standard API use cases.
Always return 429 Too Many Requests with a Retry-After header so clients know when to try again.
Cache-Control: public, max-age=3600 lets CDNs and browsers cache shared content. Cache-Control: no-store prevents caching for sensitive data like bank balances.
ETags enable conditional requests — when content hasn't changed, the server returns 304 Not Modified with an empty body, saving bandwidth.
Use cursor-based pagination for large datasets. Offset pagination slows down dramatically as page numbers grow.
Long operations should return 202 Accepted immediately and process in the background. Provide a status URL for polling.
The N+1 query problem is the most common API performance killer — batch your database queries or use JOINs.

Previous lesson

Back to course

Next lesson