System Design Real World Case Studies

Real-world case studies bring all system design concepts together into concrete, applied examples. Understanding how well-known systems are designed — and the specific problems they solved — builds the intuition needed to design new systems confidently. This topic walks through four classic system design problems, applying every concept covered in this course.

Case Study 1: Design a URL Shortener (like bit.ly)

Requirements

Functional:

  • Given a long URL, generate a short URL (e.g., bit.ly/abc123)
  • Visiting the short URL redirects to the original long URL
  • Short URLs expire after a configured time period

Non-Functional:

  • 100 million new URLs created per day
  • 10 billion redirects per day (~115,000 redirects/second)
  • Reads (redirects) are 100× more frequent than writes (URL creation)
  • 99.9% uptime requirement

Scale Estimation

Writes: 100M URLs/day = ~1,160 writes/second
Reads:  10B redirects/day = ~115,740 reads/second
Storage per URL: 500 bytes (long URL + metadata)
5-year storage: 100M × 365 × 5 × 500 bytes ≈ 91 TB

Core Design Decisions

Short URL Generation:

Option 1: Hash the long URL
MD5("https://estudy247.com/long-article") → Take first 7 characters
Risk: Collisions (two different URLs could produce same hash)
Solution: Check if short code exists, append counter if collision

Option 2: Unique ID + Base62 encoding
Generate auto-incrementing ID: 12345678
Encode in base62 (a-z, A-Z, 0-9): 12345678 → "5K9aB3"
7-character base62 = 62^7 = 3.5 trillion unique codes → Never runs out

Architecture

Client
  ↓
Load Balancer (Round Robin across API servers)
  ↓
API Servers (Stateless, horizontally scalable)
  ↓              ↓
Cache          Database
(Redis:        (PostgreSQL)
 short→long    short_url | long_url | created_at | expires_at
 mapping)
  ↑
  Cache HIT (95%+ for popular links): Return immediately
  Cache MISS: Fetch from DB, store in cache, return

Redirect flow:
GET /abc123
→ Check Redis cache → HIT → Return 301 Redirect to long URL
→ Cache MISS → Query DB → Store in Redis → Return 301 Redirect

Key design insight:
301 (Permanent Redirect): Browser caches → Future clicks bypass server entirely
302 (Temporary Redirect): Server handles every click → Better analytics tracking

Case Study 2: Design a Notification System

Requirements

Functional:

  • Send push notifications, SMS, and emails
  • Support millions of notifications per day
  • Allow prioritization (critical alerts vs promotional)
  • Track delivery status (sent, delivered, failed)

Architecture

Triggering Services (Order Service, Marketing Service, etc.)
  ↓
Notification Service API
  ↓
Priority Queue (Message Broker - Kafka)
  High Priority Topic:  Password resets, security alerts
  Normal Priority Topic: Order confirmations, shipping updates
  Low Priority Topic:   Promotions, newsletters
  ↓
Notification Workers (pull from appropriate topics)
  ↓
  +------------------+------------------+------------------+
  |                  |                  |                  |
Push Worker       Email Worker       SMS Worker
(FCM/APNS)        (SendGrid/SES)     (Twilio/SNS)
  ↓                  ↓                  ↓
Mobile Device     Email Server       Phone (SMS)
  ↓
Delivery Status DB (tracks each notification)
  ↓
Analytics Dashboard (delivery rates, failure rates)

Key Design Decisions

Retry Logic:

Failed notification → Retry with exponential backoff:
Attempt 1: Immediately
Attempt 2: 30 seconds later
Attempt 3: 5 minutes later
Attempt 4: 30 minutes later
Attempt 5: 2 hours later
After 5 failures → Move to Dead Letter Queue → Alert team

Rate Limiting per User:

Limit: Max 10 push notifications per user per day
       Max 3 SMS per user per day
       Max 1 promotional email per user per day
→ Prevents notification fatigue, protects user experience

Case Study 3: Design a Social Media Feed (like Twitter/X)

Requirements

Functional:

  • Users post tweets (short messages)
  • Users follow other users
  • Home timeline shows latest tweets from all followed users
  • Tweets include text, images, links

Non-Functional:

  • 300 million active users
  • 100,000 tweets posted per second
  • Read-heavy: timeline views far exceed tweet posts

Feed Generation Approaches

Pull Model (Fanout on Read):

When user opens timeline:
→ Fetch list of all accounts user follows (500 accounts)
→ Query each account's recent tweets
→ Merge, sort by timestamp
→ Return timeline

Problem: Opening timeline requires 500+ queries → Slow!
Better for: Users following very few accounts

Push Model (Fanout on Write):

When user posts a tweet:
→ Find all followers (say: 10,000 followers)
→ Write this tweet into each follower's pre-built timeline cache
→ When any follower opens their feed → Already ready! Return instantly

Pre-built Timeline Cache (Redis):
User 42's timeline: [tweet789, tweet456, tweet123, ...]

Problem: Celebrity with 50M followers posts → 50M cache writes!
Better for: Most regular users

Hybrid Approach (Twitter's actual solution):

Regular users (< 10M followers): Push model (fanout on write)
Celebrities (> 10M followers):   Pull model (read at feed generation time)

Normal user feed:     Prebuilt cache + any celebrity tweets merged at read time
Celebrity user feed:  Prebuilt cache + (celebrity posts merged live when user opens feed)

Architecture

Tweet Creation:
Client → API Gateway → Tweet Service → Tweet DB (MySQL sharded by TweetID)
                                     → Media Service (images → S3 → CDN)
                                     → Fanout Service → User timeline caches (Redis)

Feed Read:
Client → API Gateway → Timeline Service → Redis cache → Render feed
                                       → Merge celebrity tweets (for regular users)

Case Study 4: Design a Ride-Sharing System (like Uber)

Requirements

Functional:

  • Rider requests a ride with pickup and dropoff locations
  • System matches rider with nearest available driver
  • Both rider and driver see real-time location updates
  • Trip completes, payment processes automatically

Location Tracking Challenge

Millions of drivers update their location every 5 seconds.
5,000,000 drivers × 1 update/5 sec = 1,000,000 location writes/second

Solution: Location Service with write-optimized storage
- Use Cassandra for location data (high write throughput)
- Driver locations stored as: { driverID, lat, lng, timestamp }
- Recent location in Redis (fast read for matching)
- Historical locations in Cassandra (analytics, route replay)

Geospatial Matching

Problem: Rider requests ride in Mumbai.
How to find all drivers within 5km efficiently?

Naive approach: Check every driver's location → 5M calculations → Too slow

Solution: Geohashing
Divide the world into a grid of cells. Each cell has a unique string (geohash).
Nearby locations share the same geohash prefix.

Mumbai driver at lat:19.07, lng:72.87 → Geohash: "te7uddh"
Rider at              lat:19.08, lng:72.88 → Geohash: "te7uddk"

Both start with "te7udd" → Same neighborhood → Nearby!

Query: Find all drivers whose geohash starts with "te7udd" → Fast index scan

Real-Time Location Updates (WebSockets)

HTTP (polling) approach:
Rider app: "Where is driver?" → Server responds
Rider app: Wait 2 seconds → "Where is driver?" → Server responds
→ Many requests, delayed updates, server overhead

WebSocket approach:
Client and server maintain persistent two-way connection
Driver app → WebSocket → Location server → Updates all connected riders instantly
→ No polling, instant updates, efficient

System Architecture

+--------+  WebSocket  +----------+  Kafka   +----------+
| Driver |-----------> | Location | -------> | Matching |
|  App   |             | Service  |          | Service  |
+--------+             +----------+          +----------+
                           |                      |
                        Cassandra              Redis (active
                        (history)              driver pool)
                                                   ↑
                                               Geohash index

+--------+  HTTP  +----------+               +----------+
| Rider  |------> |  Trip    | ------------> | Payment  |
|  App   |        | Service  |               | Service  |
+--------+        +----------+               +----------+
    ↑                  |                          |
    | WebSocket         → Notifications         Stripe/
    | (driver            (Push + SMS)           Braintree
    | location)

Common Patterns Across All Case Studies

Pattern UsedURL ShortenerNotificationsSocial FeedRide Sharing
CachingRedis (URL map)User preferencesTimeline cacheDriver locations
Message QueueNoKafka (priority)Fanout queueLocation updates
Load BalancingAPI serversWorker nodesFeed serversAll services
Horizontal ScalingAPI + DB shardingWorker scalingTweet DB shardingLocation service
Async ProcessingExpiry cleanupAll notificationsFanout writesPayment, receipts

How to Approach Any System Design Problem

Use this framework for any system design interview or real-world design:

  1. Clarify requirements – Ask about scale, features, and priorities. Confirm functional and non-functional requirements.
  2. Estimate scale – Calculate writes/second, reads/second, storage over 5 years.
  3. Define the API – What endpoints does the system expose? What do they accept and return?
  4. High-level design – Draw the major components: client, API gateway, services, caches, databases, queues.
  5. Deep dive into bottlenecks – Identify the hardest parts (fanout, location queries, payment consistency) and explain solutions.
  6. Address failure scenarios – What happens if the database goes down? If the queue fills up? If a service crashes?
  7. Trade-offs – Acknowledge what decisions sacrifice (e.g., AP vs CP, cost vs performance).

Summary

Real-world systems combine every concept from this course: caching, load balancing, sharding, replication, queues, CDN, rate limiting, and security — all working together. A URL shortener demonstrates read-heavy caching. A notification system shows priority queues and retry logic. A social feed reveals the fanout problem and hybrid push-pull strategies. A ride-sharing system highlights real-time geospatial challenges. Mastering system design means recognizing these patterns and knowing when and how to apply each one. The goal is always the same: build a system that is fast, reliable, scalable, and secure at any scale.

Leave a Comment