System Design Real World Case Studies
Real-world case studies bring all system design concepts together into concrete, applied examples. Understanding how well-known systems are designed — and the specific problems they solved — builds the intuition needed to design new systems confidently. This topic walks through four classic system design problems, applying every concept covered in this course.
Case Study 1: Design a URL Shortener (like bit.ly)
Requirements
Functional:
- Given a long URL, generate a short URL (e.g., bit.ly/abc123)
- Visiting the short URL redirects to the original long URL
- Short URLs expire after a configured time period
Non-Functional:
- 100 million new URLs created per day
- 10 billion redirects per day (~115,000 redirects/second)
- Reads (redirects) are 100× more frequent than writes (URL creation)
- 99.9% uptime requirement
Scale Estimation
Writes: 100M URLs/day = ~1,160 writes/second Reads: 10B redirects/day = ~115,740 reads/second Storage per URL: 500 bytes (long URL + metadata) 5-year storage: 100M × 365 × 5 × 500 bytes ≈ 91 TB
Core Design Decisions
Short URL Generation:
Option 1: Hash the long URL
MD5("https://estudy247.com/long-article") → Take first 7 characters
Risk: Collisions (two different URLs could produce same hash)
Solution: Check if short code exists, append counter if collision
Option 2: Unique ID + Base62 encoding
Generate auto-incrementing ID: 12345678
Encode in base62 (a-z, A-Z, 0-9): 12345678 → "5K9aB3"
7-character base62 = 62^7 = 3.5 trillion unique codes → Never runs out
Architecture
Client ↓ Load Balancer (Round Robin across API servers) ↓ API Servers (Stateless, horizontally scalable) ↓ ↓ Cache Database (Redis: (PostgreSQL) short→long short_url | long_url | created_at | expires_at mapping) ↑ Cache HIT (95%+ for popular links): Return immediately Cache MISS: Fetch from DB, store in cache, return Redirect flow: GET /abc123 → Check Redis cache → HIT → Return 301 Redirect to long URL → Cache MISS → Query DB → Store in Redis → Return 301 Redirect Key design insight: 301 (Permanent Redirect): Browser caches → Future clicks bypass server entirely 302 (Temporary Redirect): Server handles every click → Better analytics tracking
Case Study 2: Design a Notification System
Requirements
Functional:
- Send push notifications, SMS, and emails
- Support millions of notifications per day
- Allow prioritization (critical alerts vs promotional)
- Track delivery status (sent, delivered, failed)
Architecture
Triggering Services (Order Service, Marketing Service, etc.) ↓ Notification Service API ↓ Priority Queue (Message Broker - Kafka) High Priority Topic: Password resets, security alerts Normal Priority Topic: Order confirmations, shipping updates Low Priority Topic: Promotions, newsletters ↓ Notification Workers (pull from appropriate topics) ↓ +------------------+------------------+------------------+ | | | | Push Worker Email Worker SMS Worker (FCM/APNS) (SendGrid/SES) (Twilio/SNS) ↓ ↓ ↓ Mobile Device Email Server Phone (SMS) ↓ Delivery Status DB (tracks each notification) ↓ Analytics Dashboard (delivery rates, failure rates)
Key Design Decisions
Retry Logic:
Failed notification → Retry with exponential backoff: Attempt 1: Immediately Attempt 2: 30 seconds later Attempt 3: 5 minutes later Attempt 4: 30 minutes later Attempt 5: 2 hours later After 5 failures → Move to Dead Letter Queue → Alert team
Rate Limiting per User:
Limit: Max 10 push notifications per user per day
Max 3 SMS per user per day
Max 1 promotional email per user per day
→ Prevents notification fatigue, protects user experience
Case Study 3: Design a Social Media Feed (like Twitter/X)
Requirements
Functional:
- Users post tweets (short messages)
- Users follow other users
- Home timeline shows latest tweets from all followed users
- Tweets include text, images, links
Non-Functional:
- 300 million active users
- 100,000 tweets posted per second
- Read-heavy: timeline views far exceed tweet posts
Feed Generation Approaches
Pull Model (Fanout on Read):
When user opens timeline: → Fetch list of all accounts user follows (500 accounts) → Query each account's recent tweets → Merge, sort by timestamp → Return timeline Problem: Opening timeline requires 500+ queries → Slow! Better for: Users following very few accounts
Push Model (Fanout on Write):
When user posts a tweet: → Find all followers (say: 10,000 followers) → Write this tweet into each follower's pre-built timeline cache → When any follower opens their feed → Already ready! Return instantly Pre-built Timeline Cache (Redis): User 42's timeline: [tweet789, tweet456, tweet123, ...] Problem: Celebrity with 50M followers posts → 50M cache writes! Better for: Most regular users
Hybrid Approach (Twitter's actual solution):
Regular users (< 10M followers): Push model (fanout on write) Celebrities (> 10M followers): Pull model (read at feed generation time) Normal user feed: Prebuilt cache + any celebrity tweets merged at read time Celebrity user feed: Prebuilt cache + (celebrity posts merged live when user opens feed)
Architecture
Tweet Creation:
Client → API Gateway → Tweet Service → Tweet DB (MySQL sharded by TweetID)
→ Media Service (images → S3 → CDN)
→ Fanout Service → User timeline caches (Redis)
Feed Read:
Client → API Gateway → Timeline Service → Redis cache → Render feed
→ Merge celebrity tweets (for regular users)
Case Study 4: Design a Ride-Sharing System (like Uber)
Requirements
Functional:
- Rider requests a ride with pickup and dropoff locations
- System matches rider with nearest available driver
- Both rider and driver see real-time location updates
- Trip completes, payment processes automatically
Location Tracking Challenge
Millions of drivers update their location every 5 seconds.
5,000,000 drivers × 1 update/5 sec = 1,000,000 location writes/second
Solution: Location Service with write-optimized storage
- Use Cassandra for location data (high write throughput)
- Driver locations stored as: { driverID, lat, lng, timestamp }
- Recent location in Redis (fast read for matching)
- Historical locations in Cassandra (analytics, route replay)
Geospatial Matching
Problem: Rider requests ride in Mumbai. How to find all drivers within 5km efficiently? Naive approach: Check every driver's location → 5M calculations → Too slow Solution: Geohashing Divide the world into a grid of cells. Each cell has a unique string (geohash). Nearby locations share the same geohash prefix. Mumbai driver at lat:19.07, lng:72.87 → Geohash: "te7uddh" Rider at lat:19.08, lng:72.88 → Geohash: "te7uddk" Both start with "te7udd" → Same neighborhood → Nearby! Query: Find all drivers whose geohash starts with "te7udd" → Fast index scan
Real-Time Location Updates (WebSockets)
HTTP (polling) approach: Rider app: "Where is driver?" → Server responds Rider app: Wait 2 seconds → "Where is driver?" → Server responds → Many requests, delayed updates, server overhead WebSocket approach: Client and server maintain persistent two-way connection Driver app → WebSocket → Location server → Updates all connected riders instantly → No polling, instant updates, efficient
System Architecture
+--------+ WebSocket +----------+ Kafka +----------+
| Driver |-----------> | Location | -------> | Matching |
| App | | Service | | Service |
+--------+ +----------+ +----------+
| |
Cassandra Redis (active
(history) driver pool)
↑
Geohash index
+--------+ HTTP +----------+ +----------+
| Rider |------> | Trip | ------------> | Payment |
| App | | Service | | Service |
+--------+ +----------+ +----------+
↑ | |
| WebSocket → Notifications Stripe/
| (driver (Push + SMS) Braintree
| location)
Common Patterns Across All Case Studies
| Pattern Used | URL Shortener | Notifications | Social Feed | Ride Sharing |
|---|---|---|---|---|
| Caching | Redis (URL map) | User preferences | Timeline cache | Driver locations |
| Message Queue | No | Kafka (priority) | Fanout queue | Location updates |
| Load Balancing | API servers | Worker nodes | Feed servers | All services |
| Horizontal Scaling | API + DB sharding | Worker scaling | Tweet DB sharding | Location service |
| Async Processing | Expiry cleanup | All notifications | Fanout writes | Payment, receipts |
How to Approach Any System Design Problem
Use this framework for any system design interview or real-world design:
- Clarify requirements – Ask about scale, features, and priorities. Confirm functional and non-functional requirements.
- Estimate scale – Calculate writes/second, reads/second, storage over 5 years.
- Define the API – What endpoints does the system expose? What do they accept and return?
- High-level design – Draw the major components: client, API gateway, services, caches, databases, queues.
- Deep dive into bottlenecks – Identify the hardest parts (fanout, location queries, payment consistency) and explain solutions.
- Address failure scenarios – What happens if the database goes down? If the queue fills up? If a service crashes?
- Trade-offs – Acknowledge what decisions sacrifice (e.g., AP vs CP, cost vs performance).
Summary
Real-world systems combine every concept from this course: caching, load balancing, sharding, replication, queues, CDN, rate limiting, and security — all working together. A URL shortener demonstrates read-heavy caching. A notification system shows priority queues and retry logic. A social feed reveals the fanout problem and hybrid push-pull strategies. A ride-sharing system highlights real-time geospatial challenges. Mastering system design means recognizing these patterns and knowing when and how to apply each one. The goal is always the same: build a system that is fast, reliable, scalable, and secure at any scale.
