Masterclass

System Design

0%
๐Ÿ๏ธ From Zero to Production-Ready Architect

System Design Masterclass

A deep-dive journey through High-Level Design (HLD) and Low-Level Design (LLD). Build the mental models, trade-off frameworks, and engineering intuition used by senior engineers at top tech companies.

20 Modules Interactive Quizzes Real Diagrams Case Studies
Module 01

Prerequisites & Foundations

Before we architect skyscrapers, we pour the foundation. This module ensures you have the conceptual toolkit โ€” networking, databases, concurrency, and environment setup โ€” required to reason about distributed systems without hand-waving.

1.1 What You Need Before Starting

System design sits at the intersection of computer science, software engineering, and product thinking. You do not need a PhD, but you do need comfort with certain primitives. Think of it like learning to navigate the open ocean: you don't need to build the boat on day one, but you must understand tides, compass headings, and how sails catch wind.

โœ… Required Comfort Level

  • โ€ข Basic programming in any language (Python, Java, JavaScript, Go)
  • โ€ข Understanding of variables, functions, loops, and basic data structures
  • โ€ข Familiarity with HTTP (you've made or consumed a REST API)
  • โ€ข Willingness to learn math-lite concepts (Big-O, percentages, orders of magnitude)

๐ŸŽฏ Helpful But Not Mandatory

  • โ€ข Prior backend or full-stack development experience
  • โ€ข Exposure to cloud platforms (AWS, GCP, Azure)
  • โ€ข Database query experience (SQL or NoSQL)
  • โ€ข Systems programming or OS course material

Analogy โ€” The Island Cartographer: A cartographer mapping an archipelago doesn't need to have sailed every route, but they must understand scale, coordinates, and how islands connect via shipping lanes. System designers map software islands (services) connected by network lanes (APIs, queues, databases).

1.2 Networking Fundamentals

Every distributed system is, at its core, computers talking to each other over a network. When you design a chat app, a payment gateway, or a video streaming platform, you are really designing who talks to whom, over what protocol, with what latency budget, and what happens when the message never arrives.

The Network Stack โ€” How Data Travels from App to Wire
CLIENT Application (HTTP/gRPC) Transport (TCP/UDP) Network (IP) Link / Physical Internet / LAN SERVER Application (HTTP/gRPC) Transport (TCP/UDP) Network (IP) Link / Physical Total Round-Trip Time (RTT) = propagation + processing + queuing Same-region: ~1โ€“5ms | Cross-continent: ~100โ€“300ms

Key Concepts You Must Internalize

IP Address & DNS

An IP address is a street address for a machine. DNS is the phone book that translates api.example.com into 203.0.113.42. In system design, DNS is also a load distribution tool (round-robin, geo-routing).

TCP vs UDP

TCP is reliable, ordered, connection-oriented โ€” like registered mail with delivery confirmation. Use it for HTTP, database connections, file transfers. UDP is fire-and-forget โ€” like shouting across a lagoon. Use it for live video, gaming, DNS queries where speed beats guaranteed delivery.

HTTP/HTTPS & REST

HTTP is the lingua franca of web APIs. REST is an architectural style using HTTP verbs (GET, POST, PUT, DELETE) on resources identified by URLs. HTTPS adds TLS encryption โ€” non-negotiable for production systems handling user data.

Latency, Bandwidth, Throughput

Latency is how long one request takes (ms). Bandwidth is pipe width (Mbps). Throughput is completed requests per second (RPS/QPS). A wide pipe (bandwidth) doesn't help if each message takes forever (latency).

1.3 Database Fundamentals

Data is the treasure buried on every island in your architecture. Choosing where and how to store it determines consistency, scalability, and operational complexity. At a foundation level, understand the two great families of databases and their trade-offs.

SQL vs NoSQL โ€” When to Use Which
SQL (Relational) โ€ข Structured schema (tables, rows) โ€ข ACID transactions โ€ข Strong consistency โ€ข JOINs across tables โ€ข Vertical scaling primary path Examples: PostgreSQL, MySQL Best for: financial records, orders, user accounts, relational data NoSQL (Non-Relational) โ€ข Flexible / schema-less models โ€ข Horizontal scaling native โ€ข Eventual consistency common โ€ข Document, key-value, graph, column โ€ข Optimized for specific access patterns Examples: MongoDB, Redis, Cassandra Best for: feeds, sessions, analytics, high-write workloads, caching

The ACID vs BASE Mental Model

ACID (Atomicity, Consistency, Isolation, Durability) guarantees that database transactions behave predictably โ€” critical for banking. BASE (Basically Available, Soft state, Eventually consistent) accepts temporary inconsistency in exchange for availability and partition tolerance โ€” common in globally distributed systems. You'll revisit this deeply when we cover CAP theorem in Module 13.

-- Foundational SQL you'll encounter in LLD discussions
CREATE TABLE users (
  id         BIGSERIAL PRIMARY KEY,
  email      VARCHAR(255) UNIQUE NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_users_email ON users(email);
-- Indexes are the "table of contents" โ€” trade write speed for read speed

1.4 Computer Science Building Blocks

These concepts appear in every system design discussion. You don't need to implement a B-tree from scratch, but you must speak fluently about them when justifying design decisions.

Concept Layman's Terms System Design Relevance
Big-O NotationHow cost grows as input growsChoosing algorithms, estimating database query cost
Hash TablesInstant lookup by key (O(1) average)Caching, sharding keys, consistent hashing
Trees & GraphsHierarchical or connected data structuresFile systems, org charts, social networks, DNS
Queues & StacksFIFO vs LIFO processing orderMessage queues, job schedulers, undo buffers
ConcurrencyMultiple things happening "at once"Thread pools, race conditions, locks, async I/O
Memory HierarchyFast/small (CPU cache) โ†’ slow/big (disk)Why caching layers exist at every level
The Memory Hierarchy โ€” Why Caching Is Everywhere
L1/L2 CPU Cache (~ns) RAM (~100ns) SSD (~100ฮผs) Network / Disk (~ms) Each level is ~10โ€“100ร— slower but ~10โ€“100ร— larger

1.5 Environment & Tooling Setup

Hands-on practice reinforces theory. Set up a lightweight environment for sketching architectures, running local services, and experimenting with APIs.

1. Diagramming Tools

  • โ€ข Excalidraw (free, hand-drawn aesthetic) โ€” great for interviews
  • โ€ข draw.io / diagrams.net โ€” professional architecture diagrams
  • โ€ข Mermaid โ€” diagram-as-code in Markdown (used in this course)

2. Local Development Stack

# Recommended baseline tooling
# macOS (Homebrew)
brew install git node python@3.12 docker

# Verify installations
git --version && node --version && python3 --version && docker --version

# Optional: run local Redis + PostgreSQL via Docker
docker run -d --name local-redis -p 6379:6379 redis:7-alpine
docker run -d --name local-postgres -e POSTGRES_PASSWORD=dev \
  -p 5432:5432 postgres:16-alpine

3. API Testing

Install curl (built into macOS/Linux) or use Postman / HTTPie to probe REST endpoints. Understanding request/response cycles is essential for API design modules later.

curl -X GET https://api.github.com/users/octocat
curl -X POST https://httpbin.org/post -H "Content-Type: application/json" \
  -d '{"message": "hello from system design course"}'

1.6 How to Use This Masterclass

  1. Read sequentially first. Modules build on each other. Skipping to "Design Twitter" without understanding caching is like sailing without charts.
  2. Sketch as you read. Redraw every diagram from memory on paper or Excalidraw. Active recall beats passive reading 10:1.
  3. Complete every quiz. Each module ends with MCQs designed to surface gaps in understanding. Read explanations even for questions you got right.
  4. Time-box deep dives. Aim for 15โ€“20 minutes per module section, 5 minutes per quiz. The full course targets 5โ€“6 hours.
  5. Revisit case studies. Modules 16โ€“18 apply everything. Return to them after completing the theory modules.
Module 02

Introduction to System Design

What exactly is "system design"? Why does every senior engineering interview include it? This module defines the discipline, distinguishes high-level from low-level design, and introduces the structured thinking framework you'll use throughout this course.

2.1 What Is System Design?

System design is the process of defining the architecture, components, modules, interfaces, and data flows for a software system to satisfy specified requirements. It answers: "Given a problem like 'build Instagram' or 'process 1 million payments per hour,' how do we decompose it into reliable, scalable, maintainable pieces?"

It is not coding. It is not picking React vs Vue. It is the engineering decision-making layer that sits above individual features โ€” deciding how services communicate, where state lives, what fails gracefully, and what trade-offs you accept.

Analogy โ€” Designing a Resort Island: Building one beach bungalow is "feature development." Designing the entire resort โ€” where the power plant goes, how freshwater reaches every villa, how guests move between islands, what happens during a hurricane โ€” that's system design. You're the master planner, not the carpenter.

The System Design Landscape
Product Requirements HIGH-LEVEL DESIGN (HLD) Services, APIs, data stores, scaling strategy LOW-LEVEL DESIGN (LLD) Classes, methods, schemas, algorithms Implementation & Deployment HLD Answers "What services exist?" "How do they scale?" "Where is data stored?" "What fails first?" "What are the bottlenecks?" LLD Answers "What classes exist?" "What methods do they expose?" "What indexes on tables?" "How is concurrency handled?" "What design patterns apply?" Non-Functional Requirements Permeate Every Layer Scalability โ€ข Availability โ€ข Latency โ€ข Security โ€ข Cost โ€ข Maintainability

2.2 High-Level Design vs Low-Level Design

The single most important distinction in this entire course. Confusing HLD with LLD is like confusing a city zoning map with a building's floor plan โ€” both are "design," but they operate at different zoom levels.

Dimension High-Level Design (HLD) Low-Level Design (LLD)
FocusArchitecture & componentsInternal structure & logic
AudienceArchitects, tech leads, interviewersImplementing engineers
ArtifactsArchitecture diagrams, data flow, API contractsClass diagrams, sequence diagrams, ER schemas
Key QuestionsMicroservices or monolith? SQL or NoSQL?Which pattern? Which data structure? Thread-safe?
Example (URL Shortener)Load balancer โ†’ API servers โ†’ Redis cache โ†’ DB clusterUrlService.createShortUrl(), base62 encoding, DB schema
When in SDLCEarly โ€” before major implementationJust before / during implementation
HLD Example โ€” URL Shortener Architecture (Bird's-Eye View)
Client Load Balancer API Server 1 API Server 2 Redis Cache PostgreSQL Primary DB Read Replica HLD: boxes and arrows โ€” no class names, no SQL queries

2.3 Functional vs Non-Functional Requirements

Every system design session begins with requirements. Splitting them correctly prevents you from over-engineering features nobody asked for while ignoring the constraints that actually break production systems.

Functional Requirements (FRs)

What the system does โ€” features and behaviors.

  • โ€ข Users can create a short URL from a long URL
  • โ€ข Users can redirect via the short URL
  • โ€ข Users can view click analytics
  • โ€ข Custom alias URLs are supported

Non-Functional Requirements (NFRs)

How well the system performs โ€” quality attributes.

  • โ€ข Scalability: 100M URLs, 10K reads/sec
  • โ€ข Availability: 99.99% uptime
  • โ€ข Latency: Redirect < 100ms p99
  • โ€ข Durability: Zero URL data loss
  • โ€ข Security: Rate limiting, abuse prevention

Pro tip: In interviews, always clarify NFRs before drawing boxes. "How many users? Read/write ratio? Latency target? Consistency requirements?" These numbers drive every subsequent decision โ€” cache or not, SQL or NoSQL, sync or async.

2.4 The Structured Design Process

Senior engineers don't freestyle. They follow a repeatable framework that ensures nothing critical is missed. Internalize this 7-step process โ€” you'll use it in every case study module.

The 7-Step System Design Framework
1. Clarify Requirements 2. Estimate Scale (Back-of-Envelope) 3. Define API / Interface 4. High-Level Architecture (HLD) 5. Deep Dive Critical Components 6. Identify Bottlenecks & Trade-offs 7. Low-Level Design (if time permits) FRs + NFRs DAU, QPS, storage REST endpoints Boxes & arrows DB schema, cache CAP, SPOFs Classes, patterns

Step 2 in Action: Back-of-the-Envelope Math

Estimation separates senior engineers from junior ones. You don't need exact numbers โ€” you need the right order of magnitude.

Example: URL Shortener scale estimation
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Assumptions:
  โ€ข 100M new URLs/month
  โ€ข Read:Write ratio = 100:1
  โ€ข Average URL stored = 500 bytes
  โ€ข Retention = 5 years

Writes/sec  = 100M / (30 ร— 24 ร— 3600) โ‰ˆ 40 writes/sec
Reads/sec   = 40 ร— 100 = 4,000 reads/sec
Storage     = 100M ร— 12 ร— 5 ร— 500B = 3 TB (raw, before replication)

โ†’ Reads dominate โ†’ aggressive caching (Redis) is essential
โ†’ Writes are modest โ†’ single DB shard may suffice initially
โ†’ 3 TB is manageable โ†’ no exotic storage needed yet

2.5 Trade-offs โ€” The Core Skill

There is no perfect architecture โ€” only trade-offs aligned with requirements. System design is less about finding the "right answer" and more about articulating why you chose A over B given constraints C and D.

Common System Design Trade-off Spectrum
Consistency Availability Low Latency Strong Durability Simplicity Flexibility Low Cost You usually pick a point on each spectrum

When you propose a design, always pair it with: "I'm choosing X because [requirement]. The trade-off is Y, which we mitigate by Z." This sentence structure alone elevates interview performance dramatically.

2.6 Architectural Styles Preview

Before diving into individual patterns in later modules, orient yourself to the three dominant architectural styles. We'll explore each in depth in Module 4.

๐Ÿ 

Monolith

Single deployable unit. Simple to develop and debug. Harder to scale individual components.

๐Ÿ˜๏ธ

Microservices

Independent services with own databases. Scales teams and components. Adds network complexity.

โšก

Serverless

Functions as a service. Zero server management. Cold starts and vendor lock-in are trade-offs.

Module 03

Requirements Engineering & Constraints

The difference between a senior engineer and a junior one often appears in the first five minutes of a design session: seniors interrogate requirements before touching a whiteboard. This module teaches you to extract, classify, prioritize, and constrain requirements like an architect surveying land before laying a foundation.

3.1 Why Requirements Come First

A system designed for 100 users looks nothing like one designed for 100 million. A banking ledger demands different consistency guarantees than a social media "like" counter. Requirements are the contract between problem and solution โ€” miss them, and you optimize for the wrong thing entirely.

Analogy โ€” The Tide Surveyor: Before building a pier, surveyors measure tides, storm surges, and seabed depth. Skipping this step means your pier floods at high tide or collapses in a storm. Requirements engineering is your tide survey โ€” it tells you what forces your system must withstand.

Requirements Flow Into Every Design Decision
Stakeholder Needs Requirements (FRs + NFRs + Constraints) Scale & Capacity sharding, caching Consistency Model ACID vs eventual Security Posture auth, encryption Architecture & Technology Choices

3.2 Functional Requirements โ€” The Feature Contract

Functional requirements describe what the system must do. They should be specific, testable, and unambiguous. Vague requirements like "the system should be fast" are NFRs in disguise โ€” functional requirements name concrete behaviors.

Writing Testable Functional Requirements

Weak (Vague) Strong (Testable)
Users can share contentUsers can generate a shareable link valid for 7 days
System handles searchUsers can search products by name with prefix matching
Support notificationsUsers receive push notifications within 30s of a new message

User Stories vs Use Cases

User stories (Agile): "As a [role], I want [feature], so that [benefit]." Great for prioritization. Use cases: step-by-step interaction flows including edge cases and alternate paths. Great for design completeness. In system design interviews, narrate use cases aloud while the interviewer nods or redirects.

Use Case: Create Short URL
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Actor:     Registered user
Precond:   User is authenticated
Main flow:
  1. User submits long URL
  2. System validates URL format
  3. System generates unique 7-char code
  4. System persists mapping
  5. System returns short URL
Alt flow 3a: User provides custom alias โ†’ check uniqueness
Alt flow 4a: DB unavailable โ†’ return 503, do not return broken link

3.3 Non-Functional Requirements โ€” The Quality Taxonomy

NFRs are where system design lives. They are often under-specified in interviews on purpose โ€” the interviewer wants to see you ask the right clarifying questions. Memorize this taxonomy:

The NFR Wheel โ€” Eight Quality Attributes
NFRs Quality Scalability Availability Reliability Latency Security Maintainability Cost Compliance

Scalability

Can the system handle 10ร— growth without redesign? Horizontal vs vertical scaling path.

Availability

Uptime SLA (99.9% = 8.7 hrs downtime/year). Redundancy, failover.

Latency

p50, p95, p99 response times. Tail latency matters at scale.

Durability

Once acknowledged, data survives crashes. Replication, backups.

The "ilities" interview trick: When stuck, run through scalability, availability, reliability, maintainability, security, latency, durability, and cost. You will surface at least two NFRs the interviewer expected you to ask about.

3.4 Constraints โ€” The Boundaries You Cannot Cross

Constraints are hard limits. Unlike NFRs (which are targets you optimize toward), constraints are immovable walls. Ignoring them invalidates your entire design.

Technical Constraints

Must use existing PostgreSQL cluster. Must integrate with legacy SOAP API. Must run on-premises (no cloud). Team only knows Python.

Business Constraints

Launch in 3 months. Budget capped at $5K/month infra. Must support only US users initially. No third-party data sharing.

Regulatory Constraints

GDPR (EU data residency). HIPAA (health data encryption). PCI-DSS (payment card handling). SOC 2 audit requirements.

Constraint Triangle โ€” Pick Two, Sacrifice One
Fast (Time) Cheap (Cost) Good (Quality) You can optimize for two. The third suffers.

3.5 Scope Management โ€” MoSCoW Prioritization

Not every requirement ships in v1. MoSCoW prevents scope creep and forces explicit prioritization โ€” critical in interviews when the interviewer says "we have 45 minutes."

Priority Meaning URL Shortener Example
Must HaveNon-negotiable for launchCreate short URL, redirect, uniqueness
Should HaveImportant but not blockingCustom aliases, expiration dates
Could HaveNice if time permitsClick analytics dashboard
Won't HaveExplicitly out of scopeQR code generation, A/B testing

In interviews, state your assumptions: "I'll treat analytics as a Should Have and focus the core design on redirect latency and scale." This shows product thinking and time management.

3.6 The Clarification Question Bank

Memorize and adapt these questions. Ask 5โ€“8 at the start of any design session. The answers reshape your entire architecture.

SCALE
  โ€ข How many daily active users (DAU)? Total registered users?
  โ€ข Read-to-write ratio?
  โ€ข Expected growth rate (6 months, 1 year)?
  โ€ข Peak vs average traffic (burst factor)?

PERFORMANCE
  โ€ข Latency targets (p50, p99)?
  โ€ข Throughput (requests/sec)?
  โ€ข Real-time vs batch acceptable?

DATA
  โ€ข How much data stored per entity?
  โ€ข Retention period?
  โ€ข Consistency requirements (strong vs eventual)?
  โ€ข Can we lose data? Under what conditions?

USERS & GEO
  โ€ข Global or single region?
  โ€ข Mobile, web, or both?
  โ€ข Authenticated vs anonymous users?

CONSTRAINTS
  โ€ข Existing tech stack or greenfield?
  โ€ข Budget / team size / timeline?
  โ€ข Regulatory requirements (GDPR, HIPAA)?
Module 04

High-Level Design (HLD) Architecture

High-Level Design is the art of drawing the right boxes and arrows. You decide what major components exist, how they communicate, and where data flows โ€” without writing a single line of implementation code. This module covers architectural styles, when to use each, and how to produce interview-grade HLD diagrams.

4.1 Anatomy of a Good HLD Diagram

A strong HLD diagram answers four questions at a glance: Who calls whom? Where does data live? What are the failure points? How does traffic scale? It uses consistent notation, labels protocols on arrows, and groups related components.

HLD Diagram Legend โ€” Standard Notation
Service API / App Server Cache Redis, Memcached Database SQL / NoSQL Queue Kafka, SQS Client Browser / Mobile HTTPS async Solid line = sync request/response | Dashed line = async / event Label every arrow with protocol: HTTP, gRPC, TCP, AMQP

4.2 Monolith vs Microservices vs Serverless

The most common architectural decision. There is no universal winner โ€” only the right fit for your team's size, scale, and operational maturity.

Three Architectural Styles Compared
Monolith Auth Module Payment Module User Module Single deployable unit โœ“ Simple โœ— Scale as one Microservices Auth Payment User Independent services + APIs โœ“ Scale independently โœ— Ops complexity Serverless ฮป fn1 ฮป fn2 ฮป fn3 Event triggers Functions on demand โœ“ Zero ops โœ— Cold starts Start monolith โ†’ extract microservices when team/scale demands it (evolutionary architecture) "MonolithFirst" โ€” Martin Fowler's recommended default for new products
Factor Monolith Microservices Serverless
Team size1โ€“10 engineers10โ€“100+ (Conway's Law)Small teams, event workloads
Deploy complexityLowHigh (CI/CD per service)Very low
ScalingVertical + replicate allPer-service horizontalAuto-scale per function
Best forMVPs, early stageLarge orgs, varied scaleSpiky, infrequent workloads

4.3 Layered (N-Tier) Architecture

The classic pattern: separate presentation, business logic, and data access into distinct layers. Each layer only talks to the layer directly below it. Simple, well-understood, and still the backbone of most enterprise applications.

Three-Tier Architecture
Presentation Layer Web UI, Mobile App, API Gateway Business Logic Layer Services, validation, orchestration Data Access Layer ORM, repositories, query builders Database

When to use: CRUD-heavy business apps, internal tools, e-commerce backends. Watch out for: the "anemic domain model" where the logic layer becomes a thin pass-through โ€” keep business rules in the business layer, not scattered in controllers.

4.4 Event-Driven Architecture (EDA)

Instead of services calling each other directly (tight coupling), producers emit events to a message broker. Consumers subscribe and react independently. This enables loose coupling, async processing, and natural audit trails.

Event-Driven Flow โ€” Order Placed Example
Order Svc publish Message Broker Kafka / RabbitMQ Inventory Payment Notification Event: OrderPlaced { orderId, userId, items } Each consumer processes independently

Benefits: decoupling, resilience (consumers can retry), scalability (add consumers without changing producer). Costs: eventual consistency, debugging complexity, need for idempotent consumers.

4.5 Data Flow & Component Interaction

Every HLD must show read path and write path separately โ€” they often have different performance characteristics and caching strategies.

Read Path (URL Shortener redirect):
  Client โ†’ CDN edge โ†’ Load Balancer โ†’ API Server โ†’ Redis cache (hit?) โ†’ return 301
                                              โ†“ miss
                                         PostgreSQL read replica โ†’ populate cache โ†’ return 301

Write Path (create short URL):
  Client โ†’ Load Balancer โ†’ API Server โ†’ generate code โ†’ PostgreSQL primary (write)
                                                    โ†’ async replicate to read replicas
                                                    โ†’ optionally warm Redis cache

4.6 Choosing Your Architecture Style

Use this decision flowchart mentally during interviews:

Architecture Selection Decision Tree
New product, small team? Yes โ†’ Monolith No โ†“ Varied scale per feature? Yes โ†’ Microservices No โ†“ Spiky / event-driven? Yes โ†’ Serverless + queues No โ†’ Monolith or modular monolith Always state your choice AND the trade-off accepted "Monolith now โ€” extract payment service when it needs independent scaling"
Module 05

API Design & Communication Patterns

APIs are the contracts between your system's islands. Poor API design creates coupling, versioning nightmares, and cascading failures. This module covers REST principles, protocol selection (REST vs gRPC vs GraphQL), sync vs async patterns, and production-grade API hygiene.

5.1 REST API Design Principles

REST (Representational State Transfer) models your system as resources identified by URLs, manipulated via HTTP verbs. Good REST APIs are predictable, cacheable, and self-describing.

Resource-Oriented URL Design (URL Shortener)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
POST   /api/v1/urls              Create short URL       โ†’ 201 Created
GET    /api/v1/urls/{code}       Get URL metadata       โ†’ 200 OK
GET    /api/v1/urls/{code}/stats Get click analytics    โ†’ 200 OK
DELETE /api/v1/urls/{code}       Deactivate short URL   โ†’ 204 No Content

GET    /{code}                   Redirect (not /api/)   โ†’ 301 Moved Permanently

Anti-patterns to avoid:
  POST /api/createShortUrl       (verb in URL โ€” use nouns)
  GET  /api/deleteUrl?id=123     (mutation via GET โ€” never)
  GET  /api/v1/getAllUsers       (RPC style disguised as REST)

HTTP Status Codes You Must Know

200 OK โ€” successful GET/PUT
201 Created โ€” successful POST
204 No Content โ€” successful DELETE
400 Bad Request โ€” client error
401 Unauthorized โ€” auth required
404 Not Found โ€” resource missing
409 Conflict โ€” duplicate resource
429 Too Many Requests โ€” rate limited
500 Internal Server Error
503 Service Unavailable โ€” overload

5.2 REST vs gRPC vs GraphQL

Three dominant API paradigms โ€” each optimized for different constraints. Choosing the wrong one creates friction for years.

API Paradigm Comparison
REST โ€ข JSON over HTTP โ€ข Human-readable โ€ข Cacheable (HTTP) โ€ข Over/under-fetching Best: public APIs, CRUD, web Twitter, Stripe, GitHub gRPC โ€ข Protobuf binary โ€ข HTTP/2 multiplexing โ€ข Strong typing + codegen โ€ข Bi-directional streaming Best: internal microservices Google, Netflix internal GraphQL โ€ข Single endpoint โ€ข Client specifies fields โ€ข No over-fetching โ€ข Complex server caching Best: mobile, varied clients Facebook, GitHub API v4

5.3 Synchronous vs Asynchronous Communication

Synchronous: caller waits for response (HTTP request/response). Simple mental model, but creates temporal coupling โ€” if the callee is slow or down, the caller suffers. Asynchronous: caller sends message and continues (queue/event). Decouples availability but introduces complexity (ordering, duplicates, eventual consistency).

Sync vs Async Communication Patterns
Synchronous (Request/Response) Client req Service resp (waits) Asynchronous (Message Queue) Producer Queue Consumer Producer continues immediately Use sync for: user-facing reads, auth checks, real-time responses Use async for: emails, analytics, image processing, order fulfillment

5.4 Idempotency, Retries & Rate Limiting

Distributed systems fail mid-request. Networks drop packets. Clients retry. Without idempotency, a payment API called twice charges the customer twice.

Idempotent operations produce the same result no matter how many times they're executed. GET, PUT, DELETE are naturally idempotent. POST is not โ€” solve with Idempotency-Key headers stored in Redis with TTL.

// Client sends idempotency key on POST
POST /api/v1/payments
Headers: Idempotency-Key: "550e8400-e29b-41d4-a716-446655440000"

// Server logic:
if (redis.exists(idempotency_key)) {
  return cached_response;  // duplicate โ€” return same result
}
result = process_payment(request);
redis.setex(idempotency_key, 86400, result);
return result;

// Rate limiting (Token Bucket algorithm)
// 100 requests/minute per API key โ†’ return 429 when exceeded
// Headers: X-RateLimit-Remaining: 42, X-RateLimit-Reset: 1620000000

5.5 API Versioning & Documentation

APIs evolve. Versioning strategy prevents breaking existing clients. Common approaches:

  • โ€ข URL versioning: /api/v1/users โ€” simple, explicit, most common
  • โ€ข Header versioning: Accept: application/vnd.myapi.v2+json โ€” clean URLs, harder to test in browser
  • โ€ข Query param: /api/users?version=2 โ€” least recommended

Document APIs with OpenAPI (Swagger) specs โ€” they generate interactive docs, client SDKs, and mock servers. In system design interviews, listing 3โ€“4 key endpoints with request/response shapes demonstrates API thinking without writing full specs.

Module 06

Data Modeling & Database Selection

Data outlives code. The schema and storage engine you choose on day one constrains every feature for years. This module teaches you to model entities, choose between SQL and NoSQL, understand normalization trade-offs, and plan for horizontal data scaling.

6.1 Entity-Relationship Modeling

Before picking a database, model your entities (nouns: User, Order, URL) and relationships (verbs: places, contains, maps-to). This ER diagram drives your schema regardless of SQL or NoSQL.

ER Diagram โ€” URL Shortener
User id, email ShortUrl code, long_url ClickEvent timestamp, ip 1:N creates 1:N tracks
-- Relational schema derived from ER diagram
CREATE TABLE short_urls (
  id          BIGSERIAL PRIMARY KEY,
  code        VARCHAR(10) UNIQUE NOT NULL,
  long_url    TEXT NOT NULL,
  user_id     BIGINT REFERENCES users(id),
  created_at  TIMESTAMPTZ DEFAULT NOW(),
  expires_at  TIMESTAMPTZ,
  is_active   BOOLEAN DEFAULT TRUE
);
CREATE INDEX idx_short_urls_code ON short_urls(code);      -- redirect lookup
CREATE INDEX idx_short_urls_user ON short_urls(user_id);   -- user's URLs list

6.2 Normalization vs Denormalization

Normalization eliminates redundancy โ€” data lives in one place (3NF). Updates are consistent but reads may require JOINs. Denormalization duplicates data for faster reads โ€” common in read-heavy systems at scale.

Normalized (3NF)

users table + orders table + order_items table. Insert order = 3 table writes. Read order with items = JOIN query.

Denormalized

orders document embeds user_name and items array. Read order = single document fetch. Update user name = update many documents.

Rule of thumb: Normalize for write-heavy, consistency-critical systems (banking). Denormalize for read-heavy systems where query speed matters more than storage redundancy (feeds, analytics dashboards).

6.3 SQL vs NoSQL โ€” Decision Framework

Database Selection Decision Tree
Need ACID transactions? Yes โ†’ SQL (PostgreSQL) No โ†“ Fixed or flexible schema? Fixed โ†’ SQL Flexible โ†’ Document DB (MongoDB) Access pattern? Key lookup โ†’ Redis / DynamoDB Time-series โ†’ InfluxDB / TimescaleDB Graph โ†’ Neo4j Wide-column โ†’ Cassandra Search โ†’ Elasticsearch Polyglot Persistence: use multiple DBs for different jobs PostgreSQL (transactions) + Redis (cache) + Elasticsearch (search) is normal and expected at scale

6.4 Sharding & Partitioning

When a single database node can't hold your data or serve your query load, you partition (split) data across multiple nodes. Horizontal partitioning (sharding) splits rows by a shard key. Vertical partitioning splits columns or tables by feature.

Horizontal Sharding by User ID
Application Shard Router (user_id % 3) Shard 0 user_id % 3 = 0 Shard 1 Shard 2

Shard key selection is critical โ€” a bad key (e.g., country) creates hot shards. Good keys distribute evenly (user_id hash, UUID). Avoid cross-shard JOINs โ€” design queries to hit a single shard.

6.5 Data Access Patterns Drive Everything

The single most important database design principle: design your schema around how data is read and written, not around how it looks on a whiteboard.

Access Pattern Storage Choice Why
Lookup by primary keySQL B-tree index / DynamoDBO(log n) or O(1) retrieval
Session / hot key cacheRedisSub-ms in-memory access
Full-text searchElasticsearchInverted indexes for text
Time-series metricsTimescaleDB / InfluxDBOptimized for time-range queries
Social graph traversalNeo4j / adjacency listsJOINs on graphs are expensive in SQL
Module 07

Caching Strategies

Caching is the single highest-ROI optimization in system design. A well-placed cache turns 100ms database queries into 1ms memory lookups and can reduce database load by 90%+. This module covers every caching layer, pattern, eviction policy, and the infamous cache invalidation problem.

7.1 Why Caching Exists โ€” The Memory Hierarchy at Scale

You learned in Module 1 that CPU cache is 100ร— faster than RAM, which is 1000ร— faster than disk. Distributed systems follow the same principle: **keep hot data as close to the consumer as possible**. Every cache layer trades freshness for speed.

The Distributed Cache Hierarchy
Browser Cache CDN Edge (~10ms) Reverse Proxy / API GW Application Cache (Redis ~1ms) Database (~5โ€“50ms) Static assets Hot keys, sessions Source of truth Each layer absorbs traffic before it hits the next

Analogy โ€” The Beach Snack Shack: Instead of every tourist sailing to the mainland warehouse (database) for water, you place snack shacks (caches) at the beach, pier, and hotel lobby. Most requests never leave the island. You restock shacks periodically โ€” that's cache invalidation.

7.2 Core Caching Patterns

Four fundamental patterns govern how application code interacts with cache and database. Know all four โ€” interviews often ask you to pick one and justify it.

Cache-Aside (Lazy Loading) โ€” Most Common Pattern
App Cache DB Read Path 1. get HIT โ†’ return 2. missโ†’DB 3. populate Other Patterns Read-through: cache loads from DB Write-through: write cache + DB sync Write-behind: write cache, async DB Cache-aside = simplest, most common
Pattern How It Works Trade-off
Cache-AsideApp checks cache; on miss, reads DB and populates cacheSimple; stale data possible between writes
Read-ThroughCache itself loads from DB on missCleaner app code; cache library must support it
Write-ThroughWrite goes to cache AND DB synchronouslyConsistent; higher write latency
Write-BehindWrite to cache; async flush to DB laterFast writes; risk of data loss on crash
// Cache-Aside pseudocode (URL redirect)
function getLongUrl(shortCode):
  cached = redis.get("url:" + shortCode)
  if cached:
    return cached                          // cache HIT (~1ms)

  row = db.query("SELECT long_url FROM short_urls WHERE code = ?", shortCode)
  if row:
    redis.setex("url:" + shortCode, 3600, row.long_url)  // TTL 1 hour
  return row.long_url                      // cache MISS (~10ms)

7.3 CDN Caching โ€” Edge Proximity

A Content Delivery Network (CDN) caches static and dynamic content at edge servers geographically close to users. A user in Tokyo hits a Tokyo edge node instead of your US-origin server โ€” slashing latency from 300ms to 20ms.

Cache-Control headers control CDN behavior: max-age=3600 (cache 1 hour), no-cache (revalidate every time), private (browser only, not CDN). For URL shortener redirects, CDN can cache 301 responses for popular short codes.

7.4 Eviction Policies & TTL

Caches have finite memory. When full, something must go. TTL (Time To Live) expires entries automatically. Eviction policies decide what to remove when memory is full.

LRU (Least Recently Used)

Evict the item not accessed for the longest time. Default in Redis. Good general-purpose policy.

LFU (Least Frequently Used)

Evict the item accessed fewest times. Better when access patterns have long-tail popularity (viral content).

TTL strategy: Short TTL (60s) for frequently changing data. Long TTL (24h) for static content. Jitter TTL (random ยฑ10%) to prevent synchronized mass expiration.

7.5 Cache Invalidation โ€” The Hard Problem

Phil Karlton famously said: "There are only two hard things in Computer Science: cache invalidation and naming things." When source data changes, stale cache entries must be updated or removed.

TTL-based expiration

Simplest โ€” let entries expire naturally. Acceptable staleness window.

Write-invalidate

On DB write, delete cache key. Next read repopulates. Most common with cache-aside.

Write-update

On DB write, update cache entry directly. Keeps cache warm but more complex.

// Write-invalidate on URL update
function updateLongUrl(shortCode, newUrl):
  db.execute("UPDATE short_urls SET long_url = ? WHERE code = ?", newUrl, shortCode)
  redis.del("url:" + shortCode)   // invalidate โ€” next read refreshes cache

7.6 Thundering Herd & Cache Stampede

When a popular cache key expires, thousands of concurrent requests all miss simultaneously and hammer the database โ€” a cache stampede. Mitigations:

  • โ€ข Mutex / lock: Only one request rebuilds cache; others wait or return stale
  • โ€ข Probabilistic early expiration: Randomly refresh before TTL expires
  • โ€ข Never expire hot keys: Background refresh before expiration
  • โ€ข Request coalescing: Deduplicate in-flight requests for same key
Module 08

Load Balancing & Horizontal Scaling

One server has limits โ€” CPU cores, memory, network bandwidth. Load balancing distributes traffic across multiple servers so no single machine becomes the bottleneck. Combined with horizontal scaling, this is how systems grow from handling hundreds to millions of requests per second.

8.1 Vertical vs Horizontal Scaling

Vertical scaling (scale up): add more CPU/RAM to one machine. Simple but has a ceiling โ€” the biggest cloud instance costs 10ร— more for 2ร— performance. Horizontal scaling (scale out): add more machines. The path to internet scale, but requires load balancing and stateless design.

Horizontal Scaling with Load Balancer
Clients Load Balancer distributes traffic Server 1 Server 2 Server 3 Add Server 4, 5, N... without changing clients Scale Out Benefits โ€ข No single point of failure โ€ข Linear throughput growth โ€ข Rolling deployments Requires stateless app servers

8.2 Layer 4 vs Layer 7 Load Balancing

L4 (Transport Layer)

Routes based on IP + port. Fast, no content inspection. Cannot route by URL path or HTTP headers.

Examples: AWS NLB, HAProxy (TCP mode)

L7 (Application Layer)

Routes based on HTTP headers, URL path, cookies. Can terminate SSL, inject headers, route /api to one pool and /static to another.

Examples: AWS ALB, Nginx, Envoy

8.3 Load Balancing Algorithms

Algorithm Behavior Best For
Round RobinRotate through servers sequentiallyEqual-capacity, uniform requests
Weighted Round RobinMore traffic to more powerful serversMixed instance sizes
Least ConnectionsRoute to server with fewest active connectionsLong-lived connections, variable request duration
IP HashSame client IP โ†’ same serverSession affinity without cookies
Consistent HashingMinimal redistribution when servers added/removedDistributed caches, sharding

8.4 Stateless Servers & Session Affinity

For true horizontal scaling, application servers must be stateless โ€” any server can handle any request. Session data lives in Redis, not server memory. When state is unavoidable, sticky sessions (session affinity) route the same user to the same server โ€” but this complicates scaling and failover.

Best practice: Externalize all state to Redis/DB. Avoid sticky sessions unless absolutely required (WebSocket connections are a common exception).

8.5 Health Checks & Auto-Scaling

Load balancers continuously health check backends โ€” HTTP GET /health every 10s. Unhealthy servers are removed from rotation automatically. Auto-scaling groups add/remove servers based on CPU, request count, or custom metrics โ€” paying only for capacity you need.

Auto-scaling policy example:
  Scale OUT when: avg CPU > 70% for 3 minutes
  Scale IN  when: avg CPU < 30% for 10 minutes
  Min instances: 2  |  Max instances: 20  |  Desired: 4

Health check endpoint:
  GET /health โ†’ 200 { "status": "ok", "db": "connected", "redis": "connected" }
Module 09

Message Queues & Async Processing

Not every operation needs an immediate response. Message queues decouple producers from consumers, absorb traffic spikes, and enable reliable background processing. This module covers queue fundamentals, delivery guarantees, and when to reach for Kafka vs RabbitMQ vs SQS.

9.1 Why Message Queues?

Without queues, every operation is synchronous โ€” the user waits for email sending, image resizing, and analytics logging before seeing "Order Placed." Queues let you acknowledge fast and process slow.

  • โ€ข Decoupling: producer doesn't know about consumers
  • โ€ข Buffering: absorb traffic spikes without overwhelming downstream
  • โ€ข Reliability: messages persist if consumer is temporarily down
  • โ€ข Scalability: add more consumers to process faster
Point-to-Point vs Pub/Sub
Point-to-Point (Queue) Producer Queue Consumer One message โ†’ one consumer Pub/Sub (Topic) Publisher Topic Sub A Sub B One message โ†’ all subscribers Use queues for task distribution | Use topics for event broadcasting Order processing โ†’ Queue | User signed up โ†’ Topic (email + analytics + CRM)

9.2 Delivery Guarantees

The hardest problem in messaging: ensuring messages are processed exactly once in a world where networks fail and consumers crash. In practice, you choose a guarantee and design idempotent consumers.

Guarantee Meaning Risk
At-most-onceFire and forget โ€” message may be lostData loss acceptable (metrics, logs)
At-least-onceMessage delivered โ‰ฅ1 times; consumer must be idempotentDuplicates possible โ€” most common in production
Exactly-onceProcessed precisely onceExpensive; Kafka transactions, or dedup at consumer

9.3 Kafka vs RabbitMQ vs SQS

Apache Kafka

Distributed commit log. High throughput, message replay, event sourcing. Retains messages for days/weeks. Best for event streams and analytics pipelines.

RabbitMQ

Traditional message broker. Complex routing (exchanges, bindings). Messages deleted after ack. Best for task queues and RPC patterns.

AWS SQS

Fully managed, serverless queue. Standard (at-least-once) or FIFO (exactly-once ordering). Best for AWS-native async workloads with zero ops.

9.4 Dead Letter Queues & Backpressure

When a message fails processing repeatedly (poison message), it moves to a Dead Letter Queue (DLQ) for manual inspection โ€” preventing infinite retry loops that block the queue.

Backpressure occurs when consumers can't keep up with producers. Solutions: scale consumers, throttle producers, increase queue capacity, or shed load (drop low-priority messages).

// Idempotent consumer pattern
function processOrderEvent(event):
  if redis.setnx("processed:" + event.id, 1, ttl=86400):
    charge_payment(event)
    send_confirmation_email(event)
  else:
    log("Duplicate event, skipping")  // safe to ignore

9.5 Event Sourcing Preview

Instead of storing current state, event sourcing stores every state change as an immutable event log. Current state is reconstructed by replaying events. Kafka's commit log is naturally suited for this pattern โ€” we'll see it applied in case studies.

Module 10

Low-Level Design (LLD) Fundamentals

HLD tells you what services exist. LLD tells you how they're built inside โ€” classes, methods, algorithms, database schemas, and interaction sequences. This module bridges architecture diagrams to implementable code through SOLID principles, UML diagrams, and structured design thinking.

10.1 HLD to LLD โ€” The Zoom-In Transition

After HLD defines the URL Shortener's boxes (API Server, Redis, PostgreSQL), LLD zooms into the API Server box and asks: What classes exist? What methods do they expose? How does encoding work? What exceptions are thrown?

From HLD Box to LLD Class Diagram
API Server (HLD box) โ†’ UrlController +createUrl(req) +redirect(code) UrlService +createShortUrl() +resolveUrl() UrlRepository +save(), +findByCode() Base62Encoder +encode(id), +decode(s)

10.2 SOLID Principles

SOLID guides maintainable object-oriented design. Internalize these โ€” interviewers probe them in LLD rounds.

S โ€” Single Responsibility

A class should have one reason to change. UrlService handles URL logic; EmailService handles email. Don't mix them.

O โ€” Open/Closed

Open for extension, closed for modification. Add new encoding strategies (Base62, Base64) via interface without changing UrlService.

L โ€” Liskov Substitution

Subtypes must be substitutable for base types. Any UrlRepository implementation (PostgreSQL, MongoDB) must honor the same contract.

I โ€” Interface Segregation

Don't force classes to implement methods they don't use. Separate ReadableUrlStore and WritableUrlStore if consumers differ.

D โ€” Dependency Inversion

Depend on abstractions, not concretions. UrlService depends on IUrlRepository interface, not PostgreSQL directly โ€” enables testing with mocks.

10.3 Sequence Diagrams โ€” Object Interactions Over Time

Sequence diagrams show who calls whom, in what order, over time. Essential for LLD interviews when explaining a use case flow.

Sequence Diagram โ€” Create Short URL
Client Controller Service Repository POST /urls createShortUrl() save(url) id shortUrl 201 Created

10.4 LLD Code Structure โ€” Layered Implementation

// Interface (Dependency Inversion)
interface UrlRepository {
  save(url: ShortUrl): ShortUrl
  findByCode(code: string): ShortUrl | null
  existsByCode(code: string): boolean
}

// Service layer (business logic)
class UrlService {
  constructor(
    private repo: UrlRepository,
    private encoder: UrlEncoder,
    private cache: CacheClient
  ) {}

  async createShortUrl(longUrl: string, userId?: string): Promise<ShortUrl> {
    this.validateUrl(longUrl)
    const code = await this.generateUniqueCode()
    const url = new ShortUrl(code, longUrl, userId)
    const saved = await this.repo.save(url)
    await this.cache.set(`url:${code}`, longUrl, 3600)
    return saved
  }

  private async generateUniqueCode(): Promise<string> {
    // retry loop with collision detection
  }
}

10.5 LLD Interview Approach

  1. Clarify scope: "Are we designing the full system or one component?"
  2. Identify entities: nouns become classes (User, Order, ParkingSpot)
  3. Define relationships: associations, inheritance, composition
  4. Walk through use cases: draw sequence diagrams for 2โ€“3 main flows
  5. Handle edge cases: concurrency, validation, error handling
  6. Discuss extensibility: how would you add feature X without rewriting?
Module 11

OOP Design Patterns for LLD

Design patterns are reusable solutions to recurring object-oriented design problems. In LLD interviews, naming the right pattern โ€” and explaining why it fits โ€” signals senior-level thinking. This module covers the patterns you'll use most, with concrete examples from parking lots, payment systems, and notification engines.

11.1 What Are Design Patterns?

A design pattern is a proven template for structuring classes and their interactions. Patterns are not copy-paste code โ€” they are a shared vocabulary. Saying "I'll use Strategy for payment methods" instantly communicates intent to any experienced engineer.

The Gang of Four (GoF) Pattern Categories
Creational How objects are born Singleton, Factory, Builder Prototype, Abstract Factory Structural How objects compose Adapter, Decorator, Facade Proxy, Composite, Bridge Behavioral How objects communicate Strategy, Observer, Command State, Iterator, Template Focus on the 8โ€“10 patterns used in 90% of LLD interviews

Interview tip: Don't name-drop patterns without context. Always follow: "I'm using [Pattern] because [problem], which gives us [benefit] at the cost of [trade-off]."

11.2 Creational Patterns

Singleton โ€” One Instance Only

Guarantees a class has exactly one instance with global access. Use for database connection pools, configuration managers, thread pools. Caution: overuse creates hidden dependencies and makes testing hard.

class DatabaseConnectionPool {
  private static instance: DatabaseConnectionPool
  private constructor() {}  // prevent direct instantiation

  static getInstance(): DatabaseConnectionPool {
    if (!DatabaseConnectionPool.instance) {
      DatabaseConnectionPool.instance = new DatabaseConnectionPool()
    }
    return DatabaseConnectionPool.instance
  }
}

Factory / Factory Method โ€” Delegate Object Creation

Encapsulates object creation logic. Client asks for a "Notification" without knowing if it's Email, SMS, or Push. Classic LLD example: Design a Notification System.

interface Notification { send(message: string, recipient: string): void }

class EmailNotification implements Notification { /* ... */ }
class SmsNotification implements Notification { /* ... */ }
class PushNotification implements Notification { /* ... */ }

class NotificationFactory {
  static create(type: 'email' | 'sms' | 'push'): Notification {
    switch (type) {
      case 'email': return new EmailNotification()
      case 'sms':   return new SmsNotification()
      case 'push':  return new PushNotification()
    }
  }
}

Builder โ€” Construct Complex Objects Step by Step

When an object has many optional fields (Order with items, discounts, shipping, gift wrap). Builder provides fluent API and validates before construction.

const order = new OrderBuilder()
  .setCustomer(userId)
  .addItem(productId, quantity)
  .applyCoupon('SAVE10')
  .setShippingAddress(address)
  .build()  // validates all required fields before creating Order

11.3 Structural Patterns

Adapter โ€” Bridge Incompatible Interfaces

Wraps a legacy or third-party API so it conforms to your interface. Your payment system expects PaymentProcessor; Stripe SDK has a different API โ€” Adapter bridges the gap.

Decorator โ€” Add Behavior Without Subclassing

Wraps an object to add responsibilities dynamically. A base Coffee can be wrapped with MilkDecorator, then WhipDecorator. In systems: add logging, caching, or encryption layers around a service.

Decorator Pattern โ€” Layering Behaviors
BaseService LoggingDecorator CachingDecorator AuthDecorator Each wrapper delegates to inner object, adding its own behavior

Facade โ€” Simplified Interface to Complex Subsystem

OrderFacade.placeOrder() internally coordinates inventory, payment, shipping, and notification services โ€” client sees one simple method.

Proxy โ€” Stand-In With Controlled Access

Proxy controls access to a real object. Use cases: lazy loading (load image only when displayed), access control, remote proxy (RPC stub), caching proxy.

11.4 Behavioral Patterns โ€” The LLD Workhorses

Strategy โ€” Swap Algorithms at Runtime

Define a family of algorithms, encapsulate each, and make them interchangeable. The #1 pattern in LLD interviews.

// Parking Lot LLD โ€” different pricing strategies
interface PricingStrategy {
  calculateFee(entryTime: Date, exitTime: Date): number
}

class HourlyPricing implements PricingStrategy { /* $5/hour */ }
class FlatRatePricing implements PricingStrategy { /* $20/day */ }
class WeekendPricing implements PricingStrategy { /* 1.5ร— multiplier */ }

class ParkingTicket {
  constructor(private strategy: PricingStrategy) {}
  getFee(entry: Date, exit: Date) { return this.strategy.calculateFee(entry, exit) }
}

Observer โ€” Publish/Subscribe Within Code

When one object's state change must notify many dependents. OrderSubject notifies EmailObserver, InventoryObserver, AnalyticsObserver on order placement. Mirrors event-driven architecture at the code level.

Observer Pattern โ€” Order Placed Event
OrderSubject EmailObserver InventoryObserver AnalyticsObserver notify() broadcasts to all registered observers

Command โ€” Encapsulate Actions as Objects

Turn requests into objects with execute() and undo(). Powers undo/redo in text editors, job queues, and transaction systems. Each command stores the receiver and parameters.

State โ€” Behavior Changes With Internal State

Object behavior changes when its state changes. A VendingMachine behaves differently in Idle, HasMoney, Dispensing, and OutOfStock states โ€” each state is its own class implementing a common interface.

interface VendingState {
  insertCoin(machine: VendingMachine): void
  selectProduct(machine: VendingMachine, code: string): void
  dispense(machine: VendingMachine): void
}

class IdleState implements VendingState {
  insertCoin(m) { m.setState(new HasMoneyState()); m.addCredit(1) }
  selectProduct(m, code) { throw new Error("Insert coin first") }
  dispense(m) { throw new Error("No product selected") }
}

11.5 Pattern Selection Guide for LLD Interviews

Problem You Face Reach For LLD Example
Multiple interchangeable algorithmsStrategyPayment methods, pricing rules, routing
Notify many components on state changeObserverOrder events, stock price alerts
Create objects without specifying exact classFactoryNotification channels, DB drivers
Object behavior depends on current stateStateElevator, vending machine, workflow
Undo/redo or queue operationsCommandText editor, task scheduler
Add features without modifying classDecoratorLogging, caching, compression layers
Simplify complex subsystem APIFacadeOrder placement, checkout flow
Integrate incompatible third-party APIAdapterLegacy payment gateway wrapper

11.6 Anti-Patterns to Avoid

  • โ€ข God Object: one class does everything โ€” violates SRP, untestable
  • โ€ข Pattern overload: forcing Factory + Strategy + Decorator + Observer into a simple CRUD app
  • โ€ข Premature abstraction: "We might need 10 payment methods someday" โ€” YAGNI applies
  • โ€ข Singleton abuse: global mutable state makes unit testing a nightmare
  • โ€ข Inheritance over composition: deep class hierarchies break Liskov; favor interfaces + composition

Golden rule: Start simple. Introduce a pattern only when you can name the specific problem it solves. Interviewers reward clarity over complexity.

Module 12

Concurrency & Thread Safety

Modern servers handle thousands of requests simultaneously. Concurrency unlocks performance โ€” but shared mutable state is the root of nearly every production bug that can't be reproduced locally. This module teaches you to reason about threads, races, locks, deadlocks, and the design patterns that keep multi-threaded systems correct.

12.1 Concurrency vs Parallelism

Concurrency is about dealing with many things at once โ€” structuring your program so multiple tasks make progress. Parallelism is about doing many things at once โ€” literally executing on multiple CPU cores simultaneously. You can have concurrency without parallelism (single-core time-slicing) and parallelism without much concurrency (embarrassingly parallel batch jobs).

Processes vs Threads
Process A Own memory space (heap, stack) Thread 1 Thread 2 Thread 3 Threads share heap; each has own stack Process B Isolated memory โ€” no shared state Thread 1 IPC (pipes, sockets) to communicate Microservices โ‰ˆ processes | Threads within a service share memory

Analogy โ€” Shared Kitchen: Threads are chefs in the same kitchen (shared memory). If two chefs grab the same knife (shared variable) without coordinating, someone gets cut (race condition). Processes are separate kitchens โ€” safer but slower to pass ingredients between them.

12.2 Race Conditions & Critical Sections

A race condition occurs when the correctness of your program depends on the unpredictable timing of thread execution. The classic example: two threads increment a shared counter โ€” both read 0, both write 1, result is 1 instead of 2.

// RACE CONDITION โ€” counter may be less than 1000
let counter = 0

// Thread A and Thread B both run this:
counter = counter + 1   // NOT atomic! Read โ†’ increment โ†’ write = 3 steps

// After 1000 increments from 2 threads, counter might be 847, not 1000

A critical section is the code region that accesses shared resources. Only one thread may execute the critical section at a time โ€” protected by synchronization primitives.

Race Condition Timeline โ€” Lost Update
Thread A Thread B counter read: 0 read: 0 write: 1 write: 1 0 โ†’ 1 expected 2! Interleaved reads/writes cause lost updates without synchronization

12.3 Synchronization Primitives

Primitive What It Does Use Case
Mutex (Lock)Only one thread holds the lock at a timeProtecting critical sections
Read-Write LockMany readers OR one writerRead-heavy caches, config stores
SemaphoreLimits concurrent access to N threadsConnection pool (max 10 DB connections)
Atomic OperationsHardware-guaranteed indivisible read-modify-writeCounters, flags without full locks
Condition VariableThread waits until a condition is signaledProducer-consumer queues, thread pools
// Mutex-protected counter โ€” thread safe
mutex = new Mutex()
counter = 0

function increment():
  mutex.lock()
  try:
    counter = counter + 1    // critical section โ€” only one thread here
  finally:
    mutex.unlock()

// Semaphore โ€” limit concurrent DB connections to 10
dbSemaphore = new Semaphore(10)

function queryDatabase(sql):
  dbSemaphore.acquire()
  try:
    return db.execute(sql)
  finally:
    dbSemaphore.release()

12.4 Deadlock โ€” When Threads Wait Forever

A deadlock occurs when two or more threads are blocked forever, each waiting for a resource held by another. All four Coffman conditions must be true simultaneously:

  1. Mutual exclusion โ€” resource held by only one thread at a time
  2. Hold and wait โ€” thread holds one resource while waiting for another
  3. No preemption โ€” resources can't be forcibly taken away
  4. Circular wait โ€” A waits for B, B waits for A
Classic Deadlock โ€” Two Threads, Two Locks
Thread Aholds Lock 1 Thread Bholds Lock 2 Lock 1 Lock 2 waits for waits for DEADLOCK

Deadlock Prevention Strategies

  • โ€ข Lock ordering: always acquire Lock 1 before Lock 2 โ€” breaks circular wait
  • โ€ข Lock timeout: tryLock(timeout) โ€” abort and retry instead of waiting forever
  • โ€ข Minimize lock scope: hold locks for the shortest time possible
  • โ€ข Avoid nested locks: redesign to need only one lock

12.5 Thread-Safe Design Patterns

Immutability

Objects that cannot change after creation are inherently thread-safe. No locks needed. Java String, Python tuple, event objects in event sourcing.

Thread-Local Storage

Each thread gets its own copy of a variable. No sharing = no races. Request context, DB transaction handles per thread.

Thread Pool

Fixed set of worker threads processing a task queue. Avoids thread creation overhead. Tomcat, Node.js worker threads, Java ExecutorService.

Concurrent Collections

java.util.concurrent, Python asyncio queues, Go channels. Built-in thread-safe data structures instead of rolling your own locks.

// Producer-Consumer with blocking queue (thread-safe by design)
queue = new BlockingQueue<Task>(capacity=100)

// Producer thread
function producer():
  while running:
    task = generateTask()
    queue.put(task)          // blocks if queue full (backpressure)

// Consumer threads (pool of N workers)
function consumer():
  while running:
    task = queue.take()      // blocks if queue empty
    process(task)

12.6 Concurrency in LLD & System Design

In LLD interviews, concurrency appears in parking lots (multiple entry/exit gates), elevators (multiple requests), ticket booking (seat reservation races), and rate limiters. Key questions to address:

  • โ€ข What shared state exists? Who reads/writes it?
  • โ€ข What happens if two users book the same seat simultaneously?
  • โ€ข Can you use optimistic locking (version numbers) vs pessimistic locking (mutex)?
  • โ€ข Should you use database transactions (SELECT FOR UPDATE) instead of in-memory locks?

Optimistic vs Pessimistic locking: Pessimistic = lock the row before reading (safe, lower concurrency). Optimistic = read freely, check version on write, retry if conflict (higher concurrency, good when conflicts are rare). E-commerce inventory with low contention โ†’ optimistic. Bank transfers โ†’ pessimistic.

Module 13

Reliability, Fault Tolerance & CAP Theorem

Production systems fail โ€” disks crash, networks partition, deployments go wrong. Reliability is not about preventing all failures; it's about designing systems that continue serving users correctly despite them. This module covers availability math, fault tolerance patterns, the CAP theorem, consistency models, and replication strategies that underpin every distributed database decision.

13.1 Reliability, Availability & Durability

Three terms often confused โ€” each measures a different dimension of system trustworthiness:

Reliability

System performs correctly even when things go wrong. Fault tolerance + recoverability + absence of data corruption.

Availability

System is operational and responding to requests. Measured as uptime percentage (SLA).

Durability

Once data is acknowledged as written, it survives crashes. Replication + backups ensure no loss.

The Nines of Availability

Availability Downtime / Year Typical Use
99% (two nines)3.65 daysInternal tools, dev environments
99.9% (three nines)8.7 hoursStandard SaaS products
99.99% (four nines)52 minutesPayment systems, e-commerce
99.999% (five nines)5.2 minutesTelecom, critical infrastructure

Key insight: Each additional nine costs roughly 10ร— more in engineering and infrastructure. Don't over-engineer โ€” match availability targets to business requirements.

13.2 Fault Tolerance & Eliminating SPOFs

A Single Point of Failure (SPOF) is any component whose failure takes down the entire system. Fault tolerance means redundancy at every critical layer โ€” no single server, rack, or data center is indispensable.

Eliminating Single Points of Failure
โŒ With SPOF Single Load Balancer Single API Server Single Database Any failure = total outage โœ“ Fault Tolerant LB 1 LB 2 API 1 API 2 API 3 DB Primary Replica N+1 redundancy at every layer Also eliminate SPOFs in: DNS, power, network switches, deployment pipelines Multi-AZ and multi-region deployment for disaster recovery

Failover Strategies

  • โ€ข Active-Passive: standby replica takes over on primary failure (faster recovery, wasted idle capacity)
  • โ€ข Active-Active: all nodes serve traffic simultaneously (higher utilization, conflict resolution needed)
  • โ€ข Health-check driven: load balancer detects failure and reroutes within seconds

13.3 The CAP Theorem โ€” The Fundamental Trade-off

Eric Brewer's CAP theorem states that a distributed data store can provide at most two of three guarantees simultaneously during a network partition:

C โ€” Consistency

Every read receives the most recent write or an error

A โ€” Availability

Every request receives a non-error response (may be stale)

P โ€” Partition Tolerance

System continues despite network failures between nodes

CAP Theorem โ€” Pick Two During a Partition
Consistency Availability Partition Tolerance CP MongoDB HBase, Redis AP Cassandra DynamoDB, CouchDB CA only possible on single-node (no partition)

Critical nuance: Partitions WILL happen in distributed systems โ€” so P is non-negotiable. The real choice is CP vs AP during a partition: reject requests to stay consistent (CP), or serve potentially stale data to stay available (AP).

13.4 PACELC โ€” CAP Extended

Daniel Abadi's PACELC theorem extends CAP: If there is a Partition (P), choose Availability (A) or Consistency (C). Else (E), choose Latency (L) or Consistency (C). This captures the trade-off even when the network is healthy โ€” strong consistency often requires coordination that adds latency.

System During Partition Normal Operation
DynamoDB / CassandraAP โ€” stay availableEL โ€” low latency, eventual consistency
MongoDB / HBaseCP โ€” reject writesEC โ€” consistent but higher latency
PostgreSQL (single node)CA (no partition possible)EC โ€” strong consistency

13.5 Consistency Models

Consistency exists on a spectrum โ€” not just "strong" or "eventual." Choose based on what your users can tolerate.

Strong Consistency

After a write completes, all reads see the new value. Required for bank balances, inventory counts. Implemented via consensus (Paxos, Raft) or single-leader replication.

Eventual Consistency

Given no new writes, all replicas converge to the same value eventually. Acceptable for social media likes, view counts, DNS. Cassandra, DynamoDB default.

Causal Consistency

Middle ground โ€” if event A caused event B, everyone sees A before B. Good for chat messages, comment threads.

Read-Your-Writes Consistency

A user always sees their own updates. Critical for profile edits, settings changes. Route user reads to the node that handled their write.

13.6 Replication Strategies

Leader-Follower (Primary-Replica) Replication
Leader (Primary) All writes go here async replicate Follower 1 read replica Follower 2 Follower 3 Sync replication = strong consistency, higher latency | Async = faster writes, replication lag
Strategy Description Example
Leader-FollowerOne leader handles writes; followers replicate and serve readsPostgreSQL, MySQL replication
Multi-LeaderMultiple nodes accept writes; conflict resolution neededMulti-region CouchDB, offline-first apps
LeaderlessAny node accepts reads/writes; quorum-based consistencyCassandra, DynamoDB (quorum reads/writes)
// Quorum consistency (leaderless โ€” Cassandra/DynamoDB)
// N = number of replicas, W = write quorum, R = read quorum
// Strong consistency when W + R > N

N = 3 replicas
W = 2  (write must succeed on 2 of 3 nodes)
R = 2  (read from 2 of 3 nodes)
W + R = 4 > N = 3  โ†’  guaranteed to overlap โ†’ consistent read

13.7 Designing for Failure โ€” Interview Framework

In system design interviews, always address failure explicitly. Walk through this checklist:

  1. Identify SPOFs: What single component failure kills the system?
  2. Add redundancy: N+1 at every layer (servers, DB replicas, AZs)
  3. Choose CAP position: CP for banking, AP for social feeds โ€” justify it
  4. Define failover: automatic vs manual, RTO (recovery time) and RPO (data loss window)
  5. Plan degradation: what features can be disabled under stress? (circuit breakers, graceful degradation)
  6. Backup & restore: how do you recover from catastrophic failure?
Module 14

Security in System Design

Security is not a feature you bolt on at the end โ€” it is an architectural property woven through every layer. A system design interview that ignores authentication, encryption, and threat modeling will be incomplete. This module teaches you to design systems that protect confidentiality, integrity, and availability against real-world attacks.

14.1 The CIA Triad & Defense in Depth

All security goals reduce to three pillars. Every control you design maps to at least one:

Confidentiality

Only authorized parties access data. Encryption, access controls, least privilege.

Integrity

Data is not altered unauthorized. Hashing, digital signatures, audit logs.

Availability

System remains accessible to authorized users. DDoS protection, redundancy, rate limiting.

Defense in depth layers multiple security controls so no single failure compromises the system โ€” like a castle with moat, walls, guards, and a vault.

Defense in Depth โ€” Security Layers
Perimeter: WAF, DDoS protection, CDN Network: Firewalls, VPC, TLS, mTLS Application: AuthN, AuthZ, input validation Data: Encryption at rest, field-level encryption Secrets & Keys Vault, KMS, rotation Each layer independently protects โ€” breach of one โ‰  total compromise

14.2 Authentication vs Authorization

These are distinct concerns โ€” conflating them is a common design mistake.

Authentication (AuthN)

"Who are you?"

  • โ€ข Password + bcrypt/argon2 hashing
  • โ€ข Multi-factor authentication (MFA)
  • โ€ข OAuth 2.0 / OpenID Connect (SSO)
  • โ€ข API keys, mTLS for service-to-service

Authorization (AuthZ)

"What can you do?"

  • โ€ข Role-Based Access Control (RBAC)
  • โ€ข Attribute-Based Access Control (ABAC)
  • โ€ข JWT claims / OAuth scopes
  • โ€ข Policy engines (OPA, AWS IAM)
OAuth 2.0 Authorization Code Flow
User Client App Auth Server Resource API 1. login redirect 2. auth code 3. exchange code โ†’ access token 4. API call with Bearer token
// JWT structure (Header.Payload.Signature)
// Payload contains claims โ€” never store secrets in JWT!
{
  "sub": "user-123",
  "role": "admin",
  "exp": 1620000000,
  "iat": 1619996400
}

// API request
GET /api/v1/users
Authorization: Bearer eyJhbGciOiJIUzI1NiIs...

// Server validates: signature valid? not expired? role permits action?

14.3 Encryption โ€” In Transit & At Rest

Encryption in transit: TLS 1.3 for all client-server and service-to-service communication. HTTPS is non-negotiable. Internal microservices should use mTLS (mutual TLS) so services authenticate each other.

Encryption at rest: Database-level encryption (AES-256), disk encryption, and field-level encryption for PII (SSN, credit cards). Use a Key Management Service (AWS KMS, HashiCorp Vault) โ€” never hardcode keys.

Technique Purpose Example
TLS / HTTPSEncrypt data in transitAll public APIs, web traffic
AES-256 at restEncrypt stored dataDatabase disk encryption, S3 SSE
bcrypt / argon2One-way hash passwordsNever store plaintext passwords
HMAC / SHA-256Integrity verificationWebhook signatures, JWT signing
TokenizationReplace sensitive data with tokensPCI-DSS credit card handling

14.4 Common Attack Vectors & Defenses

SQL Injection

Attacker injects SQL via input fields. Defense: parameterized queries / prepared statements โ€” never concatenate user input into SQL.

Cross-Site Scripting (XSS)

Malicious scripts injected into web pages. Defense: output encoding, Content-Security-Policy headers, sanitize user HTML.

CSRF (Cross-Site Request Forgery)

Tricks authenticated user into unwanted actions. Defense: CSRF tokens, SameSite cookies, verify Origin header.

DDoS (Distributed Denial of Service)

Overwhelms servers with traffic. Defense: CDN absorption, rate limiting, WAF, auto-scaling, anycast routing.

Broken Access Control

User accesses another user's data (IDOR). Defense: authorize every request server-side, never trust client-sent user IDs alone.

// WRONG โ€” SQL injection vulnerable
query = "SELECT * FROM users WHERE email = '" + userInput + "'"

// RIGHT โ€” parameterized query
query = "SELECT * FROM users WHERE email = ?"
db.execute(query, [userInput])

// Rate limiting at API gateway
// 100 req/min per IP โ†’ 429 Too Many Requests
// Prevents brute force and DDoS amplification

14.5 Secrets Management & Least Privilege

Never commit secrets to git. Use dedicated secret stores with automatic rotation:

  • โ€ข HashiCorp Vault โ€” dynamic secrets, encryption as a service
  • โ€ข AWS Secrets Manager / GCP Secret Manager โ€” cloud-native rotation
  • โ€ข Environment injection โ€” secrets injected at runtime, not baked into images

Principle of least privilege: every service account, API key, and IAM role gets only the minimum permissions required. A compromised read-only analytics service shouldn't be able to delete production databases.

14.6 Zero Trust & Security in Interviews

Zero Trust assumes no user or service is trusted by default โ€” even inside the corporate network. Every request is authenticated, authorized, and encrypted. Micro-segmentation limits blast radius if one service is compromised.

Security Checklist for System Design Interviews

  1. Authentication: How do users/services prove identity? (OAuth, JWT, API keys)
  2. Authorization: Who can access what? (RBAC, resource-level checks)
  3. Encryption: TLS in transit, encryption at rest for PII
  4. Input validation: Sanitize all external input at API boundary
  5. Rate limiting: Prevent abuse and brute force
  6. Audit logging: Who did what, when โ€” immutable logs for forensics
  7. Compliance: GDPR, HIPAA, PCI-DSS if applicable

Pro tip: Mentioning security proactively โ€” even briefly โ€” distinguishes senior candidates. "I'll place an API gateway with TLS termination, JWT validation, and rate limiting before traffic hits services" shows production thinking.

Module 15

Observability & Monitoring

You built it, you deployed it, it's serving traffic โ€” but is it healthy? Observability is the discipline of understanding system internal state from external outputs. Without metrics, logs, and traces, debugging a production incident at 3 AM is guesswork. This module teaches you to instrument systems like a senior SRE.

15.1 Monitoring vs Observability

Monitoring tells you when something is wrong โ€” predefined dashboards and alerts fire when thresholds breach. Observability lets you ask why โ€” exploring arbitrary questions about system behavior you didn't anticipate when writing alerts.

Monitoring

"Is CPU above 80%?" Known unknowns. Dashboards, alerts, uptime checks. Reactive โ€” you defined what to watch in advance.

Observability

"Why did checkout latency spike for users in EU only?" Unknown unknowns. Ad-hoc queries across metrics, logs, and traces.

Analogy โ€” Ship Navigation: Monitoring is the dashboard warning light ("engine temperature high"). Observability is the ability to inspect any part of the engine, review the captain's log, and trace the ship's route to diagnose why temperature rose.

15.2 The Three Pillars of Observability

Metrics, Logs, and Traces โ€” The Three Pillars
Metrics โ€ข Numeric time-series data โ€ข CPU, memory, QPS, latency โ€ข Aggregatable, cheap to store Tools: Prometheus, Datadog Grafana, CloudWatch Best for: dashboards, alerts Logs โ€ข Discrete event records โ€ข Timestamped text/JSON โ€ข Rich context per event Tools: ELK Stack, Loki Splunk, CloudWatch Logs Best for: debugging, audit Traces โ€ข Request journey across services โ€ข Span hierarchy with timing โ€ข Distributed context Tools: Jaeger, Zipkin OpenTelemetry, X-Ray Best for: latency debugging

The pillars are complementary โ€” metrics tell you something is wrong, logs tell you what happened, traces tell you where in the call chain it happened. Correlating all three (via trace IDs in logs) is the gold standard.

15.3 Golden Signals & RED/USE Methods

Google's SRE team defines four Golden Signals every user-facing service should monitor:

Latency โ€” time to serve a request (distinguish success vs error latency)
Traffic โ€” demand on the system (requests/sec, connections)
Errors โ€” rate of failed requests (5xx, timeouts, exceptions)
Saturation โ€” how "full" the system is (CPU, memory, queue depth)

RED Method (for services)

Rate (requests/sec) ยท Error rate ยท Duration (latency distribution)

USE Method (for resources)

Utilization ยท Saturation ยท Errors โ€” applied per resource: CPU, memory, disk, network.

// Example Prometheus metrics (RED)
http_requests_total{method="GET", status="200", endpoint="/api/urls"} 45230
http_requests_total{method="GET", status="500", endpoint="/api/urls"} 12
http_request_duration_seconds{quantile="0.99", endpoint="/api/urls"} 0.087

// Alert rule
ALERT HighErrorRate
  IF rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
  FOR 5m
  LABELS { severity="critical" }

15.4 SLIs, SLOs, SLAs & Error Budgets

Reliability targets must be measurable and agreed upon. This hierarchy connects engineering to business:

SLI (Service Level Indicator)

A measured metric. "Percentage of successful HTTP requests" or "p99 redirect latency."

SLO (Service Level Objective)

Internal target. "99.9% of requests succeed" or "p99 latency < 100ms."

SLA (Service Level Agreement)

Contractual commitment to customers with financial penalties. SLO should be stricter than SLA to provide buffer.

Error budget: If SLO is 99.9% monthly, you have 0.1% budget for failures (~43 minutes/month). When budget is exhausted, freeze feature launches and focus on reliability. This aligns product velocity with stability.

15.5 Distributed Tracing

In microservices, a single user request traverses dozens of services. Distributed tracing assigns a unique trace_id at the edge and propagates it through every service call, creating a waterfall of spans.

Distributed Trace โ€” Checkout Request Waterfall
0ms 250ms API GW Order Svc 180ms Payment 120ms Inventory 40ms DB query 110ms โ† bottleneck! Trace ID propagated via headers: traceparent, X-Request-ID

OpenTelemetry is the vendor-neutral standard for generating traces, metrics, and logs. Instrument once, export to Jaeger, Datadog, or Honeycomb.

15.6 Alerting & On-Call Best Practices

  • โ€ข Alert on symptoms, not causes: "Users experiencing high error rate" not "CPU is 82%"
  • โ€ข Every alert must be actionable: if no one can do anything, it's a dashboard metric, not an alert
  • โ€ข Reduce noise: group related alerts, use alert routing (PagerDuty, Opsgenie)
  • โ€ข Severity levels: P1 (wake someone up) vs P3 (fix next business day)
  • โ€ข Runbooks: every P1/P2 alert links to step-by-step remediation docs
// Structured logging โ€” JSON for machine parsing
{
  "timestamp": "2024-06-15T10:23:45.123Z",
  "level": "ERROR",
  "service": "url-shortener-api",
  "trace_id": "abc123def456",
  "message": "Database connection timeout",
  "user_id": "user-789",
  "duration_ms": 5023,
  "endpoint": "POST /api/v1/urls"
}
// trace_id links this log entry to the distributed trace in Jaeger

15.7 Observability in System Design Interviews

Mentioning observability proactively signals production experience. Cover these in every design:

  1. What to measure: Golden signals / RED metrics for each service
  2. SLOs: "99.9% redirect success, p99 < 100ms"
  3. Health checks: /health endpoint for load balancer
  4. Tracing: propagate trace ID across services for debugging
  5. Alerting: alert on error rate and latency SLO breaches
  6. Dashboards: per-service Grafana boards for on-call engineers
Module 16 ยท Case Study

Design a URL Shortener (bit.ly / TinyURL)

The canonical system design interview question. We'll walk through the complete 7-step framework โ€” requirements, scale, API, HLD, deep dives, trade-offs, and LLD โ€” producing an interview-ready design.

16.1 Step 1 โ€” Clarify Requirements

Functional Requirements

  • โ€ข Given a long URL โ†’ return a short unique URL
  • โ€ข Given short URL โ†’ 301 redirect to original
  • โ€ข Optional custom alias (e.g., /my-link)
  • โ€ข Optional expiration date
  • โ€ข Analytics: click count per URL (Should Have)

Non-Functional Requirements

  • โ€ข 100M new URLs/month, 5-year retention
  • โ€ข Read:Write ratio = 100:1
  • โ€ข Redirect latency p99 < 100ms
  • โ€ข 99.99% availability for redirects
  • โ€ข Short codes: as short as possible (base62)

16.2 Step 2 โ€” Back-of-the-Envelope Estimation

Writes/sec  = 100M / (30 ร— 24 ร— 3600)  โ‰ˆ 40/sec
Reads/sec   = 40 ร— 100                  โ‰ˆ 4,000/sec
Storage     = 100M ร— 12 ร— 5 ร— 500 bytes โ‰ˆ 3 TB (raw)
QPS peak    = 4,000 ร— 3 (burst factor) โ‰ˆ 12,000 reads/sec peak

โ†’ Reads dominate โ†’ Redis cache + CDN essential
โ†’ Writes modest โ†’ single DB shard OK initially
โ†’ 3 TB โ†’ plan sharding at ~10 TB

16.3 Step 3 โ€” API Design

POST   /api/v1/urls          โ†’ { long_url, custom_alias?, ttl? }  โ†’ 201 { short_url }
GET    /api/v1/urls/{code}   โ†’ metadata + click stats              โ†’ 200
DELETE /api/v1/urls/{code}   โ†’ deactivate                          โ†’ 204
GET    /{code}               โ†’ 301 redirect (separate from /api/)

16.4 Step 4 โ€” High-Level Architecture

URL Shortener โ€” Complete HLD
Client CDN Edge Load Balancer API Server API Server Redis PostgreSQL Primary Replica Kafka โ†’ Analytics Service Read path: CDN โ†’ Redis โ†’ DB replica | Write path: API โ†’ DB primary โ†’ cache invalidate

16.5 Step 5 โ€” Deep Dive: Short Code Generation

Two approaches โ€” discuss both in interviews:

Counter + Base62

DB auto-increment ID โ†’ encode to base62 (a-z, A-Z, 0-9). 7 chars = 62โท โ‰ˆ 3.5 trillion URLs. Simple, no collisions. Requires centralized counter (single DB or range allocation per server).

MD5/Hash + Collision Retry

Hash long URL, take first 7 chars. Collision risk โ€” retry with salt. No centralized counter but collisions increase with scale.

// Base62 encode
function encode(num):
  chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
  result = ""
  while num > 0:
    result = chars[num % 62] + result
    num = num // 62
  return result.padStart(7, 'a')

16.6 Step 6โ€“7 โ€” Trade-offs & LLD Touchpoints

  • โ€ข 301 vs 302 redirect: 301 permanent (browser caches โ€” good for CDN), 302 allows changing destination
  • โ€ข Cache-aside for reads; write-invalidate on URL update
  • โ€ข Async analytics via Kafka โ€” don't block redirect for click logging
  • โ€ข LLD classes: UrlService, UrlRepository, Base62Encoder, CacheClient
Module 17 ยท Case Study

Design a Chat System (WhatsApp / Messenger)

Real-time messaging at scale requires WebSockets, presence tracking, message ordering, and offline delivery. This case study covers one-on-one and group chat for 500M DAU.

17.1 Requirements & Scale

  • โ€ข FRs: 1:1 chat, group chat (max 500), media sharing, read receipts, online presence
  • โ€ข NFRs: 500M DAU, message delivery < 500ms, 99.9% availability, offline message sync
  • โ€ข Scale: 500M DAU ร— 40 msgs/day = 20B msgs/day โ‰ˆ 230K msgs/sec

17.2 Architecture โ€” WebSockets & Chat Servers

Chat System HLD
User A User B Load Balancer Chat Server 1 Chat Server 2 Redis presence Cassandra Kafka Media S3 Push Svc WebSocket persistent connections ยท Redis maps user_id โ†’ chat_server for routing

Key design: Clients maintain persistent WebSocket connections to chat servers. Redis stores user_id โ†’ server_id mapping so messages route to the correct server. Cassandra stores message history (write-heavy, time-series friendly).

17.3 Message Flow & Offline Delivery

  1. User A sends message โ†’ Chat Server 1 โ†’ persist to Cassandra โ†’ publish to Kafka
  2. Lookup User B in Redis โ†’ connected to Chat Server 2 โ†’ push via WebSocket
  3. If User B offline โ†’ store in inbox table โ†’ push notification via APNs/FCM on reconnect
  4. Group chat: fan-out message to all member connections (or fan-out on read for large groups)

17.4 Trade-offs

  • โ€ข WebSocket vs long polling: WebSocket for real-time; fallback to long polling for restrictive networks
  • โ€ข Small groups: fan-out on write. Large groups (1000+): fan-out on read
  • โ€ข Message ordering: per-conversation sequence numbers; causal ordering for group chats
  • โ€ข Sticky sessions required for WebSocket โ€” or use Redis pub/sub between chat servers
Module 18 ยท Case Study

Design a News Feed (Twitter / Instagram)

The news feed is the core of every social platform. The central design decision: pre-compute feeds on write (fan-out on write) or assemble on read (fan-out on read). This module walks through both and when to hybridize.

18.1 Requirements & Scale

  • โ€ข FRs: Post tweets/photos, follow users, view personalized home feed, like/comment
  • โ€ข NFRs: 300M DAU, feed load < 500ms, 500 follows max per user (simplified)
  • โ€ข Scale: 300M DAU, 200M posts/day, avg user follows 200 people, read:write โ‰ˆ 100:1

18.2 Fan-out on Write vs Fan-out on Read

Fan-out Strategies Compared
Fan-out on Write (Push) Celebrity posts โ†’ Push to all follower feed caches โœ“ Fast reads โœ— Slow for users with millions of followers Fan-out on Read (Pull) User opens feed โ†’ Query all followed users' posts โœ“ Fast writes โœ— Slow reads for users following many Hybrid (Twitter/Instagram approach) Fan-out on write for normal users ยท Fan-out on read for celebrities (>1M followers) Merge at read time from both pre-computed cache and celebrity posts

18.3 Feed Architecture

Post service writes to posts table. Fan-out worker pushes post IDs into each follower's Redis feed cache (sorted set by timestamp). Feed service reads top N post IDs from cache, hydrates full post content from posts store.

// Redis feed cache per user (sorted set โ€” score = timestamp)
ZADD feed:user-123  1620000000  "post-456"
ZADD feed:user-123  1620003600  "post-789"

// Get home feed โ€” top 20 most recent
ZREVRANGE feed:user-123 0 19

// Celebrity threshold: if follower_count > 1M โ†’ skip fan-out, pull on read

18.4 Key Trade-offs

  • โ€ข Storage: pre-computed feeds use more storage (post ID ร— followers) but enable fast reads
  • โ€ข Ranking: production feeds use ML ranking โ€” start with chronological, mention ranking as extension
  • โ€ข Media: store images/videos in CDN (S3 + CloudFront), feed cache stores only metadata + URLs
Module 19

Interview Framework & Best Practices

Knowledge alone doesn't pass interviews โ€” execution does. This module synthesizes everything into a battle-tested framework for the 45โ€“60 minute system design interview, with communication tactics, time management, and common pitfalls.

19.1 The 45-Minute Timeline

Phase Time Activity
Clarify5 minRequirements, scale, constraints โ€” ask 5โ€“8 questions
Estimate5 minBack-of-envelope: QPS, storage, bandwidth
API + HLD15 minDraw architecture โ€” boxes, arrows, data flows
Deep Dive15 minInterviewer-directed: DB, cache, bottlenecks
Wrap-up5 minTrade-offs, extensions, what you'd do with more time

19.2 Communication Tactics That Win

  • โ€ข Think aloud: narrate your reasoning โ€” silence makes interviewers nervous
  • โ€ข State assumptions: "I'll assume 100M DAU unless you say otherwise"
  • โ€ข Propose, don't dictate: "I'd lean toward Redis here โ€” does that align with your constraints?"
  • โ€ข Name trade-offs: every decision gets a "because X, trade-off is Y, mitigated by Z"
  • โ€ข Check in: "Should I go deeper on the database layer or move to caching?"
  • โ€ข Draw while talking: diagrams on whiteboard/Excalidraw beat pure verbal descriptions

19.3 Common Mistakes to Avoid

โŒ Jumping to microservices without justification
โŒ Skipping requirements clarification
โŒ Over-engineering (Kafka for 10 users/sec)
โŒ Ignoring single points of failure
โŒ No numbers โ€” "a lot of users" without QPS
โŒ Silent for 10 minutes drawing

19.4 What Interviewers Evaluate

Problem Solving

Structured approach, handles ambiguity

Technical Depth

Knows how components work, not just names

Trade-off Analysis

Articulates why A over B given constraints

Communication

Clear, collaborative, receptive to hints

Module 20

Capstone Review & Next Steps

You've completed all 20 modules. This capstone ties the full curriculum together โ€” a mental map of everything you've learned, a self-assessment checklist, and a roadmap for continued mastery.

20.1 The Complete Mental Map

System Design Knowledge Map
Requirements & Scale High-Level Design Caching Load Balancing Message Queues Low-Level Design Reliability/CAP Security Observability Production-Ready System Case Studies: URL Shortener ยท Chat ยท News Feed

20.2 Self-Assessment Checklist

Can you confidently explain each of these without notes?

โ˜ CAP theorem and when to choose CP vs AP
โ˜ Cache-aside pattern and invalidation
โ˜ Horizontal vs vertical scaling
โ˜ Leader-follower DB replication
โ˜ Consistent hashing for sharding
โ˜ OAuth 2.0 flow and JWT structure
โ˜ Fan-out on write vs read for feeds
โ˜ WebSocket chat architecture
โ˜ SLI / SLO / error budgets
โ˜ Strategy and Observer design patterns

20.3 Continued Learning Roadmap

  • โ€ข Practice: Design 2 systems/week on Excalidraw โ€” Uber, Netflix, Dropbox, Rate Limiter
  • โ€ข Read: "Designing Data-Intensive Applications" by Martin Kleppmann (the bible)
  • โ€ข Watch: System Design Interview channels โ€” mock interviews with narration
  • โ€ข Build: Implement a URL shortener or chat app โ€” theory becomes intuition through code
  • โ€ข Mock interviews: Pramp, interviewing.io, or peer practice with this course's frameworks

๐Ÿ๏ธ Congratulations!

You've completed the System Design Masterclass โ€” all 20 modules, 200+ quiz questions, and 3 full case studies. You have the foundation of a senior engineer. Now go build, practice, and ace those interviews.