Chapter 1 · System Design Interview

Scaling From One User To Millions

A walk-through of the architectural moves you make as load grows — from a single box to a multi-region distributed system. Each step solves a specific failure mode of the previous one.

Study Notes · 15 slides
02 / 15

The Scaling Journey

Scale is added one bottleneck at a time. You don't design for millions on day one — you respond to the next constraint as it appears.

1
Single Server

Web + DB on one box

2
Tier Separation

Split web & DB

3
Horizontal Web

LB + replicas

4
Caching Layer

Cache + CDN

5
Distributed

Multi-DC + sharding

Each step costs complexity

Every new layer adds operational surface area — failure modes, deploys, debugging.

Wait for the signal

Real metrics (latency, error rate, queue depth) tell you when to move, not predictions.

03 / 15
1

Single Server Setup

One machine hosts the web app, the database, and serves everything to the user directly. It works until it doesn't.

What lives here

App process, database, static assets, logs — all on a single VM or physical host.

Why start here

Fastest to ship, cheapest to run, easiest to debug. Most production systems should begin this way.

First failure mode

One process eats all the CPU or memory and takes the whole site down. No isolation between web traffic and DB load.

The DNS detail

User's browser resolves your domain to the server's IP via DNS, then hits it directly over HTTP.

User Browser HTTP Single Server Web App Database + static files, logs, cron
04 / 15
2

Choosing A Database: SQL vs NoSQL

The first real architectural decision. It's about data shape and access patterns, not "modern vs old."

Pick SQL (Postgres, MySQL) when…

You need transactions, joins, strong consistency, or your data is well-structured and relational. Default choice for most CRUD apps.

Pick NoSQL (DynamoDB, Mongo, Cassandra) when…

Schema is flexible or evolving, writes are huge, you need horizontal scale out of the box, or data is denormalised (logs, events, time-series).

The honest answer

For the first million users, a managed Postgres almost always wins. NoSQL solves problems most apps never have.

Relational (SQL) ACID · Joins · Schema Postgres MySQL SQL Server Strong consistency Vertical scale by default Non-relational (NoSQL) Flexible · Sharded · Fast DynamoDB MongoDB Cassandra · Redis Eventual consistency (often) Horizontal scale by design
05 / 15
3

Vertical vs Horizontal Scaling

Make the box bigger, or add more boxes. The first is simpler; the second is the only one that keeps working.

Vertical (scale up)

Bigger CPU, more RAM, faster disk. Zero code changes. Caps out at whatever the cloud provider offers — and a single failure still kills you.

Horizontal (scale out)

Add more identical servers behind a load balancer. Requires the app to be stateless. Failure of one node ≠ outage. This is what every internet-scale system uses.

The rule

Start vertical because it's free. Switch horizontal the moment availability matters more than simplicity.

Vertical Bigger Server One node, more power Horizontal Many identical nodes
06 / 15
4

Load Balancer

A traffic director that sits in front of your web tier. The public sees one IP; behind it, many servers share the work.

What it solves

Distributes incoming requests across healthy servers. Removes the single point of failure at the web tier.

Health checks

The LB pings each server; if a server stops responding, traffic is rerouted automatically. The user never knows.

Public vs private IPs

Only the LB has a public IP. Web servers sit on a private network and are unreachable from the internet — a security win.

Common algorithms

Round-robin, least-connections, IP-hash. Pick based on whether requests are uniform or sessions matter.

Users Load Balancer Web Server 1 Web Server 2 Web Server 3
07 / 15
5

Database Replication

Once the web tier is redundant, the database becomes the single point of failure. Replication fixes that and scales reads.

Primary / Replica model

One primary handles writes. Replicas mirror its data and serve reads. Most workloads are read-heavy, so this scales nicely.

Failover

If the primary dies, a replica is promoted. Some lag is tolerated; brief unavailability during promotion is normal.

Replication lag

Replicas are seconds behind. Read-after-write workflows (e.g. "just posted, see your post") often need to read from the primary.

Reliability bonus

Replicas across availability zones survive datacenter-level outages.

Web Server writes reads Primary DB replicate Replica 1 Replica 2
08 / 15
6

Caching Tier

Memory is ~100,000× faster than disk. A cache puts the hot data in front of the database so most reads never touch it.

Read-through pattern

App checks the cache first. On hit, return immediately. On miss, query the DB, write the result back to cache, return.

Pick a TTL

Too long → stale data. Too short → cache barely helps. Tune per data type. Frequently-read, rarely-changed data is the sweet spot.

Eviction policy

LRU is the default for a reason. LFU and FIFO exist for special access patterns.

Common pitfall

Cache stampede — when a hot key expires and a thousand requests all rebuild it at once. Mitigate with locks or staggered TTLs.

Web App 1 check 2 miss 3 query Cache (Redis) Database 4 write back to cache
09 / 15
7

Content Delivery Network (CDN)

A globally distributed cache for static assets. The user fetches from a server near them, not from your origin halfway across the world.

What goes on a CDN

Images, CSS, JS, fonts, videos — anything that doesn't change per user. Increasingly also cached API responses and HTML.

How it works

CDN has edge servers in dozens of cities. DNS routes the user to the nearest one. First request fills the edge cache; subsequent requests serve from edge.

Why latency drops

Speed of light is fixed. A request from Mumbai to Frankfurt is ~150ms minimum. From Mumbai to a Mumbai edge: ~10ms.

Invalidation matters

Use versioned URLs (`app.v42.js`) or cache-busting query strings. Manual cache purges are slow and error-prone.

User · IN User · US User · EU CDN Edge IN CDN Edge US CDN Edge EU cache fill Origin
10 / 15
8

Stateless Web Tier

For horizontal scaling to actually work, web servers must be interchangeable. That means moving state out of them.

The problem with sticky sessions

If user A's session lives in server 1's memory, the LB must always route them to server 1. Lose server 1, lose the session. Defeats horizontal scaling.

Move state to shared storage

Sessions → Redis. User uploads → S3/object storage. Anything per-user lives outside the web process.

Now servers are cattle, not pets

Any request can hit any server. Auto-scaling actually works. Deploys are rolling restarts with zero downtime.

Stateless ≠ no state anywhere

The system still has state — it's just centralised in services built for it (databases, caches, blob stores).

LB Web · stateless Web · stateless Web · stateless Session Store Object Storage
11 / 15
9

Multiple Data Centers

A whole region can go offline — power, fibre cuts, cloud-provider outages. Multi-DC keeps users served when one region dies.

GeoDNS routing

DNS routes each user to the closest healthy data center. If a region fails its health check, traffic shifts to the next nearest.

Data sync is the hard part

Each DC needs its own DB copy. Cross-region replication is asynchronous; you accept some lag in exchange for resilience.

Test the failover

A failover plan that's never been exercised will fail when you need it. Run drills.

Cost reality

You're roughly doubling infrastructure cost. Justify with uptime SLA, not engineering aesthetics.

User · IN User · US GeoDNS Data Center · APAC Web + DB + Cache Data Center · US Web + DB + Cache async repl
12 / 15
10

Message Queue

Not every task needs to happen during the request. A queue lets producers and consumers run at their own pace, and lets the system absorb spikes.

Decouple producer and consumer

Web servers drop jobs onto the queue and return immediately. Workers process them whenever they're free.

Smooths bursts

A signup flood doesn't crash the email service — emails queue up and get sent at a sustainable rate.

Different scaling axes

Heavy processing? Add more workers. Heavy traffic? Add more web servers. They scale independently.

Typical tools

SQS, RabbitMQ, Kafka. Kafka is for event streams; SQS/RabbitMQ for plain job queues.

Web Server publish Queue job · job · job consume Worker 1 Worker 2 Worker 3
13 / 15
11

Logging, Metrics, Automation

At scale, you can't ssh into a box and figure things out. Observability and automation aren't optional — they're how the system stays alive.

Logging

Structured logs, centralised (ELK, Loki, CloudWatch). Searchable across all servers. Every request gets a trace ID.

Metrics

Time-series of latency, error rate, queue depth, CPU. The four golden signals (latency, traffic, errors, saturation) cover most needs.

Alerting

Alert on user-visible symptoms (p99 latency, error rate), not low-level causes (CPU 80%). The latter generates pager fatigue.

Automation

CI/CD, infrastructure-as-code, auto-scaling, automated failover. Anything done manually twice should be scripted.

Web · App · Worker logs Log Aggregator metrics Metrics Pipeline Alerts Dashboard CI / CD · IaC deploys, scaling, rollback
14 / 15
12

Database Sharding

When the primary database can't be made any bigger, you split the data across many databases. Each shard owns a slice.

Shard key chooses the slice

e.g. `user_id % 4` puts each user on one of 4 shards. The application or a proxy routes queries to the right shard.

Picking the key is everything

A bad key creates hot shards (one DB does all the work). Want roughly even read/write distribution across keys.

Hard tradeoffs

Cross-shard joins become expensive or impossible. Resharding (changing the key) is a major migration. Plan it before you need it.

Last resort, not first

Try replicas, caching, read-only replicas, and bigger boxes first. Sharding adds permanent complexity.

App / Proxy user_id % N Shard 1 users 0–999K Shard 2 users 1M–2M Shard 3 users 2M–3M Watch out · Hot shards · Cross-shard joins · Resharding pain · Schema drift Last resort
15 / 15
Chapter 1 · Summary

Principles That Apply At Every Stage

The architectures above are just tools. The thinking that decides when to reach for each one is what gets you through a system-design interview.

Solve the next bottleneck, not the future

Over-engineering early is a faster way to fail than under-engineering. Add complexity in response to real signals.

Statelessness is a superpower

Any server can serve any request. It's the precondition for horizontal scale, zero-downtime deploys, and graceful failure.

Cache aggressively, invalidate carefully

Most reads in most systems can be served from memory. The hard part is keeping it fresh enough.

Asynchrony decouples failure

Queues let one slow part of the system not bring down the rest. Trade latency for resilience.

You can't manage what you can't see

Observability is not a phase 2 deliverable. It's the only way to know which bottleneck to fix next.

Failure is normal

Disks die, networks partition, regions go offline. Design assuming each component can fail at any moment.

prev·next· pressFfor fullscreen
📖 Read the lesson