You don't architect for millions on day one. You start with one box and answer one
bottleneck at a time — each new layer exists to fix a specific failure mode of the
one before it. This is that ladder, drawn rung by rung.
Every box below — load balancer, cache, queue, shards — earns its place by solving a
problem you can measure, not one you can imagine. Here's where the whole
chapter is headed: the architecture a system grows into once it has survived real
traffic. Don't memorise it — we'll assemble it one piece at a time, and each piece
will make sense as the answer to "what just broke?"
Where we're headed. Static assets come from the CDN; everything
else enters through the load balancer, which spreads requests across an
interchangeable web tier. Reads hit the cache first, writes go to the
primary DB (replicas serve more reads), and slow work is handed to a
queue for workers to drain.
Primary source
Alex Xu, System Design Interview — An Insider's Guide (Vol. 1),
Chapter 1. The author's site, bytebytego.com,
is the highest-trust companion. This lesson mirrors the
slide deck; jump straight to any slide with the
Slide N chips below.
Twelve moves take you from a hobby project to an internet-scale system. Each rung
names the move and the failure it answers. Skim it now — the
rest of the chapter walks each rung, with the matching slide one tap away.
1Single server Slide 3Web app + database on one box. Fails when one process starves the other, or the box dies.
2Pick the right database Slide 4SQL by default; NoSQL for flexible schema or huge writes. The wrong choice locks in pain later.
3Scale the web tier out Slide 5Many identical servers, not one giant one. A single big box is still a single point of failure.
4Load balancer Slide 6One public IP fronts many private servers. Removes the web tier as a single point of failure.
5Database replication Slide 7One primary for writes, replicas for reads. Now the database isn't the lone weak point either.
6Caching tier Slide 8Hot data in memory, in front of the DB. Most reads should never touch disk.
7CDN Slide 9Static assets from an edge near the user. You can't beat the speed of light — so move closer to it.
8Stateless web tier Slide 10Push session state to shared storage. Sticky sessions quietly defeat horizontal scaling.
9Multiple data centers Slide 11GeoDNS routes users to the nearest healthy region. A whole region can go offline.
10Message queue Slide 12Decouple producers from consumers; absorb spikes. Not every task needs to finish inside the request.
11Logging, metrics, automation Slide 13Observability + CI/CD + IaC. At scale you can't SSH in and eyeball it.
12Database sharding Slide 14Split data across many DBs by a shard key. The last resort, when the primary can't grow further.
One VM runs everything. Fastest to ship, cheapest to run, easiest to debug
— and the right place for almost every system to start.
The user's browser resolves your domain to this server's IP via DNS and talks to it
directly over HTTP. Its first failure mode is the lack of isolation: one process eats
all the CPU or memory and takes the whole site down. The entire system is one machine
away from an outage.
Key idea
Begin with the simplest thing that could possibly work. The reasons to add a layer
come from production metrics — latency, error rate, queue depth — not from a diagram
you drew before launch.
The first decision that's genuinely hard to reverse. Frame it around data
shape and access patterns, not "modern vs old." Both are correct for
different workloads.
Transactions, joins, strong consistency. The default for CRUD.Flexible schema, huge writes, horizontal scale out of the box.
The honest answer
For the first million users, a managed Postgres almost always wins. NoSQL solves
problems most applications never have — choose it because a real access pattern demands
it, not for the scalability halo.
Vertical needs zero code changes but caps out and keeps a single point of
failure. Horizontal has no ceiling — at the price of statelessness (stage 8).
The rule
Start vertical because it's free and simple. Switch to horizontal the moment
availability matters more than simplicity — only horizontal scaling survives a node
dying.
The public sees one IP; behind it, many servers share the work. The LB
health-checks each one and routes around failures — invisibly to the user.
There's a security dividend too: only the load balancer holds a public IP; the web
servers sit on a private network, unreachable from the internet. Distribution strategy
is a tuning knob — round-robin or least-connections for uniform requests, IP-hash when
a user must stick to one server.
One primary takes every write; replicas mirror it and serve
reads. Most workloads are read-heavy, so this buys a lot of headroom cheaply. If the
primary dies, a replica is promoted.
Watch out · replication lag
Replicas trail the primary by seconds. A read-after-write flow — "I just posted,
show me my post" — can return stale data from a lagging replica. Route those specific
reads to the primary, or design the UI to tolerate the lag.
Memory is roughly 100,000× faster than disk. On a hit the app returns
immediately; on a miss it queries the DB, writes the result back, and returns.
Two knobs decide whether a cache helps. A TTL that's too long serves
stale data; too short and the cache barely earns its keep — frequently read,
rarely changed data is the sweet spot. An eviction policy (LRU by
default) decides what to drop when memory fills.
Watch out · cache stampede
When a hot key expires, a thousand requests can all miss at once and rebuild it
simultaneously, hammering the database. Mitigate with a rebuild lock or jittered TTLs
so keys don't expire in lockstep.
The reason latency drops is physics. Mumbai↔Frankfurt is ~150 ms at
best; Mumbai to a Mumbai edge is ~10 ms. You can't speed up light, so you
move the data closer.
Invalidation matters
Use versioned URLs like app.v42.js or cache-busting query strings.
Manual purges are slow and error-prone — bake the version into the URL so a new deploy
is simply a new file.
Move per-user state out of the web process: sessions to Redis, uploads to
object storage. Now any request can hit any server — servers become cattle, not
pets.
The trap is the sticky session: if user A's session lives in server
1's memory, the LB must always send them back to server 1 — and a dead server means a
dropped session, defeating the whole point of scaling out. With state centralised,
auto-scaling works and deploys become zero-downtime rolling restarts.
Key idea
Stateless doesn't mean no state anywhere — it means state lives in services built to
hold it (databases, caches, blob stores), not in the memory of a web process that might
vanish at any moment.
A whole region can go offline. GeoDNS sends each user to the nearest healthy
DC and shifts traffic away from one that fails its health check.
The hard part is data: each DC needs its own copy, and cross-region replication is
asynchronous — you accept some lag in exchange for resilience. A failover plan
that's never been exercised will fail when you need it, so run drills. And you're
roughly doubling infrastructure cost, justified by an uptime SLA, not by aesthetics.
Web servers drop a job and return immediately; workers drain it whenever
they're free. A signup flood queues up instead of crashing the email service.
Decoupling lets the two sides scale on independent axes: heavy
processing means more workers, heavy traffic means more web servers. Use SQS or
RabbitMQ for plain job queues; Kafka when you need a durable, replayable event stream.
At scale you can't SSH into a box and reason your way to the problem. Observability and
automation aren't a phase-2 nicety — they're how the system stays alive and how you
know which bottleneck to fix next.
Logging — structured, centralised (ELK, Loki, CloudWatch),
searchable across every server, with a trace ID on every request.
Metrics — the four golden signals: latency, traffic, errors,
saturation. They cover most of what you need to watch.
Alerting — page on user-visible symptoms (p99 latency, error
rate), not low-level causes (CPU at 80%). Alerting on causes trains your team to
ignore the pager.
Automation — CI/CD, infrastructure-as-code, auto-scaling,
automated failover. Anything done by hand twice should be scripted.
The shard key decides which database a row lives on. A bad key creates
hot shards where one DB does most of the work.
Last resort, not first
Sharding adds permanent complexity — cross-shard joins get expensive or impossible,
and resharding is a major migration. Exhaust read replicas, caching, and bigger boxes
first. It's the one move on this ladder you can't easily walk back.
The architectures above are just tools. What carries you through an interview — and
through real on-call life — is the judgement about when to reach for each one:
Solve the next bottleneck, not the future. Over-engineering early
fails faster than under-engineering. Add complexity in response to real signals.
Statelessness is a superpower. It's the precondition for
horizontal scale, zero-downtime deploys, and graceful failure.
Cache aggressively, invalidate carefully. Most reads can come from
memory; the hard part is keeping it fresh enough.
Asynchrony decouples failure. Queues let one slow component avoid
taking down the rest. You trade latency for resilience.
You can't manage what you can't see. Observability is how you know
which bottleneck is next.
Failure is normal. Disks die, networks partition, regions go dark.
Design assuming each component can fail at any moment.
Active recall
Cover the answers. Say each one out loud before you tap to check.
Why start with a single server even for a "serious" product?
Think shipping speed and where the signal to scale comes from.
It's the fastest to ship, cheapest to run, and easiest to debug.
You add layers in response to real production metrics, not predictions — so the
single box is both the right start and the source of the data that tells you what
to do next.
What must be true of the web tier before horizontal scaling works?
It's about where session state lives.
It must be stateless — no per-user state in a
server's local memory. Sessions go to a shared store (Redis), uploads to object
storage. Then any request can hit any server and sticky sessions can't pin a user
to a box that might die.
Primary/replica replication scales reads — what bug does it introduce?
Replicas aren't instantaneous.
Replication lag. Replicas trail the primary by
seconds, so a read-after-write flow ("see the post I just made") can show stale
data. Route those reads to the primary or design the UI to tolerate the lag.
Why does a CDN reduce latency — and what limit does it work around?
Physics, not software.
The speed of light is fixed, so a far-away round trip has an
irreducible floor (~150 ms Mumbai↔Frankfurt). A CDN serves assets from an edge
near the user (~10 ms), moving the data closer rather than trying to move light
faster.
What is cache stampede, and how do you prevent it?
A hot key expires…
When a hot key expires, many requests miss at once and all rebuild
it together, hammering the DB. Prevent it with a rebuild lock (one request
repopulates) or jittered TTLs so keys don't expire in lockstep.
Why is sharding the last resort on the ladder?
Think reversibility.
It adds permanent complexity — cross-shard joins get expensive or
impossible, and resharding is a major migration. Replicas, caching, and bigger
boxes are all easier to walk back, so you exhaust them first.
Check yourself
Q1 Your read-heavy app's database is the bottleneck and a single point of failure. What's the most natural next move?
Why: Primary/replica replication directly removes the
DB as a single point of failure and scales reads — exactly the read-heavy
case. Sharding is a later, heavier, harder-to-reverse step.
Q2 A user reports that immediately after posting, their post sometimes doesn't appear. Most likely cause?
Why: Read-after-write against a replica that trails the
primary by seconds is the classic symptom. Fix by routing that read to the primary.
Q3 Why do sticky sessions undermine horizontal scaling?
Why: If state lives in one server's RAM, the LB must
always route that user there — and a dead server means a dropped session. Moving
state to shared storage makes servers interchangeable.
Q4 Which is the best reason to put a message queue between your web tier and an email-sending worker?
Why: The queue lets the web tier return immediately and
absorbs bursts; workers drain the backlog at a sustainable rate, and the two sides
scale independently.
Q5 You're starting a new CRUD product expecting its first users. Which database choice does the chapter actually recommend?
Why: For most CRUD apps up to a million users, managed
Postgres wins on transactions, joins, and simplicity. NoSQL solves problems most
apps never have — pick it for a real access pattern, not the scalability halo.