Chapter 1 · System Design Fundamentals

Scaling From One User to Millions

You don't architect for millions on day one. You start with one box and answer one bottleneck at a time — each new layer exists to fix a specific failure mode of the one before it. This is that ladder, drawn rung by rung.

▶ Open the companion slides

Reading time ~12 min Prerequisites none Format diagram-first walkthrough Audio 🔊 Hinglish read-aloud Next Back-of-the-Envelope Estimation →

Every box below — load balancer, cache, queue, shards — earns its place by solving a problem you can measure, not one you can imagine. Here's where the whole chapter is headed: the architecture a system grows into once it has survived real traffic. Don't memorise it — we'll assemble it one piece at a time, and each piece will make sense as the answer to "what just broke?"

Where we're headed. Static assets come from the CDN; everything else enters through the load balancer, which spreads requests across an interchangeable web tier. Reads hit the cache first, writes go to the primary DB (replicas serve more reads), and slow work is handed to a queue for workers to drain.

Primary source

Alex Xu, System Design Interview — An Insider's Guide (Vol. 1), Chapter 1. The author's site, bytebytego.com, is the highest-trust companion. This lesson mirrors the slide deck; jump straight to any slide with the Slide N chips below.

The scaling ladder Slide 2

Twelve moves take you from a hobby project to an internet-scale system. Each rung names the move and the failure it answers. Skim it now — the rest of the chapter walks each rung, with the matching slide one tap away.

1 Single server Slide 3 Web app + database on one box. Fails when one process starves the other, or the box dies.

2 Pick the right database Slide 4 SQL by default; NoSQL for flexible schema or huge writes. The wrong choice locks in pain later.

3 Scale the web tier out Slide 5 Many identical servers, not one giant one. A single big box is still a single point of failure.

4 Load balancer Slide 6 One public IP fronts many private servers. Removes the web tier as a single point of failure.

5 Database replication Slide 7 One primary for writes, replicas for reads. Now the database isn't the lone weak point either.

6 Caching tier Slide 8 Hot data in memory, in front of the DB. Most reads should never touch disk.

7 CDN Slide 9 Static assets from an edge near the user. You can't beat the speed of light — so move closer to it.

8 Stateless web tier Slide 10 Push session state to shared storage. Sticky sessions quietly defeat horizontal scaling.

9 Multiple data centers Slide 11 GeoDNS routes users to the nearest healthy region. A whole region can go offline.

10 Message queue Slide 12 Decouple producers from consumers; absorb spikes. Not every task needs to finish inside the request.

11 Logging, metrics, automation Slide 13 Observability + CI/CD + IaC. At scale you can't SSH in and eyeball it.

12 Database sharding Slide 14 Split data across many DBs by a shard key. The last resort, when the primary can't grow further.

Start on one box Slide 3

One VM runs everything. Fastest to ship, cheapest to run, easiest to debug — and the right place for almost every system to start.

The user's browser resolves your domain to this server's IP via DNS and talks to it directly over HTTP. Its first failure mode is the lack of isolation: one process eats all the CPU or memory and takes the whole site down. The entire system is one machine away from an outage.

Key idea

Begin with the simplest thing that could possibly work. The reasons to add a layer come from production metrics — latency, error rate, queue depth — not from a diagram you drew before launch.

Choosing a database Slide 4

The first decision that's genuinely hard to reverse. Frame it around data shape and access patterns, not "modern vs old." Both are correct for different workloads.

Transactions, joins, strong consistency. The default for CRUD.

Flexible schema, huge writes, horizontal scale out of the box.

The honest answer

For the first million users, a managed Postgres almost always wins. NoSQL solves problems most applications never have — choose it because a real access pattern demands it, not for the scalability halo.

Vertical vs horizontal scaling Slide 5

Vertical needs zero code changes but caps out and keeps a single point of failure. Horizontal has no ceiling — at the price of statelessness (stage 8).

The rule

Start vertical because it's free and simple. Switch to horizontal the moment availability matters more than simplicity — only horizontal scaling survives a node dying.

The load balancer Slide 6

The public sees one IP; behind it, many servers share the work. The LB health-checks each one and routes around failures — invisibly to the user.

There's a security dividend too: only the load balancer holds a public IP; the web servers sit on a private network, unreachable from the internet. Distribution strategy is a tuning knob — round-robin or least-connections for uniform requests, IP-hash when a user must stick to one server.

Database replication Slide 7

One primary takes every write; replicas mirror it and serve reads. Most workloads are read-heavy, so this buys a lot of headroom cheaply. If the primary dies, a replica is promoted.

Watch out · replication lag

Replicas trail the primary by seconds. A read-after-write flow — "I just posted, show me my post" — can return stale data from a lagging replica. Route those specific reads to the primary, or design the UI to tolerate the lag.

The caching tier Slide 8

Memory is roughly 100,000× faster than disk. On a hit the app returns immediately; on a miss it queries the DB, writes the result back, and returns.

Two knobs decide whether a cache helps. A TTL that's too long serves stale data; too short and the cache barely earns its keep — frequently read, rarely changed data is the sweet spot. An eviction policy (LRU by default) decides what to drop when memory fills.

Watch out · cache stampede

When a hot key expires, a thousand requests can all miss at once and rebuild it simultaneously, hammering the database. Mitigate with a rebuild lock or jittered TTLs so keys don't expire in lockstep.

Content delivery network Slide 9

The reason latency drops is physics. Mumbai↔Frankfurt is ~150 ms at best; Mumbai to a Mumbai edge is ~10 ms. You can't speed up light, so you move the data closer.

Invalidation matters

Use versioned URLs like app.v42.js or cache-busting query strings. Manual purges are slow and error-prone — bake the version into the URL so a new deploy is simply a new file.

Make the web tier stateless Slide 10

Move per-user state out of the web process: sessions to Redis, uploads to object storage. Now any request can hit any server — servers become cattle, not pets.

The trap is the sticky session: if user A's session lives in server 1's memory, the LB must always send them back to server 1 — and a dead server means a dropped session, defeating the whole point of scaling out. With state centralised, auto-scaling works and deploys become zero-downtime rolling restarts.

Key idea

Stateless doesn't mean no state anywhere — it means state lives in services built to hold it (databases, caches, blob stores), not in the memory of a web process that might vanish at any moment.

Multiple data centers Slide 11

A whole region can go offline. GeoDNS sends each user to the nearest healthy DC and shifts traffic away from one that fails its health check.

The hard part is data: each DC needs its own copy, and cross-region replication is asynchronous — you accept some lag in exchange for resilience. A failover plan that's never been exercised will fail when you need it, so run drills. And you're roughly doubling infrastructure cost, justified by an uptime SLA, not by aesthetics.

The message queue Slide 12

Web servers drop a job and return immediately; workers drain it whenever they're free. A signup flood queues up instead of crashing the email service.

Decoupling lets the two sides scale on independent axes: heavy processing means more workers, heavy traffic means more web servers. Use SQS or RabbitMQ for plain job queues; Kafka when you need a durable, replayable event stream.

Logging, metrics, automation Slide 13

At scale you can't SSH into a box and reason your way to the problem. Observability and automation aren't a phase-2 nicety — they're how the system stays alive and how you know which bottleneck to fix next.

Logging — structured, centralised (ELK, Loki, CloudWatch), searchable across every server, with a trace ID on every request.
Metrics — the four golden signals: latency, traffic, errors, saturation. They cover most of what you need to watch.
Alerting — page on user-visible symptoms (p99 latency, error rate), not low-level causes (CPU at 80%). Alerting on causes trains your team to ignore the pager.
Automation — CI/CD, infrastructure-as-code, auto-scaling, automated failover. Anything done by hand twice should be scripted.

Database sharding Slide 14

The shard key decides which database a row lives on. A bad key creates hot shards where one DB does most of the work.

Last resort, not first

Sharding adds permanent complexity — cross-shard joins get expensive or impossible, and resharding is a major migration. Exhaust read replicas, caching, and bigger boxes first. It's the one move on this ladder you can't easily walk back.

The principles underneath Slide 15

The architectures above are just tools. What carries you through an interview — and through real on-call life — is the judgement about when to reach for each one:

Solve the next bottleneck, not the future. Over-engineering early fails faster than under-engineering. Add complexity in response to real signals.
Statelessness is a superpower. It's the precondition for horizontal scale, zero-downtime deploys, and graceful failure.
Cache aggressively, invalidate carefully. Most reads can come from memory; the hard part is keeping it fresh enough.
Asynchrony decouples failure. Queues let one slow component avoid taking down the rest. You trade latency for resilience.
You can't manage what you can't see. Observability is how you know which bottleneck is next.
Failure is normal. Disks die, networks partition, regions go dark. Design assuming each component can fail at any moment.

Active recall

Cover the answers. Say each one out loud before you tap to check.

Why start with a single server even for a "serious" product?

Think shipping speed and where the signal to scale comes from.

It's the fastest to ship, cheapest to run, and easiest to debug. You add layers in response to real production metrics, not predictions — so the single box is both the right start and the source of the data that tells you what to do next.

What must be true of the web tier before horizontal scaling works?

It's about where session state lives.

It must be stateless — no per-user state in a server's local memory. Sessions go to a shared store (Redis), uploads to object storage. Then any request can hit any server and sticky sessions can't pin a user to a box that might die.

Primary/replica replication scales reads — what bug does it introduce?

Replicas aren't instantaneous.

Replication lag. Replicas trail the primary by seconds, so a read-after-write flow ("see the post I just made") can show stale data. Route those reads to the primary or design the UI to tolerate the lag.

Why does a CDN reduce latency — and what limit does it work around?

Physics, not software.

The speed of light is fixed, so a far-away round trip has an irreducible floor (~150 ms Mumbai↔Frankfurt). A CDN serves assets from an edge near the user (~10 ms), moving the data closer rather than trying to move light faster.

What is cache stampede, and how do you prevent it?

A hot key expires…

When a hot key expires, many requests miss at once and all rebuild it together, hammering the DB. Prevent it with a rebuild lock (one request repopulates) or jittered TTLs so keys don't expire in lockstep.

Why is sharding the last resort on the ladder?

Think reversibility.

It adds permanent complexity — cross-shard joins get expensive or impossible, and resharding is a major migration. Replicas, caching, and bigger boxes are all easier to walk back, so you exhaust them first.

Check yourself

Q1 Your read-heavy app's database is the bottleneck and a single point of failure. What's the most natural next move?

Why: Primary/replica replication directly removes the DB as a single point of failure and scales reads — exactly the read-heavy case. Sharding is a later, heavier, harder-to-reverse step.

Q2 A user reports that immediately after posting, their post sometimes doesn't appear. Most likely cause?

Why: Read-after-write against a replica that trails the primary by seconds is the classic symptom. Fix by routing that read to the primary.

Q3 Why do sticky sessions undermine horizontal scaling?

Why: If state lives in one server's RAM, the LB must always route that user there — and a dead server means a dropped session. Moving state to shared storage makes servers interchangeable.

Q4 Which is the best reason to put a message queue between your web tier and an email-sending worker?

Why: The queue lets the web tier return immediately and absorbs bursts; workers drain the backlog at a sustainable rate, and the two sides scale independently.

Q5 You're starting a new CRUD product expecting its first users. Which database choice does the chapter actually recommend?

Why: For most CRUD apps up to a million users, managed Postgres wins on transactions, joins, and simplicity. NoSQL solves problems most apps never have — pick it for a real access pattern, not the scalability halo.