A walk-through of the architectural moves you make as load grows — from a single box to a multi-region distributed system. Each step solves a specific failure mode of the previous one.
Scale is added one bottleneck at a time. You don't design for millions on day one — you respond to the next constraint as it appears.
Web + DB on one box
Split web & DB
LB + replicas
Cache + CDN
Multi-DC + sharding
Every new layer adds operational surface area — failure modes, deploys, debugging.
Real metrics (latency, error rate, queue depth) tell you when to move, not predictions.
One machine hosts the web app, the database, and serves everything to the user directly. It works until it doesn't.
App process, database, static assets, logs — all on a single VM or physical host.
Fastest to ship, cheapest to run, easiest to debug. Most production systems should begin this way.
One process eats all the CPU or memory and takes the whole site down. No isolation between web traffic and DB load.
User's browser resolves your domain to the server's IP via DNS, then hits it directly over HTTP.
The first real architectural decision. It's about data shape and access patterns, not "modern vs old."
You need transactions, joins, strong consistency, or your data is well-structured and relational. Default choice for most CRUD apps.
Schema is flexible or evolving, writes are huge, you need horizontal scale out of the box, or data is denormalised (logs, events, time-series).
For the first million users, a managed Postgres almost always wins. NoSQL solves problems most apps never have.
Make the box bigger, or add more boxes. The first is simpler; the second is the only one that keeps working.
Bigger CPU, more RAM, faster disk. Zero code changes. Caps out at whatever the cloud provider offers — and a single failure still kills you.
Add more identical servers behind a load balancer. Requires the app to be stateless. Failure of one node ≠ outage. This is what every internet-scale system uses.
Start vertical because it's free. Switch horizontal the moment availability matters more than simplicity.
A traffic director that sits in front of your web tier. The public sees one IP; behind it, many servers share the work.
Distributes incoming requests across healthy servers. Removes the single point of failure at the web tier.
The LB pings each server; if a server stops responding, traffic is rerouted automatically. The user never knows.
Only the LB has a public IP. Web servers sit on a private network and are unreachable from the internet — a security win.
Round-robin, least-connections, IP-hash. Pick based on whether requests are uniform or sessions matter.
Once the web tier is redundant, the database becomes the single point of failure. Replication fixes that and scales reads.
One primary handles writes. Replicas mirror its data and serve reads. Most workloads are read-heavy, so this scales nicely.
If the primary dies, a replica is promoted. Some lag is tolerated; brief unavailability during promotion is normal.
Replicas are seconds behind. Read-after-write workflows (e.g. "just posted, see your post") often need to read from the primary.
Replicas across availability zones survive datacenter-level outages.
Memory is ~100,000× faster than disk. A cache puts the hot data in front of the database so most reads never touch it.
App checks the cache first. On hit, return immediately. On miss, query the DB, write the result back to cache, return.
Too long → stale data. Too short → cache barely helps. Tune per data type. Frequently-read, rarely-changed data is the sweet spot.
LRU is the default for a reason. LFU and FIFO exist for special access patterns.
Cache stampede — when a hot key expires and a thousand requests all rebuild it at once. Mitigate with locks or staggered TTLs.
A globally distributed cache for static assets. The user fetches from a server near them, not from your origin halfway across the world.
Images, CSS, JS, fonts, videos — anything that doesn't change per user. Increasingly also cached API responses and HTML.
CDN has edge servers in dozens of cities. DNS routes the user to the nearest one. First request fills the edge cache; subsequent requests serve from edge.
Speed of light is fixed. A request from Mumbai to Frankfurt is ~150ms minimum. From Mumbai to a Mumbai edge: ~10ms.
Use versioned URLs (`app.v42.js`) or cache-busting query strings. Manual cache purges are slow and error-prone.
For horizontal scaling to actually work, web servers must be interchangeable. That means moving state out of them.
If user A's session lives in server 1's memory, the LB must always route them to server 1. Lose server 1, lose the session. Defeats horizontal scaling.
Sessions → Redis. User uploads → S3/object storage. Anything per-user lives outside the web process.
Any request can hit any server. Auto-scaling actually works. Deploys are rolling restarts with zero downtime.
The system still has state — it's just centralised in services built for it (databases, caches, blob stores).
A whole region can go offline — power, fibre cuts, cloud-provider outages. Multi-DC keeps users served when one region dies.
DNS routes each user to the closest healthy data center. If a region fails its health check, traffic shifts to the next nearest.
Each DC needs its own DB copy. Cross-region replication is asynchronous; you accept some lag in exchange for resilience.
A failover plan that's never been exercised will fail when you need it. Run drills.
You're roughly doubling infrastructure cost. Justify with uptime SLA, not engineering aesthetics.
Not every task needs to happen during the request. A queue lets producers and consumers run at their own pace, and lets the system absorb spikes.
Web servers drop jobs onto the queue and return immediately. Workers process them whenever they're free.
A signup flood doesn't crash the email service — emails queue up and get sent at a sustainable rate.
Heavy processing? Add more workers. Heavy traffic? Add more web servers. They scale independently.
SQS, RabbitMQ, Kafka. Kafka is for event streams; SQS/RabbitMQ for plain job queues.
At scale, you can't ssh into a box and figure things out. Observability and automation aren't optional — they're how the system stays alive.
Structured logs, centralised (ELK, Loki, CloudWatch). Searchable across all servers. Every request gets a trace ID.
Time-series of latency, error rate, queue depth, CPU. The four golden signals (latency, traffic, errors, saturation) cover most needs.
Alert on user-visible symptoms (p99 latency, error rate), not low-level causes (CPU 80%). The latter generates pager fatigue.
CI/CD, infrastructure-as-code, auto-scaling, automated failover. Anything done manually twice should be scripted.
When the primary database can't be made any bigger, you split the data across many databases. Each shard owns a slice.
e.g. `user_id % 4` puts each user on one of 4 shards. The application or a proxy routes queries to the right shard.
A bad key creates hot shards (one DB does all the work). Want roughly even read/write distribution across keys.
Cross-shard joins become expensive or impossible. Resharding (changing the key) is a major migration. Plan it before you need it.
Try replicas, caching, read-only replicas, and bigger boxes first. Sharding adds permanent complexity.
The architectures above are just tools. The thinking that decides when to reach for each one is what gets you through a system-design interview.
Over-engineering early is a faster way to fail than under-engineering. Add complexity in response to real signals.
Any server can serve any request. It's the precondition for horizontal scale, zero-downtime deploys, and graceful failure.
Most reads in most systems can be served from memory. The hard part is keeping it fresh enough.
Queues let one slow part of the system not bring down the rest. Trade latency for resilience.
Observability is not a phase 2 deliverable. It's the only way to know which bottleneck to fix next.
Disks die, networks partition, regions go offline. Design assuming each component can fail at any moment.