Chapter 6 · System Design Fundamentals
Design a Key-Value Store
This is the chapter where the whole distributed-systems toolkit snaps together — replication, quorums, the CAP tradeoff, anti-entropy, and LSM storage — into a single coherent design. Start with a hash map on one box; finish with a ring of replicas that survives real-world failure.
▶ Open the companion slides
A key-value store is the simplest possible database — two methods, put(key, value)
and get(key) — and yet building one that holds tens of billions of keys and never
refuses a write forces you to confront every hard problem in distributed systems at once.
That is exactly why it is the canonical interview question: it is a survey course disguised as a single design.
Alex Xu, System Design Interview (Vol. 1), Chapter 6. Companion deck: slides — jump to any slide with the Slide N chips.
Go deeper: the canonical reference is the Dynamo paper — DeCandia et al., "Dynamo: Amazon's Highly Available Key-value Store" (SOSP 2007). Almost every mechanism below traces back to it.
What we're actually building Slide 2
The API is two methods; the difficulty is entirely in the non-functional pressure. Before drawing rings, agree the contract and the operating envelope: tens of billions of opaque keys, six-figure read/write QPS, hundreds to thousands of commodity nodes across several data centres. Two design goals fight each other from here on — always-writable availability versus per-operation tunable consistency.
Surface API
put(key, value) stores an opaque blob under a short string key. get(key) returns the most recent value, or a tombstone if it was deleted. No range scans, no secondary indexes, no transactions.
The two goals in tension
Availability: accept writes even when a replica, a rack, or a whole zone is down. Consistency: tunable per call — strong reads, fast eventual reads, or a quorum — but the system never silently chooses for you.
Single node: one hash map Slide 3
A key-value store on one machine is twenty lines: a process-local hash map for O(1) put/get,
plus an append-only write-ahead log fsync'd before each ack for durability.
Understanding precisely why it cannot survive reality justifies every section that follows —
the working set must fit in RAM, one bad disk loses everything, and vertical scaling buys time, not a solution.
Spread keys around a ring Slide 4
To outgrow one box, partition the keyspace with consistent hashing (Chapter 5):
hash the key onto a circle, walk clockwise to the first node. The payoff over a naive
key % N is that adding or removing a node moves only ~1/N of keys —
its neighbours' slices — instead of reshuffling everything. Virtual nodes give one
physical machine many small arcs, smoothing load and letting beefier hardware carry more tokens.
Replication: never store anything once Slide 5
Pick a replication factor N — usually three. The node the ring points at is the
coordinator; the next N−1 nodes clockwise also store the value. Together they
form the key's preference list. Crucially, you skip replicas that share a
rack or zone already in the list, so every key has copies in at least two failure domains —
otherwise a single rack outage takes all three copies at once.
The CAP corner you can't escape Slide 6
Once data is replicated across machines, the network can split them. When it does, you cannot offer both linearisable consistency and full availability at once — you must choose. CAP is a worst-case constraint that bites only during a partition; outside one you can have all three. Every distributed KV store, in that moment, is making this call, whether or not it admits it.
| Choice | What it does under partition | What it gives up |
|---|---|---|
| CP | Refuses writes (or reads) on the minority side until the partition heals. No client ever sees stale or conflicting state. Spanner, HBase, etcd, ZooKeeper. | Availability — some users get errors during the split. |
| AP | Both sides keep accepting writes; divergent versions are reconciled later. The store stays writable through the outage. Dynamo, Cassandra, Riak. | Consistency — apps may read an older or conflicting value. |
Pick AP if availability is the product; pick CP if a wrong answer is worse than no answer. A KV store built for shopping carts and sessions is almost always AP — that is the path we follow for the rest of the chapter.
N, W, R — and the overlap rule Slide 7
"AP" doesn't mean "no consistency" — it means consistency is a dial. Three knobs set it:
N replicas per key, W acks required before a write succeeds, and R
responses required before a read returns. The single most important inequality in the chapter governs
when a read is guaranteed to see the latest write.
The intuition: if your W writers and your R readers together exceed the
N replicas, by the pigeonhole principle at least one replica was written and read,
so the latest value is in the read response — the coordinator just picks the newest. Tuning the two
knobs lets the same cluster serve fast writes and strong reads, or the reverse:
| Config (N=3) | Profile | Strong read? |
|---|---|---|
| W=1, R=1 | Fastest, fully eventual — both calls return on one ack. | No |
| W=2, R=2 | Balanced quorum — the common default. | Yes (4>3) |
| W=1, R=3 | Fast writes, strong reads — read all replicas. | Yes (4>3) |
| W=3, R=1 | Strong, read-optimised — write all, read one. | Yes (4>3) |
What "consistent" actually promises Slide 8
Quorum settings are mechanism; consistency models are the contract those settings expose to the application. The same cluster can serve different guarantees on different code paths.
Strong consistency
Every read returns the most recently completed write, as if on one global timeline. Achieved with full quorums (W=N or R=N) or consensus like Paxos/Raft. Cost: a slow or partitioned replica blocks the operation.
Eventual consistency
If writes stop, all replicas converge — eventually. Meanwhile a reader may see an older version or watch a value briefly go backwards. Cost: application code must be idempotent and conflict-tolerant.
Read-your-writes
A session-scoped guarantee: a client never sees its own writes regress. Implemented with a sticky replica or a version attached to later reads. A pragmatic middle ground for user-facing features.
Other useful flavours
Monotonic reads never go backwards; causal consistency preserves "A happened before B" for every observer; bounded staleness caps lag at t seconds or k versions. Real systems compose several.
Vector clocks: who really wrote last? Slide 9
Under an AP design, two clients can write the same key on different replicas during a partition. The store
must detect this and either pick a deterministic winner or hand both versions back to the application.
A vector clock tags each write with a per-replica counter, e.g. [(A,1),(B,0),(C,0)];
a replica increments its own counter on each write. If one version's counters are component-wise ≤ another's,
they are causally ordered — keep the newer. If neither dominates, they are concurrent:
return both as siblings and let the caller merge (a cart unions items, a counter sums).
Attach a wall-clock timestamp and let the larger one win. Fast and stateless, but it silently drops a write whenever clocks disagree or two writes are truly concurrent. Use LWW only where losing a write is genuinely acceptable.
Keep writing when replicas vanish Slide 10
Hosts reboot, switches flap, garbage collectors stall. An AP store treats a short-lived failure as a routing problem, not a data-loss event. Two mechanisms keep it writable, and a third tells it who is alive.
Anti-entropy with Merkle trees Slide 11
Hints cover minutes. When a replica is gone long enough that hints expire, two replicas must compare
their entire keyspaces and copy whatever is missing — but a full scan is impossibly expensive.
A Merkle tree hashes each keyspace range into a leaf, hashes children up to a single root
fingerprint, then lets two replicas exchange just the root: if equal, they are already in sync. If not,
they descend only into the subtrees whose hashes differ, isolating the exact divergent ranges
in O(log n) comparisons instead of a linear scan. It also runs as a periodic background repair and catches silent disk corruption.
The write path: commit log → memtable → SSTable Slide 12
Each node's storage engine is an LSM tree: it makes writes cheap by turning random
updates into sequential appends. A write is appended to a durable commit log and
fsync'd, then inserted into an in-memory sorted memtable. When the memtable
fills, it is frozen and flushed to disk as an immutable, sorted SSTable; background
compaction later merges SSTables, dropping superseded versions and tombstones.
The read path: memtable → bloom filter → disk Slide 13
Because the newest version of a key might be in the memtable or in any of several SSTables, a read could touch many files — that is the read amplification cost of an LSM design. The bloom filter is what keeps it sane: a tiny probabilistic structure per SSTable that answers "definitely not here" or "possibly here," letting the engine skip most files without a disk seek.
Active recall
Cover the answers. Say each one out loud before you tap to check.
key % N?1/N of keys — its neighbours' arcs —
instead of remapping the whole keyspace. Virtual nodes smooth the load.W + R > N. Then every read set intersects every write set in at
least one replica, which carries the latest write — so a read can see it.Check yourself
W + R > N. Only 2 + 2 = 4 > 3,
so the read and write sets are forced to share a replica that holds the newest value.