Chapter 6 · System Design Fundamentals

Design a Key-Value Store

This is the chapter where the whole distributed-systems toolkit snaps together — replication, quorums, the CAP tradeoff, anti-entropy, and LSM storage — into a single coherent design. Start with a hash map on one box; finish with a ring of replicas that survives real-world failure.

Open the companion slides
Reading time ~14 min Prerequisites Ch 5 · Consistent Hashing, Ch 4 Audio 🔊 Hinglish read-aloud Next Distributed Unique ID Generator

A key-value store is the simplest possible database — two methods, put(key, value) and get(key) — and yet building one that holds tens of billions of keys and never refuses a write forces you to confront every hard problem in distributed systems at once. That is exactly why it is the canonical interview question: it is a survey course disguised as a single design.

Primary source

Alex Xu, System Design Interview (Vol. 1), Chapter 6. Companion deck: slides — jump to any slide with the Slide N chips.

Go deeper: the canonical reference is the Dynamo paper — DeCandia et al., "Dynamo: Amazon's Highly Available Key-value Store" (SOSP 2007). Almost every mechanism below traces back to it.

What we're actually building Slide 2

The API is two methods; the difficulty is entirely in the non-functional pressure. Before drawing rings, agree the contract and the operating envelope: tens of billions of opaque keys, six-figure read/write QPS, hundreds to thousands of commodity nodes across several data centres. Two design goals fight each other from here on — always-writable availability versus per-operation tunable consistency.

Surface API

put(key, value) stores an opaque blob under a short string key. get(key) returns the most recent value, or a tombstone if it was deleted. No range scans, no secondary indexes, no transactions.

The two goals in tension

Availability: accept writes even when a replica, a rack, or a whole zone is down. Consistency: tunable per call — strong reads, fast eventual reads, or a quorum — but the system never silently chooses for you.

Single node: one hash map Slide 3

A key-value store on one machine is twenty lines: a process-local hash map for O(1) put/get, plus an append-only write-ahead log fsync'd before each ack for durability. Understanding precisely why it cannot survive reality justifies every section that follows — the working set must fit in RAM, one bad disk loses everything, and vertical scaling buys time, not a solution.

Spread keys around a ring Slide 4

To outgrow one box, partition the keyspace with consistent hashing (Chapter 5): hash the key onto a circle, walk clockwise to the first node. The payoff over a naive key % N is that adding or removing a node moves only ~1/N of keys — its neighbours' slices — instead of reshuffling everything. Virtual nodes give one physical machine many small arcs, smoothing load and letting beefier hardware carry more tokens.

Replication: never store anything once Slide 5

Pick a replication factor N — usually three. The node the ring points at is the coordinator; the next N−1 nodes clockwise also store the value. Together they form the key's preference list. Crucially, you skip replicas that share a rack or zone already in the list, so every key has copies in at least two failure domains — otherwise a single rack outage takes all three copies at once.

RACK A RACK B RACK C RACK D N1 N2 N3 N4 N5 N6 N7 N8 hash(k) 1 skip 2 3 preference list N = 3 · rack-aware
The key replicates to the next N = 3 nodes clockwise, skipping N4 because it shares Rack B with N3 — so the three copies land in three distinct racks.

The CAP corner you can't escape Slide 6

Once data is replicated across machines, the network can split them. When it does, you cannot offer both linearisable consistency and full availability at once — you must choose. CAP is a worst-case constraint that bites only during a partition; outside one you can have all three. Every distributed KV store, in that moment, is making this call, whether or not it admits it.

ChoiceWhat it does under partitionWhat it gives up
CPRefuses writes (or reads) on the minority side until the partition heals. No client ever sees stale or conflicting state. Spanner, HBase, etcd, ZooKeeper.Availability — some users get errors during the split.
APBoth sides keep accepting writes; divergent versions are reconciled later. The store stays writable through the outage. Dynamo, Cassandra, Riak.Consistency — apps may read an older or conflicting value.
You can't pretend you avoided it

Pick AP if availability is the product; pick CP if a wrong answer is worse than no answer. A KV store built for shopping carts and sessions is almost always AP — that is the path we follow for the rest of the chapter.

N, W, R — and the overlap rule Slide 7

"AP" doesn't mean "no consistency" — it means consistency is a dial. Three knobs set it: N replicas per key, W acks required before a write succeeds, and R responses required before a read returns. The single most important inequality in the chapter governs when a read is guaranteed to see the latest write.

WRITE · W = 2 R1 R2 READ · R = 2 R2 R3 R2 is in both sets it carries the latest write N = 3 replicas: R1 · R2 · R3 W + R > N 2 + 2 > 3 ✓ sets overlap
With N = 3, W = 2, R = 2, the write set and read set must share at least one replica. That replica holds the newest value, so the read is guaranteed to see it.
W + R > N  →  every read set intersects every write set in ≥ 1 replica

The intuition: if your W writers and your R readers together exceed the N replicas, by the pigeonhole principle at least one replica was written and read, so the latest value is in the read response — the coordinator just picks the newest. Tuning the two knobs lets the same cluster serve fast writes and strong reads, or the reverse:

Config (N=3)ProfileStrong read?
W=1, R=1Fastest, fully eventual — both calls return on one ack.No
W=2, R=2Balanced quorum — the common default.Yes (4>3)
W=1, R=3Fast writes, strong reads — read all replicas.Yes (4>3)
W=3, R=1Strong, read-optimised — write all, read one.Yes (4>3)

What "consistent" actually promises Slide 8

Quorum settings are mechanism; consistency models are the contract those settings expose to the application. The same cluster can serve different guarantees on different code paths.

Strong consistency

Every read returns the most recently completed write, as if on one global timeline. Achieved with full quorums (W=N or R=N) or consensus like Paxos/Raft. Cost: a slow or partitioned replica blocks the operation.

Eventual consistency

If writes stop, all replicas converge — eventually. Meanwhile a reader may see an older version or watch a value briefly go backwards. Cost: application code must be idempotent and conflict-tolerant.

Read-your-writes

A session-scoped guarantee: a client never sees its own writes regress. Implemented with a sticky replica or a version attached to later reads. A pragmatic middle ground for user-facing features.

Other useful flavours

Monotonic reads never go backwards; causal consistency preserves "A happened before B" for every observer; bounded staleness caps lag at t seconds or k versions. Real systems compose several.

Vector clocks: who really wrote last? Slide 9

Under an AP design, two clients can write the same key on different replicas during a partition. The store must detect this and either pick a deterministic winner or hand both versions back to the application. A vector clock tags each write with a per-replica counter, e.g. [(A,1),(B,0),(C,0)]; a replica increments its own counter on each write. If one version's counters are component-wise ≤ another's, they are causally ordered — keep the newer. If neither dominates, they are concurrent: return both as siblings and let the caller merge (a cart unions items, a counter sums).

Last-write-wins is the lazy cousin

Attach a wall-clock timestamp and let the larger one win. Fast and stateless, but it silently drops a write whenever clocks disagree or two writes are truly concurrent. Use LWW only where losing a write is genuinely acceptable.

Keep writing when replicas vanish Slide 10

Hosts reboot, switches flap, garbage collectors stall. An AP store treats a short-lived failure as a routing problem, not a data-loss event. Two mechanisms keep it writable, and a third tells it who is alive.

Sloppy quorum
A preference-list replica is unreachable, so the coordinator writes to the next healthy node clockwise instead. W acks still come back; the write still succeeds.
Hinted handoff
The stand-in tags the write with a hint — "this belongs to N6." When N6 returns, peers replay the buffered hints and the data lands where it should.
Failure detection
A gossip protocol spreads heartbeats so every node has an eventually-consistent view of who is alive — no central master to fail.

Anti-entropy with Merkle trees Slide 11

Hints cover minutes. When a replica is gone long enough that hints expire, two replicas must compare their entire keyspaces and copy whatever is missing — but a full scan is impossibly expensive. A Merkle tree hashes each keyspace range into a leaf, hashes children up to a single root fingerprint, then lets two replicas exchange just the root: if equal, they are already in sync. If not, they descend only into the subtrees whose hashes differ, isolating the exact divergent ranges in O(log n) comparisons instead of a linear scan. It also runs as a periodic background repair and catches silent disk corruption.

The write path: commit log → memtable → SSTable Slide 12

Each node's storage engine is an LSM tree: it makes writes cheap by turning random updates into sequential appends. A write is appended to a durable commit log and fsync'd, then inserted into an in-memory sorted memtable. When the memtable fills, it is frozen and flushed to disk as an immutable, sorted SSTable; background compaction later merges SSTables, dropping superseded versions and tombstones.

put(k,v) client Commit Log append + fsync Memtable sorted · in RAM SSTable n (newest) SSTable n-1 SSTable n-2 … immutable · sorted · on disk flush compaction merges & dedupes
Durability first (commit log), then RAM speed (memtable); flushed SSTables are immutable and sorted, and compaction quietly pays down the disk debt.

The read path: memtable → bloom filter → disk Slide 13

Because the newest version of a key might be in the memtable or in any of several SSTables, a read could touch many files — that is the read amplification cost of an LSM design. The bloom filter is what keeps it sane: a tiny probabilistic structure per SSTable that answers "definitely not here" or "possibly here," letting the engine skip most files without a disk seek.

1 Check the memtable The most recent write lives in RAM — a hit returns in microseconds, no disk at all.
2 Ask each SSTable's bloom filter "Definitely not" skips the file entirely; only "possibly" files are opened. This avoids the vast majority of seeks.
3 Block index → one block For a candidate file, a sparse index points to the right block; we read exactly that one block from disk.
4 Newest-wins merge If the key surfaces in several places, keep the version with the highest sequence number or timestamp. Hot keys are served from page/block cache.

Active recall

Cover the answers. Say each one out loud before you tap to check.

Why a consistent-hashing ring instead of key % N?
Think about adding a node.
Adding or removing a node moves only ~1/N of keys — its neighbours' arcs — instead of remapping the whole keyspace. Virtual nodes smooth the load.
What does the preference list skip, and why?
Failure domains.
Replicas in a rack or zone already represented, so the N copies land in distinct failure domains — one rack outage can't take all three.
State the quorum overlap rule and what it guarantees.
An inequality.
W + R > N. Then every read set intersects every write set in at least one replica, which carries the latest write — so a read can see it.
During a partition, what must an AP store give up?
Two of three.
Strong consistency. It keeps accepting writes on both sides and reconciles the divergent versions later; apps may read a stale or conflicting value.
Concurrent vector clocks — what does the store do?
Neither dominates.
Return both versions as siblings and let the application merge them (cart unions items, counter sums). LWW would silently drop one.
What keeps an LSM read from touching every SSTable?
Probabilistic.
A bloom filter per SSTable answers "definitely not here," so the engine skips most files without a disk seek; only "possibly here" files are opened.

Check yourself

Q1 With N = 3, which configuration guarantees a read sees the latest write?
Why: The overlap rule is W + R > N. Only 2 + 2 = 4 > 3, so the read and write sets are forced to share a replica that holds the newest value.
Q2 A network partition splits the cluster. What does an AP store choose to do?
Why: AP keeps both sides writable and merges divergent versions after the partition heals, trading away strong consistency during the split.
Q3 Two vector clocks where neither is component-wise ≤ the other mean the writes are…
Why: Neither dominating means the writes are concurrent. The store returns both as siblings for the application to merge — LWW would silently lose one.
Q4 A preference-list replica is down mid-write. Sloppy quorum plus hinted handoff will…
Why: The coordinator writes to the next healthy node with a hint, so W is still met; when the original returns, the buffered hint is replayed to it.
Q5 In the LSM read path, what is the bloom filter's job?
Why: A per-SSTable bloom filter answers "definitely not here," letting the engine avoid opening files that cannot contain the key, cutting most disk seeks.