A guided tour through the moving parts of a distributed, highly available, eventually consistent storage system — from a single hash map to a ring of replicas surviving real-world failure.
System Design Notebook · Chapter 06
02 / 14
Brief
What we're actually building
Before drawing rings and quorums, agree on the contract and the operating envelope. The two surface methods are tiny; everything else is non-functional pressure.
Surface API
put(key, value) — store or overwrite an opaque blob keyed by a short string.
get(key) — return the most recent value, or a tombstone if the key was deleted.
Scale targets
Tens of billions of keys, values from bytes to a few hundred kilobytes.
Six-figure read/write QPS, with hot keys an order of magnitude above the mean.
Cluster sized in hundreds to thousands of commodity nodes spread over several data centres.
Availability
Always-writable goal. The store must accept writes even when some replicas, a rack, or an entire zone is unreachable.
Consistency
Tunable per operation. Callers may ask for strong reads, fast eventually-consistent reads, or a middle quorum. The system never silently picks for them.
Out of scope
Range scans, secondary indexes, multi-key transactions, and a SQL surface. These belong in a different chapter.
03 / 14
Starting point
One box, one hash map
A key-value store on a single machine is twenty lines of code. Understanding why it cannot survive contact with reality justifies every section that follows.
The naive version
A process-local hash map gives O(1) average put and get.
Reads and writes never leave the CPU; latency is dominated by network and serialisation.
Disk durability is bolted on by appending each write to a log before acknowledging.
Why it falls over
Memory ceiling. The working set must fit in RAM, which caps you well before billions of keys.
Single point of failure. One bad disk, one kernel panic, one cable pull, and all data is gone or unreachable.
No headroom. Vertical scaling buys time, not solutions. Eventually traffic exceeds what one box can serve.
A single-node KV store with write-ahead log
04 / 14
Partitioning
Spread the keys around a ring
Hash the key, place it on a circle, walk clockwise to the first node. When a node joins or leaves only its neighbours' slices move — not the entire keyspace.
Hash ring with virtual node ticks — each physical node owns many small arcs
How keys find a home
Hash the key with a uniform function (MD5, MurmurHash) to land on a 128-bit circle.
Each node claims a token on the ring. A key is owned by the first node clockwise from its hash.
Virtual nodes let one physical machine hold many tokens, smoothing out the inevitable lumps a small cluster would otherwise show.
Why a ring instead of key % N
Adding a node moves only 1/N of keys on average, not all of them.
Removing a failed node hands its arcs to immediate neighbours — recovery is local.
Heterogeneous hardware is handled by giving beefier nodes more virtual tokens.
05 / 14
Replication
Don't store anything just once
Pick a replication factor N — usually three. Place those replicas thoughtfully so a single rack or zone never wipes out all copies of the same key.
The N replicas of a key
The first replica is the coordinator — the node the ring originally pointed at.
The next N−1 nodes clockwise also store the value. This is the preference list.
Replicas are skipped if they sit in a rack or zone already represented, so every key has copies in at least two failure domains.
Sizing the factor
N = 3 is the industry default: survives one failure cleanly, two with caveats.
Larger N buys durability but costs disk and write amplification.
Cross-region replication is layered on top, not folded into the same ring.
Walking clockwise, skipping replicas in the same rack
06 / 14
Tradeoffs
The CAP corner you can't escape
When the network splits, you cannot offer both linearisable consistency and full availability. Every distributed KV store is, in that moment, choosing.
CP — Consistent under partition
Refuse writes (or reads) on the minority side until the partition heals. No client ever sees stale or conflicting state. The cost is downtime for some users when the network misbehaves.
Spanner, HBase, etcd, ZooKeeper.
AP — Available under partition
Both sides keep accepting writes. Divergent versions are reconciled later. The cost is application code that must tolerate seeing an older or conflicting value.
Dynamo, Cassandra, Riak.
What the theorem doesn't say
Outside a partition you can — and should — have all three. CAP is a worst-case constraint, not a steady-state one.
Real systems are tunable per request: a single deployment may serve quorum reads for some keys and last-write-wins for others.
Latency hides inside the C in practice. The PACELC refinement asks: when there is no partition, are you choosing latency or consistency?
The honest framing
Pick AP if availability is the product. Pick CP if a wrong answer is worse than no answer. Don't pretend you've avoided the choice — you've just made it implicitly.
07 / 14
Quorum
N, W, R and the overlap rule
Quorum makes the consistency dial tunable. Three knobs control how many replicas must respond before the coordinator returns to the client.
Knob
Meaning
N
Replicas per key in the preference list.
W
Acks required before a write is considered successful.
R
Responses required before a read is returned.
The overlap rule
If W + R > N, any read set intersects every write set in at least one replica. That replica carries the latest value, so the coordinator can detect and return it.
A W=2 write: two acks are enough to commit, the third can lag
08 / 14
Models
What "consistent" actually promises
Quorum settings are mechanism. Consistency models are the contract those mechanisms expose to the application.
Strong consistency
Every read returns the value of the most recently completed write, as if there were one global timeline. Achieved with full quorums (W=N or R=N) or a consensus protocol like Paxos or Raft.
Cost: a slow or partitioned replica blocks the read or write.
Eventual consistency
If writes stop, all replicas converge to the same value — eventually. In the meantime a reader may see an older version, or briefly observe the value going backwards.
Cost: application code must be idempotent and conflict-tolerant.
Read-your-writes
A session-scoped guarantee: a client never sees its own writes regress. Implemented by routing the same session through a sticky replica, or by attaching a version vector to subsequent reads.
A pragmatic middle ground for user-facing features.
Other useful flavours
Monotonic reads — successive reads never see time go backwards. Causal consistency — if A happened before B, every observer sees that order. Bounded staleness — reads may be stale, but only by at most t seconds or k versions. Real systems often compose several of these.
09 / 14
Versioning
Vector clocks, or: who really wrote last?
Under an AP design, two clients can write to the same key on different replicas during a partition. The store needs to detect this and either pick a winner deterministically or hand both versions back to the application.
Vector clocks
Each replica tags every write with a counter: [(A,1),(B,0),(C,0)].
On a subsequent write, the writing replica increments its own counter.
Two versions V₁ and V₂ are causally ordered if one's counters are component-wise ≤ the other's — keep the newer.
If neither dominates, the versions are concurrent. Return both as siblings and let the caller merge — a shopping cart unions items, a counter sums, etc.
Last-write-wins
Simpler: attach a wall-clock or hybrid timestamp; the larger value wins.
Fast and stateless, but quietly drops one write when clocks disagree or when two writes truly are concurrent. Use only where losing a write is acceptable.
Two writes diverge during a partition, surface as siblings on heal
10 / 14
Transient failure
Keep writing when replicas go missing
Hosts reboot, switches flap, garbage collectors stall. The store treats short-lived failures as a routing problem, not a data-loss event.
Sloppy quorum
If a replica in the preference list is unreachable, the coordinator picks the next healthy node clockwise on the ring and writes there instead. W acks still come back, the operation still succeeds, and availability survives a localised outage.
Hinted handoff
The stand-in node tags the write with a hint — "this really belongs to N5." When N5 returns, its peers replay the buffered hints to it, and the data lands where it should. The hint store is bounded; if a node stays down too long, anti-entropy takes over.
Failure detection
A gossip protocol exchanges heartbeats so every node has an approximate, eventually-consistent view of who is alive. There is no central master to fail.
N6 is down; N8 buffers the write and hands it back later
11 / 14
Permanent repair
Anti-entropy with Merkle trees
When a replica is gone long enough that hints expire, replicas must compare their entire keyspaces and copy whatever's missing. A naïve scan is impossibly expensive — Merkle trees let two nodes find their differences in logarithmic comparisons.
Compare roots, descend only into mismatching subtrees
How it works
Each replica partitions its keyspace into fixed ranges and hashes the contents of each range into a leaf.
Internal nodes hash their children; the root summarises the whole partition in one fingerprint.
Two replicas exchange the root first. If equal, they are already in sync — done.
If not, they descend only into the subtrees whose hashes differ, isolating the exact ranges that need to be streamed across.
When it kicks in
Periodic background process between replica pairs, throttled so it doesn't starve foreground traffic.
After a node returns from long downtime, before it's allowed back into the quorum.
After detecting silent corruption — disks lie occasionally, and the tree will notice.
12 / 14
On-disk write path
Commit log → memtable → SSTable
Writes are made cheap by turning them into sequential appends. The storage engine is essentially an LSM tree: hot data in RAM, cold data in immutable sorted files on disk.
Commit log
Every write is appended to a per-node log on durable storage and fsync'd before the ack. Sequential I/O is fast; the log lets us replay any state lost in a crash.
Memtable
An in-memory sorted structure (skip list or balanced tree). Absorbs writes at RAM speed. When it crosses a size or age threshold, it is frozen and flushed to disk as a new immutable SSTable.
SSTables & compaction
Sorted String Tables: immutable, indexed files. Background compaction merges them, drops superseded versions and tombstones, and keeps read amplification bounded.
13 / 14
On-disk read path
Memtable, bloom filter, then disk
A read may have to consult the memtable plus several SSTables to find the latest version of a key. Bloom filters and block indexes keep the disk I/O sane.
The lookup ladder
Memtable first. The most recent write is in RAM; if it's there, we're done in microseconds.
Bloom filter per SSTable. A small probabilistic structure says "definitely not here" or "possibly here." Skipping definitely-not files avoids most disk seeks.
Block index → block. For the SSTables that might contain the key, a sparse index points to the right block; we read that one block.
Newest-wins merge. If the key appears in several SSTables, take the version with the highest sequence number or timestamp.
Read amplification
One logical read can touch several files — that's the cost of an LSM design.
Bloom filters slash false candidates; compaction strategy bounds the number of levels the read has to traverse.
Hot keys are served from OS page cache or an internal block cache; very hot keys never touch disk at all.
Memtable → bloom filter → block index → data block → merge
14 / 14
Closing
What to carry away
A distributed key-value store is not one big idea — it's a stack of small, correct mechanisms layered carefully. These are the principles worth remembering when you sketch your own.
— I —
Hash, don't modulo
Consistent hashing with virtual nodes spreads load evenly and keeps the cost of joining or losing a node proportional to 1/N.
— II —
Replicate across failure domains
Three copies on three machines in three racks is the floor, not the ceiling. Anything less is single-zone storage with extra steps.
— III —
Make consistency a knob
Expose N, W, R per request. Strong, eventual, and read-your-writes are valid choices for different paths in the same product.
— IV —
Plan for divergence
Vector clocks or LWW: pick one consciously. Either way, design data types that merge gracefully — counters that sum, sets that union, carts that combine.
— V —
Repair, don't just retry
Hinted handoff covers minutes of failure; Merkle-tree anti-entropy covers hours and days. Both have to be running for replicas to stay honest.
— VI —
Turn random writes into sequential ones
Commit log plus memtable plus SSTable is how you get cheap durable writes, with compaction quietly paying down the disk debt in the background.