01 / 14
Chapter Six

Design a & Key-Value Store

A guided tour through the moving parts of a distributed, highly available, eventually consistent storage system — from a single hash map to a ring of replicas surviving real-world failure.

System Design Notebook · Chapter 06
02 / 14
Brief

What we're actually building

Before drawing rings and quorums, agree on the contract and the operating envelope. The two surface methods are tiny; everything else is non-functional pressure.

Surface API

put(key, value) — store or overwrite an opaque blob keyed by a short string.

get(key) — return the most recent value, or a tombstone if the key was deleted.

Scale targets

  • Tens of billions of keys, values from bytes to a few hundred kilobytes.
  • Six-figure read/write QPS, with hot keys an order of magnitude above the mean.
  • Cluster sized in hundreds to thousands of commodity nodes spread over several data centres.

Availability

Always-writable goal. The store must accept writes even when some replicas, a rack, or an entire zone is unreachable.

Consistency

Tunable per operation. Callers may ask for strong reads, fast eventually-consistent reads, or a middle quorum. The system never silently picks for them.

Out of scope

Range scans, secondary indexes, multi-key transactions, and a SQL surface. These belong in a different chapter.

03 / 14
Starting point

One box, one hash map

A key-value store on a single machine is twenty lines of code. Understanding why it cannot survive contact with reality justifies every section that follows.

The naive version

  • A process-local hash map gives O(1) average put and get.
  • Reads and writes never leave the CPU; latency is dominated by network and serialisation.
  • Disk durability is bolted on by appending each write to a log before acknowledging.

Why it falls over

  • Memory ceiling. The working set must fit in RAM, which caps you well before billions of keys.
  • Single point of failure. One bad disk, one kernel panic, one cable pull, and all data is gone or unreachable.
  • No headroom. Vertical scaling buys time, not solutions. Eventually traffic exceeds what one box can serve.
Client put/get SINGLE NODE In-Memory Hash Map "user:42" → {…} "cart:7" → {…} "sess:a9" → {…} Append-Only Log fsync on each put disk
A single-node KV store with write-ahead log
04 / 14
Partitioning

Spread the keys around a ring

Hash the key, place it on a circle, walk clockwise to the first node. When a node joins or leaves only its neighbours' slices move — not the entire keyspace.

N1 N2 N3 N4 hash(k) → walks to N2 hash ring 0 → 2¹²⁸
Hash ring with virtual node ticks — each physical node owns many small arcs

How keys find a home

  • Hash the key with a uniform function (MD5, MurmurHash) to land on a 128-bit circle.
  • Each node claims a token on the ring. A key is owned by the first node clockwise from its hash.
  • Virtual nodes let one physical machine hold many tokens, smoothing out the inevitable lumps a small cluster would otherwise show.

Why a ring instead of key % N

  • Adding a node moves only 1/N of keys on average, not all of them.
  • Removing a failed node hands its arcs to immediate neighbours — recovery is local.
  • Heterogeneous hardware is handled by giving beefier nodes more virtual tokens.
05 / 14
Replication

Don't store anything just once

Pick a replication factor N — usually three. Place those replicas thoughtfully so a single rack or zone never wipes out all copies of the same key.

The N replicas of a key

  • The first replica is the coordinator — the node the ring originally pointed at.
  • The next N−1 nodes clockwise also store the value. This is the preference list.
  • Replicas are skipped if they sit in a rack or zone already represented, so every key has copies in at least two failure domains.

Sizing the factor

  • N = 3 is the industry default: survives one failure cleanly, two with caveats.
  • Larger N buys durability but costs disk and write amplification.
  • Cross-region replication is layered on top, not folded into the same ring.
RACK A RACK B RACK C RACK D N1 N2 N3 N4 N5 N6 N7 N8 hash(k) 1 skip 2 3 preference list, N=3, rack-aware
Walking clockwise, skipping replicas in the same rack
06 / 14
Tradeoffs

The CAP corner you can't escape

When the network splits, you cannot offer both linearisable consistency and full availability. Every distributed KV store is, in that moment, choosing.

CP — Consistent under partition

Refuse writes (or reads) on the minority side until the partition heals. No client ever sees stale or conflicting state. The cost is downtime for some users when the network misbehaves.

Spanner, HBase, etcd, ZooKeeper.

AP — Available under partition

Both sides keep accepting writes. Divergent versions are reconciled later. The cost is application code that must tolerate seeing an older or conflicting value.

Dynamo, Cassandra, Riak.

What the theorem doesn't say

  • Outside a partition you can — and should — have all three. CAP is a worst-case constraint, not a steady-state one.
  • Real systems are tunable per request: a single deployment may serve quorum reads for some keys and last-write-wins for others.
  • Latency hides inside the C in practice. The PACELC refinement asks: when there is no partition, are you choosing latency or consistency?

The honest framing

Pick AP if availability is the product. Pick CP if a wrong answer is worse than no answer. Don't pretend you've avoided the choice — you've just made it implicitly.

07 / 14
Quorum

N, W, R and the overlap rule

Quorum makes the consistency dial tunable. Three knobs control how many replicas must respond before the coordinator returns to the client.

KnobMeaning
NReplicas per key in the preference list.
WAcks required before a write is considered successful.
RResponses required before a read is returned.

The overlap rule

If W + R > N, any read set intersects every write set in at least one replica. That replica carries the latest value, so the coordinator can detect and return it.

N=3, W=3, R=1 · read-heavy, strong N=3, W=1, R=3 · write-heavy, strong N=3, W=2, R=2 · balanced quorum N=3, W=1, R=1 · fastest, eventual
Client Coordinator N=3, W=2 put R1 R2 R3 ack ack slow OK (2/3) 2 acks reach W ⇒ commit
A W=2 write: two acks are enough to commit, the third can lag
08 / 14
Models

What "consistent" actually promises

Quorum settings are mechanism. Consistency models are the contract those mechanisms expose to the application.

Strong consistency

Every read returns the value of the most recently completed write, as if there were one global timeline. Achieved with full quorums (W=N or R=N) or a consensus protocol like Paxos or Raft.

Cost: a slow or partitioned replica blocks the read or write.

Eventual consistency

If writes stop, all replicas converge to the same value — eventually. In the meantime a reader may see an older version, or briefly observe the value going backwards.

Cost: application code must be idempotent and conflict-tolerant.

Read-your-writes

A session-scoped guarantee: a client never sees its own writes regress. Implemented by routing the same session through a sticky replica, or by attaching a version vector to subsequent reads.

A pragmatic middle ground for user-facing features.

Other useful flavours

Monotonic reads — successive reads never see time go backwards. Causal consistency — if A happened before B, every observer sees that order. Bounded staleness — reads may be stale, but only by at most t seconds or k versions. Real systems often compose several of these.

09 / 14
Versioning

Vector clocks, or: who really wrote last?

Under an AP design, two clients can write to the same key on different replicas during a partition. The store needs to detect this and either pick a winner deterministically or hand both versions back to the application.

Vector clocks

  • Each replica tags every write with a counter: [(A,1),(B,0),(C,0)].
  • On a subsequent write, the writing replica increments its own counter.
  • Two versions V₁ and V₂ are causally ordered if one's counters are component-wise ≤ the other's — keep the newer.
  • If neither dominates, the versions are concurrent. Return both as siblings and let the caller merge — a shopping cart unions items, a counter sums, etc.

Last-write-wins

  • Simpler: attach a wall-clock or hybrid timestamp; the larger value wins.
  • Fast and stateless, but quietly drops one write when clocks disagree or when two writes truly are concurrent. Use only where losing a write is acceptable.
time Replica A Replica B Replica C v1 [A:1, B:0, C:0] network partition v2a [A:2, B:0, C:0] v2b [A:1, B:1, C:0] partition heals conflict: neither dominates → siblings v3 [A:2, B:1, C:0] (app-merged)
Two writes diverge during a partition, surface as siblings on heal
10 / 14
Transient failure

Keep writing when replicas go missing

Hosts reboot, switches flap, garbage collectors stall. The store treats short-lived failures as a routing problem, not a data-loss event.

Sloppy quorum

If a replica in the preference list is unreachable, the coordinator picks the next healthy node clockwise on the ring and writes there instead. W acks still come back, the operation still succeeds, and availability survives a localised outage.

Hinted handoff

The stand-in node tags the write with a hint — "this really belongs to N5." When N5 returns, its peers replay the buffered hints to it, and the data lands where it should. The hint store is bounded; if a node stays down too long, anti-entropy takes over.

Failure detection

A gossip protocol exchanges heartbeats so every node has an approximate, eventually-consistent view of who is alive. There is no central master to fail.

Coordinator N5 N6 N7 N8 unreachable hint: "for N6" replay when N6 returns sloppy quorum + hinted handoff
N6 is down; N8 buffers the write and hands it back later
11 / 14
Permanent repair

Anti-entropy with Merkle trees

When a replica is gone long enough that hints expire, replicas must compare their entire keyspaces and copy whatever's missing. A naïve scan is impossibly expensive — Merkle trees let two nodes find their differences in logarithmic comparisons.

Replica A Replica B H₀ H₁ H₂ h1 h2 h3 h4 H₀' H₁ H₂' h1 h2 h3 h4' H₀ ≠ H₀' H₁ = H₁' (skip) H₂ ≠ H₂' (descend) found: leaf h4 disagrees → sync that range O(log n) comparisons to locate a divergent block
Compare roots, descend only into mismatching subtrees

How it works

  • Each replica partitions its keyspace into fixed ranges and hashes the contents of each range into a leaf.
  • Internal nodes hash their children; the root summarises the whole partition in one fingerprint.
  • Two replicas exchange the root first. If equal, they are already in sync — done.
  • If not, they descend only into the subtrees whose hashes differ, isolating the exact ranges that need to be streamed across.

When it kicks in

  • Periodic background process between replica pairs, throttled so it doesn't starve foreground traffic.
  • After a node returns from long downtime, before it's allowed back into the quorum.
  • After detecting silent corruption — disks lie occasionally, and the tree will notice.
12 / 14
On-disk write path

Commit log → memtable → SSTable

Writes are made cheap by turning them into sequential appends. The storage engine is essentially an LSM tree: hot data in RAM, cold data in immutable sorted files on disk.

put(k,v) Commit Log append + fsync durability Memtable sorted map (RAM) recent writes flush when full SSTable n (newest) SSTable n-1 SSTable n-2 SSTable … SSTables on disk immutable, sorted Compaction merge + dedupe drops tombstones disk

Commit log

Every write is appended to a per-node log on durable storage and fsync'd before the ack. Sequential I/O is fast; the log lets us replay any state lost in a crash.

Memtable

An in-memory sorted structure (skip list or balanced tree). Absorbs writes at RAM speed. When it crosses a size or age threshold, it is frozen and flushed to disk as a new immutable SSTable.

SSTables & compaction

Sorted String Tables: immutable, indexed files. Background compaction merges them, drops superseded versions and tombstones, and keeps read amplification bounded.

13 / 14
On-disk read path

Memtable, bloom filter, then disk

A read may have to consult the memtable plus several SSTables to find the latest version of a key. Bloom filters and block indexes keep the disk I/O sane.

The lookup ladder

  • Memtable first. The most recent write is in RAM; if it's there, we're done in microseconds.
  • Bloom filter per SSTable. A small probabilistic structure says "definitely not here" or "possibly here." Skipping definitely-not files avoids most disk seeks.
  • Block index → block. For the SSTables that might contain the key, a sparse index points to the right block; we read that one block.
  • Newest-wins merge. If the key appears in several SSTables, take the version with the highest sequence number or timestamp.

Read amplification

  • One logical read can touch several files — that's the cost of an LSM design.
  • Bloom filters slash false candidates; compaction strategy bounds the number of levels the read has to traverse.
  • Hot keys are served from OS page cache or an internal block cache; very hot keys never touch disk at all.
get(k) Memtable hit? return SSTables check each Bloom filter "maybe?" / "no" Block index point to block Data block read one block Merge versions newest wins return v
Memtable → bloom filter → block index → data block → merge
14 / 14
Closing

What to carry away

A distributed key-value store is not one big idea — it's a stack of small, correct mechanisms layered carefully. These are the principles worth remembering when you sketch your own.

— I —
Hash, don't modulo

Consistent hashing with virtual nodes spreads load evenly and keeps the cost of joining or losing a node proportional to 1/N.

— II —
Replicate across failure domains

Three copies on three machines in three racks is the floor, not the ceiling. Anything less is single-zone storage with extra steps.

— III —
Make consistency a knob

Expose N, W, R per request. Strong, eventual, and read-your-writes are valid choices for different paths in the same product.

— IV —
Plan for divergence

Vector clocks or LWW: pick one consciously. Either way, design data types that merge gracefully — counters that sum, sets that union, carts that combine.

— V —
Repair, don't just retry

Hinted handoff covers minutes of failure; Merkle-tree anti-entropy covers hours and days. Both have to be running for replicas to stay honest.

— VI —
Turn random writes into sequential ones

Commit log plus memtable plus SSTable is how you get cheap durable writes, with compaction quietly paying down the disk debt in the background.

Chapter 6 · Design a Key-Value Store
← → / space · F fullscreen
📖 Read the lesson