Chapter 5 · System Design Fundamentals
Consistent Hashing
Sharding data and spreading cache keys across a fleet of servers is only easy until the fleet changes size. Consistent hashing is the technique that lets a cluster grow, shrink, or lose a node while moving only a small slice of the keys — and it is a building block that surfaces again and again in interviews.
▶ Open the companion slidesThe crucial reframe: the hard part of distribution is not the first placement — it is the re-placement. Any hash function can spread keys evenly across a fixed set of servers. The question that separates a toy from a production system is what happens the moment a server is added, removed, or dies. Consistent hashing answers that question by minimizing how much the assignment has to change.
Alex Xu, System Design Interview (Vol. 1), Chapter 5. Companion deck: slides — jump to any slide with the Slide N chips. Go deeper: the original idea comes from Karger et al., "Consistent Hashing and Random Trees" (STOC 1997).
The problem: mod-N breaks on resize Slide 2
The naive way to spread keys over N servers is server = hash(key) mod N. It is fast and even — until N changes. Add or remove a single node and the modulus changes for every key, so almost all of them land on a different bucket. For a cache that is a stampede of misses hitting the origin; for a database it is a wholesale rebalance and downtime.
The fix: bend the hash space into a ring Slide 3
Pick a hash function with a large output space — say the 128-bit range of MD5. Lay that range end to end and join the two ends into a circle. Every possible hash value is now a single point on the ring, and servers and keys both map onto the same circle through the same hash function. Crucially, a server's spot depends only on its identifier, never on the count of nodes — so resizing no longer touches the formula.
Servers claim positions by hashing themselves Slide 4
Compute hash(A), hash(B), and so on for each node's stable identifier — its IP, hostname, or a synthetic label. Each output lands the node somewhere on the circle. By convention, a server owns the arc of the ring that ends at its position: the next server clockwise marks the boundary of its territory. Most implementations give every node a copy of the membership list, so any node can answer a lookup locally without a central coordinator.
A key belongs to the first server clockwise Slide 5
Hash the key, drop it onto the ring, then walk clockwise until you hit a server — that server owns the key. The walk is conceptual; in code it is a binary search over the sorted node positions, so lookup is O(log N). The same rule covers reads, writes, and replica placement, and there is no seam: a key past the last server simply wraps past 0 to the first.
Adding a node rearranges only one arc Slide 6
Suppose a new server E joins between B and C. Every key that used to walk past E to reach C now stops at E instead — and nothing else moves. Keys elsewhere on the ring keep the same successor. On average, joining a cluster of N nodes relocates only about 1/N of the keys, versus nearly all of them under mod-N.
Removing a node folds its arc into the next Slide 7
Removal is the mirror image. If a server fails or is decommissioned, every key it owned was, by definition, in the arc ending at its position. Those keys simply walk one step further to the next surviving server clockwise; no other arc is disturbed. A failure is just a node deletion — convenient, but, as the next section shows, also exactly where the trouble starts.
Side by side: what moves when membership changes
The whole value of the ring is captured in one column — the fraction of keys that have to relocate when a single node joins or leaves.
| Property | Mod-N hashing | Consistent hashing (ring) |
|---|---|---|
| Keys moved on a node join | ≈ (N−1)/N | ≈ 1/N |
| Keys moved on a node leave | ≈ (N−1)/N | ≈ 1/N |
| Routing table to maintain | none, but recompute on resize | none — the hash is the map |
| Lookup cost | O(1) | O(log N) |
| Even load by default | yes, for fixed N | no — needs virtual nodes |
The catch: a small ring is lumpy Slide 8
A uniform hash spreads keys evenly only on average. With just a handful of randomly placed points on a huge circle, some arcs end up far wider than others — and the server at the clockwise end of a wide arc owns a disproportionate share of the keyspace and its traffic. Worse, when a node fails its entire arc lands on a single neighbor, whose load can instantly double and knock it over too.
With N = 4 or N = 8 real servers, variance dominates: gaps of 2× or 3× the mean are common, and skewed real-world traffic only compounds a hot arc. The fix is to put far more points on the ring.
Virtual nodes smooth the load Slide 9
Instead of placing each server once, place it dozens or hundreds of times. Each replica is a virtual node, generated by hashing labels like A#0, A#1, A#2, and so on — all still owned by the same physical server. The law of large numbers now works for us: each physical server ends up owning roughly the same total arc length, and when one dies, its many small arcs are absorbed by many different neighbors rather than one.
Heterogeneous capacity, easily
Give a beefier machine more virtual nodes than a small one. Capacity is dialed in by changing a single integer per server.
The cost: a bigger ring
50 servers × 200 vnodes = 10,000 positions. Lookups stay logarithmic, but the sorted structure uses more memory on every node.
Where you'll meet it in real infra Slide 10
The pattern shows up wherever a fleet of servers must agree on which one owns which piece of data while the fleet churns underneath. Three families dominate.
Distributed caches
Clients hash request keys onto a ring of cache servers; adding or rebooting one box re-warms only its share, not the whole fleet. Memcached client libraries, Ketama, Twemproxy.
Partitioned databases
Each row's primary key is hashed onto a ring; the owning node holds the data and forwards copies to a few clockwise neighbors as replicas. Amazon Dynamo, Cassandra, Riak, ScyllaDB.
Edge and CDNs
An edge layer hashes an object's URL to pick which cache node serves it; nodes join and leave with most content staying put. Akamai-style edges, Cloudflare, Discord guild sharding.
The common thread
Stateless or partly-stateful servers, constant churn, and a need to keep most data exactly where it is during a rolling deploy or a single failure.
Tradeoffs: it solves one problem and adds smaller ones Slide 11
Adopting the ring is rarely an afternoon's work. None of the following are showstoppers, but they are the issues that consume the rest of the design.
Membership and gossip
Nodes must learn quickly when peers join or leave. Most clusters layer a gossip protocol on the ring to converge the membership view — an extra subsystem with its own failure modes.
Replica placement
The next clockwise nodes may share a rack, zone, or power domain. Production rules skip neighbors until rack and region diversity is satisfied.
Bounded load
Even with virtual nodes the spread is statistical and some shards run hot. Modern variants add per-node capacity caps and overflow to the next node.
Rebalance is data movement
Only a small share of keys move, but that share is still physically copied. Throttle the transfer so live reads and writes keep their bandwidth.
Active recall
Cover the answers. Say each one out loud before you tap to check.
hash(key) mod N for almost every key, so
roughly (N−1)/N of all keys remap at once — a cache stampede or a
wholesale database rebalance.Check yourself
hash(key) mod N behave badly on resize?