Chapter Five System Design · Distribution

Chapter 05

Consistent Hashing

A scheme for spreading keys across a changing set of servers — so that adding or losing a node shifts only a small slice of the data, rather than every key in the system.

A System Design Primer · Distributed Storage

02 · The Problem Why mod-N breaks

The naive approach

Mod-N hashing collapses the moment you resize the cluster.

If you map keys to servers with server = hash(key) mod N, the assignment is fast and even — until N changes. Adding or removing a single node reshuffles nearly every key. For a cache, that is a stampede; for a database, that is downtime.

The formula

Each key is hashed to a number, then taken modulo the server count. With 4 servers, key k lives on bucket hash(k) % 4.

Resize → rehash storm

Move from 4 servers to 5 and the modulus changes. Roughly (N-1)/N of all keys land on a different bucket. At 4→5 that is about 80% of the cache invalidated at once.

Why it hurts

Origin servers see a thundering herd. Warm caches go cold. Replicated data must be rebalanced wholesale. The cluster takes minutes or hours to recover.

A single new server invalidates most of the cache

03 · The Concept The hash ring

A circular hash space

Bend the hash space into a ring.

Pick a hash function with a large output space — for instance the 128-bit space of MD5. Imagine its full range laid end to end and joined into a circle. Every possible hash value is a single point on that circle. Servers and keys both live on the same ring.

One shared coordinate system

Both nodes and keys map to positions on the ring through the same hash function. There is no separate routing table to maintain.

Position is determined by hash, not by count

A server's spot on the ring depends only on its identifier — its IP, hostname, or a chosen label. The total number of nodes never enters the calculation.

The ring is huge and sparse

Typical hash spaces hold 2^32 or 2^128 positions. A real cluster of dozens of servers occupies a tiny, scattered fraction of the circle.

A continuous, wrap-around hash range, walked in one direction

04 · Servers Anchoring nodes

Step one

Each server claims its position by hashing its own identifier.

Compute hash(server-A), hash(server-B), and so on. The output lands each node somewhere on the ring. Because the hash function distributes outputs roughly uniformly, the nodes scatter around the circle — though, as we will see, never perfectly evenly.

A, B, C, D placed by hash output

Identifier in, position out

The label hashed is up to you — IP address, DNS name, or a synthetic ID. What matters is that it is stable; the same node always lands in the same spot.

Each node owns the arc behind it

By convention, a server is responsible for the segment of the ring ending at its position. The next server clockwise marks the boundary of its territory.

Uneven gaps are expected

With only a handful of points, hash randomness leaves big arcs and small arcs. Some servers naturally end up with more of the ring than others — a problem we will address shortly.

Each peer knows the whole ring

In most implementations, every node carries a copy of the membership list. Lookups can be answered locally without a central coordinator.

05 · Lookup Walk clockwise

Step two

A key belongs to the first server you meet when walking clockwise.

Hash the key. Drop it onto the ring at that position. Then move in the agreed direction — clockwise by convention — until you encounter a server. That server owns the key. The walk is purely conceptual; in practice it is a binary search over the sorted list of node positions.

One rule fits everything

Reads, writes, and replica placement all use the same rule. There is no separate routing table that must stay in sync with membership.

Lookup is O(log N)

Each node holds the sorted positions of all peers. Finding the successor of a key's hash is a binary search — fast even with thousands of virtual nodes.

Wrap-around at the top

If a key's hash falls past the last server's position, the walk continues past 0 and lands on the first server. The ring has no seam.

Each arc is the territory of the server at its clockwise end

06 · Adding Only neighbors move

Elasticity, the gentle way

Add a node and only one arc rearranges.

Suppose a new server E joins between B and C. Every key that previously walked past E to reach C now stops at E instead. Keys living anywhere else on the ring are untouched — their successor server is unchanged.

Before · 4 nodes

Keys k1, k2, k3 all walk clockwise to land on C.

After · E joins between B and C

k1, k2, k3 now stop at E. All other keys keep the same owner.

On average, joining a cluster of N nodes moves only about 1/N of the keys — compared to nearly all of them under mod-N hashing.

07 · Removing Failure and shrink

The symmetric case

Lose a node, and its arc folds into the next one clockwise.

If a server fails or is decommissioned, every key it owned was, by definition, in the arc ending at its position. Those keys are simply walked one step further around the ring — to the next surviving server. No other arc is disturbed.

Before · B still healthy

k1 and k2 fall in the arc owned by B.

After · B fails

k1 and k2 walk past where B was and stop at C. A and D are untouched.

Failure is just a node deletion. The neighbor inherits the load — which is convenient and, as we will see next, also exactly where the trouble starts.

08 · The Catch Uneven arcs, hot spots

The first real problem

A small number of nodes makes the ring lumpy.

With only a handful of randomly placed points on a huge circle, some arcs are inevitably much wider than others. The server at the clockwise end of a wide arc owns a disproportionate share of the keyspace — and therefore traffic. Worse, when a node fails, its entire arc falls onto a single neighbor, which may already be busy.

A inherits the long arc — and the lion's share of keys

Random placement is not even placement

A uniform hash function spreads on average. With only N=4 or N=8 actual servers, the variance dominates — gaps of 2× or 3× the mean are common.

Cascade on failure

When a server dies, the whole of its arc lands on a single neighbor. That neighbor's load instantly doubles, which can knock it over too.

Skewed workloads compound it

Real traffic is rarely uniform across the keyspace either. A hot arc on a small ring is the worst of both worlds.

We need many more points on the ring

Increasing the number of positions per server smooths the distribution. That is the idea behind virtual nodes.

09 · Virtual Nodes The fix for hot spots

The standard refinement

Give every physical server many positions on the ring.

Instead of placing each server once, place it dozens, hundreds, or even thousands of times. Each replica is a virtual node, generated by hashing labels like A#0, A#1, A#2, and so on. The physical server still owns each of those positions, but the law of large numbers now works in our favor.

Smoother distribution

With many replicas per node, each physical server ends up owning roughly the same total arc length — even though individual arcs vary.

Failures spread instead of cascade

When a physical node dies, its many virtual positions vanish from the ring. Their arcs are absorbed by many different neighbors, not just one.

Heterogeneous capacity, made easy

A beefier machine can be given more virtual nodes than a smaller one. Capacity is dialed in by changing a single integer per server.

Cost: a bigger ring to search

A cluster of 50 servers with 200 vnodes each has 10,000 positions. Lookups are still logarithmic, but the structure consumes more memory on every node.

Each color is one physical server, scattered across the ring

10 · In Practice Where you'll meet it

Real-world deployments

Consistent hashing quietly powers a lot of the infrastructure you already use.

The pattern shows up wherever a fleet of stateless or partly-stateful servers must agree on which one is responsible for which piece of data, with churn happening underneath. The same hash-ring idea appears in caches, in databases, and at the edges of content networks.

Caching tier

Distributed caches

Cache clients hash request keys onto a ring of cache servers. When a cache box is added or rebooted, only its share of the keyspace re-warms instead of the whole fleet — which is the difference between a hiccup and an outage at the origin.

Memcached client libraries · Ketama · Twemproxy

Distributed databases

Partitioned storage

Each row's primary key is hashed onto a ring of storage nodes. The owning node holds the data and forwards copies to a small fixed number of clockwise neighbors as replicas. Adding capacity moves only the affected partitions.

Amazon Dynamo · Apache Cassandra · Riak · ScyllaDB

Edge and CDNs

Request routing at scale

A CDN's edge layer hashes the URL of an object to pick which cache node should serve it. As nodes join and leave the pool, the bulk of cached content stays exactly where it was — keeping hit rates high even during rolling deploys.

Akamai-style edges · Cloudflare · Discord guild sharding

11 · Tradeoffs It is not free

Costs and complications

Consistent hashing solves one problem and introduces several smaller ones.

Adopting the ring is rarely a single afternoon's work. Membership has to be tracked, replicas have to be placed thoughtfully, and rebalancing has to be paced. None of these are showstoppers — but they are the issues that consume the rest of the design.

M

Membership and gossip

Every node must learn quickly when peers join or leave. Most clusters layer a gossip protocol on top of the ring to converge the membership view. That is an extra subsystem with its own failure modes.

R

Replica placement

Putting copies on the next clockwise nodes is simple, but those nodes may share a rack, a zone, or a power domain. Production systems extend the basic rule to skip over neighbors until rack and region diversity is satisfied.

B

Bounded load balance

Even with virtual nodes, the distribution is statistical. Some shards run hot. Modern variants add capacity caps on each node and overflow to the next — at the cost of a slightly more complex lookup.

D

Data movement during rebalance

Adding a node only shifts a small share of keys — but that share still has to be physically copied. The transfer must be throttled so that ongoing reads and writes are not starved of bandwidth.

12 · Summary Take this with you

Closing

Six principles to carry forward.

1

The hash space is a circle, not a list.

Wrap the output of your hash function end to end. Servers and keys both live as points on the same circle.

2

Each key belongs to the next server clockwise.

One rule for placement, lookup, and replica selection. No external routing table to keep in sync.

3

Joining or leaving disturbs only one arc.

On average, about 1/N of the keys move when a node arrives or departs — versus nearly all of them under mod-N.

4

Use many virtual nodes per server.

Replicating each physical node hundreds of times around the ring evens out arc sizes and spreads failures across many neighbors.

5

Replicate across failure domains, not just neighbors.

Walk the ring to choose replicas, but skip over peers that share a rack or zone with one already chosen.

6

Rebalancing is data movement — pace it.

The math is gentle, but the network is not. Throttle transfers so live traffic keeps its share of the pipe.