Chapter Seven

Distributed
Unique ID Generator

From a single auto-increment column to 64-bit Snowflake IDs minted independently on hundreds of machines — without ever colliding.

SYSTEM DESIGN SLIDES · 11 USE ARROW KEYS  ·  F FOR FULLSCREEN
02 · Goals Chapter 7

What we actually need from an ID

Before picking a scheme, pin down what "good" looks like. Five properties shape every design decision that follows.

01
Globally unique
No two machines, no two threads, no two days may ever produce the same ID — not even by accident on a clock-rewind.
02
Roughly time-ordered
Newer IDs should sort after older ones. This makes B-tree indexes happy and lets us paginate by ID without a separate timestamp column.
03
Compact
64 bits fits in a register, a BIGINT column, and most network headers. 128-bit identifiers double storage on every row and every index page.
04
High throughput
Tens of thousands of IDs per second per node, with sub-millisecond latency. No round trip to a central server on the hot path.
05
No single point of failure
If one node dies, the rest keep minting. The scheme survives partial outages, network partitions, and rolling restarts.
03 · Approach 1 Chapter 7

Database auto-increment

The default in every relational database. One table, one counter, one row at a time. It works until it doesn't.

How it works

The database keeps an internal counter. Every INSERT grabs the next integer and hands it back. Uniqueness is guaranteed by the engine's transaction log.

App #1 App #2 App #3 single DB counter = 9,431 every write contends for the same row lock
What's nice
  • Trivial to implement — one column, zero new infrastructure
  • IDs are dense, sorted, and small (a 64-bit BIGINT)
  • Strong uniqueness backed by transactions
Why it doesn't scale
  • The whole world serializes on one counter row — write throughput caps at whatever a single primary can handle
  • A failover can replay or skip IDs unless replication is strictly synchronous
  • Sharding the DB breaks the global counter; you'd need a coordinator that re-introduces the same bottleneck
  • Network round trip on every insert adds latency you can't optimize away
Verdict: fine for a side project. Fatal for anything that needs to mint IDs on more than one machine concurrently.
04 · Approach 2 Chapter 7

UUID v4 — pure randomness

Generate 128 random bits on each client. No coordination, no central server, no network call. The collision probability is so small it might as well be zero.

Anatomy of a v4 UUID

f47ac10b-58cc-4372-a567-0e02b2c3d479 122 random bits version = 4 variant bits 128 bits total · 36 chars in canonical form
Pros
  • Zero coordination — generate offline, on a phone, in a Lambda
  • No SPOF, no central authority
  • Collision odds are astronomically low
Cons
  • 128 bits — twice the storage of a BIGINT, on every row, every index
  • Random order kills B-tree locality: inserts hit random pages, splitting them constantly
  • Can't sort by ID to get chronological order
  • Verbose in URLs and logs
UUID v7 fixes most of these by prefixing a millisecond timestamp — basically a friendly cousin of Snowflake in 128-bit clothing.
05 · Approach 3 Chapter 7

Ticket server — one counter to rule them all

Flickr's classic move: a tiny dedicated service whose only job is to hand out the next integer. Every app server asks it for an ID over the network.

Topology

App A App B App C Ticket Server next() → 84,201 single counter standby? (tricky) one network hop per ID, one place for everything to fail
Pros
  • IDs are dense, monotonic, and small
  • Logic is dead simple — just an atomic counter
  • Easy to reason about; easy to audit
Cons
  • It is the single point of failure
  • Every ID needs a network round trip — latency floor of milliseconds
  • Failover is hard: a hot standby must never hand out the same number, even after a network partition
  • Throughput limited by one machine's atomic-increment rate
Mitigation: batch out ranges (e.g. 1,000 IDs at a time). That softens the latency hit but the SPOF remains.
06 · Approach 4 Chapter 7

Snowflake — coordinated by design

Twitter's insight: if we slice a 64-bit integer into pre-agreed fields, every machine can mint IDs locally and still guarantee global uniqueness. No network call, no central counter.

The four fields

  • Sign bit (1) — always zero, so the ID is a positive signed long
  • Timestamp (41 bits) — milliseconds since a custom epoch; gives roughly 69 years of headroom
  • Machine ID (10 bits) — up to 1,024 distinct generator nodes (often split into datacenter and worker)
  • Sequence (12 bits) — 4,096 IDs per millisecond per machine
Total throughput per node: 4,096,000 IDs/sec. Across 1,024 machines: over 4 billion IDs/sec. That is more than enough for any product that has ever existed.

Why this works

Each machine knows its own ID and tracks its own per-millisecond sequence. Two machines can never collide because their machine-ID bits differ. The same machine can't collide with itself because either the millisecond has advanced or the sequence counter has.

Sorting by the integer value sorts by time (because the timestamp sits in the high bits) — exactly what B-tree indexes want.

1 41 bits 10 bits 12 bits sign timestamp machine sequence
07 · Bit layout Chapter 7

Sixty-four bits, four jobs

A scaled-up diagram of one Snowflake ID. The width of each block is exactly proportional to the number of bits it occupies.

63 bit position → 0 0 TIMESTAMP 41 bits · ms since custom epoch MACHINE 10 bits SEQUENCE 12 bits sign always 0 ~69 years of values 2^41 ms ≈ 69.7 years 1,024 nodes e.g. 5 dc + 5 worker 4,096 / ms / node resets every millisecond EXAMPLE 0 | 00010110010111010110010101100010110101 | 0001100101 | 000000010111 = 1605994281047920663
Sign · 1 bit
Reserved; keeps the value positive when treated as a signed 64-bit integer.
Timestamp · 41 bits
Millisecond offset from your chosen epoch (e.g. 2020-01-01). Max 2,199,023,255,551.
Machine · 10 bits
Identifies the generator. Assigned at boot via config, Zookeeper, or env var.
Sequence · 12 bits
Per-millisecond counter on each machine. Resets to zero every new millisecond.
08 · Edge cases Chapter 7

Bursts and clock skew

Two failure modes haunt every Snowflake implementation: a node trying to mint more than 4,096 IDs in one millisecond, and a system clock that jumps backwards.

Burst overflow

If the sequence counter reaches its ceiling (4,095) within the same millisecond, the generator must wait for the next millisecond before producing more IDs. This is a busy-wait — usually a few hundred nanoseconds — and is invisible to callers.

ms 0 ms 1 ms 2 ms 3 4 IDs 4,096 IDs · seq exhausted → block until ms 2 resume from seq 0

Clock going backwards

NTP slews, leap seconds, and VM migrations can pull the system clock backwards. If we naively used the new (smaller) timestamp, we'd mint IDs that look older than ones already issued — and possibly duplicate them.

Defensive strategy: remember the last timestamp used. If now < lastTimestamp, either wait until the clock catches up (small skews) or refuse to generate IDs and alert (large skews). Some implementations also use the high bits of the machine ID as a logical-clock guard so that a restart with a regressed clock can be detected.

Operational tip: Run NTP in slew-only mode, alert when offset exceeds a few hundred milliseconds, and prefer monotonic clocks where the runtime exposes them.
09 · Approach 5 Chapter 7

Database-segmented ranges

A pragmatic middle ground used at Instagram, Flickr, and inside many internal services. The database hands out blocks of N IDs to each node, and the node serves them locally.

range allocator next_range = [3000, 4000) Node A [1000–2000) · used 800/1000 serves locally, no network call Node B [2000–3000) · used 220/1000 fresh range, just allocated Node C [3000–4000) about to refill will ask DB at 90% used
Why teams reach for this
  • One DB round trip per thousand IDs, not per ID
  • Nodes are independent on the hot path — generator throughput is local
  • Resulting IDs are dense, sortable BIGINTs
  • Simple mental model: just a counter, but chunked
Trade-offs to plan for
  • If a node crashes mid-range, the unused IDs in that range are lost forever (gaps appear)
  • The allocator DB is still a SPOF, though it's only touched occasionally — pair it with a replica
  • Across the fleet, IDs are roughly time-ordered but not strictly so — Node A might be on 1,800 while Node B is already on 2,500
10 · Compare Chapter 7

Side by side

Same five approaches, scored on the requirements we wrote down on slide two.

Auto-increment

Size64 bits
CoordinationDB lock
Throughputlow
SPOFyes
Time-sortedyes
Best forsingle-node apps

UUID v4

Size128 bits
Coordinationnone
Throughputhuge
SPOFnone
Time-sortedno
Best foroffline / client IDs

Ticket server

Size64 bits
Coordinationcentral RPC
Throughputmedium
SPOFyes
Time-sortedyes
Best forsmall fleets

Snowflake

Size64 bits
Coordinationmachine-ID only
Throughputhuge
SPOFnone
Time-sortedyes
Best forlarge fleets

Segmented

Size64 bits
CoordinationDB · per range
Throughputhigh
SPOFsoft
Time-sortedapprox
Best forexisting SQL stacks
Rule of thumb: if you already operate a strong stateful store (Postgres, MySQL, ZK) and only a handful of nodes mint IDs, segmented ranges are the cheapest win. If you have hundreds of producers and want zero hot-path coordination, ship Snowflake.
11 · Closing Chapter 7

Principles to take with you

Whichever scheme you ship, these ideas survive the choice.

01
Coordinate as little as possible
Every network hop on the ID path is latency you can never get back. Push uniqueness into the structure of the ID itself.
02
Put time in the high bits
Sortable IDs make indexes, pagination, range scans, and replication lag debugging dramatically easier — for free.
03
Budget your bit width
Decide how many years of timestamps, how many machines, and how many IDs per millisecond you actually need. Then size the fields.
04
Treat clocks as adversarial
Wall clocks drift, slew, and occasionally reverse. Detect regression, refuse to mint when it happens, and alert loudly.
05
Gaps are usually fine
Lost IDs from a crashed range or skipped sequence are cheap. Duplicates are catastrophic. Optimize for the right failure mode.
06
Pick the simplest thing that fits
A BIGINT auto-increment is right until it isn't. Snowflake is overkill for a five-server fleet. Match the scheme to the scale you'll actually reach.
← → · Space · F
📖 Read the lesson