arrows / space · F = fullscreen
Chapter 13
Design a Search
Autocomplete System
Suggest likely query completions while the user is still typing. The whole challenge is shipping the right list in under a hundred milliseconds, again and again, across billions of keystrokes.
swiss j obs — full stack swiss alps
Slide 02 · Scope

What the system must do

Autocomplete sits in the user's hot path. Every extra millisecond is felt directly while they type, so the constraints are unusually strict for what looks like a simple lookup.

Functional

  • Given a prefix string, return the top K matching queries.
  • Ranked by popularity (with optional personalisation later).
  • Updates as the user adds or deletes characters.
  • Multi-language, case-insensitive, tolerant of trailing spaces.

Out of scope (for now)

  • Spelling correction beyond what the index already absorbs.
  • Semantic search and embeddings.
  • Long-form sentence completion.

Non-functional

  • Latency: end-to-end under 100 ms at p99.
  • Freshness: trending queries should appear within hours, not days.
  • Throughput: every keystroke is a request — expect ~10x query QPS.
  • Availability: graceful degradation, never a 5xx in the typeahead box.
  • Cost: cheap per lookup; the box is on every page.

Back-of-envelope

10 M DAU × 10 queries/day × 6 keystrokes/query
   ≈ 600 M requests/day
   ≈ 7 K QPS average, 20–30 K peak

Slide 03 · Core Data Structure

The trie: a tree of shared prefixes

A trie stores strings character by character. Every word that shares a prefix shares a path, so finding "all queries starting with swi" is just walking three edges and reading the subtree underneath.

s c w u i s a o t d ° s c w u a o swi• sus• cat• cod• root teal node = end of a word
A toy trie holding "swi", "sus", "cat", "cod" — one path per shared prefix.
Slide 04 · Ranking Signal

Attach a frequency to every terminal

A trie tells us which strings exist. To rank suggestions we also need to know which ones people actually search for. Store a frequency counter on each terminal node — the count of times that exact query was issued in the recent window.

s w u i e s m ° s w u swiss f = 8 400 sweet f = 1 250 sushi f = 3 600 summit f = 740

What the count represents

The number of times that exact query string was logged over the chosen time window (often a sliding 30-day count). Newer counts can be weighted higher with exponential decay so trends bubble up.

Where it gets stored

  • Only on terminal nodes (intermediate nodes don't have a query).
  • Updated by an offline aggregation job, not on the live request path.
  • Can be a small integer once normalised — rank order matters, not absolute counts.
Slide 05 · The Obvious Algorithm

Naïve lookup: walk the prefix, then explore everything below

The textbook algorithm: descend to the prefix node by walking one edge per character, then do a depth-first search of the subtree, collect every terminal, and sort by frequency. It works — until the prefix is short.

Steps per request

  • Walk the prefix: O(P) where P is the prefix length.
  • DFS the subtree: visit every descendant terminal.
  • Sort or heap-select the top K by frequency.
  • Return the K winners.

Why it falls apart

The subtree under s can hold millions of distinct queries. Doing a full traversal on every keystroke for every user blows the latency budget by orders of magnitude.

prefix "s" → ~10M nodes DFS > 500 ms

Complexity, at a glance

StepCost
Descend to prefix nodeO(P)
DFS subtree (N terminals)O(N)
Top-K via heapO(N log K)
Total per keystrokeO(P + N log K)

P is tiny but N grows with traffic. For the popular short prefixes, N is exactly where we cannot afford linear work.

Slide 06 · The Speed Trick

Cache the answer at every prefix node

Instead of recomputing the top K each time, precompute and store the list directly on every node. A lookup becomes: walk to the node, return its cached list. O(P) total — constant in the size of the corpus.

s w i root s top-5 swiss sushi sport soup song w top-5 swiss sweet swim swap switch i top-5 swiss swim swing swipe swift return cached list no scan needed
Each node along the prefix path "swi" carries its own precomputed top-K. The query is a pointer chase.
Slide 07 · The Trade-off

You bought speed with memory — now prune

Storing K strings at every node multiplies memory cost by roughly K. For a trie of 50 M nodes and K = 10, that is 500 M cached strings. Pruning techniques bring it back to a workable footprint.

Where the cost lives

  • Each prefix node holds an array of K (suggestion, score) pairs.
  • Long-tail prefixes (10+ chars) hardly serve traffic but still cost RAM.
  • Strings can be stored as offsets into a single string pool.

Pruning strategies

  • Min-frequency cutoff: drop queries searched fewer than N times.
  • Max prefix depth: stop precomputing past depth 6 or so — long prefixes have few candidates, a quick DFS is fine.
  • Variable K: store more suggestions for short, hot prefixes; fewer for deep, rare ones.
  • Compress the trie: collapse single-child chains into a radix / Patricia tree.
Memory per design choice naïve trie 1.0x + top-K cached ~10x + min-freq cutoff ~6x + depth cap (6) ~3.5x + radix compression ~2x
Slide 08 · Build Pipeline

Build the trie offline from query logs

The serving trie is a derived artifact. A periodic batch job reads raw search logs, aggregates them into query → count, builds a fresh trie with precomputed top-K, and publishes it to the serving fleet as an immutable blob.

Query logs raw events S3 / HDFS / Kafka Aggregate group + count + decay Spark / MapReduce Build trie insert + precompute K serialize to blob Publish push to serving fleet blob store + sync Serve in-memory read-only ~hours of data ~tens of minutes ~tens of minutes atomic swap live in seconds Offline build, online serve runs every few hours; nothing about the live request path mutates state
Slide 09 · Freshness

Two answers to "how fresh is fresh enough?"

Batch rebuilds give you new trends every few hours. For most products that is fine. When breaking news or live events demand minute-level freshness, layer a streaming update path on top.

Path A · Delayed batch refresh

simplecheapstale ~hours
  • Rebuild the trie every 1–6 hours from the rolling log window.
  • Atomic swap into serving nodes — readers never see a partial state.
  • Good default. Most queries don't trend that fast.

Path B · Streaming top-up

fresh ~minutesmore complextwo systems to reason about
  • Tail the query log via Kafka / Flink in real time.
  • Maintain a small delta trie of recently-spiking queries.
  • At query time, merge the base trie's top-K with the delta's top-K.
  • Delta is wiped every batch swap.
Base trie refreshed every few hours ~tens of GB immutable, in RAM Delta trie updated by stream ~tens of MB mutable, small merge top-K per request final suggestions
Slide 10 · Cache Hot Prefixes

Most keystrokes hit the same few hundred prefixes

Query prefixes follow a steep Zipf-style curve. A small cache at the edge or in Redis can absorb the bulk of traffic, leaving the trie servers responsible for the long tail.

user CDN edge ~5 ms prefix → top-K Redis ~15 ms per-region cache Trie server ~30–50 ms in-memory trie ~80% of traffic ~15% of traffic ~5% of traffic cache cascade — each layer protects the next

What to cache where

  • CDN: short, popular prefixes that are the same for everyone. TTL of a few minutes.
  • Redis (regional): longer prefixes, per-locale variants.
  • Trie server: everything else — the long tail and personalised reranks.

Invalidation

  • TTL-based expiry — staleness on the edge is acceptable for autocomplete.
  • On a fresh trie deploy, bump a version tag so caches re-fill lazily.
  • Never block a request waiting for cache; fall through quickly.
Slide 11 · Horizontal Scale

Shard the trie when one box no longer fits

A full trie can outgrow a single machine's RAM. Two common partitioning schemes, each with different trade-offs.

By first character

routing is trivialskewed
  • Shard A holds prefixes starting with a–c, shard B with d–g, etc.
  • Router picks the shard from the first byte of the prefix.
  • Problem: shard for "s" is enormous; shard for "z" is tiny.
  • Mitigate by splitting hot ranges further (sa-, se-, sh-, …).

By hash of the prefix

even loadduplicates data
  • Each (prefix, top-K) pair stored on hash(prefix) mod N.
  • Distribution is uniform, no hot shards.
  • Cost: the trie structure itself isn't shared across shards — you store redundant prefix slices.
  • Often used as a flat key–value store rather than a trie at all.
By first character shard 1 a–c shard 2 d–j shard 3 s (hot!) shard 4 t–z load skew: shard 3 sees 5x the traffic By hash(prefix) mod N shard 1 ~25% shard 2 ~25% shard 3 ~25% shard 4 ~25% even load, lookup is one hop router resolves prefix → shard via consistent hash
Slide 12 · Takeaways

Principles to carry forward

01

The data structure is the design

A trie isn't an implementation detail. Choosing it locks in O(P) lookup and prefix sharing — the only reason any of this fits in latency budget.

02

Precompute on writes, not reads

The cheapest read is the one that returns a pointer. Move the top-K work to the offline build, leave nothing to do at query time.

03

Treat the serving trie as immutable

Build offline, swap atomically. Mutation on the hot path leads to locking, hard-to-reason-about consistency, and ops pain.

04

Match freshness to the product

Batch every few hours covers most use cases. Stream-merged deltas exist for the minority who genuinely need them — not by default.

05

Cache cascades absorb the curve

Prefix popularity is Zipfian. A small edge cache plus a regional Redis takes the load off the trie fleet for nearly free.

06

Memory is a knob you control

Min-frequency cutoffs, depth caps, and radix compression turn the speed-for-memory trade-off into a dial, not a cliff.