Chapter 9 · System Design Fundamentals
Design a Web Crawler
A deceptively simple loop — download a page, extract its links, repeat — that becomes a hard distributed-systems problem at web scale. Here politeness and traps matter as much as throughput, and the cleverness lives in the queues, not the loop.
▶ Open the companion slidesStrip a crawler to its core and it is a four-step cycle: a frontier hands out a URL, a fetcher downloads the page, a parser extracts text and links, and a store keeps the bytes. The freshly found links feed back into the frontier, and the wheel turns again. Everything hard — billions of URLs, not hammering one host, not running forever — is about controlling that loop, not changing it.
Alex Xu, System Design Interview (Vol. 1), Chapter 9. Companion deck: slides — jump to any slide with the Slide N chips. Go deeper: the Robots Exclusion Protocol (RFC 9309).
What a crawler is for Slide 2
The mechanics barely change between jobs; the constraints do. Decide the goal first — coverage, freshness, fidelity, or budget — and the rest of the design follows from it. Four jobs cover most of what crawlers are built to do.
Search indexing
Walk the open web and feed an inverted index. Coverage is wide; freshness matters most for popular pages. The classic motivator behind Googlebot.
Web archiving
Take periodic snapshots for posterity. The Internet Archive prizes fidelity and historical completeness over freshness.
Data mining
Pull structured data — prices, catalogs — from known sites. Heavy on parsing rules; light on breadth. Closest to ETL in spirit.
Monitoring
Re-check a known set of URLs for diffs — broken links, edits, defacements, leaks. Small surface, tight freshness window.
The crawl loop Slide 4
Read the cycle first — it is the system. Seed URLs prime the frontier; the fetcher downloads HTML; the parser pulls out text and links; content is checked for duplicates and stored. The extracted links pass through a URL-seen? filter and the survivors flow back into the frontier, closing the loop. Each box is a place to apply a limit.
Requirements in tension Slide 3
Four goals pull in different directions, and the whole design lives on the tradeoff curve between them. Crawl aggressively and you damage target sites and get banned; crawl politely and throughput drops. Name what each one actually demands before you start drawing boxes.
| Requirement | What it demands | Tension |
|---|---|---|
| Scale | Tens of billions of URLs known, billions fetched per day; petabytes stored. | Single-node thinking dies on day one. |
| Politeness | Obey robots.txt and crawl-delay; never hammer one host in parallel. | Directly caps per-host throughput. |
| Robustness | Survive malformed HTML, timeouts, redirect chains, 5xx storms, traps. | Retrying forever is as bad as giving up. |
| Freshness | News in minutes, static pages in weeks; importance-weighted recrawl. | Re-fetching everything wastes scarce budget. |
These are non-functional requirements doing the real work. The functional surface — "fetch pages, follow links, store results" — is almost trivial by comparison.
The URL frontier — two queues stacked Slide 5
The frontier decides what to crawl next and when. It does this with two rows of queues. A prioritizer scores each URL and drops it into one of several front queues by importance. A router then maps each URL to one of many back queues, one host per queue. A back-queue selector with a per-host timer releases URLs so a single host is never hit too fast. Front decides priority; back decides politeness.
Why BFS, not DFS Slide 11
The frontier behaves like a breadth-first queue, and that is deliberate. Depth-first would chase one site down a deep chain of links — fetching page after page from the same host back to back, exactly the hammering politeness forbids. Breadth-first spreads the work across many hosts at once, so the per-host pacing has other work to do while one host cools down.
A crawler that ignores robots.txt or crawl-delay and opens parallel
connections to one host gets your IP blocked — and your crawler's reputation with it. Per-host pacing
isn't good manners; it's the thing that keeps you able to crawl at all.
Politeness Slide 6
Treat every origin as a guest in someone else's home. Three rules carry most of the weight: read the welcome mat, throttle per host, and make politeness hold across all your workers, not just one.
Send an honest User-Agent that names the crawler and links a policy page; back off on
429 / 503 by doubling the delay. Identified, well-behaved bots get the benefit of
the doubt — anonymous ones get blocked first.
Deduplication — two different hashes Slide 7
The web has more copies than originals, so the crawler dedupes twice, with two different hashes. URL-seen stops us fetching the same address twice: canonicalize the URL (lowercase host, strip default port and tracking params) and hash that. Content-seen stops us storing the same article ten times under ten addresses: hash the page's normalized content and compare. The first hashes an address; the second hashes a body.
URL-seen (the address)
Canonicalize, then check membership before pushing into the frontier. At scale a Bloom filter in memory sits in front of an authoritative key-value store — an O(1) check per URL keeps the hot path cheap.
Content-seen (the body)
Hash the normalized page text into a fingerprint and compare. Mirrors, sessionized URLs, and tracking parameters all serve identical content from different addresses — only a content hash catches those.
DNS resolution as its own subsystem Slide 8
DNS is the hidden bottleneck. A naive crawler does a fresh lookup per URL and turns its resolver into a
smoking pile; at billions of fetches a day, name resolution is a tier-one dependency that needs its own
cache. Because URL extraction tends to discover many pages from the same host, a local
host → IP cache (honouring TTLs) hits well over 90% of the time.
Cache, two tiers
An in-memory cache on each fetcher catches the bulk; a shared cluster-wide resolver absorbs the misses so no node bombards the public DNS roots on its own.
Bound the wait
DNS calls are synchronous in many libraries — one slow domain stalls a worker. Wrap resolution in a deadline (say 2 s), pre-resolve in batches, and negatively cache dead hosts.
Crawler traps Slide 9
Follow every link reachable from a seed and you will never finish — some traps are accidental, some adversarial. The defences are mostly limits (depth, count, shape), not cleverness. Cap everything, and bound loops finish.
| Trap | What it looks like | Defence |
|---|---|---|
| Spider trap | Infinite calendar "next month" links; bot mazes of generated content. | Cap path depth and URLs per host per day; stop when novelty drops. |
| Dynamic / infinite URLs | Each visit mints a new session id, so addresses are endless for one page. | Strip known session params during canonicalization. |
| Huge / redirect loops | Pages that never end, or A→B→A redirect chains. | Cap page size; hard-cap redirects (e.g. 5) and track URLs seen in a fetch. |
| Malformed HTML | Broken markup, soft-404s returning 200 OK, cloaked bot-only content. | Parse defensively; detect by text patterns and bot-vs-clean comparison. |
Freshness — when to come back Slide 10
Crawling a page once is easy; knowing when to return is what separates a search index from a museum. The recrawl interval is set per-URL by importance × change rate — a busy homepage of a major site earns minutes, a hobbyist's static archive earns weeks.
high importance × high churn → minutes · low importance × static → weeks
Estimate change rate from observed history, weight importance with PageRank or query traffic, and lean on
free signals: sitemap lastmod, RSS feeds, and conditional If-Modified-Since /
ETag GETs — a 304 confirms freshness for almost nothing.
Scaling — partition by host hash Slide 11
One machine can't crawl the web, so distribute the work by hashing hosts, not URLs. Each shard owns a slice of hosts and runs its own frontier, fetchers, and per-host pacing — so politeness stays local and shards never coordinate rate limits. The BFS expansion just gets wider as more shards join. A URL extracted for another host is routed to whichever shard owns it.
Shard on hash(host)
Hashing hosts (the same consistent-hashing idea from Chapter 5) keeps every per-host queue and timer inside one shard, so the politeness invariant holds without cross-shard chatter.
Share the global state
The visited URL set, content store, DNS resolver, and robots cache are shared services — so dedup and storage stay globally consistent even as shards multiply.
Storage — big bytes vs small facts Slide 12
Page bodies are large, immutable, and read in bulk — perfect for object storage. Metadata is small, hot, and queried by URL — perfect for a database. Don't mix the two. And keep the URL-seen set in a fast in-memory structure (a Bloom filter) so the hot dedup check never waits on disk.
Per-URL metadata — last fetched, status, ETag, content hash, blob pointer, depth — lives in a wide-column store or partitioned Postgres. The link graph sits apart because its access pattern is traversal, not lookup. Parsed pages get pushed to a stream so indexers consume crawl events independently.
Principles Slide 13
The crawler is a feedback loop
Pages produce links; links seed pages. Frontier, fetcher, parser, store are just plumbing around that loop.
Politeness is non-negotiable
robots.txt, per-host pacing, honest user agents. Lose any one and your IPs — and reputation — get banned.
Dedupe twice
By URL (canonicalize + Bloom filter) and by content (content hash / SimHash). The web has more copies than originals.
Partition by host, share the rest
Sharding on host hash keeps politeness local; visited sets, content store, and DNS stay shared so global invariants hold.
Cap everything
Depth, redirects, retries, URLs per host per day, time per fetch. Bounded loops finish; the web won't stop on its own.
Recrawl by importance
Importance × change rate sets cadence; cheap conditional GETs confirm freshness. Crawl budget is the scarcest resource.
Active recall
Cover the answers. Say each one out loud before you tap to check.