Chapter 9 · System Design Fundamentals

Design a Web Crawler

A deceptively simple loop — download a page, extract its links, repeat — that becomes a hard distributed-systems problem at web scale. Here politeness and traps matter as much as throughput, and the cleverness lives in the queues, not the loop.

▶ Open the companion slides

Reading time ~12 min Prerequisites Ch 8 · URL Shortener, Ch 5 · Consistent Hashing Audio 🔊 Hinglish read-aloud Next Design a Notification System

Strip a crawler to its core and it is a four-step cycle: a frontier hands out a URL, a fetcher downloads the page, a parser extracts text and links, and a store keeps the bytes. The freshly found links feed back into the frontier, and the wheel turns again. Everything hard — billions of URLs, not hammering one host, not running forever — is about controlling that loop, not changing it.

Primary source

Alex Xu, System Design Interview (Vol. 1), Chapter 9. Companion deck: slides — jump to any slide with the Slide N chips. Go deeper: the Robots Exclusion Protocol (RFC 9309).

What a crawler is for Slide 2

The mechanics barely change between jobs; the constraints do. Decide the goal first — coverage, freshness, fidelity, or budget — and the rest of the design follows from it. Four jobs cover most of what crawlers are built to do.

Search indexing

Walk the open web and feed an inverted index. Coverage is wide; freshness matters most for popular pages. The classic motivator behind Googlebot.

Web archiving

Take periodic snapshots for posterity. The Internet Archive prizes fidelity and historical completeness over freshness.

Data mining

Pull structured data — prices, catalogs — from known sites. Heavy on parsing rules; light on breadth. Closest to ETL in spirit.

Monitoring

Re-check a known set of URLs for diffs — broken links, edits, defacements, leaks. Small surface, tight freshness window.

The crawl loop Slide 4

Read the cycle first — it is the system. Seed URLs prime the frontier; the fetcher downloads HTML; the parser pulls out text and links; content is checked for duplicates and stored. The extracted links pass through a URL-seen? filter and the survivors flow back into the frontier, closing the loop. Each box is a place to apply a limit.

The crawler is a feedback loop. Pages produce links; links seed pages. The URL-seen and content-seen filters are what keep it from looping forever or hoarding copies.

Requirements in tension Slide 3

Four goals pull in different directions, and the whole design lives on the tradeoff curve between them. Crawl aggressively and you damage target sites and get banned; crawl politely and throughput drops. Name what each one actually demands before you start drawing boxes.

Requirement	What it demands	Tension
Scale	Tens of billions of URLs known, billions fetched per day; petabytes stored.	Single-node thinking dies on day one.
Politeness	Obey robots.txt and crawl-delay; never hammer one host in parallel.	Directly caps per-host throughput.
Robustness	Survive malformed HTML, timeouts, redirect chains, 5xx storms, traps.	Retrying forever is as bad as giving up.
Freshness	News in minutes, static pages in weeks; importance-weighted recrawl.	Re-fetching everything wastes scarce budget.

These are non-functional requirements doing the real work. The functional surface — "fetch pages, follow links, store results" — is almost trivial by comparison.

The URL frontier — two queues stacked Slide 5

The frontier decides what to crawl next and when. It does this with two rows of queues. A prioritizer scores each URL and drops it into one of several front queues by importance. A router then maps each URL to one of many back queues, one host per queue. A back-queue selector with a per-host timer releases URLs so a single host is never hit too fast. Front decides priority; back decides politeness.

Two stacked queues. Front queues rank by importance; back queues pace each host. A timer keyed on next-allowed-fetch time releases one URL at a time so no origin is overwhelmed.

Why BFS, not DFS Slide 11

The frontier behaves like a breadth-first queue, and that is deliberate. Depth-first would chase one site down a deep chain of links — fetching page after page from the same host back to back, exactly the hammering politeness forbids. Breadth-first spreads the work across many hosts at once, so the per-host pacing has other work to do while one host cools down.

Politeness is survival, not courtesy

A crawler that ignores robots.txt or crawl-delay and opens parallel connections to one host gets your IP blocked — and your crawler's reputation with it. Per-host pacing isn't good manners; it's the thing that keeps you able to crawl at all.

Politeness Slide 6

Treat every origin as a guest in someone else's home. Three rules carry most of the weight: read the welcome mat, throttle per host, and make politeness hold across all your workers, not just one.

1 · read robots.txt

pull /robots.txt once per host, cache it; honour Disallow, Allow, Crawl-delay

2 · pace per host

one connection at a time; wait max(crawl-delay, 1s) between requests

3 · stay polite when distributed

shard hosts so one worker owns each → the per-host timer is never split

Send an honest User-Agent that names the crawler and links a policy page; back off on 429 / 503 by doubling the delay. Identified, well-behaved bots get the benefit of the doubt — anonymous ones get blocked first.

Deduplication — two different hashes Slide 7

The web has more copies than originals, so the crawler dedupes twice, with two different hashes. URL-seen stops us fetching the same address twice: canonicalize the URL (lowercase host, strip default port and tracking params) and hash that. Content-seen stops us storing the same article ten times under ten addresses: hash the page's normalized content and compare. The first hashes an address; the second hashes a body.

URL-seen (the address)

Canonicalize, then check membership before pushing into the frontier. At scale a Bloom filter in memory sits in front of an authoritative key-value store — an O(1) check per URL keeps the hot path cheap.

Content-seen (the body)

Hash the normalized page text into a fingerprint and compare. Mirrors, sessionized URLs, and tracking parameters all serve identical content from different addresses — only a content hash catches those.

DNS resolution as its own subsystem Slide 8

DNS is the hidden bottleneck. A naive crawler does a fresh lookup per URL and turns its resolver into a smoking pile; at billions of fetches a day, name resolution is a tier-one dependency that needs its own cache. Because URL extraction tends to discover many pages from the same host, a local host → IP cache (honouring TTLs) hits well over 90% of the time.

Cache, two tiers

An in-memory cache on each fetcher catches the bulk; a shared cluster-wide resolver absorbs the misses so no node bombards the public DNS roots on its own.

Bound the wait

DNS calls are synchronous in many libraries — one slow domain stalls a worker. Wrap resolution in a deadline (say 2 s), pre-resolve in batches, and negatively cache dead hosts.

Crawler traps Slide 9

Follow every link reachable from a seed and you will never finish — some traps are accidental, some adversarial. The defences are mostly limits (depth, count, shape), not cleverness. Cap everything, and bound loops finish.

Trap	What it looks like	Defence
Spider trap	Infinite calendar "next month" links; bot mazes of generated content.	Cap path depth and URLs per host per day; stop when novelty drops.
Dynamic / infinite URLs	Each visit mints a new session id, so addresses are endless for one page.	Strip known session params during canonicalization.
Huge / redirect loops	Pages that never end, or A→B→A redirect chains.	Cap page size; hard-cap redirects (e.g. 5) and track URLs seen in a fetch.
Malformed HTML	Broken markup, soft-404s returning 200 OK, cloaked bot-only content.	Parse defensively; detect by text patterns and bot-vs-clean comparison.

Freshness — when to come back Slide 10

Crawling a page once is easy; knowing when to return is what separates a search index from a museum. The recrawl interval is set per-URL by importance × change rate — a busy homepage of a major site earns minutes, a hobbyist's static archive earns weeks.

recrawl interval ≈ f( 1 / importance , 1 / change rate )
high importance × high churn → minutes · low importance × static → weeks

Estimate change rate from observed history, weight importance with PageRank or query traffic, and lean on free signals: sitemap lastmod, RSS feeds, and conditional If-Modified-Since / ETag GETs — a 304 confirms freshness for almost nothing.

Scaling — partition by host hash Slide 11

One machine can't crawl the web, so distribute the work by hashing hosts, not URLs. Each shard owns a slice of hosts and runs its own frontier, fetchers, and per-host pacing — so politeness stays local and shards never coordinate rate limits. The BFS expansion just gets wider as more shards join. A URL extracted for another host is routed to whichever shard owns it.

Shard on hash(host)

Hashing hosts (the same consistent-hashing idea from Chapter 5) keeps every per-host queue and timer inside one shard, so the politeness invariant holds without cross-shard chatter.

Share the global state

The visited URL set, content store, DNS resolver, and robots cache are shared services — so dedup and storage stay globally consistent even as shards multiply.

Storage — big bytes vs small facts Slide 12

Page bodies are large, immutable, and read in bulk — perfect for object storage. Metadata is small, hot, and queried by URL — perfect for a database. Don't mix the two. And keep the URL-seen set in a fast in-memory structure (a Bloom filter) so the hot dedup check never waits on disk.

blob store

raw HTML

S3 / GCS / WARC

database

metadata

status, ETag, hash

graph

link edges

for PageRank

in-memory

URL-seen

bloom + cache

Per-URL metadata — last fetched, status, ETag, content hash, blob pointer, depth — lives in a wide-column store or partitioned Postgres. The link graph sits apart because its access pattern is traversal, not lookup. Parsed pages get pushed to a stream so indexers consume crawl events independently.

Principles Slide 13

The crawler is a feedback loop

Pages produce links; links seed pages. Frontier, fetcher, parser, store are just plumbing around that loop.

Politeness is non-negotiable

robots.txt, per-host pacing, honest user agents. Lose any one and your IPs — and reputation — get banned.

Dedupe twice

By URL (canonicalize + Bloom filter) and by content (content hash / SimHash). The web has more copies than originals.

Partition by host, share the rest

Sharding on host hash keeps politeness local; visited sets, content store, and DNS stay shared so global invariants hold.

Cap everything

Depth, redirects, retries, URLs per host per day, time per fetch. Bounded loops finish; the web won't stop on its own.

Recrawl by importance

Importance × change rate sets cadence; cheap conditional GETs confirm freshness. Crawl budget is the scarcest resource.

Active recall

Cover the answers. Say each one out loud before you tap to check.

What are the four stages of the crawl loop?

Out, down, parse, back.

Frontier → fetcher → parser → store. Extracted links pass a URL-seen filter and flow back into the frontier, closing the loop.

Front queues vs back queues — what does each decide?

What vs when.

Front = priority (which URL to crawl next, by importance). Back = politeness (one host per queue; a per-host timer decides when it may be hit).

URL-seen vs content-seen — what does each hash?

Address vs body.

URL-seen hashes the normalized URL to avoid refetching an address. Content-seen hashes the page content to avoid re-storing the same body under different URLs.

Why BFS instead of DFS?

Think per-host load.

DFS chases one host deep and back-to-back, which is the hammering politeness forbids. BFS spreads work across many hosts, so per-host pacing always has other work.

Why does DNS need its own subsystem?

One lookup per URL.

A fresh lookup per URL crushes the resolver. DNS is a tier-one dependency needing a local + cluster cache, bounded timeouts, and batch pre-resolution.

What sets a page's recrawl interval?

Two factors, multiplied.

Importance × change rate, per-URL. Busy important pages recrawl in minutes; static unimportant ones in weeks. Conditional GETs confirm freshness cheaply.

Check yourself

Q1 In the URL frontier, what do the back queues decide?

Why: Back queues hold one host each and a per-host timer paces them — they decide when a host may be hit. Front queues decide what to crawl by priority.

Q2 Two different URLs return byte-for-byte identical HTML. Which catches it?

Why: The addresses differ, so URL-seen passes both. Only a hash of the page content sees that the bodies match and skips storing the duplicate.

Q3 Why does a web crawler traverse breadth-first, not depth-first?

Why: DFS chases one site's links back-to-back, pounding that host. BFS spreads work across many hosts so per-host pacing always has other work to do.

Q4 Why is DNS resolution treated as its own crawler subsystem?

Why: A fresh lookup for every URL turns DNS into a bottleneck, so it gets its own cache (local plus cluster-wide), bounded timeouts, and batch pre-resolution.

Q5 When sharding a distributed crawler, what is hashed to a shard?

Why: Hashing on host keeps every per-host queue and timer inside one shard, so the politeness invariant holds without shards coordinating rate limits.