Chapter 9 · System Design

Design a Web Crawler

A distributed program that walks the public web — discovering links, fetching pages politely, and pouring the results into an index, an archive, or a watchful pipeline. We treat it as a system, not a script.

Systems Notebook · Chapter 09
Use ← → or Space · F for fullscreen
02 / 13
Why crawl

A crawler is a quiet workhorse for very different jobs

The mechanics are similar, but the constraints — coverage, freshness, latency, budget — change with the goal. Decide which job first; the design follows.

Use case 01

Search indexing

Walk the open web, parse content, and feed an inverted index. Coverage is wide; freshness matters most for popular pages. The classic motivator behind Googlebot and Bingbot.

Use case 02

Web archive

Take periodic snapshots of pages for posterity. The Internet Archive treats fidelity and historical completeness as the prize, not freshness.

Use case 03

Change monitoring

Re-check a known set of URLs for diffs — broken links, content edits, defacements, leaks. Smaller surface, tighter freshness window, often per-customer.

Use case 04

Price & catalog scraping

Pull structured data from competitor sites or marketplaces. Heavy on parsing rules and anti-bot evasion; light on breadth. Closest to ETL in spirit.

Use ← → or Space · F for fullscreen
03 / 13
Requirements

Scale, politeness, robustness, freshness — in tension

Each goal pushes the design in a different direction. Aggressive crawling damages target sites; polite crawling lowers throughput. The system lives on this tradeoff curve.

Scale

Tens of billions of URLs known, a few billion fetched per day at the top end. Storage in petabytes; bandwidth in tens of gigabits per second. Single-node thinking dies on the first day.

Politeness

Obey robots.txt, respect crawl-delay, never hammer a host with parallel connections. Treat every origin as a guest in someone else's home.

Robustness

Tolerate malformed HTML, timeouts, redirect chains, 5xx storms, captive portals, TLS errors. A crawler that retries forever is as bad as one that gives up immediately.

Freshness

News sites need recrawl in minutes; corporate landing pages can wait weeks. Importance-weighted scheduling beats blanket policies every time.

10¹⁰
Pages known
10⁹/day
Fetched at peak
~100 KB
Avg page size
PB scale
Stored content
Use ← → or Space · F for fullscreen
04 / 13
Architecture

The loop: frontier → fetcher → parser → store

Every crawler, no matter how grand, is a feedback loop. URLs flow out of a frontier, pages flow back in, and freshly discovered links seed the next turn of the wheel.

Seed URLs homepages, sitemaps URL Frontier priority + politeness queues, sharded ⟲ feedback target Fetcher Pool async HTTP workers DNS + TLS + retry DNS Resolver cached, distributed robots.txt cache per-host policy Parser HTML → text + links strip boilerplate URL Extractor normalize, canonicalize filter, dedupe Content Store blob storage + metadata DB Visited Set bloom + DB new URLs flow back into the frontier cold start work queue network I/O CPU + text persistence
Use ← → or Space · F for fullscreen
05 / 13
URL Frontier

Two queues stacked: importance first, then politeness

The frontier picks what to crawl next and when. A front-row of priority queues selects by importance; a back-row of per-host queues paces each origin so we never overwhelm it.

Incoming extracted URLs Prioritizer score → tier FRONT QUEUES (priority) F₁ high F₂ F₃ F₄ low Biased Selector weighted draw BACK QUEUES (one per host) example.com delay 1s news.io delay 5s shop.co delay 2s blog.net delay 3s Host-timer heap min-heap keyed on next-allowed-fetch time to Fetcher Pool next URL whose host is ready & whose priority wins Two-stage queue: priority decides what · politeness decides when
Use ← → or Space · F for fullscreen
06 / 13
Politeness

Be a guest, not a guest who breaks the door

A polite crawler keeps host operators happy and itself unbanned. Three rules carry most of the weight: read the welcome mat, throttle per host, and respect that politeness must hold across all your workers, not just one.

1 · robots.txt

Before crawling any host, pull /robots.txt and cache it. Honour Disallow, Allow, and Crawl-delay. Refresh once a day or when it expires.

2 · Per-host rate limit

Open at most one connection to a host at a time, and wait a configurable delay between requests — often max(crawl-delay, 1s). The host-timer heap enforces this.

3 · Distributed delay

If ten workers exist, they must share the per-host counter. Either shard hosts so one worker owns each, or coordinate via a central token store. Don't let parallelism erase politeness.

Honest identity

Send a User-Agent that names the crawler and links to a policy page. Operators block anonymous bots first; identified ones get the benefit of doubt.

Backoff on errors

On 429 / 503, double the delay and cap retries. Treat connection refusals as a hint to slow down for that host for the rest of the hour.

Bandwidth budget

Cap total egress per host per day. A small site shouldn't see more than a sliver of its bandwidth used by you, even if you have permission.

Use ← → or Space · F for fullscreen
07 / 13
Duplicate detection

Two different jobs: same URL, and same content

URL-level dedupe stops us from fetching the same address twice. Content-level dedupe stops us from storing the same article ten times under ten different addresses — mirrors, sessionized URLs, and tracking parameters create that mess.

Visited URL set

Before pushing a URL into the frontier, canonicalize it (lowercase host, strip default port, drop tracking params) and check membership. At scale, a Bloom filter sits in front of an authoritative key-value store.

Canonicalization rules

Normalize scheme, host case, percent-encoding, default ports, and trailing slashes. Decide which query parameters carry meaning vs. tracking. Mistakes here multiply your crawl budget.

SimHash for near-duplicates

Hash the page's shingles into a single 64-bit fingerprint where Hamming distance approximates content similarity. Pages within k=3 bits are treated as duplicates of the canonical version.

MinHash for set similarity

Better for comparing larger documents or full sites — estimate Jaccard similarity between shingle sets without comparing them directly. Useful for spotting scraped mirrors of well-known content.

~30%
Web is duplicate content
64 bits
SimHash fingerprint
k ≤ 3
Typical match threshold
O(1)
Bloom lookup per URL
Use ← → or Space · F for fullscreen
08 / 13
DNS at scale

Resolving names becomes its own subsystem

A naive crawler does a fresh DNS lookup per URL — and quickly turns its upstream resolver into a smoking pile. At billions of fetches a day, DNS is a tier-one dependency that needs caching, sharding, and timeouts.

Local resolver cache

Every fetcher keeps an in-memory map of host → IP honouring TTLs. Hit rates above 90% are typical because URL extraction tends to discover many pages from the same host.

Cluster-wide cache

A shared resolver (a fleet of recursive DNS servers behind a load balancer) absorbs misses from the local cache and prevents every node from independently bombarding the public DNS root.

Bounded timeouts

DNS calls are synchronous in many libraries — a single slow domain can stall a worker. Wrap resolution in a deadline (e.g., 2 seconds) and fail the URL fast on miss.

Pre-resolve in batches

When the frontier emits work, look up unresolved hosts in parallel before handing URLs to fetchers. This decouples DNS latency from fetch latency.

Negative caching

Cache NXDOMAIN and SERVFAIL responses too — for a shorter TTL — so a dead host doesn't get re-queried on every reference.

Respect TTL, don't pin

Hosts change IPs for CDN reasons. Pinning resolves for hours saves a few lookups but creates a long tail of mysterious connection failures.

Use ← → or Space · F for fullscreen
09 / 13
Traps

The web tries to keep you running forever

A crawler that follows every link reachable from a seed will not finish. Some traps are accidental, some are adversarial. Defences are mostly limits — depth, count, and shape — rather than cleverness.

Infinite calendars

A page links to "next month" forever. Cap path depth, cap URLs per host per day, detect monotonic counters in query strings and trim them.

Redirect loops

A → B → A, or chains hundreds long. Hard-cap redirects (e.g., 5), and track URLs seen during a single fetch to break cycles.

Session-id explosion

Each visit gets a new session in the URL; the crawler sees infinite distinct addresses for the same content. Strip known session params during canonicalization.

Dynamic / JS-rendered pages

Lots of content only appears after JS runs. Either render a small subset with a headless browser, or rely on server-side HTML and accept partial coverage.

Soft-404 and cloaking

Servers return 200 OK with a "not found" body, or different content for bots. Detect by text patterns and by comparing bot vs. clean-browser responses on samples.

Bot mazes

Some sites deliberately link bots into endless generated content. The cure: per-host quotas + diminishing returns detection (low novelty rate → stop).

Use ← → or Space · F for fullscreen
10 / 13
Refresh policy

Importance × change rate decides recrawl interval

Crawling a page once is the easy part. Knowing when to come back is what separates a search index from a museum. The right cadence is per-URL, not per-site.

Estimate change rate

Track how often each URL's content actually changes between fetches. Pages with high churn earn shorter recrawl intervals; static pages drift to the long-tail bucket.

Weight by importance

A change on the homepage of a major news site matters more than a change on a hobbyist's archive page. Use signals like PageRank, inbound link count, or query traffic to weight refresh priority.

Sitemaps and feeds

When a host publishes a sitemap with lastmod or an RSS feed, use it. Free signal beats clever inference every time.

Conditional GETs

Send If-Modified-Since or ETag headers. A 304 response costs almost nothing and confirms freshness without re-downloading bytes.

Tiered cadences

Bucket URLs into tiers — minutes, hours, days, weeks. Move URLs between tiers based on observed change history, not initial guesses.

Decay unimportant pages

A page that never changes and no one queries can be checked annually, or removed from the rotation entirely. Crawl budget is the scarcest resource.

Use ← → or Space · F for fullscreen
11 / 13
Distributed crawling

Partition the frontier by host hash, then expand BFS

A single machine can't crawl the web. Distribute the work by hashing hosts (not URLs) so politeness stays local: each shard owns a slice of hosts, manages its own per-host queues, and never needs to coordinate rate limits with peers.

BFS expansion from seeds s₀ depth 0 — seeds depth 1 depth 2 depth 3 — fanout explodes Frontier acts as the BFS queue; politeness pacing serializes per-host steps. Sharded by hash(host) URL Router shard = hash(host) mod N Shard 1 hosts a–f frontier + fetchers + politeness ctrl self-contained Shard 2 hosts g–m Shard 3 hosts n–s Shard N hosts t–z extracted URLs to other hosts → routed to the right shard Shared services visited URL set · content store · DNS resolver · robots cache shared so duplicates and storage stay globally consistent
Use ← → or Space · F for fullscreen
12 / 13
Storage

Big bytes go to blobs, small facts go to a database

Page bodies are large, immutable, and accessed in bulk — perfect for object storage. Metadata is small, hot, and queried by URL or by foreign key — perfect for a database. Don't mix the two.

Content blobs

Raw HTML (and rendered text) for each fetched URL go into object storage — S3, GCS, HDFS. Group by date or by content hash. Compress with gzip or zstd; expect ~3× shrinkage on HTML.

WARC archives

The web archive community packages raw responses (headers + body) into WARC files, often hundreds of MB each. Easy to ship, easy to replay; cheap per-page overhead.

Content fingerprints

Hash each page's normalized text; store the hash next to the blob pointer. Used by dedupe (same content under different URLs) and change detection (same URL, new content?).

Metadata DB

Per-URL records: last fetched at, last status code, ETag, content hash, length, MIME, blob pointer, priority tier, host, depth, error counters. A wide-column store (Cassandra, Bigtable) or partitioned Postgres works.

Link graph

Edges (from-URL → to-URL) are the input to PageRank-style importance scoring and to broken-link reports. Store separately because the access pattern is graph traversal, not URL lookup.

Downstream feed

Push parsed pages to a stream (Kafka, Pub/Sub) so indexers, classifiers, and monitors can consume crawl events independently. Decouples crawl rate from downstream throughput.

Use ← → or Space · F for fullscreen
13 / 13
Summary

Six principles for crawler design

01

The crawler is a feedback loop

Pages produce links; links seed more pages. Everything else — frontier, fetcher, parser, store — is plumbing around that loop.

02

Politeness is non-negotiable

robots.txt, per-host pacing, and identified user agents. Lose any of these and your IPs get banned and your reputation along with them.

03

Dedupe twice — by URL and by content

Canonicalization and Bloom filters stop same-URL refetches; SimHash and MinHash stop near-duplicate storage. The web has more copies than originals.

04

Partition by host, share the rest

Sharding the frontier on host hash keeps politeness local. Visited sets, content stores, and DNS sit on shared infrastructure so global invariants hold.

05

Cap everything

Depth, redirects, retries, URLs per host per day, time per fetch. The web will happily run your crawler forever if you let it; bounded loops finish.

06

Recrawl by importance, not uniformly

Cheap conditional GETs, observed change history, and importance signals together decide cadence. Crawl budget is the scarcest resource you own.

End of Chapter 09 · Systems Notebook
← → · Space · F · 1–9
📖 Read the lesson