A distributed program that walks the public web — discovering links, fetching pages politely, and pouring the results into an index, an archive, or a watchful pipeline. We treat it as a system, not a script.
The mechanics are similar, but the constraints — coverage, freshness, latency, budget — change with the goal. Decide which job first; the design follows.
Walk the open web, parse content, and feed an inverted index. Coverage is wide; freshness matters most for popular pages. The classic motivator behind Googlebot and Bingbot.
Take periodic snapshots of pages for posterity. The Internet Archive treats fidelity and historical completeness as the prize, not freshness.
Re-check a known set of URLs for diffs — broken links, content edits, defacements, leaks. Smaller surface, tighter freshness window, often per-customer.
Pull structured data from competitor sites or marketplaces. Heavy on parsing rules and anti-bot evasion; light on breadth. Closest to ETL in spirit.
Each goal pushes the design in a different direction. Aggressive crawling damages target sites; polite crawling lowers throughput. The system lives on this tradeoff curve.
Tens of billions of URLs known, a few billion fetched per day at the top end. Storage in petabytes; bandwidth in tens of gigabits per second. Single-node thinking dies on the first day.
Obey robots.txt, respect crawl-delay, never hammer a host with parallel connections. Treat every origin as a guest in someone else's home.
Tolerate malformed HTML, timeouts, redirect chains, 5xx storms, captive portals, TLS errors. A crawler that retries forever is as bad as one that gives up immediately.
News sites need recrawl in minutes; corporate landing pages can wait weeks. Importance-weighted scheduling beats blanket policies every time.
Every crawler, no matter how grand, is a feedback loop. URLs flow out of a frontier, pages flow back in, and freshly discovered links seed the next turn of the wheel.
The frontier picks what to crawl next and when. A front-row of priority queues selects by importance; a back-row of per-host queues paces each origin so we never overwhelm it.
A polite crawler keeps host operators happy and itself unbanned. Three rules carry most of the weight: read the welcome mat, throttle per host, and respect that politeness must hold across all your workers, not just one.
Before crawling any host, pull /robots.txt and cache it. Honour Disallow, Allow, and Crawl-delay. Refresh once a day or when it expires.
Open at most one connection to a host at a time, and wait a configurable delay between requests — often max(crawl-delay, 1s). The host-timer heap enforces this.
If ten workers exist, they must share the per-host counter. Either shard hosts so one worker owns each, or coordinate via a central token store. Don't let parallelism erase politeness.
Send a User-Agent that names the crawler and links to a policy page. Operators block anonymous bots first; identified ones get the benefit of doubt.
On 429 / 503, double the delay and cap retries. Treat connection refusals as a hint to slow down for that host for the rest of the hour.
Cap total egress per host per day. A small site shouldn't see more than a sliver of its bandwidth used by you, even if you have permission.
URL-level dedupe stops us from fetching the same address twice. Content-level dedupe stops us from storing the same article ten times under ten different addresses — mirrors, sessionized URLs, and tracking parameters create that mess.
Before pushing a URL into the frontier, canonicalize it (lowercase host, strip default port, drop tracking params) and check membership. At scale, a Bloom filter sits in front of an authoritative key-value store.
Normalize scheme, host case, percent-encoding, default ports, and trailing slashes. Decide which query parameters carry meaning vs. tracking. Mistakes here multiply your crawl budget.
Hash the page's shingles into a single 64-bit fingerprint where Hamming distance approximates content similarity. Pages within k=3 bits are treated as duplicates of the canonical version.
Better for comparing larger documents or full sites — estimate Jaccard similarity between shingle sets without comparing them directly. Useful for spotting scraped mirrors of well-known content.
A naive crawler does a fresh DNS lookup per URL — and quickly turns its upstream resolver into a smoking pile. At billions of fetches a day, DNS is a tier-one dependency that needs caching, sharding, and timeouts.
Every fetcher keeps an in-memory map of host → IP honouring TTLs. Hit rates above 90% are typical because URL extraction tends to discover many pages from the same host.
A shared resolver (a fleet of recursive DNS servers behind a load balancer) absorbs misses from the local cache and prevents every node from independently bombarding the public DNS root.
DNS calls are synchronous in many libraries — a single slow domain can stall a worker. Wrap resolution in a deadline (e.g., 2 seconds) and fail the URL fast on miss.
When the frontier emits work, look up unresolved hosts in parallel before handing URLs to fetchers. This decouples DNS latency from fetch latency.
Cache NXDOMAIN and SERVFAIL responses too — for a shorter TTL — so a dead host doesn't get re-queried on every reference.
Hosts change IPs for CDN reasons. Pinning resolves for hours saves a few lookups but creates a long tail of mysterious connection failures.
A crawler that follows every link reachable from a seed will not finish. Some traps are accidental, some are adversarial. Defences are mostly limits — depth, count, and shape — rather than cleverness.
A page links to "next month" forever. Cap path depth, cap URLs per host per day, detect monotonic counters in query strings and trim them.
A → B → A, or chains hundreds long. Hard-cap redirects (e.g., 5), and track URLs seen during a single fetch to break cycles.
Each visit gets a new session in the URL; the crawler sees infinite distinct addresses for the same content. Strip known session params during canonicalization.
Lots of content only appears after JS runs. Either render a small subset with a headless browser, or rely on server-side HTML and accept partial coverage.
Servers return 200 OK with a "not found" body, or different content for bots. Detect by text patterns and by comparing bot vs. clean-browser responses on samples.
Some sites deliberately link bots into endless generated content. The cure: per-host quotas + diminishing returns detection (low novelty rate → stop).
Crawling a page once is the easy part. Knowing when to come back is what separates a search index from a museum. The right cadence is per-URL, not per-site.
Track how often each URL's content actually changes between fetches. Pages with high churn earn shorter recrawl intervals; static pages drift to the long-tail bucket.
A change on the homepage of a major news site matters more than a change on a hobbyist's archive page. Use signals like PageRank, inbound link count, or query traffic to weight refresh priority.
When a host publishes a sitemap with lastmod or an RSS feed, use it. Free signal beats clever inference every time.
Send If-Modified-Since or ETag headers. A 304 response costs almost nothing and confirms freshness without re-downloading bytes.
Bucket URLs into tiers — minutes, hours, days, weeks. Move URLs between tiers based on observed change history, not initial guesses.
A page that never changes and no one queries can be checked annually, or removed from the rotation entirely. Crawl budget is the scarcest resource.
A single machine can't crawl the web. Distribute the work by hashing hosts (not URLs) so politeness stays local: each shard owns a slice of hosts, manages its own per-host queues, and never needs to coordinate rate limits with peers.
Page bodies are large, immutable, and accessed in bulk — perfect for object storage. Metadata is small, hot, and queried by URL or by foreign key — perfect for a database. Don't mix the two.
Raw HTML (and rendered text) for each fetched URL go into object storage — S3, GCS, HDFS. Group by date or by content hash. Compress with gzip or zstd; expect ~3× shrinkage on HTML.
The web archive community packages raw responses (headers + body) into WARC files, often hundreds of MB each. Easy to ship, easy to replay; cheap per-page overhead.
Hash each page's normalized text; store the hash next to the blob pointer. Used by dedupe (same content under different URLs) and change detection (same URL, new content?).
Per-URL records: last fetched at, last status code, ETag, content hash, length, MIME, blob pointer, priority tier, host, depth, error counters. A wide-column store (Cassandra, Bigtable) or partitioned Postgres works.
Edges (from-URL → to-URL) are the input to PageRank-style importance scoring and to broken-link reports. Store separately because the access pattern is graph traversal, not URL lookup.
Push parsed pages to a stream (Kafka, Pub/Sub) so indexers, classifiers, and monitors can consume crawl events independently. Decouples crawl rate from downstream throughput.
Pages produce links; links seed more pages. Everything else — frontier, fetcher, parser, store — is plumbing around that loop.
robots.txt, per-host pacing, and identified user agents. Lose any of these and your IPs get banned and your reputation along with them.
Canonicalization and Bloom filters stop same-URL refetches; SimHash and MinHash stop near-duplicate storage. The web has more copies than originals.
Sharding the frontier on host hash keeps politeness local. Visited sets, content stores, and DNS sit on shared infrastructure so global invariants hold.
Depth, redirects, retries, URLs per host per day, time per fetch. The web will happily run your crawler forever if you let it; bounded loops finish.
Cheap conditional GETs, observed change history, and importance signals together decide cadence. Crawl budget is the scarcest resource you own.