Chapter 9 · Hinglish · Sunne-wala Sabak

Design a Web Crawler

Ek dhokebaaz simple loop — ek page download karo, uske links nikaalo, dohraao — jo web scale par ek mushkil distributed systems problem ban jaata hai. Niche "Play all" dabaiye aur browser pura chapter padh kar sunayega. Kisi bhi section par "Suniye" se wahin se shuru karein.

Companion slides kholें
Bhasha Hinglish (Roman) Sunne ka tareeka browser text-to-speech English version visual lesson →
🔊 Suniye Voice Speed

Tip: jo voice sabse natural lage wahi chuniye — ek Hindi (hi-IN) ya Indian-English (en-IN) voice aam taur par best chalti hai.

Web crawler kis kaam ka Slide 2

Chapter nau, design a web crawler. Ek web crawler ek distributed program hai jo public web par chalta hai, links discover karta hai, pages politely fetch karta hai, aur result ko ek index, ek archive, ya ek pipeline mein daal deta hai. Mechanics har kaam mein lagbhag ek jaisi rehti hai, lekin constraints badal jaati hain, isliye pehle goal tay karo. Char aam use cases hain. Pehla, search indexing, jisme pura open web walk karke ek inverted index ko feed kiya jaata hai, jaise Googlebot karta hai. Doosra, web archiving, jisme periodic snapshots liye jaate hain aur fidelity sabse important hoti hai, freshness nahi. Teesra, data mining, jisme known sites se structured data jaise prices aur catalogs nikaale jaate hain, yeh ETL ke kareeb hai. Aur chautha, monitoring, jisme ek known set of URLs ko diffs ke liye dobara check kiya jaata hai, jaise broken links ya content edits. Goal decide karo, baaki design usi se nikal aata hai.

Requirements jo aapas mein tension mein hain Slide 3

Char goals alag alag dishaon mein kheechte hain, aur pura design inke beech ke tradeoff par jeeta hai. Pehla, scale. Das arab se zyaada URLs known hote hain, har roz arab pages fetch hote hain, aur storage petabytes mein hoti hai. Single node sochna pehle hi din mar jaata hai. Doosra, politeness. robots.txt aur crawl-delay maano, ek host par parallel connections kabhi mat kholo. Politeness seedha per host throughput ko kam kar deti hai. Teesra, robustness. Malformed HTML, timeouts, redirect chains, aur paanch sau wale error storms ko jhelo. Hamesha retry karna utna hi bura hai jitna foran haar maan lena. Aur chautha, freshness. News pages ko minute mein dobara crawl karna padta hai, static pages hafton ruk sakte hain. Yaad rakho, asli kaam yeh non-functional requirements karti hain, functional surface to lagbhag trivial hai.

Crawl loop Slide 4

Crawler ki jaan ek char step ka loop hai. Pehle URL frontier ek URL nikaalta hai. Phir fetcher us page ka HTML download karta hai. Phir parser us HTML se text aur links nikaalta hai. Phir content ko deduplication filter check karta hai aur naye content ko storage mein likh deta hai. Ab jo links nikle the, wo ek URL seen filter se gujarte hain, aur jo pehle nahi dekhe gaye, wahi wapas frontier mein chale jaate hain, aur loop band ho jaata hai. Yahi feedback loop hai, pages links banate hain, links naye pages seed karte hain. Baaki sab kuch, yaani frontier, fetcher, parser, aur store, sirf is loop ke aas paas ki plumbing hai. Aur sabse important baat, loop ka har box ek aisi jagah hai jahan tum ek limit laga sakte ho, jaise depth, count, time, ya bytes. In limits ke bina crawler kabhi khatam nahi hoga.

SEED URLs homepages, sitemaps URL Frontier what & when Fetcher downloads HTML Parser text + links Content-seen? page-content hash new Storage blob + metadata extracted links URL-seen? normalized-URL hash unseen URLs back to frontier Do filters loop ko sambhaalte hain: URL-seen address rokta hai, content-seen body rokta hai. Har box ek jagah hai depth, count, time, ya bytes cap karne ki.
Crawler ek feedback loop hai. Pages links banate hain, links pages seed karte hain. URL-seen aur content-seen filters ise hamesha chalte rehne se rokte hain.

URL frontier ke do queues Slide 5

URL frontier do cheezein tay karta hai, kya crawl karna hai aur kab. Ise wo do row ke queues se karta hai. Pehle ek prioritizer har URL ko score deta hai aur use importance ke hisaab se kisi ek front queue mein daal deta hai. Yeh front queues priority decide karte hain. Phir ek router har URL ko kisi ek back queue mein map karta hai, jahan ek queue mein ek hi host hota hai. Yeh back queues politeness decide karte hain. Phir ek back queue selector, jiske paas har host ke liye ek timer hota hai, URLs ko ek ek karke release karta hai, taaki kisi ek host ko kabhi bahut tej na hit kiya jaaye. Yaad rakhne ka tareeka simple hai. Front matlab priority, yaani kya crawl karna hai. Back matlab politeness, yaani kab crawl karna hai. Timer ek host ko tab tak rok kar rakhta hai jab tak uska next allowed fetch time nahi aata, isliye parallelism kabhi politeness ko mita nahi paata.

BFS kyun, DFS kyun nahi Slide 11

Frontier ek breadth first queue ki tarah behave karta hai, aur yeh jaan boojh kar hai. Agar tum depth first jaate, to crawler ek hi site ke links ko gehraai tak peechha karta, yaani ek hi host se page ke baad page back to back fetch karta, jo bilkul wahi hammering hai jo politeness mana karti hai. Breadth first se kaam bahut saare hosts mein ek saath fail jaata hai, isliye per host pacing ke paas hamesha doosra kaam hota hai jabki ek host thanda ho raha hota hai. Aur yahan ek bada warning yaad rakho. Politeness courtesy nahi, survival hai. Agar ek crawler robots.txt ya crawl-delay ignore karke ek host par parallel connections khol de, to uska IP block ho jaata hai aur uski reputation bhi. Per host pacing acchhe sanskaar nahi hai, yeh wahi cheez hai jo tumhe crawl karte rehne deti hai.

Politeness Slide 6

Politeness ka matlab hai har origin ko kisi aur ke ghar ke mehmaan ki tarah maano. Teen rules zyaadatar weight uthate hain. Pehla, robots.txt. Kisi bhi host ko crawl karne se pehle uska slash robots.txt pull karo aur cache kar lo, aur Disallow, Allow, aur Crawl-delay ko maano. Doosra, per host rate limit. Ek host par ek baar mein ek hi connection kholo, aur har request ke beech ek delay ruko, aam taur par crawl-delay aur ek second mein se jo bada ho utna. Teesra, distributed delay. Agar das workers hain, to unhe per host counter share karna hoga. Behtar hai ki hosts ko shard karo taaki ek host ka maalik ek hi worker ho, taaki per host timer kabhi bate nahi. Sath mein ek honest User-Agent bhejo jo crawler ka naam aur ek policy page deta ho, aur char sau untees ya paanch sau teen error par delay double kar do. Pehchaane gaye acchhe bots ko benefit of doubt milta hai, anonymous bots pehle block hote hain.

Deduplication ke do hash Slide 7

Web par originals se zyaada copies hain, isliye crawler do baar dedupe karta hai, do alag alag hash se. Pehla, URL seen. Yeh ek hi address ko dobara fetch karne se rokta hai. Iske liye URL ko canonicalize karo, yaani host lowercase karo, default port hatao, aur tracking params drop karo, phir us normalized URL ko hash karo. Doosra, content seen. Yeh ek hi article ko das alag addresses ke neeche das baar store karne se rokta hai. Iske liye page ke normalized content ka hash banao aur compare karo. Pehla hash ek address ka hai, doosra hash ek body ka. Mirrors, sessionized URLs, aur tracking parameters sab ek hi content alag alag addresses se serve karte hain, aur unhe sirf content hash pakad sakta hai. Scale par URL seen ke liye ek bloom filter memory mein rakha jaata hai, jo ek authoritative key value store ke aage baithta hai, aur har URL par ek O of one jaisi sasti check deta hai.

DNS ek alag subsystem Slide 8

DNS chhupa hua bottleneck hai. Ek naive crawler har URL par ek fresh lookup karta hai aur apne resolver ko ek dhuaan udaata dher bana deta hai. Arab fetches per day par name resolution ek tier one dependency hai jise apne DNS cache ki zaroorat hoti hai. Kyunki URL extraction aksar ek hi host se bahut saare pages discover karti hai, ek local host se IP cache, jo TTL ko maanta ho, navve percent se zyaada hit karta hai. Do tiers acchhe rehte hain. Har fetcher par ek in memory cache zyaada bulk pakad leta hai, aur ek shared cluster wide resolver misses ko absorb karta hai taaki koi node akela public DNS roots par bombardment na kare. Sath mein wait ko bound karo. DNS calls bahut libraries mein synchronous hoti hain, ek slow domain ek worker ko stall kar deta hai. Resolution ko ek deadline mein wrap karo, jaise do second, batches mein pre resolve karo, aur dead hosts ko negatively cache karo.

Crawler traps Slide 9

Agar tum ek seed se pahunchne wale har link ko follow karoge to tum kabhi khatam nahi karoge. Kuch traps accidental hain, kuch adversarial. Defences zyaadatar limits hain, cleverness nahi. Char aam traps yaad rakho. Pehla, spider trap, jaise ek calendar jo agle month ka link hamesha deta rehta hai, ya bot mazes jo generated content banaate rehte hain. Iska ilaaj, path depth cap karo aur ek host par ek din ke URLs cap karo, aur jab novelty girne lage to ruk jao. Doosra, dynamic ya infinite URLs, jisme har visit par ek naya session id banta hai aur ek hi page ke liye addresses endless ho jaate hain. Iska ilaaj, canonicalization ke dauraan known session params strip karo. Teesra, huge pages ya redirect loops, jaise A se B se wapas A. Iska ilaaj, page size cap karo, redirects ko hard cap karo jaise paanch, aur ek fetch mein dekhe gaye URLs track karo. Aur chautha, malformed HTML aur soft four oh four, jo do sau OK ke saath not found body deta hai. Iska ilaaj, defensively parse karo aur text patterns se detect karo.

Freshness, kab wapas aana hai Slide 10

Ek page ko ek baar crawl karna aasaan hai, lekin kab wapas aana hai yeh tay karna hi ek search index ko ek museum se alag karta hai. Recrawl interval per URL set hota hai, importance guna change rate se. Ek bade site ka busy homepage minute kamaata hai, ek hobbyist ka static archive page hafte kamaata hai. Change rate ka andaza observed history se lagao, importance ko PageRank ya query traffic se weight karo, aur free signals par bharosa karo. Sitemap ka lastmod, RSS feeds, aur conditional If Modified Since ya ETag GETs sab muft signal hain. Ek teen sau char response yeh confirm kar deta hai ki page badla nahi, lagbhag bina kisi cost ke, bina bytes dobara download kiye. Yaad rakho, sabse importance wale aur sabse tezi se badalne wale pages ko sabse pehle dekho, aur jo page kabhi nahi badalta aur jise koi query nahi karta use saal mein ek baar dekho ya rotation se nikaal do.

Scaling, host hash se partition Slide 11

Ek machine pura web crawl nahi kar sakti, isliye kaam ko hosts ke hash se distribute karo, URLs ke nahi. Har shard hosts ka ek slice rakhta hai aur apna frontier, apne fetchers, aur apni per host pacing chalata hai. Isse politeness local rehti hai aur shards kabhi aapas mein rate limits coordinate nahi karte. BFS expansion bas widen ho jaata hai jaise jaise zyaada shards judte hain. Agar ek URL kisi doosre host ke liye nikle, to use uske maalik shard par route kar diya jaata hai. Hosts par hash karna wahi consistent hashing wala idea hai jo Chapter paanch mein aaya tha, aur yeh har per host queue aur timer ko ek hi shard ke andar rakhta hai. Lekin kuch state global rehni chahiye. Visited URL set, content store, DNS resolver, aur robots cache shared services hain, taaki duplicates aur storage poore system mein globally consistent rahein, chahe shards kitne bhi badh jaayein.

Storage, bade bytes versus chhote facts Slide 12

Page bodies bade, immutable, aur bulk mein padhe jaate hain, jo object storage ke liye perfect hai. Metadata chhota, hot, aur URL se query hota hai, jo ek database ke liye perfect hai. Dono ko mat milao. Raw HTML har fetched URL ke liye ek blob store mein jaata hai, jaise S three, GCS, ya WARC files, aur use gzip ya zstd se compress karo. Per URL metadata, jaise last fetched, status, ETag, content hash, aur blob pointer, ek wide column store ya partitioned Postgres mein rehta hai. Link graph alag rakho, kyunki uska access pattern traversal hai, lookup nahi, aur wo PageRank jaise importance scoring ka input hai. Aur ek important baat, URL seen set ko ek tej in memory structure mein, yaani ek bloom filter mein rakho, taaki hot dedup check kabhi disk ka intezaar na kare. Parsed pages ko ek stream par push karo taaki indexers aur monitors crawl events independently consume kar sakein.

Recap Slide 13

To chhe baatein yaad rakho. Ek, crawler ek feedback loop hai, pages links banate hain aur links pages seed karte hain, baaki sab plumbing hai. Do, politeness non negotiable hai, yaani robots.txt, per host pacing, aur honest user agents, inme se ek bhi khoya to IP ban ho jaata hai. Teen, do baar dedupe karo, ek URL se canonicalize aur bloom filter se, aur ek content se content hash se, kyunki web par copies originals se zyaada hain. Char, host se partition karo aur baaki share karo, taaki politeness local rahe par visited set, content store, aur DNS global consistent rahein. Paanch, sab kuch cap karo, yaani depth, redirects, retries, per host per day URLs, aur per fetch time, kyunki bounded loops hi khatam hote hain. Aur chhe, importance se recrawl karo, uniformly nahi, kyunki importance guna change rate cadence set karta hai aur crawl budget sabse scarce resource hai.

Aage kya

Bas, yahi hai chapter nau. Ek kaam karo, soch kar dekho. Do pages ke alag alag URLs hain par bilkul ek jaisa HTML hai. Tab kaunsa dedup layer ise pakdega, URL seen ya content seen, aur tum exactly kya hash karoge? Phir ise palto, ab ek hi URL agle din naya content lautata hai, tab kaunsa layer ise notice karega, aur yeh freshness scheduler ko kya batata hai? In dono sawaalon ka jawaab apne aap se kaho. Agle chapter mein hum ek notification system design karenge.