Reaching Into Archives Β· Lesson 1 Β· Hinglish
🌐 English version β†’

Aakhir mein chhupa Index

Aap ek file ko zip se bina baaki sab unpack kiye kaise nikaal sakte ho.

Ek ghanta pehle maine 17 archives mein haath daala aur har ek se ek hi file nikaali β€” ek single SENSEX.csv β€” jabki har archive mein ~150 index CSVs the, plus futures aur options ka dher. Maine unhe unpack nahi kiya. Gigabytes compressed hi rahe; maine sirf wahi file chui jo chahiye thi.

Yeh koi jugaad nahi hai β€” yeh ek structural fact hai ki ek zip kaise bana hota hai. Bas ek fact samajh lo aur poori technique usi se nikal aati hai: aap archives ko unpack karna band kar dete ho aur unme haath daalna shuru kar dete ho. Yeh lesson wahi fact hai.

Aapka mission General Unix / data-engineering fluency. Toh hum commands ratenge nahi β€” hum wo structure samjhenge jo in commands ko possible banaata hai, taaki aap kisi bhi archive ke baare mein pipeline mein reason kar sako, sirf saamne wale ke baare mein nahi.

Archive ek container hai, jiske paas apna table of contents hai

Ek zip koi folder nahi hai. Wo ek file hai: har member back-to-back stored, aur phir β€” bilkul end mein β€” ek central directory: ek list jisme har member ka ek record hai, jo uska naam, size, aur β€” sabse important β€” uska byte offset file ke andar batata hai.1

Toh ek reader poora archive kabhi scan nahi karta. Wo end par jump karta hai, End of Central Directory record padhta hai, usse index tak pahunchta hai, jo naam aapne maanga usse dhoondhta hai, aur seedha un bytes par seek karta hai.2 Baaki 149 members kabhi padhe hi nahi jaate.

one .zip file Β· bytes run left β†’ right member 1 a CSV member 2 a CSV β‹― member k β‹― SENSEX.csv β‹― 150 CENTRAL DIRECTORY names + offsets (the index) EOCD end marker β‘  reader starts here β‘‘ β†’ the index β‘’ offset β†’ seek straight to SENSEX.csv
Index end mein rehta hai. Reader pehle wahaan hop karta hai, phir jis member ko wo name karta hai us par seek karta hai β€” index plus ek file padhta hai, poora archive nahi.

Iska nateeja: random access

Kyunki index offsets store karta hai, aapko random access milta hai β€” kisi bhi byte par jump karne ki ability, bina yeh padhe ki uske pehle kya aaya. Faayda concrete hai: plain sequential reading se aap, average mein, ek file dhoondhne ke liye aadha archive scan karte; index se aap sirf index aur wahi file padhte ho.2

Yeh ek property do operations ko alag kar deti hai jinhe log aksar gadd-madd kar dete hain:

Teen verbs jo aap actually type karte ho

Har archive tool wahi teen verbs deta hai. Inhe ek set ki tarah seekho β€” list, stream-one, bulk β€” aur aap zip aur rar dono mein ek jaise haath daal sakte ho:

Wahi teen moves, do tools. list β†’ stream-one β†’ bulk
# 1 Β· LIST β€” read only the index (the central directory). Always do this first.
unzip -l  data.zip                 # names + sizes, instantly
unrar l   data.rar

# 2 Β· STREAM ONE member to stdout β€” decompress just that file, nothing to disk
unzip -p  data.zip  SENSEX.csv  > SENSEX.csv
unrar p   data.rar  SENSEX.csv  > SENSEX.csv

# 3 Β· BULK extract everything to a folder β€” the default, usually more than you need
unzip -d  out/  data.zip
unrar x   data.rar  out/

Asli move: stream karo, extract nahi

Verb 2 wahi hai jo aapke kaam karne ka tareeka badal deta hai. -p / p member ke bytes ko stdout par bhej dete hain, file likhne ke bajaaye3 β€” toh aap usse seedha agle program mein pipe kar sakte ho aur ek bhi temp file likhe bina kaam ho jaata hai:

Haath daalo, inspect karo, disk ko chuo hi mat
# Peek at the first rows of one member inside a 50 MB archive
unzip -p data.zip SENSEX.csv | head

# Count rows, or feed straight into awk / a program β€” no extraction step
unzip -p data.zip SENSEX.csv | wc -l
unzip -p data.zip SENSEX.csv | awk -F, '$2 > 80000'

Yahi cheez us extraction ko chala rahi thi jo aapne dekhi: zip mahino ke liye unzip -p, rar mahino ke liye unrar p β€” har archive se ek member stream hua, baaki ~149 untouched. Extract gigabytes files likh deta jinhe main phir delete karta. Stream ne sirf wahi likha jo maine maanga.

Aaj ki aapki win

Ab aap (1) bata sakte ho ki archive se ek file nikaalna sasta kyun hai β€” central directory random access deti hai β€” aur (2) teen tareekon se haath daal sakte ho: list ke liye -l, ek member ko pipe mein stream karne ke liye -p, bulk-extract ke liye -d. Aapne β€œpoori cheez unzip karo” sochna chhod diya aur β€œjis file ko chahiye us par seek karo” sochna shuru kar diya.

Recall check

Dobara mat padho β€” retrieve karo. Effortful recall hi isse aisi memory banata hai jo agle hafte tak rahegi. Apne dimaag se jawaab do; feedback turant milega.

Primary source β€” yeh next padho

Wikipedia β€” ZIP (file format), β€œStructure” section. Central directory, EOCD record, aur index end mein hone se random access kaise milta hai β€” iska sabse saaf accessible account. ~10 minutes. Byte-level authoritative detail ke liye spec khud hai PKWARE ka APPNOTE.TXT.

Main aapka teacher hoon β€” istemaal karo. Apne kisi archive par try karna chahte ho (apni BSE zips, koi .tar.gz, ek node_modules tarball)? Path paste karo aur main usi par -l / -p dikhaata hoon. Curious ho ki .tar.gz yeh sasta kyun nahi kar sakta, ya wo nested .rar ko temp file kyun chahiye thi? Wo Lesson 2 hai β€” par abhi pooch lo agar khatak raha hai.
πŸ“– Glossary Next β†’ Lesson 2: Streaming vs seekable

Sources

  1. PKWARE β€” APPNOTE.TXT, .ZIP File Format Specification. Central directory aur EOCD records define karta hai (names, sizes, relative offsets).
  2. Wikipedia β€” ZIP (file format). β€œA directory placed at the end … identifies what files are in the ZIP and where … allowing a file listing without reading the entire archive.”
  3. unzip(1) man page (Info-ZIP). -p extracts to stdout (pipe); -l lists; -d sets the extraction directory.