Wahi teen verbs Python mein โ plus ek superpower jo shell ke paas nahi.
SENSEX extraction koi chalaak shell one-liner nahi thi โ wo ek Python script thi jo 17 archives mein haath daal rahi thi. Code mein aapko wahi list โ stream-one โ bulk verbs milte hain, par ek move bhi milta hai jo command line nahi kar sakti: ek archive ko memory mein rakho aur reader ko ek seekable buffer de do. Yahi ek trick wajah hai ki ek zip-ke-andar-zip ko koi temp file nahi chahiye thi.
Standard library L1 ko bilkul mirror karti hai. Key farak: read() aapko bytes in memory deta hai, disk par file nahi.1
zipfile โ list, stream-one, bulkimport zipfile, io
# open once; the central directory is read for you
with zipfile.ZipFile("data.zip") as z:
names = z.namelist() # LIST โ names from the index
data = z.read("SENSEX.csv") # STREAM ONE โ bytes, in RAM (no file written)
text = data.decode() # bytes โ str when you need text
with z.open("SENSEX.csv") as f: # or a file-like stream, line by line
for line in io.TextIOWrapper(f):
process(line)
z.extractall("out/") # BULK โ write everything to disk
Lesson 2 ka rule yaad karo: ek zip reader ko seek karna padta hai, aur ek pipe nahi kar sakti. Par seekable ka matlab disk par hona zaroori nahi โ RAM bhi seekable ho sakti hai. io.BytesIO bytes ke ek bunch ko ek file-like object mein lapet deta hai jo seek() support karta hai.2 Toh aap ek aisi zip khol sakte ho jo sirf memory mein hai โ jaise, ek jo aapne abhi doosri zip se nikaali.
with zipfile.ZipFile(outer_path) as oz: # outer zip on disk
inner_bytes = oz.read("BSE_IDX_1MIN_20250203.zip") # inner zip โ bytes in RAM
with zipfile.ZipFile(io.BytesIO(inner_bytes)) as iz: # RAM buffer IS seekable
csv = iz.read("SENSEX.csv") # reach into the inner zip โ done
tarfile L2 ko mirror karta hai. Kyunki tar ek streaming format hai, extractfile() aapko ek file-like stream deta hai jise sequentially padho โ random access nahi.3
tarfile โ .tar aur .tar.gz dono par ek jaisaimport tarfile
with tarfile.open("data.tar.gz") as t: # compression auto-detected
names = t.getnames() # LIST โ walks the archive
f = t.extractfile("dir/SENSEX.csv") # STREAM ONE โ a file-like object
data = f.read() # read its bytes
In-memory buffer ek superpower hai jiska ek hard edge hai. Wo isliye kaam karta hai kyunki zipfile/tarfile ek file object accept karte hain. Par jab reader ek alag program ho โ jaise unrar, subprocess se call kiya gaya โ wo aapki Python memory dekh nahi sakta; use disk par ek real path chahiye. Wahaan koi BytesIO aapko nahi bachayega.
Yahi poori logic hai SENSEX script ke teen handlers ke peeche:
| Situation | Reader | Seekable source? | Temp file? |
|---|---|---|---|
| rar pehle se disk par | unrar p (subprocess) | file khud | Nahi โ path hai |
| zip ek zip mein nested | zipfile + BytesIO | RAM buffer | Nahi โ memory mein |
| rar ek zip mein nested | unrar p (subprocess) | real path chahiye | Haan โ spill karna pada |
Toh heuristic: in-process libraries ek seekable buffer lete hain; out-of-process tools ek path lete hain. BytesIO uthao jab koi Python library padhegi; temp file uthao jab aapko shell out karna pade.
Ab aap Python se archives mein haath daal sakte ho โ zipfile / tarfile, list / read-one / bulk โ aur wo in-memory pattern (ZipFile(BytesIO(...))) aapke paas hai jo ek archive ko doosre mein nested khol deta hai bina temp file ke. Iski ek limit bhi aap jaante ho: ek subprocess ko real path chahiye. Yeh ek loader pattern hai jise aap har ingestion script mein dobara use karoge.
Retrieve karo, dobara mat padho. Memory se jawaab do; feedback turant.
Python docs โ zipfile. ZipFile, namelist(), read(), aur open() skim karo; dhyaan do ki inme se koi bhi ek file object accept karta hai, jo BytesIO trick ko kaam karwaata hai. ~10 minutes. Iske saath tarfile aur io.BytesIO bhi.
zipfile. ZipFile.namelist(), read() (bytes lautata hai), open() (file-like), extractall(); ek path ya ek file object accept karta hai.io.BytesIO. Ek in-memory binary stream poore seek()/read() ke saath โ ek seekable file jo disk ko kabhi nahi chuti.tarfile. open() (gzip/bz2/xz auto-detect), getnames(), extractfile() (file-like, sequential).