Reaching Into Archives ยท Lesson 3 ยท Hinglish
๐ŸŒ English version โ†’

Code se haath daalna

Wahi teen verbs Python mein โ€” plus ek superpower jo shell ke paas nahi.

SENSEX extraction koi chalaak shell one-liner nahi thi โ€” wo ek Python script thi jo 17 archives mein haath daal rahi thi. Code mein aapko wahi list โ†’ stream-one โ†’ bulk verbs milte hain, par ek move bhi milta hai jo command line nahi kar sakti: ek archive ko memory mein rakho aur reader ko ek seekable buffer de do. Yahi ek trick wajah hai ki ek zip-ke-andar-zip ko koi temp file nahi chahiye thi.

Aapka mission Data-engineering fluency code mein basti hai, sirf terminal mein nahi. Ingestion pipelines archives mein programmatically haath daalti hain โ€” yeh wo lesson hai jo shell intuition ko ek Python pattern mein badal deta hai jise aap har loader mein dobara use karoge.

Wahi teen verbs, Python mein

Standard library L1 ko bilkul mirror karti hai. Key farak: read() aapko bytes in memory deta hai, disk par file nahi.1

zipfile โ€” list, stream-one, bulk
import zipfile, io

# open once; the central directory is read for you
with zipfile.ZipFile("data.zip") as z:
    names = z.namelist()                 # LIST โ€” names from the index

    data  = z.read("SENSEX.csv")         # STREAM ONE โ€” bytes, in RAM (no file written)
    text  = data.decode()                # bytes โ†’ str when you need text

    with z.open("SENSEX.csv") as f:      # or a file-like stream, line by line
        for line in io.TextIOWrapper(f):
            process(line)

    z.extractall("out/")                 # BULK โ€” write everything to disk

Superpower: memory mein ek seekable buffer

Lesson 2 ka rule yaad karo: ek zip reader ko seek karna padta hai, aur ek pipe nahi kar sakti. Par seekable ka matlab disk par hona zaroori nahi โ€” RAM bhi seekable ho sakti hai. io.BytesIO bytes ke ek bunch ko ek file-like object mein lapet deta hai jo seek() support karta hai.2 Toh aap ek aisi zip khol sakte ho jo sirf memory mein hai โ€” jaise, ek jo aapne abhi doosri zip se nikaali.

outer.zip on disk BytesIO(bytes) seekable ยท in RAM ZipFile(buf) reads the inner zip SENSEX.csv oz.read() wrap .read()
Zip-ke-andar-zip, poori tarah memory mein โ€” inner bytes ek seekable buffer ban jaate hain, toh koi temp file kabhi nahi likhi jaati.
SENSEX job ka asli move (Feb 2025: daily zips ek outer zip mein nested)
with zipfile.ZipFile(outer_path) as oz:                  # outer zip on disk
    inner_bytes = oz.read("BSE_IDX_1MIN_20250203.zip")  # inner zip โ†’ bytes in RAM
    with zipfile.ZipFile(io.BytesIO(inner_bytes)) as iz:  # RAM buffer IS seekable
        csv = iz.read("SENSEX.csv")                     # reach into the inner zip โ€” done

Code mein tar: streaming, file-like members

tarfile L2 ko mirror karta hai. Kyunki tar ek streaming format hai, extractfile() aapko ek file-like stream deta hai jise sequentially padho โ€” random access nahi.3

tarfile โ€” .tar aur .tar.gz dono par ek jaisa
import tarfile

with tarfile.open("data.tar.gz") as t:     # compression auto-detected
    names = t.getnames()                   # LIST โ€” walks the archive
    f = t.extractfile("dir/SENSEX.csv")    # STREAM ONE โ€” a file-like object
    data = f.read()                        # read its bytes

Code bhi kahaan temp file nahi bacha sakta

In-memory buffer ek superpower hai jiska ek hard edge hai. Wo isliye kaam karta hai kyunki zipfile/tarfile ek file object accept karte hain. Par jab reader ek alag program ho โ€” jaise unrar, subprocess se call kiya gaya โ€” wo aapki Python memory dekh nahi sakta; use disk par ek real path chahiye. Wahaan koi BytesIO aapko nahi bachayega.

Yahi poori logic hai SENSEX script ke teen handlers ke peeche:

SituationReaderSeekable source?Temp file?
rar pehle se disk parunrar p (subprocess)file khudNahi โ€” path hai
zip ek zip mein nestedzipfile + BytesIORAM bufferNahi โ€” memory mein
rar ek zip mein nestedunrar p (subprocess)real path chahiyeHaan โ€” spill karna pada

Toh heuristic: in-process libraries ek seekable buffer lete hain; out-of-process tools ek path lete hain. BytesIO uthao jab koi Python library padhegi; temp file uthao jab aapko shell out karna pade.

Aaj ki aapki win

Ab aap Python se archives mein haath daal sakte ho โ€” zipfile / tarfile, list / read-one / bulk โ€” aur wo in-memory pattern (ZipFile(BytesIO(...))) aapke paas hai jo ek archive ko doosre mein nested khol deta hai bina temp file ke. Iski ek limit bhi aap jaante ho: ek subprocess ko real path chahiye. Yeh ek loader pattern hai jise aap har ingestion script mein dobara use karoge.

Recall check

Retrieve karo, dobara mat padho. Memory se jawaab do; feedback turant.

Primary source โ€” yeh next padho

Python docs โ€” zipfile. ZipFile, namelist(), read(), aur open() skim karo; dhyaan do ki inme se koi bhi ek file object accept karta hai, jo BytesIO trick ko kaam karwaata hai. ~10 minutes. Iske saath tarfile aur io.BytesIO bhi.

Main aapka teacher hoon โ€” istemaal karo. Ise apne backtester ke data loader mein wire karna chahte ho? Dikhao wo BSE archives ko kaise ingest karta hai aur main use refactor kar dunga ki members ko disk par extract karne ke bajaaye stream kare โ€” ya nested zips ke ek folder ko memory mein walk kare. Pooch lo.
โ† Lesson 2 ยท ๐Ÿ“– Glossary Next โ†’ Lesson 4: Field guide (capstone)

Sources

  1. Python Standard Library โ€” zipfile. ZipFile.namelist(), read() (bytes lautata hai), open() (file-like), extractall(); ek path ya ek file object accept karta hai.
  2. Python Standard Library โ€” io.BytesIO. Ek in-memory binary stream poore seek()/read() ke saath โ€” ek seekable file jo disk ko kabhi nahi chuti.
  3. Python Standard Library โ€” tarfile. open() (gzip/bz2/xz auto-detect), getnames(), extractfile() (file-like, sequential).