The same three verbs in Python — plus one superpower the shell doesn’t have.
The SENSEX extraction wasn’t a clever shell one-liner — it was a Python script reaching into 17 archives. In code you get the same list → stream-one → bulk verbs, but you also gain one move the command line can’t make: hold an archive in memory and hand a reader a seekable buffer. That single trick is why a zip-inside-a-zip needed no temp file.
The standard library mirrors L1 exactly. The key difference: read() hands you bytes in memory, not a file on disk.1
zipfile — list, stream-one, bulkimport zipfile, io
# open once; the central directory is read for you
with zipfile.ZipFile("data.zip") as z:
names = z.namelist() # LIST — names from the index
data = z.read("SENSEX.csv") # STREAM ONE — bytes, in RAM (no file written)
text = data.decode() # bytes → str when you need text
with z.open("SENSEX.csv") as f: # or a file-like stream, line by line
for line in io.TextIOWrapper(f):
process(line)
z.extractall("out/") # BULK — write everything to disk
Remember the rule from Lesson 2: a zip reader must seek, and a pipe can’t. But seekable doesn’t have to mean on disk — RAM can be seekable too. io.BytesIO wraps a bunch of bytes in a file-like object that supports seek().2 So you can open a zip that exists only in memory — for instance, one you just pulled out of another zip.
with zipfile.ZipFile(outer_path) as oz: # outer zip on disk
inner_bytes = oz.read("BSE_IDX_1MIN_20250203.zip") # inner zip → bytes in RAM
with zipfile.ZipFile(io.BytesIO(inner_bytes)) as iz: # RAM buffer IS seekable
csv = iz.read("SENSEX.csv") # reach into the inner zip — done
tarfile mirrors L2. Because tar is a streaming format, extractfile() hands you a file-like stream to read sequentially — not random access.3
tarfile — works on .tar and .tar.gz alikeimport tarfile
with tarfile.open("data.tar.gz") as t: # compression auto-detected
names = t.getnames() # LIST — walks the archive
f = t.extractfile("dir/SENSEX.csv") # STREAM ONE — a file-like object
data = f.read() # read its bytes
The in-memory buffer is a superpower with one hard edge. It works because zipfile/tarfile accept a file object. But when the reader is a separate program — like unrar, called via subprocess — it can’t see your Python memory; it needs a real path on disk. No BytesIO will save you there.
That’s the whole logic behind the three handlers in the SENSEX script:
| Situation | Reader | Seekable source? | Temp file? |
|---|---|---|---|
| rar already on disk | unrar p (subprocess) | the file itself | No — path exists |
| zip nested in a zip | zipfile + BytesIO | RAM buffer | No — in memory |
| rar nested in a zip | unrar p (subprocess) | needs a real path | Yes — must spill |
So the heuristic: in-process libraries take a seekable buffer; out-of-process tools take a path. Reach for BytesIO when a Python library will do the reading; reach for a temp file when you must shell out.
You can now reach into archives from Python — zipfile / tarfile, list / read-one / bulk — and you own the in-memory pattern (ZipFile(BytesIO(...))) that opens an archive nested in another without a temp file. You also know its one limit: a subprocess needs a real path. That’s a loader pattern you’ll reuse in every ingestion script.
Retrieve, don’t re-read. Answer from memory; feedback is instant.
Python docs — zipfile. Skim ZipFile, namelist(), read(), and open(); note that any of them accept a file object, which is what makes the BytesIO trick work. ~10 minutes. Pair it with tarfile and io.BytesIO.
zipfile. ZipFile.namelist(), read() (returns bytes), open() (file-like), extractall(); accepts a path or a file object.io.BytesIO. An in-memory binary stream with full seek()/read() — a seekable file that never touches disk.tarfile. open() (auto-detects gzip/bz2/xz), getnames(), extractfile() (file-like, sequential).