Reaching Into Archives · Lesson 3 of the course

Reaching In From Code

The same three verbs in Python — plus one superpower the shell doesn’t have.

The SENSEX extraction wasn’t a clever shell one-liner — it was a Python script reaching into 17 archives. In code you get the same list → stream-one → bulk verbs, but you also gain one move the command line can’t make: hold an archive in memory and hand a reader a seekable buffer. That single trick is why a zip-inside-a-zip needed no temp file.

Your mission Data-engineering fluency lives in code, not just the terminal. Ingestion pipelines reach into archives programmatically — this is the lesson that turns the shell intuition into a Python pattern you’ll reuse in every loader you write.

The same three verbs, in Python

The standard library mirrors L1 exactly. The key difference: read() hands you bytes in memory, not a file on disk.1

zipfile — list, stream-one, bulk

import zipfile, io

# open once; the central directory is read for you
with zipfile.ZipFile("data.zip") as z:
    names = z.namelist()                 # LIST — names from the index

    data  = z.read("SENSEX.csv")         # STREAM ONE — bytes, in RAM (no file written)
    text  = data.decode()                # bytes → str when you need text

    with z.open("SENSEX.csv") as f:      # or a file-like stream, line by line
        for line in io.TextIOWrapper(f):
            process(line)

    z.extractall("out/")                 # BULK — write everything to disk

The superpower: a seekable buffer in memory

Remember the rule from Lesson 2: a zip reader must seek, and a pipe can’t. But seekable doesn’t have to mean on disk — RAM can be seekable too. io.BytesIO wraps a bunch of bytes in a file-like object that supports seek().2 So you can open a zip that exists only in memory — for instance, one you just pulled out of another zip.

Zip-inside-a-zip, entirely in memory — the inner bytes become a seekable buffer, so no temp file is ever written.

The actual move from the SENSEX job (Feb 2025: daily zips nested in one outer zip)

with zipfile.ZipFile(outer_path) as oz:                  # outer zip on disk
    inner_bytes = oz.read("BSE_IDX_1MIN_20250203.zip")  # inner zip → bytes in RAM
    with zipfile.ZipFile(io.BytesIO(inner_bytes)) as iz:  # RAM buffer IS seekable
        csv = iz.read("SENSEX.csv")                     # reach into the inner zip — done

tar in code: streaming, file-like members

tarfile mirrors L2. Because tar is a streaming format, extractfile() hands you a file-like stream to read sequentially — not random access.3

tarfile — works on .tar and .tar.gz alike

import tarfile

with tarfile.open("data.tar.gz") as t:     # compression auto-detected
    names = t.getnames()                   # LIST — walks the archive
    f = t.extractfile("dir/SENSEX.csv")    # STREAM ONE — a file-like object
    data = f.read()                        # read its bytes

Where code still can’t dodge a temp file

The in-memory buffer is a superpower with one hard edge. It works because zipfile/tarfile accept a file object. But when the reader is a separate program — like unrar, called via subprocess — it can’t see your Python memory; it needs a real path on disk. No BytesIO will save you there.

That’s the whole logic behind the three handlers in the SENSEX script:

Situation	Reader	Seekable source?	Temp file?
rar already on disk	`unrar p` (subprocess)	the file itself	No — path exists
zip nested in a zip	`zipfile` + `BytesIO`	RAM buffer	No — in memory
rar nested in a zip	`unrar p` (subprocess)	needs a real path	Yes — must spill

So the heuristic: in-process libraries take a seekable buffer; out-of-process tools take a path. Reach for BytesIO when a Python library will do the reading; reach for a temp file when you must shell out.

Your win today

You can now reach into archives from Python — zipfile / tarfile, list / read-one / bulk — and you own the in-memory pattern (ZipFile(BytesIO(...))) that opens an archive nested in another without a temp file. You also know its one limit: a subprocess needs a real path. That’s a loader pattern you’ll reuse in every ingestion script.

Recall check

Retrieve, don’t re-read. Answer from memory; feedback is instant.

Primary source — read this next

Python docs — zipfile. Skim ZipFile, namelist(), read(), and open(); note that any of them accept a file object, which is what makes the BytesIO trick work. ~10 minutes. Pair it with tarfile and io.BytesIO.

I’m your teacher — use me. Want to wire this into your backtester’s data loader? Show me how it ingests the BSE archives and I’ll refactor it to stream members instead of extracting to disk — or to walk a folder of nested zips in memory. Ask away.

← Lesson 2 · 📖 Glossary Next → Lesson 4: The field guide (capstone)

Sources

Python Standard Library — zipfile. ZipFile.namelist(), read() (returns bytes), open() (file-like), extractall(); accepts a path or a file object.
Python Standard Library — io.BytesIO. An in-memory binary stream with full seek()/read() — a seekable file that never touches disk.
Python Standard Library — tarfile. open() (auto-detects gzip/bz2/xz), getnames(), extractfile() (file-like, sequential).