Reaching Into Archives Β· Lesson 1 of the course
🌐 ΰ€Ήΰ€Ώΰ€‚ΰ€—ΰ₯ΰ€²ΰ€Ώΰ€Ά version β†’

The Index at the End

Why you can pull one file out of a zip without unpacking the rest.

An hour ago I reached into 17 archives and pulled out one file from each β€” a single SENSEX.csv β€” even though every archive held ~150 index CSVs plus a pile of futures and options. I never unpacked them. Gigabytes stayed compressed; I touched only the file I wanted.

That isn't a trick β€” it's a structural fact about how a zip is built. Learn the one fact and the whole technique falls out of it: you stop unpacking archives and start reaching into them. This lesson is that fact.

Your mission General Unix / data-engineering fluency. So we won't memorise commands β€” we'll understand the structure that makes the commands possible, so you can reason about any archive in a pipeline, not just the one in front of you.

An archive is a container with a table of contents

A zip is not a folder. It's one file: every member stored back-to-back, and then β€” at the very end β€” a central directory: a list with one record per member giving its name, its size, and its byte offset inside the file.1

So a reader never scans the whole archive. It jumps to the end, reads the End of Central Directory record, follows it to the index, looks up the one name you asked for, and seeks straight to those bytes.2 The other 149 members are never read.

one .zip file Β· bytes run left β†’ right member 1 a CSV member 2 a CSV β‹― member k β‹― SENSEX.csv β‹― 150 CENTRAL DIRECTORY names + offsets (the index) EOCD end marker β‘  reader starts here β‘‘ β†’ the index β‘’ offset β†’ seek straight to SENSEX.csv
The index lives at the end. The reader hops to it first, then seeks back to the one member it names β€” reading the index plus one file, not the whole archive.

The consequence: random access

Because the index stores offsets, you get random access β€” the ability to jump to any byte without reading what came before. The payoff is concrete: with plain sequential reading you'd scan, on average, half the archive to find one file; with the index you read only the index and the file itself.2

That single property splits two operations people constantly conflate:

Three verbs you actually type

Every archive tool exposes the same three verbs. Learn them as a set β€” list, stream-one, bulk β€” and you can reach into zip and rar identically:

Same three moves, two tools. list β†’ stream-one β†’ bulk
# 1 Β· LIST β€” read only the index (the central directory). Always do this first.
unzip -l  data.zip                 # names + sizes, instantly
unrar l   data.rar

# 2 Β· STREAM ONE member to stdout β€” decompress just that file, nothing to disk
unzip -p  data.zip  SENSEX.csv  > SENSEX.csv
unrar p   data.rar  SENSEX.csv  > SENSEX.csv

# 3 Β· BULK extract everything to a folder β€” the default, usually more than you need
unzip -d  out/  data.zip
unrar x   data.rar  out/

The move that matters: stream, don't extract

Verb 2 is the one that changes how you work. -p / p send the member's bytes to stdout instead of writing a file3 β€” so you can pipe it straight into the next program and never write a temp file at all:

Reach in, inspect, never hit the disk
# Peek at the first rows of one member inside a 50 MB archive
unzip -p data.zip SENSEX.csv | head

# Count rows, or feed straight into awk / a program β€” no extraction step
unzip -p data.zip SENSEX.csv | wc -l
unzip -p data.zip SENSEX.csv | awk -F, '$2 > 80000'

This is exactly what powered the extraction you watched: unzip -p for the zip months, unrar p for the rar months β€” one member streamed out of each archive, the other ~149 left untouched. Extract would have written gigabytes of files I'd then delete. Stream wrote only what I asked for.

Your win today

You can now (1) explain why grabbing one file from an archive is cheap β€” the central directory gives random access β€” and (2) reach in three ways: -l to list, -p to stream one member into a pipe, -d to bulk-extract. You've stopped thinking "unzip the whole thing" and started thinking "seek to the file I want."

Recall check

Don't re-read β€” retrieve. Effortful recall is what turns this into memory you'll still have next week. Answer from your head; feedback is instant.

Primary source β€” read this next

Wikipedia β€” ZIP (file format), the β€œStructure” section. The clearest accessible account of the central directory, the EOCD record, and why the index sitting at the end gives random access. ~10 minutes. For the authoritative byte-level detail, the spec itself is PKWARE’s APPNOTE.TXT.

I’m your teacher β€” use me. Want to try this on one of your own archives (your BSE zips, a .tar.gz, a node_modules tarball)? Paste a path and I’ll show you the -l / -p on it. Curious why .tar.gz can’t do this cheaply, or why that nested .rar needed a temp file? That’s Lesson 2 β€” but ask now if it’s nagging you.
πŸ“– Glossary Next β†’ Lesson 2: Streaming vs seekable

Sources

  1. PKWARE β€” APPNOTE.TXT, .ZIP File Format Specification. Defines the central directory and EOCD records (names, sizes, relative offsets).
  2. Wikipedia β€” ZIP (file format). β€œA directory placed at the end … identifies what files are in the ZIP and where … allowing a file listing without reading the entire archive.”
  3. unzip(1) man page (Info-ZIP). -p extracts to stdout (pipe); -l lists; -d sets the extraction directory.