Why you can pull one file out of a zip without unpacking the rest.
An hour ago I reached into 17 archives and pulled out one file from each β a single SENSEX.csv β even though every archive held ~150 index CSVs plus a pile of futures and options. I never unpacked them. Gigabytes stayed compressed; I touched only the file I wanted.
That isn't a trick β it's a structural fact about how a zip is built. Learn the one fact and the whole technique falls out of it: you stop unpacking archives and start reaching into them. This lesson is that fact.
A zip is not a folder. It's one file: every member stored back-to-back, and then β at the very end β a central directory: a list with one record per member giving its name, its size, and its byte offset inside the file.1
So a reader never scans the whole archive. It jumps to the end, reads the End of Central Directory record, follows it to the index, looks up the one name you asked for, and seeks straight to those bytes.2 The other 149 members are never read.
Because the index stores offsets, you get random access β the ability to jump to any byte without reading what came before. The payoff is concrete: with plain sequential reading you'd scan, on average, half the archive to find one file; with the index you read only the index and the file itself.2
That single property splits two operations people constantly conflate:
Every archive tool exposes the same three verbs. Learn them as a set β list, stream-one, bulk β and you can reach into zip and rar identically:
# 1 Β· LIST β read only the index (the central directory). Always do this first.
unzip -l data.zip # names + sizes, instantly
unrar l data.rar
# 2 Β· STREAM ONE member to stdout β decompress just that file, nothing to disk
unzip -p data.zip SENSEX.csv > SENSEX.csv
unrar p data.rar SENSEX.csv > SENSEX.csv
# 3 Β· BULK extract everything to a folder β the default, usually more than you need
unzip -d out/ data.zip
unrar x data.rar out/
Verb 2 is the one that changes how you work. -p / p send the member's bytes to stdout instead of writing a file3 β so you can pipe it straight into the next program and never write a temp file at all:
# Peek at the first rows of one member inside a 50 MB archive
unzip -p data.zip SENSEX.csv | head
# Count rows, or feed straight into awk / a program β no extraction step
unzip -p data.zip SENSEX.csv | wc -l
unzip -p data.zip SENSEX.csv | awk -F, '$2 > 80000'
This is exactly what powered the extraction you watched: unzip -p for the zip months, unrar p for the rar months β one member streamed out of each archive, the other ~149 left untouched. Extract would have written gigabytes of files I'd then delete. Stream wrote only what I asked for.
You can now (1) explain why grabbing one file from an archive is cheap β the central directory gives random access β and (2) reach in three ways: -l to list, -p to stream one member into a pipe, -d to bulk-extract. You've stopped thinking "unzip the whole thing" and started thinking "seek to the file I want."
Don't re-read β retrieve. Effortful recall is what turns this into memory you'll still have next week. Answer from your head; feedback is instant.
Wikipedia β ZIP (file format), the βStructureβ section. The clearest accessible account of the central directory, the EOCD record, and why the index sitting at the end gives random access. ~10 minutes. For the authoritative byte-level detail, the spec itself is PKWAREβs APPNOTE.TXT.
.tar.gz, a node_modules tarball)? Paste a path and Iβll show you the -l / -p on it. Curious why .tar.gz canβt do this cheaply, or why that nested .rar needed a temp file? Thatβs Lesson 2 β but ask now if itβs nagging you.
-p extracts to stdout (pipe); -l lists; -d sets the extraction directory.