Reaching Into Archives · Lesson 2 of the course

Streaming vs Seekable

Why tar can’t do what zip does — and why that .rar needed a temp file.

In Lesson 1 you reached into a zip and pulled one member out instantly, because the central directory let the reader seek. Try the exact same move on a .tar.gz and it crawls. Try to pipe a .rar into unrar and it flat-out refuses. Both surprises come from one split: random-access vs streaming, and seekable vs not.

Your mission General Unix fluency means knowing which archives you can reach into cheaply and which you can’t — before you write the pipeline. This lesson is that judgement.

Two families of archive

Every archive you’ll meet falls on one side of a line:

Family	Examples	Has an index?	List / grab one member
Random-access	`.zip`, `.rar`	Yes — central directory	Cheap: seek to it
Streaming	`.tar`, `.gz`, `.tar.gz`	No	Costly: read front-to-back

tar — short for tape archive — was designed for magnetic tape: a sequential device you read front-to-back. So it has no central directory at all. Each member is just a header (name, size, timestamps) immediately followed by its bytes, then the next header, and so on.1 There is no index to jump to — “no way of knowing how many files a tar archive contains unless the whole archive is traversed.”1

Same goal, opposite cost. ZIP seeks once. TAR walks the whole thing — there’s nowhere to jump to.

`.tar.gz` makes it worse: one solid stream

Compress a tar with gzip and you wrap the entire archive in a single compressed stream. Now the data isn’t just index-less — it’s solid: “to find the 50th file, you must uncompress and read files 1 through 49 first.”1 So tar -tzf big.tar.gz has to decompress and scan the whole thing just to list it. There is no cheap “jump to one member” — the structure to jump with doesn’t exist.

The trap A 2 GB .tar.gz looks just like a 2 GB .zip in your file manager. But “give me one file from it” is a seek in the zip and a full decompress-scan in the tar.gz. Format dictates cost.

You can still stream one out — it’s just sequential

The three verbs from Lesson 1 still exist for streaming formats. They work; they’re simply doing a front-to-back read under the hood:

The same list → stream-one → bulk, for tar & gzip

# LIST — walks the whole archive (no index to read)
tar -tf   archive.tar
tar -tzf  archive.tar.gz            # add z for gzip-compressed

# STREAM ONE member to stdout — note -O (capital o = "to stdout")
tar -xO -f  archive.tar     path/in/archive.csv | head
tar -xzO -f archive.tar.gz  path/in/archive.csv | head

# BULK extract to a directory
tar -xf  archive.tar     -C out/
tar -xzf archive.tar.gz  -C out/

# A lone .gz wraps ONE file — just decompress its single stream to stdout
zcat       prices.csv.gz | head     # zcat == gunzip -c == gzip -dc
gzip -dc   prices.csv.gz | wc -l

-O is the tar equivalent of unzip -p: decompress one member straight to stdout, nothing written to disk.2 And zcat is the whole story for a single .gz — gzip compresses one stream, so there’s no member to pick; you just decompress it.3

The seekable rule — why that `.rar` needed a temp file

Now the payoff. Remember the July data: .rar files sitting inside a .zip. I couldn’t pipe each rar into unrar; I had to write it to a temp file first. Here’s exactly why.

A pipe is sequential-only — you can read the bytes flowing past, but you can’t seek backwards or jump ahead. A random-access reader (unzip, unrar) needs to seek: to the index at the end, then back to a member’s offset. So it needs a seekable file — a real file on disk the OS can jump around in. A pipe can’t provide that, so unrar refuses stdin.

You have…	Reader needs…	Through a pipe?	So you must…
tar / gz stream	sequential read	✅ works	just pipe it (`zcat … \|`)
zip / rar (nested)	seek (random access)	❌ no seek in a pipe	spill to a temp file, then read

The rule of thumb: streaming formats flow through pipes; random-access formats often need a seekable file. When a seekable-only tool meets a pipe, you give it a temp file — that one workaround is forced by the format, not a quirk of the tool. (In code you sometimes dodge even that, by handing the reader an in-memory seekable buffer — which is exactly Lesson 3.)

Your win today

You can now classify any archive before touching it: random-access (zip/rar — seek, cheap to grab one) or streaming (tar/gz — walk, costly to grab one), reach into the streaming ones with tar -O / zcat, and you can explain the seekable rule that forces a temp file when a pipe meets unrar. Format → cost → the right verb.

Recall check

Retrieve, don’t re-read. Answer from memory; feedback is instant.

Primary source — read this next

Wikipedia — tar (computing), the “Format details” and limitations. The clearest account of why tar has no index and what that costs. ~8 minutes. For the verbs, the authoritative reference is the GNU tar manual — Extracting Specific Files (and GNU gzip manual for zcat).

I’m your teacher — use me. Want to feel the difference? Time unzip -l on a big zip vs tar -tzf on a big .tar.gz and ask me to explain the gap. Or hand me a real archive and I’ll tell you which family it’s in and the cheapest way to reach into it.

← Lesson 1 · 📖 Glossary Next → Lesson 3: Reaching in from code

Sources

Wikipedia — tar (computing). “The tar format does not have a central directory … no way of knowing how many files a tar archive contains unless the whole archive is traversed.” Tape-archive origin; solid .tar.gz.
GNU tar manual — Writing to Standard Output (-O/--to-stdout) and Extracting Specific Files.
GNU gzip manual. zcat = gunzip -c = gzip -dc: decompress a single stream to stdout.