Reaching Into Archives · Reference
Glossary
The shared vocabulary for this course. Every lesson adheres to these definitions. When a term is added or sharpened, it changes here first.
How to use
This is a reference, not a lesson — skim it, don't memorise it. Lessons link here; come back when a term is fuzzy.
- Archive
- A single file that packages many files (and folders) together, usually compressed. A container — not a folder.
.zip, .rar, .7z, .tar, .tar.gz are all archives.
- Member (entry)
- One file stored inside an archive. The thing you reach in to grab. Avoid: "file in the zip" (ambiguous with the zip itself).
- Central directory
- The archive's built-in table of contents: one record per member listing its name, size, and — crucially — its byte offset inside the file. In a ZIP it lives at the end of the file. This index is what makes selective extraction possible. — PKWARE APPNOTE.TXT Avoid: "the header", "the file table".
- End of Central Directory record (EOCD)
- A small fixed marker at the very end of a ZIP that points back to where the central directory begins. A reader finds the EOCD first, then jumps to the index. — Wikipedia, ZIP (file format)
- Random access
- The ability to jump straight to any byte offset in a file without reading what comes before it. Because a ZIP's central directory stores offsets, a reader can seek directly to one member. Avoid: "direct access".
- Sequential / streaming access
- Reading a file front-to-back, byte after byte, with no seeking. The only option for streaming formats and for data arriving through a pipe. Avoid: "linear read".
- Streaming format
- An archive/compression format with no index — you must read it sequentially.
gzip and tar are streaming formats (which is why a .tar.gz can't cheaply list one member without scanning). Contrast with ZIP's random access. (Covered in L2.)
- Decompress a member (or all members) and write it to disk as a real file.
unzip -d, unrar x. The default, and often more than you need.
- Stream to stdout (pipe-to-stdout)
- Decompress one member and send its bytes to standard output instead of writing a file — so you can pipe it onward or redirect it yourself.
unzip -p, unrar p. The selective-extraction verb. — unzip(1)
- stdout / pipe
- Standard output is a process's default output channel; a pipe (
|) wires one process's stdout into the next's input. Data in a pipe is sequential only — you can't seek in it.
- Seekable file
- A real on-disk file the OS can jump around in (seek to any offset). A pipe is not seekable. Tools that need random access (e.g.
unrar) therefore can't read from a pipe — they need a seekable file, which is why a nested .rar must first be written to a temp file. (Covered in L2.)
- Solid archive
- An archive that compresses its members as one continuous stream rather than independently (common in
.rar/.7z). Great ratio, but extracting one member may require decompressing everything before it — random access in name, sequential in cost. (Covered in L2.)
Tools & code
- -p / -O / -so (to stdout)
- The “stream one member to stdout” flag, by tool:
unzip -p, tar -O (--to-stdout), 7z -so. Same job each time — decompress one member to stdout, write nothing to disk.
- zcat / gzip -dc
- Decompress a single
.gz stream to stdout. zcat = gunzip -c = gzip -dc. A lone .gz wraps one file, so there is no member to pick. — GNU gzip
- Universal reader (bsdtar · 7z)
- One tool that reads many formats.
bsdtar (libarchive) auto-detects format and compression behind one -tf/-xf interface; 7z handles 7z/zip/tar/gz and more. The interface is uniform; the cost still follows the format. — libarchive
- BytesIO (in-memory buffer)
- Bytes wrapped as a seekable file-like object in RAM (Python
io.BytesIO). Lets an in-process reader open an archive that exists only in memory — e.g. a zip nested in another — with no temp file. — Python io
- zipfile · tarfile
- Python’s standard-library archive readers.
zipfile gives random access (read(name) → bytes); tarfile gives streaming members (extractfile() → a file-like object). Both accept a path or a file object.