From 3910213fd904a83a563bdf63ca8a4190fb14bf8c Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 19 Feb 2026 01:51:43 +0000 Subject: [PATCH] feat: add file formats documentation (image, audio/video, archive, document) Comprehensive coverage of how non-plain-text files work: - File Formats: magic bytes, headers/trailers, hex walkthroughs, endianness - Image Formats: JPEG DCT pipeline, PNG chunks, GIF, WebP, AVIF internals - File Metadata: EXIF structure, GPS coordinates, XMP, ID3, privacy implications - Audio and Video Formats: codecs vs containers, MP3 frames, MP4 boxes, streaming - Archive and Compression: ZIP central directory, TAR headers, LZ77/DEFLATE, zstd - Document Formats: PDF object graph, DOCX/XLSX ZIP structure, EPUB internals https://claude.ai/code/session_01Q7ZjU9KDPBgT8yWJyA9GXq --- Computer Science MOC.md | 9 + .../Archive and Compression Formats.md | 304 +++++++++++ Computer Science/Audio and Video Formats.md | 294 ++++++++++ Computer Science/Document Formats.md | 356 +++++++++++++ Computer Science/File Formats.md | 278 ++++++++++ Computer Science/File Metadata.md | 503 ++++++++++++++++++ Computer Science/Image Formats.md | 359 +++++++++++++ Tools MOC.md | 1 + 8 files changed, 2104 insertions(+) create mode 100644 Computer Science/Archive and Compression Formats.md create mode 100644 Computer Science/Audio and Video Formats.md create mode 100644 Computer Science/Document Formats.md create mode 100644 Computer Science/File Formats.md create mode 100644 Computer Science/File Metadata.md create mode 100644 Computer Science/Image Formats.md diff --git a/Computer Science MOC.md b/Computer Science MOC.md index 6e0baae..4d7f56a 100644 --- a/Computer Science MOC.md +++ b/Computer Science MOC.md @@ -70,6 +70,15 @@ Fundamental CS concepts, data structures, algorithms, and system design. - [[Testing Strategies]] — Unit, integration, E2E, BDD, property-based +### File Formats & Media + +- [[File Formats]] — Binary structure, magic bytes, headers, and trailers +- [[Image Formats]] — JPEG, PNG, GIF, WebP, AVIF internals +- [[File Metadata]] — EXIF, GPS, XMP, ID3 metadata systems +- [[Audio and Video Formats]] — Codecs, containers, streaming +- [[Archive and Compression Formats]] — ZIP, tar, gzip, zstd, brotli +- [[Document Formats]] — PDF, DOCX, EPUB internals + ### Reference - [[Technical Measurements]] diff --git a/Computer Science/Archive and Compression Formats.md b/Computer Science/Archive and Compression Formats.md new file mode 100644 index 0000000..48add0b --- /dev/null +++ b/Computer Science/Archive and Compression Formats.md @@ -0,0 +1,304 @@ +--- +title: Archive and Compression Formats +aliases: + - Compression Formats + - Archive Formats + - ZIP Format + - Compression Algorithms +tags: + - cs + - fundamentals + - file-formats +type: concept +status: complete +difficulty: fundamentals +created: "2026-02-19" +--- + +# Archive and Compression Formats + +How files are bundled together (archiving) and made smaller (compression) — from ZIP internals to modern algorithms like Zstandard and Brotli. + +## Archive vs Compression + +These are separate concepts, often combined: + +| Concept | What It Does | Examples | +|---------|-------------|----------| +| **Archive** | Bundles multiple files into one | TAR, CPIO, AR | +| **Compression** | Reduces file size | gzip, bzip2, zstd, brotli, LZMA | +| **Both** | Bundles + compresses | ZIP, 7z, RAR | + +TAR was designed for tape archives and handles **only** archiving. Compression is applied separately: + +```bash +# Archive only (no compression) +tar cf archive.tar files/ + +# Archive + gzip +tar czf archive.tar.gz files/ + +# Archive + zstandard +tar --zstd -cf archive.tar.zst files/ + +# Archive + bzip2 +tar cjf archive.tar.bz2 files/ + +# Archive + xz (LZMA2) +tar cJf archive.tar.xz files/ +``` + +--- + +## Compression Algorithms + +### Comparison + +| Algorithm | Ratio | Compress Speed | Decompress Speed | Used In | +|-----------|-------|----------------|------------------|---------| +| DEFLATE | Good | Moderate | Fast | ZIP, gzip, PNG, HTTP | +| LZ4 | Low | Very fast | Very fast | Filesystem compression, real-time | +| Zstandard (zstd) | Excellent | Fast | Very fast | Kernel, packaging, databases | +| Brotli | Excellent | Slow | Fast | HTTP (WOFF2, web assets) | +| LZMA/LZMA2 | Best | Very slow | Moderate | 7z, xz | +| bzip2 | Good | Slow | Slow | Legacy, some distro packages | +| LZW | Moderate | Fast | Fast | GIF (legacy) | +| Snappy | Low | Very fast | Very fast | Google internal, Hadoop | + +### How LZ77/DEFLATE Works + +Most general-purpose compression builds on LZ77, which replaces repeated sequences with back-references: + +``` +Input: "ABCABCABCXYZ" + ↓ +Step 1: Output literal "ABC" +Step 2: See "ABC" repeats → output (distance=3, length=6) +Step 3: Output literal "XYZ" + +Compressed: ABC <3,6> XYZ + +The decoder reads forward: + "ABC" → emit as-is + <3,6> → go back 3 chars, copy 6 chars → "ABCABC" + "XYZ" → emit as-is + Result: "ABCABCABCXYZ" +``` + +DEFLATE combines LZ77 with Huffman coding — after finding repeated patterns, it Huffman-encodes the literals and back-references for additional compression. + +### Zstandard (zstd) + +Facebook/Meta's modern replacement for gzip. Uses finite state entropy (ANS) instead of Huffman coding and has a dictionary mode for compressing many small items. + +```bash +# Compress (default level 3) +zstd file.dat + +# Compress with level 19 (max practical) +zstd -19 file.dat + +# Train a dictionary on similar files (e.g., JSON logs) +zstd --train samples/* -o dictionary + +# Compress using dictionary +zstd --dict dictionary file.json +``` + +Key advantage: zstd decompression speed is nearly constant regardless of compression level. You can spend more time compressing (once) and decompress quickly (many times). + +### Brotli + +Google's algorithm optimized for web content. Includes a built-in dictionary of common web strings (HTML tags, CSS properties, JavaScript keywords). + +```bash +# Compress for web serving (level 11 = max) +brotli -q 11 styles.css + +# Content-Encoding header in HTTP +Content-Encoding: br +``` + +Typical web asset savings over gzip: 15-25% smaller. + +--- + +## ZIP + +The most widely used archive+compression format. Also the foundation for DOCX, XLSX, JAR, APK, EPUB, and many other formats. + +### ZIP File Structure + +ZIP is unusual — the authoritative file index is at the **end** of the file, not the beginning: + +``` +┌────────────────────────────────────────────┐ +│ Local File Header 1 │ +│ 50 4B 03 04 (PK..) ← signature │ +│ version, flags, compression method │ +│ CRC-32, sizes, filename │ +│ [File Data 1 - compressed] │ +├────────────────────────────────────────────┤ +│ Local File Header 2 │ +│ [File Data 2 - compressed] │ +├────────────────────────────────────────────┤ +│ ...more files... │ +├────────────────────────────────────────────┤ +│ Central Directory │ ← The actual index +│ 50 4B 01 02 (PK..) ← entry signature │ +│ Entry for File 1 (offset, size, name) │ +│ Entry for File 2 (offset, size, name) │ +│ ... │ +├────────────────────────────────────────────┤ +│ End of Central Directory Record │ +│ 50 4B 05 06 (PK..) ← EOCD signature │ +│ Number of entries │ +│ Central directory offset │ +│ Comment │ +└────────────────────────────────────────────┘ +``` + +**Why the index is at the end:** ZIP was designed for appending files. You can add files to a ZIP without rewriting the entire archive — just append new local entries and write a new central directory. + +### ZIP Hex Walkthrough + +``` +Offset Hex Meaning +00000000 50 4B 03 04 Local file header signature +00000004 14 00 Version needed: 2.0 +00000006 00 00 Flags: none +00000008 08 00 Compression: DEFLATE +0000000A 4A 7D Mod time (MS-DOS format) +0000000C 54 59 Mod date (MS-DOS format) +0000000E XX XX XX XX CRC-32 +00000012 XX XX XX XX Compressed size +00000016 XX XX XX XX Uncompressed size +0000001A 0A 00 Filename length: 10 +0000001C 00 00 Extra field length: 0 +0000001E 68 65 6C 6C 6F 2E 74 78 74 00 "hello.txt" +00000028 [compressed data...] DEFLATE'd file content +``` + +### ZIP Compression Methods + +| Value | Method | Notes | +|-------|--------|-------| +| 0 | Stored | No compression (files already compressed) | +| 8 | DEFLATE | Standard, universal support | +| 9 | DEFLATE64 | Larger window, rare | +| 12 | bzip2 | Better ratio, less common | +| 14 | LZMA | 7-Zip format, uncommon in ZIP | +| 93 | Zstandard | Modern, gaining support | +| 95 | XZ (LZMA2) | Very high ratio | + +### ZIP-Based Formats + +| Format | Extension | Contents | +|--------|-----------|----------| +| Office Open XML | `.docx`, `.xlsx`, `.pptx` | XML + media files | +| Java Archive | `.jar` | `.class` files + manifest | +| Android Package | `.apk` | DEX + resources + manifest | +| EPUB | `.epub` | XHTML + CSS + images | +| OpenDocument | `.odt`, `.ods`, `.odp` | XML + media | +| XPI (Firefox ext) | `.xpi` | Web extension files | +| IPSW (iOS firmware) | `.ipsw` | Firmware images | + +--- + +## TAR (Tape Archive) + +Unix archiving format from 1979. No compression — purely bundles files with metadata. + +### TAR Header Structure + +Each file is preceded by a 512-byte header block: + +``` +Offset Size Field +0 100 Filename (null-terminated) +100 8 File mode (octal ASCII) +108 8 Owner UID (octal ASCII) +116 8 Group GID (octal ASCII) +124 12 File size (octal ASCII) +136 12 Modification time (Unix epoch, octal) +148 8 Header checksum +156 1 Type flag ('0'=file, '5'=directory, '2'=symlink) +157 100 Link target name +257 6 "ustar" magic +263 2 Version "00" +265 32 Owner username +297 32 Group name +329 8 Device major +337 8 Device minor +345 155 Filename prefix (for paths > 100 chars) +500 12 Padding to 512 bytes +``` + +File data follows immediately, padded to a 512-byte boundary. The archive ends with two consecutive 512-byte blocks of zeros. + +**Note:** TAR headers are entirely ASCII-encoded octal numbers, making them partially human-readable in a hex editor. + +--- + +## Gzip + +The standard compression wrapper on Unix. Compresses a single stream using DEFLATE. + +### Gzip Header + +``` +1F 8B ← Magic number +08 ← Compression method: DEFLATE +XX ← Flags (FTEXT, FHCRC, FEXTRA, FNAME, FCOMMENT) +XX XX XX XX ← Modification time (Unix epoch, LE) +XX ← Extra flags (compression level hint) +XX ← OS (0=FAT, 3=Unix, 7=macOS, 11=NTFS) +[optional: original filename, null-terminated] +[optional: comment, null-terminated] +[DEFLATE compressed data] +XX XX XX XX ← CRC-32 of original data +XX XX XX XX ← Original size mod 2^32 +``` + +The trailing CRC-32 and size allow integrity verification after decompression. + +--- + +## 7z + +7-Zip's native format. Supports multiple compression methods and solid compression (compressing multiple files as a single stream for better ratio). + +### 7z Signature + +``` +37 7A BC AF 27 1C ← Magic bytes: "7z" + 4 signature bytes +00 04 ← Format version +[header with offsets to compressed streams and metadata] +``` + +7z achieves the best compression ratios among common formats by using LZMA2 with large dictionaries and solid compression, at the cost of slower compression speed and higher memory usage. + +--- + +## Choosing a Format + +| Scenario | Recommended | Why | +|----------|-------------|-----| +| General file sharing | ZIP | Universal support, every OS handles it natively | +| Unix/Linux packages | `.tar.gz` or `.tar.zst` | Standard convention, preserves permissions/ownership | +| Maximum compression | `.tar.xz` or `.7z` | Best ratios, worth the slower compression | +| Fast compression | `.tar.zst` or `.tar.lz4` | Near-instant compress/decompress | +| Web assets | Brotli (`.br`) | Best ratio for HTTP, built-in web dictionary | +| Incremental backups | TAR (append mode) | Add files without rewriting | +| Cross-platform distribution | ZIP | Zero dependency on any platform | +| Container/Docker layers | gzip or zstd | OCI standard, broad registry support | + +--- + +## Related + +- [[File Formats]] — Parent overview of file format concepts +- [[File Metadata]] — Metadata preserved in archives (timestamps, permissions) +- [[Document Formats]] — PDF, DOCX, EPUB (ZIP-based formats) +- [[Build Systems]] — Build tools that produce archives +- [[Deployment]] — Container images and artifact packaging diff --git a/Computer Science/Audio and Video Formats.md b/Computer Science/Audio and Video Formats.md new file mode 100644 index 0000000..545e5b4 --- /dev/null +++ b/Computer Science/Audio and Video Formats.md @@ -0,0 +1,294 @@ +--- +title: Audio and Video Formats +aliases: + - Video Formats + - Audio Formats + - Media Formats + - Codecs + - Video Codecs + - Audio Codecs +tags: + - cs + - fundamentals + - file-formats + - media +type: concept +status: complete +difficulty: fundamentals +created: "2026-02-19" +--- + +# Audio and Video Formats + +How audio and video are encoded, compressed, and packaged — the distinction between codecs (compression) and containers (packaging). + +## Codec vs Container + +The most important concept in media formats: **codecs** and **containers** are separate things. + +| Concept | What It Does | Examples | +|---------|-------------|----------| +| **Codec** | Compresses/decompresses audio or video data | H.264, H.265, AV1, VP9, AAC, Opus | +| **Container** | Packages codec streams + metadata into a file | MP4, MKV, WebM, AVI, MOV, OGG | + +A container holds one or more streams (video, audio, subtitles, metadata) that can each use a different codec: + +``` +┌─ MP4 Container ──────────────────────────────┐ +│ │ +│ Stream 0: Video → H.264 codec, 1080p 24fps │ +│ Stream 1: Audio → AAC codec, 48kHz stereo │ +│ Stream 2: Audio → AAC codec, 48kHz 5.1 │ +│ Stream 3: Subtitle → SRT text │ +│ Metadata: title, duration, GPS, chapters │ +│ │ +└───────────────────────────────────────────────┘ +``` + +--- + +## Video Codecs + +### Codec Comparison + +| Codec | Standard | Compression | Licensing | Browser Support | Typical Use | +|-------|----------|-------------|-----------|-----------------|-------------| +| H.264 (AVC) | MPEG-4 Part 10 | Good | Patented (MPEG LA) | Universal | Web, streaming, Blu-ray | +| H.265 (HEVC) | MPEG-H Part 2 | ~50% better than H.264 | Patented (expensive) | Safari, some others | 4K broadcast, Apple ecosystem | +| AV1 | Alliance for Open Media | ~30% better than H.265 | Royalty-free | ~95% browsers | YouTube, Netflix, web | +| VP9 | Google | ~similar to H.265 | Royalty-free | ~97% browsers | YouTube (legacy) | +| VP8 | Google | ~similar to H.264 | Royalty-free | Wide | WebRTC (legacy) | +| AV1 | AOMedia | Best current ratio | Royalty-free | Growing | Next-gen streaming | +| ProRes | Apple | Visually lossless | Proprietary | N/A | Professional editing | + +### How Video Compression Works + +Video codecs exploit three types of redundancy: + +**Spatial** — Within a single frame, nearby pixels are similar (same as image compression). + +**Temporal** — Consecutive frames are mostly identical. Instead of storing every pixel for every frame, store the differences (motion vectors + residuals). + +**Perceptual** — Human vision is less sensitive to certain details. Quantize aggressively in areas the eye won't notice. + +### Frame Types + +| Type | Name | Description | Size | +|------|------|-------------|------| +| I-frame | Intra | Complete image (like a JPEG). Seek point. | Largest | +| P-frame | Predicted | References previous frames. Stores only differences. | Medium | +| B-frame | Bidirectional | References both past and future frames. | Smallest | + +``` +I ← P ← P ← B ← B ← P ← P ← I ← P ← P ... +▲ ▲ +Keyframe (seekable) Keyframe (seekable) +└──────── GOP (Group of Pictures) ────────┘ +``` + +**GOP (Group of Pictures)** — The sequence between keyframes. Longer GOPs = better compression but slower seeking. Streaming services typically use 2-4 second GOPs. + +--- + +## Audio Codecs + +### Codec Comparison + +| Codec | Type | Bitrate Range | Quality | Typical Use | +|-------|------|---------------|---------|-------------| +| MP3 (MPEG-1 Layer 3) | Lossy | 128-320 kbps | Good | Legacy music distribution | +| AAC (Advanced Audio) | Lossy | 96-256 kbps | Better than MP3 | Streaming, Apple, YouTube | +| Opus | Lossy | 6-510 kbps | Best lossy codec | VoIP, WebRTC, streaming | +| Vorbis | Lossy | 64-500 kbps | Good | OGG containers, games | +| FLAC | Lossless | 800-1400 kbps | Perfect | Archival, audiophile | +| ALAC | Lossless | 800-1400 kbps | Perfect | Apple ecosystem | +| WAV/PCM | Uncompressed | 1411 kbps (CD) | Perfect | Recording, editing | +| AC-3 (Dolby Digital) | Lossy | 192-640 kbps | Good | DVD, Blu-ray, streaming surround | + +### How Audio Compression Works (MP3) + +```mermaid +graph LR + A[PCM Audio] --> B[Subband Filter] + B --> C[Psychoacoustic Model] + C --> D[Quantization] + D --> E[Huffman Encoding] + E --> F[MP3 Frames] + + style A fill:#E8E8E8 + style F fill:#90EE90 +``` + +1. **Subband filtering** — Split audio into 32 frequency subbands using a polyphase filter bank. +2. **MDCT** — Modified Discrete Cosine Transform on each subband for finer frequency resolution. +3. **Psychoacoustic model** — Determine which frequencies are inaudible due to masking: + - **Frequency masking:** A loud tone makes nearby quieter tones inaudible + - **Temporal masking:** A loud sound masks softer sounds just before and after it +4. **Quantization** — Allocate bits based on the psychoacoustic model. Inaudible frequencies get fewer (or zero) bits. +5. **Huffman encoding** — Entropy-code the quantized values. + +### MP3 Frame Structure + +MP3 is a **frame-based** format. Each frame is independently decodable (enabling seeking and streaming): + +``` +FF FB ← Sync word (11 bits all 1s) + header bits + Bits 12-13: MPEG version (11 = MPEG1) + Bits 14-15: Layer (01 = Layer III) + Bit 16: CRC protection + Bits 17-20: Bitrate index + Bits 21-22: Sample rate (00 = 44100 Hz) + Bit 23: Padding + Bit 24: Private + Bits 25-26: Channel mode (00 = stereo) +[Side information] ← Huffman table selections, scalefactors +[Main data] ← Huffman-coded frequency data +``` + +Each frame at 128 kbps/44.1 kHz contains 1152 audio samples (~26ms of audio). + +--- + +## Container Formats + +### Container Comparison + +| Container | Extension | Video Codecs | Audio Codecs | Features | Common Use | +|-----------|-----------|-------------|-------------|----------|------------| +| MP4 | `.mp4`, `.m4a`, `.m4v` | H.264, H.265, AV1 | AAC, AC-3, Opus | Chapters, subtitles, metadata | Web, streaming | +| MKV | `.mkv`, `.mka` | Anything | Anything | Most flexible, multiple tracks | Desktop media | +| WebM | `.webm` | VP8, VP9, AV1 | Vorbis, Opus | Web-optimized subset of MKV | Web video | +| AVI | `.avi` | Legacy codecs | PCM, MP3 | Simple but limited | Legacy | +| MOV | `.mov` | H.264, H.265, ProRes | AAC, ALAC | Apple's MP4 variant | Apple ecosystem, editing | +| OGG | `.ogg`, `.ogv` | Theora | Vorbis, Opus | Open standard | Open source | +| FLAC | `.flac` | N/A | FLAC only | Lossless audio | Audiophile, archival | +| WAV | `.wav` | N/A | PCM (usually) | RIFF-based, uncompressed | Recording, editing | + +### MP4 Box Structure + +MP4 files are organized as nested "boxes" (atoms), each with a type and size: + +``` +Offset Hex Meaning +00000000 00 00 00 20 Box size: 32 bytes +00000004 66 74 79 70 Box type: "ftyp" (file type) +00000008 69 73 6F 6D Major brand: "isom" +0000000C 00 00 02 00 Minor version: 512 +00000010 69 73 6F 6D 69 73 6F 32 Compatible: "isomiso2" +00000018 61 76 63 31 6D 70 34 31 Compatible: "avc1mp41" +``` + +Key boxes: + +``` +ftyp ← File type / brand declaration +moov ← Movie metadata (MUST exist) +├── mvhd ← Movie header (duration, timescale) +├── trak ← Track (one per stream) +│ ├── tkhd ← Track header (dimensions, duration) +│ └── mdia ← Media data +│ ├── mdhd ← Media header (timescale, language) +│ ├── hdlr ← Handler (video/audio/subtitle) +│ └── minf ← Media information +│ └── stbl ← Sample table (codec config, offsets, sizes) +└── udta ← User data / metadata +mdat ← Actual compressed media data (bulk of the file) +``` + +**Fast-start (web streaming):** The `moov` box must appear **before** `mdat` for progressive download to work. Videos encoded without this require the entire file to download before playback starts. Fix with: + +```bash +ffmpeg -i input.mp4 -movflags +faststart output.mp4 +``` + +### RIFF / WAV Structure + +WAV files use the RIFF container — a simple chunk-based format: + +``` +52 49 46 46 [file size-8] ← "RIFF" + remaining size +57 41 56 45 ← "WAVE" format identifier + +66 6D 74 20 [chunk size] ← "fmt " chunk (audio format) + 01 00 ← Format: 1 (PCM) + 02 00 ← Channels: 2 (stereo) + 44 AC 00 00 ← Sample rate: 44100 + 10 B1 02 00 ← Byte rate: 176400 + 04 00 ← Block align: 4 + 10 00 ← Bits per sample: 16 + +64 61 74 61 [chunk size] ← "data" chunk + [PCM audio samples...] ← Raw audio data +``` + +At 16-bit stereo 44.1 kHz (CD quality), uncompressed audio is ~10 MB per minute. + +--- + +## Streaming Formats + +Streaming video uses segmented delivery rather than single-file download: + +| Protocol | Format | Segments | Use Case | +|----------|--------|----------|----------| +| HLS | `.m3u8` playlist + `.ts` or `.mp4` segments | 2-10 second chunks | Apple, Safari, most CDNs | +| DASH | `.mpd` manifest + `.mp4` segments | 2-10 second chunks | Cross-platform, YouTube | +| WebRTC | Real-time packets | Per-frame | Video calls, live P2P | + +### Adaptive Bitrate + +Both HLS and DASH support **adaptive bitrate streaming** — multiple quality levels encoded, client switches based on bandwidth: + +``` +Master Playlist (HLS): + → 1080p @ 5 Mbps (strong connection) + → 720p @ 2.5 Mbps + → 480p @ 1 Mbps + → 360p @ 500 kbps (weak connection) + +Client monitors download speed and switches quality +between segments to minimize buffering. +``` + +--- + +## Practical Reference + +### Common Web Recommendations + +| Content | Format | Codec | Why | +|---------|--------|-------|-----| +| Video (broad support) | MP4 | H.264 + AAC | Universal browser support | +| Video (modern) | MP4 or WebM | AV1 + Opus | Best compression, royalty-free | +| Audio (music) | MP4 | AAC | Small, good quality | +| Audio (speech) | WebM or OGG | Opus | Best at low bitrates | +| Audio (lossless) | FLAC | FLAC | Open, widely supported | +| Live/real-time | WebRTC | VP8/VP9/AV1 + Opus | Low latency | + +### Useful Commands + +```bash +# Inspect media file +ffprobe -v quiet -show_format -show_streams input.mp4 + +# Convert video codec +ffmpeg -i input.mov -c:v libx264 -c:a aac output.mp4 + +# Extract audio from video +ffmpeg -i video.mp4 -vn -c:a copy audio.m4a + +# Re-mux without re-encoding (change container) +ffmpeg -i input.mkv -c copy output.mp4 + +# Add fast-start for web streaming +ffmpeg -i input.mp4 -c copy -movflags +faststart output.mp4 +``` + +--- + +## Related + +- [[File Formats]] — Parent overview of file format concepts +- [[File Metadata]] — EXIF, ID3, XMP metadata systems +- [[Image Formats]] — Still image format internals +- [[Archive and Compression Formats]] — Compression algorithms shared with media diff --git a/Computer Science/Document Formats.md b/Computer Science/Document Formats.md new file mode 100644 index 0000000..ee0f237 --- /dev/null +++ b/Computer Science/Document Formats.md @@ -0,0 +1,356 @@ +--- +title: Document Formats +aliases: + - PDF Format + - Office Formats + - EPUB Format + - Document File Formats +tags: + - cs + - fundamentals + - file-formats +type: concept +status: complete +difficulty: fundamentals +created: "2026-02-19" +--- + +# Document Formats + +How documents are stored digitally — from PDF's page description model to Office XML's ZIP-based structure and EPUB's web-standards approach. + +## Overview + +| Format | Structure | Editable | Layout | Use Case | +|--------|-----------|----------|--------|----------| +| PDF | Binary object graph | Difficult | Fixed (pixel-perfect) | Print, contracts, archival | +| DOCX | ZIP of XML | Yes (Word) | Reflowable | Business documents | +| ODT | ZIP of XML | Yes (LibreOffice) | Reflowable | Open-source documents | +| EPUB | ZIP of XHTML + CSS | Yes | Reflowable | E-books | +| RTF | Text markup | Yes | Basic | Legacy interchange | +| Plain text | Raw bytes | Yes | None | Code, logs, notes | +| LaTeX | Text markup | Yes (source) | Fixed (compiled) | Academic papers, math | + +--- + +## PDF (Portable Document Format) + +Created by Adobe in 1993, now ISO standard 32000. Designed for **fixed-layout** documents that look identical everywhere. + +### How PDF Works + +A PDF is not a sequence of pages like you'd expect. It's an **object graph** — a collection of numbered objects that reference each other: + +``` +┌─────────────────────────────────────────────┐ +│ Header: %PDF-1.7 │ +├─────────────────────────────────────────────┤ +│ Body: Numbered objects │ +│ │ +│ 1 0 obj (Catalog - root of document) │ +│ → points to Pages object │ +│ 2 0 obj (Pages - page tree) │ +│ → points to individual Page objects │ +│ 3 0 obj (Page 1) │ +│ → points to Content stream + Resources │ +│ 4 0 obj (Content stream - drawing commands) │ +│ → moveto, lineto, show text, draw image │ +│ 5 0 obj (Font - embedded or referenced) │ +│ → TrueType/Type1/CID font data │ +│ 6 0 obj (Image - embedded raster) │ +│ → JPEG/CCITT/Flate compressed pixels │ +│ │ +├─────────────────────────────────────────────┤ +│ Cross-Reference Table (xref) │ +│ Maps object numbers → byte offsets │ +├─────────────────────────────────────────────┤ +│ Trailer │ +│ Points to: Catalog, Info dict, xref offset │ +│ startxref [byte offset to xref] │ +│ %%EOF │ +└─────────────────────────────────────────────┘ +``` + +### PDF Hex Walkthrough + +``` +Offset Content Meaning +00000000 25 50 44 46 2D 31 2E 37 "%PDF-1.7" header +00000008 0A 25 E2 E3 CF D3 0A Binary comment (signals binary content) + + ...objects... + + 1 0 obj Object 1, generation 0 + << /Type /Catalog Catalog dictionary + /Pages 2 0 R >> Reference to object 2 + endobj + + 4 0 obj Content stream + << /Length 44 >> + stream + BT Begin Text + /F1 12 Tf Font F1, 12pt + 100 700 Td Move to (100, 700) + (Hello, World!) Tj Draw text + ET End Text + endstream + endobj +``` + +### PDF Content Streams + +Page content uses a PostScript-like drawing language: + +| Operator | Meaning | Example | +|----------|---------|---------| +| `BT` / `ET` | Begin/end text block | `BT ... ET` | +| `Tf` | Set font and size | `/F1 12 Tf` | +| `Td` | Move text position | `100 700 Td` | +| `Tj` | Show text string | `(Hello) Tj` | +| `m` | Move to point | `100 200 m` | +| `l` | Line to point | `300 400 l` | +| `S` | Stroke path | `S` | +| `f` | Fill path | `f` | +| `re` | Rectangle | `50 50 200 100 re` | +| `Do` | Draw XObject (image) | `/Img1 Do` | +| `cm` | Transform matrix | `1 0 0 1 100 200 cm` | + +### PDF Incremental Updates + +PDFs can be updated **without rewriting** — new objects and a new xref table are appended to the end: + +``` +[Original PDF content] +[Original xref] +[Original trailer + %%EOF] + +[New/modified objects] ← Appended +[New xref (references new objects)] +[New trailer + %%EOF] +``` + +This is how form filling and digital signatures work — the original content is preserved and new data is layered on. It also means "deleted" content may still exist in the file. + +### PDF Versions and Features + +| Feature | PDF Version | +|---------|-------------| +| Basic text and images | 1.0 (1993) | +| Interactive forms | 1.2 | +| JavaScript | 1.3 | +| Transparency | 1.4 | +| Embedded multimedia | 1.5 | +| AES encryption | 1.6 | +| 3D content | 1.6 | +| XFA forms | 1.5 | +| PDF/A (archival) | Based on 1.4-1.7 | +| PDF 2.0 (ISO 32000-2) | 2.0 (2017) | + +--- + +## Office Open XML (DOCX, XLSX, PPTX) + +Microsoft Office's format since 2007. A **ZIP archive** containing XML files, media, and relationships. + +### DOCX Structure + +```bash +$ unzip -l document.docx + [Content_Types].xml ← MIME type registry + _rels/.rels ← Root relationships + word/document.xml ← Main document content + word/styles.xml ← Style definitions + word/settings.xml ← Document settings + word/fontTable.xml ← Font declarations + word/theme/theme1.xml ← Theme (colors, fonts) + word/media/image1.png ← Embedded images + word/_rels/document.xml.rels ← Document relationships + docProps/core.xml ← Dublin Core metadata + docProps/app.xml ← Application metadata +``` + +### Document.xml Content + +```xml + + + + + + + + + + + Introduction + + + + +``` + +### XLSX Structure + +Spreadsheets split data across multiple XML files: + +``` +xl/worksheets/sheet1.xml ← Cell data (row/col/value) +xl/sharedStrings.xml ← String table (cells reference by index) +xl/styles.xml ← Number formats, fonts, fills +xl/workbook.xml ← Sheet names, defined names +``` + +Cell values reference the shared strings table by index, so repeated strings are stored once: + +```xml + + + Name + Revenue + + + + + 0 + 1 + 50000 + +``` + +--- + +## OpenDocument Format (ODF) + +ISO standard, used by LibreOffice and other open-source suites. Also ZIP-based with XML content, but uses different schemas. + +| Office XML | ODF Equivalent | +|-----------|----------------| +| `.docx` | `.odt` (text) | +| `.xlsx` | `.ods` (spreadsheet) | +| `.pptx` | `.odp` (presentation) | + +Structure is similar: ZIP containing `content.xml`, `styles.xml`, `meta.xml`, `META-INF/manifest.xml`. + +--- + +## EPUB + +The standard e-book format. Essentially a **website in a ZIP file** — XHTML pages + CSS + images + metadata. + +### EPUB Structure + +```bash +$ unzip -l book.epub + mimetype ← Must be first, uncompressed: "application/epub+zip" + META-INF/container.xml ← Points to the OPF file + OEBPS/content.opf ← Package document (manifest + spine) + OEBPS/toc.ncx ← Table of contents (EPUB 2) + OEBPS/nav.xhtml ← Navigation document (EPUB 3) + OEBPS/chapter1.xhtml ← Content pages + OEBPS/chapter2.xhtml + OEBPS/styles/main.css ← Stylesheets + OEBPS/images/cover.jpg ← Images +``` + +### Key EPUB Files + +**content.opf** — The manifest and reading order: + +```xml + + + The Great Novel + Author Name + en + isbn:978-0-123456-78-9 + + + + + + + + + + + + + + +``` + +**mimetype file** — The `mimetype` entry MUST be the first file in the ZIP, stored without compression, at byte offset 38. This allows identification without full ZIP parsing: + +``` +Offset Content +00000000 50 4B 03 04 ZIP local file header +... +00000026 6D 69 6D 65 74 79 70 65 "mimetype" +00000030 61 70 70 6C 69 63 61 74 "application/epub+zip" + 69 6F 6E 2F 65 70 75 62 + 2B 7A 69 70 +``` + +### EPUB 2 vs EPUB 3 + +| Feature | EPUB 2 | EPUB 3 | +|---------|--------|--------| +| Content | XHTML 1.1 | HTML5 | +| Styling | CSS 2.1 | CSS3 | +| Navigation | NCX (XML) | nav.xhtml (HTML5) | +| Scripting | No | JavaScript (limited) | +| Audio/Video | No | HTML5 `