diff --git a/Computer Science MOC.md b/Computer Science MOC.md
index 6e0baae..4d7f56a 100644
--- a/Computer Science MOC.md
+++ b/Computer Science MOC.md
@@ -70,6 +70,15 @@ Fundamental CS concepts, data structures, algorithms, and system design.
- [[Testing Strategies]] — Unit, integration, E2E, BDD, property-based
+### File Formats & Media
+
+- [[File Formats]] — Binary structure, magic bytes, headers, and trailers
+- [[Image Formats]] — JPEG, PNG, GIF, WebP, AVIF internals
+- [[File Metadata]] — EXIF, GPS, XMP, ID3 metadata systems
+- [[Audio and Video Formats]] — Codecs, containers, streaming
+- [[Archive and Compression Formats]] — ZIP, tar, gzip, zstd, brotli
+- [[Document Formats]] — PDF, DOCX, EPUB internals
+
### Reference
- [[Technical Measurements]]
diff --git a/Computer Science/Archive and Compression Formats.md b/Computer Science/Archive and Compression Formats.md
new file mode 100644
index 0000000..48add0b
--- /dev/null
+++ b/Computer Science/Archive and Compression Formats.md
@@ -0,0 +1,304 @@
+---
+title: Archive and Compression Formats
+aliases:
+ - Compression Formats
+ - Archive Formats
+ - ZIP Format
+ - Compression Algorithms
+tags:
+ - cs
+ - fundamentals
+ - file-formats
+type: concept
+status: complete
+difficulty: fundamentals
+created: "2026-02-19"
+---
+
+# Archive and Compression Formats
+
+How files are bundled together (archiving) and made smaller (compression) — from ZIP internals to modern algorithms like Zstandard and Brotli.
+
+## Archive vs Compression
+
+These are separate concepts, often combined:
+
+| Concept | What It Does | Examples |
+|---------|-------------|----------|
+| **Archive** | Bundles multiple files into one | TAR, CPIO, AR |
+| **Compression** | Reduces file size | gzip, bzip2, zstd, brotli, LZMA |
+| **Both** | Bundles + compresses | ZIP, 7z, RAR |
+
+TAR was designed for tape archives and handles **only** archiving. Compression is applied separately:
+
+```bash
+# Archive only (no compression)
+tar cf archive.tar files/
+
+# Archive + gzip
+tar czf archive.tar.gz files/
+
+# Archive + zstandard
+tar --zstd -cf archive.tar.zst files/
+
+# Archive + bzip2
+tar cjf archive.tar.bz2 files/
+
+# Archive + xz (LZMA2)
+tar cJf archive.tar.xz files/
+```
+
+---
+
+## Compression Algorithms
+
+### Comparison
+
+| Algorithm | Ratio | Compress Speed | Decompress Speed | Used In |
+|-----------|-------|----------------|------------------|---------|
+| DEFLATE | Good | Moderate | Fast | ZIP, gzip, PNG, HTTP |
+| LZ4 | Low | Very fast | Very fast | Filesystem compression, real-time |
+| Zstandard (zstd) | Excellent | Fast | Very fast | Kernel, packaging, databases |
+| Brotli | Excellent | Slow | Fast | HTTP (WOFF2, web assets) |
+| LZMA/LZMA2 | Best | Very slow | Moderate | 7z, xz |
+| bzip2 | Good | Slow | Slow | Legacy, some distro packages |
+| LZW | Moderate | Fast | Fast | GIF (legacy) |
+| Snappy | Low | Very fast | Very fast | Google internal, Hadoop |
+
+### How LZ77/DEFLATE Works
+
+Most general-purpose compression builds on LZ77, which replaces repeated sequences with back-references:
+
+```
+Input: "ABCABCABCXYZ"
+ ↓
+Step 1: Output literal "ABC"
+Step 2: See "ABC" repeats → output (distance=3, length=6)
+Step 3: Output literal "XYZ"
+
+Compressed: ABC <3,6> XYZ
+
+The decoder reads forward:
+ "ABC" → emit as-is
+ <3,6> → go back 3 chars, copy 6 chars → "ABCABC"
+ "XYZ" → emit as-is
+ Result: "ABCABCABCXYZ"
+```
+
+DEFLATE combines LZ77 with Huffman coding — after finding repeated patterns, it Huffman-encodes the literals and back-references for additional compression.
+
+### Zstandard (zstd)
+
+Facebook/Meta's modern replacement for gzip. Uses finite state entropy (ANS) instead of Huffman coding and has a dictionary mode for compressing many small items.
+
+```bash
+# Compress (default level 3)
+zstd file.dat
+
+# Compress with level 19 (max practical)
+zstd -19 file.dat
+
+# Train a dictionary on similar files (e.g., JSON logs)
+zstd --train samples/* -o dictionary
+
+# Compress using dictionary
+zstd --dict dictionary file.json
+```
+
+Key advantage: zstd decompression speed is nearly constant regardless of compression level. You can spend more time compressing (once) and decompress quickly (many times).
+
+### Brotli
+
+Google's algorithm optimized for web content. Includes a built-in dictionary of common web strings (HTML tags, CSS properties, JavaScript keywords).
+
+```bash
+# Compress for web serving (level 11 = max)
+brotli -q 11 styles.css
+
+# Content-Encoding header in HTTP
+Content-Encoding: br
+```
+
+Typical web asset savings over gzip: 15-25% smaller.
+
+---
+
+## ZIP
+
+The most widely used archive+compression format. Also the foundation for DOCX, XLSX, JAR, APK, EPUB, and many other formats.
+
+### ZIP File Structure
+
+ZIP is unusual — the authoritative file index is at the **end** of the file, not the beginning:
+
+```
+┌────────────────────────────────────────────┐
+│ Local File Header 1 │
+│ 50 4B 03 04 (PK..) ← signature │
+│ version, flags, compression method │
+│ CRC-32, sizes, filename │
+│ [File Data 1 - compressed] │
+├────────────────────────────────────────────┤
+│ Local File Header 2 │
+│ [File Data 2 - compressed] │
+├────────────────────────────────────────────┤
+│ ...more files... │
+├────────────────────────────────────────────┤
+│ Central Directory │ ← The actual index
+│ 50 4B 01 02 (PK..) ← entry signature │
+│ Entry for File 1 (offset, size, name) │
+│ Entry for File 2 (offset, size, name) │
+│ ... │
+├────────────────────────────────────────────┤
+│ End of Central Directory Record │
+│ 50 4B 05 06 (PK..) ← EOCD signature │
+│ Number of entries │
+│ Central directory offset │
+│ Comment │
+└────────────────────────────────────────────┘
+```
+
+**Why the index is at the end:** ZIP was designed for appending files. You can add files to a ZIP without rewriting the entire archive — just append new local entries and write a new central directory.
+
+### ZIP Hex Walkthrough
+
+```
+Offset Hex Meaning
+00000000 50 4B 03 04 Local file header signature
+00000004 14 00 Version needed: 2.0
+00000006 00 00 Flags: none
+00000008 08 00 Compression: DEFLATE
+0000000A 4A 7D Mod time (MS-DOS format)
+0000000C 54 59 Mod date (MS-DOS format)
+0000000E XX XX XX XX CRC-32
+00000012 XX XX XX XX Compressed size
+00000016 XX XX XX XX Uncompressed size
+0000001A 0A 00 Filename length: 10
+0000001C 00 00 Extra field length: 0
+0000001E 68 65 6C 6C 6F 2E 74 78 74 00 "hello.txt"
+00000028 [compressed data...] DEFLATE'd file content
+```
+
+### ZIP Compression Methods
+
+| Value | Method | Notes |
+|-------|--------|-------|
+| 0 | Stored | No compression (files already compressed) |
+| 8 | DEFLATE | Standard, universal support |
+| 9 | DEFLATE64 | Larger window, rare |
+| 12 | bzip2 | Better ratio, less common |
+| 14 | LZMA | 7-Zip format, uncommon in ZIP |
+| 93 | Zstandard | Modern, gaining support |
+| 95 | XZ (LZMA2) | Very high ratio |
+
+### ZIP-Based Formats
+
+| Format | Extension | Contents |
+|--------|-----------|----------|
+| Office Open XML | `.docx`, `.xlsx`, `.pptx` | XML + media files |
+| Java Archive | `.jar` | `.class` files + manifest |
+| Android Package | `.apk` | DEX + resources + manifest |
+| EPUB | `.epub` | XHTML + CSS + images |
+| OpenDocument | `.odt`, `.ods`, `.odp` | XML + media |
+| XPI (Firefox ext) | `.xpi` | Web extension files |
+| IPSW (iOS firmware) | `.ipsw` | Firmware images |
+
+---
+
+## TAR (Tape Archive)
+
+Unix archiving format from 1979. No compression — purely bundles files with metadata.
+
+### TAR Header Structure
+
+Each file is preceded by a 512-byte header block:
+
+```
+Offset Size Field
+0 100 Filename (null-terminated)
+100 8 File mode (octal ASCII)
+108 8 Owner UID (octal ASCII)
+116 8 Group GID (octal ASCII)
+124 12 File size (octal ASCII)
+136 12 Modification time (Unix epoch, octal)
+148 8 Header checksum
+156 1 Type flag ('0'=file, '5'=directory, '2'=symlink)
+157 100 Link target name
+257 6 "ustar" magic
+263 2 Version "00"
+265 32 Owner username
+297 32 Group name
+329 8 Device major
+337 8 Device minor
+345 155 Filename prefix (for paths > 100 chars)
+500 12 Padding to 512 bytes
+```
+
+File data follows immediately, padded to a 512-byte boundary. The archive ends with two consecutive 512-byte blocks of zeros.
+
+**Note:** TAR headers are entirely ASCII-encoded octal numbers, making them partially human-readable in a hex editor.
+
+---
+
+## Gzip
+
+The standard compression wrapper on Unix. Compresses a single stream using DEFLATE.
+
+### Gzip Header
+
+```
+1F 8B ← Magic number
+08 ← Compression method: DEFLATE
+XX ← Flags (FTEXT, FHCRC, FEXTRA, FNAME, FCOMMENT)
+XX XX XX XX ← Modification time (Unix epoch, LE)
+XX ← Extra flags (compression level hint)
+XX ← OS (0=FAT, 3=Unix, 7=macOS, 11=NTFS)
+[optional: original filename, null-terminated]
+[optional: comment, null-terminated]
+[DEFLATE compressed data]
+XX XX XX XX ← CRC-32 of original data
+XX XX XX XX ← Original size mod 2^32
+```
+
+The trailing CRC-32 and size allow integrity verification after decompression.
+
+---
+
+## 7z
+
+7-Zip's native format. Supports multiple compression methods and solid compression (compressing multiple files as a single stream for better ratio).
+
+### 7z Signature
+
+```
+37 7A BC AF 27 1C ← Magic bytes: "7z" + 4 signature bytes
+00 04 ← Format version
+[header with offsets to compressed streams and metadata]
+```
+
+7z achieves the best compression ratios among common formats by using LZMA2 with large dictionaries and solid compression, at the cost of slower compression speed and higher memory usage.
+
+---
+
+## Choosing a Format
+
+| Scenario | Recommended | Why |
+|----------|-------------|-----|
+| General file sharing | ZIP | Universal support, every OS handles it natively |
+| Unix/Linux packages | `.tar.gz` or `.tar.zst` | Standard convention, preserves permissions/ownership |
+| Maximum compression | `.tar.xz` or `.7z` | Best ratios, worth the slower compression |
+| Fast compression | `.tar.zst` or `.tar.lz4` | Near-instant compress/decompress |
+| Web assets | Brotli (`.br`) | Best ratio for HTTP, built-in web dictionary |
+| Incremental backups | TAR (append mode) | Add files without rewriting |
+| Cross-platform distribution | ZIP | Zero dependency on any platform |
+| Container/Docker layers | gzip or zstd | OCI standard, broad registry support |
+
+---
+
+## Related
+
+- [[File Formats]] — Parent overview of file format concepts
+- [[File Metadata]] — Metadata preserved in archives (timestamps, permissions)
+- [[Document Formats]] — PDF, DOCX, EPUB (ZIP-based formats)
+- [[Build Systems]] — Build tools that produce archives
+- [[Deployment]] — Container images and artifact packaging
diff --git a/Computer Science/Audio and Video Formats.md b/Computer Science/Audio and Video Formats.md
new file mode 100644
index 0000000..545e5b4
--- /dev/null
+++ b/Computer Science/Audio and Video Formats.md
@@ -0,0 +1,294 @@
+---
+title: Audio and Video Formats
+aliases:
+ - Video Formats
+ - Audio Formats
+ - Media Formats
+ - Codecs
+ - Video Codecs
+ - Audio Codecs
+tags:
+ - cs
+ - fundamentals
+ - file-formats
+ - media
+type: concept
+status: complete
+difficulty: fundamentals
+created: "2026-02-19"
+---
+
+# Audio and Video Formats
+
+How audio and video are encoded, compressed, and packaged — the distinction between codecs (compression) and containers (packaging).
+
+## Codec vs Container
+
+The most important concept in media formats: **codecs** and **containers** are separate things.
+
+| Concept | What It Does | Examples |
+|---------|-------------|----------|
+| **Codec** | Compresses/decompresses audio or video data | H.264, H.265, AV1, VP9, AAC, Opus |
+| **Container** | Packages codec streams + metadata into a file | MP4, MKV, WebM, AVI, MOV, OGG |
+
+A container holds one or more streams (video, audio, subtitles, metadata) that can each use a different codec:
+
+```
+┌─ MP4 Container ──────────────────────────────┐
+│ │
+│ Stream 0: Video → H.264 codec, 1080p 24fps │
+│ Stream 1: Audio → AAC codec, 48kHz stereo │
+│ Stream 2: Audio → AAC codec, 48kHz 5.1 │
+│ Stream 3: Subtitle → SRT text │
+│ Metadata: title, duration, GPS, chapters │
+│ │
+└───────────────────────────────────────────────┘
+```
+
+---
+
+## Video Codecs
+
+### Codec Comparison
+
+| Codec | Standard | Compression | Licensing | Browser Support | Typical Use |
+|-------|----------|-------------|-----------|-----------------|-------------|
+| H.264 (AVC) | MPEG-4 Part 10 | Good | Patented (MPEG LA) | Universal | Web, streaming, Blu-ray |
+| H.265 (HEVC) | MPEG-H Part 2 | ~50% better than H.264 | Patented (expensive) | Safari, some others | 4K broadcast, Apple ecosystem |
+| AV1 | Alliance for Open Media | ~30% better than H.265 | Royalty-free | ~95% browsers | YouTube, Netflix, web |
+| VP9 | Google | ~similar to H.265 | Royalty-free | ~97% browsers | YouTube (legacy) |
+| VP8 | Google | ~similar to H.264 | Royalty-free | Wide | WebRTC (legacy) |
+| AV1 | AOMedia | Best current ratio | Royalty-free | Growing | Next-gen streaming |
+| ProRes | Apple | Visually lossless | Proprietary | N/A | Professional editing |
+
+### How Video Compression Works
+
+Video codecs exploit three types of redundancy:
+
+**Spatial** — Within a single frame, nearby pixels are similar (same as image compression).
+
+**Temporal** — Consecutive frames are mostly identical. Instead of storing every pixel for every frame, store the differences (motion vectors + residuals).
+
+**Perceptual** — Human vision is less sensitive to certain details. Quantize aggressively in areas the eye won't notice.
+
+### Frame Types
+
+| Type | Name | Description | Size |
+|------|------|-------------|------|
+| I-frame | Intra | Complete image (like a JPEG). Seek point. | Largest |
+| P-frame | Predicted | References previous frames. Stores only differences. | Medium |
+| B-frame | Bidirectional | References both past and future frames. | Smallest |
+
+```
+I ← P ← P ← B ← B ← P ← P ← I ← P ← P ...
+▲ ▲
+Keyframe (seekable) Keyframe (seekable)
+└──────── GOP (Group of Pictures) ────────┘
+```
+
+**GOP (Group of Pictures)** — The sequence between keyframes. Longer GOPs = better compression but slower seeking. Streaming services typically use 2-4 second GOPs.
+
+---
+
+## Audio Codecs
+
+### Codec Comparison
+
+| Codec | Type | Bitrate Range | Quality | Typical Use |
+|-------|------|---------------|---------|-------------|
+| MP3 (MPEG-1 Layer 3) | Lossy | 128-320 kbps | Good | Legacy music distribution |
+| AAC (Advanced Audio) | Lossy | 96-256 kbps | Better than MP3 | Streaming, Apple, YouTube |
+| Opus | Lossy | 6-510 kbps | Best lossy codec | VoIP, WebRTC, streaming |
+| Vorbis | Lossy | 64-500 kbps | Good | OGG containers, games |
+| FLAC | Lossless | 800-1400 kbps | Perfect | Archival, audiophile |
+| ALAC | Lossless | 800-1400 kbps | Perfect | Apple ecosystem |
+| WAV/PCM | Uncompressed | 1411 kbps (CD) | Perfect | Recording, editing |
+| AC-3 (Dolby Digital) | Lossy | 192-640 kbps | Good | DVD, Blu-ray, streaming surround |
+
+### How Audio Compression Works (MP3)
+
+```mermaid
+graph LR
+ A[PCM Audio] --> B[Subband Filter]
+ B --> C[Psychoacoustic Model]
+ C --> D[Quantization]
+ D --> E[Huffman Encoding]
+ E --> F[MP3 Frames]
+
+ style A fill:#E8E8E8
+ style F fill:#90EE90
+```
+
+1. **Subband filtering** — Split audio into 32 frequency subbands using a polyphase filter bank.
+2. **MDCT** — Modified Discrete Cosine Transform on each subband for finer frequency resolution.
+3. **Psychoacoustic model** — Determine which frequencies are inaudible due to masking:
+ - **Frequency masking:** A loud tone makes nearby quieter tones inaudible
+ - **Temporal masking:** A loud sound masks softer sounds just before and after it
+4. **Quantization** — Allocate bits based on the psychoacoustic model. Inaudible frequencies get fewer (or zero) bits.
+5. **Huffman encoding** — Entropy-code the quantized values.
+
+### MP3 Frame Structure
+
+MP3 is a **frame-based** format. Each frame is independently decodable (enabling seeking and streaming):
+
+```
+FF FB ← Sync word (11 bits all 1s) + header bits
+ Bits 12-13: MPEG version (11 = MPEG1)
+ Bits 14-15: Layer (01 = Layer III)
+ Bit 16: CRC protection
+ Bits 17-20: Bitrate index
+ Bits 21-22: Sample rate (00 = 44100 Hz)
+ Bit 23: Padding
+ Bit 24: Private
+ Bits 25-26: Channel mode (00 = stereo)
+[Side information] ← Huffman table selections, scalefactors
+[Main data] ← Huffman-coded frequency data
+```
+
+Each frame at 128 kbps/44.1 kHz contains 1152 audio samples (~26ms of audio).
+
+---
+
+## Container Formats
+
+### Container Comparison
+
+| Container | Extension | Video Codecs | Audio Codecs | Features | Common Use |
+|-----------|-----------|-------------|-------------|----------|------------|
+| MP4 | `.mp4`, `.m4a`, `.m4v` | H.264, H.265, AV1 | AAC, AC-3, Opus | Chapters, subtitles, metadata | Web, streaming |
+| MKV | `.mkv`, `.mka` | Anything | Anything | Most flexible, multiple tracks | Desktop media |
+| WebM | `.webm` | VP8, VP9, AV1 | Vorbis, Opus | Web-optimized subset of MKV | Web video |
+| AVI | `.avi` | Legacy codecs | PCM, MP3 | Simple but limited | Legacy |
+| MOV | `.mov` | H.264, H.265, ProRes | AAC, ALAC | Apple's MP4 variant | Apple ecosystem, editing |
+| OGG | `.ogg`, `.ogv` | Theora | Vorbis, Opus | Open standard | Open source |
+| FLAC | `.flac` | N/A | FLAC only | Lossless audio | Audiophile, archival |
+| WAV | `.wav` | N/A | PCM (usually) | RIFF-based, uncompressed | Recording, editing |
+
+### MP4 Box Structure
+
+MP4 files are organized as nested "boxes" (atoms), each with a type and size:
+
+```
+Offset Hex Meaning
+00000000 00 00 00 20 Box size: 32 bytes
+00000004 66 74 79 70 Box type: "ftyp" (file type)
+00000008 69 73 6F 6D Major brand: "isom"
+0000000C 00 00 02 00 Minor version: 512
+00000010 69 73 6F 6D 69 73 6F 32 Compatible: "isomiso2"
+00000018 61 76 63 31 6D 70 34 31 Compatible: "avc1mp41"
+```
+
+Key boxes:
+
+```
+ftyp ← File type / brand declaration
+moov ← Movie metadata (MUST exist)
+├── mvhd ← Movie header (duration, timescale)
+├── trak ← Track (one per stream)
+│ ├── tkhd ← Track header (dimensions, duration)
+│ └── mdia ← Media data
+│ ├── mdhd ← Media header (timescale, language)
+│ ├── hdlr ← Handler (video/audio/subtitle)
+│ └── minf ← Media information
+│ └── stbl ← Sample table (codec config, offsets, sizes)
+└── udta ← User data / metadata
+mdat ← Actual compressed media data (bulk of the file)
+```
+
+**Fast-start (web streaming):** The `moov` box must appear **before** `mdat` for progressive download to work. Videos encoded without this require the entire file to download before playback starts. Fix with:
+
+```bash
+ffmpeg -i input.mp4 -movflags +faststart output.mp4
+```
+
+### RIFF / WAV Structure
+
+WAV files use the RIFF container — a simple chunk-based format:
+
+```
+52 49 46 46 [file size-8] ← "RIFF" + remaining size
+57 41 56 45 ← "WAVE" format identifier
+
+66 6D 74 20 [chunk size] ← "fmt " chunk (audio format)
+ 01 00 ← Format: 1 (PCM)
+ 02 00 ← Channels: 2 (stereo)
+ 44 AC 00 00 ← Sample rate: 44100
+ 10 B1 02 00 ← Byte rate: 176400
+ 04 00 ← Block align: 4
+ 10 00 ← Bits per sample: 16
+
+64 61 74 61 [chunk size] ← "data" chunk
+ [PCM audio samples...] ← Raw audio data
+```
+
+At 16-bit stereo 44.1 kHz (CD quality), uncompressed audio is ~10 MB per minute.
+
+---
+
+## Streaming Formats
+
+Streaming video uses segmented delivery rather than single-file download:
+
+| Protocol | Format | Segments | Use Case |
+|----------|--------|----------|----------|
+| HLS | `.m3u8` playlist + `.ts` or `.mp4` segments | 2-10 second chunks | Apple, Safari, most CDNs |
+| DASH | `.mpd` manifest + `.mp4` segments | 2-10 second chunks | Cross-platform, YouTube |
+| WebRTC | Real-time packets | Per-frame | Video calls, live P2P |
+
+### Adaptive Bitrate
+
+Both HLS and DASH support **adaptive bitrate streaming** — multiple quality levels encoded, client switches based on bandwidth:
+
+```
+Master Playlist (HLS):
+ → 1080p @ 5 Mbps (strong connection)
+ → 720p @ 2.5 Mbps
+ → 480p @ 1 Mbps
+ → 360p @ 500 kbps (weak connection)
+
+Client monitors download speed and switches quality
+between segments to minimize buffering.
+```
+
+---
+
+## Practical Reference
+
+### Common Web Recommendations
+
+| Content | Format | Codec | Why |
+|---------|--------|-------|-----|
+| Video (broad support) | MP4 | H.264 + AAC | Universal browser support |
+| Video (modern) | MP4 or WebM | AV1 + Opus | Best compression, royalty-free |
+| Audio (music) | MP4 | AAC | Small, good quality |
+| Audio (speech) | WebM or OGG | Opus | Best at low bitrates |
+| Audio (lossless) | FLAC | FLAC | Open, widely supported |
+| Live/real-time | WebRTC | VP8/VP9/AV1 + Opus | Low latency |
+
+### Useful Commands
+
+```bash
+# Inspect media file
+ffprobe -v quiet -show_format -show_streams input.mp4
+
+# Convert video codec
+ffmpeg -i input.mov -c:v libx264 -c:a aac output.mp4
+
+# Extract audio from video
+ffmpeg -i video.mp4 -vn -c:a copy audio.m4a
+
+# Re-mux without re-encoding (change container)
+ffmpeg -i input.mkv -c copy output.mp4
+
+# Add fast-start for web streaming
+ffmpeg -i input.mp4 -c copy -movflags +faststart output.mp4
+```
+
+---
+
+## Related
+
+- [[File Formats]] — Parent overview of file format concepts
+- [[File Metadata]] — EXIF, ID3, XMP metadata systems
+- [[Image Formats]] — Still image format internals
+- [[Archive and Compression Formats]] — Compression algorithms shared with media
diff --git a/Computer Science/Document Formats.md b/Computer Science/Document Formats.md
new file mode 100644
index 0000000..ee0f237
--- /dev/null
+++ b/Computer Science/Document Formats.md
@@ -0,0 +1,356 @@
+---
+title: Document Formats
+aliases:
+ - PDF Format
+ - Office Formats
+ - EPUB Format
+ - Document File Formats
+tags:
+ - cs
+ - fundamentals
+ - file-formats
+type: concept
+status: complete
+difficulty: fundamentals
+created: "2026-02-19"
+---
+
+# Document Formats
+
+How documents are stored digitally — from PDF's page description model to Office XML's ZIP-based structure and EPUB's web-standards approach.
+
+## Overview
+
+| Format | Structure | Editable | Layout | Use Case |
+|--------|-----------|----------|--------|----------|
+| PDF | Binary object graph | Difficult | Fixed (pixel-perfect) | Print, contracts, archival |
+| DOCX | ZIP of XML | Yes (Word) | Reflowable | Business documents |
+| ODT | ZIP of XML | Yes (LibreOffice) | Reflowable | Open-source documents |
+| EPUB | ZIP of XHTML + CSS | Yes | Reflowable | E-books |
+| RTF | Text markup | Yes | Basic | Legacy interchange |
+| Plain text | Raw bytes | Yes | None | Code, logs, notes |
+| LaTeX | Text markup | Yes (source) | Fixed (compiled) | Academic papers, math |
+
+---
+
+## PDF (Portable Document Format)
+
+Created by Adobe in 1993, now ISO standard 32000. Designed for **fixed-layout** documents that look identical everywhere.
+
+### How PDF Works
+
+A PDF is not a sequence of pages like you'd expect. It's an **object graph** — a collection of numbered objects that reference each other:
+
+```
+┌─────────────────────────────────────────────┐
+│ Header: %PDF-1.7 │
+├─────────────────────────────────────────────┤
+│ Body: Numbered objects │
+│ │
+│ 1 0 obj (Catalog - root of document) │
+│ → points to Pages object │
+│ 2 0 obj (Pages - page tree) │
+│ → points to individual Page objects │
+│ 3 0 obj (Page 1) │
+│ → points to Content stream + Resources │
+│ 4 0 obj (Content stream - drawing commands) │
+│ → moveto, lineto, show text, draw image │
+│ 5 0 obj (Font - embedded or referenced) │
+│ → TrueType/Type1/CID font data │
+│ 6 0 obj (Image - embedded raster) │
+│ → JPEG/CCITT/Flate compressed pixels │
+│ │
+├─────────────────────────────────────────────┤
+│ Cross-Reference Table (xref) │
+│ Maps object numbers → byte offsets │
+├─────────────────────────────────────────────┤
+│ Trailer │
+│ Points to: Catalog, Info dict, xref offset │
+│ startxref [byte offset to xref] │
+│ %%EOF │
+└─────────────────────────────────────────────┘
+```
+
+### PDF Hex Walkthrough
+
+```
+Offset Content Meaning
+00000000 25 50 44 46 2D 31 2E 37 "%PDF-1.7" header
+00000008 0A 25 E2 E3 CF D3 0A Binary comment (signals binary content)
+
+ ...objects...
+
+ 1 0 obj Object 1, generation 0
+ << /Type /Catalog Catalog dictionary
+ /Pages 2 0 R >> Reference to object 2
+ endobj
+
+ 4 0 obj Content stream
+ << /Length 44 >>
+ stream
+ BT Begin Text
+ /F1 12 Tf Font F1, 12pt
+ 100 700 Td Move to (100, 700)
+ (Hello, World!) Tj Draw text
+ ET End Text
+ endstream
+ endobj
+```
+
+### PDF Content Streams
+
+Page content uses a PostScript-like drawing language:
+
+| Operator | Meaning | Example |
+|----------|---------|---------|
+| `BT` / `ET` | Begin/end text block | `BT ... ET` |
+| `Tf` | Set font and size | `/F1 12 Tf` |
+| `Td` | Move text position | `100 700 Td` |
+| `Tj` | Show text string | `(Hello) Tj` |
+| `m` | Move to point | `100 200 m` |
+| `l` | Line to point | `300 400 l` |
+| `S` | Stroke path | `S` |
+| `f` | Fill path | `f` |
+| `re` | Rectangle | `50 50 200 100 re` |
+| `Do` | Draw XObject (image) | `/Img1 Do` |
+| `cm` | Transform matrix | `1 0 0 1 100 200 cm` |
+
+### PDF Incremental Updates
+
+PDFs can be updated **without rewriting** — new objects and a new xref table are appended to the end:
+
+```
+[Original PDF content]
+[Original xref]
+[Original trailer + %%EOF]
+
+[New/modified objects] ← Appended
+[New xref (references new objects)]
+[New trailer + %%EOF]
+```
+
+This is how form filling and digital signatures work — the original content is preserved and new data is layered on. It also means "deleted" content may still exist in the file.
+
+### PDF Versions and Features
+
+| Feature | PDF Version |
+|---------|-------------|
+| Basic text and images | 1.0 (1993) |
+| Interactive forms | 1.2 |
+| JavaScript | 1.3 |
+| Transparency | 1.4 |
+| Embedded multimedia | 1.5 |
+| AES encryption | 1.6 |
+| 3D content | 1.6 |
+| XFA forms | 1.5 |
+| PDF/A (archival) | Based on 1.4-1.7 |
+| PDF 2.0 (ISO 32000-2) | 2.0 (2017) |
+
+---
+
+## Office Open XML (DOCX, XLSX, PPTX)
+
+Microsoft Office's format since 2007. A **ZIP archive** containing XML files, media, and relationships.
+
+### DOCX Structure
+
+```bash
+$ unzip -l document.docx
+ [Content_Types].xml ← MIME type registry
+ _rels/.rels ← Root relationships
+ word/document.xml ← Main document content
+ word/styles.xml ← Style definitions
+ word/settings.xml ← Document settings
+ word/fontTable.xml ← Font declarations
+ word/theme/theme1.xml ← Theme (colors, fonts)
+ word/media/image1.png ← Embedded images
+ word/_rels/document.xml.rels ← Document relationships
+ docProps/core.xml ← Dublin Core metadata
+ docProps/app.xml ← Application metadata
+```
+
+### Document.xml Content
+
+```xml
+
+
+
+
+
+
+
+
+
+
+ Introduction
+
+
+
+
+```
+
+### XLSX Structure
+
+Spreadsheets split data across multiple XML files:
+
+```
+xl/worksheets/sheet1.xml ← Cell data (row/col/value)
+xl/sharedStrings.xml ← String table (cells reference by index)
+xl/styles.xml ← Number formats, fonts, fills
+xl/workbook.xml ← Sheet names, defined names
+```
+
+Cell values reference the shared strings table by index, so repeated strings are stored once:
+
+```xml
+
+
+ Name
+ Revenue
+
+
+
+
+ 0
+ 1
+ 50000
+
+```
+
+---
+
+## OpenDocument Format (ODF)
+
+ISO standard, used by LibreOffice and other open-source suites. Also ZIP-based with XML content, but uses different schemas.
+
+| Office XML | ODF Equivalent |
+|-----------|----------------|
+| `.docx` | `.odt` (text) |
+| `.xlsx` | `.ods` (spreadsheet) |
+| `.pptx` | `.odp` (presentation) |
+
+Structure is similar: ZIP containing `content.xml`, `styles.xml`, `meta.xml`, `META-INF/manifest.xml`.
+
+---
+
+## EPUB
+
+The standard e-book format. Essentially a **website in a ZIP file** — XHTML pages + CSS + images + metadata.
+
+### EPUB Structure
+
+```bash
+$ unzip -l book.epub
+ mimetype ← Must be first, uncompressed: "application/epub+zip"
+ META-INF/container.xml ← Points to the OPF file
+ OEBPS/content.opf ← Package document (manifest + spine)
+ OEBPS/toc.ncx ← Table of contents (EPUB 2)
+ OEBPS/nav.xhtml ← Navigation document (EPUB 3)
+ OEBPS/chapter1.xhtml ← Content pages
+ OEBPS/chapter2.xhtml
+ OEBPS/styles/main.css ← Stylesheets
+ OEBPS/images/cover.jpg ← Images
+```
+
+### Key EPUB Files
+
+**content.opf** — The manifest and reading order:
+
+```xml
+
+
+ The Great Novel
+ Author Name
+ en
+ isbn:978-0-123456-78-9
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+```
+
+**mimetype file** — The `mimetype` entry MUST be the first file in the ZIP, stored without compression, at byte offset 38. This allows identification without full ZIP parsing:
+
+```
+Offset Content
+00000000 50 4B 03 04 ZIP local file header
+...
+00000026 6D 69 6D 65 74 79 70 65 "mimetype"
+00000030 61 70 70 6C 69 63 61 74 "application/epub+zip"
+ 69 6F 6E 2F 65 70 75 62
+ 2B 7A 69 70
+```
+
+### EPUB 2 vs EPUB 3
+
+| Feature | EPUB 2 | EPUB 3 |
+|---------|--------|--------|
+| Content | XHTML 1.1 | HTML5 |
+| Styling | CSS 2.1 | CSS3 |
+| Navigation | NCX (XML) | nav.xhtml (HTML5) |
+| Scripting | No | JavaScript (limited) |
+| Audio/Video | No | HTML5 `