Universal eBook → Markdown converter and cleaner. Handles all formats, all artifacts, all chapter styles automatically.
Transform your entire eBook library into clean, readable Markdown files with a single command. allmark intelligently strips away the cruft—frontmatter, backmatter, headers, footers, page numbers, and metadata—leaving only the pure narrative content.
- 📚 Universal Format Support: Convert 40+ formats to clean Markdown (10 verified: EPUB, HTML, DOCX, PDF, TXT, MD, RTF, ODT, LaTeX, RST)
- 🧹 Intelligent Cleaning: Automatically removes frontmatter, backmatter, headers, footers, page numbers
- 🔧 OCR Repair: Fixes broken hyphenation, ligatures, and common OCR artifacts
- 📖 Chapter Detection: Standardizes chapter markers across different formats
- 🎯 Artifact Removal: Strips ebook metadata, CSS classes, Calibre IDs, and other cruft
- 🛡️ Safety First: Never removes more than 50% of content (built-in safety check)
- 📊 Progress Tracking: SQLite database logs all conversions with statistics
- 📄 JSONL Export: Token-based text chunking for ML/AI training datasets
- 🎛️ Flexible Splitting: Paragraph-aware or strict token boundary splitting
- 🏷️ Custom Metadata: Add arbitrary metadata to JSONL records
- Statistical Analysis: Uses document structure analysis to intelligently identify and remove non-content sections
- Dialogue-Aware: Preserves paragraph breaks in dialogue while merging broken narrative paragraphs
- Format Agnostic: Same great results whether your source is a scanned PDF or a modern EPUB
- Zero Configuration: Works out of the box with sensible defaults
- Batch Processing: Convert entire libraries with a single command
- ML-Ready Output: Direct JSONL export with configurable chunk sizes for training datasets
pip install git+https://github.com/dcondrey/allmark.gitUsing pip:
git clone https://github.com/dcondrey/allmark.git
cd allmark
pip install -e .Using Poetry:
git clone https://github.com/dcondrey/allmark.git
cd allmark
poetry install
poetry shellUsing Conda:
git clone https://github.com/dcondrey/allmark.git
cd allmark
conda env create -f environment.yml
conda activate allmarkallmark has zero Python dependencies - uses only Python stdlib!
| Tool | Purpose | Required? |
|---|---|---|
| pandoc | EPUB, DOCX converter | ✅ Yes |
| pdftotext (poppler) | PDF text extraction | ✅ Yes |
| ebook-convert (Calibre) | FB2, MOBI fallback |
PDF Extraction:
- Uses pdftotext with
-layoutmode (preserves formatting) - Falls back to
-rawmode if layout fails - Final fallback to ebook-convert if both fail
macOS (Homebrew)
brew install pandoc poppler
brew install --cask calibre # optionalUbuntu/Debian
sudo apt-get install pandoc poppler-utils
sudo apt-get install calibre # optionalWindows (Chocolatey)
choco install pandoc poppler
choco install calibre # optionalallmark
# or
allmark --help# Convert all ebooks in a directory (with intelligent cleaning)
allmark --in /path/to/ebooks
# Output goes to same directory by default
# Verified formats: .epub, .html, .docx, .pdf, .txt, .md, .rtf, .odt, .tex, .rst
# Additional (with Calibre): .mobi, .azw3, .kf8, .fb2, .djvu📚 Convert entire library to Markdown
allmark --in ~/Books --out ~/Books-Markdown🤖 Create ML training dataset with JSONL
# Convert to JSONL with 1024 token chunks
allmark --in ./books --jsonl --token-size 1024
# With custom metadata for training
allmark --in ./books --jsonl --metadata ./book_info.jsonExample book_info.json:
{
"genre": "science_fiction",
"language": "en",
"dataset": "training_v1"
}📄 Convert without cleaning (preserve everything)
allmark --in ./books --no-strip
# Keeps: frontmatter, backmatter, headers, footers, page numbers, metadata⚡ Strict token splitting for exact chunk sizes
allmark --in ./books --jsonl --token-size 512 --strict-split
# Splits at exact token boundaries, ignoring paragraph breaks| Option | Description | Default |
|---|---|---|
--in, --input <dir> |
Input directory containing ebook files | Required |
--out, --output <dir> |
Output directory for markdown files | Same as --in |
--no-strip |
Skip cleaning (preserve all content) | Cleaning enabled |
--force |
Force reconversion of existing files | Skip existing |
--no-clean-md |
Skip cleaning existing .md files | Clean .md files |
--db <path> |
Conversion log database path | ./conversion_log.db |
--jsonl |
Also create JSONL output with chunks | Markdown only |
--token-size <n> |
Max tokens per JSONL chunk | 512 |
--strict-split |
Split at exact token boundaries | Paragraph-aware |
--metadata <file> |
JSON file with custom metadata for JSONL | None |
# Example 1: Basic conversion with cleaning
allmark --in ./ebooks
# Example 2: Separate output directory
allmark --in ./source-books --out ./clean-markdown
# Example 3: Raw conversion (no cleaning)
allmark --in ./books --no-strip
# Example 4: Force reconversion
allmark --in ./books --force
# Example 5: Create ML training dataset
allmark --in ./books --jsonl --token-size 1024 --metadata ./metadata.json
# Example 6: Custom everything
allmark --in ./books --out ./md --db ~/conversion.db --forceWhen using --jsonl, each record contains:
{
"text": "Chunk of narrative text...",
"chunk_index": 0,
"total_chunks": 25,
"token_count": 487,
"source_file": "book.epub",
"markdown_file": "book.md",
"split_mode": "paragraph_aware",
// ... plus any custom metadata from --metadata file
"genre": "fiction",
"language": "en"
}allmark processes files through a comprehensive pipeline:
- Format Conversion: Uses pandoc/pdftotext to convert to markdown
- OCR Repair: Fixes broken hyphens, ligatures, soft hyphens
- Artifact Removal: Strips images, links, CSS classes, ebook metadata
- Code Block Detection: Removes non-literary code/markup blocks
- Header/Footer Removal: Statistical detection of repeating elements
- Page Number Removal: Multiple pattern matching
- TOC Removal: Detects and removes table of contents
- Document Analysis: Understands prose density and narrative structure
- Frontmatter/Backmatter Trimming: Removes copyright pages, author bios, etc.
- Chapter Standardization: Normalizes chapter markers to
# Chapter N - Typography Normalization: Fixes quotes, dashes, ellipses
- Markdown Validation: Ensures proper markdown formatting
- Paragraph Merging: Intelligently rejoins broken paragraphs
allmark/
├── src/
│ └── allmark/
│ ├── __init__.py # Package initialization
│ ├── __main__.py # CLI entry point
│ ├── cli.py # Command-line interface
│ ├── converter.py # Main conversion logic
│ ├── cleaners.py # Text cleaning functions
│ ├── analyzers.py # Document analysis
│ ├── ocr.py # OCR artifact repair
│ └── utils.py # Utility functions
├── setup.py # pip installation
├── pyproject.toml # Modern Python packaging
├── environment.yml # Conda environment
└── README.md # This file
# Clone the repository
git clone https://github.com/dcondrey/allmark.git
cd allmark
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in editable mode with dev dependencies
pip install -e ".[dev]"
# OR: Install with pinned dev dependencies for reproducible environment
pip install -r requirements-dev.txt
pip install -e .pytest
pytest --cov=allmark # with coverageblack src/flake8 src/
mypy src/Contributions are welcome! Here's how you can help:
- Report bugs: Open an issue with details and reproduction steps
- Suggest features: Share your ideas via GitHub issues
- Submit PRs: Fork, create a feature branch, and submit a pull request
- Improve docs: Help make the documentation clearer
See Development Guide for setup instructions.
MIT License - see LICENSE file for details.
Copyright (c) 2025 David Condrey
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: This README and inline code documentation
Built with:
- Pandoc - Universal document converter
- Poppler - PDF rendering and text extraction
- Python standard library - Zero Python dependencies!
- Python Dependencies: 0 (pure stdlib!)
- Verified Formats: 10 formats (EPUB, HTML, DOCX, PDF, TXT, MD, RTF, ODT, LaTeX, RST)
- Additional Formats: 30+ with Calibre (MOBI, AZW3, KF8, DjVu, legacy formats)
- Cleaning Stages: 17-stage intelligent pipeline
- Safety Checks: Never removes >50% of content
- Output Formats: Markdown, JSONL
- Test Coverage: Coming soon!
These formats work out-of-the-box with just Pandoc + poppler-utils:
- EPUB (.epub, .epub3) - Modern ebooks
- HTML (.html, .htm, .xhtml) - Web pages
- DOCX (.docx) - Microsoft Word 2007+
- PDF (.pdf) - Portable documents
- TXT/MD (.txt, .text, .md) - Plain text
- RTF (.rtf) - Rich text format
- ODT (.odt) - LibreOffice documents
- LaTeX (.tex, .latex) - Academic documents
- RST (.rst) - Python documentation
Requires brew install calibre or apt install calibre:
- MOBI (.mobi) - Mobipocket/Kindle
- AZW3/KF8 (.azw3, .kf8) - Amazon Kindle
- FB2 (.fb2) - FictionBook (Russian format)
- DjVu (.djvu) - Scanned documents (also needs djvulibre)
Implemented but untested (require Calibre):
- Microsoft Reader (.lit), Sony Reader (.lrf), Palm (.pdb, .pml, .prc)
- RocketBook (.rb), TomeRaider (.tcr), XPS (.xps)
- And 15+ other obsolete formats from the 2000s
Total: 40+ formats supported in code, 10 verified working, 15 example files
See examples/ directory for test files in 15 different formats!
Made with ❤️ for book lovers and data scientists