A comprehensive automated system to convert Microsoft Word documents into an interactive web-based reader with JSON content backend. This system can be adapted for any large document with chapter/section structure.
The chapter-viewer directory is a fully self-contained React application that can be used independently:
- β
Extract and use separately - Copy the
chapter-viewerfolder to create your own book viewer - β Reusable for any book - Just provide your own JSON content
- β No dependencies on parent project - All book data stored within the viewer directory
- β Ready to deploy - Complete standalone web application
- β Easy to customize - Modern React codebase with clear structure
# Copy the viewer to create your own project
cp -r chapter-viewer my-book-viewer
cd my-book-viewer
# Add your book content to book_content_json/
# (Follow the JSON format described in chapter-viewer/README.md)
# Install and run
pnpm install
pnpm devThe viewer becomes a universal book reader - perfect for documentation, handbooks, manuals, or any structured content!
See chapter-viewer/README.md for detailed standalone usage instructions.
After building your book, you can distribute the complete viewer:
# Build your book
make build
# The chapter-viewer directory is now self-contained!
# Package it for distribution:
tar -czf my-book-viewer.tar.gz chapter-viewer/
# Or just copy it anywhere:
cp -r chapter-viewer /path/to/my-book-viewer
# Recipients can use it immediately:
cd my-book-viewer
pnpm install
pnpm devThe chapter-viewer directory contains:
- β
All book content in
book_content_json/ - β
All images in
book_content_json/chapter_XX/pictures/ - β Complete React application
- β Ready to run with no external dependencies
This makes it perfect for:
- π¦ Distributing documentation as a web app
- π Hosting on GitHub Pages, Netlify, Vercel
- πΏ Sharing as an offline viewer
- π Creating multiple book viewers from one codebase
- π One-command build - Single
make buildconverts entire document - π Smart chapter detection - Automatically identifies chapters and sections
- πΌοΈ Image extraction - Extracts all images including WMF conversion
- π Table processing - Preserves complex table structures including headers in cells
- π¨ Format preservation - Maintains bold, italic, fonts, alignment
- β TOC validation - Automatically extracts and validates table of contents
- π Dual output format - Generates both JSON and Markdown simultaneously
- π Verification tools - Built-in integrity checking
- π± React web viewer - Responsive mobile-friendly interface
You MUST convert automatic numbering to fixed text before processing.
Automatic numbering in Word/LibreOffice stores section numbers (like "3.1", "4.2.3") invisibly in the document's internal structure. This causes missing sections and failed TOC validation.
Quick Fix:
- LibreOffice: Select All β Format β Lists β No List β Save
- Word: Select All β Ctrl+Shift+N β Numbering β None β Save
π See DOCUMENT_PREPARATION_GUIDE.md for detailed instructions, verification steps, and troubleshooting.
# 1. Install dependencies
make install-deps
# 2. Build the book
make build
# 3. Start the web viewer
make viewerOpen your browser to http://localhost:3000 to view the book.
- Python 3.8+ with python-docx
- ImageMagick 7+ - Image processing
- Ghostscript - PDF to PNG conversion
- LibreOffice - WMF to PDF conversion
- Node.js 16+ - Web viewer
macOS:
brew install imagemagick ghostscript
brew install --cask libreoffice
make install-depsLinux (Ubuntu/Debian):
sudo apt-get install imagemagick ghostscript libreoffice python3-pip nodejs npm
make install-depsmacOS LibreOffice Setup:
If LibreOffice was installed via DMG (not Homebrew), run:
make setup-libreofficeThis creates a symlink so ImageMagick can access LibreOffice.
Word Document
β
1. Extract TOC automatically
2. Extract chapters & sections with TOC validation
3. Parse text with formatting
4. Extract images (WMF β PNG)
5. Process tables (including headers in cells)
6. Build navigation index
β
Interactive Web Viewer
- TOC Extraction - Automatically extracts Table of Contents from document
- Chapter Detection - Identifies chapters by N.0 headings (e.g., "1.0 Introduction")
- Section Parsing - Subdivides chapters into N.X sections (e.g., "1.1", "1.2")
- Content Extraction - Preserves formatting, images, tables, footnotes
- Table Cell Headers - Detects and processes section headers inside table cells
- WMF Conversion - Converts Windows Metafiles to PNG via LibreOffice β PDF β PNG
- Index Building - Creates navigation structure with statistics
make build # Build complete book content
make rebuild-all # Clean and rebuild from scratch
make clean # Remove generated filesmake dev # Build and start viewer in one command
make viewer # Start chapter-viewer dev server
make status # Show current project status
make stats # Display content statisticsmake check-deps # Verify all dependencies installed
make verify # Check image integrity and contentproject-root/
βββ build_book.py # Main build system (JSON output)
βββ verify_images.py # Image verification tool
βββ Makefile # Build automation
βββ setup_libreoffice.sh # LibreOffice configuration helper
βββ requirements.txt # Python dependencies
βββ LICENSE # GPL-3.0 license
β
βββ original-book.docx # Source document (not in repo)
β
βββ markdown_chapters/ # Markdown export (optional, not in repo)
β βββ README.md # Navigation index
β βββ chapter_XX/ # Chapter directories
β βββ section_X_X.md # Section content
β βββ pictures/ # Extracted images
β
βββ chapter-viewer/ # STANDALONE React web application
βββ book_content_json/ # Book data (self-contained!)
β βββ index.json # Navigation index
β βββ toc_structure.json # Table of contents
β βββ chapter_XX/ # Chapter directories
β βββ chapter.json # Chapter metadata
β βββ section_XX.json # Section content
β βββ pictures/ # Chapter images
βββ src/ # React source code
βββ public/
β βββ book_content_json/ # Symlink to ../book_content_json/
βββ package.json
βββ README.md # Standalone usage guide
Each section file contains:
{
"chapter_number": 1,
"chapter_title": "1.0 FIRST CHAPTER",
"content": [
{
"type": "paragraph",
"index": 0,
"text": "Full paragraph text",
"runs": [
{"text": "Bold text", "bold": true, "font_size": 12.0}
],
"alignment": "LEFT (0)"
},
{
"type": "table",
"rows": 3,
"cols": 2,
"cells": [...]
}
],
"statistics": {
"paragraphs": 78,
"tables": 1,
"images": 10
}
}Handles both standard chapters (N.0 format) and appendix-style chapters (starting with N.1):
- Regular chapters: Start with N.0 heading (e.g., "1.0 Introduction")
- Appendix chapters: Start with N.1 section (e.g., "24.1 First Section")
- Extracts entire Table of Contents
- Excludes TOC paragraphs from actual content
- Cross-validates TOC against actual content
- Generates detailed discrepancy report
- Uses actual content titles as source of truth
Automatically converts Windows Metafile images using the conversion chain:
WMF β LibreOffice β PDF β Ghostscript β PNG
This ensures all images are properly displayed in modern web browsers.
The system detects and processes section headers that appear inside table cells, maintaining proper chapter/section hierarchy even when headers are embedded in complex table layouts.
Edit build_book.py to customize:
INPUT_DOCX = "original-book.docx"
JSON_DIR = "chapter-viewer/book_content_json"
EXCEPTIONS_FILE = "conf/exceptions.conf"The build system provides real-time feedback showing:
- Number of TOC entries extracted
- Chapters and sections detected
- Paragraphs and tables processed
- Images extracted and converted
- Build completion time
# Check if LibreOffice is accessible
libreoffice --version
# If not found, configure it
make setup-libreoffice
# Rebuild
make rebuild-all# Check image integrity
make verify
# If issues found, rebuild
make rebuild-all# Check what's missing
make check-deps
# Install dependencies
make install-deps# Clean and rebuild
make clean
make build
# Force browser refresh
# Chrome/Firefox: Cmd+Shift+R (Mac) or Ctrl+Shift+R (Windows/Linux)The system automatically extracts the Table of Contents directly from the document during the build process:
# TOC is automatically extracted during build
make build
# TOC structure is extracted internally and used for validationFeatures:
- β Automatically extracts all TOC entries from document
- β Handles extra spaces in numbering (e.g., "3. 1", "21. 2")
- β Supports Unicode smart quotes in titles
- β Filters out false positives (dosages, measurements)
- β Validates section headers against TOC during parsing
- β Handles headers inside table cells
To process your own Word document:
- Prepare your document - Convert automatic numbering to fixed text (see DOCUMENT_PREPARATION_GUIDE.md)
- Place your
.docxfile in the project root - Name the book "original-book.docx" or else update
INPUT_DOCXinbuild_book.py - Create
conf/exceptions.confif you have known numbering errors - Run
make rebuild-all
If your document has known numbering inconsistencies, create conf/exceptions.conf:
# Format: wrong_number = correct_number
10.7.7 = 10.7.5
10.7.8 = 10.7.6
21.4.3 = 21.2.3
The system will automatically correct these during parsing.
After build, check:
- Console output shows TOC extraction and validation statistics
- Build process reports number of TOC entries extracted and numbered entries found
- DOCUMENT_PREPARATION_GUIDE.md -
β οΈ START HERE - Document preparation (convert automatic numbering) - WMF_CONVERSION_GUIDE.md - Image conversion guide
- MARKDOWN_GENERATION.md - Markdown output feature guide
- chapter-viewer/README.md - Web viewer documentation
- CONTRIBUTING.md - Contribution guidelines
- build_book.py - Main build system with integrated TOC extraction
- verify_images.py - Image verification tool
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Run
make verifyto check integrity - Submit a pull request
This project is licensed under the GNU General Public License v3.0 (GPL-3.0).
This means you can:
- β Use commercially
- β Modify the code
- β Distribute
- β Use privately
Under the conditions:
- π Disclose source
- π License and copyright notice
- π Same license for derivatives
- π State changes made
See LICENSE file for full details.
- python-docx - Word document parsing
- ImageMagick - Image processing
- LibreOffice - Document conversion
- React - Web viewer interface
- Vite - Build tooling
For issues, questions, or suggestions:
- Check the troubleshooting section above
- Review existing issues on GitHub
- Create a new issue with:
- System information (OS, Python version, etc.)
- Output of
make check-deps - Error messages or unexpected behavior
- Steps to reproduce
Potential future enhancements:
- Support for more document formats (PDF, EPUB input)
- Full-text search in viewer
- Export to EPUB/PDF from JSON
- Image optimization options
- Multi-language support
- Cloud deployment guides
- Docker containerization
Note: This repository does not include source Word documents or generated content. You'll need to provide your own document to process.