Document Conversion System

A comprehensive automated system to convert Microsoft Word documents into an interactive web-based reader with JSON content backend. This system can be adapted for any large document with chapter/section structure.

🎯 Standalone Chapter Viewer

The chapter-viewer directory is a fully self-contained React application that can be used independently:

✅ Extract and use separately - Copy the chapter-viewer folder to create your own book viewer
✅ Reusable for any book - Just provide your own JSON content
✅ No dependencies on parent project - All book data stored within the viewer directory
✅ Ready to deploy - Complete standalone web application
✅ Easy to customize - Modern React codebase with clear structure

Using the Viewer Standalone

# Copy the viewer to create your own project
cp -r chapter-viewer my-book-viewer
cd my-book-viewer

# Add your book content to book_content_json/
# (Follow the JSON format described in chapter-viewer/README.md)

# Install and run
pnpm install
pnpm dev

The viewer becomes a universal book reader - perfect for documentation, handbooks, manuals, or any structured content!

See chapter-viewer/README.md for detailed standalone usage instructions.

Distributing Your Book as a Standalone Viewer

After building your book, you can distribute the complete viewer:

# Build your book
make build

# The chapter-viewer directory is now self-contained!
# Package it for distribution:
tar -czf my-book-viewer.tar.gz chapter-viewer/

# Or just copy it anywhere:
cp -r chapter-viewer /path/to/my-book-viewer

# Recipients can use it immediately:
cd my-book-viewer
pnpm install
pnpm dev

The chapter-viewer directory contains:

✅ All book content in book_content_json/
✅ All images in book_content_json/chapter_XX/pictures/
✅ Complete React application
✅ Ready to run with no external dependencies

This makes it perfect for:

📦 Distributing documentation as a web app
🌐 Hosting on GitHub Pages, Netlify, Vercel
💿 Sharing as an offline viewer
📚 Creating multiple book viewers from one codebase

Features

🚀 One-command build - Single make build converts entire document
📚 Smart chapter detection - Automatically identifies chapters and sections
🖼️ Image extraction - Extracts all images including WMF conversion
📊 Table processing - Preserves complex table structures including headers in cells
🎨 Format preservation - Maintains bold, italic, fonts, alignment
✅ TOC validation - Automatically extracts and validates table of contents
📝 Dual output format - Generates both JSON and Markdown simultaneously
🔍 Verification tools - Built-in integrity checking
📱 React web viewer - Responsive mobile-friendly interface

⚠️ Document Preparation (CRITICAL FIRST STEP!)

You MUST convert automatic numbering to fixed text before processing.

Automatic numbering in Word/LibreOffice stores section numbers (like "3.1", "4.2.3") invisibly in the document's internal structure. This causes missing sections and failed TOC validation.

Quick Fix:

LibreOffice: Select All → Format → Lists → No List → Save
Word: Select All → Ctrl+Shift+N → Numbering → None → Save

📖 See DOCUMENT_PREPARATION_GUIDE.md for detailed instructions, verification steps, and troubleshooting.

Quick Start

# 1. Install dependencies
make install-deps

# 2. Build the book
make build

# 3. Start the web viewer
make viewer

Open your browser to http://localhost:3000 to view the book.

System Requirements

Required Dependencies

Python 3.8+ with python-docx
ImageMagick 7+ - Image processing
Ghostscript - PDF to PNG conversion
LibreOffice - WMF to PDF conversion
Node.js 16+ - Web viewer

Installation

macOS:

brew install imagemagick ghostscript
brew install --cask libreoffice
make install-deps

Linux (Ubuntu/Debian):

sudo apt-get install imagemagick ghostscript libreoffice python3-pip nodejs npm
make install-deps

macOS LibreOffice Setup:

If LibreOffice was installed via DMG (not Homebrew), run:

make setup-libreoffice

This creates a symlink so ImageMagick can access LibreOffice.

What It Does

Build Pipeline

Word Document
    ↓
1. Extract TOC automatically
2. Extract chapters & sections with TOC validation
3. Parse text with formatting
4. Extract images (WMF → PNG)
5. Process tables (including headers in cells)
6. Build navigation index
    ↓
Interactive Web Viewer

Detailed Steps

TOC Extraction - Automatically extracts Table of Contents from document
Chapter Detection - Identifies chapters by N.0 headings (e.g., "1.0 Introduction")
Section Parsing - Subdivides chapters into N.X sections (e.g., "1.1", "1.2")
Content Extraction - Preserves formatting, images, tables, footnotes
Table Cell Headers - Detects and processes section headers inside table cells
WMF Conversion - Converts Windows Metafiles to PNG via LibreOffice → PDF → PNG
Index Building - Creates navigation structure with statistics

Usage

Build Commands

make build           # Build complete book content
make rebuild-all     # Clean and rebuild from scratch
make clean           # Remove generated files

Development Commands

make dev             # Build and start viewer in one command
make viewer          # Start chapter-viewer dev server
make status          # Show current project status
make stats           # Display content statistics

Verification Commands

make check-deps      # Verify all dependencies installed
make verify          # Check image integrity and content

Project Structure

project-root/
├── build_book.py                    # Main build system (JSON output)
├── verify_images.py                 # Image verification tool
├── Makefile                         # Build automation
├── setup_libreoffice.sh             # LibreOffice configuration helper
├── requirements.txt                 # Python dependencies
├── LICENSE                          # GPL-3.0 license
│
├── original-book.docx               # Source document (not in repo)
│
├── markdown_chapters/               # Markdown export (optional, not in repo)
│   ├── README.md                    # Navigation index
│   └── chapter_XX/                  # Chapter directories
│       ├── section_X_X.md           # Section content
│       └── pictures/                # Extracted images
│
└── chapter-viewer/                  # STANDALONE React web application
    ├── book_content_json/           # Book data (self-contained!)
    │   ├── index.json               # Navigation index
    │   ├── toc_structure.json       # Table of contents
    │   └── chapter_XX/              # Chapter directories
    │       ├── chapter.json         # Chapter metadata
    │       ├── section_XX.json      # Section content
    │       └── pictures/            # Chapter images
    ├── src/                         # React source code
    ├── public/
    │   └── book_content_json/       # Symlink to ../book_content_json/
    ├── package.json
    └── README.md                    # Standalone usage guide

Output Format

JSON Structure

Each section file contains:

{
  "chapter_number": 1,
  "chapter_title": "1.0 FIRST CHAPTER",
  "content": [
    {
      "type": "paragraph",
      "index": 0,
      "text": "Full paragraph text",
      "runs": [
        {"text": "Bold text", "bold": true, "font_size": 12.0}
      ],
      "alignment": "LEFT (0)"
    },
    {
      "type": "table",
      "rows": 3,
      "cols": 2,
      "cells": [...]
    }
  ],
  "statistics": {
    "paragraphs": 78,
    "tables": 1,
    "images": 10
  }
}

Key Features

Smart Chapter Detection

Handles both standard chapters (N.0 format) and appendix-style chapters (starting with N.1):

Regular chapters: Start with N.0 heading (e.g., "1.0 Introduction")
Appendix chapters: Start with N.1 section (e.g., "24.1 First Section")

TOC Validation System

Extracts entire Table of Contents
Excludes TOC paragraphs from actual content
Cross-validates TOC against actual content
Generates detailed discrepancy report
Uses actual content titles as source of truth

WMF Image Conversion

Automatically converts Windows Metafile images using the conversion chain:

WMF → LibreOffice → PDF → Ghostscript → PNG

This ensures all images are properly displayed in modern web browsers.

Table Cell Header Support

The system detects and processes section headers that appear inside table cells, maintaining proper chapter/section hierarchy even when headers are embedded in complex table layouts.

Configuration

Edit build_book.py to customize:

INPUT_DOCX = "original-book.docx"
JSON_DIR = "chapter-viewer/book_content_json"
EXCEPTIONS_FILE = "conf/exceptions.conf"

Build Process

The build system provides real-time feedback showing:

Number of TOC entries extracted
Chapters and sections detected
Paragraphs and tables processed
Images extracted and converted
Build completion time

Troubleshooting

WMF Images Not Converting

# Check if LibreOffice is accessible
libreoffice --version

# If not found, configure it
make setup-libreoffice

# Rebuild
make rebuild-all

Images Not Loading in Viewer

# Check image integrity
make verify

# If issues found, rebuild
make rebuild-all

Build Fails with Missing Dependencies

# Check what's missing
make check-deps

# Install dependencies
make install-deps

Content Not Updating

# Clean and rebuild
make clean
make build

# Force browser refresh
# Chrome/Firefox: Cmd+Shift+R (Mac) or Ctrl+Shift+R (Windows/Linux)

Advanced Usage

TOC Extraction

The system automatically extracts the Table of Contents directly from the document during the build process:

# TOC is automatically extracted during build
make build
# TOC structure is extracted internally and used for validation

Features:

✅ Automatically extracts all TOC entries from document
✅ Handles extra spaces in numbering (e.g., "3. 1", "21. 2")
✅ Supports Unicode smart quotes in titles
✅ Filters out false positives (dosages, measurements)
✅ Validates section headers against TOC during parsing
✅ Handles headers inside table cells

Custom Document Processing

To process your own Word document:

Prepare your document - Convert automatic numbering to fixed text (see DOCUMENT_PREPARATION_GUIDE.md)
Place your .docx file in the project root
Name the book "original-book.docx" or else update INPUT_DOCX in build_book.py
Create conf/exceptions.conf if you have known numbering errors
Run make rebuild-all

Exception Handling

If your document has known numbering inconsistencies, create conf/exceptions.conf:

# Format: wrong_number = correct_number
10.7.7 = 10.7.5
10.7.8 = 10.7.6
21.4.3 = 21.2.3

The system will automatically correct these during parsing.

Accessing Build Reports

After build, check:

Console output shows TOC extraction and validation statistics
Build process reports number of TOC entries extracted and numbered entries found

Documentation

DOCUMENT_PREPARATION_GUIDE.md - ⚠️ START HERE - Document preparation (convert automatic numbering)
WMF_CONVERSION_GUIDE.md - Image conversion guide
MARKDOWN_GENERATION.md - Markdown output feature guide
chapter-viewer/README.md - Web viewer documentation
CONTRIBUTING.md - Contribution guidelines

Main Scripts

build_book.py - Main build system with integrated TOC extraction
verify_images.py - Image verification tool

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Run make verify to check integrity
Submit a pull request

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0).

This means you can:

✅ Use commercially
✅ Modify the code
✅ Distribute
✅ Use privately

Under the conditions:

📋 Disclose source
📋 License and copyright notice
📋 Same license for derivatives
📋 State changes made

See LICENSE file for full details.

Acknowledgments

python-docx - Word document parsing
ImageMagick - Image processing
LibreOffice - Document conversion
React - Web viewer interface
Vite - Build tooling

Support

For issues, questions, or suggestions:

Check the troubleshooting section above
Review existing issues on GitHub
Create a new issue with:
- System information (OS, Python version, etc.)
- Output of make check-deps
- Error messages or unexpected behavior
- Steps to reproduce

Roadmap

Potential future enhancements:

Support for more document formats (PDF, EPUB input)
Full-text search in viewer
Export to EPUB/PDF from JSON
Image optimization options
Multi-language support
Cloud deployment guides
Docker containerization

Note: This repository does not include source Word documents or generated content. You'll need to provide your own document to process.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
chapter-viewer		chapter-viewer
conf		conf
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
DOCUMENT_PREPARATION_GUIDE.md		DOCUMENT_PREPARATION_GUIDE.md
IMAGE_TROUBLESHOOTING.md		IMAGE_TROUBLESHOOTING.md
LICENSE		LICENSE
MARKDOWN_GENERATION.md		MARKDOWN_GENERATION.md
Makefile		Makefile
README.md		README.md
WMF_CONVERSION_GUIDE.md		WMF_CONVERSION_GUIDE.md
build_book.py		build_book.py
check_images.py		check_images.py
fix_wmf_images.py		fix_wmf_images.py
requirements.txt		requirements.txt
sample-book.docx		sample-book.docx
setup_libreoffice.sh		setup_libreoffice.sh
verify_images.py		verify_images.py

License

larsgson/docx2app

Folders and files

Latest commit

History

Repository files navigation

Document Conversion System

🎯 Standalone Chapter Viewer

Using the Viewer Standalone

Distributing Your Book as a Standalone Viewer

Features

⚠️ Document Preparation (CRITICAL FIRST STEP!)

Quick Start

System Requirements

Required Dependencies

Installation

What It Does

Build Pipeline

Detailed Steps

Usage

Build Commands

Development Commands

Verification Commands

Project Structure

Output Format

JSON Structure

Key Features

Smart Chapter Detection

TOC Validation System

WMF Image Conversion

Table Cell Header Support

Configuration

Build Process

Troubleshooting

WMF Images Not Converting

Images Not Loading in Viewer

Build Fails with Missing Dependencies

Content Not Updating

Advanced Usage

TOC Extraction

Custom Document Processing

Exception Handling

Accessing Build Reports

Documentation

Main Scripts

Contributing

License

Acknowledgments

Support

Roadmap

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages