Skip to content
/ docx2app Public template

πŸš€ Automated system to convert Microsoft Word documents into interactive web-based readers. Extracts chapters, sections, images, and tables into optimized JSON with a React viewer. Perfect for large documents like handbooks and manuals.

License

Notifications You must be signed in to change notification settings

larsgson/docx2app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Document Conversion System

License: GPL v3 Python 3.8+ Node.js

A comprehensive automated system to convert Microsoft Word documents into an interactive web-based reader with JSON content backend. This system can be adapted for any large document with chapter/section structure.

🎯 Standalone Chapter Viewer

The chapter-viewer directory is a fully self-contained React application that can be used independently:

  • βœ… Extract and use separately - Copy the chapter-viewer folder to create your own book viewer
  • βœ… Reusable for any book - Just provide your own JSON content
  • βœ… No dependencies on parent project - All book data stored within the viewer directory
  • βœ… Ready to deploy - Complete standalone web application
  • βœ… Easy to customize - Modern React codebase with clear structure

Using the Viewer Standalone

# Copy the viewer to create your own project
cp -r chapter-viewer my-book-viewer
cd my-book-viewer

# Add your book content to book_content_json/
# (Follow the JSON format described in chapter-viewer/README.md)

# Install and run
pnpm install
pnpm dev

The viewer becomes a universal book reader - perfect for documentation, handbooks, manuals, or any structured content!

See chapter-viewer/README.md for detailed standalone usage instructions.

Distributing Your Book as a Standalone Viewer

After building your book, you can distribute the complete viewer:

# Build your book
make build

# The chapter-viewer directory is now self-contained!
# Package it for distribution:
tar -czf my-book-viewer.tar.gz chapter-viewer/

# Or just copy it anywhere:
cp -r chapter-viewer /path/to/my-book-viewer

# Recipients can use it immediately:
cd my-book-viewer
pnpm install
pnpm dev

The chapter-viewer directory contains:

  • βœ… All book content in book_content_json/
  • βœ… All images in book_content_json/chapter_XX/pictures/
  • βœ… Complete React application
  • βœ… Ready to run with no external dependencies

This makes it perfect for:

  • πŸ“¦ Distributing documentation as a web app
  • 🌐 Hosting on GitHub Pages, Netlify, Vercel
  • πŸ’Ώ Sharing as an offline viewer
  • πŸ“š Creating multiple book viewers from one codebase

Features

  • πŸš€ One-command build - Single make build converts entire document
  • πŸ“š Smart chapter detection - Automatically identifies chapters and sections
  • πŸ–ΌοΈ Image extraction - Extracts all images including WMF conversion
  • πŸ“Š Table processing - Preserves complex table structures including headers in cells
  • 🎨 Format preservation - Maintains bold, italic, fonts, alignment
  • βœ… TOC validation - Automatically extracts and validates table of contents
  • πŸ“ Dual output format - Generates both JSON and Markdown simultaneously
  • πŸ” Verification tools - Built-in integrity checking
  • πŸ“± React web viewer - Responsive mobile-friendly interface

⚠️ Document Preparation (CRITICAL FIRST STEP!)

You MUST convert automatic numbering to fixed text before processing.

Automatic numbering in Word/LibreOffice stores section numbers (like "3.1", "4.2.3") invisibly in the document's internal structure. This causes missing sections and failed TOC validation.

Quick Fix:

  • LibreOffice: Select All β†’ Format β†’ Lists β†’ No List β†’ Save
  • Word: Select All β†’ Ctrl+Shift+N β†’ Numbering β†’ None β†’ Save

πŸ“– See DOCUMENT_PREPARATION_GUIDE.md for detailed instructions, verification steps, and troubleshooting.


Quick Start

# 1. Install dependencies
make install-deps

# 2. Build the book
make build

# 3. Start the web viewer
make viewer

Open your browser to http://localhost:3000 to view the book.

System Requirements

Required Dependencies

  • Python 3.8+ with python-docx
  • ImageMagick 7+ - Image processing
  • Ghostscript - PDF to PNG conversion
  • LibreOffice - WMF to PDF conversion
  • Node.js 16+ - Web viewer

Installation

macOS:

brew install imagemagick ghostscript
brew install --cask libreoffice
make install-deps

Linux (Ubuntu/Debian):

sudo apt-get install imagemagick ghostscript libreoffice python3-pip nodejs npm
make install-deps

macOS LibreOffice Setup:

If LibreOffice was installed via DMG (not Homebrew), run:

make setup-libreoffice

This creates a symlink so ImageMagick can access LibreOffice.

What It Does

Build Pipeline

Word Document
    ↓
1. Extract TOC automatically
2. Extract chapters & sections with TOC validation
3. Parse text with formatting
4. Extract images (WMF β†’ PNG)
5. Process tables (including headers in cells)
6. Build navigation index
    ↓
Interactive Web Viewer

Detailed Steps

  1. TOC Extraction - Automatically extracts Table of Contents from document
  2. Chapter Detection - Identifies chapters by N.0 headings (e.g., "1.0 Introduction")
  3. Section Parsing - Subdivides chapters into N.X sections (e.g., "1.1", "1.2")
  4. Content Extraction - Preserves formatting, images, tables, footnotes
  5. Table Cell Headers - Detects and processes section headers inside table cells
  6. WMF Conversion - Converts Windows Metafiles to PNG via LibreOffice β†’ PDF β†’ PNG
  7. Index Building - Creates navigation structure with statistics

Usage

Build Commands

make build           # Build complete book content
make rebuild-all     # Clean and rebuild from scratch
make clean           # Remove generated files

Development Commands

make dev             # Build and start viewer in one command
make viewer          # Start chapter-viewer dev server
make status          # Show current project status
make stats           # Display content statistics

Verification Commands

make check-deps      # Verify all dependencies installed
make verify          # Check image integrity and content

Project Structure

project-root/
β”œβ”€β”€ build_book.py                    # Main build system (JSON output)
β”œβ”€β”€ verify_images.py                 # Image verification tool
β”œβ”€β”€ Makefile                         # Build automation
β”œβ”€β”€ setup_libreoffice.sh             # LibreOffice configuration helper
β”œβ”€β”€ requirements.txt                 # Python dependencies
β”œβ”€β”€ LICENSE                          # GPL-3.0 license
β”‚
β”œβ”€β”€ original-book.docx               # Source document (not in repo)
β”‚
β”œβ”€β”€ markdown_chapters/               # Markdown export (optional, not in repo)
β”‚   β”œβ”€β”€ README.md                    # Navigation index
β”‚   └── chapter_XX/                  # Chapter directories
β”‚       β”œβ”€β”€ section_X_X.md           # Section content
β”‚       └── pictures/                # Extracted images
β”‚
└── chapter-viewer/                  # STANDALONE React web application
    β”œβ”€β”€ book_content_json/           # Book data (self-contained!)
    β”‚   β”œβ”€β”€ index.json               # Navigation index
    β”‚   β”œβ”€β”€ toc_structure.json       # Table of contents
    β”‚   └── chapter_XX/              # Chapter directories
    β”‚       β”œβ”€β”€ chapter.json         # Chapter metadata
    β”‚       β”œβ”€β”€ section_XX.json      # Section content
    β”‚       └── pictures/            # Chapter images
    β”œβ”€β”€ src/                         # React source code
    β”œβ”€β”€ public/
    β”‚   └── book_content_json/       # Symlink to ../book_content_json/
    β”œβ”€β”€ package.json
    └── README.md                    # Standalone usage guide

Output Format

JSON Structure

Each section file contains:

{
  "chapter_number": 1,
  "chapter_title": "1.0 FIRST CHAPTER",
  "content": [
    {
      "type": "paragraph",
      "index": 0,
      "text": "Full paragraph text",
      "runs": [
        {"text": "Bold text", "bold": true, "font_size": 12.0}
      ],
      "alignment": "LEFT (0)"
    },
    {
      "type": "table",
      "rows": 3,
      "cols": 2,
      "cells": [...]
    }
  ],
  "statistics": {
    "paragraphs": 78,
    "tables": 1,
    "images": 10
  }
}

Key Features

Smart Chapter Detection

Handles both standard chapters (N.0 format) and appendix-style chapters (starting with N.1):

  • Regular chapters: Start with N.0 heading (e.g., "1.0 Introduction")
  • Appendix chapters: Start with N.1 section (e.g., "24.1 First Section")

TOC Validation System

  • Extracts entire Table of Contents
  • Excludes TOC paragraphs from actual content
  • Cross-validates TOC against actual content
  • Generates detailed discrepancy report
  • Uses actual content titles as source of truth

WMF Image Conversion

Automatically converts Windows Metafile images using the conversion chain:

WMF β†’ LibreOffice β†’ PDF β†’ Ghostscript β†’ PNG

This ensures all images are properly displayed in modern web browsers.

Table Cell Header Support

The system detects and processes section headers that appear inside table cells, maintaining proper chapter/section hierarchy even when headers are embedded in complex table layouts.

Configuration

Edit build_book.py to customize:

INPUT_DOCX = "original-book.docx"
JSON_DIR = "chapter-viewer/book_content_json"
EXCEPTIONS_FILE = "conf/exceptions.conf"

Build Process

The build system provides real-time feedback showing:

  • Number of TOC entries extracted
  • Chapters and sections detected
  • Paragraphs and tables processed
  • Images extracted and converted
  • Build completion time

Troubleshooting

WMF Images Not Converting

# Check if LibreOffice is accessible
libreoffice --version

# If not found, configure it
make setup-libreoffice

# Rebuild
make rebuild-all

Images Not Loading in Viewer

# Check image integrity
make verify

# If issues found, rebuild
make rebuild-all

Build Fails with Missing Dependencies

# Check what's missing
make check-deps

# Install dependencies
make install-deps

Content Not Updating

# Clean and rebuild
make clean
make build

# Force browser refresh
# Chrome/Firefox: Cmd+Shift+R (Mac) or Ctrl+Shift+R (Windows/Linux)

Advanced Usage

TOC Extraction

The system automatically extracts the Table of Contents directly from the document during the build process:

# TOC is automatically extracted during build
make build
# TOC structure is extracted internally and used for validation

Features:

  • βœ… Automatically extracts all TOC entries from document
  • βœ… Handles extra spaces in numbering (e.g., "3. 1", "21. 2")
  • βœ… Supports Unicode smart quotes in titles
  • βœ… Filters out false positives (dosages, measurements)
  • βœ… Validates section headers against TOC during parsing
  • βœ… Handles headers inside table cells

Custom Document Processing

To process your own Word document:

  1. Prepare your document - Convert automatic numbering to fixed text (see DOCUMENT_PREPARATION_GUIDE.md)
  2. Place your .docx file in the project root
  3. Name the book "original-book.docx" or else update INPUT_DOCX in build_book.py
  4. Create conf/exceptions.conf if you have known numbering errors
  5. Run make rebuild-all

Exception Handling

If your document has known numbering inconsistencies, create conf/exceptions.conf:

# Format: wrong_number = correct_number
10.7.7 = 10.7.5
10.7.8 = 10.7.6
21.4.3 = 21.2.3

The system will automatically correct these during parsing.

Accessing Build Reports

After build, check:

  • Console output shows TOC extraction and validation statistics
  • Build process reports number of TOC entries extracted and numbered entries found

Documentation

Main Scripts

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run make verify to check integrity
  5. Submit a pull request

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0).

This means you can:

  • βœ… Use commercially
  • βœ… Modify the code
  • βœ… Distribute
  • βœ… Use privately

Under the conditions:

  • πŸ“‹ Disclose source
  • πŸ“‹ License and copyright notice
  • πŸ“‹ Same license for derivatives
  • πŸ“‹ State changes made

See LICENSE file for full details.

Acknowledgments

  • python-docx - Word document parsing
  • ImageMagick - Image processing
  • LibreOffice - Document conversion
  • React - Web viewer interface
  • Vite - Build tooling

Support

For issues, questions, or suggestions:

  1. Check the troubleshooting section above
  2. Review existing issues on GitHub
  3. Create a new issue with:
    • System information (OS, Python version, etc.)
    • Output of make check-deps
    • Error messages or unexpected behavior
    • Steps to reproduce

Roadmap

Potential future enhancements:

  • Support for more document formats (PDF, EPUB input)
  • Full-text search in viewer
  • Export to EPUB/PDF from JSON
  • Image optimization options
  • Multi-language support
  • Cloud deployment guides
  • Docker containerization

Note: This repository does not include source Word documents or generated content. You'll need to provide your own document to process.

About

πŸš€ Automated system to convert Microsoft Word documents into interactive web-based readers. Extracts chapters, sections, images, and tables into optimized JSON with a React viewer. Perfect for large documents like handbooks and manuals.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published