This project is licensed under PMPL-1.0-or-later (Palimpsest License).
The full licence text is in license/PMPL-1.0.txt. The canonical source is the palimpsest-license repository.
Docudactyl is a multi-format HPC document extraction engine designed for British Library scale (~170 million items). It processes PDFs, images, audio, video, EPUB, and geospatial data across hundreds of cluster nodes.
┌──────────────────────────────────────────────────────────────────┐
│ Chapel HPC Orchestrator │
│ (64-512 locales, dynamic load balancing) │
├──────────────────────────────────────────────────────────────────┤
│ Conduit │ L1/L2 Cache │ Checkpoint │ Progress Reporter │
│ (validate) │ (LMDB+DFly) │ (resume) │ (ETA, rate) │
├──────────────────────────────────────────────────────────────────┤
│ Zig FFI Layer │
│ (51 C-exported functions, zero overhead) │
├────────┬──────────┬──────────┬──────────┬──────────┬────────────┤
│Poppler │Tesseract │ FFmpeg │ libxml2 │ GDAL │ libvips │
│ (PDF) │ (OCR) │(AV meta) │ (EPUB) │ (Geo) │ (Image) │
├────────┴──────────┴──────────┴──────────┴──────────┴────────────┤
│ dlopen: ONNX Runtime (ML) │ PaddleOCR (GPU OCR) │ CUDA │
├──────────────────────────────────────────────────────────────────┤
│ Idris2 ABI Proofs (14 types, 5 struct layouts, 51 FFI decls) │
└──────────────────────────────────────────────────────────────────┘
Offline: OCaml docudactyl-scm (JSON/text → Scheme S-expressions)
Viewer: Ada TUI (interactive document inspection)
Legacy: Julia extraction scripts (replaced by Chapel pipeline)# Verify dependencies
just deps-check
# Build Zig FFI + Chapel binary
just build-hpc
# Run all tests
just test-hpc
# Process a directory of documents
just generate-manifest /path/to/documents manifest.txt
bin/docudactyl-hpc --manifestPath=manifest.txt --outputDir=output/
# Or on an HPC cluster (64 nodes)
sbatch deploy/slurm-docudactyl.shThe Chapel component distributes document processing across cluster nodes with dynamic load balancing.
Modules: Config, ContentType, FFIBridge, ManifestLoader, NdjsonManifest, FaultHandler, ProgressReporter, ShardedOutput, ResultAggregator, Checkpoint, DocudactylHPC.
10 submodules providing a unified C ABI for 7 content types and 20 processing stages:
-
Core:
docudactyl_ffi.zig— init, free, parse, version (dispatches by content type) -
Stages: 20 analysis stages with Cap’n Proto output (language, readability, keywords, citations, OCR confidence, perceptual hash, TOC, NER, Whisper, image classify, layout, handwriting, etc.)
-
Cache: L1 LMDB per-locale (zero-copy mmap) + L2 Dragonfly cross-locale
-
Conduit: Magic-byte content detection (15 formats), SHA-256, validation
-
GPU OCR: PaddleOCR CUDA > Tesseract CUDA > CPU (via dlopen)
-
ML Inference: ONNX Runtime — NER, Whisper, ImageClassify, Layout, Handwriting (TensorRT > CUDA > OpenVINO > CPU)
-
Hardware Crypto: SHA-NI, AVX2, AVX-512, AES-NI, ARM SHA2 acceleration
-
I/O Prefetch: io_uring (Linux 5.6+) with posix_fadvise fallback
Dependent types proving struct layout, alignment, and enum correctness:
-
14 proven types (ContentKind, ParseStatus, MlStatus, MlStage, ExecProvider, Sha256Tier, etc.)
-
5 struct layout proofs (ParseResult 952B, MlResult 48B, CryptoCaps 16B, OcrResult 48B, ConduitResult 88B)
-
51 FFI declarations matching the C header 1:1
Transforms extracted JSON/text into machine-readable Scheme S-expressions. Not in the HPC hot path.
docudactyl-scm document.pdf -o document.scm
docudactyl-scm extracted.json -o extracted.scm# Build
just build-hpc # Zig FFI + Chapel binary
just build-ffi # Zig FFI only
just build-idris # Idris2 ABI proofs
just build-ocaml # OCaml transformer
just build-ada # Ada TUI
# Test
just test-hpc # All HPC tests (FFI + error paths)
just test-ffi # Zig integration tests (40+ tests)
just test-scale # Scale test (2105+ files)
just test-idris # Idris2 proofs compile
just test-ocaml # OCaml tests
just test-ada # Ada build check
# Deploy
just deps-check # Verify dependencies
just generate-manifest <dir> [output]
just generate-abi-header
just loc # Lines of codedocudactyl/
├── src/
│ ├── chapel/ # HPC engine (11 modules)
│ ├── Docudactyl/ABI/ # Idris2 ABI proofs (3 modules)
│ ├── ocaml/ # Offline Scheme transformer
│ ├── ada/ # Terminal UI
│ └── julia/ # Legacy extraction (replaced)
│
├── ffi/zig/ # Zig FFI layer (10 submodules)
│ ├── src/ # Source
│ └── test/ # Integration tests
│
├── generated/abi/ # Auto-generated C header
├── schema/ # Cap'n Proto schema
├── deploy/ # Containerfile + Slurm script
├── contractiles/ # K9 contractile configs
├── .machine_readable/ # SCM checkpoint files
├── Justfile # Task runner
└── docudactyl.ipkg # Idris2 package-
Chapel 2.3+ (HPC engine)
-
Zig 0.15+ (FFI layer)
-
Idris2 0.8+ (ABI proofs)
-
C libraries: Poppler, Tesseract, FFmpeg, libxml2, GDAL, libvips, LMDB
-
Optional: ONNX Runtime, PaddleOCR, CUDA (for ML/GPU features)
-
OCaml 4.14+ (offline Scheme transformer)
-
Ada GNAT/gprbuild (terminal UI)
podman build -f deploy/Containerfile -t docudactyl-hpc .
podman run --rm -v /data/manifest.txt:/manifest.txt:ro \
-v /data/output:/output \
docudactyl-hpc --manifestPath=/manifest.txtThis tool is designed for:
-
Document analysis and archival processing at national library scale
-
Research and verification of redaction practices
-
Accessibility improvements for PDF content
-
Multi-format metadata extraction and cataloguing
This is a Tier 1 RSR project. The hot path uses Chapel + Zig (systems languages). Legacy components (Julia, OCaml, Ada) serve offline/auxiliary roles.