Extralit · JonnyTran · Nov 5, 2025 · Nov 4, 2025 · Nov 5, 2025
@@ -0,0 +1,348 @@
+# PDF Preprocessing Configuration Guide
+
+This guide explains how to configure PDF preprocessing in Extralit Server using the `PDFPreprocessingSettings` class for optimal results with different document types.
+
+## Overview
+
+Extralit Server uses [OCRmyPDF](https://github.com/ocrmypdf/ocrmypdf) for PDF preprocessing, which performs OCR (Optical Character Recognition), rotation correction, optimization, and cleanup. The preprocessing pipeline also includes PDF layout analysis to extract margin and structure information.
+
+All settings can be configured via environment variables with the `PREPROCESSING_` prefix.
+
+## Quick Start
+
+### Digital Research Papers (Born-Digital PDFs)
+
+For modern PDFs that already contain searchable text:
+
+```bash
+# Minimal processing - just analysis and optimization
+PREPROCESSING_ENABLED=true
+PREPROCESSING_ENABLE_ANALYSIS=true
+PREPROCESSING_SKIP_TEXT=true           # Skip OCR on text pages
+PREPROCESSING_FORCE_OCR=false
+PREPROCESSING_TESSERACT_TIMEOUT=0      # No timeout (not "skip OCR")
+PREPROCESSING_OPTIMIZE=1               # Lossless optimization
+PREPROCESSING_CLEAN=false              # No cleanup needed
+PREPROCESSING_DESKEW=false             # Usually not needed
+```
+
+**Performance:** ~0.5-2s per page (mostly analysis)
+
+### Scanned Research Papers (Image-Based PDFs)
+
+For scanned documents or image-only PDFs that need OCR:
+
+```bash
+# Full OCR processing
+PREPROCESSING_ENABLED=true
+PREPROCESSING_ENABLE_ANALYSIS=true
+PREPROCESSING_FORCE_OCR=true           # OCR all pages
+PREPROCESSING_SKIP_TEXT=false          # Process text layers
+PREPROCESSING_TESSERACT_TIMEOUT=180    # 3 minutes per page
+PREPROCESSING_LANGUAGE=["eng"]         # Add more as needed
+PREPROCESSING_ROTATE_PAGES=true        # Auto-rotate pages
+PREPROCESSING_DESKEW=true              # Fix skewed scans
+PREPROCESSING_CLEAN=true               # Remove scan artifacts
+PREPROCESSING_OPTIMIZE=2               # Lossy compression
+```
+
+**Performance:** ~2-5s per page for good quality scans
+
+### Mixed Document Collections
+
+For collections with both digital and scanned papers:
+
+```bash
+# Balanced approach
+PREPROCESSING_ENABLED=true
+PREPROCESSING_ENABLE_ANALYSIS=true
+PREPROCESSING_SKIP_TEXT=true           # Only OCR image pages
+PREPROCESSING_FORCE_OCR=false          # Detect existing text
+PREPROCESSING_REDO_OCR=false           # Don't re-OCR
+PREPROCESSING_TESSERACT_TIMEOUT=120    # 2 minutes timeout
+PREPROCESSING_ROTATE_PAGES=true
+PREPROCESSING_DESKEW=false
+PREPROCESSING_CLEAN=true
+PREPROCESSING_OPTIMIZE=1
+```
+
+## Configuration Reference
+
+### Core Settings
+
+#### `PREPROCESSING_ENABLED`
+- **Type**: `bool`
+- **Default**: `true`
+- **Description**: Master switch for PDF preprocessing. When `false`, only layout analysis runs (if `enable_analysis=true`).
+- **Use Case**: Set to `false` to disable all OCR processing while keeping layout analysis.
+
+#### `PREPROCESSING_ENABLE_ANALYSIS`
+- **Type**: `bool`
+- **Default**: `true`
+- **Description**: Enable PDF layout analysis and margin detection using `PDFAnalyzer`.
+- **Use Case**: Disable if you don't need structural metadata extraction.
+
+### OCR Settings
+
+#### `PREPROCESSING_LANGUAGE`
+- **Type**: `list[str]`
+- **Default**: `["eng"]`
+- **Options**: ISO 639-3 language codes (e.g., `["eng", "spa", "fra", "deu"]`)
+- **Description**: Languages for OCR recognition. Multiple languages increase processing time.
+- **Use Case**:
+  - Single language papers: `["eng"]`
+  - Multilingual papers: `["eng", "spa"]`
+  - International collections: Add all expected languages
+
+#### `PREPROCESSING_TESSERACT_TIMEOUT`
+- **Type**: `int` (seconds)
+- **Default**: `0`
+- **Description**: Timeout for Tesseract OCR per page. **`0` means no timeout** (unlimited time), not "skip OCR". To skip OCR entirely, set `PREPROCESSING_ENABLED=false`.
+- **Use Case**:
+  - `0`: No timeout - best for accuracy (default)
+  - `60-120`: Standard scanned papers with time constraints
+  - `180-300`: Complex layouts, low-quality scans
+  - `600+`: Historical documents, very poor scan quality
+
+#### `PREPROCESSING_FORCE_OCR`
+- **Type**: `bool`
+- **Default**: `false`
+- **Description**: Force OCR on all pages, even those with existing text.
+- **Use Case**:
+  - `true`: Scanned documents, poor existing OCR
+  - `false`: Digital PDFs, mixed collections (recommended)
+
+#### `PREPROCESSING_SKIP_TEXT`
+- **Type**: `bool`
+- **Default**: `true`
+- **Description**: Skip OCR on pages that already have text. Only process image-only pages.
+- **Use Case**:
+  - `true`: Digital PDFs, mixed collections (recommended)
+  - `false`: Force OCR on all pages
+
+#### `PREPROCESSING_REDO_OCR`
+- **Type**: `bool`
+- **Default**: `false`
+- **Description**: Redo OCR on pages that already have OCR text.
+- **Use Case**: Set to `true` only if existing OCR is poor quality.
+
+### Page Processing
+
+#### `PREPROCESSING_ROTATE_PAGES`
+- **Type**: `bool`
+- **Default**: `true`
+- **Description**: Auto-rotate pages with horizontal text to correct orientation.
+- **Use Case**: Keep `true` for scanned documents; safe for digital PDFs.
+
+#### `PREPROCESSING_ROTATE_PAGES_THRESHOLD`
+- **Type**: `float`
+- **Default**: `2.0`
+- **Description**: Confidence threshold for rotation (higher = more conservative).
+- **Use Case**: Lower (1.0-1.5) for aggressive rotation; higher (3.0+) to avoid false rotations.
+
+#### `PREPROCESSING_DESKEW`
+- **Type**: `bool`
+- **Default**: `false`
+- **Description**: Correct skewed/tilted text in scanned documents.
+- **Use Case**:
+  - `true`: Scanned documents with visible skew
+  - `false`: Digital PDFs (adds processing time)
+
+#### `PREPROCESSING_CLEAN`
+- **Type**: `bool`
+- **Default**: `true`
+- **Description**: Use `unpaper` to remove scan artifacts, borders, and noise.
+- **Use Case**:
+  - `true`: Scanned documents, photocopies
+  - `false`: Clean digital PDFs (saves processing time)
+
+### Output Optimization
+
+#### `PREPROCESSING_OPTIMIZE`
+- **Type**: `int`
+- **Default**: `1`
+- **Options**:
+  - `0`: No optimization (largest file size)
+  - `1`: Lossless optimization (recommended for digital PDFs)
+  - `2`: Lossy compression (good for scanned documents)
+  - `3`: Aggressive compression (smallest size, some quality loss)
+- **Use Case**:
+  - Digital PDFs: `1` (preserve quality)
+  - Scanned documents: `2` (balance size/quality)
+  - Large collections: `3` (minimize storage)
+
+#### `PREPROCESSING_PDF_RENDERER`
+- **Type**: `str`
+- **Default**: `"hocr"`
+- **Options**: `"auto"`, `"hocr"`, `"sandwich"`
+- **Description**:
+  - `"hocr"`: Embed invisible text layer (best for most documents)
+  - `"sandwich"`: Visible text with image background (preserves appearance)
+  - `"auto"`: Let OCRmyPDF choose
+- **Use Case**:
+  - Digital papers: `"hocr"` (smaller files)
+  - Scanned papers: `"hocr"` or `"sandwich"` (depending on preference)
+
+#### `PREPROCESSING_OUTPUT_TYPE`
+- **Type**: `str`
+- **Default**: `"pdf"`
+- **Options**: `"pdf"`, `"pdfa"`, `"pdfa-1"`, `"pdfa-2"`, `"pdfa-3"`
+- **Description**: Output PDF format. `"pdf"` skips PDF/A conversion.
+- **Use Case**: Use `"pdf"` for speed; PDF/A formats for long-term archival.
+
+#### `PREPROCESSING_FAST_WEB_VIEW`
+- **Type**: `int`
+- **Default**: `999999` (effectively disabled)
+- **Description**: Optimize PDF for web viewing by reorganizing structure. High values disable optimization.
+- **Use Case**: Set to `1` for web-served PDFs; keep default for processing pipelines.
+
+### Performance Settings
+
+#### `PREPROCESSING_JOBS`
+- **Type**: `int`
+- **Default**: `1`
+- **Description**: Number of parallel worker processes for OCR.
+- **Use Case**:
+  - Docker/limited CPU: `1` (avoid oversubscription)
+  - Multi-core servers: `2-4` (balance speed/resources)
+  - High-memory systems: `4-8` (maximum parallelism)
+
+#### `PREPROCESSING_SKIP_BIG`
+- **Type**: `float` (MB)
+- **Default**: `100.0`
+- **Description**: Skip OCR on images larger than this threshold to avoid timeouts.
+- **Use Case**:
+  - High-quality scans: `50-100` MB
+  - Standard documents: `100-200` MB
+  - Large format papers: `200+` MB
+
+#### `PREPROCESSING_PROGRESS_BAR`
+- **Type**: `bool`
+- **Default**: `false`
+- **Description**: Show progress bar during processing (useful for CLI, not for background jobs).
+- **Use Case**: `true` for interactive processing; `false` for production.
+
+## Troubleshooting
+
+### Issue: Timeout Errors
+
+**Symptoms**: `TesseractTimeout` errors in logs
+
+**Solutions**:
+1. Increase `PREPROCESSING_TESSERACT_TIMEOUT` (try 300-600)
+2. Increase `PREPROCESSING_SKIP_BIG` to skip problematic pages
+3. Reduce `PREPROCESSING_JOBS` to avoid resource contention
+4. Set `PREPROCESSING_CLEAN=false` to skip image preprocessing
+
+### Issue: Poor OCR Quality
+
+**Symptoms**: Garbled or missing text extraction
+
+**Solutions**:
+1. Enable `PREPROCESSING_DESKEW=true` for skewed scans
+2. Enable `PREPROCESSING_CLEAN=true` to remove artifacts
+3. Set `PREPROCESSING_FORCE_OCR=true` to redo existing OCR
+4. Add more languages to `PREPROCESSING_LANGUAGE`
+5. Adjust `PREPROCESSING_ROTATE_PAGES_THRESHOLD` if pages are incorrectly rotated
+
+### Issue: Processing Too Slow
+
+**Symptoms**: Long wait times for document processing
+
+**Solutions**:
+1. Set `PREPROCESSING_ENABLED=false` for digital PDFs (only run analysis)
+2. Reduce `PREPROCESSING_TESSERACT_TIMEOUT` (try 60-120)
+3. Ensure `PREPROCESSING_SKIP_TEXT=true` for hybrid documents
+4. Reduce `PREPROCESSING_OPTIMIZE` level
+5. Disable `PREPROCESSING_CLEAN=false` and `PREPROCESSING_DESKEW=false`
+
+### Issue: High Memory Usage
+
+**Symptoms**: Out-of-memory errors, system slowdown
+
+**Solutions**:
+1. Set `PREPROCESSING_JOBS=1` (most important)
+2. Reduce `PREPROCESSING_SKIP_BIG` threshold (e.g., 50 MB)
+3. Set `PREPROCESSING_OPTIMIZE=3` to reduce output size
+4. Process documents in smaller batches
+
+## Integration Examples
+
+### Programmatic Configuration (Python)
+
+```python
+from extralit_server.contexts.document.preprocessing import (
+    PDFPreprocessor,
+    PDFPreprocessingSettings
+)
+
+# Custom settings for scanned documents
+settings = PDFPreprocessingSettings(
+    enabled=True,
+    enable_analysis=True,
+    force_ocr=True,
+    tesseract_timeout=180,
+    language=["eng"],
+    deskew=True,
+    clean=True,
+    optimize=2
+)
+
+preprocessor = PDFPreprocessor(settings=settings)
+result = preprocessor.preprocess(pdf_bytes, "document.pdf")
+
+# Access processed data and metadata
+processed_pdf = result.processed_data
+metadata = result.metadata
+print(f"Processing time: {metadata.processing_time:.2f}s")
+print(f"Analysis results: {metadata.analysis_results}")
+```
+
+### Environment Variables (Docker/Production)
+
+Create a `.env` file:
+
+```bash
+# Extralit Server Configuration
+EXTRALIT_DATABASE_URL=postgresql://user:pass@localhost/extralit
+EXTRALIT_REDIS_URL=redis://localhost:6379/0
+
+# PDF Preprocessing for Scanned Documents
+PREPROCESSING_ENABLED=true
+PREPROCESSING_ENABLE_ANALYSIS=true
+PREPROCESSING_FORCE_OCR=true
+PREPROCESSING_TESSERACT_TIMEOUT=180
+PREPROCESSING_LANGUAGE=["eng", "spa"]
+PREPROCESSING_DESKEW=true
+PREPROCESSING_CLEAN=true
+PREPROCESSING_OPTIMIZE=2
+PREPROCESSING_JOBS=2
+```
+
+## Related Components
+
+| File | Purpose |
+|------|---------|
+| [`preprocessing.py`](../src/extralit_server/contexts/document/preprocessing.py) | Core preprocessing logic and settings |
+| [`margin.py`](../src/extralit_server/contexts/document/margin.py) | PDF layout analysis and margin detection |
+| [`api/schemas/v1/document/preprocessing.py`](../src/extralit_server/api/schemas/v1/document/preprocessing.py) | API metadata schema |
+
+## Important Notes
+
+1. **Environment Variables**: All settings can be overridden via `PREPROCESSING_*` env vars
+2. **OCRmyPDF Dependency**: Requires `ocrmypdf` and `tesseract` installed
+3. **Lazy Loading**: `ocrmypdf` is lazy-loaded to avoid import overhead
+4. **Error Handling**: Falls back to temp files if BytesIO approach fails
+
+## Further Reading
+
+- [OCRmyPDF Documentation](https://ocrmypdf.readthedocs.io/)
+- [Tesseract Language Data](https://github.com/tesseract-ocr/tessdata)
+- [PDF/A Archival Standards](https://en.wikipedia.org/wiki/PDF/A)
+- [Extralit Server Architecture](../README.md)
+
+## Related Repositories
+
+- [Extralit](https://github.com/Extralit/extralit)
+- [Extralit HF Space](https://github.com/Extralit/extralit-hf-space)
+- [Papers OCR Benchmarks](https://github.com/Extralit/papers-ocr-benchmarks)
@@ -208,13 +208,13 @@ nav:
           - Upgrading: admin_guide/upgrading.md
           # - File storage configuration: admin_guide/file_storage.md
       - Advanced:
+          - PDF Preprocessing Configuration: admin_guide/pdf_preprocessing_config.md
           - Custom fields with layout templates: admin_guide/custom_fields.md
           - Use webhooks to respond to server events:
               - admin_guide/webhooks.md
               - Webhooks internals: admin_guide/webhooks_internals.md
           - Use Markdown to format rich content: admin_guide/use_markdown_to_format_rich_content.md
           - Migrate users, workspaces and datasets to Extralit V2: admin_guide/migrate_from_legacy_datasets.md
-          - Custom fields with layout templates: admin_guide/custom_fields.md
   - Tutorials:
       - tutorials/index.md
       - Text classification: tutorials/text_classification.ipynb