diff --git a/docs/source/layout_detector.md b/docs/source/layout_detector.md new file mode 100644 index 0000000..94b270c --- /dev/null +++ b/docs/source/layout_detector.md @@ -0,0 +1,273 @@ +# Layout Detection + +The LayoutDetector is a powerful component in ScaleDP that uses PaddleOCR's layout analysis capabilities to detect and classify different regions within documents. This detector can identify various layout types such as text blocks, titles, lists, tables, and figures. + +## Overview + +Layout detection is essential for understanding document structure and extracting meaningful information from complex documents. The LayoutDetector provides: + +- **Multiple Layout Types**: Detects text, titles, lists, tables, and figures +- **Configurable Detection**: Customize which types to detect and confidence thresholds +- **GPU Acceleration**: Support for GPU processing to improve performance +- **Integration**: Seamless integration with ScaleDP pipeline +- **Error Handling**: Robust error handling for various edge cases + +## Installation + +The LayoutDetector requires PaddleOCR to be installed: + +```bash +pip install paddleocr +``` + +## Basic Usage + +### Initialize the LayoutDetector + +```python +from scaledp.models.detectors.LayoutDetector import LayoutDetector +from scaledp.enums import Device + +# Create a LayoutDetector instance +layout_detector = LayoutDetector( + inputCol="image", + outputCol="layout_boxes", + scoreThreshold=0.5, # Confidence threshold + device=Device.CPU, # Use CPU for inference + whiteList=["text", "title", "list", "table", "figure"], # Types to detect + model="PP-DocLayout_plus-L" # Model to use +) +``` + +### Process an Image + +```python +from scaledp.schemas.Image import Image +from PIL import Image as PILImage + +# Load and prepare image +pil_image = PILImage.open("document.png") +image = Image( + path="document.png", + data=pil_image, + exception="" +) + +# Run layout detection +result = layout_detector.transform_udf(image) + +# Access results +print(f"Detected {len(result.bboxes)} layout regions") +for box in result.bboxes: + print(f"- {box.text}: confidence {box.score:.3f}") +``` + +## Configuration Parameters + +### Core Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `inputCol` | str | "image" | Input column name containing images | +| `outputCol` | str | "layout_boxes" | Output column name for detection results | +| `scoreThreshold` | float | 0.5 | Minimum confidence score for detections | +| `device` | Device | Device.CPU | Processing device (CPU/GPU) | +| `whiteList` | List[str] | ["text", "title", "list", "table", "figure"] | Layout types to detect | +| `model` | str | "PP-DocLayout_plus-L" | PaddleOCR layout detection model name | + +### Advanced Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `scaleFactor` | float | 1.0 | Image scaling factor | +| `keepInputData` | bool | False | Whether to keep input data in output | +| `partitionMap` | bool | False | Enable partitioned processing | +| `numPartitions` | int | 0 | Number of partitions for processing | +| `propagateError` | bool | False | Whether to propagate errors | + +## Available Models + +The LayoutDetector supports different PaddleOCR layout detection models: + +- **PP-DocLayout_plus-L**: Large model with high accuracy (default) +- **PP-DocLayout-M**: Medium model with balanced speed and accuracy + +## Layout Types + +The LayoutDetector can identify the following layout types: + +- **text**: General text content +- **title**: Document titles and headings +- **list**: Bulleted or numbered lists +- **table**: Tabular data structures +- **figure**: Images, charts, and diagrams + +## Examples + +### Custom Layout Type Detection + +```python +# Detect only text and tables +text_table_detector = LayoutDetector( + inputCol="image", + outputCol="text_table_boxes", + scoreThreshold=0.6, + whiteList=["text", "table"], + model="PP-DocLayout-M" # Use medium model for faster processing +) +``` + +### GPU Acceleration + +```python +# Use GPU for faster processing +gpu_detector = LayoutDetector( + inputCol="image", + outputCol="gpu_layout_boxes", + device=Device.CUDA, + scoreThreshold=0.5 +) +``` + +### Pipeline Integration + +```python +from pyspark.ml import PipelineModel +from scaledp.models.image.DataToImage import DataToImage + +pipeline = PipelineModel(stages=[ + DataToImage(inputCol="content", outputCol="image"), + LayoutDetector( + inputCol="image", + outputCol="layout_boxes", + scoreThreshold=0.5 + ) +]) + +result = pipeline.transform(df) +``` + +## Output Format + +The LayoutDetector returns a `DetectorOutput` object containing: + +- **path**: Image file path +- **type**: Detection type ("layout") +- **bboxes**: List of detected layout regions +- **exception**: Any error messages + +Each detected region includes: + +- **text**: Layout type (text, title, list, table, figure) +- **score**: Confidence score (0.0 to 1.0) +- **x, y**: Top-left coordinates +- **width, height**: Region dimensions +- **polygon**: Optional polygon coordinates for rotated regions + +## Performance Considerations + +### CPU vs GPU + +- **CPU**: Suitable for small batches and development +- **GPU**: Recommended for production and large-scale processing + +### Batch Processing + +For large datasets, consider using partitioned processing: + +```python +layout_detector = LayoutDetector( + inputCol="image", + outputCol="layout_boxes", + partitionMap=True, + numPartitions=4 +) +``` + +### Memory Management + +The detector automatically handles memory cleanup, but for very large images, consider: + +- Using `scaleFactor` to reduce image size +- Processing in smaller batches +- Monitoring memory usage + +## Error Handling + +The LayoutDetector includes robust error handling: + +- **Import Errors**: Graceful handling when PaddleOCR is not installed +- **Processing Errors**: Individual image errors don't stop batch processing +- **Configuration Errors**: Clear error messages for invalid parameters + +## Use Cases + +### Document Analysis + +```python +# Analyze document structure +result = layout_detector.transform_udf(document_image) + +# Extract titles +titles = [box for box in result.bboxes if box.text == "title"] + +# Extract tables +tables = [box for box in result.bboxes if box.text == "table"] +``` + +### Content Extraction + +```python +# Focus on specific content types +text_detector = LayoutDetector( + inputCol="image", + outputCol="text_regions", + layoutTypes=["text", "title"] +) +``` + +### Quality Control + +```python +# High confidence detection +high_confidence_detector = LayoutDetector( + inputCol="image", + outputCol="high_conf_boxes", + scoreThreshold=0.8 +) +``` + +## Troubleshooting + +### Common Issues + +1. **PaddleOCR not installed** + ``` + pip install paddleocr + ``` + +2. **GPU not available** + - Check CUDA installation + - Verify PaddleOCR GPU support + - Fall back to CPU processing + +3. **Memory issues** + - Reduce `scaleFactor` + - Process smaller batches + - Monitor system resources + +### Performance Tips + +- Use GPU when available for faster processing +- Adjust `scoreThreshold` based on quality requirements +- Consider image preprocessing for better results +- Use appropriate batch sizes for your hardware + +## Integration with Other Components + +The LayoutDetector works well with other ScaleDP components: + +- **OCR**: Extract text from detected text regions +- **NER**: Apply named entity recognition to text regions +- **Visual Extractors**: Extract data from specific layout types +- **Image Processing**: Draw bounding boxes around detected regions diff --git a/poetry.lock b/poetry.lock index 7c8778a..4ac458f 100644 --- a/poetry.lock +++ b/poetry.lock @@ -968,14 +968,14 @@ files = [ [[package]] name = "flatbuffers" -version = "25.2.10" +version = "25.9.23" description = "The FlatBuffers serialization format for Python" optional = false python-versions = "*" groups = ["main"] files = [ - {file = "flatbuffers-25.2.10-py2.py3-none-any.whl", hash = "sha256:ebba5f4d5ea615af3f7fd70fc310636fbb2bbd1f566ac0a23d98dd412de50051"}, - {file = "flatbuffers-25.2.10.tar.gz", hash = "sha256:97e451377a41262f8d9bd4295cc836133415cc03d8cb966410a4af92eb00d26e"}, + {file = "flatbuffers-25.9.23-py2.py3-none-any.whl", hash = "sha256:255538574d6cb6d0a79a17ec8bc0d30985913b87513a01cce8bcdb6b4c44d0e2"}, + {file = "flatbuffers-25.9.23.tar.gz", hash = "sha256:676f9fa62750bb50cf531b42a0a2a118ad8f7f797a511eda12881c016f093b12"}, ] [[package]] @@ -2801,36 +2801,30 @@ files = [ [[package]] name = "onnxruntime" -version = "1.15.1" +version = "1.22.0" description = "ONNX Runtime is a runtime accelerator for Machine Learning models" optional = false -python-versions = "*" +python-versions = ">=3.10" groups = ["main"] files = [ - {file = "onnxruntime-1.15.1-cp310-cp310-macosx_10_15_x86_64.whl", hash = "sha256:baad59e6a763237fa39545325d29c16f98b8a45d2dfc524c67631e2e3ba44d16"}, - {file = "onnxruntime-1.15.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:568c2db848f619a0a93e843c028e9fb4879929d40b04bd60f9ba6eb8d2e93421"}, - {file = "onnxruntime-1.15.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:69088d7784bb04dedfd9e883e2c96e4adf8ae0451acdd0abb78d68f59ecc6d9d"}, - {file = "onnxruntime-1.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3cef43737b2cd886d5d718d100f56ec78c9c476c5db5f8f946e95024978fe754"}, - {file = "onnxruntime-1.15.1-cp310-cp310-win32.whl", hash = "sha256:79d7e65abb44a47c633ede8e53fe7b9756c272efaf169758c482c983cca98d7e"}, - {file = "onnxruntime-1.15.1-cp310-cp310-win_amd64.whl", hash = "sha256:8bc4c47682933a7a2c79808688aad5f12581305e182be552de50783b5438e6bd"}, - {file = "onnxruntime-1.15.1-cp311-cp311-macosx_10_15_x86_64.whl", hash = "sha256:652b2cb777f76446e3cc41072dd3d1585a6388aeff92b9de656724bc22e241e4"}, - {file = "onnxruntime-1.15.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:89b86dbed15740abc385055a29c9673a212600248d702737ce856515bdeddc88"}, - {file = "onnxruntime-1.15.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ed5cdd9ee748149a57f4cdfa67187a0d68f75240645a3c688299dcd08742cc98"}, - {file = "onnxruntime-1.15.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2f748cce6a70ed38c19658615c55f4eedb9192765a4e9c4bd2682adfe980698d"}, - {file = "onnxruntime-1.15.1-cp311-cp311-win32.whl", hash = "sha256:e0312046e814c40066e7823da58075992d51364cbe739eeeb2345ec440c3ac59"}, - {file = "onnxruntime-1.15.1-cp311-cp311-win_amd64.whl", hash = "sha256:f0980969689cb956c22bd1318b271e1be260060b37f3ddd82c7d63bd7f2d9a79"}, - {file = "onnxruntime-1.15.1-cp38-cp38-macosx_10_15_x86_64.whl", hash = "sha256:345986cfdbd6f4b20a89b6a6cd9abd3e2ced2926ae0b6e91fefa8149f95c0f09"}, - {file = "onnxruntime-1.15.1-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:a4d7b3ad75e040f1e95757f69826a11051737b31584938a26d466a0234c6de98"}, - {file = "onnxruntime-1.15.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3603d07b829bcc1c14963a76103e257aade8861eb208173b300cc26e118ec2f8"}, - {file = "onnxruntime-1.15.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d3df0625b9295daf1f7409ea55f72e1eeb38d54f5769add53372e79ddc3cf98d"}, - {file = "onnxruntime-1.15.1-cp38-cp38-win32.whl", hash = "sha256:f68b47fdf1a0406c0292f81ac993e2a2ae3e8b166b436d590eb221f64e8e187a"}, - {file = "onnxruntime-1.15.1-cp38-cp38-win_amd64.whl", hash = "sha256:52d762d297cc3f731f54fa65a3e329b813164970671547bef6414d0ed52765c9"}, - {file = "onnxruntime-1.15.1-cp39-cp39-macosx_10_15_x86_64.whl", hash = "sha256:99228f9f03dc1fc8af89a28c9f942e8bd3e97e894e263abe1a32e4ddb1f6363b"}, - {file = "onnxruntime-1.15.1-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:45db7f96febb0cf23e3af147f35c4f8de1a37dd252d1cef853c242c2780250cd"}, - {file = "onnxruntime-1.15.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2bafc112a36db25c821b90ab747644041cb4218f6575889775a2c12dd958b8c3"}, - {file = "onnxruntime-1.15.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:985693d18f2d46aa34fd44d7f65ff620660b2c8fa4b8ec365c2ca353f0fbdb27"}, - {file = "onnxruntime-1.15.1-cp39-cp39-win32.whl", hash = "sha256:708eb31b0c04724bf0f01c1309a9e69bbc09b85beb750e5662c8aed29f1ff9fd"}, - {file = "onnxruntime-1.15.1-cp39-cp39-win_amd64.whl", hash = "sha256:73d6de4c42dfde1e9dbea04773e6dc23346c8cda9c7e08c6554fafc97ac60138"}, + {file = "onnxruntime-1.22.0-cp310-cp310-macosx_13_0_universal2.whl", hash = "sha256:85d8826cc8054e4d6bf07f779dc742a363c39094015bdad6a08b3c18cfe0ba8c"}, + {file = "onnxruntime-1.22.0-cp310-cp310-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:468c9502a12f6f49ec335c2febd22fdceecc1e4cc96dfc27e419ba237dff5aff"}, + {file = "onnxruntime-1.22.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:681fe356d853630a898ee05f01ddb95728c9a168c9460e8361d0a240c9b7cb97"}, + {file = "onnxruntime-1.22.0-cp310-cp310-win_amd64.whl", hash = "sha256:20bca6495d06925631e201f2b257cc37086752e8fe7b6c83a67c6509f4759bc9"}, + {file = "onnxruntime-1.22.0-cp311-cp311-macosx_13_0_universal2.whl", hash = "sha256:8d6725c5b9a681d8fe72f2960c191a96c256367887d076b08466f52b4e0991df"}, + {file = "onnxruntime-1.22.0-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:fef17d665a917866d1f68f09edc98223b9a27e6cb167dec69da4c66484ad12fd"}, + {file = "onnxruntime-1.22.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:b978aa63a9a22095479c38371a9b359d4c15173cbb164eaad5f2cd27d666aa65"}, + {file = "onnxruntime-1.22.0-cp311-cp311-win_amd64.whl", hash = "sha256:03d3ef7fb11adf154149d6e767e21057e0e577b947dd3f66190b212528e1db31"}, + {file = "onnxruntime-1.22.0-cp312-cp312-macosx_13_0_universal2.whl", hash = "sha256:f3c0380f53c1e72a41b3f4d6af2ccc01df2c17844072233442c3a7e74851ab97"}, + {file = "onnxruntime-1.22.0-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c8601128eaef79b636152aea76ae6981b7c9fc81a618f584c15d78d42b310f1c"}, + {file = "onnxruntime-1.22.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:6964a975731afc19dc3418fad8d4e08c48920144ff590149429a5ebe0d15fb3c"}, + {file = "onnxruntime-1.22.0-cp312-cp312-win_amd64.whl", hash = "sha256:c0d534a43d1264d1273c2d4f00a5a588fa98d21117a3345b7104fa0bbcaadb9a"}, + {file = "onnxruntime-1.22.0-cp313-cp313-macosx_13_0_universal2.whl", hash = "sha256:fe7c051236aae16d8e2e9ffbfc1e115a0cc2450e873a9c4cb75c0cc96c1dae07"}, + {file = "onnxruntime-1.22.0-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6a6bbed10bc5e770c04d422893d3045b81acbbadc9fb759a2cd1ca00993da919"}, + {file = "onnxruntime-1.22.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:9fe45ee3e756300fccfd8d61b91129a121d3d80e9d38e01f03ff1295badc32b8"}, + {file = "onnxruntime-1.22.0-cp313-cp313-win_amd64.whl", hash = "sha256:5a31d84ef82b4b05d794a4ce8ba37b0d9deb768fd580e36e17b39e0b4840253b"}, + {file = "onnxruntime-1.22.0-cp313-cp313t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0a2ac5bd9205d831541db4e508e586e764a74f14efdd3f89af7fd20e1bf4a1ed"}, + {file = "onnxruntime-1.22.0-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:64845709f9e8a2809e8e009bc4c8f73b788cee9c6619b7d9930344eae4c9cd36"}, ] [package.dependencies] @@ -2897,9 +2891,10 @@ numpy = [ name = "opencv-python-headless" version = "4.8.1.78" description = "Wrapper package for OpenCV python bindings." -optional = false +optional = true python-versions = ">=3.6" groups = ["main"] +markers = "extra == \"ocr\"" files = [ {file = "opencv-python-headless-4.8.1.78.tar.gz", hash = "sha256:bc7197b42352f6f865c302a49140b889ec7cd957dd697e2d7fc016ad0d3f28f1"}, {file = "opencv_python_headless-4.8.1.78-cp37-abi3-macosx_10_16_x86_64.whl", hash = "sha256:f3a33f644249f9ce1c913eac580e4b3ef4ce7cab0a71900274708959c2feb5e3"}, @@ -3342,21 +3337,21 @@ wcwidth = "*" [[package]] name = "protobuf" -version = "6.31.1" +version = "6.32.1" description = "" optional = false python-versions = ">=3.9" groups = ["main"] files = [ - {file = "protobuf-6.31.1-cp310-abi3-win32.whl", hash = "sha256:7fa17d5a29c2e04b7d90e5e32388b8bfd0e7107cd8e616feef7ed3fa6bdab5c9"}, - {file = "protobuf-6.31.1-cp310-abi3-win_amd64.whl", hash = "sha256:426f59d2964864a1a366254fa703b8632dcec0790d8862d30034d8245e1cd447"}, - {file = "protobuf-6.31.1-cp39-abi3-macosx_10_9_universal2.whl", hash = "sha256:6f1227473dc43d44ed644425268eb7c2e488ae245d51c6866d19fe158e207402"}, - {file = "protobuf-6.31.1-cp39-abi3-manylinux2014_aarch64.whl", hash = "sha256:a40fc12b84c154884d7d4c4ebd675d5b3b5283e155f324049ae396b95ddebc39"}, - {file = "protobuf-6.31.1-cp39-abi3-manylinux2014_x86_64.whl", hash = "sha256:4ee898bf66f7a8b0bd21bce523814e6fbd8c6add948045ce958b73af7e8878c6"}, - {file = "protobuf-6.31.1-cp39-cp39-win32.whl", hash = "sha256:0414e3aa5a5f3ff423828e1e6a6e907d6c65c1d5b7e6e975793d5590bdeecc16"}, - {file = "protobuf-6.31.1-cp39-cp39-win_amd64.whl", hash = "sha256:8764cf4587791e7564051b35524b72844f845ad0bb011704c3736cce762d8fe9"}, - {file = "protobuf-6.31.1-py3-none-any.whl", hash = "sha256:720a6c7e6b77288b85063569baae8536671b39f15cc22037ec7045658d80489e"}, - {file = "protobuf-6.31.1.tar.gz", hash = "sha256:d8cac4c982f0b957a4dc73a80e2ea24fab08e679c0de9deb835f4a12d69aca9a"}, + {file = "protobuf-6.32.1-cp310-abi3-win32.whl", hash = "sha256:a8a32a84bc9f2aad712041b8b366190f71dde248926da517bde9e832e4412085"}, + {file = "protobuf-6.32.1-cp310-abi3-win_amd64.whl", hash = "sha256:b00a7d8c25fa471f16bc8153d0e53d6c9e827f0953f3c09aaa4331c718cae5e1"}, + {file = "protobuf-6.32.1-cp39-abi3-macosx_10_9_universal2.whl", hash = "sha256:d8c7e6eb619ffdf105ee4ab76af5a68b60a9d0f66da3ea12d1640e6d8dab7281"}, + {file = "protobuf-6.32.1-cp39-abi3-manylinux2014_aarch64.whl", hash = "sha256:2f5b80a49e1eb7b86d85fcd23fe92df154b9730a725c3b38c4e43b9d77018bf4"}, + {file = "protobuf-6.32.1-cp39-abi3-manylinux2014_x86_64.whl", hash = "sha256:b1864818300c297265c83a4982fd3169f97122c299f56a56e2445c3698d34710"}, + {file = "protobuf-6.32.1-cp39-cp39-win32.whl", hash = "sha256:68ff170bac18c8178f130d1ccb94700cf72852298e016a2443bdb9502279e5f1"}, + {file = "protobuf-6.32.1-cp39-cp39-win_amd64.whl", hash = "sha256:d0975d0b2f3e6957111aa3935d08a0eb7e006b1505d825f862a1fffc8348e122"}, + {file = "protobuf-6.32.1-py3-none-any.whl", hash = "sha256:2601b779fc7d32a866c6b4404f9d42a3f67c5b9f3f15b4db3cccabe06b95c346"}, + {file = "protobuf-6.32.1.tar.gz", hash = "sha256:ee2469e4a021474ab9baafea6cd070e5bf27c7d29433504ddea1a4ee5850f68d"}, ] [[package]] @@ -3559,21 +3554,6 @@ files = [ {file = "pycparser-2.22.tar.gz", hash = "sha256:491c8be9c040f5390f5bf44a5b07752bd07f56edf992381b05c701439eec10f6"}, ] -[[package]] -name = "pycrafter" -version = "0.0.7" -description = "Text extraction from images using ONNX runtime and CRAFT net" -optional = false -python-versions = ">=3.8" -groups = ["main"] -files = [ - {file = "pycrafter-0.0.7-py3-none-any.whl", hash = "sha256:3f11551ab195c96a6aff71190bbd9465e86a4bb8da218a37bdb180805291bc4b"}, -] - -[package.dependencies] -onnxruntime = ">=1.15.0,<1.16.0" -opencv-python-headless = ">=4.8.0.76,<4.9.0.0" - [[package]] name = "pydantic" version = "2.11.3" @@ -5932,8 +5912,9 @@ files = [ llm = [] ml = ["torch", "torchvision", "transformers"] ocr = ["easyocr", "python-doctr", "surya-ocr"] +paddle = [] [metadata] lock-version = "2.1" python-versions = "^3.10" -content-hash = "0beb4123bf5669db09026fe42e9427b2ebe6a571c768176a9b3fa78c32ccd29e" +content-hash = "dbc223fa004895653ea3ee28ab16deef00cc87824450ddc9149056d6bd549ff0" diff --git a/pyproject.toml b/pyproject.toml index 92acb8b..8a13f38 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [tool.poetry] name = "scaledp" -version = "0.2.3rc46" +version = "0.2.4rc5" description = "ScaleDP is a library for processing documents using Apache Spark and LLMs" authors = ["Mykola Melnyk "] repository = "https://github.com/StabRise/scaledp" @@ -39,15 +39,17 @@ tenacity = ">=8.2.3" openai = ">=1.58.0" sparkdantic = "^2.0.0" img2pdf = "^0.6.1" -pycrafter = "^0.0.7" shapely = "^2.1.1" pyclipper = "^1.3.0.post6" +onnxruntime = "1.22.0" + [tool.poetry.extras] ml = ["transformers", "torch", "torchvision"] ocr = ["easyocr", "python-doctr", "surya-ocr"] llm = ["dspy"] +paddle = ["paddleocr", "paddlepaddle",] [[tool.poetry.source]] name = "pytorch_cpu" diff --git a/scaledp/__init__.py b/scaledp/__init__.py index 4cbd9aa..74644e5 100644 --- a/scaledp/__init__.py +++ b/scaledp/__init__.py @@ -13,6 +13,8 @@ from scaledp.image.ImageCropBoxes import ImageCropBoxes from scaledp.image.ImageDrawBoxes import ImageDrawBoxes from scaledp.models.detectors.DocTRTextDetector import DocTRTextDetector +from scaledp.models.detectors.LayoutDetector import LayoutDetector +from scaledp.models.detectors.SignatureDetector import SignatureDetector from scaledp.models.detectors.YoloDetector import YoloDetector from scaledp.models.detectors.YoloOnnxDetector import YoloOnnxDetector from scaledp.models.extractors.DSPyExtractor import DSPyExtractor @@ -206,12 +208,14 @@ def ScaleDPSession( "TesseractOcr", "Ner", "TextToDocument", + "LayoutDetector", "PipelineModel", "SuryaOcr", "EasyOcr", "DocTROcr", "YoloDetector", "YoloOnnxDetector", + "SignatureDetector", "ImageCropBoxes", "DSPyExtractor", "TesseractRecognizer", diff --git a/scaledp/image/ImageDrawBoxes.py b/scaledp/image/ImageDrawBoxes.py index 2843a0b..f7ec526 100644 --- a/scaledp/image/ImageDrawBoxes.py +++ b/scaledp/image/ImageDrawBoxes.py @@ -162,11 +162,20 @@ def get_color(): return Image.from_pil(img, image.path, image.imageType, image.resolution) def draw_boxes(self, data, fill, img1): - color = "green" if self.getColor() is None else self.getColor() + colors = {} for b in data.bboxes: box = b if not isinstance(box, Box): box = Box(**box.asDict()) + + # Group by Box.text field for color consistency + text_key = box.text if hasattr(box, "text") and box.text else "default" + + if text_key not in colors: + colors[text_key] = "#{:06x}".format(random.randint(0, 0xFFFFFF)) + + color = colors[text_key] if self.getColor() is None else self.getColor() + self.draw_box(box, color, fill, img1) text = self.getDisplayText(box) if text: diff --git a/scaledp/models/detectors/BaseDetector.py b/scaledp/models/detectors/BaseDetector.py index 8fbfa95..e96eefe 100644 --- a/scaledp/models/detectors/BaseDetector.py +++ b/scaledp/models/detectors/BaseDetector.py @@ -122,6 +122,7 @@ def transform_udf(self, image, params=None): logging.info("Call detector on image") result = self.call_detector([(resized_image, image.path)], params) except Exception as e: + raise e exception = traceback.format_exc() exception = ( f"{self.uid}: Error in object detection: {exception}, {image.exception}" diff --git a/scaledp/models/detectors/DBNetOnnxDetector.py b/scaledp/models/detectors/DBNetOnnxDetector.py index cbeb905..65a86f6 100644 --- a/scaledp/models/detectors/DBNetOnnxDetector.py +++ b/scaledp/models/detectors/DBNetOnnxDetector.py @@ -88,6 +88,15 @@ def call_detector(cls, images, params): if params["onlyRotated"]: boxes = [box for box in boxes if box.is_rotated()] + # Merge overlapping boxes before returning, only if on the same line and similar angle + boxes = Box.merge_overlapping_boxes( + boxes, + iou_threshold=0.02, # as before + angle_thresh=10.0, # only merge if angle difference < 10 degrees + line_thresh=0.3, # only merge if centers are close (half + # height) + ) + results_final.append( DetectorOutput(path=image_path, type="DBNetOnnx", bboxes=boxes), ) diff --git a/scaledp/models/detectors/HasDetectLineOrientation.py b/scaledp/models/detectors/HasDetectLineOrientation.py new file mode 100644 index 0000000..7670e35 --- /dev/null +++ b/scaledp/models/detectors/HasDetectLineOrientation.py @@ -0,0 +1,91 @@ +from pathlib import Path +from typing import Any, ClassVar + +import cv2 +import numpy as np +import onnxruntime as ort +from huggingface_hub import hf_hub_download +from pyspark.ml.param import Param, Params, TypeConverters + + +class HasDetectLineOrientation(Params): + """ + Mixin for param detectLineOrientation: whether to detect line orientation. + and logic for orientation detection. + """ + + detectLineOrientation = Param( + Params._dummy(), + "detectLineOrientation", + "Whether to detect line orientation.", + typeConverter=TypeConverters.toBoolean, + ) + + oriModel = Param( + Params._dummy(), + "oriModel", + "Text line Orientation Model.", + typeConverter=TypeConverters.toString, + ) + + def getOriModel(self) -> str: + """ + Gets the value of model or its default value. + """ + return self.getOrDefault(self.model) + + def setOriModel(self, value: str) -> Any: + """ + Sets the value of :py:attr:`model`. + """ + return self._set(model=value) + + _orientation_session: ClassVar = None + _orientation_input_name: ClassVar = None + _orientation_label_list: ClassVar = ["0_degree", "180_degree"] + + def setDetectLineOrientation(self, value: bool): + return self._set(detectLineOrientation=value) + + def getDetectLineOrientation(self) -> bool: + return self.getOrDefault(self.detectLineOrientation) + + @classmethod + def _load_orientation_model(cls, params): + if cls._orientation_session is None: + model_path = params.get("oriModel") + if not Path(model_path).is_file(): + model = hf_hub_download(repo_id=model_path, filename="model.onnx") + cls._orientation_session = ort.InferenceSession(model) + cls._orientation_input_name = cls._orientation_session.get_inputs()[0].name + + @staticmethod + def _preprocess_for_orientation(pil_img): + img = np.array(pil_img) + img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR) + img = cv2.resize(img, (160, 80)) # width, height + img = img.astype(np.float32) / 255.0 + mean = np.array([0.485, 0.456, 0.406]) + std = np.array([0.229, 0.224, 0.225]) + img = (img - mean) / std + img = np.transpose(img, (2, 0, 1)) # HWC -> CHW + img = np.expand_dims(img, 0) # batch + return img.astype(np.float32) + + @classmethod + def detect_orientation(cls, pil_img, params): + """Detects orientation (0 or 180 degrees) of a PIL image.""" + cls._load_orientation_model(params) + inp = cls._preprocess_for_orientation(pil_img) + outputs = cls._orientation_session.run(None, {cls._orientation_input_name: inp}) + pred_idx = np.argmax(outputs[0], axis=1)[0] + pred_label = cls._orientation_label_list[pred_idx] + return pred_label + + @classmethod + def auto_orient_image(cls, pil_img, params): + """Rotates the image to 0 degrees if needed.""" + orientation = cls.detect_orientation(pil_img, params) + if orientation == "180_degree": + return pil_img.rotate(180, expand=True), orientation + return pil_img, orientation diff --git a/scaledp/models/detectors/LayoutDetector.py b/scaledp/models/detectors/LayoutDetector.py new file mode 100644 index 0000000..53d72b9 --- /dev/null +++ b/scaledp/models/detectors/LayoutDetector.py @@ -0,0 +1,150 @@ +import gc +import logging +from types import MappingProxyType +from typing import Any + +import numpy as np +from pyspark import keyword_only + +from scaledp.enums import Device +from scaledp.models.detectors.BaseDetector import BaseDetector +from scaledp.params import HasBatchSize, HasDevice, HasModel, HasWhiteList +from scaledp.schemas.Box import Box +from scaledp.schemas.DetectorOutput import DetectorOutput + + +class LayoutDetector(BaseDetector, HasDevice, HasBatchSize, HasWhiteList, HasModel): + _model = None + + defaultParams = MappingProxyType( + { + "inputCol": "image", + "outputCol": "layout_boxes", + "keepInputData": False, + "scaleFactor": 1.0, + "scoreThreshold": 0.5, + "device": Device.CPU, + "batchSize": 2, + "partitionMap": False, + "numPartitions": 0, + "pageCol": "page", + "pathCol": "path", + "propagateError": False, + "onlyRotated": False, + "model": "PP-DocLayout_plus-L", + "whiteList": [], + }, + ) + + @keyword_only + def __init__(self, **kwargs: Any) -> None: + super(LayoutDetector, self).__init__() + self._setDefault(**self.defaultParams) + self._set(**kwargs) + self.get_model({k.name: v for k, v in self.extractParamMap().items()}) + + @classmethod + def get_model(cls, params): + logging.info("Loading PaddleOCR LayoutDetection model...") + if cls._model: + return cls._model + + try: + from paddleocr import LayoutDetection + except ImportError as e: + raise ImportError( + "PaddleOCR is not installed. Please install it with: pip install paddleocr", + ) from e + + # Get model name from params or use default + model_name = params.get("model", "PP-DocLayout_plus-L") + + # Initialize LayoutDetection model + device = "gpu" if int(params["device"]) == Device.CUDA.value else "cpu" + cls._model = LayoutDetection( + model_name=model_name, + enable_hpi=False, + device=device, + ) + + logging.info( + f"PaddleOCR LayoutDetection model '{model_name}' loaded successfully", + ) + return cls._model + + @classmethod + def call_detector(cls, images, params): + logging.info("Running LayoutDetector") + + detector = cls.get_model(params) + layout_types = params.get("whiteList", []) + + logging.info("Process images for layout detection") + results_final = [] + + for image, image_path in images: + boxes = [] + + # Convert PIL to NumPy (RGB) + image_np = np.array(image) + + try: + # Run layout analysis using LayoutDetection + result = detector.predict(input=image_np) + + if result and len(result) > 0: + # LayoutDetection returns a list of layout regions + result = result[0] + if isinstance(result, dict) and "boxes" in result: + for layout_item in result["boxes"]: + bbox = layout_item["coordinate"] # Bounding box coordinates + layout_type = layout_item["label"] # Layout type + confidence = layout_item.get("score", 1.0) + + # Filter by layout type if specified + if layout_types and layout_type not in layout_types: + continue + + # Filter by confidence threshold + if confidence < params["scoreThreshold"]: + continue + + # Convert bbox to Box format + # LayoutDetection returns bbox as [x1, y1, x2, y2] + if len(bbox) == 4: + x = bbox[0] + y = bbox[1] + width = bbox[2] - bbox[0] + height = bbox[3] - bbox[1] + + # Create Box with layout type as text + box = Box( + text=layout_type, + score=confidence, + x=int(x), + y=int(y), + width=int(width), + height=int(height), + ) + + # Add polygon points if needed for rotated boxes + if len(bbox) == 4: + box.polygon = bbox + + boxes.append(box) + + except Exception as e: + logging.warning(f"Error in layout detection for {image_path}: {e!s}") + if params.get("propagateError", False): + raise e + + if params.get("onlyRotated", False): + boxes = [box for box in boxes if box.is_rotated()] + + results_final.append( + DetectorOutput(path=image_path, type="layout", bboxes=boxes), + ) + + gc.collect() + + return results_final diff --git a/scaledp/models/detectors/SignatureDetector.py b/scaledp/models/detectors/SignatureDetector.py new file mode 100644 index 0000000..8c8613e --- /dev/null +++ b/scaledp/models/detectors/SignatureDetector.py @@ -0,0 +1,5 @@ +from scaledp import YoloOnnxDetector + + +class SignatureDetector(YoloOnnxDetector): + pass diff --git a/scaledp/models/detectors/YoloOnnxDetector.py b/scaledp/models/detectors/YoloOnnxDetector.py index aac0fcc..90ab708 100644 --- a/scaledp/models/detectors/YoloOnnxDetector.py +++ b/scaledp/models/detectors/YoloOnnxDetector.py @@ -2,7 +2,7 @@ import logging from pathlib import Path from types import MappingProxyType -from typing import Any +from typing import Any, ClassVar import numpy as np from huggingface_hub import hf_hub_download @@ -18,7 +18,7 @@ class YoloOnnxDetector(BaseDetector, HasDevice, HasBatchSize): - _model = None + _model: ClassVar = {} task = Param( Params._dummy(), @@ -56,20 +56,25 @@ def __init__(self, **kwargs: Any) -> None: @classmethod def get_model(cls, params): - logging.info("Loading model...") - if cls._model: - return cls._model + model_path = params["model"] - model = params["model"] - if not Path(model).is_file(): - model = hf_hub_download(repo_id=model, filename="model.onnx") + logging.info("Loading model...") + if cls._model and model_path in cls._model: + return cls._model.get(model_path) + + model_path_final = model_path + if not Path(model_path).is_file(): + model_path_final = hf_hub_download( + repo_id=model_path, + filename="model.onnx", + ) logging.info("Model downloaded") - detector = YOLO(model, params["scoreThreshold"]) + detector = YOLO(model_path_final, params["scoreThreshold"]) - cls._model = detector - return cls._model + cls._model[model_path] = detector + return cls._model[model_path] @classmethod def call_detector(cls, images, params): diff --git a/scaledp/models/detectors/paddle_onnx/predict_det.py b/scaledp/models/detectors/paddle_onnx/predict_det.py index 39af222..165ffc3 100644 --- a/scaledp/models/detectors/paddle_onnx/predict_det.py +++ b/scaledp/models/detectors/paddle_onnx/predict_det.py @@ -13,7 +13,7 @@ def __init__(self, det_model_dir, use_gpu): pre_process_list = [ { "DetResizeForTest": { - "image_shape": [960, 960], + "image_shape": [1280, 1280], # "limit_side_len": 960, # "resize_long": 960, "limit_type": "max", @@ -33,10 +33,10 @@ def __init__(self, det_model_dir, use_gpu): ] postprocess_params = {} postprocess_params["name"] = "DBPostProcess" - postprocess_params["thresh"] = 0.3 # args.det_db_thresh - postprocess_params["box_thresh"] = 0.4 # args.det_db_box_thresh + postprocess_params["thresh"] = 0.5 # args.det_db_thresh + postprocess_params["box_thresh"] = 0.3 # args.det_db_box_thresh postprocess_params["max_candidates"] = 1000 - postprocess_params["unclip_ratio"] = 1.7 # args.det_db_unclip_ratio + postprocess_params["unclip_ratio"] = 2.5 # 1.7 # args.det_db_unclip_ratio postprocess_params["use_dilation"] = False # args.use_dilation postprocess_params["score_mode"] = "fast" # args.det_db_score_mode postprocess_params["box_type"] = "quad" # args.det_box_type diff --git a/scaledp/models/detectors/yolo/yolo.py b/scaledp/models/detectors/yolo/yolo.py index 8eccabc..dcc2027 100644 --- a/scaledp/models/detectors/yolo/yolo.py +++ b/scaledp/models/detectors/yolo/yolo.py @@ -1,5 +1,6 @@ -from typing import Any +from typing import Any, Tuple +import logging import cv2 import numpy as np import onnxruntime @@ -13,6 +14,13 @@ def __init__(self, path, conf_thres=0.7, iou_thres=0.5) -> None: self.conf_threshold = conf_thres self.iou_threshold = iou_thres + # Store original image dimensions and scaling info + self.original_width = None + self.original_height = None + self.scale_factor = None + self.pad_x = None + self.pad_y = None + # Initialize model self.initialize_model(path) @@ -37,13 +45,101 @@ def detect_objects(self, image): return self.boxes, self.scores, self.class_ids + def rescale_image_with_padding( + self, image: np.ndarray, target_size: Tuple[int, int] + ) -> np.ndarray: + """ + Rescale image while keeping aspect ratio and pad with white background. + + Args: + image: Input image (H, W, C) + target_size: Target size (width, height) + + Returns: + Rescaled and padded image + """ + self.original_height, self.original_width = image.shape[:2] + target_width, target_height = target_size + + # Calculate scaling factor to maintain aspect ratio + scale_w = target_width / self.original_width + scale_h = target_height / self.original_height + self.scale_factor = min(scale_w, scale_h) + + # Calculate new dimensions + new_width = int(self.original_width * self.scale_factor) + new_height = int(self.original_height * self.scale_factor) + + # Resize image + resized_image = cv2.resize(image, (new_width, new_height)) + + # Calculate padding to center the image + self.pad_x = (target_width - new_width) // 2 + self.pad_y = (target_height - new_height) // 2 + + # Create padded image with white background + padded_image = np.full((target_height, target_width, 3), 255, dtype=np.uint8) + + # Calculate the actual placement bounds to avoid index errors + end_y = min(self.pad_y + new_height, target_height) + end_x = min(self.pad_x + new_width, target_width) + + # Adjust the resized image if it exceeds target bounds + actual_height = end_y - self.pad_y + actual_width = end_x - self.pad_x + + # Place the resized image in the center of the padded image + padded_image[self.pad_y : end_y, self.pad_x : end_x] = resized_image[ + :actual_height, :actual_width + ] + + return padded_image + + def restore_coordinates(self, boxes: np.ndarray) -> np.ndarray: + """ + Restore bounding box coordinates to original image space. + + Args: + boxes: Bounding boxes in model input space (N, 4) [x1, y1, x2, y2] + + Returns: + Bounding boxes in original image space + """ + if len(boxes) == 0: + return boxes + + restored_boxes = boxes.copy() + + # Remove padding offset + restored_boxes[:, [0, 2]] -= self.pad_x # x coordinates + restored_boxes[:, [1, 3]] -= self.pad_y # y coordinates + + # Scale back to original size + restored_boxes = restored_boxes / self.scale_factor + + # Clip to original image bounds + restored_boxes[:, [0, 2]] = np.clip( + restored_boxes[:, [0, 2]], 0, self.original_width + ) + restored_boxes[:, [1, 3]] = np.clip( + restored_boxes[:, [1, 3]], 0, self.original_height + ) + + return restored_boxes + def prepare_input(self, image): + # Store original dimensions for coordinate restoration self.img_height, self.img_width = image.shape[:2] input_img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) + print(input_img.shape) + print(self.input_shape) + print(f"Input width: {self.input_width}, Input height: {self.input_height}") - # Resize input image - input_img = cv2.resize(input_img, (self.input_width, self.input_height)) + # Rescale image with padding instead of simple resize + input_img = self.rescale_image_with_padding( + input_img, (self.input_width, self.input_height) + ) # Scale input pixel values to 0 to 1 input_img = input_img / 255.0 @@ -73,16 +169,16 @@ def process_output(self, output): # Apply non-maxima suppression to suppress weak, overlapping bounding boxes indices = multiclass_nms(boxes, scores, class_ids, self.iou_threshold) - return boxes[indices], scores[indices], class_ids[indices] + # Restore coordinates to original image space + final_boxes = self.restore_coordinates(boxes[indices]) + + return final_boxes, scores[indices], class_ids[indices] def extract_boxes(self, predictions): # Extract boxes from predictions boxes = predictions[:, :4] - # Scale boxes to original image dimensions - boxes = self.rescale_boxes(boxes) - - # Convert boxes to xyxy format + # Convert boxes to xyxy format (no rescaling yet, done in restore_coordinates) return xywh2xyxy(boxes) def rescale_boxes(self, boxes): diff --git a/scaledp/models/extractors/GeminiVisualExtractor.py b/scaledp/models/extractors/GeminiVisualExtractor.py index 966ca06..640af0c 100644 --- a/scaledp/models/extractors/GeminiVisualExtractor.py +++ b/scaledp/models/extractors/GeminiVisualExtractor.py @@ -18,7 +18,7 @@ class GeminiVisualExtractor(BaseVisualExtractor, HasLLM, HasSchema, HasPrompt): "inputCol": "image", "outputCol": "data", "keepInputData": True, - "model": "gemini-1.5-flash", + "model": "gemini-2.5-flash", "apiBase": "", "apiKey": "", "numPartitions": 1, diff --git a/scaledp/models/extractors/LLMVisualExtractor.py b/scaledp/models/extractors/LLMVisualExtractor.py index 99a9681..9f7d819 100644 --- a/scaledp/models/extractors/LLMVisualExtractor.py +++ b/scaledp/models/extractors/LLMVisualExtractor.py @@ -24,7 +24,7 @@ class LLMVisualExtractor(BaseVisualExtractor, HasLLM, HasSchema, HasPrompt): "inputCol": "image", "outputCol": "data", "keepInputData": True, - "model": "gemini-1.5-flash", + "model": "gemini-2.5-flash", "apiBase": None, "apiKey": None, "numPartitions": 1, diff --git a/scaledp/models/ner/LLMNer.py b/scaledp/models/ner/LLMNer.py index b7276b0..c44717b 100644 --- a/scaledp/models/ner/LLMNer.py +++ b/scaledp/models/ner/LLMNer.py @@ -44,7 +44,7 @@ class LLMNer(BaseNer, HasLLM, HasPrompt, HasPropagateExc): "pathCol": "path", "systemPrompt": "You are excellent NER tag extractor.", "prompt": """Please extract text from the image.""", - "model": "gemini-1.5-flash-8b", + "model": "gemini-2.5-flash-lite", "apiBase": "", "apiKey": "", "delay": 30, diff --git a/scaledp/models/recognizers/LLMOcr.py b/scaledp/models/recognizers/LLMOcr.py index a1c7fb7..d4d5daa 100644 --- a/scaledp/models/recognizers/LLMOcr.py +++ b/scaledp/models/recognizers/LLMOcr.py @@ -36,7 +36,7 @@ class LLMOcr(BaseOcr, HasLLM, HasPrompt): "pathCol": "path", "systemPrompt": "You are ocr.", "prompt": """Please extract text from the image.""", - "model": "gemini-1.5-flash", + "model": "gemini-2.5-flash", "apiBase": None, "apiKey": None, "delay": 30, diff --git a/scaledp/models/recognizers/TesseractRecognizer.py b/scaledp/models/recognizers/TesseractRecognizer.py index a299a49..c2c6a99 100644 --- a/scaledp/models/recognizers/TesseractRecognizer.py +++ b/scaledp/models/recognizers/TesseractRecognizer.py @@ -2,10 +2,12 @@ from types import MappingProxyType from typing import Any +import cv2 import numpy as np from pyspark import keyword_only from pyspark.ml.param import Param, Params, TypeConverters +from scaledp.models.detectors.HasDetectLineOrientation import HasDetectLineOrientation from scaledp.params import CODE_TO_LANGUAGE, LANGUAGE_TO_TESSERACT_CODE from scaledp.schemas.Box import Box from scaledp.schemas.Document import Document @@ -14,7 +16,7 @@ from .BaseRecognizer import BaseRecognizer -class TesseractRecognizer(BaseRecognizer): +class TesseractRecognizer(BaseRecognizer, HasDetectLineOrientation): """ Run Tesseract text recognition on images. """ @@ -40,6 +42,13 @@ class TesseractRecognizer(BaseRecognizer): typeConverter=TypeConverters.toInt, ) + onlyRotated = Param( + Params._dummy(), + "onlyRotated", + "Return only rotated boxes.", + typeConverter=TypeConverters.toBoolean, + ) + defaultParams = MappingProxyType( { "inputCols": ["image", "boxes"], @@ -57,6 +66,9 @@ class TesseractRecognizer(BaseRecognizer): "numPartitions": 0, "pageCol": "page", "pathCol": "path", + "detectLineOrientation": True, + "onlyRotated": True, + "oriModel": "StabRise/line_orientation_detection_v0.1", }, ) @@ -77,17 +89,8 @@ def getLangTess(params): @classmethod def _prepare_box_for_ocr(cls, image_np, box, params): - import cv2 - - # Ensure box is Box instance - if isinstance(box, dict): - box = Box(**box) - elif not isinstance(box, Box): - box = Box(**box.asDict()) scaled_box = box.scale(params["scaleFactor"], padding=5) - if scaled_box.angle == 90: - scaled_box.angle = -90 center_tuple = ( scaled_box.x + scaled_box.width / 2, @@ -122,7 +125,6 @@ def _prepare_box_for_ocr(cls, image_np, box, params): @classmethod def _convert_to_pil(cls, cropped_np): - import cv2 from PIL import Image if cropped_np is None or cropped_np.size == 0: @@ -151,12 +153,30 @@ def _process_image_with_tesseract(cls, image, image_path, detected_boxes, params lang=lang, ) as api: api.SetVariable("debug_file", "ocr.log") - for box in detected_boxes.bboxes: + for box_raw in detected_boxes.bboxes: + # Ensure box is Box instance + if isinstance(box_raw, dict): + box = Box(**box_raw) + elif not isinstance(box_raw, Box): + box = Box(**box_raw.asDict()) + else: + box = box_raw cropped_np = cls._prepare_box_for_ocr(image_np, box, params) + # Auto-orient the image before OCR + pil_image = cls._convert_to_pil(cropped_np) + if params["detectLineOrientation"]: + pil_image, orientation = cls.auto_orient_image(pil_image, params) if pil_image is None: continue + if ( + params["onlyRotated"] + and not box.is_rotated() + and orientation != "180_degree" + ): + continue + api.SetImage(pil_image) api.Recognize(0) b = box @@ -164,7 +184,6 @@ def _process_image_with_tesseract(cls, image, image_path, detected_boxes, params b = Box(**b) b.text = api.GetUTF8Text() b.conf = api.MeanTextConf() - if b.score > params["scoreThreshold"]: boxes.append(b) texts.append(b.text) @@ -199,12 +218,9 @@ def call_tesserocr(cls, images, detected_boxes, params): # pragma: no cover results = [] lang = cls.getLangTess(params) - with PyTessBaseAPI( - path=params["tessDataPath"], - psm=PSM.SINGLE_WORD, - oem=params["oem"], - lang=lang, - ) as api: + with PyTessBaseAPI() as api: + api.Init(params["tessDataPath"], lang, oem=params["oem"]) + api.SetPageSegMode(PSM.SINGLE_WORD) api.SetVariable("debug_file", "ocr.log") for (image, image_path), detected_box in zip(images, detected_boxes): diff --git a/scaledp/pdf/__init__.py b/scaledp/pdf/__init__.py index e69de29..8b13789 100644 --- a/scaledp/pdf/__init__.py +++ b/scaledp/pdf/__init__.py @@ -0,0 +1 @@ + diff --git a/scaledp/pipeline/PandasPipeline.py b/scaledp/pipeline/PandasPipeline.py index 939d9f8..974abb1 100644 --- a/scaledp/pipeline/PandasPipeline.py +++ b/scaledp/pipeline/PandasPipeline.py @@ -2,6 +2,7 @@ import json import logging import time +from concurrent.futures import ThreadPoolExecutor # <-- added from pathlib import Path from typing import Any, ClassVar, List @@ -38,8 +39,10 @@ def __init__( self.returnType = returnType def __call__(self, *cols: Any) -> Any: - cols = zip(*cols) - return [self.func(*i) for i in cols] + cols = list(zip(*cols)) + with ThreadPoolExecutor() as executor: + results = list(executor.map(lambda args: self.func(*args), cols)) + return results def _wrapped(self) -> Any: return self diff --git a/scaledp/schemas/Box.py b/scaledp/schemas/Box.py index 0e6285c..c3e1599 100644 --- a/scaledp/schemas/Box.py +++ b/scaledp/schemas/Box.py @@ -154,7 +154,113 @@ def from_polygon( ) def is_rotated(self) -> bool: - return abs(self.angle) >= 10 + return abs(self.angle) >= 3 + + @staticmethod + def iou(box1: "Box", box2: "Box") -> float: + """Compute Intersection over Union (IoU) between two boxes.""" + x1 = max(box1.x, box2.x) + y1 = max(box1.y, box2.y) + x2 = min(box1.x + box1.width, box2.x + box2.width) + y2 = min(box1.y + box1.height, box2.y + box2.height) + inter_area = max(0, x2 - x1) * max(0, y2 - y1) + if inter_area == 0: + return 0.0 + box1_area = box1.width * box1.height + box2_area = box2.width * box2.height + union_area = box1_area + box2_area - inter_area + return inter_area / union_area + + @staticmethod + def merge(box1: "Box", box2: "Box") -> "Box": + """Merge two boxes into one by taking the minimal bounding rectangle.""" + x1 = min(box1.x, box2.x) + y1 = min(box1.y, box2.y) + x2 = max(box1.x + box1.width, box2.x + box2.width) + y2 = max(box1.y + box1.height, box2.y + box2.height) + return Box( + text=box1.text or box2.text, + score=max(box1.score, box2.score), + x=x1, + y=y1, + width=x2 - x1, + height=y2 - y1, + angle=0.0, + ) + + @staticmethod + def is_on_same_line( + box1: "Box", + box2: "Box", + angle_thresh: float = 10.0, + line_thresh: float = 0.5, + ) -> bool: + """Check if two boxes are on the same text line. + + - angle_thresh: maximum allowed angle difference (degrees) + - line_thresh: maximum allowed normalized center difference + (as a fraction of height for horizontal text) + """ + # Check angle similarity + ret = None + if abs(box1.angle - box2.angle) > angle_thresh: + return False + # For horizontal text (angle near 0) + if abs(box1.angle) < angle_thresh: + # Check if vertical centers are close + y1 = box1.y + box1.height / 2 + y2 = box2.y + box2.height / 2 + avg_height = (box1.height + box2.height) / 2 + ret = abs(y1 - y2) < avg_height * line_thresh + else: + # For rotated text, project centers onto the perpendicular direction + import math + + theta = math.radians(box1.angle) + # Direction perpendicular to text line + dx = -math.sin(theta) + dy = math.cos(theta) + c1x = box1.x + box1.width / 2 + c1y = box1.y + box1.height / 2 + c2x = box2.x + box2.width / 2 + c2y = box2.y + box2.height / 2 + # Project difference onto perpendicular direction + perp_dist = abs((c2x - c1x) * dx + (c2y - c1y) * dy) + avg_dim = (box1.height + box2.height) / 2 + ret = perp_dist < avg_dim * line_thresh + return ret + + @staticmethod + def merge_overlapping_boxes( + boxes: list["Box"], + iou_threshold: float = 0.3, + angle_thresh: float = 10.0, + line_thresh: float = 0.5, + ) -> list["Box"]: + """ + Merge all overlapping boxes in a list using a greedy algorithm, + but only if they are on the same line and have similar angle. + """ + merged = [] + used = [False] * len(boxes) + for i, box in enumerate(boxes): + if used[i]: + continue + curr = box + for j in range(i + 1, len(boxes)): + if used[j]: + continue + if Box.iou(curr, boxes[j]) > iou_threshold and Box.is_on_same_line( + curr, + boxes[j], + angle_thresh=angle_thresh, + line_thresh=line_thresh, + ): + curr = Box.merge(curr, boxes[j]) + used[j] = True + merged.append(curr) + used[i] = True + return merged register_type(Box, Box.get_schema) diff --git a/tests/conftest.py b/tests/conftest.py index df35f95..70440f3 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -23,6 +23,20 @@ def image_file(resource_path_root): return (resource_path_root / "images/Invoice.png").absolute().as_posix() +@pytest.fixture +def image_rotated_text_file(resource_path_root): + return (resource_path_root / "images/RotatedText1.png").absolute().as_posix() + + +@pytest.fixture +def image_rotated_text_df(spark_session, image_rotated_text_file): + df = spark_session.read.format("binaryFile").load( + image_rotated_text_file, + ) + bin_to_image = DataToImage().setImageType(ImageType.WEBP.value) + return bin_to_image.transform(df) + + @pytest.fixture def receipt_file(resource_path_root): return (resource_path_root / "images" / "receipt.jpg").absolute().as_posix() @@ -100,11 +114,28 @@ def image_pdf_df(spark_session, resource_path_root): ) +@pytest.fixture +def signatures_pdf_df(spark_session, resource_path_root): + return spark_session.read.format("binaryFile").load( + (resource_path_root / "pdfs" / "signatures.pdf").absolute().as_posix(), + ) + + +@pytest.fixture +def signatures_pdf_file(spark_session, resource_path_root): + return (resource_path_root / "pdfs" / "signatures.pdf").absolute().as_posix() + + @pytest.fixture def pdf_file(resource_path_root): return (resource_path_root / "pdfs/unipdf-medical-bill.pdf").absolute().as_posix() +@pytest.fixture +def pdf_report_file(resource_path_root): + return (resource_path_root / "pdfs/sample-report.pdf").absolute().as_posix() + + @pytest.fixture def image_df(spark_session, resource_path_root): df = spark_session.read.format("binaryFile").load( @@ -150,6 +181,15 @@ def image_qr_code_df(spark_session, resource_path_root): return bin_to_image.transform(df) +@pytest.fixture +def image_signature_df(spark_session, resource_path_root): + df = spark_session.read.format("binaryFile").load( + (resource_path_root / "images" / "signature.png").absolute().as_posix(), + ) + bin_to_image = DataToImage().setImageType(ImageType.WEBP.value) + return bin_to_image.transform(df) + + @pytest.fixture def receipt_json(receipt_json_path: Path) -> Path: return receipt_json_path.open("r").read() diff --git a/tests/models/detectors/test_dbnet_onnx_text_detector.py b/tests/models/detectors/test_dbnet_onnx_text_detector.py index 9ea91e5..1e1ea51 100644 --- a/tests/models/detectors/test_dbnet_onnx_text_detector.py +++ b/tests/models/detectors/test_dbnet_onnx_text_detector.py @@ -1,3 +1,4 @@ +import logging import tempfile from pyspark.ml import PipelineModel @@ -10,12 +11,12 @@ from scaledp.models.detectors.DBNetOnnxDetector import DBNetOnnxDetector -def test_dbnet_detector(image_df): +def test_dbnet_detector(image_rotated_text_df): detector = DBNetOnnxDetector( - model="StabRise/text_detection_dbnet_ml_v0.1", + model="StabRise/text_detection_dbnet_ml_v0.2", keepInputData=True, - onlyRotated=True, + onlyRotated=False, ) ocr = TesseractRecognizer( @@ -40,7 +41,7 @@ def test_dbnet_detector(image_df): ) # Transform the image dataframe through the OCR stage pipeline = PipelineModel(stages=[detector, ocr, draw]) - result = pipeline.transform(image_df) + result = pipeline.transform(image_rotated_text_df) data = result.collect() @@ -56,4 +57,4 @@ def test_dbnet_detector(image_df): temp.close() # Print the path to the temporary file - print("file://" + temp.name) + logging.info("file://" + temp.name) diff --git a/tests/models/detectors/test_layout_detector.py b/tests/models/detectors/test_layout_detector.py new file mode 100644 index 0000000..8b982d6 --- /dev/null +++ b/tests/models/detectors/test_layout_detector.py @@ -0,0 +1,128 @@ +import logging +import tempfile +import warnings + +import pytest +from pyspark.ml import PipelineModel + +from scaledp import ImageDrawBoxes +from scaledp.enums import Device +from scaledp.models.detectors.LayoutDetector import LayoutDetector + + +@pytest.fixture(autouse=True) +def suppress_warnings(): + """Suppress SWIG deprecation warnings.""" + with warnings.catch_warnings(): + warnings.filterwarnings( + "ignore", + category=DeprecationWarning, + module="importlib._bootstrap", + ) + yield + + +@pytest.fixture +def layout_detector(): + return LayoutDetector( + inputCol="image", + outputCol="layout_boxes", + scoreThreshold=0.5, + device=Device.CPU, + whiteList=[], + model="PP-DocLayout_plus-L", + propagateError=True, + keepInputData=True, + ) + + +def test_layout_detector_initialization(layout_detector): + """Test that LayoutDetector initializes correctly.""" + assert layout_detector.getInputCol() == "image" + assert layout_detector.getOutputCol() == "layout_boxes" + assert layout_detector.getScoreThreshold() == 0.5 + assert layout_detector.getDevice() == Device.CPU + assert layout_detector.getWhiteList() == [] + assert layout_detector.getModel() == "PP-DocLayout_plus-L" + + +def test_layout_detector_with_drawn_boxes(image_df): + """Test LayoutDetector with drawn boxes on the original image.""" + detector = LayoutDetector( + inputCol="image", + outputCol="layout_boxes", + scoreThreshold=0.5, + device=Device.CPU, + whiteList=["text", "title", "list", "table", "figure"], + model="PP-DocLayout_plus-L", + keepInputData=True, + ) + + # Create draw component to visualize detected boxes + draw = ImageDrawBoxes( + keepInputData=True, + inputCols=["image", "layout_boxes"], + filled=False, + color="blue", + lineWidth=4, + displayDataList=["text", "score"], + ) + + try: + + # Create a pipeline with detector and draw components + pipeline = PipelineModel(stages=[detector, draw]) + result = pipeline.transform(image_df) + + data = result.collect() + + # Verify the pipeline result + assert len(data) == 1, "Expected exactly one result" + + # Check that the output column exists and has the expected structure + assert hasattr(data[0], "layout_boxes"), "Expected layout_boxes column" + assert data[0].layout_boxes.type == "layout" + assert isinstance(data[0].layout_boxes.bboxes, list) + + # Check that the image with boxes was created + assert hasattr(data[0], "image_with_boxes"), "Expected image_with_boxes column" + + # Save the output image to a temporary file for verification + with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as temp: + temp.write(data[0].image_with_boxes.data) + temp.close() + + # Print the path to the temporary file + logging.info("file://" + temp.name) + + except ImportError: + pytest.skip("PaddleOCR not installed") + except Exception as e: + # Handle other exceptions that might occur during processing + assert "Error in object detection" in str(e) or "PaddleOCR" in str(e) + + +def test_layout_detector_with_custom_layout_types(): + """Test LayoutDetector with custom layout types.""" + detector = LayoutDetector( + inputCol="image", + outputCol="layout_boxes", + whiteList=["text", "table"], # Only detect text and table + model="PP-DocLayout-M", # Use different model + keepInputData=True, + ) + + assert detector.getWhiteList() == ["text", "table"] + assert detector.getModel() == "PP-DocLayout-M" + + +def test_layout_detector_output_schema(layout_detector): + """Test that the output schema is correct.""" + schema = layout_detector.outputSchema() + + # Check that the schema has the expected fields + field_names = [field.name for field in schema.fields] + expected_fields = ["path", "type", "bboxes", "exception"] + + for field in expected_fields: + assert field in field_names diff --git a/tests/models/detectors/test_signature_detector.py b/tests/models/detectors/test_signature_detector.py new file mode 100644 index 0000000..da3f966 --- /dev/null +++ b/tests/models/detectors/test_signature_detector.py @@ -0,0 +1,143 @@ +import tempfile + +import pyspark +from pipeline.PandasPipeline import PandasPipeline, pathSparkFunctions +from pyspark.ml import PipelineModel + +from scaledp import ( + ImageDrawBoxes, + PdfDataToImage, + SignatureDetector, +) +from scaledp.enums import Device +from scaledp.pdf.PdfDataToSingleImage import PdfDataToSingleImage + + +def test_signature_detector(image_signature_df): + + detector = SignatureDetector( + device=Device.CPU, + keepInputData=True, + partitionMap=True, + numPartitions=0, + scoreThreshold=0.25, + task="detect", + model="/home/mykola/PycharmProjects/scaledp-models/detection/document/signature/detector_yolo_1cls.onnx", + ) + + draw = ImageDrawBoxes( + keepInputData=True, + inputCols=["image", "boxes"], + filled=False, + color="green", + lineWidth=5, + displayDataList=["score", "angle"], + ) + # Transform the image dataframe through the OCR stage + pipeline = PipelineModel(stages=[detector, draw]) + result = pipeline.transform(image_signature_df) + + data = result.collect() + + # Verify the pipeline result + assert len(data) == 1, "Expected exactly one result" + + # # Check that exceptions is empty + assert data[0].boxes.exception == "" + + # Save the output image to a temporary file for verification + with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as temp: + temp.write(data[0].image_with_boxes.data) + temp.close() + + # Print the path to the temporary file + print("file://" + temp.name) + + +def test_signature_pdf_detector(signatures_pdf_df): + + pdf = PdfDataToSingleImage(outputCol="image", keepInputData=True) + + detector = SignatureDetector( + device=Device.CPU, + keepInputData=True, + partitionMap=False, + numPartitions=0, + scoreThreshold=0.25, + task="detect", + model="/home/mykola/PycharmProjects/scaledp-models/detection/document/signature/detector_yolo_1cls.onnx", + ) + + draw = ImageDrawBoxes( + keepInputData=True, + inputCols=["image", "boxes"], + filled=False, + color="green", + lineWidth=5, + displayDataList=["score", "angle"], + ) + # Transform the image dataframe through the OCR stage + pipeline = PipelineModel(stages=[pdf, detector, draw]) + result = pipeline.transform(signatures_pdf_df) + + data = result.collect() + + # Verify the pipeline result + assert len(data) == 1, "Expected exactly one result" + + # # Check that exceptions is empty + assert data[0].boxes.exception == "" + + # Save the output image to a temporary file for verification + with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as temp: + temp.write(data[0].image_with_boxes.data) + temp.close() + + # Print the path to the temporary file + print("file://" + temp.name) + + +def test_signature_pdf_detector_pandas(signatures_pdf_file): + pathSparkFunctions(pyspark) + + # pdf = PdfDataToSingleImage(inputCol="content", outputCol="image", + # keepInputData=True) + + pdf = PdfDataToImage( + inputCol="content", + outputCol="image", + pageLimit=1, + ) + + detector = SignatureDetector( + device=Device.CPU, + keepInputData=True, + partitionMap=False, + numPartitions=0, + scoreThreshold=0.25, + task="detect", + model="StabRise/signature_detection", + ) + + draw = ImageDrawBoxes( + keepInputData=True, + inputCols=["image", "boxes"], + filled=False, + color="green", + lineWidth=5, + displayDataList=["score", "angle"], + ) + # Transform the image dataframe through the OCR stage + pipeline = PandasPipeline(stages=[pdf, detector, draw]) + data = pipeline.fromFile(signatures_pdf_file) + + # Verify the pipeline result + assert len(data) == 1, "Expected exactly one result" + + # Save the output image to a temporary file for verification + with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as temp: + temp.write(data["image_with_boxes"][0].data) + temp.close() + + # Print the path to the temporary file + print("file://" + temp.name) diff --git a/tests/models/extractors/test_llm_extractor.py b/tests/models/extractors/test_llm_extractor.py index 92d9e72..6f4fb87 100644 --- a/tests/models/extractors/test_llm_extractor.py +++ b/tests/models/extractors/test_llm_extractor.py @@ -37,7 +37,7 @@ def test_llm_extractor(image_receipt_df): ) # Initialize the NER stage with the specified model and device - extractor = LLMExtractor(model="gemini-1.5-flash", schema=ReceiptSchema) + extractor = LLMExtractor(model="gemini-2.5-flash", schema=ReceiptSchema) # Transform the image dataframe through the OCR and NER stages result_df = extractor.transform(ocr.transform(image_receipt_df)) diff --git a/tests/models/extractors/test_llm_visual_extractor.py b/tests/models/extractors/test_llm_visual_extractor.py index bba6ed4..a2986df 100644 --- a/tests/models/extractors/test_llm_visual_extractor.py +++ b/tests/models/extractors/test_llm_visual_extractor.py @@ -113,7 +113,7 @@ def test_llm_visual_extractor_pandas(receipt_file, receipt_json, receipt_json_pa pathSparkFunctions(pyspark) data_to_image = DataToImage() extractor = LLMVisualExtractor( - model="gemini-1.5-flash", + model="gemini-2.5-flash", schema=ReceiptSchema, propagateError=True, ) @@ -195,7 +195,7 @@ def test_llm_visual_extractor_prompt_schema( ): extractor = LLMVisualExtractor( - model="gemini-1.5-flash", + model="gemini-2.5-flash", schema=ReceiptSchema1, propagateError=False, schemaByPrompt=True, diff --git a/tests/pdf/test_pdf_assembler.py b/tests/pdf/test_pdf_assembler.py new file mode 100644 index 0000000..b34df1a --- /dev/null +++ b/tests/pdf/test_pdf_assembler.py @@ -0,0 +1,157 @@ +import tempfile + +from models.detectors.DBNetOnnxDetector import DBNetOnnxDetector +from pyspark.ml import PipelineModel +from pyspark.sql import DataFrame + +from scaledp import ImageDrawBoxes, TesseractRecognizer, TessLib +from scaledp.models.recognizers.TesseractOcr import TesseractOcr +from scaledp.pdf import PdfAddTextLayer, PdfAssembler, PdfDataToImage, SingleImageToPdf +from scaledp.pipeline.PandasPipeline import PandasPipeline + + +def test_pdf_assembler(pdf_df: DataFrame) -> None: + + # Initialize pipeline stages + pdf_data_to_image = PdfDataToImage( + inputCol="content", + outputCol="image", + pageLimit=2, + ) + ocr = TesseractOcr( + inputCol="image", + outputCol="text", + keepInputData=True, + tessLib=TessLib.TESSEROCR, + ) + + image_to_pdf = SingleImageToPdf( + inputCol="image", + outputCol="pdf", + ) + + pdf_text_layer = PdfAddTextLayer( + inputCols=["pdf", "text"], + outputCol="pdf_with_text_layer", + ) + + pdf_assembler = PdfAssembler( + inputCol="pdf_with_text_layer", + outputCol="assembled_pdf", + groupByCol="path", + ) + + # Create and configure the pipeline + pipeline = PipelineModel( + stages=[ + pdf_data_to_image, + ocr, + image_to_pdf, + pdf_text_layer, + pdf_assembler, + ], + ) + + result = pipeline.transform(pdf_df).collect() + + # Verify the pipeline result + assert len(result) == 1, "Expected exactly two results" + + assert hasattr(result[0], "assembled_pdf") + + # Verify that there is no exception in the OCR result + assert ( + result[0].assembled_pdf.exception == "" + ), "Expected no exception in the OCR result" + + # Create temporary file to store the PDF + with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as temp: + # Output the file location for user reference + print(f"PDF saved at: file://{temp.name}") + + # Write PDF data to temporary file + temp.write(result[0].assembled_pdf.data) + + +def test_pdf_local_pipeline(patch_spark, pdf_report_file: str) -> None: + """Test PDF processing using PandasPipeline with local file input.""" + + # Initialize pipeline stages + pdf_data_to_image = PdfDataToImage( + inputCol="content", + outputCol="image", + pageLimit=10, + ) + + text_detector = DBNetOnnxDetector( + model="StabRise/text_detection_dbnet_ml_v0.1", + keepInputData=True, + onlyRotated=False, + ) + + text_recognizer = TesseractRecognizer( + inputCols=["image", "boxes"], + outputCol="text", + keepFormatting=False, + keepInputData=True, + tessLib=TessLib.PYTESSERACT, + lang=["eng", "spa"], + scoreThreshold=0.2, + partitionMap=False, + numPartitions=1, + ) + + draw = ImageDrawBoxes( + inputCols=["image", "text"], + outputCol="image_with_boxes", + lineWidth=2, + textSize=20, + displayDataList=[], + keepInputData=True, + ) + + image_to_pdf = SingleImageToPdf( + inputCol="image_with_boxes", + outputCol="pdf", + ) + + pdf_text_layer = PdfAddTextLayer( + inputCols=["pdf", "text"], + outputCol="pdf_with_text_layer", + ) + + pdf_assembler = PdfAssembler( + inputCol="pdf_with_text_layer", + outputCol="assembled_pdf", + groupByCol="path", + ) + + # Create and configure the pipeline + pipeline = PandasPipeline( + stages=[ + pdf_data_to_image, + text_detector, + text_recognizer, + draw, + image_to_pdf, + pdf_text_layer, + pdf_assembler, + ], + ) + + # Process the PDF file + result = pipeline.fromFile(pdf_report_file) + + # Verify pipeline execution and results + assert result is not None, "Pipeline result should not be None" + assert "assembled_pdf" in result.columns, "Result should contain 'text' column" + assert "execution_time" in result.columns, "Result should contain execution timing" + + assert len(result) == 1 + + with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as temp: + # Output the file location for user reference + print(f"PDF saved at: file://{temp.name}") + + # Write PDF data to temporary file + temp.write(result["assembled_pdf"][0].data) diff --git a/tests/pytest.ini b/tests/pytest.ini index 3dd7ced..9e1fe95 100644 --- a/tests/pytest.ini +++ b/tests/pytest.ini @@ -2,3 +2,6 @@ spark_options = spark.app.name: scaledp-pytest-spark-tests spark.executor.instances: 1 +addopts = --log-cli-level=INFO -s +env = + PYARROW_IGNORE_TIMEZONE = 1 diff --git a/tests/testresources/images/RotatedText.png b/tests/testresources/images/RotatedText.png new file mode 100644 index 0000000..a62c248 Binary files /dev/null and b/tests/testresources/images/RotatedText.png differ diff --git a/tests/testresources/images/signature.png b/tests/testresources/images/signature.png new file mode 100644 index 0000000..98ae985 Binary files /dev/null and b/tests/testresources/images/signature.png differ diff --git a/tests/testresources/images/text_line_or.png b/tests/testresources/images/text_line_or.png new file mode 100644 index 0000000..8330486 Binary files /dev/null and b/tests/testresources/images/text_line_or.png differ diff --git a/tests/testresources/pdfs/sample-report.pdf b/tests/testresources/pdfs/sample-report.pdf new file mode 100644 index 0000000..31d2ebe Binary files /dev/null and b/tests/testresources/pdfs/sample-report.pdf differ diff --git a/tests/testresources/pdfs/signatures.pdf b/tests/testresources/pdfs/signatures.pdf new file mode 100644 index 0000000..f59c38e Binary files /dev/null and b/tests/testresources/pdfs/signatures.pdf differ