Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -162,3 +162,4 @@ cython_debug/
.idea/
/.vscode/settings.json
/tests/testresources/pdfs/private/
/.run/*
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## [unreleased]
## 0.2.4 - 01.10.2025

### 🚀 Features

Expand Down
Binary file added docs/source/_static/ShowFaceBoxes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/ShowFaceCropped.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/ShowImageInvoice.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/ShowSignatureBoxes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 19 additions & 4 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,25 @@
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

import os
import sys

sys.path.insert(0, os.path.abspath("../scaledp"))

project = "ScaleDP"
copyright = "2024, StabRise"
author = "StabRise"
release = "0.1.0"
author = "Mykola Melnyk"
release = "0.2.4"

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = ["sphinx.ext.autodoc", "myst_parser"]

source_suffix = {
".rst": "restructuredtext",
".md": "markdown",
}

templates_path = ["_templates"]
exclude_patterns = []

Expand Down Expand Up @@ -46,7 +55,13 @@
"icon": "https://img.shields.io/badge/by-StabRise-orange.svg?style=flat&colorA=E1523D&colorB=007D8A",
"type": "url",
},
]
],
"extra_footer": """
<p style="font-size:1em; color:#777;">
© Copyright 2025, <a href="https://stabrise.com"
target="_blank">StabRise</a>
</p>
""",
}

# -- Options for HTML output -------------------------------------------------
Expand Down
24 changes: 24 additions & 0 deletions docs/source/detectors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
Detectors
=========

## Overview

This section provides an overview of the various detectors available in ScaleDP for processing images and documents. These detectors are designed to identify and extract specific features such as text, objects, and layout structures from images.

## Object Detection

* [**Face Detector**](#FaceDetector)
* [**Signature Detector**](#SignatureDetector)

## Text Detection

* [**CraftTextDetector**](#CraftTextDetector)
* [**DBNetOnnxDetector**](#DBNetOnnxDetector)
* **YoloOnnxTextDetector**
* **DocTRTextDetector**

## Base Detectors

* **BaseDetector**
* [**YoloOnnxDetector**](#YoloOnnxDetector)

49 changes: 49 additions & 0 deletions docs/source/image/data_to_image.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
(DataToImage)=
# DataToImage

## Overview

`DataToImage` is a PySpark ML transformer that converts binary content (such as bytes from files or streams) into image objects. It is designed for use in Spark pipelines, enabling scalable and distributed image processing workflows. The transformer supports various image types and handles errors gracefully.

## Usage Example

```python
from scaledp import DataToImage, PipelineModel

image_example = files('resources/images/Invoice.png')

df = spark.read.format("binaryFile") \
.load(image_example)

data_to_image = DataToImage(
inputCol="content", # Column with binary data
outputCol="image", # Output column for image objects
pathCol="path", # Optional: column with image paths
keepInputData=True, # Keep original data in output
propagateError=False, # Handle errors gracefully
)

pipeline = PipelineModel(stages=[data_to_image])
result = pipeline.transform(df) # df should have 'content' and optionally 'path' columns
result.show_image("image")
```

![ShowImageInvoice.png](../_static/ShowImageInvoice.png)

## Parameters

| Parameter | Type | Description | Default |
|-------------------|---------|--------------------------------------------------|-----------------|
| inputCol | str | Input column with binary content | "content" |
| outputCol | str | Output column for image objects | "image" |
| pathCol | str | Path column for image metadata | "path" |
| keepInputData | bool | Keep input data in output | False |
| imageType | Enum | Type of image (e.g., FILE, PIL) | ImageType.FILE |
| propagateError | bool | Propagate errors | False |

## Notes
- Converts binary data to image objects using the specified image type.
- Handles errors gracefully; if `propagateError` is False, exceptions are logged and empty images are returned.
- Can be used as the first stage in image processing pipelines to ingest raw image data.
- Supports distributed processing with Spark.

63 changes: 63 additions & 0 deletions docs/source/image/image_crop_boxes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
(ImageCropBoxes)=
# ImageCropBoxes

## Overview

`ImageCropBoxes` is a PySpark ML transformer that crops images based on provided bounding boxes. It is designed to process images in Spark pipelines, supporting batch and distributed processing. The transformer can add padding to crops, limit the number of crops per image, and handle cases where no boxes are present.

## Usage Example

```python
from scaledp import FaceDetector, ImageCropBoxes, PipelineModel

# Step 1: Detect faces in images
detector = FaceDetector(
inputCol="image",
outputCol="boxes",
keepInputData=True,
scoreThreshold=0.25,
padding=20,
)

# Step 2: Crop images using detected face boxes
cropper = ImageCropBoxes(
inputCols=["image", "boxes"],
outputCol="cropped_image",
keepInputData=True,
padding=10,
limit=5,
noCrop=True,
autoRotate=False, # Automatically rotate crops if box height > width
)

# Build and run the pipeline
pipeline = PipelineModel(stages=[detector, cropper])
result = pipeline.transform(image_df)
result.show_image("cropped_image")
```

![ShowFaceCropped.png](../_static/ShowFaceCropped.png)

## Parameters

| Parameter | Type | Description | Default |
|-------------------|---------|--------------------------------------------------|-----------------|
| inputCols | list | Input columns: image and boxes | ["image", "boxes"] |
| outputCol | str | Output column for cropped images | "cropped_image"|
| keepInputData | bool | Keep input data in output | False |
| imageType | Enum | Type of image (e.g., FILE) | ImageType.FILE |
| numPartitions | int | Number of partitions for Spark | 0 |
| padding | int | Padding added to each crop | 0 |
| pageCol | str | Page column for repartitioning | "page" |
| propagateError | bool | Propagate errors | False |
| noCrop | bool | Raise error if no boxes to crop | True |
| limit | int | Limit number of crops per image | 0 (no limit) |
| autoRotate | bool | Auto rotate crop if box height > width | True |

## Notes
- Crops are performed using bounding boxes from the `boxes` column.
- If `noCrop` is True and no boxes are present, an error is raised.
- If `limit` is set, only the first N boxes are used for cropping.
- If `autoRotate` is True, crops are rotated if the bounding box height is greater than its width.
- Supports distributed processing with Spark.
- Errors can be propagated or handled gracefully based on `propagateError`.
61 changes: 61 additions & 0 deletions docs/source/image/image_draw_boxes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
(ImageDrawBoxes)=
# ImageDrawBoxes

## Overview

`ImageDrawBoxes` is a PySpark ML transformer that draws bounding boxes and/or NER entity boxes on images. It supports both standard bounding boxes and named entity recognition (NER) outputs, allowing for flexible visualization of detected objects or entities. The transformer can be integrated into Spark pipelines for scalable image annotation tasks.

## Usage Example

```python
from scaledp import FaceDetector, ImageDrawBoxes, PipelineModel

detector = FaceDetector(
inputCol="image",
outputCol="boxes",
keepInputData=True,
scoreThreshold=0.25,
padding=20,
)

draw = ImageDrawBoxes(
inputCols=["image", "boxes"],
outputCol="image_with_boxes",
keepInputData=True,
filled=False,
color="green",
lineWidth=5,
)

pipeline = PipelineModel(stages=[detector, draw])
result = pipeline.transform(image_df)
result.show_image("image_with_boxes")
```
![ShowFaceBoxes.png](../_static/ShowFaceBoxes.png)

## Parameters

| Parameter | Type | Description | Default |
|-------------------|---------|--------------------------------------------------|---------------------|
| inputCols | list | Input columns: image and boxes/entities | ["image", "boxes"] |
| outputCol | str | Output column for annotated images | "image_with_boxes" |
| keepInputData | bool | Keep input data in output | False |
| imageType | Enum | Type of image (e.g., FILE) | ImageType.FILE |
| filled | bool | Fill rectangles | False |
| color | str | Box color (hex or name) | None (random) |
| lineWidth | int | Line width for boxes | 1 |
| textSize | int | Text size for labels | 12 |
| displayDataList | list | List of box/entity attributes to display as text | [] |
| numPartitions | int | Number of partitions for Spark | 0 |
| padding | int | Padding added to boxes | 0 |
| pageCol | str | Page column for repartitioning | "page" |
| whiteList | list | Only draw boxes/entities of these types | [] |
| blackList | list | Do not draw boxes/entities of these types | [] |

## Notes
- Supports drawing both standard bounding boxes and NER entity boxes.
- Colors can be set manually or randomly assigned per entity/class.
- Text labels can be displayed using `displayDataList`.
- Handles rotated boxes and fills/outline options.
- Can be used in Spark pipelines for distributed image annotation.
- Errors are handled gracefully and logged.
13 changes: 13 additions & 0 deletions docs/source/image_processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Image Processing

This document provides an overview of various image processing transformers in ScaledP.


## Available Image Processing Transformers

* [**DataToImage**](#DataToImage): Converts raw data into image format for further
processing.
* [**ImageCropBoxes**](#ImageCropBoxes): Crops specified regions from images based on
bounding box coordinates.
* [**ImageDrawBoxes**](#ImageDrawBoxes): Draws bounding boxes on images to highlight detected
objects or regions of interest.
7 changes: 7 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,4 +42,11 @@ Benefits of using ScaleDP

installation.md
quickstart.md
image_processing.md
pdf_processing.md
detectors.md
ocr.md
show_utils.md
release_notes.md


81 changes: 81 additions & 0 deletions docs/source/models/detectors/craft_text_detector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
(CraftTextDetector)=
# CraftTextDetector

## Overview

`CraftTextDetector` is a PySpark ML transformer for text detection in images using the CRAFT model. It supports distributed processing in Spark pipelines, batch inference, and optional refiner network postprocessing for improved accuracy. The detector outputs bounding boxes for detected text regions, with options for rotated boxes and threshold tuning.

## Usage Example

```python
from scaledp.models.detectors import CraftTextDetector
from scaledp import TesseractRecognizer, ImageDrawBoxes, PipelineModel

detector = CraftTextDetector(
device="cpu",
keepInputData=True,
partitionMap=True,
numPartitions=1,
width=1600,
scoreThreshold=0.7,
textThreshold=0.4,
linkThreshold=0.4,
withRefiner=True,
)

ocr = TesseractRecognizer(
inputCols=["image", "boxes"],
keepFormatting=False,
keepInputData=True,
lang=["eng", "spa"],
scoreThreshold=0.2,
scaleFactor=2.0,
partitionMap=True,
numPartitions=1,
)

draw = ImageDrawBoxes(
keepInputData=True,
inputCols=["image", "text"],
filled=False,
color="green",
lineWidth=5,
displayDataList=["score", "text", "angle"],
)

pipeline = PipelineModel(stages=[detector, ocr, draw])
result = pipeline.transform(image_df)
result.show_image("image_with_boxes")
```

## Parameters

| Parameter | Type | Description | Default |
|-------------------|---------|--------------------------------------------------|-----------------|
| inputCol | str | Input image column | "image" |
| outputCol | str | Output column for boxes | "boxes" |
| keepInputData | bool | Keep input data in output | False |
| scaleFactor | float | Image resize factor | 1.0 |
| scoreThreshold | float | Minimum confidence score | 0.7 |
| textThreshold | float | Threshold for text region score | 0.4 |
| linkThreshold | float | Threshold for link affinity score | 0.4 |
| sizeThreshold | int | Minimum height for detected regions | -1 |
| width | int | Width for image resizing | 1280 |
| withRefiner | bool | Enable refiner network postprocessing | False |
| device | Device | Inference device (CPU/GPU) | Device.CPU |
| batchSize | int | Batch size for inference | 2 |
| partitionMap | bool | Use partitioned mapping | False |
| numPartitions | int | Number of partitions | 0 |
| pageCol | str | Page column | "page" |
| pathCol | str | Path column | "path" |
| propagateError | bool | Propagate errors | False |
| onlyRotated | bool | Return only rotated boxes | False |

## Notes
- Supports optional refiner network for improved text box accuracy (`withRefiner`).
- Outputs bounding boxes for detected text regions, including rotated boxes if `onlyRotated` is True.
- Thresholds (`scoreThreshold`, `textThreshold`, `linkThreshold`) can be tuned for different document types.
- Can be integrated with OCR and visualization stages in Spark pipelines.
- Supports batch and distributed processing for scalable text detection.
- Errors are handled gracefully and can be propagated if desired.

Loading