Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
## [unreleased]

### 🚀 Features

- Added TextEmbeddings transformer, for compute embedding using SentenceTransformers


## [0.2.5] - 10.11.2025

### 🚀 Features
Expand Down
14 changes: 14 additions & 0 deletions docs/source/embeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Embeddings
==========

## Overview

This section provides an overview of the various embedding transformers available in ScaleDP for processing text and other data types. These transformers are designed to generate embeddings that can be used for tasks such as clustering, classification, and semantic similarity.

## Text Embeddings

* [**TextEmbeddings**](models/embeddings/TextEmbeddings.md)

## Base Embeddings

* [**BaseEmbeddings**](models/embeddings/BaseEmbeddings.md)
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ Benefits of using ScaleDP
pdf_processing.md
detectors.md
ocr.md
embeddings.md
show_utils.md
release_notes.md

Expand Down
39 changes: 39 additions & 0 deletions docs/source/models/embeddings/BaseEmbeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
(BaseEmbeddings)=
# BaseEmbeddings

## Overview

`BaseEmbeddings` is an abstract base class for embedding transformers in ScaleDP. It provides the foundational structure and common functionality for embedding models, enabling efficient and scalable embedding generation for various data types. Derived classes, such as `TextEmbeddings`, extend this base class to implement specific embedding logic.

## Key Features

- **Abstract Base Class**: Provides a common interface for embedding transformers.
- **PySpark Integration**: Designed to work seamlessly with PySpark for distributed data processing.
- **Customizable Parameters**: Supports a wide range of parameters for flexibility and customization.
- **Error Handling**: Includes validation for input columns and error propagation options.

## Usage Example

`BaseEmbeddings` is not intended to be used directly. Instead, it serves as a parent class for specific embedding transformers like `TextEmbeddings`.

## Parameters

| Parameter | Type | Description | Default |
|-------------------|---------|--------------------------------------------------|-----------------------------|
| inputCol | str | Input column name | N/A |
| outputCol | str | Output column name | N/A |
| keepInputData | bool | Whether to retain input data in the output | True |
| device | Device | Device for computation (CPU/GPU) | Device.CPU |
| model | str | Pre-trained model identifier | N/A |
| batchSize | int | Batch size for processing | 1 |
| numPartitions | int | Number of partitions for distributed processing | 1 |
| partitionMap | bool | Use partitioned mapping | False |
| pageCol | str | Page column | "page" |
| pathCol | str | Path column | "path" |

## Notes

- `BaseEmbeddings` provides the `_transform` method, which handles the core logic for applying transformations to a dataset.
- Derived classes must implement the `transform_udf` and `transform_udf_pandas` methods to define the specific embedding logic.
- The class includes validation for input columns to ensure compatibility with the dataset.

46 changes: 46 additions & 0 deletions docs/source/models/embeddings/TextEmbeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
(TextEmbeddings)=
# TextEmbeddings

## Overview

`TextEmbeddings` is a text embedding transformer based on the SentenceTransformer model. It is designed to efficiently generate embeddings for text data using a pre-trained model. The transformer is implemented as a PySpark ML transformer and can be integrated into Spark pipelines for scalable text embedding tasks.

## Usage Example

```python
from scaledp import TextEmbeddings, PipelineModel

text_embeddings = TextEmbeddings(
inputCol="text",
outputCol="embeddings",
keepInputData=True,
model="all-MiniLM-L6-v2",
batchSize=1,
device="cpu",
)

# Transform the text dataframe through the embedding stage
pipeline = PipelineModel(stages=[text_embeddings])
result = pipeline.transform(text_df)
result.show()
```

## Parameters

| Parameter | Type | Description | Default |
|-------------------|---------|--------------------------------------------------|-----------------------------|
| inputCol | str | Input text column | "text" |
| outputCol | str | Output column for embeddings | "embeddings" |
| keepInputData | bool | Keep input data in output | True |
| model | str | Pre-trained model identifier | "all-MiniLM-L6-v2" |
| batchSize | int | Batch size for inference | 1 |
| device | Device | Inference device (CPU/GPU) | Device.CPU |
| numPartitions | int | Number of partitions | 1 |
| partitionMap | bool | Use partitioned mapping | False |
| pageCol | str | Page column | "page" |
| pathCol | str | Path column | "path" |

## Notes
- The transformer uses the SentenceTransformer model for generating text embeddings.
- Supports batch processing and distributed inference with Spark.
- Additional parameters can be set using the corresponding setter methods.
35 changes: 35 additions & 0 deletions docs/source/release_notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,41 @@ Release Notes
This document outlines the release notes for the ScaledP project. It includes information about new features, bug fixes, and other changes made in each version.


## [unreleased]

### 🚀 Features

- Added [TextEmbeddings](#TextEmbeddings) transformer, for compute embedding using SentenceTransformers


## [0.2.5] - 10.11.2025

### 🚀 Features

- Added param 'returnEmpty' to [ImageCropBoxes](#ImageCropBoxes) for avoid to have exceptions if no boxes are found
- Added labels param to the [YoloOnnxDetector](#YoloOnnxDetector)
- Improve displaying labels in [ImageDrawBoxes](#ImageDrawBoxes)

### 🧰 Maintenance
- Updated versions of dependencies (Pandas, Numpy, OpenCV)

### 🐛 Bug Fixes

- Fixed convert color schema in [YoloOnnxDetector](#YoloOnnxDetector)
- Fixed show utils on Google Colab
- Fixed imports of the DataFrame

### 📘 Jupyter Notebooks

- [YoloOnnxDetector.ipynb](https://github.com/StabRise/ScaleDP-Tutorials/blob/master/object-detection/1.YoloOnnxDetector.ipynb)
- [FaceDetection.ipynb](https://github.com/StabRise/ScaleDP-Tutorials/blob/master/object-detection/2.FaceDetection.ipynb)
- [SignatureDetection.ipynb](https://github.com/StabRise/ScaleDP-Tutorials/blob/master/object-detection/3.SignatureDetection.ipynb)

### 📝 Blog Posts

- [Running YOLO Models on Spark Using ScaleDP](https://stabrise.com/blog/running_yolo_on_spark_with_scaledp/)


## 0.2.4 - 02.11.2025

### 🚀 Features
Expand Down
Loading