StabRise · mykolamelnykml · Nov 12, 2025 · Nov 12, 2025 · Nov 12, 2025 · Nov 12, 2025
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,10 @@
+## [unreleased]
+
+### 🚀 Features
+
+- Added TextEmbeddings transformer, for compute embedding using SentenceTransformers
+
+
 ## [0.2.5] - 10.11.2025
 
 ### 🚀 Features

diff --git a/docs/source/embeddings.md b/docs/source/embeddings.md
@@ -0,0 +1,14 @@
+Embeddings
+==========
+
+## Overview
+
+This section provides an overview of the various embedding transformers available in ScaleDP for processing text and other data types. These transformers are designed to generate embeddings that can be used for tasks such as clustering, classification, and semantic similarity.
+
+## Text Embeddings
+
+* [**TextEmbeddings**](models/embeddings/TextEmbeddings.md)
+
+## Base Embeddings
+
+* [**BaseEmbeddings**](models/embeddings/BaseEmbeddings.md)
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -46,6 +46,7 @@ Benefits of using ScaleDP
    pdf_processing.md
    detectors.md
    ocr.md
+   embeddings.md
    show_utils.md
    release_notes.md
 

diff --git a/docs/source/models/embeddings/BaseEmbeddings.md b/docs/source/models/embeddings/BaseEmbeddings.md
@@ -0,0 +1,39 @@
+(BaseEmbeddings)=
+# BaseEmbeddings
+
+## Overview
+
+`BaseEmbeddings` is an abstract base class for embedding transformers in ScaleDP. It provides the foundational structure and common functionality for embedding models, enabling efficient and scalable embedding generation for various data types. Derived classes, such as `TextEmbeddings`, extend this base class to implement specific embedding logic.
+
+## Key Features
+
+- **Abstract Base Class**: Provides a common interface for embedding transformers.
+- **PySpark Integration**: Designed to work seamlessly with PySpark for distributed data processing.
+- **Customizable Parameters**: Supports a wide range of parameters for flexibility and customization.
+- **Error Handling**: Includes validation for input columns and error propagation options.
+
+## Usage Example
+
+`BaseEmbeddings` is not intended to be used directly. Instead, it serves as a parent class for specific embedding transformers like `TextEmbeddings`.
+
+## Parameters
+
+| Parameter         | Type    | Description                                      | Default                     |
+|-------------------|---------|--------------------------------------------------|-----------------------------|
+| inputCol          | str     | Input column name                                | N/A                         |
+| outputCol         | str     | Output column name                               | N/A                         |
+| keepInputData     | bool    | Whether to retain input data in the output       | True                        |
+| device            | Device  | Device for computation (CPU/GPU)                | Device.CPU                  |
+| model             | str     | Pre-trained model identifier                     | N/A                         |
+| batchSize         | int     | Batch size for processing                        | 1                           |
+| numPartitions     | int     | Number of partitions for distributed processing  | 1                           |
+| partitionMap      | bool    | Use partitioned mapping                          | False                       |
+| pageCol           | str     | Page column                                      | "page"                    |
+| pathCol           | str     | Path column                                      | "path"                    |
+
+## Notes
+
+- `BaseEmbeddings` provides the `_transform` method, which handles the core logic for applying transformations to a dataset.
+- Derived classes must implement the `transform_udf` and `transform_udf_pandas` methods to define the specific embedding logic.
+- The class includes validation for input columns to ensure compatibility with the dataset.
+
diff --git a/docs/source/models/embeddings/TextEmbeddings.md b/docs/source/models/embeddings/TextEmbeddings.md
@@ -0,0 +1,46 @@
+(TextEmbeddings)=
+# TextEmbeddings
+
+## Overview
+
+`TextEmbeddings` is a text embedding transformer based on the SentenceTransformer model. It is designed to efficiently generate embeddings for text data using a pre-trained model. The transformer is implemented as a PySpark ML transformer and can be integrated into Spark pipelines for scalable text embedding tasks.
+
+## Usage Example
+
+```python
+from scaledp import TextEmbeddings, PipelineModel
+
+text_embeddings = TextEmbeddings(
+        inputCol="text",
+        outputCol="embeddings",
+        keepInputData=True,
+        model="all-MiniLM-L6-v2",
+        batchSize=1,
+        device="cpu",
+    )
+
+# Transform the text dataframe through the embedding stage
+pipeline = PipelineModel(stages=[text_embeddings])
+result = pipeline.transform(text_df)
+result.show()
+```
+
+## Parameters
+
+| Parameter         | Type    | Description                                      | Default                     |
+|-------------------|---------|--------------------------------------------------|-----------------------------|
+| inputCol          | str     | Input text column                                | "text"                    |
+| outputCol         | str     | Output column for embeddings                     | "embeddings"              |
+| keepInputData     | bool    | Keep input data in output                        | True                        |
+| model             | str     | Pre-trained model identifier                     | "all-MiniLM-L6-v2"        |
+| batchSize         | int     | Batch size for inference                         | 1                           |
+| device            | Device  | Inference device (CPU/GPU)                       | Device.CPU                  |
+| numPartitions     | int     | Number of partitions                             | 1                           |
+| partitionMap      | bool    | Use partitioned mapping                          | False                       |
+| pageCol           | str     | Page column                                      | "page"                    |
+| pathCol           | str     | Path column                                      | "path"                    |
+
+## Notes
+- The transformer uses the SentenceTransformer model for generating text embeddings.
+- Supports batch processing and distributed inference with Spark.
+- Additional parameters can be set using the corresponding setter methods.
diff --git a/docs/source/release_notes.md b/docs/source/release_notes.md
@@ -4,6 +4,41 @@ Release Notes
 This document outlines the release notes for the ScaledP project. It includes information about new features, bug fixes, and other changes made in each version.
 
 
+## [unreleased]
+
+### 🚀 Features
+
+- Added [TextEmbeddings](#TextEmbeddings) transformer, for compute embedding using SentenceTransformers
+
+
+## [0.2.5] - 10.11.2025
+
+### 🚀 Features
+
+- Added param 'returnEmpty' to [ImageCropBoxes](#ImageCropBoxes) for avoid to have exceptions if no boxes are found
+- Added labels param to the [YoloOnnxDetector](#YoloOnnxDetector)
+- Improve displaying labels in [ImageDrawBoxes](#ImageDrawBoxes)
+
+### 🧰 Maintenance
+- Updated versions of dependencies (Pandas, Numpy, OpenCV)
+
+### 🐛 Bug Fixes
+
+- Fixed convert color schema in [YoloOnnxDetector](#YoloOnnxDetector)
+- Fixed show utils on Google Colab
+- Fixed imports of the DataFrame
+
+### 📘 Jupyter Notebooks
+
+- [YoloOnnxDetector.ipynb](https://github.com/StabRise/ScaleDP-Tutorials/blob/master/object-detection/1.YoloOnnxDetector.ipynb)
+- [FaceDetection.ipynb](https://github.com/StabRise/ScaleDP-Tutorials/blob/master/object-detection/2.FaceDetection.ipynb)
+- [SignatureDetection.ipynb](https://github.com/StabRise/ScaleDP-Tutorials/blob/master/object-detection/3.SignatureDetection.ipynb)
+
+### 📝 Blog Posts
+
+- [Running YOLO Models on Spark Using ScaleDP](https://stabrise.com/blog/running_yolo_on_spark_with_scaledp/)
+
+
 ## 0.2.4 - 02.11.2025
 
 ### 🚀 Features