StabRise · mykolamelnykml · Nov 19, 2025 · Nov 18, 2025 · Nov 18, 2025 · Nov 18, 2025
diff --git a/cliff.toml b/cliff.toml
@@ -61,6 +61,7 @@ protect_breaking_commits = false
 commit_parsers = [
     { message = "^feat", group = "<!-- 0 -->🚀 Features" },
     { message = "^fix", group = "<!-- 1 -->🐛 Bug Fixes" },
+    { message = "^update", group = "<!-- 2 -->🔄 Updates" },
     { message = "^doc", group = "<!-- 3 -->📚 Documentation" },
     { message = "^perf", group = "<!-- 4 -->⚡ Performance" },
     { message = "^maint", group = "<!-- 4 -->🧰 Maintenance" },

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -47,6 +47,8 @@ Benefits of using ScaleDP
    detectors.md
    ocr.md
    embeddings.md
+   splitters.md
+   schemas.md
    show_utils.md
    release_notes.md
 

diff --git a/docs/source/models/splitters/base_splitter.md b/docs/source/models/splitters/base_splitter.md
@@ -0,0 +1,150 @@
+(BaseSplitter)=
+# BaseSplitter
+
+## Overview
+
+`BaseSplitter` is the abstract base class for all text splitter transformers in the ScaleDP library. It extends PySpark's `Transformer` class and provides common functionality for splitting documents into chunks. This class defines the interface and shared parameters for all splitter implementations.
+
+## Inheritance
+
+- Extends PySpark's `Transformer` for ML pipeline compatibility.
+- Mixes in the following parameter mixins:
+  - `HasInputCol` - Input column containing documents
+  - `HasOutputCol` - Output column for results
+  - `HasKeepInputData` - Whether to preserve input data
+  - `HasChunkSize` - Maximum chunk size
+  - `HasChunkOverlap` - Overlap between chunks
+  - `HasNumPartitions` - Partition control
+  - `HasPartitionMap` - Enable distributed processing mode
+  - `HasWhiteList` - Whitelist filtering support
+
+## Key Features
+
+- **PySpark Integration**: Full compatibility with PySpark ML pipelines
+- **Serialization**: Support for reading and writing model parameters
+- **Flexible Configuration**: Extensive parameters for customization
+- **Extensible Design**: Foundation for specialized splitter implementations
+- **Batch Processing**: Support for both local and distributed processing modes
+
+## Class Hierarchy
+
+```
+BaseSplitter
+├── BaseTextSplitter
+│   └── TextSplitter (concrete implementation)
+```
+
+## Parameters
+
+| Parameter         | Type    | Description                                      | Default                     |
+|-------------------|---------|--------------------------------------------------|-----------------------------|
+| inputCol          | str     | Input column name                                | varies by implementation    |
+| outputCol         | str     | Output column name                               | varies by implementation    |
+| keepInputData     | bool    | Keep input columns in output                     | True                        |
+| chunkSize         | int     | Size of each chunk                               | 500                         |
+| chunkOverlap      | int     | Overlap between consecutive chunks               | 0                           |
+| numPartitions     | int     | Number of partitions                             | 1                           |
+| partitionMap      | bool    | Use partitioned mapping (pandas_udf mode)        | False                       |
+| whiteList         | list    | Whitelist of allowed items                       | []                          |
+
+## Abstract Methods
+
+Subclasses must implement the following abstract methods:
+
+### transform(dataset)
+Transforms a Spark DataFrame by applying the splitter logic.
+
+**Parameters:**
+- `dataset` (pyspark.sql.DataFrame): Input DataFrame
+
+**Returns:**
+- (pyspark.sql.DataFrame): DataFrame with split results
+
+## Usage Guidelines
+
+`BaseSplitter` is an abstract class and should not be instantiated directly. Instead, use concrete implementations like:
+
+- [`TextSplitter`](./text_splitter.md) - Semantic text splitting
+
+```python
+# Correct: Use concrete implementation
+from scaledp.models.splitters.TextSplitter import TextSplitter
+
+splitter = TextSplitter(chunk_size=500, chunk_overlap=50)
+```
+
+```python
+# Incorrect: Do not instantiate BaseSplitter directly
+from scaledp.models.splitters.BaseSplitter import BaseSplitter
+
+# This will raise an error
+splitter = BaseSplitter()  # Error!
+```
+
+## Creating Custom Splitters
+
+To create a custom splitter, inherit from `BaseSplitter` or `BaseTextSplitter`:
+
+```python
+from scaledp.models.splitters.BaseTextSplitter import BaseTextSplitter
+from scaledp.schemas.Document import Document
+from scaledp.schemas.TextChunks import TextChunks
+
+class CustomSplitter(BaseTextSplitter):
+    """Custom splitter implementation."""
+
+    def split(self, document: Document) -> TextChunks:
+        """Implement custom splitting logic."""
+        # Your splitting algorithm here
+        chunks = self._split_text(document.text)
+        return TextChunks(
+            path=document.path,
+            chunks=chunks,
+            exception="",
+            processing_time=0.0
+        )
+```
+
+## Pipeline Integration
+
+`BaseSplitter` and its subclasses are designed to work seamlessly with PySpark pipelines:
+
+```python
+from pyspark.ml import Pipeline
+from scaledp.models.splitters.TextSplitter import TextSplitter
+
+# Create pipeline stages
+splitter = TextSplitter(chunk_size=500)
+
+# Create and fit pipeline
+pipeline = Pipeline(stages=[splitter])
+model = pipeline.fit(training_data)
+
+# Transform data
+results = model.transform(test_data)
+```
+
+## Serialization
+
+All splitters support PySpark's read/write functionality:
+
+```python
+# Save a model
+splitter = TextSplitter(chunk_size=500)
+splitter.write().overwrite().save("path/to/splitter")
+
+# Load a model
+loaded_splitter = TextSplitter.load("path/to/splitter")
+```
+
+## Related Classes
+
+- [`BaseTextSplitter`](./base_text_splitter.md) - Abstract base for text splitters
+- [`TextSplitter`](./text_splitter.md) - Concrete semantic text splitter implementation
+- [`Document`](#Document) - Input document schema
+- [`TextChunks`](#TextChunks) - Output text chunks schema
+
+## See Also
+
+- [PySpark ML Transformers](https://spark.apache.org/docs/latest/ml-pipeline.html)
+- [Transformer API](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Transformer.html)
diff --git a/docs/source/models/splitters/base_text_splitter.md b/docs/source/models/splitters/base_text_splitter.md
@@ -0,0 +1,89 @@
+(BaseTextSplitter)=
+# BaseTextSplitter
+
+## Overview
+
+`BaseTextSplitter` is an abstract base class for text splitting transformers in PySpark. It provides common functionality for splitting documents into chunks while preserving metadata like file paths and document types. It is designed for extensibility and serves as the foundation for concrete text splitting implementations like [`TextSplitter`](./text_splitter.md).
+
+The splitter operates on **Document struct columns**, which contain structured data including text content, file path, document type, and bounding boxes.
+
+## Inheritance
+
+- Inherits from [`BaseSplitter`](./base_splitter.md), which provides core Spark ML transformer functionality and schema handling.
+- Mixes in `HasColumnValidator` and `HasDefaultEnum` for validation and enumeration support.
+- Extends `DefaultParamsReadable` and `DefaultParamsWritable` for serialization support.
+
+## Key Features
+
+- **Document-Centric**: Works with Document struct columns containing path, text, type, and bboxes
+- **Metadata Preservation**: Maintains document metadata (path, document type) through the splitting process
+- **Flexible Chunking**: Configurable chunk size and overlap for text splitting
+- **Distributed Processing**: Supports both regular UDF and pandas_udf (partitionMap) modes for Spark batch processing
+- **Error Handling**: Captures and reports processing exceptions in output
+
+## Usage Example
+
+```python
+from scaledp.models.splitters.TextSplitter import TextSplitter
+from scaledp.schemas.Document import Document
+
+# Create a splitter with custom parameters
+splitter = TextSplitter(
+    inputCol="document",      # Column containing Document structs
+    outputCol="chunks",       # Output column for TextChunks
+    chunk_size=500,           # Characters per chunk
+    chunk_overlap=50,         # Character overlap between chunks
+)
+
+# Use in a Spark pipeline
+result_df = splitter.transform(input_df)
+```
+
+## Parameters
+
+| Parameter         | Type    | Description                                      | Default                     |
+|-------------------|---------|--------------------------------------------------|-----------------------------|
+| inputCol          | str     | Input Document struct column                     | "document"                  |
+| outputCol         | str     | Output column for TextChunks results             | "chunks"                    |
+| keepInputData     | bool    | Keep input document column in output             | True                        |
+| chunk_size        | int     | Size of each chunk in characters                 | 500                         |
+| chunk_overlap     | int     | Number of characters to overlap between chunks   | 0                           |
+| numPartitions     | int     | Number of partitions for coalescing              | 1                           |
+| partitionMap      | bool    | Use pandas_udf for distributed processing        | False                       |
+
+## Input Schema
+
+The input column must contain **Document struct**. For detailed schema information, see [Document Schema Documentation](../../schemas/document.md).
+
+**Key Fields:**
+- `path` - File path or document identifier
+- `text` - Text content to split
+- `type` - Document type (e.g., "text", "pdf")
+- `bboxes` - Bounding boxes (empty for text documents)
+- `exception` - Error message if any (optional)
+
+## Output Schema
+
+The output column contains **TextChunks struct**. For detailed schema information, see [TextChunks Schema Documentation](../../schemas/text_chunks.md).
+
+**Key Fields:**
+- `path` - Original document path
+- `chunks` - List of text chunks
+- `exception` - Error message if splitting failed
+- `processing_time` - Time taken to split document (seconds)
+
+## Notes
+
+- The splitter is abstract and cannot be instantiated directly. Use concrete implementations like `TextSplitter`.
+- Input documents must contain text and path information in the Document struct format.
+- Chunk overlap can help maintain context between chunks for semantic meaning.
+- The `partitionMap` option enables pandas_udf mode for better performance on large datasets but requires careful configuration.
+- All errors during splitting are captured and reported in the `exception` field of the output.
+
+## Related Classes
+
+- [`TextSplitter`](./text_splitter.md) - Concrete implementation using semantic text splitting
+- [`BaseSplitter`](./base_splitter.md) - Base transformer for all splitter implementations
+- [`Document`](../../schemas/document.md) - Input schema class
+- [`TextChunks`](../../schemas/text_chunks.md) - Output schema class
+- [`Box`](../../schemas/box.md) - Bounding box schema class