Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cliff.toml
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ protect_breaking_commits = false
commit_parsers = [
{ message = "^feat", group = "<!-- 0 -->🚀 Features" },
{ message = "^fix", group = "<!-- 1 -->🐛 Bug Fixes" },
{ message = "^update", group = "<!-- 2 -->🔄 Updates" },
{ message = "^doc", group = "<!-- 3 -->📚 Documentation" },
{ message = "^perf", group = "<!-- 4 -->⚡ Performance" },
{ message = "^maint", group = "<!-- 4 -->🧰 Maintenance" },
Expand Down
2 changes: 2 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ Benefits of using ScaleDP
detectors.md
ocr.md
embeddings.md
splitters.md
schemas.md
show_utils.md
release_notes.md

Expand Down
150 changes: 150 additions & 0 deletions docs/source/models/splitters/base_splitter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
(BaseSplitter)=
# BaseSplitter

## Overview

`BaseSplitter` is the abstract base class for all text splitter transformers in the ScaleDP library. It extends PySpark's `Transformer` class and provides common functionality for splitting documents into chunks. This class defines the interface and shared parameters for all splitter implementations.

## Inheritance

- Extends PySpark's `Transformer` for ML pipeline compatibility.
- Mixes in the following parameter mixins:
- `HasInputCol` - Input column containing documents
- `HasOutputCol` - Output column for results
- `HasKeepInputData` - Whether to preserve input data
- `HasChunkSize` - Maximum chunk size
- `HasChunkOverlap` - Overlap between chunks
- `HasNumPartitions` - Partition control
- `HasPartitionMap` - Enable distributed processing mode
- `HasWhiteList` - Whitelist filtering support

## Key Features

- **PySpark Integration**: Full compatibility with PySpark ML pipelines
- **Serialization**: Support for reading and writing model parameters
- **Flexible Configuration**: Extensive parameters for customization
- **Extensible Design**: Foundation for specialized splitter implementations
- **Batch Processing**: Support for both local and distributed processing modes

## Class Hierarchy

```
BaseSplitter
├── BaseTextSplitter
│ └── TextSplitter (concrete implementation)
```

## Parameters

| Parameter | Type | Description | Default |
|-------------------|---------|--------------------------------------------------|-----------------------------|
| inputCol | str | Input column name | varies by implementation |
| outputCol | str | Output column name | varies by implementation |
| keepInputData | bool | Keep input columns in output | True |
| chunkSize | int | Size of each chunk | 500 |
| chunkOverlap | int | Overlap between consecutive chunks | 0 |
| numPartitions | int | Number of partitions | 1 |
| partitionMap | bool | Use partitioned mapping (pandas_udf mode) | False |
| whiteList | list | Whitelist of allowed items | [] |

## Abstract Methods

Subclasses must implement the following abstract methods:

### transform(dataset)
Transforms a Spark DataFrame by applying the splitter logic.

**Parameters:**
- `dataset` (pyspark.sql.DataFrame): Input DataFrame

**Returns:**
- (pyspark.sql.DataFrame): DataFrame with split results

## Usage Guidelines

`BaseSplitter` is an abstract class and should not be instantiated directly. Instead, use concrete implementations like:

- [`TextSplitter`](./text_splitter.md) - Semantic text splitting

```python
# Correct: Use concrete implementation
from scaledp.models.splitters.TextSplitter import TextSplitter

splitter = TextSplitter(chunk_size=500, chunk_overlap=50)
```

```python
# Incorrect: Do not instantiate BaseSplitter directly
from scaledp.models.splitters.BaseSplitter import BaseSplitter

# This will raise an error
splitter = BaseSplitter() # Error!
```

## Creating Custom Splitters

To create a custom splitter, inherit from `BaseSplitter` or `BaseTextSplitter`:

```python
from scaledp.models.splitters.BaseTextSplitter import BaseTextSplitter
from scaledp.schemas.Document import Document
from scaledp.schemas.TextChunks import TextChunks

class CustomSplitter(BaseTextSplitter):
"""Custom splitter implementation."""

def split(self, document: Document) -> TextChunks:
"""Implement custom splitting logic."""
# Your splitting algorithm here
chunks = self._split_text(document.text)
return TextChunks(
path=document.path,
chunks=chunks,
exception="",
processing_time=0.0
)
```

## Pipeline Integration

`BaseSplitter` and its subclasses are designed to work seamlessly with PySpark pipelines:

```python
from pyspark.ml import Pipeline
from scaledp.models.splitters.TextSplitter import TextSplitter

# Create pipeline stages
splitter = TextSplitter(chunk_size=500)

# Create and fit pipeline
pipeline = Pipeline(stages=[splitter])
model = pipeline.fit(training_data)

# Transform data
results = model.transform(test_data)
```

## Serialization

All splitters support PySpark's read/write functionality:

```python
# Save a model
splitter = TextSplitter(chunk_size=500)
splitter.write().overwrite().save("path/to/splitter")

# Load a model
loaded_splitter = TextSplitter.load("path/to/splitter")
```

## Related Classes

- [`BaseTextSplitter`](./base_text_splitter.md) - Abstract base for text splitters
- [`TextSplitter`](./text_splitter.md) - Concrete semantic text splitter implementation
- [`Document`](#Document) - Input document schema
- [`TextChunks`](#TextChunks) - Output text chunks schema

## See Also

- [PySpark ML Transformers](https://spark.apache.org/docs/latest/ml-pipeline.html)
- [Transformer API](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Transformer.html)
89 changes: 89 additions & 0 deletions docs/source/models/splitters/base_text_splitter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
(BaseTextSplitter)=
# BaseTextSplitter

## Overview

`BaseTextSplitter` is an abstract base class for text splitting transformers in PySpark. It provides common functionality for splitting documents into chunks while preserving metadata like file paths and document types. It is designed for extensibility and serves as the foundation for concrete text splitting implementations like [`TextSplitter`](./text_splitter.md).

The splitter operates on **Document struct columns**, which contain structured data including text content, file path, document type, and bounding boxes.

## Inheritance

- Inherits from [`BaseSplitter`](./base_splitter.md), which provides core Spark ML transformer functionality and schema handling.
- Mixes in `HasColumnValidator` and `HasDefaultEnum` for validation and enumeration support.
- Extends `DefaultParamsReadable` and `DefaultParamsWritable` for serialization support.

## Key Features

- **Document-Centric**: Works with Document struct columns containing path, text, type, and bboxes
- **Metadata Preservation**: Maintains document metadata (path, document type) through the splitting process
- **Flexible Chunking**: Configurable chunk size and overlap for text splitting
- **Distributed Processing**: Supports both regular UDF and pandas_udf (partitionMap) modes for Spark batch processing
- **Error Handling**: Captures and reports processing exceptions in output

## Usage Example

```python
from scaledp.models.splitters.TextSplitter import TextSplitter
from scaledp.schemas.Document import Document

# Create a splitter with custom parameters
splitter = TextSplitter(
inputCol="document", # Column containing Document structs
outputCol="chunks", # Output column for TextChunks
chunk_size=500, # Characters per chunk
chunk_overlap=50, # Character overlap between chunks
)

# Use in a Spark pipeline
result_df = splitter.transform(input_df)
```

## Parameters

| Parameter | Type | Description | Default |
|-------------------|---------|--------------------------------------------------|-----------------------------|
| inputCol | str | Input Document struct column | "document" |
| outputCol | str | Output column for TextChunks results | "chunks" |
| keepInputData | bool | Keep input document column in output | True |
| chunk_size | int | Size of each chunk in characters | 500 |
| chunk_overlap | int | Number of characters to overlap between chunks | 0 |
| numPartitions | int | Number of partitions for coalescing | 1 |
| partitionMap | bool | Use pandas_udf for distributed processing | False |

## Input Schema

The input column must contain **Document struct**. For detailed schema information, see [Document Schema Documentation](../../schemas/document.md).

**Key Fields:**
- `path` - File path or document identifier
- `text` - Text content to split
- `type` - Document type (e.g., "text", "pdf")
- `bboxes` - Bounding boxes (empty for text documents)
- `exception` - Error message if any (optional)

## Output Schema

The output column contains **TextChunks struct**. For detailed schema information, see [TextChunks Schema Documentation](../../schemas/text_chunks.md).

**Key Fields:**
- `path` - Original document path
- `chunks` - List of text chunks
- `exception` - Error message if splitting failed
- `processing_time` - Time taken to split document (seconds)

## Notes

- The splitter is abstract and cannot be instantiated directly. Use concrete implementations like `TextSplitter`.
- Input documents must contain text and path information in the Document struct format.
- Chunk overlap can help maintain context between chunks for semantic meaning.
- The `partitionMap` option enables pandas_udf mode for better performance on large datasets but requires careful configuration.
- All errors during splitting are captured and reported in the `exception` field of the output.

## Related Classes

- [`TextSplitter`](./text_splitter.md) - Concrete implementation using semantic text splitting
- [`BaseSplitter`](./base_splitter.md) - Base transformer for all splitter implementations
- [`Document`](../../schemas/document.md) - Input schema class
- [`TextChunks`](../../schemas/text_chunks.md) - Output schema class
- [`Box`](../../schemas/box.md) - Bounding box schema class
Loading