Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
147 changes: 147 additions & 0 deletions docs/guides/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# ML Workflow Guides

Learn how to build common machine learning workflows using Argo Connectors with both YAML and the Hera Python SDK.

## Overview

These guides demonstrate real-world ML workflows using the available connectors in this repository. Each guide includes:

- **Architecture diagram** showing the workflow structure
- **Python (Hera) examples** for programmatic workflows
- **YAML examples** for direct Argo Workflows usage
- **Best practices** for production deployments
- **Real code** using actual Databricks notebooks or Spark jobs

## Available Guides

### [Data Ingestion](data-ingestion.md)
Build data ingestion pipelines to load, validate, and prepare raw data for processing.

**Connectors used**: Databricks

### [Feature Engineering](feature-engineering.md)
Create scalable feature engineering pipelines for ML model training.

**Connectors used**: Databricks

### [Model Training](model-training.md)
Train machine learning models at scale with distributed compute.

**Connectors used**: Databricks

**Coming soon**: PyTorch, TensorFlow, Ray connectors

### [Batch Inference](batch-inference.md)
Score large datasets with trained models in production.

**Connectors used**: Databricks

### [Multi-Step Pipelines](multi-step-pipelines.md)
Build end-to-end ML pipelines by chaining multiple connectors together.

**Connectors used**: Databricks, Spark

### [Passing Data Between Steps](passing-data-between-steps.md)
Learn how to pass data and parameters between workflow steps.

**Connectors used**: Databricks, Spark

## ML Workflow Patterns

### ETL Pattern
```mermaid
graph LR
A[Extract] --> B[Transform]
B --> C[Load]
style B fill:#e1f5ff
```

**Guides**: Data Ingestion, Feature Engineering

### Training Pattern
```mermaid
graph LR
A[Load Data] --> B[Preprocess]
B --> C[Train Model]
C --> D[Evaluate]
D --> E[Register Model]
style C fill:#e1f5ff
```

**Guides**: Model Training

### Inference Pattern
```mermaid
graph LR
A[Load Model] --> B[Load Data]
B --> C[Score]
C --> D[Save Predictions]
style C fill:#e1f5ff
```

**Guides**: Batch Inference

### End-to-End Pipeline
```mermaid
graph LR
A[Ingest] --> B[Features]
B --> C[Train]
C --> D[Evaluate]
D --> E[Deploy]
E --> F[Inference]
style C fill:#e1f5ff
style F fill:#e1f5ff
```

**Guides**: Multi-Step Pipelines

## Choosing the Right Connector

### For Databricks Users
Use the **Databricks connector** when you:
- Already have notebooks in Databricks
- Need managed Spark clusters
- Want serverless compute
- Require Databricks-specific features (Unity Catalog, Delta Lake, etc.)

### For Kubernetes-Native Spark
Use the **Apache Spark connector** when you:
- Want to run Spark entirely on Kubernetes
- Don't need Databricks-specific features
- Have existing Spark JARs or Python scripts
- Prefer open-source tooling

### Combining Connectors
Many workflows use both:
- Databricks for interactive development and notebooks
- Spark for production batch jobs
- Different connectors for different stages of the pipeline

## Best Practices Across All Guides

### 1. Parameterize Everything
Use workflow parameters for configuration:
```python
arguments=[
Parameter(name="environment", value="production"),
Parameter(name="date", value="2024-01-01"),
]
```

### 2. Handle Failures
Add retry strategies and error handling to production workflows.

### 3. Monitor Progress
Use workflow outputs to track job progress and results.

### 4. Optimize Resources
Choose appropriate cluster sizes and scaling strategies for your workload.

### 5. Use Version Control
Store your workflow definitions in Git alongside your data/ML code.

## Next Steps

- Start with [Data Ingestion](data-ingestion.md) to build your first pipeline
- Learn about [Passing Data Between Steps](passing-data-between-steps.md) for complex workflows
- Explore [connector documentation](../connectors/README.md) for detailed parameter references
Loading