Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -141,3 +141,4 @@ k8s_results/
rocprof_output/
slurm_output/
MagicMock/
.madengine_session_start
83 changes: 62 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,13 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep
## ✨ Key Features

- **🚀 Modern CLI** - Rich terminal output with Typer and Rich
- **🎯 Simple Deployment** - Run locally or deploy to Kubernetes/SLURM via configuration
- **🎯 Flexible Deployment** - Run locally, Kubernetes, SLURM, or Bare Metal VM with guaranteed isolation
- **🔧 Distributed Launchers** - Full support for torchrun, DeepSpeed, Megatron-LM, TorchTitan, vLLM, SGLang
- **🐳 Container-Native** - Docker-based execution with GPU support (ROCm, CUDA)
- **🖥️ VM Isolation** - Bare metal execution with ephemeral VMs for complete environment cleanup
- **📊 Performance Tools** - Integrated profiling with rocprof, rocblas, MIOpen, RCCL tracing
- **🔍 Environment Validation** - TheRock ROCm detection and validation tools
- **⚙️ Intelligent Defaults** - Minimal K8s configs with automatic preset application
- **⚙️ Intelligent Defaults** - Minimal configs with automatic preset application

## 🚀 Quick Start

Expand Down Expand Up @@ -99,7 +100,8 @@ For detailed command options, see the **[CLI Command Reference](docs/cli-referen
| [Installation](docs/installation.md) | Complete installation instructions |
| [Usage Guide](docs/usage.md) | Commands, workflows, and examples |
| **[CLI Reference](docs/cli-reference.md)** | **Detailed command options and examples** |
| [Deployment](docs/deployment.md) | Kubernetes and SLURM deployment |
| [Deployment](docs/deployment.md) | Kubernetes, SLURM, and Bare Metal VM deployment |
| [Bare Metal VM](docs/baremetal-vm.md) | VM-based execution with isolation and cleanup |
| [Configuration](docs/configuration.md) | Advanced configuration options |
| [Batch Build](docs/batch-build.md) | Selective builds for CI/CD |
| [Launchers](docs/launchers.md) | Distributed training frameworks |
Expand Down Expand Up @@ -137,14 +139,14 @@ For detailed command options, see the **[CLI Command Reference](docs/cli-referen
│ • RunOrchestrator │ │
└────────┬───────────────┘ │
│ │
┌────────┼────────
│ │ │
┌────▼───┐ ┌─▼──────┐ ┌▼─────────┐
│ Local │ │ K8s │ │ SLURM │
│ Docker │ │ Jobs │ │ Jobs │
└────┬───┘ └─┬──────┘ └┬─────────┘
│ │ │
└───────┼─────────
┌────────┼───────────────┐
│ │ │
┌────▼───┐ ┌─▼──────┐ ┌▼─────────┐ ┌▼────────────┐
│ Local │ │ K8s │ │ SLURM │ │ Bare Metal │
│ Docker │ │ Jobs │ │ Jobs │ VM
└────┬───┘ └─┬──────┘ └┬─────────┘ └┬────────────┘
│ │ │
└───────┼─────────┴────────────┘
│ │
┌───────┴─────────┐ │
│ Distributed │ │
Expand Down Expand Up @@ -185,7 +187,7 @@ For detailed command options, see the **[CLI Command Reference](docs/cli-referen
1. **CLI Layer** - User interface with 5 commands (discover, build, run, report, database)
2. **Model Discovery** - Find and validate models from MAD package
3. **Orchestration** - BuildOrchestrator & RunOrchestrator manage workflows
4. **Execution Targets** - Local Docker, Kubernetes Jobs, or SLURM Jobs
4. **Execution Targets** - Local Docker, Kubernetes Jobs, SLURM Jobs, or Bare Metal VM
5. **Distributed Launchers** - Training (torchrun, DeepSpeed, TorchTitan, Megatron-LM) and Inference (vLLM, SGLang)
6. **Performance Output** - CSV/JSON results with metrics
7. **Post-Processing** - Report generation (HTML/Email) and database upload (MongoDB)
Expand Down Expand Up @@ -220,14 +222,16 @@ For detailed command options, see the **[CLI Command Reference](docs/cli-referen

### Infrastructure Capabilities

| Feature | Local | Kubernetes | SLURM |
|---------|-------|-----------|-------|
| **Execution** | Docker containers | K8s Jobs | SLURM jobs |
| **Multi-Node** | ❌ | ✅ Indexed Jobs | ✅ Job arrays |
| **Resource Mgmt** | Manual | Declarative (YAML) | Batch scheduler |
| **Monitoring** | Docker logs | kubectl/dashboard | squeue/scontrol |
| **Auto-scaling** | ❌ | ✅ | ❌ |
| **Network** | Host | CNI plugin | InfiniBand/Ethernet |
| Feature | Local | Kubernetes | SLURM | Bare Metal VM |
|---------|-------|-----------|-------|---------------|
| **Execution** | Docker containers | K8s Jobs | SLURM jobs | Ephemeral VMs |
| **Multi-Node** | ❌ | ✅ Indexed Jobs | ✅ Job arrays | ❌ (single-node) |
| **Resource Mgmt** | Manual | Declarative (YAML) | Batch scheduler | VM isolation |
| **Monitoring** | Docker logs | kubectl/dashboard | squeue/scontrol | VM + Docker logs |
| **Auto-scaling** | ❌ | ✅ | ❌ | ❌ |
| **Network** | Host | CNI plugin | InfiniBand/Ethernet | VM networking |
| **GPU Support** | Passthrough | Device plugin | Direct | SR-IOV/VFIO |
| **Cleanup** | Manual | Automatic | Manual | Guaranteed |

## 💻 Usage Examples

Expand Down Expand Up @@ -320,6 +324,42 @@ madengine run --manifest-file build_manifest.json \
}'
```

### Bare Metal VM Execution

```bash
# SSH to bare metal node
ssh admin@baremetal-gpu-node.example.com

# Create config with VM isolation
cat > baremetal-vm-config.json << 'EOF'
{
"baremetal_vm": {
"enabled": true,
"base_image": "/var/lib/libvirt/images/ubuntu-22.04-rocm.qcow2",
"vcpus": 32,
"memory": "128G",
"gpu_passthrough": {
"mode": "sriov",
"gpu_vendor": "AMD"
}
},
"gpu_vendor": "AMD",
"guest_os": "UBUNTU"
}
EOF

# Run with VM isolation (guaranteed cleanup)
madengine run --tags model \
--additional-context-file baremetal-vm-config.json \
--timeout 3600
```

**Benefits:**
- ✅ Guaranteed clean state after each run
- ✅ Complete environment isolation
- ✅ Near-native GPU performance (95-98%)
- ✅ Works with existing Docker images

### Common Workflows

**Development → Testing → Production:**
Expand Down Expand Up @@ -597,7 +637,8 @@ MIT License - see [LICENSE](LICENSE) file for details.
### Documentation
- **[CLI Reference](docs/cli-reference.md)** - Complete command options
- **[Usage Guide](docs/usage.md)** - Workflows and examples
- **[Deployment Guide](docs/deployment.md)** - Kubernetes/SLURM deployment
- **[Deployment Guide](docs/deployment.md)** - Kubernetes/SLURM/Bare Metal VM deployment
- **[Bare Metal VM Guide](docs/baremetal-vm.md)** - VM-based execution with isolation
- **[Configuration Guide](docs/configuration.md)** - Advanced configuration
- **[All Docs](docs/)** - Complete documentation index

Expand Down
7 changes: 6 additions & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ Complete documentation for madengine - AI model automation and distributed bench
|-------|-------------|
| [Configuration](configuration.md) | Advanced configuration options |
| [Batch Build](batch-build.md) | Selective builds with batch manifests |
| [Deployment](deployment.md) | Kubernetes and SLURM deployment |
| [Deployment](deployment.md) | Kubernetes, SLURM, and Bare Metal VM deployment |
| **[Bare Metal VM](baremetal-vm.md)** | **VM-based execution with isolation and guaranteed cleanup** |
| [Launchers](launchers.md) | Multi-node training frameworks |

### Advanced Topics
Expand Down Expand Up @@ -138,6 +139,9 @@ Complete documentation for madengine - AI model automation and distributed bench
**Deploy to SLURM**
→ [Configuration](configuration.md) → [Deployment](deployment.md)

**Run on bare metal with VM isolation**
→ [Bare Metal VM Guide](baremetal-vm.md)

**Build multiple models selectively (CI/CD)**
→ [Batch Build](batch-build.md)

Expand Down Expand Up @@ -174,6 +178,7 @@ madengine operates within the MAD (Model Automation and Dashboarding) ecosystem.
- **Local** - Docker containers on local machine
- **Kubernetes** - Cloud-native container orchestration
- **SLURM** - HPC cluster job scheduling
- **Bare Metal VM** - VM-based execution with isolation and cleanup

### Distributed Launchers

Expand Down
Loading