-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Implement a wrapper layer to integrate all-smi with AAMI for unified AI accelerator monitoring. all-smi provides Prometheus-compatible metrics for various AI accelerators (NVIDIA, AMD, Intel Gaudi, Google TPU, Tenstorrent, Rebellions, Furiosa).
Background
- all-smi: Open-source tool that exposes Prometheus metrics for multiple AI accelerator vendors
- Approach: Thin wrapper instead of full internalization to minimize maintenance overhead
- Benefit: Single exporter for 7+ accelerator types with native Prometheus integration
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Config Server │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Target Registry │ │ Exporter Config │ │ SD Generator │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└───────────────────────────────────┬─────────────────────────────┘
│ HTTP API
▼
┌─────────────────────────────────────────────────────────────────┐
│ Target Node │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ all-smi │ │ aami-agent │ │ Node Exporter │ │
│ │ (port 9401) │ │ (wrapper) │ │ (port 9100) │ │
│ └────────┬────────┘ └────────┬────────┘ └─────────────────┘ │
│ │ │ │
│ │ manage/monitor │ │
│ ◄────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼ scrape
┌─────────────────────────────────────────────────────────────────┐
│ Prometheus │
└─────────────────────────────────────────────────────────────────┘
Implementation Phases
Phase 1: Foundation (1 week)
- Extend ExporterType domain model to include
all_smi - Add config-server API for exporter configuration per target
- Extend SD Generator to include all-smi endpoints
- Define default port (9401) for all-smi
Phase 2: Installation Automation (3-4 days)
- Create
install-all-smi.shscript (apt/brew/binary) - Create
all-smi.servicesystemd unit file - Integrate all-smi installation option into bootstrap.sh
- Implement accelerator auto-detection (nvidia-smi, rocm-smi, etc.)
Phase 3: Health Check & Management (3-4 days)
- Create health check script for all-smi process and metrics endpoint
- Implement auto-restart via systemd watchdog
- Add version management in config-server
- Support automatic updates
Phase 4: Dashboard & Alerts (3-4 days)
- Create Grafana dashboard for all-smi metrics
- Define alert rules for accelerators (temperature, utilization, memory)
- Add accelerator alert templates to config-server
New Directories to Create
The following directories need to be created as part of this implementation:
| Directory | Purpose |
|---|---|
dashboards/ |
Grafana dashboard JSON files |
scripts/systemd/ |
systemd service unit files |
deploy/ansible/roles/all-smi/ |
Ansible role for all-smi deployment |
deploy/kubernetes/all-smi/ |
Kubernetes manifests for all-smi |
File Structure
services/
├── config-server/
│ ├── internal/
│ │ ├── domain/
│ │ │ └── exporter.go # Add all_smi to ExporterType
│ │ └── service/
│ │ └── service_discovery.go # Generate all-smi SD targets
│ └── configs/
│ └── defaults/
│ └── alert-templates.yaml # Add accelerator alert templates
scripts/
├── node/
│ ├── install-all-smi.sh # Installation script
│ ├── all-smi-health-check.sh # Health check
│ └── bootstrap.sh # Add all-smi option
└── systemd/ # NEW DIRECTORY
└── all-smi.service # systemd service
deploy/
├── ansible/
│ └── roles/
│ └── all-smi/ # Ansible role (NEW)
└── kubernetes/
└── all-smi/
└── daemonset.yaml # K8s DaemonSet (NEW)
dashboards/ # NEW DIRECTORY
└── all-smi-overview.json # Grafana dashboard
Code Changes
ExporterType Extension
// internal/domain/exporter.go
type ExporterType string
const (
ExporterTypeNodeExporter ExporterType = "node_exporter"
ExporterTypeDCGMExporter ExporterType = "dcgm_exporter"
ExporterTypeAllSMI ExporterType = "all_smi" // New
ExporterTypeCustom ExporterType = "custom"
)Default Ports
| Exporter | Default Port |
|---|---|
| node_exporter | 9100 |
| dcgm_exporter | 9400 |
| all_smi | 9401 |
| custom | 9090 |
Note: all-smi uses port 9401 to avoid conflict with dcgm_exporter (9400)
all-smi vs DCGM Exporter
Relationship
- DCGM Exporter: NVIDIA-specific, uses
dcgm_*metric prefix (e.g.,dcgm_gpu_temp,dcgm_fb_used) - all-smi: Multi-vendor support, uses
allsmi_*metric prefix
Recommendation
| Scenario | Recommended Exporter |
|---|---|
| NVIDIA GPU only, need deep DCGM metrics | dcgm_exporter |
| Multi-vendor accelerators | all_smi |
| Mixed environment | Both (different ports) |
Alert Template Migration
Current alert templates use DCGM metrics. New all-smi templates will use:
# Example: all-smi GPU temperature alert
- id: allsmi_high_gpu_temperature
name: High GPU Temperature (all-smi)
query_template: |
allsmi_gpu_temperature > {{ .threshold }}Note: Verify actual metric names from all-smi documentation before implementation.
Supported Accelerators (via all-smi)
- NVIDIA GPUs (CUDA)
- AMD GPUs (ROCm)
- NVIDIA Jetson
- Apple Silicon GPUs
- Intel Gaudi NPUs
- Google Cloud TPUs
- Tenstorrent NPUs
- Rebellions NPUs
- Furiosa NPUs
Dependencies
| Item | Status |
|---|---|
| config-server Exporter model | ✅ Exists (domain/exporter.go) |
| Service Discovery | ✅ Implemented (service/service_discovery.go) |
| Bootstrap script | ✅ Exists (scripts/node/bootstrap.sh) |
| Alert Template system | ✅ Implemented |
| dashboards/ directory | |
| scripts/systemd/ directory |
Estimated Timeline
| Phase | Duration | Deliverables |
|---|---|---|
| Phase 1 | 1 week | config-server extension, API |
| Phase 2 | 3-4 days | Installation automation scripts |
| Phase 3 | 3-4 days | Health check, management |
| Phase 4 | 3-4 days | Dashboard, alert rules |
| Total | ~2.5 weeks |
References
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels