Skip to content

feat: Implement all-smi wrapper for unified AI accelerator monitoring #2

@fregataa

Description

@fregataa

Summary

Implement a wrapper layer to integrate all-smi with AAMI for unified AI accelerator monitoring. all-smi provides Prometheus-compatible metrics for various AI accelerators (NVIDIA, AMD, Intel Gaudi, Google TPU, Tenstorrent, Rebellions, Furiosa).

Background

  • all-smi: Open-source tool that exposes Prometheus metrics for multiple AI accelerator vendors
  • Approach: Thin wrapper instead of full internalization to minimize maintenance overhead
  • Benefit: Single exporter for 7+ accelerator types with native Prometheus integration

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Config Server                             │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │ Target Registry │  │ Exporter Config │  │  SD Generator   │ │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘ │
└───────────────────────────────────┬─────────────────────────────┘
                                │ HTTP API
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Target Node                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │   all-smi       │  │  aami-agent     │  │ Node Exporter   │ │
│  │   (port 9401)   │  │  (wrapper)      │  │ (port 9100)     │ │
│  └────────┬────────┘  └────────┬────────┘  └─────────────────┘ │
│           │                    │                                 │
│           │  manage/monitor    │                                 │
│           ◄────────────────────┘                                 │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼ scrape
┌─────────────────────────────────────────────────────────────────┐
│                       Prometheus                                 │
└─────────────────────────────────────────────────────────────────┘

Implementation Phases

Phase 1: Foundation (1 week)

  • Extend ExporterType domain model to include all_smi
  • Add config-server API for exporter configuration per target
  • Extend SD Generator to include all-smi endpoints
  • Define default port (9401) for all-smi

Phase 2: Installation Automation (3-4 days)

  • Create install-all-smi.sh script (apt/brew/binary)
  • Create all-smi.service systemd unit file
  • Integrate all-smi installation option into bootstrap.sh
  • Implement accelerator auto-detection (nvidia-smi, rocm-smi, etc.)

Phase 3: Health Check & Management (3-4 days)

  • Create health check script for all-smi process and metrics endpoint
  • Implement auto-restart via systemd watchdog
  • Add version management in config-server
  • Support automatic updates

Phase 4: Dashboard & Alerts (3-4 days)

  • Create Grafana dashboard for all-smi metrics
  • Define alert rules for accelerators (temperature, utilization, memory)
  • Add accelerator alert templates to config-server

New Directories to Create

The following directories need to be created as part of this implementation:

Directory Purpose
dashboards/ Grafana dashboard JSON files
scripts/systemd/ systemd service unit files
deploy/ansible/roles/all-smi/ Ansible role for all-smi deployment
deploy/kubernetes/all-smi/ Kubernetes manifests for all-smi

File Structure

services/
├── config-server/
│   ├── internal/
│   │   ├── domain/
│   │   │   └── exporter.go          # Add all_smi to ExporterType
│   │   └── service/
│   │       └── service_discovery.go # Generate all-smi SD targets
│   └── configs/
│       └── defaults/
│           └── alert-templates.yaml # Add accelerator alert templates

scripts/
├── node/
│   ├── install-all-smi.sh          # Installation script
│   ├── all-smi-health-check.sh     # Health check
│   └── bootstrap.sh                # Add all-smi option
└── systemd/                        # NEW DIRECTORY
    └── all-smi.service             # systemd service

deploy/
├── ansible/
│   └── roles/
│       └── all-smi/                # Ansible role (NEW)
└── kubernetes/
    └── all-smi/
        └── daemonset.yaml          # K8s DaemonSet (NEW)

dashboards/                         # NEW DIRECTORY
└── all-smi-overview.json           # Grafana dashboard

Code Changes

ExporterType Extension

// internal/domain/exporter.go
type ExporterType string

const (
    ExporterTypeNodeExporter ExporterType = "node_exporter"
    ExporterTypeDCGMExporter ExporterType = "dcgm_exporter"
    ExporterTypeAllSMI       ExporterType = "all_smi"      // New
    ExporterTypeCustom       ExporterType = "custom"
)

Default Ports

Exporter Default Port
node_exporter 9100
dcgm_exporter 9400
all_smi 9401
custom 9090

Note: all-smi uses port 9401 to avoid conflict with dcgm_exporter (9400)

all-smi vs DCGM Exporter

Relationship

  • DCGM Exporter: NVIDIA-specific, uses dcgm_* metric prefix (e.g., dcgm_gpu_temp, dcgm_fb_used)
  • all-smi: Multi-vendor support, uses allsmi_* metric prefix

Recommendation

Scenario Recommended Exporter
NVIDIA GPU only, need deep DCGM metrics dcgm_exporter
Multi-vendor accelerators all_smi
Mixed environment Both (different ports)

Alert Template Migration

Current alert templates use DCGM metrics. New all-smi templates will use:

# Example: all-smi GPU temperature alert
- id: allsmi_high_gpu_temperature
  name: High GPU Temperature (all-smi)
  query_template: |
    allsmi_gpu_temperature > {{ .threshold }}

Note: Verify actual metric names from all-smi documentation before implementation.

Supported Accelerators (via all-smi)

  • NVIDIA GPUs (CUDA)
  • AMD GPUs (ROCm)
  • NVIDIA Jetson
  • Apple Silicon GPUs
  • Intel Gaudi NPUs
  • Google Cloud TPUs
  • Tenstorrent NPUs
  • Rebellions NPUs
  • Furiosa NPUs

Dependencies

Item Status
config-server Exporter model ✅ Exists (domain/exporter.go)
Service Discovery ✅ Implemented (service/service_discovery.go)
Bootstrap script ✅ Exists (scripts/node/bootstrap.sh)
Alert Template system ✅ Implemented
dashboards/ directory ⚠️ To be created
scripts/systemd/ directory ⚠️ To be created

Estimated Timeline

Phase Duration Deliverables
Phase 1 1 week config-server extension, API
Phase 2 3-4 days Installation automation scripts
Phase 3 3-4 days Health check, management
Phase 4 3-4 days Dashboard, alert rules
Total ~2.5 weeks

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions