feat: Implement all-smi wrapper for unified AI accelerator monitoring

## Summary

Implement a wrapper layer to integrate [all-smi](https://github.com/lablup/all-smi) with AAMI for unified AI accelerator monitoring. all-smi provides Prometheus-compatible metrics for various AI accelerators (NVIDIA, AMD, Intel Gaudi, Google TPU, Tenstorrent, Rebellions, Furiosa).

## Background

- **all-smi**: Open-source tool that exposes Prometheus metrics for multiple AI accelerator vendors
- **Approach**: Thin wrapper instead of full internalization to minimize maintenance overhead
- **Benefit**: Single exporter for 7+ accelerator types with native Prometheus integration

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        Config Server                             │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │ Target Registry │  │ Exporter Config │  │  SD Generator   │ │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘ │
└───────────────────────────────────┬─────────────────────────────┘
                                │ HTTP API
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Target Node                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │   all-smi       │  │  aami-agent     │  │ Node Exporter   │ │
│  │   (port 9401)   │  │  (wrapper)      │  │ (port 9100)     │ │
│  └────────┬────────┘  └────────┬────────┘  └─────────────────┘ │
│           │                    │                                 │
│           │  manage/monitor    │                                 │
│           ◄────────────────────┘                                 │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼ scrape
┌─────────────────────────────────────────────────────────────────┐
│                       Prometheus                                 │
└─────────────────────────────────────────────────────────────────┘
```

## Implementation Phases

### Phase 1: Foundation (1 week)
- [ ] Extend ExporterType domain model to include `all_smi`
- [ ] Add config-server API for exporter configuration per target
- [ ] Extend SD Generator to include all-smi endpoints
- [ ] Define default port (9401) for all-smi

### Phase 2: Installation Automation (3-4 days)
- [ ] Create `install-all-smi.sh` script (apt/brew/binary)
- [ ] Create `all-smi.service` systemd unit file
- [ ] Integrate all-smi installation option into bootstrap.sh
- [ ] Implement accelerator auto-detection (nvidia-smi, rocm-smi, etc.)

### Phase 3: Health Check & Management (3-4 days)
- [ ] Create health check script for all-smi process and metrics endpoint
- [ ] Implement auto-restart via systemd watchdog
- [ ] Add version management in config-server
- [ ] Support automatic updates

### Phase 4: Dashboard & Alerts (3-4 days)
- [ ] Create Grafana dashboard for all-smi metrics
- [ ] Define alert rules for accelerators (temperature, utilization, memory)
- [ ] Add accelerator alert templates to config-server

## New Directories to Create

The following directories need to be created as part of this implementation:

| Directory | Purpose |
|-----------|---------|
| `dashboards/` | Grafana dashboard JSON files |
| `scripts/systemd/` | systemd service unit files |
| `deploy/ansible/roles/all-smi/` | Ansible role for all-smi deployment |
| `deploy/kubernetes/all-smi/` | Kubernetes manifests for all-smi |

## File Structure

```
services/
├── config-server/
│   ├── internal/
│   │   ├── domain/
│   │   │   └── exporter.go          # Add all_smi to ExporterType
│   │   └── service/
│   │       └── service_discovery.go # Generate all-smi SD targets
│   └── configs/
│       └── defaults/
│           └── alert-templates.yaml # Add accelerator alert templates

scripts/
├── node/
│   ├── install-all-smi.sh          # Installation script
│   ├── all-smi-health-check.sh     # Health check
│   └── bootstrap.sh                # Add all-smi option
└── systemd/                        # NEW DIRECTORY
    └── all-smi.service             # systemd service

deploy/
├── ansible/
│   └── roles/
│       └── all-smi/                # Ansible role (NEW)
└── kubernetes/
    └── all-smi/
        └── daemonset.yaml          # K8s DaemonSet (NEW)

dashboards/                         # NEW DIRECTORY
└── all-smi-overview.json           # Grafana dashboard
```

## Code Changes

### ExporterType Extension
```go
// internal/domain/exporter.go
type ExporterType string

const (
    ExporterTypeNodeExporter ExporterType = "node_exporter"
    ExporterTypeDCGMExporter ExporterType = "dcgm_exporter"
    ExporterTypeAllSMI       ExporterType = "all_smi"      // New
    ExporterTypeCustom       ExporterType = "custom"
)
```

### Default Ports
| Exporter | Default Port |
|----------|-------------|
| node_exporter | 9100 |
| dcgm_exporter | 9400 |
| all_smi | 9401 |
| custom | 9090 |

> **Note**: all-smi uses port 9401 to avoid conflict with dcgm_exporter (9400)

## all-smi vs DCGM Exporter

### Relationship
- **DCGM Exporter**: NVIDIA-specific, uses `dcgm_*` metric prefix (e.g., `dcgm_gpu_temp`, `dcgm_fb_used`)
- **all-smi**: Multi-vendor support, uses `allsmi_*` metric prefix

### Recommendation
| Scenario | Recommended Exporter |
|----------|---------------------|
| NVIDIA GPU only, need deep DCGM metrics | dcgm_exporter |
| Multi-vendor accelerators | all_smi |
| Mixed environment | Both (different ports) |

### Alert Template Migration
Current alert templates use DCGM metrics. New all-smi templates will use:
```yaml
# Example: all-smi GPU temperature alert
- id: allsmi_high_gpu_temperature
  name: High GPU Temperature (all-smi)
  query_template: |
    allsmi_gpu_temperature > {{ .threshold }}
```

> **Note**: Verify actual metric names from [all-smi documentation](https://github.com/lablup/all-smi) before implementation.

## Supported Accelerators (via all-smi)

- NVIDIA GPUs (CUDA)
- AMD GPUs (ROCm)
- NVIDIA Jetson
- Apple Silicon GPUs
- Intel Gaudi NPUs
- Google Cloud TPUs
- Tenstorrent NPUs
- Rebellions NPUs
- Furiosa NPUs

## Dependencies

| Item | Status |
|------|--------|
| config-server Exporter model | ✅ Exists (`domain/exporter.go`) |
| Service Discovery | ✅ Implemented (`service/service_discovery.go`) |
| Bootstrap script | ✅ Exists (`scripts/node/bootstrap.sh`) |
| Alert Template system | ✅ Implemented |
| dashboards/ directory | ⚠️ To be created |
| scripts/systemd/ directory | ⚠️ To be created |

## Estimated Timeline

| Phase | Duration | Deliverables |
|-------|----------|--------------|
| Phase 1 | 1 week | config-server extension, API |
| Phase 2 | 3-4 days | Installation automation scripts |
| Phase 3 | 3-4 days | Health check, management |
| Phase 4 | 3-4 days | Dashboard, alert rules |
| **Total** | **~2.5 weeks** | |

## References

- [all-smi GitHub](https://github.com/lablup/all-smi)
- [all-smi API Documentation](https://github.com/lablup/all-smi/blob/main/API.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement all-smi wrapper for unified AI accelerator monitoring #2

Summary

Background

Architecture

Implementation Phases

Phase 1: Foundation (1 week)

Phase 2: Installation Automation (3-4 days)

Phase 3: Health Check & Management (3-4 days)

Phase 4: Dashboard & Alerts (3-4 days)

New Directories to Create

File Structure

Code Changes

ExporterType Extension

Default Ports

all-smi vs DCGM Exporter

Relationship

Recommendation

Alert Template Migration

Supported Accelerators (via all-smi)

Dependencies

Estimated Timeline

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Directory	Purpose
`dashboards/`	Grafana dashboard JSON files
`scripts/systemd/`	systemd service unit files
`deploy/ansible/roles/all-smi/`	Ansible role for all-smi deployment
`deploy/kubernetes/all-smi/`	Kubernetes manifests for all-smi

Scenario	Recommended Exporter
NVIDIA GPU only, need deep DCGM metrics	dcgm_exporter
Multi-vendor accelerators	all_smi
Mixed environment	Both (different ports)

Item	Status
config-server Exporter model	✅ Exists (`domain/exporter.go`)
Service Discovery	✅ Implemented (`service/service_discovery.go`)
Bootstrap script	✅ Exists (`scripts/node/bootstrap.sh`)
Alert Template system	✅ Implemented
dashboards/ directory	⚠️ To be created
scripts/systemd/ directory	⚠️ To be created

Phase	Duration	Deliverables
Phase 1	1 week	config-server extension, API
Phase 2	3-4 days	Installation automation scripts
Phase 3	3-4 days	Health check, management
Phase 4	3-4 days	Dashboard, alert rules
Total	~2.5 weeks

feat: Implement all-smi wrapper for unified AI accelerator monitoring #2

Description

Summary

Background

Architecture

Implementation Phases

Phase 1: Foundation (1 week)

Phase 2: Installation Automation (3-4 days)

Phase 3: Health Check & Management (3-4 days)

Phase 4: Dashboard & Alerts (3-4 days)

New Directories to Create

File Structure

Code Changes

ExporterType Extension

Default Ports

all-smi vs DCGM Exporter

Relationship

Recommendation

Alert Template Migration

Supported Accelerators (via all-smi)

Dependencies

Estimated Timeline

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions