Skip to content

feat(nodes): GPU detection and K8s node hardware info#166

Merged
thestumonkey merged 3 commits intodevfrom
node-deploy-blue/node-info-dev
Feb 27, 2026
Merged

feat(nodes): GPU detection and K8s node hardware info#166
thestumonkey merged 3 commits intodevfrom
node-deploy-blue/node-info-dev

Conversation

@thestumonkey
Copy link
Member

Summary

  • Worker GPU detection: _collect_gpu_info() queries nvidia-smi and rocm-smi to collect GPU model, VRAM, CUDA toolkit version, and ROCm version per device. Falls back to driver version if nvcc isn't available. Results surfaced in both heartbeats and the discovery /info endpoint.
  • UNode model: Added GPUDevice Pydantic model + extended UNodeCapabilities with gpu_count, gpu_devices, gpu_model, gpu_vram_mb — all optional with safe defaults (fully backward-compatible with older workers).
  • K8s model: KubernetesNode now parses nvidia.com/gpu and amd.com/gpu extended resources from node capacity into gpu_capacity_nvidia / gpu_capacity_amd.
  • Frontend: New ClusterNodeList component (lazy-loaded, expandable) embedded in each K8s cluster card showing node status, roles, CPU/mem, kubelet version, OS, and GPU badges. UNode cards on ClusterPage now display GPU model, VRAM, count, and CUDA/ROCm version when a GPU is present.

Test plan

  • Worker with NVIDIA GPU: check heartbeat stores gpu_devices in MongoDB with model + VRAM
  • Worker with AMD GPU: check ROCm version parsed correctly
  • GET /api/kubernetes/{id}/nodes on GPU cluster → gpu_capacity_nvidia populated
  • ClusterPage: GPU node card shows model, VRAM, CUDA version
  • KubernetesClustersPage: "Nodes" toggle expands with GPU badge on GPU nodes
  • Non-GPU nodes: no GPU UI shown (conditional rendering verified)
  • grep -r "data-testid" ushadow/frontend/src/components/kubernetes/ClusterNodeList.tsx passes

🤖 Generated with Claude Code

thestumonkey and others added 3 commits February 27, 2026 09:37
- Worker: _collect_gpu_info() queries nvidia-smi/rocm-smi for model,
  VRAM, CUDA/ROCm version. get_capabilities() includes gpu_count,
  gpu_devices, gpu_model, gpu_vram_mb. handle_info() now exposes
  capabilities for discovery.
- Backend: GPUDevice model + UNodeCapabilities extended with GPU fields
  (all optional with defaults for backward compat).
- K8s model: KubernetesNode gains gpu_capacity_nvidia/amd from
  nvidia.com/gpu and amd.com/gpu extended resources.
- Frontend: KubernetesNode TS interface + kubernetesApi.listNodes().
  ClusterNodeList component (lazy-loaded, expandable, GPU badges).
  UNode cards show GPU model, VRAM, count, CUDA/ROCm version.
  KubernetesClustersPage integrates ClusterNodeList per cluster card.

All interactive elements have data-testid attributes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@thestumonkey thestumonkey merged commit f069b29 into dev Feb 27, 2026
1 of 2 checks passed
thestumonkey added a commit that referenced this pull request Feb 27, 2026
- Worker: _collect_gpu_info() queries nvidia-smi/rocm-smi for model,
  VRAM, CUDA/ROCm version. get_capabilities() includes gpu_count,
  gpu_devices, gpu_model, gpu_vram_mb. handle_info() now exposes
  capabilities for discovery.
- Backend: GPUDevice model + UNodeCapabilities extended with GPU fields
  (all optional with defaults for backward compat).
- K8s model: KubernetesNode gains gpu_capacity_nvidia/amd from
  nvidia.com/gpu and amd.com/gpu extended resources.
- Frontend: KubernetesNode TS interface + kubernetesApi.listNodes().
  ClusterNodeList component (lazy-loaded, expandable, GPU badges).
  UNode cards show GPU model, VRAM, count, CUDA/ROCm version.
  KubernetesClustersPage integrates ClusterNodeList per cluster card.

All interactive elements have data-testid attributes.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant