-
Notifications
You must be signed in to change notification settings - Fork 434
Open
Labels
bugIssue/PR to expose/discuss/fix a bugIssue/PR to expose/discuss/fix a bug
Milestone
Description
1. Quick Debug Information
- OS/Version: master node and other worker node Rocky 8.8 , Gpu worker node Rhel 8.8
- Kernel Version: 4.18.0-477.15.1.el8_8.x86_64
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s (v1.24.12)
- GPU Operator Version: 23.3.2
2. Issue or feature description
Kubernetes cluster with master node and non gpu worker node running on Rocky 8.8 OS and gpu worker node running on Rhel 8.8 OS in an airgapped environment.
The custom repo configmap that is injected to the driver daemon set says not supported. Following is an image from gpu-operator pod logs.
Because of this only gpu-operator pod and gpu-node-feature-discovery pods are coming up, rest of the pods like driver, container-toolkit, dcgm-exporter etc are missing ( their daemonsets are also not present)
3. Steps to reproduce the issue
- create a kubernetes cluster with 1 master (Rocky OS 8.8) and 2 worker nodes (1 node with Rocky OS 8.8 and other gpu node with Rhel OS 8.8
- Install gpu operator through helm with 23.3.2 version, also pass custom configmap for driver in values.yaml
driver:
repoConfig:
configMapName: "repo-config"
4. Information to attach (optional if deemed irrelevant)
- kubernetes all resource status:
kubectl get all -n gpu-operator
5. Our debug analysis
- We found that gpu operator pod is always scheduled on the master node because of nodeAffinity
- We edited the gpu operator deployment and provided nodeName for it to be scheduled on the gpu node. Once the gpu operator pod started running on the gpu node all the pods (driver daemonset , toolkit, dcgm exporter etc) came up
- If it is not supported in heterogenous, when we dont pass custom configmap ( non airgap scenario), there is no error in gpu operator pod logs and all the pods (driver, containertoolkit, dcgm-exporter etc) are up
- We want to understand why there is a custom configmap check added, even if it is added, we want to understand why the distribution is not supported for heterogenous cluster in an airgap environment
Metadata
Metadata
Assignees
Labels
bugIssue/PR to expose/discuss/fix a bugIssue/PR to expose/discuss/fix a bug