-
Notifications
You must be signed in to change notification settings - Fork 434
[feat] : validate additional enabled drivers #2014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
|
/ok to test 4457fca |
| return driverInfo{}, err | ||
| } | ||
|
|
||
| err = validateAdditionalDriverComponents(d.ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- now that the driver-validation validates all of the additional driver components, does this mean we can potentially remove the additional nvidia-fs-validation and gdrcopy-validation containers from the validator daemonset?
gpu-operator/assets/state-operator-validation/0500_daemonset.yaml
Lines 59 to 100 in b7a7b87
| - name: nvidia-fs-validation | |
| image: "FILLED BY THE OPERATOR" | |
| command: ['sh', '-c'] | |
| args: ["nvidia-validator"] | |
| env: | |
| - name: WITH_WAIT | |
| value: "true" | |
| - name: COMPONENT | |
| value: nvidia-fs | |
| - name: NODE_NAME | |
| valueFrom: | |
| fieldRef: | |
| fieldPath: spec.nodeName | |
| securityContext: | |
| privileged: true | |
| seLinuxOptions: | |
| level: "s0" | |
| volumeMounts: | |
| - name: run-nvidia-validations | |
| mountPath: /run/nvidia/validations | |
| mountPropagation: Bidirectional | |
| - name: gdrcopy-validation | |
| image: "FILLED BY THE OPERATOR" | |
| command: [ 'sh', '-c' ] | |
| args: [ "nvidia-validator" ] | |
| env: | |
| - name: WITH_WAIT | |
| value: "true" | |
| - name: COMPONENT | |
| value: gdrcopy | |
| - name: NODE_NAME | |
| valueFrom: | |
| fieldRef: | |
| fieldPath: spec.nodeName | |
| securityContext: | |
| privileged: true | |
| seLinuxOptions: | |
| level: "s0" | |
| volumeMounts: | |
| - name: run-nvidia-validations | |
| mountPath: /run/nvidia/validations | |
| mountPropagation: Bidirectional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, wasn't aware of these. Let me take a look and see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not critical for this PR, but just wanted to highlight the potential for some cleanup here.
…ver container Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
| if ! nvidia-smi >/dev/null 2>&1; then | ||
| echo "nvidia-smi failed" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- should we really redirect stdout and stderr to /dev/null? If nvidia-smi fails, having output from stdout / stderr may be helpful for debugging purposes.
Dependencies
Depends on: NVIDIA/k8s-device-plugin#1550
Description
Problem
GPU Operator supports deploying multiple driver versions within a single Kubernetes cluster through the use of multiple NvidiaDriver custom resources (CRs). However, despite supporting multiple driver instances, the GPU Operator currently deploys only a single, cluster-wide NVIDIA Container Toolkit DaemonSet and a single NVIDIA Device Plugin DaemonSet.
This architecture introduces a limitation when different NvidiaDriver CRs enable different driver-dependent features - such as GPUDirect Storage (GDS), GDRCopy, or other optional components. Because the Container Toolkit and Device Plugin are deployed once per cluster and configured uniformly, they cannot be tailored to account for feature differences across driver instances. As a result, nodes running drivers with differing enabled features cannot be correctly or independently supported.
Proposed solution
During reconciliation in the GPU Operator, we will inject additional driver-enablement environment variables into the nvidia-driver container based on the ClusterPolicy or NvidiaDriver CR selected for the node. The driver container will then persist these variables to the host filesystem on which it runs.
With this mechanism, each node will record a node-local view of enabled additional drivers, accurately reflecting the features configured for that node via its ClusterPolicy or NvidiaDriver CR.
We are updating the gpu-operator's driver validation logic where it will now wait for all enabled drivers to be installed first before proceeding.
Nvidia device-plugin is already resilient to missing devices or drivers and does not crash if a particular device is not present on the node. We are now updating device-plugin to always attempt discovery for all supported devices and driver features.
Checklist
make lint)make validate-generated-assets)make validate-modules)Testing
make coverage)Test details:
Manual testing done to validate the changes.
To test with clusterpolicy, following values.yaml was used:
Pods after install:
Testing with nvidiadriver CR:
values.yaml file:
nvidiadriver CRD installed using:
Status after install:
CDI was enabled/disabled in both the tests to make sure it works with/without CDI.