[feat] : validate additional enabled drivers #2014

rahulait · 2025-12-25T05:26:51Z

Dependencies

Depends on: NVIDIA/k8s-device-plugin#1550

Description

Problem

GPU Operator supports deploying multiple driver versions within a single Kubernetes cluster through the use of multiple NvidiaDriver custom resources (CRs). However, despite supporting multiple driver instances, the GPU Operator currently deploys only a single, cluster-wide NVIDIA Container Toolkit DaemonSet and a single NVIDIA Device Plugin DaemonSet.
This architecture introduces a limitation when different NvidiaDriver CRs enable different driver-dependent features - such as GPUDirect Storage (GDS), GDRCopy, or other optional components. Because the Container Toolkit and Device Plugin are deployed once per cluster and configured uniformly, they cannot be tailored to account for feature differences across driver instances. As a result, nodes running drivers with differing enabled features cannot be correctly or independently supported.

Proposed solution

During reconciliation in the GPU Operator, we will inject additional driver-enablement environment variables into the nvidia-driver container based on the ClusterPolicy or NvidiaDriver CR selected for the node. The driver container will then persist these variables to the host filesystem on which it runs.
With this mechanism, each node will record a node-local view of enabled additional drivers, accurately reflecting the features configured for that node via its ClusterPolicy or NvidiaDriver CR.

We are updating the gpu-operator's driver validation logic where it will now wait for all enabled drivers to be installed first before proceeding.

Nvidia device-plugin is already resilient to missing devices or drivers and does not crash if a particular device is not present on the node. We are now updating device-plugin to always attempt discovery for all supported devices and driver features.

Checklist

No secrets, sensitive information, or unrelated changes
Lint checks passing (make lint)
Generated assets in-sync (make validate-generated-assets)
Go mod artifacts in-sync (make validate-modules)

Testing

Unit tests (make coverage)
Manual cluster testing (describe below)
N/A or Other (docs, CI config, etc.)

Test details:
Manual testing done to validate the changes.

To test with clusterpolicy, following values.yaml was used:

driver:
  enabled: true
  nvidiaDriverCRD:
    enabled: false
    deployDefaultCR: false
  kernelModuleType: open
  repository: rahulsharm810
  image: driver
  version: 580.105.08
  imagePullPolicy: Always
  rdma:
    enabled: false
    useHostMofed: false
gds:
  enabled: true
  repository: nvcr.io/nvidia/cloud-native
  image: nvidia-fs
  version: "2.26.6"
  imagePullPolicy: IfNotPresent
gdrcopy:
  enabled: true
  repository: nvcr.io/nvidia/cloud-native
  image: gdrdrv
  version: "v2.5.1"
  imagePullPolicy: Always
operator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd1
  imagePullPolicy: Always
devicePlugin:
  repository: docker.io/rahulsharm810
  image: k8s-device-plugin
  version: nvd1
  imagePullPolicy: Always
cdi:
  enabled: false
validator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd1
  imagePullPolicy: Always
manager:
  repository: rahulsharm810
  image: k8s-driver-manager
  version: nvd1
  imagePullPolicy: Always

Pods after install:

root@test:~# kgpo
NAME                                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-rdcrd                                1/1     Running     0          8m54s
gpu-operator-6457b8f76d-ldm8g                              1/1     Running     0          9m20s
nvidia-container-toolkit-daemonset-v72cb                   1/1     Running     0          8m54s
nvidia-cuda-validator-6hgln                                0/1     Completed   0          6m36s
nvidia-dcgm-exporter-6f86g                                 1/1     Running     0          8m54s
nvidia-device-plugin-daemonset-7pslg                       1/1     Running     0          8m54s
nvidia-driver-daemonset-kltm9                              3/3     Running     0          9m1s
nvidia-mig-manager-62vnq                                   1/1     Running     0          8m54s
nvidia-operator-validator-7fscv                            1/1     Running     0          8m54s
nvidiagpu-node-feature-discovery-gc-6d484cd547-sfgd5       1/1     Running     0          9m20s
nvidiagpu-node-feature-discovery-master-7d466cdd75-mg6nq   1/1     Running     0          9m20s
nvidiagpu-node-feature-discovery-worker-ltv95              1/1     Running     0          9m20s
root@test:~# cat /run/nvidia/driver/.additional-drivers-flags
GDRCOPY_ENABLED: true
GDS_ENABLED: true
GPU_DIRECT_RDMA_ENABLED: false
root@test:~#

Testing with nvidiadriver CR:

values.yaml file:

driver:
  enabled: true
  nvidiaDriverCRD:
    enabled: false
    deployDefaultCR: false
operator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd1
  imagePullPolicy: Always
devicePlugin:
  repository: docker.io/rahulsharm810
  image: k8s-device-plugin
  version: nvd1
  imagePullPolicy: Always
cdi:
  enabled: false
validator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd1
  imagePullPolicy: Always

nvidiadriver CRD installed using:

kind: NVIDIADriver
metadata:
  name: demo-test
spec:
  driverType: gpu
  gdrcopy:
    enabled: true
    repository: nvcr.io/nvidia/cloud-native
    image: gdrdrv
    version: v2.5.1
    imagePullPolicy: IfNotPresent
    imagePullSecrets: []
    env: []
    args: []
  kernelModuleType: open
  rdma:
    enabled: false
    useHostMofed: false
  gds:
    enabled: false
    repository: nvcr.io/nvidia/cloud-native
    image: nvidia-fs
    version: "2.26.6"
    imagePullPolicy: IfNotPresent
  startupProbe:
    failureThreshold: 120
    initialDelaySeconds: 60
    periodSeconds: 10
    timeoutSeconds: 60
  image: driver
  repository: rahulsharm810
  imagePullPolicy: Always
  version: 580.105.08
  usePrecompiled: false
  manager:
    repository: rahulsharm810
    image: k8s-driver-manager
    version: nvd1
    imagePullPolicy: Always

Status after install:

root@test:~# kgpo
NAME                                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-j8nlg                                1/1     Running     0          34m
gpu-operator-6457b8f76d-vtzbj                              1/1     Running     0          36m
nvidia-container-toolkit-daemonset-9hzvt                   1/1     Running     0          34m
nvidia-cuda-validator-8h769                                0/1     Completed   0          33m
nvidia-dcgm-exporter-2rzzf                                 1/1     Running     0          34m
nvidia-device-plugin-daemonset-v7fzj                       1/1     Running     0          34m
nvidia-gpu-driver-ubuntu24.04-6585477fb6-c4pm2             2/2     Running     0          35m
nvidia-mig-manager-s7m5q                                   1/1     Running     0          32m
nvidia-operator-validator-4sr4t                            1/1     Running     0          34m
nvidiagpu-node-feature-discovery-gc-6d484cd547-bc8k6       1/1     Running     0          36m
nvidiagpu-node-feature-discovery-master-7d466cdd75-vfqg2   1/1     Running     0          36m
nvidiagpu-node-feature-discovery-worker-cx8r9              1/1     Running     0          36m
root@test:~# cat /run/nvidia/driver/.additional-drivers-flags
GDRCOPY_ENABLED: true
GDS_ENABLED: false
GPU_DIRECT_RDMA_ENABLED: false
root@test:~#

CDI was enabled/disabled in both the tests to make sure it works with/without CDI.

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

copy-pr-bot · 2025-12-25T05:26:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

rahulait · 2026-01-07T22:14:22Z

/ok to test 4457fca

cdesiniotis · 2026-01-08T17:53:02Z

cmd/nvidia-validator/main.go

 		return driverInfo{}, err
 	}
+
+	err = validateAdditionalDriverComponents(d.ctx)


Question -- now that the driver-validation validates all of the additional driver components, does this mean we can potentially remove the additional nvidia-fs-validation and gdrcopy-validation containers from the validator daemonset?

gpu-operator/assets/state-operator-validation/0500_daemonset.yaml

Lines 59 to 100 in b7a7b87

- name: nvidia-fs-validation

image: "FILLED BY THE OPERATOR"

command: ['sh', '-c']

args: ["nvidia-validator"]

env:

- name: WITH_WAIT

value: "true"

- name: COMPONENT

value: nvidia-fs

- name: NODE_NAME

valueFrom:

fieldRef:

fieldPath: spec.nodeName

securityContext:

privileged: true

seLinuxOptions:

level: "s0"

volumeMounts:

- name: run-nvidia-validations

mountPath: /run/nvidia/validations

mountPropagation: Bidirectional

- name: gdrcopy-validation

image: "FILLED BY THE OPERATOR"

command: [ 'sh', '-c' ]

args: [ "nvidia-validator" ]

env:

- name: WITH_WAIT

value: "true"

- name: COMPONENT

value: gdrcopy

- name: NODE_NAME

valueFrom:

fieldRef:

fieldPath: spec.nodeName

securityContext:

privileged: true

seLinuxOptions:

level: "s0"

volumeMounts:

- name: run-nvidia-validations

mountPath: /run/nvidia/validations

mountPropagation: Bidirectional

Good point, wasn't aware of these. Let me take a look and see.

Not critical for this PR, but just wanted to highlight the potential for some cleanup here.

…ver container Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

cdesiniotis · 2026-01-08T22:31:07Z

assets/state-driver/0500_daemonset.yaml

+                if ! nvidia-smi >/dev/null 2>&1; then
+                  echo "nvidia-smi failed"
+                  exit 1
+                fi


Question -- should we really redirect stdout and stderr to /dev/null? If nvidia-smi fails, having output from stdout / stderr may be helpful for debugging purposes.

validate additional enabled drivers

4457fca

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

rahulait marked this pull request as ready for review December 25, 2025 05:42

rahulait requested review from ArangoGutierrez, cdesiniotis, elezar, shivamerla and tariq1890 as code owners December 25, 2025 05:42

rahulait mentioned this pull request Jan 7, 2026

remove additional-drivers-flags file on driver uninstall NVIDIA/k8s-driver-manager#147

Closed

cdesiniotis reviewed Jan 8, 2026

View reviewed changes

store additional driver enablement status during startup probe of dri…

1c24003

…ver container Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

cdesiniotis reviewed Jan 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] : validate additional enabled drivers #2014

[feat] : validate additional enabled drivers #2014

rahulait commented Dec 25, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Dec 25, 2025

Uh oh!

rahulait commented Jan 7, 2026

Uh oh!

cdesiniotis Jan 8, 2026

Uh oh!

rahulait Jan 8, 2026

Uh oh!

cdesiniotis Jan 8, 2026

Uh oh!

cdesiniotis Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	- name: nvidia-fs-validation
	image: "FILLED BY THE OPERATOR"
	command: ['sh', '-c']
	args: ["nvidia-validator"]
	env:
	- name: WITH_WAIT
	value: "true"
	- name: COMPONENT
	value: nvidia-fs
	- name: NODE_NAME
	valueFrom:
	fieldRef:
	fieldPath: spec.nodeName
	securityContext:
	privileged: true
	seLinuxOptions:
	level: "s0"
	volumeMounts:
	- name: run-nvidia-validations
	mountPath: /run/nvidia/validations
	mountPropagation: Bidirectional
	- name: gdrcopy-validation
	image: "FILLED BY THE OPERATOR"
	command: [ 'sh', '-c' ]
	args: [ "nvidia-validator" ]
	env:
	- name: WITH_WAIT
	value: "true"
	- name: COMPONENT
	value: gdrcopy
	- name: NODE_NAME
	valueFrom:
	fieldRef:
	fieldPath: spec.nodeName
	securityContext:
	privileged: true
	seLinuxOptions:
	level: "s0"
	volumeMounts:
	- name: run-nvidia-validations
	mountPath: /run/nvidia/validations
	mountPropagation: Bidirectional

[feat] : validate additional enabled drivers #2014

Are you sure you want to change the base?

[feat] : validate additional enabled drivers #2014

Conversation

rahulait commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependencies

Description

Problem

Proposed solution

Checklist

Testing

To test with clusterpolicy, following values.yaml was used:

Testing with nvidiadriver CR:

Uh oh!

copy-pr-bot bot commented Dec 25, 2025

Uh oh!

rahulait commented Jan 7, 2026

Uh oh!

cdesiniotis Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

rahulait Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rahulait commented Dec 25, 2025 •

edited

Loading