Summary
tools/cleanup (and helm uninstall gpu-operator) on a driver-managed cluster — e.g. EKS GB200, where gpu-operator installs the NVIDIA driver with driver.enabled=true — removes the operator but does not cleanly unload the NVIDIA kernel modules or drain/reboot the GPU nodes. This leaves nvidia_uvm wedged mid-unload, which breaks the next AICR install on the same nodes.
Environment
- Cluster: EKS GB200 (
p6e-gb200.36xlarge), Ubuntu 24.04, kernel 6.17.0-1017-aws
- Driver: NVIDIA Open Kernel Module
580.126.20 (gpu-operator-managed, driver.enabled=true)
- AICR:
v0.16.0-rc1
Reproduction
- Deploy a driver-managed recipe (
gb200-eks-ubuntu-training-kubeflow) — GPU stack healthy.
- Run
tools/cleanup (uninstalls gpu-operator, deletes namespaces + CRDs).
- Re-deploy the same recipe.
nvidia-container-toolkit-daemonset init container (driver-validation) goes Init:CrashLoopBackOff; nvidia-device-plugin and nvidia-operator-validator stay stuck Init. The driver daemonset itself is Running 2/2.
Root cause
dmesg on the GPU node shows nvidia_uvm stuck mid-unload:
Modules linked in: ... nvidia_uvm(OE-) nvidia(OE) ...
uvm_pmm_devmem_exit+0x88/0x108 [nvidia_uvm]
uvm_global_exit+0x68/0xe8 [nvidia_uvm]
uvm_exit_entry+0x178/0x1238 [nvidia_uvm]
[last unloaded: nvidia_modeset(OE)]
The trailing - in nvidia_uvm(OE-) means the module is marked for removal but wedged in its kernel exit path. The freshly-deployed driver's nvidia module loads, but nvidia_uvm can neither finish unloading nor reload (zombie present). modprobe/insmod cannot fix it. The toolkit driver-validation then fails with:
error validating driver installation: failed to create device node nvidia-uvm:
failed to determine major: invalid device node
(surfaces as the gpu-operator #430 symlink symptom, but the underlying cause is the wedged module).
Why GKE doesn't hit it
GKE COS host-manages the driver (gpu-operator runs driver.enabled=false), so tools/cleanup never touches kernel modules. Only driver-managed clusters (EKS / BCM / etc. with driver.enabled=true) are affected.
Proposed fix
On driver-managed clusters, tools/cleanup should cordon + drain + reboot the GPU nodes (or cleanly uninstall the NVIDIA driver and unload its modules) before/after removing gpu-operator. At minimum, document the required GPU-node reboot in the cleanup tool output and the redeploy docs.
Workaround
After cleanup, reboot the GPU nodes (kubectl cordon → aws ec2 reboot-instances → wait Ready → kubectl uncordon); gpu-operator then reloads the driver + nvidia_uvm cleanly and the operand chain goes green.
Summary
tools/cleanup(andhelm uninstall gpu-operator) on a driver-managed cluster — e.g. EKS GB200, where gpu-operator installs the NVIDIA driver withdriver.enabled=true— removes the operator but does not cleanly unload the NVIDIA kernel modules or drain/reboot the GPU nodes. This leavesnvidia_uvmwedged mid-unload, which breaks the next AICR install on the same nodes.Environment
p6e-gb200.36xlarge), Ubuntu 24.04, kernel6.17.0-1017-aws580.126.20(gpu-operator-managed,driver.enabled=true)v0.16.0-rc1Reproduction
gb200-eks-ubuntu-training-kubeflow) — GPU stack healthy.tools/cleanup(uninstalls gpu-operator, deletes namespaces + CRDs).nvidia-container-toolkit-daemonsetinit container (driver-validation) goesInit:CrashLoopBackOff;nvidia-device-pluginandnvidia-operator-validatorstay stuckInit. The driver daemonset itself isRunning 2/2.Root cause
dmesgon the GPU node showsnvidia_uvmstuck mid-unload:The trailing
-innvidia_uvm(OE-)means the module is marked for removal but wedged in its kernel exit path. The freshly-deployed driver'snvidiamodule loads, butnvidia_uvmcan neither finish unloading nor reload (zombie present).modprobe/insmodcannot fix it. The toolkitdriver-validationthen fails with:(surfaces as the gpu-operator #430 symlink symptom, but the underlying cause is the wedged module).
Why GKE doesn't hit it
GKE COS host-manages the driver (gpu-operator runs
driver.enabled=false), sotools/cleanupnever touches kernel modules. Only driver-managed clusters (EKS / BCM / etc. withdriver.enabled=true) are affected.Proposed fix
On driver-managed clusters,
tools/cleanupshould cordon + drain + reboot the GPU nodes (or cleanly uninstall the NVIDIA driver and unload its modules) before/after removing gpu-operator. At minimum, document the required GPU-node reboot in the cleanup tool output and the redeploy docs.Workaround
After cleanup, reboot the GPU nodes (
kubectl cordon→aws ec2 reboot-instances→ wait Ready →kubectl uncordon); gpu-operator then reloads the driver +nvidia_uvmcleanly and the operand chain goes green.