Skip to content

tools/cleanup leaves nvidia_uvm wedged on driver-managed GPU nodes (EKS) #1553

Description

@yuanchen8911

Summary

tools/cleanup (and helm uninstall gpu-operator) on a driver-managed cluster — e.g. EKS GB200, where gpu-operator installs the NVIDIA driver with driver.enabled=true — removes the operator but does not cleanly unload the NVIDIA kernel modules or drain/reboot the GPU nodes. This leaves nvidia_uvm wedged mid-unload, which breaks the next AICR install on the same nodes.

Environment

  • Cluster: EKS GB200 (p6e-gb200.36xlarge), Ubuntu 24.04, kernel 6.17.0-1017-aws
  • Driver: NVIDIA Open Kernel Module 580.126.20 (gpu-operator-managed, driver.enabled=true)
  • AICR: v0.16.0-rc1

Reproduction

  1. Deploy a driver-managed recipe (gb200-eks-ubuntu-training-kubeflow) — GPU stack healthy.
  2. Run tools/cleanup (uninstalls gpu-operator, deletes namespaces + CRDs).
  3. Re-deploy the same recipe.
  4. nvidia-container-toolkit-daemonset init container (driver-validation) goes Init:CrashLoopBackOff; nvidia-device-plugin and nvidia-operator-validator stay stuck Init. The driver daemonset itself is Running 2/2.

Root cause

dmesg on the GPU node shows nvidia_uvm stuck mid-unload:

Modules linked in: ... nvidia_uvm(OE-) nvidia(OE) ...
  uvm_pmm_devmem_exit+0x88/0x108 [nvidia_uvm]
  uvm_global_exit+0x68/0xe8 [nvidia_uvm]
  uvm_exit_entry+0x178/0x1238 [nvidia_uvm]
[last unloaded: nvidia_modeset(OE)]

The trailing - in nvidia_uvm(OE-) means the module is marked for removal but wedged in its kernel exit path. The freshly-deployed driver's nvidia module loads, but nvidia_uvm can neither finish unloading nor reload (zombie present). modprobe/insmod cannot fix it. The toolkit driver-validation then fails with:

error validating driver installation: failed to create device node nvidia-uvm:
failed to determine major: invalid device node

(surfaces as the gpu-operator #430 symlink symptom, but the underlying cause is the wedged module).

Why GKE doesn't hit it

GKE COS host-manages the driver (gpu-operator runs driver.enabled=false), so tools/cleanup never touches kernel modules. Only driver-managed clusters (EKS / BCM / etc. with driver.enabled=true) are affected.

Proposed fix

On driver-managed clusters, tools/cleanup should cordon + drain + reboot the GPU nodes (or cleanly uninstall the NVIDIA driver and unload its modules) before/after removing gpu-operator. At minimum, document the required GPU-node reboot in the cleanup tool output and the redeploy docs.

Workaround

After cleanup, reboot the GPU nodes (kubectl cordonaws ec2 reboot-instances → wait Ready → kubectl uncordon); gpu-operator then reloads the driver + nvidia_uvm cleanly and the operand chain goes green.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Fields

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions