tools/cleanup leaves nvidia_uvm wedged on driver-managed GPU nodes (EKS)

## Summary

`tools/cleanup` (and `helm uninstall gpu-operator`) on a **driver-managed** cluster — e.g. EKS GB200, where gpu-operator installs the NVIDIA driver with `driver.enabled=true` — removes the operator but does **not** cleanly unload the NVIDIA kernel modules or drain/reboot the GPU nodes. This leaves `nvidia_uvm` wedged mid-unload, which breaks the **next** AICR install on the same nodes.

## Environment

- Cluster: EKS GB200 (`p6e-gb200.36xlarge`), Ubuntu 24.04, kernel `6.17.0-1017-aws`
- Driver: NVIDIA Open Kernel Module `580.126.20` (gpu-operator-managed, `driver.enabled=true`)
- AICR: `v0.16.0-rc1`

## Reproduction

1. Deploy a driver-managed recipe (`gb200-eks-ubuntu-training-kubeflow`) — GPU stack healthy.
2. Run `tools/cleanup` (uninstalls gpu-operator, deletes namespaces + CRDs).
3. Re-deploy the same recipe.
4. `nvidia-container-toolkit-daemonset` init container (`driver-validation`) goes `Init:CrashLoopBackOff`; `nvidia-device-plugin` and `nvidia-operator-validator` stay stuck `Init`. The driver daemonset itself is `Running 2/2`.

## Root cause

`dmesg` on the GPU node shows `nvidia_uvm` stuck mid-unload:

```
Modules linked in: ... nvidia_uvm(OE-) nvidia(OE) ...
  uvm_pmm_devmem_exit+0x88/0x108 [nvidia_uvm]
  uvm_global_exit+0x68/0xe8 [nvidia_uvm]
  uvm_exit_entry+0x178/0x1238 [nvidia_uvm]
[last unloaded: nvidia_modeset(OE)]
```

The trailing `-` in `nvidia_uvm(OE-)` means the module is marked for removal but **wedged in its kernel exit path**. The freshly-deployed driver's `nvidia` module loads, but `nvidia_uvm` can neither finish unloading nor reload (zombie present). `modprobe`/`insmod` cannot fix it. The toolkit `driver-validation` then fails with:

```
error validating driver installation: failed to create device node nvidia-uvm:
failed to determine major: invalid device node
```

(surfaces as the gpu-operator #430 symlink symptom, but the underlying cause is the wedged module).

## Why GKE doesn't hit it

GKE COS host-manages the driver (gpu-operator runs `driver.enabled=false`), so `tools/cleanup` never touches kernel modules. Only driver-managed clusters (EKS / BCM / etc. with `driver.enabled=true`) are affected.

## Proposed fix

On driver-managed clusters, `tools/cleanup` should cordon + drain + reboot the GPU nodes (or cleanly uninstall the NVIDIA driver and unload its modules) before/after removing gpu-operator. At minimum, document the required GPU-node reboot in the cleanup tool output and the redeploy docs.

## Workaround

After cleanup, reboot the GPU nodes (`kubectl cordon` → `aws ec2 reboot-instances` → wait Ready → `kubectl uncordon`); gpu-operator then reloads the driver + `nvidia_uvm` cleanly and the operand chain goes green.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tools/cleanup leaves nvidia_uvm wedged on driver-managed GPU nodes (EKS) #1553

Summary

Environment

Reproduction

Root cause

Why GKE doesn't hit it

Proposed fix

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

tools/cleanup leaves nvidia_uvm wedged on driver-managed GPU nodes (EKS) #1553

Description

Summary

Environment

Reproduction

Root cause

Why GKE doesn't hit it

Proposed fix

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions