Skip to content

Restart driver pods in place when driver config is unchanged#2527

Merged
rajathagasthya merged 1 commit into
NVIDIA:mainfrom
rajathagasthya:worktree-minor-version-driver-no-upgrade
Jun 30, 2026
Merged

Restart driver pods in place when driver config is unchanged#2527
rajathagasthya merged 1 commit into
NVIDIA:mainfrom
rajathagasthya:worktree-minor-version-driver-no-upgrade

Conversation

@rajathagasthya

@rajathagasthya rajathagasthya commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Description

A patch chart upgrade can change only cosmetic pod-template metadata (e.g. the helm.sh/chart label) without changing the driver itself. The upgrade controller keys on the DaemonSet's controller revision hash, so such a change still evicts running GPU workloads and drains the node, causing disruption for running workloads.

Register a RestartOnlyPredicate on the upgrade state manager (from the UpgradeReconciler) that compares DRIVER_CONFIG_DIGEST — a hash of the install-relevant driver config, already set on the driver pod template — between the running pod and the desired DaemonSet. When the digests match, the node is cordoned and the driver pod restarted in place, with no workload eviction or drain; the driver fast-path keeps the kernel modules loaded across the restart, so running GPU workloads are not disrupted. Cordoning keeps the node unschedulable if the restart fails (same as the full flow), and the node is uncordoned on success. A missing or differing digest falls back to the full upgrade flow.

If the predicate returns an error or the cordon fails, the node stays in upgrade-required and is retried on a later reconcile (with a Warning event), rather than being routed to the disruptive flow on an unknown answer.

Known limitation: the first upgrade from a release without restart-only is still disruptive, because the old operator holds the leader-election lease and routes the upgrade before the new operator becomes leader. Steady-state (both sides have the code) is non-disruptive.

Related to NVIDIA/k8s-operator-libs#145

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

Unit tests:

  • internal/config: TestDriverConfigDigestFromPodSpec — digest reader, incl. nil/empty/container-precedence cases.
  • controllers: TestDriverPodRestartOnly — predicate routing, incl. nil pod/DS and missing/equal/differing digests.
  • Add optional restart-only predicate to inplace upgrade flow k8s-operator-libs#145: Ginkgo specs for restart-only routing (cordon + pod-restart), retry-on-error for predicate and cordon failures, orphaned/upgrade-requested/safe-driver-load skips, and the maxParallelUpgrades throttle.

Manual testing (single-node cluster, GPU workload running throughout):

  1. Without helm: deploy an operator image with this change, then patch the driver DS pod template with a label-only change. Node goes upgrade-required → pod-restart-required (cordoned, never cordon-required), the driver pod restarts via the fast path, and the GPU workload is not evicted.
  2. Helm, first adoption: install v26.3.2, then upgrade to a chart built with this change. Full upgrade flow, as expected: the driver version also changed, and the first upgrade is routed by the old operator (see known limitation above).
  3. Helm, patch upgrade: install a chart built with this change, then upgrade to another chart also built with this change, same driver version. Restart-only flow — the GPU workload is not evicted.
  4. Helm, real driver change: install a chart built with this change, then upgrade to another chart also built with this change but a different driver version. Full upgrade flow (digest differs).

Comment thread vendor/github.com/NVIDIA/k8s-operator-libs/pkg/upgrade/common_manager.go Outdated
Comment thread controllers/upgrade_controller.go Outdated
Comment thread controllers/upgrade_controller.go Outdated
@rajathagasthya rajathagasthya force-pushed the worktree-minor-version-driver-no-upgrade branch 3 times, most recently from 3feaaf8 to 40151ff Compare June 29, 2026 20:31
@rajathagasthya rajathagasthya marked this pull request as ready for review June 29, 2026 20:38
@rajathagasthya rajathagasthya force-pushed the worktree-minor-version-driver-no-upgrade branch from 40151ff to 9b0cd08 Compare June 29, 2026 21:38
Comment thread cmd/gpu-operator/main.go Outdated
Comment thread controllers/object_controls.go
@rajathagasthya rajathagasthya force-pushed the worktree-minor-version-driver-no-upgrade branch from 9b0cd08 to d743da5 Compare June 30, 2026 02:40
@cdesiniotis

Copy link
Copy Markdown
Contributor

LGTM, thanks @rajathagasthya. Let's get another approval on this.

Comment thread internal/config/driver_config_digest.go Outdated
Comment thread internal/config/driver_config_digest.go Outdated
Comment thread controllers/object_controls.go
A patch chart upgrade can change only cosmetic pod-template metadata
(e.g. the helm.sh/chart label) without changing the driver itself. The
upgrade controller keys on the DaemonSet's controller revision hash, so
such a change still evicts running GPU workloads and drains the node --
for no driver benefit.

Register a RestartOnlyPredicate on the upgrade state manager that
compares DRIVER_CONFIG_DIGEST -- a hash of the install-relevant driver
config, already set on the driver pod template -- between the running
pod and the desired DaemonSet. When the digests match, the node is
cordoned and the driver pod restarted in place, with no workload
eviction or drain; the driver fast-path keeps the kernel modules loaded
across the restart, so running GPU workloads are not disrupted.
Cordoning keeps the node unschedulable if the restart fails, and the
node is uncordoned on success. A missing or differing digest falls back
to the full upgrade flow.

The digest env name and a reader for it live in internal/config beside
the digest definition; the restart-only routing decision lives in
internal/predicates and is registered on the upgrade state manager in
main.go. The RestartOnlyPredicate hook it relies on is provided by
k8s-operator-libs, vendored here at the merged version.

Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>
@rajathagasthya rajathagasthya force-pushed the worktree-minor-version-driver-no-upgrade branch from d743da5 to 0bbf881 Compare June 30, 2026 17:28
@rajathagasthya rajathagasthya enabled auto-merge June 30, 2026 17:33
@rajathagasthya rajathagasthya merged commit 7b38b13 into NVIDIA:main Jun 30, 2026
20 checks passed
@rajathagasthya rajathagasthya deleted the worktree-minor-version-driver-no-upgrade branch June 30, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants