Skip to content

[ocp4_workload_nvidia_gpu_operator] Fix ClusterPolicy CRD race condition#21

Merged
wkulhanek merged 1 commit into
mainfrom
nvidia-crd-10-march
Mar 10, 2026
Merged

[ocp4_workload_nvidia_gpu_operator] Fix ClusterPolicy CRD race condition#21
wkulhanek merged 1 commit into
mainfrom
nvidia-crd-10-march

Conversation

@stencell
Copy link
Copy Markdown
Contributor

Summary

  • Fix race condition where Setup NVIDIA GPU Cluster Policy task fails because the clusterpolicies.nvidia.com CRD isn't registered yet when the task runs
  • Add a dedicated wait task for clusterpolicies.nvidia.com CRD availability before applying the ClusterPolicy resource
  • Resolves Failed to find exact match for nvidia.com/v1.ClusterPolicy errors in deployments where operator initialization is slow

Root Cause

The install_operator role waits for the CSV to reach Succeeded state. However, the NVIDIA GPU Operator registers its ClusterPolicy CRD after the CSV transitions to Succeeded — observed to be ~2 minutes later. When install_operator_install_csv_ignore_error is true and the CSV wait times out, the gap is even larger.

This left the Setup NVIDIA GPU Cluster Policy task attempting to apply a resource whose CRD didn't exist in the API server yet.

Changes

roles/ocp4_workload_nvidia_gpu_operator/tasks/workload.yml

  • Added a Wait for NVIDIA GPU Operator ClusterPolicy CRD to be available task between the operator install and ClusterPolicy creation
  • Polls apiextensions.k8s.io/v1/CustomResourceDefinition/clusterpolicies.nvidia.com with 30 retries × 10s delay (5 min total)

Test plan

  • Deploy openshift-ai-v3-aws CI and verify the Setup NVIDIA GPU Cluster Policy task succeeds without the CRD resolution error

…applying

The install_operator role considers installation complete when the CSV
reaches Succeeded state. However, the NVIDIA GPU Operator registers its
ClusterPolicy CRD after the CSV succeeds, causing a race condition where
the ClusterPolicy task fails to resolve nvidia.com/v1.ClusterPolicy.

Add an explicit wait for clusterpolicies.nvidia.com CRD availability
(30 retries x 10s = 5 min) before applying the ClusterPolicy resource.
@wkulhanek wkulhanek merged commit 3f41bff into main Mar 10, 2026
1 check passed
@wkulhanek wkulhanek deleted the nvidia-crd-10-march branch March 10, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants