Summary
The GB200/EKS deploy hangs in deploy.sh at 009-gpu-operator because the chart's --wait blocks on ClusterPolicy reaching Ready, but ClusterPolicy reconciliation requires the nvidia-kernel-module-params ConfigMap that ships in the recipe's manifestFiles. #706 (cf3cd33) moved manifestFiles from a pre-install pass to a -post folder applied after the chart, so the ConfigMap never lands before reconcile.
Reproduction
Clean GB200/EKS cluster:
aicr recipe --service eks --accelerator gb200 --os ubuntu --intent training -o recipe.yaml
aicr bundle --recipe recipe.yaml -o ./bundles
cd ./bundles && ./deploy.sh
009-gpu-operator retries 5×10m and fails with:
"error":"ERROR: could not get ConfigMap nvidia-kernel-module-params from client:
ConfigMap \"nvidia-kernel-module-params\" not found"
Root cause
pkg/bundler/deployer/localformat/doc.go:51-64 classifies any manifestFiles entry as post-install (intended for CR-shaped manifests that need the chart's CRDs to exist first). But nvidia-kernel-module-params is prereq-shaped — the chart's controller reads it during reconcile, so it must apply before helm install --wait. There's currently no way to express that direction.
Scope
manifestFiles is used by multiple recipes; some entries are post-shaped, some are prereq-shaped. Known prereqs that break the same way:
gb200-eks-training.yaml, gb200-eks-inference.yaml → gpu-operator/manifests/kernel-module-params.yaml
Other manifestFiles callers (h100-eks-*, h100-gke-cos-*, mixins/platform-inference.yaml, aks.yaml, gke-cos.yaml, base.yaml) need audit; any that the chart's controller reads at reconcile time has the same latent deadlock.
Proposed fix
Add a direction tag to manifestFiles:
manifestFiles:
- path: components/gpu-operator/manifests/kernel-module-params.yaml
phase: pre # default remains post
In pkg/bundler/deployer/localformat/, emit NNN-<name>-pre/ before the upstream chart for phase: pre and keep -post for everything else. Update pkg/recipe/metadata.go:101-104 and the registry/overlay schemas. Migrate the GB200 entries to phase: pre and audit the rest in the same PR.
Related
Severity
Reproducible on every fresh GB200/EKS cluster; ~52-minute hang before terminal failure.
Summary
The GB200/EKS deploy hangs in
deploy.shat009-gpu-operatorbecause the chart's--waitblocks onClusterPolicyreaching Ready, butClusterPolicyreconciliation requires thenvidia-kernel-module-paramsConfigMap that ships in the recipe'smanifestFiles. #706 (cf3cd33) movedmanifestFilesfrom a pre-install pass to a-postfolder applied after the chart, so the ConfigMap never lands before reconcile.Reproduction
Clean GB200/EKS cluster:
009-gpu-operatorretries 5×10m and fails with:Root cause
pkg/bundler/deployer/localformat/doc.go:51-64classifies anymanifestFilesentry as post-install (intended for CR-shaped manifests that need the chart's CRDs to exist first). Butnvidia-kernel-module-paramsis prereq-shaped — the chart's controller reads it during reconcile, so it must apply beforehelm install --wait. There's currently no way to express that direction.Scope
manifestFilesis used by multiple recipes; some entries are post-shaped, some are prereq-shaped. Known prereqs that break the same way:gb200-eks-training.yaml,gb200-eks-inference.yaml→gpu-operator/manifests/kernel-module-params.yamlOther
manifestFilescallers (h100-eks-*,h100-gke-cos-*,mixins/platform-inference.yaml,aks.yaml,gke-cos.yaml,base.yaml) need audit; any that the chart's controller reads at reconcile time has the same latent deadlock.Proposed fix
Add a direction tag to
manifestFiles:In
pkg/bundler/deployer/localformat/, emitNNN-<name>-pre/before the upstream chart forphase: preand keep-postfor everything else. Updatepkg/recipe/metadata.go:101-104and the registry/overlay schemas. Migrate the GB200 entries tophase: preand audit the rest in the same PR.Related
manifestFilesentrypkg/bundler/deployer/localformat/doc.go:40-64pkg/recipe/metadata.go:101-104Severity
Reproducible on every fresh GB200/EKS cluster; ~52-minute hang before terminal failure.