Skip to content

bundler: manifestFiles always emitted as -post, breaking prereq ConfigMaps (GB200/EKS deploy hangs) #859

Description

@njhensley

Summary

The GB200/EKS deploy hangs in deploy.sh at 009-gpu-operator because the chart's --wait blocks on ClusterPolicy reaching Ready, but ClusterPolicy reconciliation requires the nvidia-kernel-module-params ConfigMap that ships in the recipe's manifestFiles. #706 (cf3cd33) moved manifestFiles from a pre-install pass to a -post folder applied after the chart, so the ConfigMap never lands before reconcile.

Reproduction

Clean GB200/EKS cluster:

aicr recipe --service eks --accelerator gb200 --os ubuntu --intent training -o recipe.yaml
aicr bundle --recipe recipe.yaml -o ./bundles
cd ./bundles && ./deploy.sh

009-gpu-operator retries 5×10m and fails with:

"error":"ERROR: could not get ConfigMap nvidia-kernel-module-params from client:
ConfigMap \"nvidia-kernel-module-params\" not found"

Root cause

pkg/bundler/deployer/localformat/doc.go:51-64 classifies any manifestFiles entry as post-install (intended for CR-shaped manifests that need the chart's CRDs to exist first). But nvidia-kernel-module-params is prereq-shaped — the chart's controller reads it during reconcile, so it must apply before helm install --wait. There's currently no way to express that direction.

Scope

manifestFiles is used by multiple recipes; some entries are post-shaped, some are prereq-shaped. Known prereqs that break the same way:

  • gb200-eks-training.yaml, gb200-eks-inference.yamlgpu-operator/manifests/kernel-module-params.yaml

Other manifestFiles callers (h100-eks-*, h100-gke-cos-*, mixins/platform-inference.yaml, aks.yaml, gke-cos.yaml, base.yaml) need audit; any that the chart's controller reads at reconcile time has the same latent deadlock.

Proposed fix

Add a direction tag to manifestFiles:

manifestFiles:
  - path: components/gpu-operator/manifests/kernel-module-params.yaml
    phase: pre   # default remains post

In pkg/bundler/deployer/localformat/, emit NNN-<name>-pre/ before the upstream chart for phase: pre and keep -post for everything else. Update pkg/recipe/metadata.go:101-104 and the registry/overlay schemas. Migrate the GB200 entries to phase: pre and audit the rest in the same PR.

Related

Severity

Reproducible on every fresh GB200/EKS cluster; ~52-minute hang before terminal failure.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions