Skip to content

fix(recipes): handle kubeflow-trainer v2.2.0 API changes#724

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kubeflow-trainer-v2.2-durable
Apr 30, 2026
Merged

fix(recipes): handle kubeflow-trainer v2.2.0 API changes#724
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kubeflow-trainer-v2.2-durable

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Bake the cluster-aware nodeSelector + tolerations into the torch-distributed ClusterTrainingRuntime, using AICR's existing nodeScheduling.accelerated bundler injection. Demos go back to bare-bones TrainJobs (no podTemplateOverrides, no runtimePatches).

Motivation / Context

The pytorch demo TrainJobs in demos/cuj1-{eks,gke}.md currently carry per-cluster scheduling boilerplate so the resulting pods land on AICR's tainted GPU nodes. Each TrainJob author has to repeat it; each demo has to be edited per-cluster vocabulary; and the override mechanism keeps changing upstream (PodTemplateOverrides deprecated in v2.1 → replaced by RuntimePatches in v2.2; see kubeflow/trainer#3309).

This PR moves the per-cluster scheduling into the runtime itself. The bundler already supports this via nodeScheduling.accelerated paths declared in recipes/registry.yaml — already used by gpu-operator, nfd, nodewright-customizations, and kgateway. kubeflow-trainer was the only manifestFiles-using component without an accelerated: block. This PR adds it.

End state for users: same --accelerated-node-selector / --accelerated-node-toleration CLI flags at bundle time. Different cluster, different vocabulary, same demo TrainJob YAML.

API-version-agnostic. Works on kubeflow-trainer v2.1 (PodTemplateOverrides era) and v2.2+ (RuntimePatches era) identically, because the TrainJob no longer overrides anything — the runtime carries the scheduling.

Fixes: N/A
Related: kubeflow/trainer#3309

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Recipe engine / data (`pkg/recipe`)

Implementation Notes

Three coordinated changes (4 files, +34/-12 net):

  1. `recipes/registry.yaml` — add `nodeScheduling.accelerated` block to the `kubeflow-trainer` component entry. `nodeSelectorPaths: [acceleratedNodeSelector]` and `tolerationPaths: [acceleratedTolerations]` (top-level keys). Identical pattern to `gpu-operator` (`daemonsets.nodeSelector` / `daemonsets.tolerations`); just chose top-level for readability.

  2. `recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml` — replace the static pod-spec scheduling region with Helm template directives:

```yaml
{{- $kft := index .Values "kubeflow-trainer" }}
{{- with $kft.acceleratedNodeSelector }}
nodeSelector:
{{- toYaml . | nindent 20 }}
{{- end }}
{{- with $kft.acceleratedTolerations }}
tolerations:
{{- toYaml . | nindent 20 }}
{{- end }}
```

`index .Values "kubeflow-trainer"` matches the bundler's `manifest.RenderInput.Values` shape (values nested under ComponentName). Same access pattern as `nodewright-customizations/manifests/tuning.yaml`.

The bundler renders this template at bundle time, so the `bundle/-kubeflow-trainer-post/templates/` artifact is plain YAML with concrete values substituted — Helm at install time just applies it as-is.

  1. `demos/cuj1-eks.md` and `demos/cuj1-gke.md` — drop the entire `podTemplateOverrides` block. Demo TrainJob is just `trainer:` + `runtimeRef:`.

What stays the same:

  • `helm.sh/hook` annotations (still required by `pkg/recipe.TestManifestHelmHooksRequired`).
  • Bundler CLI flags (`--accelerated-node-selector`, `--accelerated-node-toleration`).
  • No bundler Go changes; no new patterns; no precedent broken.

Testing

```bash
yamllint -c .yamllint.yaml
recipes/registry.yaml
recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml

OK

go test ./pkg/recipe/

ok github.com/NVIDIA/aicr/pkg/recipe 0.845s

```

End-to-end on a real EKS H100 cluster (kubeflow-trainer v2.2.0):

  1. `helm upgrade kubeflow-trainer-post` from this branch's bundle -> CTR live with baked tolerations + nodeSelector.
  2. Apply the bare-bones TrainJob from `demos/cuj1-eks.md` literally (no `podTemplateOverrides`, no `runtimePatches`). Admission accepts; pod scheduled to GPU node with `dedicated=worker-workload:NoSchedule|NoExecute` tolerations and `nodeGroup=gpu-worker` nodeSelector inherited from the runtime; `pytorch-mnist` runs to completion in 21s with `accuracy=0.7424`.

Risk Assessment

  • Low — Isolated change, validated end-to-end, easy to revert.

Rollout notes: Existing clusters re-bundling get the new templated CTR on the next `helm upgrade kubeflow-trainer-post`. Backwards-compatible: TrainJobs that still use `podTemplateOverrides` (v2.1) or `runtimePatches` (v2.2) continue to work — those override mechanisms are additive, this PR just removes the need for them in the AICR-standard demo flow.

Checklist

  • Tests pass locally (`go test ./pkg/recipe/`, `yamllint`)
  • Linter passes (`yamllint`)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (N/A — uses existing `nodeScheduling.accelerated` injection paths covered by existing bundler tests)
  • I updated docs if user-facing behavior changed (`demos/cuj1-{eks,gke}.md` updated)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (`git commit -S`)

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants