fix(recipes): handle kubeflow-trainer v2.2.0 API changes by yuanchen8911 · Pull Request #724 · NVIDIA/aicr

yuanchen8911 · 2026-04-30T22:19:49Z

Summary

Bake the cluster-aware nodeSelector + tolerations into the torch-distributed ClusterTrainingRuntime, using AICR's existing nodeScheduling.accelerated bundler injection. Demos go back to bare-bones TrainJobs (no podTemplateOverrides, no runtimePatches).

Motivation / Context

The pytorch demo TrainJobs in demos/cuj1-{eks,gke}.md currently carry per-cluster scheduling boilerplate so the resulting pods land on AICR's tainted GPU nodes. Each TrainJob author has to repeat it; each demo has to be edited per-cluster vocabulary; and the override mechanism keeps changing upstream (PodTemplateOverrides deprecated in v2.1 → replaced by RuntimePatches in v2.2; see kubeflow/trainer#3309).

This PR moves the per-cluster scheduling into the runtime itself. The bundler already supports this via nodeScheduling.accelerated paths declared in recipes/registry.yaml — already used by gpu-operator, nfd, nodewright-customizations, and kgateway. kubeflow-trainer was the only manifestFiles-using component without an accelerated: block. This PR adds it.

End state for users: same --accelerated-node-selector / --accelerated-node-toleration CLI flags at bundle time. Different cluster, different vocabulary, same demo TrainJob YAML.

API-version-agnostic. Works on kubeflow-trainer v2.1 (PodTemplateOverrides era) and v2.2+ (RuntimePatches era) identically, because the TrainJob no longer overrides anything — the runtime carries the scheduling.

Fixes: N/A
Related: kubeflow/trainer#3309

Type of Change

Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

Recipe engine / data (`pkg/recipe`)

Implementation Notes

Three coordinated changes (4 files, +34/-12 net):

`recipes/registry.yaml` — add `nodeScheduling.accelerated` block to the `kubeflow-trainer` component entry. `nodeSelectorPaths: [acceleratedNodeSelector]` and `tolerationPaths: [acceleratedTolerations]` (top-level keys). Identical pattern to `gpu-operator` (`daemonsets.nodeSelector` / `daemonsets.tolerations`); just chose top-level for readability.
`recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml` — replace the static pod-spec scheduling region with Helm template directives:

```yaml
{{- $kft := index .Values "kubeflow-trainer" }}
{{- with $kft.acceleratedNodeSelector }}
nodeSelector:
{{- toYaml . | nindent 20 }}
{{- end }}
{{- with $kft.acceleratedTolerations }}
tolerations:
{{- toYaml . | nindent 20 }}
{{- end }}
```

`index .Values "kubeflow-trainer"` matches the bundler's `manifest.RenderInput.Values` shape (values nested under ComponentName). Same access pattern as `nodewright-customizations/manifests/tuning.yaml`.

The bundler renders this template at bundle time, so the `bundle/-kubeflow-trainer-post/templates/` artifact is plain YAML with concrete values substituted — Helm at install time just applies it as-is.

`demos/cuj1-eks.md` and `demos/cuj1-gke.md` — drop the entire `podTemplateOverrides` block. Demo TrainJob is just `trainer:` + `runtimeRef:`.

What stays the same:

`helm.sh/hook` annotations (still required by `pkg/recipe.TestManifestHelmHooksRequired`).
Bundler CLI flags (`--accelerated-node-selector`, `--accelerated-node-toleration`).
No bundler Go changes; no new patterns; no precedent broken.

Testing

```bash
yamllint -c .yamllint.yaml
recipes/registry.yaml
recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml

OK

go test ./pkg/recipe/

ok github.com/NVIDIA/aicr/pkg/recipe 0.845s

```

End-to-end on a real EKS H100 cluster (kubeflow-trainer v2.2.0):

`helm upgrade kubeflow-trainer-post` from this branch's bundle -> CTR live with baked tolerations + nodeSelector.
Apply the bare-bones TrainJob from `demos/cuj1-eks.md` literally (no `podTemplateOverrides`, no `runtimePatches`). Admission accepts; pod scheduled to GPU node with `dedicated=worker-workload:NoSchedule|NoExecute` tolerations and `nodeGroup=gpu-worker` nodeSelector inherited from the runtime; `pytorch-mnist` runs to completion in 21s with `accuracy=0.7424`.

Risk Assessment

Low — Isolated change, validated end-to-end, easy to revert.

Rollout notes: Existing clusters re-bundling get the new templated CTR on the next `helm upgrade kubeflow-trainer-post`. Backwards-compatible: TrainJobs that still use `podTemplateOverrides` (v2.1) or `runtimePatches` (v2.2) continue to work — those override mechanisms are additive, this PR just removes the need for them in the AICR-standard demo flow.

Checklist

Tests pass locally (`go test ./pkg/recipe/`, `yamllint`)
Linter passes (`yamllint`)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (N/A — uses existing `nodeScheduling.accelerated` injection paths covered by existing bundler tests)
I updated docs if user-facing behavior changed (`demos/cuj1-{eks,gke}.md` updated)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (`git commit -S`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(recipes): handle kubeflow-trainer v2.2.0 API changes#724

fix(recipes): handle kubeflow-trainer v2.2.0 API changes#724
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kubeflow-trainer-v2.2-durable

yuanchen8911 commented Apr 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yuanchen8911 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

OK

ok github.com/NVIDIA/aicr/pkg/recipe 0.845s

Risk Assessment

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented Apr 30, 2026 •

edited

Loading