fix(recipes): handle kubeflow-trainer v2.2.0 API changes#724
Merged
yuanchen8911 merged 1 commit intoApr 30, 2026
Merged
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bake the cluster-aware
nodeSelector+tolerationsinto thetorch-distributedClusterTrainingRuntime, using AICR's existingnodeScheduling.acceleratedbundler injection. Demos go back to bare-bones TrainJobs (nopodTemplateOverrides, noruntimePatches).Motivation / Context
The pytorch demo TrainJobs in
demos/cuj1-{eks,gke}.mdcurrently carry per-cluster scheduling boilerplate so the resulting pods land on AICR's tainted GPU nodes. Each TrainJob author has to repeat it; each demo has to be edited per-cluster vocabulary; and the override mechanism keeps changing upstream (PodTemplateOverridesdeprecated in v2.1 → replaced byRuntimePatchesin v2.2; see kubeflow/trainer#3309).This PR moves the per-cluster scheduling into the runtime itself. The bundler already supports this via
nodeScheduling.acceleratedpaths declared inrecipes/registry.yaml— already used bygpu-operator,nfd,nodewright-customizations, andkgateway.kubeflow-trainerwas the only manifestFiles-using component without anaccelerated:block. This PR adds it.End state for users: same
--accelerated-node-selector/--accelerated-node-tolerationCLI flags at bundle time. Different cluster, different vocabulary, same demo TrainJob YAML.API-version-agnostic. Works on kubeflow-trainer v2.1 (
PodTemplateOverridesera) and v2.2+ (RuntimePatchesera) identically, because the TrainJob no longer overrides anything — the runtime carries the scheduling.Fixes: N/A
Related: kubeflow/trainer#3309
Type of Change
Component(s) Affected
Implementation Notes
Three coordinated changes (4 files, +34/-12 net):
`recipes/registry.yaml` — add `nodeScheduling.accelerated` block to the `kubeflow-trainer` component entry. `nodeSelectorPaths: [acceleratedNodeSelector]` and `tolerationPaths: [acceleratedTolerations]` (top-level keys). Identical pattern to `gpu-operator` (`daemonsets.nodeSelector` / `daemonsets.tolerations`); just chose top-level for readability.
`recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml` — replace the static pod-spec scheduling region with Helm template directives:
```yaml
{{- $kft := index .Values "kubeflow-trainer" }}
{{- with $kft.acceleratedNodeSelector }}
nodeSelector:
{{- toYaml . | nindent 20 }}
{{- end }}
{{- with $kft.acceleratedTolerations }}
tolerations:
{{- toYaml . | nindent 20 }}
{{- end }}
```
`index .Values "kubeflow-trainer"` matches the bundler's `manifest.RenderInput.Values` shape (values nested under ComponentName). Same access pattern as `nodewright-customizations/manifests/tuning.yaml`.
The bundler renders this template at bundle time, so the `bundle/-kubeflow-trainer-post/templates/` artifact is plain YAML with concrete values substituted — Helm at install time just applies it as-is.
What stays the same:
Testing
```bash
yamllint -c .yamllint.yaml
recipes/registry.yaml
recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml
OK
go test ./pkg/recipe/
ok github.com/NVIDIA/aicr/pkg/recipe 0.845s
```
End-to-end on a real EKS H100 cluster (kubeflow-trainer v2.2.0):
Risk Assessment
Rollout notes: Existing clusters re-bundling get the new templated CTR on the next `helm upgrade kubeflow-trainer-post`. Backwards-compatible: TrainJobs that still use `podTemplateOverrides` (v2.1) or `runtimePatches` (v2.2) continue to work — those override mechanisms are additive, this PR just removes the need for them in the AICR-standard demo flow.
Checklist