Summary
The performance validator's self-installed Kubeflow Trainer version is hardcoded to v2.2.0 and cannot be overridden. Make the Trainer version (and its coupled parameters) configurable, ideally derived from the recipe's kubeflow-trainer component pin when present, with env-var override and the current value as the compiled default.
Background
When the nccl-all-reduce-bw / -net / -nvls performance checks run, the validator launches its NCCL job as a Kubeflow TrainJob + TrainingRuntime. If Kubeflow Trainer is not already present in the cluster, the validator self-installs it transiently (installTrainer), then removes only what it installed on cleanup (deleteTrainer). If Trainer is already present (e.g. deployed by the runtime bundle's kubeflow platform), the validator detects the CRD and skips install, leaving the existing install untouched — so this only affects the install path.
The version is a compile-time constant in validators/performance/trainer_lifecycle.go:
// trainerArchiveURL is the GitHub tar.gz archive for Kubeflow Trainer v2.2.0.
trainerArchiveURL = "https://github.com/kubeflow/trainer/archive/refs/tags/v2.2.0.tar.gz"
trainerKustomizePath = "manifests/overlays/manager"
It is not just one string — it is coupled to:
- the kustomize overlay path within the archive (
manifests/overlays/manager),
- the JobSet staging→promoted image rewrite (
jobSetStagingImageRepo → jobSetPromotedImageRepo at v0.11.0), which assumes the JobSet version shipped by Trainer v2.2.0.
Meanwhile the recipe already pins a Trainer version declaratively in recipes/registry.yaml:
- name: kubeflow-trainer
helm:
defaultChart: kubeflow-trainer
defaultVersion: 2.2.0
These two pins can drift independently, and changing the validator's version today requires a code edit + rebuild/republish of the aicr-validators/performance image.
Proposed change
Make the self-installed Trainer version (and coupled JobSet image pin) configurable, with this resolution precedence (mirroring the existing inference-perf knob pattern in recipes/validators/catalog.yaml):
- Recipe-derived — read the
kubeflow-trainer component version from the recipe/ValidationInput when defined, and install that version on the self-install path.
- Env override — e.g.
AICR_NCCL_TRAINER_VERSION (and a JobSet image/version knob if the rewrite must stay in sync), settable via the catalog env block without rebuilding the image.
- Compiled default — keep v2.2.0 as today.
Considerations / open questions
- The recipe pins a Helm chart version (
oci://ghcr.io/kubeflow/charts), while the validator installs from a GitHub source tarball via kustomize. These version schemes need to be reconciled (map chart version → source tag, or switch the self-installer to the Helm chart).
- The NCCL
TrainingRuntime/TrainJob testdata templates target trainer.kubeflow.org/v1alpha1; a configurable version must still be API-compatible with those templates (or the templates need to be version-aware). Validate/guard against an incompatible version rather than silently failing.
- Keep the JobSet image rewrite in sync with whatever Trainer version is selected.
Acceptance criteria
References
validators/performance/trainer_lifecycle.go (hardcoded const, install/delete logic)
validators/performance/nccl_all_reduce_bw_constraint.go (isTrainerInstalled skip-if-present at the install call site)
recipes/registry.yaml (kubeflow-trainer defaultVersion: 2.2.0)
recipes/validators/catalog.yaml (inference-perf env-knob precedence pattern to mirror)
Summary
The performance validator's self-installed Kubeflow Trainer version is hardcoded to v2.2.0 and cannot be overridden. Make the Trainer version (and its coupled parameters) configurable, ideally derived from the recipe's
kubeflow-trainercomponent pin when present, with env-var override and the current value as the compiled default.Background
When the
nccl-all-reduce-bw/-net/-nvlsperformance checks run, the validator launches its NCCL job as a KubeflowTrainJob+TrainingRuntime. If Kubeflow Trainer is not already present in the cluster, the validator self-installs it transiently (installTrainer), then removes only what it installed on cleanup (deleteTrainer). If Trainer is already present (e.g. deployed by the runtime bundle'skubeflowplatform), the validator detects the CRD and skips install, leaving the existing install untouched — so this only affects the install path.The version is a compile-time constant in
validators/performance/trainer_lifecycle.go:It is not just one string — it is coupled to:
manifests/overlays/manager),jobSetStagingImageRepo→jobSetPromotedImageRepoatv0.11.0), which assumes the JobSet version shipped by Trainer v2.2.0.Meanwhile the recipe already pins a Trainer version declaratively in
recipes/registry.yaml:These two pins can drift independently, and changing the validator's version today requires a code edit + rebuild/republish of the
aicr-validators/performanceimage.Proposed change
Make the self-installed Trainer version (and coupled JobSet image pin) configurable, with this resolution precedence (mirroring the existing
inference-perfknob pattern inrecipes/validators/catalog.yaml):kubeflow-trainercomponent version from the recipe/ValidationInputwhen defined, and install that version on the self-install path.AICR_NCCL_TRAINER_VERSION(and a JobSet image/version knob if the rewrite must stay in sync), settable via the catalogenvblock without rebuilding the image.Considerations / open questions
oci://ghcr.io/kubeflow/charts), while the validator installs from a GitHub source tarball via kustomize. These version schemes need to be reconciled (map chart version → source tag, or switch the self-installer to the Helm chart).TrainingRuntime/TrainJobtestdata templates targettrainer.kubeflow.org/v1alpha1; a configurable version must still be API-compatible with those templates (or the templates need to be version-aware). Validate/guard against an incompatible version rather than silently failing.Acceptance criteria
ErrCodeInvalidRequest), not a silent skip or a broken TrainJob.recipes/validators/catalog.yamlif a new env knob is added.References
validators/performance/trainer_lifecycle.go(hardcoded const, install/delete logic)validators/performance/nccl_all_reduce_bw_constraint.go(isTrainerInstalledskip-if-present at the install call site)recipes/registry.yaml(kubeflow-trainerdefaultVersion: 2.2.0)recipes/validators/catalog.yaml(inference-perfenv-knob precedence pattern to mirror)