Skip to content

Make self-installed Kubeflow Trainer version configurable in performance validator #1564

Description

@yuanchen8911

Summary

The performance validator's self-installed Kubeflow Trainer version is hardcoded to v2.2.0 and cannot be overridden. Make the Trainer version (and its coupled parameters) configurable, ideally derived from the recipe's kubeflow-trainer component pin when present, with env-var override and the current value as the compiled default.

Background

When the nccl-all-reduce-bw / -net / -nvls performance checks run, the validator launches its NCCL job as a Kubeflow TrainJob + TrainingRuntime. If Kubeflow Trainer is not already present in the cluster, the validator self-installs it transiently (installTrainer), then removes only what it installed on cleanup (deleteTrainer). If Trainer is already present (e.g. deployed by the runtime bundle's kubeflow platform), the validator detects the CRD and skips install, leaving the existing install untouched — so this only affects the install path.

The version is a compile-time constant in validators/performance/trainer_lifecycle.go:

// trainerArchiveURL is the GitHub tar.gz archive for Kubeflow Trainer v2.2.0.
trainerArchiveURL   = "https://github.com/kubeflow/trainer/archive/refs/tags/v2.2.0.tar.gz"
trainerKustomizePath = "manifests/overlays/manager"

It is not just one string — it is coupled to:

  • the kustomize overlay path within the archive (manifests/overlays/manager),
  • the JobSet staging→promoted image rewrite (jobSetStagingImageRepojobSetPromotedImageRepo at v0.11.0), which assumes the JobSet version shipped by Trainer v2.2.0.

Meanwhile the recipe already pins a Trainer version declaratively in recipes/registry.yaml:

- name: kubeflow-trainer
  helm:
    defaultChart: kubeflow-trainer
    defaultVersion: 2.2.0

These two pins can drift independently, and changing the validator's version today requires a code edit + rebuild/republish of the aicr-validators/performance image.

Proposed change

Make the self-installed Trainer version (and coupled JobSet image pin) configurable, with this resolution precedence (mirroring the existing inference-perf knob pattern in recipes/validators/catalog.yaml):

  1. Recipe-derived — read the kubeflow-trainer component version from the recipe/ValidationInput when defined, and install that version on the self-install path.
  2. Env override — e.g. AICR_NCCL_TRAINER_VERSION (and a JobSet image/version knob if the rewrite must stay in sync), settable via the catalog env block without rebuilding the image.
  3. Compiled default — keep v2.2.0 as today.

Considerations / open questions

  • The recipe pins a Helm chart version (oci://ghcr.io/kubeflow/charts), while the validator installs from a GitHub source tarball via kustomize. These version schemes need to be reconciled (map chart version → source tag, or switch the self-installer to the Helm chart).
  • The NCCL TrainingRuntime/TrainJob testdata templates target trainer.kubeflow.org/v1alpha1; a configurable version must still be API-compatible with those templates (or the templates need to be version-aware). Validate/guard against an incompatible version rather than silently failing.
  • Keep the JobSet image rewrite in sync with whatever Trainer version is selected.

Acceptance criteria

  • Trainer version resolvable from recipe component pin when present, else env, else compiled default.
  • JobSet image pin tracks the selected Trainer version (no ImagePullBackOff regression).
  • Incompatible/unknown version fails fast with a clear error (ErrCodeInvalidRequest), not a silent skip or a broken TrainJob.
  • Unit tests cover the resolution precedence; existing self-install + skip-if-present behavior preserved.
  • Doc/comment update in recipes/validators/catalog.yaml if a new env knob is added.

References

  • validators/performance/trainer_lifecycle.go (hardcoded const, install/delete logic)
  • validators/performance/nccl_all_reduce_bw_constraint.go (isTrainerInstalled skip-if-present at the install call site)
  • recipes/registry.yaml (kubeflow-trainer defaultVersion: 2.2.0)
  • recipes/validators/catalog.yaml (inference-perf env-knob precedence pattern to mirror)

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/validatortheme/validationConstraint evaluation, health checks, and conformance evidence

    Fields

    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions