From ba34b191b973831c51496c315c6b29d45f2ef87f Mon Sep 17 00:00:00 2001 From: Rob Bell Date: Mon, 13 Apr 2026 21:25:18 +0100 Subject: [PATCH 1/5] feat(docs): add KEP-2599 for mutable runtimes Proposes allowing TrainingRuntimes and ClusterTrainingRuntimes to be mutable by introducing TrainingRuntimeSnapshot resources. TrainJobs snapshot their referenced runtime configuration on first reconciliation, decoupling job execution from runtime changes. Key changes: - New TrainingRuntimeSnapshot CRD to store point-in-time runtime config - Remove finalizers from runtimes (no longer needed) - Remove TrainJobWatcher interface and boilerplate - Automatic migration for existing TrainJobs on upgrade Addresses issue #2599 Co-Authored-By: Claude Sonnet 4.5 Signed-off-by: Rob Bell --- .../proposals/2599-mutable-runtimes/README.md | 155 ++++++++++++++++++ 1 file changed, 155 insertions(+) create mode 100644 docs/proposals/2599-mutable-runtimes/README.md diff --git a/docs/proposals/2599-mutable-runtimes/README.md b/docs/proposals/2599-mutable-runtimes/README.md new file mode 100644 index 0000000000..08876e0404 --- /dev/null +++ b/docs/proposals/2599-mutable-runtimes/README.md @@ -0,0 +1,155 @@ +# KEP-2599: Mutable Runtimes + +## Authors +- Rob Bell (Red Hat) + +Assisted by Claude Code (Sonnet 4.5) + +## Summary + +This document proposes a new Custom Resource to allow Cluster Training Runtimes and Training Runtimes to be fully mutable. + +## Motivation + +The `TrainingRuntime` and `ClusterTrainingRuntime` APIs were designed as blueprints for model training. A `TrainJob` references a runtime, and uses its configuration during reconciliation. + +Currently, runtimes are protected by finalizers to prevent deletion while in use by TrainJobs. This design creates operational friction: + +- **Inability to Update Runtimes:** Platform admins cannot update runtimes (e.g., newer PyTorch versions, security patches) without either risking breaking existing TrainJobs, or creating new runtime objects leading to proliferation of similar runtimes that differ only in minor version details. +- **Finalizers block deletion:** admins must ensure that all TrainJobs are removed before the control plane is uninstalled to ensure runtime finalizers are removed. Failure to do this can cause operational problems, e.g. unable to re-install the operator; namespaces stuck in "deleting" status. + +While the original [Trainer v2 design](../2170-kubeflow-trainer-v2#the-training-runtime-api) anticipated runtimes being immutable with version control, a versioning mechanism has not yet been implemented. Instead, the current finalizer-based approach prevents deletion of runtimes that are referenced but does not provide the intended benefits of versioning. + +### Goals + +* **Mutable Runtimes**: users and platform admins are able to update, add or remove `TrainingRuntimes` and `ClusterTrainingRuntimes` without impacting existing running or paused Training Job. +* **Self-contained TrainJob**: once a TrainJob is created, its configuration is entirely self-contained. It only depends on itself or on resources it has created and owns. It does not depend on any external resources. +* **Remove finalizer on runtimes**: `TrainingRuntimes` and `ClusterTrainingRuntimes` should no longer need a finalizer. + +### Non-Goals + +* **Mutable TrainJobs**: we are not proposing any changes to the existing immutable fields of TrainJobs. These fields will remain immutable. + +## Proposal + +### User Stories + +### Story 1 + +As a platform engineer, I want to be able to update or delete a training runtime without breaking any existing running or paused training jobs. + +### Story 2 + +As a maintainer of Kubeflow Trainer, I want to be able to update or delete the default training runtimes included in a Kubeflow Trainer release without introducing breaking changes for users. + +## Design details + +We propose making the TrainJob only lookup the runtime configuration at creation and instead store a "snapshot" of the runtime configuration in a separate object: + +* create a new namespaced custom resource `TrainingRuntimeSnapshot` with the same API as the `TrainingRuntime` resource. This is an internal resource and should only be created or updated by the trainer controller. Each `TrainJob` will have one `TrainingRuntimeSnapshot` with the same name and namespace as the `TrainJob`. +* when a train job is reconciled, the controller first tries to fetch the `TrainingRuntimeSnapshot` for the job. If the snapshot does not exist, it looks up the `(Cluster)TrainingRuntime` referenced by the train job and creates a new `TrainingRuntimeSnapshot` resource. The snapshot resource has the same name and namespace as the `TrainJob`, and the same spec copied from the referenced `(Cluster)TrainingRuntime`. +* the `TrainJob` reconciliation logic gets the runtime configuration from the snapshot rather than the `(Cluster)TrainingRuntime`. +* the reconciliation is otherwise unchanged. +* the `TrainingRuntimeSnapshot` is automatically deleted when the train job is deleted using an `ownerReference` on the snapshot. + +Additional changes: +* Remove the finalizer on the runtimes. It is no longer necessary as TrainJobs only need to reference runtimes on creation. +* Remove the `TrainJobWatcher` interface and associated implementations and boilerplate. + +The `TrainingRuntimeSnapshot` will have the following API: + +```go +// TrainingRuntimeSnapshot contains a point-in-time snapshot of a TrainingRuntime or ClusterTrainingRuntime as it was +// observed when a TrainJob was first reconciled. +type TrainingRuntimeSnapshot struct { + metav1.TypeMeta `json:",inline"` + + // metadata of the TrainingRuntimeSnapshot. + // +optional + metav1.ObjectMeta `json:"metadata,omitempty"` + + // spec of the TrainingRuntimeSnapshot. + // +optional + Spec TrainingRuntimeSpec `json:"spec,omitempty,omitzero"` +} +``` + +### Migrations for existing resources + +**TrainJobs**: existing TrainJobs are automatically migrated on first reconciliation after upgrade. The controller creates a `TrainingRuntimeSnapshot` for each non-finished TrainJob by copying the current state of its referenced runtime. + +If a runtime was modified after a TrainJob was created but before the upgrade, the snapshot will capture the modified state not the original configuration the job used. This is acceptable, however, because any incompatible changes would have already caused the TrainJob reconciliation to fail before the upgrade. + +**ClusterTrainingRuntimes** and **TrainingRuntimes**: all runtime finalizers need removing which can be done using the runtime controllers. This finalizer removal logic can be removed in a future version after sufficient time has passed for all clusters to migrate. Release notes should document this and warn users they must upgrade through the migration release(s) and cannot skip directly to later versions. + +**Rollback:** No explicit rollback logic is required. On rollback, the controller will reconcile using the runtime; any `TrainingRuntimeSnapshot` objects will be ignored and removed once the TrainJob is deleted. + +### Test plan + +#### E2E tests + +* `test/e2e/e2e_test.go` + * test updating TrainingRuntime does not affect a paused TrainJob: create runtime + train job, allow train job to start, pause train job, update the runtime, restart the train job. TrainJob should use the original configuration. + * ensure existing tests still pass + +#### Integration tests + +* `test/integration/controller/clustertrainingruntime_controller_test.go` + * remove test file. Only contains tests relating to the finalizer which is being removed. +* `test/integration/controller/trainingruntime_controller_test.go` + * remove test file. Only contains tests relating to the finalizer which is being removed. +* `test/integration/controller/trainjob_controller_test.go` + * test snapshot resource is created + * test migration scenario: TrainJob exists without a snapshot. + +#### Unit tests + +* `pkg/controller/clustertrainingruntime_controller_test.go` + * test finalizer is always removed. Updates existing + * remove existing tests +* `pkg/controller/trainingruntime_controller_test.go` + * test finalizer is always removed. + * remove existing tests. Not required. + +## Open Questions + +* **Are there use-cases where an update to a runtime should propagate to a TrainJob (e.g. updating a training image to address CVEs)?** Given this is currently unsupported, this could be considered out of scope. +* **Should `TrainingRuntimeSnapshot` be immutable?** Given the resource is internal and should only be edited by the controller, this may be unnecessary. +* **Should `TrainingRuntimeSnapshot` have a finalizer to prevent deletion while the TrainJob exists?** Similarly, given the resource is internal, this may be unnecessary. + +## Alternatives considered + +### Alternative 1: store runtime configuration in the train job status + +Store the runtime configuration snapshot in a new field on the train job status, e.g. `status.runtimeConfiguration`. This pattern is used in other projects, e.g. [Tekton PipelineRuns](https://github.com/tektoncd/pipeline/blob/v1.11.0/pkg/apis/pipeline/v1/pipelinerun_types.go#L535-L539). + +**Pros** +- Avoids introducing new custom resource API +- Avoids creating an additional resource per TrainJob. + +**Cons** +- Adds bloat to the TrainJob status +- Makes TrainJob less readable: mixes observed state with configuration snapshots + +### Alternative 2: store multiple versions of runtimes + +Introduce a version control mechanism for runtimes. Runtime versions are immutable, and changes to a runtime trigger a new runtime "version" to be created. TrainJobs keep track of which runtime version they use, e.g. through a status field. TrainJob reconciliation uses the configuration of the version. The control plane is responsible for garbage collecting runtime versions that are no longer referenced by a TrainJob. + +**Pros** +- Avoids creating an additional resource per TrainJob. +- Adds a minimal additional config to the TrainJob status or to etcd. + +**Cons** +- Significant extra complexity to correctly manage the runtime version lifecycle. + + +### Alternative 3: immutable runtimes + +Make runtimes immutable via webhook and CRD annotations. + +**Pros** +- Prevents incompatible changes to runtimes. + +**Cons** +- Adds friction to platform admins for maintaining and updating runtimes. +- Proliferation of similar runtimes that differ only in minor version details. From 52ca16c21c7493eb19144d5ae14364272db7ddd0 Mon Sep 17 00:00:00 2001 From: Rob Bell Date: Tue, 14 Apr 2026 08:49:20 +0100 Subject: [PATCH 2/5] feat(docs): improve KEP-2599 clarity and completeness Improvements to the mutable runtimes KEP: - Clarify operational problems with finalizers (orphaned finalizers, namespace deletion issues) - Better explain why runtime updates are risky (fetched on every reconciliation, no design-level guarantees) - Add concrete examples of runtime proliferation (pytorch-2.0, pytorch-2.1) - Add RBAC section showing required permissions - Improve summary to clearly state API change (new CRD) - Clarify test cases for migration scenarios Co-Authored-By: Claude Sonnet 4.5 Signed-off-by: Rob Bell --- .../proposals/2599-mutable-runtimes/README.md | 33 +++++++++++++++---- 1 file changed, 27 insertions(+), 6 deletions(-) diff --git a/docs/proposals/2599-mutable-runtimes/README.md b/docs/proposals/2599-mutable-runtimes/README.md index 08876e0404..420ef057e7 100644 --- a/docs/proposals/2599-mutable-runtimes/README.md +++ b/docs/proposals/2599-mutable-runtimes/README.md @@ -7,7 +7,9 @@ Assisted by Claude Code (Sonnet 4.5) ## Summary -This document proposes a new Custom Resource to allow Cluster Training Runtimes and Training Runtimes to be fully mutable. +This document proposes a design to allow Cluster Training Runtimes and Training Runtimes to be fully mutable. + +This KEP introduces a new `TrainingRuntimeSnapshot` CRD for containing a point-in-time snapshot of the runtime configuration. TrainJobs create a snapshot of their runtime configuration on first reconciliation, decoupling job execution from runtime changes. ## Motivation @@ -15,8 +17,8 @@ The `TrainingRuntime` and `ClusterTrainingRuntime` APIs were designed as bluepri Currently, runtimes are protected by finalizers to prevent deletion while in use by TrainJobs. This design creates operational friction: -- **Inability to Update Runtimes:** Platform admins cannot update runtimes (e.g., newer PyTorch versions, security patches) without either risking breaking existing TrainJobs, or creating new runtime objects leading to proliferation of similar runtimes that differ only in minor version details. -- **Finalizers block deletion:** admins must ensure that all TrainJobs are removed before the control plane is uninstalled to ensure runtime finalizers are removed. Failure to do this can cause operational problems, e.g. unable to re-install the operator; namespaces stuck in "deleting" status. +- **No safe way to update runtimes:** Updating a runtime affects all TrainJobs referencing it because runtimes are fetched during each reconciliation. While implementation-level protections exist (e.g., existing JobSets aren't modified), there are no design-level immutability guarantees. The safest practice becomes creating new runtime objects for each update (e.g., `pytorch-2.0`, `pytorch-2.1`, `pytorch-2.1.1`), leading to a proliferation of nearly-identical runtimes that confuses users. +- **Finalizers block uninstallation:** If the Trainer controller is uninstalled before all TrainJobs are removed, runtime finalizers become orphaned and cannot be removed. This prevents namespace deletion (stuck in "Terminating") and complicates controller reinstallation. While the original [Trainer v2 design](../2170-kubeflow-trainer-v2#the-training-runtime-api) anticipated runtimes being immutable with version control, a versioning mechanism has not yet been implemented. Instead, the current finalizer-based approach prevents deletion of runtimes that are referenced but does not provide the intended benefits of versioning. @@ -44,7 +46,7 @@ As a maintainer of Kubeflow Trainer, I want to be able to update or delete the d ## Design details -We propose making the TrainJob only lookup the runtime configuration at creation and instead store a "snapshot" of the runtime configuration in a separate object: +We propose making the TrainJob only lookup the runtime configuration on first reconciliation and instead store a "snapshot" of the runtime configuration in a separate object: * create a new namespaced custom resource `TrainingRuntimeSnapshot` with the same API as the `TrainingRuntime` resource. This is an internal resource and should only be created or updated by the trainer controller. Each `TrainJob` will have one `TrainingRuntimeSnapshot` with the same name and namespace as the `TrainJob`. * when a train job is reconciled, the controller first tries to fetch the `TrainingRuntimeSnapshot` for the job. If the snapshot does not exist, it looks up the `(Cluster)TrainingRuntime` referenced by the train job and creates a new `TrainingRuntimeSnapshot` resource. The snapshot resource has the same name and namespace as the `TrainJob`, and the same spec copied from the referenced `(Cluster)TrainingRuntime`. @@ -53,9 +55,11 @@ We propose making the TrainJob only lookup the runtime configuration at creation * the `TrainingRuntimeSnapshot` is automatically deleted when the train job is deleted using an `ownerReference` on the snapshot. Additional changes: -* Remove the finalizer on the runtimes. It is no longer necessary as TrainJobs only need to reference runtimes on creation. +* Remove the finalizer on the runtimes. It is no longer necessary as TrainJobs only need to reference runtimes on first reconciliation. * Remove the `TrainJobWatcher` interface and associated implementations and boilerplate. +### API and RBAC changes + The `TrainingRuntimeSnapshot` will have the following API: ```go @@ -74,6 +78,23 @@ type TrainingRuntimeSnapshot struct { } ``` +The following additional RBAC permissions will be granted: +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +rules: +- apiGroups: + - trainer.kubeflow.org + resources: + - trainingruntimesnapshots + verbs: + - get + - list + - patch + - update + - watch +``` + ### Migrations for existing resources **TrainJobs**: existing TrainJobs are automatically migrated on first reconciliation after upgrade. The controller creates a `TrainingRuntimeSnapshot` for each non-finished TrainJob by copying the current state of its referenced runtime. @@ -100,7 +121,7 @@ If a runtime was modified after a TrainJob was created but before the upgrade, t * remove test file. Only contains tests relating to the finalizer which is being removed. * `test/integration/controller/trainjob_controller_test.go` * test snapshot resource is created - * test migration scenario: TrainJob exists without a snapshot. + * test migration scenario: existing TrainJob without snapshot migrates successfully (snapshot created and used for reconciliation) #### Unit tests From 04003aba1faeb08257c9d3d24be0b3d9d2ba6a37 Mon Sep 17 00:00:00 2001 From: Rob Bell Date: Tue, 14 Apr 2026 10:18:54 +0100 Subject: [PATCH 3/5] docs: address Copilot review comments on KEP-2599 - Fix relative link to include README.md in Trainer v2 design reference - Remove extra space in Goals heading - Change 'Training Job' to 'TrainJobs' for consistency - Fix 'lookup' to 'look up' (correct verb form) - Add 'create' verb to RBAC permissions for TrainingRuntimeSnapshot Signed-off-by: Rob Bell --- docs/proposals/2599-mutable-runtimes/README.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/proposals/2599-mutable-runtimes/README.md b/docs/proposals/2599-mutable-runtimes/README.md index 420ef057e7..29691626f1 100644 --- a/docs/proposals/2599-mutable-runtimes/README.md +++ b/docs/proposals/2599-mutable-runtimes/README.md @@ -20,11 +20,11 @@ Currently, runtimes are protected by finalizers to prevent deletion while in use - **No safe way to update runtimes:** Updating a runtime affects all TrainJobs referencing it because runtimes are fetched during each reconciliation. While implementation-level protections exist (e.g., existing JobSets aren't modified), there are no design-level immutability guarantees. The safest practice becomes creating new runtime objects for each update (e.g., `pytorch-2.0`, `pytorch-2.1`, `pytorch-2.1.1`), leading to a proliferation of nearly-identical runtimes that confuses users. - **Finalizers block uninstallation:** If the Trainer controller is uninstalled before all TrainJobs are removed, runtime finalizers become orphaned and cannot be removed. This prevents namespace deletion (stuck in "Terminating") and complicates controller reinstallation. -While the original [Trainer v2 design](../2170-kubeflow-trainer-v2#the-training-runtime-api) anticipated runtimes being immutable with version control, a versioning mechanism has not yet been implemented. Instead, the current finalizer-based approach prevents deletion of runtimes that are referenced but does not provide the intended benefits of versioning. +While the original [Trainer v2 design](../2170-kubeflow-trainer-v2/README.md#the-training-runtime-api) anticipated runtimes being immutable with version control, a versioning mechanism has not yet been implemented. Instead, the current finalizer-based approach prevents deletion of runtimes that are referenced but does not provide the intended benefits of versioning. -### Goals +### Goals -* **Mutable Runtimes**: users and platform admins are able to update, add or remove `TrainingRuntimes` and `ClusterTrainingRuntimes` without impacting existing running or paused Training Job. +* **Mutable Runtimes**: users and platform admins are able to update, add or remove `TrainingRuntimes` and `ClusterTrainingRuntimes` without impacting existing running or paused `TrainJobs`. * **Self-contained TrainJob**: once a TrainJob is created, its configuration is entirely self-contained. It only depends on itself or on resources it has created and owns. It does not depend on any external resources. * **Remove finalizer on runtimes**: `TrainingRuntimes` and `ClusterTrainingRuntimes` should no longer need a finalizer. @@ -46,7 +46,7 @@ As a maintainer of Kubeflow Trainer, I want to be able to update or delete the d ## Design details -We propose making the TrainJob only lookup the runtime configuration on first reconciliation and instead store a "snapshot" of the runtime configuration in a separate object: +We propose making the TrainJob only look up the runtime configuration on first reconciliation and instead store a "snapshot" of the runtime configuration in a separate object: * create a new namespaced custom resource `TrainingRuntimeSnapshot` with the same API as the `TrainingRuntime` resource. This is an internal resource and should only be created or updated by the trainer controller. Each `TrainJob` will have one `TrainingRuntimeSnapshot` with the same name and namespace as the `TrainJob`. * when a train job is reconciled, the controller first tries to fetch the `TrainingRuntimeSnapshot` for the job. If the snapshot does not exist, it looks up the `(Cluster)TrainingRuntime` referenced by the train job and creates a new `TrainingRuntimeSnapshot` resource. The snapshot resource has the same name and namespace as the `TrainJob`, and the same spec copied from the referenced `(Cluster)TrainingRuntime`. @@ -88,6 +88,7 @@ rules: resources: - trainingruntimesnapshots verbs: + - create - get - list - patch From 43db85b5da8490b7cb640f0132cc2930a9596ce7 Mon Sep 17 00:00:00 2001 From: Rob Bell Date: Wed, 15 Apr 2026 13:08:29 +0100 Subject: [PATCH 4/5] docs: clarify KEP-2599 divergence from original design Co-Authored-By: Claude Sonnet 4.5 Signed-off-by: Rob Bell --- docs/proposals/2599-mutable-runtimes/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/proposals/2599-mutable-runtimes/README.md b/docs/proposals/2599-mutable-runtimes/README.md index 29691626f1..9cf4e4ae25 100644 --- a/docs/proposals/2599-mutable-runtimes/README.md +++ b/docs/proposals/2599-mutable-runtimes/README.md @@ -7,9 +7,9 @@ Assisted by Claude Code (Sonnet 4.5) ## Summary -This document proposes a design to allow Cluster Training Runtimes and Training Runtimes to be fully mutable. +This KEP proposes making Training Runtimes and Cluster Training Runtimes fully mutable by introducing a `TrainingRuntimeSnapshot` CRD. TrainJobs create a snapshot of their runtime configuration on first reconciliation, ensuring their behaviour remains unchanged regardless of subsequent runtime modifications. -This KEP introduces a new `TrainingRuntimeSnapshot` CRD for containing a point-in-time snapshot of the runtime configuration. TrainJobs create a snapshot of their runtime configuration on first reconciliation, decoupling job execution from runtime changes. +**Note:** This diverges from the [Trainer v2 design](../2170-kubeflow-trainer-v2/README.md#the-training-runtime-api), which originally proposed making runtimes immutable with version control (see also [#2599](https://github.com/kubeflow/trainer/issues/2599)). Based on operational experience, this KEP takes an alternative approach that eliminates the friction enforced immutability creates for platform administrators. ## Motivation From caae12a65d605d22b3fcfadf6ae03c6861d936e7 Mon Sep 17 00:00:00 2001 From: Rob Bell Date: Tue, 21 Apr 2026 16:39:46 +0100 Subject: [PATCH 5/5] docs: add source runtime annotation to KEP-2599 Co-Authored-By: Claude Sonnet 4.5 Signed-off-by: Rob Bell --- docs/proposals/2599-mutable-runtimes/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/proposals/2599-mutable-runtimes/README.md b/docs/proposals/2599-mutable-runtimes/README.md index 9cf4e4ae25..327301ef43 100644 --- a/docs/proposals/2599-mutable-runtimes/README.md +++ b/docs/proposals/2599-mutable-runtimes/README.md @@ -48,7 +48,7 @@ As a maintainer of Kubeflow Trainer, I want to be able to update or delete the d We propose making the TrainJob only look up the runtime configuration on first reconciliation and instead store a "snapshot" of the runtime configuration in a separate object: -* create a new namespaced custom resource `TrainingRuntimeSnapshot` with the same API as the `TrainingRuntime` resource. This is an internal resource and should only be created or updated by the trainer controller. Each `TrainJob` will have one `TrainingRuntimeSnapshot` with the same name and namespace as the `TrainJob`. +* create a new namespaced custom resource `TrainingRuntimeSnapshot` with the same API as the `TrainingRuntime` resource. This is an internal resource and should only be created or updated by the trainer controller. Each `TrainJob` will have one `TrainingRuntimeSnapshot` with the same name and namespace as the `TrainJob`. The `TrainingRuntimeSnapshot` would be annotated for the source runtime it was copied from to aid debugging, e.g. `trainer.kubeflow.org/source-runtime: ClusterTrainingRuntime/` or `trainer.kubeflow.org/source-runtime: TrainingRuntime//`. * when a train job is reconciled, the controller first tries to fetch the `TrainingRuntimeSnapshot` for the job. If the snapshot does not exist, it looks up the `(Cluster)TrainingRuntime` referenced by the train job and creates a new `TrainingRuntimeSnapshot` resource. The snapshot resource has the same name and namespace as the `TrainJob`, and the same spec copied from the referenced `(Cluster)TrainingRuntime`. * the `TrainJob` reconciliation logic gets the runtime configuration from the snapshot rather than the `(Cluster)TrainingRuntime`. * the reconciliation is otherwise unchanged.