Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions docs/proposals/2599-mutable-runtimes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# KEP-2599: Mutable Runtimes

## Authors
- Rob Bell (Red Hat)

Assisted by Claude Code (Sonnet 4.5)

## Summary

This KEP proposes making Training Runtimes and Cluster Training Runtimes fully mutable by introducing a `TrainingRuntimeSnapshot` CRD. TrainJobs create a snapshot of their runtime configuration on first reconciliation, ensuring their behaviour remains unchanged regardless of subsequent runtime modifications.

**Note:** This diverges from the [Trainer v2 design](../2170-kubeflow-trainer-v2/README.md#the-training-runtime-api), which originally proposed making runtimes immutable with version control (see also [#2599](https://github.com/kubeflow/trainer/issues/2599)). Based on operational experience, this KEP takes an alternative approach that eliminates the friction enforced immutability creates for platform administrators.

## Motivation

The `TrainingRuntime` and `ClusterTrainingRuntime` APIs were designed as blueprints for model training. A `TrainJob` references a runtime, and uses its configuration during reconciliation.

Currently, runtimes are protected by finalizers to prevent deletion while in use by TrainJobs. This design creates operational friction:

- **No safe way to update runtimes:** Updating a runtime affects all TrainJobs referencing it because runtimes are fetched during each reconciliation. While implementation-level protections exist (e.g., existing JobSets aren't modified), there are no design-level immutability guarantees. The safest practice becomes creating new runtime objects for each update (e.g., `pytorch-2.0`, `pytorch-2.1`, `pytorch-2.1.1`), leading to a proliferation of nearly-identical runtimes that confuses users.
- **Finalizers block uninstallation:** If the Trainer controller is uninstalled before all TrainJobs are removed, runtime finalizers become orphaned and cannot be removed. This prevents namespace deletion (stuck in "Terminating") and complicates controller reinstallation.

While the original [Trainer v2 design](../2170-kubeflow-trainer-v2/README.md#the-training-runtime-api) anticipated runtimes being immutable with version control, a versioning mechanism has not yet been implemented. Instead, the current finalizer-based approach prevents deletion of runtimes that are referenced but does not provide the intended benefits of versioning.

### Goals

* **Mutable Runtimes**: users and platform admins are able to update, add or remove `TrainingRuntimes` and `ClusterTrainingRuntimes` without impacting existing running or paused `TrainJobs`.
* **Self-contained TrainJob**: once a TrainJob is created, its configuration is entirely self-contained. It only depends on itself or on resources it has created and owns. It does not depend on any external resources.
* **Remove finalizer on runtimes**: `TrainingRuntimes` and `ClusterTrainingRuntimes` should no longer need a finalizer.

### Non-Goals

* **Mutable TrainJobs**: we are not proposing any changes to the existing immutable fields of TrainJobs. These fields will remain immutable.

## Proposal

### User Stories

### Story 1

As a platform engineer, I want to be able to update or delete a training runtime without breaking any existing running or paused training jobs.

### Story 2

As a maintainer of Kubeflow Trainer, I want to be able to update or delete the default training runtimes included in a Kubeflow Trainer release without introducing breaking changes for users.

## Design details

We propose making the TrainJob only look up the runtime configuration on first reconciliation and instead store a "snapshot" of the runtime configuration in a separate object:

* create a new namespaced custom resource `TrainingRuntimeSnapshot` with the same API as the `TrainingRuntime` resource. This is an internal resource and should only be created or updated by the trainer controller. Each `TrainJob` will have one `TrainingRuntimeSnapshot` with the same name and namespace as the `TrainJob`. The `TrainingRuntimeSnapshot` would be annotated for the source runtime it was copied from to aid debugging, e.g. `trainer.kubeflow.org/source-runtime: ClusterTrainingRuntime/<runtime-name>` or `trainer.kubeflow.org/source-runtime: TrainingRuntime/<runtime-namespace>/<runtime-name>`.
* when a train job is reconciled, the controller first tries to fetch the `TrainingRuntimeSnapshot` for the job. If the snapshot does not exist, it looks up the `(Cluster)TrainingRuntime` referenced by the train job and creates a new `TrainingRuntimeSnapshot` resource. The snapshot resource has the same name and namespace as the `TrainJob`, and the same spec copied from the referenced `(Cluster)TrainingRuntime`.
* the `TrainJob` reconciliation logic gets the runtime configuration from the snapshot rather than the `(Cluster)TrainingRuntime`.
* the reconciliation is otherwise unchanged.
* the `TrainingRuntimeSnapshot` is automatically deleted when the train job is deleted using an `ownerReference` on the snapshot.

Additional changes:
* Remove the finalizer on the runtimes. It is no longer necessary as TrainJobs only need to reference runtimes on first reconciliation.
* Remove the `TrainJobWatcher` interface and associated implementations and boilerplate.

### API and RBAC changes

The `TrainingRuntimeSnapshot` will have the following API:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on recent issues with kueue : #2599 (comment)
The current TrainJob controller reconciles suspended jobs - it calls reconcileObjects for
any non-finished TrainJob . IIUC "Snapshot on first reconciliation" means a suspended TrainJob gets snapshotted immediately on creation, before
an external controller (e.g. Kueue integration) can patch spec.runtimeRef ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is a something I'm wanting to discuss in today's community call.

cc. @VassilisVassiliadis

Copy link
Copy Markdown
Contributor

@VassilisVassiliadis VassilisVassiliadis Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the runtimeRef field is currently immutable so I think we can break this down into different questions:

  1. how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?
  2. how do we allow external controllers (e.g. Kueue) or users to patch "important" fields in a TrainJob ?
  3. do we want to allow mutating the runtimeRef field?

For example, if the goal is to allow starting a TrainJob using:

  • all options from a RT/CRT object, BUT
  • with a different image

There's several ways to do this.
One is:

  1. create a new RT/CRT object which is a copy of the old one but with an updated image
  2. reference it

Another is:

  1. Reference the "old" RT/CRT object,
  2. override the image using the runtimePatches API (KEP-KEP-2170: Kubeflow Trainer V2 API #2170 does not support mutating this field right now)

If we treat a RT/CRT as the "lowest priority blueprint" for a TrainJob and allowing the TrainJob itself, or even external controllers/users, to override the values of fields that exist in this "blueprint" then this KEP is a valid answer to the question:

  1. how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?

Technically, we still need to lift the immutability of the spec.runtimeRef field so that it points to the Snapshot but we can do it in a way where the TrainJob webhook only allows changing the value from a RT/CRT to a RT/CRT-SnapShot this way it's immutable except during 1st reconciliation.

P.S. Some more context:

The above suggestion implies the following "blueprint" priority:

  • highest priority: The runtimePatches api (this overrides everything else below)
  • middle priority: the remaining fields of TrainJob (e.g. trainjob.spec.trainer, etc)
  • lowest priority: the RT/CRT fields

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @VassilisVassiliadis's framing of breaking this into three separate questions. For this KEP specifically, I think the cleanest path forward is: snapshot only when spec.suspend transitions from true to false (or on first reconciliation if already unsuspended). This gives external controllers the window they need to modify things while suspended, and the snapshot captures the final state. Would that work for both your use cases?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @VassilisVassiliadis — the breakdown into separate questions is very helpful.

For this KEP, I suggest we focus scope on the first question:

how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?

The other two:

how do we allow external controllers (e.g. Kueue) or users to patch "important" fields in a TrainJob?

do we want to allow mutating the runtimeRef field?

seem broader in scope and likely need more dedicated design, given their impact on API semantics, maintainability, and testing. It may be better to address those in a follow-up KEP or directly as part of your kep.

Approaches like snapshot-on-unsuspend could be explored as part of that work, though we should also evaluate the runtimePatches api approach.

Importantly, I don’t think this proposal blocks those future directions - it preserves the current immutability behaviour but just decoupled the job from the runtime.

@VassilisVassiliadis if we're limited scope to the first question is there anything in the design that would impact the controller-driven workflow you’re thinking of or introduce constraints we should account for now?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robert-bell I agree with everything you said.

Especially:

For this KEP, I suggest we focus scope on the first question: how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?

And:

[the other 2 questions] seem broader in scope and likely need more dedicated design, given their impact on API semantics, maintainability, and testing. It may be better to address those in a follow-up KEP or directly as part of your kep.

I think this KEP is a valid solution to the "1st question" without introducing any blockers to finding answers for the other 2 questions.

The one thing that's hard to give my opinion on is the introduction of the TrainingRuntimeSnapshot CRD. That is something the core maintainers should comment on because they have a better view of Trainer than I do. My 2 cents is that this is fine because this proposed approach addresses an existing problem (lifecycles of RT/CRT tied to TrainJob objects) in a way that's better than the alternatives and since TrainingRuntimeSnapshot is an "internal" CRD users won't interact with it and thus they won't mind.

@andreyvelich what do you think ?


```go
// TrainingRuntimeSnapshot contains a point-in-time snapshot of a TrainingRuntime or ClusterTrainingRuntime as it was
// observed when a TrainJob was first reconciled.
type TrainingRuntimeSnapshot struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking can we add source annotation like
trainer.kubeflow.org/snapshot-source: ClusterTrainingRuntime/pytorch-default on the snapshot?
Without it, once a runtime is deleted, there's no way to know which runtime a TrainJob originally used. It may become useful for debugging, auditing, and the migration scenario where you need to verify what was captured.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did wonder this, but I wasn't sure it was necessary given the TrainJob runtimeRef is currently immutable. There's already lineage via the TrainJob. I also wasn't sure how useful this info is given the runtime could have changed between the resource creating and being deleted. Wdyt?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point that runtimeRef provides lineage. I'd still lean towards adding it - it's one annotation and the value is in debugging production issues quickly. I think when you're using kubectl get trainingruntimesnapshots and trying to correlate with runtimes, having the source right on the snapshot saves a hop through the TrainJob. But not a hill I'll die on - fine to defer to implementation time if others feel the same. 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point - I can see the benefit of the annotation, and it's low risk given it's only an annotation. I've updated the text.

metav1.TypeMeta `json:",inline"`

// metadata of the TrainingRuntimeSnapshot.
// +optional
metav1.ObjectMeta `json:"metadata,omitempty"`

// spec of the TrainingRuntimeSnapshot.
// +optional
Spec TrainingRuntimeSpec `json:"spec,omitempty,omitzero"`
}
```

The following additional RBAC permissions will be granted:
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
rules:
- apiGroups:
- trainer.kubeflow.org
resources:
- trainingruntimesnapshots
verbs:
- create
- get
- list
- patch
- update
- watch
Copy link
Copy Markdown
Member

@abhijeet-dhumal abhijeet-dhumal Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I guessdelete verb is missing here. The controller might need it for edge cases? (e.g. snapshot recreation if ownerReference is wrong, or test cleanup).
The SSA FieldOwner("trainer") pattern used
elsewhere in the controller also benefits from having delete available.. WDYT?

Copy link
Copy Markdown
Contributor Author

@robert-bell robert-bell Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot, though it looks like the other resources don't have delete permissions and we should be consistent. I guess we technically don't need delete because we allow the api server gc to delete the resources?

```
Comment thread
robert-bell marked this conversation as resolved.

### Migrations for existing resources

**TrainJobs**: existing TrainJobs are automatically migrated on first reconciliation after upgrade. The controller creates a `TrainingRuntimeSnapshot` for each non-finished TrainJob by copying the current state of its referenced runtime.

If a runtime was modified after a TrainJob was created but before the upgrade, the snapshot will capture the modified state not the original configuration the job used. This is acceptable, however, because any incompatible changes would have already caused the TrainJob reconciliation to fail before the upgrade.

**ClusterTrainingRuntimes** and **TrainingRuntimes**: all runtime finalizers need removing which can be done using the runtime controllers. This finalizer removal logic can be removed in a future version after sufficient time has passed for all clusters to migrate. Release notes should document this and warn users they must upgrade through the migration release(s) and cannot skip directly to later versions.

**Rollback:** No explicit rollback logic is required. On rollback, the controller will reconcile using the runtime; any `TrainingRuntimeSnapshot` objects will be ignored and removed once the TrainJob is deleted.

### Test plan

#### E2E tests

* `test/e2e/e2e_test.go`
* test updating TrainingRuntime does not affect a paused TrainJob: create runtime + train job, allow train job to start, pause train job, update the runtime, restart the train job. TrainJob should use the original configuration.
* ensure existing tests still pass

#### Integration tests

* `test/integration/controller/clustertrainingruntime_controller_test.go`
* remove test file. Only contains tests relating to the finalizer which is being removed.
* `test/integration/controller/trainingruntime_controller_test.go`
* remove test file. Only contains tests relating to the finalizer which is being removed.
* `test/integration/controller/trainjob_controller_test.go`
* test snapshot resource is created
* test migration scenario: existing TrainJob without snapshot migrates successfully (snapshot created and used for reconciliation)

#### Unit tests

* `pkg/controller/clustertrainingruntime_controller_test.go`
* test finalizer is always removed. Updates existing
* remove existing tests
* `pkg/controller/trainingruntime_controller_test.go`
* test finalizer is always removed.
* remove existing tests. Not required.

## Open Questions

* **Are there use-cases where an update to a runtime should propagate to a TrainJob (e.g. updating a training image to address CVEs)?** Given this is currently unsupported, this could be considered out of scope.
* **Should `TrainingRuntimeSnapshot` be immutable?** Given the resource is internal and should only be edited by the controller, this may be unnecessary.
* **Should `TrainingRuntimeSnapshot` have a finalizer to prevent deletion while the TrainJob exists?** Similarly, given the resource is internal, this may be unnecessary.

## Alternatives considered

### Alternative 1: store runtime configuration in the train job status

Store the runtime configuration snapshot in a new field on the train job status, e.g. `status.runtimeConfiguration`. This pattern is used in other projects, e.g. [Tekton PipelineRuns](https://github.com/tektoncd/pipeline/blob/v1.11.0/pkg/apis/pipeline/v1/pipelinerun_types.go#L535-L539).

**Pros**
- Avoids introducing new custom resource API
- Avoids creating an additional resource per TrainJob.

**Cons**
- Adds bloat to the TrainJob status
- Makes TrainJob less readable: mixes observed state with configuration snapshots

### Alternative 2: store multiple versions of runtimes

Introduce a version control mechanism for runtimes. Runtime versions are immutable, and changes to a runtime trigger a new runtime "version" to be created. TrainJobs keep track of which runtime version they use, e.g. through a status field. TrainJob reconciliation uses the configuration of the version. The control plane is responsible for garbage collecting runtime versions that are no longer referenced by a TrainJob.

**Pros**
- Avoids creating an additional resource per TrainJob.
- Adds a minimal additional config to the TrainJob status or to etcd.

**Cons**
- Significant extra complexity to correctly manage the runtime version lifecycle.


### Alternative 3: immutable runtimes

Make runtimes immutable via webhook and CRD annotations.

**Pros**
- Prevents incompatible changes to runtimes.

**Cons**
- Adds friction to platform admins for maintaining and updating runtimes.
- Proliferation of similar runtimes that differ only in minor version details.
Loading