Skip to content

[release-2.2] fix(api): allow atomic RuntimePatches update on unsuspend#3489

Merged
google-oss-prow[bot] merged 11 commits intokubeflow:release-2.2from
tenzen-y:cherry-pick-3469-to-release-2.2
May 5, 2026
Merged

[release-2.2] fix(api): allow atomic RuntimePatches update on unsuspend#3489
google-oss-prow[bot] merged 11 commits intokubeflow:release-2.2from
tenzen-y:cherry-pick-3469-to-release-2.2

Conversation

@tenzen-y
Copy link
Copy Markdown
Member

@tenzen-y tenzen-y commented May 5, 2026

This is an automated cherry-pick of #3469

/assign tenzen-y

TrainJob webhook validation now permits updating `spec.runtimePatches` while transitioning from `suspend=true` to `suspend=false` in a single API request. The previously required two-step update (modify runtimePatches, then unsuspend) still works.

NarayanaSabari and others added 11 commits May 5, 2026 04:53
Change checkRuntimePatchesImmutability to check the *old* TrainJob's
suspend state instead of the new one. This lets external controllers
(notably Kueue, see kubernetes-sigs/kueue#8296) modify spec.runtimePatches
and set spec.suspend=false in a single API request, eliminating the
two-step update workaround and the race window between the calls.

Mirrors the Kubernetes core Job validation pattern:
https://github.com/kubernetes/kubernetes/blob/86b66f6f333a/pkg/registry/batch/job/strategy.go#L191

The downstream JobSet activity check (no active replicatedJobs) still
runs, so updates while pods are running remain forbidden.

Refs: kubeflow#3043
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Add a unit test that exercises the new atomic-update path: a single
TrainJob update that both clears spec.suspend and modifies
spec.runtimePatches must succeed when the previous state was suspended.

Refs: kubeflow#3043
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Extend the trainjob webhook update table with an entry that creates a
suspended TrainJob with RuntimePatches and then, in a single Update,
both unsuspends it and modifies the runtimePatches NodeSelector. This
exercises the relaxed validation against a real apiserver and protects
the suspended -> unsuspended transition path used by Kueue.

Refs: kubeflow#3043
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
…pended

Per @mimowo's review, the previous condition (!oldSuspended) over-blocked
the (oldSuspended=false, newSuspended=true) transition — i.e. suspending
a running TrainJob while modifying spec.runtimePatches in the same call,
which is safe because the JobSet activity check still guards against
concurrent pod runs.

Change the predicate to !oldSuspended && !newSuspended so the webhook
only forbids RuntimePatches mutations when the TrainJob is staying fully
unsuspended. Reword the error message accordingly so callers see
explicitly that suspension at either side of the update is acceptable.

Refs: kubeflow#3043, kubeflow#3469 (review feedback)
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Per @mimowo's request, complete the transition table coverage:

- (false, false) -> blocked   (existing test)
- (true,  true)  -> allowed   (existing test)
- (true,  false) -> allowed   (added previously)
- (false, true)  -> allowed   (added here)

Mirrors the existing atomic-unsuspend cases — one unit case in
jobset_test.go and one webhook integration entry in trainjob_test.go.

Refs: kubeflow#3043, kubeflow#3469 (review feedback)
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Trim the comment block above checkRuntimePatchesImmutability to a 4-line
summary, per @kaisoz's review feedback (the comment was longer than the
change itself). Drops the kueue#8296 backreference and the k/k Job
validation link — both are already in the PR description and commit
history.

Refs: kubeflow#3469 (review feedback)
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Per @tenzen-y's review nit, swap the ptr.Equal(x, ptr.To(true)) idiom
for an inline nil-check + dereference. Equivalent semantics, no extra
generic-comparison hop, and stays in the stdlib for these two lines.

Refs: kubeflow#3469 (review feedback)
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Per @tenzen-y's request, add a unit case where the old TrainJob is
unsuspended, the new one is being suspended with a runtimePatches
change, and the underlying JobSet still has an active replicatedJob.
The relaxed suspend predicate alone would let this through; the
JobSet activity check should still reject it.

Refs: kubeflow#3469 (review feedback)
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Per @tenzen-y's review, the two atomic-transition ginkgo entries
(unsuspend + RuntimePatches, suspend + RuntimePatches) duplicate the
unit cases in jobset_test.go that already exercise the same validator
path. The pre-existing "Should succeed to update runtimePatches when
suspend is true" entry stays as the integration smoke test for the
webhook chain.

Refs: kubeflow#3469 (review feedback)
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
@tenzen-y is right that the value-form `new()` builtin is valid Go 1.26,
which the repo already targets (`go 1.26.0` in go.mod). Apply his
original suggestion verbatim — keeps `ptr.Equal` for the two-pointer
comparison and replaces only `ptr.To(true)`, which is the smaller and
more idiomatic delta vs the inline nil-check + deref pattern I tried
in the previous commit.

Refs: kubeflow#3469 (review feedback)
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@google-oss-prow google-oss-prow Bot added this to the v2.2 milestone May 5, 2026
@tenzen-y
Copy link
Copy Markdown
Member Author

tenzen-y commented May 5, 2026

I manually opened this cherry pick PR due to #3488 (comment)

/assign @andreyvelich @astefanutti

Copy link
Copy Markdown
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tenzen-y!

/lgtm
/approve

@google-oss-prow google-oss-prow Bot added the lgtm label May 5, 2026
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot merged commit da938b7 into kubeflow:release-2.2 May 5, 2026
43 of 48 checks passed
@tenzen-y tenzen-y deleted the cherry-pick-3469-to-release-2.2 branch May 6, 2026 03:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants