Gang-termination breach formula rework + E2E scenarios (GT-1 through GT-6)#614
Open
danbar2 wants to merge 6 commits into
Open
Gang-termination breach formula rework + E2E scenarios (GT-1 through GT-6)#614danbar2 wants to merge 6 commits into
danbar2 wants to merge 6 commits into
Conversation
df46feb to
a437f75
Compare
Contributor
|
@danbar2 I'd like to know if the new decision of breaching even when |
a437f75 to
c034d8d
Compare
c034d8d to
28c1491
Compare
…rmula Drops the scheduled==0 suppression from both PCLQ and PCSG status reconcilers. The previous suppression (added with the original gang- termination fix to avoid a churn loop) made it impossible to recover when an entire PodClique lost all its scheduled pods — gang termination would never fire even after the workload had been healthy. The suppression is replaced by relying on the existing TerminationDelay (default 4h) as the natural grace window: a workload that stays below MinAvailable past TerminationDelay is genuinely stuck and recycling it gives the scheduler a fresh PodGang against the current cluster state. Reworks the PCSG-level MinAvailableBreached formula to compute breach as `existingReplicas - breachedReplicas < MinAvailable` rather than `scheduledReplicas - breachedReplicas < MinAvailable`. This matches the definition the spec implies: a PCSG replica is in breach when at least one of its PodCliques is in breach, and the PCSG itself is in breach only when the count of replicas not-in-breach falls below MinAvailable. The previous formula double-counted: a PCSG replica that had failed to schedule was excluded from scheduledReplicas AND counted again in breachedReplicas, flipping the PCSG into breach unnecessarily and causing PCS-level gang termination to fire when only PCSG-replica- scoped restart was warranted. Unit tests cover the four corner cases: all healthy, partial breach within MinAvailable, all replicas breached, and the update-in-progress precedence over the breach branch. Test helpers are tightened so WithPCLQNotScheduled also sets MinAvailableBreached=True, matching the always-breach reality. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the four scenarios from the gang-termination test plan, all
passing locally against the operator with the always-breach + per-
replica PCSG formula:
GT-1 WL1 full-replicas, PCS-owned PCLQ breached pass
GT-2 WL1 full-replicas, PCSG-owned PCLQ breached pass
GT-3 WL2 min-replicas, PCS-owned PCLQ (incremental) pass
GT-4 WL2 min-replicas, PCSG-owned PCLQ (incremental) pass
Each test cordons the node hosting a target ready pod and deletes that
pod, waits for terminationDelay + grace, then asserts the expected
outcome via pod-UID snapshots. GT-3 step 5 and GT-4 step 6 both kill
the last remaining ready pod of a PCLQ — these exercise the
always-breach path that the previous suppression had blocked. GT-4
step 4 is the PCSG-replica-scoped restart that the previous PCSG
formula was over-eagerly escalating to PCS-level.
workload1-gt.yaml and workload2-gt.yaml are dedicated copies of the
existing workload{1,2}.yaml with terminationDelay set to 10s so the
gang-termination flow has a chance to fire within a reasonable test
window, and with -gt suffixed names so the existing gang-scheduling
tests are unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
28c1491 to
45e134a
Compare
df0eaa9 to
6111aaa
Compare
shayasoolin
reviewed
May 28, 2026
shayasoolin
reviewed
May 28, 2026
GT-5 (Individual PCSG replica termination): WL5 (WL2 shape but pc-c has min=3, no tolerance). Killing all 3 pods of one PCSG replica's pc-c breaches that replica's MinAvailable; under the per-replica PCSG formula, the PCSG itself stays healthy (one replica still not in breach satisfies MinAvailable=1) so only the PCSG-scoped restart fires and the rest of the workload keeps its UIDs. GT-6 (Scaled-PodGang pod deletion): WL2 (pc-c min=1 replicas=3 has 2 pods of tolerance). Picks a pc-c pod that carries the grove.io/base-podgang label (i.e. lives outside the base PodGang), cordons its node and deletes it. MinAvailable is still satisfied across pc-c so no breach fires; the standalone pc-a PCLQ's UIDs must be intact (no PCS-level gang termination). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the gang_scheduling entry in both the e2e matrix (real run against prod-grove-e2e-v1) and the e2e-skip matrix (passing stub for PRs that don't touch operator/ or .github/). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The MinAvailableBreached condition correctly stays True whenever
scheduledReplicas < MinAvailable, including the initial-startup state
where pods have never been admitted. Without this gate, that condition
would feed straight into gangterminate.go, which would delete and
recreate the PCS replica every TerminationDelay — a 4-hour churn loop
on a workload the cluster simply cannot schedule. Grove should be
silent in that state and let the scheduler keep polling its pending
pods; recycling makes sense only after a workload has been healthy and
then regressed.
Adds a small wasScheduled signal derived from the existing conditions:
WasPCLQEverScheduled reads PodCliqueScheduled — True if currently
True, or False with LastTransitionTime
sufficiently after CreationTimestamp (the
condition must have flipped via True since
creation).
WasPCSGEverHealthy same shape, reading the PCSG's own
MinAvailableBreached (False == healthy at
some point).
InitialScheduleGrace (5s) absorbs the gap between the apiserver
setting CreationTimestamp and the first reconcile flipping the
condition False.
The gate lives in the helpers GetMinAvailableBreachedPCLQInfo
(consumed by both gangterminate.go and the PCSG-replica-scoped
restart in podcliquescalinggroup/components/podclique/sync.go) and
getMinAvailableBreachedPCSGInfo. The condition itself is unchanged —
operators still see MinAvailableBreached=True for never-healthy
workloads; only the deletion action is suppressed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6111aaa to
d97fd16
Compare
The wasHealthy gate (added in 45e134a) prevents gang termination on never-healthy workloads, but a recycled PCSG keeps MinAvailableBreached=True across the PCS-replica DeleteAllOf because the condition lives on the PCSG, not on the deleted-and-recreated PCLQs. WasPCSGEverHealthy returns true forever after the first regression, so the next TerminationDelay fires again, and so on — a 4-hour churn loop on a workload the cluster can't re-stabilize. Adds a GangTerminationInProgress=True / Reason=RecycleInFlight condition that is set on every PCSG in a PCS replica immediately after createPCSReplicaDeleteTask's DeleteAllOf succeeds, and cleared by the PCSG status reconciler on the next MinAvailableBreached True→False transition (the natural "workload recovered" signal). While the flag is True, getMinAvailableBreachedPCSGInfo skips the PCSG, suppressing further fires until the recycle resolves. Ordering is action-first / flag-second: if the controller crashes between DeleteAllOf and the flag write, the next reconcile sees the breach still True (new PCLQs Pending) with no flag set, fires once more (one extra churn), and converges. The two gates compose: - WasPCSGEverHealthy suppresses fires before the workload has ever stabilized (initial scheduling latency). - GangTerminationInProgress suppresses fires while a recycle is still in flight (post-fire convergence). E2E: GT-1..GT-6 all pass (32–120s each, 347s total — comparable to the pre-flag baseline of 353s; faster than 485s baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d97fd16 to
f004a2e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #561 (which fixed the initial cordon-then-kill regression) — picks up the open
questions that PR left behind. Refs #277.
Summary
Two related changes, one commit each:
c8820c9Rework MinAvailableBreached semantics. Drops thescheduledReplicas == 0suppression that Fix gang termination when scheduled replicas regress below MinAvailable #561 introduced as a churn-loop guard, and rewrites the PCSG-level
breach formula to match the spec-implied definition.
7f151af+294882bAdd gang-termination E2E scenarios GT-1 through GT-6.Operator changes (commit
c8820c9)Drop the scheduled==0 suppression
The previous suppression made it impossible to recover when a PodClique lost every
scheduled pod — gang termination never fired even after the workload had once been
healthy. The new rule:
scheduledReplicas < MinAvailablealways breaches.The natural debounce is
TerminationDelay(default 4h). During a normal startup thebreach flickers True briefly and resolves; a workload still below MinAvailable past
TerminationDelayis genuinely stuck and recycling it gives the scheduler a freshPodGang against the current cluster state.
Rework PCSG breach formula
Replaces `availableReplicas = scheduledReplicas - breachedReplicas` with
`notInBreachReplicas = existingReplicas - breachedReplicas`. The previous formula
double-counted: a PCSG replica that failed to schedule was excluded from
`scheduledReplicas` AND counted again in `breachedReplicas`, flipping the PCSG into
breach unnecessarily and causing PCS-level gang termination to fire when only the
PCSG-replica-scoped restart was warranted.
Matches the canonical definition:
replica's PCLQs).
Unit tests rewritten around the new formula; sanity-pin cases for "1 of 2 replicas
breached with MinAvailable=1 — not in breach (PCSG-scoped restart only)" and "both
replicas breached — escalates to PCS-level" capture the GT-4 / GT-5 expectations.
E2E scenarios (commits
7f151af+294882b)All six pass locally against an operator running with this PR's changes.
Test plan
Out of scope (for a follow-up)
The original test plan also lists GT-7/9/10/11 (crash-loop variants), GT-12 (readiness
probe failure), GT-13 (no gang term during rolling update), and GT-14 (recovery before
TerminationDelay). These are not in this PR — each needs new test infrastructure that
doesn't exist in `operator/e2e` today:
workload variants with a toggleable liveness/readiness probe keyed on a file in an
emptyDir volume.
update mid-run.
different enough from GT-1..6 that I'd rather pull it in once the infra above lands.
🤖 Generated with Claude Code