Gang-termination breach formula rework + E2E scenarios (GT-1 through GT-6) by danbar2 · Pull Request #614 · ai-dynamo/grove

danbar2 · 2026-05-14T15:21:10Z

Follow-up to #561 (which fixed the initial cordon-then-kill regression) — picks up the open
questions that PR left behind. Refs #277.

Summary

Two related changes, one commit each:

c8820c9 Rework MinAvailableBreached semantics. Drops the scheduledReplicas == 0
suppression that Fix gang termination when scheduled replicas regress below MinAvailable #561 introduced as a churn-loop guard, and rewrites the PCSG-level
breach formula to match the spec-implied definition.
7f151af + 294882b Add gang-termination E2E scenarios GT-1 through GT-6.

Operator changes (commit `c8820c9`)

Drop the scheduled==0 suppression

The previous suppression made it impossible to recover when a PodClique lost every
scheduled pod — gang termination never fired even after the workload had once been
healthy. The new rule: scheduledReplicas < MinAvailable always breaches.

The natural debounce is TerminationDelay (default 4h). During a normal startup the
breach flickers True briefly and resolves; a workload still below MinAvailable past
TerminationDelay is genuinely stuck and recycling it gives the scheduler a fresh
PodGang against the current cluster state.

Rework PCSG breach formula

Replaces `availableReplicas = scheduledReplicas - breachedReplicas` with
`notInBreachReplicas = existingReplicas - breachedReplicas`. The previous formula
double-counted: a PCSG replica that failed to schedule was excluded from
`scheduledReplicas` AND counted again in `breachedReplicas`, flipping the PCSG into
breach unnecessarily and causing PCS-level gang termination to fire when only the
PCSG-replica-scoped restart was warranted.

Matches the canonical definition:

A PCSG replica is in breach if at least one of its PodCliques is in breach.
A PCSG replica in breach > TerminationDelay → PCSG-scoped restart (just that
replica's PCLQs).
The PCSG itself is in breach when (replicas not in breach) < MinAvailable.
Standalone PCLQ in breach → PCS-level restart.
PCSG-level in breach → PCS-level restart.

Unit tests rewritten around the new formula; sanity-pin cases for "1 of 2 replicas
breached with MinAvailable=1 — not in breach (PCSG-scoped restart only)" and "both
replicas breached — escalates to PCS-level" capture the GT-4 / GT-5 expectations.

E2E scenarios (commits `7f151af` + `294882b`)

Test	Workload	What it exercises
GT-1	WL1 (full replicas)	Cordon + kill a PCS-owned pod → full-workload gang term
GT-2	WL1 (full replicas)	Cordon + kill a PCSG-owned pod → full-workload gang term
GT-3	WL2 (min replicas)	Incremental kill of pc-a; first kill no-op, second triggers term
GT-4	WL2 (min replicas)	Multi-stage PCSG-owned: PCSG-scoped restart then full term
GT-5	WL5 (pc-c min=3)	Kill all of one PCSG replica's pc-c → that PCSG replica only
GT-6	WL2	Kill a scaled-PodGang pod → no PCS-level term (pc-a survives)

All six pass locally against an operator running with this PR's changes.

Test plan

`go test ./...` clean (operator + scheduler/api + api modules)
`go vet ./...` clean
gofmt clean
E2E: `make run-e2e TEST_PATTERN=Test_GT` — 6 PASS in ~9 minutes against k3d/Kai
Recommend a CI run on `prod-grove-e2e-v1` before merge

Out of scope (for a follow-up)

The original test plan also lists GT-7/9/10/11 (crash-loop variants), GT-12 (readiness
probe failure), GT-13 (no gang term during rolling update), and GT-14 (recovery before
TerminationDelay). These are not in this PR — each needs new test infrastructure that
doesn't exist in `operator/e2e` today:

GT-7/9/10/11 + GT-12: pod-exec from Go (`client-go/tools/remotecommand` + SPDY) plus
workload variants with a toggleable liveness/readiness probe keyed on a file in an
emptyDir volume.
GT-13: a `PatchPCSTemplate` helper in `testctx` so a test can trigger a rolling
update mid-run.
GT-14 actually works against this PR's operator changes, but its assertion shape is
different enough from GT-1..6 that I'd rather pull it in once the infra above lands.

🤖 Generated with Claude Code

copy-pr-bot · 2026-05-14T15:21:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shayasoolin · 2026-05-18T09:21:05Z

@danbar2 I'd like to know if the new decision of breaching even when scheduled == 0 only means setting the condition, or actually recreating resources like Pods and PogGangs.
Considering the simplest case where gang-scheduling doesn't take place due to insufficient resources, I'm okay with having the breach condition set, but reluctant to recreating resources indefinitely every terminationDelay seconds. Grove should be silent in such cases, only the scheduler should do this kind of polling after its pending resources.

…rmula Drops the scheduled==0 suppression from both PCLQ and PCSG status reconcilers. The previous suppression (added with the original gang- termination fix to avoid a churn loop) made it impossible to recover when an entire PodClique lost all its scheduled pods — gang termination would never fire even after the workload had been healthy. The suppression is replaced by relying on the existing TerminationDelay (default 4h) as the natural grace window: a workload that stays below MinAvailable past TerminationDelay is genuinely stuck and recycling it gives the scheduler a fresh PodGang against the current cluster state. Reworks the PCSG-level MinAvailableBreached formula to compute breach as `existingReplicas - breachedReplicas < MinAvailable` rather than `scheduledReplicas - breachedReplicas < MinAvailable`. This matches the definition the spec implies: a PCSG replica is in breach when at least one of its PodCliques is in breach, and the PCSG itself is in breach only when the count of replicas not-in-breach falls below MinAvailable. The previous formula double-counted: a PCSG replica that had failed to schedule was excluded from scheduledReplicas AND counted again in breachedReplicas, flipping the PCSG into breach unnecessarily and causing PCS-level gang termination to fire when only PCSG-replica- scoped restart was warranted. Unit tests cover the four corner cases: all healthy, partial breach within MinAvailable, all replicas breached, and the update-in-progress precedence over the breach branch. Test helpers are tightened so WithPCLQNotScheduled also sets MinAvailableBreached=True, matching the always-breach reality. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the four scenarios from the gang-termination test plan, all passing locally against the operator with the always-breach + per- replica PCSG formula: GT-1 WL1 full-replicas, PCS-owned PCLQ breached pass GT-2 WL1 full-replicas, PCSG-owned PCLQ breached pass GT-3 WL2 min-replicas, PCS-owned PCLQ (incremental) pass GT-4 WL2 min-replicas, PCSG-owned PCLQ (incremental) pass Each test cordons the node hosting a target ready pod and deletes that pod, waits for terminationDelay + grace, then asserts the expected outcome via pod-UID snapshots. GT-3 step 5 and GT-4 step 6 both kill the last remaining ready pod of a PCLQ — these exercise the always-breach path that the previous suppression had blocked. GT-4 step 4 is the PCSG-replica-scoped restart that the previous PCSG formula was over-eagerly escalating to PCS-level. workload1-gt.yaml and workload2-gt.yaml are dedicated copies of the existing workload{1,2}.yaml with terminationDelay set to 10s so the gang-termination flow has a chance to fire within a reasonable test window, and with -gt suffixed names so the existing gang-scheduling tests are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GT-5 (Individual PCSG replica termination): WL5 (WL2 shape but pc-c has min=3, no tolerance). Killing all 3 pods of one PCSG replica's pc-c breaches that replica's MinAvailable; under the per-replica PCSG formula, the PCSG itself stays healthy (one replica still not in breach satisfies MinAvailable=1) so only the PCSG-scoped restart fires and the rest of the workload keeps its UIDs. GT-6 (Scaled-PodGang pod deletion): WL2 (pc-c min=1 replicas=3 has 2 pods of tolerance). Picks a pc-c pod that carries the grove.io/base-podgang label (i.e. lives outside the base PodGang), cordons its node and deletes it. MinAvailable is still satisfied across pc-c so no breach fires; the standalone pc-a PCLQ's UIDs must be intact (no PCS-level gang termination). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the gang_scheduling entry in both the e2e matrix (real run against prod-grove-e2e-v1) and the e2e-skip matrix (passing stub for PRs that don't touch operator/ or .github/). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The MinAvailableBreached condition correctly stays True whenever scheduledReplicas < MinAvailable, including the initial-startup state where pods have never been admitted. Without this gate, that condition would feed straight into gangterminate.go, which would delete and recreate the PCS replica every TerminationDelay — a 4-hour churn loop on a workload the cluster simply cannot schedule. Grove should be silent in that state and let the scheduler keep polling its pending pods; recycling makes sense only after a workload has been healthy and then regressed. Adds a small wasScheduled signal derived from the existing conditions: WasPCLQEverScheduled reads PodCliqueScheduled — True if currently True, or False with LastTransitionTime sufficiently after CreationTimestamp (the condition must have flipped via True since creation). WasPCSGEverHealthy same shape, reading the PCSG's own MinAvailableBreached (False == healthy at some point). InitialScheduleGrace (5s) absorbs the gap between the apiserver setting CreationTimestamp and the first reconcile flipping the condition False. The gate lives in the helpers GetMinAvailableBreachedPCLQInfo (consumed by both gangterminate.go and the PCSG-replica-scoped restart in podcliquescalinggroup/components/podclique/sync.go) and getMinAvailableBreachedPCSGInfo. The condition itself is unchanged — operators still see MinAvailableBreached=True for never-healthy workloads; only the deletion action is suppressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The wasHealthy gate (added in 45e134a) prevents gang termination on never-healthy workloads, but a recycled PCSG keeps MinAvailableBreached=True across the PCS-replica DeleteAllOf because the condition lives on the PCSG, not on the deleted-and-recreated PCLQs. WasPCSGEverHealthy returns true forever after the first regression, so the next TerminationDelay fires again, and so on — a 4-hour churn loop on a workload the cluster can't re-stabilize. Adds a GangTerminationInProgress=True / Reason=RecycleInFlight condition that is set on every PCSG in a PCS replica immediately after createPCSReplicaDeleteTask's DeleteAllOf succeeds, and cleared by the PCSG status reconciler on the next MinAvailableBreached True→False transition (the natural "workload recovered" signal). While the flag is True, getMinAvailableBreachedPCSGInfo skips the PCSG, suppressing further fires until the recycle resolves. Ordering is action-first / flag-second: if the controller crashes between DeleteAllOf and the flag write, the next reconcile sees the breach still True (new PCLQs Pending) with no flag set, fires once more (one extra churn), and converges. The two gates compose: - WasPCSGEverHealthy suppresses fires before the workload has ever stabilized (initial scheduling latency). - GangTerminationInProgress suppresses fires while a recycle is still in flight (post-fire convergence). E2E: GT-1..GT-6 all pass (32–120s each, 347s total — comparable to the pre-flag baseline of 353s; faster than 485s baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

danbar2 requested review from Ronkahn21, gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners May 14, 2026 15:21

danbar2 force-pushed the gang-termination-e2e branch 2 times, most recently from df46feb to a437f75 Compare May 14, 2026 15:25

danbar2 force-pushed the gang-termination-e2e branch from a437f75 to c034d8d Compare May 19, 2026 06:43

danbar2 force-pushed the gang-termination-e2e branch from c034d8d to 28c1491 Compare May 27, 2026 05:59

danbar2 and others added 2 commits May 28, 2026 09:24

danbar2 force-pushed the gang-termination-e2e branch from 28c1491 to 45e134a Compare May 28, 2026 06:24

danbar2 marked this pull request as draft May 28, 2026 10:51

danbar2 added the run-e2e label May 28, 2026

danbar2 force-pushed the gang-termination-e2e branch from df0eaa9 to 6111aaa Compare May 28, 2026 11:31

danbar2 marked this pull request as ready for review May 28, 2026 11:33

shayasoolin reviewed May 28, 2026

View reviewed changes

Comment thread operator/e2e/yaml/workload5-gt.yaml Outdated

shayasoolin reviewed May 28, 2026

View reviewed changes

Comment thread operator/api/common/constants/constants.go Outdated

danbar2 and others added 3 commits May 28, 2026 18:32

Add gang-termination E2E job to CI

efe1c0d

Mirrors the gang_scheduling entry in both the e2e matrix (real run against prod-grove-e2e-v1) and the e2e-skip matrix (passing stub for PRs that don't touch operator/ or .github/). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

danbar2 force-pushed the gang-termination-e2e branch from 6111aaa to d97fd16 Compare May 28, 2026 15:33

danbar2 force-pushed the gang-termination-e2e branch from d97fd16 to f004a2e Compare May 31, 2026 06:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gang-termination breach formula rework + E2E scenarios (GT-1 through GT-6)#614

Gang-termination breach formula rework + E2E scenarios (GT-1 through GT-6)#614
danbar2 wants to merge 6 commits into
ai-dynamo:mainfrom
danbar2:gang-termination-e2e

danbar2 commented May 14, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

shayasoolin commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danbar2 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Operator changes (commit c8820c9)

Drop the scheduled==0 suppression

Rework PCSG breach formula

E2E scenarios (commits 7f151af + 294882b)

Test plan

Out of scope (for a follow-up)

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

shayasoolin commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danbar2 commented May 14, 2026 •

edited

Loading

Operator changes (commit `c8820c9`)

E2E scenarios (commits `7f151af` + `294882b`)