Skip to content

Gang-termination breach formula rework + E2E scenarios (GT-1 through GT-6)#614

Open
danbar2 wants to merge 6 commits into
ai-dynamo:mainfrom
danbar2:gang-termination-e2e
Open

Gang-termination breach formula rework + E2E scenarios (GT-1 through GT-6)#614
danbar2 wants to merge 6 commits into
ai-dynamo:mainfrom
danbar2:gang-termination-e2e

Conversation

@danbar2
Copy link
Copy Markdown
Contributor

@danbar2 danbar2 commented May 14, 2026

Follow-up to #561 (which fixed the initial cordon-then-kill regression) — picks up the open
questions that PR left behind. Refs #277.

Summary

Two related changes, one commit each:

  1. c8820c9 Rework MinAvailableBreached semantics. Drops the scheduledReplicas == 0
    suppression that Fix gang termination when scheduled replicas regress below MinAvailable #561 introduced as a churn-loop guard, and rewrites the PCSG-level
    breach formula to match the spec-implied definition.
  2. 7f151af + 294882b Add gang-termination E2E scenarios GT-1 through GT-6.

Operator changes (commit c8820c9)

Drop the scheduled==0 suppression

The previous suppression made it impossible to recover when a PodClique lost every
scheduled pod — gang termination never fired even after the workload had once been
healthy. The new rule: scheduledReplicas < MinAvailable always breaches.

The natural debounce is TerminationDelay (default 4h). During a normal startup the
breach flickers True briefly and resolves; a workload still below MinAvailable past
TerminationDelay is genuinely stuck and recycling it gives the scheduler a fresh
PodGang against the current cluster state.

Rework PCSG breach formula

Replaces `availableReplicas = scheduledReplicas - breachedReplicas` with
`notInBreachReplicas = existingReplicas - breachedReplicas`. The previous formula
double-counted: a PCSG replica that failed to schedule was excluded from
`scheduledReplicas` AND counted again in `breachedReplicas`, flipping the PCSG into
breach unnecessarily and causing PCS-level gang termination to fire when only the
PCSG-replica-scoped restart was warranted.

Matches the canonical definition:

  • A PCSG replica is in breach if at least one of its PodCliques is in breach.
  • A PCSG replica in breach > TerminationDelay → PCSG-scoped restart (just that
    replica's PCLQs).
  • The PCSG itself is in breach when (replicas not in breach) < MinAvailable.
  • Standalone PCLQ in breach → PCS-level restart.
  • PCSG-level in breach → PCS-level restart.

Unit tests rewritten around the new formula; sanity-pin cases for "1 of 2 replicas
breached with MinAvailable=1 — not in breach (PCSG-scoped restart only)" and "both
replicas breached — escalates to PCS-level" capture the GT-4 / GT-5 expectations.

E2E scenarios (commits 7f151af + 294882b)

Test Workload What it exercises
GT-1 WL1 (full replicas) Cordon + kill a PCS-owned pod → full-workload gang term
GT-2 WL1 (full replicas) Cordon + kill a PCSG-owned pod → full-workload gang term
GT-3 WL2 (min replicas) Incremental kill of pc-a; first kill no-op, second triggers term
GT-4 WL2 (min replicas) Multi-stage PCSG-owned: PCSG-scoped restart then full term
GT-5 WL5 (pc-c min=3) Kill all of one PCSG replica's pc-c → that PCSG replica only
GT-6 WL2 Kill a scaled-PodGang pod → no PCS-level term (pc-a survives)

All six pass locally against an operator running with this PR's changes.

Test plan

  • `go test ./...` clean (operator + scheduler/api + api modules)
  • `go vet ./...` clean
  • gofmt clean
  • E2E: `make run-e2e TEST_PATTERN=Test_GT` — 6 PASS in ~9 minutes against k3d/Kai
  • Recommend a CI run on `prod-grove-e2e-v1` before merge

Out of scope (for a follow-up)

The original test plan also lists GT-7/9/10/11 (crash-loop variants), GT-12 (readiness
probe failure), GT-13 (no gang term during rolling update), and GT-14 (recovery before
TerminationDelay). These are not in this PR — each needs new test infrastructure that
doesn't exist in `operator/e2e` today:

  • GT-7/9/10/11 + GT-12: pod-exec from Go (`client-go/tools/remotecommand` + SPDY) plus
    workload variants with a toggleable liveness/readiness probe keyed on a file in an
    emptyDir volume.
  • GT-13: a `PatchPCSTemplate` helper in `testctx` so a test can trigger a rolling
    update mid-run.
  • GT-14 actually works against this PR's operator changes, but its assertion shape is
    different enough from GT-1..6 that I'd rather pull it in once the infra above lands.

🤖 Generated with Claude Code

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@danbar2 danbar2 force-pushed the gang-termination-e2e branch 2 times, most recently from df46feb to a437f75 Compare May 14, 2026 15:25
@shayasoolin
Copy link
Copy Markdown
Contributor

@danbar2 I'd like to know if the new decision of breaching even when scheduled == 0 only means setting the condition, or actually recreating resources like Pods and PogGangs.
Considering the simplest case where gang-scheduling doesn't take place due to insufficient resources, I'm okay with having the breach condition set, but reluctant to recreating resources indefinitely every terminationDelay seconds. Grove should be silent in such cases, only the scheduler should do this kind of polling after its pending resources.

@danbar2 danbar2 force-pushed the gang-termination-e2e branch from a437f75 to c034d8d Compare May 19, 2026 06:43
@danbar2 danbar2 force-pushed the gang-termination-e2e branch from c034d8d to 28c1491 Compare May 27, 2026 05:59
danbar2 and others added 2 commits May 28, 2026 09:24
…rmula

Drops the scheduled==0 suppression from both PCLQ and PCSG status
reconcilers. The previous suppression (added with the original gang-
termination fix to avoid a churn loop) made it impossible to recover
when an entire PodClique lost all its scheduled pods — gang termination
would never fire even after the workload had been healthy. The
suppression is replaced by relying on the existing TerminationDelay
(default 4h) as the natural grace window: a workload that stays below
MinAvailable past TerminationDelay is genuinely stuck and recycling it
gives the scheduler a fresh PodGang against the current cluster state.

Reworks the PCSG-level MinAvailableBreached formula to compute breach
as `existingReplicas - breachedReplicas < MinAvailable` rather than
`scheduledReplicas - breachedReplicas < MinAvailable`. This matches the
definition the spec implies: a PCSG replica is in breach when at least
one of its PodCliques is in breach, and the PCSG itself is in breach
only when the count of replicas not-in-breach falls below MinAvailable.
The previous formula double-counted: a PCSG replica that had failed to
schedule was excluded from scheduledReplicas AND counted again in
breachedReplicas, flipping the PCSG into breach unnecessarily and
causing PCS-level gang termination to fire when only PCSG-replica-
scoped restart was warranted.

Unit tests cover the four corner cases: all healthy, partial breach
within MinAvailable, all replicas breached, and the update-in-progress
precedence over the breach branch. Test helpers are tightened so
WithPCLQNotScheduled also sets MinAvailableBreached=True, matching the
always-breach reality.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the four scenarios from the gang-termination test plan, all
passing locally against the operator with the always-breach + per-
replica PCSG formula:

  GT-1  WL1 full-replicas, PCS-owned PCLQ breached         pass
  GT-2  WL1 full-replicas, PCSG-owned PCLQ breached        pass
  GT-3  WL2 min-replicas, PCS-owned PCLQ (incremental)     pass
  GT-4  WL2 min-replicas, PCSG-owned PCLQ (incremental)    pass

Each test cordons the node hosting a target ready pod and deletes that
pod, waits for terminationDelay + grace, then asserts the expected
outcome via pod-UID snapshots. GT-3 step 5 and GT-4 step 6 both kill
the last remaining ready pod of a PCLQ — these exercise the
always-breach path that the previous suppression had blocked. GT-4
step 4 is the PCSG-replica-scoped restart that the previous PCSG
formula was over-eagerly escalating to PCS-level.

workload1-gt.yaml and workload2-gt.yaml are dedicated copies of the
existing workload{1,2}.yaml with terminationDelay set to 10s so the
gang-termination flow has a chance to fire within a reasonable test
window, and with -gt suffixed names so the existing gang-scheduling
tests are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@danbar2 danbar2 force-pushed the gang-termination-e2e branch from 28c1491 to 45e134a Compare May 28, 2026 06:24
@danbar2 danbar2 marked this pull request as draft May 28, 2026 10:51
@danbar2 danbar2 force-pushed the gang-termination-e2e branch from df0eaa9 to 6111aaa Compare May 28, 2026 11:31
@danbar2 danbar2 marked this pull request as ready for review May 28, 2026 11:33
Comment thread operator/e2e/yaml/workload5-gt.yaml Outdated
Comment thread operator/api/common/constants/constants.go Outdated
danbar2 and others added 3 commits May 28, 2026 18:32
GT-5 (Individual PCSG replica termination): WL5 (WL2 shape but pc-c
has min=3, no tolerance). Killing all 3 pods of one PCSG replica's
pc-c breaches that replica's MinAvailable; under the per-replica PCSG
formula, the PCSG itself stays healthy (one replica still not in
breach satisfies MinAvailable=1) so only the PCSG-scoped restart
fires and the rest of the workload keeps its UIDs.

GT-6 (Scaled-PodGang pod deletion): WL2 (pc-c min=1 replicas=3 has
2 pods of tolerance). Picks a pc-c pod that carries the
grove.io/base-podgang label (i.e. lives outside the base PodGang),
cordons its node and deletes it. MinAvailable is still satisfied
across pc-c so no breach fires; the standalone pc-a PCLQ's UIDs
must be intact (no PCS-level gang termination).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the gang_scheduling entry in both the e2e matrix (real run
against prod-grove-e2e-v1) and the e2e-skip matrix (passing stub for
PRs that don't touch operator/ or .github/).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The MinAvailableBreached condition correctly stays True whenever
scheduledReplicas < MinAvailable, including the initial-startup state
where pods have never been admitted. Without this gate, that condition
would feed straight into gangterminate.go, which would delete and
recreate the PCS replica every TerminationDelay — a 4-hour churn loop
on a workload the cluster simply cannot schedule. Grove should be
silent in that state and let the scheduler keep polling its pending
pods; recycling makes sense only after a workload has been healthy and
then regressed.

Adds a small wasScheduled signal derived from the existing conditions:

  WasPCLQEverScheduled  reads PodCliqueScheduled — True if currently
                        True, or False with LastTransitionTime
                        sufficiently after CreationTimestamp (the
                        condition must have flipped via True since
                        creation).
  WasPCSGEverHealthy    same shape, reading the PCSG's own
                        MinAvailableBreached (False == healthy at
                        some point).

InitialScheduleGrace (5s) absorbs the gap between the apiserver
setting CreationTimestamp and the first reconcile flipping the
condition False.

The gate lives in the helpers GetMinAvailableBreachedPCLQInfo
(consumed by both gangterminate.go and the PCSG-replica-scoped
restart in podcliquescalinggroup/components/podclique/sync.go) and
getMinAvailableBreachedPCSGInfo. The condition itself is unchanged —
operators still see MinAvailableBreached=True for never-healthy
workloads; only the deletion action is suppressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@danbar2 danbar2 force-pushed the gang-termination-e2e branch from 6111aaa to d97fd16 Compare May 28, 2026 15:33
The wasHealthy gate (added in 45e134a) prevents gang termination on
never-healthy workloads, but a recycled PCSG keeps MinAvailableBreached=True
across the PCS-replica DeleteAllOf because the condition lives on the PCSG,
not on the deleted-and-recreated PCLQs. WasPCSGEverHealthy returns true
forever after the first regression, so the next TerminationDelay fires
again, and so on — a 4-hour churn loop on a workload the cluster can't
re-stabilize.

Adds a GangTerminationInProgress=True / Reason=RecycleInFlight condition
that is set on every PCSG in a PCS replica immediately after
createPCSReplicaDeleteTask's DeleteAllOf succeeds, and cleared by the
PCSG status reconciler on the next MinAvailableBreached True→False
transition (the natural "workload recovered" signal). While the flag is
True, getMinAvailableBreachedPCSGInfo skips the PCSG, suppressing further
fires until the recycle resolves.

Ordering is action-first / flag-second: if the controller crashes between
DeleteAllOf and the flag write, the next reconcile sees the breach still
True (new PCLQs Pending) with no flag set, fires once more (one extra
churn), and converges.

The two gates compose:
  - WasPCSGEverHealthy   suppresses fires before the workload has ever
                         stabilized (initial scheduling latency).
  - GangTerminationInProgress  suppresses fires while a recycle is still
                         in flight (post-fire convergence).

E2E: GT-1..GT-6 all pass (32–120s each, 347s total — comparable to the
pre-flag baseline of 353s; faster than 485s baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@danbar2 danbar2 force-pushed the gang-termination-e2e branch from d97fd16 to f004a2e Compare May 31, 2026 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants