Introduce GREP-393: Coherent updates by unmarshall · Pull Request #640 · ai-dynamo/grove

unmarshall · 2026-06-01T07:02:02Z

Introduces the GREP-393 proposal document for the Coherent rolling update strategy for PodCliqueSet. The proposal covers motivation, goals, design decisions, API changes, and the implementation approach for coordinating multi-version rolling updates across PodCliques.

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #639

Special notes for your reviewer:

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Introduce GREP: 393 Coherent updates

copy-pr-bot · 2026-06-01T07:02:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Adds the GREP-393 proposal document for the Coherent rolling update strategy for PodCliqueSet. The proposal covers motivation, goals, design decisions, API changes, and the implementation approach for coordinating multi-version rolling updates across PodCliques. Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>

Ronkahn21 · 2026-06-04T10:49:55Z

+
+### Non-Goals
+
+* Re-use of topology optimized resources during rolling update using resource reservations. Will be handled in future as a separate feature.


What does it mean topology optimized resources?

Ronkahn21 · 2026-06-04T10:51:41Z

+* Re-use of topology optimized resources during rolling update using resource reservations. Will be handled in future as a separate feature.
+* Explicit support for `maxSurge` and `maxUnavailable` API. However, similar concurrency controls and functionality will be supported in future.
+* User-configurable concurrency control during a coherent update — neither the number of `PodCliqueSet` replicas updated simultaneously nor the number of MVU iterations in flight per replica is configurable in the current iteration. Both default to one. Configurable knobs will be supported in future.
+* `scale-out` and `scale-in` of scale sub-resources (`PodClique`, `PodCliqueScalingGroup`, `PodCliqueSet`) during a coherent update. The current iteration blocks these operations for the duration of a coherent update. How scale operations should compose with an in-flight coherent update will be supported in future.


In what level, meaning any pcsg/pclq of as part of the update, any resource in any pcs replica or anything in between

Ronkahn21 · 2026-06-04T11:03:36Z

+
+`PodGangMap` (PGM) is a new namespaced custom resource that captures the **desired-state mapping between PodGangs and their constituent PodClique pod counts and PodCliqueScalingGroup replica indices** for a single `PodCliqueSet` replica. One `PodGangMap` exists per PCS replica, named `<pcs-name>-<pcs-replica-index>`.
+
+`PodGangMap` has no `Status` subresource as it only captures the desired-state. Mappings captured in this resource are read by the `PodGang`, `PodClique` and `PodCliqueScalingGroup` reconcilers.


we have new podGang reconciler?, today the pcs is the one handle the podGang

Ronkahn21 · 2026-06-04T11:18:43Z

+    // for this replica. The orchestrator waits for all of them to become Available before
+    // advancing to the next iteration.
+    // +optional
+    InFlightPodGangs []string `json:"inFlightPodGangs,omitempty"`


It used by any rolling update type

Ronkahn21 · 2026-06-04T11:20:57Z

+    // ... existing fields ...
+
+    // UpdatedStandalonePodCliques captures the names of standalone PodCliques whose pod template
+    // was detected as out-of-date when this coherent update started. The set is frozen for the


I am not sure I understand the sentence, it will include all the standalone pclq that are park of this update
or standalone pclq that are not part of this MVU update but the current pod template is not match desire ?

Ronkahn21 · 2026-06-04T11:23:40Z

+
+`InFlightPodGangs` is the orchestrator's hand-off to the PodGangMap component and back. In addition it also provide visibility into what PodGang(s) are currently getting updated. A coherent update for a PodCliqueSet replica proceeds in two phases: 
+
+* The **MPG phase**: In this phase MPGs are taken up one at a time — each MPG is rolled to the new hash, and only once it reports `PodGangConditionTypeAvailable=True` does the orchestrator move on to the next MPG 


Where is the step we calc the gang three, what belong to each podGang ?

Ronkahn21 · 2026-06-04T11:26:18Z

+`InFlightPodGangs` is the orchestrator's hand-off to the PodGangMap component and back. In addition it also provide visibility into what PodGang(s) are currently getting updated. A coherent update for a PodCliqueSet replica proceeds in two phases: 
+
+* The **MPG phase**: In this phase MPGs are taken up one at a time — each MPG is rolled to the new hash, and only once it reports `PodGangConditionTypeAvailable=True` does the orchestrator move on to the next MPG 
+* The **TPG phase**: Once every MPG is at the new hash, all remaining TPGs are rolled together in a single iteration. 


What does it mean Once every MPG is at the new hash ? is it means the podGang is ready ?

Ronkahn21 · 2026-06-04T14:02:39Z

+- **One MPG (Minimum-Viable PodGang)** carrying MinAvailable replicas of every standalone PCLQ and every PCSG.
+- **Zero or more TPGs**, one per tranche of PCSG replicas above MinAvailable. TPGs depend on the MPG via `DependsOn`, so the scheduler places them only after the MPG is up. 
+- **Excess standalone-PCLQ replicas** (when a standalone PCLQ has more than MinAvailable) roll into the highest index MPG entry as additional pods — no separate PodGang is created for them.


at startup is act basically the same as BPG and SPG ?

Ronkahn21 · 2026-06-04T14:43:04Z

+
+As an ML infrastructure team member deploying a disaggregated inference system where the prefill tier and decode tier are updated on different release cadences, I need to independently update only the decode `PodClique` (e.g., to pick up a memory-efficiency fix) without touching the prefill `PodClique`. The system should recognise that this is a backward compatible, single-component update and updates decode pods incrementally (up to a configurable concurrency limit), and leave prefill pods untouched — all without requiring a full MVU replacement.
+
+### Limitations/Risks & Mitigations


Wanted to flag something I don't think is covered in the limitations section yet.

Before a coherent update, excess PCSG replicas sit in isolated single-component SPGs — {prefill[1]} and {decode[1]} are separate gangs. So if the scheduler needs to preempt due to resource pressure, only decode[1] gets preempted. Prefill[1] keeps running. After the update those two replicas are gang-scheduled together in MPG-NEW2: {prefill[1]v1, decode[1]v1} — the scheduler will preempt the entire gang, taking down both prefill and decode together. This persists permanently in steady state, not just during the update window. Worth calling out in the limitations section.

Separate question — after the first MPG is up and available, why do we continue creating additional MPGs for the remaining replicas rather than keeping them as TPGs that depend on the new MPG, same as the initial setup? The first MPG already satisfies minAvailable, so the remaining replicas are tail capacity. Keeping them as TPGs would preserve the pre-update isolation and avoid the preemption blast radius problem entirely. If the right behavior differs across use cases, we might need to expose this as a user-controlled knob on the update strategy API.

Ronkahn21 · 2026-06-04T14:53:13Z

+
+- **PodClique (standalone).** When `PodClique.Spec.Replicas` shrinks, the PodClique reconciler's pod component picks the pods to remove using a deletion sorter that considers pod-template-hash mismatch, readiness, age, and other heuristics over the live pod set. The chosen pods can come from any of the existing PodGangs and the pick is **non-deterministic from outside** the pod component — it depends on which pods exist at that instant. The pod component records the per-PodGang decrement in `PodCliqueStatus.PodGangMapping` during a scale-in, since only it knows which PodGang each removed pod was associated with.
+- **PodCliqueScalingGroup.** When `PodCliqueScalingGroup.Spec.Replicas` shrinks, the PodCliqueScalingGroup reconciler's PodClique component runs a deterministic tier walk over the existing PodGang names — legacy SPG entries first, then unified-naming entries — sorted by trailing PodGang-name suffix descending, and pops the highest replica index from each entry until the scale-in count is satisfied. So although the PCSG-side pick is fully deterministic, the work — selecting which entry to drain and which replica index to pop from it — still happens inside the PodCliqueScalingGroup reconciler. The PodCliqueScalingGroup reconciler records the resulting decrement in `PodCliqueScalingGroupStatus.PodGangMapping` so the PodGangMap component can rebuild PGM from it on the next reconcile.
+


What happens here during a post-update scale-in when a PCSG replica is part of an MPG with another component? After a coherent update, replicas that were previously isolated in single-component SPGs are now gang-scheduled together — e.g. MPG-NEW3: {prefill[2]v1, decode[2]v1}. If the user scales in decode from 3 → 2, the tier walk hits MPG-NEW3 and removes decode[2], leaving prefill[2] orphaned in a half-empty MPG with no decode counterpart. The two PCSG reconcilers make their scale-in decisions independently with no cross-component coordination — prefill has no visibility that its gang partner was removed. Is this expected? Does the doc need to address what happens to the orphaned component?

unmarshall requested review from Ronkahn21, danbar2, gflarity, sanjaychatterjee and shayasoolin as code owners June 1, 2026 07:02

unmarshall force-pushed the coherent-update-grep branch from bd766f4 to cb146ce Compare June 1, 2026 07:04

Ronkahn21 reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce GREP-393: Coherent updates#640

Introduce GREP-393: Coherent updates#640
unmarshall wants to merge 1 commit into
ai-dynamo:mainfrom
unmarshall:coherent-update-grep

unmarshall commented Jun 1, 2026

Uh oh!

copy-pr-bot Bot commented Jun 1, 2026

Uh oh!

Ronkahn21 Jun 4, 2026

Uh oh!

Ronkahn21 Jun 4, 2026

Uh oh!

Ronkahn21 Jun 4, 2026

Uh oh!

Ronkahn21 Jun 4, 2026

Uh oh!

Ronkahn21 Jun 4, 2026

Uh oh!

Ronkahn21 Jun 4, 2026

Uh oh!

Ronkahn21 Jun 4, 2026

Uh oh!

Ronkahn21 Jun 4, 2026

Uh oh!

Ronkahn21 Jun 4, 2026

Uh oh!

Ronkahn21 Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### Non-Goals

		* Re-use of topology optimized resources during rolling update using resource reservations. Will be handled in future as a separate feature.


		`PodGangMap` (PGM) is a new namespaced custom resource that captures the desired-state mapping between PodGangs and their constituent PodClique pod counts and PodCliqueScalingGroup replica indices for a single `PodCliqueSet` replica. One `PodGangMap` exists per PCS replica, named `<pcs-name>-<pcs-replica-index>`.

		`PodGangMap` has no `Status` subresource as it only captures the desired-state. Mappings captured in this resource are read by the `PodGang`, `PodClique` and `PodCliqueScalingGroup` reconcilers.


		`InFlightPodGangs` is the orchestrator's hand-off to the PodGangMap component and back. In addition it also provide visibility into what PodGang(s) are currently getting updated. A coherent update for a PodCliqueSet replica proceeds in two phases:

		* The MPG phase: In this phase MPGs are taken up one at a time — each MPG is rolled to the new hash, and only once it reports `PodGangConditionTypeAvailable=True` does the orchestrator move on to the next MPG


		As an ML infrastructure team member deploying a disaggregated inference system where the prefill tier and decode tier are updated on different release cadences, I need to independently update only the decode `PodClique` (e.g., to pick up a memory-efficiency fix) without touching the prefill `PodClique`. The system should recognise that this is a backward compatible, single-component update and updates decode pods incrementally (up to a configurable concurrency limit), and leave prefill pods untouched — all without requiring a full MVU replacement.

		### Limitations/Risks & Mitigations


		- PodClique (standalone). When `PodClique.Spec.Replicas` shrinks, the PodClique reconciler's pod component picks the pods to remove using a deletion sorter that considers pod-template-hash mismatch, readiness, age, and other heuristics over the live pod set. The chosen pods can come from any of the existing PodGangs and the pick is non-deterministic from outside the pod component — it depends on which pods exist at that instant. The pod component records the per-PodGang decrement in `PodCliqueStatus.PodGangMapping` during a scale-in, since only it knows which PodGang each removed pod was associated with.
		- PodCliqueScalingGroup. When `PodCliqueScalingGroup.Spec.Replicas` shrinks, the PodCliqueScalingGroup reconciler's PodClique component runs a deterministic tier walk over the existing PodGang names — legacy SPG entries first, then unified-naming entries — sorted by trailing PodGang-name suffix descending, and pops the highest replica index from each entry until the scale-in count is satisfied. So although the PCSG-side pick is fully deterministic, the work — selecting which entry to drain and which replica index to pop from it — still happens inside the PodCliqueScalingGroup reconciler. The PodCliqueScalingGroup reconciler records the resulting decrement in `PodCliqueScalingGroupStatus.PodGangMapping` so the PodGangMap component can rebuild PGM from it on the next reconcile.

Conversation

unmarshall commented Jun 1, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

copy-pr-bot Bot commented Jun 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants