Skip to content

Introduce GREP-393: Coherent updates#640

Open
unmarshall wants to merge 1 commit into
ai-dynamo:mainfrom
unmarshall:coherent-update-grep
Open

Introduce GREP-393: Coherent updates#640
unmarshall wants to merge 1 commit into
ai-dynamo:mainfrom
unmarshall:coherent-update-grep

Conversation

@unmarshall
Copy link
Copy Markdown
Collaborator

Introduces the GREP-393 proposal document for the Coherent rolling update strategy for PodCliqueSet. The proposal covers motivation, goals, design decisions, API changes, and the implementation approach for coordinating multi-version rolling updates across PodCliques.

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #639

Special notes for your reviewer:

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Introduce GREP: 393 Coherent updates

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Adds the GREP-393 proposal document for the Coherent rolling update
strategy for PodCliqueSet. The proposal covers motivation, goals,
design decisions, API changes, and the implementation approach for
coordinating multi-version rolling updates across PodCliques.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
@unmarshall unmarshall force-pushed the coherent-update-grep branch from bd766f4 to cb146ce Compare June 1, 2026 07:04

### Non-Goals

* Re-use of topology optimized resources during rolling update using resource reservations. Will be handled in future as a separate feature.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean topology optimized resources?

* Re-use of topology optimized resources during rolling update using resource reservations. Will be handled in future as a separate feature.
* Explicit support for `maxSurge` and `maxUnavailable` API. However, similar concurrency controls and functionality will be supported in future.
* User-configurable concurrency control during a coherent update — neither the number of `PodCliqueSet` replicas updated simultaneously nor the number of MVU iterations in flight per replica is configurable in the current iteration. Both default to one. Configurable knobs will be supported in future.
* `scale-out` and `scale-in` of scale sub-resources (`PodClique`, `PodCliqueScalingGroup`, `PodCliqueSet`) during a coherent update. The current iteration blocks these operations for the duration of a coherent update. How scale operations should compose with an in-flight coherent update will be supported in future.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what level, meaning any pcsg/pclq of as part of the update, any resource in any pcs replica or anything in between


`PodGangMap` (PGM) is a new namespaced custom resource that captures the **desired-state mapping between PodGangs and their constituent PodClique pod counts and PodCliqueScalingGroup replica indices** for a single `PodCliqueSet` replica. One `PodGangMap` exists per PCS replica, named `<pcs-name>-<pcs-replica-index>`.

`PodGangMap` has no `Status` subresource as it only captures the desired-state. Mappings captured in this resource are read by the `PodGang`, `PodClique` and `PodCliqueScalingGroup` reconcilers.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have new podGang reconciler?, today the pcs is the one handle the podGang

// for this replica. The orchestrator waits for all of them to become Available before
// advancing to the next iteration.
// +optional
InFlightPodGangs []string `json:"inFlightPodGangs,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used by any rolling update type

// ... existing fields ...

// UpdatedStandalonePodCliques captures the names of standalone PodCliques whose pod template
// was detected as out-of-date when this coherent update started. The set is frozen for the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the sentence, it will include all the standalone pclq that are park of this update
or standalone pclq that are not part of this MVU update but the current pod template is not match desire ?


`InFlightPodGangs` is the orchestrator's hand-off to the PodGangMap component and back. In addition it also provide visibility into what PodGang(s) are currently getting updated. A coherent update for a PodCliqueSet replica proceeds in two phases:

* The **MPG phase**: In this phase MPGs are taken up one at a time — each MPG is rolled to the new hash, and only once it reports `PodGangConditionTypeAvailable=True` does the orchestrator move on to the next MPG
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the step we calc the gang three, what belong to each podGang ?

`InFlightPodGangs` is the orchestrator's hand-off to the PodGangMap component and back. In addition it also provide visibility into what PodGang(s) are currently getting updated. A coherent update for a PodCliqueSet replica proceeds in two phases:

* The **MPG phase**: In this phase MPGs are taken up one at a time — each MPG is rolled to the new hash, and only once it reports `PodGangConditionTypeAvailable=True` does the orchestrator move on to the next MPG
* The **TPG phase**: Once every MPG is at the new hash, all remaining TPGs are rolled together in a single iteration.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean Once every MPG is at the new hash ? is it means the podGang is ready ?

Comment on lines +340 to +342
- **One MPG (Minimum-Viable PodGang)** carrying MinAvailable replicas of every standalone PCLQ and every PCSG.
- **Zero or more TPGs**, one per tranche of PCSG replicas above MinAvailable. TPGs depend on the MPG via `DependsOn`, so the scheduler places them only after the MPG is up.
- **Excess standalone-PCLQ replicas** (when a standalone PCLQ has more than MinAvailable) roll into the highest index MPG entry as additional pods — no separate PodGang is created for them.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at startup is act basically the same as BPG and SPG ?


As an ML infrastructure team member deploying a disaggregated inference system where the prefill tier and decode tier are updated on different release cadences, I need to independently update only the decode `PodClique` (e.g., to pick up a memory-efficiency fix) without touching the prefill `PodClique`. The system should recognise that this is a backward compatible, single-component update and updates decode pods incrementally (up to a configurable concurrency limit), and leave prefill pods untouched — all without requiring a full MVU replacement.

### Limitations/Risks & Mitigations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to flag something I don't think is covered in the limitations section yet.

Before a coherent update, excess PCSG replicas sit in isolated single-component SPGs — {prefill[1]} and {decode[1]} are separate gangs. So if the scheduler needs to preempt due to resource pressure, only decode[1] gets preempted. Prefill[1] keeps running. After the update those two replicas are gang-scheduled together in MPG-NEW2: {prefill[1]v1, decode[1]v1} — the scheduler will preempt the entire gang, taking down both prefill and decode together. This persists permanently in steady state, not just during the update window. Worth calling out in the limitations section.

Separate question — after the first MPG is up and available, why do we continue creating additional MPGs for the remaining replicas rather than keeping them as TPGs that depend on the new MPG, same as the initial setup? The first MPG already satisfies minAvailable, so the remaining replicas are tail capacity. Keeping them as TPGs would preserve the pre-update isolation and avoid the preemption blast radius problem entirely. If the right behavior differs across use cases, we might need to expose this as a user-controlled knob on the update strategy API.


- **PodClique (standalone).** When `PodClique.Spec.Replicas` shrinks, the PodClique reconciler's pod component picks the pods to remove using a deletion sorter that considers pod-template-hash mismatch, readiness, age, and other heuristics over the live pod set. The chosen pods can come from any of the existing PodGangs and the pick is **non-deterministic from outside** the pod component — it depends on which pods exist at that instant. The pod component records the per-PodGang decrement in `PodCliqueStatus.PodGangMapping` during a scale-in, since only it knows which PodGang each removed pod was associated with.
- **PodCliqueScalingGroup.** When `PodCliqueScalingGroup.Spec.Replicas` shrinks, the PodCliqueScalingGroup reconciler's PodClique component runs a deterministic tier walk over the existing PodGang names — legacy SPG entries first, then unified-naming entries — sorted by trailing PodGang-name suffix descending, and pops the highest replica index from each entry until the scale-in count is satisfied. So although the PCSG-side pick is fully deterministic, the work — selecting which entry to drain and which replica index to pop from it — still happens inside the PodCliqueScalingGroup reconciler. The PodCliqueScalingGroup reconciler records the resulting decrement in `PodCliqueScalingGroupStatus.PodGangMapping` so the PodGangMap component can rebuild PGM from it on the next reconcile.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here during a post-update scale-in when a PCSG replica is part of an MPG with another component? After a coherent update, replicas that were previously isolated in single-component SPGs are now gang-scheduled together — e.g. MPG-NEW3: {prefill[2]v1, decode[2]v1}. If the user scales in decode from 3 → 2, the tier walk hits MPG-NEW3 and removes decode[2], leaving prefill[2] orphaned in a half-empty MPG with no decode counterpart. The two PCSG reconcilers make their scale-in decisions independently with no cross-component coordination — prefill has no visibility that its gang partner was removed. Is this expected? Does the doc need to address what happens to the orphaned component?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[GREP]: Coherent update

2 participants