GREP-537 add Koordinator scheduler backend#562
Conversation
| - **`SyncPodGang` translation**: For each `PodGroup` in a `PodGang`, create one Koordinator `PodGroup` CR (named `{podgang}-{podgroup}`), linked into a GangGroup via the `gang.scheduling.koordinator.sh/groups` annotation. | ||
| - **`PreparePod` injection**: Set `Pod.Spec.SchedulerName`, inject the `pod-group.scheduling.sigs.k8s.io` label, and optionally inject the `koordinator.sh/qosClass` label when configured. | ||
| - **Single-layer topology-aware placement**: Translate Grove topology intent into the `network-topology-spec` annotation on the generated PodGroup CRs. The user-facing `PodCliqueSet` API expresses this as `topologyConstraint.packDomain`; the PodCliqueSet controller resolves it through `ClusterTopology` into an internal `PodGang.Spec.TopologyConstraint.PackConstraint.Required` topology key. | ||
| - **Admission rejection for incompatible features**: Reject PodCliqueSets that request features this backend fundamentally cannot honour (MNNVL and PCSG-level topology constraints) via `ValidatePodCliqueSet`, producing a clear error at `kubectl apply` time rather than silently mis-scheduling the workload. |
There was a problem hiding this comment.
What would happen to existing pod groups that violate this and were created when the scheduling backend was different ?
In the future we want to allow using different scheduling backends together - we need to create a mechanism for the scheduling backend to have a hook the admission can use for that only when it needs to call that scheduling backend.
There was a problem hiding this comment.
We should align with the design decisions that they agree on the other GREPs (#560 (comment))
There was a problem hiding this comment.
Sorry for the late reply due to the holiday. This is a very good point. In the future, multiple schedulers may coexist, and this wasn’t fully considered before. I think during implementation we can validate only PodCliqueSets that explicitly configure the scheduler backend as 'koord-scheduler'. I’ll discuss the design with Kang and then update the document. Thanks!
There was a problem hiding this comment.
@shinedays, for some capability that some backend may not support yet, we could reject now for simple just as we did in volcano. We could leverage pcs validation hook to do this or ValidatePodCliqueSet in the developmenet
|
Change the PR description issue relation from Fixes to Refers so that it will not close the issue. We need the issue for the implementation as well. |
@shinedays Thanks for the contribution! Once this GREP PR is merged, do you plan to proceed with the implementation as well? |
enoodle
left a comment
There was a problem hiding this comment.
Overall looks great! I think we just need to align all the validation decisions across all scheduling backends, but maybe it shouldn't postpone this one.
We need to fix the PR comment so that it will not close the issue as we will need the issue for the implementation.
Yes, I’ve already implemented an initial demo. Once the GREP review is approved, I’ll refine and submit it. |
|
|
||
| ### Scope and Limitations | ||
|
|
||
| - **MNNVL / ComputeDomain is not supported**: Multi-Node NVLink depends on NVIDIA DRA `ComputeDomain` and ResourceClaims, which are incompatible with Koordinator's DeviceShare model. `ValidatePodCliqueSet` rejects workloads annotated with `grove.io/auto-mnnvl: enabled`. |
There was a problem hiding this comment.
Why DRA not supported? What is DeviceShare model
| - **MNNVL / ComputeDomain is not supported**: Multi-Node NVLink depends on NVIDIA DRA `ComputeDomain` and ResourceClaims, which are incompatible with Koordinator's DeviceShare model. `ValidatePodCliqueSet` rejects workloads annotated with `grove.io/auto-mnnvl: enabled`. | ||
| - **Topology support is intentionally narrow**: The user-facing API supports one `topologyConstraint.packDomain` per scope. The PodCliqueSet controller resolves that into an internal Required topology key, and this backend maps that key to one Koordinator gather rule. Multi-layer rules, `PodCountMultiple`, pod-index alignment, and user-authored Preferred topology need a follow-up Grove API change. | ||
| - **PCSG-level topology is not representable**: One Grove PodGang becomes one Koordinator GangGroup. Koordinator cannot apply separate topology constraints to disjoint subsets inside that GangGroup, so `PodCliqueScalingGroupConfig.TopologyConstraint` is rejected at admission and internal `TopologyConstraintGroupConfigs` fail closed at reconcile time. | ||
| - **Only three Koordinator topology layers are configured in this version**: topology keys can map to `hostLayer`, `rackLayer`, or `blockLayer`. Arbitrary Koordinator layers such as `acceleratorLayer`, `spineLayer`, or `datacenterLayer` require widening the backend configuration and validation. |
There was a problem hiding this comment.
shoule be the same with Topology support is intentionally narrow?
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Add Koordinator scheduler backend
Which issue(s) this PR fixes:
Refers #537
Special notes for your reviewer:
Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: