diff --git a/docs/proposals/537-koordinator-scheduler-backend/README.md b/docs/proposals/537-koordinator-scheduler-backend/README.md new file mode 100644 index 000000000..ea086cc2c --- /dev/null +++ b/docs/proposals/537-koordinator-scheduler-backend/README.md @@ -0,0 +1,359 @@ +# GREP-537: Koordinator Scheduler Backend + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Scope and Limitations](#scope-and-limitations) +- [Proposal](#proposal) +- [Design Details](#design-details) + - [Architecture](#architecture) + - [PodGang → Koordinator PodGroup Translation](#podgang--koordinator-podgroup-translation) + - [Naming convention](#naming-convention) + - [GangGroup aggregation](#ganggroup-aggregation) + - [PodGroup spec fields](#podgroup-spec-fields) + - [Base vs Scaled PodGang mapping](#base-vs-scaled-podgang-mapping) + - [Pod Preparation](#pod-preparation) + - [Topology Constraint Translation](#topology-constraint-translation) + - [Required vs Preferred policy](#required-vs-preferred-policy) + - [Key-to-layer mapping](#key-to-layer-mapping) + - [Admission Validation](#admission-validation) + - [OperatorConfiguration Extension](#operatorconfiguration-extension) + - [Lifecycle and Ownership](#lifecycle-and-ownership) + - [RBAC](#rbac) + - [Test Plan](#test-plan) + - [Unit Tests](#unit-tests) + - [E2E Tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) +- [Alternatives](#alternatives) + - [Users write Koordinator annotations directly](#users-write-koordinator-annotations-directly) + - [Map PodGang to a single PodGroup](#map-podgang-to-a-single-podgroup) + - [Wait for Koordinator to natively support PodGang](#wait-for-koordinator-to-natively-support-podgang) +- [Appendix](#appendix) + - [Reference: Koordinator annotations used](#reference-koordinator-annotations-used) + - [Reference: Example translation](#reference-example-translation) + + +## Summary + +This GREP, tracked by [issue #537](https://github.com/ai-dynamo/grove/issues/537), proposes `koord-scheduler` as a first-class Grove scheduler backend, built on top of the Scheduler Backend Framework defined in [GREP-375](../375-scheduler-backend-framework/README.md). The backend lets Grove workloads run on clusters that use [Koordinator](https://koordinator.sh/) as the scheduler, obtaining gang scheduling, single-layer topology-aware placement, and optional QoS-class injection without requiring users to hand-author any Koordinator-specific YAML. + +The backend translates Grove `PodGang` resources into a GangGroup of sig-scheduler-plugins `PodGroup` CRs (the format Koordinator's coscheduling plugin consumes), sets `Pod.Spec.SchedulerName = "koord-scheduler"` on each constituent Pod, injects the `pod-group.scheduling.sigs.k8s.io` association label, optionally translates a single topology constraint into Koordinator's `network-topology-spec` annotation, and rejects incompatible configurations (notably MNNVL and PCSG-level topology constraints) at admission. + +## Motivation + +Koordinator is a widely deployed enhanced scheduler with a mature gang-scheduling and network-topology-aware feature set. Many platform teams already run Koordinator for non-AI workloads and want to onboard AI workloads onto the same scheduler rather than deploying a second advanced scheduler in parallel. Grove today supports `kai-scheduler` as its advanced-scheduler backend; it does not support Koordinator. + +Without a dedicated backend, the only way to run a Grove workload on a Koordinator cluster is to fall back to `default-scheduler`, which loses gang guarantees and topology-aware placement — both of which are essential correctness properties for distributed training and inference workloads. + +With GREP-375 landed, Grove now has a clean extension surface for scheduler backends: `SyncPodGang`, `PreparePod`, `ValidatePodCliqueSet`, and `Init`. This GREP defines how Grove should use that surface for Koordinator. + +### Goals + +- **Gang scheduling on Koordinator**: Every Grove `PodGang` is realised as a Koordinator GangGroup, so all constituent pods are scheduled together or not at all, subject to Koordinator's `Strict`/`NonStrict` modes. +- **`SyncPodGang` translation**: For each `PodGroup` in a `PodGang`, create one Koordinator `PodGroup` CR (named `{podgang}-{podgroup}`), linked into a GangGroup via the `gang.scheduling.koordinator.sh/groups` annotation. +- **`PreparePod` injection**: Set `Pod.Spec.SchedulerName`, inject the `pod-group.scheduling.sigs.k8s.io` label, and optionally inject the `koordinator.sh/qosClass` label when configured. +- **Single-layer topology-aware placement**: Translate Grove topology intent into the `network-topology-spec` annotation on the generated PodGroup CRs. The user-facing `PodCliqueSet` API expresses this as `topologyConstraint.packDomain`; the PodCliqueSet controller resolves it through `ClusterTopology` into an internal `PodGang.Spec.TopologyConstraint.PackConstraint.Required` topology key. +- **Admission rejection for incompatible features**: Reject PodCliqueSets that request features this backend fundamentally cannot honour (MNNVL and PCSG-level topology constraints) via `ValidatePodCliqueSet`, producing a clear error at `kubectl apply` time rather than silently mis-scheduling the workload. +- **Configurable backend knobs**: Expose a `KoordinatorSchedulerConfiguration` type under `SchedulerProfile.Config` covering gang mode, match policy, schedule timeout, default QoS class, and user-supplied topology-key-to-layer mappings. +- **Cleanup via owner references**: Koordinator PodGroup CRs are owned by the PodGang they are generated from, so deleting the PodGang garbage-collects the PodGroups — `OnPodGangDelete` is a no-op. + +### Scope and Limitations + +- **MNNVL / ComputeDomain is not supported**: Multi-Node NVLink depends on NVIDIA DRA `ComputeDomain` and ResourceClaims, which are incompatible with Koordinator's DeviceShare model. `ValidatePodCliqueSet` rejects workloads annotated with `grove.io/auto-mnnvl: enabled`. +- **Topology support is intentionally narrow**: The user-facing API supports one `topologyConstraint.packDomain` per scope. The PodCliqueSet controller resolves that into an internal Required topology key, and this backend maps that key to one Koordinator gather rule. Multi-layer rules, `PodCountMultiple`, pod-index alignment, and user-authored Preferred topology need a follow-up Grove API change. +- **PCSG-level topology is not representable**: One Grove PodGang becomes one Koordinator GangGroup. Koordinator cannot apply separate topology constraints to disjoint subsets inside that GangGroup, so `PodCliqueScalingGroupConfig.TopologyConstraint` is rejected at admission and internal `TopologyConstraintGroupConfigs` fail closed at reconcile time. +- **Only three Koordinator topology layers are configured in this version**: topology keys can map to `hostLayer`, `rackLayer`, or `blockLayer`. Arbitrary Koordinator layers such as `acceleratorLayer`, `spineLayer`, or `datacenterLayer` require widening the backend configuration and validation. +- **Fine-grained GPU sharing is out of scope**: Whole-card `nvidia.com/gpu` requests are left unchanged because Koordinator already recognises them. The backend does not translate Grove workloads to `koordinator.sh/gpu-core` or `koordinator.sh/gpu-memory-ratio`. +- **ClusterNetworkTopology lifecycle is external**: Koordinator's network-topology plugin reads cluster-scoped topology CRDs managed by Koordinator/operator tooling. Grove does not create or update them and does not implement `TopologyAwareSchedBackend`. +- **Other Koordinator subsystems are out of scope**: Reservation, ElasticQuota, NUMA scheduling, and Pod migration are not part of the scheduler backend contract. `ReuseReservationRef` is logged and skipped in the first version. + +## Proposal + +The Koordinator backend will be implemented in `operator/internal/scheduler/koordinator/` as a `scheduler.Backend` from GREP-375. It will be registered in the scheduler manager next to `kai-scheduler` and `default-scheduler`, enabled per cluster via `OperatorConfiguration.scheduler.profiles`. + +No changes to Grove's user-facing `PodCliqueSet` API are required: the user writes the same PodCliqueSet YAML as for any other backend, and selects the backend either by listing `koord-scheduler` as the default profile or by setting `schedulerName: koord-scheduler` on individual PodClique templates. + +## Design Details + +### Architecture + +``` + PodCliqueSet (user-facing) + │ + ▼ + PodCliqueSet Controller ──▶ PodGang (scheduler.grove.io) + │ + ▼ + Backend Controller + │ + ▼ + koord-scheduler backend (this GREP) + │ + ┌────────────────┼────────────────┐ + ▼ ▼ ▼ + PodGroup CR PodGroup CR PodGroup CR + (scheduling.sigs.k8s.io/v1alpha1, linked by GangGroup annotation) + │ │ │ + └────────────────┴────────────────┘ + │ + ▼ + koord-scheduler +``` + +The Grove PodCliqueSet → PodGang translation is unchanged from GREP-375. This GREP only owns the `PodGang → Koordinator PodGroup CR` step and the per-Pod preparation. + +### PodGang → Koordinator PodGroup Translation + +#### Naming convention + +For a PodGang named `` containing a `PodGroup` named `` (where the Grove PodGroup maps to a PodClique), the Koordinator `PodGroup` CR is created in the same namespace with name: + +``` +{podgang}-{podgroup} +``` + +This convention is symmetric: `PreparePod` reads the pod's `grove.io/podgang` and `grove.io/podclique` labels and computes the same name to inject into the `pod-group.scheduling.sigs.k8s.io` Pod label. No separate registry or index is needed. + +#### GangGroup aggregation + +Koordinator's coscheduling plugin treats a set of PodGroups as a single atomic gang when each of them carries the `gang.scheduling.koordinator.sh/groups` annotation with a JSON array listing the full membership. + +For each PodGroup CR generated from a PodGang, the backend stamps: + +```yaml +gang.scheduling.koordinator.sh/groups: '["/-","/-", ...]' +gang.scheduling.koordinator.sh/mode: Strict # configurable +gang.scheduling.koordinator.sh/match-policy: once-satisfied # configurable +gang.scheduling.koordinator.sh/total-number: "" # this PodGroup's total child pod count, when known +``` + +Koordinator defines `gang.scheduling.koordinator.sh/total-number` as the total children number for an individual gang, not the number of PodGroups in a GangGroup. Grove therefore derives it from `len(PodGroup.PodReferences)` when that value is known and is not less than `PodGroup.MinReplicas`; it is omitted rather than guessed from GangGroup size. + +#### PodGroup spec fields + +```yaml +apiVersion: scheduling.sigs.k8s.io/v1alpha1 +kind: PodGroup +metadata: + name: {podgang}-{podgroup} + namespace: + labels: + grove.io/podgang: # used by prune to narrow the List scope + annotations: + ownerReferences: + - controller: true # owned by the PodGang +spec: + minMember: + scheduleTimeoutSeconds: 30 # configurable + priorityClassName: +``` + +Note: Koordinator's coscheduling plugin reads `MinMember` to decide how many pods must arrive before gang scheduling is attempted. It does **not** consume `MinResources`, so the backend does not populate it. The Grove PodGang itself has no schedule-timeout field, so the backend uses a configurable backend-wide default. + +#### Base vs Scaled PodGang mapping + +Grove's PodCliqueSet controller splits PodCliqueScalingGroup (PCSG) replicas into: + +- one **Base PodGang** per PodCliqueSet replica, containing the standalone cliques plus each PCSG's indices `[0, MinAvailable-1]`; +- one **Scaled PodGang** per PCSG replica beyond `MinAvailable-1` (one PodGang per replica). + +This split is entirely upstream of the backend. The backend simply sees one PodGang at a time and emits one GangGroup per PodGang. Every PodGang therefore becomes one Koordinator GangGroup; scaled replicas get their own GangGroup. + +### Pod Preparation + +`PreparePod` is called by the PodClique controller at Pod creation time. It performs three actions: + +1. Set `Pod.Spec.SchedulerName = "koord-scheduler"` so the pod is picked up by koord-scheduler. +2. Inject `pod-group.scheduling.sigs.k8s.io: {podgang}-{podclique}` into Pod labels. This is the label the sig-scheduler-plugins coscheduling plugin (and Koordinator, which extends it) uses to associate a Pod with its PodGroup CR. +3. Optionally inject `koordinator.sh/qosClass: ` when `KoordinatorSchedulerConfiguration.DefaultQoSClass` is configured. + +### Topology Constraint Translation + +`buildTopologyAnnotation` is invoked twice per reconcile: + +1. once for the PodGang-level `TopologyConstraint` (the global annotation, applied to every PodGroup unless overridden); +2. once per PodGroup for the per-PodGroup `TopologyConstraint` (takes precedence over the global one). + +The result is a JSON value assigned to the `gang.scheduling.koordinator.sh/network-topology-spec` annotation on the PodGroup CR. + +#### Required vs Preferred policy + +- `PackConstraint.Required` with a key that has no Koordinator layer equivalent is a **hard error**: `SyncPodGang` returns an error and the reconciler retries. Silently dropping a mandatory constraint would schedule the workload in a topology the user explicitly prohibited. +- `PackConstraint.Preferred` with an unmappable key is **logged at Info level and skipped** — it is advisory, not binding. + +#### Key-to-layer mapping + +Built-in (exact-match) mappings: + +| Grove topology key | Koordinator layer | +|---|---| +| `kubernetes.io/hostname` | `hostLayer` | +| `topology.kubernetes.io/rack` | `rackLayer` | +| `topology.kubernetes.io/block` | `blockLayer` | + +Substring matching is intentionally **not** used — `Contains("host")` would silently mis-map unrelated keys such as `nfs-hostpath`. + +Cluster operators can add key aliases via `KoordinatorSchedulerConfiguration.TopologyKeyMappings` (a `map[string]string` from Grove topology key to Koordinator layer name). User mappings override built-ins, but this implementation intentionally validates mapped values against `{hostLayer, rackLayer, blockLayer}`. It therefore does not expose arbitrary Koordinator topology layers such as `acceleratorLayer`, `spineLayer`, or `datacenterLayer`; supporting those layers would require widening the backend configuration and validation. + +### Admission Validation + +`ValidatePodCliqueSet` runs at the `PodCliqueSet` validating webhook: + +- **MNNVL rejection**: If the PodCliqueSet carries `grove.io/auto-mnnvl: enabled`, the webhook returns an error instructing the user to either remove the annotation or switch to `kai-scheduler`. +- **PCSG topology rejection**: If any `PodCliqueScalingGroupConfig` sets `topologyConstraint`, the webhook returns an error because this backend cannot represent per-scaling-group topology constraints within one Koordinator GangGroup. +- **Unknown Required topology key**: This check is done in `SyncPodGang` (reconcile time) rather than admission, because the backend receives resolved PodGang topology keys and applies its own key-to-layer mapping after PodGang creation. + +### OperatorConfiguration Extension + +A new scheduler name is registered next to `kai-scheduler` and `default-scheduler`: + +```go +const SchedulerNameKoordinator SchedulerName = "koord-scheduler" +``` + +A new typed config: + +```go +// KoordinatorSchedulerConfiguration is the Config payload for scheduler profile "koord-scheduler". +type KoordinatorSchedulerConfiguration struct { + // GangMode applied to all generated PodGroups. + // +kubebuilder:validation:Enum=Strict;NonStrict + GangMode string `json:"gangMode,omitempty"` + + // MatchPolicy for the GangGroup. + // +kubebuilder:validation:Enum=once-satisfied;only-waiting;waiting-and-running + MatchPolicy string `json:"matchPolicy,omitempty"` + + // ScheduleTimeoutSeconds applied to all generated PodGroups. + // +kubebuilder:validation:Minimum=1 + ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"` + + // DefaultQoSClass injects koordinator.sh/qosClass on every Pod when non-empty. + // +kubebuilder:validation:Enum=LSE;LSR;LS;BE + DefaultQoSClass string `json:"defaultQoSClass,omitempty"` + + // TopologyKeyMappings extends the built-in topology key → Koordinator layer table. + // Value must be one of: hostLayer, rackLayer, blockLayer. + // Arbitrary Koordinator layer names are not supported in this version. + TopologyKeyMappings map[string]string `json:"topologyKeyMappings,omitempty"` +} +``` + +Example OperatorConfiguration: + +```yaml +scheduler: + defaultProfileName: koord-scheduler + profiles: + - name: koord-scheduler + config: + gangMode: Strict + matchPolicy: once-satisfied + scheduleTimeoutSeconds: 60 + defaultQoSClass: LS + topologyKeyMappings: + topology.example.com/zone: blockLayer +``` + +Invalid field values (e.g. unknown `GangMode`) cause `Init()` to return an error and the operator fails to start, rather than silently running with a partial/default configuration. + +### Lifecycle and Ownership + +- **PodGang create**: The Backend Controller reconciles the new PodGang and calls `SyncPodGang`. The backend creates one Koordinator PodGroup CR per Grove PodGroup, each with an OwnerReference to the PodGang. +- **PodGang update** (PodGroups added or removed between reconciles): `SyncPodGang` first creates/updates every desired PodGroup, then runs `pruneOrphanedPodGroups` to delete any stale PodGroups owned by this PodGang. The create-then-prune ordering ensures a failure mid-loop never deletes a still-valid PodGroup. +- **PodGang delete**: OwnerReference drives Kubernetes garbage collection; `OnPodGangDelete` is a no-op. +- **Ownership safety**: `createOrUpdatePodGroup` refuses to overwrite a PodGroup whose controller OwnerReference UID differs from the desired one, preventing one PodGang from clobbering another workload's PodGroup. +- **Metadata preservation**: Updates preserve existing labels, annotations, and finalizers that are not owned by Grove, then overlay the desired Grove-managed labels and annotations. + +### RBAC + +The backend needs `get`, `list`, `watch`, `create`, `update`, `delete` on `podgroups.scheduling.sigs.k8s.io`. The Helm chart grants these verbs unconditionally in the operator `ClusterRole`; Kubernetes accepts RBAC rules for resources whose CRDs are not installed, so the rule is harmless on clusters that do not use Koordinator. + +### Test Plan + +#### Unit Tests + +Unit tests live under `operator/internal/scheduler/koordinator/`: + +- `backend_test.go`: interface wiring (Name, Init error surfacing). +- `config_test.go`: parse and defaulting behaviour for every field in `KoordinatorSchedulerConfiguration`, including invalid-value error paths. +- `pod_test.go`: `PreparePod` injects `SchedulerName`, `pod-group` label, and (when configured) the QoS label. +- `podgroup_test.go`: single-group, multi-group, per-PodGroup total child count annotation, topology (global vs per-PodGroup override, Required hard-fail, Preferred soft-skip), GangGroup membership JSON, metadata-preserving update, prune (shrink and orphan-with-different-UID), `TopologyConstraintGroupConfigs` rejection, `ReuseReservationRef` warning event. +- `validation_test.go`: MNNVL rejection and PCSG-level topology rejection. + +#### E2E Tests + +E2E tests live under `operator/e2e/tests/koordinator/`: + +- `gang_scheduling_test.go`: end-to-end gang scheduling of a multi-clique workload on a cluster running Koordinator, asserting PodGroup creation, GangGroup annotations, per-PodGroup child-count annotations, gang blocking while no worker nodes are schedulable, and MNNVL admission rejection. +- These tests assume an externally prepared `koordinator-grove` cluster and Grove operator deployment with the `koord-scheduler` profile enabled. They connect to the existing cluster instead of creating the cluster, installing Koordinator, or deploying Grove from the test binary. + +Reusable sample workloads live at `operator/e2e/yaml/workload-koord.yaml`. + +### Graduation Criteria + +#### Alpha + +- `scheduler.Backend` implementation lands with gang scheduling, single-layer topology, QoS label injection, MNNVL and PCSG topology rejection, owner-reference cleanup, and unit tests covering all branches. +- Registered as a valid scheduler name in OperatorConfiguration. +- Sample workload and e2e smoke test running against a Koordinator cluster. + +#### Beta + +- E2E suite covers multi-clique, multi-PCSG, topology Required + Preferred, and backend-switching scenarios. +- Documented operator playbook for deploying Grove on an existing Koordinator cluster. +- Ownership and prune semantics battle-tested against at least one production-style workload. + +#### GA + +- Grove `TopologyConstraint` API is evolved (in a follow-up GREP) to a form that can express multi-layer rules and pod-index alignment, and the Koordinator backend is updated to use it. +- Stable for two consecutive Grove releases with no breaking backend-interface changes. + +## Alternatives + +### Users write Koordinator annotations directly + +**Rejected.** Requiring users to hand-author `gang.scheduling.koordinator.sh/*` annotations on every PodClique template defeats the purpose of Grove's unified API. It also leaks scheduler-specific syntax into the workload manifest, making workloads non-portable across backends. + +### Map PodGang to a single PodGroup + +**Rejected.** A single PodGroup would flatten the per-PodClique `MinReplicas`, losing the ability for Koordinator's coscheduling plugin to compute per-role gang satisfaction. Producing N PodGroups and tying them into a GangGroup preserves per-role semantics while still gating the overall gang. + +### Wait for Koordinator to natively support PodGang + +**Rejected.** There is no upstream proposal for Koordinator to consume Grove's PodGang CRD, and even if one existed, users need a solution today. GREP-375's backend framework is designed precisely for translation layers like this. + +## Appendix + +### Reference: Koordinator annotations used + +| Annotation | Purpose | +|---|---| +| `gang.scheduling.koordinator.sh/groups` | JSON array of `"ns/podgroup"` strings declaring GangGroup membership | +| `gang.scheduling.koordinator.sh/mode` | `Strict` or `NonStrict` gang failure handling | +| `gang.scheduling.koordinator.sh/match-policy` | When the GangGroup is considered satisfied | +| `gang.scheduling.koordinator.sh/total-number` | Total children number for the individual Koordinator gang, derived from that Grove PodGroup's pod references when known | +| `gang.scheduling.koordinator.sh/network-topology-spec` | JSON-serialised single-layer topology constraint | +| `pod-group.scheduling.sigs.k8s.io` (Pod label) | Associates a Pod with its PodGroup CR | +| `koordinator.sh/qosClass` (Pod label) | Koordinator QoS class | + +### Reference: Example translation + +Given the sample `multi-node-disaggregated` PodCliqueSet (4 cliques: `pleader`, `pworker`, `dleader`, `dworker`; 2 PCSGs: `prefill`, `decode`), the PodCliqueSet controller produces one Base PodGang named e.g. `demo-0` and (as PCSG replicas scale) additional Scaled PodGangs such as `demo-0-prefill-1`. + +The Koordinator backend then produces, for the Base PodGang: + +``` +demo-0-pleader (MinMember: 1) ┐ +demo-0-pworker (MinMember: 4) │ all four PodGroups carry the same +demo-0-dleader (MinMember: 1) │ gang.scheduling.koordinator.sh/groups +demo-0-dworker (MinMember: 3) ┘ listing all four members +``` + +Each Scaled PodGang produces its own disjoint GangGroup containing just that replica's PodGroups. + +> NOTE: This GREP template has been inspired by [KEP Template](https://github.com/kubernetes/enhancements/blob/f90055d254c356b2c038a1bdf4610bf4acd8d7be/keps/NNNN-kep-template/README.md).