test(e2e): support multiple scheduler backends by brluobt · Pull Request #595 · ai-dynamo/grove

brluobt · 2026-05-09T07:11:30Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Restructures the E2E suite from KAI-only to support multiple scheduler backends — KAI as primary, default-scheduler as the first additional backend. Continuation of @kangclzjc's prototype in #584 (closed) with a small polish commit on top.

Three-tier test classification (per RequireCapability runtime gating):

Agnostic (CM, CRD): primary backend only — no scheduler-specific code path.
Sensitive (RU, OD, SO): every enabled backend — behavior may diverge subtly across implementations.
Capability-gated (GS, TAS, AutoMNNVL): backends that declare the capability — RequireCapability(t, ...) auto-skips otherwise.

Six configuration layers touched (purely additive — no Makefile/test-runner/Helm-rendering changes):

Workload YAMLs (×22): drop schedulerName: kai-scheduler; operator's PreparePod() injects from defaultProfileName. Same workload YAML now runs unmodified on every backend.
Skaffold profiles: rename topology-test → e2e-kai; add e2e-default-scheduler.
infra-manager presets: new hack/e2e-default-scheduler.yaml overlay; KAI image prepull gated on cfg.scheduler.kai.enabled so default-scheduler rows skip the unused pull.
Capability discovery: operator/e2e/tests/{capabilities.go, capability_discovery.go}; cross-check unit test in capabilities_test.go fails the build if the hardcoded backend→capability table drifts from actual Go interface assertions.
Test gates: RequireCapability(t, GangScheduling) etc. added to GS and TAS suites.
CI matrix: new create_flags field threads -f hack/<preset>.yaml through E2E_CREATE_FLAGS. default-scheduler rows: rolling_updates_default-scheduler, ondelete_updates_default-scheduler, startup_ordering_default-scheduler. e2e-skip mirrors for branch protection.

Which issue(s) this PR fixes:

Refs #594

(This PR does not resolve #594. The issue tracks ongoing multi-backend E2E work including path-filtered PR matrix and nightly runs as follow-ups.)

Special notes for your reviewer:

Two commits in this PR:

test(e2e): support multiple scheduler backends — cherry-picked from @kangclzjc's branch, message prefix updated from GREP: to test(e2e):. Author preserved as Kang.
test(e2e): polish multi-backend matrix naming and infra preset — two cosmetic items:
- CI matrix: rename _default → _default-scheduler (full backend name in the GHA UI; matches configv1alpha1.SchedulerNameKube value).
- hack/e2e-default-scheduler.yaml: replace dangling reference to a non-existent design doc with a concrete activation pointer.

Merge strategy: please prefer rebase-merge over squash to preserve @kangclzjc's authorship on commit 1. If squash is required, the trailer block should retain Co-authored-by: Kang Zhang <kangz@nvidia.com> and both Signed-off-by lines.

Out of scope for this PR (deferred to follow-ups, tracked by #594):

Path-aware PR matrix: today the matrix runs all configured rows when operator/** or .github/** changes. Refining to backend-aware filters (scheduler/kai/** → KAI rows only, etc.) is a separate PR.
Nightly workflow: exhaustive (suite × capable backend) matrix on schedule.
Scheduler version coverage: the capability table currently assumes a single pinned version per backend; multi-version coverage is a known gap to address later.

L20 validation ✅ — lightweight validation on a single-server k3d (l20-6, 30 KWOK workers) completed:

Run 1 — KAI baseline (cert_management + crd_installer): PASS
Run 2 — default-scheduler new path (rolling_updates): PASS

Full evidence (per-test results, what each run proves, environment notes) is in the validation summary comment.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

copy-pr-bot · 2026-05-09T07:11:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

brluobt · 2026-05-09T08:47:05Z

L20 validation: ✅ all PASS

Lightweight validation on a single-server k3d cluster (l20-6, 30 KWOK worker nodes).

Results

Run	Backend	Tests	Result
1	KAI (primary)	`Test_CM1_CertManagementRoundTrip`, `Test_CRD_Installer_AllCRDsExist`, `Test_CRD_Installer_InitContainerCompleted`, `Test_CRD_Installer_Idempotent`	✅ PASS
2	default-scheduler (new framework path)	All 20 `Test_RU*` (rolling_updates)	✅ PASS

What each run proves

Run 1 (KAI agnostic): removing hardcoded schedulerName: kai-scheduler from the 22 workload YAMLs and relying on PreparePod() to inject from defaultProfileName does not regress the KAI baseline. cert_management exercises the webhook TLS round-trip with workload deployment; crd_installer exercises the init-container path. Both pass end-to-end on KAI.

Run 2 (default-scheduler): the new framework end-to-end:

Skaffold profile e2e-default-scheduler deploys the operator with defaultProfileName: default-scheduler.
infra-manager preset hack/e2e-default-scheduler.yaml skips KAI installation; KAI image prepull is correctly gated off via cfg.scheduler.kai.enabled.
DiscoverCapabilities() reads the live OperatorConfiguration and resolves the active backend.
PreparePod() injects default-scheduler into pod specs from the unmodified workload YAMLs.
All 20 rolling_updates tests pass against default-scheduler — confirming sensitive-tier behavior is sound on the new backend.

Notes

Run 1 first attempt failed at make run-e2e with /bin/sh: syntax error near unexpected token '('. Root cause was unrelated to the framework: grove's run-e2e-full Makefile target expands $(TEST_PATTERN) unquoted, so the original pattern ^(Test_CM|Test_CRD_Installer) was reparsed by sh as a subshell. Re-ran with the equivalent prefix-only pattern ^Test_C (only Test_CM* and Test_CRD_Installer* start with Test_C in operator/e2e/tests/); all 4 tests passed.
Both runs created and tore down their own k3d cluster (shared-e2e-test-cluster), no leftover state.
This was lightweight validation by design — it does not exercise gang_scheduling, topology_aware_scheduling, auto_mnnvl, or resource_sharing (capability-gated, expected to skip on default-scheduler via RequireCapability). The CI matrix in this PR will exercise those on KAI.

kangclzjc · 2026-05-18T01:14:40Z

+          # --- kai-scheduler (primary backend: agnostic + sensitive + KAI-supported capabilities) ---
          - test_name: gang_scheduling
            test_pattern: "^Test_GS"
+            create_flags: ""


if this is empty, can we just don't set this parameters?

Good catch — technically yes, GitHub Actions silently expands an undeclared matrix field to an empty string, so we could drop these 8 lines and the workflow would still run.

I left them in for two reasons:

Symmetry with the default-scheduler block right below, which uses this same field with a non-empty value (-f hack/e2e-default-scheduler.yaml). Keeping create_flags explicit on every row makes the matrix read as "two backends along the same axis" rather than making default-scheduler look like a special case.

Forward compatibility: when a third backend lands (e.g. the Volcano backend from GREP-0376 / feat: Add GREP-0376 for Volcano scheduler backend #560), it'll just be another create_flags value, no schema change needed for the include block.

Happy to remove them if you'd rather optimize for line count over the explicit-axis read — let me know which way you prefer.

(Also: pushed a rebase onto current main just now to clear the conflict from #601 / #616.)

kangclzjc · 2026-05-18T01:16:32Z

+          # of whether E2E ran or was skipped.
          - test_name: gang_scheduling
          - test_name: rolling_updates
+          - test_name: ondelete_updates


Is this because we don't do ondelete_updates test before?

Not new — the Test_OD* (ondelete updates) tests have existed in operator/e2e/tests/ for a while and are already part of the e2e job's matrix. The reason ondelete_updates is appearing in the e2e-skip matrix in this PR is that one of the goals here is to make the e2e-skip job's test_name list a 1:1 mirror of the e2e job's, so the same set of branch-protection check names resolve regardless of whether E2E ran or was skipped.

I just verified the two matrices are aligned now — both list the same 12 entries (including ondelete_updates and the three *_default-scheduler variants). The comment block at the top of e2e-skip flags this contract so future contributors know to update both blocks together when adding a backend or test.

brluobt · 2026-05-26T11:49:32Z

Marking ready for review after a full local E2E pass on the rebased branch.

Rebase summary

20b381c — rebased onto current main; one merge conflict in operator/e2e/tests/topology_test.go resolved by keeping the upstream rename (Test_TAS21_TopologyValidationWebhooks, from change TopologyConstraint.TopologyName to optional #601 / Rename ClusterTopology CRD to ClusterTopologyBinding #616) and merging the RequireCapability(t, TopologyAwareScheduling) gate from this PR.
3d2a85a — small follow-up build fix: upstream also renamed scheduler.TopologyAwareSchedBackend → scheduler.TopologyAwareBackend (from the scheduler-registry refactor in Replace scheduler backend global mutable state with well defined interface and dependency injection #512). The topology_test.go resolution caught one site; this commit fixes the matching type assertion in capabilities_test.go:91 plus two doc-comment references. No behavior change.

E2E results (12/12 PASS)

Ran the full CI matrix locally on a 128-core / 780 GB node, total 81 min. All 12 entries passed:

#	Task	Result	Time
1	gang_scheduling	PASS	984s
2	rolling_updates	PASS	468s
3	ondelete_updates	PASS	427s
4	startup_ordering	PASS	312s
5	Topology_Aware_Scheduling	PASS	500s
6	cert_management	PASS	170s
7	auto_mnnvl	PASS	597s
8	crd_installer	PASS	137s
9	resource_sharing	PASS	148s
10	rolling_updates_default-scheduler	PASS	436s
11	ondelete_updates_default-scheduler	PASS	398s
12	startup_ordering_default-scheduler	PASS	282s

Capability gate spot-check

Verified the RequireCapability gate behaves correctly in the positive path on both backends:

Active backend: kai-scheduler and Active backend: default-scheduler log lines show DiscoverCapabilities resolves correctly per backend.
KAI tasks: RequireCapability(t, GangScheduling) is hit 12× and RequireCapability(t, TopologyAwareScheduling) is hit 21×, all PASS, zero unexpected SKIP — i.e. the capability table is reporting kai-scheduler → {GS=true, TAS=true} correctly.
default-scheduler tasks: zero unexpected SKIPs across the three sensitive-tier patterns (^Test_RU / ^Test_OD / ^Test_SO), which is by design (these tests are agnostic and run against every backend; capability gates are scoped to gang_scheduling_test.go and topology_test.go).
The unsupported-capability path (e.g. default-scheduler + ^Test_GS) is not exercised by the current matrix by design.

Per-task logs retained locally; happy to dig into specifics if any reviewer wants to spot-check.

Restructures the E2E suite from KAI-only to N-backend ready, landing KAI (primary) and default-scheduler. KAI rows are functionally unchanged; default-scheduler rows light up new sensitive coverage (RU/OD/SO). Skaffold adds an e2e-default-scheduler profile alongside e2e-kai. Workload YAMLs lose hardcoded schedulerName: kai-scheduler (22 files); the operator's PreparePod() assigns from defaultProfileName. The new hack/e2e-default-scheduler.yaml infra-manager preset disables KAI install/queues, and _run_prepull's KAI image list is now gated on cfg.scheduler.kai.enabled so default-scheduler rows skip the unused pull. A new RequireCapability(t, ...) gate in operator/e2e/tests/capabilities.go auto-skips capability-gated tests when the active backend does not provide the required capability. DiscoverCapabilities reads the live OperatorConfiguration via grove/config and joins it with a hardcoded backend->interface table; a unit test cross-checks the table against actual Go interface assertions for every registered backend. The CI matrix grows three default-scheduler rows (rolling_updates, ondelete_updates, startup_ordering) via a new create_flags field that threads -f hack/<preset>.yaml through E2E_CREATE_FLAGS; e2e-skip syncs in lockstep. Signed-off-by: Kang Zhang <kangz@nvidia.com> Signed-off-by: Bruce Luo <brluobt@gmail.com>

Two small polish items on top of the previous commit: 1. CI matrix: rename the three default-scheduler rows from `*_default` to `*_default-scheduler` (rolling_updates_default-scheduler, ondelete_updates_default-scheduler, startup_ordering_default-scheduler) in both the e2e and e2e-skip matrices. The longer suffix matches the actual backend name (configv1alpha1.SchedulerNameKube = "default-scheduler") and removes ambiguity in the GitHub Actions UI when scanning a long matrix at a glance. 2. hack/e2e-default-scheduler.yaml: replace the dangling reference to a non-existent design proposal with a concrete activation pointer (infra-manager.py setup -f ..., threaded through E2E_CREATE_FLAGS). Signed-off-by: Bruce Luo <brluobt@gmail.com>

…face Upstream main renamed scheduler.TopologyAwareSchedBackend to scheduler.TopologyAwareBackend. The rebase of this branch resolved topology_test.go but missed the same reference in capabilities_test.go, leaving the e2e/tests package un-buildable on top of current main. Updates the one code site (capabilities_test.go:91 type assertion) and two doc-comment references for consistency. No behavior change.

This was referenced May 9, 2026

[E2E] Multi-backend E2E test framework #594

Open

feat(e2e): path-aware PR matrix selector + nightly workflow (draft) #600

Draft

brluobt force-pushed the e2e-multi-scheduler-backend branch from e49cea1 to 6aababc Compare May 12, 2026 07:47

kangclzjc reviewed May 18, 2026

View reviewed changes

brluobt force-pushed the e2e-multi-scheduler-backend branch from 6aababc to a40d1a7 Compare May 26, 2026 06:13

brluobt marked this pull request as ready for review May 26, 2026 11:49

brluobt requested review from Ronkahn21, danbar2, gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners May 26, 2026 11:49

brluobt changed the title ~~test(e2e): support multiple scheduler backends (draft)~~ test(e2e): support multiple scheduler backends Jun 1, 2026

kangclzjc and others added 3 commits June 1, 2026 06:37

brluobt force-pushed the e2e-multi-scheduler-backend branch from 3d2a85a to 00693ae Compare June 1, 2026 06:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): support multiple scheduler backends#595

test(e2e): support multiple scheduler backends#595
brluobt wants to merge 3 commits into
ai-dynamo:mainfrom
brluobt:e2e-multi-scheduler-backend

brluobt commented May 9, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 9, 2026

Uh oh!

brluobt commented May 9, 2026

Uh oh!

kangclzjc May 18, 2026

Uh oh!

brluobt May 26, 2026

Uh oh!

kangclzjc May 18, 2026

Uh oh!

brluobt May 26, 2026

Uh oh!

brluobt commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brluobt commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

copy-pr-bot Bot commented May 9, 2026

Uh oh!

brluobt commented May 9, 2026

L20 validation: ✅ all PASS

Results

What each run proves

Notes

Uh oh!

kangclzjc May 18, 2026

Choose a reason for hiding this comment

Uh oh!

brluobt May 26, 2026

Choose a reason for hiding this comment

Uh oh!

kangclzjc May 18, 2026

Choose a reason for hiding this comment

Uh oh!

brluobt May 26, 2026

Choose a reason for hiding this comment

Uh oh!

brluobt commented May 26, 2026

Rebase summary

E2E results (12/12 PASS)

Capability gate spot-check

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brluobt commented May 9, 2026 •

edited

Loading