Skip to content

test(e2e): support multiple scheduler backends#595

Open
brluobt wants to merge 3 commits into
ai-dynamo:mainfrom
brluobt:e2e-multi-scheduler-backend
Open

test(e2e): support multiple scheduler backends#595
brluobt wants to merge 3 commits into
ai-dynamo:mainfrom
brluobt:e2e-multi-scheduler-backend

Conversation

@brluobt
Copy link
Copy Markdown

@brluobt brluobt commented May 9, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

Restructures the E2E suite from KAI-only to support multiple scheduler backends — KAI as primary, default-scheduler as the first additional backend. Continuation of @kangclzjc's prototype in #584 (closed) with a small polish commit on top.

Three-tier test classification (per RequireCapability runtime gating):

  • Agnostic (CM, CRD): primary backend only — no scheduler-specific code path.
  • Sensitive (RU, OD, SO): every enabled backend — behavior may diverge subtly across implementations.
  • Capability-gated (GS, TAS, AutoMNNVL): backends that declare the capability — RequireCapability(t, ...) auto-skips otherwise.

Six configuration layers touched (purely additive — no Makefile/test-runner/Helm-rendering changes):

  1. Workload YAMLs (×22): drop schedulerName: kai-scheduler; operator's PreparePod() injects from defaultProfileName. Same workload YAML now runs unmodified on every backend.
  2. Skaffold profiles: rename topology-teste2e-kai; add e2e-default-scheduler.
  3. infra-manager presets: new hack/e2e-default-scheduler.yaml overlay; KAI image prepull gated on cfg.scheduler.kai.enabled so default-scheduler rows skip the unused pull.
  4. Capability discovery: operator/e2e/tests/{capabilities.go, capability_discovery.go}; cross-check unit test in capabilities_test.go fails the build if the hardcoded backend→capability table drifts from actual Go interface assertions.
  5. Test gates: RequireCapability(t, GangScheduling) etc. added to GS and TAS suites.
  6. CI matrix: new create_flags field threads -f hack/<preset>.yaml through E2E_CREATE_FLAGS. default-scheduler rows: rolling_updates_default-scheduler, ondelete_updates_default-scheduler, startup_ordering_default-scheduler. e2e-skip mirrors for branch protection.

Which issue(s) this PR fixes:

Refs #594

(This PR does not resolve #594. The issue tracks ongoing multi-backend E2E work including path-filtered PR matrix and nightly runs as follow-ups.)

Special notes for your reviewer:

Two commits in this PR:

  1. test(e2e): support multiple scheduler backends — cherry-picked from @kangclzjc's branch, message prefix updated from GREP: to test(e2e):. Author preserved as Kang.
  2. test(e2e): polish multi-backend matrix naming and infra preset — two cosmetic items:
    • CI matrix: rename _default_default-scheduler (full backend name in the GHA UI; matches configv1alpha1.SchedulerNameKube value).
    • hack/e2e-default-scheduler.yaml: replace dangling reference to a non-existent design doc with a concrete activation pointer.

Merge strategy: please prefer rebase-merge over squash to preserve @kangclzjc's authorship on commit 1. If squash is required, the trailer block should retain Co-authored-by: Kang Zhang <kangz@nvidia.com> and both Signed-off-by lines.

Out of scope for this PR (deferred to follow-ups, tracked by #594):

  • Path-aware PR matrix: today the matrix runs all configured rows when operator/** or .github/** changes. Refining to backend-aware filters (scheduler/kai/** → KAI rows only, etc.) is a separate PR.
  • Nightly workflow: exhaustive (suite × capable backend) matrix on schedule.
  • Scheduler version coverage: the capability table currently assumes a single pinned version per backend; multi-version coverage is a known gap to address later.

L20 validation ✅ — lightweight validation on a single-server k3d (l20-6, 30 KWOK workers) completed:

  • Run 1 — KAI baseline (cert_management + crd_installer): PASS
  • Run 2 — default-scheduler new path (rolling_updates): PASS

Full evidence (per-test results, what each run proves, environment notes) is in the validation summary comment.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@brluobt
Copy link
Copy Markdown
Author

brluobt commented May 9, 2026

L20 validation: ✅ all PASS

Lightweight validation on a single-server k3d cluster (l20-6, 30 KWOK worker nodes).

Results

Run Backend Tests Result
1 KAI (primary) Test_CM1_CertManagementRoundTrip, Test_CRD_Installer_AllCRDsExist, Test_CRD_Installer_InitContainerCompleted, Test_CRD_Installer_Idempotent ✅ PASS
2 default-scheduler (new framework path) All 20 Test_RU* (rolling_updates) ✅ PASS

What each run proves

Run 1 (KAI agnostic): removing hardcoded schedulerName: kai-scheduler from the 22 workload YAMLs and relying on PreparePod() to inject from defaultProfileName does not regress the KAI baseline. cert_management exercises the webhook TLS round-trip with workload deployment; crd_installer exercises the init-container path. Both pass end-to-end on KAI.

Run 2 (default-scheduler): the new framework end-to-end:

  • Skaffold profile e2e-default-scheduler deploys the operator with defaultProfileName: default-scheduler.
  • infra-manager preset hack/e2e-default-scheduler.yaml skips KAI installation; KAI image prepull is correctly gated off via cfg.scheduler.kai.enabled.
  • DiscoverCapabilities() reads the live OperatorConfiguration and resolves the active backend.
  • PreparePod() injects default-scheduler into pod specs from the unmodified workload YAMLs.
  • All 20 rolling_updates tests pass against default-scheduler — confirming sensitive-tier behavior is sound on the new backend.

Notes

  • Run 1 first attempt failed at make run-e2e with /bin/sh: syntax error near unexpected token '('. Root cause was unrelated to the framework: grove's run-e2e-full Makefile target expands $(TEST_PATTERN) unquoted, so the original pattern ^(Test_CM|Test_CRD_Installer) was reparsed by sh as a subshell. Re-ran with the equivalent prefix-only pattern ^Test_C (only Test_CM* and Test_CRD_Installer* start with Test_C in operator/e2e/tests/); all 4 tests passed.
  • Both runs created and tore down their own k3d cluster (shared-e2e-test-cluster), no leftover state.
  • This was lightweight validation by design — it does not exercise gang_scheduling, topology_aware_scheduling, auto_mnnvl, or resource_sharing (capability-gated, expected to skip on default-scheduler via RequireCapability). The CI matrix in this PR will exercise those on KAI.

# --- kai-scheduler (primary backend: agnostic + sensitive + KAI-supported capabilities) ---
- test_name: gang_scheduling
test_pattern: "^Test_GS"
create_flags: ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is empty, can we just don't set this parameters?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — technically yes, GitHub Actions silently expands an undeclared matrix field to an empty string, so we could drop these 8 lines and the workflow would still run.

I left them in for two reasons:

  1. Symmetry with the default-scheduler block right below, which uses this same field with a non-empty value (-f hack/e2e-default-scheduler.yaml). Keeping create_flags explicit on every row makes the matrix read as "two backends along the same axis" rather than making default-scheduler look like a special case.

  2. Forward compatibility: when a third backend lands (e.g. the Volcano backend from GREP-0376 / feat: Add GREP-0376 for Volcano scheduler backend #560), it'll just be another create_flags value, no schema change needed for the include block.

Happy to remove them if you'd rather optimize for line count over the explicit-axis read — let me know which way you prefer.

(Also: pushed a rebase onto current main just now to clear the conflict from #601 / #616.)

# of whether E2E ran or was skipped.
- test_name: gang_scheduling
- test_name: rolling_updates
- test_name: ondelete_updates
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because we don't do ondelete_updates test before?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not new — the Test_OD* (ondelete updates) tests have existed in operator/e2e/tests/ for a while and are already part of the e2e job's matrix. The reason ondelete_updates is appearing in the e2e-skip matrix in this PR is that one of the goals here is to make the e2e-skip job's test_name list a 1:1 mirror of the e2e job's, so the same set of branch-protection check names resolve regardless of whether E2E ran or was skipped.

I just verified the two matrices are aligned now — both list the same 12 entries (including ondelete_updates and the three *_default-scheduler variants). The comment block at the top of e2e-skip flags this contract so future contributors know to update both blocks together when adding a backend or test.

@brluobt brluobt force-pushed the e2e-multi-scheduler-backend branch from 6aababc to a40d1a7 Compare May 26, 2026 06:13
@brluobt brluobt marked this pull request as ready for review May 26, 2026 11:49
@brluobt
Copy link
Copy Markdown
Author

brluobt commented May 26, 2026

Marking ready for review after a full local E2E pass on the rebased branch.

Rebase summary

E2E results (12/12 PASS)

Ran the full CI matrix locally on a 128-core / 780 GB node, total 81 min. All 12 entries passed:

# Task Result Time
1 gang_scheduling PASS 984s
2 rolling_updates PASS 468s
3 ondelete_updates PASS 427s
4 startup_ordering PASS 312s
5 Topology_Aware_Scheduling PASS 500s
6 cert_management PASS 170s
7 auto_mnnvl PASS 597s
8 crd_installer PASS 137s
9 resource_sharing PASS 148s
10 rolling_updates_default-scheduler PASS 436s
11 ondelete_updates_default-scheduler PASS 398s
12 startup_ordering_default-scheduler PASS 282s

Capability gate spot-check

Verified the RequireCapability gate behaves correctly in the positive path on both backends:

  • Active backend: kai-scheduler and Active backend: default-scheduler log lines show DiscoverCapabilities resolves correctly per backend.
  • KAI tasks: RequireCapability(t, GangScheduling) is hit 12× and RequireCapability(t, TopologyAwareScheduling) is hit 21×, all PASS, zero unexpected SKIP — i.e. the capability table is reporting kai-scheduler → {GS=true, TAS=true} correctly.
  • default-scheduler tasks: zero unexpected SKIPs across the three sensitive-tier patterns (^Test_RU / ^Test_OD / ^Test_SO), which is by design (these tests are agnostic and run against every backend; capability gates are scoped to gang_scheduling_test.go and topology_test.go).
  • The unsupported-capability path (e.g. default-scheduler + ^Test_GS) is not exercised by the current matrix by design.

Per-task logs retained locally; happy to dig into specifics if any reviewer wants to spot-check.

@brluobt brluobt changed the title test(e2e): support multiple scheduler backends (draft) test(e2e): support multiple scheduler backends Jun 1, 2026
kangclzjc and others added 3 commits June 1, 2026 06:37
Restructures the E2E suite from KAI-only to N-backend ready, landing
KAI (primary) and default-scheduler. KAI rows are functionally
unchanged; default-scheduler rows light up new sensitive coverage
(RU/OD/SO).

Skaffold adds an e2e-default-scheduler profile alongside e2e-kai.
Workload YAMLs lose hardcoded schedulerName: kai-scheduler (22 files);
the operator's PreparePod() assigns from defaultProfileName. The new
hack/e2e-default-scheduler.yaml infra-manager preset disables KAI
install/queues, and _run_prepull's KAI image list is now gated on
cfg.scheduler.kai.enabled so default-scheduler rows skip the unused
pull.

A new RequireCapability(t, ...) gate in operator/e2e/tests/capabilities.go
auto-skips capability-gated tests when the active backend does not
provide the required capability. DiscoverCapabilities reads the live
OperatorConfiguration via grove/config and joins it with a hardcoded
backend->interface table; a unit test cross-checks the table against
actual Go interface assertions for every registered backend.

The CI matrix grows three default-scheduler rows (rolling_updates,
ondelete_updates, startup_ordering) via a new create_flags field that
threads -f hack/<preset>.yaml through E2E_CREATE_FLAGS; e2e-skip syncs
in lockstep.

Signed-off-by: Kang Zhang <kangz@nvidia.com>
Signed-off-by: Bruce Luo <brluobt@gmail.com>
Two small polish items on top of the previous commit:

1. CI matrix: rename the three default-scheduler rows from `*_default`
   to `*_default-scheduler` (rolling_updates_default-scheduler,
   ondelete_updates_default-scheduler, startup_ordering_default-scheduler)
   in both the e2e and e2e-skip matrices. The longer suffix matches the
   actual backend name (configv1alpha1.SchedulerNameKube = "default-scheduler")
   and removes ambiguity in the GitHub Actions UI when scanning a long
   matrix at a glance.

2. hack/e2e-default-scheduler.yaml: replace the dangling reference to a
   non-existent design proposal with a concrete activation pointer
   (infra-manager.py setup -f ..., threaded through E2E_CREATE_FLAGS).

Signed-off-by: Bruce Luo <brluobt@gmail.com>
…face

Upstream main renamed scheduler.TopologyAwareSchedBackend to
scheduler.TopologyAwareBackend. The rebase of this branch resolved
topology_test.go but missed the same reference in capabilities_test.go,
leaving the e2e/tests package un-buildable on top of current main.

Updates the one code site (capabilities_test.go:91 type assertion) and
two doc-comment references for consistency. No behavior change.
@brluobt brluobt force-pushed the e2e-multi-scheduler-backend branch from 3d2a85a to 00693ae Compare June 1, 2026 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[E2E] Multi-backend E2E test framework

2 participants