test(e2e): support multiple scheduler backends#595
Conversation
L20 validation: ✅ all PASSLightweight validation on a single-server k3d cluster (l20-6, 30 KWOK worker nodes). Results
What each run provesRun 1 (KAI agnostic): removing hardcoded Run 2 (default-scheduler): the new framework end-to-end:
Notes
|
e49cea1 to
6aababc
Compare
| # --- kai-scheduler (primary backend: agnostic + sensitive + KAI-supported capabilities) --- | ||
| - test_name: gang_scheduling | ||
| test_pattern: "^Test_GS" | ||
| create_flags: "" |
There was a problem hiding this comment.
if this is empty, can we just don't set this parameters?
There was a problem hiding this comment.
Good catch — technically yes, GitHub Actions silently expands an undeclared matrix field to an empty string, so we could drop these 8 lines and the workflow would still run.
I left them in for two reasons:
-
Symmetry with the default-scheduler block right below, which uses this same field with a non-empty value (
-f hack/e2e-default-scheduler.yaml). Keepingcreate_flagsexplicit on every row makes the matrix read as "two backends along the same axis" rather than making default-scheduler look like a special case. -
Forward compatibility: when a third backend lands (e.g. the Volcano backend from GREP-0376 / feat: Add GREP-0376 for Volcano scheduler backend #560), it'll just be another
create_flagsvalue, no schema change needed for the include block.
Happy to remove them if you'd rather optimize for line count over the explicit-axis read — let me know which way you prefer.
(Also: pushed a rebase onto current main just now to clear the conflict from #601 / #616.)
| # of whether E2E ran or was skipped. | ||
| - test_name: gang_scheduling | ||
| - test_name: rolling_updates | ||
| - test_name: ondelete_updates |
There was a problem hiding this comment.
Is this because we don't do ondelete_updates test before?
There was a problem hiding this comment.
Not new — the Test_OD* (ondelete updates) tests have existed in operator/e2e/tests/ for a while and are already part of the e2e job's matrix. The reason ondelete_updates is appearing in the e2e-skip matrix in this PR is that one of the goals here is to make the e2e-skip job's test_name list a 1:1 mirror of the e2e job's, so the same set of branch-protection check names resolve regardless of whether E2E ran or was skipped.
I just verified the two matrices are aligned now — both list the same 12 entries (including ondelete_updates and the three *_default-scheduler variants). The comment block at the top of e2e-skip flags this contract so future contributors know to update both blocks together when adding a backend or test.
6aababc to
a40d1a7
Compare
|
Marking ready for review after a full local E2E pass on the rebased branch. Rebase summary
E2E results (12/12 PASS)Ran the full CI matrix locally on a 128-core / 780 GB node, total 81 min. All 12 entries passed:
Capability gate spot-checkVerified the
Per-task logs retained locally; happy to dig into specifics if any reviewer wants to spot-check. |
Restructures the E2E suite from KAI-only to N-backend ready, landing KAI (primary) and default-scheduler. KAI rows are functionally unchanged; default-scheduler rows light up new sensitive coverage (RU/OD/SO). Skaffold adds an e2e-default-scheduler profile alongside e2e-kai. Workload YAMLs lose hardcoded schedulerName: kai-scheduler (22 files); the operator's PreparePod() assigns from defaultProfileName. The new hack/e2e-default-scheduler.yaml infra-manager preset disables KAI install/queues, and _run_prepull's KAI image list is now gated on cfg.scheduler.kai.enabled so default-scheduler rows skip the unused pull. A new RequireCapability(t, ...) gate in operator/e2e/tests/capabilities.go auto-skips capability-gated tests when the active backend does not provide the required capability. DiscoverCapabilities reads the live OperatorConfiguration via grove/config and joins it with a hardcoded backend->interface table; a unit test cross-checks the table against actual Go interface assertions for every registered backend. The CI matrix grows three default-scheduler rows (rolling_updates, ondelete_updates, startup_ordering) via a new create_flags field that threads -f hack/<preset>.yaml through E2E_CREATE_FLAGS; e2e-skip syncs in lockstep. Signed-off-by: Kang Zhang <kangz@nvidia.com> Signed-off-by: Bruce Luo <brluobt@gmail.com>
Two small polish items on top of the previous commit: 1. CI matrix: rename the three default-scheduler rows from `*_default` to `*_default-scheduler` (rolling_updates_default-scheduler, ondelete_updates_default-scheduler, startup_ordering_default-scheduler) in both the e2e and e2e-skip matrices. The longer suffix matches the actual backend name (configv1alpha1.SchedulerNameKube = "default-scheduler") and removes ambiguity in the GitHub Actions UI when scanning a long matrix at a glance. 2. hack/e2e-default-scheduler.yaml: replace the dangling reference to a non-existent design proposal with a concrete activation pointer (infra-manager.py setup -f ..., threaded through E2E_CREATE_FLAGS). Signed-off-by: Bruce Luo <brluobt@gmail.com>
…face Upstream main renamed scheduler.TopologyAwareSchedBackend to scheduler.TopologyAwareBackend. The rebase of this branch resolved topology_test.go but missed the same reference in capabilities_test.go, leaving the e2e/tests package un-buildable on top of current main. Updates the one code site (capabilities_test.go:91 type assertion) and two doc-comment references for consistency. No behavior change.
3d2a85a to
00693ae
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it:
Restructures the E2E suite from KAI-only to support multiple scheduler backends — KAI as primary, default-scheduler as the first additional backend. Continuation of @kangclzjc's prototype in #584 (closed) with a small polish commit on top.
Three-tier test classification (per
RequireCapabilityruntime gating):RequireCapability(t, ...)auto-skips otherwise.Six configuration layers touched (purely additive — no Makefile/test-runner/Helm-rendering changes):
schedulerName: kai-scheduler; operator'sPreparePod()injects fromdefaultProfileName. Same workload YAML now runs unmodified on every backend.topology-test→e2e-kai; adde2e-default-scheduler.hack/e2e-default-scheduler.yamloverlay; KAI image prepull gated oncfg.scheduler.kai.enabledso default-scheduler rows skip the unused pull.operator/e2e/tests/{capabilities.go, capability_discovery.go}; cross-check unit test incapabilities_test.gofails the build if the hardcoded backend→capability table drifts from actual Go interface assertions.RequireCapability(t, GangScheduling)etc. added to GS and TAS suites.create_flagsfield threads-f hack/<preset>.yamlthroughE2E_CREATE_FLAGS. default-scheduler rows:rolling_updates_default-scheduler,ondelete_updates_default-scheduler,startup_ordering_default-scheduler. e2e-skip mirrors for branch protection.Which issue(s) this PR fixes:
Refs #594
(This PR does not resolve #594. The issue tracks ongoing multi-backend E2E work including path-filtered PR matrix and nightly runs as follow-ups.)
Special notes for your reviewer:
Two commits in this PR:
test(e2e): support multiple scheduler backends— cherry-picked from @kangclzjc's branch, message prefix updated fromGREP:totest(e2e):. Author preserved as Kang.test(e2e): polish multi-backend matrix naming and infra preset— two cosmetic items:_default→_default-scheduler(full backend name in the GHA UI; matchesconfigv1alpha1.SchedulerNameKubevalue).hack/e2e-default-scheduler.yaml: replace dangling reference to a non-existent design doc with a concrete activation pointer.Merge strategy: please prefer rebase-merge over squash to preserve @kangclzjc's authorship on commit 1. If squash is required, the trailer block should retain
Co-authored-by: Kang Zhang <kangz@nvidia.com>and bothSigned-off-bylines.Out of scope for this PR (deferred to follow-ups, tracked by #594):
operator/**or.github/**changes. Refining to backend-aware filters (scheduler/kai/**→ KAI rows only, etc.) is a separate PR.L20 validation ✅ — lightweight validation on a single-server k3d (l20-6, 30 KWOK workers) completed:
cert_management+crd_installer): PASSrolling_updates): PASSFull evidence (per-test results, what each run proves, environment notes) is in the validation summary comment.
Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: