feat(slinky-slurm): add GB200 slurm recipe, IMEX compute domain and channel conformance#1539
feat(slinky-slurm): add GB200 slurm recipe, IMEX compute domain and channel conformance#1539kaynetu wants to merge 1 commit into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds GB200 Slurm IMEX support across the recipe, manifest, validator, tests, docs, and demo deck. It introduces a new ComputeDomain manifest and GB200 Slurm recipe overlay, adds an IMEX channel conformance check and catalog wiring, updates related recipe and component documentation, extends tests for recipe wiring and validation, adjusts deployment-order expectations, and adds a standalone Slinky Slurm demo HTML slide deck. Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@demos/slinky-slurm-demo.html`:
- Around line 263-297: The Slinky Slurm e2e flow diagram still describes the old
H100 training path instead of the GB200 IMEX workflow added by this PR. Update
the relevant demo sections and labels in the flow/expectation content to
reference the new GB200 IMEX symbols, including slurm-slurm-imex-compute-domain
and slurm-slurm-imex-channel, and remove or replace the old H100-specific terms
like h100-eks-ubuntu-training-slurm, Gres=gpu:h100:8, and slinky-slurm-health.
Keep the diagram and narrative consistent with the new deployment and
conformance steps so the demo matches the feature being introduced.
- Line 309: Add type="button" to the buttons in the deck UI so they do not
default to submit behavior when reused inside a form. Update the button elements
in the bar markup, including the copy control in the slinky-slurm-demo HTML, and
apply the same fix to any similar button elements in that section so their
behavior is explicitly non-submitting.
- Around line 616-623: The copy-button handler in the
document.querySelectorAll('.code .copy') block does not handle failures from
navigator.clipboard.writeText, so clipboard permission or insecure-context
errors are ignored. Update the click handler to add a rejection path after
writeText succeeds or fails, using the existing btn/pre logic to show a fallback
UI or error state when copying is not available. Keep the change localized to
the copy button event listener so the behavior remains consistent with the
current clipboard workflow.
In `@recipes/checks/slinky-slurm-imex-compute-domain/health-check.yaml`:
- Around line 46-50: Update the ResourceClaimTemplate manifest in the
health-check recipe to use resource.k8s.io/v1beta1 instead of resource.k8s.io/v1
so the assert matches the Kubernetes floor supported by this check. Locate the
ResourceClaimTemplate entry in the health-check YAML for the
slinky-slurm-imex-channels resource and change only the apiVersion to the served
beta version.
In
`@recipes/components/slinky-slurm-imex-compute-domain/manifests/compute-domain.yaml`:
- Around line 15-16: The conditional in compute-domain.yaml is ignoring an
explicit enabled: false because the default true is applied before checking the
value, so the ComputeDomain still renders. Update the guard around the
$imex.enabled check to preserve an explicit false by first checking whether the
key exists in .Values["slinky-slurm-imex-compute-domain"] and only falling back
to the default when it is unset. Keep the fix localized to the template logic
that gates rendering of the ComputeDomain resource.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 80d336d8-186e-4fac-afef-0f88c5a21015
📒 Files selected for processing (15)
demos/slinky-slurm-demo.htmldocs/user/component-catalog.mddocs/user/container-images.mddocs/user/recipe-health.mdpkg/recipe/deployment_order_guard_test.gopkg/recipe/metadata_store_test.gorecipes/checks/slinky-slurm-imex-compute-domain/health-check.yamlrecipes/components/slinky-slurm-imex-compute-domain/manifests/compute-domain.yamlrecipes/overlays/gb200-eks-ubuntu-training-slurm.yamlrecipes/registry.yamlrecipes/validators/README.mdrecipes/validators/catalog.yamlvalidators/conformance/main.govalidators/conformance/slinky_slurm_imex_channel_check.govalidators/conformance/slinky_slurm_imex_channel_check_test.go
|
@kaynetu this PR now has merge conflicts with |
njhensley
left a comment
There was a problem hiding this comment.
Multi-Persona Review — GB200 IMEX compute domain + channel conformance
Method: 4 independent persona reviewers (Correctness, Domain/Architecture, Test-coverage, Operability/CI-DX), with every finding re-derived from the resolved code at the head commit and the highest-tier claims independently reproduced. Two personas raised blockers; one survived verification, one was downgraded, and both Helm-hook "Major" claims were refuted against the repo's established idiom.
Tier legend: 🔴 Blocker · 🟠 Major · 🟡 Minor · 🔵 Nitpick
Overall assessment
A well-scoped, well-tested feature that closely mirrors the existing Slinky-Slurm patterns (shared check helpers, manifest-only-via-Helm component idiom, deployment-order guard tests, BOM/recipe-health regen all internally consistent). The conformance check's parse logic is sound and robustly gated. Two small, real defects are worth fixing before merge — both are quick.
Confirmed non-issues (examined, refuted)
- Post-install hook creates "hidden ordering" risk — Refuted. The hook block is byte-identical to the universal repo idiom (
gpu-operator-ocp,nfd-ocp,agentgateway,nodewright-*), andlocalformat'sinjectPostInstallHooksapplies post-install hooks to all recipe-side manifests by design. Cross-component ordering (DRA driver → compute-domain → slinky-slurm) is enforced by the dependency graph and verified indeployment_order_guard_test.go, not by the hook. before-hook-creationdelete-policy churn on upgrade — Refuted. Identical to every sibling manifest component; speculative and repo-wide if real.resourceClaimsname must stay in sync — bothimex-channelsreferences sit a few lines apart in the same overlay; trivially co-located.type: Helm+ empty repo for a manifest-only component — established idiom.:latestvalidator image tag — consistent with all 13 conformance catalog entries; no new drift.- BOM / recipe-health / component-catalog regen — internally consistent (34 components, 38 recipes, new gb200-slurm row C:9); no hand-edit signatures.
Summary
🔴 Blocker 1 · 🟠 Major 1 · 🟡 Minor 1 · 🔵 Nitpick 3 — both must-fixes are small.
Review assembled with a multi-persona + adversarial meta-review workflow.
f95f180 to
2d0941e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/gb200-eks-ubuntu-training-slurm.yaml`:
- Around line 78-82: The `slinky-slurm-imex-compute-domain` overlay entry is
incorrectly changing a manifest/direct component to `type: Helm`, which the
overlay model does not support. Update the `slinky-slurm-imex-compute-domain`
reference in the overlay to keep the same manifest/direct type as the
registry/documented component, and leave it pointing at its manifest file
instead of declaring it as Helm.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 0d08d075-e307-43e4-b68e-48ab8e770681
📒 Files selected for processing (16)
demos/slinky-slurm-demo.htmldocs/user/component-catalog.mddocs/user/container-images.mddocs/user/recipe-health.mdpkg/collector/k8s/server_test.gopkg/recipe/deployment_order_guard_test.gopkg/recipe/metadata_store_test.gorecipes/checks/slinky-slurm-imex-compute-domain/health-check.yamlrecipes/components/slinky-slurm-imex-compute-domain/manifests/compute-domain.yamlrecipes/overlays/gb200-eks-ubuntu-training-slurm.yamlrecipes/registry.yamlrecipes/validators/README.mdrecipes/validators/catalog.yamlvalidators/conformance/main.govalidators/conformance/slinky_slurm_imex_channel_check.govalidators/conformance/slinky_slurm_imex_channel_check_test.go
yuanchen8911
left a comment
There was a problem hiding this comment.
A few items remain after the latest push. Inline comments below.
Separately: this branch currently has a merge commit ("Merge branch 'main' into feat/…") — please rebase onto origin/main and squash to a single signed commit per repo policy before merge.
782f0cb to
f1180cb
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@validators/conformance/slinky_slurm_imex_channel_check.go`:
- Around line 39-73: The Slinky Slurm IMEX shell check uses a wall-time limit
that is shorter than the script’s 40-second hold, so the overlap check cannot
complete as intended. Update the `slinkySlurmIMEXChannelShell` command to use a
`--time` value longer than the `sleep 40` window, and then adjust the expected
command string in `TestSlinkySlurmIMEXCommandIsBoundedAndResourceScoped` to
match the new timeout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 55b378c6-532f-4f2c-aa0e-dac1e444e9b1
📒 Files selected for processing (13)
demos/slinky-slurm-demo.htmldocs/user/component-catalog.mddocs/user/recipe-health.mdpkg/collector/k8s/server_test.gopkg/recipe/deployment_order_guard_test.gopkg/recipe/metadata_store_test.gorecipes/components/slinky-slurm/manifests/compute-domain.yamlrecipes/overlays/gb200-eks-ubuntu-training-slurm.yamlrecipes/validators/README.mdrecipes/validators/catalog.yamlvalidators/conformance/main.govalidators/conformance/slinky_slurm_imex_channel_check.govalidators/conformance/slinky_slurm_imex_channel_check_test.go
f1180cb to
3fe2ebd
Compare
|
🌿 Preview your docs: https://nvidia-preview-feat-gb200-slinky-slurm-imex.docs.buildwithfern.com/aicr |
Recipe evidence checkOther affected recipes without evidence yet: 1These recipes are affected by this PR but carry no committed evidence pointer, so there is
This gate is warning-only and never blocks merge. See ADR-007 for the trust model. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@demos/slinky-slurm-demo.html`:
- Around line 605-614: The global keydown handler in the document listener
should skip shortcuts when the event target is an interactive control. Update
the shortcut guard around the handler that processes ArrowRight, Space,
PageDown, n, and the other navigation keys so it ignores events from inputs,
buttons, selects, textareas, and contenteditable elements. Keep the existing
navigation logic in the same document.addEventListener('keydown', ...) block,
but only call next(), prev(), show(), or fullscreen actions when the focused
target is not interactive.
In `@pkg/recipe/metadata_store_test.go`:
- Around line 787-799: The wiring test is missing an assertion for the pod-level
claim name, so add a check in the same test around podClaims in
metadata_store_test.go that verifies
nodesets.slinky.podSpec.resourceClaims[0].name is "imex-channels" in addition to
the existing resourceClaimTemplateName and container claim assertions. Use the
existing podClaims and assertSingleNameField helpers (and the same
nodesets.slinky path) so the test explicitly ties the pod-level claim name to
the slurmd.resources.claims reference.
In `@recipes/overlays/gb200-eks-ubuntu-training-slurm.yaml`:
- Around line 21-25: The overlay comment uses an incorrect component key in the
`aicr bundle --set` example; update the example to match the actual
`componentRef` name used by the overlay, `slinky-slurm`, so users copy a valid
override key. Check the inline explanatory comment near the GB200/EKS/Ubuntu
training overlay and keep the example aligned with the `componentRef`
declaration below.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: dc43b7e6-5116-460a-b489-410082ec9e42
📒 Files selected for processing (13)
demos/slinky-slurm-demo.htmldocs/user/component-catalog.mddocs/user/recipe-health.mdpkg/collector/k8s/server_test.gopkg/recipe/deployment_order_guard_test.gopkg/recipe/metadata_store_test.gorecipes/components/slinky-slurm/manifests/compute-domain.yamlrecipes/overlays/gb200-eks-ubuntu-training-slurm.yamlrecipes/validators/README.mdrecipes/validators/catalog.yamlvalidators/conformance/main.govalidators/conformance/slinky_slurm_imex_channel_check.govalidators/conformance/slinky_slurm_imex_channel_check_test.go
735f82f to
b552459
Compare
njhensley
left a comment
There was a problem hiding this comment.
Multi-Persona Re-Review (v2)
Re-reviewed at head b55245969 (4 persona reviewers re-run + a senior adversarial meta-review). Follow-up to my earlier REQUEST_CHANGES.
Tier legend: 🔴 Blocker · 🟠 Major · 🟡 Minor · 🔵 Nitpick
Both prior blockers resolved ✅
- apiVersion — overlay is now
aicr.run/v1alpha2. - shell fail-open — tail is now
test "$first_rc" -eq 0 && test "$second_rc" -eq 0, with a test asserting it.
Prior minors also closed: malformed-path branch (TestParseSlinkySlurmIMEXChannelsRejectsMalformedPath) and nil-context guards (TestCheckSlinkySlurmIMEXChannelRequiresContext).
Beyond the patch, the design was restructured: the standalone slinky-slurm-imex-compute-domain component was removed, the ComputeDomain moved to a pre-manifest under slinky-slurm with a hardcoded (immutable) identity, and the check rewritten to gate on live cluster resources (fail-closed). Check coverage is strong (88.9% / 94.1% / 100% / 100% / 100%).
Overall assessment
Clean, well-tested, and materially improved. No blocker or major issues survive the meta-review — the two inline 🟡s and one 🔵 below are all non-blocking. Recommendation: approve (with optional follow-ups).
Confirmed non-issues (examined, refuted)
- Async RCT reconciliation race — pre-manifest + dependencyRefs ordering; pods Pending until the RCT reconciles is normal self-healing DRA behavior, not a defect.
- NotFound error-message ambiguity — message already includes kind + namespace + name.
- errgroup for 2 reads — idiomatic fan-out with correct ctx cancellation.
getDynamicClientnil path untested — unreachable given the entry guards (normal hot path).- Helper error paths untested in the IMEX test — covered in
slinky_slurm_health_check_test.go. - Docs / BOM / demo — regenerated and consistent; BOM correctly needs no regen; no stale references to the removed component.
pkg/collector/k8s/server_test.go— small KUBECONFIG/HOME test-isolation hardening bundled in; benign.
Summary
🔴 Blocker 0 · 🟠 Major 0 · 🟡 Minor 2 · 🔵 Nitpick 1 — nothing blocking.
Review assembled with a multi-persona + adversarial meta-review workflow.
…ance validation Signed-off-by: Kayne Tu <kaynet@nvidia.com>
b552459 to
794a1bf
Compare
njhensley
left a comment
There was a problem hiding this comment.
All prior review findings resolved at 794a1bf (verified + conformance/recipe tests pass):
- apiVersion migrated to aicr.run/v1alpha2 (blocker)
- Shell now asserts both jobs' exit status (
first_rc/second_rc) — fail-open closed (major) - requireSlinkySlurmIMEXResources reordered after discover + KWOK skip, with test
- Malformed-path, nil-context, and KWOK-skip branches now covered by tests
- Auto-skip documented in validation block; NodeSet claim ref flagged as non-overridable internal integration value
Clearing my CHANGES_REQUESTED. Approving.
njhensley
left a comment
There was a problem hiding this comment.
All prior review findings resolved at 794a1bf (verified; conformance + recipe tests pass): apiVersion migrated to aicr.run/v1alpha2 (blocker); shell asserts both jobs' exit status, closing the fail-open (major); requireSlinkySlurmIMEXResources reordered after discover + KWOK skip; malformed-path, nil-context, and KWOK-skip branches now tested; auto-skip documented and NodeSet claim ref flagged as non-overridable. Clearing my CHANGES_REQUESTED.
njhensley
left a comment
There was a problem hiding this comment.
All prior review findings resolved at 794a1bf (verified; conformance + recipe tests pass): apiVersion migrated to aicr.run/v1alpha2 (blocker); shell asserts both jobs' exit status, closing the fail-open (major); requireSlinkySlurmIMEXResources reordered after discover + KWOK skip; malformed-path, nil-context, and KWOK-skip branches now tested; auto-skip documented and NodeSet claim ref flagged as non-overridable. Clearing my CHANGES_REQUESTED.
Summary
Adds a GB200 EKS Ubuntu training Slurm recipe with NVIDIA DRA IMEX support, including the
ComputeDomainmanifest component and Slurm resource-claim wiring.Adds
slinky-slurm-imex-channelconformance validation, which proves two concurrent bounded Slurm jobs receive different IMEX channel device paths.Motivation / Context
GB200 multi-node NVLink workloads need an IMEX ComputeDomain and DRA-provisioned channel claims for Slurm job steps. Existing Slinky Slurm support did not provide a GB200 IMEX-capable leaf or an end-to-end conformance signal for per-job channel isolation.
Fixes: N/A
Related: #1394
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)recipes/,validators/conformance/Implementation Notes
slinky-slurm-imex-compute-domainmanifest-only component. Its health check verifies the NVIDIA DRAComputeDomainand reconciledResourceClaimTemplate.gb200-eks-ubuntu-training-slurm, which configures the Slinky NodeSet to consume the IMEX claim template and configuresSwitchType=switch/nvidia_imex.slinky-slurm-imex-channelconformance check. It launches two overlapping, bounded one-GPUsrunjobs, records their/dev/nvidia-caps-imex-channels/channel*device paths, and fails if the paths are missing, malformed, or identical.Testing
unset GITLAB_TOKEN make qualifymake qualifypassed locally.Local deployment and conformance: deployed the GB200 Slinky Slurm recipe locally, then ran the conformance phase. The resulting CTRF report recorded 2/2 passed checks:
The Slurm health check verified
slurmctldwas up and that ansrunjob completed:The IMEX check launched two overlapping bounded one-GPU Slurm jobs and confirmed distinct channel devices:
Risk Assessment
Rollout notes: Deploy the new recipe with a DRA-capable NVIDIA driver and IMEX-enabled GB200 environment. The conformance validator image must include this change before deployment validation can execute the new check.
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info