Skip to content

feat(slinky-slurm): add GB200 slurm recipe, IMEX compute domain and channel conformance#1539

Open
kaynetu wants to merge 1 commit into
NVIDIA:mainfrom
kaynetu:feat/gb200-slinky-slurm-imex
Open

feat(slinky-slurm): add GB200 slurm recipe, IMEX compute domain and channel conformance#1539
kaynetu wants to merge 1 commit into
NVIDIA:mainfrom
kaynetu:feat/gb200-slinky-slurm-imex

Conversation

@kaynetu

@kaynetu kaynetu commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a GB200 EKS Ubuntu training Slurm recipe with NVIDIA DRA IMEX support, including the ComputeDomain manifest component and Slurm resource-claim wiring.

Adds slinky-slurm-imex-channel conformance validation, which proves two concurrent bounded Slurm jobs receive different IMEX channel device paths.

Motivation / Context

GB200 multi-node NVLink workloads need an IMEX ComputeDomain and DRA-provisioned channel claims for Slurm job steps. Existing Slinky Slurm support did not provide a GB200 IMEX-capable leaf or an end-to-end conformance signal for per-job channel isolation.

Fixes: N/A
Related: #1394

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: recipes/, validators/conformance/

Implementation Notes

  • Adds the slinky-slurm-imex-compute-domain manifest-only component. Its health check verifies the NVIDIA DRA ComputeDomain and reconciled ResourceClaimTemplate.
  • Adds gb200-eks-ubuntu-training-slurm, which configures the Slinky NodeSet to consume the IMEX claim template and configures SwitchType=switch/nvidia_imex.
  • Adds the slinky-slurm-imex-channel conformance check. It launches two overlapping, bounded one-GPU srun jobs, records their /dev/nvidia-caps-imex-channels/channel* device paths, and fails if the paths are missing, malformed, or identical.
  • Registers the check in the validator catalog and scopes it to recipes that include the IMEX ComputeDomain component.
  • Updates recipe coverage, component catalog/BOM content, recipe-health documentation, and the Slinky demo.

Testing

unset GITLAB_TOKEN
make qualify

make qualify passed locally.

Local deployment and conformance: deployed the GB200 Slinky Slurm recipe locally, then ran the conformance phase. The resulting CTRF report recorded 2/2 passed checks:

tests: 2
passed: 2
failed: 0

conformance  slinky-slurm-health        passed
conformance  slinky-slurm-imex-channel  passed

The Slurm health check verified slurmctld was up and that an srun job completed:

Slurmctld(primary) at slinky-slurm-controller-0 is UP
srun --immediate=5 --time=0:03 hostname
ExitCode: 0
stdout: slinky-1

The IMEX check launched two overlapping bounded one-GPU Slurm jobs and confirmed distinct channel devices:

ExitCode: 0
IMEX_CHANNEL=/dev/nvidia-caps-imex-channels/channel1
IMEX_CHANNEL=/dev/nvidia-caps-imex-channels/channel2

first=/dev/nvidia-caps-imex-channels/channel1
second=/dev/nvidia-caps-imex-channels/channel2

Risk Assessment

  • Low — The new behavior is scoped to the new GB200 Slurm leaf and recipes that explicitly include the IMEX ComputeDomain component. Existing non-IMEX recipes do not run the new check.
  • Medium
  • High

Rollout notes: Deploy the new recipe with a DRA-capable NVIDIA driver and IMEX-enabled GB200 environment. The conformance validator image must include this change before deployment validation can execute the new check.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@kaynetu kaynetu requested review from a team as code owners June 29, 2026 20:49
@copy-pr-bot

copy-pr-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds GB200 Slurm IMEX support across the recipe, manifest, validator, tests, docs, and demo deck. It introduces a new ComputeDomain manifest and GB200 Slurm recipe overlay, adds an IMEX channel conformance check and catalog wiring, updates related recipe and component documentation, extends tests for recipe wiring and validation, adjusts deployment-order expectations, and adds a standalone Slinky Slurm demo HTML slide deck.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main additions: GB200 Slurm recipe support, IMEX compute-domain wiring, and the new conformance check.
Description check ✅ Passed The description matches the changeset and accurately describes the new recipe, IMEX compute-domain support, conformance validator, docs, and testing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@demos/slinky-slurm-demo.html`:
- Around line 263-297: The Slinky Slurm e2e flow diagram still describes the old
H100 training path instead of the GB200 IMEX workflow added by this PR. Update
the relevant demo sections and labels in the flow/expectation content to
reference the new GB200 IMEX symbols, including slurm-slurm-imex-compute-domain
and slurm-slurm-imex-channel, and remove or replace the old H100-specific terms
like h100-eks-ubuntu-training-slurm, Gres=gpu:h100:8, and slinky-slurm-health.
Keep the diagram and narrative consistent with the new deployment and
conformance steps so the demo matches the feature being introduced.
- Line 309: Add type="button" to the buttons in the deck UI so they do not
default to submit behavior when reused inside a form. Update the button elements
in the bar markup, including the copy control in the slinky-slurm-demo HTML, and
apply the same fix to any similar button elements in that section so their
behavior is explicitly non-submitting.
- Around line 616-623: The copy-button handler in the
document.querySelectorAll('.code .copy') block does not handle failures from
navigator.clipboard.writeText, so clipboard permission or insecure-context
errors are ignored. Update the click handler to add a rejection path after
writeText succeeds or fails, using the existing btn/pre logic to show a fallback
UI or error state when copying is not available. Keep the change localized to
the copy button event listener so the behavior remains consistent with the
current clipboard workflow.

In `@recipes/checks/slinky-slurm-imex-compute-domain/health-check.yaml`:
- Around line 46-50: Update the ResourceClaimTemplate manifest in the
health-check recipe to use resource.k8s.io/v1beta1 instead of resource.k8s.io/v1
so the assert matches the Kubernetes floor supported by this check. Locate the
ResourceClaimTemplate entry in the health-check YAML for the
slinky-slurm-imex-channels resource and change only the apiVersion to the served
beta version.

In
`@recipes/components/slinky-slurm-imex-compute-domain/manifests/compute-domain.yaml`:
- Around line 15-16: The conditional in compute-domain.yaml is ignoring an
explicit enabled: false because the default true is applied before checking the
value, so the ComputeDomain still renders. Update the guard around the
$imex.enabled check to preserve an explicit false by first checking whether the
key exists in .Values["slinky-slurm-imex-compute-domain"] and only falling back
to the default when it is unset. Keep the fix localized to the template logic
that gates rendering of the ComputeDomain resource.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 80d336d8-186e-4fac-afef-0f88c5a21015

📥 Commits

Reviewing files that changed from the base of the PR and between 003fe8a and f95f180.

📒 Files selected for processing (15)
  • demos/slinky-slurm-demo.html
  • docs/user/component-catalog.md
  • docs/user/container-images.md
  • docs/user/recipe-health.md
  • pkg/recipe/deployment_order_guard_test.go
  • pkg/recipe/metadata_store_test.go
  • recipes/checks/slinky-slurm-imex-compute-domain/health-check.yaml
  • recipes/components/slinky-slurm-imex-compute-domain/manifests/compute-domain.yaml
  • recipes/overlays/gb200-eks-ubuntu-training-slurm.yaml
  • recipes/registry.yaml
  • recipes/validators/README.md
  • recipes/validators/catalog.yaml
  • validators/conformance/main.go
  • validators/conformance/slinky_slurm_imex_channel_check.go
  • validators/conformance/slinky_slurm_imex_channel_check_test.go

Comment thread demos/slinky-slurm-demo.html
Comment thread demos/slinky-slurm-demo.html
Comment thread demos/slinky-slurm-demo.html
Comment thread recipes/checks/slinky-slurm-imex-compute-domain/health-check.yaml Outdated
Comment thread recipes/components/slinky-slurm-imex-compute-domain/manifests/compute-domain.yaml Outdated
@github-actions

Copy link
Copy Markdown
Contributor

@kaynetu this PR now has merge conflicts with main. Please rebase to resolve them.

@njhensley njhensley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Persona Review — GB200 IMEX compute domain + channel conformance

Method: 4 independent persona reviewers (Correctness, Domain/Architecture, Test-coverage, Operability/CI-DX), with every finding re-derived from the resolved code at the head commit and the highest-tier claims independently reproduced. Two personas raised blockers; one survived verification, one was downgraded, and both Helm-hook "Major" claims were refuted against the repo's established idiom.

Tier legend: 🔴 Blocker · 🟠 Major · 🟡 Minor · 🔵 Nitpick

Overall assessment

A well-scoped, well-tested feature that closely mirrors the existing Slinky-Slurm patterns (shared check helpers, manifest-only-via-Helm component idiom, deployment-order guard tests, BOM/recipe-health regen all internally consistent). The conformance check's parse logic is sound and robustly gated. Two small, real defects are worth fixing before merge — both are quick.

Confirmed non-issues (examined, refuted)

  • Post-install hook creates "hidden ordering" riskRefuted. The hook block is byte-identical to the universal repo idiom (gpu-operator-ocp, nfd-ocp, agentgateway, nodewright-*), and localformat's injectPostInstallHooks applies post-install hooks to all recipe-side manifests by design. Cross-component ordering (DRA driver → compute-domain → slinky-slurm) is enforced by the dependency graph and verified in deployment_order_guard_test.go, not by the hook.
  • before-hook-creation delete-policy churn on upgradeRefuted. Identical to every sibling manifest component; speculative and repo-wide if real.
  • resourceClaims name must stay in sync — both imex-channels references sit a few lines apart in the same overlay; trivially co-located.
  • type: Helm + empty repo for a manifest-only component — established idiom.
  • :latest validator image tag — consistent with all 13 conformance catalog entries; no new drift.
  • BOM / recipe-health / component-catalog regen — internally consistent (34 components, 38 recipes, new gb200-slurm row C:9); no hand-edit signatures.

Summary

🔴 Blocker 1 · 🟠 Major 1 · 🟡 Minor 1 · 🔵 Nitpick 3 — both must-fixes are small.

Review assembled with a multi-persona + adversarial meta-review workflow.

Comment thread recipes/overlays/gb200-eks-ubuntu-training-slurm.yaml Outdated
Comment thread validators/conformance/slinky_slurm_imex_channel_check.go Outdated
Comment thread validators/conformance/slinky_slurm_imex_channel_check_test.go
Comment thread recipes/overlays/gb200-eks-ubuntu-training-slurm.yaml
Comment thread validators/conformance/slinky_slurm_imex_channel_check_test.go
Comment thread pkg/recipe/metadata_store_test.go
@kaynetu kaynetu force-pushed the feat/gb200-slinky-slurm-imex branch from f95f180 to 2d0941e Compare June 29, 2026 21:56

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/gb200-eks-ubuntu-training-slurm.yaml`:
- Around line 78-82: The `slinky-slurm-imex-compute-domain` overlay entry is
incorrectly changing a manifest/direct component to `type: Helm`, which the
overlay model does not support. Update the `slinky-slurm-imex-compute-domain`
reference in the overlay to keep the same manifest/direct type as the
registry/documented component, and leave it pointing at its manifest file
instead of declaring it as Helm.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0d08d075-e307-43e4-b68e-48ab8e770681

📥 Commits

Reviewing files that changed from the base of the PR and between f95f180 and 2d0941e.

📒 Files selected for processing (16)
  • demos/slinky-slurm-demo.html
  • docs/user/component-catalog.md
  • docs/user/container-images.md
  • docs/user/recipe-health.md
  • pkg/collector/k8s/server_test.go
  • pkg/recipe/deployment_order_guard_test.go
  • pkg/recipe/metadata_store_test.go
  • recipes/checks/slinky-slurm-imex-compute-domain/health-check.yaml
  • recipes/components/slinky-slurm-imex-compute-domain/manifests/compute-domain.yaml
  • recipes/overlays/gb200-eks-ubuntu-training-slurm.yaml
  • recipes/registry.yaml
  • recipes/validators/README.md
  • recipes/validators/catalog.yaml
  • validators/conformance/main.go
  • validators/conformance/slinky_slurm_imex_channel_check.go
  • validators/conformance/slinky_slurm_imex_channel_check_test.go

Comment thread recipes/overlays/gb200-eks-ubuntu-training-slurm.yaml Outdated

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few items remain after the latest push. Inline comments below.

Separately: this branch currently has a merge commit ("Merge branch 'main' into feat/…") — please rebase onto origin/main and squash to a single signed commit per repo policy before merge.

Comment thread validators/conformance/slinky_slurm_imex_channel_check.go Outdated
Comment thread validators/conformance/slinky_slurm_imex_channel_check.go
Comment thread recipes/components/slinky-slurm-imex-compute-domain/manifests/compute-domain.yaml Outdated
Comment thread recipes/checks/slinky-slurm-imex-compute-domain/health-check.yaml Outdated
Comment thread docs/user/recipe-health.md Outdated
@kaynetu kaynetu force-pushed the feat/gb200-slinky-slurm-imex branch from 782f0cb to f1180cb Compare June 30, 2026 21:36
@kaynetu kaynetu changed the title feat(slinky-slurm): add GB200 IMEX compute domain and channel conformance feat(slinky-slurm): add GB200 slur吗IMEX compute domain and channel conformance Jun 30, 2026
@kaynetu kaynetu changed the title feat(slinky-slurm): add GB200 slur吗IMEX compute domain and channel conformance feat(slinky-slurm): add GB200 slurm recipe, IMEX compute domain and channel conformance Jun 30, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/conformance/slinky_slurm_imex_channel_check.go`:
- Around line 39-73: The Slinky Slurm IMEX shell check uses a wall-time limit
that is shorter than the script’s 40-second hold, so the overlap check cannot
complete as intended. Update the `slinkySlurmIMEXChannelShell` command to use a
`--time` value longer than the `sleep 40` window, and then adjust the expected
command string in `TestSlinkySlurmIMEXCommandIsBoundedAndResourceScoped` to
match the new timeout.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 55b378c6-532f-4f2c-aa0e-dac1e444e9b1

📥 Commits

Reviewing files that changed from the base of the PR and between 2d0941e and f1180cb.

📒 Files selected for processing (13)
  • demos/slinky-slurm-demo.html
  • docs/user/component-catalog.md
  • docs/user/recipe-health.md
  • pkg/collector/k8s/server_test.go
  • pkg/recipe/deployment_order_guard_test.go
  • pkg/recipe/metadata_store_test.go
  • recipes/components/slinky-slurm/manifests/compute-domain.yaml
  • recipes/overlays/gb200-eks-ubuntu-training-slurm.yaml
  • recipes/validators/README.md
  • recipes/validators/catalog.yaml
  • validators/conformance/main.go
  • validators/conformance/slinky_slurm_imex_channel_check.go
  • validators/conformance/slinky_slurm_imex_channel_check_test.go

Comment thread validators/conformance/slinky_slurm_imex_channel_check.go
@kaynetu kaynetu force-pushed the feat/gb200-slinky-slurm-imex branch from f1180cb to 3fe2ebd Compare June 30, 2026 22:00
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Recipe evidence check

Other affected recipes without evidence yet: 1

These recipes are affected by this PR but carry no committed evidence pointer, so there is
nothing to verify. This is expected — evidence is hardware-gated and added over time.

  • gb200-eks-ubuntu-training-slurm

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@demos/slinky-slurm-demo.html`:
- Around line 605-614: The global keydown handler in the document listener
should skip shortcuts when the event target is an interactive control. Update
the shortcut guard around the handler that processes ArrowRight, Space,
PageDown, n, and the other navigation keys so it ignores events from inputs,
buttons, selects, textareas, and contenteditable elements. Keep the existing
navigation logic in the same document.addEventListener('keydown', ...) block,
but only call next(), prev(), show(), or fullscreen actions when the focused
target is not interactive.

In `@pkg/recipe/metadata_store_test.go`:
- Around line 787-799: The wiring test is missing an assertion for the pod-level
claim name, so add a check in the same test around podClaims in
metadata_store_test.go that verifies
nodesets.slinky.podSpec.resourceClaims[0].name is "imex-channels" in addition to
the existing resourceClaimTemplateName and container claim assertions. Use the
existing podClaims and assertSingleNameField helpers (and the same
nodesets.slinky path) so the test explicitly ties the pod-level claim name to
the slurmd.resources.claims reference.

In `@recipes/overlays/gb200-eks-ubuntu-training-slurm.yaml`:
- Around line 21-25: The overlay comment uses an incorrect component key in the
`aicr bundle --set` example; update the example to match the actual
`componentRef` name used by the overlay, `slinky-slurm`, so users copy a valid
override key. Check the inline explanatory comment near the GB200/EKS/Ubuntu
training overlay and keep the example aligned with the `componentRef`
declaration below.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: dc43b7e6-5116-460a-b489-410082ec9e42

📥 Commits

Reviewing files that changed from the base of the PR and between f1180cb and 3fe2ebd.

📒 Files selected for processing (13)
  • demos/slinky-slurm-demo.html
  • docs/user/component-catalog.md
  • docs/user/recipe-health.md
  • pkg/collector/k8s/server_test.go
  • pkg/recipe/deployment_order_guard_test.go
  • pkg/recipe/metadata_store_test.go
  • recipes/components/slinky-slurm/manifests/compute-domain.yaml
  • recipes/overlays/gb200-eks-ubuntu-training-slurm.yaml
  • recipes/validators/README.md
  • recipes/validators/catalog.yaml
  • validators/conformance/main.go
  • validators/conformance/slinky_slurm_imex_channel_check.go
  • validators/conformance/slinky_slurm_imex_channel_check_test.go

Comment thread demos/slinky-slurm-demo.html
Comment thread pkg/recipe/metadata_store_test.go
Comment thread recipes/overlays/gb200-eks-ubuntu-training-slurm.yaml
@kaynetu kaynetu force-pushed the feat/gb200-slinky-slurm-imex branch from 735f82f to b552459 Compare June 30, 2026 23:14

@njhensley njhensley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Persona Re-Review (v2)

Re-reviewed at head b55245969 (4 persona reviewers re-run + a senior adversarial meta-review). Follow-up to my earlier REQUEST_CHANGES.

Tier legend: 🔴 Blocker · 🟠 Major · 🟡 Minor · 🔵 Nitpick

Both prior blockers resolved ✅

  • apiVersion — overlay is now aicr.run/v1alpha2.
  • shell fail-open — tail is now test "$first_rc" -eq 0 && test "$second_rc" -eq 0, with a test asserting it.

Prior minors also closed: malformed-path branch (TestParseSlinkySlurmIMEXChannelsRejectsMalformedPath) and nil-context guards (TestCheckSlinkySlurmIMEXChannelRequiresContext).

Beyond the patch, the design was restructured: the standalone slinky-slurm-imex-compute-domain component was removed, the ComputeDomain moved to a pre-manifest under slinky-slurm with a hardcoded (immutable) identity, and the check rewritten to gate on live cluster resources (fail-closed). Check coverage is strong (88.9% / 94.1% / 100% / 100% / 100%).

Overall assessment

Clean, well-tested, and materially improved. No blocker or major issues survive the meta-review — the two inline 🟡s and one 🔵 below are all non-blocking. Recommendation: approve (with optional follow-ups).

Confirmed non-issues (examined, refuted)

  • Async RCT reconciliation race — pre-manifest + dependencyRefs ordering; pods Pending until the RCT reconciles is normal self-healing DRA behavior, not a defect.
  • NotFound error-message ambiguity — message already includes kind + namespace + name.
  • errgroup for 2 reads — idiomatic fan-out with correct ctx cancellation.
  • getDynamicClient nil path untested — unreachable given the entry guards (normal hot path).
  • Helper error paths untested in the IMEX test — covered in slinky_slurm_health_check_test.go.
  • Docs / BOM / demo — regenerated and consistent; BOM correctly needs no regen; no stale references to the removed component.
  • pkg/collector/k8s/server_test.go — small KUBECONFIG/HOME test-isolation hardening bundled in; benign.

Summary

🔴 Blocker 0 · 🟠 Major 0 · 🟡 Minor 2 · 🔵 Nitpick 1 — nothing blocking.

Review assembled with a multi-persona + adversarial meta-review workflow.

Comment thread recipes/components/slinky-slurm/manifests/compute-domain.yaml
Comment thread validators/conformance/slinky_slurm_imex_channel_check.go Outdated
Comment thread validators/conformance/slinky_slurm_imex_channel_check_test.go
…ance validation

Signed-off-by: Kayne Tu <kaynet@nvidia.com>
@kaynetu kaynetu force-pushed the feat/gb200-slinky-slurm-imex branch from b552459 to 794a1bf Compare July 2, 2026 16:10
@kaynetu kaynetu requested a review from njhensley July 2, 2026 16:45

@njhensley njhensley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All prior review findings resolved at 794a1bf (verified + conformance/recipe tests pass):

  • apiVersion migrated to aicr.run/v1alpha2 (blocker)
  • Shell now asserts both jobs' exit status (first_rc/second_rc) — fail-open closed (major)
  • requireSlinkySlurmIMEXResources reordered after discover + KWOK skip, with test
  • Malformed-path, nil-context, and KWOK-skip branches now covered by tests
  • Auto-skip documented in validation block; NodeSet claim ref flagged as non-overridable internal integration value

Clearing my CHANGES_REQUESTED. Approving.

@njhensley njhensley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All prior review findings resolved at 794a1bf (verified; conformance + recipe tests pass): apiVersion migrated to aicr.run/v1alpha2 (blocker); shell asserts both jobs' exit status, closing the fail-open (major); requireSlinkySlurmIMEXResources reordered after discover + KWOK skip; malformed-path, nil-context, and KWOK-skip branches now tested; auto-skip documented and NodeSet claim ref flagged as non-overridable. Clearing my CHANGES_REQUESTED.

@njhensley njhensley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All prior review findings resolved at 794a1bf (verified; conformance + recipe tests pass): apiVersion migrated to aicr.run/v1alpha2 (blocker); shell asserts both jobs' exit status, closing the fail-open (major); requireSlinkySlurmIMEXResources reordered after discover + KWOK skip; malformed-path, nil-context, and KWOK-skip branches now tested; auto-skip documented and NodeSet claim ref flagged as non-overridable. Clearing my CHANGES_REQUESTED.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants