Skip to content

feat(slinky-topograph): topograph component with slinky engine#1554

Open
ravisoundar wants to merge 2 commits into
NVIDIA:mainfrom
ravisoundar:rs-topograph
Open

feat(slinky-topograph): topograph component with slinky engine#1554
ravisoundar wants to merge 2 commits into
NVIDIA:mainfrom
ravisoundar:rs-topograph

Conversation

@ravisoundar

Copy link
Copy Markdown

Summary

Registered slinky-topograph component, and added health checks.
Updated the recipes h100-kind-training-slurm and h100-gke-cos-training-slurm to include topograph.

Motivation / Context

Fixes:
Related: #1496

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: recipes/registry.yaml, recipes/components/, recipes/checks/, recipes/overlays/

Implementation Notes

New slingy-topograph component Added a Slinky/Slurm-scoped instance of Topograph — the topology utility that generates Slurm topology.conf by querying cloud provider topology APIs (GCP, AWS, OCI …). Feeds topology data to the Slinky-managed Slurm scheduler so it can make topology-aware placement decisions.

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify
Action Result
test coverage PASS — 77.3% coverage (threshold 75%)
lint PASS — 0 issues
e2e -chainsaw tests PASS — 23/23

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@copy-pr-bot

copy-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown
Contributor

Welcome to AICR, @ravisoundar! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Recipe evidence check

Registry change: scoped to recipes that reference a changed component
entry in recipes/registry.yaml (not every leaf).

Other affected recipes without evidence yet: 2

These recipes are affected by this PR but carry no committed evidence pointer, so there is
nothing to verify. This is expected — evidence is hardware-gated and added over time.

  • h100-gke-cos-training-slurm
  • h100-kind-training-slurm

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A new optional component, slinky-topograph, is added to generate Slurm topology.conf from cloud provider topology APIs. The change includes a registry entry in recipes/registry.yaml, default Helm values in recipes/components/slinky-topograph/values.yaml, wiring into the GKE and kind Slurm recipe overlays with slinky-slurm depending on slinky-topograph, a Chainsaw health-check manifest for the topograph namespace, and updates to the component catalog, recipe development guide, and container image BOM.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly names the new slinky-topograph component and its Slinky engine backing.
Description check ✅ Passed The description matches the change set by summarizing the new component, health checks, recipe updates, and docs.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/user/component-catalog.md`:
- Line 58: The description of the `slinky-topograph` optional component includes
an unsubstantiated “queue depth” responsibility that is not supported by the
current docs or PR changes. Update the wording in the component-catalog entry
for `slinky-topograph` to describe only the topology-related data it provides,
and remove the queue depth mention while keeping the rest of the `Topology-aware
optional components` explanation intact.

In `@recipes/components/slinky-topograph/values.yaml`:
- Around line 23-26: The supported-provider documentation in values.yaml is
inconsistent with the overlays because the h100-kind-training-slurm overlay uses
global.provider.name: test while the list omits it. Update the provider list in
the values.yaml comment to include test if it is truly supported, or change the
overlay to use one of the documented providers; keep the guidance aligned with
global.provider.name and the overlay values so the documented set matches actual
usage.
- Around line 78-81: The comment for the node-data-broker sub-chart is
misleading because it says the dependency is always deployed while the
node-data-broker block in values.yaml disables it by default. Update the comment
near the node-data-broker configuration to accurately describe the current
behavior, using the node-data-broker key and its enabled setting as the
reference point.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 520f85d3-6eeb-4eba-bc24-25544f0fd1a1

📥 Commits

Reviewing files that changed from the base of the PR and between 702f9ee and 1d486a0.

📒 Files selected for processing (8)
  • docs/integrator/recipe-development.md
  • docs/user/component-catalog.md
  • docs/user/container-images.md
  • recipes/checks/slinky-topograph/health-check.yaml
  • recipes/components/slinky-topograph/values.yaml
  • recipes/overlays/h100-gke-cos-training-slurm.yaml
  • recipes/overlays/h100-kind-training-slurm.yaml
  • recipes/registry.yaml

Comment thread docs/user/component-catalog.md Outdated
Comment thread recipes/components/slinky-topograph/values.yaml
Comment thread recipes/components/slinky-topograph/values.yaml Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/checks/slinky-topograph/health-check.yaml`:
- Around line 33-111: The health-check in
validate-topograph-deployment/validate-node-observer-deployment/validate-all-pods-healthy
only verifies pod readiness and unhealthy pod states, so it can miss failures in
the topology generation pipeline. Add a Chainsaw assert in this same check that
validates the expected slurm/slinky-slurm-config ConfigMap and its topology.conf
key after the Slurm release exists, using the existing validate-* steps as the
place to locate and extend the coverage.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 743af252-7d78-426a-b9f5-ed4239da9d3b

📥 Commits

Reviewing files that changed from the base of the PR and between c2cba00 and 794ac3b.

📒 Files selected for processing (8)
  • docs/integrator/recipe-development.md
  • docs/user/component-catalog.md
  • docs/user/container-images.md
  • recipes/checks/slinky-topograph/health-check.yaml
  • recipes/components/slinky-topograph/values.yaml
  • recipes/overlays/h100-gke-cos-training-slurm.yaml
  • recipes/overlays/h100-kind-training-slurm.yaml
  • recipes/registry.yaml

Comment thread recipes/checks/slinky-topograph/health-check.yaml
@ravisoundar ravisoundar force-pushed the rs-topograph branch 2 times, most recently from 0e9fc83 to 35b7b79 Compare June 30, 2026 19:39

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/user/component-catalog.md`:
- Line 42: Update the `slinky-topograph` documentation entry to make it clear
that opt-in requires both adding the component via `componentRef` and declaring
any `dependencyRefs`; the current wording in the component catalog and the
related recipe text should be aligned so authors understand `dependencyRefs`
alone is not sufficient. Use the `slinky-topograph` table entry and the example
recipe section as the places to revise the language.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f2abe02a-1e22-48f5-8636-ce137a6f088c

📥 Commits

Reviewing files that changed from the base of the PR and between 794ac3b and 35b7b79.

📒 Files selected for processing (8)
  • docs/integrator/recipe-development.md
  • docs/user/component-catalog.md
  • docs/user/container-images.md
  • recipes/checks/slinky-topograph/health-check.yaml
  • recipes/components/slinky-topograph/values.yaml
  • recipes/overlays/h100-gke-cos-training-slurm.yaml
  • recipes/overlays/h100-kind-training-slurm.yaml
  • recipes/registry.yaml

Comment thread docs/user/component-catalog.md Outdated
Comment thread recipes/overlays/h100-gke-cos-training-slurm.yaml

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
recipes/overlays/h100-gke-cos-training-slurm.yaml (1)

82-90: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Hardcoded GCP service-account email reduces portability.

topograph-compute@eidosx.iam.gserviceaccount.com is baked directly into the shared recipe overlay. Any user/org deploying this recipe on their own GCP project will need to discover and override this value manually (or the Workload Identity binding will simply fail). Consider parameterizing this via a valuesFile/--set override pattern instead of an inline literal, or at least add a comment noting this is an example/test value that must be replaced for other GCP projects.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@recipes/overlays/h100-gke-cos-training-slurm.yaml` around lines 82 - 90, The
serviceAccount annotation in the overlay is hardcoded to a specific GCP
service-account email, which makes the recipe non-portable across projects.
Update the overlay to use a configurable value for the
iam.gke.io/gcp-service-account annotation, ideally wired through the existing
recipe/values override path used by this Slurm overlay, and if keeping a literal
is unavoidable, add a clear note in the overlay near serviceAccount that it is
only an example and must be replaced for each GCP project.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/checks/slinky-topograph/health-check.yaml`:
- Around line 57-68: The validate-node-data-broker-daemonset check is asserting
a DaemonSet that is disabled by default, so it will fail unless the component is
explicitly enabled. Update the health-check logic to gate this assertion behind
the same enable flag used for node-data-broker, or adjust the Topograph values
to enable the subchart before running this check, so the DaemonSet assertion
only runs when the resource is expected to exist.

---

Outside diff comments:
In `@recipes/overlays/h100-gke-cos-training-slurm.yaml`:
- Around line 82-90: The serviceAccount annotation in the overlay is hardcoded
to a specific GCP service-account email, which makes the recipe non-portable
across projects. Update the overlay to use a configurable value for the
iam.gke.io/gcp-service-account annotation, ideally wired through the existing
recipe/values override path used by this Slurm overlay, and if keeping a literal
is unavoidable, add a clear note in the overlay near serviceAccount that it is
only an example and must be replaced for each GCP project.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 400050d7-4ab5-43ae-9d1f-0b2461c4971b

📥 Commits

Reviewing files that changed from the base of the PR and between 35b7b79 and 288d26d.

📒 Files selected for processing (7)
  • docs/user/component-catalog.md
  • docs/user/container-images.md
  • recipes/checks/slinky-topograph/health-check.yaml
  • recipes/components/slinky-topograph/values.yaml
  • recipes/overlays/h100-gke-cos-training-slurm.yaml
  • recipes/overlays/h100-kind-training-slurm.yaml
  • recipes/registry.yaml

Comment thread recipes/checks/slinky-topograph/health-check.yaml
@ravisoundar ravisoundar force-pushed the rs-topograph branch 3 times, most recently from 86fc341 to 1f2c2e6 Compare June 30, 2026 23:14
@ravisoundar ravisoundar marked this pull request as ready for review June 30, 2026 23:28
@ravisoundar ravisoundar requested review from a team as code owners June 30, 2026 23:28
Signed-off-by: Ravi Shankar <ravish@nvidia.com>
Signed-off-by: Ravi Shankar <ravish@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant