feat(slinky-topograph): topograph component with slinky engine#1554
feat(slinky-topograph): topograph component with slinky engine#1554ravisoundar wants to merge 2 commits into
Conversation
|
Welcome to AICR, @ravisoundar! Thanks for your first pull request. Before review, please ensure:
A maintainer will review this soon. |
|
🌿 Preview your docs: https://nvidia-preview-rs-topograph.docs.buildwithfern.com/aicr |
1d486a0 to
c2cba00
Compare
Recipe evidence check
Other affected recipes without evidence yet: 2These recipes are affected by this PR but carry no committed evidence pointer, so there is
This gate is warning-only and never blocks merge. See ADR-007 for the trust model. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughA new optional component, Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/user/component-catalog.md`:
- Line 58: The description of the `slinky-topograph` optional component includes
an unsubstantiated “queue depth” responsibility that is not supported by the
current docs or PR changes. Update the wording in the component-catalog entry
for `slinky-topograph` to describe only the topology-related data it provides,
and remove the queue depth mention while keeping the rest of the `Topology-aware
optional components` explanation intact.
In `@recipes/components/slinky-topograph/values.yaml`:
- Around line 23-26: The supported-provider documentation in values.yaml is
inconsistent with the overlays because the h100-kind-training-slurm overlay uses
global.provider.name: test while the list omits it. Update the provider list in
the values.yaml comment to include test if it is truly supported, or change the
overlay to use one of the documented providers; keep the guidance aligned with
global.provider.name and the overlay values so the documented set matches actual
usage.
- Around line 78-81: The comment for the node-data-broker sub-chart is
misleading because it says the dependency is always deployed while the
node-data-broker block in values.yaml disables it by default. Update the comment
near the node-data-broker configuration to accurately describe the current
behavior, using the node-data-broker key and its enabled setting as the
reference point.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 520f85d3-6eeb-4eba-bc24-25544f0fd1a1
📒 Files selected for processing (8)
docs/integrator/recipe-development.mddocs/user/component-catalog.mddocs/user/container-images.mdrecipes/checks/slinky-topograph/health-check.yamlrecipes/components/slinky-topograph/values.yamlrecipes/overlays/h100-gke-cos-training-slurm.yamlrecipes/overlays/h100-kind-training-slurm.yamlrecipes/registry.yaml
c2cba00 to
794ac3b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/checks/slinky-topograph/health-check.yaml`:
- Around line 33-111: The health-check in
validate-topograph-deployment/validate-node-observer-deployment/validate-all-pods-healthy
only verifies pod readiness and unhealthy pod states, so it can miss failures in
the topology generation pipeline. Add a Chainsaw assert in this same check that
validates the expected slurm/slinky-slurm-config ConfigMap and its topology.conf
key after the Slurm release exists, using the existing validate-* steps as the
place to locate and extend the coverage.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 743af252-7d78-426a-b9f5-ed4239da9d3b
📒 Files selected for processing (8)
docs/integrator/recipe-development.mddocs/user/component-catalog.mddocs/user/container-images.mdrecipes/checks/slinky-topograph/health-check.yamlrecipes/components/slinky-topograph/values.yamlrecipes/overlays/h100-gke-cos-training-slurm.yamlrecipes/overlays/h100-kind-training-slurm.yamlrecipes/registry.yaml
0e9fc83 to
35b7b79
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/user/component-catalog.md`:
- Line 42: Update the `slinky-topograph` documentation entry to make it clear
that opt-in requires both adding the component via `componentRef` and declaring
any `dependencyRefs`; the current wording in the component catalog and the
related recipe text should be aligned so authors understand `dependencyRefs`
alone is not sufficient. Use the `slinky-topograph` table entry and the example
recipe section as the places to revise the language.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: f2abe02a-1e22-48f5-8636-ce137a6f088c
📒 Files selected for processing (8)
docs/integrator/recipe-development.mddocs/user/component-catalog.mddocs/user/container-images.mdrecipes/checks/slinky-topograph/health-check.yamlrecipes/components/slinky-topograph/values.yamlrecipes/overlays/h100-gke-cos-training-slurm.yamlrecipes/overlays/h100-kind-training-slurm.yamlrecipes/registry.yaml
35b7b79 to
288d26d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
recipes/overlays/h100-gke-cos-training-slurm.yaml (1)
82-90: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueHardcoded GCP service-account email reduces portability.
topograph-compute@eidosx.iam.gserviceaccount.comis baked directly into the shared recipe overlay. Any user/org deploying this recipe on their own GCP project will need to discover and override this value manually (or the Workload Identity binding will simply fail). Consider parameterizing this via avaluesFile/--setoverride pattern instead of an inline literal, or at least add a comment noting this is an example/test value that must be replaced for other GCP projects.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@recipes/overlays/h100-gke-cos-training-slurm.yaml` around lines 82 - 90, The serviceAccount annotation in the overlay is hardcoded to a specific GCP service-account email, which makes the recipe non-portable across projects. Update the overlay to use a configurable value for the iam.gke.io/gcp-service-account annotation, ideally wired through the existing recipe/values override path used by this Slurm overlay, and if keeping a literal is unavoidable, add a clear note in the overlay near serviceAccount that it is only an example and must be replaced for each GCP project.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/checks/slinky-topograph/health-check.yaml`:
- Around line 57-68: The validate-node-data-broker-daemonset check is asserting
a DaemonSet that is disabled by default, so it will fail unless the component is
explicitly enabled. Update the health-check logic to gate this assertion behind
the same enable flag used for node-data-broker, or adjust the Topograph values
to enable the subchart before running this check, so the DaemonSet assertion
only runs when the resource is expected to exist.
---
Outside diff comments:
In `@recipes/overlays/h100-gke-cos-training-slurm.yaml`:
- Around line 82-90: The serviceAccount annotation in the overlay is hardcoded
to a specific GCP service-account email, which makes the recipe non-portable
across projects. Update the overlay to use a configurable value for the
iam.gke.io/gcp-service-account annotation, ideally wired through the existing
recipe/values override path used by this Slurm overlay, and if keeping a literal
is unavoidable, add a clear note in the overlay near serviceAccount that it is
only an example and must be replaced for each GCP project.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 400050d7-4ab5-43ae-9d1f-0b2461c4971b
📒 Files selected for processing (7)
docs/user/component-catalog.mddocs/user/container-images.mdrecipes/checks/slinky-topograph/health-check.yamlrecipes/components/slinky-topograph/values.yamlrecipes/overlays/h100-gke-cos-training-slurm.yamlrecipes/overlays/h100-kind-training-slurm.yamlrecipes/registry.yaml
86fc341 to
1f2c2e6
Compare
Signed-off-by: Ravi Shankar <ravish@nvidia.com>
Signed-off-by: Ravi Shankar <ravish@nvidia.com>
Summary
Registered
slinky-topographcomponent, and added health checks.Updated the recipes
h100-kind-training-slurmandh100-gke-cos-training-slurmto include topograph.Motivation / Context
Fixes:
Related: #1496
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)recipes/registry.yaml,recipes/components/,recipes/checks/,recipes/overlays/Implementation Notes
New slingy-topograph component Added a Slinky/Slurm-scoped instance of Topograph — the topology utility that generates Slurm topology.conf by querying cloud provider topology APIs (GCP, AWS, OCI …). Feeds topology data to the Slinky-managed Slurm scheduler so it can make topology-aware placement decisions.
Testing
# Commands run (prefer `make qualify` for non-trivial changes) make qualifyRisk Assessment
Rollout notes:
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info