Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
693 changes: 693 additions & 0 deletions demos/slinky-slurm-demo.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions docs/user/component-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ The source of truth is [`recipes/registry.yaml`](https://github.com/NVIDIA/aicr/
| **kubeflow-trainer** | Kubeflow Training Operator for distributed training jobs (PyTorch, etc.). Manages multi-node training job lifecycle with JobSet integration. | [Kubeflow Trainer](https://github.com/kubeflow/trainer) |
| **slinky-slurm-operator-crds** | Custom Resource Definitions for the SchedMD Slinky Slurm operator. Installs the `slinky.slurm.net` CRDs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). Installed separately to support CRD lifecycle management. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) |
| **slinky-slurm-operator** | SchedMD Slinky Slurm operator and admission webhook. Manages the lifecycle of Slurm clusters declared via Slinky CRs (Controller, NodeSet, LoginSet, Accounting, RestApi, Token). **Known limitation:** chart v1.1.0 silently ignores `operator.nodeSelector` and `webhook.nodeSelector` (current chart behavior, not a planned feature); tracking [SlinkyProject/slurm-operator#187](https://github.com/SlinkyProject/slurm-operator/pull/187) for the upstream fix. | [Slinky Slurm Operator](https://github.com/SlinkyProject/slurm-operator) |
| **slinky-slurm** | Slinky-managed Slurm cluster instance: Controller (slurmctld) + LoginSet (sackd/sshd) + NodeSet (slurmd) + RestApi (slurmrestd). Reconciled by `slinky-slurm-operator`. Declared inline per slurm leaf overlay alongside `slinky-slurm-operator-crds` and `slinky-slurm-operator` (matching the dynamo-platform pattern) so each leaf can carry its own GPU/GRES tuning. Accounting (slurmdbd) requires an external MariaDB and is disabled in defaults — see `recipes/components/slinky-slurm/values.yaml`. | [Slinky Slurm Cluster Chart](https://github.com/SlinkyProject/slurm-operator/tree/main/helm/slurm) |
| **slinky-slurm** | Slinky-managed Slurm cluster instance: Controller (slurmctld) + LoginSet (sackd/sshd) + NodeSet (slurmd) + RestApi (slurmrestd). Reconciled by `slinky-slurm-operator`. Declared inline per slurm leaf overlay alongside `slinky-slurm-operator-crds` and `slinky-slurm-operator` (matching the dynamo-platform pattern) so each leaf can carry its own GPU/GRES tuning. IMEX-capable leaves attach a fixed NVIDIA DRA `ComputeDomain` as a pre-manifest before the Slurm chart; the DRA driver reconciles it asynchronously into the `ResourceClaimTemplate` consumed by the NodeSet. Accounting (slurmdbd) requires an external MariaDB and is disabled in defaults — see `recipes/components/slinky-slurm/values.yaml`. | [Slinky Slurm Cluster Chart](https://github.com/SlinkyProject/slurm-operator/tree/main/helm/slurm) |
| **nfd-ocp-olm** | OLM installer for Node Feature Discovery on OpenShift. Creates the OperatorGroup and Subscription resources that install NFD via the Operator Lifecycle Manager. Paired with `nfd-ocp`. OCP-specific. | [Node Feature Discovery (Certified)](https://catalog.redhat.com/software/container-stacks/detail/5ec53e8c110f56bd24f5f8db) |
| **nfd-ocp** | Node Feature Discovery CR for OpenShift. Configures NFD's operand (worker, topology updater) via a NodeFeatureDiscovery custom resource. Deployed after `nfd-ocp-olm`. OCP-specific. | [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) |
| **gpu-operator-ocp-olm** | OLM installer for the GPU Operator on OpenShift. Creates the OperatorGroup and Subscription resources that install the certified GPU Operator via the Operator Lifecycle Manager. Paired with `gpu-operator-ocp`. OCP-specific. | [NVIDIA GPU Operator (Certified)](https://catalog.redhat.com/software/container-stacks/detail/5e7b210b8a3c1e00013d636d) |
Expand All @@ -53,7 +53,7 @@ Not every component appears in every recipe. The recipe engine selects component
- **Base components** (cert-manager, kube-prometheus-stack) appear in most recipes.
- **Cloud-specific components** (aws-efa, aws-ebs-csi-driver) are added when the service matches. OCP recipes replace base components (gpu-operator, nfd, network-operator) with OLM+CR pairs (e.g., `gpu-operator-ocp-olm` + `gpu-operator-ocp`).
- **Intent-specific components** (agentgateway, agentgateway-crds) are added based on workload intent (e.g., inference recipes include the inference gateway).
- **Platform-specific components** (slinky-slurm-operator, slinky-slurm, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching `--platform`. For `--platform slurm`, all three Slinky pieces (`slinky-slurm-operator-crds`, `slinky-slurm-operator`, `slinky-slurm`) are declared inline per slurm leaf overlay — the same shape `dynamo-platform` uses across `*-inference-dynamo` leaves. Leaves that want the operator only inline the CRDs + operator and omit the `slinky-slurm` componentRef. For an end-to-end walkthrough (recipe → bundle → install → validate → `srun` smoke job on EKS, GKE, or Kind), see [`demos/cuj1-slinky-slurm.md`](https://github.com/NVIDIA/aicr/blob/main/demos/cuj1-slinky-slurm.md).
- **Platform-specific components** (slinky-slurm-operator, slinky-slurm, kubeflow-trainer, dynamo-platform) are added when the recipe selects a matching `--platform`. For `--platform slurm`, all three core Slinky pieces (`slinky-slurm-operator-crds`, `slinky-slurm-operator`, `slinky-slurm`) are declared inline per slurm leaf overlay — the same shape `dynamo-platform` uses across `*-inference-dynamo` leaves. IMEX-capable Slurm leaves attach a fixed ComputeDomain through `slinky-slurm.preManifestFiles` so slurmd pods can consume DRA-provisioned IMEX channels. Leaves that want the operator only inline the CRDs + operator and omit the `slinky-slurm` componentRef. For an end-to-end walkthrough (recipe → bundle → install → validate → `srun` smoke job on EKS, GKE, or Kind), see [`demos/cuj1-slinky-slurm.md`](https://github.com/NVIDIA/aicr/blob/main/demos/cuj1-slinky-slurm.md).
- **Accelerator/OS-specific tuning** (nodewright-customizations, nvidia-dra-driver-gpu) varies by hardware and OS combination.

### NFD Topology Updater
Expand Down
8 changes: 6 additions & 2 deletions docs/user/recipe-health.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ The matrix is computed **hermetically and offline**: every signal is a pure read
{/* BEGIN AICR-HEALTH */}
## Summary

- Recipes: **39**
- Pass: **39** · Warn: **0** · Fail: **0** · Unknown: **0**
- Recipes: **43**
- Pass: **43** · Warn: **0** · Fail: **0** · Unknown: **0**

## Recipes

Expand All @@ -46,6 +46,7 @@ The matrix is computed **hermetically and offline**: every signal is a pure read
| gb200-any | — | gb200 | — | — | — | pass | R:0 D:4 P:0 C:0 | pending |
| h100-any | — | h100 | — | — | — | pass | R:0 D:4 P:0 C:0 | pending |
| h200-any | — | h200 | — | — | — | pass | R:0 D:4 P:0 C:0 | pending |
| l40s-any | — | l40s | — | — | — | pass | R:0 D:4 P:0 C:0 | pending |
| rtx-pro-6000-any | — | rtx-pro-6000 | — | — | — | pass | R:0 D:4 P:0 C:0 | pending |
| monitoring-hpa | — | — | — | — | — | pass | R:0 D:0 P:0 C:0 | pending |
| a100-aks-ubuntu-training-kubeflow | aks | a100 | ubuntu | training | kubeflow | pass | R:0 D:4 P:0 C:10 | pending |
Expand All @@ -56,6 +57,7 @@ The matrix is computed **hermetically and offline**: every signal is a pure read
| a100-eks-ubuntu-training-kubeflow | eks | a100 | ubuntu | training | kubeflow | pass | R:0 D:4 P:0 C:10 | pending |
| gb200-eks-ubuntu-inference-dynamo | eks | gb200 | ubuntu | inference | dynamo | pass | R:0 D:4 P:1 C:10 | pending |
| gb200-eks-ubuntu-training-kubeflow | eks | gb200 | ubuntu | training | kubeflow | pass | R:0 D:4 P:2 C:8 | pending |
| gb200-eks-ubuntu-training-slurm | eks | gb200 | ubuntu | training | slurm | pass | R:0 D:4 P:0 C:10 | pending |
| h100-eks-ubuntu-inference-dynamo | eks | h100 | ubuntu | inference | dynamo | pass | R:0 D:4 P:1 C:11 | pending |
| h100-eks-ubuntu-inference-nim | eks | h100 | ubuntu | inference | nim | pass | R:0 D:4 P:0 C:11 | pending |
| h100-eks-ubuntu-training-kubeflow | eks | h100 | ubuntu | training | kubeflow | pass | R:0 D:4 P:1 C:10 | pending |
Expand All @@ -80,5 +82,7 @@ The matrix is computed **hermetically and offline**: every signal is a pure read
| a100-oke-ubuntu-training-kubeflow | oke | a100 | ubuntu | training | kubeflow | pass | R:0 D:4 P:0 C:8 | pending |
| gb200-oke-ubuntu-inference-dynamo | oke | gb200 | ubuntu | inference | dynamo | pass | R:0 D:4 P:1 C:10 | pending |
| gb200-oke-ubuntu-training-kubeflow | oke | gb200 | ubuntu | training | kubeflow | pass | R:0 D:4 P:1 C:8 | pending |
| l40s-oke-inference | oke | l40s | ol | inference | — | pass | R:0 D:4 P:0 C:8 | pending |
| l40s-oke-training | oke | l40s | ol | training | — | pass | R:0 D:4 P:0 C:8 | pending |

{/* END AICR-HEALTH */}
11 changes: 11 additions & 0 deletions pkg/collector/k8s/server_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ package k8s

import (
"context"
"os"
"testing"
"time"

Expand Down Expand Up @@ -92,6 +93,16 @@ func TestKubernetesCollector_CollectWithTimeout(t *testing.T) {
}

func TestKubernetesCollector_ErrorRecovery_NilClient(t *testing.T) {
// Match the client package's discovery-isolation pattern so this test
// cannot select a real workstation kubeconfig.
t.Setenv("KUBECONFIG", os.Getenv("KUBECONFIG"))
if err := os.Unsetenv("KUBECONFIG"); err != nil {
t.Fatalf("unset KUBECONFIG: %v", err)
}
home := t.TempDir()
t.Setenv("HOME", home)
t.Setenv("USERPROFILE", home)

ctx := context.TODO()

// Create collector without a valid client
Expand Down
24 changes: 24 additions & 0 deletions pkg/recipe/deployment_order_guard_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,30 @@ func TestDeploymentOrderGuards(t *testing.T) {
{"gpu-operator", "nvsentinel"},
},
},
{
name: "gb200-eks-ubuntu-training-slurm",
criteria: func() *Criteria {
c := NewCriteria()
c.Service = CriteriaServiceEKS
c.Accelerator = CriteriaAcceleratorGB200
c.OS = CriteriaOSUbuntu
c.Intent = CriteriaIntentTraining
c.Platform = CriteriaPlatformSlurm
return c
},
requiredDeps: map[string][]string{
"slinky-slurm-operator": {"cert-manager", "slinky-slurm-operator-crds"},
"slinky-slurm": {"nvidia-dra-driver-gpu", "slinky-slurm-operator", "slinky-slurm-operator-crds"},
},
requiredOrdering: [][2]string{
{"nvidia-dra-driver-gpu", "slinky-slurm"},
{"cert-manager", "slinky-slurm-operator"},
{"slinky-slurm-operator-crds", "slinky-slurm-operator"},
{"slinky-slurm-operator", "slinky-slurm"},
{"slinky-slurm-operator-crds", "slinky-slurm"},
{"gpu-operator", "nvsentinel"},
},
},
{
name: "h100-eks-ubuntu-training-slurm",
criteria: func() *Criteria {
Expand Down
213 changes: 213 additions & 0 deletions pkg/recipe/metadata_store_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import (
"testing"

aicrerrors "github.com/NVIDIA/aicr/pkg/errors"
"github.com/NVIDIA/aicr/pkg/manifest"
"golang.org/x/sync/errgroup"
"gopkg.in/yaml.v3"
)
Expand Down Expand Up @@ -631,6 +632,7 @@ func TestSlurmLeavesClearInheritedPerformancePhase(t *testing.T) {
}

for _, name := range []string{
"gb200-eks-ubuntu-training-slurm",
"h100-eks-ubuntu-training-slurm",
"h100-gke-cos-training-slurm",
} {
Expand Down Expand Up @@ -676,6 +678,18 @@ func TestSlurmLeavesAppendConformanceHealthCheck(t *testing.T) {
"secure-accelerator-access",
"slinky-slurm-health",
}
gb200ConformanceChecks := []string{
"platform-health",
"gpu-operator-health",
"dra-support",
"accelerator-metrics",
"ai-service-metrics",
"gang-scheduling",
"pod-autoscaling",
"cluster-autoscaling",
"slinky-slurm-health",
"slinky-slurm-imex-channel",
}
kindConformanceChecks := []string{
"platform-health",
"gpu-operator-health",
Expand All @@ -693,6 +707,7 @@ func TestSlurmLeavesAppendConformanceHealthCheck(t *testing.T) {
name string
want []string
}{
{name: "gb200-eks-ubuntu-training-slurm", want: gb200ConformanceChecks},
{name: "h100-eks-ubuntu-training-slurm", want: conformanceChecks},
{name: "h100-gke-cos-training-slurm", want: conformanceChecks},
{name: "h100-kind-training-slurm", want: kindConformanceChecks},
Expand All @@ -718,6 +733,204 @@ func TestSlurmLeavesAppendConformanceHealthCheck(t *testing.T) {
}
}

func TestGB200EKSSlurmWiresIMEXComputeDomain(t *testing.T) {
Comment thread
kaynetu marked this conversation as resolved.
ctx := context.Background()
store, err := loadMetadataStore(ctx)
if err != nil {
t.Fatalf("failed to load metadata store: %v", err)
}

leaf, ok := store.GetRecipeByName("gb200-eks-ubuntu-training-slurm")
if !ok {
t.Fatal("overlay gb200-eks-ubuntu-training-slurm not found in store")
}
result, err := store.BuildRecipeResult(ctx, leaf.Spec.Criteria)
if err != nil {
t.Fatalf("BuildRecipeResult failed: %v", err)
}
if !slices.ContainsFunc(
result.Constraints,
func(c Constraint) bool {
return c.Name == "K8s.server.version" && c.Value == ">= 1.34"
},
) {

t.Errorf("constraints = %v, want K8s.server.version >= 1.34 for DRA v1", result.Constraints)
}

if computeDomain := result.GetComponentRef("slinky-slurm-imex-compute-domain"); computeDomain != nil {
t.Errorf("standalone IMEX ComputeDomain component = %+v, want absent", computeDomain)
}
slurm := result.GetComponentRef("slinky-slurm")
if slurm == nil {
t.Fatal("slinky-slurm component missing")
}
const manifestPath = "components/slinky-slurm/manifests/compute-domain.yaml"
if !slices.Contains(slurm.PreManifestFiles, manifestPath) {
t.Errorf("slinky-slurm preManifestFiles = %v, want %q", slurm.PreManifestFiles, manifestPath)
}
if !slices.Contains(slurm.DependencyRefs, "nvidia-dra-driver-gpu") {
t.Errorf("slinky-slurm dependencyRefs = %v, want nvidia-dra-driver-gpu", slurm.DependencyRefs)
}

values, err := result.GetValuesForComponent("slinky-slurm")
if err != nil {
t.Fatalf("GetValuesForComponent(slinky-slurm) failed: %v", err)
}
if got := valueAtPath[string](t, values, "controller", "extraConfMap", "SwitchType"); got != "switch/nvidia_imex" {
t.Errorf("controller.extraConfMap.SwitchType = %q, want switch/nvidia_imex", got)
}
if got := valueAtPath[string](t, values, "nodesets", "slinky", "extraConfMap", "Gres"); got != "gpu:gb200:4" {
t.Errorf("nodesets.slinky.extraConfMap.Gres = %q, want gpu:gb200:4", got)
}

podClaims := valueAtPath[[]any](t, values, "nodesets", "slinky", "podSpec", "resourceClaims")
assertSingleNameField(t, podClaims, "name", "imex-channels")
assertSingleNameField(t, podClaims, "resourceClaimTemplateName", "slinky-slurm-imex-channels")
nodeSetClaim, ok := podClaims[0].(map[string]any)
if !ok {
t.Fatalf("podClaims[0] = %T, want map[string]any", podClaims[0])
}
nodeSetRCTName, ok := nodeSetClaim["resourceClaimTemplateName"].(string)
if !ok {
t.Fatalf("podClaims[0].resourceClaimTemplateName = %T, want string", nodeSetClaim["resourceClaimTemplateName"])
}
containerClaims := valueAtPath[[]any](t, values, "nodesets", "slinky", "slurmd", "resources", "claims")
assertSingleNameField(t, containerClaims, "name", "imex-channels")
slurmd := valueAtPath[map[string]any](t, values, "nodesets", "slinky", "slurmd")
Comment thread
coderabbitai[bot] marked this conversation as resolved.
if got, ok := slurmd["securityContext"]; ok {
t.Errorf("nodesets.slinky.slurmd.securityContext = %v, want omitted to use chart default", got)
}

content, err := GetManifestContentWithContext(ctx, result.DataProvider(), manifestPath)
if err != nil {
t.Fatalf("GetManifestContentWithContext(%q) failed: %v", manifestPath, err)
}
rendered, err := manifest.Render(content, manifest.RenderInput{
ComponentName: slurm.Name,
Namespace: slurm.Namespace,
ChartName: slurm.Chart,
ChartVersion: slurm.Version,
Values: values,
})
if err != nil {
t.Fatalf("render ComputeDomain manifest: %v", err)
}
var computeDomain map[string]any
if err := yaml.Unmarshal(rendered, &computeDomain); err != nil {
t.Fatalf("unmarshal rendered ComputeDomain: %v", err)
}
computeDomainRCTName := valueAtPath[string](t, computeDomain, "spec", "channel", "resourceClaimTemplate", "name")
if computeDomainRCTName != nodeSetRCTName {
t.Errorf("ComputeDomain RCT name = %q, NodeSet RCT name = %q", computeDomainRCTName, nodeSetRCTName)
}
}

func TestSlinkySlurmIMEXComputeDomainFixedIdentityCannotBeOverridden(t *testing.T) {
ctx := context.Background()
const manifestPath = "components/slinky-slurm/manifests/compute-domain.yaml"
content, err := GetManifestContentWithContext(ctx, nil, manifestPath)
if err != nil {
t.Fatalf("GetManifestContentWithContext(%q) failed: %v", manifestPath, err)
}

// Scalar --set and typed --set-json/--set-file converge on this final
// values map before local manifests are rendered. None may change the
// immutable ComputeDomain integration contract.
tests := []struct {
name string
values map[string]any
}{
{
name: "scalar --set",
values: map[string]any{
"name": "from-set",
"allocationMode": "Immediate",
"resourceClaimTemplateName": "from-set",
},
},
{
name: "typed --set-json or --set-file",
values: map[string]any{
"name": "from-typed-set",
"allocationMode": "Immediate",
"resourceClaimTemplateName": "from-typed-set",
},
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
rendered, renderErr := manifest.Render(content, manifest.RenderInput{
ComponentName: "slinky-slurm",
Namespace: "slurm",
ChartName: "slurm",
ChartVersion: "1.1.0",
Values: tt.values,
})
if renderErr != nil {
t.Fatalf("render ComputeDomain manifest: %v", renderErr)
}

for _, want := range []string{
"name: slinky-slurm-imex",
"allocationMode: All",
"name: slinky-slurm-imex-channels",
} {
if !strings.Contains(string(rendered), want) {
t.Errorf("rendered ComputeDomain manifest missing fixed value %q:\n%s", want, rendered)
}
}
for _, unwanted := range []string{"from-set", "from-typed-set", "allocationMode: Immediate"} {
if strings.Contains(string(rendered), unwanted) {
t.Errorf("rendered ComputeDomain manifest contains override value %q:\n%s", unwanted, rendered)
}
}
})
}
}

func valueAtPath[T any](t *testing.T, root map[string]any, path ...string) T {
t.Helper()

if len(path) == 0 {
t.Fatal("value path must not be empty")
}

var current any = root
for _, key := range path {
m, ok := current.(map[string]any)
if !ok {
t.Fatalf("%q parent is %T, want map[string]any", key, current)
}
current, ok = m[key]
if !ok {
t.Fatalf("missing nested key path %v", path)
}
}
value, ok := current.(T)
if !ok {
var expected T
t.Fatalf("nested path %v = %T, want %T", path, current, expected)
}
return value
}

func assertSingleNameField(t *testing.T, items []any, field, want string) {
t.Helper()

if len(items) != 1 {
t.Fatalf("items length = %d, want 1: %v", len(items), items)
}
item, ok := items[0].(map[string]any)
if !ok {
t.Fatalf("items[0] = %T, want map[string]any", items[0])
}
if got, ok := item[field].(string); !ok || got != want {
t.Fatalf("items[0].%s = %v, want %q", field, item[field], want)
}
}

// TestEvaluatorFailingLeafExcludesCandidate verifies that when a leaf overlay's
// constraints fail evaluation, no ancestor overlay is used as a fallback
// candidate. With maximal leaf selection, ancestors are not independent
Expand Down
Loading
Loading