Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion docs/user/validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,21 @@ ones) that match the target fabric:
| Check | Transport | When it's selected |
|---|---|---|
| `nccl-all-reduce-bw` | Auto-detect (whatever NCCL picks) | H100/H200 on EKS, H100 on GKE, and B200/GB200 on self-managed clusters (`service=any`). Preserves the pre-variant behavior. |
| `nccl-all-reduce-bw-net` | NET (EFA on EKS) | GB200 + EKS. Asserts EFA actually carried traffic — catches silent fallback to Socket when the NVIDIA driver is missing `NVreg_GrdmaPciTopoCheckOverride=1`. |
| `nccl-all-reduce-bw-net` | NET (EFA on EKS by default; ConnectX RoCE via `AICR_NCCL_FABRIC=roce`) | GB200 + EKS. Asserts EFA actually carried traffic — catches silent fallback to Socket when the NVIDIA driver is missing `NVreg_GrdmaPciTopoCheckOverride=1`. |
Comment thread
coderabbitai[bot] marked this conversation as resolved.
| `nccl-all-reduce-bw-nvls` | NVLS (MNNVL across an NVL72 IMEX domain) | GB200 + EKS, and GB200 + OKE. Asserts the NVLS communicator actually initialized — catches silent fallback to EFA (EKS) or Socket (OKE) when the IMEX domain is misconfigured. |

The `-net` check defaults to the AWS EFA fabric. On a ConnectX **RoCE** cluster
(e.g. DGXC GB300 `p6e-gb300r`), set `AICR_NCCL_FABRIC=roce` in the `aicr
validate` environment to run the NET test over NCCL's built-in IB/verbs
transport across `roce.networking.k8s.aws` DRA devices instead. The value is
scoped to the `-net` check only; unset (or `efa`) leaves every existing recipe
on the EFA path unchanged, and any other value is rejected. The RoCE runtime
image installs `openssh-server` at startup, so the GPU nodes need apt egress;
on an air-gapped cluster the RoCE NET test cannot bootstrap. This env override is
interim — snapshot-based fabric auto-detection (and removing the runtime
package install once a CUDA-13 image ships sshd) is tracked in
[NVIDIA/aicr#1413](https://github.com/NVIDIA/aicr/issues/1413).

GB200/EKS recipes (both `training` and `inference` intents) enable `-net` and
`-nvls` together rather than the auto-detect variant, because those nodes
expose two inter-node fabrics simultaneously and a single auto-detect test
Expand Down
22 changes: 22 additions & 0 deletions pkg/validator/catalog/catalog_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1140,6 +1140,28 @@ func TestEmbeddedCatalog_InferenceGatewayEntryExists(t *testing.T) {
t.Fatalf("no embedded catalog entry named %q (AICR_REQUIRE_SCOPED_INFERENCE_GATEWAY forwarding would silently no-op)", v1.InferenceGatewayCheckName)
}

// TestEmbeddedCatalog_NCCLAllReduceBWNetEntryExists locks the embedded catalog
// entry name to v1.NCCLAllReduceBWNetCheckName, which scopes AICR_NCCL_FABRIC
// forwarding (see buildEnv in pkg/validator/v1). Without this, renaming the
// "nccl-all-reduce-bw-net" catalog entry would silently disable RoCE-fabric
// forwarding — the in-Job validator would never see the env and default to EFA
// — with no other test failing.
func TestEmbeddedCatalog_NCCLAllReduceBWNetEntryExists(t *testing.T) {
cat, err := LoadWithDataProvider(context.Background(), nil, "v0.0.0-next", "")
if err != nil {
t.Fatalf("Load failed: %v", err)
}
for _, v := range cat.Validators {
if v.Name == v1.NCCLAllReduceBWNetCheckName {
if v.Phase != "performance" {
t.Errorf("%q phase = %q, want performance", v1.NCCLAllReduceBWNetCheckName, v.Phase)
}
return
}
}
t.Fatalf("no embedded catalog entry named %q (AICR_NCCL_FABRIC forwarding would silently no-op)", v1.NCCLAllReduceBWNetCheckName)
}

func TestCatalogEmbedding(t *testing.T) {
// Simulate embedding in a CR spec
type ValidatorCatalogSpec struct {
Expand Down
27 changes: 26 additions & 1 deletion pkg/validator/v1/job_plan_internal.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,23 @@ const (

requireScopedInferenceGatewayEnv = "AICR_REQUIRE_SCOPED_INFERENCE_GATEWAY"

// NCCLAllReduceBWNetCheckName is the catalog name of the NCCL all-reduce NET
// check. Used to scope AICR_NCCL_FABRIC forwarding to that validator only.
// Exported (like InferencePerfCheckName / InferenceGatewayCheckName) so the
// catalog package can lock the embedded entry name against it — a rename
// would otherwise silently no-op RoCE forwarding with no test failing.
NCCLAllReduceBWNetCheckName = "nccl-all-reduce-bw-net"

// ncclFabricEnv selects the NET fabric (efa default | roce). Forwarded to
// the NET check pod so the in-Job validator can observe it. This is the
// orchestrator (forwarding) end; the validator-pod (reading) end defines the
// same literal as ncclFabricEnv in
// validators/performance/nccl_all_reduce_bw_constraint.go — keep the two in
// sync. The split mirrors the other forwarded validator envs (HF_TOKEN,
// AICR_REQUIRE_SCOPED_INFERENCE_GATEWAY, AICR_INFERENCE_PERF_NO_CLEANUP): the
// pod binary is a separate package that does not import this one.
ncclFabricEnv = "AICR_NCCL_FABRIC"
Comment thread
mchmarny marked this conversation as resolved.

// inferencePerfNoCleanupEnv, when truthy, makes the inference-perf validator
// leave its namespace/DGD/workers/frontend/AIPerf Job in place after the run
// for post-mortem inspection. Forwarded only to the inference-perf pod.
Expand Down Expand Up @@ -155,6 +172,14 @@ func buildEnv(
env = append(env, corev1.EnvVar{Name: requireScopedInferenceGatewayEnv, Value: v})
}

// Forward the NCCL fabric selector to the nccl-all-reduce-bw-net check pod.
// The NET test runs inside the Job, so it can't observe the CLI environment
// unless forwarded here. Unset (default) leaves the check on EFA; scoped to
// the NET check so unrelated validator pods don't carry it.
if v, ok := os.LookupEnv(ncclFabricEnv); ok && v != "" && entry.Name == NCCLAllReduceBWNetCheckName {
env = append(env, corev1.EnvVar{Name: ncclFabricEnv, Value: v})
}

Comment thread
coderabbitai[bot] marked this conversation as resolved.
// Forward the inference-perf no-cleanup debug toggle into that validator pod.
// Cleanup runs inside the Job, so it can't see the CLI process environment
// unless the orchestrator carries the value across. Scoped to the
Expand All @@ -175,7 +200,7 @@ func buildEnv(
// forwarded value (k8s takes the last duplicate), breaking that trust
// boundary.
for _, e := range entry.Env {
if e.Name == hfTokenEnvVar || e.Name == requireScopedInferenceGatewayEnv || e.Name == inferencePerfNoCleanupEnv {
if e.Name == hfTokenEnvVar || e.Name == requireScopedInferenceGatewayEnv || e.Name == inferencePerfNoCleanupEnv || e.Name == ncclFabricEnv {
continue
}
env = append(env, corev1.EnvVar{Name: e.Name, Value: e.Value})
Expand Down
102 changes: 102 additions & 0 deletions pkg/validator/v1/job_plan_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ package v1

import (
stderrors "errors"
"os"
"strings"
"testing"
"time"
Expand Down Expand Up @@ -522,6 +523,107 @@ func TestBuildJobPlan_ForwardsInferencePerfNoCleanupEnv(t *testing.T) {
})
}

// TestBuildJobPlan_ForwardsNCCLFabricEnv verifies the NET fabric selector is
// carried from the CLI process into the nccl-all-reduce-bw-net validator Job
// (where ncclFabric() reads it), only that validator, and that a catalog-pinned
// value can never shadow or substitute for the forwarded one. Unlike the
// no-cleanup toggle, the value is forwarded verbatim (the validator validates it).
func TestBuildJobPlan_ForwardsNCCLFabricEnv(t *testing.T) {
build := func(entry ValidatorEntry) map[string]string {
plan, err := BuildJobPlan(entry, "run-1", "ns", "1.0.0", "abc123", "sa", nil, nil, nil, "", "", nil)
if err != nil {
t.Fatalf("BuildJobPlan error: %v", err)
}
m := make(map[string]string)
for _, e := range plan.Env {
m[e.Name] = e.Value
}
return m
}

netEntry := ValidatorEntry{Name: NCCLAllReduceBWNetCheckName, Phase: "performance", Image: "img:v1", Timeout: time.Minute}

t.Run("forwarded verbatim to nccl-all-reduce-bw-net", func(t *testing.T) {
t.Setenv(ncclFabricEnv, "roce")
if got := build(netEntry)[ncclFabricEnv]; got != "roce" {
t.Errorf("%s env = %q, want roce", ncclFabricEnv, got)
}
})
t.Run("empty value omitted", func(t *testing.T) {
t.Setenv(ncclFabricEnv, "")
if _, present := build(netEntry)[ncclFabricEnv]; present {
t.Errorf("%s should not be in Job env when empty", ncclFabricEnv)
}
})
t.Run("unset omitted", func(t *testing.T) {
// Exercise the LookupEnv ok=false branch (os.Unsetenv), distinct from the
// empty-string ok=true case above. t.Setenv registers cleanup so the
// unset is restored after the test.
t.Setenv(ncclFabricEnv, "")
if err := os.Unsetenv(ncclFabricEnv); err != nil {
t.Fatalf("unsetenv: %v", err)
}
if _, present := build(netEntry)[ncclFabricEnv]; present {
t.Errorf("%s should not be in Job env when unset", ncclFabricEnv)
}
})
t.Run("not forwarded to other validators", func(t *testing.T) {
t.Setenv(ncclFabricEnv, "roce")
other := ValidatorEntry{Name: InferencePerfCheckName, Phase: "performance", Image: "img:v1", Timeout: time.Minute}
if _, present := build(other)[ncclFabricEnv]; present {
t.Errorf("%s must not be forwarded to a non-NET validator", ncclFabricEnv)
}
})
t.Run("env-name literal locked", func(t *testing.T) {
// Pin the orchestrator (forwarding) end of the env name. The validator-pod
// (reading) end defines the same literal independently in
// validators/performance/nccl_all_reduce_bw_constraint.go; a fat-finger in
// either redeclaration would silently no-op RoCE forwarding. Both ends
// pin to this canonical string so a typo fails its own package's test.
if ncclFabricEnv != "AICR_NCCL_FABRIC" {
t.Errorf("ncclFabricEnv = %q, want AICR_NCCL_FABRIC (keep in sync with the pod-side const)", ncclFabricEnv)
}
})

// values collects every occurrence of the env var (not just the last) so we
// can assert the catalog value is dropped, not merely shadowed.
values := func(entry ValidatorEntry) []string {
plan, err := BuildJobPlan(entry, "run-1", "ns", "1.0.0", "abc123", "sa", nil, nil, nil, "", "", nil)
if err != nil {
t.Fatalf("BuildJobPlan error: %v", err)
}
var got []string
for _, e := range plan.Env {
if e.Name == ncclFabricEnv {
got = append(got, e.Value)
}
}
return got
}

t.Run("catalog value cannot override forwarded value", func(t *testing.T) {
t.Setenv(ncclFabricEnv, "roce")
entry := ValidatorEntry{
Name: NCCLAllReduceBWNetCheckName, Phase: "performance", Image: "img:v1", Timeout: time.Minute,
Env: []EnvVar{{Name: ncclFabricEnv, Value: "efa"}},
}
if got := values(entry); len(got) != 1 || got[0] != "roce" {
t.Errorf("%s env = %v, want exactly [roce] (catalog value must be dropped)", ncclFabricEnv, got)
}
})

t.Run("catalog value alone cannot select fabric", func(t *testing.T) {
t.Setenv(ncclFabricEnv, "")
entry := ValidatorEntry{
Name: NCCLAllReduceBWNetCheckName, Phase: "performance", Image: "img:v1", Timeout: time.Minute,
Env: []EnvVar{{Name: ncclFabricEnv, Value: "roce"}},
}
if got := values(entry); len(got) != 0 {
t.Errorf("%s env = %v, want none (catalog must not select fabric without shell env)", ncclFabricEnv, got)
}
})
}

func TestBuildJobPlanWithDefaults(t *testing.T) {
// Test with minimal entry (no custom resources, no tolerations, no node selector)
entry := ValidatorEntry{
Expand Down
2 changes: 1 addition & 1 deletion recipes/validators/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ Applied by `catalog.Load` (`pkg/validator/catalog/catalog.go`) in order:
| Name | Description | Timeout |
|------|-------------|---------|
| `nccl-all-reduce-bw` | Verify NCCL All Reduce Bus Bandwidth meets threshold | 30m |
| `nccl-all-reduce-bw-net` | Verify NCCL All Reduce Bus Bandwidth on the NET transport (EFA on EKS) | 30m |
| `nccl-all-reduce-bw-net` | Verify NCCL All Reduce Bus Bandwidth on the NET transport (EFA on EKS; ConnectX RoCE via `AICR_NCCL_FABRIC=roce`) | 30m |
Comment thread
coderabbitai[bot] marked this conversation as resolved.
| `nccl-all-reduce-bw-nvls` | Verify NCCL All Reduce Bus Bandwidth on the NVLS transport (MNNVL across an NVL72 IMEX domain) | 30m |

### Conformance Phase
Expand Down
2 changes: 1 addition & 1 deletion recipes/validators/catalog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ validators:
env: []
- name: nccl-all-reduce-bw-net
phase: performance
description: "Verify NCCL All Reduce Bus Bandwidth on the NET transport (EFA on EKS)"
description: "Verify NCCL All Reduce Bus Bandwidth on the NET transport (EFA on EKS; ConnectX RoCE via AICR_NCCL_FABRIC=roce)"
image: ghcr.io/nvidia/aicr-validators/performance:latest
timeout: 30m
args: ["nccl-all-reduce-bw-net"]
Expand Down
Loading
Loading