AWS provider's accelerator label has reservation-scoped semantics, diverging from other MNNVL-aware providers
Context
Topograph's network.topology.nvidia.com/accelerator label is defined as an accelerated interconnect domain identifier — in practice, operators treat it as "same-value → same NVLink fabric". Four of topograph's five providers that emit this label honor that contract by deriving the value from a Fabric-Manager-derived NVLink clique ID:
The AWS provider is the exception — it derives accelerator from AWS's CapacityBlockId attribute:
// pkg/providers/aws/instance_topology.go:110-111
if inst.CapacityBlockId != nil {
topo.AcceleratorID = *inst.CapacityBlockId
}
Per the AWS EC2 API reference for InstanceTopology, on UltraServer instances CapacityBlockId "identifies instances within the UltraServer domain" — it is a reservation-scoped identifier, not an NVLink-partition identifier. AWS's explicit "same NVLink domain" label is topology.k8s.aws/ultraserver-id.
Why this matters
Per the Run:ai canonical definition, a clique is "a logical split of the MNNVL into smaller domains". An UltraServer can contain multiple cliques (e.g., an x72 split into two x36 halves), and a clique can be absent when NVIDIA Fabric Manager has not completed init (NVML reports NVML_GPU_FABRIC_STATE_COMPLETED is required before the label is written).
The practical consequence is that two nodes with the same network.topology.nvidia.com/accelerator value can mean:
- On DRA / InfiniBand / Lambda AI providers: same NVLink fabric (operator's likely mental model)
- On the AWS provider: same UltraServer reservation, which is co-extensive with the UltraServer-level MNNVL domain on P6e-GB200 but may contain multiple cliques — so "same accelerator" is coarser than "same NVLink partition"
Empirical data from an NVIDIA-internal cluster confirms the N-cliques-per-CapacityBlock case in production: multiple distinct nvidia.com/gpu.clique values were observed within a single topology.k8s.aws/capacity-block-id, with some Capacity Blocks having no clique ID at all.
A downstream scheduler's CEL rule or podAffinity expression that assumes accelerator equality implies NVLink reachability will therefore be accurate on DRA/IB/Lambda AI but can over-colocate on AWS.
Options
- Keep AWS as-is, document. The docs in #289 already clarify the semantic difference. Schedulers that need true NVLink-partition granularity can consume
nvidia.com/gpu.clique or topology.k8s.aws/ultraserver-id directly on AWS nodes.
- AWS provider prefers
topology.k8s.aws/ultraserver-id when present, falls back to CapacityBlockId. This is the cleanest semantic fix for the UltraServer case — ultraserver-id is AWS's documented "same NVLink domain" identifier. Still doesn't capture within-UltraServer cliques on partitioned hardware.
- AWS provider prefers
nvidia.com/gpu.clique (from the GPU Operator on MNNVL nodes) when present, falls back to CapacityBlockId. Closest alignment with other providers' semantics. Requires the AWS provider to read node labels (it already runs as a Kubernetes-aware workload), and has to reason about the non-MNNVL case where no clique label exists.
- Emit two distinct labels, e.g.,
accelerator (broad — UltraServer / Capacity Block level) and accelerator-partition (fine-grained — NVLink clique level). Breaking change for current consumers.
- Change the default on AWS to
ultraserver-id, keep CapacityBlockId behind an engineParams flag for operators who still want reservation-level grouping.
Ask
This is an inquiry to @dmitsh and @ravisoundar to assess.
Related
AWS provider's
acceleratorlabel has reservation-scoped semantics, diverging from other MNNVL-aware providersContext
Topograph's
network.topology.nvidia.com/acceleratorlabel is defined as an accelerated interconnect domain identifier — in practice, operators treat it as "same-value → same NVLink fabric". Four of topograph's five providers that emit this label honor that contract by deriving the value from a Fabric-Manager-derived NVLink clique ID:acceleratorvaluedranvidia.com/gpu.cliquelabel from the NVIDIA GPU Operator (pkg/providers/dra/provider.go)infiniband-bmClusterUUID.CliqueIdvianvidia-smi(pkg/providers/infiniband/bm.go)infiniband-k8sClusterUUID.CliqueIdfrom the device plugin's annotations (pkg/providers/infiniband/k8s.go)lambdaiNVLink.DomainID.CliqueIDfrom the Lambda AI API (pkg/providers/lambdai/provider.go)The AWS provider is the exception — it derives
acceleratorfrom AWS'sCapacityBlockIdattribute:Per the AWS EC2 API reference for
InstanceTopology, on UltraServer instancesCapacityBlockId"identifies instances within the UltraServer domain" — it is a reservation-scoped identifier, not an NVLink-partition identifier. AWS's explicit "same NVLink domain" label istopology.k8s.aws/ultraserver-id.Why this matters
Per the Run:ai canonical definition, a clique is "a logical split of the MNNVL into smaller domains". An UltraServer can contain multiple cliques (e.g., an x72 split into two x36 halves), and a clique can be absent when NVIDIA Fabric Manager has not completed init (NVML reports
NVML_GPU_FABRIC_STATE_COMPLETEDis required before the label is written).The practical consequence is that two nodes with the same
network.topology.nvidia.com/acceleratorvalue can mean:Empirical data from an NVIDIA-internal cluster confirms the N-cliques-per-CapacityBlock case in production: multiple distinct
nvidia.com/gpu.cliquevalues were observed within a singletopology.k8s.aws/capacity-block-id, with some Capacity Blocks having no clique ID at all.A downstream scheduler's CEL rule or
podAffinityexpression that assumesacceleratorequality implies NVLink reachability will therefore be accurate on DRA/IB/Lambda AI but can over-colocate on AWS.Options
nvidia.com/gpu.cliqueortopology.k8s.aws/ultraserver-iddirectly on AWS nodes.topology.k8s.aws/ultraserver-idwhen present, falls back toCapacityBlockId. This is the cleanest semantic fix for the UltraServer case —ultraserver-idis AWS's documented "same NVLink domain" identifier. Still doesn't capture within-UltraServer cliques on partitioned hardware.nvidia.com/gpu.clique(from the GPU Operator on MNNVL nodes) when present, falls back toCapacityBlockId. Closest alignment with other providers' semantics. Requires the AWS provider to read node labels (it already runs as a Kubernetes-aware workload), and has to reason about the non-MNNVL case where no clique label exists.accelerator(broad — UltraServer / Capacity Block level) andaccelerator-partition(fine-grained — NVLink clique level). Breaking change for current consumers.ultraserver-id, keepCapacityBlockIdbehind an engineParams flag for operators who still want reservation-level grouping.Ask
This is an inquiry to @dmitsh and @ravisoundar to assess.
Related