Skip to content

AWS provider's accelerator label has reservation-scoped semantics, diverging from other MNNVL-aware providers #293

@resker

Description

@resker

AWS provider's accelerator label has reservation-scoped semantics, diverging from other MNNVL-aware providers

Context

Topograph's network.topology.nvidia.com/accelerator label is defined as an accelerated interconnect domain identifier — in practice, operators treat it as "same-value → same NVLink fabric". Four of topograph's five providers that emit this label honor that contract by deriving the value from a Fabric-Manager-derived NVLink clique ID:

Provider Source of accelerator value
dra nvidia.com/gpu.clique label from the NVIDIA GPU Operator (pkg/providers/dra/provider.go)
infiniband-bm ClusterUUID.CliqueId via nvidia-smi (pkg/providers/infiniband/bm.go)
infiniband-k8s ClusterUUID.CliqueId from the device plugin's annotations (pkg/providers/infiniband/k8s.go)
lambdai NVLink.DomainID.CliqueID from the Lambda AI API (pkg/providers/lambdai/provider.go)

The AWS provider is the exception — it derives accelerator from AWS's CapacityBlockId attribute:

// pkg/providers/aws/instance_topology.go:110-111
if inst.CapacityBlockId != nil {
    topo.AcceleratorID = *inst.CapacityBlockId
}

Per the AWS EC2 API reference for InstanceTopology, on UltraServer instances CapacityBlockId "identifies instances within the UltraServer domain" — it is a reservation-scoped identifier, not an NVLink-partition identifier. AWS's explicit "same NVLink domain" label is topology.k8s.aws/ultraserver-id.

Why this matters

Per the Run:ai canonical definition, a clique is "a logical split of the MNNVL into smaller domains". An UltraServer can contain multiple cliques (e.g., an x72 split into two x36 halves), and a clique can be absent when NVIDIA Fabric Manager has not completed init (NVML reports NVML_GPU_FABRIC_STATE_COMPLETED is required before the label is written).

The practical consequence is that two nodes with the same network.topology.nvidia.com/accelerator value can mean:

  • On DRA / InfiniBand / Lambda AI providers: same NVLink fabric (operator's likely mental model)
  • On the AWS provider: same UltraServer reservation, which is co-extensive with the UltraServer-level MNNVL domain on P6e-GB200 but may contain multiple cliques — so "same accelerator" is coarser than "same NVLink partition"

Empirical data from an NVIDIA-internal cluster confirms the N-cliques-per-CapacityBlock case in production: multiple distinct nvidia.com/gpu.clique values were observed within a single topology.k8s.aws/capacity-block-id, with some Capacity Blocks having no clique ID at all.

A downstream scheduler's CEL rule or podAffinity expression that assumes accelerator equality implies NVLink reachability will therefore be accurate on DRA/IB/Lambda AI but can over-colocate on AWS.

Options

  1. Keep AWS as-is, document. The docs in #289 already clarify the semantic difference. Schedulers that need true NVLink-partition granularity can consume nvidia.com/gpu.clique or topology.k8s.aws/ultraserver-id directly on AWS nodes.
  2. AWS provider prefers topology.k8s.aws/ultraserver-id when present, falls back to CapacityBlockId. This is the cleanest semantic fix for the UltraServer case — ultraserver-id is AWS's documented "same NVLink domain" identifier. Still doesn't capture within-UltraServer cliques on partitioned hardware.
  3. AWS provider prefers nvidia.com/gpu.clique (from the GPU Operator on MNNVL nodes) when present, falls back to CapacityBlockId. Closest alignment with other providers' semantics. Requires the AWS provider to read node labels (it already runs as a Kubernetes-aware workload), and has to reason about the non-MNNVL case where no clique label exists.
  4. Emit two distinct labels, e.g., accelerator (broad — UltraServer / Capacity Block level) and accelerator-partition (fine-grained — NVLink clique level). Breaking change for current consumers.
  5. Change the default on AWS to ultraserver-id, keep CapacityBlockId behind an engineParams flag for operators who still want reservation-level grouping.

Ask

This is an inquiry to @dmitsh and @ravisoundar to assess.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions