This issue is intended to discuss and align Topograph's direction - and highlight areas where contributions are especially desired!
Direction
Topograph's strategic positioning is to be the standard topology-discovery layer for AI cluster scheduling — a unified substrate that abstracts over diverse infrastructure sources (cloud APIs, InfiniBand fabric, NVIDIA-managed NVLink/Ethernet) and translates topology into the formats each scheduler can consume (Slurm topology.conf, Kubernetes node labels, Slinky ConfigMaps). The broader the provider and engine coverage, the more useful the standardized topology layer becomes for downstream schedulers (KAI Scheduler, Kueue, Grove, native pod affinity), operational tooling (NVSentinel), and the ecosystem of observability / placement tools that depend on accurate, real-time topology data.
We actively welcome community contribution. The provider interface is well-defined, the engine boundary is stable, and the project is designed to accept new integrations without touching shared internals.
Focus Areas
1. Additional cloud and colocation providers
If your infrastructure isn't covered by the existing AWS, GCP, OCI, Nebius, Lambda AI, and CoreWeave providers, adding support means integrating with your provider's topology API and implementing a new Topograph provider.
2. On-premises network fabrics
Expanding on-premises coverage requires collaboration with network switch vendors and operators. If you run InfiniBand, Ethernet, or other high-performance fabrics on-premises and have ideas about how node-to-switch relationships should be exposed reliably in your environment, please comment on this issue or open a new one describing the mechanism.
3. Kubernetes Workload API and KEP-5732
The Workload API (alpha in Kubernetes 1.35) brings gang scheduling into the core scheduler. KEP-5732 extends this with a TopologyConstraint on PodGroupSpec — a node label key ensuring all pods in a gang land in the same topology domain — with DRA integration planned to follow. Both are upstream proposals under active development; the topology labels Topograph applies are the natural input as these features mature.
4. KEP-4962 — upstream topology label standardization
KEP-4962 proposes a standard network.topology.kubernetes.io/ label schema that closely mirrors Topograph's existing network.topology.nvidia.com/ design. Contributions that keep Topograph aligned with the KEP as it evolves — and contributions to the KEP itself — help shape how topology awareness lands in the Kubernetes ecosystem.
5. Grove ClusterTopology integration
The Grove project's roadmap includes automatic topology detection — generating ClusterTopology resources directly from the cluster's physical layout. Topograph is the natural source for that data. A native integration between Topograph and Grove's ClusterTopology API would close the loop between topology discovery and scheduling configuration automatically.
6. AICR integration
NVIDIA/aicr is a sibling project that curates AI-cluster configuration recipes. Adding Topograph as a first-class AICR recipe component (alongside KAI Scheduler and the DRA driver) would let AICR-managed clusters pick up topology labels out of the box. This is a cross-repo change that would land on the AICR side.
How to engage
- Comment on this issue with questions, offers to pick up an area, or new focus-area proposals
- Open a new issue for a specific feature, bug, or integration; link back here if it maps to one of the focus areas above
- Submit a pull request — see
CONTRIBUTING.md for mechanics (DCO sign-off, conventional commits, pre-push checks) and AGENTS.md for structural conventions
Maintainer note
This issue is intended to be pinned to the repository's issue tracker for discoverability. Please maintain it as the authoritative list of direction-setting focus areas; update it as items are completed, added, or reframed.
This issue is intended to discuss and align Topograph's direction - and highlight areas where contributions are especially desired!
Direction
Topograph's strategic positioning is to be the standard topology-discovery layer for AI cluster scheduling — a unified substrate that abstracts over diverse infrastructure sources (cloud APIs, InfiniBand fabric, NVIDIA-managed NVLink/Ethernet) and translates topology into the formats each scheduler can consume (Slurm
topology.conf, Kubernetes node labels, Slinky ConfigMaps). The broader the provider and engine coverage, the more useful the standardized topology layer becomes for downstream schedulers (KAI Scheduler, Kueue, Grove, native pod affinity), operational tooling (NVSentinel), and the ecosystem of observability / placement tools that depend on accurate, real-time topology data.We actively welcome community contribution. The provider interface is well-defined, the engine boundary is stable, and the project is designed to accept new integrations without touching shared internals.
Focus Areas
1. Additional cloud and colocation providers
If your infrastructure isn't covered by the existing AWS, GCP, OCI, Nebius, Lambda AI, and CoreWeave providers, adding support means integrating with your provider's topology API and implementing a new Topograph provider.
AGENTS.mdpkg/providers/lambdai/,pkg/providers/cw/2. On-premises network fabrics
Expanding on-premises coverage requires collaboration with network switch vendors and operators. If you run InfiniBand, Ethernet, or other high-performance fabrics on-premises and have ideas about how node-to-switch relationships should be exposed reliably in your environment, please comment on this issue or open a new one describing the mechanism.
3. Kubernetes Workload API and KEP-5732
The Workload API (alpha in Kubernetes 1.35) brings gang scheduling into the core scheduler. KEP-5732 extends this with a
TopologyConstraintonPodGroupSpec— a node label key ensuring all pods in a gang land in the same topology domain — with DRA integration planned to follow. Both are upstream proposals under active development; the topology labels Topograph applies are the natural input as these features mature.4. KEP-4962 — upstream topology label standardization
KEP-4962 proposes a standard
network.topology.kubernetes.io/label schema that closely mirrors Topograph's existingnetwork.topology.nvidia.com/design. Contributions that keep Topograph aligned with the KEP as it evolves — and contributions to the KEP itself — help shape how topology awareness lands in the Kubernetes ecosystem.5. Grove
ClusterTopologyintegrationThe Grove project's roadmap includes automatic topology detection — generating
ClusterTopologyresources directly from the cluster's physical layout. Topograph is the natural source for that data. A native integration between Topograph and Grove'sClusterTopologyAPI would close the loop between topology discovery and scheduling configuration automatically.6. AICR integration
NVIDIA/aicr is a sibling project that curates AI-cluster configuration recipes. Adding Topograph as a first-class AICR recipe component (alongside KAI Scheduler and the DRA driver) would let AICR-managed clusters pick up topology labels out of the box. This is a cross-repo change that would land on the AICR side.
How to engage
CONTRIBUTING.mdfor mechanics (DCO sign-off, conventional commits, pre-push checks) andAGENTS.mdfor structural conventionsMaintainer note
This issue is intended to be pinned to the repository's issue tracker for discoverability. Please maintain it as the authoritative list of direction-setting focus areas; update it as items are completed, added, or reframed.