Skip to content

Roadmap & Focus Areas #265

@resker

Description

@resker

This issue is intended to discuss and align Topograph's direction - and highlight areas where contributions are especially desired!

Direction

Topograph's strategic positioning is to be the standard topology-discovery layer for AI cluster scheduling — a unified substrate that abstracts over diverse infrastructure sources (cloud APIs, InfiniBand fabric, NVIDIA-managed NVLink/Ethernet) and translates topology into the formats each scheduler can consume (Slurm topology.conf, Kubernetes node labels, Slinky ConfigMaps). The broader the provider and engine coverage, the more useful the standardized topology layer becomes for downstream schedulers (KAI Scheduler, Kueue, Grove, native pod affinity), operational tooling (NVSentinel), and the ecosystem of observability / placement tools that depend on accurate, real-time topology data.

We actively welcome community contribution. The provider interface is well-defined, the engine boundary is stable, and the project is designed to accept new integrations without touching shared internals.

Focus Areas

1. Additional cloud and colocation providers

If your infrastructure isn't covered by the existing AWS, GCP, OCI, Nebius, Lambda AI, and CoreWeave providers, adding support means integrating with your provider's topology API and implementing a new Topograph provider.

2. On-premises network fabrics

Expanding on-premises coverage requires collaboration with network switch vendors and operators. If you run InfiniBand, Ethernet, or other high-performance fabrics on-premises and have ideas about how node-to-switch relationships should be exposed reliably in your environment, please comment on this issue or open a new one describing the mechanism.

3. Kubernetes Workload API and KEP-5732

The Workload API (alpha in Kubernetes 1.35) brings gang scheduling into the core scheduler. KEP-5732 extends this with a TopologyConstraint on PodGroupSpec — a node label key ensuring all pods in a gang land in the same topology domain — with DRA integration planned to follow. Both are upstream proposals under active development; the topology labels Topograph applies are the natural input as these features mature.

4. KEP-4962 — upstream topology label standardization

KEP-4962 proposes a standard network.topology.kubernetes.io/ label schema that closely mirrors Topograph's existing network.topology.nvidia.com/ design. Contributions that keep Topograph aligned with the KEP as it evolves — and contributions to the KEP itself — help shape how topology awareness lands in the Kubernetes ecosystem.

5. Grove ClusterTopology integration

The Grove project's roadmap includes automatic topology detection — generating ClusterTopology resources directly from the cluster's physical layout. Topograph is the natural source for that data. A native integration between Topograph and Grove's ClusterTopology API would close the loop between topology discovery and scheduling configuration automatically.

6. AICR integration

NVIDIA/aicr is a sibling project that curates AI-cluster configuration recipes. Adding Topograph as a first-class AICR recipe component (alongside KAI Scheduler and the DRA driver) would let AICR-managed clusters pick up topology labels out of the box. This is a cross-repo change that would land on the AICR side.

How to engage

  • Comment on this issue with questions, offers to pick up an area, or new focus-area proposals
  • Open a new issue for a specific feature, bug, or integration; link back here if it maps to one of the focus areas above
  • Submit a pull request — see CONTRIBUTING.md for mechanics (DCO sign-off, conventional commits, pre-push checks) and AGENTS.md for structural conventions

Maintainer note

This issue is intended to be pinned to the repository's issue tracker for discoverability. Please maintain it as the authoritative list of direction-setting focus areas; update it as items are completed, added, or reframed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions