Skip to content

Add Azure cross-region cloudprovider (modeled on nebius) #63

@chokevin

Description

@chokevin

Summary

Add a Karpenter cloudprovider that provisions external nodes in other Azure regions, modeled on the existing karpenter/pkg/cloudproviders/nebius/ pattern. Goal: let an AKS cluster (e.g. westeurope) auto-provision capacity for SKUs only available in another region (e.g. Standard_ND96isr_H200_v5 in eastus2) via a standard Karpenter NodePool.

Motivation

The merged PR #61 + open PR #62 unblocked manual cross-region node join. We have 2 H200 nodes in eastus2 joined to an AKS control plane in westeurope — proven working end-to-end with aks-flex-node v0.0.18.

What's missing is lifecycle automation: today operators must run gen_userdata.py + az vm create for every node and clean up by hand. A Karpenter cloudprovider closes the loop so a researcher's pending Pod with nodeSelector: gpu=h200 triggers VM creation in eastus2 automatically.

The existing upstream karpenter-provider-azure provisions only into the AKS cluster's own region (VMSS-based). Cross-region requires a separate provider.

Proposed Phase 1 Scope

One region per NodeClass, BYO network, on-demand only, static SKU allowlist.

Mirror nebius:

karpenter/pkg/apis/v1alpha1/azureflex.go         # AzureFlexNodeClass CRD
karpenter/pkg/cloudproviders/azure/              # CloudProvider impl
karpenter/pkg/controllers/azure/                 # NodeClass status + termination
karpenter/examples/azure/                        # NodePool + NodeClass YAML

AzureFlexNodeClassSpec:

  • subscriptionID (required)
  • location (required, e.g. "eastus2")
  • resourceGroup (required)
  • subnetID (required, full ARM resource ID — assumes operator pre-provisioned VNet/peering/NSG)
  • imageReference (publisher/offer/sku/version) or imageID (SIG/community gallery)
  • securityType (default Standard)
  • osDiskSizeGB (default 128)
  • sshPublicKeys
  • allocateNodePublicIP (default false)
  • maxPodsPerNode (default 110)
  • tags (map)
  • (deferred: zones, identity/UAMI, PPG, capacity reservation, spot)

ProviderID: azure-flex:///subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<name> (full canonical ARM ID).

VM lifecycle:

  • Deterministic VM name from NodeClaim.Name
  • DeleteOption=Delete on NIC + OS disk so VM delete cascades
  • Idempotent on retries (handle 404 as success)

Userdata: reuse existing plugin/pkg/util/kubeadm/azure.go:FromAKS() + plugin/pkg/services/agentpools/userdata/flex/. Cache the FromAKS result per controller process to avoid hammering the bootstrap secret on every Create.

Drift: hash-based, mirroring nebius pattern. Triggers: image, subnet, location, RG/sub change.

SKU catalog: hardcoded allowlist of 3-5 GPU SKUs we currently care about (ND96isr_H200_v5, ND96amsr_A100_v4, NCadsH100v5) with explicit availableLocations. Clear interface{} boundary so Azure SKUs API can be plugged in later.

Explicit Non-Goals (Phase 1)

  • Network bring-up (VNet, peering, NSG) — operator-managed, NodeClass references existing subnet
  • Multi-region per NodeClass (use multiple NodePools instead)
  • Spot pricing / capacity-type selection
  • Quota preflight (let ARM fail, classify as InsufficientCapacity)
  • Full Azure SKU catalog (allowlist only)
  • Cross-subscription identity management (assume controller MI has rights in target sub)

Open Questions for Maintainers

  1. API group: karpenter.flex.aks.azure.com to match the project? Or stay under flex.aks.azure.com like nebius does (nebiusnodeclasses.flex.aks.azure.com)?
  2. Reuse vs reimplement: should the provider take a Go-module dependency on Azure/karpenter-provider-azure for VM lifecycle helpers, or stay self-contained like nebius does for Nebius?
  3. Identity model: Phase 1 assumes the controller's MI/SP has Contributor on the target subscription/RG/subnet. Worth surfacing as a NodeClass field now (escape hatch) or keep as deployment-level concern?
  4. SKU catalog: hardcoded allowlist OK for v1, or block on Azure SKUs API integration?
  5. aks-flex-node version: plugin/pkg/services/agentpools/userdata/flex/flex.go pins v0.0.17. Should this PR also bump to v0.0.18 (which is what proven works for our H200 case), or do that in a separate PR?

Validation Plan

Real-world e2e target: voice-agent-flex (AKS westeurope) provisioning into voice-agent-flex-h200-rg (eastus2). 2 working H200 nodes already there to compare against.

Unit tests will cover: providerID round-trip, idempotent delete, partial-failure cleanup, NodeClass deletion blocked by live NodeClaims, bad-subnet validation, unsupported SKU/region rejection, drift hash determinism.

Happy to scope this down further if Phase 1 is too big a chunk. Filing as a draft for direction-setting before opening a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions