Skip to content

Harden Azure H200 Karpenter provisioning#64

Open
chokevin wants to merge 20 commits into
Azure:mainfrom
chokevin:chokevin/azure-cross-region-cloudprovider
Open

Harden Azure H200 Karpenter provisioning#64
chokevin wants to merge 20 commits into
Azure:mainfrom
chokevin:chokevin/azure-cross-region-cloudprovider

Conversation

@chokevin
Copy link
Copy Markdown
Contributor

@chokevin chokevin commented Apr 22, 2026

What

Hardens AzureFlex H200 provisioning through Karpenter by persisting incomplete agent pools after NIC creation, filtering mixed protobuf Any list responses by concrete AgentPool type, and delaying List-driven cleanup of fresh incomplete AzureFlex records.

This also updates the Karpenter integration to the compatible Azure provider/Karpenter versions and adds H200 eastus2/eastus2euap examples so both regions can be capacity-hunted independently.

Why

Azure can create or reserve NIC/disk resources before VM creation fails for H200 quota or physical capacity. Without durable incomplete records and safe cleanup, those failed attempts can leak NICs; without type-filtered list handling, mixed provider records can break Karpenter garbage collection.

Non-goals

  • Does not guarantee Azure H200 physical capacity is available.
  • Does not change the two-node-per-region quota target used in the live cluster.
  • Does not remove the keepalive/capacity-hunting behavior.

Testing

  • cd plugin && go test -count=1 ./...
  • cd karpenter && go test -count=1 ./...
  • cd cli && go test ./...
  • helm template karpenter karpenter/charts/karpenter --namespace karpenter
  • kubectl apply --dry-run=server -f karpenter/examples/azure/nodepool-h200.yaml
  • kubectl apply --dry-run=server -f karpenter/examples/azure/h200_deployment.yaml
  • kubectl apply --dry-run=server -f karpenter/examples/azure/azureflexnodeclass-h200-eastus2euap.yaml
  • kubectl apply --dry-run=server -f karpenter/examples/azure/nodepool-h200-eastus2euap.yaml

Risk

Main risk is AzureFlex cleanup being too aggressive or too conservative. The mitigation is a 30-minute grace period before List-driven cleanup of incomplete records, while Create failure cleanup still targets known failed NodeClaims. Rollback is reverting the Karpenter/plugin changes and redeploying the previous controller image.

chokevin and others added 10 commits April 22, 2026 11:28
Adds the AzureFlexNodeClass CRD that lets a Karpenter NodePool in an AKS
cluster auto-provision single Azure VMs in a (potentially different)
Azure region. Spec covers subscription/RG/subnet, image (marketplace ref
or SIG ID), security type (Standard only), OS disk size, SSH keys,
public IP toggle, max pods, and tags. Implements status.Object.

Regenerates deepcopy and CRD manifests. Mirrors the new CRD into the
helm chart crds/ directory.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a new gRPC-served agent pool implementation that creates a single
Azure VM (not VMSS) per agent pool. VM and NIC names are deterministic
from the agent pool ID so retries are idempotent. Both the NIC and OS
disk are tagged DeleteOption=Delete so a single VM delete cascades the
whole resource trio.

Phase 1 scope: Standard security only (TrustedLaunch breaks the DSVM
image), default DSVM image, gzip+base64 cloud-init via flex.UserData,
no public IP by default. Auth uses the plugin process's default Azure
credential — the plugin MI is expected to have Contributor on the
target subscription/RG/subnet.

Wires the new service alongside ubuntu2404vmss in the agentpools and
instances service registries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implements a Karpenter cloudprovider ('azure-flex') that backs the new
AzureFlexNodeClass by talking to the plugin's flexvm service. Mirrors
the nebius layout: consts/log/api stubs, an instancetype subpackage
with a hardcoded Phase 1 SKU catalog (ND96isr_H200_v5, ND96amsr_A100_v4,
NC40ads_H100_v5, NC24ads_A100_v4, D8s_v5), nodeclaim conversions, and
the cloudprovider.go top-level CRUD.

ProviderID format is azure-flex:///<full-arm-id> (three slashes); the
round-trip via providerIDToARMID is lossless and the parse rejects URLs
that put anything in the host position. Drift is computed as a SHA-256
over the AzureFlexNodeClass fields that affect VM identity, with sorted
tag keys for determinism.

Quota errors from the plugin surface as InsufficientCapacityError so
Karpenter stops thrashing on a NodePool whose SKU isn't available.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds the two AzureFlexNodeClass reconcilers (status and termination),
mirrored almost line-for-line from the nebius equivalents. Status
controller adds a finalizer and sets the ValidationSucceeded condition
based on cheap shape checks (required fields, subnet ARM-ID prefix,
imageReference vs imageID mutual exclusion). Termination controller
blocks NodeClass deletion until all owning NodeClaims are gone,
re-emitting a WaitingOnNodeClaimTermination event every 10m.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Registers the new azure cloudprovider in the cloudproviders hub
alongside aks/nebius/kaito, adds AzureFlexNodeClass to the WaitForCRDs
list, and registers the two AzureFlex controllers in
flexcontrollers.NewControllers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a worked example for an H200 NodePool in eastus2 backed by an
AzureFlexNodeClass, with a 64-GPU limit and consolidation enabled.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- nodeclass_status: capture base before AddFinalizer; client.MergeFrom on
  the mutated object produces an empty patch and silently drops the
  finalizer, allowing nodeclass deletion to bypass termination cleanup.
- flexvm/agentpools Create: clean up NIC on PollUntilDone failure (NIC
  was created but VM never reached a state where DeleteOption cascades),
  and guard against nil vmResp.ID before deref.
- flexvm/agentpools Delete: NIC cleanup uses fresh background context so
  it still runs if caller cancels mid-VM-delete.
- nodeclaim: armIDToProviderID("") returns empty rather than the invalid
  'azure-flex:///' URL when status is not yet populated.
The new CRD shares the flex.aks.azure.com api group with nebius, but the
controller needs explicit verb grants on the resource name. Without this
the nodeclass controllers would 403 on every reconcile in a real
deployment.
azcore is a direct import in the new azure cloudprovider.
@chokevin chokevin marked this pull request as ready for review April 22, 2026 20:01
Copilot AI review requested due to automatic review settings April 22, 2026 20:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Phase 1 Azure cross-region (“azure-flex”) support to the repo by introducing an AzureFlexNodeClass CRD plus a Karpenter cloudprovider that provisions individual Azure VMs via a colocated plugin service, enabling nodes to join an AKS control plane from any Azure region (unblocking cross-region GPU capacity workflows).

Changes:

  • Introduces AzureFlexNodeClass (CRD + deepcopy/registration) and wires AzureFlex controllers (status + termination).
  • Adds a new Karpenter cloudprovider (karpenter/pkg/cloudproviders/azure) including providerID helpers and a hardcoded Phase 1 instance type catalog.
  • Adds a new plugin agentpool service (plugin/.../azure/flexvm) for per-VM provisioning (NIC + VM + userdata) and registers it in the plugin servers; bumps flex node userdata version to v0.0.18.

Reviewed changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
plugin/pkg/services/agentpools/userdata/flex/flex.go Bumps flexNodeVersion to v0.0.18 for generated userdata.
plugin/pkg/services/agentpools/instances.go Registers the new Azure flexvm Instances service.
plugin/pkg/services/agentpools/azure/flexvm/redact.go Adds Redact hooks for the new flexvm proto objects.
plugin/pkg/services/agentpools/azure/flexvm/instances.proto Defines Instances messages for flexvm (1 VM per AgentPool).
plugin/pkg/services/agentpools/azure/flexvm/instances.pb.go Generated protobuf code for flexvm instances messages.
plugin/pkg/services/agentpools/azure/flexvm/instances.go Implements minimal List/Get for Instances backed by AgentPool storage.
plugin/pkg/services/agentpools/azure/flexvm/agentpools.proto Defines flexvm AgentPool spec/status (subscription/RG/subnet/image/kubeadm/etc).
plugin/pkg/services/agentpools/azure/flexvm/agentpools.pb.go Generated protobuf code for flexvm agentpool messages.
plugin/pkg/services/agentpools/azure/flexvm/agentpools.go Implements Azure ARM NIC+VM provisioning and idempotent deletion for flexvm.
plugin/pkg/services/agentpools/agentpools.go Registers the new Azure flexvm AgentPools service.
karpenter/pkg/controllers/controllers.go Wires AzureFlex NodeClass status + termination controllers.
karpenter/pkg/controllers/azure/nodeclass_termination.go Adds termination finalizer removal flow blocked by existing NodeClaims.
karpenter/pkg/controllers/azure/nodeclass_status_test.go Adds unit tests for AzureFlex NodeClass spec validation.
karpenter/pkg/controllers/azure/nodeclass_status.go Adds status controller that ensures finalizer + publishes validation condition.
karpenter/pkg/cloudproviders/azure/nodeclaim_test.go Adds unit tests for providerID parsing and drift hash determinism.
karpenter/pkg/cloudproviders/azure/nodeclaim.go Adds providerID helpers, NodeClaim↔AgentPool translation, and drift hashing.
karpenter/pkg/cloudproviders/azure/log.go Adds provider-scoped logger helper.
karpenter/pkg/cloudproviders/azure/instancetype/provider.go Hardcoded Phase 1 instance type provider + requirement resolution.
karpenter/pkg/cloudproviders/azure/instancetype/offerings.go Defines Phase 1 offerings (on-demand only, empty zone requirement).
karpenter/pkg/cloudproviders/azure/instancetype/instancetype.go Builds Karpenter InstanceTypes (requirements/capacity/overhead) from catalog.
karpenter/pkg/cloudproviders/azure/instancetype/catalog_test.go Tests catalog presence and provider listing behavior.
karpenter/pkg/cloudproviders/azure/instancetype/catalog.go Adds hardcoded SKU allowlist (H200/A100/H100 + baseline D-series).
karpenter/pkg/cloudproviders/azure/consts.go Introduces ProviderIDScheme and GroupKind for AzureFlex.
karpenter/pkg/cloudproviders/azure/cloudprovider.go Implements Karpenter CloudProvider CRUD + instancetype listing via plugin gRPC.
karpenter/pkg/cloudproviders/azure/api.go Adds helper classifiers (NotFound/quota) for plugin/Azure error surfaces.
karpenter/pkg/apis/v1alpha1/zz_generated.deepcopy.go Adds deepcopy support for new AzureFlex API types.
karpenter/pkg/apis/v1alpha1/labels.go Adds AzureFlexNodeClassHashAnnotation constant for drift tracking.
karpenter/pkg/apis/v1alpha1/doc.go Registers AzureFlex types into the scheme.
karpenter/pkg/apis/v1alpha1/azureflex.go Defines AzureFlexNodeClass CRD types and spec fields.
karpenter/pkg/apis/crds/flex.aks.azure.com_nebiusnodeclasses.yaml Reorders/adds maxPodsPerNode schema placement (no functional change).
karpenter/pkg/apis/crds/flex.aks.azure.com_azureflexnodeclasses.yaml Adds generated CRD YAML for AzureFlexNodeClass.
karpenter/go.mod Promotes azcore to a direct dependency (needed by new code).
karpenter/examples/azure/nodepool-h200.yaml Adds example NodePool targeting the H200 SKU via AzureFlexNodeClass.
karpenter/examples/azure/azureflexnodeclass-h200-eastus2.yaml Adds example AzureFlexNodeClass for eastus2 H200 provisioning.
karpenter/cmd/controller/main.go Waits for AzureFlex CRD and registers the azure-flex cloudprovider.
karpenter/charts/karpenter/templates/clusterrole.yaml Grants RBAC for azureflexnodeclasses get/list/watch and status patch/update.
karpenter/charts/karpenter/crds/flex.aks.azure.com_azureflexnodeclasses.yaml Adds AzureFlexNodeClass CRD to the Helm chart CRDs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +139 to +142
// Phase 1: skip explicit PIP creation — leave a TODO. Per-NodeClass
// public IP is deferred; documented in CRD.
// (Falls through to private-only NIC.)
_ = nicParams // satisfy linter
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocate_public_ip is accepted but currently ignored (the NIC is still created private-only) and the code just assigns _ = nicParams to satisfy the linter. This is a silent contract violation for callers. Either implement Public IP creation/attachment here, or reject allocate_public_ip=true in validateSpec until it's supported.

Suggested change
// Phase 1: skip explicit PIP creation — leave a TODO. Per-NodeClass
// public IP is deferred; documented in CRD.
// (Falls through to private-only NIC.)
_ = nicParams // satisfy linter
pipsClient, err := armnetwork.NewPublicIPAddressesClient(spec.GetSubscriptionId(), srv.credentials, nil)
if err != nil {
return nil, fmt.Errorf("creating public IP client: %w", err)
}
pipName := nicName + "-pip"
pipParams := armnetwork.PublicIPAddress{
Location: to.Ptr(spec.GetLocation()),
Tags: toARMTags(spec.GetTags()),
SKU: &armnetwork.PublicIPAddressSKU{
Name: to.Ptr(armnetwork.PublicIPAddressSKUNameStandard),
},
Properties: &armnetwork.PublicIPAddressPropertiesFormat{
PublicIPAllocationMethod: to.Ptr(armnetwork.IPAllocationMethodStatic),
},
}
pipPoller, err := pipsClient.BeginCreateOrUpdate(ctx, spec.GetResourceGroup(), pipName, pipParams, nil)
if err != nil {
return nil, fmt.Errorf("creating public IP %q: %w", pipName, err)
}
pipResp, err := pipPoller.PollUntilDone(ctx, nil)
if err != nil {
return nil, fmt.Errorf("polling public IP creation %q: %w", pipName, err)
}
if pipResp.ID == nil {
return nil, fmt.Errorf("public IP %q created without resource ID", pipName)
}
nicParams.Properties.IPConfigurations[0].Properties.PublicIPAddress = &armnetwork.PublicIPAddress{
ID: pipResp.ID,
}

Copilot uses AI. Check for mistakes.
return nil, fmt.Errorf("creating azure-flex agent pool: %w", err)
}

return agentPoolToNodeClaim(created, it), nil
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsDrifted compares the current NodeClass hash to nodeClaim.Annotations[AzureFlexNodeClassHashAnnotation], but this annotation is never set when creating/updating a NodeClaim. As written, drift detection will never trigger. Consider setting the annotation (e.g., to driftHash(nodeClass.Spec)) on the NodeClaim you return from Create so it persists for later drift checks.

Suggested change
return agentPoolToNodeClaim(created, it), nil
createdNodeClaim := agentPoolToNodeClaim(created, it)
if createdNodeClaim.Annotations == nil {
createdNodeClaim.Annotations = map[string]string{}
}
createdNodeClaim.Annotations[AzureFlexNodeClassHashAnnotation] = driftHash(nodeClass.Spec)
return createdNodeClaim, nil

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +107
if !strings.HasPrefix(spec.SubnetID, "/subscriptions/") {
return fmt.Errorf("subnetID %q must be a full ARM resource ID", spec.SubnetID)
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SubnetID is documented/treated as a full ARM resource ID, but validation only checks for the /subscriptions/ prefix. This will allow many malformed IDs through and defer failures to VM create. Consider using arm.ParseResourceID (as the plugin side does) to validate the full shape and return a clearer InvalidSpec condition.

Copilot uses AI. Check for mistakes.
Comment on lines +96 to +104
// Annotate kubeadm node labels with cross-region topology hints. Note:
// region is the *target* region (the VM's region), which may differ
// from the AKS control-plane region — that's the whole point of flexvm.
kubeadmConfig := spec.GetKubeadm()
kubeadmConfig.AddNodeLabels(map[string]string{
topology.NodeLabelKeyCloud: "azure",
topology.NodeLabelKeyRegion: strings.ToLower(spec.GetLocation()),
topology.NodeLabelKeyInstanceType: strings.ToLower(spec.GetVmSize()),
})
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spec.GetKubeadm() can be nil (proto field is optional), but the code immediately calls AddNodeLabels on it. This will panic on nil input. Consider failing fast in validateSpec (or here) with an InvalidArgument error when kubeadm config is missing.

Copilot uses AI. Check for mistakes.
Kevin Cho and others added 9 commits April 22, 2026 13:20
AKSFlexNode v0.0.18 ships a containerd v2 binary but its template writes a
v1-schema config (only 'imports' and 'oom_score'), leaving CRI non-functional
and kubeadm join hanging at step 7/7. Manual nodes were repaired by hand;
Karpenter-provisioned nodes hit registration timeout and churn.

Regenerate /etc/containerd/config.toml via 'containerd config default' (which
produces the v3 schema v2 expects) and restart containerd before invoking
aks-flex-node apply. Mirrors the workaround applied to manual nodes.
Previous attempt regenerated containerd config BEFORE aks-flex-node, but
aks-flex-node clobbered it during apply. Result: a v1-schema CNI config
(io.containerd.grpc.v1.cri.cni) which containerd 2.x silently ignores in
favor of the v3 schema (io.containerd.cri.v1.runtime.cni). bin_dir/conf_dir
end up empty, every Pod fails: 'failed to find plugin cilium-cni in path []'.

Write the canonical v3-schema config (mirrored from a working manual node)
after apply, then restart containerd. This is the same hand-fix that was
applied to manual nodes and unblocks Pod sandbox creation.
gpu-operator drops /etc/containerd/conf.d/99-nvidia.toml with
bin_dir = "" and bin_dirs = ["/opt/cni/bin"]. containerd 2.0.4
only honors bin_dir, so the empty string blanks our main-config
bin_dir and CNI plugin lookup fails with 'failed to find plugin
cilium-cni in path []'. sed-rewrite the import after aks-flex-node
apply.
Each Karpenter VM-create failure (commonly 409 quota) was leaving the
NIC behind, exhausting the subnet (~250 orphans / 8h of churn).

Root cause: the inline best-effort cleanup at line 220 reused the gRPC
ctx, which Karpenter has often already cancelled by the time the VM
PUT returns. A cancelled context causes BeginDelete to return
immediately without sending the HTTP DELETE.

Naive fix (cleanup with a fresh context inside the gRPC handler) does
not work either: ARM reserves the NIC for its target VM for 180s after
*any* CreateOrUpdate attempt — successful or not. Synchronous delete
during that window returns 400 NicReservedForAnotherVm. Blocking the
gRPC handler for 3+ minutes is also unacceptable since Karpenter
expects fast failure to back off.

Fix: spawn a detached goroutine on the failure path that sleeps out
the 180s ARM reservation window plus slack, then deletes the NIC with
retries on its own background context. The gRPC handler returns
immediately to Karpenter so its retry loop is unaffected.

Caveats documented in code:
 - Cleanup is best-effort. Pod restart abandons in-flight goroutines;
   a periodic reconciler is left as future work.
 - Sleeping-goroutine count is bounded by retry rate * cleanup window
   (observed ~7/min * 4min = ~30 max).

Validated on voice-agent-flex Sweden cluster: forced 50+ quota-blocked
retries, all NICs reaped automatically (43 cleanup-success log lines,
zero failures) and orphan count returned to zero.
- flexvm: reject allocate_public_ip=true in validateSpec (was silently
  ignored; previously fell through to a private-only NIC)
- flexvm: reject nil kubeadm spec in validateSpec (would nil-panic in
  CreateOrUpdate when calling AddNodeLabels)
- karpenter cloudprovider: stamp AzureFlexNodeClassHashAnnotation on
  NodeClaim in Create(); without this IsDrifted never triggered
- karpenter nodeclass_status: validate subnetID with arm.ParseResourceID
  instead of a prefix check (was letting malformed IDs through to VM
  create)
Persist incomplete AzureFlex agent pools after NIC creation, add typed list filtering for mixed agent pool records, and wire H200 NodeClaims through Karpenter with region-specific examples.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@chokevin chokevin changed the title feat(karpenter): Phase 1 Azure cross-region cloudprovider (#63) Harden Azure H200 Karpenter provisioning May 16, 2026
The updated Karpenter/Azure provider dependencies already include these patched changes, so keeping the old patch files breaks make vendor-patch in CI.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants