Harden Azure H200 Karpenter provisioning#64
Conversation
Adds the AzureFlexNodeClass CRD that lets a Karpenter NodePool in an AKS cluster auto-provision single Azure VMs in a (potentially different) Azure region. Spec covers subscription/RG/subnet, image (marketplace ref or SIG ID), security type (Standard only), OS disk size, SSH keys, public IP toggle, max pods, and tags. Implements status.Object. Regenerates deepcopy and CRD manifests. Mirrors the new CRD into the helm chart crds/ directory. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a new gRPC-served agent pool implementation that creates a single Azure VM (not VMSS) per agent pool. VM and NIC names are deterministic from the agent pool ID so retries are idempotent. Both the NIC and OS disk are tagged DeleteOption=Delete so a single VM delete cascades the whole resource trio. Phase 1 scope: Standard security only (TrustedLaunch breaks the DSVM image), default DSVM image, gzip+base64 cloud-init via flex.UserData, no public IP by default. Auth uses the plugin process's default Azure credential — the plugin MI is expected to have Contributor on the target subscription/RG/subnet. Wires the new service alongside ubuntu2404vmss in the agentpools and instances service registries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implements a Karpenter cloudprovider ('azure-flex') that backs the new
AzureFlexNodeClass by talking to the plugin's flexvm service. Mirrors
the nebius layout: consts/log/api stubs, an instancetype subpackage
with a hardcoded Phase 1 SKU catalog (ND96isr_H200_v5, ND96amsr_A100_v4,
NC40ads_H100_v5, NC24ads_A100_v4, D8s_v5), nodeclaim conversions, and
the cloudprovider.go top-level CRUD.
ProviderID format is azure-flex:///<full-arm-id> (three slashes); the
round-trip via providerIDToARMID is lossless and the parse rejects URLs
that put anything in the host position. Drift is computed as a SHA-256
over the AzureFlexNodeClass fields that affect VM identity, with sorted
tag keys for determinism.
Quota errors from the plugin surface as InsufficientCapacityError so
Karpenter stops thrashing on a NodePool whose SKU isn't available.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds the two AzureFlexNodeClass reconcilers (status and termination), mirrored almost line-for-line from the nebius equivalents. Status controller adds a finalizer and sets the ValidationSucceeded condition based on cheap shape checks (required fields, subnet ARM-ID prefix, imageReference vs imageID mutual exclusion). Termination controller blocks NodeClass deletion until all owning NodeClaims are gone, re-emitting a WaitingOnNodeClaimTermination event every 10m. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Registers the new azure cloudprovider in the cloudproviders hub alongside aks/nebius/kaito, adds AzureFlexNodeClass to the WaitForCRDs list, and registers the two AzureFlex controllers in flexcontrollers.NewControllers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a worked example for an H200 NodePool in eastus2 backed by an AzureFlexNodeClass, with a 64-GPU limit and consolidation enabled. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- nodeclass_status: capture base before AddFinalizer; client.MergeFrom on
the mutated object produces an empty patch and silently drops the
finalizer, allowing nodeclass deletion to bypass termination cleanup.
- flexvm/agentpools Create: clean up NIC on PollUntilDone failure (NIC
was created but VM never reached a state where DeleteOption cascades),
and guard against nil vmResp.ID before deref.
- flexvm/agentpools Delete: NIC cleanup uses fresh background context so
it still runs if caller cancels mid-VM-delete.
- nodeclaim: armIDToProviderID("") returns empty rather than the invalid
'azure-flex:///' URL when status is not yet populated.
The new CRD shares the flex.aks.azure.com api group with nebius, but the controller needs explicit verb grants on the resource name. Without this the nodeclass controllers would 403 on every reconcile in a real deployment.
azcore is a direct import in the new azure cloudprovider.
There was a problem hiding this comment.
Pull request overview
Adds Phase 1 Azure cross-region (“azure-flex”) support to the repo by introducing an AzureFlexNodeClass CRD plus a Karpenter cloudprovider that provisions individual Azure VMs via a colocated plugin service, enabling nodes to join an AKS control plane from any Azure region (unblocking cross-region GPU capacity workflows).
Changes:
- Introduces
AzureFlexNodeClass(CRD + deepcopy/registration) and wires AzureFlex controllers (status + termination). - Adds a new Karpenter cloudprovider (
karpenter/pkg/cloudproviders/azure) including providerID helpers and a hardcoded Phase 1 instance type catalog. - Adds a new plugin agentpool service (
plugin/.../azure/flexvm) for per-VM provisioning (NIC + VM + userdata) and registers it in the plugin servers; bumps flex node userdata version tov0.0.18.
Reviewed changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| plugin/pkg/services/agentpools/userdata/flex/flex.go | Bumps flexNodeVersion to v0.0.18 for generated userdata. |
| plugin/pkg/services/agentpools/instances.go | Registers the new Azure flexvm Instances service. |
| plugin/pkg/services/agentpools/azure/flexvm/redact.go | Adds Redact hooks for the new flexvm proto objects. |
| plugin/pkg/services/agentpools/azure/flexvm/instances.proto | Defines Instances messages for flexvm (1 VM per AgentPool). |
| plugin/pkg/services/agentpools/azure/flexvm/instances.pb.go | Generated protobuf code for flexvm instances messages. |
| plugin/pkg/services/agentpools/azure/flexvm/instances.go | Implements minimal List/Get for Instances backed by AgentPool storage. |
| plugin/pkg/services/agentpools/azure/flexvm/agentpools.proto | Defines flexvm AgentPool spec/status (subscription/RG/subnet/image/kubeadm/etc). |
| plugin/pkg/services/agentpools/azure/flexvm/agentpools.pb.go | Generated protobuf code for flexvm agentpool messages. |
| plugin/pkg/services/agentpools/azure/flexvm/agentpools.go | Implements Azure ARM NIC+VM provisioning and idempotent deletion for flexvm. |
| plugin/pkg/services/agentpools/agentpools.go | Registers the new Azure flexvm AgentPools service. |
| karpenter/pkg/controllers/controllers.go | Wires AzureFlex NodeClass status + termination controllers. |
| karpenter/pkg/controllers/azure/nodeclass_termination.go | Adds termination finalizer removal flow blocked by existing NodeClaims. |
| karpenter/pkg/controllers/azure/nodeclass_status_test.go | Adds unit tests for AzureFlex NodeClass spec validation. |
| karpenter/pkg/controllers/azure/nodeclass_status.go | Adds status controller that ensures finalizer + publishes validation condition. |
| karpenter/pkg/cloudproviders/azure/nodeclaim_test.go | Adds unit tests for providerID parsing and drift hash determinism. |
| karpenter/pkg/cloudproviders/azure/nodeclaim.go | Adds providerID helpers, NodeClaim↔AgentPool translation, and drift hashing. |
| karpenter/pkg/cloudproviders/azure/log.go | Adds provider-scoped logger helper. |
| karpenter/pkg/cloudproviders/azure/instancetype/provider.go | Hardcoded Phase 1 instance type provider + requirement resolution. |
| karpenter/pkg/cloudproviders/azure/instancetype/offerings.go | Defines Phase 1 offerings (on-demand only, empty zone requirement). |
| karpenter/pkg/cloudproviders/azure/instancetype/instancetype.go | Builds Karpenter InstanceTypes (requirements/capacity/overhead) from catalog. |
| karpenter/pkg/cloudproviders/azure/instancetype/catalog_test.go | Tests catalog presence and provider listing behavior. |
| karpenter/pkg/cloudproviders/azure/instancetype/catalog.go | Adds hardcoded SKU allowlist (H200/A100/H100 + baseline D-series). |
| karpenter/pkg/cloudproviders/azure/consts.go | Introduces ProviderIDScheme and GroupKind for AzureFlex. |
| karpenter/pkg/cloudproviders/azure/cloudprovider.go | Implements Karpenter CloudProvider CRUD + instancetype listing via plugin gRPC. |
| karpenter/pkg/cloudproviders/azure/api.go | Adds helper classifiers (NotFound/quota) for plugin/Azure error surfaces. |
| karpenter/pkg/apis/v1alpha1/zz_generated.deepcopy.go | Adds deepcopy support for new AzureFlex API types. |
| karpenter/pkg/apis/v1alpha1/labels.go | Adds AzureFlexNodeClassHashAnnotation constant for drift tracking. |
| karpenter/pkg/apis/v1alpha1/doc.go | Registers AzureFlex types into the scheme. |
| karpenter/pkg/apis/v1alpha1/azureflex.go | Defines AzureFlexNodeClass CRD types and spec fields. |
| karpenter/pkg/apis/crds/flex.aks.azure.com_nebiusnodeclasses.yaml | Reorders/adds maxPodsPerNode schema placement (no functional change). |
| karpenter/pkg/apis/crds/flex.aks.azure.com_azureflexnodeclasses.yaml | Adds generated CRD YAML for AzureFlexNodeClass. |
| karpenter/go.mod | Promotes azcore to a direct dependency (needed by new code). |
| karpenter/examples/azure/nodepool-h200.yaml | Adds example NodePool targeting the H200 SKU via AzureFlexNodeClass. |
| karpenter/examples/azure/azureflexnodeclass-h200-eastus2.yaml | Adds example AzureFlexNodeClass for eastus2 H200 provisioning. |
| karpenter/cmd/controller/main.go | Waits for AzureFlex CRD and registers the azure-flex cloudprovider. |
| karpenter/charts/karpenter/templates/clusterrole.yaml | Grants RBAC for azureflexnodeclasses get/list/watch and status patch/update. |
| karpenter/charts/karpenter/crds/flex.aks.azure.com_azureflexnodeclasses.yaml | Adds AzureFlexNodeClass CRD to the Helm chart CRDs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Phase 1: skip explicit PIP creation — leave a TODO. Per-NodeClass | ||
| // public IP is deferred; documented in CRD. | ||
| // (Falls through to private-only NIC.) | ||
| _ = nicParams // satisfy linter |
There was a problem hiding this comment.
allocate_public_ip is accepted but currently ignored (the NIC is still created private-only) and the code just assigns _ = nicParams to satisfy the linter. This is a silent contract violation for callers. Either implement Public IP creation/attachment here, or reject allocate_public_ip=true in validateSpec until it's supported.
| // Phase 1: skip explicit PIP creation — leave a TODO. Per-NodeClass | |
| // public IP is deferred; documented in CRD. | |
| // (Falls through to private-only NIC.) | |
| _ = nicParams // satisfy linter | |
| pipsClient, err := armnetwork.NewPublicIPAddressesClient(spec.GetSubscriptionId(), srv.credentials, nil) | |
| if err != nil { | |
| return nil, fmt.Errorf("creating public IP client: %w", err) | |
| } | |
| pipName := nicName + "-pip" | |
| pipParams := armnetwork.PublicIPAddress{ | |
| Location: to.Ptr(spec.GetLocation()), | |
| Tags: toARMTags(spec.GetTags()), | |
| SKU: &armnetwork.PublicIPAddressSKU{ | |
| Name: to.Ptr(armnetwork.PublicIPAddressSKUNameStandard), | |
| }, | |
| Properties: &armnetwork.PublicIPAddressPropertiesFormat{ | |
| PublicIPAllocationMethod: to.Ptr(armnetwork.IPAllocationMethodStatic), | |
| }, | |
| } | |
| pipPoller, err := pipsClient.BeginCreateOrUpdate(ctx, spec.GetResourceGroup(), pipName, pipParams, nil) | |
| if err != nil { | |
| return nil, fmt.Errorf("creating public IP %q: %w", pipName, err) | |
| } | |
| pipResp, err := pipPoller.PollUntilDone(ctx, nil) | |
| if err != nil { | |
| return nil, fmt.Errorf("polling public IP creation %q: %w", pipName, err) | |
| } | |
| if pipResp.ID == nil { | |
| return nil, fmt.Errorf("public IP %q created without resource ID", pipName) | |
| } | |
| nicParams.Properties.IPConfigurations[0].Properties.PublicIPAddress = &armnetwork.PublicIPAddress{ | |
| ID: pipResp.ID, | |
| } |
| return nil, fmt.Errorf("creating azure-flex agent pool: %w", err) | ||
| } | ||
|
|
||
| return agentPoolToNodeClaim(created, it), nil |
There was a problem hiding this comment.
IsDrifted compares the current NodeClass hash to nodeClaim.Annotations[AzureFlexNodeClassHashAnnotation], but this annotation is never set when creating/updating a NodeClaim. As written, drift detection will never trigger. Consider setting the annotation (e.g., to driftHash(nodeClass.Spec)) on the NodeClaim you return from Create so it persists for later drift checks.
| return agentPoolToNodeClaim(created, it), nil | |
| createdNodeClaim := agentPoolToNodeClaim(created, it) | |
| if createdNodeClaim.Annotations == nil { | |
| createdNodeClaim.Annotations = map[string]string{} | |
| } | |
| createdNodeClaim.Annotations[AzureFlexNodeClassHashAnnotation] = driftHash(nodeClass.Spec) | |
| return createdNodeClaim, nil |
| if !strings.HasPrefix(spec.SubnetID, "/subscriptions/") { | ||
| return fmt.Errorf("subnetID %q must be a full ARM resource ID", spec.SubnetID) | ||
| } |
There was a problem hiding this comment.
SubnetID is documented/treated as a full ARM resource ID, but validation only checks for the /subscriptions/ prefix. This will allow many malformed IDs through and defer failures to VM create. Consider using arm.ParseResourceID (as the plugin side does) to validate the full shape and return a clearer InvalidSpec condition.
| // Annotate kubeadm node labels with cross-region topology hints. Note: | ||
| // region is the *target* region (the VM's region), which may differ | ||
| // from the AKS control-plane region — that's the whole point of flexvm. | ||
| kubeadmConfig := spec.GetKubeadm() | ||
| kubeadmConfig.AddNodeLabels(map[string]string{ | ||
| topology.NodeLabelKeyCloud: "azure", | ||
| topology.NodeLabelKeyRegion: strings.ToLower(spec.GetLocation()), | ||
| topology.NodeLabelKeyInstanceType: strings.ToLower(spec.GetVmSize()), | ||
| }) |
There was a problem hiding this comment.
spec.GetKubeadm() can be nil (proto field is optional), but the code immediately calls AddNodeLabels on it. This will panic on nil input. Consider failing fast in validateSpec (or here) with an InvalidArgument error when kubeadm config is missing.
AKSFlexNode v0.0.18 ships a containerd v2 binary but its template writes a v1-schema config (only 'imports' and 'oom_score'), leaving CRI non-functional and kubeadm join hanging at step 7/7. Manual nodes were repaired by hand; Karpenter-provisioned nodes hit registration timeout and churn. Regenerate /etc/containerd/config.toml via 'containerd config default' (which produces the v3 schema v2 expects) and restart containerd before invoking aks-flex-node apply. Mirrors the workaround applied to manual nodes.
Previous attempt regenerated containerd config BEFORE aks-flex-node, but aks-flex-node clobbered it during apply. Result: a v1-schema CNI config (io.containerd.grpc.v1.cri.cni) which containerd 2.x silently ignores in favor of the v3 schema (io.containerd.cri.v1.runtime.cni). bin_dir/conf_dir end up empty, every Pod fails: 'failed to find plugin cilium-cni in path []'. Write the canonical v3-schema config (mirrored from a working manual node) after apply, then restart containerd. This is the same hand-fix that was applied to manual nodes and unblocks Pod sandbox creation.
gpu-operator drops /etc/containerd/conf.d/99-nvidia.toml with bin_dir = "" and bin_dirs = ["/opt/cni/bin"]. containerd 2.0.4 only honors bin_dir, so the empty string blanks our main-config bin_dir and CNI plugin lookup fails with 'failed to find plugin cilium-cni in path []'. sed-rewrite the import after aks-flex-node apply.
Each Karpenter VM-create failure (commonly 409 quota) was leaving the NIC behind, exhausting the subnet (~250 orphans / 8h of churn). Root cause: the inline best-effort cleanup at line 220 reused the gRPC ctx, which Karpenter has often already cancelled by the time the VM PUT returns. A cancelled context causes BeginDelete to return immediately without sending the HTTP DELETE. Naive fix (cleanup with a fresh context inside the gRPC handler) does not work either: ARM reserves the NIC for its target VM for 180s after *any* CreateOrUpdate attempt — successful or not. Synchronous delete during that window returns 400 NicReservedForAnotherVm. Blocking the gRPC handler for 3+ minutes is also unacceptable since Karpenter expects fast failure to back off. Fix: spawn a detached goroutine on the failure path that sleeps out the 180s ARM reservation window plus slack, then deletes the NIC with retries on its own background context. The gRPC handler returns immediately to Karpenter so its retry loop is unaffected. Caveats documented in code: - Cleanup is best-effort. Pod restart abandons in-flight goroutines; a periodic reconciler is left as future work. - Sleeping-goroutine count is bounded by retry rate * cleanup window (observed ~7/min * 4min = ~30 max). Validated on voice-agent-flex Sweden cluster: forced 50+ quota-blocked retries, all NICs reaped automatically (43 cleanup-success log lines, zero failures) and orphan count returned to zero.
- flexvm: reject allocate_public_ip=true in validateSpec (was silently ignored; previously fell through to a private-only NIC) - flexvm: reject nil kubeadm spec in validateSpec (would nil-panic in CreateOrUpdate when calling AddNodeLabels) - karpenter cloudprovider: stamp AzureFlexNodeClassHashAnnotation on NodeClaim in Create(); without this IsDrifted never triggered - karpenter nodeclass_status: validate subnetID with arm.ParseResourceID instead of a prefix check (was letting malformed IDs through to VM create)
Persist incomplete AzureFlex agent pools after NIC creation, add typed list filtering for mixed agent pool records, and wire H200 NodeClaims through Karpenter with region-specific examples. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The updated Karpenter/Azure provider dependencies already include these patched changes, so keeping the old patch files breaks make vendor-patch in CI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What
Hardens AzureFlex H200 provisioning through Karpenter by persisting incomplete agent pools after NIC creation, filtering mixed protobuf
Anylist responses by concrete AgentPool type, and delaying List-driven cleanup of fresh incomplete AzureFlex records.This also updates the Karpenter integration to the compatible Azure provider/Karpenter versions and adds H200 eastus2/eastus2euap examples so both regions can be capacity-hunted independently.
Why
Azure can create or reserve NIC/disk resources before VM creation fails for H200 quota or physical capacity. Without durable incomplete records and safe cleanup, those failed attempts can leak NICs; without type-filtered list handling, mixed provider records can break Karpenter garbage collection.
Non-goals
Testing
cd plugin && go test -count=1 ./...cd karpenter && go test -count=1 ./...cd cli && go test ./...helm template karpenter karpenter/charts/karpenter --namespace karpenterkubectl apply --dry-run=server -f karpenter/examples/azure/nodepool-h200.yamlkubectl apply --dry-run=server -f karpenter/examples/azure/h200_deployment.yamlkubectl apply --dry-run=server -f karpenter/examples/azure/azureflexnodeclass-h200-eastus2euap.yamlkubectl apply --dry-run=server -f karpenter/examples/azure/nodepool-h200-eastus2euap.yamlRisk
Main risk is AzureFlex cleanup being too aggressive or too conservative. The mitigation is a 30-minute grace period before List-driven cleanup of incomplete records, while Create failure cleanup still targets known failed NodeClaims. Rollback is reverting the Karpenter/plugin changes and redeploying the previous controller image.