feat(mgmt-pipeline): self-heal NRP-KVS-corrupted system pool (AROSLSRE-951)#5397
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an automated, detection-gated remediation step to the management EV2 pipeline to recover AKS management clusters whose system nodepool VMSS updates are continuously failing due to NRP key-value-store corruption, by recreating the system pool ahead of the main cluster ARM deployment.
Changes:
- Introduces a new Go-based
recreate-system-poolutility that evaluates safety guards and, when triggered, drains and recreates thesystempool via ARM/SDK operations plus targetedkubectlactions. - Adds extensive unit tests for the utility’s pure-logic components (env parsing, guard evaluation, snapshot sanitization, activity-log parsing, kubeconfig/token handling).
- Updates
dev-infrastructure/mgmt-pipeline.yamlto build/package the binary and run it as a Shell step before the cluster ARM step, and wires the module intogo.work.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
go.work |
Adds the new recreate-system-pool Go module to the workspace. |
dev-infrastructure/scripts/recreate-system-pool/main.go |
Implements the detection-gated remediation flow: guard checks, snapshot/sanitize, temporary pool creation, drain/delete/recreate, and tag reconcile. |
dev-infrastructure/scripts/recreate-system-pool/main_test.go |
Adds unit tests for config parsing, guards, sanitization, activity-log parsing, kube helpers, and error classification. |
dev-infrastructure/scripts/recreate-system-pool/go.mod |
Defines the module and dependencies (Azure SDK + Kubernetes client libraries). |
dev-infrastructure/scripts/recreate-system-pool/go.sum |
Captures dependency checksums for the new module. |
dev-infrastructure/mgmt-pipeline.yaml |
Builds the binary in the pipeline buildStep and adds a pre-cluster Shell step to execute it (no-op unless guards fire). |
ce8567a to
bf07f1b
Compare
bf07f1b to
6454335
Compare
6454335 to
96588d0
Compare
96588d0 to
d9edc55
Compare
|
@copilot no action needed. Addressed all 10 points in the force-push (
Not applied (with reason):
Other tightening in the same push:
|
…E-924)
Adds a detection-gated EV2 Shell step that recreates the AKS system
pool when the NRP key-value-store entity for its VMSS gets corrupted.
The same recipe was applied manually at INT on 2026-05-24
(AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod.
## Background
A corrupted NRP KVS entry for the system pool's VMSS causes every
Microsoft.Compute/virtualMachineScaleSets/write to fail with
NetworkingInternalOperationError on a continuous retry chain. Fresh
VM instances come up but never get a Swift NIC, kubelet never
registers, the pool stops scaling, and the cluster's upgrade LRO
retries forever - the AROSLSRE-880 / INT (2026-05-16..18) incident
left the cluster stuck in Updating for days because of this.
The corruption is bound to the VMSS ARM resource ID; per-instance
delete does not help. Deleting and re-creating the pool yields a
fresh VMSS name and a clean KVS entity. NRP-side fix is tracked
in ICM 798003653; once it ships, this binary's detection guards
never fire and the step becomes a no-op.
## Placement
The step runs BEFORE the cluster ARM step (depending on the same
cert-issuer prereqs), so when guards fire the recreate happens
before the next cluster PUT, preventing the pipeline from getting
stuck. The cluster ARM step now depends on this step.
The binary tolerates a not-yet-created cluster (greenfield rollout):
it does an ARM Get; if 404, logs and exits 0 with no action. No
`aksCluster:` directive is used (which would fail on greenfield);
instead the binary bootstraps its own kubeconfig from
ListClusterUserCredentials + a bearer token issued by the MSI
scoped to the AKS AAD server app, and exports KUBECONFIG so child
kubectl invocations work too. No `kubelogin` dependency.
## Detection (ALL guards must pass; otherwise exit 0 no-op)
1. system pool Ready k8s nodes < minCount
2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on
aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15)
3. cluster provisioningState is recoverable: Succeeded, Canceled,
Failed (settled) OR Updating, Upgrading (mid-LRO - the wedge
signature itself). Rejected: Creating, Deleting, unknown.
4. every non-system pool has count > 0
5. system pool provisioningState == "Failed" - positive
confirmation that this specific pool is wedged
## Action (once guards pass)
1. Snapshot system pool ARM JSON
2. Abort cluster LRO ONLY if active LRO is >= 30 min old (the
NRP-KVS retry storm signature). If younger, no-op exit to
avoid racing a healthy in-progress operation. AROSLSRE-924
manual recipe required this step to move the cluster from
stuck-Updating to Canceled.
3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted,
same VMSize/subnets as live system)
4. Cordon + drain existing system nodes
5. Delete the broken system pool
6. Re-create system via SDK CreateOrUpdate from the sanitized
snapshot (strips read-only fields and aks-managed-* tags,
pins orchestratorVersion to the live CP version)
7. Drain + delete systmp
8. No-op tag PATCH to flip cluster Canceled -> Succeeded
## Safety
- DefaultAzureCredential chain (MSI in EV2, az CLI locally).
- sanitizeForRecreate deep-copies via JSON round-trip; never
mutates the snapshot.
- snapshotSystem refuses to act if VMSize or VnetSubnetID are
missing from the live pool.
- maybeAbortLRO returns (proceed=false, no err) when LRO is
younger than 30 min, so the binary exits 0 (not an error)
rather than racing a potentially-healthy operation.
- preflightChecks refuses to act if a leftover 'systmp' exists.
- Guard 2 fails closed if the activity-log query errors (so a
missing Reader role on the node RG yields a no-op, not a
runaway recreate).
- Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op.
- Overall 60-min context timeout.
- DRY_RUN=true lets operators verify guard behaviour without
making any writes.
## Testing
- 100+ unit test cases covering all pure-logic functions:
env parsing, guard primitives (evalGuard1..5 - including
guard 3's acceptance of stuck-Updating clusters and guard 5
for system pool Failed state), sanitizeForRecreate
(no-mutation, tag stripping, version pin),
buildSystmpAgentPool (defensive nil checks),
activity-log parsing (dedup, case insensitivity, prefix
filter), isNodeReady (nil/missing/false conditions),
isNotFoundErr (404 vs other status codes, wrapped errors),
extractAPIServerAndCA (happy path, empty input, malformed
yaml, missing fields), kubeconfigWithBearerToken (no exec
plugin or auth provider required).
- go vet, gofmt clean.
- validate-changed-config-pipelines passes.
## References
- PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper)
- Pipeline step pattern: Azure#4790
- Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev
envs on macOS-arm build natively too)
- Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask),
AROSLSRE-924 (INT manual mitigation), AROSLSRE-880 (parent
incident bug)
- ICM: 798003653
d9edc55 to
6de8d8b
Compare
…2 (AROSLSRE-924) Address Copilot review (PR Azure#5397, batch 3): * Guard 2 (countNRPFailures + nrpResourceIDs): require the activity-log Failed VMSS-write event to carry the NetworkingInternalOperationError inner error code in properties.statusMessage before counting toward the threshold. Other failure modes (quota, capacity, policy, image pull) on the same aks-system-* VMSS no longer satisfy guard 2 and cannot trigger a destructive pool recreation that would not address their actual root cause. * Guard 5 PASS log: print the observed provisioningState (Failed, Canceled, Updating, Upgrading) instead of hard-coding "is Failed". Operators reading the log now see which accepted state matched. Implementation: * Add nrpKVSErrorCode const with the ARM inner error code. * Extend activityEvent with Properties.StatusMessage (the inner ARM error body as an embedded JSON string). * Add hasNRPKVSSignature(e) helper that parses statusMessage and returns true iff error.code == NetworkingInternalOperationError. Fails closed on any parse error / missing field. * Apply the signature filter inside both countNRPFailures and nrpResourceIDs so the diagnostic list matches the count. * Update the guard-2 progress log line and the top-of-file guard-2 doc comment to call out the NRP-KVS signature. Tests: * mkActivityEvent now emits a realistic properties.statusMessage with the NRP-KVS code by default so all pre-existing tests stay green; callers can pass a different code to simulate other failure modes. * New tests: TestCountNRPFailures_RequiresNRPKVSSignature, TestCountNRPFailures_MissingPropertiesNotCounted, TestCountNRPFailures_MalformedStatusMessageNotCounted, TestNRPResourceIDs_RequiresNRPKVSSignature. * go vet + go test all green in dev-infrastructure/scripts/recreate-system-pool.
…2 (AROSLSRE-924) Address Copilot review (PR Azure#5397, batch 3): * Guard 2 (countNRPFailures + nrpResourceIDs): require the activity-log Failed VMSS-write event to carry the NetworkingInternalOperationError inner error code in properties.statusMessage before counting toward the threshold. Other failure modes (quota, capacity, policy, image pull) on the same aks-system-* VMSS no longer satisfy guard 2 and cannot trigger a destructive pool recreation that would not address their actual root cause. * Guard 5 PASS log: print the observed provisioningState (Failed, Canceled, Updating, Upgrading) instead of hard-coding "is Failed". Operators reading the log now see which accepted state matched. Implementation: * Add nrpKVSErrorCode const with the ARM inner error code. * Extend activityEvent with Properties.StatusMessage (the inner ARM error body as an embedded JSON string). * Add hasNRPKVSSignature(e) helper that parses statusMessage and returns true iff error.code == NetworkingInternalOperationError. Fails closed on any parse error / missing field. * Apply the signature filter inside both countNRPFailures and nrpResourceIDs so the diagnostic list matches the count. * Update the guard-2 progress log line and the top-of-file guard-2 doc comment to call out the NRP-KVS signature. Tests: * mkActivityEvent now emits a realistic properties.statusMessage with the NRP-KVS code by default so all pre-existing tests stay green; callers can pass a different code to simulate other failure modes. * New tests: TestCountNRPFailures_RequiresNRPKVSSignature, TestCountNRPFailures_MissingPropertiesNotCounted, TestCountNRPFailures_MalformedStatusMessageNotCounted, TestNRPResourceIDs_RequiresNRPKVSSignature. * go vet + go test all green in dev-infrastructure/scripts/recreate-system-pool.
ad64d13 to
89e3196
Compare
…2 (AROSLSRE-924) Address Copilot review (PR Azure#5397, batch 3): * Guard 2 (countNRPFailures + nrpResourceIDs): require the activity-log Failed VMSS-write event to carry the NetworkingInternalOperationError inner error code in properties.statusMessage before counting toward the threshold. Other failure modes (quota, capacity, policy, image pull) on the same aks-system-* VMSS no longer satisfy guard 2 and cannot trigger a destructive pool recreation that would not address their actual root cause. * Guard 5 PASS log: print the observed provisioningState (Failed, Canceled, Updating, Upgrading) instead of hard-coding "is Failed". Operators reading the log now see which accepted state matched. Implementation: * Add nrpKVSErrorCode const with the ARM inner error code. * Extend activityEvent with Properties.StatusMessage (the inner ARM error body as an embedded JSON string). * Add hasNRPKVSSignature(e) helper that parses statusMessage and returns true iff error.code == NetworkingInternalOperationError. Fails closed on any parse error / missing field. * Apply the signature filter inside both countNRPFailures and nrpResourceIDs so the diagnostic list matches the count. * Update the guard-2 progress log line and the top-of-file guard-2 doc comment to call out the NRP-KVS signature. Tests: * mkActivityEvent now emits a realistic properties.statusMessage with the NRP-KVS code by default so all pre-existing tests stay green; callers can pass a different code to simulate other failure modes. * New tests: TestCountNRPFailures_RequiresNRPKVSSignature, TestCountNRPFailures_MissingPropertiesNotCounted, TestCountNRPFailures_MalformedStatusMessageNotCounted, TestNRPResourceIDs_RequiresNRPKVSSignature. * go vet + go test all green in dev-infrastructure/scripts/recreate-system-pool.
89e3196 to
8d19fc0
Compare
|
Update after re-checking INT behavior: Ready system nodes < minCount is no longer a hard guard. That condition only appeared in INT after we manually triggered extra scale-up/surge attempts; the normal stuck-upgrade wedge can still have Ready nodes at minCount while the system pool/cluster LRO is stuck Updating. The binary now logs ready/registered system-node counts as diagnostics only and gates on the four stronger signals: NRP-KVS activity-log signature, recoverable cluster state, non-system pools count > 0, and system pool state in Failed/Canceled/Updating/Upgrading. |
…-924)
Add SKIP_GUARDS env var that bypasses the 4 detection guards,
allowing the recreation flow to be tested on healthy clusters.
Combined with DRY_RUN=true it logs what would happen; without
DRY_RUN it performs the actual pool recreation.
Validated on personal dev env (pers-usw3rael-mgmt-1):
SKIP_GUARDS=true DRY_RUN=true → full plumbing test, exit 0
SKIP_GUARDS=true → full recreation in ~10 min,
system VMSS 20157742 → 30104029, cluster healthy after.
6b2ce0c to
d7a0ff1
Compare
|
/retest |
Testing run 2: full recreation with latest commits (fdfe771)Re-tested after the 3 new fixes (narrow activity guard, ignore cordoned nodes, retry activity log auth). Binary built from Timeline (~9 min total)
Azure Activity Log(systmp delete completed at 13:58:20 per binary logs) Before / After
Reconcile tagImprovements observed vs run 1
Run 1 vs Run 2 VMSS progressionEach recreation yields a fresh VMSS name, confirming the KVS entity rotation works correctly. |
|
How will you put this into ev2 artifact? |
|
@janboll the binary is built into the EV2 artifact by the That step runs during artifact/image build: CGO_ENABLED=0 go build -o "${tmp2}" ./scripts/recreate-system-pool
chmod 0755 "${tmp2}"
mv "${tmp2}" ./scripts/recreate-system-pool/recreate-system-poolThen the EV2 Shell step executes it from the rollout artifact root: command: ./scripts/recreate-system-pool/recreate-system-pool
workingDir: .So we do not commit the compiled binary; the mgmt pipeline artifact build compiles it and places it under |
|
accepted under the premise that we will salvage the idea of this tool into something sustainably maintainable asap once it did its duty |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: geoberle, raelga The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Testing run 3: with diagnostic shellouts removed (77d8ce4)Re-tested after Timeline (~10 min total)
Azure Activity LogBefore / After
Reconcile tagImprovements in this run
VMSS progression across all runs |
bennerv
left a comment
There was a problem hiding this comment.
Couple comments.
I have more feedback but these comments are more blocking or attention. Long-term, this implementation needs some refactoring if we decide to keep it, but knowing this resolves an ongoing incident I'm okay with moving forward to get the issue resolved first to mitigate the issue.
|
/override ci/prow/e2e-parallel E2E provision step passed — |
|
@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/override ci/prow/e2e-parallel E2E provision step passed — recreate-system-pool-if-broken ran successfully in 1s (greenfield no-op, cluster not yet created at that pipeline stage). E2E tests are unaffected by this PR since the binary only acts when NRP-KVS corruption is detected, and the new pipeline step runs before the cluster ARM deployment. |
|
@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/override ci/prow/e2e-parallel E2E provision step passed — recreate-system-pool-if-broken ran successfully in 1s (greenfield no-op, cluster not yet created at that pipeline stage). E2E tests are unaffected by this PR since the binary only acts when NRP-KVS corruption is detected, and the new pipeline step runs before the cluster ARM deployment. |
|
@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
AROSLSRE-951 (story) · AROSLSRE-952 (subtask) · AROSLSRE-880 (parent incident bug) · ICM 798003653
What
Adds a detection-gated EV2 Shell step (Go binary) that runs before the
clusterARM step on every Management Cluster Rollout. When the AKSsystempool's VMSS is wedged by NRP-KVS corruption, it aborts the long-running cluster LRO when safe, creates a temporarysystmpsystem pool, deletes/recreates thesystempool to get a fresh VMSS/KVS entity, drains/deletessystmp, and reconciles the cluster tags. It exits0no-op when the cluster is healthy or does not exist yet (greenfield).Why
The AROSLSRE-880 INT incident (2026-05-16..18) left the mgmt cluster stuck in
Updatingfor days because everyvirtualMachineScaleSets/writeon the system pool's VMSS failed withNetworkingInternalOperationErroron a continuous retry chain. The corruption is bound to the VMSS ARM resource ID, so per-instance delete is useless — only recreating the pool (and thus the VMSS) clears it. The recipe was applied manually at INT under AROSLSRE-924; this PR automates it for stg/prod so the daily mgmt-pipeline self-heals instead of paging on-call.Four guards must ALL fire for the binary to act (else exit 0 no-op):
The number of Ready system nodes is logged as a diagnostic, but it is not a hard guard. In INT,
Ready < minCountonly appeared after we manually triggered extra scale-up/surge attempts; the normal stuck-upgrade form can have Ready nodes still atminCountwhile the system pool/cluster LRO is wedged.Testing
systmpclone/post-processing, activity-log parsing including NRP-KVS signature filtering, LRO-age activity-log parsing, Ready-node waiting, and no-op/execute remediation flow with fake clients.SKIP_GUARDScoverage exercises the full remediation path without fabricating an NRP failure storm; the mgmt pipeline does not set this env var.make verify,make lint,make validate-changed-config-pipelines— all green locally.az account showdependency removed;SUBSCRIPTION_IDnow comes from an ARMsubscription-outputstep.kubectldependency removed for cordon/drain; drain uses the client-go drain helper with the sessiongate dynamic AKS REST config.Special notes for your reviewer
sessiongate/pkg/mc.GetAKSRESTConfig, so client-go requests use the shared dynamic Azure-token transport with token refresh.kubectlshellouts remain. Activity Log detection usesarmmonitor.ActivityLogsClient, LRO abort uses SDKManagedClustersClient.BeginAbortLatestOperation, tag reconcile usesarmresources.TagsClient.UpdateAtScope, and pre/post-flight diagnostics use SDK/client-go logs.Microsoft.Compute/virtualMachineScaleSets/writeevents with the NRP-KVS signature; VMSS deletes and child-resource writes do not satisfy the destructive remediation gate.AuthorizationFailed/LinkedAuthorizationFailedresponses retry with bounded backoff inside the Go binary to tolerate RBAC propagation after the Reader assignment; other query failures still fail closed.systemnodes cannot satisfy the recreated-pool wait.aks-previewdependency: stuck-LRO age comes from Activity LogStartedmanagedClusters/writeevents.systmp, so it exits no-op if the wedge recovered.systmpis built from the same JSON-roundtrip sanitized live-pool clone assystemrecreation, then post-processed only for intentional temporary-pool differences (Count=1, autoscaler fields cleared, purpose tag); taints are inherited from the live system pool clone.k8s.io/kubectl/pkg/drainin-process. Cordon failure is fatal;Force=trueis enabled to match the later authoritative nodepool deletion path.log/slogJSON to stderr (component=recreate-system-pool,phase=STEP Non banners), same shape asfrontend/backend/admin-serverso Geneva ships it to Kusto without extra wiring.GOOS/GOARCHso dev envs on macOS-arm work too).globalMSIIdreceives subscription Reader viapipeline-msi-reader-permissionsso guard 1 can read AKS node RG activity logs.buildStepcompiles./scripts/recreate-system-poolinto./scripts/recreate-system-pool/recreate-system-poolduring artifact build; the Shell step runs that generated binary from the rollout artifact root.PR Checklist