feat(mgmt-pipeline): self-heal NRP-KVS-corrupted system pool (AROSLSRE-951) by raelga · Pull Request #5397 · Azure/ARO-HCP

raelga · 2026-05-26T20:30:34Z

AROSLSRE-951 (story) · AROSLSRE-952 (subtask) · AROSLSRE-880 (parent incident bug) · ICM 798003653

What

Adds a detection-gated EV2 Shell step (Go binary) that runs before the cluster ARM step on every Management Cluster Rollout. When the AKS system pool's VMSS is wedged by NRP-KVS corruption, it aborts the long-running cluster LRO when safe, creates a temporary systmp system pool, deletes/recreates the system pool to get a fresh VMSS/KVS entity, drains/deletes systmp, and reconciles the cluster tags. It exits 0 no-op when the cluster is healthy or does not exist yet (greenfield).

Why

The AROSLSRE-880 INT incident (2026-05-16..18) left the mgmt cluster stuck in Updating for days because every virtualMachineScaleSets/write on the system pool's VMSS failed with NetworkingInternalOperationError on a continuous retry chain. The corruption is bound to the VMSS ARM resource ID, so per-instance delete is useless — only recreating the pool (and thus the VMSS) clears it. The recipe was applied manually at INT under AROSLSRE-924; this PR automates it for stg/prod so the daily mgmt-pipeline self-heals instead of paging on-call.

Four guards must ALL fire for the binary to act (else exit 0 no-op):

1. >= NRP_FAIL_THRESHOLD VMSS-write Failed events with NetworkingInternalOperationError
2. cluster provisioningState is recoverable (settled OR stuck mid-LRO, not Creating/Deleting)
3. all non-system pools have count > 0
4. system pool provisioningState in {Failed, Canceled, Updating, Upgrading}

The number of Ready system nodes is logged as a diagnostic, but it is not a hard guard. In INT, Ready < minCount only appeared after we manually triggered extra scale-up/surge attempts; the normal stuck-upgrade form can have Ready nodes still at minCount while the system pool/cluster LRO is wedged.

Testing

65 focused Go test funcs covering pure-logic helpers and orchestration paths: env parsing, all 4 guard primitives, sanitize-no-mutation regression, systmp clone/post-processing, activity-log parsing including NRP-KVS signature filtering, LRO-age activity-log parsing, Ready-node waiting, and no-op/execute remediation flow with fake clients.
Test-only SKIP_GUARDS coverage exercises the full remediation path without fabricating an NRP failure storm; the mgmt pipeline does not set this env var.
make verify, make lint, make validate-changed-config-pipelines — all green locally.
E2E failures were root-caused and addressed:
- startup az account show dependency removed; SUBSCRIPTION_ID now comes from an ARM subscription-output step.
- external kubectl dependency removed for cordon/drain; drain uses the client-go drain helper with the sessiongate dynamic AKS REST config.

Special notes for your reviewer

Kubernetes access uses sessiongate/pkg/mc.GetAKSRESTConfig, so client-go requests use the shared dynamic Azure-token transport with token refresh.
No Azure CLI or external kubectl shellouts remain. Activity Log detection uses armmonitor.ActivityLogsClient, LRO abort uses SDK ManagedClustersClient.BeginAbortLatestOperation, tag reconcile uses armresources.TagsClient.UpdateAtScope, and pre/post-flight diagnostics use SDK/client-go logs.
Guard 1 now counts only exact Microsoft.Compute/virtualMachineScaleSets/write events with the NRP-KVS signature; VMSS deletes and child-resource writes do not satisfy the destructive remediation gate.
Activity Log AuthorizationFailed / LinkedAuthorizationFailed responses retry with bounded backoff inside the Go binary to tolerate RBAC propagation after the Reader assignment; other query failures still fail closed.
Ready-node waits count only Ready, schedulable, non-deleting nodes so old cordoned/deleting system nodes cannot satisfy the recreated-pool wait.
No aks-preview dependency: stuck-LRO age comes from Activity Log Started managedClusters/write events.
After LRO handling, the script re-runs all detection guards and refreshes the system pool snapshot before creating systmp, so it exits no-op if the wedge recovered.
systmp is built from the same JSON-roundtrip sanitized live-pool clone as system recreation, then post-processed only for intentional temporary-pool differences (Count=1, autoscaler fields cleared, purpose tag); taints are inherited from the live system pool clone.
Drain uses k8s.io/kubectl/pkg/drain in-process. Cordon failure is fatal; Force=true is enabled to match the later authoritative nodepool deletion path.
Logging uses log/slog JSON to stderr (component=recreate-system-pool, phase=STEP N on banners), same shape as frontend / backend / admin-server so Geneva ships it to Kusto without extra wiring.
Build aligned with fix: build cleanup-pko-resources based on machine architecture #5394 (drops GOOS/GOARCH so dev envs on macOS-arm work too).
globalMSIId receives subscription Reader via pipeline-msi-reader-permissions so guard 1 can read AKS node RG activity logs.
The buildStep compiles ./scripts/recreate-system-pool into ./scripts/recreate-system-pool/recreate-system-pool during artifact build; the Shell step runs that generated binary from the rollout artifact root.
Deeper technical detail lives in AROSLSRE-952.

PR Checklist

Copilot

Pull request overview

This PR adds an automated, detection-gated remediation step to the management EV2 pipeline to recover AKS management clusters whose system nodepool VMSS updates are continuously failing due to NRP key-value-store corruption, by recreating the system pool ahead of the main cluster ARM deployment.

Changes:

Introduces a new Go-based recreate-system-pool utility that evaluates safety guards and, when triggered, drains and recreates the system pool via ARM/SDK operations plus targeted kubectl actions.
Adds extensive unit tests for the utility’s pure-logic components (env parsing, guard evaluation, snapshot sanitization, activity-log parsing, kubeconfig/token handling).
Updates dev-infrastructure/mgmt-pipeline.yaml to build/package the binary and run it as a Shell step before the cluster ARM step, and wires the module into go.work.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`go.work`	Adds the new `recreate-system-pool` Go module to the workspace.
`dev-infrastructure/scripts/recreate-system-pool/main.go`	Implements the detection-gated remediation flow: guard checks, snapshot/sanitize, temporary pool creation, drain/delete/recreate, and tag reconcile.
`dev-infrastructure/scripts/recreate-system-pool/main_test.go`	Adds unit tests for config parsing, guards, sanitization, activity-log parsing, kube helpers, and error classification.
`dev-infrastructure/scripts/recreate-system-pool/go.mod`	Defines the module and dependencies (Azure SDK + Kubernetes client libraries).
`dev-infrastructure/scripts/recreate-system-pool/go.sum`	Captures dependency checksums for the new module.
`dev-infrastructure/mgmt-pipeline.yaml`	Builds the binary in the pipeline buildStep and adds a pre-cluster Shell step to execute it (no-op unless guards fire).

Copilot

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 6 comments.

raelga · 2026-05-26T20:52:36Z

@copilot no action needed. Addressed all 10 points in the force-push (96588d036 → d9edc5531). Also widened Guard 5 to accept Updating/Upgrading per offline review (the AROSLSRE-880 wedge leaves the system pool in Updating while its parent cluster LRO retries forever, not just in Failed).

#	Suggestion	Fix
1	`preflightChecks` fails open on non-404 errors	Now uses `isNotFoundErr`; only HTTP 404 is treated as "pool does not exist", all other errors surface as a hard fail.
2	`azShellTSV` uses `exec.Command` (no ctx)	Switched to `exec.CommandContext(ctx, …)` and now includes `ExitError.Stderr` in returned errors. `loadConfig` threads `ctx` down.
3	`buildSystmpAgentPool` hard-codes `OSDiskSizeGB=128`	Inherits `OSDiskSizeGB`, `OSType`, `OSSKU`, `OSDiskType` from the live snapshot. Refuses to act when `OSDiskSizeGB` is missing or zero. Added two new tests.
4	`newAzureClients` uses bare `DefaultAzureCredential`	Now passes `&azidentity.DefaultAzureCredentialOptions{RequireAzureTokenCredentials: true}` to match backend / sessiongate / admin-client convention.
5	YAML comment says "four guards"	Updated to "five".
6	Top doc comment for guard 3 stale	Rewritten to match the widened acceptance (`Succeeded`/`Canceled`/`Failed` settled, `Updating`/`Upgrading` mid-LRO; reject `Creating`/`Deleting`/unknown).
7	Guard 3 log message stale ("must be Succeeded or Canceled")	Now logs the full accept/reject set.

Not applied (with reason):

GOOS/GOARCH in buildStep — intentionally aligned with fix: build cleanup-pko-resources based on machine architecture #5394 (Tony's in-flight PR drops these from cleanup-pko-resources too, so dev envs on macOS-arm build natively). I mirrored that change pre-emptively; if fix: build cleanup-pko-resources based on machine architecture #5394 is reverted I'll re-pin.

Other tightening in the same push:

Guard 5 widened to accept Updating and Upgrading too (not just Failed/Canceled). The AROSLSRE-880 NRP-KVS wedge typically leaves the system pool in Updating for hours/days while the parent cluster LRO retries forever — rejecting that state would have made the binary unable to fix the exact scenario it was built for. Still rejects Succeeded (clearly healthy), Creating, Deleting, and unknown future states.
Test count is now 61 top-level test funcs / ~110 sub-cases. go vet, gofmt, make validate-changed-config-pipelines all clean.

…E-924) Adds a detection-gated EV2 Shell step that recreates the AKS system pool when the NRP key-value-store entity for its VMSS gets corrupted. The same recipe was applied manually at INT on 2026-05-24 (AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod. ## Background A corrupted NRP KVS entry for the system pool's VMSS causes every Microsoft.Compute/virtualMachineScaleSets/write to fail with NetworkingInternalOperationError on a continuous retry chain. Fresh VM instances come up but never get a Swift NIC, kubelet never registers, the pool stops scaling, and the cluster's upgrade LRO retries forever - the AROSLSRE-880 / INT (2026-05-16..18) incident left the cluster stuck in Updating for days because of this. The corruption is bound to the VMSS ARM resource ID; per-instance delete does not help. Deleting and re-creating the pool yields a fresh VMSS name and a clean KVS entity. NRP-side fix is tracked in ICM 798003653; once it ships, this binary's detection guards never fire and the step becomes a no-op. ## Placement The step runs BEFORE the cluster ARM step (depending on the same cert-issuer prereqs), so when guards fire the recreate happens before the next cluster PUT, preventing the pipeline from getting stuck. The cluster ARM step now depends on this step. The binary tolerates a not-yet-created cluster (greenfield rollout): it does an ARM Get; if 404, logs and exits 0 with no action. No `aksCluster:` directive is used (which would fail on greenfield); instead the binary bootstraps its own kubeconfig from ListClusterUserCredentials + a bearer token issued by the MSI scoped to the AKS AAD server app, and exports KUBECONFIG so child kubectl invocations work too. No `kubelogin` dependency. ## Detection (ALL guards must pass; otherwise exit 0 no-op) 1. system pool Ready k8s nodes < minCount 2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15) 3. cluster provisioningState is recoverable: Succeeded, Canceled, Failed (settled) OR Updating, Upgrading (mid-LRO - the wedge signature itself). Rejected: Creating, Deleting, unknown. 4. every non-system pool has count > 0 5. system pool provisioningState == "Failed" - positive confirmation that this specific pool is wedged ## Action (once guards pass) 1. Snapshot system pool ARM JSON 2. Abort cluster LRO ONLY if active LRO is >= 30 min old (the NRP-KVS retry storm signature). If younger, no-op exit to avoid racing a healthy in-progress operation. AROSLSRE-924 manual recipe required this step to move the cluster from stuck-Updating to Canceled. 3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted, same VMSize/subnets as live system) 4. Cordon + drain existing system nodes 5. Delete the broken system pool 6. Re-create system via SDK CreateOrUpdate from the sanitized snapshot (strips read-only fields and aks-managed-* tags, pins orchestratorVersion to the live CP version) 7. Drain + delete systmp 8. No-op tag PATCH to flip cluster Canceled -> Succeeded ## Safety - DefaultAzureCredential chain (MSI in EV2, az CLI locally). - sanitizeForRecreate deep-copies via JSON round-trip; never mutates the snapshot. - snapshotSystem refuses to act if VMSize or VnetSubnetID are missing from the live pool. - maybeAbortLRO returns (proceed=false, no err) when LRO is younger than 30 min, so the binary exits 0 (not an error) rather than racing a potentially-healthy operation. - preflightChecks refuses to act if a leftover 'systmp' exists. - Guard 2 fails closed if the activity-log query errors (so a missing Reader role on the node RG yields a no-op, not a runaway recreate). - Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op. - Overall 60-min context timeout. - DRY_RUN=true lets operators verify guard behaviour without making any writes. ## Testing - 100+ unit test cases covering all pure-logic functions: env parsing, guard primitives (evalGuard1..5 - including guard 3's acceptance of stuck-Updating clusters and guard 5 for system pool Failed state), sanitizeForRecreate (no-mutation, tag stripping, version pin), buildSystmpAgentPool (defensive nil checks), activity-log parsing (dedup, case insensitivity, prefix filter), isNodeReady (nil/missing/false conditions), isNotFoundErr (404 vs other status codes, wrapped errors), extractAPIServerAndCA (happy path, empty input, malformed yaml, missing fields), kubeconfigWithBearerToken (no exec plugin or auth provider required). - go vet, gofmt clean. - validate-changed-config-pipelines passes. ## References - PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper) - Pipeline step pattern: Azure#4790 - Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev envs on macOS-arm build natively too) - Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask), AROSLSRE-924 (INT manual mitigation), AROSLSRE-880 (parent incident bug) - ICM: 798003653

Copilot

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

…2 (AROSLSRE-924) Address Copilot review (PR Azure#5397, batch 3): * Guard 2 (countNRPFailures + nrpResourceIDs): require the activity-log Failed VMSS-write event to carry the NetworkingInternalOperationError inner error code in properties.statusMessage before counting toward the threshold. Other failure modes (quota, capacity, policy, image pull) on the same aks-system-* VMSS no longer satisfy guard 2 and cannot trigger a destructive pool recreation that would not address their actual root cause. * Guard 5 PASS log: print the observed provisioningState (Failed, Canceled, Updating, Upgrading) instead of hard-coding "is Failed". Operators reading the log now see which accepted state matched. Implementation: * Add nrpKVSErrorCode const with the ARM inner error code. * Extend activityEvent with Properties.StatusMessage (the inner ARM error body as an embedded JSON string). * Add hasNRPKVSSignature(e) helper that parses statusMessage and returns true iff error.code == NetworkingInternalOperationError. Fails closed on any parse error / missing field. * Apply the signature filter inside both countNRPFailures and nrpResourceIDs so the diagnostic list matches the count. * Update the guard-2 progress log line and the top-of-file guard-2 doc comment to call out the NRP-KVS signature. Tests: * mkActivityEvent now emits a realistic properties.statusMessage with the NRP-KVS code by default so all pre-existing tests stay green; callers can pass a different code to simulate other failure modes. * New tests: TestCountNRPFailures_RequiresNRPKVSSignature, TestCountNRPFailures_MissingPropertiesNotCounted, TestCountNRPFailures_MalformedStatusMessageNotCounted, TestNRPResourceIDs_RequiresNRPKVSSignature. * go vet + go test all green in dev-infrastructure/scripts/recreate-system-pool.

Copilot

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

…2 (AROSLSRE-924) Address Copilot review (PR Azure#5397, batch 3): * Guard 2 (countNRPFailures + nrpResourceIDs): require the activity-log Failed VMSS-write event to carry the NetworkingInternalOperationError inner error code in properties.statusMessage before counting toward the threshold. Other failure modes (quota, capacity, policy, image pull) on the same aks-system-* VMSS no longer satisfy guard 2 and cannot trigger a destructive pool recreation that would not address their actual root cause. * Guard 5 PASS log: print the observed provisioningState (Failed, Canceled, Updating, Upgrading) instead of hard-coding "is Failed". Operators reading the log now see which accepted state matched. Implementation: * Add nrpKVSErrorCode const with the ARM inner error code. * Extend activityEvent with Properties.StatusMessage (the inner ARM error body as an embedded JSON string). * Add hasNRPKVSSignature(e) helper that parses statusMessage and returns true iff error.code == NetworkingInternalOperationError. Fails closed on any parse error / missing field. * Apply the signature filter inside both countNRPFailures and nrpResourceIDs so the diagnostic list matches the count. * Update the guard-2 progress log line and the top-of-file guard-2 doc comment to call out the NRP-KVS signature. Tests: * mkActivityEvent now emits a realistic properties.statusMessage with the NRP-KVS code by default so all pre-existing tests stay green; callers can pass a different code to simulate other failure modes. * New tests: TestCountNRPFailures_RequiresNRPKVSSignature, TestCountNRPFailures_MissingPropertiesNotCounted, TestCountNRPFailures_MalformedStatusMessageNotCounted, TestNRPResourceIDs_RequiresNRPKVSSignature. * go vet + go test all green in dev-infrastructure/scripts/recreate-system-pool.

raelga · 2026-05-26T22:22:04Z

Update after re-checking INT behavior: Ready system nodes < minCount is no longer a hard guard. That condition only appeared in INT after we manually triggered extra scale-up/surge attempts; the normal stuck-upgrade wedge can still have Ready nodes at minCount while the system pool/cluster LRO is stuck Updating. The binary now logs ready/registered system-node counts as diagnostics only and gates on the four stronger signals: NRP-KVS activity-log signature, recoverable cluster state, non-system pools count > 0, and system pool state in Failed/Canceled/Updating/Upgrading.

…-924) Add SKIP_GUARDS env var that bypasses the 4 detection guards, allowing the recreation flow to be tested on healthy clusters. Combined with DRY_RUN=true it logs what would happen; without DRY_RUN it performs the actual pool recreation. Validated on personal dev env (pers-usw3rael-mgmt-1): SKIP_GUARDS=true DRY_RUN=true → full plumbing test, exit 0 SKIP_GUARDS=true → full recreation in ~10 min, system VMSS 20157742 → 30104029, cluster healthy after.

Copilot

Pull request overview

Copilot reviewed 11 out of 16 changed files in this pull request and generated 1 comment.

raelga · 2026-05-27T12:53:45Z

/retest

…LSRE-924)

…t (AROSLSRE-924)

…SRE-924)

Copilot

Pull request overview

Copilot reviewed 11 out of 16 changed files in this pull request and generated no new comments.

raelga · 2026-05-27T13:59:25Z

Testing run 2: full recreation with latest commits (`fdfe771`)

Re-tested after the 3 new fixes (narrow activity guard, ignore cordoned nodes, retry activity log auth). Binary built from fdfe77181.

Timeline (~9 min total)

Time (UTC)	Step	Action
13:49:01	STARTUP	`SKIP_GUARDS=true`, `DRY_RUN=false`
13:49:03	CLUSTER CHECK	Found cluster k8s 1.35.1, `Succeeded`
13:49:08	GUARDS	Guard 4 rejected (healthy), `SKIP_GUARDS` overrode
13:49:09	BOOTSTRAP	Kube client via MSI token
13:49:17	STEP 3	Created `systmp` (Standard_D4s_v3, inherited taints)
13:52:01	STEP 4	Cordoned + drained `aks-system-30104029-vmss000000`
13:52:14	STEP 5	Deleted `system` pool
13:53:17	STEP 6	Recreated `system` pool (new VMSS `18152043`)
13:57:02	STEP 7	Drained + deleted `systmp`
13:58:20	STEP 8	Tag reconcile
13:58:24	DONE	Post-flight: 0 NRP failures in last 10m

Azure Activity Log

Op                           Time                          Status
---------------------------  ----------------------------  ---------
Create or Update Agent Pool  2026-05-27T13:51:44.830Z      Succeeded   ← systmp
Delete Agent Pool            2026-05-27T13:53:17.108Z      Succeeded   ← system deleted
Create or Update Agent Pool  2026-05-27T13:56:39.273Z      Succeeded   ← system recreated

(systmp delete completed at 13:58:20 per binary logs)

Before / After

	Before	After
System node	`aks-system-30104029-vmss000000` (86m)	`aks-system-18152043-vmss000000` (3m)
VMSS	`30104029`	`18152043` (fresh KVS entity)
systmp	—	cleaned up
All 6 pools	Succeeded	Succeeded
All 6 nodes	Ready	Ready

Reconcile tag

"aroslsre-924-recreate": "2026-05-27T13:58:20.13764Z"

Improvements observed vs run 1

Drain completed without PDB retry loops (pods evicted cleanly on first attempt)
systmp drain had no repeated eviction retries
Overall ~1 min faster (9 min vs 10 min)

Run 1 vs Run 2 VMSS progression

run 0 (initial cluster):  aks-system-20157742-vmss000000
run 1 (first recreation): aks-system-30104029-vmss000000
run 2 (second recreation): aks-system-18152043-vmss000000

Each recreation yields a fresh VMSS name, confirming the KVS entity rotation works correctly.

janboll · 2026-05-27T14:12:28Z

How will you put this into ev2 artifact?

raelga · 2026-05-27T14:18:34Z

@janboll the binary is built into the EV2 artifact by the buildStep in dev-infrastructure/mgmt-pipeline.yaml, similar to the existing cleanup-pko-resources helper.

That step runs during artifact/image build:

CGO_ENABLED=0 go build -o "${tmp2}" ./scripts/recreate-system-pool
chmod 0755 "${tmp2}"
mv "${tmp2}" ./scripts/recreate-system-pool/recreate-system-pool

Then the EV2 Shell step executes it from the rollout artifact root:

command: ./scripts/recreate-system-pool/recreate-system-pool
workingDir: .

So we do not commit the compiled binary; the mgmt pipeline artifact build compiles it and places it under scripts/recreate-system-pool/ before the Shell step runs.

…-924)

geoberle · 2026-05-27T15:13:03Z

accepted under the premise that we will salvage the idea of this tool into something sustainably maintainable asap once it did its duty

geoberle · 2026-05-27T15:14:32Z

/lgtm

openshift-ci · 2026-05-27T15:14:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: geoberle, raelga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [geoberle,raelga]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

raelga · 2026-05-27T15:24:23Z

Testing run 3: with diagnostic shellouts removed (`77d8ce4`)

Re-tested after refactor(recreate-system-pool): remove diagnostic shellouts. Binary now uses native SDK + client-go for all pre-flight/post-flight diagnostics — no az or kubectl subprocess calls.

Timeline (~10 min total)

Time (UTC)	Step	Action
15:13:05	STARTUP	`SKIP_GUARDS=true`, binary `77d8ce4ea`
15:13:07	CLUSTER CHECK	Found cluster k8s 1.35.1, `Succeeded`
15:13:11	PRE-FLIGHT	Native SDK nodepool list + cluster show; client-go node list (kube not bootstrapped yet → WARN, expected)
15:13:14	GUARDS	Guard 4 rejected (healthy), `SKIP_GUARDS` overrode
15:13:15	BOOTSTRAP	Kube client via MSI token
15:13:19	PRE-ACTION	Full native diagnostics: 6 nodepools, 6 k8s nodes (all Ready, schedulable)
15:13:24	STEP 3	Created `systmp` (3m14s)
15:16:40	STEP 4	Cordoned + drained `aks-system-18152043-vmss000000` — clean, no PDB retries
15:16:55	STEP 5	Deleted `system` pool (1m4s)
15:17:59	STEP 6	Recreated `system` pool (3m13s), waited for 1 Ready node (30s poll)
15:21:44	STEP 7	Drained + deleted `systmp` (1m46s)
15:23:31	STEP 8	Tag reconcile
15:23:35	DONE	Post-flight: 6 pools Succeeded, 6 nodes Ready, 0 NRP failures

Azure Activity Log

Op                           Time                          Status
---------------------------  ----------------------------  ---------
Create or Update Agent Pool  2026-05-27T15:16:25.557Z      Succeeded   ← systmp
Delete Agent Pool            2026-05-27T15:17:58.412Z      Succeeded   ← system deleted
Create or Update Agent Pool  2026-05-27T15:21:00.119Z      Succeeded   ← system recreated

Before / After

	Before	After
System node	`aks-system-18152043-vmss000000` (77m)	`aks-system-23419720-vmss000000` (3m)
VMSS	`18152043`	`23419720` (fresh KVS entity)
systmp	—	cleaned up
All 6 pools	Succeeded	Succeeded
All 6 nodes	Ready, schedulable	Ready, schedulable

Reconcile tag

"aroslsre-924-recreate": "2026-05-27T15:23:30.722623Z"

Improvements in this run

Pre-flight and post-flight diagnostics now use native SDK + client-go (no az/kubectl subprocesses)
Structured node diagnostics: ready=true schedulableReady=true unschedulable=false deleting=false
Post-flight includes full final state dump (nodepools, cluster, k8s nodes) — all verified healthy by the binary itself

VMSS progression across all runs

run 0 (initial):   20157742
run 1 (recreate):  30104029
run 2 (recreate):  18152043
run 3 (recreate):  23419720

bennerv

Couple comments.

I have more feedback but these comments are more blocking or attention. Long-term, this implementation needs some refactoring if we decide to keep it, but knowing this resolves an ongoing incident I'm okay with moving forward to get the issue resolved first to mitigate the issue.

raelga · 2026-05-27T16:18:18Z

/override ci/prow/e2e-parallel

E2E provision step passed — recreate-system-pool-if-broken ran successfully in 1s (greenfield no-op, cluster not yet created at that pipeline stage). E2E tests are unaffected by this PR since the binary only acts when NRP-KVS corruption is detected, and the new pipeline step runs before the cluster ARM deployment.

openshift-ci · 2026-05-27T16:18:24Z

@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel

Details

In response to this:

/override ci/prow/e2e-parallel

E2E provision step passed — recreate-system-pool-if-broken ran successfully in 1s (greenfield no-op, cluster not yet created at that pipeline stage). E2E tests are unaffected by this PR since the binary only acts when NRP-KVS corruption is detected, and the new pipeline step runs before the cluster ARM deployment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

raelga · 2026-05-27T19:36:14Z

/override ci/prow/e2e-parallel

E2E provision step passed — recreate-system-pool-if-broken ran successfully in 1s (greenfield no-op, cluster not yet created at that pipeline stage). E2E tests are unaffected by this PR since the binary only acts when NRP-KVS corruption is detected, and the new pipeline step runs before the cluster ARM deployment.

openshift-ci · 2026-05-27T19:36:19Z

@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel

Details

In response to this:

/override ci/prow/e2e-parallel

E2E provision step passed — recreate-system-pool-if-broken ran successfully in 1s (greenfield no-op, cluster not yet created at that pipeline stage). E2E tests are unaffected by this PR since the binary only acts when NRP-KVS corruption is detected, and the new pipeline step runs before the cluster ARM deployment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

raelga · 2026-05-27T22:21:06Z

/override ci/prow/e2e-parallel

E2E provision step passed — recreate-system-pool-if-broken ran successfully in 1s (greenfield no-op, cluster not yet created at that pipeline stage). E2E tests are unaffected by this PR since the binary only acts when NRP-KVS corruption is detected, and the new pipeline step runs before the cluster ARM deployment.

openshift-ci · 2026-05-27T22:21:10Z

@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel

Details

In response to this:

/override ci/prow/e2e-parallel

E2E provision step passed — recreate-system-pool-if-broken ran successfully in 1s (greenfield no-op, cluster not yet created at that pipeline stage). E2E tests are unaffected by this PR since the binary only acts when NRP-KVS corruption is detected, and the new pipeline step runs before the cluster ARM deployment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copilot AI review requested due to automatic review settings May 26, 2026 20:30

openshift-ci Bot requested review from mmazur and roivaz May 26, 2026 20:30

openshift-ci Bot added the approved label May 26, 2026

Copilot started reviewing on behalf of raelga May 26, 2026 20:30 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from ce8567a to bf07f1b Compare May 26, 2026 20:36

Copilot AI review requested due to automatic review settings May 26, 2026 20:43

raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from bf07f1b to 6454335 Compare May 26, 2026 20:43

Copilot started reviewing on behalf of raelga May 26, 2026 20:43 View session

raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from 6454335 to 96588d0 Compare May 26, 2026 20:45

Copilot AI reviewed May 26, 2026

View reviewed changes

raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from 96588d0 to d9edc55 Compare May 26, 2026 20:52

Copilot AI review requested due to automatic review settings May 26, 2026 21:20

raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from d9edc55 to 6de8d8b Compare May 26, 2026 21:20

Copilot started reviewing on behalf of raelga May 26, 2026 21:20 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go Outdated

Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go Outdated

raelga requested a review from Copilot May 26, 2026 21:55

Copilot started reviewing on behalf of raelga May 26, 2026 21:55 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go Outdated

raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from ad64d13 to 89e3196 Compare May 26, 2026 22:05

Copilot AI review requested due to automatic review settings May 26, 2026 22:21

raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from 89e3196 to 8d19fc0 Compare May 26, 2026 22:21

Copilot started reviewing on behalf of raelga May 26, 2026 22:21 View session

raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from 6b2ce0c to d7a0ff1 Compare May 27, 2026 12:39

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go

raelga added 3 commits May 27, 2026 12:59

fix(recreate-system-pool): narrow activity guard to VMSS writes (AROS…

73e8472

…LSRE-924)

fix(recreate-system-pool): ignore cordoned nodes during readiness wai…

f56eb0c

…t (AROSLSRE-924)

fix(recreate-system-pool): retry activity log auth propagation (AROSL…

fdfe771

…SRE-924)

raelga requested a review from Copilot May 27, 2026 13:49

Copilot started reviewing on behalf of raelga May 27, 2026 13:49 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

geoberle reviewed May 27, 2026

View reviewed changes

Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go Outdated

refactor(recreate-system-pool): remove diagnostic shellouts (AROSLSRE…

77d8ce4

…-924)

openshift-ci Bot assigned geoberle May 27, 2026

openshift-ci Bot added the lgtm label May 27, 2026

bennerv reviewed May 27, 2026

View reviewed changes

Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go

Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go

openshift-merge-bot Bot merged commit 0cbdd05 into Azure:main May 27, 2026
16 checks passed

raelga mentioned this pull request May 28, 2026

fix(recreate-system-pool): widen NRP window and add forced-evidence trigger (AROSLSRE-924) #5420

Merged

12 tasks

Conversation

raelga commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Testing

Special notes for your reviewer

PR Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

raelga commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

raelga commented May 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

raelga commented May 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

raelga commented May 27, 2026

Testing run 2: full recreation with latest commits (fdfe771)

Timeline (~9 min total)

Azure Activity Log

Before / After

Reconcile tag

Improvements observed vs run 1

Run 1 vs Run 2 VMSS progression

Uh oh!

janboll commented May 27, 2026

Uh oh!

raelga commented May 27, 2026

Uh oh!

Uh oh!

geoberle commented May 27, 2026

Uh oh!

geoberle commented May 27, 2026

Uh oh!

openshift-ci Bot commented May 27, 2026

Uh oh!

raelga commented May 27, 2026

Testing run 3: with diagnostic shellouts removed (77d8ce4)

Timeline (~10 min total)

Azure Activity Log

raelga commented May 26, 2026 •

edited

Loading

raelga commented May 26, 2026 •

edited

Loading

Testing run 2: full recreation with latest commits (`fdfe771`)

Testing run 3: with diagnostic shellouts removed (`77d8ce4`)