Skip to content

feat(mgmt-pipeline): self-heal NRP-KVS-corrupted system pool (AROSLSRE-951)#5397

Merged
openshift-merge-bot[bot] merged 19 commits into
Azure:mainfrom
raelga:raelga/aroslsre-924-pipeline-recreate-system-pool
May 27, 2026
Merged

feat(mgmt-pipeline): self-heal NRP-KVS-corrupted system pool (AROSLSRE-951)#5397
openshift-merge-bot[bot] merged 19 commits into
Azure:mainfrom
raelga:raelga/aroslsre-924-pipeline-recreate-system-pool

Conversation

@raelga
Copy link
Copy Markdown
Collaborator

@raelga raelga commented May 26, 2026

AROSLSRE-951 (story) · AROSLSRE-952 (subtask) · AROSLSRE-880 (parent incident bug) · ICM 798003653

What

Adds a detection-gated EV2 Shell step (Go binary) that runs before the cluster ARM step on every Management Cluster Rollout. When the AKS system pool's VMSS is wedged by NRP-KVS corruption, it aborts the long-running cluster LRO when safe, creates a temporary systmp system pool, deletes/recreates the system pool to get a fresh VMSS/KVS entity, drains/deletes systmp, and reconciles the cluster tags. It exits 0 no-op when the cluster is healthy or does not exist yet (greenfield).

Why

The AROSLSRE-880 INT incident (2026-05-16..18) left the mgmt cluster stuck in Updating for days because every virtualMachineScaleSets/write on the system pool's VMSS failed with NetworkingInternalOperationError on a continuous retry chain. The corruption is bound to the VMSS ARM resource ID, so per-instance delete is useless — only recreating the pool (and thus the VMSS) clears it. The recipe was applied manually at INT under AROSLSRE-924; this PR automates it for stg/prod so the daily mgmt-pipeline self-heals instead of paging on-call.

Four guards must ALL fire for the binary to act (else exit 0 no-op):

1. >= NRP_FAIL_THRESHOLD VMSS-write Failed events with NetworkingInternalOperationError
2. cluster provisioningState is recoverable (settled OR stuck mid-LRO, not Creating/Deleting)
3. all non-system pools have count > 0
4. system pool provisioningState in {Failed, Canceled, Updating, Upgrading}

The number of Ready system nodes is logged as a diagnostic, but it is not a hard guard. In INT, Ready < minCount only appeared after we manually triggered extra scale-up/surge attempts; the normal stuck-upgrade form can have Ready nodes still at minCount while the system pool/cluster LRO is wedged.

Testing

  • 65 focused Go test funcs covering pure-logic helpers and orchestration paths: env parsing, all 4 guard primitives, sanitize-no-mutation regression, systmp clone/post-processing, activity-log parsing including NRP-KVS signature filtering, LRO-age activity-log parsing, Ready-node waiting, and no-op/execute remediation flow with fake clients.
  • Test-only SKIP_GUARDS coverage exercises the full remediation path without fabricating an NRP failure storm; the mgmt pipeline does not set this env var.
  • make verify, make lint, make validate-changed-config-pipelines — all green locally.
  • E2E failures were root-caused and addressed:
    • startup az account show dependency removed; SUBSCRIPTION_ID now comes from an ARM subscription-output step.
    • external kubectl dependency removed for cordon/drain; drain uses the client-go drain helper with the sessiongate dynamic AKS REST config.

Special notes for your reviewer

  • Kubernetes access uses sessiongate/pkg/mc.GetAKSRESTConfig, so client-go requests use the shared dynamic Azure-token transport with token refresh.
  • No Azure CLI or external kubectl shellouts remain. Activity Log detection uses armmonitor.ActivityLogsClient, LRO abort uses SDK ManagedClustersClient.BeginAbortLatestOperation, tag reconcile uses armresources.TagsClient.UpdateAtScope, and pre/post-flight diagnostics use SDK/client-go logs.
  • Guard 1 now counts only exact Microsoft.Compute/virtualMachineScaleSets/write events with the NRP-KVS signature; VMSS deletes and child-resource writes do not satisfy the destructive remediation gate.
  • Activity Log AuthorizationFailed / LinkedAuthorizationFailed responses retry with bounded backoff inside the Go binary to tolerate RBAC propagation after the Reader assignment; other query failures still fail closed.
  • Ready-node waits count only Ready, schedulable, non-deleting nodes so old cordoned/deleting system nodes cannot satisfy the recreated-pool wait.
  • No aks-preview dependency: stuck-LRO age comes from Activity Log Started managedClusters/write events.
  • After LRO handling, the script re-runs all detection guards and refreshes the system pool snapshot before creating systmp, so it exits no-op if the wedge recovered.
  • systmp is built from the same JSON-roundtrip sanitized live-pool clone as system recreation, then post-processed only for intentional temporary-pool differences (Count=1, autoscaler fields cleared, purpose tag); taints are inherited from the live system pool clone.
  • Drain uses k8s.io/kubectl/pkg/drain in-process. Cordon failure is fatal; Force=true is enabled to match the later authoritative nodepool deletion path.
  • Logging uses log/slog JSON to stderr (component=recreate-system-pool, phase=STEP N on banners), same shape as frontend / backend / admin-server so Geneva ships it to Kusto without extra wiring.
  • Build aligned with fix: build cleanup-pko-resources based on machine architecture #5394 (drops GOOS/GOARCH so dev envs on macOS-arm work too).
  • globalMSIId receives subscription Reader via pipeline-msi-reader-permissions so guard 1 can read AKS node RG activity logs.
  • The buildStep compiles ./scripts/recreate-system-pool into ./scripts/recreate-system-pool/recreate-system-pool during artifact build; the Shell step runs that generated binary from the rollout artifact root.
  • Deeper technical detail lives in AROSLSRE-952.

PR Checklist

  • PR is scoped to a single task (no mixed concerns)
  • Title follows Conventional Commits format
  • Summary explains the "Why" behind the change
  • Linked to relevant ticket/issue
  • Screenshots included (if graph/UI/metrics changes) — n/a
  • Self-reviewed the diff
  • CI/CD checks are passing (ignore Tide)
  • Draft PR used for WIP (if applicable) — n/a
  • Commit history is clean (rebased/squashed)
  • Tricky code blocks are commented
  • Specific reviewers tagged
  • All comment threads resolved before merge

Copilot AI review requested due to automatic review settings May 26, 2026 20:30
@openshift-ci openshift-ci Bot requested review from mmazur and roivaz May 26, 2026 20:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an automated, detection-gated remediation step to the management EV2 pipeline to recover AKS management clusters whose system nodepool VMSS updates are continuously failing due to NRP key-value-store corruption, by recreating the system pool ahead of the main cluster ARM deployment.

Changes:

  • Introduces a new Go-based recreate-system-pool utility that evaluates safety guards and, when triggered, drains and recreates the system pool via ARM/SDK operations plus targeted kubectl actions.
  • Adds extensive unit tests for the utility’s pure-logic components (env parsing, guard evaluation, snapshot sanitization, activity-log parsing, kubeconfig/token handling).
  • Updates dev-infrastructure/mgmt-pipeline.yaml to build/package the binary and run it as a Shell step before the cluster ARM step, and wires the module into go.work.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
go.work Adds the new recreate-system-pool Go module to the workspace.
dev-infrastructure/scripts/recreate-system-pool/main.go Implements the detection-gated remediation flow: guard checks, snapshot/sanitize, temporary pool creation, drain/delete/recreate, and tag reconcile.
dev-infrastructure/scripts/recreate-system-pool/main_test.go Adds unit tests for config parsing, guards, sanitization, activity-log parsing, kube helpers, and error classification.
dev-infrastructure/scripts/recreate-system-pool/go.mod Defines the module and dependencies (Azure SDK + Kubernetes client libraries).
dev-infrastructure/scripts/recreate-system-pool/go.sum Captures dependency checksums for the new module.
dev-infrastructure/mgmt-pipeline.yaml Builds the binary in the pipeline buildStep and adds a pre-cluster Shell step to execute it (no-op unless guards fire).

Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go Outdated
Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go Outdated
Comment thread dev-infrastructure/mgmt-pipeline.yaml
Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go
Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go Outdated
@raelga raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from ce8567a to bf07f1b Compare May 26, 2026 20:36
Copilot AI review requested due to automatic review settings May 26, 2026 20:43
@raelga raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from bf07f1b to 6454335 Compare May 26, 2026 20:43
@raelga raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from 6454335 to 96588d0 Compare May 26, 2026 20:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 6 comments.

Comment thread dev-infrastructure/mgmt-pipeline.yaml
Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go Outdated
Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go Outdated
Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go Outdated
Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go
Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go
@raelga raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from 96588d0 to d9edc55 Compare May 26, 2026 20:52
@raelga
Copy link
Copy Markdown
Collaborator Author

raelga commented May 26, 2026

@copilot no action needed. Addressed all 10 points in the force-push (96588d036d9edc5531). Also widened Guard 5 to accept Updating/Upgrading per offline review (the AROSLSRE-880 wedge leaves the system pool in Updating while its parent cluster LRO retries forever, not just in Failed).

# Suggestion Fix
1 preflightChecks fails open on non-404 errors Now uses isNotFoundErr; only HTTP 404 is treated as "pool does not exist", all other errors surface as a hard fail.
2 azShellTSV uses exec.Command (no ctx) Switched to exec.CommandContext(ctx, …) and now includes ExitError.Stderr in returned errors. loadConfig threads ctx down.
3 buildSystmpAgentPool hard-codes OSDiskSizeGB=128 Inherits OSDiskSizeGB, OSType, OSSKU, OSDiskType from the live snapshot. Refuses to act when OSDiskSizeGB is missing or zero. Added two new tests.
4 newAzureClients uses bare DefaultAzureCredential Now passes &azidentity.DefaultAzureCredentialOptions{RequireAzureTokenCredentials: true} to match backend / sessiongate / admin-client convention.
5 YAML comment says "four guards" Updated to "five".
6 Top doc comment for guard 3 stale Rewritten to match the widened acceptance (Succeeded/Canceled/Failed settled, Updating/Upgrading mid-LRO; reject Creating/Deleting/unknown).
7 Guard 3 log message stale ("must be Succeeded or Canceled") Now logs the full accept/reject set.

Not applied (with reason):

Other tightening in the same push:

  • Guard 5 widened to accept Updating and Upgrading too (not just Failed/Canceled). The AROSLSRE-880 NRP-KVS wedge typically leaves the system pool in Updating for hours/days while the parent cluster LRO retries forever — rejecting that state would have made the binary unable to fix the exact scenario it was built for. Still rejects Succeeded (clearly healthy), Creating, Deleting, and unknown future states.
  • Test count is now 61 top-level test funcs / ~110 sub-cases. go vet, gofmt, make validate-changed-config-pipelines all clean.

…E-924)

Adds a detection-gated EV2 Shell step that recreates the AKS system
pool when the NRP key-value-store entity for its VMSS gets corrupted.
The same recipe was applied manually at INT on 2026-05-24
(AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod.

## Background

A corrupted NRP KVS entry for the system pool's VMSS causes every
Microsoft.Compute/virtualMachineScaleSets/write to fail with
NetworkingInternalOperationError on a continuous retry chain. Fresh
VM instances come up but never get a Swift NIC, kubelet never
registers, the pool stops scaling, and the cluster's upgrade LRO
retries forever - the AROSLSRE-880 / INT (2026-05-16..18) incident
left the cluster stuck in Updating for days because of this.

The corruption is bound to the VMSS ARM resource ID; per-instance
delete does not help. Deleting and re-creating the pool yields a
fresh VMSS name and a clean KVS entity. NRP-side fix is tracked
in ICM 798003653; once it ships, this binary's detection guards
never fire and the step becomes a no-op.

## Placement

The step runs BEFORE the cluster ARM step (depending on the same
cert-issuer prereqs), so when guards fire the recreate happens
before the next cluster PUT, preventing the pipeline from getting
stuck. The cluster ARM step now depends on this step.

The binary tolerates a not-yet-created cluster (greenfield rollout):
it does an ARM Get; if 404, logs and exits 0 with no action. No
`aksCluster:` directive is used (which would fail on greenfield);
instead the binary bootstraps its own kubeconfig from
ListClusterUserCredentials + a bearer token issued by the MSI
scoped to the AKS AAD server app, and exports KUBECONFIG so child
kubectl invocations work too. No `kubelogin` dependency.

## Detection (ALL guards must pass; otherwise exit 0 no-op)

  1. system pool Ready k8s nodes < minCount
  2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on
     aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15)
  3. cluster provisioningState is recoverable: Succeeded, Canceled,
     Failed (settled) OR Updating, Upgrading (mid-LRO - the wedge
     signature itself). Rejected: Creating, Deleting, unknown.
  4. every non-system pool has count > 0
  5. system pool provisioningState == "Failed" - positive
     confirmation that this specific pool is wedged

## Action (once guards pass)

  1. Snapshot system pool ARM JSON
  2. Abort cluster LRO ONLY if active LRO is >= 30 min old (the
     NRP-KVS retry storm signature). If younger, no-op exit to
     avoid racing a healthy in-progress operation. AROSLSRE-924
     manual recipe required this step to move the cluster from
     stuck-Updating to Canceled.
  3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted,
     same VMSize/subnets as live system)
  4. Cordon + drain existing system nodes
  5. Delete the broken system pool
  6. Re-create system via SDK CreateOrUpdate from the sanitized
     snapshot (strips read-only fields and aks-managed-* tags,
     pins orchestratorVersion to the live CP version)
  7. Drain + delete systmp
  8. No-op tag PATCH to flip cluster Canceled -> Succeeded

## Safety

  - DefaultAzureCredential chain (MSI in EV2, az CLI locally).
  - sanitizeForRecreate deep-copies via JSON round-trip; never
    mutates the snapshot.
  - snapshotSystem refuses to act if VMSize or VnetSubnetID are
    missing from the live pool.
  - maybeAbortLRO returns (proceed=false, no err) when LRO is
    younger than 30 min, so the binary exits 0 (not an error)
    rather than racing a potentially-healthy operation.
  - preflightChecks refuses to act if a leftover 'systmp' exists.
  - Guard 2 fails closed if the activity-log query errors (so a
    missing Reader role on the node RG yields a no-op, not a
    runaway recreate).
  - Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op.
  - Overall 60-min context timeout.
  - DRY_RUN=true lets operators verify guard behaviour without
    making any writes.

## Testing

  - 100+ unit test cases covering all pure-logic functions:
    env parsing, guard primitives (evalGuard1..5 - including
    guard 3's acceptance of stuck-Updating clusters and guard 5
    for system pool Failed state), sanitizeForRecreate
    (no-mutation, tag stripping, version pin),
    buildSystmpAgentPool (defensive nil checks),
    activity-log parsing (dedup, case insensitivity, prefix
    filter), isNodeReady (nil/missing/false conditions),
    isNotFoundErr (404 vs other status codes, wrapped errors),
    extractAPIServerAndCA (happy path, empty input, malformed
    yaml, missing fields), kubeconfigWithBearerToken (no exec
    plugin or auth provider required).
  - go vet, gofmt clean.
  - validate-changed-config-pipelines passes.

## References

  - PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper)
  - Pipeline step pattern: Azure#4790
  - Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev
    envs on macOS-arm build natively too)
  - Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask),
    AROSLSRE-924 (INT manual mitigation), AROSLSRE-880 (parent
    incident bug)
  - ICM: 798003653
Copilot AI review requested due to automatic review settings May 26, 2026 21:20
@raelga raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from d9edc55 to 6de8d8b Compare May 26, 2026 21:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go Outdated
Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go Outdated
raelga added a commit to raelga/ARO-HCP that referenced this pull request May 26, 2026
…2 (AROSLSRE-924)

Address Copilot review (PR Azure#5397, batch 3):

* Guard 2 (countNRPFailures + nrpResourceIDs): require the activity-log
  Failed VMSS-write event to carry the NetworkingInternalOperationError
  inner error code in properties.statusMessage before counting toward
  the threshold. Other failure modes (quota, capacity, policy, image
  pull) on the same aks-system-* VMSS no longer satisfy guard 2 and
  cannot trigger a destructive pool recreation that would not address
  their actual root cause.

* Guard 5 PASS log: print the observed provisioningState (Failed,
  Canceled, Updating, Upgrading) instead of hard-coding "is Failed".
  Operators reading the log now see which accepted state matched.

Implementation:

* Add nrpKVSErrorCode const with the ARM inner error code.
* Extend activityEvent with Properties.StatusMessage (the inner ARM
  error body as an embedded JSON string).
* Add hasNRPKVSSignature(e) helper that parses statusMessage and
  returns true iff error.code == NetworkingInternalOperationError.
  Fails closed on any parse error / missing field.
* Apply the signature filter inside both countNRPFailures and
  nrpResourceIDs so the diagnostic list matches the count.
* Update the guard-2 progress log line and the top-of-file guard-2
  doc comment to call out the NRP-KVS signature.

Tests:

* mkActivityEvent now emits a realistic properties.statusMessage with
  the NRP-KVS code by default so all pre-existing tests stay green;
  callers can pass a different code to simulate other failure modes.
* New tests: TestCountNRPFailures_RequiresNRPKVSSignature,
  TestCountNRPFailures_MissingPropertiesNotCounted,
  TestCountNRPFailures_MalformedStatusMessageNotCounted,
  TestNRPResourceIDs_RequiresNRPKVSSignature.
* go vet + go test all green in
  dev-infrastructure/scripts/recreate-system-pool.
@raelga raelga requested a review from Copilot May 26, 2026 21:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go Outdated
raelga added a commit to raelga/ARO-HCP that referenced this pull request May 26, 2026
…2 (AROSLSRE-924)

Address Copilot review (PR Azure#5397, batch 3):

* Guard 2 (countNRPFailures + nrpResourceIDs): require the activity-log
  Failed VMSS-write event to carry the NetworkingInternalOperationError
  inner error code in properties.statusMessage before counting toward
  the threshold. Other failure modes (quota, capacity, policy, image
  pull) on the same aks-system-* VMSS no longer satisfy guard 2 and
  cannot trigger a destructive pool recreation that would not address
  their actual root cause.

* Guard 5 PASS log: print the observed provisioningState (Failed,
  Canceled, Updating, Upgrading) instead of hard-coding "is Failed".
  Operators reading the log now see which accepted state matched.

Implementation:

* Add nrpKVSErrorCode const with the ARM inner error code.
* Extend activityEvent with Properties.StatusMessage (the inner ARM
  error body as an embedded JSON string).
* Add hasNRPKVSSignature(e) helper that parses statusMessage and
  returns true iff error.code == NetworkingInternalOperationError.
  Fails closed on any parse error / missing field.
* Apply the signature filter inside both countNRPFailures and
  nrpResourceIDs so the diagnostic list matches the count.
* Update the guard-2 progress log line and the top-of-file guard-2
  doc comment to call out the NRP-KVS signature.

Tests:

* mkActivityEvent now emits a realistic properties.statusMessage with
  the NRP-KVS code by default so all pre-existing tests stay green;
  callers can pass a different code to simulate other failure modes.
* New tests: TestCountNRPFailures_RequiresNRPKVSSignature,
  TestCountNRPFailures_MissingPropertiesNotCounted,
  TestCountNRPFailures_MalformedStatusMessageNotCounted,
  TestNRPResourceIDs_RequiresNRPKVSSignature.
* go vet + go test all green in
  dev-infrastructure/scripts/recreate-system-pool.
@raelga raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from ad64d13 to 89e3196 Compare May 26, 2026 22:05
…2 (AROSLSRE-924)

Address Copilot review (PR Azure#5397, batch 3):

* Guard 2 (countNRPFailures + nrpResourceIDs): require the activity-log
  Failed VMSS-write event to carry the NetworkingInternalOperationError
  inner error code in properties.statusMessage before counting toward
  the threshold. Other failure modes (quota, capacity, policy, image
  pull) on the same aks-system-* VMSS no longer satisfy guard 2 and
  cannot trigger a destructive pool recreation that would not address
  their actual root cause.

* Guard 5 PASS log: print the observed provisioningState (Failed,
  Canceled, Updating, Upgrading) instead of hard-coding "is Failed".
  Operators reading the log now see which accepted state matched.

Implementation:

* Add nrpKVSErrorCode const with the ARM inner error code.
* Extend activityEvent with Properties.StatusMessage (the inner ARM
  error body as an embedded JSON string).
* Add hasNRPKVSSignature(e) helper that parses statusMessage and
  returns true iff error.code == NetworkingInternalOperationError.
  Fails closed on any parse error / missing field.
* Apply the signature filter inside both countNRPFailures and
  nrpResourceIDs so the diagnostic list matches the count.
* Update the guard-2 progress log line and the top-of-file guard-2
  doc comment to call out the NRP-KVS signature.

Tests:

* mkActivityEvent now emits a realistic properties.statusMessage with
  the NRP-KVS code by default so all pre-existing tests stay green;
  callers can pass a different code to simulate other failure modes.
* New tests: TestCountNRPFailures_RequiresNRPKVSSignature,
  TestCountNRPFailures_MissingPropertiesNotCounted,
  TestCountNRPFailures_MalformedStatusMessageNotCounted,
  TestNRPResourceIDs_RequiresNRPKVSSignature.
* go vet + go test all green in
  dev-infrastructure/scripts/recreate-system-pool.
Copilot AI review requested due to automatic review settings May 26, 2026 22:21
@raelga raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from 89e3196 to 8d19fc0 Compare May 26, 2026 22:21
@raelga
Copy link
Copy Markdown
Collaborator Author

raelga commented May 26, 2026

Update after re-checking INT behavior: Ready system nodes < minCount is no longer a hard guard. That condition only appeared in INT after we manually triggered extra scale-up/surge attempts; the normal stuck-upgrade wedge can still have Ready nodes at minCount while the system pool/cluster LRO is stuck Updating. The binary now logs ready/registered system-node counts as diagnostics only and gates on the four stronger signals: NRP-KVS activity-log signature, recoverable cluster state, non-system pools count > 0, and system pool state in Failed/Canceled/Updating/Upgrading.

…-924)

Add SKIP_GUARDS env var that bypasses the 4 detection guards,
allowing the recreation flow to be tested on healthy clusters.
Combined with DRY_RUN=true it logs what would happen; without
DRY_RUN it performs the actual pool recreation.

Validated on personal dev env (pers-usw3rael-mgmt-1):
  SKIP_GUARDS=true DRY_RUN=true  → full plumbing test, exit 0
  SKIP_GUARDS=true               → full recreation in ~10 min,
    system VMSS 20157742 → 30104029, cluster healthy after.
@raelga raelga force-pushed the raelga/aroslsre-924-pipeline-recreate-system-pool branch from 6b2ce0c to d7a0ff1 Compare May 27, 2026 12:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 16 changed files in this pull request and generated 1 comment.

Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go
@raelga
Copy link
Copy Markdown
Collaborator Author

raelga commented May 27, 2026

/retest

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 16 changed files in this pull request and generated no new comments.

@raelga
Copy link
Copy Markdown
Collaborator Author

raelga commented May 27, 2026

Testing run 2: full recreation with latest commits (fdfe771)

Re-tested after the 3 new fixes (narrow activity guard, ignore cordoned nodes, retry activity log auth). Binary built from fdfe77181.

Timeline (~9 min total)

Time (UTC) Step Action
13:49:01 STARTUP SKIP_GUARDS=true, DRY_RUN=false
13:49:03 CLUSTER CHECK Found cluster k8s 1.35.1, Succeeded
13:49:08 GUARDS Guard 4 rejected (healthy), SKIP_GUARDS overrode
13:49:09 BOOTSTRAP Kube client via MSI token
13:49:17 STEP 3 Created systmp (Standard_D4s_v3, inherited taints)
13:52:01 STEP 4 Cordoned + drained aks-system-30104029-vmss000000
13:52:14 STEP 5 Deleted system pool
13:53:17 STEP 6 Recreated system pool (new VMSS 18152043)
13:57:02 STEP 7 Drained + deleted systmp
13:58:20 STEP 8 Tag reconcile
13:58:24 DONE Post-flight: 0 NRP failures in last 10m

Azure Activity Log

Op                           Time                          Status
---------------------------  ----------------------------  ---------
Create or Update Agent Pool  2026-05-27T13:51:44.830Z      Succeeded   ← systmp
Delete Agent Pool            2026-05-27T13:53:17.108Z      Succeeded   ← system deleted
Create or Update Agent Pool  2026-05-27T13:56:39.273Z      Succeeded   ← system recreated

(systmp delete completed at 13:58:20 per binary logs)

Before / After

Before After
System node aks-system-30104029-vmss000000 (86m) aks-system-18152043-vmss000000 (3m)
VMSS 30104029 18152043 (fresh KVS entity)
systmp cleaned up
All 6 pools Succeeded Succeeded
All 6 nodes Ready Ready

Reconcile tag

"aroslsre-924-recreate": "2026-05-27T13:58:20.13764Z"

Improvements observed vs run 1

  • Drain completed without PDB retry loops (pods evicted cleanly on first attempt)
  • systmp drain had no repeated eviction retries
  • Overall ~1 min faster (9 min vs 10 min)

Run 1 vs Run 2 VMSS progression

run 0 (initial cluster):  aks-system-20157742-vmss000000
run 1 (first recreation): aks-system-30104029-vmss000000
run 2 (second recreation): aks-system-18152043-vmss000000

Each recreation yields a fresh VMSS name, confirming the KVS entity rotation works correctly.

@janboll
Copy link
Copy Markdown
Collaborator

janboll commented May 27, 2026

How will you put this into ev2 artifact?

@raelga
Copy link
Copy Markdown
Collaborator Author

raelga commented May 27, 2026

@janboll the binary is built into the EV2 artifact by the buildStep in dev-infrastructure/mgmt-pipeline.yaml, similar to the existing cleanup-pko-resources helper.

That step runs during artifact/image build:

CGO_ENABLED=0 go build -o "${tmp2}" ./scripts/recreate-system-pool
chmod 0755 "${tmp2}"
mv "${tmp2}" ./scripts/recreate-system-pool/recreate-system-pool

Then the EV2 Shell step executes it from the rollout artifact root:

command: ./scripts/recreate-system-pool/recreate-system-pool
workingDir: .

So we do not commit the compiled binary; the mgmt pipeline artifact build compiles it and places it under scripts/recreate-system-pool/ before the Shell step runs.

Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go Outdated
@geoberle
Copy link
Copy Markdown
Collaborator

accepted under the premise that we will salvage the idea of this tool into something sustainably maintainable asap once it did its duty

@geoberle
Copy link
Copy Markdown
Collaborator

/lgtm

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 27, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: geoberle, raelga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@raelga
Copy link
Copy Markdown
Collaborator Author

raelga commented May 27, 2026

Testing run 3: with diagnostic shellouts removed (77d8ce4)

Re-tested after refactor(recreate-system-pool): remove diagnostic shellouts. Binary now uses native SDK + client-go for all pre-flight/post-flight diagnostics — no az or kubectl subprocess calls.

Timeline (~10 min total)

Time (UTC) Step Action
15:13:05 STARTUP SKIP_GUARDS=true, binary 77d8ce4ea
15:13:07 CLUSTER CHECK Found cluster k8s 1.35.1, Succeeded
15:13:11 PRE-FLIGHT Native SDK nodepool list + cluster show; client-go node list (kube not bootstrapped yet → WARN, expected)
15:13:14 GUARDS Guard 4 rejected (healthy), SKIP_GUARDS overrode
15:13:15 BOOTSTRAP Kube client via MSI token
15:13:19 PRE-ACTION Full native diagnostics: 6 nodepools, 6 k8s nodes (all Ready, schedulable)
15:13:24 STEP 3 Created systmp (3m14s)
15:16:40 STEP 4 Cordoned + drained aks-system-18152043-vmss000000 — clean, no PDB retries
15:16:55 STEP 5 Deleted system pool (1m4s)
15:17:59 STEP 6 Recreated system pool (3m13s), waited for 1 Ready node (30s poll)
15:21:44 STEP 7 Drained + deleted systmp (1m46s)
15:23:31 STEP 8 Tag reconcile
15:23:35 DONE Post-flight: 6 pools Succeeded, 6 nodes Ready, 0 NRP failures

Azure Activity Log

Op                           Time                          Status
---------------------------  ----------------------------  ---------
Create or Update Agent Pool  2026-05-27T15:16:25.557Z      Succeeded   ← systmp
Delete Agent Pool            2026-05-27T15:17:58.412Z      Succeeded   ← system deleted
Create or Update Agent Pool  2026-05-27T15:21:00.119Z      Succeeded   ← system recreated

Before / After

Before After
System node aks-system-18152043-vmss000000 (77m) aks-system-23419720-vmss000000 (3m)
VMSS 18152043 23419720 (fresh KVS entity)
systmp cleaned up
All 6 pools Succeeded Succeeded
All 6 nodes Ready, schedulable Ready, schedulable

Reconcile tag

"aroslsre-924-recreate": "2026-05-27T15:23:30.722623Z"

Improvements in this run

  • Pre-flight and post-flight diagnostics now use native SDK + client-go (no az/kubectl subprocesses)
  • Structured node diagnostics: ready=true schedulableReady=true unschedulable=false deleting=false
  • Post-flight includes full final state dump (nodepools, cluster, k8s nodes) — all verified healthy by the binary itself

VMSS progression across all runs

run 0 (initial):   20157742
run 1 (recreate):  30104029
run 2 (recreate):  18152043
run 3 (recreate):  23419720

Copy link
Copy Markdown
Member

@bennerv bennerv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple comments.

I have more feedback but these comments are more blocking or attention. Long-term, this implementation needs some refactoring if we decide to keep it, but knowing this resolves an ongoing incident I'm okay with moving forward to get the issue resolved first to mitigate the issue.

Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go
Comment thread dev-infrastructure/scripts/recreate-system-pool/main.go
@raelga
Copy link
Copy Markdown
Collaborator Author

raelga commented May 27, 2026

/override ci/prow/e2e-parallel

E2E provision step passed — recreate-system-pool-if-broken ran successfully in 1s (greenfield no-op, cluster not yet created at that pipeline stage). E2E tests are unaffected by this PR since the binary only acts when NRP-KVS corruption is detected, and the new pipeline step runs before the cluster ARM deployment.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 27, 2026

@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel

Details

In response to this:

/override ci/prow/e2e-parallel

E2E provision step passed — recreate-system-pool-if-broken ran successfully in 1s (greenfield no-op, cluster not yet created at that pipeline stage). E2E tests are unaffected by this PR since the binary only acts when NRP-KVS corruption is detected, and the new pipeline step runs before the cluster ARM deployment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@raelga
Copy link
Copy Markdown
Collaborator Author

raelga commented May 27, 2026

/override ci/prow/e2e-parallel

E2E provision step passed — recreate-system-pool-if-broken ran successfully in 1s (greenfield no-op, cluster not yet created at that pipeline stage). E2E tests are unaffected by this PR since the binary only acts when NRP-KVS corruption is detected, and the new pipeline step runs before the cluster ARM deployment.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 27, 2026

@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel

Details

In response to this:

/override ci/prow/e2e-parallel

E2E provision step passed — recreate-system-pool-if-broken ran successfully in 1s (greenfield no-op, cluster not yet created at that pipeline stage). E2E tests are unaffected by this PR since the binary only acts when NRP-KVS corruption is detected, and the new pipeline step runs before the cluster ARM deployment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@raelga
Copy link
Copy Markdown
Collaborator Author

raelga commented May 27, 2026

/override ci/prow/e2e-parallel

E2E provision step passed — recreate-system-pool-if-broken ran successfully in 1s (greenfield no-op, cluster not yet created at that pipeline stage). E2E tests are unaffected by this PR since the binary only acts when NRP-KVS corruption is detected, and the new pipeline step runs before the cluster ARM deployment.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 27, 2026

@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel

Details

In response to this:

/override ci/prow/e2e-parallel

E2E provision step passed — recreate-system-pool-if-broken ran successfully in 1s (greenfield no-op, cluster not yet created at that pipeline stage). E2E tests are unaffected by this PR since the binary only acts when NRP-KVS corruption is detected, and the new pipeline step runs before the cluster ARM deployment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit 0cbdd05 into Azure:main May 27, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants