fix: build cleanup-pko-resources based on machine architecture by tony-schndr · Pull Request #5394 · Azure/ARO-HCP

tony-schndr · 2026-05-26T17:01:12Z

What

Builds cleanup-pko-resources based on the current machine architecture.

Why

So that cleanup-pko-resources binary is compatible on macOS arm.

Deployment error:

  "err": "errors occurred during execution: [error running Shell Step, failed to execute shell command: /bin/bash: line 1: ./scripts/cleanup-pko-resources/cleanup-pko-resources: cannot execute binary file

Testing

I was able to create a personal dev environment after this change.

Special notes for your reviewer

PR Checklist

Copilot

Pull request overview

This PR updates the dev-infrastructure management pipeline to build the cleanup-pko-resources helper binary for the current machine architecture/OS, so it can run on macOS arm64 (instead of always producing a linux/amd64 binary).

Changes:

Build scripts/cleanup-pko-resources using host OS/architecture detection in the pipeline build step.

gmfrasca

/lgtm

raelga

/lgtm
/approve

openshift-ci · 2026-05-26T20:27:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gmfrasca, raelga, tony-schndr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~dev-infrastructure/OWNERS~~ [raelga]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…E-924) Adds a detection-gated EV2 Shell step that recreates the AKS system pool when the NRP key-value-store entity for its VMSS gets corrupted. The same recipe was applied manually at INT on 2026-05-24 (AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod. ## Background A corrupted NRP KVS entry for the system pool's VMSS causes every Microsoft.Compute/virtualMachineScaleSets/write to fail with NetworkingInternalOperationError on a continuous retry chain. Fresh VM instances come up but never get a Swift NIC, kubelet never registers, the pool stops scaling, and the next mgmt-pipeline run gets stuck on the cluster ARM step because the underlying VMSS update never converges. The system pool's ARM resource ends up in Failed once the parent LRO finally times out. The corruption is bound to the VMSS ARM resource ID; per-instance delete does not help. Deleting and re-creating the pool yields a fresh VMSS name and a clean KVS entity. NRP-side fix is tracked in ICM 798003653; once it ships, this binary's detection guards never fire and the step becomes a no-op. ## Placement The step runs BEFORE the cluster ARM step (depending on the same cert-issuer prereqs), so when guards fire the recreate happens before the next cluster PUT, preventing the pipeline from getting stuck. The cluster ARM step now depends on this step. The binary tolerates a not-yet-created cluster (greenfield rollout): it does an ARM Get; if 404, logs and exits 0 with no action. No `aksCluster:` directive is used (which would fail on greenfield); instead the binary bootstraps its own kubeconfig from ListClusterUserCredentials + a bearer token issued by the MSI scoped to the AKS AAD server app, and exports KUBECONFIG so child kubectl invocations work too. No `kubelogin` dependency. ## Detection (ALL guards must pass; otherwise exit 0 no-op) 1. system pool Ready k8s nodes < minCount 2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15) 3. cluster provisioningState NOT in {Updating, Creating, Deleting, Upgrading} - i.e. no active LRO we would be racing. Failed, Succeeded and Canceled all qualify as "settled". 4. every non-system pool has count > 0 5. system pool provisioningState == "Failed" - positive confirmation that this specific pool is wedged (and not, say, a transient cluster-wide blip caught between the other guards) ## Action (once guards pass) 1. Snapshot system pool ARM JSON 2. Abort cluster LRO ONLY if >30 min old (refuses to fight a healthy in-progress operation) 3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted, same VMSize/subnets as live system) 4. Cordon + drain existing system nodes 5. Delete the broken system pool 6. Re-create system via SDK CreateOrUpdate from the sanitized snapshot (strips read-only fields and aks-managed-* tags, pins orchestratorVersion to the live CP version) 7. Drain + delete systmp 8. No-op tag PATCH to flip cluster Canceled -> Succeeded ## Safety - DefaultAzureCredential chain (MSI in EV2, az CLI locally). - sanitizeForRecreate deep-copies via JSON round-trip; never mutates the snapshot. - snapshotSystem refuses to act if VMSize or VnetSubnetID are missing from the live pool. - maybeAbortLRO refuses to abort an LRO younger than 30 min. - preflightChecks refuses to act if a leftover 'systmp' exists. - Guard 2 fails closed if the activity-log query errors (so a missing Reader role on the node RG yields a no-op, not a runaway recreate). - Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op. - Overall 60-min context timeout. - DRY_RUN=true lets operators verify guard behaviour without making any writes. - Kubeconfig written to $TMPDIR with random per-pid name and removed on exit; bearer token has the MSI's normal AAD TTL. ## Testing - 100+ unit test cases covering all pure-logic functions: env parsing, guard primitives (evalGuard1..5 including the revised guard 3 acceptance of Failed/Succeeded/Canceled and the new guard 5 for system pool Failed state), sanitizeForRecreate (no-mutation, tag stripping, version pin), buildSystmpAgentPool (defensive nil checks), activity-log parsing (dedup, case insensitivity, prefix filter), isNodeReady (nil/missing/false conditions), isNotFoundErr (404 vs other status codes, wrapped errors), extractAPIServerAndCA (happy path, empty input, malformed yaml, missing fields), kubeconfigWithBearerToken (no exec plugin or auth provider required). - go vet, gofmt clean. - validate-changed-config-pipelines passes. ## References - PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper) - Pipeline step pattern: Azure#4790 - Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev envs on macOS-arm build natively too) - Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask), AROSLSRE-924 (INT manual mitigation) - ICM: 798003653

…E-924) Adds a detection-gated EV2 Shell step that recreates the AKS system pool when the NRP key-value-store entity for its VMSS gets corrupted. The same recipe was applied manually at INT on 2026-05-24 (AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod. ## Background A corrupted NRP KVS entry for the system pool's VMSS causes every Microsoft.Compute/virtualMachineScaleSets/write to fail with NetworkingInternalOperationError on a continuous retry chain. Fresh VM instances come up but never get a Swift NIC, kubelet never registers, the pool stops scaling, and the cluster's upgrade LRO retries forever - the AROSLSRE-880 / INT (2026-05-16..18) incident left the cluster stuck in Updating for days because of this. The corruption is bound to the VMSS ARM resource ID; per-instance delete does not help. Deleting and re-creating the pool yields a fresh VMSS name and a clean KVS entity. NRP-side fix is tracked in ICM 798003653; once it ships, this binary's detection guards never fire and the step becomes a no-op. ## Placement The step runs BEFORE the cluster ARM step (depending on the same cert-issuer prereqs), so when guards fire the recreate happens before the next cluster PUT, preventing the pipeline from getting stuck. The cluster ARM step now depends on this step. The binary tolerates a not-yet-created cluster (greenfield rollout): it does an ARM Get; if 404, logs and exits 0 with no action. No `aksCluster:` directive is used (which would fail on greenfield); instead the binary bootstraps its own kubeconfig from ListClusterUserCredentials + a bearer token issued by the MSI scoped to the AKS AAD server app, and exports KUBECONFIG so child kubectl invocations work too. No `kubelogin` dependency. ## Detection (ALL guards must pass; otherwise exit 0 no-op) 1. system pool Ready k8s nodes < minCount 2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15) 3. cluster provisioningState is recoverable: Succeeded, Canceled, Failed (settled) OR Updating, Upgrading (mid-LRO - the wedge signature itself). Rejected: Creating, Deleting, unknown. 4. every non-system pool has count > 0 5. system pool provisioningState == "Failed" - positive confirmation that this specific pool is wedged ## Action (once guards pass) 1. Snapshot system pool ARM JSON 2. Abort cluster LRO ONLY if active LRO is >= 30 min old (the NRP-KVS retry storm signature). If younger, no-op exit to avoid racing a healthy in-progress operation. AROSLSRE-924 manual recipe required this step to move the cluster from stuck-Updating to Canceled. 3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted, same VMSize/subnets as live system) 4. Cordon + drain existing system nodes 5. Delete the broken system pool 6. Re-create system via SDK CreateOrUpdate from the sanitized snapshot (strips read-only fields and aks-managed-* tags, pins orchestratorVersion to the live CP version) 7. Drain + delete systmp 8. No-op tag PATCH to flip cluster Canceled -> Succeeded ## Safety - DefaultAzureCredential chain (MSI in EV2, az CLI locally). - sanitizeForRecreate deep-copies via JSON round-trip; never mutates the snapshot. - snapshotSystem refuses to act if VMSize or VnetSubnetID are missing from the live pool. - maybeAbortLRO returns (proceed=false, no err) when LRO is younger than 30 min, so the binary exits 0 (not an error) rather than racing a potentially-healthy operation. - preflightChecks refuses to act if a leftover 'systmp' exists. - Guard 2 fails closed if the activity-log query errors (so a missing Reader role on the node RG yields a no-op, not a runaway recreate). - Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op. - Overall 60-min context timeout. - DRY_RUN=true lets operators verify guard behaviour without making any writes. ## Testing - 100+ unit test cases covering all pure-logic functions: env parsing, guard primitives (evalGuard1..5 - including guard 3's acceptance of stuck-Updating clusters and guard 5 for system pool Failed state), sanitizeForRecreate (no-mutation, tag stripping, version pin), buildSystmpAgentPool (defensive nil checks), activity-log parsing (dedup, case insensitivity, prefix filter), isNodeReady (nil/missing/false conditions), isNotFoundErr (404 vs other status codes, wrapped errors), extractAPIServerAndCA (happy path, empty input, malformed yaml, missing fields), kubeconfigWithBearerToken (no exec plugin or auth provider required). - go vet, gofmt clean. - validate-changed-config-pipelines passes. ## References - PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper) - Pipeline step pattern: Azure#4790 - Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev envs on macOS-arm build natively too) - Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask), AROSLSRE-924 (INT manual mitigation), AROSLSRE-880 (parent incident bug) - ICM: 798003653

openshift-merge-bot · 2026-05-26T20:58:50Z

/retest-required

Remaining retests: 0 against base HEAD 184cebd and 2 for PR HEAD 0f84895 in total

…E-924) Adds a detection-gated EV2 Shell step that recreates the AKS system pool when the NRP key-value-store entity for its VMSS gets corrupted. The same recipe was applied manually at INT on 2026-05-24 (AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod. ## Background A corrupted NRP KVS entry for the system pool's VMSS causes every Microsoft.Compute/virtualMachineScaleSets/write to fail with NetworkingInternalOperationError on a continuous retry chain. Fresh VM instances come up but never get a Swift NIC, kubelet never registers, the pool stops scaling, and the cluster's upgrade LRO retries forever - the AROSLSRE-880 / INT (2026-05-16..18) incident left the cluster stuck in Updating for days because of this. The corruption is bound to the VMSS ARM resource ID; per-instance delete does not help. Deleting and re-creating the pool yields a fresh VMSS name and a clean KVS entity. NRP-side fix is tracked in ICM 798003653; once it ships, this binary's detection guards never fire and the step becomes a no-op. ## Placement The step runs BEFORE the cluster ARM step (depending on the same cert-issuer prereqs), so when guards fire the recreate happens before the next cluster PUT, preventing the pipeline from getting stuck. The cluster ARM step now depends on this step. The binary tolerates a not-yet-created cluster (greenfield rollout): it does an ARM Get; if 404, logs and exits 0 with no action. No `aksCluster:` directive is used (which would fail on greenfield); instead the binary bootstraps its own kubeconfig from ListClusterUserCredentials + a bearer token issued by the MSI scoped to the AKS AAD server app, and exports KUBECONFIG so child kubectl invocations work too. No `kubelogin` dependency. ## Detection (ALL guards must pass; otherwise exit 0 no-op) 1. system pool Ready k8s nodes < minCount 2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15) 3. cluster provisioningState is recoverable: Succeeded, Canceled, Failed (settled) OR Updating, Upgrading (mid-LRO - the wedge signature itself). Rejected: Creating, Deleting, unknown. 4. every non-system pool has count > 0 5. system pool provisioningState == "Failed" - positive confirmation that this specific pool is wedged ## Action (once guards pass) 1. Snapshot system pool ARM JSON 2. Abort cluster LRO ONLY if active LRO is >= 30 min old (the NRP-KVS retry storm signature). If younger, no-op exit to avoid racing a healthy in-progress operation. AROSLSRE-924 manual recipe required this step to move the cluster from stuck-Updating to Canceled. 3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted, same VMSize/subnets as live system) 4. Cordon + drain existing system nodes 5. Delete the broken system pool 6. Re-create system via SDK CreateOrUpdate from the sanitized snapshot (strips read-only fields and aks-managed-* tags, pins orchestratorVersion to the live CP version) 7. Drain + delete systmp 8. No-op tag PATCH to flip cluster Canceled -> Succeeded ## Safety - DefaultAzureCredential chain (MSI in EV2, az CLI locally). - sanitizeForRecreate deep-copies via JSON round-trip; never mutates the snapshot. - snapshotSystem refuses to act if VMSize or VnetSubnetID are missing from the live pool. - maybeAbortLRO returns (proceed=false, no err) when LRO is younger than 30 min, so the binary exits 0 (not an error) rather than racing a potentially-healthy operation. - preflightChecks refuses to act if a leftover 'systmp' exists. - Guard 2 fails closed if the activity-log query errors (so a missing Reader role on the node RG yields a no-op, not a runaway recreate). - Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op. - Overall 60-min context timeout. - DRY_RUN=true lets operators verify guard behaviour without making any writes. ## Testing - 100+ unit test cases covering all pure-logic functions: env parsing, guard primitives (evalGuard1..5 - including guard 3's acceptance of stuck-Updating clusters and guard 5 for system pool Failed state), sanitizeForRecreate (no-mutation, tag stripping, version pin), buildSystmpAgentPool (defensive nil checks), activity-log parsing (dedup, case insensitivity, prefix filter), isNodeReady (nil/missing/false conditions), isNotFoundErr (404 vs other status codes, wrapped errors), extractAPIServerAndCA (happy path, empty input, malformed yaml, missing fields), kubeconfigWithBearerToken (no exec plugin or auth provider required). - go vet, gofmt clean. - validate-changed-config-pipelines passes. ## References - PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper) - Pipeline step pattern: Azure#4790 - Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev envs on macOS-arm build natively too) - Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask), AROSLSRE-924 (INT manual mitigation), AROSLSRE-880 (parent incident bug) - ICM: 798003653

raelga · 2026-05-27T07:09:01Z

/override ci/prow/e2e-parallel

This PR only affects the PKO cleanup binary build (machine architecture fix) — it impacts the provision step, not e2e test logic. The e2e-parallel suite already passed for this PR previously and also passed in the batch run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/batch/pull-ci-Azure-ARO-HCP-main-e2e-parallel/2059505962137423872

openshift-ci · 2026-05-27T07:09:07Z

@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel

Details

In response to this:

/override ci/prow/e2e-parallel

This PR only affects the PKO cleanup binary build (machine architecture fix) — it impacts the provision step, not e2e test logic. The e2e-parallel suite already passed for this PR previously and also passed in the batch run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/batch/pull-ci-Azure-ARO-HCP-main-e2e-parallel/2059505962137423872

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

raelga · 2026-05-27T07:19:23Z

/override ci/prow/e2e-parallel

This PR only affects the PKO cleanup binary build (machine architecture fix) — it impacts the provision step, not e2e test logic. The e2e-parallel suite already passed for this PR previously and also passed in the batch run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/batch/pull-ci-Azure-ARO-HCP-main-e2e-parallel/2059505962137423872

openshift-ci · 2026-05-27T07:19:28Z

@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel

Details

In response to this:

/override ci/prow/e2e-parallel

This PR only affects the PKO cleanup binary build (machine architecture fix) — it impacts the provision step, not e2e test logic. The e2e-parallel suite already passed for this PR previously and also passed in the batch run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/batch/pull-ci-Azure-ARO-HCP-main-e2e-parallel/2059505962137423872

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copilot AI review requested due to automatic review settings May 26, 2026 17:01

openshift-ci Bot requested review from janboll and weherdh May 26, 2026 17:01

Copilot started reviewing on behalf of tony-schndr May 26, 2026 17:01 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

Comment thread dev-infrastructure/mgmt-pipeline.yaml Outdated

build cleanup-pko-resources based on machine architecture

0f84895

tony-schndr force-pushed the fix-mgmt-build-step branch from a02a8f6 to 0f84895 Compare May 26, 2026 17:15

gmfrasca reviewed May 26, 2026

View reviewed changes

openshift-ci Bot assigned gmfrasca May 26, 2026

openshift-ci Bot added the lgtm label May 26, 2026

raelga approved these changes May 26, 2026

View reviewed changes

openshift-ci Bot assigned raelga May 26, 2026

openshift-ci Bot added the approved label May 26, 2026

raelga mentioned this pull request May 26, 2026

feat(mgmt-pipeline): self-heal NRP-KVS-corrupted system pool (AROSLSRE-951) #5397

Merged

12 tasks

tony-schndr changed the title ~~build cleanup-pko-resources based on machine architecture~~ fix: build cleanup-pko-resources based on machine architecture May 27, 2026

openshift-merge-bot Bot merged commit 4f3332c into Azure:main May 27, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: build cleanup-pko-resources based on machine architecture#5394

fix: build cleanup-pko-resources based on machine architecture#5394
openshift-merge-bot[bot] merged 1 commit into
Azure:mainfrom
tony-schndr:fix-mgmt-build-step

tony-schndr commented May 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

gmfrasca left a comment

Uh oh!

raelga left a comment

Uh oh!

openshift-ci Bot commented May 26, 2026

Uh oh!

openshift-merge-bot Bot commented May 26, 2026

Uh oh!

raelga commented May 27, 2026

Uh oh!

openshift-ci Bot commented May 27, 2026

Uh oh!

raelga commented May 27, 2026

Uh oh!

openshift-ci Bot commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tony-schndr commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Testing

Special notes for your reviewer

PR Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

gmfrasca left a comment

Choose a reason for hiding this comment

Uh oh!

raelga left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 26, 2026

Uh oh!

openshift-merge-bot Bot commented May 26, 2026

Uh oh!

raelga commented May 27, 2026

Uh oh!

openshift-ci Bot commented May 27, 2026

Uh oh!

raelga commented May 27, 2026

Uh oh!

openshift-ci Bot commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tony-schndr commented May 26, 2026 •

edited

Loading