Skip to content

fix: build cleanup-pko-resources based on machine architecture#5394

Merged
openshift-merge-bot[bot] merged 1 commit into
Azure:mainfrom
tony-schndr:fix-mgmt-build-step
May 27, 2026
Merged

fix: build cleanup-pko-resources based on machine architecture#5394
openshift-merge-bot[bot] merged 1 commit into
Azure:mainfrom
tony-schndr:fix-mgmt-build-step

Conversation

@tony-schndr
Copy link
Copy Markdown
Collaborator

@tony-schndr tony-schndr commented May 26, 2026

What

Builds cleanup-pko-resources based on the current machine architecture.

Why

So that cleanup-pko-resources binary is compatible on macOS arm.

Deployment error:

  "err": "errors occurred during execution: [error running Shell Step, failed to execute shell command: /bin/bash: line 1: ./scripts/cleanup-pko-resources/cleanup-pko-resources: cannot execute binary file

Testing

I was able to create a personal dev environment after this change.

Special notes for your reviewer

PR Checklist

  • PR is scoped to a single task (no mixed concerns)
  • Title follows Conventional Commits format
  • Summary explains the "Why" behind the change
  • Linked to relevant ticket/issue
  • Screenshots included (if graph/UI/metrics changes)
  • Self-reviewed the diff
  • CI/CD checks are passing (ignore Tide)
  • Draft PR used for WIP (if applicable)
  • Commit history is clean (rebased/squashed)
  • Tricky code blocks are commented
  • Specific reviewers tagged
  • All comment threads resolved before merge

Copilot AI review requested due to automatic review settings May 26, 2026 17:01
@openshift-ci openshift-ci Bot requested review from janboll and weherdh May 26, 2026 17:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the dev-infrastructure management pipeline to build the cleanup-pko-resources helper binary for the current machine architecture/OS, so it can run on macOS arm64 (instead of always producing a linux/amd64 binary).

Changes:

  • Build scripts/cleanup-pko-resources using host OS/architecture detection in the pipeline build step.

Comment thread dev-infrastructure/mgmt-pipeline.yaml Outdated
@tony-schndr tony-schndr force-pushed the fix-mgmt-build-step branch from a02a8f6 to 0f84895 Compare May 26, 2026 17:15
Copy link
Copy Markdown
Collaborator

@gmfrasca gmfrasca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Copy Markdown
Collaborator

@raelga raelga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 26, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gmfrasca, raelga, tony-schndr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

raelga added a commit to raelga/ARO-HCP that referenced this pull request May 26, 2026
…E-924)

Adds a detection-gated EV2 Shell step that recreates the AKS system
pool when the NRP key-value-store entity for its VMSS gets corrupted.
The same recipe was applied manually at INT on 2026-05-24
(AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod.

## Background

A corrupted NRP KVS entry for the system pool's VMSS causes every
Microsoft.Compute/virtualMachineScaleSets/write to fail with
NetworkingInternalOperationError on a continuous retry chain. Fresh
VM instances come up but never get a Swift NIC, kubelet never
registers, the pool stops scaling, and the next mgmt-pipeline run
gets stuck on the cluster ARM step because the underlying VMSS
update never converges. The system pool's ARM resource ends up in
Failed once the parent LRO finally times out.

The corruption is bound to the VMSS ARM resource ID; per-instance
delete does not help. Deleting and re-creating the pool yields a
fresh VMSS name and a clean KVS entity. NRP-side fix is tracked
in ICM 798003653; once it ships, this binary's detection guards
never fire and the step becomes a no-op.

## Placement

The step runs BEFORE the cluster ARM step (depending on the same
cert-issuer prereqs), so when guards fire the recreate happens
before the next cluster PUT, preventing the pipeline from getting
stuck. The cluster ARM step now depends on this step.

The binary tolerates a not-yet-created cluster (greenfield rollout):
it does an ARM Get; if 404, logs and exits 0 with no action. No
`aksCluster:` directive is used (which would fail on greenfield);
instead the binary bootstraps its own kubeconfig from
ListClusterUserCredentials + a bearer token issued by the MSI
scoped to the AKS AAD server app, and exports KUBECONFIG so child
kubectl invocations work too. No `kubelogin` dependency.

## Detection (ALL guards must pass; otherwise exit 0 no-op)

  1. system pool Ready k8s nodes < minCount
  2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on
     aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15)
  3. cluster provisioningState NOT in {Updating, Creating, Deleting,
     Upgrading} - i.e. no active LRO we would be racing. Failed,
     Succeeded and Canceled all qualify as "settled".
  4. every non-system pool has count > 0
  5. system pool provisioningState == "Failed" - positive
     confirmation that this specific pool is wedged (and not, say,
     a transient cluster-wide blip caught between the other guards)

## Action (once guards pass)

  1. Snapshot system pool ARM JSON
  2. Abort cluster LRO ONLY if >30 min old (refuses to fight a
     healthy in-progress operation)
  3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted,
     same VMSize/subnets as live system)
  4. Cordon + drain existing system nodes
  5. Delete the broken system pool
  6. Re-create system via SDK CreateOrUpdate from the sanitized
     snapshot (strips read-only fields and aks-managed-* tags,
     pins orchestratorVersion to the live CP version)
  7. Drain + delete systmp
  8. No-op tag PATCH to flip cluster Canceled -> Succeeded

## Safety

  - DefaultAzureCredential chain (MSI in EV2, az CLI locally).
  - sanitizeForRecreate deep-copies via JSON round-trip; never
    mutates the snapshot.
  - snapshotSystem refuses to act if VMSize or VnetSubnetID are
    missing from the live pool.
  - maybeAbortLRO refuses to abort an LRO younger than 30 min.
  - preflightChecks refuses to act if a leftover 'systmp' exists.
  - Guard 2 fails closed if the activity-log query errors (so a
    missing Reader role on the node RG yields a no-op, not a
    runaway recreate).
  - Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op.
  - Overall 60-min context timeout.
  - DRY_RUN=true lets operators verify guard behaviour without
    making any writes.
  - Kubeconfig written to $TMPDIR with random per-pid name and
    removed on exit; bearer token has the MSI's normal AAD TTL.

## Testing

  - 100+ unit test cases covering all pure-logic functions:
    env parsing, guard primitives (evalGuard1..5 including the
    revised guard 3 acceptance of Failed/Succeeded/Canceled and
    the new guard 5 for system pool Failed state),
    sanitizeForRecreate (no-mutation, tag stripping, version
    pin), buildSystmpAgentPool (defensive nil checks),
    activity-log parsing (dedup, case insensitivity, prefix
    filter), isNodeReady (nil/missing/false conditions),
    isNotFoundErr (404 vs other status codes, wrapped errors),
    extractAPIServerAndCA (happy path, empty input, malformed
    yaml, missing fields), kubeconfigWithBearerToken (no exec
    plugin or auth provider required).
  - go vet, gofmt clean.
  - validate-changed-config-pipelines passes.

## References

  - PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper)
  - Pipeline step pattern: Azure#4790
  - Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev
    envs on macOS-arm build natively too)
  - Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask),
    AROSLSRE-924 (INT manual mitigation)
  - ICM: 798003653
raelga added a commit to raelga/ARO-HCP that referenced this pull request May 26, 2026
…E-924)

Adds a detection-gated EV2 Shell step that recreates the AKS system
pool when the NRP key-value-store entity for its VMSS gets corrupted.
The same recipe was applied manually at INT on 2026-05-24
(AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod.

## Background

A corrupted NRP KVS entry for the system pool's VMSS causes every
Microsoft.Compute/virtualMachineScaleSets/write to fail with
NetworkingInternalOperationError on a continuous retry chain. Fresh
VM instances come up but never get a Swift NIC, kubelet never
registers, the pool stops scaling, and the cluster's upgrade LRO
retries forever - the AROSLSRE-880 / INT (2026-05-16..18) incident
left the cluster stuck in Updating for days because of this.

The corruption is bound to the VMSS ARM resource ID; per-instance
delete does not help. Deleting and re-creating the pool yields a
fresh VMSS name and a clean KVS entity. NRP-side fix is tracked
in ICM 798003653; once it ships, this binary's detection guards
never fire and the step becomes a no-op.

## Placement

The step runs BEFORE the cluster ARM step (depending on the same
cert-issuer prereqs), so when guards fire the recreate happens
before the next cluster PUT, preventing the pipeline from getting
stuck. The cluster ARM step now depends on this step.

The binary tolerates a not-yet-created cluster (greenfield rollout):
it does an ARM Get; if 404, logs and exits 0 with no action. No
`aksCluster:` directive is used (which would fail on greenfield);
instead the binary bootstraps its own kubeconfig from
ListClusterUserCredentials + a bearer token issued by the MSI
scoped to the AKS AAD server app, and exports KUBECONFIG so child
kubectl invocations work too. No `kubelogin` dependency.

## Detection (ALL guards must pass; otherwise exit 0 no-op)

  1. system pool Ready k8s nodes < minCount
  2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on
     aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15)
  3. cluster provisioningState is recoverable: Succeeded, Canceled,
     Failed (settled) OR Updating, Upgrading (mid-LRO - the wedge
     signature itself). Rejected: Creating, Deleting, unknown.
  4. every non-system pool has count > 0
  5. system pool provisioningState == "Failed" - positive
     confirmation that this specific pool is wedged

## Action (once guards pass)

  1. Snapshot system pool ARM JSON
  2. Abort cluster LRO ONLY if active LRO is >= 30 min old (the
     NRP-KVS retry storm signature). If younger, no-op exit to
     avoid racing a healthy in-progress operation. AROSLSRE-924
     manual recipe required this step to move the cluster from
     stuck-Updating to Canceled.
  3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted,
     same VMSize/subnets as live system)
  4. Cordon + drain existing system nodes
  5. Delete the broken system pool
  6. Re-create system via SDK CreateOrUpdate from the sanitized
     snapshot (strips read-only fields and aks-managed-* tags,
     pins orchestratorVersion to the live CP version)
  7. Drain + delete systmp
  8. No-op tag PATCH to flip cluster Canceled -> Succeeded

## Safety

  - DefaultAzureCredential chain (MSI in EV2, az CLI locally).
  - sanitizeForRecreate deep-copies via JSON round-trip; never
    mutates the snapshot.
  - snapshotSystem refuses to act if VMSize or VnetSubnetID are
    missing from the live pool.
  - maybeAbortLRO returns (proceed=false, no err) when LRO is
    younger than 30 min, so the binary exits 0 (not an error)
    rather than racing a potentially-healthy operation.
  - preflightChecks refuses to act if a leftover 'systmp' exists.
  - Guard 2 fails closed if the activity-log query errors (so a
    missing Reader role on the node RG yields a no-op, not a
    runaway recreate).
  - Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op.
  - Overall 60-min context timeout.
  - DRY_RUN=true lets operators verify guard behaviour without
    making any writes.

## Testing

  - 100+ unit test cases covering all pure-logic functions:
    env parsing, guard primitives (evalGuard1..5 - including
    guard 3's acceptance of stuck-Updating clusters and guard 5
    for system pool Failed state), sanitizeForRecreate
    (no-mutation, tag stripping, version pin),
    buildSystmpAgentPool (defensive nil checks),
    activity-log parsing (dedup, case insensitivity, prefix
    filter), isNodeReady (nil/missing/false conditions),
    isNotFoundErr (404 vs other status codes, wrapped errors),
    extractAPIServerAndCA (happy path, empty input, malformed
    yaml, missing fields), kubeconfigWithBearerToken (no exec
    plugin or auth provider required).
  - go vet, gofmt clean.
  - validate-changed-config-pipelines passes.

## References

  - PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper)
  - Pipeline step pattern: Azure#4790
  - Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev
    envs on macOS-arm build natively too)
  - Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask),
    AROSLSRE-924 (INT manual mitigation), AROSLSRE-880 (parent
    incident bug)
  - ICM: 798003653
raelga added a commit to raelga/ARO-HCP that referenced this pull request May 26, 2026
…E-924)

Adds a detection-gated EV2 Shell step that recreates the AKS system
pool when the NRP key-value-store entity for its VMSS gets corrupted.
The same recipe was applied manually at INT on 2026-05-24
(AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod.

## Background

A corrupted NRP KVS entry for the system pool's VMSS causes every
Microsoft.Compute/virtualMachineScaleSets/write to fail with
NetworkingInternalOperationError on a continuous retry chain. Fresh
VM instances come up but never get a Swift NIC, kubelet never
registers, the pool stops scaling, and the cluster's upgrade LRO
retries forever - the AROSLSRE-880 / INT (2026-05-16..18) incident
left the cluster stuck in Updating for days because of this.

The corruption is bound to the VMSS ARM resource ID; per-instance
delete does not help. Deleting and re-creating the pool yields a
fresh VMSS name and a clean KVS entity. NRP-side fix is tracked
in ICM 798003653; once it ships, this binary's detection guards
never fire and the step becomes a no-op.

## Placement

The step runs BEFORE the cluster ARM step (depending on the same
cert-issuer prereqs), so when guards fire the recreate happens
before the next cluster PUT, preventing the pipeline from getting
stuck. The cluster ARM step now depends on this step.

The binary tolerates a not-yet-created cluster (greenfield rollout):
it does an ARM Get; if 404, logs and exits 0 with no action. No
`aksCluster:` directive is used (which would fail on greenfield);
instead the binary bootstraps its own kubeconfig from
ListClusterUserCredentials + a bearer token issued by the MSI
scoped to the AKS AAD server app, and exports KUBECONFIG so child
kubectl invocations work too. No `kubelogin` dependency.

## Detection (ALL guards must pass; otherwise exit 0 no-op)

  1. system pool Ready k8s nodes < minCount
  2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on
     aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15)
  3. cluster provisioningState is recoverable: Succeeded, Canceled,
     Failed (settled) OR Updating, Upgrading (mid-LRO - the wedge
     signature itself). Rejected: Creating, Deleting, unknown.
  4. every non-system pool has count > 0
  5. system pool provisioningState == "Failed" - positive
     confirmation that this specific pool is wedged

## Action (once guards pass)

  1. Snapshot system pool ARM JSON
  2. Abort cluster LRO ONLY if active LRO is >= 30 min old (the
     NRP-KVS retry storm signature). If younger, no-op exit to
     avoid racing a healthy in-progress operation. AROSLSRE-924
     manual recipe required this step to move the cluster from
     stuck-Updating to Canceled.
  3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted,
     same VMSize/subnets as live system)
  4. Cordon + drain existing system nodes
  5. Delete the broken system pool
  6. Re-create system via SDK CreateOrUpdate from the sanitized
     snapshot (strips read-only fields and aks-managed-* tags,
     pins orchestratorVersion to the live CP version)
  7. Drain + delete systmp
  8. No-op tag PATCH to flip cluster Canceled -> Succeeded

## Safety

  - DefaultAzureCredential chain (MSI in EV2, az CLI locally).
  - sanitizeForRecreate deep-copies via JSON round-trip; never
    mutates the snapshot.
  - snapshotSystem refuses to act if VMSize or VnetSubnetID are
    missing from the live pool.
  - maybeAbortLRO returns (proceed=false, no err) when LRO is
    younger than 30 min, so the binary exits 0 (not an error)
    rather than racing a potentially-healthy operation.
  - preflightChecks refuses to act if a leftover 'systmp' exists.
  - Guard 2 fails closed if the activity-log query errors (so a
    missing Reader role on the node RG yields a no-op, not a
    runaway recreate).
  - Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op.
  - Overall 60-min context timeout.
  - DRY_RUN=true lets operators verify guard behaviour without
    making any writes.

## Testing

  - 100+ unit test cases covering all pure-logic functions:
    env parsing, guard primitives (evalGuard1..5 - including
    guard 3's acceptance of stuck-Updating clusters and guard 5
    for system pool Failed state), sanitizeForRecreate
    (no-mutation, tag stripping, version pin),
    buildSystmpAgentPool (defensive nil checks),
    activity-log parsing (dedup, case insensitivity, prefix
    filter), isNodeReady (nil/missing/false conditions),
    isNotFoundErr (404 vs other status codes, wrapped errors),
    extractAPIServerAndCA (happy path, empty input, malformed
    yaml, missing fields), kubeconfigWithBearerToken (no exec
    plugin or auth provider required).
  - go vet, gofmt clean.
  - validate-changed-config-pipelines passes.

## References

  - PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper)
  - Pipeline step pattern: Azure#4790
  - Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev
    envs on macOS-arm build natively too)
  - Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask),
    AROSLSRE-924 (INT manual mitigation), AROSLSRE-880 (parent
    incident bug)
  - ICM: 798003653
raelga added a commit to raelga/ARO-HCP that referenced this pull request May 26, 2026
…E-924)

Adds a detection-gated EV2 Shell step that recreates the AKS system
pool when the NRP key-value-store entity for its VMSS gets corrupted.
The same recipe was applied manually at INT on 2026-05-24
(AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod.

## Background

A corrupted NRP KVS entry for the system pool's VMSS causes every
Microsoft.Compute/virtualMachineScaleSets/write to fail with
NetworkingInternalOperationError on a continuous retry chain. Fresh
VM instances come up but never get a Swift NIC, kubelet never
registers, the pool stops scaling, and the cluster's upgrade LRO
retries forever - the AROSLSRE-880 / INT (2026-05-16..18) incident
left the cluster stuck in Updating for days because of this.

The corruption is bound to the VMSS ARM resource ID; per-instance
delete does not help. Deleting and re-creating the pool yields a
fresh VMSS name and a clean KVS entity. NRP-side fix is tracked
in ICM 798003653; once it ships, this binary's detection guards
never fire and the step becomes a no-op.

## Placement

The step runs BEFORE the cluster ARM step (depending on the same
cert-issuer prereqs), so when guards fire the recreate happens
before the next cluster PUT, preventing the pipeline from getting
stuck. The cluster ARM step now depends on this step.

The binary tolerates a not-yet-created cluster (greenfield rollout):
it does an ARM Get; if 404, logs and exits 0 with no action. No
`aksCluster:` directive is used (which would fail on greenfield);
instead the binary bootstraps its own kubeconfig from
ListClusterUserCredentials + a bearer token issued by the MSI
scoped to the AKS AAD server app, and exports KUBECONFIG so child
kubectl invocations work too. No `kubelogin` dependency.

## Detection (ALL guards must pass; otherwise exit 0 no-op)

  1. system pool Ready k8s nodes < minCount
  2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on
     aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15)
  3. cluster provisioningState is recoverable: Succeeded, Canceled,
     Failed (settled) OR Updating, Upgrading (mid-LRO - the wedge
     signature itself). Rejected: Creating, Deleting, unknown.
  4. every non-system pool has count > 0
  5. system pool provisioningState == "Failed" - positive
     confirmation that this specific pool is wedged

## Action (once guards pass)

  1. Snapshot system pool ARM JSON
  2. Abort cluster LRO ONLY if active LRO is >= 30 min old (the
     NRP-KVS retry storm signature). If younger, no-op exit to
     avoid racing a healthy in-progress operation. AROSLSRE-924
     manual recipe required this step to move the cluster from
     stuck-Updating to Canceled.
  3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted,
     same VMSize/subnets as live system)
  4. Cordon + drain existing system nodes
  5. Delete the broken system pool
  6. Re-create system via SDK CreateOrUpdate from the sanitized
     snapshot (strips read-only fields and aks-managed-* tags,
     pins orchestratorVersion to the live CP version)
  7. Drain + delete systmp
  8. No-op tag PATCH to flip cluster Canceled -> Succeeded

## Safety

  - DefaultAzureCredential chain (MSI in EV2, az CLI locally).
  - sanitizeForRecreate deep-copies via JSON round-trip; never
    mutates the snapshot.
  - snapshotSystem refuses to act if VMSize or VnetSubnetID are
    missing from the live pool.
  - maybeAbortLRO returns (proceed=false, no err) when LRO is
    younger than 30 min, so the binary exits 0 (not an error)
    rather than racing a potentially-healthy operation.
  - preflightChecks refuses to act if a leftover 'systmp' exists.
  - Guard 2 fails closed if the activity-log query errors (so a
    missing Reader role on the node RG yields a no-op, not a
    runaway recreate).
  - Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op.
  - Overall 60-min context timeout.
  - DRY_RUN=true lets operators verify guard behaviour without
    making any writes.

## Testing

  - 100+ unit test cases covering all pure-logic functions:
    env parsing, guard primitives (evalGuard1..5 - including
    guard 3's acceptance of stuck-Updating clusters and guard 5
    for system pool Failed state), sanitizeForRecreate
    (no-mutation, tag stripping, version pin),
    buildSystmpAgentPool (defensive nil checks),
    activity-log parsing (dedup, case insensitivity, prefix
    filter), isNodeReady (nil/missing/false conditions),
    isNotFoundErr (404 vs other status codes, wrapped errors),
    extractAPIServerAndCA (happy path, empty input, malformed
    yaml, missing fields), kubeconfigWithBearerToken (no exec
    plugin or auth provider required).
  - go vet, gofmt clean.
  - validate-changed-config-pipelines passes.

## References

  - PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper)
  - Pipeline step pattern: Azure#4790
  - Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev
    envs on macOS-arm build natively too)
  - Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask),
    AROSLSRE-924 (INT manual mitigation), AROSLSRE-880 (parent
    incident bug)
  - ICM: 798003653
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 184cebd and 2 for PR HEAD 0f84895 in total

raelga added a commit to raelga/ARO-HCP that referenced this pull request May 26, 2026
…E-924)

Adds a detection-gated EV2 Shell step that recreates the AKS system
pool when the NRP key-value-store entity for its VMSS gets corrupted.
The same recipe was applied manually at INT on 2026-05-24
(AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod.

## Background

A corrupted NRP KVS entry for the system pool's VMSS causes every
Microsoft.Compute/virtualMachineScaleSets/write to fail with
NetworkingInternalOperationError on a continuous retry chain. Fresh
VM instances come up but never get a Swift NIC, kubelet never
registers, the pool stops scaling, and the cluster's upgrade LRO
retries forever - the AROSLSRE-880 / INT (2026-05-16..18) incident
left the cluster stuck in Updating for days because of this.

The corruption is bound to the VMSS ARM resource ID; per-instance
delete does not help. Deleting and re-creating the pool yields a
fresh VMSS name and a clean KVS entity. NRP-side fix is tracked
in ICM 798003653; once it ships, this binary's detection guards
never fire and the step becomes a no-op.

## Placement

The step runs BEFORE the cluster ARM step (depending on the same
cert-issuer prereqs), so when guards fire the recreate happens
before the next cluster PUT, preventing the pipeline from getting
stuck. The cluster ARM step now depends on this step.

The binary tolerates a not-yet-created cluster (greenfield rollout):
it does an ARM Get; if 404, logs and exits 0 with no action. No
`aksCluster:` directive is used (which would fail on greenfield);
instead the binary bootstraps its own kubeconfig from
ListClusterUserCredentials + a bearer token issued by the MSI
scoped to the AKS AAD server app, and exports KUBECONFIG so child
kubectl invocations work too. No `kubelogin` dependency.

## Detection (ALL guards must pass; otherwise exit 0 no-op)

  1. system pool Ready k8s nodes < minCount
  2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on
     aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15)
  3. cluster provisioningState is recoverable: Succeeded, Canceled,
     Failed (settled) OR Updating, Upgrading (mid-LRO - the wedge
     signature itself). Rejected: Creating, Deleting, unknown.
  4. every non-system pool has count > 0
  5. system pool provisioningState == "Failed" - positive
     confirmation that this specific pool is wedged

## Action (once guards pass)

  1. Snapshot system pool ARM JSON
  2. Abort cluster LRO ONLY if active LRO is >= 30 min old (the
     NRP-KVS retry storm signature). If younger, no-op exit to
     avoid racing a healthy in-progress operation. AROSLSRE-924
     manual recipe required this step to move the cluster from
     stuck-Updating to Canceled.
  3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted,
     same VMSize/subnets as live system)
  4. Cordon + drain existing system nodes
  5. Delete the broken system pool
  6. Re-create system via SDK CreateOrUpdate from the sanitized
     snapshot (strips read-only fields and aks-managed-* tags,
     pins orchestratorVersion to the live CP version)
  7. Drain + delete systmp
  8. No-op tag PATCH to flip cluster Canceled -> Succeeded

## Safety

  - DefaultAzureCredential chain (MSI in EV2, az CLI locally).
  - sanitizeForRecreate deep-copies via JSON round-trip; never
    mutates the snapshot.
  - snapshotSystem refuses to act if VMSize or VnetSubnetID are
    missing from the live pool.
  - maybeAbortLRO returns (proceed=false, no err) when LRO is
    younger than 30 min, so the binary exits 0 (not an error)
    rather than racing a potentially-healthy operation.
  - preflightChecks refuses to act if a leftover 'systmp' exists.
  - Guard 2 fails closed if the activity-log query errors (so a
    missing Reader role on the node RG yields a no-op, not a
    runaway recreate).
  - Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op.
  - Overall 60-min context timeout.
  - DRY_RUN=true lets operators verify guard behaviour without
    making any writes.

## Testing

  - 100+ unit test cases covering all pure-logic functions:
    env parsing, guard primitives (evalGuard1..5 - including
    guard 3's acceptance of stuck-Updating clusters and guard 5
    for system pool Failed state), sanitizeForRecreate
    (no-mutation, tag stripping, version pin),
    buildSystmpAgentPool (defensive nil checks),
    activity-log parsing (dedup, case insensitivity, prefix
    filter), isNodeReady (nil/missing/false conditions),
    isNotFoundErr (404 vs other status codes, wrapped errors),
    extractAPIServerAndCA (happy path, empty input, malformed
    yaml, missing fields), kubeconfigWithBearerToken (no exec
    plugin or auth provider required).
  - go vet, gofmt clean.
  - validate-changed-config-pipelines passes.

## References

  - PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper)
  - Pipeline step pattern: Azure#4790
  - Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev
    envs on macOS-arm build natively too)
  - Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask),
    AROSLSRE-924 (INT manual mitigation), AROSLSRE-880 (parent
    incident bug)
  - ICM: 798003653
@tony-schndr tony-schndr changed the title build cleanup-pko-resources based on machine architecture fix: build cleanup-pko-resources based on machine architecture May 27, 2026
@raelga
Copy link
Copy Markdown
Collaborator

raelga commented May 27, 2026

/override ci/prow/e2e-parallel

This PR only affects the PKO cleanup binary build (machine architecture fix) — it impacts the provision step, not e2e test logic. The e2e-parallel suite already passed for this PR previously and also passed in the batch run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/batch/pull-ci-Azure-ARO-HCP-main-e2e-parallel/2059505962137423872

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 27, 2026

@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel

Details

In response to this:

/override ci/prow/e2e-parallel

This PR only affects the PKO cleanup binary build (machine architecture fix) — it impacts the provision step, not e2e test logic. The e2e-parallel suite already passed for this PR previously and also passed in the batch run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/batch/pull-ci-Azure-ARO-HCP-main-e2e-parallel/2059505962137423872

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@raelga
Copy link
Copy Markdown
Collaborator

raelga commented May 27, 2026

/override ci/prow/e2e-parallel

This PR only affects the PKO cleanup binary build (machine architecture fix) — it impacts the provision step, not e2e test logic. The e2e-parallel suite already passed for this PR previously and also passed in the batch run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/batch/pull-ci-Azure-ARO-HCP-main-e2e-parallel/2059505962137423872

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 27, 2026

@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel

Details

In response to this:

/override ci/prow/e2e-parallel

This PR only affects the PKO cleanup binary build (machine architecture fix) — it impacts the provision step, not e2e test logic. The e2e-parallel suite already passed for this PR previously and also passed in the batch run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/batch/pull-ci-Azure-ARO-HCP-main-e2e-parallel/2059505962137423872

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit 4f3332c into Azure:main May 27, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants