fix: build cleanup-pko-resources based on machine architecture#5394
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the dev-infrastructure management pipeline to build the cleanup-pko-resources helper binary for the current machine architecture/OS, so it can run on macOS arm64 (instead of always producing a linux/amd64 binary).
Changes:
- Build
scripts/cleanup-pko-resourcesusing host OS/architecture detection in the pipeline build step.
a02a8f6 to
0f84895
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gmfrasca, raelga, tony-schndr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…E-924)
Adds a detection-gated EV2 Shell step that recreates the AKS system
pool when the NRP key-value-store entity for its VMSS gets corrupted.
The same recipe was applied manually at INT on 2026-05-24
(AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod.
## Background
A corrupted NRP KVS entry for the system pool's VMSS causes every
Microsoft.Compute/virtualMachineScaleSets/write to fail with
NetworkingInternalOperationError on a continuous retry chain. Fresh
VM instances come up but never get a Swift NIC, kubelet never
registers, the pool stops scaling, and the next mgmt-pipeline run
gets stuck on the cluster ARM step because the underlying VMSS
update never converges. The system pool's ARM resource ends up in
Failed once the parent LRO finally times out.
The corruption is bound to the VMSS ARM resource ID; per-instance
delete does not help. Deleting and re-creating the pool yields a
fresh VMSS name and a clean KVS entity. NRP-side fix is tracked
in ICM 798003653; once it ships, this binary's detection guards
never fire and the step becomes a no-op.
## Placement
The step runs BEFORE the cluster ARM step (depending on the same
cert-issuer prereqs), so when guards fire the recreate happens
before the next cluster PUT, preventing the pipeline from getting
stuck. The cluster ARM step now depends on this step.
The binary tolerates a not-yet-created cluster (greenfield rollout):
it does an ARM Get; if 404, logs and exits 0 with no action. No
`aksCluster:` directive is used (which would fail on greenfield);
instead the binary bootstraps its own kubeconfig from
ListClusterUserCredentials + a bearer token issued by the MSI
scoped to the AKS AAD server app, and exports KUBECONFIG so child
kubectl invocations work too. No `kubelogin` dependency.
## Detection (ALL guards must pass; otherwise exit 0 no-op)
1. system pool Ready k8s nodes < minCount
2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on
aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15)
3. cluster provisioningState NOT in {Updating, Creating, Deleting,
Upgrading} - i.e. no active LRO we would be racing. Failed,
Succeeded and Canceled all qualify as "settled".
4. every non-system pool has count > 0
5. system pool provisioningState == "Failed" - positive
confirmation that this specific pool is wedged (and not, say,
a transient cluster-wide blip caught between the other guards)
## Action (once guards pass)
1. Snapshot system pool ARM JSON
2. Abort cluster LRO ONLY if >30 min old (refuses to fight a
healthy in-progress operation)
3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted,
same VMSize/subnets as live system)
4. Cordon + drain existing system nodes
5. Delete the broken system pool
6. Re-create system via SDK CreateOrUpdate from the sanitized
snapshot (strips read-only fields and aks-managed-* tags,
pins orchestratorVersion to the live CP version)
7. Drain + delete systmp
8. No-op tag PATCH to flip cluster Canceled -> Succeeded
## Safety
- DefaultAzureCredential chain (MSI in EV2, az CLI locally).
- sanitizeForRecreate deep-copies via JSON round-trip; never
mutates the snapshot.
- snapshotSystem refuses to act if VMSize or VnetSubnetID are
missing from the live pool.
- maybeAbortLRO refuses to abort an LRO younger than 30 min.
- preflightChecks refuses to act if a leftover 'systmp' exists.
- Guard 2 fails closed if the activity-log query errors (so a
missing Reader role on the node RG yields a no-op, not a
runaway recreate).
- Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op.
- Overall 60-min context timeout.
- DRY_RUN=true lets operators verify guard behaviour without
making any writes.
- Kubeconfig written to $TMPDIR with random per-pid name and
removed on exit; bearer token has the MSI's normal AAD TTL.
## Testing
- 100+ unit test cases covering all pure-logic functions:
env parsing, guard primitives (evalGuard1..5 including the
revised guard 3 acceptance of Failed/Succeeded/Canceled and
the new guard 5 for system pool Failed state),
sanitizeForRecreate (no-mutation, tag stripping, version
pin), buildSystmpAgentPool (defensive nil checks),
activity-log parsing (dedup, case insensitivity, prefix
filter), isNodeReady (nil/missing/false conditions),
isNotFoundErr (404 vs other status codes, wrapped errors),
extractAPIServerAndCA (happy path, empty input, malformed
yaml, missing fields), kubeconfigWithBearerToken (no exec
plugin or auth provider required).
- go vet, gofmt clean.
- validate-changed-config-pipelines passes.
## References
- PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper)
- Pipeline step pattern: Azure#4790
- Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev
envs on macOS-arm build natively too)
- Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask),
AROSLSRE-924 (INT manual mitigation)
- ICM: 798003653
…E-924)
Adds a detection-gated EV2 Shell step that recreates the AKS system
pool when the NRP key-value-store entity for its VMSS gets corrupted.
The same recipe was applied manually at INT on 2026-05-24
(AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod.
## Background
A corrupted NRP KVS entry for the system pool's VMSS causes every
Microsoft.Compute/virtualMachineScaleSets/write to fail with
NetworkingInternalOperationError on a continuous retry chain. Fresh
VM instances come up but never get a Swift NIC, kubelet never
registers, the pool stops scaling, and the cluster's upgrade LRO
retries forever - the AROSLSRE-880 / INT (2026-05-16..18) incident
left the cluster stuck in Updating for days because of this.
The corruption is bound to the VMSS ARM resource ID; per-instance
delete does not help. Deleting and re-creating the pool yields a
fresh VMSS name and a clean KVS entity. NRP-side fix is tracked
in ICM 798003653; once it ships, this binary's detection guards
never fire and the step becomes a no-op.
## Placement
The step runs BEFORE the cluster ARM step (depending on the same
cert-issuer prereqs), so when guards fire the recreate happens
before the next cluster PUT, preventing the pipeline from getting
stuck. The cluster ARM step now depends on this step.
The binary tolerates a not-yet-created cluster (greenfield rollout):
it does an ARM Get; if 404, logs and exits 0 with no action. No
`aksCluster:` directive is used (which would fail on greenfield);
instead the binary bootstraps its own kubeconfig from
ListClusterUserCredentials + a bearer token issued by the MSI
scoped to the AKS AAD server app, and exports KUBECONFIG so child
kubectl invocations work too. No `kubelogin` dependency.
## Detection (ALL guards must pass; otherwise exit 0 no-op)
1. system pool Ready k8s nodes < minCount
2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on
aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15)
3. cluster provisioningState is recoverable: Succeeded, Canceled,
Failed (settled) OR Updating, Upgrading (mid-LRO - the wedge
signature itself). Rejected: Creating, Deleting, unknown.
4. every non-system pool has count > 0
5. system pool provisioningState == "Failed" - positive
confirmation that this specific pool is wedged
## Action (once guards pass)
1. Snapshot system pool ARM JSON
2. Abort cluster LRO ONLY if active LRO is >= 30 min old (the
NRP-KVS retry storm signature). If younger, no-op exit to
avoid racing a healthy in-progress operation. AROSLSRE-924
manual recipe required this step to move the cluster from
stuck-Updating to Canceled.
3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted,
same VMSize/subnets as live system)
4. Cordon + drain existing system nodes
5. Delete the broken system pool
6. Re-create system via SDK CreateOrUpdate from the sanitized
snapshot (strips read-only fields and aks-managed-* tags,
pins orchestratorVersion to the live CP version)
7. Drain + delete systmp
8. No-op tag PATCH to flip cluster Canceled -> Succeeded
## Safety
- DefaultAzureCredential chain (MSI in EV2, az CLI locally).
- sanitizeForRecreate deep-copies via JSON round-trip; never
mutates the snapshot.
- snapshotSystem refuses to act if VMSize or VnetSubnetID are
missing from the live pool.
- maybeAbortLRO returns (proceed=false, no err) when LRO is
younger than 30 min, so the binary exits 0 (not an error)
rather than racing a potentially-healthy operation.
- preflightChecks refuses to act if a leftover 'systmp' exists.
- Guard 2 fails closed if the activity-log query errors (so a
missing Reader role on the node RG yields a no-op, not a
runaway recreate).
- Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op.
- Overall 60-min context timeout.
- DRY_RUN=true lets operators verify guard behaviour without
making any writes.
## Testing
- 100+ unit test cases covering all pure-logic functions:
env parsing, guard primitives (evalGuard1..5 - including
guard 3's acceptance of stuck-Updating clusters and guard 5
for system pool Failed state), sanitizeForRecreate
(no-mutation, tag stripping, version pin),
buildSystmpAgentPool (defensive nil checks),
activity-log parsing (dedup, case insensitivity, prefix
filter), isNodeReady (nil/missing/false conditions),
isNotFoundErr (404 vs other status codes, wrapped errors),
extractAPIServerAndCA (happy path, empty input, malformed
yaml, missing fields), kubeconfigWithBearerToken (no exec
plugin or auth provider required).
- go vet, gofmt clean.
- validate-changed-config-pipelines passes.
## References
- PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper)
- Pipeline step pattern: Azure#4790
- Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev
envs on macOS-arm build natively too)
- Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask),
AROSLSRE-924 (INT manual mitigation), AROSLSRE-880 (parent
incident bug)
- ICM: 798003653
…E-924)
Adds a detection-gated EV2 Shell step that recreates the AKS system
pool when the NRP key-value-store entity for its VMSS gets corrupted.
The same recipe was applied manually at INT on 2026-05-24
(AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod.
## Background
A corrupted NRP KVS entry for the system pool's VMSS causes every
Microsoft.Compute/virtualMachineScaleSets/write to fail with
NetworkingInternalOperationError on a continuous retry chain. Fresh
VM instances come up but never get a Swift NIC, kubelet never
registers, the pool stops scaling, and the cluster's upgrade LRO
retries forever - the AROSLSRE-880 / INT (2026-05-16..18) incident
left the cluster stuck in Updating for days because of this.
The corruption is bound to the VMSS ARM resource ID; per-instance
delete does not help. Deleting and re-creating the pool yields a
fresh VMSS name and a clean KVS entity. NRP-side fix is tracked
in ICM 798003653; once it ships, this binary's detection guards
never fire and the step becomes a no-op.
## Placement
The step runs BEFORE the cluster ARM step (depending on the same
cert-issuer prereqs), so when guards fire the recreate happens
before the next cluster PUT, preventing the pipeline from getting
stuck. The cluster ARM step now depends on this step.
The binary tolerates a not-yet-created cluster (greenfield rollout):
it does an ARM Get; if 404, logs and exits 0 with no action. No
`aksCluster:` directive is used (which would fail on greenfield);
instead the binary bootstraps its own kubeconfig from
ListClusterUserCredentials + a bearer token issued by the MSI
scoped to the AKS AAD server app, and exports KUBECONFIG so child
kubectl invocations work too. No `kubelogin` dependency.
## Detection (ALL guards must pass; otherwise exit 0 no-op)
1. system pool Ready k8s nodes < minCount
2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on
aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15)
3. cluster provisioningState is recoverable: Succeeded, Canceled,
Failed (settled) OR Updating, Upgrading (mid-LRO - the wedge
signature itself). Rejected: Creating, Deleting, unknown.
4. every non-system pool has count > 0
5. system pool provisioningState == "Failed" - positive
confirmation that this specific pool is wedged
## Action (once guards pass)
1. Snapshot system pool ARM JSON
2. Abort cluster LRO ONLY if active LRO is >= 30 min old (the
NRP-KVS retry storm signature). If younger, no-op exit to
avoid racing a healthy in-progress operation. AROSLSRE-924
manual recipe required this step to move the cluster from
stuck-Updating to Canceled.
3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted,
same VMSize/subnets as live system)
4. Cordon + drain existing system nodes
5. Delete the broken system pool
6. Re-create system via SDK CreateOrUpdate from the sanitized
snapshot (strips read-only fields and aks-managed-* tags,
pins orchestratorVersion to the live CP version)
7. Drain + delete systmp
8. No-op tag PATCH to flip cluster Canceled -> Succeeded
## Safety
- DefaultAzureCredential chain (MSI in EV2, az CLI locally).
- sanitizeForRecreate deep-copies via JSON round-trip; never
mutates the snapshot.
- snapshotSystem refuses to act if VMSize or VnetSubnetID are
missing from the live pool.
- maybeAbortLRO returns (proceed=false, no err) when LRO is
younger than 30 min, so the binary exits 0 (not an error)
rather than racing a potentially-healthy operation.
- preflightChecks refuses to act if a leftover 'systmp' exists.
- Guard 2 fails closed if the activity-log query errors (so a
missing Reader role on the node RG yields a no-op, not a
runaway recreate).
- Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op.
- Overall 60-min context timeout.
- DRY_RUN=true lets operators verify guard behaviour without
making any writes.
## Testing
- 100+ unit test cases covering all pure-logic functions:
env parsing, guard primitives (evalGuard1..5 - including
guard 3's acceptance of stuck-Updating clusters and guard 5
for system pool Failed state), sanitizeForRecreate
(no-mutation, tag stripping, version pin),
buildSystmpAgentPool (defensive nil checks),
activity-log parsing (dedup, case insensitivity, prefix
filter), isNodeReady (nil/missing/false conditions),
isNotFoundErr (404 vs other status codes, wrapped errors),
extractAPIServerAndCA (happy path, empty input, malformed
yaml, missing fields), kubeconfigWithBearerToken (no exec
plugin or auth provider required).
- go vet, gofmt clean.
- validate-changed-config-pipelines passes.
## References
- PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper)
- Pipeline step pattern: Azure#4790
- Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev
envs on macOS-arm build natively too)
- Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask),
AROSLSRE-924 (INT manual mitigation), AROSLSRE-880 (parent
incident bug)
- ICM: 798003653
…E-924)
Adds a detection-gated EV2 Shell step that recreates the AKS system
pool when the NRP key-value-store entity for its VMSS gets corrupted.
The same recipe was applied manually at INT on 2026-05-24
(AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod.
## Background
A corrupted NRP KVS entry for the system pool's VMSS causes every
Microsoft.Compute/virtualMachineScaleSets/write to fail with
NetworkingInternalOperationError on a continuous retry chain. Fresh
VM instances come up but never get a Swift NIC, kubelet never
registers, the pool stops scaling, and the cluster's upgrade LRO
retries forever - the AROSLSRE-880 / INT (2026-05-16..18) incident
left the cluster stuck in Updating for days because of this.
The corruption is bound to the VMSS ARM resource ID; per-instance
delete does not help. Deleting and re-creating the pool yields a
fresh VMSS name and a clean KVS entity. NRP-side fix is tracked
in ICM 798003653; once it ships, this binary's detection guards
never fire and the step becomes a no-op.
## Placement
The step runs BEFORE the cluster ARM step (depending on the same
cert-issuer prereqs), so when guards fire the recreate happens
before the next cluster PUT, preventing the pipeline from getting
stuck. The cluster ARM step now depends on this step.
The binary tolerates a not-yet-created cluster (greenfield rollout):
it does an ARM Get; if 404, logs and exits 0 with no action. No
`aksCluster:` directive is used (which would fail on greenfield);
instead the binary bootstraps its own kubeconfig from
ListClusterUserCredentials + a bearer token issued by the MSI
scoped to the AKS AAD server app, and exports KUBECONFIG so child
kubectl invocations work too. No `kubelogin` dependency.
## Detection (ALL guards must pass; otherwise exit 0 no-op)
1. system pool Ready k8s nodes < minCount
2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on
aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15)
3. cluster provisioningState is recoverable: Succeeded, Canceled,
Failed (settled) OR Updating, Upgrading (mid-LRO - the wedge
signature itself). Rejected: Creating, Deleting, unknown.
4. every non-system pool has count > 0
5. system pool provisioningState == "Failed" - positive
confirmation that this specific pool is wedged
## Action (once guards pass)
1. Snapshot system pool ARM JSON
2. Abort cluster LRO ONLY if active LRO is >= 30 min old (the
NRP-KVS retry storm signature). If younger, no-op exit to
avoid racing a healthy in-progress operation. AROSLSRE-924
manual recipe required this step to move the cluster from
stuck-Updating to Canceled.
3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted,
same VMSize/subnets as live system)
4. Cordon + drain existing system nodes
5. Delete the broken system pool
6. Re-create system via SDK CreateOrUpdate from the sanitized
snapshot (strips read-only fields and aks-managed-* tags,
pins orchestratorVersion to the live CP version)
7. Drain + delete systmp
8. No-op tag PATCH to flip cluster Canceled -> Succeeded
## Safety
- DefaultAzureCredential chain (MSI in EV2, az CLI locally).
- sanitizeForRecreate deep-copies via JSON round-trip; never
mutates the snapshot.
- snapshotSystem refuses to act if VMSize or VnetSubnetID are
missing from the live pool.
- maybeAbortLRO returns (proceed=false, no err) when LRO is
younger than 30 min, so the binary exits 0 (not an error)
rather than racing a potentially-healthy operation.
- preflightChecks refuses to act if a leftover 'systmp' exists.
- Guard 2 fails closed if the activity-log query errors (so a
missing Reader role on the node RG yields a no-op, not a
runaway recreate).
- Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op.
- Overall 60-min context timeout.
- DRY_RUN=true lets operators verify guard behaviour without
making any writes.
## Testing
- 100+ unit test cases covering all pure-logic functions:
env parsing, guard primitives (evalGuard1..5 - including
guard 3's acceptance of stuck-Updating clusters and guard 5
for system pool Failed state), sanitizeForRecreate
(no-mutation, tag stripping, version pin),
buildSystmpAgentPool (defensive nil checks),
activity-log parsing (dedup, case insensitivity, prefix
filter), isNodeReady (nil/missing/false conditions),
isNotFoundErr (404 vs other status codes, wrapped errors),
extractAPIServerAndCA (happy path, empty input, malformed
yaml, missing fields), kubeconfigWithBearerToken (no exec
plugin or auth provider required).
- go vet, gofmt clean.
- validate-changed-config-pipelines passes.
## References
- PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper)
- Pipeline step pattern: Azure#4790
- Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev
envs on macOS-arm build natively too)
- Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask),
AROSLSRE-924 (INT manual mitigation), AROSLSRE-880 (parent
incident bug)
- ICM: 798003653
…E-924)
Adds a detection-gated EV2 Shell step that recreates the AKS system
pool when the NRP key-value-store entity for its VMSS gets corrupted.
The same recipe was applied manually at INT on 2026-05-24
(AROSLSRE-924 / AROSLSRE-925); this binary automates it for stg/prod.
## Background
A corrupted NRP KVS entry for the system pool's VMSS causes every
Microsoft.Compute/virtualMachineScaleSets/write to fail with
NetworkingInternalOperationError on a continuous retry chain. Fresh
VM instances come up but never get a Swift NIC, kubelet never
registers, the pool stops scaling, and the cluster's upgrade LRO
retries forever - the AROSLSRE-880 / INT (2026-05-16..18) incident
left the cluster stuck in Updating for days because of this.
The corruption is bound to the VMSS ARM resource ID; per-instance
delete does not help. Deleting and re-creating the pool yields a
fresh VMSS name and a clean KVS entity. NRP-side fix is tracked
in ICM 798003653; once it ships, this binary's detection guards
never fire and the step becomes a no-op.
## Placement
The step runs BEFORE the cluster ARM step (depending on the same
cert-issuer prereqs), so when guards fire the recreate happens
before the next cluster PUT, preventing the pipeline from getting
stuck. The cluster ARM step now depends on this step.
The binary tolerates a not-yet-created cluster (greenfield rollout):
it does an ARM Get; if 404, logs and exits 0 with no action. No
`aksCluster:` directive is used (which would fail on greenfield);
instead the binary bootstraps its own kubeconfig from
ListClusterUserCredentials + a bearer token issued by the MSI
scoped to the AKS AAD server app, and exports KUBECONFIG so child
kubectl invocations work too. No `kubelogin` dependency.
## Detection (ALL guards must pass; otherwise exit 0 no-op)
1. system pool Ready k8s nodes < minCount
2. >= NRP_FAIL_THRESHOLD (default 10) Failed VMSS-write events on
aks-system-* VMSS in last NRP_FAIL_WINDOW_MIN (default 15)
3. cluster provisioningState is recoverable: Succeeded, Canceled,
Failed (settled) OR Updating, Upgrading (mid-LRO - the wedge
signature itself). Rejected: Creating, Deleting, unknown.
4. every non-system pool has count > 0
5. system pool provisioningState == "Failed" - positive
confirmation that this specific pool is wedged
## Action (once guards pass)
1. Snapshot system pool ARM JSON
2. Abort cluster LRO ONLY if active LRO is >= 30 min old (the
NRP-KVS retry storm signature). If younger, no-op exit to
avoid racing a healthy in-progress operation. AROSLSRE-924
manual recipe required this step to move the cluster from
stuck-Updating to Canceled.
3. Add throwaway 'systmp' System pool (CriticalAddonsOnly tainted,
same VMSize/subnets as live system)
4. Cordon + drain existing system nodes
5. Delete the broken system pool
6. Re-create system via SDK CreateOrUpdate from the sanitized
snapshot (strips read-only fields and aks-managed-* tags,
pins orchestratorVersion to the live CP version)
7. Drain + delete systmp
8. No-op tag PATCH to flip cluster Canceled -> Succeeded
## Safety
- DefaultAzureCredential chain (MSI in EV2, az CLI locally).
- sanitizeForRecreate deep-copies via JSON round-trip; never
mutates the snapshot.
- snapshotSystem refuses to act if VMSize or VnetSubnetID are
missing from the live pool.
- maybeAbortLRO returns (proceed=false, no err) when LRO is
younger than 30 min, so the binary exits 0 (not an error)
rather than racing a potentially-healthy operation.
- preflightChecks refuses to act if a leftover 'systmp' exists.
- Guard 2 fails closed if the activity-log query errors (so a
missing Reader role on the node RG yields a no-op, not a
runaway recreate).
- Greenfield safe: ARM 404 on cluster Get -> exit 0 no-op.
- Overall 60-min context timeout.
- DRY_RUN=true lets operators verify guard behaviour without
making any writes.
## Testing
- 100+ unit test cases covering all pure-logic functions:
env parsing, guard primitives (evalGuard1..5 - including
guard 3's acceptance of stuck-Updating clusters and guard 5
for system pool Failed state), sanitizeForRecreate
(no-mutation, tag stripping, version pin),
buildSystmpAgentPool (defensive nil checks),
activity-log parsing (dedup, case insensitivity, prefix
filter), isNodeReady (nil/missing/false conditions),
isNotFoundErr (404 vs other status codes, wrapped errors),
extractAPIServerAndCA (happy path, empty input, malformed
yaml, missing fields), kubeconfigWithBearerToken (no exec
plugin or auth provider required).
- go vet, gofmt clean.
- validate-changed-config-pipelines passes.
## References
- PR pattern: Azure#5149 (cleanup-pko-resources), Azure#5366 (Go bumper)
- Pipeline step pattern: Azure#4790
- Build pattern: aligned with Azure#5394 (drops GOOS/GOARCH so dev
envs on macOS-arm build natively too)
- Jira: AROSLSRE-951 (story), AROSLSRE-952 (subtask),
AROSLSRE-924 (INT manual mitigation), AROSLSRE-880 (parent
incident bug)
- ICM: 798003653
|
/override ci/prow/e2e-parallel This PR only affects the PKO cleanup binary build (machine architecture fix) — it impacts the provision step, not e2e test logic. The e2e-parallel suite already passed for this PR previously and also passed in the batch run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/batch/pull-ci-Azure-ARO-HCP-main-e2e-parallel/2059505962137423872 |
|
@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/override ci/prow/e2e-parallel This PR only affects the PKO cleanup binary build (machine architecture fix) — it impacts the provision step, not e2e test logic. The e2e-parallel suite already passed for this PR previously and also passed in the batch run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/batch/pull-ci-Azure-ARO-HCP-main-e2e-parallel/2059505962137423872 |
|
@raelga: Overrode contexts on behalf of raelga: ci/prow/e2e-parallel DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What
Builds cleanup-pko-resources based on the current machine architecture.
Why
So that cleanup-pko-resources binary is compatible on macOS arm.
Deployment error:
Testing
I was able to create a personal dev environment after this change.
Special notes for your reviewer
PR Checklist