From bb17d101b06c9ad86a63e5d45b2f217ebef4c4b3 Mon Sep 17 00:00:00 2001 From: Thanh Ha Date: Tue, 9 Jun 2026 11:38:37 -0400 Subject: [PATCH] Document ARC stale scale set recovery runbook Add a step-by-step recovery procedure to operations.md for when AutoscalingListener pods crash-loop after config changes that alter the GitHub-side scale set registration (github_config_url or runner_group changes). Co-Authored-By: Claude Sonnet 4.6 Signed-off-by: Thanh Ha --- osdc/docs/operations.md | 54 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) diff --git a/osdc/docs/operations.md b/osdc/docs/operations.md index e2229ca5..8e58d9d9 100644 --- a/osdc/docs/operations.md +++ b/osdc/docs/operations.md @@ -126,6 +126,60 @@ just resume-runners arc-staging # restore maxRunners from def files, remove `recycle-nodes` deletes all Karpenter `NodeClaims`, forcing Karpenter to reprovision on demand. Use after AMI/userData changes when you do not need to preserve in-flight jobs. Staging runs this automatically at the end of `just deploy` (driven by `recycle_karpenter_nodes: true` in `clusters.yaml`). +### Recovering from stale ARC scale sets + +Certain `arc-runners` config changes (e.g. changing `github_config_url` from +repo-scoped to org-scoped, or changing `runner_group`) cause the ARC controller to +register new GitHub-side scale sets while old `AutoscalingListener` objects remain +in the cluster pointing at the now-invalid scale set IDs. Symptoms: listener pods +crash-loop with `RunnerScaleSetNotFoundException` (404 from GitHub API), or two +listener pods exist for the same runner set with different spec hashes. + +**Step 1 — Check for duplicate or erroring listeners:** + +```bash +kubectl get autoscalinglisteners -n arc-systems +kubectl get pods -n arc-systems | grep -v Running +``` + +If you see two listeners for the same runner set (different hash suffixes), or pods +cycling through `Error` / `ContainerCreating`, proceed with the cleanup. + +**Step 2 — Delete all `AutoscalingRunnerSet` objects to force re-registration:** + +```bash +kubectl delete autoscalingrunnersets --all -n arc-runners +``` + +This removes the in-cluster ARS objects. The ARC Helm releases are untouched — the +controller recreates ARS objects (and fresh GitHub scale set IDs) on the next deploy. + +**Step 3 — Force-redeploy arc-runners:** + +```bash +HELM_FORCE_UPGRADE=1 just deploy-module arc-runners +``` + +`HELM_FORCE_UPGRADE=1` bypasses the skip-if-no-diff logic so the upgrade runs even +when the rendered templates haven't changed. + +**Step 4 — Clean up any remaining stale listeners:** + +After the redeploy, new listeners get a fresh spec hash. Old listeners (prior hash) +may linger briefly. Delete them explicitly: + +```bash +OLD_HASH= # e.g. 58d9767d — visible in the listener name +kubectl delete autoscalinglistener -n arc-systems \ + $(kubectl get autoscalinglisteners -n arc-systems --no-headers | grep "$OLD_HASH" | awk '{print $1}') +``` + +All listener pods should reach `Running` within ~30 seconds. + +> **Note:** For `runner_group`-only changes (no URL change), you typically do **not** +> need to delete the ARS objects — a `HELM_FORCE_UPGRADE=1` redeploy is sufficient. +> Only stale listeners need manual cleanup in that case. + ### Read-only debugging ```bash