Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions osdc/docs/operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,60 @@ just resume-runners arc-staging # restore maxRunners from def files, remove

`recycle-nodes` deletes all Karpenter `NodeClaims`, forcing Karpenter to reprovision on demand. Use after AMI/userData changes when you do not need to preserve in-flight jobs. Staging runs this automatically at the end of `just deploy` (driven by `recycle_karpenter_nodes: true` in `clusters.yaml`).

### Recovering from stale ARC scale sets

Certain `arc-runners` config changes (e.g. changing `github_config_url` from
repo-scoped to org-scoped, or changing `runner_group`) cause the ARC controller to
register new GitHub-side scale sets while old `AutoscalingListener` objects remain
in the cluster pointing at the now-invalid scale set IDs. Symptoms: listener pods
crash-loop with `RunnerScaleSetNotFoundException` (404 from GitHub API), or two
listener pods exist for the same runner set with different spec hashes.

**Step 1 — Check for duplicate or erroring listeners:**

```bash
kubectl get autoscalinglisteners -n arc-systems
kubectl get pods -n arc-systems | grep -v Running
```

If you see two listeners for the same runner set (different hash suffixes), or pods
cycling through `Error` / `ContainerCreating`, proceed with the cleanup.

**Step 2 — Delete all `AutoscalingRunnerSet` objects to force re-registration:**

```bash
kubectl delete autoscalingrunnersets --all -n arc-runners
```

This removes the in-cluster ARS objects. The ARC Helm releases are untouched — the
controller recreates ARS objects (and fresh GitHub scale set IDs) on the next deploy.

**Step 3 — Force-redeploy arc-runners:**

```bash
HELM_FORCE_UPGRADE=1 just deploy-module <cluster> arc-runners
```

`HELM_FORCE_UPGRADE=1` bypasses the skip-if-no-diff logic so the upgrade runs even
when the rendered templates haven't changed.

**Step 4 — Clean up any remaining stale listeners:**

After the redeploy, new listeners get a fresh spec hash. Old listeners (prior hash)
may linger briefly. Delete them explicitly:

```bash
OLD_HASH=<old-hash> # e.g. 58d9767d — visible in the listener name
kubectl delete autoscalinglistener -n arc-systems \
$(kubectl get autoscalinglisteners -n arc-systems --no-headers | grep "$OLD_HASH" | awk '{print $1}')
```

All listener pods should reach `Running` within ~30 seconds.

> **Note:** For `runner_group`-only changes (no URL change), you typically do **not**
> need to delete the ARS objects — a `HELM_FORCE_UPGRADE=1` redeploy is sufficient.
> Only stale listeners need manual cleanup in that case.

### Read-only debugging

```bash
Expand Down
Loading