-
Notifications
You must be signed in to change notification settings - Fork 13
BuildKit autoscaling on staging: in-cluster KEDA + LB queue + warm baseline #723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| # BuildKit module | ||
|
|
||
| Remote BuildKit build service: per-arch `buildkitd` Deployments behind an HAProxy | ||
| LB, on dedicated Karpenter NodePools. Clients build with | ||
| `buildctl --addr tcp://buildkitd-<arch>.buildkit:1234`. | ||
|
|
||
| Sizing is per-arch in `clusters.yaml` (`buildkit.{amd64,arm64}_*` instance type, | ||
| pods-per-node, and autoscaling `*_min` / `*_max`); pod CPU/mem is computed by | ||
| `scripts/python/generate_buildkit.py`. | ||
|
|
||
| ## Autoscaling (optional, `buildkit.autoscaling.enabled`) | ||
|
|
||
| Absorb bursts of concurrent builds without overloading existing pods, and scale | ||
| back to a small warm baseline when idle. | ||
|
|
||
| - **One build per pod** — HAProxy `server maxconn 1` (matches buildkitd | ||
| `max-parallelism = 1`) so a build never stacks on a busy pod. When every pod is | ||
| busy the LB has no slot, so the client must **retry the connect** (see below) | ||
| until a pod frees or the pool scales up. | ||
| - **In-cluster scale signal** — KEDA `ScaledObject` per arch, `metrics-api` | ||
| scraping the LB's own metrics (`haproxy_backend_current_sessions`) — no external | ||
| metrics backend. If KEDA can't read the metric, a `fallback` (`*_fallback`, | ||
| e.g. 32/8 on prod) holds the proven fixed pool instead of freezing the count. | ||
| - **Warm baseline** — `amd64_min` / `arm64_min` keep ≥1 node per arch up so the | ||
| common case gets a free warm pod immediately. `*_max` caps the burst; NodePool | ||
| limits are sized to `*_max`. | ||
| - **No-flap scale-down** — KEDA holds a pod ~20 min after it goes idle | ||
| (`stabilizationWindowSeconds: 1200`), then sheds at most `max(10 pods, 20%)` | ||
| per 20 min, so a follow-up build reuses the pod's warm decompressed NVMe layer | ||
| cache. Node churn is left to Karpenter. | ||
| - **Safe scale-down** — `preStop` drain (waits until the pod's `:1234` is idle) | ||
| + long `terminationGracePeriodSeconds` + PDB, so a build is never killed | ||
| mid-flight. Scale-down removes an arbitrary pod, which may be mid-build; the | ||
| drain holds termination until that build finishes, but | ||
| `terminationGracePeriodSeconds` is a hard SIGKILL cap, so it must outlast the | ||
| longest possible build. It's set to **8100s (135m) = 120m** (the max time a | ||
| docker build may run, matching HAProxy `timeout server`) **+ ~15m** of | ||
| headroom for the drain's idle-detection polling. A build that starts just | ||
| before drain still completes; the cap only fires as a backstop if a pod never | ||
| drains. | ||
| The **PDB** (`maxUnavailable: 1` per arch) bounds *voluntary* disruptions — | ||
| node consolidation and manual `kubectl drain` — to one builder per arch at a | ||
| time, so those go through the preStop drain one pod at a time instead of | ||
| evicting several in-flight builds at once. (KEDA scale-down deletes pods | ||
| directly rather than via the eviction API, so it isn't PDB-gated — the drain + | ||
| grace cap above is what protects that path.) | ||
|
|
||
| ## Clients must retry the connect | ||
|
|
||
| Build clients (both `docker buildx` and `buildctl`) use the `moby/buildkit` Go | ||
| client, which dials with gRPC's default **~20s `MinConnectTimeout`** and | ||
| **fail-fast** RPCs — there is no client-side flag to make it wait longer. During | ||
| a burst, a build whose connection finds no free pod (`maxconn 1`) is dropped by | ||
| the client after ~20s, well before KEDA/Karpenter can add a pod (minutes). An | ||
| HAProxy-side `timeout queue` does **not** help: the client gives up at 20s | ||
| regardless, so queueing on the LB is pointless (and was removed). | ||
|
|
||
| So the **client must retry the build** on connection failures until a pod is | ||
| free or the pool has scaled up; the repeated attempts also keep the autoscaler's | ||
| load signal alive. PyTorch's `.ci/docker/build.sh` does this when | ||
| `REMOTE_BUILDKIT` is set, and the workflow creates the remote builder *without* | ||
| `--bootstrap` (the `docker buildx inspect --bootstrap` health check hits the same | ||
| 20s gate at setup). This was confirmed on the staging cluster. | ||
|
|
||
| ## HAProxy config changes roll the LB | ||
|
|
||
| HAProxy renders its config only at container start, and nothing else restarts | ||
| the `buildkitd-lb` pod, so a bare ConfigMap update (`maxconn`, timeouts, | ||
| backends) would silently not take effect. `deploy.sh` stamps the LB pod template | ||
| with a `checksum/config` annotation = a hash of `haproxy.yaml`; when the config | ||
| changes the hash changes, which rolls the Deployment so the new pod picks up the | ||
| new config. An unchanged config keeps the same hash, so routine deploys don't | ||
| churn the LB. (The buildkitd worker pods do **not** yet have this, so a | ||
| `buildkitd.toml` / `drain.sh` change needs a manual rollout to take effect.) | ||
|
|
||
| Requires the `keda` module deployed before `buildkit` (provides the CRDs). The | ||
| `monitoring` module scrapes the KEDA operator's metrics and ships | ||
| buildkit-autoscaling alerts (scaler / fallback errors, queue backlog). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
21 changes: 21 additions & 0 deletions
21
osdc/modules/buildkit/kubernetes/base/drain-configmap.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| apiVersion: v1 | ||
| kind: ConfigMap | ||
| metadata: | ||
| name: buildkitd-drain | ||
| namespace: buildkit | ||
| data: | ||
| # preStop drain: block termination until no in-flight build remains. A build | ||
| # keeps an ESTABLISHED inbound connection on :1234 for its whole duration; | ||
| # require two consecutive idle polls so a transient health check can't be | ||
| # mistaken for "done". terminationGracePeriodSeconds caps the total wait. | ||
| drain.sh: | | ||
| #!/bin/sh | ||
| idle=0 | ||
| while [ "$idle" -lt 2 ]; do | ||
| if netstat -tn 2>/dev/null | awk '$NF=="ESTABLISHED" && $4 ~ /:1234$/{f=1} END{exit !f}'; then | ||
| idle=0 | ||
| else | ||
| idle=$((idle + 1)) | ||
| fi | ||
| sleep 15 | ||
| done |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
25 changes: 25 additions & 0 deletions
25
osdc/modules/buildkit/kubernetes/base/poddisruptionbudget.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| # Cap voluntary disruptions (node consolidation, drains) to one builder per arch | ||
| # at a time so evictions go through the preStop drain instead of killing builds. | ||
| apiVersion: policy/v1 | ||
| kind: PodDisruptionBudget | ||
| metadata: | ||
| name: buildkitd-arm64 | ||
| namespace: buildkit | ||
| spec: | ||
| maxUnavailable: 1 | ||
| selector: | ||
| matchLabels: | ||
| app: buildkitd | ||
| arch: arm64 | ||
| --- | ||
| apiVersion: policy/v1 | ||
| kind: PodDisruptionBudget | ||
| metadata: | ||
| name: buildkitd-amd64 | ||
| namespace: buildkit | ||
| spec: | ||
| maxUnavailable: 1 | ||
| selector: | ||
| matchLabels: | ||
| app: buildkitd | ||
| arch: amd64 |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.