BuildKit autoscaling on staging: in-cluster KEDA + LB queue + warm baseline#723
Conversation
tofu plan — arc-cbr-production✅ Plan succeeded · commit Plan output |
tofu plan — arc-cbr-production-uw1✅ Plan succeeded · commit Plan output |
Integration-test validation (updated)Correction to the earlier note below: the Latest run — 27303571628, deliberately over-subscribed: per arch, 1 connectivity build + an 8-wide scale burst = 9 builds against a max-8 pool. 18/18 jobs succeeded.
Validates:
(Earlier run 27247502866 also covered kill-free scale-down to the 2/4 baseline + Original note (superseded — "queue worked / no connect timeouts" was wrong)Staging validation runDrove a balanced burst of 8 amd64 + 8 arm64 builds against the staging pool (each held a BuildKit nodes / pods during the run:
Observations:
|
tofu plan — meta-prod-aws-ue1✅ Plan succeeded · commit Plan output |
|
Full stack testing on staging https://github.com/pytorch/ci-infra/actions/runs/27302743769 |
|
Claude finished @jeanschmidt's task in 8m 51s —— View job PR review — BuildKit autoscaling (KEDA + HAProxy + connect-retry)Focused on the five concerns you raised. Overall this is a well-thought-out design and the staging validation run (18/18 over-subscribed, kill-free scale-down) exercises the happy path convincingly. The findings below are mostly about observability gaps and failure-mode robustness rather than correctness of the validated path. 1. Race conditions in the scaling loopScaling math is sound. KEDA registers The load signal is only as alive as the retries (
2. Risk of not scaling / failing to deploy
3. Edge cases
4. Monitoring — HAProxy ✅, KEDA ❌ (the notable gap)
5. Karpenter disruption wiring — this is the strongest part
Summary
None of these block the staging-only, opt-in rollout. The two I'd most want before widening beyond staging are the KEDA ServiceMonitor and the KEDA |
|
following up on #723 (comment) and #723 (comment) are there anti-flap mechanisms prevent the fast scale down of pods/nodes (keeping them warm for a while before tearing them down), explain them. can you look into this PR and tell me if pytorch/pytorch#186955 can safely handle failure to schedule up and wait for 2-4 average build jobs to finish? (you can look for build jobs on pytorch/pytorch for x86-64 and aarch) |
|
Claude finished @jeanschmidt's task in 5m 1s —— View job Anti-flap mechanisms & connect-retry safety
1. Anti-flap: what keeps pods/nodes warm before teardownThere are four distinct anti-flap layers, two at the pod level (KEDA/HPA) and two at the node level (Karpenter), plus the warm floor. From Pod scale-down (KEDA cooldownPeriod: 600
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 600 # 10 min: HPA takes the *highest*
policies: # recommendation over the window,
- type: Pods # so a transient dip can't trigger
value: 1 # teardown
periodSeconds: 120 # at most 1 pod removed / 2 min
Node scale-down (Karpenter disruption:
consolidationPolicy: WhenEmpty # only consolidate a node with zero build pods
consolidateAfter: 5m # …and only after it's been empty 5 min
budgets:
- nodes: "1" # disrupt at most 1 node at a time
Warm floor:
(Separately, the kill-free teardown path — 2. Can pytorch#186955's connect-retry safely "fail to schedule up and wait for 2–4 average builds to finish"?The answer splits by whether the pool is below Case A — demand ≤ Case B — demand >
So your specific framing — "wait for 2–4 average build jobs to finish" — is the worst case of Case B, and it is the one place this design can drop a build:
The two numbers to confirm (I couldn't fetch them here):
Recommendation: if you genuinely need to ride out 2–4 sequential builds at saturation (rather than rely on scale-up), the robust lever is raise Other things I noticed while tracing this
· branch |
|
Add KEDA fallback is an interesting point, let me see what we can do there, probably keep the current fixed number of 32 x86 and 8 aarch64 Buildkit pods, which has proven to be working ok since last Friday. For KEDA metrics, let's do that in a separate PR. It should have some we want to collect. |
|
no-flap protections are too weak for a real use case IMO. we should try to keep pods alive for at least 20 minutes of idling, so instead of one every 2 minutes, it makes much more sense to have max(10, 20%) every 20 minutes only for pods idle for 20+ minutes. Or something similar. By the other hand, karpenter can be quite agressive on scaling down nodes (like they are now). We should gate the anti-flap mechanism on pods, yes they are less expensive to start, but more costly for jobs (loss of local decompressed layer caches). also monitoring seems a requirement :) |
|
Yeah, a minimum of 20 minutes of idling can be set easily for pods by tweaking the |
|
@jeanschmidt Done — scale-down now holds a pod ~20 min after it goes idle ( |
…seline **Impact:** OSDC arc-staging buildkit only (autoscaling is opt-in; other clusters unchanged). **Risk:** low Absorb ciflow/docker bursts without overloading existing pods, and scale back to a small warm per-arch baseline when idle. - HAProxy `server maxconn 1` + `timeout queue`: one build per pod; excess builds queue and flow onto new pods as they register, instead of stacking on busy pods (so scaled-up pods don't sit idle). - KEDA ScaledObject per arch via `metrics-api` scraping the LB's own metrics (haproxy_backend_current_sessions) — no Grafana / external metrics backend. - Warm baseline: amd64_min=2 / arm64_min=4 (1 physical node each); *_max caps the burst and sizes the NodePool limits. - preStop drain + PDB + long terminationGracePeriodSeconds for kill-free scale-down. staging: amd64 m6id.24xlarge @ 2/node (min 2), arm64 m7gd.16xlarge @ 4/node (min 4). Runner-side connect retry (separate pytorch/pytorch change) lets a build tolerate waiting for a pod from a cold/queued pool. Testing: just lint 13/13, just test pass (generate_buildkit.py 98%). Signed-off-by: Huy Do <huydo@meta.com> ghstack-source-id: e6bb458 Pull-Request: pytorch#723
Stack from ghstack (oldest at bottom):
Impact: OSDC arc-staging buildkit only (autoscaling is opt-in; other
clusters unchanged).
Risk: low
Absorb ciflow/docker bursts without overloading existing pods, and scale back to
a small warm per-arch baseline when idle.
server maxconn 1+timeout queue: one build per pod; excess buildsqueue and flow onto new pods as they register, instead of stacking on busy
pods (so scaled-up pods don't sit idle).
metrics-apiscraping the LB's own metrics(haproxy_backend_current_sessions) — no Grafana / external metrics backend.
the burst and sizes the NodePool limits.
scale-down.
staging: amd64 m6id.24xlarge @ 2/node (min 2), arm64 m7gd.16xlarge @ 4/node
(min 4). Runner-side connect retry (separate pytorch/pytorch change) lets a build
tolerate waiting for a pod from a cold/queued pool.
Testing: Validation https://github.com/pytorch/ci-infra/actions/runs/27302743769 on pytorch-canary https://github.com/pytorch/pytorch-canary/actions/runs/27303571628 (comment below)
Signed-off-by: Huy Do huydo@meta.com