Skip to content

Support per-arch BuildKit replica counts and resize prod fleet#702

Merged
huydhn merged 5 commits into
pytorch:mainfrom
huydhn:buildkit-per-arch-replicas
Jun 6, 2026
Merged

Support per-arch BuildKit replica counts and resize prod fleet#702
huydhn merged 5 commits into
pytorch:mainfrom
huydhn:buildkit-per-arch-replicas

Conversation

@huydhn

@huydhn huydhn commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Adds optional per-arch buildkit.{amd64,arm64}_replicas and ..._pods_per_node overrides (fall back to the shared replicas_per_arch/pods_per_node, so other clusters render unchanged), then resizes the prod fleet:

  • amd64: m6id.24xlarge, 2/node, 42 vCPU/155 GiB — 32 replicas (was 12)
  • arm64: m7gd.16xlarge, 4/node, ~14 vCPU/51 GiB — 8 replicas (smaller pods, more of them; ≈ the pre-OSDC m7g.4xlarge build runner)

NodePool limits scale per-arch automatically. just test pass (97% cov), just lint 13/13.

Independent of the autoscaling work on #701.

**Impact:** OSDC `arc-cbr-production` BuildKit fleet. Other clusters unchanged
(per-arch keys are optional and fall back to the shared values).
**Risk:** low

Adds optional `buildkit.{amd64,arm64}_replicas` and `buildkit.{amd64,arm64}_pods_per_node`
overrides (generator + deploy.sh), defaulting to the existing `replicas_per_arch` /
`pods_per_node` when unset, so all other clusters render identically.

Prod fleet is resized to:
- amd64: m6id.24xlarge, 2/node, 42 vCPU/155 GiB — 32 replicas (was 12)
- arm64: m7gd.16xlarge, 4/node, ~14 vCPU/51 GiB — 8 replicas

arm64 moves to smaller pods (m7gd.16xlarge packed 4/node ≈ the pre-OSDC
m7g.4xlarge build runner: same Graviton3 family, ~16 vCPU/64 GiB) so we run more,
smaller arm64 builders. NodePool limits scale per-arch automatically.

Testing: `just test` pass (generate_buildkit.py 97%), `just lint` 13/13.
Signed-off-by: Huy Do <huydo@meta.com>
@huydhn huydhn requested a review from jeanschmidt as a code owner June 5, 2026 18:28
Comment thread osdc/clusters.yaml
Comment thread osdc/clusters.yaml
@huydhn huydhn requested a review from malfet June 5, 2026 18:35
@jeanschmidt

Copy link
Copy Markdown
Contributor

Did you test your changes in arc-staging? I suspect you would need to change smoke tests as well, don't you?

@jeanschmidt jeanschmidt left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check arc-staging

@huydhn

huydhn commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

Did you test your changes in arc-staging? I suspect you would need to change smoke tests as well, don't you?

This was made on the flight, so yes, I need to test this on staging now, also the reason I haven't merged it yet :)

Comment thread osdc/clusters.yaml Outdated
@huydhn

huydhn commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

(Claude here, posting on behalf of @huydhn.)

Checked — deployed to arc-staging and verified healthy: amd64 4/4 and arm64 2/2 Running with correct specs (42 vCPU/155 GiB on m6id.24xlarge; 28/102 on m7gd.16xlarge), and HAProxy re-resolved all 6 backends.

Reverting the staging test config from this PR now so it only carries the per-arch mechanism + the prod change.

@huydhn huydhn requested a review from jeanschmidt June 6, 2026 01:13
@huydhn huydhn enabled auto-merge June 6, 2026 01:14
@huydhn huydhn dismissed jeanschmidt’s stale review June 6, 2026 01:15

Confirm to work on staging

@huydhn huydhn added this pull request to the merge queue Jun 6, 2026
Merged via the queue into pytorch:main with commit ed196f0 Jun 6, 2026
11 checks passed
@huydhn huydhn deleted the buildkit-per-arch-replicas branch June 6, 2026 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants