buildkit: deploy on meta-prod-aws-ue1 (us-east-1)#728
Conversation
[ghstack-poisoned]
tofu plan — meta-prod-aws-ue1✅ Plan succeeded · commit Plan output |
[ghstack-poisoned]
Same buildkit config as arc-cbr-production (ue2): m6id.24xlarge amd64 (2/node) + m7gd.16xlarge arm64 (4/node), autoscaling min 2/4, max 360/30, fallback 32/8. Adds keda + buildkit to the cluster's module list (keda provides the CRDs the autoscaling needs). ghstack-source-id: f6bf677 Pull-Request: #728
tofu plan — arc-cbr-production✅ Plan succeeded · commit Plan output |
|
I suspect that there might be more to this than meets the eye here, not the autoscaling part, but the Buildkit in multiple regions part. @jeanschmidt @zxiiro If you have successfully deployed Buildkit in multiple regions, let me know. I'm referring to these lines specifically https://github.com/pytorch/ci-infra/blob/main/osdc/modules/arc-runners/defs/rel-l-x86iavx512-8-64.yaml#L11-L12 |
[ghstack-poisoned]
Same buildkit config as arc-cbr-production (ue2): m6id.24xlarge amd64 (2/node) + m7gd.16xlarge arm64 (4/node), autoscaling min 2/4, max 360/30, fallback 32/8. Adds keda + buildkit to the cluster's module list (keda provides the CRDs the autoscaling needs). ghstack-source-id: 0170c40 Pull-Request: #728
tofu plan — arc-cbr-production-uw1✅ Plan succeeded · commit Plan output |
[ghstack-poisoned]
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * pytorch#727 * pytorch#728 * pytorch#726 * pytorch#725 * __->__ pytorch#724 Same min per arch as staging (amd64 2 / arm64 4). Max sized from 14-day docker-build concurrency: amd64 128 (peak 105 + headroom), arm64 16 (peak 8, likely capped by the old fixed pool).
[ghstack-poisoned]
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * pytorch#727 * pytorch#728 * pytorch#726 * __->__ pytorch#725 Burst 8 parallel buildctl builds per arch (each holds a maxconn=1 slot ~10m via sleep). With amd64_min=2 they serialize ~43m > timeout 30m and fail unless KEDA scales the pool up; one wave ~18m when it does.
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.15.0) (oldest at bottom): * pytorch#727 * pytorch#728 * __->__ pytorch#726 * pytorch#725 Enable the KEDA operator's Prometheus endpoint (prometheus.operator.enabled) and add a ServiceMonitor scraping keda_* (scaler/scaledobject values, errors, activity, latency) at 60s. Lets us see what KEDA reads and when it errors / falls back.
|
This is ready to land now given the multiple Buildkit clusters test on staging #667, but I think I will hold on to this PR a bit longer until after the next prod deployment |
[ghstack-poisoned]
tofu plan — lf-prod-aws-ue1✅ Plan succeeded · commit Plan output |
tofu plan — lf-prod-aws-ue2✅ Plan succeeded · commit Plan output |
Stack from ghstack (oldest at bottom):
Same buildkit config as arc-cbr-production (ue2): m6id.24xlarge amd64 (2/node)
Adds keda + buildkit to the cluster's module list (keda provides the CRDs the
autoscaling needs).