BuildKit autoscaling on staging: in-cluster KEDA + LB queue + warm baseline by huydhn · Pull Request #723 · pytorch/ci-infra

huydhn · 2026-06-10T08:52:17Z

Stack from ghstack (oldest at bottom):

Impact: OSDC arc-staging buildkit only (autoscaling is opt-in; other
clusters unchanged).
Risk: low

Absorb ciflow/docker bursts without overloading existing pods, and scale back to
a small warm per-arch baseline when idle.

HAProxy server maxconn 1 + timeout queue: one build per pod; excess builds
queue and flow onto new pods as they register, instead of stacking on busy
pods (so scaled-up pods don't sit idle).
KEDA ScaledObject per arch via metrics-api scraping the LB's own metrics
(haproxy_backend_current_sessions) — no Grafana / external metrics backend.
Warm baseline: amd64_min=2 / arm64_min=4 (1 physical node each); *_max caps
the burst and sizes the NodePool limits.
preStop drain + PDB + long terminationGracePeriodSeconds for kill-free
scale-down.

staging: amd64 m6id.24xlarge @ 2/node (min 2), arm64 m7gd.16xlarge @ 4/node
(min 4). Runner-side connect retry (separate pytorch/pytorch change) lets a build
tolerate waiting for a pod from a cold/queued pool.

Testing: Validation https://github.com/pytorch/ci-infra/actions/runs/27302743769 on pytorch-canary https://github.com/pytorch/pytorch-canary/actions/runs/27303571628 (comment below)

Signed-off-by: Huy Do huydo@meta.com

[ghstack-poisoned]

github-actions · 2026-06-10T08:53:32Z

tofu plan — arc-cbr-production

✅ Plan succeeded · commit ef31ab27 · run log

Plan output

Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (arc-cbr-production) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.eks.data.aws_ami.eks_optimized_al2023: Reading...
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3]
module.eks.data.aws_caller_identity.current: Reading...
module.vpc.aws_vpc.this: Refreshing state... [id=vpc-0e712dc7e743bbcf7]
module.eks.aws_iam_role.node: Refreshing state... [id=pytorch-arc-cbr-production-node-role]
data.aws_availability_zones.available: Reading...
module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=527854a4-e335-4f95-bc89-1321cff7a478]
module.eks.aws_iam_role.cluster: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_caller_identity.current: Read complete after 0s [id=308535385114]
data.aws_availability_zones.available: Read complete after 0s [id=us-east-2]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAUPVRELQNOLQFN6MU]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/pytorch-arc-cbr-production-eks-secrets]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role/arn:aws:iam::aws:policy/AmazonEKSClusterPolicy]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role/arn:aws:iam::aws:policy/AmazonEKSVPCResourceController]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy]
module.eks.aws_iam_role_policy.node_cni_ipv6: Refreshing state... [id=pytorch-arc-cbr-production-node-role:pytorch-arc-cbr-production-node-cni-ipv6]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 1s [id=ami-009f1fe7d56695348]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-05e96ee7cb818e5c0]
module.vpc.aws_egress_only_internet_gateway.this: Refreshing state... [id=eigw-032d4401e63f0c9b9]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-0fddf2f74e7e978c7]
module.vpc.aws_eip.nat_secondary["us-east-2c-1"]: Refreshing state... [id=eipalloc-06a980076e99cda81]
module.vpc.aws_eip.nat_secondary["us-east-2b-1"]: Refreshing state... [id=eipalloc-0e67c0a8cd8c990da]
module.vpc.aws_eip.nat_secondary["us-east-2b-3"]: Refreshing state... [id=eipalloc-021ee6c9f1d20b71a]
module.vpc.aws_eip.nat_secondary["us-east-2c-3"]: Refreshing state... [id=eipalloc-0d3a71569b2f687be]
module.vpc.aws_eip.nat_secondary["us-east-2a-1"]: Refreshing state... [id=eipalloc-0f2b00a9ac31df215]
module.vpc.aws_eip.nat_secondary["us-east-2a-2"]: Refreshing state... [id=eipalloc-09b15a770e0c6d552]
module.vpc.aws_eip.nat_secondary["us-east-2b-0"]: Refreshing state... [id=eipalloc-0cead990d60ce181e]
module.vpc.aws_eip.nat_secondary["us-east-2b-4"]: Refreshing state... [id=eipalloc-0de33181548ac2e5a]
module.vpc.aws_eip.nat_secondary["us-east-2a-6"]: Refreshing state... [id=eipalloc-0113c95dbdec2f879]
module.vpc.aws_eip.nat_secondary["us-east-2b-6"]: Refreshing state... [id=eipalloc-06b7b88826199a232]
module.vpc.aws_eip.nat_secondary["us-east-2c-2"]: Refreshing state... [id=eipalloc-07cfdb2fd5dc07459]
module.vpc.aws_eip.nat_secondary["us-east-2c-0"]: Refreshing state... [id=eipalloc-03542e74755fc105b]
module.vpc.aws_eip.nat_secondary["us-east-2a-3"]: Refreshing state... [id=eipalloc-034d5e1f5a2fcb795]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-01e479dcb5aedf696]
module.vpc.aws_eip.nat_secondary["us-east-2a-0"]: Refreshing state... [id=eipalloc-086a011b3c26c0dd7]
module.vpc.aws_eip.nat_secondary["us-east-2b-5"]: Refreshing state... [id=eipalloc-0cde9a6463901f1e1]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-0a583bbbcac436ebd]
module.vpc.aws_eip.nat[2]: Refreshing state... [id=eipalloc-01187bfaa68514400]
module.vpc.aws_eip.nat_secondary["us-east-2b-2"]: Refreshing state... [id=eipalloc-063bee447616351f9]
module.vpc.aws_eip.nat_secondary["us-east-2a-5"]: Refreshing state... [id=eipalloc-0bd9bf54bd6010323]
module.vpc.aws_eip.nat_secondary["us-east-2a-4"]: Refreshing state... [id=eipalloc-067d535102a61d1a8]
module.vpc.aws_eip.nat_secondary["us-east-2c-6"]: Refreshing state... [id=eipalloc-0aede78edc69cf695]
module.vpc.aws_eip.nat_secondary["us-east-2c-5"]: Refreshing state... [id=eipalloc-02825435a2786b3d8]
module.vpc.aws_eip.nat_secondary["us-east-2c-4"]: Refreshing state... [id=eipalloc-0cc3dadec18bbb3f3]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-0992f582e9bf2836e]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-0709abbcafa23aec0]
module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-0577a02acde719bff]
module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-0d34063a19f4b07b4]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0d26e280575e8aaf4]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0ab11fcdb8d4ea113]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3/arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.eks.aws_eks_cluster.this: Refreshing state... [id=pytorch-arc-cbr-production]
module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-0ce4fba002d90e7d5]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-084975a7f7af2696e]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-07d5cd4c479c827ab]
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=pytorch-arc-cbr-production:kube-proxy]
module.eks.aws_eks_access_entry.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production:arn:aws:iam::308535385114:role/osdc_gha_prod]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=pytorch-arc-cbr-production:vpc-cni]
module.eks.aws_launch_template.base: Refreshing state... [id=lt-0b820cd15307b6d57]
module.eks.aws_eks_node_group.base: Refreshing state... [id=pytorch-arc-cbr-production:pytorch-arc-cbr-production-base-nodes]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=033a163afb2babc26f7883e642621ac361c93d61]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::308535385114:oidc-provider/oidc.eks.us-east-2.amazonaws.com/id/0A621339248958D6D5F2FF084BD185B5]
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=2879363015]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=pytorch-arc-cbr-production:coredns]
module.eks.aws_eks_access_policy_association.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production#arn:aws:iam::308535385114:role/osdc_gha_prod#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role/arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry/arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-ebs-csi-driver]
module.vpc.aws_nat_gateway.this[2]: Refreshing state... [id=nat-0f7b8f4473e5790df]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-08e264cbbd47be1ee]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-0ad75b2f5282877db]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-01d38d41a7ca82a08]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-0c7ecd4166a01e5f0]
module.vpc.aws_route_table.private[2]: Refreshing state... [id=rtb-0cb3785c433ed7718]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-0beb143017359bda1]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-0b6e08b4b0dc968c0]
module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-097abe4676c74f71b]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change]
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change]
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance-KarpenterRebalance]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption-KarpenterSpotInterruption]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change-KarpenterInstanceStateChange]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change-KarpenterScheduledChange]
data.terraform_remote_state.base: Read complete after 1s
aws_ec2_tag.subnet_karpenter_discovery["subnet-0992f582e9bf2836e"]: Refreshing state... [id=subnet-0992f582e9bf2836e,karpenter.sh/discovery]
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-01ec5f742ae028981,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0577a02acde719bff"]: Refreshing state... [id=subnet-0577a02acde719bff,karpenter.sh/discovery]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-karpenter-controller]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0709abbcafa23aec0"]: Refreshing state... [id=subnet-0709abbcafa23aec0,karpenter.sh/discovery]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller-20260518021844404100000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wheel-syncer-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-0deb818bbf18764de]
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wants-collector-s3]
data.terraform_remote_state.base: Read complete after 1s
aws_iam_role.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role]
aws_iam_role.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role]
aws_iam_role.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role]
aws_security_group.efs: Refreshing state... [id=sg-0979eb5e3d9d3db9f]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role-20260518023249955700000005]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role-20260518023249903900000003]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-efs-csi-driver]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role-20260518023249929400000004]
aws_efs_mount_target.pypi_cache["subnet-0709abbcafa23aec0"]: Refreshing state... [id=fsmt-08cd5108febbacef9]
aws_efs_mount_target.pypi_cache["subnet-0992f582e9bf2836e"]: Refreshing state... [id=fsmt-03523586bb4ff0c46]
aws_efs_mount_target.pypi_cache["subnet-0577a02acde719bff"]: Refreshing state... [id=fsmt-07d7b111b9cd6684e]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

github-actions · 2026-06-10T08:54:45Z

tofu plan — arc-cbr-production-uw1

✅ Plan succeeded · commit ef31ab27 · run log

Plan output

Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod-uw1",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (arc-cbr-production-uw1) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.eks.data.aws_caller_identity.current: Reading...
data.aws_availability_zones.available: Reading...
module.eks.data.aws_ami.eks_optimized_al2023: Reading...
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-s3]
module.vpc.aws_vpc.this: Refreshing state... [id=vpc-0121d1038d393182a]
module.eks.aws_iam_role.cluster: Refreshing state... [id=pytorch-arc-cbr-production-uw1-cluster-role]
module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=1fb5d763-c5cd-4de5-bf40-712df992288c]
module.eks.aws_iam_role.node: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAUPVRELQNFWBLKNFS]
module.eks.data.aws_caller_identity.current: Read complete after 1s [id=308535385114]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-cluster-role/arn:aws:iam::aws:policy/AmazonEKSClusterPolicy]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=pytorch-arc-cbr-production-uw1-cluster-role/arn:aws:iam::aws:policy/AmazonEKSVPCResourceController]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly]
module.eks.aws_iam_role_policy.node_cni_ipv6: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role:pytorch-arc-cbr-production-uw1-node-cni-ipv6]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role/arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role/arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role/arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy]
data.aws_availability_zones.available: Read complete after 1s [id=us-west-1]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/pytorch-arc-cbr-production-uw1-eks-secrets]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 1s [id=ami-07fd8394a1d58b614]
module.vpc.aws_egress_only_internet_gateway.this: Refreshing state... [id=eigw-07b06397ce403fa53]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-0b3b22b995e71d8d9]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-05f5edbf2c6678c03]
module.vpc.aws_eip.nat_secondary["us-west-1a-4"]: Refreshing state... [id=eipalloc-0dfae88698dce850e]
module.vpc.aws_eip.nat_secondary["us-west-1c-1"]: Refreshing state... [id=eipalloc-0bd09c7f2dcaa0a46]
module.vpc.aws_eip.nat_secondary["us-west-1a-3"]: Refreshing state... [id=eipalloc-05a2bad636af56f4d]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-0a13e7b49c841e497]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-06d137da3460167c4]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-0a8410ffa0f0014a7]
module.vpc.aws_eip.nat_secondary["us-west-1c-2"]: Refreshing state... [id=eipalloc-0f2e15b6a36b52fac]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-08861bee27120b994]
module.vpc.aws_eip.nat_secondary["us-west-1a-6"]: Refreshing state... [id=eipalloc-08763a35db0a26caa]
module.vpc.aws_eip.nat_secondary["us-west-1a-0"]: Refreshing state... [id=eipalloc-0e3ca79e34012a238]
module.vpc.aws_eip.nat_secondary["us-west-1c-0"]: Refreshing state... [id=eipalloc-0d565f5bf077b05cf]
module.vpc.aws_eip.nat_secondary["us-west-1c-5"]: Refreshing state... [id=eipalloc-0635efedc10ee5f66]
module.vpc.aws_eip.nat_secondary["us-west-1c-6"]: Refreshing state... [id=eipalloc-0cf91a032d10f4ec5]
module.vpc.aws_eip.nat_secondary["us-west-1c-4"]: Refreshing state... [id=eipalloc-0dfaa16c61333ceb3]
module.vpc.aws_eip.nat_secondary["us-west-1a-2"]: Refreshing state... [id=eipalloc-0647e169131be5893]
module.vpc.aws_eip.nat_secondary["us-west-1c-3"]: Refreshing state... [id=eipalloc-09f89978685e7f3c7]
module.vpc.aws_eip.nat_secondary["us-west-1a-5"]: Refreshing state... [id=eipalloc-059986f686b188dc2]
module.vpc.aws_eip.nat_secondary["us-west-1a-1"]: Refreshing state... [id=eipalloc-012ac413772344fea]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0bd275a35f8e7ef65]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0ce35bb011df0cfdb]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-harbor-registry]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-0f79a2ac72857a304]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-00184fa8d73e575c9]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-s3-20260519191031756900000001]
module.eks.aws_eks_cluster.this: Refreshing state... [id=pytorch-arc-cbr-production-uw1]
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=pytorch-arc-cbr-production-uw1:vpc-cni]
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=pytorch-arc-cbr-production-uw1:kube-proxy]
module.eks.aws_eks_access_entry.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production-uw1:arn:aws:iam::308535385114:role/osdc_gha_prod]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_launch_template.base: Refreshing state... [id=lt-066ae5f473a2b07c0]
module.eks.aws_eks_node_group.base: Refreshing state... [id=pytorch-arc-cbr-production-uw1:pytorch-arc-cbr-production-uw1-base-nodes]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=ab5db6c82031e2d229412c67921160a3b3af073b]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::308535385114:oidc-provider/oidc.eks.us-west-1.amazonaws.com/id/ED52EC64FF5CFAB4151C6E4B5DE279BD]
module.eks.aws_eks_access_policy_association.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production-uw1#arn:aws:iam::308535385114:role/osdc_gha_prod#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=3969145930]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-uw1-ebs-csi-driver-role]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=pytorch-arc-cbr-production-uw1:coredns]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-uw1-ebs-csi-driver-role/arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry/arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-harbor-registry]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-uw1:aws-ebs-csi-driver]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-01ec520e3931f5f6a]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-0c336634317cc9f35]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-06e17b37b87d890f2]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-01165f36472c0a780]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-0cc835aef3e3bcc21]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-02e4c54e5fa3b4f8a]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (arc-cbr-production-uw1) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-rebalance]
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-scheduled-change]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-instance-state-change]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-west-1.amazonaws.com/308535385114/pytorch-arc-cbr-production-uw1-karpenter]
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-spot-interruption]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-west-1.amazonaws.com/308535385114/pytorch-arc-cbr-production-uw1-karpenter]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-rebalance-KarpenterRebalance]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-instance-state-change-KarpenterInstanceStateChange]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-scheduled-change-KarpenterScheduledChange]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-spot-interruption-KarpenterSpotInterruption]
data.terraform_remote_state.base: Read complete after 2s
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-058909cc1cdc63fad,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-08861bee27120b994"]: Refreshing state... [id=subnet-08861bee27120b994,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0a13e7b49c841e497"]: Refreshing state... [id=subnet-0a13e7b49c841e497,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-controller]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-karpenter-controller]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-controller-20260519195229107000000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (arc-cbr-production-uw1) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-pypi-wheel-syncer-s3]
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-pypi-wants-collector-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-0da5eaf2022d80aa0]
data.terraform_remote_state.base: Read complete after 1s
aws_iam_role.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-uw1-pypi-wants-collector-role]
aws_security_group.efs: Refreshing state... [id=sg-01c1f3fa51705db76]
aws_iam_role.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-uw1-efs-csi-driver-role]
aws_iam_role.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-uw1-pypi-wheel-syncer-role]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-uw1-pypi-wants-collector-role-20260519200350781900000004]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-uw1-pypi-wheel-syncer-role-20260519200350777100000003]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-uw1-efs-csi-driver-role-20260519200350826400000005]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-uw1:aws-efs-csi-driver]
aws_efs_mount_target.pypi_cache["subnet-08861bee27120b994"]: Refreshing state... [id=fsmt-00708cc923d4d2055]
aws_efs_mount_target.pypi_cache["subnet-0a13e7b49c841e497"]: Refreshing state... [id=fsmt-089fd42858a5a85ab]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

huydhn · 2026-06-10T08:55:07Z

Integration-test validation (updated)

Correction to the earlier note below: the maxconn 1 + HAProxy timeout queue design does not let a client wait for a pod. Both docker buildx and buildctl use the moby/buildkit gRPC client, which dials with the default ~20 s MinConnectTimeout and fail-fast RPCs (not configurable). A connection sitting in HAProxy's queue never receives the HTTP/2 server preface, so the client gives up at ~20 s regardless of timeout queue. So timeout queue was removed, and clients now retry the connect until a pod frees / the pool scales up. (Connectivity test uses buildctl; the scale test and prod .ci/docker/build.sh use docker buildx — same client, same retry.)

Latest run — 27303571628, deliberately over-subscribed: per arch, 1 connectivity build + an 8-wide scale burst = 9 builds against a max-8 pool. 18/18 jobs succeeded.

job	result	notes
amd64 `build` (connectivity)	✅ 3.0 min	won the race for a pod
arm64 `build` (connectivity)	✅ 11.1 min	odd-one-out: retried 30× (~10 min) until a scale peer freed a pod, then built on attempt 30
amd64 `scale (1..8)`	✅ ~11–13 min	connect (after scale-up) + ~10 min hold
arm64 `scale (1..8)`	✅ ~11–13 min	same

Validates:

Connect-retry rides out scale-up and over-subscription — the 9th build waited ~10 min for a peer to finish, reconnected, and succeeded; no false failures.
KEDA scale-up brought both arches to max off the in-cluster HAProxy session metric (the scale builds landed once pods were up).
The gate still holds: with no scale-up, the back of the burst exhausts its retry budget / the 30-min job timeout and fails.

(Earlier run 27247502866 also covered kill-free scale-down to the 2/4 baseline + WhenEmpty node-consolidation lag — still applies.)

Original note (superseded — "queue worked / no connect timeouts" was wrong)

Staging validation run

Drove a balanced burst of 8 amd64 + 8 arm64 builds against the staging pool (each held a maxconn=1 slot ~5m), the same shape as the new integration-test scale test: https://github.com/pytorch/pytorch-canary/actions/runs/27247502866 — 16/16 builds succeeded.

BuildKit nodes / pods during the run:

Phase	amd64 pods	arm64 pods	buildkit nodes
Baseline (before)	2	4	2 (1 amd64 + 1 arm64)
Mid-burst (~T+9m)	scaling 2→8	scaling 4→8	climbing; 1 pod already draining
Peak	8	8	6 (4× `m6id.24xlarge` + 2× `m7gd.16xlarge`)
After (~T+18m)	back to 2	back to 4	trailing down

Observations:

Queue worked as intended — every queued buildctl connected and rode the queue onto new pods as they registered; no connect timeouts, so no runner-side wait was needed for this burst.
Scale-up — KEDA brought both arches to max (8/8) off the in-cluster HAProxy session metric.
Kill-free scale-down — pods drained (preStop waited for :1234 to go idle) and returned to the 2/4 baseline with zero failed builds.
Node consolidation lag (expected) — with consolidationPolicy: WhenEmpty, survivor pods left some nodes half-full, so the node count trails the pod count back down rather than dropping immediately.

github-actions · 2026-06-10T08:55:53Z

tofu plan — meta-prod-aws-ue1

✅ Plan succeeded · commit ef31ab27 · run log

Plan output

Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod-ue1",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (meta-prod-aws-ue1) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.vpc.aws_vpc.this: Refreshing state... [id=vpc-046818728dce02486]
data.aws_availability_zones.available: Reading...
module.eks.data.aws_caller_identity.current: Reading...
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=meta-prod-aws-ue1-harbor-s3]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry]
module.eks.data.aws_ami.eks_optimized_al2023: Reading...
module.eks.aws_iam_role.cluster: Refreshing state... [id=meta-prod-aws-ue1-cluster-role]
module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=9274017b-776a-41bd-9f11-d118a1174159]
module.eks.aws_iam_role.node: Refreshing state... [id=meta-prod-aws-ue1-node-role]
module.eks.data.aws_caller_identity.current: Read complete after 0s [id=308535385114]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAUPVRELQNGRUDTXPT]
data.aws_availability_zones.available: Read complete after 0s [id=us-east-1]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/meta-prod-aws-ue1-eks-secrets]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=meta-prod-aws-ue1-cluster-role/arn:aws:iam::aws:policy/AmazonEKSVPCResourceController]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=meta-prod-aws-ue1-cluster-role/arn:aws:iam::aws:policy/AmazonEKSClusterPolicy]
module.eks.aws_iam_role_policy.node_cni_ipv6: Refreshing state... [id=meta-prod-aws-ue1-node-role:meta-prod-aws-ue1-node-cni-ipv6]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=meta-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=meta-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=meta-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=meta-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 1s [id=ami-0dafeb02304897431]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-harbor-registry]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-0cf3d9cf37ee998b6]
module.vpc.aws_egress_only_internet_gateway.this: Refreshing state... [id=eigw-0ce44cb6446f3c1b6]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-0beb5fc44f0ee165f]
module.vpc.aws_eip.nat_secondary["us-east-1b-1"]: Refreshing state... [id=eipalloc-0d095305019486ae6]
module.vpc.aws_eip.nat_secondary["us-east-1b-6"]: Refreshing state... [id=eipalloc-0f922f499d32f1368]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0f922406e02ecba1d]
module.vpc.aws_eip.nat_secondary["us-east-1c-0"]: Refreshing state... [id=eipalloc-05844040c7248f44f]
module.vpc.aws_eip.nat_secondary["us-east-1a-3"]: Refreshing state... [id=eipalloc-0bda13d7b70c00c00]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-078f44b58c8b48ade]
module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-07bfd0f170c3b3406]
module.vpc.aws_eip.nat_secondary["us-east-1b-4"]: Refreshing state... [id=eipalloc-0aba12aa23c11d20c]
module.vpc.aws_eip.nat_secondary["us-east-1b-5"]: Refreshing state... [id=eipalloc-0d078dc6f07628714]
module.vpc.aws_eip.nat_secondary["us-east-1c-3"]: Refreshing state... [id=eipalloc-0af54aa2e5f40dfa4]
module.vpc.aws_eip.nat_secondary["us-east-1a-4"]: Refreshing state... [id=eipalloc-09fa171393c3a7cfb]
module.vpc.aws_eip.nat_secondary["us-east-1c-2"]: Refreshing state... [id=eipalloc-025ef0e1813277c67]
module.vpc.aws_eip.nat_secondary["us-east-1b-0"]: Refreshing state... [id=eipalloc-0bcfe1f98793e1b12]
module.vpc.aws_eip.nat_secondary["us-east-1b-3"]: Refreshing state... [id=eipalloc-0c8291ee817240e1f]
module.vpc.aws_eip.nat_secondary["us-east-1c-1"]: Refreshing state... [id=eipalloc-0cb5208c5f775baf6]
module.vpc.aws_eip.nat_secondary["us-east-1c-5"]: Refreshing state... [id=eipalloc-04fe645562f597aaa]
module.vpc.aws_eip.nat_secondary["us-east-1b-2"]: Refreshing state... [id=eipalloc-0f0b720f4cca62ec7]
module.vpc.aws_eip.nat_secondary["us-east-1a-1"]: Refreshing state... [id=eipalloc-08c7bd3306cf687ca]
module.vpc.aws_eip.nat_secondary["us-east-1a-2"]: Refreshing state... [id=eipalloc-080ec4e265ebdc5ad]
module.vpc.aws_eip.nat_secondary["us-east-1c-6"]: Refreshing state... [id=eipalloc-0d22d3aa0667a1070]
module.vpc.aws_eip.nat_secondary["us-east-1c-4"]: Refreshing state... [id=eipalloc-00c5df9f3b60f353d]
module.vpc.aws_eip.nat_secondary["us-east-1a-5"]: Refreshing state... [id=eipalloc-01f89a7c130d2a810]
module.vpc.aws_eip.nat_secondary["us-east-1a-0"]: Refreshing state... [id=eipalloc-0c8a6faed0a97479d]
module.vpc.aws_eip.nat_secondary["us-east-1a-6"]: Refreshing state... [id=eipalloc-02e84a51a14c9cbda]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-0eafd792589fbb363]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-00c2e2605c4dea199]
module.vpc.aws_eip.nat[2]: Refreshing state... [id=eipalloc-033772b4490df1b41]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-0d65ec2dd49f0d87c]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-0348c5058db524cd2]
module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-02ce11d6646870431]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=meta-prod-aws-ue1-harbor-s3/arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-harbor-registry]
module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-05e7e66e960593972]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-05da47c4ed26ae390]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-0616491b7baeab47f]
module.eks.aws_eks_cluster.this: Refreshing state... [id=meta-prod-aws-ue1]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-0cff785d8001fc914]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-025de56c0aac8d3f0]
module.vpc.aws_nat_gateway.this[2]: Refreshing state... [id=nat-09414719983019b49]
module.vpc.aws_route_table.private[2]: Refreshing state... [id=rtb-05d5b7a41aa6323ed]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-09287d705ce4a88bc]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-0c665948be8d0282e]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-09dca398d838d4247]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-02a8683fa7258f295]
module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-0306281246323bd27]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_eks_access_entry.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=meta-prod-aws-ue1:arn:aws:iam::308535385114:role/osdc_gha_prod]
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=meta-prod-aws-ue1:kube-proxy]
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=meta-prod-aws-ue1:vpc-cni]
module.eks.aws_launch_template.base: Refreshing state... [id=lt-043779597e3b5a7fd]
module.eks.aws_eks_node_group.base: Refreshing state... [id=meta-prod-aws-ue1:meta-prod-aws-ue1-base-nodes]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=b1b539daa206035ae3c3e28288b0681fa1b462f3]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::308535385114:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/6C84A48E1BF23A027C1E78912A368743]
module.eks.aws_eks_access_policy_association.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=meta-prod-aws-ue1#arn:aws:iam::308535385114:role/osdc_gha_prod#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=3022997555]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=meta-prod-aws-ue1-ebs-csi-driver-role]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=meta-prod-aws-ue1:coredns]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=meta-prod-aws-ue1-ebs-csi-driver-role/arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry/arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-harbor-registry]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=meta-prod-aws-ue1:aws-ebs-csi-driver]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (meta-prod-aws-ue1) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=meta-prod-aws-ue1-karpenter-rebalance]
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=meta-prod-aws-ue1-karpenter-scheduled-change]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=meta-prod-aws-ue1-karpenter-instance-state-change]
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=meta-prod-aws-ue1-karpenter-spot-interruption]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/308535385114/meta-prod-aws-ue1-karpenter]
data.terraform_remote_state.base: Read complete after 0s
aws_ec2_tag.subnet_karpenter_discovery["subnet-02ce11d6646870431"]: Refreshing state... [id=subnet-02ce11d6646870431,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0d65ec2dd49f0d87c"]: Refreshing state... [id=subnet-0d65ec2dd49f0d87c,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=meta-prod-aws-ue1-karpenter-controller]
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-016f4a0d209f3e4a9,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0348c5058db524cd2"]: Refreshing state... [id=subnet-0348c5058db524cd2,karpenter.sh/discovery]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/308535385114/meta-prod-aws-ue1-karpenter]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-karpenter-controller]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=meta-prod-aws-ue1-karpenter-spot-interruption-KarpenterSpotInterruption]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=meta-prod-aws-ue1-karpenter-instance-state-change-KarpenterInstanceStateChange]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=meta-prod-aws-ue1-karpenter-scheduled-change-KarpenterScheduledChange]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=meta-prod-aws-ue1-karpenter-rebalance-KarpenterRebalance]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=meta-prod-aws-ue1-karpenter-controller-20260528200455768400000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (meta-prod-aws-ue1) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-pypi-wants-collector-s3]
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-pypi-wheel-syncer-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-023e57b36ec1cd426]
data.terraform_remote_state.base: Read complete after 1s
aws_security_group.efs: Refreshing state... [id=sg-0bc06caa62214c9b7]
aws_iam_role.wants_collector: Refreshing state... [id=meta-prod-aws-ue1-pypi-wants-collector-role]
aws_iam_role.wheel_syncer: Refreshing state... [id=meta-prod-aws-ue1-pypi-wheel-syncer-role]
aws_iam_role.efs_csi_driver: Refreshing state... [id=meta-prod-aws-ue1-efs-csi-driver-role]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=meta-prod-aws-ue1-pypi-wants-collector-role-20260528201106192600000004]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=meta-prod-aws-ue1-pypi-wheel-syncer-role-20260528201106257700000005]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=meta-prod-aws-ue1-efs-csi-driver-role-20260528201106116400000003]
aws_efs_mount_target.pypi_cache["subnet-0348c5058db524cd2"]: Refreshing state... [id=fsmt-0500c573cafe66133]
aws_efs_mount_target.pypi_cache["subnet-0d65ec2dd49f0d87c"]: Refreshing state... [id=fsmt-0ffaedc58eceb7749]
aws_efs_mount_target.pypi_cache["subnet-02ce11d6646870431"]: Refreshing state... [id=fsmt-06a05c001541338d2]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=meta-prod-aws-ue1:aws-efs-csi-driver]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

[ghstack-poisoned]

huydhn · 2026-06-10T09:56:06Z

Full stack testing on staging https://github.com/pytorch/ci-infra/actions/runs/27302743769

[ghstack-poisoned]

claude · 2026-06-10T20:50:10Z

Claude finished @jeanschmidt's task in 8m 51s —— View job

PR review — BuildKit autoscaling (KEDA + HAProxy + connect-retry)

Focused on the five concerns you raised. Overall this is a well-thought-out design and the staging validation run (18/18 over-subscribed, kill-free scale-down) exercises the happy path convincingly. The findings below are mostly about observability gaps and failure-mode robustness rather than correctness of the validated path.

1. Race conditions in the scaling loop

Scaling math is sound. KEDA registers haproxy_backend_current_sessions{proxy="bk_<arch>"} as an external AverageValue metric with targetValue: 1, so desiredReplicas = ceil(total_backend_sessions / 1) → one pod per in-flight build, clamped to [min,max]. HAProxy's scur counts queued sessions too, so over-subscription pushes the metric above the live pod count and pins to max — this is what your run confirmed. Good.

The load signal is only as alive as the retries (generate_buildkit.py:444). Because timeout queue was removed and clients give up at ~20s, a session only contributes to scur while a client is actively connected. KEDA's metrics-api poll defaults to 30s (no pollingInterval set on the ScaledObject), and the metric is an instantaneous gauge. So a scrape can land in the gap between a client's give-up and its next retry and under-count demand. It self-corrects on the next poll and the 600s scale-down stabilization prevents flapping, so this is a latency wrinkle, not a correctness bug — but it's worth an explicit pollingInterval (e.g. 15s) to tighten the burst-detection window.

replicas omission is the right call (generate_buildkit.py:374) — leaving it out of the manifest means kubectl apply won't fight the HPA on every redeploy. ✅

2. Risk of not scaling / failing to deploy

No KEDA fallback. If the metrics endpoint is unreachable (LB pod rolling — which this PR makes more frequent via the new checksum/config roll — or a transient scrape error), the HPA goes ScalingActive=False and freezes at the current replica count: no burst scale-up exactly when you might be mid-roll. A fallback: block (e.g. failureThreshold: 3, replicas: <max>) would fail open to capacity instead of freezing. Recommend adding it. Fix this →
Deploy ordering is correct: keda is listed before buildkit in clusters.yaml:184, and buildkit/deploy.sh only applies autoscaling.yaml when autoscaling.enabled (so the ScaledObject CRDs exist first). The --autoscaling requires --amd64-max/--arm64-max guard (generate_buildkit.py:498) is good defense.
Single LB is a SPOF for the whole scaling signal (haproxy.yaml:95, replicas: 1). If the LB pod is down, both data path and scale signal are down. Acceptable for staging, but note it scales the blast radius of a bad haproxy.yaml config push now that config changes roll the pod.

3. Edge cases

drain.sh depends on netstat (drain-configmap.yaml:17). If netstat is absent or connections aren't where the script expects, the awk finds no match → idle increments → the pod drains immediately, silently defeating the kill-free guarantee. Your run proves netstat is present in moby/buildkit:v0.29.0 today, but this is a silent-failure dependency pinned only implicitly to the image. Worth a one-line comment noting the dependency, and ideally a fail-safe (treat "netstat missing/error" as busy, not idle). Fix this →
The bk_all combined backend is not autoscaled. ScaledObjects watch bk_amd64/bk_arm64 only. Any client using the combined buildkitd.buildkit:1234 service (service.yaml:49) lands on bk_all and generates zero scale signal — and because a pod sits in both the per-arch headless service and buildkitd-pods, HAProxy accounts maxconn 1 separately per backend, so a pod could take one build via bk_amd64 and one via bk_all simultaneously, exceeding buildkitd's max-parallelism=1. Prod uses the per-arch endpoints so this is latent, but worth a note/guard since the combined endpoint is still advertised.
Whole design hinges on every client retrying. With timeout queue gone there is no server-side safety net; a consumer that doesn't implement connect-retry (anything other than the patched .ci/docker/build.sh) will hard-fail under burst. This is documented in the README, but it's a contract worth stating loudly since it's load-bearing.

4. Monitoring — HAProxy ✅, KEDA ❌ (the notable gap)

HAProxy is well covered: the buildkit-haproxy ServiceMonitor already keeps haproxy_backend_current_sessions, haproxy_server_current_sessions, haproxy_server_status, haproxy_server_connection_errors_total — so the scaling input metric itself is visible in Grafana. Good.
There is no monitoring for KEDA itself. No ServiceMonitor for the KEDA operator / metrics-apiserver, so when you ask "do we have metrics to troubleshoot KEDA misbehaving" — currently no. The signals you'd want (keda_scaler_errors_total, keda_scaler_metrics_value, keda_scaled_object_errors) aren't scraped; troubleshooting is limited to kubectl describe scaledobject / HPA events. Given this is opt-in on one cluster that's tolerable for now, but I'd recommend adding a KEDA ServiceMonitor before widening rollout. Fix this →
No alerting/PrometheusRule for "all builders pinned at max for a sustained window" (under-provisioning signal) or "ScaledObject in error". Optional, but cheap insurance.

5. Karpenter disruption wiring — this is the strongest part

On-demand only (generate_buildkit.py:385-387) removes spot-interruption disruption entirely for 120m builds. ✅
consolidationPolicy: WhenEmpty (generate_buildkit.py:366) means consolidation only removes empty nodes, so it never races a running build — at the cost of the node-count lag you already documented. ✅
PDB maxUnavailable: 1 per arch (poddisruptionbudget.yaml) correctly gates voluntary disruptions (consolidation, manual drain) through the eviction API one pod at a time, each going through the preStop drain. Selector labels (app: buildkitd, arch) match the Deployment template. ✅
The one wiring subtlety: Karpenter honors pod terminationGracePeriodSeconds (135m here) only because the NodePool does not set spec.template.spec.terminationGracePeriod. That's correct today, but it's an implicit dependency — if anyone later sets a node-level terminationGracePeriod it would cap the drain and start killing builds. Worth a comment on the NodePool noting why it's intentionally unset.
KEDA scale-down deletes pods directly (not via eviction) so it bypasses the PDB — correctly called out in the README, and the drain + grace cap is the right protection for that path. ✅

Summary

Concern	Verdict
Race-condition free	Math is sound & validated; signal is retry-dependent (latency, not correctness). Add `pollingInterval`.
Scaling/deploy risk	Ordering correct. Add KEDA `fallback` so a metrics outage fails open, not frozen.
Edge cases	`drain.sh`→netstat silent-failure mode; `bk_all` unscaled + double-booking; client-retry is a hard contract.
Monitoring	HAProxy ✅. KEDA has no metrics scraped — main gap vs. your explicit ask.
Karpenter disruption	Well wired (on-demand + WhenEmpty + PDB + drain/grace). Document the implicit "node grace unset" dependency.

None of these block the staging-only, opt-in rollout. The two I'd most want before widening beyond staging are the KEDA ServiceMonitor and the KEDA fallback block. (I did not touch the unrelated clusters.yaml:170 min-replica asymmetry @jeanschmidt raised separately.)
· branch gh/huydhn/33/head

jeanschmidt · 2026-06-10T20:58:06Z

@claude

following up on #723 (comment) and #723 (comment)

are there anti-flap mechanisms prevent the fast scale down of pods/nodes (keeping them warm for a while before tearing them down), explain them.

can you look into this PR and tell me if pytorch/pytorch#186955 can safely handle failure to schedule up and wait for 2-4 average build jobs to finish? (you can look for build jobs on pytorch/pytorch for x86-64 and aarch)

claude · 2026-06-10T20:58:26Z

Claude finished @jeanschmidt's task in 5m 1s —— View job

Anti-flap mechanisms & connect-retry safety

Read the full diff & changed files
Identify anti-flap mechanisms (pods & nodes)
Assess whether pytorch#186955 connect-retry can ride out scale-up + saturation
Look at build-duration data
Post findings

⚠️ Note on sources: outbound network (WebFetch / gh to pytorch/pytorch) is blocked in this sandbox, so I could not pull the 186955 diff or live ClickHouse build-duration data. The retry-budget figures below are inferred from this PR's manifests + the validation comments on this PR (huydhn's run 27303571628). I flag the two numbers you should confirm against the actual PR.

1. Anti-flap: what keeps pods/nodes warm before teardown

There are four distinct anti-flap layers, two at the pod level (KEDA/HPA) and two at the node level (Karpenter), plus the warm floor. From generate_buildkit.py:528-537 and :365-369:

Pod scale-down (KEDA ScaledObject):

cooldownPeriod: 600
advanced:
  horizontalPodAutoscalerConfig:
    behavior:
      scaleDown:
        stabilizationWindowSeconds: 600     # 10 min: HPA takes the *highest*
        policies:                            #   recommendation over the window,
          - type: Pods                       #   so a transient dip can't trigger
            value: 1                         #   teardown
            periodSeconds: 120               # at most 1 pod removed / 2 min

stabilizationWindowSeconds: 600 — the real anti-flap. For 10 min the HPA only ever acts on the maximum desired-replica seen in the window, so a momentary drop in haproxy_backend_current_sessions never tears a pod down.
Pods: 1 / 120s — even once it does scale down, it removes one pod every 2 minutes. Draining 8→4 (arm64 baseline) is ≥ 4×120s = 8 min, after the 10-min window — so ~18 min from idle to baseline.
Scale-up uses the HPA default (fast: up to +100%/15s, no stabilization). Correct asymmetry: scale up fast, scale down slow.

Node scale-down (Karpenter NodePool):

disruption:
  consolidationPolicy: WhenEmpty   # only consolidate a node with zero build pods
  consolidateAfter: 5m             # …and only after it's been empty 5 min
  budgets:
    - nodes: "1"                    # disrupt at most 1 node at a time

WhenEmpty never bin-packs-and-evicts a running build to free a node — a node is only a candidate once its builds are gone.
consolidateAfter: 5m is the node-level anti-flap: a freshly-emptied node sits warm 5 min, absorbing a quick re-burst before EC2 termination.

Warm floor: minReplicaCount = amd64_min: 2 / arm64_min: 4 (clusters.yaml:172-176) keeps ≥1 physical node per arch permanently up, so the common case gets a free warm pod with no scale-up at all.

🔎 One real finding — cooldownPeriod: 600 is a no-op here. KEDA's cooldownPeriod only governs the scale-to-zero transition (active→0). With minReplicaCount of 2/4 the deployment never reaches 0, so this line does nothing. Harmless (the stabilizationWindow + Pods:1/120s policy do the real work), but it reads as if it contributes to scale-down hysteresis when it doesn't. Worth a comment or removal to avoid future confusion. Fix this →

(Separately, the kill-free teardown path — preStop drain + terminationGracePeriodSeconds: 8100 + PDB maxUnavailable: 1 — is about not killing an in-flight build during a scale-down/consolidation, not anti-flap. It's correctly wired.)

2. Can pytorch#186955's connect-retry safely "fail to schedule up and wait for 2–4 average builds to finish"?

The answer splits by whether the pool is below max or already at max — these are very different waits:

Case A — demand ≤ max (8): wait is for scale-up, not for builds to finish. ✅ Validated.
A new build that finds every warm pod busy retries the connect while KEDA + Karpenter bring up a fresh pod (node provision + pull + ready = a few minutes). huydhn's run proves this: the 9th build (1 connectivity + 8-wide burst against a max-8 pool) "retried 30× (~10 min) until a scale peer freed a pod, then built on attempt 30" — 18/18 succeeded. So the retry comfortably rides out a cold/scaling pool.

Case B — demand > max (8): the queue protection is gone; you're betting retry-budget vs build-duration.
Once at maxReplicaCount: 8, KEDA cannot add pods. The (9th, 10th, …) build can only get a maxconn: 1 slot when an in-flight build finishes. Whether the client survives that is purely retry budget ≥ time-to-free-a-slot:

Observed retry budget ≈ 10 min (30 attempts × the ~20s gRPC MinConnectTimeout), hard-capped by the 30-min GHA job timeout. huydhn's note confirms the gate: "with no scale-up, the back of the burst exhausts its retry budget / the 30-min job timeout and fails."
Time to free a slot at saturation = time until the soonest of the 8 in-flight builds completes. If the 8 started staggered, this is short; if a ciflow/docker burst started them near-simultaneously, it approaches one full average build duration.

So your specific framing — "wait for 2–4 average build jobs to finish" — is the worst case of Case B, and it is the one place this design can drop a build:

If a burst pushes demand past 8 and the average docker-image build runs longer than the ~10-min / 30-attempt retry budget, the builds at the back of the burst exhaust retries before a slot frees and fail. This is by design (it's the documented gate), not a bug — but it means the system is safe for "wait for scale-up" and "wait for one soon-finishing peer," and is not guaranteed safe for "wait for 2–4 sequential average builds" unless the retry budget is widened.

The two numbers to confirm (I couldn't fetch them here):

Actual retry budget in pytorch#186955 — attempts × per-attempt timeout, and whether it's capped by the job timeout or a fixed attempt count. The validation implies ~30 attempts / ~10 min.
Average .ci/docker/build.sh duration for x86-64 and aarch64 with warm BuildKit + git/registry cache. (terminationGracePeriodSeconds: 8100 is sized to the 120-min worst case, not the average — don't read 120 min as typical.)

Recommendation: if you genuinely need to ride out 2–4 sequential builds at saturation (rather than rely on scale-up), the robust lever is raise *_max so the burst gets its own pods instead of queueing behind finishers — the NodePool limits already auto-size to *_max (generate_buildkit.py:617-629). Bumping the retry budget on the client helps only up to the 30-min job timeout, after which the job dies regardless. Given staging demand and the validated 18/18 at 9-vs-8 over-subscription, the current max: 8 looks adequate for now; the failure only appears if real bursts exceed 8 concurrent per arch and builds outlast retries.

Other things I noticed while tracing this

buildkitd-lb is replicas: 1 (haproxy.yaml:95) — it's a SPOF for both build routing and the scale metric (haproxy_backend_current_sessions is scraped from this one pod). When a haproxy.yaml change rolls it (the checksum/config mechanism), KEDA briefly loses its metric source and all in-flight routing blips. Acceptable for staging/opt-in, but flagging for prod.
Scale-down picks an arbitrary pod, which may be mid-build; the drain keeps it kill-free but that "scaled-down" pod can hold its node for up to 135 min (terminationGracePeriodSeconds). So node cost trails pod count down further than the WhenEmpty/consolidateAfter lag alone — expected, but compounds the "warm" period.
Deploy ordering is correct: keda precedes buildkit in clusters.yaml:181-184 (CRDs first), and deploy.sh:123-126 applies the ScaledObjects only after the rollout. Karpenter disruption wiring (SQS/spot-interruption events) is unchanged by this PR and lives in the karpenter module.

_{· branch gh/huydhn/33/head}

huydhn · 2026-06-10T21:06:30Z

Add KEDA fallback is an interesting point, let me see what we can do there, probably keep the current fixed number of 32 x86 and 8 aarch64 Buildkit pods, which has proven to be working ok since last Friday.

For KEDA metrics, let's do that in a separate PR. It should have some we want to collect.

jeanschmidt · 2026-06-10T21:11:33Z

no-flap protections are too weak for a real use case IMO.

we should try to keep pods alive for at least 20 minutes of idling, so instead of one every 2 minutes, it makes much more sense to have max(10, 20%) every 20 minutes only for pods idle for 20+ minutes. Or something similar.

By the other hand, karpenter can be quite agressive on scaling down nodes (like they are now). We should gate the anti-flap mechanism on pods, yes they are less expensive to start, but more costly for jobs (loss of local decompressed layer caches).

also monitoring seems a requirement :)

huydhn · 2026-06-10T21:20:05Z

Yeah, a minimum of 20 minutes of idling can be set easily for pods by tweaking the drain.sh script. Also, we only evict pods one at a time, so with a 20-minute of idling wait, it would take 10x20 = 200 minutes to teardown 10 pods for example, more than enough for the infrequent nature of building Docker images.

[ghstack-poisoned]

huydhn · 2026-06-10T23:17:20Z

@jeanschmidt Done — scale-down now holds a pod ~20 min after it goes idle (stabilizationWindowSeconds: 1200), then sheds at most max(10 pods, 20%) per 20 min, as you suggested. Gated on pods; node churn stays with Karpenter. Monitoring is split out: KEDA operator metrics in #726, alerts in #727.

[ghstack-poisoned]

…seline **Impact:** OSDC arc-staging buildkit only (autoscaling is opt-in; other clusters unchanged). **Risk:** low Absorb ciflow/docker bursts without overloading existing pods, and scale back to a small warm per-arch baseline when idle. - HAProxy `server maxconn 1` + `timeout queue`: one build per pod; excess builds queue and flow onto new pods as they register, instead of stacking on busy pods (so scaled-up pods don't sit idle). - KEDA ScaledObject per arch via `metrics-api` scraping the LB's own metrics (haproxy_backend_current_sessions) — no Grafana / external metrics backend. - Warm baseline: amd64_min=2 / arm64_min=4 (1 physical node each); *_max caps the burst and sizes the NodePool limits. - preStop drain + PDB + long terminationGracePeriodSeconds for kill-free scale-down. staging: amd64 m6id.24xlarge @ 2/node (min 2), arm64 m7gd.16xlarge @ 4/node (min 4). Runner-side connect retry (separate pytorch/pytorch change) lets a build tolerate waiting for a pod from a cold/queued pool. Testing: just lint 13/13, just test pass (generate_buildkit.py 98%). Signed-off-by: Huy Do <huydo@meta.com> ghstack-source-id: e6bb458 Pull-Request: pytorch#723

Update

43a6b39

[ghstack-poisoned]

huydhn requested a review from jeanschmidt as a code owner June 10, 2026 08:52

huydhn temporarily deployed to osdc-staging June 10, 2026 08:52 — with GitHub Actions Inactive

This was referenced Jun 10, 2026

buildkit: enable autoscaling on prod (arc-cbr-production) #724

Merged

integration-tests: add buildkit autoscaling scale test #725

Merged

huydhn temporarily deployed to osdc-staging June 10, 2026 08:53 — with GitHub Actions Inactive

huydhn temporarily deployed to osdc-staging June 10, 2026 08:54 — with GitHub Actions Inactive

Update

86bcea3

[ghstack-poisoned]

huydhn temporarily deployed to osdc-staging June 10, 2026 09:07 — with GitHub Actions Inactive

huydhn temporarily deployed to osdc-staging June 10, 2026 09:08 — with GitHub Actions Inactive

huydhn temporarily deployed to osdc-staging June 10, 2026 09:10 — with GitHub Actions Inactive

Update

9dc83fc

[ghstack-poisoned]

huydhn had a problem deploying to osdc-staging June 10, 2026 09:52 — with GitHub Actions Error

huydhn temporarily deployed to osdc-staging June 10, 2026 09:52 — with GitHub Actions Inactive

huydhn temporarily deployed to osdc-staging June 10, 2026 09:53 — with GitHub Actions Inactive

Update

16e8520

[ghstack-poisoned]

huydhn temporarily deployed to osdc-staging June 10, 2026 10:52 — with GitHub Actions Inactive

huydhn temporarily deployed to osdc-staging June 10, 2026 10:54 — with GitHub Actions Inactive

huydhn temporarily deployed to osdc-staging June 10, 2026 10:56 — with GitHub Actions Inactive

Update

3caba61

[ghstack-poisoned]

huydhn had a problem deploying to osdc-staging June 10, 2026 17:29 — with GitHub Actions Error

huydhn temporarily deployed to osdc-staging June 10, 2026 17:29 — with GitHub Actions Inactive

huydhn temporarily deployed to osdc-staging June 10, 2026 17:30 — with GitHub Actions Inactive

Update

229998c

[ghstack-poisoned]

huydhn temporarily deployed to osdc-staging June 10, 2026 18:44 — with GitHub Actions Inactive

Update

14abcdd

[ghstack-poisoned]

huydhn had a problem deploying to osdc-staging June 10, 2026 23:15 — with GitHub Actions Error

huydhn mentioned this pull request Jun 10, 2026

monitoring: scrape KEDA operator metrics #726

Merged

huydhn had a problem deploying to osdc-staging June 10, 2026 23:15 — with GitHub Actions Error

This was referenced Jun 10, 2026

monitoring: add buildkit autoscaling alerts #727

Open

buildkit: deploy on meta-prod-aws-ue1 (us-east-1) #728

Merged

Update

d6e63b6

[ghstack-poisoned]

huydhn had a problem deploying to osdc-staging June 10, 2026 23:23 — with GitHub Actions Failure

huydhn temporarily deployed to osdc-staging June 10, 2026 23:25 — with GitHub Actions Inactive

huydhn requested a review from jeanschmidt June 10, 2026 23:26

huydhn temporarily deployed to osdc-staging June 10, 2026 23:27 — with GitHub Actions Inactive

jeanschmidt approved these changes Jun 11, 2026

View reviewed changes

huydhn temporarily deployed to osdc-staging June 11, 2026 18:30 — with GitHub Actions Inactive

huydhn temporarily deployed to osdc-staging June 11, 2026 18:31 — with GitHub Actions Inactive

huydhn temporarily deployed to osdc-staging June 11, 2026 18:32 — with GitHub Actions Inactive

huydhn changed the base branch from gh/huydhn/33/base to main June 11, 2026 21:45

huydhn enabled auto-merge June 11, 2026 21:46

huydhn changed the base branch from main to gh/huydhn/33/base June 11, 2026 21:46

huydhn merged commit 323f752 into gh/huydhn/33/base Jun 11, 2026
12 of 13 checks passed

huydhn deleted the gh/huydhn/33/head branch June 11, 2026 21:46

huydhn restored the gh/huydhn/33/head branch June 11, 2026 21:59

Conversation

huydhn commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tofu plan — arc-cbr-production

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tofu plan — arc-cbr-production-uw1

Uh oh!

huydhn commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integration-test validation (updated)

Staging validation run

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tofu plan — meta-prod-aws-ue1

Uh oh!

huydhn commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR review — BuildKit autoscaling (KEDA + HAProxy + connect-retry)

1. Race conditions in the scaling loop

2. Risk of not scaling / failing to deploy

3. Edge cases

4. Monitoring — HAProxy ✅, KEDA ❌ (the notable gap)

5. Karpenter disruption wiring — this is the strongest part

Summary

Uh oh!

jeanschmidt commented Jun 10, 2026

Uh oh!

claude Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Anti-flap mechanisms & connect-retry safety

1. Anti-flap: what keeps pods/nodes warm before teardown

2. Can pytorch#186955's connect-retry safely "fail to schedule up and wait for 2–4 average builds to finish"?

Other things I noticed while tracing this

Uh oh!

huydhn commented Jun 10, 2026

Uh oh!

jeanschmidt commented Jun 10, 2026

Uh oh!

huydhn commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huydhn commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huydhn commented Jun 10, 2026 •

edited

Loading

github-actions Bot commented Jun 10, 2026 •

edited

Loading

github-actions Bot commented Jun 10, 2026 •

edited

Loading

huydhn commented Jun 10, 2026 •

edited

Loading

github-actions Bot commented Jun 10, 2026 •

edited

Loading

huydhn commented Jun 10, 2026 •

edited

Loading

claude Bot commented Jun 10, 2026 •

edited

Loading

claude Bot commented Jun 10, 2026 •

edited

Loading

huydhn commented Jun 10, 2026 •

edited

Loading

huydhn commented Jun 10, 2026 •

edited

Loading