Skip to content

BuildKit autoscaling on staging: in-cluster KEDA + LB queue + warm baseline#723

Merged
huydhn merged 8 commits into
gh/huydhn/33/basefrom
gh/huydhn/33/head
Jun 11, 2026
Merged

BuildKit autoscaling on staging: in-cluster KEDA + LB queue + warm baseline#723
huydhn merged 8 commits into
gh/huydhn/33/basefrom
gh/huydhn/33/head

Conversation

@huydhn

@huydhn huydhn commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Impact: OSDC arc-staging buildkit only (autoscaling is opt-in; other
clusters unchanged).
Risk: low

Absorb ciflow/docker bursts without overloading existing pods, and scale back to
a small warm per-arch baseline when idle.

  • HAProxy server maxconn 1 + timeout queue: one build per pod; excess builds
    queue and flow onto new pods as they register, instead of stacking on busy
    pods (so scaled-up pods don't sit idle).
  • KEDA ScaledObject per arch via metrics-api scraping the LB's own metrics
    (haproxy_backend_current_sessions) — no Grafana / external metrics backend.
  • Warm baseline: amd64_min=2 / arm64_min=4 (1 physical node each); *_max caps
    the burst and sizes the NodePool limits.
  • preStop drain + PDB + long terminationGracePeriodSeconds for kill-free
    scale-down.

staging: amd64 m6id.24xlarge @ 2/node (min 2), arm64 m7gd.16xlarge @ 4/node
(min 4). Runner-side connect retry (separate pytorch/pytorch change) lets a build
tolerate waiting for a pod from a cold/queued pool.

Testing: Validation https://github.com/pytorch/ci-infra/actions/runs/27302743769 on pytorch-canary https://github.com/pytorch/pytorch-canary/actions/runs/27303571628 (comment below)

Signed-off-by: Huy Do huydo@meta.com

[ghstack-poisoned]
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown

tofu plan — arc-cbr-production

✅ Plan succeeded · commit ef31ab27 · run log

Plan output
Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (arc-cbr-production) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.eks.data.aws_ami.eks_optimized_al2023: Reading...
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3]
module.eks.data.aws_caller_identity.current: Reading...
module.vpc.aws_vpc.this: Refreshing state... [id=vpc-0e712dc7e743bbcf7]
module.eks.aws_iam_role.node: Refreshing state... [id=pytorch-arc-cbr-production-node-role]
data.aws_availability_zones.available: Reading...
module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=527854a4-e335-4f95-bc89-1321cff7a478]
module.eks.aws_iam_role.cluster: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_caller_identity.current: Read complete after 0s [id=308535385114]
data.aws_availability_zones.available: Read complete after 0s [id=us-east-2]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAUPVRELQNOLQFN6MU]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/pytorch-arc-cbr-production-eks-secrets]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role/arn:aws:iam::aws:policy/AmazonEKSClusterPolicy]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role/arn:aws:iam::aws:policy/AmazonEKSVPCResourceController]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy]
module.eks.aws_iam_role_policy.node_cni_ipv6: Refreshing state... [id=pytorch-arc-cbr-production-node-role:pytorch-arc-cbr-production-node-cni-ipv6]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 1s [id=ami-009f1fe7d56695348]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-05e96ee7cb818e5c0]
module.vpc.aws_egress_only_internet_gateway.this: Refreshing state... [id=eigw-032d4401e63f0c9b9]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-0fddf2f74e7e978c7]
module.vpc.aws_eip.nat_secondary["us-east-2c-1"]: Refreshing state... [id=eipalloc-06a980076e99cda81]
module.vpc.aws_eip.nat_secondary["us-east-2b-1"]: Refreshing state... [id=eipalloc-0e67c0a8cd8c990da]
module.vpc.aws_eip.nat_secondary["us-east-2b-3"]: Refreshing state... [id=eipalloc-021ee6c9f1d20b71a]
module.vpc.aws_eip.nat_secondary["us-east-2c-3"]: Refreshing state... [id=eipalloc-0d3a71569b2f687be]
module.vpc.aws_eip.nat_secondary["us-east-2a-1"]: Refreshing state... [id=eipalloc-0f2b00a9ac31df215]
module.vpc.aws_eip.nat_secondary["us-east-2a-2"]: Refreshing state... [id=eipalloc-09b15a770e0c6d552]
module.vpc.aws_eip.nat_secondary["us-east-2b-0"]: Refreshing state... [id=eipalloc-0cead990d60ce181e]
module.vpc.aws_eip.nat_secondary["us-east-2b-4"]: Refreshing state... [id=eipalloc-0de33181548ac2e5a]
module.vpc.aws_eip.nat_secondary["us-east-2a-6"]: Refreshing state... [id=eipalloc-0113c95dbdec2f879]
module.vpc.aws_eip.nat_secondary["us-east-2b-6"]: Refreshing state... [id=eipalloc-06b7b88826199a232]
module.vpc.aws_eip.nat_secondary["us-east-2c-2"]: Refreshing state... [id=eipalloc-07cfdb2fd5dc07459]
module.vpc.aws_eip.nat_secondary["us-east-2c-0"]: Refreshing state... [id=eipalloc-03542e74755fc105b]
module.vpc.aws_eip.nat_secondary["us-east-2a-3"]: Refreshing state... [id=eipalloc-034d5e1f5a2fcb795]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-01e479dcb5aedf696]
module.vpc.aws_eip.nat_secondary["us-east-2a-0"]: Refreshing state... [id=eipalloc-086a011b3c26c0dd7]
module.vpc.aws_eip.nat_secondary["us-east-2b-5"]: Refreshing state... [id=eipalloc-0cde9a6463901f1e1]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-0a583bbbcac436ebd]
module.vpc.aws_eip.nat[2]: Refreshing state... [id=eipalloc-01187bfaa68514400]
module.vpc.aws_eip.nat_secondary["us-east-2b-2"]: Refreshing state... [id=eipalloc-063bee447616351f9]
module.vpc.aws_eip.nat_secondary["us-east-2a-5"]: Refreshing state... [id=eipalloc-0bd9bf54bd6010323]
module.vpc.aws_eip.nat_secondary["us-east-2a-4"]: Refreshing state... [id=eipalloc-067d535102a61d1a8]
module.vpc.aws_eip.nat_secondary["us-east-2c-6"]: Refreshing state... [id=eipalloc-0aede78edc69cf695]
module.vpc.aws_eip.nat_secondary["us-east-2c-5"]: Refreshing state... [id=eipalloc-02825435a2786b3d8]
module.vpc.aws_eip.nat_secondary["us-east-2c-4"]: Refreshing state... [id=eipalloc-0cc3dadec18bbb3f3]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-0992f582e9bf2836e]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-0709abbcafa23aec0]
module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-0577a02acde719bff]
module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-0d34063a19f4b07b4]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0d26e280575e8aaf4]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0ab11fcdb8d4ea113]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3/arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.eks.aws_eks_cluster.this: Refreshing state... [id=pytorch-arc-cbr-production]
module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-0ce4fba002d90e7d5]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-084975a7f7af2696e]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-07d5cd4c479c827ab]
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=pytorch-arc-cbr-production:kube-proxy]
module.eks.aws_eks_access_entry.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production:arn:aws:iam::308535385114:role/osdc_gha_prod]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=pytorch-arc-cbr-production:vpc-cni]
module.eks.aws_launch_template.base: Refreshing state... [id=lt-0b820cd15307b6d57]
module.eks.aws_eks_node_group.base: Refreshing state... [id=pytorch-arc-cbr-production:pytorch-arc-cbr-production-base-nodes]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=033a163afb2babc26f7883e642621ac361c93d61]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::308535385114:oidc-provider/oidc.eks.us-east-2.amazonaws.com/id/0A621339248958D6D5F2FF084BD185B5]
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=2879363015]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=pytorch-arc-cbr-production:coredns]
module.eks.aws_eks_access_policy_association.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production#arn:aws:iam::308535385114:role/osdc_gha_prod#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role/arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry/arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-ebs-csi-driver]
module.vpc.aws_nat_gateway.this[2]: Refreshing state... [id=nat-0f7b8f4473e5790df]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-08e264cbbd47be1ee]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-0ad75b2f5282877db]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-01d38d41a7ca82a08]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-0c7ecd4166a01e5f0]
module.vpc.aws_route_table.private[2]: Refreshing state... [id=rtb-0cb3785c433ed7718]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-0beb143017359bda1]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-0b6e08b4b0dc968c0]
module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-097abe4676c74f71b]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change]
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change]
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance-KarpenterRebalance]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption-KarpenterSpotInterruption]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change-KarpenterInstanceStateChange]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change-KarpenterScheduledChange]
data.terraform_remote_state.base: Read complete after 1s
aws_ec2_tag.subnet_karpenter_discovery["subnet-0992f582e9bf2836e"]: Refreshing state... [id=subnet-0992f582e9bf2836e,karpenter.sh/discovery]
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-01ec5f742ae028981,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0577a02acde719bff"]: Refreshing state... [id=subnet-0577a02acde719bff,karpenter.sh/discovery]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-karpenter-controller]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0709abbcafa23aec0"]: Refreshing state... [id=subnet-0709abbcafa23aec0,karpenter.sh/discovery]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller-20260518021844404100000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wheel-syncer-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-0deb818bbf18764de]
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wants-collector-s3]
data.terraform_remote_state.base: Read complete after 1s
aws_iam_role.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role]
aws_iam_role.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role]
aws_iam_role.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role]
aws_security_group.efs: Refreshing state... [id=sg-0979eb5e3d9d3db9f]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role-20260518023249955700000005]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role-20260518023249903900000003]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-efs-csi-driver]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role-20260518023249929400000004]
aws_efs_mount_target.pypi_cache["subnet-0709abbcafa23aec0"]: Refreshing state... [id=fsmt-08cd5108febbacef9]
aws_efs_mount_target.pypi_cache["subnet-0992f582e9bf2836e"]: Refreshing state... [id=fsmt-03523586bb4ff0c46]
aws_efs_mount_target.pypi_cache["subnet-0577a02acde719bff"]: Refreshing state... [id=fsmt-07d7b111b9cd6684e]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown

tofu plan — arc-cbr-production-uw1

✅ Plan succeeded · commit ef31ab27 · run log

Plan output
Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod-uw1",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (arc-cbr-production-uw1) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.eks.data.aws_caller_identity.current: Reading...
data.aws_availability_zones.available: Reading...
module.eks.data.aws_ami.eks_optimized_al2023: Reading...
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-s3]
module.vpc.aws_vpc.this: Refreshing state... [id=vpc-0121d1038d393182a]
module.eks.aws_iam_role.cluster: Refreshing state... [id=pytorch-arc-cbr-production-uw1-cluster-role]
module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=1fb5d763-c5cd-4de5-bf40-712df992288c]
module.eks.aws_iam_role.node: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAUPVRELQNFWBLKNFS]
module.eks.data.aws_caller_identity.current: Read complete after 1s [id=308535385114]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-cluster-role/arn:aws:iam::aws:policy/AmazonEKSClusterPolicy]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=pytorch-arc-cbr-production-uw1-cluster-role/arn:aws:iam::aws:policy/AmazonEKSVPCResourceController]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly]
module.eks.aws_iam_role_policy.node_cni_ipv6: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role:pytorch-arc-cbr-production-uw1-node-cni-ipv6]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role/arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role/arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role/arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy]
data.aws_availability_zones.available: Read complete after 1s [id=us-west-1]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/pytorch-arc-cbr-production-uw1-eks-secrets]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 1s [id=ami-07fd8394a1d58b614]
module.vpc.aws_egress_only_internet_gateway.this: Refreshing state... [id=eigw-07b06397ce403fa53]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-0b3b22b995e71d8d9]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-05f5edbf2c6678c03]
module.vpc.aws_eip.nat_secondary["us-west-1a-4"]: Refreshing state... [id=eipalloc-0dfae88698dce850e]
module.vpc.aws_eip.nat_secondary["us-west-1c-1"]: Refreshing state... [id=eipalloc-0bd09c7f2dcaa0a46]
module.vpc.aws_eip.nat_secondary["us-west-1a-3"]: Refreshing state... [id=eipalloc-05a2bad636af56f4d]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-0a13e7b49c841e497]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-06d137da3460167c4]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-0a8410ffa0f0014a7]
module.vpc.aws_eip.nat_secondary["us-west-1c-2"]: Refreshing state... [id=eipalloc-0f2e15b6a36b52fac]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-08861bee27120b994]
module.vpc.aws_eip.nat_secondary["us-west-1a-6"]: Refreshing state... [id=eipalloc-08763a35db0a26caa]
module.vpc.aws_eip.nat_secondary["us-west-1a-0"]: Refreshing state... [id=eipalloc-0e3ca79e34012a238]
module.vpc.aws_eip.nat_secondary["us-west-1c-0"]: Refreshing state... [id=eipalloc-0d565f5bf077b05cf]
module.vpc.aws_eip.nat_secondary["us-west-1c-5"]: Refreshing state... [id=eipalloc-0635efedc10ee5f66]
module.vpc.aws_eip.nat_secondary["us-west-1c-6"]: Refreshing state... [id=eipalloc-0cf91a032d10f4ec5]
module.vpc.aws_eip.nat_secondary["us-west-1c-4"]: Refreshing state... [id=eipalloc-0dfaa16c61333ceb3]
module.vpc.aws_eip.nat_secondary["us-west-1a-2"]: Refreshing state... [id=eipalloc-0647e169131be5893]
module.vpc.aws_eip.nat_secondary["us-west-1c-3"]: Refreshing state... [id=eipalloc-09f89978685e7f3c7]
module.vpc.aws_eip.nat_secondary["us-west-1a-5"]: Refreshing state... [id=eipalloc-059986f686b188dc2]
module.vpc.aws_eip.nat_secondary["us-west-1a-1"]: Refreshing state... [id=eipalloc-012ac413772344fea]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0bd275a35f8e7ef65]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0ce35bb011df0cfdb]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-harbor-registry]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-0f79a2ac72857a304]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-00184fa8d73e575c9]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-s3-20260519191031756900000001]
module.eks.aws_eks_cluster.this: Refreshing state... [id=pytorch-arc-cbr-production-uw1]
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=pytorch-arc-cbr-production-uw1:vpc-cni]
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=pytorch-arc-cbr-production-uw1:kube-proxy]
module.eks.aws_eks_access_entry.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production-uw1:arn:aws:iam::308535385114:role/osdc_gha_prod]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_launch_template.base: Refreshing state... [id=lt-066ae5f473a2b07c0]
module.eks.aws_eks_node_group.base: Refreshing state... [id=pytorch-arc-cbr-production-uw1:pytorch-arc-cbr-production-uw1-base-nodes]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=ab5db6c82031e2d229412c67921160a3b3af073b]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::308535385114:oidc-provider/oidc.eks.us-west-1.amazonaws.com/id/ED52EC64FF5CFAB4151C6E4B5DE279BD]
module.eks.aws_eks_access_policy_association.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production-uw1#arn:aws:iam::308535385114:role/osdc_gha_prod#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=3969145930]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-uw1-ebs-csi-driver-role]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=pytorch-arc-cbr-production-uw1:coredns]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-uw1-ebs-csi-driver-role/arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry/arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-harbor-registry]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-uw1:aws-ebs-csi-driver]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-01ec520e3931f5f6a]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-0c336634317cc9f35]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-06e17b37b87d890f2]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-01165f36472c0a780]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-0cc835aef3e3bcc21]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-02e4c54e5fa3b4f8a]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (arc-cbr-production-uw1) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-rebalance]
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-scheduled-change]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-instance-state-change]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-west-1.amazonaws.com/308535385114/pytorch-arc-cbr-production-uw1-karpenter]
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-spot-interruption]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-west-1.amazonaws.com/308535385114/pytorch-arc-cbr-production-uw1-karpenter]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-rebalance-KarpenterRebalance]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-instance-state-change-KarpenterInstanceStateChange]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-scheduled-change-KarpenterScheduledChange]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-spot-interruption-KarpenterSpotInterruption]
data.terraform_remote_state.base: Read complete after 2s
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-058909cc1cdc63fad,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-08861bee27120b994"]: Refreshing state... [id=subnet-08861bee27120b994,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0a13e7b49c841e497"]: Refreshing state... [id=subnet-0a13e7b49c841e497,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-controller]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-karpenter-controller]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-controller-20260519195229107000000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (arc-cbr-production-uw1) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-pypi-wheel-syncer-s3]
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-pypi-wants-collector-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-0da5eaf2022d80aa0]
data.terraform_remote_state.base: Read complete after 1s
aws_iam_role.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-uw1-pypi-wants-collector-role]
aws_security_group.efs: Refreshing state... [id=sg-01c1f3fa51705db76]
aws_iam_role.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-uw1-efs-csi-driver-role]
aws_iam_role.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-uw1-pypi-wheel-syncer-role]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-uw1-pypi-wants-collector-role-20260519200350781900000004]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-uw1-pypi-wheel-syncer-role-20260519200350777100000003]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-uw1-efs-csi-driver-role-20260519200350826400000005]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-uw1:aws-efs-csi-driver]
aws_efs_mount_target.pypi_cache["subnet-08861bee27120b994"]: Refreshing state... [id=fsmt-00708cc923d4d2055]
aws_efs_mount_target.pypi_cache["subnet-0a13e7b49c841e497"]: Refreshing state... [id=fsmt-089fd42858a5a85ab]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

@huydhn

huydhn commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Integration-test validation (updated)

Correction to the earlier note below: the maxconn 1 + HAProxy timeout queue design does not let a client wait for a pod. Both docker buildx and buildctl use the moby/buildkit gRPC client, which dials with the default ~20 s MinConnectTimeout and fail-fast RPCs (not configurable). A connection sitting in HAProxy's queue never receives the HTTP/2 server preface, so the client gives up at ~20 s regardless of timeout queue. So timeout queue was removed, and clients now retry the connect until a pod frees / the pool scales up. (Connectivity test uses buildctl; the scale test and prod .ci/docker/build.sh use docker buildx — same client, same retry.)

Latest run27303571628, deliberately over-subscribed: per arch, 1 connectivity build + an 8-wide scale burst = 9 builds against a max-8 pool. 18/18 jobs succeeded.

job result notes
amd64 build (connectivity) ✅ 3.0 min won the race for a pod
arm64 build (connectivity) ✅ 11.1 min odd-one-out: retried 30× (~10 min) until a scale peer freed a pod, then built on attempt 30
amd64 scale (1..8) ✅ ~11–13 min connect (after scale-up) + ~10 min hold
arm64 scale (1..8) ✅ ~11–13 min same

Validates:

  • Connect-retry rides out scale-up and over-subscription — the 9th build waited ~10 min for a peer to finish, reconnected, and succeeded; no false failures.
  • KEDA scale-up brought both arches to max off the in-cluster HAProxy session metric (the scale builds landed once pods were up).
  • The gate still holds: with no scale-up, the back of the burst exhausts its retry budget / the 30-min job timeout and fails.

(Earlier run 27247502866 also covered kill-free scale-down to the 2/4 baseline + WhenEmpty node-consolidation lag — still applies.)


Original note (superseded — "queue worked / no connect timeouts" was wrong)

Staging validation run

Drove a balanced burst of 8 amd64 + 8 arm64 builds against the staging pool (each held a maxconn=1 slot ~5m), the same shape as the new integration-test scale test: https://github.com/pytorch/pytorch-canary/actions/runs/2724750286616/16 builds succeeded.

BuildKit nodes / pods during the run:

Phase amd64 pods arm64 pods buildkit nodes
Baseline (before) 2 4 2 (1 amd64 + 1 arm64)
Mid-burst (~T+9m) scaling 2→8 scaling 4→8 climbing; 1 pod already draining
Peak 8 8 6 (4× m6id.24xlarge + 2× m7gd.16xlarge)
After (~T+18m) back to 2 back to 4 trailing down

Observations:

  • Queue worked as intended — every queued buildctl connected and rode the queue onto new pods as they registered; no connect timeouts, so no runner-side wait was needed for this burst.
  • Scale-up — KEDA brought both arches to max (8/8) off the in-cluster HAProxy session metric.
  • Kill-free scale-down — pods drained (preStop waited for :1234 to go idle) and returned to the 2/4 baseline with zero failed builds.
  • Node consolidation lag (expected) — with consolidationPolicy: WhenEmpty, survivor pods left some nodes half-full, so the node count trails the pod count back down rather than dropping immediately.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown

tofu plan — meta-prod-aws-ue1

✅ Plan succeeded · commit ef31ab27 · run log

Plan output
Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod-ue1",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (meta-prod-aws-ue1) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.vpc.aws_vpc.this: Refreshing state... [id=vpc-046818728dce02486]
data.aws_availability_zones.available: Reading...
module.eks.data.aws_caller_identity.current: Reading...
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=meta-prod-aws-ue1-harbor-s3]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry]
module.eks.data.aws_ami.eks_optimized_al2023: Reading...
module.eks.aws_iam_role.cluster: Refreshing state... [id=meta-prod-aws-ue1-cluster-role]
module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=9274017b-776a-41bd-9f11-d118a1174159]
module.eks.aws_iam_role.node: Refreshing state... [id=meta-prod-aws-ue1-node-role]
module.eks.data.aws_caller_identity.current: Read complete after 0s [id=308535385114]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAUPVRELQNGRUDTXPT]
data.aws_availability_zones.available: Read complete after 0s [id=us-east-1]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/meta-prod-aws-ue1-eks-secrets]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=meta-prod-aws-ue1-cluster-role/arn:aws:iam::aws:policy/AmazonEKSVPCResourceController]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=meta-prod-aws-ue1-cluster-role/arn:aws:iam::aws:policy/AmazonEKSClusterPolicy]
module.eks.aws_iam_role_policy.node_cni_ipv6: Refreshing state... [id=meta-prod-aws-ue1-node-role:meta-prod-aws-ue1-node-cni-ipv6]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=meta-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=meta-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=meta-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=meta-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 1s [id=ami-0dafeb02304897431]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-harbor-registry]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-0cf3d9cf37ee998b6]
module.vpc.aws_egress_only_internet_gateway.this: Refreshing state... [id=eigw-0ce44cb6446f3c1b6]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-0beb5fc44f0ee165f]
module.vpc.aws_eip.nat_secondary["us-east-1b-1"]: Refreshing state... [id=eipalloc-0d095305019486ae6]
module.vpc.aws_eip.nat_secondary["us-east-1b-6"]: Refreshing state... [id=eipalloc-0f922f499d32f1368]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0f922406e02ecba1d]
module.vpc.aws_eip.nat_secondary["us-east-1c-0"]: Refreshing state... [id=eipalloc-05844040c7248f44f]
module.vpc.aws_eip.nat_secondary["us-east-1a-3"]: Refreshing state... [id=eipalloc-0bda13d7b70c00c00]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-078f44b58c8b48ade]
module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-07bfd0f170c3b3406]
module.vpc.aws_eip.nat_secondary["us-east-1b-4"]: Refreshing state... [id=eipalloc-0aba12aa23c11d20c]
module.vpc.aws_eip.nat_secondary["us-east-1b-5"]: Refreshing state... [id=eipalloc-0d078dc6f07628714]
module.vpc.aws_eip.nat_secondary["us-east-1c-3"]: Refreshing state... [id=eipalloc-0af54aa2e5f40dfa4]
module.vpc.aws_eip.nat_secondary["us-east-1a-4"]: Refreshing state... [id=eipalloc-09fa171393c3a7cfb]
module.vpc.aws_eip.nat_secondary["us-east-1c-2"]: Refreshing state... [id=eipalloc-025ef0e1813277c67]
module.vpc.aws_eip.nat_secondary["us-east-1b-0"]: Refreshing state... [id=eipalloc-0bcfe1f98793e1b12]
module.vpc.aws_eip.nat_secondary["us-east-1b-3"]: Refreshing state... [id=eipalloc-0c8291ee817240e1f]
module.vpc.aws_eip.nat_secondary["us-east-1c-1"]: Refreshing state... [id=eipalloc-0cb5208c5f775baf6]
module.vpc.aws_eip.nat_secondary["us-east-1c-5"]: Refreshing state... [id=eipalloc-04fe645562f597aaa]
module.vpc.aws_eip.nat_secondary["us-east-1b-2"]: Refreshing state... [id=eipalloc-0f0b720f4cca62ec7]
module.vpc.aws_eip.nat_secondary["us-east-1a-1"]: Refreshing state... [id=eipalloc-08c7bd3306cf687ca]
module.vpc.aws_eip.nat_secondary["us-east-1a-2"]: Refreshing state... [id=eipalloc-080ec4e265ebdc5ad]
module.vpc.aws_eip.nat_secondary["us-east-1c-6"]: Refreshing state... [id=eipalloc-0d22d3aa0667a1070]
module.vpc.aws_eip.nat_secondary["us-east-1c-4"]: Refreshing state... [id=eipalloc-00c5df9f3b60f353d]
module.vpc.aws_eip.nat_secondary["us-east-1a-5"]: Refreshing state... [id=eipalloc-01f89a7c130d2a810]
module.vpc.aws_eip.nat_secondary["us-east-1a-0"]: Refreshing state... [id=eipalloc-0c8a6faed0a97479d]
module.vpc.aws_eip.nat_secondary["us-east-1a-6"]: Refreshing state... [id=eipalloc-02e84a51a14c9cbda]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-0eafd792589fbb363]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-00c2e2605c4dea199]
module.vpc.aws_eip.nat[2]: Refreshing state... [id=eipalloc-033772b4490df1b41]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-0d65ec2dd49f0d87c]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-0348c5058db524cd2]
module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-02ce11d6646870431]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=meta-prod-aws-ue1-harbor-s3/arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-harbor-registry]
module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-05e7e66e960593972]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-05da47c4ed26ae390]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-0616491b7baeab47f]
module.eks.aws_eks_cluster.this: Refreshing state... [id=meta-prod-aws-ue1]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-0cff785d8001fc914]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-025de56c0aac8d3f0]
module.vpc.aws_nat_gateway.this[2]: Refreshing state... [id=nat-09414719983019b49]
module.vpc.aws_route_table.private[2]: Refreshing state... [id=rtb-05d5b7a41aa6323ed]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-09287d705ce4a88bc]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-0c665948be8d0282e]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-09dca398d838d4247]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-02a8683fa7258f295]
module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-0306281246323bd27]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_eks_access_entry.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=meta-prod-aws-ue1:arn:aws:iam::308535385114:role/osdc_gha_prod]
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=meta-prod-aws-ue1:kube-proxy]
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=meta-prod-aws-ue1:vpc-cni]
module.eks.aws_launch_template.base: Refreshing state... [id=lt-043779597e3b5a7fd]
module.eks.aws_eks_node_group.base: Refreshing state... [id=meta-prod-aws-ue1:meta-prod-aws-ue1-base-nodes]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=b1b539daa206035ae3c3e28288b0681fa1b462f3]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::308535385114:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/6C84A48E1BF23A027C1E78912A368743]
module.eks.aws_eks_access_policy_association.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=meta-prod-aws-ue1#arn:aws:iam::308535385114:role/osdc_gha_prod#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=3022997555]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=meta-prod-aws-ue1-ebs-csi-driver-role]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=meta-prod-aws-ue1:coredns]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=meta-prod-aws-ue1-ebs-csi-driver-role/arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry/arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-harbor-registry]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=meta-prod-aws-ue1:aws-ebs-csi-driver]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (meta-prod-aws-ue1) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=meta-prod-aws-ue1-karpenter-rebalance]
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=meta-prod-aws-ue1-karpenter-scheduled-change]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=meta-prod-aws-ue1-karpenter-instance-state-change]
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=meta-prod-aws-ue1-karpenter-spot-interruption]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/308535385114/meta-prod-aws-ue1-karpenter]
data.terraform_remote_state.base: Read complete after 0s
aws_ec2_tag.subnet_karpenter_discovery["subnet-02ce11d6646870431"]: Refreshing state... [id=subnet-02ce11d6646870431,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0d65ec2dd49f0d87c"]: Refreshing state... [id=subnet-0d65ec2dd49f0d87c,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=meta-prod-aws-ue1-karpenter-controller]
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-016f4a0d209f3e4a9,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0348c5058db524cd2"]: Refreshing state... [id=subnet-0348c5058db524cd2,karpenter.sh/discovery]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/308535385114/meta-prod-aws-ue1-karpenter]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-karpenter-controller]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=meta-prod-aws-ue1-karpenter-spot-interruption-KarpenterSpotInterruption]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=meta-prod-aws-ue1-karpenter-instance-state-change-KarpenterInstanceStateChange]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=meta-prod-aws-ue1-karpenter-scheduled-change-KarpenterScheduledChange]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=meta-prod-aws-ue1-karpenter-rebalance-KarpenterRebalance]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=meta-prod-aws-ue1-karpenter-controller-20260528200455768400000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (meta-prod-aws-ue1) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-pypi-wants-collector-s3]
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-pypi-wheel-syncer-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-023e57b36ec1cd426]
data.terraform_remote_state.base: Read complete after 1s
aws_security_group.efs: Refreshing state... [id=sg-0bc06caa62214c9b7]
aws_iam_role.wants_collector: Refreshing state... [id=meta-prod-aws-ue1-pypi-wants-collector-role]
aws_iam_role.wheel_syncer: Refreshing state... [id=meta-prod-aws-ue1-pypi-wheel-syncer-role]
aws_iam_role.efs_csi_driver: Refreshing state... [id=meta-prod-aws-ue1-efs-csi-driver-role]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=meta-prod-aws-ue1-pypi-wants-collector-role-20260528201106192600000004]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=meta-prod-aws-ue1-pypi-wheel-syncer-role-20260528201106257700000005]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=meta-prod-aws-ue1-efs-csi-driver-role-20260528201106116400000003]
aws_efs_mount_target.pypi_cache["subnet-0348c5058db524cd2"]: Refreshing state... [id=fsmt-0500c573cafe66133]
aws_efs_mount_target.pypi_cache["subnet-0d65ec2dd49f0d87c"]: Refreshing state... [id=fsmt-0ffaedc58eceb7749]
aws_efs_mount_target.pypi_cache["subnet-02ce11d6646870431"]: Refreshing state... [id=fsmt-06a05c001541338d2]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=meta-prod-aws-ue1:aws-efs-csi-driver]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

[ghstack-poisoned]
[ghstack-poisoned]
@huydhn

huydhn commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Full stack testing on staging https://github.com/pytorch/ci-infra/actions/runs/27302743769

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

Claude finished @jeanschmidt's task in 8m 51s —— View job


PR review — BuildKit autoscaling (KEDA + HAProxy + connect-retry)

Focused on the five concerns you raised. Overall this is a well-thought-out design and the staging validation run (18/18 over-subscribed, kill-free scale-down) exercises the happy path convincingly. The findings below are mostly about observability gaps and failure-mode robustness rather than correctness of the validated path.


1. Race conditions in the scaling loop

Scaling math is sound. KEDA registers haproxy_backend_current_sessions{proxy="bk_<arch>"} as an external AverageValue metric with targetValue: 1, so desiredReplicas = ceil(total_backend_sessions / 1) → one pod per in-flight build, clamped to [min,max]. HAProxy's scur counts queued sessions too, so over-subscription pushes the metric above the live pod count and pins to max — this is what your run confirmed. Good.

The load signal is only as alive as the retries (generate_buildkit.py:444). Because timeout queue was removed and clients give up at ~20s, a session only contributes to scur while a client is actively connected. KEDA's metrics-api poll defaults to 30s (no pollingInterval set on the ScaledObject), and the metric is an instantaneous gauge. So a scrape can land in the gap between a client's give-up and its next retry and under-count demand. It self-corrects on the next poll and the 600s scale-down stabilization prevents flapping, so this is a latency wrinkle, not a correctness bug — but it's worth an explicit pollingInterval (e.g. 15s) to tighten the burst-detection window.

replicas omission is the right call (generate_buildkit.py:374) — leaving it out of the manifest means kubectl apply won't fight the HPA on every redeploy. ✅


2. Risk of not scaling / failing to deploy

  • No KEDA fallback. If the metrics endpoint is unreachable (LB pod rolling — which this PR makes more frequent via the new checksum/config roll — or a transient scrape error), the HPA goes ScalingActive=False and freezes at the current replica count: no burst scale-up exactly when you might be mid-roll. A fallback: block (e.g. failureThreshold: 3, replicas: <max>) would fail open to capacity instead of freezing. Recommend adding it. Fix this →
  • Deploy ordering is correct: keda is listed before buildkit in clusters.yaml:184, and buildkit/deploy.sh only applies autoscaling.yaml when autoscaling.enabled (so the ScaledObject CRDs exist first). The --autoscaling requires --amd64-max/--arm64-max guard (generate_buildkit.py:498) is good defense.
  • Single LB is a SPOF for the whole scaling signal (haproxy.yaml:95, replicas: 1). If the LB pod is down, both data path and scale signal are down. Acceptable for staging, but note it scales the blast radius of a bad haproxy.yaml config push now that config changes roll the pod.

3. Edge cases

  • drain.sh depends on netstat (drain-configmap.yaml:17). If netstat is absent or connections aren't where the script expects, the awk finds no match → idle increments → the pod drains immediately, silently defeating the kill-free guarantee. Your run proves netstat is present in moby/buildkit:v0.29.0 today, but this is a silent-failure dependency pinned only implicitly to the image. Worth a one-line comment noting the dependency, and ideally a fail-safe (treat "netstat missing/error" as busy, not idle). Fix this →
  • The bk_all combined backend is not autoscaled. ScaledObjects watch bk_amd64/bk_arm64 only. Any client using the combined buildkitd.buildkit:1234 service (service.yaml:49) lands on bk_all and generates zero scale signal — and because a pod sits in both the per-arch headless service and buildkitd-pods, HAProxy accounts maxconn 1 separately per backend, so a pod could take one build via bk_amd64 and one via bk_all simultaneously, exceeding buildkitd's max-parallelism=1. Prod uses the per-arch endpoints so this is latent, but worth a note/guard since the combined endpoint is still advertised.
  • Whole design hinges on every client retrying. With timeout queue gone there is no server-side safety net; a consumer that doesn't implement connect-retry (anything other than the patched .ci/docker/build.sh) will hard-fail under burst. This is documented in the README, but it's a contract worth stating loudly since it's load-bearing.

4. Monitoring — HAProxy ✅, KEDA ❌ (the notable gap)

  • HAProxy is well covered: the buildkit-haproxy ServiceMonitor already keeps haproxy_backend_current_sessions, haproxy_server_current_sessions, haproxy_server_status, haproxy_server_connection_errors_total — so the scaling input metric itself is visible in Grafana. Good.
  • There is no monitoring for KEDA itself. No ServiceMonitor for the KEDA operator / metrics-apiserver, so when you ask "do we have metrics to troubleshoot KEDA misbehaving" — currently no. The signals you'd want (keda_scaler_errors_total, keda_scaler_metrics_value, keda_scaled_object_errors) aren't scraped; troubleshooting is limited to kubectl describe scaledobject / HPA events. Given this is opt-in on one cluster that's tolerable for now, but I'd recommend adding a KEDA ServiceMonitor before widening rollout. Fix this →
  • No alerting/PrometheusRule for "all builders pinned at max for a sustained window" (under-provisioning signal) or "ScaledObject in error". Optional, but cheap insurance.

5. Karpenter disruption wiring — this is the strongest part

  • On-demand only (generate_buildkit.py:385-387) removes spot-interruption disruption entirely for 120m builds. ✅
  • consolidationPolicy: WhenEmpty (generate_buildkit.py:366) means consolidation only removes empty nodes, so it never races a running build — at the cost of the node-count lag you already documented. ✅
  • PDB maxUnavailable: 1 per arch (poddisruptionbudget.yaml) correctly gates voluntary disruptions (consolidation, manual drain) through the eviction API one pod at a time, each going through the preStop drain. Selector labels (app: buildkitd, arch) match the Deployment template. ✅
  • The one wiring subtlety: Karpenter honors pod terminationGracePeriodSeconds (135m here) only because the NodePool does not set spec.template.spec.terminationGracePeriod. That's correct today, but it's an implicit dependency — if anyone later sets a node-level terminationGracePeriod it would cap the drain and start killing builds. Worth a comment on the NodePool noting why it's intentionally unset.
  • KEDA scale-down deletes pods directly (not via eviction) so it bypasses the PDB — correctly called out in the README, and the drain + grace cap is the right protection for that path. ✅

Summary

Concern Verdict
Race-condition free Math is sound & validated; signal is retry-dependent (latency, not correctness). Add pollingInterval.
Scaling/deploy risk Ordering correct. Add KEDA fallback so a metrics outage fails open, not frozen.
Edge cases drain.sh→netstat silent-failure mode; bk_all unscaled + double-booking; client-retry is a hard contract.
Monitoring HAProxy ✅. KEDA has no metrics scraped — main gap vs. your explicit ask.
Karpenter disruption Well wired (on-demand + WhenEmpty + PDB + drain/grace). Document the implicit "node grace unset" dependency.

None of these block the staging-only, opt-in rollout. The two I'd most want before widening beyond staging are the KEDA ServiceMonitor and the KEDA fallback block. (I did not touch the unrelated clusters.yaml:170 min-replica asymmetry @jeanschmidt raised separately.)
· branch gh/huydhn/33/head

@jeanschmidt

Copy link
Copy Markdown
Contributor

@claude

following up on #723 (comment) and #723 (comment)

are there anti-flap mechanisms prevent the fast scale down of pods/nodes (keeping them warm for a while before tearing them down), explain them.

can you look into this PR and tell me if pytorch/pytorch#186955 can safely handle failure to schedule up and wait for 2-4 average build jobs to finish? (you can look for build jobs on pytorch/pytorch for x86-64 and aarch)

@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

Claude finished @jeanschmidt's task in 5m 1s —— View job


Anti-flap mechanisms & connect-retry safety

  • Read the full diff & changed files
  • Identify anti-flap mechanisms (pods & nodes)
  • Assess whether pytorch#186955 connect-retry can ride out scale-up + saturation
  • Look at build-duration data
  • Post findings

⚠️ Note on sources: outbound network (WebFetch / gh to pytorch/pytorch) is blocked in this sandbox, so I could not pull the 186955 diff or live ClickHouse build-duration data. The retry-budget figures below are inferred from this PR's manifests + the validation comments on this PR (huydhn's run 27303571628). I flag the two numbers you should confirm against the actual PR.


1. Anti-flap: what keeps pods/nodes warm before teardown

There are four distinct anti-flap layers, two at the pod level (KEDA/HPA) and two at the node level (Karpenter), plus the warm floor. From generate_buildkit.py:528-537 and :365-369:

Pod scale-down (KEDA ScaledObject):

cooldownPeriod: 600
advanced:
  horizontalPodAutoscalerConfig:
    behavior:
      scaleDown:
        stabilizationWindowSeconds: 600     # 10 min: HPA takes the *highest*
        policies:                            #   recommendation over the window,
          - type: Pods                       #   so a transient dip can't trigger
            value: 1                         #   teardown
            periodSeconds: 120               # at most 1 pod removed / 2 min
  • stabilizationWindowSeconds: 600 — the real anti-flap. For 10 min the HPA only ever acts on the maximum desired-replica seen in the window, so a momentary drop in haproxy_backend_current_sessions never tears a pod down.
  • Pods: 1 / 120s — even once it does scale down, it removes one pod every 2 minutes. Draining 8→4 (arm64 baseline) is ≥ 4×120s = 8 min, after the 10-min window — so ~18 min from idle to baseline.
  • Scale-up uses the HPA default (fast: up to +100%/15s, no stabilization). Correct asymmetry: scale up fast, scale down slow.

Node scale-down (Karpenter NodePool):

disruption:
  consolidationPolicy: WhenEmpty   # only consolidate a node with zero build pods
  consolidateAfter: 5m             # …and only after it's been empty 5 min
  budgets:
    - nodes: "1"                    # disrupt at most 1 node at a time
  • WhenEmpty never bin-packs-and-evicts a running build to free a node — a node is only a candidate once its builds are gone.
  • consolidateAfter: 5m is the node-level anti-flap: a freshly-emptied node sits warm 5 min, absorbing a quick re-burst before EC2 termination.

Warm floor: minReplicaCount = amd64_min: 2 / arm64_min: 4 (clusters.yaml:172-176) keeps ≥1 physical node per arch permanently up, so the common case gets a free warm pod with no scale-up at all.

🔎 One real finding — cooldownPeriod: 600 is a no-op here. KEDA's cooldownPeriod only governs the scale-to-zero transition (active→0). With minReplicaCount of 2/4 the deployment never reaches 0, so this line does nothing. Harmless (the stabilizationWindow + Pods:1/120s policy do the real work), but it reads as if it contributes to scale-down hysteresis when it doesn't. Worth a comment or removal to avoid future confusion. Fix this →

(Separately, the kill-free teardown path — preStop drain + terminationGracePeriodSeconds: 8100 + PDB maxUnavailable: 1 — is about not killing an in-flight build during a scale-down/consolidation, not anti-flap. It's correctly wired.)


2. Can pytorch#186955's connect-retry safely "fail to schedule up and wait for 2–4 average builds to finish"?

The answer splits by whether the pool is below max or already at max — these are very different waits:

Case A — demand ≤ max (8): wait is for scale-up, not for builds to finish. ✅ Validated.
A new build that finds every warm pod busy retries the connect while KEDA + Karpenter bring up a fresh pod (node provision + pull + ready = a few minutes). huydhn's run proves this: the 9th build (1 connectivity + 8-wide burst against a max-8 pool) "retried 30× (~10 min) until a scale peer freed a pod, then built on attempt 30"18/18 succeeded. So the retry comfortably rides out a cold/scaling pool.

Case B — demand > max (8): the queue protection is gone; you're betting retry-budget vs build-duration.
Once at maxReplicaCount: 8, KEDA cannot add pods. The (9th, 10th, …) build can only get a maxconn: 1 slot when an in-flight build finishes. Whether the client survives that is purely retry budget ≥ time-to-free-a-slot:

  • Observed retry budget ≈ 10 min (30 attempts × the ~20s gRPC MinConnectTimeout), hard-capped by the 30-min GHA job timeout. huydhn's note confirms the gate: "with no scale-up, the back of the burst exhausts its retry budget / the 30-min job timeout and fails."
  • Time to free a slot at saturation = time until the soonest of the 8 in-flight builds completes. If the 8 started staggered, this is short; if a ciflow/docker burst started them near-simultaneously, it approaches one full average build duration.

So your specific framing — "wait for 2–4 average build jobs to finish" — is the worst case of Case B, and it is the one place this design can drop a build:

If a burst pushes demand past 8 and the average docker-image build runs longer than the ~10-min / 30-attempt retry budget, the builds at the back of the burst exhaust retries before a slot frees and fail. This is by design (it's the documented gate), not a bug — but it means the system is safe for "wait for scale-up" and "wait for one soon-finishing peer," and is not guaranteed safe for "wait for 2–4 sequential average builds" unless the retry budget is widened.

The two numbers to confirm (I couldn't fetch them here):

  1. Actual retry budget in pytorch#186955 — attempts × per-attempt timeout, and whether it's capped by the job timeout or a fixed attempt count. The validation implies ~30 attempts / ~10 min.
  2. Average .ci/docker/build.sh duration for x86-64 and aarch64 with warm BuildKit + git/registry cache. (terminationGracePeriodSeconds: 8100 is sized to the 120-min worst case, not the average — don't read 120 min as typical.)

Recommendation: if you genuinely need to ride out 2–4 sequential builds at saturation (rather than rely on scale-up), the robust lever is raise *_max so the burst gets its own pods instead of queueing behind finishers — the NodePool limits already auto-size to *_max (generate_buildkit.py:617-629). Bumping the retry budget on the client helps only up to the 30-min job timeout, after which the job dies regardless. Given staging demand and the validated 18/18 at 9-vs-8 over-subscription, the current max: 8 looks adequate for now; the failure only appears if real bursts exceed 8 concurrent per arch and builds outlast retries.


Other things I noticed while tracing this

  • buildkitd-lb is replicas: 1 (haproxy.yaml:95) — it's a SPOF for both build routing and the scale metric (haproxy_backend_current_sessions is scraped from this one pod). When a haproxy.yaml change rolls it (the checksum/config mechanism), KEDA briefly loses its metric source and all in-flight routing blips. Acceptable for staging/opt-in, but flagging for prod.
  • Scale-down picks an arbitrary pod, which may be mid-build; the drain keeps it kill-free but that "scaled-down" pod can hold its node for up to 135 min (terminationGracePeriodSeconds). So node cost trails pod count down further than the WhenEmpty/consolidateAfter lag alone — expected, but compounds the "warm" period.
  • Deploy ordering is correct: keda precedes buildkit in clusters.yaml:181-184 (CRDs first), and deploy.sh:123-126 applies the ScaledObjects only after the rollout. Karpenter disruption wiring (SQS/spot-interruption events) is unchanged by this PR and lives in the karpenter module.

· branch gh/huydhn/33/head

@huydhn

huydhn commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Add KEDA fallback is an interesting point, let me see what we can do there, probably keep the current fixed number of 32 x86 and 8 aarch64 Buildkit pods, which has proven to be working ok since last Friday.

For KEDA metrics, let's do that in a separate PR. It should have some we want to collect.

@jeanschmidt

Copy link
Copy Markdown
Contributor

no-flap protections are too weak for a real use case IMO.

we should try to keep pods alive for at least 20 minutes of idling, so instead of one every 2 minutes, it makes much more sense to have max(10, 20%) every 20 minutes only for pods idle for 20+ minutes. Or something similar.

By the other hand, karpenter can be quite agressive on scaling down nodes (like they are now). We should gate the anti-flap mechanism on pods, yes they are less expensive to start, but more costly for jobs (loss of local decompressed layer caches).

also monitoring seems a requirement :)

@huydhn

huydhn commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Yeah, a minimum of 20 minutes of idling can be set easily for pods by tweaking the drain.sh script. Also, we only evict pods one at a time, so with a 20-minute of idling wait, it would take 10x20 = 200 minutes to teardown 10 pods for example, more than enough for the infrequent nature of building Docker images.

[ghstack-poisoned]
@huydhn

huydhn commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@jeanschmidt Done — scale-down now holds a pod ~20 min after it goes idle (stabilizationWindowSeconds: 1200), then sheds at most max(10 pods, 20%) per 20 min, as you suggested. Gated on pods; node churn stays with Karpenter. Monitoring is split out: KEDA operator metrics in #726, alerts in #727.

[ghstack-poisoned]
@huydhn huydhn changed the base branch from gh/huydhn/33/base to main June 11, 2026 21:45
@huydhn huydhn enabled auto-merge June 11, 2026 21:46
@huydhn huydhn changed the base branch from main to gh/huydhn/33/base June 11, 2026 21:46
@huydhn huydhn merged commit 323f752 into gh/huydhn/33/base Jun 11, 2026
12 of 13 checks passed
@huydhn huydhn deleted the gh/huydhn/33/head branch June 11, 2026 21:46
@huydhn huydhn restored the gh/huydhn/33/head branch June 11, 2026 21:59
huydhn added a commit to huydhn/pytorch-ci-infra that referenced this pull request Jun 11, 2026
…seline

**Impact:** OSDC arc-staging buildkit only (autoscaling is opt-in; other
clusters unchanged).
**Risk:** low

Absorb ciflow/docker bursts without overloading existing pods, and scale back to
a small warm per-arch baseline when idle.

- HAProxy `server maxconn 1` + `timeout queue`: one build per pod; excess builds
  queue and flow onto new pods as they register, instead of stacking on busy
  pods (so scaled-up pods don't sit idle).
- KEDA ScaledObject per arch via `metrics-api` scraping the LB's own metrics
  (haproxy_backend_current_sessions) — no Grafana / external metrics backend.
- Warm baseline: amd64_min=2 / arm64_min=4 (1 physical node each); *_max caps
  the burst and sizes the NodePool limits.
- preStop drain + PDB + long terminationGracePeriodSeconds for kill-free
  scale-down.

staging: amd64 m6id.24xlarge @ 2/node (min 2), arm64 m7gd.16xlarge @ 4/node
(min 4). Runner-side connect retry (separate pytorch/pytorch change) lets a build
tolerate waiting for a pod from a cold/queued pool.

Testing: just lint 13/13, just test pass (generate_buildkit.py 98%).
Signed-off-by: Huy Do <huydo@meta.com>
ghstack-source-id: e6bb458
Pull-Request: pytorch#723
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants