Skip to content

Add NFD topology-updater and numa-scheduler modules (#696)#716

Draft
georgehong wants to merge 5 commits into
gh/georgehong/3/basefrom
gh/georgehong/3/head
Draft

Add NFD topology-updater and numa-scheduler modules (#696)#716
georgehong wants to merge 5 commits into
gh/georgehong/3/basefrom
gh/georgehong/3/head

Conversation

@georgehong

@georgehong georgehong commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Adds two new OSDC modules for NUMA-aware GPU scheduling:

  1. nfd — deploys NFD topology-updater as a DaemonSet on GPU nodes.
    Publishes NodeResourceTopology CRDs showing per-NUMA-zone resource
    availability (GPU/CPU/memory per socket). Polls every 15s to stay
    ahead of the ARC capacity provisioner's 30s interval.

  2. numa-scheduler — deploys a secondary kube-scheduler with the
    NodeResourceTopologyMatch plugin. Reads NRT objects and only places
    pods on nodes where a single NUMA zone can satisfy the full resource
    request. Uses MostAllocated scoring to pack small pods together and
    keep free sockets available for large (4-GPU) jobs.

The scheduler-plugins chart is downloaded via curl from the GitHub
release (kubernetes-sigs/scheduler-plugins) at deploy time — the
project does not publish to a Helm registry or OCI.

Both modules are inert until added to a cluster's modules: list in
clusters.yaml (matching the standard module enablement pattern — no
separate enabled flag). Deploying them has no effect on existing
scheduling until runner definitions set scheduler_name: numa-scheduler
on workflow pods.

Includes smoke tests for both modules (namespace, Helm release,
DaemonSet/Deployment health, NRT CRD existence, scheduler pod
placement on base-infrastructure nodes).

Part of #696

Dry-Run both targets for helm upgrade:

# pointed at arc-staging
helm upgrade --install nfd nfd/node-feature-discovery --version 0.17.1 --namespace nfd --create-namespace -f modules/nfd/helm/values.yaml --dry-run=server
...
STATUS: pending-install
...
curl -fsSL "https://github.com/kubernetes-sigs/scheduler-plugins/releases/download/v0.34.7/scheduler-plugins-0.34.7.tgz" -o /tmp/scheduler-plugins-0.34.7.tgz

helm upgrade --install numa-scheduler /tmp/scheduler-plugins-0.34.7.tgz --namespace numa-scheduler --create-namespace -f modules/numa-scheduler/helm/values.yaml --set scheduler.replicaCount=2 --dry-run=server
...
STATUS: pending-install
...

Follow-Ups

  1. ARC fork (jeanschmidt/actions-runner-controller): Add SchedulerName field to capacity config + placeholder pod builder. New env var CAPACITY_AWARE_SCHEDULER_NAME, defaults to empty (no behavior change). Publish as new chart version.

  2. Deploy updated ARC controller:

    • Update arc.chart_version in clusters.yaml
    • just deploy-module arc
  3. Wire schedulerName into runner template (runner.yaml.tpl):

    • Add CAPACITY_AWARE_SCHEDULER_NAME to listener env
    • Add schedulerName: {{SCHEDULER_NAME}} to job-pod.yaml
  4. Activate for H100 runners.

[ghstack-poisoned]
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

tofu plan — arc-cbr-production

✅ Plan succeeded · commit bb98a4d4 · run log

Plan output
Installed 1 package in 1ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (arc-cbr-production) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.eks.data.aws_caller_identity.current: Reading...
data.aws_availability_zones.available: Reading...
module.eks.aws_iam_role.cluster: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role]
module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=527854a4-e335-4f95-bc89-1321cff7a478]
module.eks.data.aws_ami.eks_optimized_al2023: Reading...
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3]
module.eks.aws_iam_role.node: Refreshing state... [id=pytorch-arc-cbr-production-node-role]
module.vpc.aws_vpc.this: Refreshing state... [id=vpc-0e712dc7e743bbcf7]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_caller_identity.current: Read complete after 0s [id=308535385114]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAUPVRELQNOLQFN6MU]
data.aws_availability_zones.available: Read complete after 0s [id=us-east-2]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/pytorch-arc-cbr-production-eks-secrets]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role/arn:aws:iam::aws:policy/AmazonEKSVPCResourceController]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role/arn:aws:iam::aws:policy/AmazonEKSClusterPolicy]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore]
module.eks.aws_iam_role_policy.node_cni_ipv6: Refreshing state... [id=pytorch-arc-cbr-production-node-role:pytorch-arc-cbr-production-node-cni-ipv6]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 0s [id=ami-009f1fe7d56695348]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3/arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-05e96ee7cb818e5c0]
module.vpc.aws_egress_only_internet_gateway.this: Refreshing state... [id=eigw-032d4401e63f0c9b9]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-0a583bbbcac436ebd]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-01e479dcb5aedf696]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-0992f582e9bf2836e]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-0fddf2f74e7e978c7]
module.vpc.aws_eip.nat[2]: Refreshing state... [id=eipalloc-01187bfaa68514400]
module.vpc.aws_eip.nat_secondary["us-east-2c-1"]: Refreshing state... [id=eipalloc-06a980076e99cda81]
module.vpc.aws_eip.nat_secondary["us-east-2c-0"]: Refreshing state... [id=eipalloc-03542e74755fc105b]
module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-0577a02acde719bff]
module.vpc.aws_eip.nat_secondary["us-east-2a-0"]: Refreshing state... [id=eipalloc-086a011b3c26c0dd7]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-0709abbcafa23aec0]
module.vpc.aws_eip.nat_secondary["us-east-2c-2"]: Refreshing state... [id=eipalloc-07cfdb2fd5dc07459]
module.vpc.aws_eip.nat_secondary["us-east-2c-3"]: Refreshing state... [id=eipalloc-0d3a71569b2f687be]
module.vpc.aws_eip.nat_secondary["us-east-2c-5"]: Refreshing state... [id=eipalloc-02825435a2786b3d8]
module.vpc.aws_eip.nat_secondary["us-east-2b-1"]: Refreshing state... [id=eipalloc-0e67c0a8cd8c990da]
module.vpc.aws_eip.nat_secondary["us-east-2a-6"]: Refreshing state... [id=eipalloc-0113c95dbdec2f879]
module.vpc.aws_eip.nat_secondary["us-east-2a-1"]: Refreshing state... [id=eipalloc-0f2b00a9ac31df215]
module.vpc.aws_eip.nat_secondary["us-east-2b-3"]: Refreshing state... [id=eipalloc-021ee6c9f1d20b71a]
module.vpc.aws_eip.nat_secondary["us-east-2b-2"]: Refreshing state... [id=eipalloc-063bee447616351f9]
module.vpc.aws_eip.nat_secondary["us-east-2c-6"]: Refreshing state... [id=eipalloc-0aede78edc69cf695]
module.vpc.aws_eip.nat_secondary["us-east-2a-4"]: Refreshing state... [id=eipalloc-067d535102a61d1a8]
module.vpc.aws_eip.nat_secondary["us-east-2c-4"]: Refreshing state... [id=eipalloc-0cc3dadec18bbb3f3]
module.vpc.aws_eip.nat_secondary["us-east-2a-2"]: Refreshing state... [id=eipalloc-09b15a770e0c6d552]
module.vpc.aws_eip.nat_secondary["us-east-2a-3"]: Refreshing state... [id=eipalloc-034d5e1f5a2fcb795]
module.vpc.aws_eip.nat_secondary["us-east-2a-5"]: Refreshing state... [id=eipalloc-0bd9bf54bd6010323]
module.vpc.aws_eip.nat_secondary["us-east-2b-4"]: Refreshing state... [id=eipalloc-0de33181548ac2e5a]
module.vpc.aws_eip.nat_secondary["us-east-2b-6"]: Refreshing state... [id=eipalloc-06b7b88826199a232]
module.vpc.aws_eip.nat_secondary["us-east-2b-5"]: Refreshing state... [id=eipalloc-0cde9a6463901f1e1]
module.vpc.aws_eip.nat_secondary["us-east-2b-0"]: Refreshing state... [id=eipalloc-0cead990d60ce181e]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0ab11fcdb8d4ea113]
module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-0d34063a19f4b07b4]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0d26e280575e8aaf4]
module.eks.aws_eks_cluster.this: Refreshing state... [id=pytorch-arc-cbr-production]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-084975a7f7af2696e]
module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-0ce4fba002d90e7d5]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-07d5cd4c479c827ab]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-08e264cbbd47be1ee]
module.vpc.aws_nat_gateway.this[2]: Refreshing state... [id=nat-0f7b8f4473e5790df]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-0ad75b2f5282877db]
module.vpc.aws_route_table.private[2]: Refreshing state... [id=rtb-0cb3785c433ed7718]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-0c7ecd4166a01e5f0]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-01d38d41a7ca82a08]
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=pytorch-arc-cbr-production:kube-proxy]
module.eks.aws_eks_access_entry.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production:arn:aws:iam::308535385114:role/osdc_gha_prod]
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=pytorch-arc-cbr-production:vpc-cni]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_launch_template.base: Refreshing state... [id=lt-0b820cd15307b6d57]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-0b6e08b4b0dc968c0]
module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-097abe4676c74f71b]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-0beb143017359bda1]
module.eks.aws_eks_node_group.base: Refreshing state... [id=pytorch-arc-cbr-production:pytorch-arc-cbr-production-base-nodes]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=033a163afb2babc26f7883e642621ac361c93d61]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::308535385114:oidc-provider/oidc.eks.us-east-2.amazonaws.com/id/0A621339248958D6D5F2FF084BD185B5]
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=2879363015]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=pytorch-arc-cbr-production:coredns]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role/arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry/arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-ebs-csi-driver]
module.eks.aws_eks_access_policy_association.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production#arn:aws:iam::308535385114:role/osdc_gha_prod#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance]
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change]
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption-KarpenterSpotInterruption]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change-KarpenterInstanceStateChange]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance-KarpenterRebalance]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change-KarpenterScheduledChange]
data.terraform_remote_state.base: Read complete after 1s
aws_ec2_tag.subnet_karpenter_discovery["subnet-0992f582e9bf2836e"]: Refreshing state... [id=subnet-0992f582e9bf2836e,karpenter.sh/discovery]
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-01ec5f742ae028981,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0709abbcafa23aec0"]: Refreshing state... [id=subnet-0709abbcafa23aec0,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0577a02acde719bff"]: Refreshing state... [id=subnet-0577a02acde719bff,karpenter.sh/discovery]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-karpenter-controller]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller-20260518021844404100000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wants-collector-s3]
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wheel-syncer-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-0deb818bbf18764de]
data.terraform_remote_state.base: Read complete after 1s
aws_iam_role.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role]
aws_iam_role.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role]
aws_iam_role.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role]
aws_security_group.efs: Refreshing state... [id=sg-0979eb5e3d9d3db9f]
aws_efs_mount_target.pypi_cache["subnet-0709abbcafa23aec0"]: Refreshing state... [id=fsmt-08cd5108febbacef9]
aws_efs_mount_target.pypi_cache["subnet-0577a02acde719bff"]: Refreshing state... [id=fsmt-07d7b111b9cd6684e]
aws_efs_mount_target.pypi_cache["subnet-0992f582e9bf2836e"]: Refreshing state... [id=fsmt-03523586bb4ff0c46]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role-20260518023249929400000004]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role-20260518023249903900000003]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role-20260518023249955700000005]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-efs-csi-driver]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

tofu plan — arc-cbr-production-uw1

✅ Plan succeeded · commit bb98a4d4 · run log

Plan output
Installed 1 package in 1ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod-uw1",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (arc-cbr-production-uw1) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.vpc.aws_vpc.this: Refreshing state... [id=vpc-0121d1038d393182a]
module.eks.data.aws_caller_identity.current: Reading...
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-s3]
module.eks.data.aws_ami.eks_optimized_al2023: Reading...
data.aws_availability_zones.available: Reading...
module.eks.aws_iam_role.node: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry]
module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=1fb5d763-c5cd-4de5-bf40-712df992288c]
module.eks.aws_iam_role.cluster: Refreshing state... [id=pytorch-arc-cbr-production-uw1-cluster-role]
module.eks.data.aws_caller_identity.current: Read complete after 0s [id=308535385114]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAUPVRELQNFWBLKNFS]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=pytorch-arc-cbr-production-uw1-cluster-role/arn:aws:iam::aws:policy/AmazonEKSVPCResourceController]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-cluster-role/arn:aws:iam::aws:policy/AmazonEKSClusterPolicy]
data.aws_availability_zones.available: Read complete after 0s [id=us-west-1]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role/arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy]
module.eks.aws_iam_role_policy.node_cni_ipv6: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role:pytorch-arc-cbr-production-uw1-node-cni-ipv6]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role/arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=pytorch-arc-cbr-production-uw1-node-role/arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/pytorch-arc-cbr-production-uw1-eks-secrets]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 1s [id=ami-07fd8394a1d58b614]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-0b3b22b995e71d8d9]
module.vpc.aws_egress_only_internet_gateway.this: Refreshing state... [id=eigw-07b06397ce403fa53]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-05f5edbf2c6678c03]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-0a8410ffa0f0014a7]
module.vpc.aws_eip.nat_secondary["us-west-1c-2"]: Refreshing state... [id=eipalloc-0f2e15b6a36b52fac]
module.vpc.aws_eip.nat_secondary["us-west-1a-2"]: Refreshing state... [id=eipalloc-0647e169131be5893]
module.vpc.aws_eip.nat_secondary["us-west-1c-5"]: Refreshing state... [id=eipalloc-0635efedc10ee5f66]
module.vpc.aws_eip.nat_secondary["us-west-1c-1"]: Refreshing state... [id=eipalloc-0bd09c7f2dcaa0a46]
module.vpc.aws_eip.nat_secondary["us-west-1a-1"]: Refreshing state... [id=eipalloc-012ac413772344fea]
module.vpc.aws_eip.nat_secondary["us-west-1c-6"]: Refreshing state... [id=eipalloc-0cf91a032d10f4ec5]
module.vpc.aws_eip.nat_secondary["us-west-1a-3"]: Refreshing state... [id=eipalloc-05a2bad636af56f4d]
module.vpc.aws_eip.nat_secondary["us-west-1a-4"]: Refreshing state... [id=eipalloc-0dfae88698dce850e]
module.vpc.aws_eip.nat_secondary["us-west-1c-3"]: Refreshing state... [id=eipalloc-09f89978685e7f3c7]
module.vpc.aws_eip.nat_secondary["us-west-1c-0"]: Refreshing state... [id=eipalloc-0d565f5bf077b05cf]
module.vpc.aws_eip.nat_secondary["us-west-1a-0"]: Refreshing state... [id=eipalloc-0e3ca79e34012a238]
module.vpc.aws_eip.nat_secondary["us-west-1a-5"]: Refreshing state... [id=eipalloc-059986f686b188dc2]
module.vpc.aws_eip.nat_secondary["us-west-1a-6"]: Refreshing state... [id=eipalloc-08763a35db0a26caa]
module.vpc.aws_eip.nat_secondary["us-west-1c-4"]: Refreshing state... [id=eipalloc-0dfaa16c61333ceb3]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-06d137da3460167c4]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0bd275a35f8e7ef65]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0ce35bb011df0cfdb]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-08861bee27120b994]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-0a13e7b49c841e497]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-harbor-registry]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-00184fa8d73e575c9]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-0f79a2ac72857a304]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-s3-20260519191031756900000001]
module.eks.aws_eks_cluster.this: Refreshing state... [id=pytorch-arc-cbr-production-uw1]
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=pytorch-arc-cbr-production-uw1:vpc-cni]
module.eks.aws_eks_access_entry.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production-uw1:arn:aws:iam::308535385114:role/osdc_gha_prod]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=pytorch-arc-cbr-production-uw1:kube-proxy]
module.eks.aws_launch_template.base: Refreshing state... [id=lt-066ae5f473a2b07c0]
module.eks.aws_eks_node_group.base: Refreshing state... [id=pytorch-arc-cbr-production-uw1:pytorch-arc-cbr-production-uw1-base-nodes]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=ab5db6c82031e2d229412c67921160a3b3af073b]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::308535385114:oidc-provider/oidc.eks.us-west-1.amazonaws.com/id/ED52EC64FF5CFAB4151C6E4B5DE279BD]
module.eks.aws_eks_access_policy_association.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production-uw1#arn:aws:iam::308535385114:role/osdc_gha_prod#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=pytorch-arc-cbr-production-uw1:coredns]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-0c336634317cc9f35]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-01ec520e3931f5f6a]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=3969145930]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-uw1-ebs-csi-driver-role]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-01165f36472c0a780]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-06e17b37b87d890f2]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-uw1-ebs-csi-driver-role/arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-02e4c54e5fa3b4f8a]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-0cc835aef3e3bcc21]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-uw1-harbor-registry/arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-harbor-registry]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-uw1:aws-ebs-csi-driver]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (arc-cbr-production-uw1) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-spot-interruption]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-instance-state-change]
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-rebalance]
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-scheduled-change]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-west-1.amazonaws.com/308535385114/pytorch-arc-cbr-production-uw1-karpenter]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-west-1.amazonaws.com/308535385114/pytorch-arc-cbr-production-uw1-karpenter]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-scheduled-change-KarpenterScheduledChange]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-instance-state-change-KarpenterInstanceStateChange]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-rebalance-KarpenterRebalance]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-spot-interruption-KarpenterSpotInterruption]
data.terraform_remote_state.base: Read complete after 1s
aws_ec2_tag.subnet_karpenter_discovery["subnet-0a13e7b49c841e497"]: Refreshing state... [id=subnet-0a13e7b49c841e497,karpenter.sh/discovery]
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-058909cc1cdc63fad,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-08861bee27120b994"]: Refreshing state... [id=subnet-08861bee27120b994,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-controller]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-karpenter-controller]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-uw1-karpenter-controller-20260519195229107000000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (arc-cbr-production-uw1) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-pypi-wants-collector-s3]
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-uw1-pypi-wheel-syncer-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-0da5eaf2022d80aa0]
data.terraform_remote_state.base: Read complete after 1s
aws_iam_role.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-uw1-pypi-wheel-syncer-role]
aws_iam_role.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-uw1-efs-csi-driver-role]
aws_iam_role.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-uw1-pypi-wants-collector-role]
aws_security_group.efs: Refreshing state... [id=sg-01c1f3fa51705db76]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-uw1-pypi-wheel-syncer-role-20260519200350777100000003]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-uw1-efs-csi-driver-role-20260519200350826400000005]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-uw1-pypi-wants-collector-role-20260519200350781900000004]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-uw1:aws-efs-csi-driver]
aws_efs_mount_target.pypi_cache["subnet-0a13e7b49c841e497"]: Refreshing state... [id=fsmt-089fd42858a5a85ab]
aws_efs_mount_target.pypi_cache["subnet-08861bee27120b994"]: Refreshing state... [id=fsmt-00708cc923d4d2055]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

tofu plan — meta-prod-aws-ue1

✅ Plan succeeded · commit bb98a4d4 · run log

Plan output
Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod-ue1",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (meta-prod-aws-ue1) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.eks.data.aws_ami.eks_optimized_al2023: Reading...
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=meta-prod-aws-ue1-harbor-s3]
module.eks.data.aws_caller_identity.current: Reading...
module.eks.aws_iam_role.cluster: Refreshing state... [id=meta-prod-aws-ue1-cluster-role]
module.eks.aws_iam_role.node: Refreshing state... [id=meta-prod-aws-ue1-node-role]
data.aws_availability_zones.available: Reading...
module.vpc.aws_vpc.this: Refreshing state... [id=vpc-046818728dce02486]
module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=9274017b-776a-41bd-9f11-d118a1174159]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry]
module.eks.data.aws_caller_identity.current: Read complete after 0s [id=308535385114]
data.aws_availability_zones.available: Read complete after 0s [id=us-east-1]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAUPVRELQNGRUDTXPT]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/meta-prod-aws-ue1-eks-secrets]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=meta-prod-aws-ue1-cluster-role/arn:aws:iam::aws:policy/AmazonEKSVPCResourceController]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=meta-prod-aws-ue1-cluster-role/arn:aws:iam::aws:policy/AmazonEKSClusterPolicy]
module.eks.aws_iam_role_policy.node_cni_ipv6: Refreshing state... [id=meta-prod-aws-ue1-node-role:meta-prod-aws-ue1-node-cni-ipv6]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=meta-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=meta-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=meta-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=meta-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-harbor-registry]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=meta-prod-aws-ue1-harbor-s3/arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-harbor-registry]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 1s [id=ami-0dafeb02304897431]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-0cf3d9cf37ee998b6]
module.vpc.aws_egress_only_internet_gateway.this: Refreshing state... [id=eigw-0ce44cb6446f3c1b6]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-0beb5fc44f0ee165f]
module.vpc.aws_eip.nat_secondary["us-east-1a-3"]: Refreshing state... [id=eipalloc-0bda13d7b70c00c00]
module.vpc.aws_eip.nat_secondary["us-east-1a-1"]: Refreshing state... [id=eipalloc-08c7bd3306cf687ca]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-0348c5058db524cd2]
module.vpc.aws_eip.nat_secondary["us-east-1b-3"]: Refreshing state... [id=eipalloc-0c8291ee817240e1f]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-0d65ec2dd49f0d87c]
module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-02ce11d6646870431]
module.vpc.aws_eip.nat_secondary["us-east-1c-2"]: Refreshing state... [id=eipalloc-025ef0e1813277c67]
module.vpc.aws_eip.nat_secondary["us-east-1a-2"]: Refreshing state... [id=eipalloc-080ec4e265ebdc5ad]
module.vpc.aws_eip.nat_secondary["us-east-1c-0"]: Refreshing state... [id=eipalloc-05844040c7248f44f]
module.vpc.aws_eip.nat_secondary["us-east-1c-6"]: Refreshing state... [id=eipalloc-0d22d3aa0667a1070]
module.vpc.aws_eip.nat_secondary["us-east-1a-4"]: Refreshing state... [id=eipalloc-09fa171393c3a7cfb]
module.vpc.aws_eip.nat_secondary["us-east-1c-4"]: Refreshing state... [id=eipalloc-00c5df9f3b60f353d]
module.vpc.aws_eip.nat_secondary["us-east-1c-5"]: Refreshing state... [id=eipalloc-04fe645562f597aaa]
module.vpc.aws_eip.nat_secondary["us-east-1a-0"]: Refreshing state... [id=eipalloc-0c8a6faed0a97479d]
module.vpc.aws_eip.nat_secondary["us-east-1b-4"]: Refreshing state... [id=eipalloc-0aba12aa23c11d20c]
module.vpc.aws_eip.nat_secondary["us-east-1c-1"]: Refreshing state... [id=eipalloc-0cb5208c5f775baf6]
module.vpc.aws_eip.nat_secondary["us-east-1b-1"]: Refreshing state... [id=eipalloc-0d095305019486ae6]
module.vpc.aws_eip.nat_secondary["us-east-1b-6"]: Refreshing state... [id=eipalloc-0f922f499d32f1368]
module.vpc.aws_eip.nat_secondary["us-east-1a-6"]: Refreshing state... [id=eipalloc-02e84a51a14c9cbda]
module.vpc.aws_eip.nat_secondary["us-east-1b-2"]: Refreshing state... [id=eipalloc-0f0b720f4cca62ec7]
module.vpc.aws_eip.nat_secondary["us-east-1b-0"]: Refreshing state... [id=eipalloc-0bcfe1f98793e1b12]
module.vpc.aws_eip.nat_secondary["us-east-1b-5"]: Refreshing state... [id=eipalloc-0d078dc6f07628714]
module.vpc.aws_eip.nat_secondary["us-east-1c-3"]: Refreshing state... [id=eipalloc-0af54aa2e5f40dfa4]
module.vpc.aws_eip.nat_secondary["us-east-1a-5"]: Refreshing state... [id=eipalloc-01f89a7c130d2a810]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-0eafd792589fbb363]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-00c2e2605c4dea199]
module.vpc.aws_eip.nat[2]: Refreshing state... [id=eipalloc-033772b4490df1b41]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0f922406e02ecba1d]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-078f44b58c8b48ade]
module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-07bfd0f170c3b3406]
module.eks.aws_eks_cluster.this: Refreshing state... [id=meta-prod-aws-ue1]
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=meta-prod-aws-ue1:kube-proxy]
module.eks.aws_eks_access_entry.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=meta-prod-aws-ue1:arn:aws:iam::308535385114:role/osdc_gha_prod]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=meta-prod-aws-ue1:vpc-cni]
module.eks.aws_launch_template.base: Refreshing state... [id=lt-043779597e3b5a7fd]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=b1b539daa206035ae3c3e28288b0681fa1b462f3]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::308535385114:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/6C84A48E1BF23A027C1E78912A368743]
module.eks.aws_eks_node_group.base: Refreshing state... [id=meta-prod-aws-ue1:meta-prod-aws-ue1-base-nodes]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=3022997555]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=meta-prod-aws-ue1-ebs-csi-driver-role]
module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-05e7e66e960593972]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-05da47c4ed26ae390]
module.vpc.aws_nat_gateway.this[2]: Refreshing state... [id=nat-09414719983019b49]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-0616491b7baeab47f]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-025de56c0aac8d3f0]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-0cff785d8001fc914]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=meta-prod-aws-ue1-ebs-csi-driver-role/arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=meta-prod-aws-ue1:aws-ebs-csi-driver]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=meta-prod-aws-ue1:coredns]
module.vpc.aws_route_table.private[2]: Refreshing state... [id=rtb-05d5b7a41aa6323ed]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-0c665948be8d0282e]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-09287d705ce4a88bc]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=meta-prod-aws-ue1-harbor-registry/arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-harbor-registry]
module.eks.aws_eks_access_policy_association.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=meta-prod-aws-ue1#arn:aws:iam::308535385114:role/osdc_gha_prod#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-02a8683fa7258f295]
module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-0306281246323bd27]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-09dca398d838d4247]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (meta-prod-aws-ue1) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=meta-prod-aws-ue1-karpenter-spot-interruption]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=meta-prod-aws-ue1-karpenter-instance-state-change]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/308535385114/meta-prod-aws-ue1-karpenter]
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=meta-prod-aws-ue1-karpenter-rebalance]
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=meta-prod-aws-ue1-karpenter-scheduled-change]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=meta-prod-aws-ue1-karpenter-spot-interruption-KarpenterSpotInterruption]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/308535385114/meta-prod-aws-ue1-karpenter]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=meta-prod-aws-ue1-karpenter-scheduled-change-KarpenterScheduledChange]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=meta-prod-aws-ue1-karpenter-rebalance-KarpenterRebalance]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=meta-prod-aws-ue1-karpenter-instance-state-change-KarpenterInstanceStateChange]
data.terraform_remote_state.base: Read complete after 2s
aws_ec2_tag.subnet_karpenter_discovery["subnet-0d65ec2dd49f0d87c"]: Refreshing state... [id=subnet-0d65ec2dd49f0d87c,karpenter.sh/discovery]
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-016f4a0d209f3e4a9,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0348c5058db524cd2"]: Refreshing state... [id=subnet-0348c5058db524cd2,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=meta-prod-aws-ue1-karpenter-controller]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-karpenter-controller]
aws_ec2_tag.subnet_karpenter_discovery["subnet-02ce11d6646870431"]: Refreshing state... [id=subnet-02ce11d6646870431,karpenter.sh/discovery]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=meta-prod-aws-ue1-karpenter-controller-20260528200455768400000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (meta-prod-aws-ue1) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-pypi-wants-collector-s3]
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::308535385114:policy/meta-prod-aws-ue1-pypi-wheel-syncer-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-023e57b36ec1cd426]
data.terraform_remote_state.base: Read complete after 1s
aws_iam_role.efs_csi_driver: Refreshing state... [id=meta-prod-aws-ue1-efs-csi-driver-role]
aws_iam_role.wheel_syncer: Refreshing state... [id=meta-prod-aws-ue1-pypi-wheel-syncer-role]
aws_iam_role.wants_collector: Refreshing state... [id=meta-prod-aws-ue1-pypi-wants-collector-role]
aws_security_group.efs: Refreshing state... [id=sg-0bc06caa62214c9b7]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=meta-prod-aws-ue1-pypi-wheel-syncer-role-20260528201106257700000005]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=meta-prod-aws-ue1-pypi-wants-collector-role-20260528201106192600000004]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=meta-prod-aws-ue1-efs-csi-driver-role-20260528201106116400000003]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=meta-prod-aws-ue1:aws-efs-csi-driver]
aws_efs_mount_target.pypi_cache["subnet-0348c5058db524cd2"]: Refreshing state... [id=fsmt-0500c573cafe66133]
aws_efs_mount_target.pypi_cache["subnet-0d65ec2dd49f0d87c"]: Refreshing state... [id=fsmt-0ffaedc58eceb7749]
aws_efs_mount_target.pypi_cache["subnet-02ce11d6646870431"]: Refreshing state... [id=fsmt-06a05c001541338d2]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

georgehong added a commit that referenced this pull request Jun 9, 2026
Adds two new OSDC modules for NUMA-aware GPU scheduling:

1. nfd — deploys NFD topology-updater as a DaemonSet on GPU nodes.
   Publishes NodeResourceTopology CRDs showing per-NUMA-zone resource
   availability (GPU/CPU/memory per socket). Polls every 15s to stay
   ahead of the ARC capacity provisioner's 30s interval.

2. numa-scheduler — deploys a secondary kube-scheduler with the
   NodeResourceTopologyMatch plugin. Reads NRT objects and only places
   pods on nodes where a single NUMA zone can satisfy the full resource
   request. Uses MostAllocated scoring to pack small pods together and
   keep free sockets available for large (4-GPU) jobs.

The scheduler-plugins chart is downloaded from the GitHub release
(kubernetes-sigs/scheduler-plugins) at deploy time — the project
does not publish to a Helm registry.

Both modules default to enabled=false in clusters.yaml and are inert
until runner definitions set scheduler_name: numa-scheduler on
workflow pods. Deploying them has no effect on existing scheduling.

Part of #696

ghstack-source-id: b9b30d1
Pull-Request: #716
[ghstack-poisoned]
georgehong added a commit that referenced this pull request Jun 9, 2026
Adds two new OSDC modules for NUMA-aware GPU scheduling:

1. nfd — deploys NFD topology-updater as a DaemonSet on GPU nodes.
   Publishes NodeResourceTopology CRDs showing per-NUMA-zone resource
   availability (GPU/CPU/memory per socket). Polls every 15s to stay
   ahead of the ARC capacity provisioner's 30s interval.

2. numa-scheduler — deploys a secondary kube-scheduler with the
   NodeResourceTopologyMatch plugin. Reads NRT objects and only places
   pods on nodes where a single NUMA zone can satisfy the full resource
   request. Uses MostAllocated scoring to pack small pods together and
   keep free sockets available for large (4-GPU) jobs.

The scheduler-plugins chart is downloaded from the GitHub release
(kubernetes-sigs/scheduler-plugins) at deploy time — the project
does not publish to a Helm registry.

Both modules default to enabled=false in clusters.yaml and are inert
until runner definitions set scheduler_name: numa-scheduler on
workflow pods. Deploying them has no effect on existing scheduling.

Part of #696

ghstack-source-id: 2f24bba
Pull-Request: #716
@georgehong georgehong marked this pull request as ready for review June 9, 2026 01:25
@georgehong

georgehong commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

planning some additional changes to remove the gh dependency to extract the release (i.e. use curl instead), and add some additional module tests.

@georgehong georgehong marked this pull request as draft June 10, 2026 21:16
[ghstack-poisoned]
georgehong added a commit that referenced this pull request Jun 10, 2026
Adds two new OSDC modules for NUMA-aware GPU scheduling:

1. nfd — deploys NFD topology-updater as a DaemonSet on GPU nodes.
   Publishes NodeResourceTopology CRDs showing per-NUMA-zone resource
   availability (GPU/CPU/memory per socket). Polls every 15s to stay
   ahead of the ARC capacity provisioner's 30s interval.

2. numa-scheduler — deploys a secondary kube-scheduler with the
   NodeResourceTopologyMatch plugin. Reads NRT objects and only places
   pods on nodes where a single NUMA zone can satisfy the full resource
   request. Uses MostAllocated scoring to pack small pods together and
   keep free sockets available for large (4-GPU) jobs.

The scheduler-plugins chart is downloaded from the GitHub release
(kubernetes-sigs/scheduler-plugins) at deploy time — the project
does not publish to a Helm registry.

Both modules default to enabled=false in clusters.yaml and are inert
until runner definitions set scheduler_name: numa-scheduler on
workflow pods. Deploying them has no effect on existing scheduling.

Part of #696

ghstack-source-id: c8ce9d0
Pull-Request: #716
[ghstack-poisoned]
georgehong added a commit that referenced this pull request Jun 10, 2026
Adds two new OSDC modules for NUMA-aware GPU scheduling:

1. nfd — deploys NFD topology-updater as a DaemonSet on GPU nodes.
   Publishes NodeResourceTopology CRDs showing per-NUMA-zone resource
   availability (GPU/CPU/memory per socket). Polls every 15s to stay
   ahead of the ARC capacity provisioner's 30s interval.

2. numa-scheduler — deploys a secondary kube-scheduler with the
   NodeResourceTopologyMatch plugin. Reads NRT objects and only places
   pods on nodes where a single NUMA zone can satisfy the full resource
   request. Uses MostAllocated scoring to pack small pods together and
   keep free sockets available for large (4-GPU) jobs.

The scheduler-plugins chart is downloaded from the GitHub release
(kubernetes-sigs/scheduler-plugins) at deploy time — the project
does not publish to a Helm registry.

Both modules default to enabled=false in clusters.yaml and are inert
until runner definitions set scheduler_name: numa-scheduler on
workflow pods. Deploying them has no effect on existing scheduling.

Part of #696

ghstack-source-id: a45fb4b
Pull-Request: #716
@georgehong georgehong marked this pull request as ready for review June 10, 2026 22:43
@georgehong

Copy link
Copy Markdown
Contributor Author

Add additional smoke tests, and similar to the cluster deployment, if this is validated, I can try running this on arc-staging to ensure deployment works as expected before continuing.

Comment thread osdc/modules/nfd/helm/values.yaml Outdated
- key: instance-type
operator: Exists
effect: NoSchedule
- key: git-cache-not-ready

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been removed, so you will need to do a rebase. See this here seems to indicate that you want to add this into this new list too https://github.com/pytorch/ci-infra/blob/main/osdc/modules/nodepools/scripts/python/generate_nodepools.py#L55

@huydhn

huydhn commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@claude This PR introduces a new osdc/modules/numa-scheduler/helm/values.yaml, how does that work? Is it always enforced or can it be optional?

A couple more things that I have in mind:

  • 8xH100 or 8xB200 jobs use all the NUMA cores from the server (2). Do they still work with the new schedule?
  • Some CI jobs doesn't care about NUMA core, only the number of GPUs. So, they are ok if CPU are scattered as long as the requested number of GPUs are there. It's better to run them right away instead of waiting, or worst failing the jobs

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Claude finished @huydhn's task in 2m 0s —— View job


How numa-scheduler works — and answers to your questions

Short version: it's opt-in per pod, never global. The values.yaml configures a secondary kube-scheduler. Kubernetes only routes a pod through it when that pod sets spec.schedulerName: numa-scheduler. Every other pod keeps using the default kube-scheduler and is completely unaffected. Deploying this module is a no-op until a runner definition opts in (the PR description's follow-ups #3/#4 are what actually wire it up).

So nothing is "always enforced." The NUMA constraint only applies to pods you explicitly route through it.


1. Is it always enforced or optional?

Optional, and opt-in at the pod level. Two layers gate it:

  • schedulerName — only pods with schedulerName: numa-scheduler are touched (values.yaml:5-12). The NodeResourceTopologyMatch plugin (values.yaml:45-57) runs only inside this scheduler.
  • Node topology policy — the plugin's Filter behavior depends on each node's kubelet topologyManagerPolicy, which it reads from the NRT object NFD publishes. Per the repo's own design (CLAUDE.md → "GPU Fleet Unification & NUMA Topology"), default is best-effort, and only multi-NUMA instances (p4d, p5, p6-b200) override to single-numa-node.

This matters a lot for your next two questions, because NodeResourceTopologyMatch is strict only on single-numa-node nodes — it filters out any node that can't align the pod's full request to one NUMA zone. On best-effort/none nodes its Filter is permissive.

2. Do 8×H100 / 8×B200 jobs still work?

This is the real risk and needs explicit handling. An 8-GPU p5/p6-b200 is 2 NUMA sockets × 4 GPUs. A pod requesting all 8 GPUs cannot fit in a single NUMA zone, so on a single-numa-node node the NodeResourceTopologyMatch Filter would reject every node → the pod sits Pending (it doesn't hard-fail, but it never schedules, and ARC will eventually time it out).

Recommendation: do not set schedulerName: numa-scheduler on full-server (8-GPU) runner definitions. Whole-server jobs already own both sockets, so NUMA-aware packing buys nothing — the scheduler is there to pack small (1–4 GPU) jobs onto single sockets and keep the other socket free for large jobs. Since opt-in is per-runner-def, the clean answer is to enable it only for the sub-socket GPU tiers (1/2/4-GPU H100/B200) and leave 8-GPU on the default scheduler.

It would be worth stating this explicitly in the follow-up that wires SCHEDULER_NAME into runner.yaml.tpl, so 8-GPU defs don't accidentally inherit it.

3. Jobs that only care about GPU count (CPU may be scattered)

Same mechanism solves it: don't route those jobs through numa-scheduler. Leave schedulerName unset and they go through the default scheduler — placed immediately on any node with enough free GPUs, CPUs allowed to scatter across sockets. No waiting, no NUMA filtering.

The only failure mode to avoid is opting a NUMA-agnostic job into this scheduler on a single-numa-node node — then a request that spans sockets would stay Pending. As long as opt-in stays selective (only the latency-sensitive single-socket jobs that actually benefit from socket-local GPU+CPU+memory), the "scatter is fine, run now" jobs are untouched.


Suggestions for this PR

  1. Document the opt-in contract in numa-scheduler/helm/values.yaml or a module README — specifically the rule "8-GPU / full-server and NUMA-agnostic jobs must NOT set schedulerName: numa-scheduler", since the consequence (Pending → ARC timeout) isn't obvious.
  2. Pending-vs-fail behavior is the thing to watch in the rollout: a mis-targeted pod won't fail fast, it'll hang until the listener times out. Worth a note in the follow-up PR that sets schedulerName, and ideally an alert on long-Pending pods in the numa-scheduler path.
  3. Unrelated to your questions but flagged by @huydhn's other review comment: the nfd/helm/values.yaml:42 toleration list overlaps with generate_nodepools.py taints and needs the rebase + sync mentioned there.

Happy to draft the values.yaml/README documentation note or a guard in the runner template if you'd like — just say the word.
· branch gh/georgehong/3/head

@huydhn

huydhn commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@claude

  1. Jobs that only care about GPU count (CPU may be scattered)
    Same mechanism solves it: don't route those jobs through numa-scheduler. Leave schedulerName unset and they go through the default scheduler — placed immediately on any node with enough free GPUs, CPUs allowed to scatter across sockets. No waiting, no NUMA filtering.
    The only failure mode to avoid is opting a NUMA-agnostic job into this scheduler on a single-numa-node node — then a request that spans sockets would stay Pending. As long as opt-in stays selective (only the latency-sensitive single-socket jobs that actually benefit from socket-local GPU+CPU+memory), the "scatter is fine, run now" jobs are untouched.

For this point, what is the impact of enabling numa scheduler on https://github.com/pytorch/ci-infra/blob/main/osdc/modules/arc-runners-h100/defs/l-x86iamx-88-900-h100-4.yaml scale set that is using https://github.com/pytorch/ci-infra/blob/main/osdc/modules/nodepools-h100/defs/p5.yaml nodepool. The nodepool has topology_manager_policy: single-numa-node

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Claude finished @huydhn's task in 2m 36s —— View job


Impact of enabling numa-scheduler on l-x86iamx-88-900-h100-4 (p5.48xlarge / single-numa-node)

Short version: this is the good case — the canonical one numa-scheduler exists for. It's safe and beneficial here, with one sizing caveat worth checking before you flip it on.

Why this def is the ideal candidate

A p5.48xlarge is a 2-NUMA-node box: 4 H100 + 88 usable vCPU + 900 GiB per socket (the 8-GPU def confirms the totals: 176 vCPU / 1800 GiB / 8 GPU → ÷2 = one socket). The 4-GPU def requests:

gpu: 4   vcpu: 88   memory: 900Gi

That is exactly one NUMA socket. So a single zone can satisfy the full request → NodeResourceTopologyMatch's Filter passes, and the pod packs cleanly one-per-socket (2 runners/node, matching max_runners.default: 2). This is the opposite of the 8-GPU case I flagged earlier — that one spans both sockets and can never fit, which is why it lives in its own p5-large pool with best-effort (nodepools-h100/defs/p5-large.yaml:5-12).

What enabling it actually fixes

The node already runs topology_manager_policy: single-numa-node (p5.yaml:12) — so the kubelet already enforces single-zone alignment today, regardless of which scheduler placed the pod. The gap is that the default scheduler is NUMA-blind: it counts only total free GPUs on a node. With mixed-size H100 jobs (1/2/4-GPU) fragmenting the two sockets, it can place a 4-GPU pod on a node that has 4 GPUs free but split 2+2 across sockets. The kubelet then can't align it → TopologyAffinityError, the runner+placeholder pod fails admission and hangs, pinning a reserved node idle.

numa-scheduler closes exactly that gap: it reads the NRT objects and only places the pod where one socket has the full 4-GPU+88-vCPU+900-GiB free. So for this def, enabling it prevents TopologyAffinityErrors rather than causing Pending pods.

The one caveat — the request is sized to the full socket, with zero headroom ⚠️

This is the thing to verify before enabling. NodeResourceTopologyMatch compares the pod request against each zone's allocatable in the NRT object — and allocatable is net of kubelet system-reserved / kube-reserved. That reserved CPU/memory is typically charged against a NUMA zone, so a socket's reported allocatable is usually slightly under the nominal 88 vCPU / 900 GiB.

A pod requesting exactly 88 / 900 may then fail to fit in any single zone → it sits Pending indefinitely (then ARC times it out). GPUs (4) are always integral to a socket and fine; CPU and memory are the tight dimensions.

Before activating for this scale set, I'd recommend one of:

  1. Verify against a live NRT object — read the per-zone allocatable on a real p5.48xlarge and confirm it's ≥ 88 vCPU and ≥ 900Gi:
    kubectl get noderesourcetopologies <p5-node-name> -o yaml
    
  2. Trim the def slightly (e.g. vcpu: 84, memory: 880Gi) to leave per-zone reserved headroom, so the request comfortably fits one socket's allocatable, not just its capacity.

Summary

1/2/4-GPU defs (p5.48xlarge, single-numa-node) 8-GPU def (p5-large, best-effort)
Fits one NUMA zone? ✅ yes (4-GPU = exactly one socket) ❌ no (spans both)
Enable numa-scheduler? intended use — prevents TopologyAffinityError ❌ never (would hang Pending)
Watch out for per-zone allocatable < full-socket request → Pending n/a (don't opt in)

So: yes, enable it here — just confirm the 4-GPU request fits the socket's allocatable (not just capacity) first, or shave a little CPU/memory headroom off the def.
· branch gh/georgehong/3/head

[ghstack-poisoned]
@georgehong

Copy link
Copy Markdown
Contributor Author

Updating to break down changes into components in stack available here: #738.

@georgehong georgehong marked this pull request as draft June 15, 2026 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants