[TEST-ONLY] Validate NUMA scheduling on A100 (p4d) in ue1 staging by georgehong · Pull Request #778 · pytorch/ci-infra

georgehong · 2026-06-16T22:11:32Z

Stack from ghstack (oldest at bottom):

Temporary test configuration — DO NOT MERGE. Self-contained sibling of the g4dn.metal
[TEST-ONLY] commit (now archived on numa-aware-scheduling-g4dn): the same NUMA pipeline
on REAL A100 hardware (p4d.24xlarge) in the new meta-staging-aws-ue1 (us-east-1) staging
cluster — the production-class target. Reuses the existing p4d fleet + A100 runner defs
(the p4d fleet is already single-numa-node), so no new fleet or runner defs are needed.

Prerequisite this unblocks: confirm whether A100/p4d actually PUBLISHES per-GPU NUMA
topology in the NodeResourceTopology. On g4dn.metal it did NOT (zones exposed CPU only,
so the numa-scheduler could never align a GPU pod). After confirming, vary the scheduler
and over-request beyond one NUMA zone (edit the 1-GPU def), mirroring the g4dn A/B.

Add nfd + numa-scheduler to meta-staging-aws-ue1; remove from the two prod clusters
so the test pipeline is isolated to staging (mirrors the g4dn commit).
Repoint NFD topology-updater + taint-remover + the nfd-topology startup-taint gate
from p5 to p4d. Only ue1 runs nfd now, so a single fleet target — no affinity needed.
Add scheduler_name: numa-scheduler to the existing A100 1-GPU and 4-GPU runner defs
(the 4-GPU is the real scenario, parallel to the H100 4-GPU; the 1-GPU is the A/B knob).

p4d.24xlarge = 2 sockets x 4 A100 40GB (2 NUMA x 4 GPU). cpuManagerPolicy=static and
Guaranteed-QoS runner pods already apply, so CPU+GPU NUMA alignment needs no workload
changes. ue1 runners carry the c-mt- staging prefix + meta-staging-aws-ue1 runner group,
so there is no overlap with prod (mt-).

Deploy (ue1 only):
just deploy-module meta-staging-aws-ue1 nfd
just deploy-module meta-staging-aws-ue1 numa-scheduler
just deploy-module meta-staging-aws-ue1 nodepools
just deploy-module meta-staging-aws-ue1 arc-runners
Then queue a canary job and inspect the NRT for per-zone nvidia.com/gpu BEFORE varying.

Cleanup: drop this commit (git reset --hard HEAD~1) + teardown nfd/numa-scheduler on ue1.

[ghstack-poisoned]

github-actions · 2026-06-16T22:12:16Z

Capacity report

commit 27ebdfd0 · run log

✅ simulate-cluster

Installed 1 package in 2ms
�[1mMonte Carlo Cluster Simulation�[0m
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Seed: 42  |  MAPE threshold: 15%  |  Runners: 43  |  DaemonSets: 17
Peak target runner types: 30 (mapped from 38 old labels)

�[1m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━�[0m
�[1m�[0;36mCluster Simulation Results�[0m
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

�[1;33mSkipped labels (1):�[0m
  �[2ml-arm64g2-6-32: no runner def�[0m

�[1mNodes by instance type:�[0m

  Instance Type          Nodes  vCPU Used vCPU Total   Mem Used  Mem Total   GPU
  ──────────────────────────────────────────────────────────────────────────────
  c7a.48xlarge             261   44794.2c   49871.9c  87800.8Gi  90312.1Gi     -
  c7i.metal-24xl            37    3415.8c    3526.8c   6197.9Gi   6231.7Gi     -
  g4dn.12xlarge            162    7341.8c    7661.0c  27946.6Gi  28109.4Gi 648/648
  g4dn.8xlarge              89    2609.5c    2788.4c  10280.4Gi  10342.3Gi 89/89
  g4dn.metal                87    8205.8c    8279.8c  29972.3Gi  30076.9Gi 696/696
  g5.12xlarge               49    2220.7c    2317.2c   8208.0Gi   8237.5Gi 196/196
  g5.48xlarge               41    7762.1c    7828.1c  28884.9Gi  28910.0Gi 328/328
  g5.8xlarge               603   17680.0c   18892.0c  68446.4Gi  68931.6Gi 603/603
  g6.12xlarge               24    1087.7c    1135.0c   4140.2Gi   4164.4Gi 96/96
  g6.8xlarge               377   11053.6c   11811.4c  42793.2Gi  43096.5Gi 377/377
  m6i.32xlarge              26    3258.3c    3308.2c  12051.3Gi  12077.9Gi     -
  m7g.8xlarge               61     995.5c    1920.3c   3813.1Gi   6992.2Gi     -
  m7g.metal                 30    1869.6c    1902.0c   6795.3Gi   6830.4Gi     -
  m7i.48xlarge              48    8192.0c    9171.8c  32363.6Gi  33658.7Gi     -
  m8g.48xlarge               7    1093.4c    1337.6c   4188.2Gi   4908.6Gi     -
  r7a.48xlarge             137   21506.4c   26178.0c 170772.3Gi 193392.5Gi     -
  r7g.16xlarge             122    7481.0c    7734.8c  56548.2Gi  56673.4Gi     -

�[1mDeployment accuracy:�[0m

  Total deployed: 6208 / 7294 target
  Weighted MAPE: 15.0%

  Runner                              Deployed   Target     Diff
  ───────────────────────────────────────────────────────────────
  �[1;33ml-arm64g3-16-62                           61       76      -15�[0m
  �[1;33ml-arm64g3-61-463                         122      153      -31�[0m
  �[0;32ml-arm64g4-16-62                           67       76       -9�[0m
  �[1;33ml-barm64g3-62-226                         30       39       -9�[0m
  �[1;33ml-bx86iamx-92-167                         37       45       -8�[0m
  �[0;32ml-bx86iavx512-94-344-t4-8                 87       91       -4�[0m
  �[0;32ml-x86aavx2-189-704-a10g-8                 41       42       -1�[0m
  �[0;32ml-x86aavx2-29-113-a10g                   603      695      -92�[0m
  �[0;32ml-x86aavx2-29-113-l4                     377      422      -45�[0m
  �[1;33ml-x86aavx2-45-167-a10g-4                  49       80      -31�[0m
  �[1;33ml-x86aavx2-45-172-l4-4                    24       29       -5�[0m
  �[0;32ml-x86aavx512-125-463                      26       24       +2�[0m
  �[1;33ml-x86iamx-32-128                         130      174      -44�[0m
  �[0;32ml-x86iamx-8-32                           354      384      -30�[0m
  �[1;33ml-x86iavx2-40-160                         22       30       -8�[0m
  �[0;32ml-x86iavx2-8-32                           19       18       +1�[0m
  �[1;33ml-x86iavx512-16-128                       68       89      -21�[0m
  �[1;33ml-x86iavx512-16-32                      1146     1384     -238�[0m
  �[1;33ml-x86iavx512-2-4                          12       15       -3�[0m
  �[0;32ml-x86iavx512-29-115-t4                    89      104      -15�[0m
  �[0;32ml-x86iavx512-32-256                       13       12       +1�[0m
  �[1;33ml-x86iavx512-37-68                        48       65      -17�[0m
  �[0;32ml-x86iavx512-45-172-t4-4                 162      183      -21�[0m
  �[1;33ml-x86iavx512-46-85                       151      189      -38�[0m
  �[0;32ml-x86iavx512-48-384                      366      417      -51�[0m
  �[0;32ml-x86iavx512-8-16                       2054     2400     -346�[0m
  �[0;32ml-x86iavx512-8-64                         26       28       -2�[0m
  �[0;32ml-x86iavx512-94-192                        2        2       +0�[0m
  �[1;33ml-x86iavx512-94-768                       22       28       -6�[0m

�[1mCluster-wide utilization:�[0m

  �[0;32mvCPU:    90.9%�[0m  (150568 / 165664 cores)
  �[0;32mMemory:  95.0%�[0m  (601203 / 632946 GiB)
  �[0;32mGPU:    100.0%�[0m  (3033 / 3033 GPUs across 1432 nodes)

  Total nodes: 2161
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✅ analyze-utilization

Installed 1 package in 2ms
�[1mNode Utilization Analysis�[0m
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Runner def dirs: /home/runner/work/ci-infra/ci-infra/osdc/modules/arc-runners/defs, /home/runner/work/ci-infra/ci-infra/osdc/modules/arc-runners-b200/defs, /home/runner/work/ci-infra/ci-infra/osdc/modules/arc-runners-h100/defs
NodePool def dirs: /home/runner/work/ci-infra/ci-infra/osdc/modules/nodepools/defs, /home/runner/work/ci-infra/ci-infra/osdc/modules/nodepools-b200/defs, /home/runner/work/ci-infra/ci-infra/osdc/modules/nodepools-h100/defs
Utilization threshold: 90.0%

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: c7a.48xlarge�[0m
  Total: 192 vCPU, 384Gi advertised (355.2Gi actual)
  Kubelet reserved: 550m CPU, 8.3Gi RAM
  DaemonSet overhead: 370m CPU, 934Mi RAM
  �[0;32mAllocatable for runners: 191080m CPU (191.1 cores), 346.0Gi RAM�[0m

  �[1mRunners targeting this node:�[0m
    - l-x86iavx512-16-32: 16320m CPU, 32.5Gi RAM (job: 16c+32.0Gi, hooks: 320m+522Mi)
    - l-x86iavx512-2-4: 2320m CPU, 4.5Gi RAM (job: 2c+4.0Gi, hooks: 320m+522Mi)
    - l-x86iavx512-37-68: 37320m CPU, 68.5Gi RAM (job: 37c+68.0Gi, hooks: 320m+522Mi)
    - l-x86iavx512-46-85: 46320m CPU, 85.5Gi RAM (job: 46c+85.0Gi, hooks: 320m+522Mi)
    - l-x86iavx512-8-16: 8320m CPU, 16.5Gi RAM (job: 8c+16.0Gi, hooks: 320m+522Mi)
    - l-x86iavx512-94-192: 94320m CPU, 189.5Gi RAM (job: 94c+189.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[1;33ml-x86iavx512-16-32�[0m: 10 pods
      CPU:  85.4% (163200m / 191080m) waste: 27880m (27.9 cores)
      MEM:  94.0% (325.1Gi / 346.0Gi) waste: 20.9Gi
      Bottleneck: MEM
    �[0;32ml-x86iavx512-2-4�[0m: 76 pods
      CPU:  92.3% (176320m / 191080m) waste: 14760m (14.8 cores)
      MEM:  99.1% (342.7Gi / 346.0Gi) waste: 3.3Gi
      Bottleneck: MEM
    �[0;32ml-x86iavx512-37-68�[0m: 5 pods
      CPU:  97.7% (186600m / 191080m) waste: 4480m (4.5 cores)
      MEM:  99.0% (342.5Gi / 346.0Gi) waste: 3.5Gi
      Bottleneck: CPU
    �[0;32ml-x86iavx512-46-85�[0m: 4 pods
      CPU:  97.0% (185280m / 191080m) waste: 5800m (5.8 cores)
      MEM:  98.8% (342.0Gi / 346.0Gi) waste: 4.0Gi
      Bottleneck: CPU
    �[1;33ml-x86iavx512-8-16�[0m: 20 pods
      CPU:  87.1% (166400m / 191080m) waste: 24680m (24.7 cores)
      MEM:  95.4% (330.2Gi / 346.0Gi) waste: 15.8Gi
      Bottleneck: MEM
    �[0;31ml-x86iavx512-94-192�[0m: 1 pods
      CPU:  49.4% (94320m / 191080m) waste: 96760m (96.8 cores)
      MEM:  54.8% (189.5Gi / 346.0Gi) waste: 156.5Gi
      Bottleneck: MEM

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 236

    �[0;32mTop 5 most efficient:�[0m
      �[0;32m#1�[0m [5xl-x86iavx512-37-68]
         CPU:  97.7%  MEM:  99.0%  waste: 4.5c + 3.5Gi
      �[0;32m#2�[0m [1xl-x86iavx512-2-4, 2xl-x86iavx512-37-68, 2xl-x86iavx512-46-85, 2xl-x86iavx512-8-16]
         CPU:  97.5%  MEM:  99.9%  waste: 4.8c + 466Mi
      �[0;32m#3�[0m [12xl-x86iavx512-2-4, 3xl-x86iavx512-37-68, 1xl-x86iavx512-46-85]
         CPU:  97.4%  MEM:  99.7%  waste: 5.0c + 888Mi
      �[0;32m#4�[0m [1xl-x86iavx512-16-32, 1xl-x86iavx512-2-4, 2xl-x86iavx512-37-68, 2xl-x86iavx512-46-85]
         CPU:  97.3%  MEM:  99.7%  waste: 5.2c + 988Mi
      �[0;32m#5�[0m [8xl-x86iavx512-2-4, 2xl-x86iavx512-37-68, 2xl-x86iavx512-46-85]
         CPU:  97.3%  MEM:  99.4%  waste: 5.2c + 1.9Gi

    �[0;31mBottom 5 least efficient (money on the table):�[0m
      �[1;33m#1�[0m [1xl-x86iavx512-2-4, 9xl-x86iavx512-8-16, 1xl-x86iavx512-94-192]
         CPU:  89.8%  MEM:  99.0%  waste: 19.6c + 3.4Gi
      �[1;33m#2�[0m [2xl-x86iavx512-16-32, 12xl-x86iavx512-2-4, 2xl-x86iavx512-8-16, 1xl-x86iavx512-94-192]
         CPU:  89.7%  MEM:  98.7%  waste: 19.6c + 4.4Gi
      �[1;33m#3�[0m [4xl-x86iavx512-16-32, 5xl-x86iavx512-2-4, 1xl-x86iavx512-94-192]
         CPU:  89.6%  MEM:  98.9%  waste: 19.9c + 3.9Gi
      �[1;33m#4�[0m [1xl-x86iavx512-16-32, 1xl-x86iavx512-2-4, 7xl-x86iavx512-8-16, 1xl-x86iavx512-94-192]
         CPU:  89.6%  MEM:  98.9%  waste: 19.9c + 3.9Gi
      �[1;33m#5�[0m [2xl-x86iavx512-16-32, 1xl-x86iavx512-2-4, 5xl-x86iavx512-8-16, 1xl-x86iavx512-94-192]
         CPU:  89.4%  MEM:  98.7%  waste: 20.2c + 4.4Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: c7i.12xlarge�[0m
  Total: 48 vCPU, 96Gi advertised (88.8Gi actual)
  Kubelet reserved: 190m CPU, 2.9Gi RAM
  DaemonSet overhead: 370m CPU, 934Mi RAM
  �[0;32mAllocatable for runners: 47440m CPU (47.4 cores), 85.0Gi RAM�[0m

  �[1mRunners targeting this node:�[0m
    - l-x86iamx-14-27: 14320m CPU, 27.5Gi RAM (job: 14c+27.0Gi, hooks: 320m+522Mi)
    - l-x86iamx-22-41: 22320m CPU, 41.5Gi RAM (job: 22c+41.0Gi, hooks: 320m+522Mi)
    - l-x86iamx-46-84: 46320m CPU, 84.5Gi RAM (job: 46c+84.0Gi, hooks: 320m+522Mi)
    - l-x86iamx-8-16: 8320m CPU, 16.5Gi RAM (job: 8c+16.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-x86iamx-14-27�[0m: 3 pods
      CPU:  90.6% (42960m / 47440m) waste: 4480m (4.5 cores)
      MEM:  97.1% (82.5Gi / 85.0Gi) waste: 2.5Gi
      Bottleneck: CPU
    �[0;32ml-x86iamx-22-41�[0m: 2 pods
      CPU:  94.1% (44640m / 47440m) waste: 2800m (2.8 cores)
      MEM:  97.6% (83.0Gi / 85.0Gi) waste: 2.0Gi
      Bottleneck: CPU
    �[0;32ml-x86iamx-46-84�[0m: 1 pods
      CPU:  97.6% (46320m / 47440m) waste: 1120m (1.1 cores)
      MEM:  99.4% (84.5Gi / 85.0Gi) waste: 530Mi
      Bottleneck: CPU
    �[1;33ml-x86iamx-8-16�[0m: 5 pods
      CPU:  87.7% (41600m / 47440m) waste: 5840m (5.8 cores)
      MEM:  97.1% (82.5Gi / 85.0Gi) waste: 2.5Gi
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 8

    �[0;32mTop 5 most efficient:�[0m
      �[0;32m#1�[0m [1xl-x86iamx-46-84]
         CPU:  97.6%  MEM:  99.4%  waste: 1.1c + 530Mi
      �[0;32m#2�[0m [2xl-x86iamx-22-41]
         CPU:  94.1%  MEM:  97.6%  waste: 2.8c + 2.0Gi
      �[0;32m#3�[0m [3xl-x86iamx-14-27]
         CPU:  90.6%  MEM:  97.1%  waste: 4.5c + 2.5Gi
      �[1;33m#4�[0m [5xl-x86iamx-8-16]
         CPU:  87.7%  MEM:  97.1%  waste: 5.8c + 2.5Gi
      �[1;33m#5�[0m [1xl-x86iamx-14-27, 3xl-x86iamx-8-16]
         CPU:  82.8%  MEM:  90.6%  waste: 8.2c + 8.0Gi

    �[0;31mBottom 3 least efficient (money on the table):�[0m
      �[1;33m#1�[0m [1xl-x86iamx-22-41, 2xl-x86iamx-8-16]
         CPU:  82.1%  MEM:  87.7%  waste: 8.5c + 10.5Gi
      �[0;31m#2�[0m [2xl-x86iamx-14-27, 1xl-x86iamx-8-16]
         CPU:  77.9%  MEM:  84.1%  waste: 10.5c + 13.5Gi
      �[0;31m#3�[0m [1xl-x86iamx-14-27, 1xl-x86iamx-22-41]
         CPU:  77.2%  MEM:  81.2%  waste: 10.8c + 16.0Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: c7i.metal-24xl�[0m
  Total: 96 vCPU, 192Gi advertised (177.6Gi actual)
  Kubelet reserved: 310m CPU, 8.3Gi RAM
  DaemonSet overhead: 370m CPU, 934Mi RAM
  �[0;32mAllocatable for runners: 95320m CPU (95.3 cores), 168.4Gi RAM�[0m

  �[1mRunners targeting this node:�[0m
    - l-bx86iamx-92-167: 92320m CPU, 167.5Gi RAM (job: 92c+167.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-bx86iamx-92-167�[0m: 1 pods
      CPU:  96.9% (92320m / 95320m) waste: 3000m (3.0 cores)
      MEM:  99.5% (167.5Gi / 168.4Gi) waste: 936Mi
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[0;32m#1�[0m [1xl-bx86iamx-92-167]
         CPU:  96.9%  MEM:  99.5%  waste: 3.0c + 936Mi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: g4dn.12xlarge�[0m
  Total: 48 vCPU, 192Gi advertised (177.6Gi actual), 4 GPU
  Kubelet reserved: 190m CPU, 2.9Gi RAM
  DaemonSet overhead: 520m CPU, 1.2Gi RAM
  �[0;32mAllocatable for runners: 47290m CPU (47.3 cores), 173.5Gi RAM, 4 GPU�[0m

  �[1mRunners targeting this node:�[0m
    - l-x86iavx512-45-172-t4-4: 45320m CPU, 172.5Gi RAM, 4 GPU (job: 45c+172.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-x86iavx512-45-172-t4-4�[0m: 1 pods
      CPU:  95.8% (45320m / 47290m) waste: 1970m (2.0 cores)
      MEM:  99.4% (172.5Gi / 173.5Gi) waste: 1.0Gi
      GPU: 100.0% (4 / 4)
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[0;32m#1�[0m [1xl-x86iavx512-45-172-t4-4]
         CPU:  95.8%  MEM:  99.4%  GPU: 100.0%  waste: 2.0c + 1.0Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: g4dn.8xlarge�[0m
  Total: 32 vCPU, 128Gi advertised (118.4Gi actual), 1 GPU
  Kubelet reserved: 150m CPU, 993Mi RAM
  DaemonSet overhead: 520m CPU, 1.2Gi RAM
  �[0;32mAllocatable for runners: 31330m CPU (31.3 cores), 116.2Gi RAM, 1 GPU�[0m

  �[1mRunners targeting this node:�[0m
    - l-x86iavx512-29-115-t4: 29320m CPU, 115.5Gi RAM, 1 GPU (job: 29c+115.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-x86iavx512-29-115-t4�[0m: 1 pods
      CPU:  93.6% (29320m / 31330m) waste: 2010m (2.0 cores)
      MEM:  99.4% (115.5Gi / 116.2Gi) waste: 712Mi
      GPU: 100.0% (1 / 1)
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[0;32m#1�[0m [1xl-x86iavx512-29-115-t4]
         CPU:  93.6%  MEM:  99.4%  GPU: 100.0%  waste: 2.0c + 712Mi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: g4dn.metal�[0m
  Total: 96 vCPU, 384Gi advertised (355.2Gi actual), 8 GPU
  Kubelet reserved: 310m CPU, 8.3Gi RAM
  DaemonSet overhead: 520m CPU, 1.2Gi RAM
  �[0;32mAllocatable for runners: 95170m CPU (95.2 cores), 345.7Gi RAM, 8 GPU�[0m

  �[1mRunners targeting this node:�[0m
    - l-bx86iavx512-94-344-t4-8: 94320m CPU, 344.5Gi RAM, 8 GPU (job: 94c+344.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-bx86iavx512-94-344-t4-8�[0m: 1 pods
      CPU:  99.1% (94320m / 95170m) waste: 850m (0.8 cores)
      MEM:  99.7% (344.5Gi / 345.7Gi) waste: 1.2Gi
      GPU: 100.0% (8 / 8)
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[0;32m#1�[0m [1xl-bx86iavx512-94-344-t4-8]
         CPU:  99.1%  MEM:  99.7%  GPU: 100.0%  waste: 0.8c + 1.2Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: g5.12xlarge�[0m
  Total: 48 vCPU, 192Gi advertised (177.6Gi actual), 4 GPU
  Kubelet reserved: 190m CPU, 8.3Gi RAM
  DaemonSet overhead: 520m CPU, 1.2Gi RAM
  �[0;32mAllocatable for runners: 47290m CPU (47.3 cores), 168.1Gi RAM, 4 GPU�[0m

  �[1mRunners targeting this node:�[0m
    - l-x86aavx2-45-167-a10g-4: 45320m CPU, 167.5Gi RAM, 4 GPU (job: 45c+167.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-x86aavx2-45-167-a10g-4�[0m: 1 pods
      CPU:  95.8% (45320m / 47290m) waste: 1970m (2.0 cores)
      MEM:  99.6% (167.5Gi / 168.1Gi) waste: 616Mi
      GPU: 100.0% (4 / 4)
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[0;32m#1�[0m [1xl-x86aavx2-45-167-a10g-4]
         CPU:  95.8%  MEM:  99.6%  GPU: 100.0%  waste: 2.0c + 616Mi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: g5.48xlarge�[0m
  Total: 192 vCPU, 768Gi advertised (710.4Gi actual), 8 GPU
  Kubelet reserved: 550m CPU, 4.1Gi RAM
  DaemonSet overhead: 520m CPU, 1.2Gi RAM
  �[0;32mAllocatable for runners: 190930m CPU (190.9 cores), 705.1Gi RAM, 8 GPU�[0m

  �[1mRunners targeting this node:�[0m
    - l-x86aavx2-189-704-a10g-8: 189320m CPU, 704.5Gi RAM, 8 GPU (job: 189c+704.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-x86aavx2-189-704-a10g-8�[0m: 1 pods
      CPU:  99.2% (189320m / 190930m) waste: 1610m (1.6 cores)
      MEM:  99.9% (704.5Gi / 705.1Gi) waste: 627Mi
      GPU: 100.0% (8 / 8)
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[0;32m#1�[0m [1xl-x86aavx2-189-704-a10g-8]
         CPU:  99.2%  MEM:  99.9%  GPU: 100.0%  waste: 1.6c + 627Mi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: g5.8xlarge�[0m
  Total: 32 vCPU, 128Gi advertised (118.4Gi actual), 1 GPU
  Kubelet reserved: 150m CPU, 2.9Gi RAM
  DaemonSet overhead: 520m CPU, 1.2Gi RAM
  �[0;32mAllocatable for runners: 31330m CPU (31.3 cores), 114.3Gi RAM, 1 GPU�[0m

  �[1mRunners targeting this node:�[0m
    - l-x86aavx2-29-113-a10g: 29320m CPU, 113.5Gi RAM, 1 GPU (job: 29c+113.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-x86aavx2-29-113-a10g�[0m: 1 pods
      CPU:  93.6% (29320m / 31330m) waste: 2010m (2.0 cores)
      MEM:  99.3% (113.5Gi / 114.3Gi) waste: 824Mi
      GPU: 100.0% (1 / 1)
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[0;32m#1�[0m [1xl-x86aavx2-29-113-a10g]
         CPU:  93.6%  MEM:  99.3%  GPU: 100.0%  waste: 2.0c + 824Mi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: g6.12xlarge�[0m
  Total: 48 vCPU, 192Gi advertised (177.6Gi actual), 4 GPU
  Kubelet reserved: 190m CPU, 2.9Gi RAM
  DaemonSet overhead: 520m CPU, 1.2Gi RAM
  �[0;32mAllocatable for runners: 47290m CPU (47.3 cores), 173.5Gi RAM, 4 GPU�[0m

  �[1mRunners targeting this node:�[0m
    - l-x86aavx2-45-172-l4-4: 45320m CPU, 172.5Gi RAM, 4 GPU (job: 45c+172.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-x86aavx2-45-172-l4-4�[0m: 1 pods
      CPU:  95.8% (45320m / 47290m) waste: 1970m (2.0 cores)
      MEM:  99.4% (172.5Gi / 173.5Gi) waste: 1.0Gi
      GPU: 100.0% (4 / 4)
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[0;32m#1�[0m [1xl-x86aavx2-45-172-l4-4]
         CPU:  95.8%  MEM:  99.4%  GPU: 100.0%  waste: 2.0c + 1.0Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: g6.8xlarge�[0m
  Total: 32 vCPU, 128Gi advertised (118.4Gi actual), 1 GPU
  Kubelet reserved: 150m CPU, 2.9Gi RAM
  DaemonSet overhead: 520m CPU, 1.2Gi RAM
  �[0;32mAllocatable for runners: 31330m CPU (31.3 cores), 114.3Gi RAM, 1 GPU�[0m

  �[1mRunners targeting this node:�[0m
    - l-x86aavx2-29-113-l4: 29320m CPU, 113.5Gi RAM, 1 GPU (job: 29c+113.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-x86aavx2-29-113-l4�[0m: 1 pods
      CPU:  93.6% (29320m / 31330m) waste: 2010m (2.0 cores)
      MEM:  99.3% (113.5Gi / 114.3Gi) waste: 824Mi
      GPU: 100.0% (1 / 1)
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[0;32m#1�[0m [1xl-x86aavx2-29-113-l4]
         CPU:  93.6%  MEM:  99.3%  GPU: 100.0%  waste: 2.0c + 824Mi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: m6i.32xlarge�[0m
  Total: 128 vCPU, 512Gi advertised (473.7Gi actual)
  Kubelet reserved: 390m CPU, 8.3Gi RAM
  DaemonSet overhead: 370m CPU, 934Mi RAM
  �[0;32mAllocatable for runners: 127240m CPU (127.2 cores), 464.5Gi RAM�[0m

  �[1mRunners targeting this node:�[0m
    - l-x86aavx512-125-463: 125320m CPU, 463.5Gi RAM (job: 125c+463.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-x86aavx512-125-463�[0m: 1 pods
      CPU:  98.5% (125320m / 127240m) waste: 1920m (1.9 cores)
      MEM:  99.8% (463.5Gi / 464.5Gi) waste: 1.0Gi
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[0;32m#1�[0m [1xl-x86aavx512-125-463]
         CPU:  98.5%  MEM:  99.8%  waste: 1.9c + 1.0Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: m7g.8xlarge�[0m
  Total: 32 vCPU, 128Gi advertised (118.4Gi actual)
  Kubelet reserved: 150m CPU, 2.9Gi RAM
  DaemonSet overhead: 370m CPU, 934Mi RAM
  �[0;32mAllocatable for runners: 31480m CPU (31.5 cores), 114.6Gi RAM�[0m

  �[1mRunners targeting this node:�[0m
    - l-arm64g3-16-62: 16320m CPU, 62.5Gi RAM (job: 16c+62.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;31ml-arm64g3-16-62�[0m: 1 pods
      CPU:  51.8% (16320m / 31480m) waste: 15160m (15.2 cores)
      MEM:  54.5% (62.5Gi / 114.6Gi) waste: 52.1Gi
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[0;31m#1�[0m [1xl-arm64g3-16-62]
         CPU:  51.8%  MEM:  54.5%  waste: 15.2c + 52.1Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: m7g.metal�[0m
  Total: 64 vCPU, 256Gi advertised (236.9Gi actual)
  Kubelet reserved: 230m CPU, 8.3Gi RAM
  DaemonSet overhead: 370m CPU, 934Mi RAM
  �[0;32mAllocatable for runners: 63400m CPU (63.4 cores), 227.7Gi RAM�[0m

  �[1mRunners targeting this node:�[0m
    - l-barm64g3-62-226: 62320m CPU, 226.5Gi RAM (job: 62c+226.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-barm64g3-62-226�[0m: 1 pods
      CPU:  98.3% (62320m / 63400m) waste: 1080m (1.1 cores)
      MEM:  99.5% (226.5Gi / 227.7Gi) waste: 1.2Gi
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[0;32m#1�[0m [1xl-barm64g3-62-226]
         CPU:  98.3%  MEM:  99.5%  waste: 1.1c + 1.2Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: m7i.48xlarge�[0m
  Total: 192 vCPU, 768Gi advertised (710.4Gi actual)
  Kubelet reserved: 550m CPU, 8.3Gi RAM
  DaemonSet overhead: 370m CPU, 934Mi RAM
  �[0;32mAllocatable for runners: 191080m CPU (191.1 cores), 701.2Gi RAM�[0m

  �[1mRunners targeting this node:�[0m
    - l-x86iamx-32-128: 32320m CPU, 128.5Gi RAM (job: 32c+128.0Gi, hooks: 320m+522Mi)
    - l-x86iamx-8-32: 8320m CPU, 32.5Gi RAM (job: 8c+32.0Gi, hooks: 320m+522Mi)
    - l-x86iavx2-40-160: 40320m CPU, 160.5Gi RAM (job: 40c+160.0Gi, hooks: 320m+522Mi)
    - l-x86iavx2-8-32: 8320m CPU, 32.5Gi RAM (job: 8c+32.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[1;33ml-x86iamx-32-128�[0m: 5 pods
      CPU:  84.6% (161600m / 191080m) waste: 29480m (29.5 cores)
      MEM:  91.6% (642.5Gi / 701.2Gi) waste: 58.7Gi
      Bottleneck: CPU
    �[0;32ml-x86iamx-8-32�[0m: 21 pods
      CPU:  91.4% (174720m / 191080m) waste: 16360m (16.4 cores)
      MEM:  97.4% (682.7Gi / 701.2Gi) waste: 18.5Gi
      Bottleneck: MEM
    �[1;33ml-x86iavx2-40-160�[0m: 4 pods
      CPU:  84.4% (161280m / 191080m) waste: 29800m (29.8 cores)
      MEM:  91.6% (642.0Gi / 701.2Gi) waste: 59.2Gi
      Bottleneck: CPU
    �[0;32ml-x86iavx2-8-32�[0m: 21 pods
      CPU:  91.4% (174720m / 191080m) waste: 16360m (16.4 cores)
      MEM:  97.4% (682.7Gi / 701.2Gi) waste: 18.5Gi
      Bottleneck: MEM

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 131

    �[0;32mTop 5 most efficient:�[0m
      �[0;32m#1�[0m [1xl-x86iamx-32-128, 17xl-x86iamx-8-32]
         CPU:  90.9%  MEM:  97.1%  waste: 17.3c + 20.0Gi
      �[0;32m#2�[0m [1xl-x86iamx-32-128, 16xl-x86iamx-8-32, 1xl-x86iavx2-8-32]
         CPU:  90.9%  MEM:  97.1%  waste: 17.3c + 20.0Gi
      �[0;32m#3�[0m [1xl-x86iamx-32-128, 15xl-x86iamx-8-32, 2xl-x86iavx2-8-32]
         CPU:  90.9%  MEM:  97.1%  waste: 17.3c + 20.0Gi
      �[0;32m#4�[0m [1xl-x86iamx-32-128, 14xl-x86iamx-8-32, 3xl-x86iavx2-8-32]
         CPU:  90.9%  MEM:  97.1%  waste: 17.3c + 20.0Gi
      �[0;32m#5�[0m [1xl-x86iamx-32-128, 13xl-x86iamx-8-32, 4xl-x86iavx2-8-32]
         CPU:  90.9%  MEM:  97.1%  waste: 17.3c + 20.0Gi

    �[0;31mBottom 5 least efficient (money on the table):�[0m
      �[1;33m#1�[0m [1xl-x86iamx-32-128, 1xl-x86iamx-8-32, 3xl-x86iavx2-40-160, 1xl-x86iavx2-8-32]
         CPU:  88.9%  MEM:  96.3%  waste: 21.2c + 26.2Gi
      �[1;33m#2�[0m [1xl-x86iamx-32-128, 3xl-x86iavx2-40-160, 2xl-x86iavx2-8-32]
         CPU:  88.9%  MEM:  96.3%  waste: 21.2c + 26.2Gi
      �[1;33m#3�[0m [4xl-x86iamx-32-128, 1xl-x86iavx2-40-160]
         CPU:  88.8%  MEM:  96.2%  waste: 21.5c + 26.7Gi
      �[1;33m#4�[0m [1xl-x86iamx-8-32, 4xl-x86iavx2-40-160]
         CPU:  88.8%  MEM:  96.2%  waste: 21.5c + 26.7Gi
      �[1;33m#5�[0m [4xl-x86iavx2-40-160, 1xl-x86iavx2-8-32]
         CPU:  88.8%  MEM:  96.2%  waste: 21.5c + 26.7Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: m8g.16xlarge�[0m
  Total: 64 vCPU, 256Gi advertised (236.9Gi actual)
  Kubelet reserved: 230m CPU, 8.3Gi RAM
  DaemonSet overhead: 370m CPU, 934Mi RAM
  �[0;32mAllocatable for runners: 63400m CPU (63.4 cores), 227.7Gi RAM�[0m

  �[1mRunners targeting this node:�[0m
    - l-barm64g4-62-226: 62320m CPU, 226.5Gi RAM (job: 62c+226.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-barm64g4-62-226�[0m: 1 pods
      CPU:  98.3% (62320m / 63400m) waste: 1080m (1.1 cores)
      MEM:  99.5% (226.5Gi / 227.7Gi) waste: 1.2Gi
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[0;32m#1�[0m [1xl-barm64g4-62-226]
         CPU:  98.3%  MEM:  99.5%  waste: 1.1c + 1.2Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: m8g.48xlarge�[0m
  Total: 192 vCPU, 768Gi advertised (710.4Gi actual)
  Kubelet reserved: 550m CPU, 8.3Gi RAM
  DaemonSet overhead: 370m CPU, 934Mi RAM
  �[0;32mAllocatable for runners: 191080m CPU (191.1 cores), 701.2Gi RAM�[0m

  �[1mRunners targeting this node:�[0m
    - l-arm64g4-16-62: 16320m CPU, 62.5Gi RAM (job: 16c+62.0Gi, hooks: 320m+522Mi)
    - rel-l-arm64g4-16-62: 16320m CPU, 62.5Gi RAM (job: 16c+62.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-arm64g4-16-62�[0m: 11 pods
      CPU:  94.0% (179520m / 191080m) waste: 11560m (11.6 cores)
      MEM:  98.1% (687.6Gi / 701.2Gi) waste: 13.6Gi
      Bottleneck: CPU
    �[0;32mrel-l-arm64g4-16-62�[0m: 11 pods
      CPU:  94.0% (179520m / 191080m) waste: 11560m (11.6 cores)
      MEM:  98.1% (687.6Gi / 701.2Gi) waste: 13.6Gi
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 12

    �[0;32mTop 5 most efficient:�[0m
      �[0;32m#1�[0m [11xl-arm64g4-16-62]
         CPU:  94.0%  MEM:  98.1%  waste: 11.6c + 13.6Gi
      �[0;32m#2�[0m [10xl-arm64g4-16-62, 1xrel-l-arm64g4-16-62]
         CPU:  94.0%  MEM:  98.1%  waste: 11.6c + 13.6Gi
      �[0;32m#3�[0m [9xl-arm64g4-16-62, 2xrel-l-arm64g4-16-62]
         CPU:  94.0%  MEM:  98.1%  waste: 11.6c + 13.6Gi
      �[0;32m#4�[0m [8xl-arm64g4-16-62, 3xrel-l-arm64g4-16-62]
         CPU:  94.0%  MEM:  98.1%  waste: 11.6c + 13.6Gi
      �[0;32m#5�[0m [7xl-arm64g4-16-62, 4xrel-l-arm64g4-16-62]
         CPU:  94.0%  MEM:  98.1%  waste: 11.6c + 13.6Gi

    �[0;31mBottom 5 least efficient (money on the table):�[0m
      �[0;32m#1�[0m [4xl-arm64g4-16-62, 7xrel-l-arm64g4-16-62]
         CPU:  94.0%  MEM:  98.1%  waste: 11.6c + 13.6Gi
      �[0;32m#2�[0m [3xl-arm64g4-16-62, 8xrel-l-arm64g4-16-62]
         CPU:  94.0%  MEM:  98.1%  waste: 11.6c + 13.6Gi
      �[0;32m#3�[0m [2xl-arm64g4-16-62, 9xrel-l-arm64g4-16-62]
         CPU:  94.0%  MEM:  98.1%  waste: 11.6c + 13.6Gi
      �[0;32m#4�[0m [1xl-arm64g4-16-62, 10xrel-l-arm64g4-16-62]
         CPU:  94.0%  MEM:  98.1%  waste: 11.6c + 13.6Gi
      �[0;32m#5�[0m [11xrel-l-arm64g4-16-62]
         CPU:  94.0%  MEM:  98.1%  waste: 11.6c + 13.6Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: p4d.24xlarge�[0m
  Total: 96 vCPU, 1152Gi advertised (1065.2Gi actual), 8 GPU
  Kubelet reserved: 310m CPU, 3.0Gi RAM
  DaemonSet overhead: 520m CPU, 1.2Gi RAM
  �[0;32mAllocatable for runners: 95170m CPU (95.2 cores), 1060.9Gi RAM, 8 GPU�[0m

  �[1mRunners targeting this node:�[0m
    - l-bx86iavx512-88-1000-a100-8: 88320m CPU, 1000.5Gi RAM, 8 GPU (job: 88c+1000.0Gi, hooks: 320m+522Mi)
    - l-x86iavx512-11-125-a100: 11320m CPU, 125.5Gi RAM, 1 GPU (job: 11c+125.0Gi, hooks: 320m+522Mi)
    - l-x86iavx512-22-250-a100-2: 22320m CPU, 250.5Gi RAM, 2 GPU (job: 22c+250.0Gi, hooks: 320m+522Mi)
    - l-x86iavx512-44-500-a100-4: 44320m CPU, 500.5Gi RAM, 4 GPU (job: 44c+500.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-bx86iavx512-88-1000-a100-8�[0m: 1 pods
      CPU:  92.8% (88320m / 95170m) waste: 6850m (6.8 cores)
      MEM:  94.3% (1000.5Gi / 1060.9Gi) waste: 60.4Gi
      GPU: 100.0% (8 / 8)
      Bottleneck: CPU
    �[0;32ml-x86iavx512-11-125-a100�[0m: 8 pods
      CPU:  95.2% (90560m / 95170m) waste: 4610m (4.6 cores)
      MEM:  94.6% (1004.1Gi / 1060.9Gi) waste: 56.9Gi
      GPU: 100.0% (8 / 8)
      Bottleneck: CPU
    �[0;32ml-x86iavx512-22-250-a100-2�[0m: 4 pods
      CPU:  93.8% (89280m / 95170m) waste: 5890m (5.9 cores)
      MEM:  94.4% (1002.0Gi / 1060.9Gi) waste: 58.9Gi
      GPU: 100.0% (8 / 8)
      Bottleneck: CPU
    �[0;32ml-x86iavx512-44-500-a100-4�[0m: 2 pods
      CPU:  93.1% (88640m / 95170m) waste: 6530m (6.5 cores)
      MEM:  94.4% (1001.0Gi / 1060.9Gi) waste: 59.9Gi
      GPU: 100.0% (8 / 8)
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 10

    �[0;32mTop 5 most efficient:�[0m
      �[0;32m#1�[0m [8xl-x86iavx512-11-125-a100]
         CPU:  95.2%  MEM:  94.6%  GPU: 100.0%  waste: 4.6c + 56.9Gi
      �[0;32m#2�[0m [6xl-x86iavx512-11-125-a100, 1xl-x86iavx512-22-250-a100-2]
         CPU:  94.8%  MEM:  94.6%  GPU: 100.0%  waste: 4.9c + 57.4Gi
      �[0;32m#3�[0m [4xl-x86iavx512-11-125-a100, 2xl-x86iavx512-22-250-a100-2]
         CPU:  94.5%  MEM:  94.5%  GPU: 100.0%  waste: 5.2c + 57.9Gi
      �[0;32m#4�[0m [4xl-x86iavx512-11-125-a100, 1xl-x86iavx512-44-500-a100-4]
         CPU:  94.1%  MEM:  94.5%  GPU: 100.0%  waste: 5.6c + 58.4Gi
      �[0;32m#5�[0m [2xl-x86iavx512-11-125-a100, 3xl-x86iavx512-22-250-a100-2]
         CPU:  94.1%  MEM:  94.5%  GPU: 100.0%  waste: 5.6c + 58.4Gi

    �[0;31mBottom 5 least efficient (money on the table):�[0m
      �[0;32m#1�[0m [2xl-x86iavx512-11-125-a100, 1xl-x86iavx512-22-250-a100-2, 1xl-x86iavx512-44-500-a100-4]
         CPU:  93.8%  MEM:  94.4%  GPU: 100.0%  waste: 5.9c + 58.9Gi
      �[0;32m#2�[0m [4xl-x86iavx512-22-250-a100-2]
         CPU:  93.8%  MEM:  94.4%  GPU: 100.0%  waste: 5.9c + 58.9Gi
      �[0;32m#3�[0m [2xl-x86iavx512-22-250-a100-2, 1xl-x86iavx512-44-500-a100-4]
         CPU:  93.5%  MEM:  94.4%  GPU: 100.0%  waste: 6.2c + 59.4Gi
      �[0;32m#4�[0m [2xl-x86iavx512-44-500-a100-4]
         CPU:  93.1%  MEM:  94.4%  GPU: 100.0%  waste: 6.5c + 59.9Gi
      �[0;32m#5�[0m [1xl-bx86iavx512-88-1000-a100-8]
         CPU:  92.8%  MEM:  94.3%  GPU: 100.0%  waste: 6.8c + 60.4Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: p5.48xlarge�[0m
  Total: 192 vCPU, 2048Gi advertised (1894.4Gi actual), 8 GPU
  Kubelet reserved: 550m CPU, 2.5Gi RAM
  DaemonSet overhead: 520m CPU, 1.2Gi RAM
  �[0;32mAllocatable for runners: 190930m CPU (190.9 cores), 1890.7Gi RAM, 8 GPU�[0m

  �[1mRunners targeting this node:�[0m
    - l-bx86iamx-176-1800-h100-8: 176320m CPU, 1800.5Gi RAM, 8 GPU (job: 176c+1800.0Gi, hooks: 320m+522Mi)
    - l-x86iamx-22-225-h100: 22320m CPU, 225.5Gi RAM, 1 GPU (job: 22c+225.0Gi, hooks: 320m+522Mi)
    - l-x86iamx-44-450-h100-2: 44320m CPU, 450.5Gi RAM, 2 GPU (job: 44c+450.0Gi, hooks: 320m+522Mi)
    - l-x86iamx-88-900-h100-4: 88320m CPU, 900.5Gi RAM, 4 GPU (job: 88c+900.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-bx86iamx-176-1800-h100-8�[0m: 1 pods
      CPU:  92.3% (176320m / 190930m) waste: 14610m (14.6 cores)
      MEM:  95.2% (1800.5Gi / 1890.7Gi) waste: 90.2Gi
      GPU: 100.0% (8 / 8)
      Bottleneck: CPU
    �[0;32ml-x86iamx-22-225-h100�[0m: 8 pods
      CPU:  93.5% (178560m / 190930m) waste: 12370m (12.4 cores)
      MEM:  95.4% (1804.1Gi / 1890.7Gi) waste: 86.6Gi
      GPU: 100.0% (8 / 8)
      Bottleneck: CPU
    �[0;32ml-x86iamx-44-450-h100-2�[0m: 4 pods
      CPU:  92.9% (177280m / 190930m) waste: 13650m (13.7 cores)
      MEM:  95.3% (1802.0Gi / 1890.7Gi) waste: 88.7Gi
      GPU: 100.0% (8 / 8)
      Bottleneck: CPU
    �[0;32ml-x86iamx-88-900-h100-4�[0m: 2 pods
      CPU:  92.5% (176640m / 190930m) waste: 14290m (14.3 cores)
      MEM:  95.3% (1801.0Gi / 1890.7Gi) waste: 89.7Gi
      GPU: 100.0% (8 / 8)
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 10

    �[0;32mTop 5 most efficient:�[0m
      �[0;32m#1�[0m [8xl-x86iamx-22-225-h100]
         CPU:  93.5%  MEM:  95.4%  GPU: 100.0%  waste: 12.4c + 86.6Gi
      �[0;32m#2�[0m [6xl-x86iamx-22-225-h100, 1xl-x86iamx-44-450-h100-2]
         CPU:  93.4%  MEM:  95.4%  GPU: 100.0%  waste: 12.7c + 87.1Gi
      �[0;32m#3�[0m [4xl-x86iamx-22-225-h100, 2xl-x86iamx-44-450-h100-2]
         CPU:  93.2%  MEM:  95.4%  GPU: 100.0%  waste: 13.0c + 87.6Gi
      �[0;32m#4�[0m [4xl-x86iamx-22-225-h100, 1xl-x86iamx-88-900-h100-4]
         CPU:  93.0%  MEM:  95.3%  GPU: 100.0%  waste: 13.3c + 88.2Gi
      �[0;32m#5�[0m [2xl-x86iamx-22-225-h100, 3xl-x86iamx-44-450-h100-2]
         CPU:  93.0%  MEM:  95.3%  GPU: 100.0%  waste: 13.3c + 88.2Gi

    �[0;31mBottom 5 least efficient (money on the table):�[0m
      �[0;32m#1�[0m [2xl-x86iamx-22-225-h100, 1xl-x86iamx-44-450-h100-2, 1xl-x86iamx-88-900-h100-4]
         CPU:  92.9%  MEM:  95.3%  GPU: 100.0%  waste: 13.7c + 88.7Gi
      �[0;32m#2�[0m [4xl-x86iamx-44-450-h100-2]
         CPU:  92.9%  MEM:  95.3%  GPU: 100.0%  waste: 13.7c + 88.7Gi
      �[0;32m#3�[0m [2xl-x86iamx-44-450-h100-2, 1xl-x86iamx-88-900-h100-4]
         CPU:  92.7%  MEM:  95.3%  GPU: 100.0%  waste: 14.0c + 89.2Gi
      �[0;32m#4�[0m [2xl-x86iamx-88-900-h100-4]
         CPU:  92.5%  MEM:  95.3%  GPU: 100.0%  waste: 14.3c + 89.7Gi
      �[0;32m#5�[0m [1xl-bx86iamx-176-1800-h100-8]
         CPU:  92.3%  MEM:  95.2%  GPU: 100.0%  waste: 14.6c + 90.2Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: p6-b200.48xlarge�[0m
  Total: 192 vCPU, 2048Gi advertised (1894.4Gi actual), 8 GPU
  Kubelet reserved: 550m CPU, 2.5Gi RAM
  DaemonSet overhead: 520m CPU, 1.2Gi RAM
  �[0;32mAllocatable for runners: 190930m CPU (190.9 cores), 1890.7Gi RAM, 8 GPU�[0m

  �[1mRunners targeting this node:�[0m
    - l-bx86iamx-176-1800-b200-8: 176320m CPU, 1800.5Gi RAM, 8 GPU (job: 176c+1800.0Gi, hooks: 320m+522Mi)
    - l-x86iamx-22-225-b200: 22320m CPU, 225.5Gi RAM, 1 GPU (job: 22c+225.0Gi, hooks: 320m+522Mi)
    - l-x86iamx-44-450-b200-2: 44320m CPU, 450.5Gi RAM, 2 GPU (job: 44c+450.0Gi, hooks: 320m+522Mi)
    - l-x86iamx-88-900-b200-4: 88320m CPU, 900.5Gi RAM, 4 GPU (job: 88c+900.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-bx86iamx-176-1800-b200-8�[0m: 1 pods
      CPU:  92.3% (176320m / 190930m) waste: 14610m (14.6 cores)
      MEM:  95.2% (1800.5Gi / 1890.7Gi) waste: 90.2Gi
      GPU: 100.0% (8 / 8)
      Bottleneck: CPU
    �[0;32ml-x86iamx-22-225-b200�[0m: 8 pods
      CPU:  93.5% (178560m / 190930m) waste: 12370m (12.4 cores)
      MEM:  95.4% (1804.1Gi / 1890.7Gi) waste: 86.6Gi
      GPU: 100.0% (8 / 8)
      Bottleneck: CPU
    �[0;32ml-x86iamx-44-450-b200-2�[0m: 4 pods
      CPU:  92.9% (177280m / 190930m) waste: 13650m (13.7 cores)
      MEM:  95.3% (1802.0Gi / 1890.7Gi) waste: 88.7Gi
      GPU: 100.0% (8 / 8)
      Bottleneck: CPU
    �[0;32ml-x86iamx-88-900-b200-4�[0m: 2 pods
      CPU:  92.5% (176640m / 190930m) waste: 14290m (14.3 cores)
      MEM:  95.3% (1801.0Gi / 1890.7Gi) waste: 89.7Gi
      GPU: 100.0% (8 / 8)
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 10

    �[0;32mTop 5 most efficient:�[0m
      �[0;32m#1�[0m [8xl-x86iamx-22-225-b200]
         CPU:  93.5%  MEM:  95.4%  GPU: 100.0%  waste: 12.4c + 86.6Gi
      �[0;32m#2�[0m [6xl-x86iamx-22-225-b200, 1xl-x86iamx-44-450-b200-2]
         CPU:  93.4%  MEM:  95.4%  GPU: 100.0%  waste: 12.7c + 87.1Gi
      �[0;32m#3�[0m [4xl-x86iamx-22-225-b200, 2xl-x86iamx-44-450-b200-2]
         CPU:  93.2%  MEM:  95.4%  GPU: 100.0%  waste: 13.0c + 87.6Gi
      �[0;32m#4�[0m [4xl-x86iamx-22-225-b200, 1xl-x86iamx-88-900-b200-4]
         CPU:  93.0%  MEM:  95.3%  GPU: 100.0%  waste: 13.3c + 88.2Gi
      �[0;32m#5�[0m [2xl-x86iamx-22-225-b200, 3xl-x86iamx-44-450-b200-2]
         CPU:  93.0%  MEM:  95.3%  GPU: 100.0%  waste: 13.3c + 88.2Gi

    �[0;31mBottom 5 least efficient (money on the table):�[0m
      �[0;32m#1�[0m [2xl-x86iamx-22-225-b200, 1xl-x86iamx-44-450-b200-2, 1xl-x86iamx-88-900-b200-4]
         CPU:  92.9%  MEM:  95.3%  GPU: 100.0%  waste: 13.7c + 88.7Gi
      �[0;32m#2�[0m [4xl-x86iamx-44-450-b200-2]
         CPU:  92.9%  MEM:  95.3%  GPU: 100.0%  waste: 13.7c + 88.7Gi
      �[0;32m#3�[0m [2xl-x86iamx-44-450-b200-2, 1xl-x86iamx-88-900-b200-4]
         CPU:  92.7%  MEM:  95.3%  GPU: 100.0%  waste: 14.0c + 89.2Gi
      �[0;32m#4�[0m [2xl-x86iamx-88-900-b200-4]
         CPU:  92.5%  MEM:  95.3%  GPU: 100.0%  waste: 14.3c + 89.7Gi
      �[0;32m#5�[0m [1xl-bx86iamx-176-1800-b200-8]
         CPU:  92.3%  MEM:  95.2%  GPU: 100.0%  waste: 14.6c + 90.2Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: r7a.48xlarge�[0m
  Total: 192 vCPU, 1536Gi advertised (1420.8Gi actual)
  Kubelet reserved: 550m CPU, 8.3Gi RAM
  DaemonSet overhead: 370m CPU, 934Mi RAM
  �[0;32mAllocatable for runners: 191080m CPU (191.1 cores), 1411.6Gi RAM�[0m

  �[1mRunners targeting this node:�[0m
    - l-x86iavx512-16-128: 16320m CPU, 128.5Gi RAM (job: 16c+128.0Gi, hooks: 320m+522Mi)
    - l-x86iavx512-32-256: 32320m CPU, 256.5Gi RAM (job: 32c+256.0Gi, hooks: 320m+522Mi)
    - l-x86iavx512-48-384: 48320m CPU, 384.5Gi RAM (job: 48c+384.0Gi, hooks: 320m+522Mi)
    - l-x86iavx512-8-64: 8320m CPU, 64.5Gi RAM (job: 8c+64.0Gi, hooks: 320m+522Mi)
    - l-x86iavx512-94-768: 94320m CPU, 740.5Gi RAM (job: 94c+740.0Gi, hooks: 320m+522Mi)
    - rel-l-x86iavx512-8-64: 8320m CPU, 64.5Gi RAM (job: 8c+64.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[1;33ml-x86iavx512-16-128�[0m: 10 pods
      CPU:  85.4% (163200m / 191080m) waste: 27880m (27.9 cores)
      MEM:  91.0% (1285.1Gi / 1411.6Gi) waste: 126.5Gi
      Bottleneck: MEM
    �[1;33ml-x86iavx512-32-256�[0m: 5 pods
      CPU:  84.6% (161600m / 191080m) waste: 29480m (29.5 cores)
      MEM:  90.9% (1282.5Gi / 1411.6Gi) waste: 129.1Gi
      Bottleneck: CPU
    �[0;31ml-x86iavx512-48-384�[0m: 3 pods
      CPU:  75.9% (144960m / 191080m) waste: 46120m (46.1 cores)
      MEM:  81.7% (1153.5Gi / 1411.6Gi) waste: 258.1Gi
      Bottleneck: CPU
    �[0;32ml-x86iavx512-8-64�[0m: 21 pods
      CPU:  91.4% (174720m / 191080m) waste: 16360m (16.4 cores)
      MEM:  96.0% (1354.7Gi / 1411.6Gi) waste: 56.9Gi
      Bottleneck: MEM
    �[0;31ml-x86iavx512-94-768�[0m: 1 pods
      CPU:  49.4% (94320m / 191080m) waste: 96760m (96.8 cores)
      MEM:  52.5% (740.5Gi / 1411.6Gi) waste: 671.1Gi
      Bottleneck: MEM
    �[0;32mrel-l-x86iavx512-8-64�[0m: 21 pods
      CPU:  91.4% (174720m / 191080m) waste: 16360m (16.4 cores)
      MEM:  96.0% (1354.7Gi / 1411.6Gi) waste: 56.9Gi
      Bottleneck: MEM

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 572

    �[0;32mTop 5 most efficient:�[0m
      �[0;32m#1�[0m [5xl-x86iavx512-16-128, 2xl-x86iavx512-48-384]
         CPU:  93.3%  MEM: 100.0%  waste: 12.8c + 57Mi
      �[0;32m#2�[0m [4xl-x86iavx512-16-128, 2xl-x86iavx512-32-256, 1xl-x86iavx512-48-384]
         CPU:  93.3%  MEM: 100.0%  waste: 12.8c + 57Mi
      �[0;32m#3�[0m [3xl-x86iavx512-16-128, 4xl-x86iavx512-32-256]
         CPU:  93.3%  MEM: 100.0%  waste: 12.8c + 57Mi
      �[0;32m#4�[0m [2xl-x86iavx512-16-128, 1xl-x86iavx512-32-256, 2xl-x86iavx512-48-384, 2xl-x86iavx512-8-64]
         CPU:  93.3%  MEM: 100.0%  waste: 12.8c + 57Mi
      �[0;32m#5�[0m [2xl-x86iavx512-16-128, 1xl-x86iavx512-32-256, 2xl-x86iavx512-48-384, 1xl-x86iavx512-8-64, 1xrel-l-x86iavx512-8-64]
         CPU:  93.3%  MEM: 100.0%  waste: 12.8c + 57Mi

    �[0;31mBottom 5 least efficient (money on the table):�[0m
      �[1;33m#1�[0m [1xl-x86iavx512-16-128, 1xl-x86iavx512-32-256, 2xl-x86iavx512-48-384, 3xrel-l-x86iavx512-8-64]
         CPU:  89.1%  MEM:  95.5%  waste: 20.8c + 64.1Gi
      �[1;33m#2�[0m [3xl-x86iavx512-32-256, 1xl-x86iavx512-48-384, 3xl-x86iavx512-8-64]
         CPU:  89.1%  MEM:  95.5%  waste: 20.8c + 64.1Gi
      �[1;33m#3�[0m [3xl-x86iavx512-32-256, 1xl-x86iavx512-48-384, 2xl-x86iavx512-8-64, 1xrel-l-x86iavx512-8-64]
         CPU:  89.1%  MEM:  95.5%  waste: 20.8c + 64.1Gi
      �[1;33m#4�[0m [3xl-x86iavx512-32-256, 1xl-x86iavx512-48-384, 1xl-x86iavx512-8-64, 2xrel-l-x86iavx512-8-64]
         CPU:  89.1%  MEM:  95.5%  waste: 20.8c + 64.1Gi
      �[1;33m#5�[0m [3xl-x86iavx512-32-256, 1xl-x86iavx512-48-384, 3xrel-l-x86iavx512-8-64]
         CPU:  89.1%  MEM:  95.5%  waste: 20.8c + 64.1Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: r7g.16xlarge�[0m
  Total: 64 vCPU, 512Gi advertised (473.7Gi actual)
  Kubelet reserved: 230m CPU, 8.3Gi RAM
  DaemonSet overhead: 370m CPU, 934Mi RAM
  �[0;32mAllocatable for runners: 63400m CPU (63.4 cores), 464.5Gi RAM�[0m

  �[1mRunners targeting this node:�[0m
    - l-arm64g3-61-463: 61320m CPU, 463.5Gi RAM (job: 61c+463.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[0;32ml-arm64g3-61-463�[0m: 1 pods
      CPU:  96.7% (61320m / 63400m) waste: 2080m (2.1 cores)
      MEM:  99.8% (463.5Gi / 464.5Gi) waste: 1.0Gi
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[0;32m#1�[0m [1xl-arm64g3-61-463]
         CPU:  96.7%  MEM:  99.8%  waste: 2.1c + 1.0Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: r7i.48xlarge�[0m
  Total: 192 vCPU, 1536Gi advertised (1420.8Gi actual)
  Kubelet reserved: 550m CPU, 8.3Gi RAM
  DaemonSet overhead: 370m CPU, 934Mi RAM
  �[0;32mAllocatable for runners: 191080m CPU (191.1 cores), 1411.6Gi RAM�[0m

  �[1mRunners targeting this node:�[0m
    - l-x86iamx-16-128: 16320m CPU, 128.5Gi RAM (job: 16c+128.0Gi, hooks: 320m+522Mi)
    - l-x86iamx-8-64: 8320m CPU, 64.5Gi RAM (job: 8c+64.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[1;33ml-x86iamx-16-128�[0m: 10 pods
      CPU:  85.4% (163200m / 191080m) waste: 27880m (27.9 cores)
      MEM:  91.0% (1285.1Gi / 1411.6Gi) waste: 126.5Gi
      Bottleneck: MEM
    �[0;32ml-x86iamx-8-64�[0m: 21 pods
      CPU:  91.4% (174720m / 191080m) waste: 16360m (16.4 cores)
      MEM:  96.0% (1354.7Gi / 1411.6Gi) waste: 56.9Gi
      Bottleneck: MEM

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 10

    �[0;32mTop 5 most efficient:�[0m
      �[0;32m#1�[0m [1xl-x86iamx-16-128, 19xl-x86iamx-8-64]
         CPU:  91.3%  MEM:  95.9%  waste: 16.7c + 57.4Gi
      �[0;32m#2�[0m [2xl-x86iamx-16-128, 17xl-x86iamx-8-64]
         CPU:  91.1%  MEM:  95.9%  waste: 17.0c + 57.9Gi
      �[0;32m#3�[0m [3xl-x86iamx-16-128, 15xl-x86iamx-8-64]
         CPU:  90.9%  MEM:  95.9%  waste: 17.3c + 58.4Gi
      �[0;32m#4�[0m [4xl-x86iamx-16-128, 13xl-x86iamx-8-64]
         CPU:  90.8%  MEM:  95.8%  waste: 17.6c + 59.0Gi
      �[0;32m#5�[0m [5xl-x86iamx-16-128, 11xl-x86iamx-8-64]
         CPU:  90.6%  MEM:  95.8%  waste: 18.0c + 59.5Gi

    �[0;31mBottom 5 least efficient (money on the table):�[0m
      �[0;32m#1�[0m [6xl-x86iamx-16-128, 9xl-x86iamx-8-64]
         CPU:  90.4%  MEM:  95.8%  waste: 18.3c + 60.0Gi
      �[0;32m#2�[0m [7xl-x86iamx-16-128, 7xl-x86iamx-8-64]
         CPU:  90.3%  MEM:  95.7%  waste: 18.6c + 60.5Gi
      �[0;32m#3�[0m [8xl-x86iamx-16-128, 5xl-x86iamx-8-64]
         CPU:  90.1%  MEM:  95.7%  waste: 18.9c + 61.0Gi
      �[1;33m#4�[0m [9xl-x86iamx-16-128, 3xl-x86iamx-8-64]
         CPU:  89.9%  MEM:  95.6%  waste: 19.2c + 61.5Gi
      �[1;33m#5�[0m [10xl-x86iamx-16-128, 1xl-x86iamx-8-64]
         CPU:  89.8%  MEM:  95.6%  waste: 19.6c + 62.0Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1m�[0;36mNode Type: t4g.2xlarge�[0m
  Total: 8 vCPU, 32Gi advertised (29.6Gi actual)
  Kubelet reserved: 90m CPU, 993Mi RAM
  DaemonSet overhead: 370m CPU, 934Mi RAM
  �[0;32mAllocatable for runners: 7540m CPU (7.5 cores), 27.7Gi RAM�[0m

  �[1mRunners targeting this node:�[0m
    - l-arm64g2-6-25: 6320m CPU, 25.5Gi RAM (job: 6c+25.0Gi, hooks: 320m+522Mi)

  �[1mHomogeneous packing (single runner type fills the node):�[0m
    �[1;33ml-arm64g2-6-25�[0m: 1 pods
      CPU:  83.8% (6320m / 7540m) waste: 1220m (1.2 cores)
      MEM:  92.0% (25.5Gi / 27.7Gi) waste: 2.2Gi
      Bottleneck: CPU

  �[1mMaximal mixed combos (node fully packed, no room for another pod):�[0m
    Total maximal combos: 1

    �[0;32mTop 1 most efficient:�[0m
      �[1;33m#1�[0m [1xl-arm64g2-6-25]
         CPU:  83.8%  MEM:  92.0%  waste: 1.2c + 2.2Gi

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[0;31m�[1mFound 13 runner type(s) with homogeneous utilization below 90.0%�[0m

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
�[1mUnused resource headroom per node (homogeneous packing only):�[0m

  Node Type                 Min CPU    Max CPU    Min MEM    Max MEM
  ────────────────────────────────────────────────────────────────
  c7a.48xlarge              4480m     96760m      3.3Gi    156.5Gi
  c7i.12xlarge              1120m      5840m      530Mi      2.5Gi
  c7i.metal-24xl            3000m      3000m      936Mi      936Mi
  g4dn.12xlarge             1970m      1970m      1.0Gi      1.0Gi
  g4dn.8xlarge              2010m      2010m      712Mi      712Mi
  g4dn.metal                 850m       850m      1.2Gi      1.2Gi
  g5.12xlarge               1970m      1970m      616Mi      616Mi
  g5.48xlarge               1610m      1610m      627Mi      627Mi
  g5.8xlarge                2010m      2010m      824Mi      824Mi
  g6.12xlarge               1970m      1970m      1.0Gi      1.0Gi
  g6.8xlarge                2010m      2010m      824Mi      824Mi
  m6i.32xlarge              1920m      1920m      1.0Gi      1.0Gi
  m7g.8xlarge              15160m     15160m     52.1Gi     52.1Gi
  m7g.metal                 1080m      1080m      1.2Gi      1.2Gi
  m7i.48xlarge             16360m     29800m     18.5Gi     59.2Gi
  m8g.16xlarge              1080m      1080m      1.2Gi      1.2Gi
  m8g.48xlarge             11560m     11560m     13.6Gi     13.6Gi
  p4d.24xlarge              4610m      6850m     56.9Gi     60.4Gi
  p5.48xlarge              12370m     14610m     86.6Gi     90.2Gi
  p6-b200.48xlarge         12370m     14610m     86.6Gi     90.2Gi
  r7a.48xlarge             16360m     96760m     56.9Gi    671.1Gi
  r7g.16xlarge              2080m      2080m      1.0Gi      1.0Gi
  r7i.48xlarge             16360m     27880m     56.9Gi    126.5Gi
  t4g.2xlarge               1220m      1220m      2.2Gi      2.2Gi
  ────────────────────────────────────────────────────────────────
  �[1mWORST CASE            �[0m     850m     96760m      530Mi    671.1Gi

  The tightest node has only �[1m850m CPU�[0m and �[1m530Mi RAM�[0m free.
  Any new DaemonSet must fit within these limits or runners will fail to schedule.

github-actions · 2026-06-16T22:12:52Z

tofu plan — arc-cbr-production

✅ Plan succeeded · commit 27ebdfd0 · run log

Plan output

Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::ciforge-tfstate-arc-cbr-prod",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (arc-cbr-production) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.eks.aws_iam_role.cluster: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role]
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3]
module.eks.data.aws_ami.eks_optimized_al2023: Reading...
module.eks.data.aws_caller_identity.current: Reading...
module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=527854a4-e335-4f95-bc89-1321cff7a478]
module.vpc.aws_vpc.this: Refreshing state... [id=vpc-0e712dc7e743bbcf7]
data.aws_availability_zones.available: Reading...
module.eks.aws_iam_role.node: Refreshing state... [id=pytorch-arc-cbr-production-node-role]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_caller_identity.current: Read complete after 0s [id=308535385114]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAUPVRELQNOLQFN6MU]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role/arn:aws:iam::aws:policy/AmazonEKSClusterPolicy]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=pytorch-arc-cbr-production-cluster-role/arn:aws:iam::aws:policy/AmazonEKSVPCResourceController]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy]
module.eks.aws_iam_role_policy.node_cni_ipv6: Refreshing state... [id=pytorch-arc-cbr-production-node-role:pytorch-arc-cbr-production-node-cni-ipv6]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=pytorch-arc-cbr-production-node-role/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/pytorch-arc-cbr-production-eks-secrets]
data.aws_availability_zones.available: Read complete after 0s [id=us-east-2]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=pytorch-arc-cbr-production-harbor-s3/arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-05e96ee7cb818e5c0]
module.vpc.aws_egress_only_internet_gateway.this: Refreshing state... [id=eigw-032d4401e63f0c9b9]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 1s [id=ami-009f1fe7d56695348]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0ab11fcdb8d4ea113]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0d26e280575e8aaf4]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-0a583bbbcac436ebd]
module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-0d34063a19f4b07b4]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-01e479dcb5aedf696]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-0fddf2f74e7e978c7]
module.vpc.aws_eip.nat[2]: Refreshing state... [id=eipalloc-01187bfaa68514400]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-0709abbcafa23aec0]
module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-0577a02acde719bff]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-0992f582e9bf2836e]
module.vpc.aws_eip.nat_secondary["us-east-2b-6"]: Refreshing state... [id=eipalloc-06b7b88826199a232]
module.vpc.aws_eip.nat_secondary["us-east-2c-4"]: Refreshing state... [id=eipalloc-0cc3dadec18bbb3f3]
module.vpc.aws_eip.nat_secondary["us-east-2c-5"]: Refreshing state... [id=eipalloc-02825435a2786b3d8]
module.vpc.aws_eip.nat_secondary["us-east-2c-6"]: Refreshing state... [id=eipalloc-0aede78edc69cf695]
module.vpc.aws_eip.nat_secondary["us-east-2c-3"]: Refreshing state... [id=eipalloc-0d3a71569b2f687be]
module.vpc.aws_eip.nat_secondary["us-east-2a-0"]: Refreshing state... [id=eipalloc-086a011b3c26c0dd7]
module.vpc.aws_eip.nat_secondary["us-east-2a-5"]: Refreshing state... [id=eipalloc-0bd9bf54bd6010323]
module.vpc.aws_eip.nat_secondary["us-east-2b-1"]: Refreshing state... [id=eipalloc-0e67c0a8cd8c990da]
module.vpc.aws_eip.nat_secondary["us-east-2c-1"]: Refreshing state... [id=eipalloc-06a980076e99cda81]
module.vpc.aws_eip.nat_secondary["us-east-2b-5"]: Refreshing state... [id=eipalloc-0cde9a6463901f1e1]
module.vpc.aws_eip.nat_secondary["us-east-2c-0"]: Refreshing state... [id=eipalloc-03542e74755fc105b]
module.vpc.aws_eip.nat_secondary["us-east-2a-2"]: Refreshing state... [id=eipalloc-09b15a770e0c6d552]
module.vpc.aws_eip.nat_secondary["us-east-2b-0"]: Refreshing state... [id=eipalloc-0cead990d60ce181e]
module.vpc.aws_eip.nat_secondary["us-east-2a-6"]: Refreshing state... [id=eipalloc-0113c95dbdec2f879]
module.vpc.aws_eip.nat_secondary["us-east-2a-1"]: Refreshing state... [id=eipalloc-0f2b00a9ac31df215]
module.vpc.aws_eip.nat_secondary["us-east-2c-2"]: Refreshing state... [id=eipalloc-07cfdb2fd5dc07459]
module.vpc.aws_eip.nat_secondary["us-east-2b-3"]: Refreshing state... [id=eipalloc-021ee6c9f1d20b71a]
module.vpc.aws_eip.nat_secondary["us-east-2b-4"]: Refreshing state... [id=eipalloc-0de33181548ac2e5a]
module.vpc.aws_eip.nat_secondary["us-east-2b-2"]: Refreshing state... [id=eipalloc-063bee447616351f9]
module.vpc.aws_eip.nat_secondary["us-east-2a-4"]: Refreshing state... [id=eipalloc-067d535102a61d1a8]
module.vpc.aws_eip.nat_secondary["us-east-2a-3"]: Refreshing state... [id=eipalloc-034d5e1f5a2fcb795]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-084975a7f7af2696e]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-07d5cd4c479c827ab]
module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-0ce4fba002d90e7d5]
module.eks.aws_eks_cluster.this: Refreshing state... [id=pytorch-arc-cbr-production]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-0ad75b2f5282877db]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-08e264cbbd47be1ee]
module.vpc.aws_nat_gateway.this[2]: Refreshing state... [id=nat-0f7b8f4473e5790df]
module.vpc.aws_route_table.private[2]: Refreshing state... [id=rtb-0cb3785c433ed7718]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-0c7ecd4166a01e5f0]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-01d38d41a7ca82a08]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=pytorch-arc-cbr-production:vpc-cni]
module.eks.aws_eks_access_entry.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production:arn:aws:iam::308535385114:role/osdc_gha_prod]
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=pytorch-arc-cbr-production:kube-proxy]
module.eks.aws_launch_template.base: Refreshing state... [id=lt-0b820cd15307b6d57]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-0b6e08b4b0dc968c0]
module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-097abe4676c74f71b]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-0beb143017359bda1]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=033a163afb2babc26f7883e642621ac361c93d61]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::308535385114:oidc-provider/oidc.eks.us-east-2.amazonaws.com/id/0A621339248958D6D5F2FF084BD185B5]
module.eks.aws_eks_node_group.base: Refreshing state... [id=pytorch-arc-cbr-production:pytorch-arc-cbr-production-base-nodes]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=2879363015]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=pytorch-arc-cbr-production-ebs-csi-driver-role/arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=pytorch-arc-cbr-production-harbor-registry/arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-harbor-registry]
module.eks.aws_eks_access_policy_association.cluster_admin["osdc_gha_prod"]: Refreshing state... [id=pytorch-arc-cbr-production#arn:aws:iam::308535385114:role/osdc_gha_prod#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=pytorch-arc-cbr-production:coredns]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-ebs-csi-driver]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption]
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance]
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/308535385114/pytorch-arc-cbr-production-karpenter]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-rebalance-KarpenterRebalance]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-spot-interruption-KarpenterSpotInterruption]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-instance-state-change-KarpenterInstanceStateChange]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-scheduled-change-KarpenterScheduledChange]
data.terraform_remote_state.base: Read complete after 1s
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-01ec5f742ae028981,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0709abbcafa23aec0"]: Refreshing state... [id=subnet-0709abbcafa23aec0,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0992f582e9bf2836e"]: Refreshing state... [id=subnet-0992f582e9bf2836e,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0577a02acde719bff"]: Refreshing state... [id=subnet-0577a02acde719bff,karpenter.sh/discovery]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-karpenter-controller]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=pytorch-arc-cbr-production-karpenter-controller-20260518021844404100000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (arc-cbr-production) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wheel-syncer-s3]
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::308535385114:policy/pytorch-arc-cbr-production-pypi-wants-collector-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-0deb818bbf18764de]
data.terraform_remote_state.base: Read complete after 2s
aws_iam_role.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role]
aws_iam_role.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role]
aws_iam_role.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role]
aws_security_group.efs: Refreshing state... [id=sg-0979eb5e3d9d3db9f]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production-efs-csi-driver-role-20260518023249955700000005]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wheel-syncer-role-20260518023249929400000004]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=pytorch-arc-cbr-production-pypi-wants-collector-role-20260518023249903900000003]
aws_efs_mount_target.pypi_cache["subnet-0577a02acde719bff"]: Refreshing state... [id=fsmt-07d7b111b9cd6684e]
aws_efs_mount_target.pypi_cache["subnet-0709abbcafa23aec0"]: Refreshing state... [id=fsmt-08cd5108febbacef9]
aws_efs_mount_target.pypi_cache["subnet-0992f582e9bf2836e"]: Refreshing state... [id=fsmt-03523586bb4ff0c46]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=pytorch-arc-cbr-production:aws-efs-csi-driver]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

Temporary test configuration — DO NOT MERGE. Self-contained sibling of the g4dn.metal [TEST-ONLY] commit (now archived on numa-aware-scheduling-g4dn): the same NUMA pipeline on REAL A100 hardware (p4d.24xlarge) in the new meta-staging-aws-ue1 (us-east-1) staging cluster — the production-class target. Reuses the existing p4d fleet + A100 runner defs (the p4d fleet is already single-numa-node), so no new fleet or runner defs are needed. Prerequisite this unblocks: confirm whether A100/p4d actually PUBLISHES per-GPU NUMA topology in the NodeResourceTopology. On g4dn.metal it did NOT (zones exposed CPU only, so the numa-scheduler could never align a GPU pod). After confirming, vary the scheduler and over-request beyond one NUMA zone (edit the 1-GPU def), mirroring the g4dn A/B. - Add nfd + numa-scheduler to meta-staging-aws-ue1; remove from the two prod clusters so the test pipeline is isolated to staging (mirrors the g4dn commit). - Repoint NFD topology-updater + taint-remover + the nfd-topology startup-taint gate from p5 to p4d. Only ue1 runs nfd now, so a single fleet target — no affinity needed. - Add scheduler_name: numa-scheduler to the existing A100 1-GPU and 4-GPU runner defs (the 4-GPU is the real scenario, parallel to the H100 4-GPU; the 1-GPU is the A/B knob). p4d.24xlarge = 2 sockets x 4 A100 40GB (2 NUMA x 4 GPU). cpuManagerPolicy=static and Guaranteed-QoS runner pods already apply, so CPU+GPU NUMA alignment needs no workload changes. ue1 runners carry the c-mt- staging prefix + meta-staging-aws-ue1 runner group, so there is no overlap with prod (mt-). Deploy (ue1 only): just deploy-module meta-staging-aws-ue1 nfd just deploy-module meta-staging-aws-ue1 numa-scheduler just deploy-module meta-staging-aws-ue1 nodepools just deploy-module meta-staging-aws-ue1 arc-runners Then queue a canary job and inspect the NRT for per-zone nvidia.com/gpu BEFORE varying. Cleanup: drop this commit (git reset --hard HEAD~1) + teardown nfd/numa-scheduler on ue1. ghstack-source-id: 1473ca5 Pull-Request: #778

github-actions · 2026-06-16T22:16:16Z

tofu plan — lf-prod-aws-ue1

✅ Plan succeeded · commit 27ebdfd0 · run log

Plan output

Installed 1 package in 1ms
{
    "BucketArn": "arn:aws:s3:::lf-osdc-tfstate-prod-ue1",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (lf-prod-aws-ue1) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.eks.data.aws_ami.eks_optimized_al2023: Reading...
data.aws_availability_zones.available: Reading...
module.eks.data.aws_caller_identity.current: Reading...
module.eks.aws_iam_role.node: Refreshing state... [id=lf-prod-aws-ue1-node-role]
module.eks.aws_iam_role.cluster: Refreshing state... [id=lf-prod-aws-ue1-cluster-role]
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=lf-prod-aws-ue1-harbor-s3]
module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=e5e45db6-94ad-4dfd-8a1a-213730256a9c]
module.vpc.aws_vpc.this: Refreshing state... [id=vpc-06f350eae88f37700]
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=lf-prod-aws-ue1-harbor-registry]
module.eks.data.aws_caller_identity.current: Read complete after 0s [id=391835788720]
data.aws_availability_zones.available: Read complete after 0s [id=us-east-1]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAVWOZ3UWYJZNKMI7G]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/lf-prod-aws-ue1-eks-secrets]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=lf-prod-aws-ue1-cluster-role/arn:aws:iam::aws:policy/AmazonEKSVPCResourceController]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=lf-prod-aws-ue1-cluster-role/arn:aws:iam::aws:policy/AmazonEKSClusterPolicy]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=lf-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=lf-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy]
module.eks.aws_iam_role_policy.node_cni_ipv6: Refreshing state... [id=lf-prod-aws-ue1-node-role:lf-prod-aws-ue1-node-cni-ipv6]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=lf-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=lf-prod-aws-ue1-node-role/arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 0s [id=ami-0dafeb02304897431]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=lf-prod-aws-ue1-harbor-registry]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=lf-prod-aws-ue1-harbor-registry]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::391835788720:policy/lf-prod-aws-ue1-harbor-registry]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=lf-prod-aws-ue1-harbor-s3/arn:aws:iam::391835788720:policy/lf-prod-aws-ue1-harbor-registry]
module.vpc.aws_egress_only_internet_gateway.this: Refreshing state... [id=eigw-03548aa6de237de4c]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-089c5123e6da8d43c]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-051217c40b1d02b3a]
module.vpc.aws_eip.nat_secondary["us-east-1c-5"]: Refreshing state... [id=eipalloc-07612f4e715e508ae]
module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-01ca3df6137b445c0]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-07234379e7833a398]
module.vpc.aws_eip.nat_secondary["us-east-1b-0"]: Refreshing state... [id=eipalloc-0533e7098b8548fda]
module.vpc.aws_eip.nat_secondary["us-east-1a-6"]: Refreshing state... [id=eipalloc-046fa83874bde66b5]
module.vpc.aws_eip.nat_secondary["us-east-1a-2"]: Refreshing state... [id=eipalloc-0625b097098b1ac2a]
module.vpc.aws_eip.nat_secondary["us-east-1c-2"]: Refreshing state... [id=eipalloc-0a4874208e55dfb7b]
module.vpc.aws_eip.nat_secondary["us-east-1c-1"]: Refreshing state... [id=eipalloc-01b2275a4f494fe58]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-0fa332056910f46b2]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-0f16b1a5ecd405405]
module.vpc.aws_eip.nat_secondary["us-east-1b-3"]: Refreshing state... [id=eipalloc-042e254711d0d3dda]
module.vpc.aws_eip.nat_secondary["us-east-1b-6"]: Refreshing state... [id=eipalloc-085e43aacdd3b5c5f]
module.vpc.aws_eip.nat_secondary["us-east-1a-4"]: Refreshing state... [id=eipalloc-0581e7f2f8194266e]
module.vpc.aws_eip.nat_secondary["us-east-1b-2"]: Refreshing state... [id=eipalloc-0c66936151cceca74]
module.vpc.aws_eip.nat_secondary["us-east-1c-4"]: Refreshing state... [id=eipalloc-0db3b4c44cbf47d5a]
module.vpc.aws_eip.nat_secondary["us-east-1c-6"]: Refreshing state... [id=eipalloc-088384cdd02d04bce]
module.vpc.aws_eip.nat_secondary["us-east-1b-5"]: Refreshing state... [id=eipalloc-00cd91c376a1f197d]
module.vpc.aws_eip.nat_secondary["us-east-1b-4"]: Refreshing state... [id=eipalloc-0bd28bd3991297f4a]
module.vpc.aws_eip.nat_secondary["us-east-1a-1"]: Refreshing state... [id=eipalloc-034348675ffacd849]
module.vpc.aws_eip.nat_secondary["us-east-1a-3"]: Refreshing state... [id=eipalloc-01a2c9bf10e45099b]
module.vpc.aws_eip.nat_secondary["us-east-1a-5"]: Refreshing state... [id=eipalloc-0936080d9155a0306]
module.vpc.aws_eip.nat_secondary["us-east-1c-0"]: Refreshing state... [id=eipalloc-051a8792e8dad6c5b]
module.vpc.aws_eip.nat_secondary["us-east-1b-1"]: Refreshing state... [id=eipalloc-0f13b0cd68a133531]
module.vpc.aws_eip.nat_secondary["us-east-1c-3"]: Refreshing state... [id=eipalloc-03ec1105bba33668b]
module.vpc.aws_eip.nat_secondary["us-east-1a-0"]: Refreshing state... [id=eipalloc-03131d115b478f7c3]
module.vpc.aws_eip.nat[2]: Refreshing state... [id=eipalloc-0fe9cf01a4661b360]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-087f338c446cffe5d]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0afe958a38da9f46c]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-06e680510bc45584b]
module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-0f7184dc74425b3ca]
module.eks.aws_eks_cluster.this: Refreshing state... [id=lf-prod-aws-ue1]
module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-025956fd021d43094]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-025d5eab2da94f8e6]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-0a6049a78a7428383]
module.vpc.aws_nat_gateway.this[2]: Refreshing state... [id=nat-0cf871626838e4133]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-0d0e9d964d1cd8a9e]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-09c6f66e50dce1835]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-02536bbe724eaaa2f]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-0264db606b7f24bb6]
module.vpc.aws_route_table.private[2]: Refreshing state... [id=rtb-027811d9ba750a284]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-0c6e77db648d5279c]
module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-009170e5cc902aa3e]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-00cd583d71292870b]
module.eks.aws_eks_access_entry.cluster_admin["lf_osdc_admin"]: Refreshing state... [id=lf-prod-aws-ue1:arn:aws:iam::391835788720:role/lf_osdc_admin]
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=lf-prod-aws-ue1:vpc-cni]
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=lf-prod-aws-ue1:kube-proxy]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_launch_template.base: Refreshing state... [id=lt-0e3be05a985acc61c]
module.eks.aws_eks_node_group.base: Refreshing state... [id=lf-prod-aws-ue1:lf-prod-aws-ue1-base-nodes]
module.eks.data.tls_certificate.cluster[0]: Read complete after 1s [id=b1b539daa206035ae3c3e28288b0681fa1b462f3]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::391835788720:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/E8EF4A6C55DB9699E53A54DA444C21A3]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=lf-prod-aws-ue1-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=717515682]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=lf-prod-aws-ue1-ebs-csi-driver-role]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=lf-prod-aws-ue1:coredns]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=lf-prod-aws-ue1-ebs-csi-driver-role/arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=lf-prod-aws-ue1-harbor-registry/arn:aws:iam::391835788720:policy/lf-prod-aws-ue1-harbor-registry]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=lf-prod-aws-ue1:aws-ebs-csi-driver]
module.eks.aws_eks_access_policy_association.cluster_admin["lf_osdc_admin"]: Refreshing state... [id=lf-prod-aws-ue1#arn:aws:iam::391835788720:role/lf_osdc_admin#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (lf-prod-aws-ue1) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=lf-prod-aws-ue1-karpenter-spot-interruption]
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=lf-prod-aws-ue1-karpenter-scheduled-change]
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=lf-prod-aws-ue1-karpenter-instance-state-change]
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=lf-prod-aws-ue1-karpenter-rebalance]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/391835788720/lf-prod-aws-ue1-karpenter]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/391835788720/lf-prod-aws-ue1-karpenter]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=lf-prod-aws-ue1-karpenter-rebalance-KarpenterRebalance]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=lf-prod-aws-ue1-karpenter-instance-state-change-KarpenterInstanceStateChange]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=lf-prod-aws-ue1-karpenter-spot-interruption-KarpenterSpotInterruption]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=lf-prod-aws-ue1-karpenter-scheduled-change-KarpenterScheduledChange]
data.terraform_remote_state.base: Read complete after 1s
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-07a769a5e8a93a444,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=lf-prod-aws-ue1-karpenter-controller]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::391835788720:policy/lf-prod-aws-ue1-karpenter-controller]
aws_ec2_tag.subnet_karpenter_discovery["subnet-07234379e7833a398"]: Refreshing state... [id=subnet-07234379e7833a398,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0fa332056910f46b2"]: Refreshing state... [id=subnet-0fa332056910f46b2,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-01ca3df6137b445c0"]: Refreshing state... [id=subnet-01ca3df6137b445c0,karpenter.sh/discovery]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=lf-prod-aws-ue1-karpenter-controller-20260605165913470400000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (lf-prod-aws-ue1) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::391835788720:policy/lf-prod-aws-ue1-pypi-wants-collector-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-080ed1b100680046a]
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::391835788720:policy/lf-prod-aws-ue1-pypi-wheel-syncer-s3]
data.terraform_remote_state.base: Read complete after 1s
aws_iam_role.wheel_syncer: Refreshing state... [id=lf-prod-aws-ue1-pypi-wheel-syncer-role]
aws_iam_role.efs_csi_driver: Refreshing state... [id=lf-prod-aws-ue1-efs-csi-driver-role]
aws_iam_role.wants_collector: Refreshing state... [id=lf-prod-aws-ue1-pypi-wants-collector-role]
aws_security_group.efs: Refreshing state... [id=sg-043a534e51b4cf754]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=lf-prod-aws-ue1-efs-csi-driver-role-20260605174406081600000005]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=lf-prod-aws-ue1-pypi-wheel-syncer-role-20260605174406029100000004]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=lf-prod-aws-ue1:aws-efs-csi-driver]
aws_efs_mount_target.pypi_cache["subnet-0fa332056910f46b2"]: Refreshing state... [id=fsmt-0855738bbe8ec9699]
aws_efs_mount_target.pypi_cache["subnet-01ca3df6137b445c0"]: Refreshing state... [id=fsmt-015ba05b2befeb7f3]
aws_efs_mount_target.pypi_cache["subnet-07234379e7833a398"]: Refreshing state... [id=fsmt-0a587f0a4f8a480eb]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=lf-prod-aws-ue1-pypi-wants-collector-role-20260605174405984800000003]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

github-actions · 2026-06-16T22:17:39Z

tofu plan — lf-prod-aws-ue2

✅ Plan succeeded · commit 27ebdfd0 · run log

Plan output

Installed 1 package in 2ms
{
    "BucketArn": "arn:aws:s3:::lf-osdc-tfstate-prod-ue2",
    "BucketRegion": "us-west-2",
    "AccessPointAlias": false
}
━━━ PLAN: Base (lf-prod-aws-ue2) ━━━
There are some problems with the CLI configuration:
╷
│ Error: The specified plugin cache dir /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache cannot be opened: stat /home/runner/work/ci-infra/ci-infra/osdc/.terraform.d/plugin-cache: no such file or directory
│
╵

As a result of the above problems, OpenTofu may not behave as intended.


module.eks.aws_kms_key.eks_secrets[0]: Refreshing state... [id=27a9b8e9-2509-43ce-ac8e-cfc320b65fe2]
data.aws_availability_zones.available: Reading...
module.eks.aws_iam_role.node: Refreshing state... [id=lf-prod-aws-ue2-node-role]
module.harbor.aws_iam_user.harbor_s3: Refreshing state... [id=lf-prod-aws-ue2-harbor-s3]
module.eks.aws_iam_role.cluster: Refreshing state... [id=lf-prod-aws-ue2-cluster-role]
module.vpc.aws_vpc.this: Refreshing state... [id=vpc-0f7d54e3accfbe3e4]
module.eks.data.aws_caller_identity.current: Reading...
module.eks.data.aws_ami.eks_optimized_al2023: Reading...
module.harbor.aws_s3_bucket.harbor_registry: Refreshing state... [id=lf-prod-aws-ue2-harbor-registry]
module.eks.data.aws_caller_identity.current: Read complete after 0s [id=391835788720]
module.harbor.aws_iam_access_key.harbor_s3: Refreshing state... [id=AKIAVWOZ3UWYMGG4LIHB]
module.eks.aws_iam_role_policy_attachment.vpc_resource_controller: Refreshing state... [id=lf-prod-aws-ue2-cluster-role/arn:aws:iam::aws:policy/AmazonEKSVPCResourceController]
module.eks.aws_iam_role_policy_attachment.cluster_policy: Refreshing state... [id=lf-prod-aws-ue2-cluster-role/arn:aws:iam::aws:policy/AmazonEKSClusterPolicy]
data.aws_availability_zones.available: Read complete after 1s [id=us-east-2]
module.eks.aws_iam_role_policy_attachment.cni_policy: Refreshing state... [id=lf-prod-aws-ue2-node-role/arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy]
module.eks.aws_iam_role_policy_attachment.ssm_policy: Refreshing state... [id=lf-prod-aws-ue2-node-role/arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore]
module.eks.aws_iam_role_policy.node_cni_ipv6: Refreshing state... [id=lf-prod-aws-ue2-node-role:lf-prod-aws-ue2-node-cni-ipv6]
module.eks.aws_iam_role_policy_attachment.node_policy: Refreshing state... [id=lf-prod-aws-ue2-node-role/arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy]
module.eks.aws_iam_role_policy_attachment.ecr_policy: Refreshing state... [id=lf-prod-aws-ue2-node-role/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly]
module.eks.aws_kms_alias.eks_secrets[0]: Refreshing state... [id=alias/lf-prod-aws-ue2-eks-secrets]
module.eks.data.aws_ami.eks_optimized_al2023: Read complete after 1s [id=ami-009f1fe7d56695348]
module.harbor.aws_s3_bucket_public_access_block.harbor_registry: Refreshing state... [id=lf-prod-aws-ue2-harbor-registry]
module.harbor.aws_s3_bucket_server_side_encryption_configuration.harbor_registry: Refreshing state... [id=lf-prod-aws-ue2-harbor-registry]
module.harbor.aws_iam_policy.harbor_registry: Refreshing state... [id=arn:aws:iam::391835788720:policy/lf-prod-aws-ue2-harbor-registry]
module.vpc.aws_egress_only_internet_gateway.this: Refreshing state... [id=eigw-061f8f7ac8b40d720]
module.vpc.aws_internet_gateway.this: Refreshing state... [id=igw-042c4d31ed557eaa4]
module.harbor.aws_iam_user_policy_attachment.harbor_s3: Refreshing state... [id=lf-prod-aws-ue2-harbor-s3/arn:aws:iam::391835788720:policy/lf-prod-aws-ue2-harbor-registry]
module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0e53846501278171e]
module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-016d460df617c0e2c]
module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-080bfdf02da937445]
module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-0ae8d251d3a0336ca]
module.vpc.aws_eip.nat_secondary["us-east-2c-3"]: Refreshing state... [id=eipalloc-04d97b3aec8f5fb8a]
module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-0515848329e5dc53a]
module.vpc.aws_eip.nat_secondary["us-east-2a-5"]: Refreshing state... [id=eipalloc-095865342b4c692ac]
module.vpc.aws_eip.nat_secondary["us-east-2a-1"]: Refreshing state... [id=eipalloc-0a90e8e5b75a3fe45]
module.vpc.aws_eip.nat_secondary["us-east-2c-5"]: Refreshing state... [id=eipalloc-06c020042f283554a]
module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-06a9b2e4ea40968b6]
module.vpc.aws_eip.nat_secondary["us-east-2a-3"]: Refreshing state... [id=eipalloc-0737f1fdf35a0f975]
module.vpc.aws_eip.nat_secondary["us-east-2c-2"]: Refreshing state... [id=eipalloc-08e66df79eddc18b5]
module.vpc.aws_eip.nat_secondary["us-east-2c-1"]: Refreshing state... [id=eipalloc-0241d507f34cdb0b5]
module.vpc.aws_eip.nat_secondary["us-east-2b-2"]: Refreshing state... [id=eipalloc-0403ed9359182b72c]
module.vpc.aws_eip.nat_secondary["us-east-2c-6"]: Refreshing state... [id=eipalloc-057df768d859ed17e]
module.vpc.aws_eip.nat_secondary["us-east-2c-4"]: Refreshing state... [id=eipalloc-005e21ac878c4db34]
module.vpc.aws_eip.nat_secondary["us-east-2a-6"]: Refreshing state... [id=eipalloc-077cbc910a56d08fd]
module.vpc.aws_eip.nat_secondary["us-east-2a-0"]: Refreshing state... [id=eipalloc-0a9078e90b80cc1de]
module.vpc.aws_eip.nat_secondary["us-east-2b-0"]: Refreshing state... [id=eipalloc-055182abe5c634ddc]
module.vpc.aws_eip.nat_secondary["us-east-2b-4"]: Refreshing state... [id=eipalloc-08683a31d5967bff6]
module.vpc.aws_eip.nat_secondary["us-east-2a-2"]: Refreshing state... [id=eipalloc-0e53d306d25151b0e]
module.vpc.aws_eip.nat_secondary["us-east-2b-3"]: Refreshing state... [id=eipalloc-0b95441aa4e161db2]
module.vpc.aws_eip.nat_secondary["us-east-2b-6"]: Refreshing state... [id=eipalloc-09c38605941dbbaac]
module.vpc.aws_eip.nat_secondary["us-east-2c-0"]: Refreshing state... [id=eipalloc-09bd4b74b1a8ca6ac]
module.vpc.aws_eip.nat_secondary["us-east-2b-5"]: Refreshing state... [id=eipalloc-06f0755f7542d77fa]
module.vpc.aws_eip.nat_secondary["us-east-2a-4"]: Refreshing state... [id=eipalloc-0bd8a5e170892bb0b]
module.vpc.aws_route_table.public: Refreshing state... [id=rtb-0508ab6e3db7ccf08]
module.vpc.aws_eip.nat_secondary["us-east-2b-1"]: Refreshing state... [id=eipalloc-0c8d74e3dcfb2dad0]
module.vpc.aws_eip.nat[2]: Refreshing state... [id=eipalloc-079ed57d9de06fd9b]
module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-0e0efd2a8ef20d72e]
module.vpc.aws_eip.nat[1]: Refreshing state... [id=eipalloc-08c041a7cb9147705]
module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-005f847cdca1f2143]
module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-0d0f31615161dab0f]
module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-028a6f03785f6bca2]
module.eks.aws_eks_cluster.this: Refreshing state... [id=lf-prod-aws-ue2]
module.eks.data.tls_certificate.cluster[0]: Reading...
module.eks.aws_eks_addon.kube_proxy: Refreshing state... [id=lf-prod-aws-ue2:kube-proxy]
module.eks.aws_eks_access_entry.cluster_admin["lf_osdc_admin"]: Refreshing state... [id=lf-prod-aws-ue2:arn:aws:iam::391835788720:role/lf_osdc_admin]
module.eks.aws_eks_addon.vpc_cni: Refreshing state... [id=lf-prod-aws-ue2:vpc-cni]
module.eks.aws_launch_template.base: Refreshing state... [id=lt-062d0b42e1b1ca1af]
module.eks.data.tls_certificate.cluster[0]: Read complete after 0s [id=033a163afb2babc26f7883e642621ac361c93d61]
module.eks.aws_iam_openid_connect_provider.cluster[0]: Refreshing state... [id=arn:aws:iam::391835788720:oidc-provider/oidc.eks.us-east-2.amazonaws.com/id/43EEAC690CC76E15781134A4FC06EDCE]
module.eks.aws_eks_node_group.base: Refreshing state... [id=lf-prod-aws-ue2:lf-prod-aws-ue2-base-nodes]
module.harbor.aws_iam_role.harbor_registry: Refreshing state... [id=lf-prod-aws-ue2-harbor-registry]
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Reading...
module.eks.data.aws_iam_policy_document.ebs_csi_assume_role[0]: Read complete after 0s [id=796338164]
module.eks.aws_iam_role.ebs_csi_driver[0]: Refreshing state... [id=lf-prod-aws-ue2-ebs-csi-driver-role]
module.eks.aws_iam_role_policy_attachment.ebs_csi_driver[0]: Refreshing state... [id=lf-prod-aws-ue2-ebs-csi-driver-role/arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy]
module.harbor.aws_iam_role_policy_attachment.harbor_registry: Refreshing state... [id=lf-prod-aws-ue2-harbor-registry/arn:aws:iam::391835788720:policy/lf-prod-aws-ue2-harbor-registry]
module.eks.aws_eks_addon.coredns: Refreshing state... [id=lf-prod-aws-ue2:coredns]
module.eks.aws_eks_addon.ebs_csi_driver: Refreshing state... [id=lf-prod-aws-ue2:aws-ebs-csi-driver]
module.eks.aws_eks_access_policy_association.cluster_admin["lf_osdc_admin"]: Refreshing state... [id=lf-prod-aws-ue2#arn:aws:iam::391835788720:role/lf_osdc_admin#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.vpc.aws_nat_gateway.this[2]: Refreshing state... [id=nat-0caff297f1b93f0c7]
module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-095cd56cd812b4931]
module.vpc.aws_nat_gateway.this[1]: Refreshing state... [id=nat-0288194135c91a55d]
module.vpc.aws_route_table.private[2]: Refreshing state... [id=rtb-0d7230758d05b4f20]
module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-0ce64842bfadf32b0]
module.vpc.aws_route_table.private[1]: Refreshing state... [id=rtb-0d0497dd1d2a111f5]
module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-0feb6707491379e22]
module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-05de05c204a439484]
module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-0cbaf74e1bd57a865]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module karpenter (lf-prod-aws-ue2) ━━━
data.terraform_remote_state.base: Reading...
aws_cloudwatch_event_rule.instance_state_change: Refreshing state... [id=lf-prod-aws-ue2-karpenter-instance-state-change]
aws_cloudwatch_event_rule.scheduled_change: Refreshing state... [id=lf-prod-aws-ue2-karpenter-scheduled-change]
aws_cloudwatch_event_rule.rebalance: Refreshing state... [id=lf-prod-aws-ue2-karpenter-rebalance]
aws_cloudwatch_event_rule.spot_interruption: Refreshing state... [id=lf-prod-aws-ue2-karpenter-spot-interruption]
aws_sqs_queue.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/391835788720/lf-prod-aws-ue2-karpenter]
aws_cloudwatch_event_target.spot_interruption: Refreshing state... [id=lf-prod-aws-ue2-karpenter-spot-interruption-KarpenterSpotInterruption]
aws_cloudwatch_event_target.rebalance: Refreshing state... [id=lf-prod-aws-ue2-karpenter-rebalance-KarpenterRebalance]
aws_cloudwatch_event_target.scheduled_change: Refreshing state... [id=lf-prod-aws-ue2-karpenter-scheduled-change-KarpenterScheduledChange]
aws_sqs_queue_policy.karpenter: Refreshing state... [id=https://sqs.us-east-2.amazonaws.com/391835788720/lf-prod-aws-ue2-karpenter]
aws_cloudwatch_event_target.instance_state_change: Refreshing state... [id=lf-prod-aws-ue2-karpenter-instance-state-change-KarpenterInstanceStateChange]
data.terraform_remote_state.base: Read complete after 1s
aws_ec2_tag.subnet_karpenter_discovery["subnet-06a9b2e4ea40968b6"]: Refreshing state... [id=subnet-06a9b2e4ea40968b6,karpenter.sh/discovery]
aws_ec2_tag.cluster_sg_karpenter: Refreshing state... [id=sg-06c1f2ed8ffb1ddfa,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0ae8d251d3a0336ca"]: Refreshing state... [id=subnet-0ae8d251d3a0336ca,karpenter.sh/discovery]
aws_ec2_tag.subnet_karpenter_discovery["subnet-0515848329e5dc53a"]: Refreshing state... [id=subnet-0515848329e5dc53a,karpenter.sh/discovery]
aws_iam_role.karpenter_controller: Refreshing state... [id=lf-prod-aws-ue2-karpenter-controller]
aws_iam_policy.karpenter_controller: Refreshing state... [id=arn:aws:iam::391835788720:policy/lf-prod-aws-ue2-karpenter-controller]
aws_iam_role_policy_attachment.karpenter_controller: Refreshing state... [id=lf-prod-aws-ue2-karpenter-controller-20260608235145776400000001]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

━━━ PLAN: Module pypi-cache (lf-prod-aws-ue2) ━━━
data.terraform_remote_state.base: Reading...
aws_iam_policy.wants_collector: Refreshing state... [id=arn:aws:iam::391835788720:policy/lf-prod-aws-ue2-pypi-wants-collector-s3]
aws_iam_policy.wheel_syncer: Refreshing state... [id=arn:aws:iam::391835788720:policy/lf-prod-aws-ue2-pypi-wheel-syncer-s3]
aws_efs_file_system.pypi_cache: Refreshing state... [id=fs-066c520a3fe657aba]
data.terraform_remote_state.base: Read complete after 2s
aws_security_group.efs: Refreshing state... [id=sg-02463b5dfca9d70f7]
aws_iam_role.efs_csi_driver: Refreshing state... [id=lf-prod-aws-ue2-efs-csi-driver-role]
aws_iam_role.wheel_syncer: Refreshing state... [id=lf-prod-aws-ue2-pypi-wheel-syncer-role]
aws_iam_role.wants_collector: Refreshing state... [id=lf-prod-aws-ue2-pypi-wants-collector-role]
aws_iam_role_policy_attachment.wants_collector: Refreshing state... [id=lf-prod-aws-ue2-pypi-wants-collector-role-20260609154828857700000003]
aws_iam_role_policy_attachment.wheel_syncer: Refreshing state... [id=lf-prod-aws-ue2-pypi-wheel-syncer-role-20260609154828857900000004]
aws_iam_role_policy_attachment.efs_csi_driver: Refreshing state... [id=lf-prod-aws-ue2-efs-csi-driver-role-20260609154828863300000005]
aws_eks_addon.efs_csi_driver: Refreshing state... [id=lf-prod-aws-ue2:aws-efs-csi-driver]
aws_efs_mount_target.pypi_cache["subnet-0ae8d251d3a0336ca"]: Refreshing state... [id=fsmt-0d69f035478e5eb52]
aws_efs_mount_target.pypi_cache["subnet-0515848329e5dc53a"]: Refreshing state... [id=fsmt-033866be4ab382a70]
aws_efs_mount_target.pypi_cache["subnet-06a9b2e4ea40968b6"]: Refreshing state... [id=fsmt-0e394b1aa15541bde]

No changes. Your infrastructure matches the configuration.

OpenTofu has compared your real infrastructure against your configuration and
found no differences, so no changes are needed.

Update

6d12bce

[ghstack-poisoned]

georgehong temporarily deployed to osdc-staging June 16, 2026 22:11 — with GitHub Actions Inactive

This was referenced Jun 16, 2026

Add NFD topology-updater module with startup taint remover (#696) #738

Draft

Add numa-scheduler module (#696) #739

Draft

arc-runners: support per-def workflow schedulerName (#696) #759

Draft

Enable NUMA-aware scheduling for H100 4-GPU runner (#696) #740

Draft

georgehong had a problem deploying to osdc-staging June 16, 2026 22:12 — with GitHub Actions Error

georgehong had a problem deploying to osdc-staging June 16, 2026 22:13 — with GitHub Actions Error

georgehong temporarily deployed to osdc-staging June 16, 2026 22:14 — with GitHub Actions Inactive

georgehong temporarily deployed to osdc-staging June 16, 2026 22:16 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TEST-ONLY] Validate NUMA scheduling on A100 (p4d) in ue1 staging#778

[TEST-ONLY] Validate NUMA scheduling on A100 (p4d) in ue1 staging#778
georgehong wants to merge 1 commit into
gh/georgehong/13/basefrom
gh/georgehong/13/head

georgehong commented Jun 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

georgehong commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 16, 2026

Capacity report

Uh oh!

github-actions Bot commented Jun 16, 2026

tofu plan — arc-cbr-production

Uh oh!

github-actions Bot commented Jun 16, 2026

tofu plan — lf-prod-aws-ue1

Uh oh!

github-actions Bot commented Jun 16, 2026

tofu plan — lf-prod-aws-ue2

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

georgehong commented Jun 16, 2026 •

edited

Loading