Summary
H100 (4-GPU) jobs on the arc-cbr-production-uw1 cluster can be accepted by GitHub and then fail immediately with TopologyAffinityError because the job-acceptance path is NUMA-blind. We should gate acceptance on NUMA feasibility so these jobs queue instead of failing.
Incident
- Job:
pytorch/helion run 26856066015, runner mt-l-x86iamx-88-900-h100-4-2pfzt-runner-26mj5, cluster pytorch-arc-cbr-production-uw1, node ip-10-8-98-255 (p5.48xlarge).
- The runner pod started fine; the
-workflow pod (the 4-GPU job container) was rejected by the kubelet: TopologyAffinityError: Resources cannot be allocated with Topology locality.
- Not a one-off: the same node rejected 4 consecutive 4-GPU workflow pods in ~90s (including a capacity placeholder) — a fragmentation livelock.
Root cause
The p5-48xlarge pool pins topology_manager_policy: single-numa-node (scope: pod). A p5.48xlarge has 2 NUMA sockets x 4 H100 each, so a 4-GPU pod needs an entirely empty socket. The pool is shared by 1/2/4-GPU runners; the NUMA-blind scheduler sprinkles small runners across both sockets, and once >=1 GPU is used on each socket, no 4-GPU job can be admitted even though 4 GPUs are free in aggregate.
Why current mechanisms don't catch it
- GitHub assignment is label-based; ARC scaling and the capacity-aware placeholder gate all reason in scalar
nvidia.com/gpu counts + pod phase — none are NUMA-aware.
- The capacity gate (
runningRunners + runningPairs) is correct in aggregate, but a fragmented node is a trap: scalar-free but NUMA-split. The NUMA-blind kube-scheduler prefers it (no preemption needed) and routes the real workflow pod there, bypassing the validated placeholder slot. The kubelet — the only NUMA-aware actor — rejects it last, after the job is already accepted.
Options to discuss
- NUMA-aware scheduling (preferred). Deploy NFD
topology-updater + the NodeResourceTopologyMatch scheduler plugin (kubernetes-sigs/scheduler-plugins); set schedulerName only on GPU workflow pods + GPU workflow placeholders (low blast radius). Effect: placeholders only go Running on feasible sockets -> advertised capacity becomes NUMA-accurate -> excess jobs stay queued instead of accepted-then-failed; real pods are never bound to a trap node (wait instead of fail). Keeps the shared pool (no capacity siloing). Cost: new infra (NRT CRD + second scheduler); must validate NRT freshness vs. placeholder churn (~20-30s).
- Dedicated
node_fleet for 4-GPU runners (mirrors the existing p5-large pattern for the 8-GPU runner). Eliminates mixed-size packing -> no fragmentation. Simple, no new components, but partitions the 6-node reserved fleet and strands capacity when 4-GPU demand is low.
- Relax to
best-effort on the shared pool. No more denials, but 4-GPU jobs may straddle sockets (cross-NUMA NVLink/NCCL perf hit) — reverses the deliberate locality choice.
- Interim: trap-node cordoner. Watch for
TopologyAffinityError and cordon/recycle the node (extend zombie-cleanup/node_compactor). Doesn't prevent the first failure but stops the livelock. Good stopgap while (1) is built.
Proposed path: land (4) for immediate relief, then (1) as the durable acceptance gate.
References
osdc/modules/nodepools-h100/defs/p5.yaml (single-numa-node), p5-large.yaml (dedicated-pool precedent)
osdc/modules/arc-runners/templates/runner.yaml.tpl (two-pod model; workflow pod carries the GPU request)
- Fork capacity logic:
cmd/ghalistener/capacity/{placeholder,monitor}.go in jeanschmidt/actions-runner-controller
Summary
H100 (4-GPU) jobs on the
arc-cbr-production-uw1cluster can be accepted by GitHub and then fail immediately withTopologyAffinityErrorbecause the job-acceptance path is NUMA-blind. We should gate acceptance on NUMA feasibility so these jobs queue instead of failing.Incident
pytorch/helionrun26856066015, runnermt-l-x86iamx-88-900-h100-4-2pfzt-runner-26mj5, clusterpytorch-arc-cbr-production-uw1, nodeip-10-8-98-255(p5.48xlarge).-workflowpod (the 4-GPU job container) was rejected by the kubelet:TopologyAffinityError: Resources cannot be allocated with Topology locality.Root cause
The
p5-48xlargepool pinstopology_manager_policy: single-numa-node(scope: pod). A p5.48xlarge has 2 NUMA sockets x 4 H100 each, so a 4-GPU pod needs an entirely empty socket. The pool is shared by 1/2/4-GPU runners; the NUMA-blind scheduler sprinkles small runners across both sockets, and once >=1 GPU is used on each socket, no 4-GPU job can be admitted even though 4 GPUs are free in aggregate.Why current mechanisms don't catch it
nvidia.com/gpucounts + pod phase — none are NUMA-aware.runningRunners + runningPairs) is correct in aggregate, but a fragmented node is a trap: scalar-free but NUMA-split. The NUMA-blind kube-scheduler prefers it (no preemption needed) and routes the real workflow pod there, bypassing the validated placeholder slot. The kubelet — the only NUMA-aware actor — rejects it last, after the job is already accepted.Options to discuss
topology-updater+ theNodeResourceTopologyMatchscheduler plugin (kubernetes-sigs/scheduler-plugins); setschedulerNameonly on GPU workflow pods + GPU workflow placeholders (low blast radius). Effect: placeholders only go Running on feasible sockets -> advertised capacity becomes NUMA-accurate -> excess jobs stay queued instead of accepted-then-failed; real pods are never bound to a trap node (wait instead of fail). Keeps the shared pool (no capacity siloing). Cost: new infra (NRT CRD + second scheduler); must validate NRT freshness vs. placeholder churn (~20-30s).node_fleetfor 4-GPU runners (mirrors the existingp5-largepattern for the 8-GPU runner). Eliminates mixed-size packing -> no fragmentation. Simple, no new components, but partitions the 6-node reserved fleet and strands capacity when 4-GPU demand is low.best-efforton the shared pool. No more denials, but 4-GPU jobs may straddle sockets (cross-NUMA NVLink/NCCL perf hit) — reverses the deliberate locality choice.TopologyAffinityErrorand cordon/recycle the node (extendzombie-cleanup/node_compactor). Doesn't prevent the first failure but stops the livelock. Good stopgap while (1) is built.Proposed path: land (4) for immediate relief, then (1) as the durable acceptance gate.
References
osdc/modules/nodepools-h100/defs/p5.yaml(single-numa-node),p5-large.yaml(dedicated-pool precedent)osdc/modules/arc-runners/templates/runner.yaml.tpl(two-pod model; workflow pod carries the GPU request)cmd/ghalistener/capacity/{placeholder,monitor}.goin jeanschmidt/actions-runner-controller