Skip to content

H100 4-GPU jobs accepted then fail with TopologyAffinityError (NUMA-blind acceptance path) #696

Description

@huydhn

Summary

H100 (4-GPU) jobs on the arc-cbr-production-uw1 cluster can be accepted by GitHub and then fail immediately with TopologyAffinityError because the job-acceptance path is NUMA-blind. We should gate acceptance on NUMA feasibility so these jobs queue instead of failing.

Incident

  • Job: pytorch/helion run 26856066015, runner mt-l-x86iamx-88-900-h100-4-2pfzt-runner-26mj5, cluster pytorch-arc-cbr-production-uw1, node ip-10-8-98-255 (p5.48xlarge).
  • The runner pod started fine; the -workflow pod (the 4-GPU job container) was rejected by the kubelet: TopologyAffinityError: Resources cannot be allocated with Topology locality.
  • Not a one-off: the same node rejected 4 consecutive 4-GPU workflow pods in ~90s (including a capacity placeholder) — a fragmentation livelock.

Root cause

The p5-48xlarge pool pins topology_manager_policy: single-numa-node (scope: pod). A p5.48xlarge has 2 NUMA sockets x 4 H100 each, so a 4-GPU pod needs an entirely empty socket. The pool is shared by 1/2/4-GPU runners; the NUMA-blind scheduler sprinkles small runners across both sockets, and once >=1 GPU is used on each socket, no 4-GPU job can be admitted even though 4 GPUs are free in aggregate.

Why current mechanisms don't catch it

  • GitHub assignment is label-based; ARC scaling and the capacity-aware placeholder gate all reason in scalar nvidia.com/gpu counts + pod phase — none are NUMA-aware.
  • The capacity gate (runningRunners + runningPairs) is correct in aggregate, but a fragmented node is a trap: scalar-free but NUMA-split. The NUMA-blind kube-scheduler prefers it (no preemption needed) and routes the real workflow pod there, bypassing the validated placeholder slot. The kubelet — the only NUMA-aware actor — rejects it last, after the job is already accepted.

Options to discuss

  1. NUMA-aware scheduling (preferred). Deploy NFD topology-updater + the NodeResourceTopologyMatch scheduler plugin (kubernetes-sigs/scheduler-plugins); set schedulerName only on GPU workflow pods + GPU workflow placeholders (low blast radius). Effect: placeholders only go Running on feasible sockets -> advertised capacity becomes NUMA-accurate -> excess jobs stay queued instead of accepted-then-failed; real pods are never bound to a trap node (wait instead of fail). Keeps the shared pool (no capacity siloing). Cost: new infra (NRT CRD + second scheduler); must validate NRT freshness vs. placeholder churn (~20-30s).
  2. Dedicated node_fleet for 4-GPU runners (mirrors the existing p5-large pattern for the 8-GPU runner). Eliminates mixed-size packing -> no fragmentation. Simple, no new components, but partitions the 6-node reserved fleet and strands capacity when 4-GPU demand is low.
  3. Relax to best-effort on the shared pool. No more denials, but 4-GPU jobs may straddle sockets (cross-NUMA NVLink/NCCL perf hit) — reverses the deliberate locality choice.
  4. Interim: trap-node cordoner. Watch for TopologyAffinityError and cordon/recycle the node (extend zombie-cleanup/node_compactor). Doesn't prevent the first failure but stops the livelock. Good stopgap while (1) is built.

Proposed path: land (4) for immediate relief, then (1) as the durable acceptance gate.

References

  • osdc/modules/nodepools-h100/defs/p5.yaml (single-numa-node), p5-large.yaml (dedicated-pool precedent)
  • osdc/modules/arc-runners/templates/runner.yaml.tpl (two-pod model; workflow pod carries the GPU request)
  • Fork capacity logic: cmd/ghalistener/capacity/{placeholder,monitor}.go in jeanschmidt/actions-runner-controller

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions