H100 4-GPU jobs accepted then fail with TopologyAffinityError (NUMA-blind acceptance path)

## Summary

H100 (4-GPU) jobs on the `arc-cbr-production-uw1` cluster can be **accepted by GitHub and then fail immediately** with `TopologyAffinityError` because the job-acceptance path is NUMA-blind. We should gate acceptance on NUMA feasibility so these jobs queue instead of failing.

## Incident

- Job: `pytorch/helion` run `26856066015`, runner `mt-l-x86iamx-88-900-h100-4-2pfzt-runner-26mj5`, cluster `pytorch-arc-cbr-production-uw1`, node `ip-10-8-98-255` (p5.48xlarge).
- The runner pod started fine; the `-workflow` pod (the 4-GPU job container) was rejected by the kubelet: `TopologyAffinityError: Resources cannot be allocated with Topology locality`.
- Not a one-off: the same node rejected **4 consecutive 4-GPU workflow pods in ~90s** (including a capacity placeholder) — a fragmentation livelock.

## Root cause

The `p5-48xlarge` pool pins `topology_manager_policy: single-numa-node` (`scope: pod`). A p5.48xlarge has **2 NUMA sockets x 4 H100 each**, so a 4-GPU pod needs an **entirely empty socket**. The pool is shared by 1/2/4-GPU runners; the NUMA-blind scheduler sprinkles small runners across both sockets, and once >=1 GPU is used on each socket, **no 4-GPU job can be admitted** even though 4 GPUs are free in aggregate.

## Why current mechanisms don't catch it

- **GitHub assignment** is label-based; **ARC scaling** and the **capacity-aware placeholder gate** all reason in *scalar* `nvidia.com/gpu` counts + pod phase — none are NUMA-aware.
- The capacity gate (`runningRunners + runningPairs`) is correct in aggregate, but a fragmented node is a **trap**: scalar-free but NUMA-split. The NUMA-blind kube-scheduler *prefers* it (no preemption needed) and routes the real workflow pod there, bypassing the validated placeholder slot. The kubelet — the only NUMA-aware actor — rejects it last, after the job is already accepted.

## Options to discuss

1. **NUMA-aware scheduling (preferred).** Deploy NFD `topology-updater` + the `NodeResourceTopologyMatch` scheduler plugin (kubernetes-sigs/scheduler-plugins); set `schedulerName` only on GPU workflow pods + GPU workflow placeholders (low blast radius). Effect: placeholders only go Running on feasible sockets -> advertised capacity becomes NUMA-accurate -> excess jobs stay queued instead of accepted-then-failed; real pods are never bound to a trap node (wait instead of fail). Keeps the shared pool (no capacity siloing). Cost: new infra (NRT CRD + second scheduler); must validate NRT freshness vs. placeholder churn (~20-30s).
2. **Dedicated `node_fleet` for 4-GPU runners** (mirrors the existing `p5-large` pattern for the 8-GPU runner). Eliminates mixed-size packing -> no fragmentation. Simple, no new components, but **partitions** the 6-node reserved fleet and strands capacity when 4-GPU demand is low.
3. **Relax to `best-effort`** on the shared pool. No more denials, but 4-GPU jobs may straddle sockets (cross-NUMA NVLink/NCCL perf hit) — reverses the deliberate locality choice.
4. **Interim: trap-node cordoner.** Watch for `TopologyAffinityError` and cordon/recycle the node (extend `zombie-cleanup`/`node_compactor`). Doesn't prevent the first failure but stops the livelock. Good stopgap while (1) is built.

**Proposed path:** land (4) for immediate relief, then (1) as the durable acceptance gate.

## References

- `osdc/modules/nodepools-h100/defs/p5.yaml` (single-numa-node), `p5-large.yaml` (dedicated-pool precedent)
- `osdc/modules/arc-runners/templates/runner.yaml.tpl` (two-pod model; workflow pod carries the GPU request)
- Fork capacity logic: `cmd/ghalistener/capacity/{placeholder,monitor}.go` in jeanschmidt/actions-runner-controller


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H100 4-GPU jobs accepted then fail with TopologyAffinityError (NUMA-blind acceptance path) #696

Summary

Incident

Root cause

Why current mechanisms don't catch it

Options to discuss

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

H100 4-GPU jobs accepted then fail with TopologyAffinityError (NUMA-blind acceptance path) #696

Description

Summary

Incident

Root cause

Why current mechanisms don't catch it

Options to discuss

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions