[TEST-ONLY] Validate NUMA scheduling on A100 (p4d) in ue1 staging#778
Draft
georgehong wants to merge 1 commit into
Draft
[TEST-ONLY] Validate NUMA scheduling on A100 (p4d) in ue1 staging#778georgehong wants to merge 1 commit into
georgehong wants to merge 1 commit into
Conversation
This was referenced Jun 16, 2026
Capacity reportcommit ✅ simulate-cluster✅ analyze-utilization |
tofu plan — arc-cbr-production✅ Plan succeeded · commit Plan output |
georgehong
added a commit
that referenced
this pull request
Jun 16, 2026
Temporary test configuration — DO NOT MERGE. Self-contained sibling of the g4dn.metal [TEST-ONLY] commit (now archived on numa-aware-scheduling-g4dn): the same NUMA pipeline on REAL A100 hardware (p4d.24xlarge) in the new meta-staging-aws-ue1 (us-east-1) staging cluster — the production-class target. Reuses the existing p4d fleet + A100 runner defs (the p4d fleet is already single-numa-node), so no new fleet or runner defs are needed. Prerequisite this unblocks: confirm whether A100/p4d actually PUBLISHES per-GPU NUMA topology in the NodeResourceTopology. On g4dn.metal it did NOT (zones exposed CPU only, so the numa-scheduler could never align a GPU pod). After confirming, vary the scheduler and over-request beyond one NUMA zone (edit the 1-GPU def), mirroring the g4dn A/B. - Add nfd + numa-scheduler to meta-staging-aws-ue1; remove from the two prod clusters so the test pipeline is isolated to staging (mirrors the g4dn commit). - Repoint NFD topology-updater + taint-remover + the nfd-topology startup-taint gate from p5 to p4d. Only ue1 runs nfd now, so a single fleet target — no affinity needed. - Add scheduler_name: numa-scheduler to the existing A100 1-GPU and 4-GPU runner defs (the 4-GPU is the real scenario, parallel to the H100 4-GPU; the 1-GPU is the A/B knob). p4d.24xlarge = 2 sockets x 4 A100 40GB (2 NUMA x 4 GPU). cpuManagerPolicy=static and Guaranteed-QoS runner pods already apply, so CPU+GPU NUMA alignment needs no workload changes. ue1 runners carry the c-mt- staging prefix + meta-staging-aws-ue1 runner group, so there is no overlap with prod (mt-). Deploy (ue1 only): just deploy-module meta-staging-aws-ue1 nfd just deploy-module meta-staging-aws-ue1 numa-scheduler just deploy-module meta-staging-aws-ue1 nodepools just deploy-module meta-staging-aws-ue1 arc-runners Then queue a canary job and inspect the NRT for per-zone nvidia.com/gpu BEFORE varying. Cleanup: drop this commit (git reset --hard HEAD~1) + teardown nfd/numa-scheduler on ue1. ghstack-source-id: 1473ca5 Pull-Request: #778
tofu plan — lf-prod-aws-ue1✅ Plan succeeded · commit Plan output |
tofu plan — lf-prod-aws-ue2✅ Plan succeeded · commit Plan output |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Temporary test configuration — DO NOT MERGE. Self-contained sibling of the g4dn.metal
[TEST-ONLY] commit (now archived on numa-aware-scheduling-g4dn): the same NUMA pipeline
on REAL A100 hardware (p4d.24xlarge) in the new meta-staging-aws-ue1 (us-east-1) staging
cluster — the production-class target. Reuses the existing p4d fleet + A100 runner defs
(the p4d fleet is already single-numa-node), so no new fleet or runner defs are needed.
Prerequisite this unblocks: confirm whether A100/p4d actually PUBLISHES per-GPU NUMA
topology in the NodeResourceTopology. On g4dn.metal it did NOT (zones exposed CPU only,
so the numa-scheduler could never align a GPU pod). After confirming, vary the scheduler
and over-request beyond one NUMA zone (edit the 1-GPU def), mirroring the g4dn A/B.
so the test pipeline is isolated to staging (mirrors the g4dn commit).
from p5 to p4d. Only ue1 runs nfd now, so a single fleet target — no affinity needed.
(the 4-GPU is the real scenario, parallel to the H100 4-GPU; the 1-GPU is the A/B knob).
p4d.24xlarge = 2 sockets x 4 A100 40GB (2 NUMA x 4 GPU). cpuManagerPolicy=static and
Guaranteed-QoS runner pods already apply, so CPU+GPU NUMA alignment needs no workload
changes. ue1 runners carry the c-mt- staging prefix + meta-staging-aws-ue1 runner group,
so there is no overlap with prod (mt-).
Deploy (ue1 only):
just deploy-module meta-staging-aws-ue1 nfd
just deploy-module meta-staging-aws-ue1 numa-scheduler
just deploy-module meta-staging-aws-ue1 nodepools
just deploy-module meta-staging-aws-ue1 arc-runners
Then queue a canary job and inspect the NRT for per-zone nvidia.com/gpu BEFORE varying.
Cleanup: drop this commit (git reset --hard HEAD~1) + teardown nfd/numa-scheduler on ue1.