[TEST-ONLY] Enable NUMA modules on arc-staging with g4dn.metal (T4)#748
Draft
georgehong wants to merge 5 commits into
Draft
[TEST-ONLY] Enable NUMA modules on arc-staging with g4dn.metal (T4)#748georgehong wants to merge 5 commits into
georgehong wants to merge 5 commits into
Conversation
This was referenced Jun 12, 2026
Capacity reportcommit ✅ simulate-cluster✅ analyze-utilization |
tofu plan — arc-cbr-production✅ Plan succeeded · commit Plan output |
tofu plan — arc-cbr-production-uw1✅ Plan succeeded · commit Plan output |
tofu plan — meta-prod-aws-ue1✅ Plan succeeded · commit Plan output |
georgehong
added a commit
that referenced
this pull request
Jun 15, 2026
Temporary test configuration — DO NOT MERGE. Parallel to the A100 [TEST-ONLY] commit: the same NUMA pipeline, validated on g4dn.metal (T4) because AWS does not offer A100/p4d in us-west-1 (arc-staging's region), whereas g4dn.metal is available on-demand there. - Add nfd + numa-scheduler to arc-staging modules; remove from prod clusters - Point NFD topology-updater + taint-remover nodeSelector at g4dn-metal-numa - Gate the nfd-topology startup taint on the g4dn-metal-numa fleet - Add g4dn-metal-numa nodepool: single-numa-node, capped to ONE node (limits.nvidia.com/gpu: 8) so 1-GPU + 4-GPU runners pack one 2-NUMA box - Add a nodepool `limits` passthrough to generate_nodepools.py - Add 1-GPU and 4-GPU T4 runner defs (4-GPU uses scheduler_name: numa-scheduler) - Add cleanup-arc-staging.sh for teardown g4dn.metal = 2 sockets x 4 T4 (2 NUMA x 4 GPU), topologically identical to p4d/p5. cpuManagerPolicy=static and Guaranteed-QoS runner pods are already in place, so CPU+GPU NUMA alignment applies without workload changes. Cleanup: bash modules/nfd/scripts/cleanup-arc-staging.sh Then drop this commit: git checkout numa-aware-scheduling (or git reset --hard HEAD~1) ghstack-source-id: a9e35b6 Pull-Request: #748
georgehong
added a commit
that referenced
this pull request
Jun 15, 2026
Temporary test configuration — DO NOT MERGE. Parallel to the A100 [TEST-ONLY] commit: the same NUMA pipeline, validated on g4dn.metal (T4) because AWS does not offer A100/p4d in us-west-1 (arc-staging's region), whereas g4dn.metal is available on-demand there. - Add nfd + numa-scheduler to arc-staging modules; remove from prod clusters - Point NFD topology-updater + taint-remover nodeSelector at g4dn-metal-numa - Gate the nfd-topology startup taint on the g4dn-metal-numa fleet - Add g4dn-metal-numa nodepool: single-numa-node, capped to ONE node (limits.nvidia.com/gpu: 8) so 1-GPU + 4-GPU runners pack one 2-NUMA box - Add a nodepool `limits` passthrough to generate_nodepools.py - Add 1-GPU and 4-GPU T4 runner defs (4-GPU uses scheduler_name: numa-scheduler) - Add cleanup-arc-staging.sh for teardown g4dn.metal = 2 sockets x 4 T4 (2 NUMA x 4 GPU), topologically identical to p4d/p5. cpuManagerPolicy=static and Guaranteed-QoS runner pods are already in place, so CPU+GPU NUMA alignment applies without workload changes. Cleanup: bash modules/nfd/scripts/cleanup-arc-staging.sh Then drop this commit: git checkout numa-aware-scheduling (or git reset --hard HEAD~1) ghstack-source-id: 7af6767 Pull-Request: #748
georgehong
added a commit
that referenced
this pull request
Jun 15, 2026
Temporary test configuration — DO NOT MERGE. Parallel to the A100 [TEST-ONLY] commit: the same NUMA pipeline, validated on g4dn.metal (T4) because AWS does not offer A100/p4d in us-west-1 (arc-staging's region), whereas g4dn.metal is available on-demand there. - Add nfd + numa-scheduler to arc-staging modules; remove from prod clusters - Point NFD topology-updater + taint-remover nodeSelector at g4dn-metal-numa - Gate the nfd-topology startup taint on the g4dn-metal-numa fleet - Add g4dn-metal-numa nodepool: single-numa-node, capped to ONE node (limits.nvidia.com/gpu: 8) so 1-GPU + 4-GPU runners pack one 2-NUMA box - Add a nodepool `limits` passthrough to generate_nodepools.py - Add 1-GPU and 4-GPU T4 runner defs (4-GPU uses scheduler_name: numa-scheduler) - Add cleanup-arc-staging.sh for teardown g4dn.metal = 2 sockets x 4 T4 (2 NUMA x 4 GPU), topologically identical to p4d/p5. cpuManagerPolicy=static and Guaranteed-QoS runner pods are already in place, so CPU+GPU NUMA alignment applies without workload changes. Cleanup: bash modules/nfd/scripts/cleanup-arc-staging.sh Then drop this commit: git checkout numa-aware-scheduling (or git reset --hard HEAD~1) ghstack-source-id: 7fa073e Pull-Request: #748
georgehong
added a commit
that referenced
this pull request
Jun 15, 2026
Temporary test configuration — DO NOT MERGE. Parallel to the A100 [TEST-ONLY] commit: the same NUMA pipeline, validated on g4dn.metal (T4) because AWS does not offer A100/p4d in us-west-1 (arc-staging's region), whereas g4dn.metal is available on-demand there. - Add nfd + numa-scheduler to arc-staging modules; remove from prod clusters - Point NFD topology-updater + taint-remover nodeSelector at g4dn-metal-numa - Gate the nfd-topology startup taint on the g4dn-metal-numa fleet - Add g4dn-metal-numa nodepool: single-numa-node, capped to ONE node (limits.nvidia.com/gpu: 8) so 1-GPU + 4-GPU runners pack one 2-NUMA box - Add a nodepool `limits` passthrough to generate_nodepools.py - Add 1-GPU and 4-GPU T4 runner defs (4-GPU uses scheduler_name: numa-scheduler) - Add cleanup-arc-staging.sh for teardown g4dn.metal = 2 sockets x 4 T4 (2 NUMA x 4 GPU), topologically identical to p4d/p5. cpuManagerPolicy=static and Guaranteed-QoS runner pods are already in place, so CPU+GPU NUMA alignment applies without workload changes. Cleanup: bash modules/nfd/scripts/cleanup-arc-staging.sh Then drop this commit: git checkout numa-aware-scheduling (or git reset --hard HEAD~1) ghstack-source-id: c63197b Pull-Request: #748
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Temporary test configuration — DO NOT MERGE. Parallel to the A100 [TEST-ONLY]
commit: the same NUMA pipeline, validated on g4dn.metal (T4) because AWS does
not offer A100/p4d in us-west-1 (arc-staging's region), whereas g4dn.metal is
available on-demand there.
(limits.nvidia.com/gpu: 8) so 1-GPU + 4-GPU runners pack one 2-NUMA box
limitspassthrough to generate_nodepools.pyg4dn.metal = 2 sockets x 4 T4 (2 NUMA x 4 GPU), topologically identical to
p4d/p5. cpuManagerPolicy=static and Guaranteed-QoS runner pods are already in
place, so CPU+GPU NUMA alignment applies without workload changes.
Cleanup: bash modules/nfd/scripts/cleanup-arc-staging.sh
Then drop this commit: git checkout numa-aware-scheduling (or git reset --hard HEAD~1)