Add NFD topology-updater and numa-scheduler modules (#696)#716
Add NFD topology-updater and numa-scheduler modules (#696)#716georgehong wants to merge 5 commits into
Conversation
tofu plan — arc-cbr-production✅ Plan succeeded · commit Plan output |
tofu plan — arc-cbr-production-uw1✅ Plan succeeded · commit Plan output |
tofu plan — meta-prod-aws-ue1✅ Plan succeeded · commit Plan output |
Adds two new OSDC modules for NUMA-aware GPU scheduling: 1. nfd — deploys NFD topology-updater as a DaemonSet on GPU nodes. Publishes NodeResourceTopology CRDs showing per-NUMA-zone resource availability (GPU/CPU/memory per socket). Polls every 15s to stay ahead of the ARC capacity provisioner's 30s interval. 2. numa-scheduler — deploys a secondary kube-scheduler with the NodeResourceTopologyMatch plugin. Reads NRT objects and only places pods on nodes where a single NUMA zone can satisfy the full resource request. Uses MostAllocated scoring to pack small pods together and keep free sockets available for large (4-GPU) jobs. The scheduler-plugins chart is downloaded from the GitHub release (kubernetes-sigs/scheduler-plugins) at deploy time — the project does not publish to a Helm registry. Both modules default to enabled=false in clusters.yaml and are inert until runner definitions set scheduler_name: numa-scheduler on workflow pods. Deploying them has no effect on existing scheduling. Part of #696 ghstack-source-id: b9b30d1 Pull-Request: #716
Adds two new OSDC modules for NUMA-aware GPU scheduling: 1. nfd — deploys NFD topology-updater as a DaemonSet on GPU nodes. Publishes NodeResourceTopology CRDs showing per-NUMA-zone resource availability (GPU/CPU/memory per socket). Polls every 15s to stay ahead of the ARC capacity provisioner's 30s interval. 2. numa-scheduler — deploys a secondary kube-scheduler with the NodeResourceTopologyMatch plugin. Reads NRT objects and only places pods on nodes where a single NUMA zone can satisfy the full resource request. Uses MostAllocated scoring to pack small pods together and keep free sockets available for large (4-GPU) jobs. The scheduler-plugins chart is downloaded from the GitHub release (kubernetes-sigs/scheduler-plugins) at deploy time — the project does not publish to a Helm registry. Both modules default to enabled=false in clusters.yaml and are inert until runner definitions set scheduler_name: numa-scheduler on workflow pods. Deploying them has no effect on existing scheduling. Part of #696 ghstack-source-id: 2f24bba Pull-Request: #716
|
planning some additional changes to remove the gh dependency to extract the release (i.e. use curl instead), and add some additional module tests. |
Adds two new OSDC modules for NUMA-aware GPU scheduling: 1. nfd — deploys NFD topology-updater as a DaemonSet on GPU nodes. Publishes NodeResourceTopology CRDs showing per-NUMA-zone resource availability (GPU/CPU/memory per socket). Polls every 15s to stay ahead of the ARC capacity provisioner's 30s interval. 2. numa-scheduler — deploys a secondary kube-scheduler with the NodeResourceTopologyMatch plugin. Reads NRT objects and only places pods on nodes where a single NUMA zone can satisfy the full resource request. Uses MostAllocated scoring to pack small pods together and keep free sockets available for large (4-GPU) jobs. The scheduler-plugins chart is downloaded from the GitHub release (kubernetes-sigs/scheduler-plugins) at deploy time — the project does not publish to a Helm registry. Both modules default to enabled=false in clusters.yaml and are inert until runner definitions set scheduler_name: numa-scheduler on workflow pods. Deploying them has no effect on existing scheduling. Part of #696 ghstack-source-id: c8ce9d0 Pull-Request: #716
Adds two new OSDC modules for NUMA-aware GPU scheduling: 1. nfd — deploys NFD topology-updater as a DaemonSet on GPU nodes. Publishes NodeResourceTopology CRDs showing per-NUMA-zone resource availability (GPU/CPU/memory per socket). Polls every 15s to stay ahead of the ARC capacity provisioner's 30s interval. 2. numa-scheduler — deploys a secondary kube-scheduler with the NodeResourceTopologyMatch plugin. Reads NRT objects and only places pods on nodes where a single NUMA zone can satisfy the full resource request. Uses MostAllocated scoring to pack small pods together and keep free sockets available for large (4-GPU) jobs. The scheduler-plugins chart is downloaded from the GitHub release (kubernetes-sigs/scheduler-plugins) at deploy time — the project does not publish to a Helm registry. Both modules default to enabled=false in clusters.yaml and are inert until runner definitions set scheduler_name: numa-scheduler on workflow pods. Deploying them has no effect on existing scheduling. Part of #696 ghstack-source-id: a45fb4b Pull-Request: #716
|
Add additional smoke tests, and similar to the cluster deployment, if this is validated, I can try running this on arc-staging to ensure deployment works as expected before continuing. |
| - key: instance-type | ||
| operator: Exists | ||
| effect: NoSchedule | ||
| - key: git-cache-not-ready |
There was a problem hiding this comment.
This has been removed, so you will need to do a rebase. See this here seems to indicate that you want to add this into this new list too https://github.com/pytorch/ci-infra/blob/main/osdc/modules/nodepools/scripts/python/generate_nodepools.py#L55
|
@claude This PR introduces a new osdc/modules/numa-scheduler/helm/values.yaml, how does that work? Is it always enforced or can it be optional? A couple more things that I have in mind:
|
|
Claude finished @huydhn's task in 2m 0s —— View job How
|
For this point, what is the impact of enabling numa scheduler on https://github.com/pytorch/ci-infra/blob/main/osdc/modules/arc-runners-h100/defs/l-x86iamx-88-900-h100-4.yaml scale set that is using https://github.com/pytorch/ci-infra/blob/main/osdc/modules/nodepools-h100/defs/p5.yaml nodepool. The nodepool has |
|
Claude finished @huydhn's task in 2m 36s —— View job Impact of enabling
|
1/2/4-GPU defs (p5.48xlarge, single-numa-node) |
8-GPU def (p5-large, best-effort) |
|
|---|---|---|
| Fits one NUMA zone? | ✅ yes (4-GPU = exactly one socket) | ❌ no (spans both) |
Enable numa-scheduler? |
✅ intended use — prevents TopologyAffinityError | ❌ never (would hang Pending) |
| Watch out for | per-zone allocatable < full-socket request → Pending | n/a (don't opt in) |
So: yes, enable it here — just confirm the 4-GPU request fits the socket's allocatable (not just capacity) first, or shave a little CPU/memory headroom off the def.
· branch gh/georgehong/3/head
|
Updating to break down changes into components in stack available here: #738. |
Stack from ghstack (oldest at bottom):
Adds two new OSDC modules for NUMA-aware GPU scheduling:
nfd — deploys NFD topology-updater as a DaemonSet on GPU nodes.
Publishes NodeResourceTopology CRDs showing per-NUMA-zone resource
availability (GPU/CPU/memory per socket). Polls every 15s to stay
ahead of the ARC capacity provisioner's 30s interval.
numa-scheduler — deploys a secondary kube-scheduler with the
NodeResourceTopologyMatch plugin. Reads NRT objects and only places
pods on nodes where a single NUMA zone can satisfy the full resource
request. Uses MostAllocated scoring to pack small pods together and
keep free sockets available for large (4-GPU) jobs.
The scheduler-plugins chart is downloaded via curl from the GitHub
release (kubernetes-sigs/scheduler-plugins) at deploy time — the
project does not publish to a Helm registry or OCI.
Both modules are inert until added to a cluster's
modules:list inclusters.yaml (matching the standard module enablement pattern — no
separate
enabledflag). Deploying them has no effect on existingscheduling until runner definitions set scheduler_name: numa-scheduler
on workflow pods.
Includes smoke tests for both modules (namespace, Helm release,
DaemonSet/Deployment health, NRT CRD existence, scheduler pod
placement on base-infrastructure nodes).
Part of #696
Dry-Run both targets for
helm upgrade:Follow-Ups
ARC fork (jeanschmidt/actions-runner-controller): Add SchedulerName field to capacity config + placeholder pod builder. New env var CAPACITY_AWARE_SCHEDULER_NAME, defaults to empty (no behavior change). Publish as new chart version.
Deploy updated ARC controller:
Wire schedulerName into runner template (runner.yaml.tpl):
Activate for H100 runners.