[RFC] surface memory control knobs for nodepools, needed for static memory manager by georgehong · Pull Request #791 · pytorch/ci-infra

georgehong · 2026-06-17T22:36:48Z

(WIP) RFC

Context

Surfaces the ability for nodepools to enable memoryManagerPolicy: Static which NUMA-aware scheduling needs. Requested resources for a runner (GPU, CPU, and Memory) need to be enumerated, and CPU is already treated this way in the node config. Not having this means scheduler will return cannot align pod/pending state.

Memory management is also coupled with needing the specification of reservedMemory, which has to be precise to the byte to account for pod overhead for kubelet. Syntax is given by the following source:

reservedMemory:
  - numaNode: 0 # NUMA node index
    limits:
      memory: "1Gi" # byte quantity
  - numaNode: 1
    limits:
      memory: "2Gi" # byte quantity

For our cases, memory limits would likely be symmetrical.

Formula provided at https://kubernetes.io/docs/tasks/administer-cluster/memory-manager/:

This can be worked out for AWS nodes. In the p4d/A100 example, we can observe there are multiple sources where this is present:

nodeadm/internal/kubelet/config.go::getMemoryMebibytesToReserve with 737 max pods. This is 8362Mi. We can add a bit of headroom at 8500, and remaining memory can also be constrained to look something like the following:

+++ b/osdc/modules/nodepools/defs/p4d.yaml
+      kube_reserved_memory: 8500Mi
+      system_reserved_memory: "0"
+      eviction_hard_memory_available: 100Mi
+      reserved_memory:
+        - numa_node: 0
+          memory: 4300Mi
+        - numa_node: 1
+          memory: 4300Mi

max pods can also be deduced from running nodes and other configs, but it's important to avoid drift if AMI images change or ADM changes merge behavior of some of these node config parameters. Guardrails can be added in the form of alerts, if the image suddenly changes and node fails to start after X minutes.

Risks

Testing

RFC for team discussion — adds the MECHANISM only; no fleet is enabled. The p4d enablement (defs/p4d.yaml) is intentionally NOT included here so we can agree on the approach before flipping a scarce-capacity fleet. ## Why Under topologyManagerPolicy=single-numa-node + scope=pod, scheduler-plugins NodeResourceTopologyMatch ANDs the per-NUMA bitmask of every requested native resource. Today GPU nodes run memoryManagerPolicy=None, so the kubelet never publishes per-NUMA memory into the NodeResourceTopology (NRT). A Guaranteed GPU pod that requests memory therefore ANDs an empty bitmask -> `cannot align pod`, permanently Pending. (This is "Bug #2" of the NUMA E2E; "Bug #1" = the NFD topology-updater snapshot race, tracked separately.) Fix: enable kubelet Memory Manager Static on single-numa-node GPU fleets so memory becomes a NUMA-tracked resource the scheduler can align. ## What this commit adds (generate_nodepools.py) - `_parse_mem_quantity()` — exact byte parsing of K8s quantities (Ki/Mi/Gi/Ti). - `_memory_manager_block()` — emits the kubelet.config Memory Manager block into EC2NodeClass userData, ONLY when a def opts in, GATED to topology_manager_policy == single-numa-node (raises otherwise), and VALIDATES the kubelet boot-gate invariant at generation time: sum(reservedMemory) == kubeReserved.memory + systemReserved.memory + evictionHard.memory.available A mismatch fails `just test` / generation instead of bricking a node's boot (kubelet refuses to start if the sum is wrong -> fresh node never joins). - Pass-through of the new per-def keys for fleet-format defs. - TestMemoryManager (9 cases): gated emit, boot-gate validation, mixed units, misconfig rejection, fleet pass-through. Uses synthetic defs, so it needs no real-def change; the existing real-def round-trip still passes against the unmodified p4d.yaml. No generated output changes (0 of 75 fleets opt in) until a def is enabled. ## What enablement WOULD look like (NOT in this commit — for discussion) defs/p4d.yaml, under the p4d.24xlarge instance (fully-pinned variant so the boot-gate sum is immune to EKS maxPods/AMI formula drift): memory_manager_policy: Static kube_reserved_memory: 8500Mi # EKS floor 8362 (=11Mi*737+255) + headroom system_reserved_memory: "0" eviction_hard_memory_available: 100Mi reserved_memory: # must total 8500 + 0 + 100 = 8600Mi - { numa_node: 0, memory: 4300Mi } - { numa_node: 1, memory: 4300Mi } Open questions for the team: - Pinned (above) vs exact-match (mirror EKS's formula-derived kubeReserved and leave it owned by EKS)? Pinned is drift-proof but we own the number. - Roll mechanics: memoryManagerPolicy is boot-only; GPU fleets have disruption_budget=0, so enabling requires a manual node roll per fleet. - Per-fleet numbers differ (p5/p6 have different maxPods -> different kubeReserved). ## Validation (done on staging, fully-pinned p4d) Rolled one fresh p4d in meta-staging-aws-ue1 and validated E2E with a real pytorch-canary GPU job: node reached Ready (boot gate correct), nodeadm deep-merge preserved EKS sibling reservations, NRT gained a memory zone, and the GPU workflow pod aligned cpu+memory+gpu via numa-scheduler with no `cannot align`. Negative tests (over-one-zone memory / GPU) were correctly refused. Still TODO before any enablement ships: the Bug #1 fix (topology-updater ordering + GPU-aware wait-for-nrt.py) must land alongside it.

github-actions · 2026-06-17T22:37:49Z