[RFC] surface memory control knobs for nodepools, needed for static memory manager#791
Draft
georgehong wants to merge 6 commits into
Draft
[RFC] surface memory control knobs for nodepools, needed for static memory manager#791georgehong wants to merge 6 commits into
georgehong wants to merge 6 commits into
Conversation
RFC for team discussion — adds the MECHANISM only; no fleet is enabled. The p4d enablement (defs/p4d.yaml) is intentionally NOT included here so we can agree on the approach before flipping a scarce-capacity fleet. ## Why Under topologyManagerPolicy=single-numa-node + scope=pod, scheduler-plugins NodeResourceTopologyMatch ANDs the per-NUMA bitmask of every requested native resource. Today GPU nodes run memoryManagerPolicy=None, so the kubelet never publishes per-NUMA memory into the NodeResourceTopology (NRT). A Guaranteed GPU pod that requests memory therefore ANDs an empty bitmask -> `cannot align pod`, permanently Pending. (This is "Bug #2" of the NUMA E2E; "Bug #1" = the NFD topology-updater snapshot race, tracked separately.) Fix: enable kubelet Memory Manager Static on single-numa-node GPU fleets so memory becomes a NUMA-tracked resource the scheduler can align. ## What this commit adds (generate_nodepools.py) - `_parse_mem_quantity()` — exact byte parsing of K8s quantities (Ki/Mi/Gi/Ti). - `_memory_manager_block()` — emits the kubelet.config Memory Manager block into EC2NodeClass userData, ONLY when a def opts in, GATED to topology_manager_policy == single-numa-node (raises otherwise), and VALIDATES the kubelet boot-gate invariant at generation time: sum(reservedMemory) == kubeReserved.memory + systemReserved.memory + evictionHard.memory.available A mismatch fails `just test` / generation instead of bricking a node's boot (kubelet refuses to start if the sum is wrong -> fresh node never joins). - Pass-through of the new per-def keys for fleet-format defs. - TestMemoryManager (9 cases): gated emit, boot-gate validation, mixed units, misconfig rejection, fleet pass-through. Uses synthetic defs, so it needs no real-def change; the existing real-def round-trip still passes against the unmodified p4d.yaml. No generated output changes (0 of 75 fleets opt in) until a def is enabled. ## What enablement WOULD look like (NOT in this commit — for discussion) defs/p4d.yaml, under the p4d.24xlarge instance (fully-pinned variant so the boot-gate sum is immune to EKS maxPods/AMI formula drift): memory_manager_policy: Static kube_reserved_memory: 8500Mi # EKS floor 8362 (=11Mi*737+255) + headroom system_reserved_memory: "0" eviction_hard_memory_available: 100Mi reserved_memory: # must total 8500 + 0 + 100 = 8600Mi - { numa_node: 0, memory: 4300Mi } - { numa_node: 1, memory: 4300Mi } Open questions for the team: - Pinned (above) vs exact-match (mirror EKS's formula-derived kubeReserved and leave it owned by EKS)? Pinned is drift-proof but we own the number. - Roll mechanics: memoryManagerPolicy is boot-only; GPU fleets have disruption_budget=0, so enabling requires a manual node roll per fleet. - Per-fleet numbers differ (p5/p6 have different maxPods -> different kubeReserved). ## Validation (done on staging, fully-pinned p4d) Rolled one fresh p4d in meta-staging-aws-ue1 and validated E2E with a real pytorch-canary GPU job: node reached Ready (boot gate correct), nodeadm deep-merge preserved EKS sibling reservations, NRT gained a memory zone, and the GPU workflow pod aligned cpu+memory+gpu via numa-scheduler with no `cannot align`. Negative tests (over-one-zone memory / GPU) were correctly refused. Still TODO before any enablement ships: the Bug #1 fix (topology-updater ordering + GPU-aware wait-for-nrt.py) must land alongside it.
tofu plan — arc-cbr-production✅ Plan succeeded · commit Plan output |
tofu plan — arc-cbr-production-uw1✅ Plan succeeded · commit Plan output |
tofu plan — meta-prod-aws-ue1✅ Plan succeeded · commit Plan output |
tofu plan — lf-prod-aws-ue1✅ Plan succeeded · commit Plan output |
tofu plan — lf-prod-aws-ue2✅ Plan succeeded · commit Plan output |
…ntities.py Addresses RFC review: the boot-gate validator added a 6th private copy of a K8s memory-quantity parser (`_parse_mem_quantity`) that was also untested and less accurate than the existing one (it rejected valid decimal-SI suffixes like `8G`/`500M`). Hoist the better implementation (arc-runners' `parse_memory_bytes`, which already handled binary + decimal SI) into the shared `scripts/python/` — where `generate_nodepools.py` already imports `instance_specs`/`nodepool_defs` from, so no cross-module dependency — and give it direct unit tests (`test_quantities.py`). Both `generate_nodepools.py` and `generate_runners.py` now import the single shared `parse_memory_bytes`; the local copies are gone. Net: one tested, decimal-SI-correct parser instead of two (and a path to retire the remaining copies in daemonset_overhead/analyze_node_utilization/ node-compactor as follow-up). The gate validator inherits real coverage.
…ession test Addresses RFC review: memoryManagerPolicy and topologyManagerPolicy are independent kubelet knobs, and the kubelet accepts Static under any topology policy. The previous hard requirement (Static => single-numa-node, else raise) conflated two settings the API keeps separate. - Static no longer requires single-numa-node. It still ALWAYS validates the reservedMemory boot gate (a real kubelet requirement, independent of topology). When enabled without single-numa-node it now WARNS (stderr) rather than raising — that combo is a likely misconfig (memory is NUMA-reserved but the scheduler won't align it) but is valid and the operator's call. - Add test_no_keys_leaves_config_untouched: a def WITHOUT any Memory Manager keys emits the exact pre-feature kubelet.config — exact key set, and no blank-line artifact at the empty-block injection point. Guards every existing nodepool def against drift from this feature. - Replace the hard-raise test with test_static_without_single_numa_warns_but_emits.
Addresses RFC review: kubeReserved / systemReserved / evictionHard are GENERAL
kubelet settings (they change node-allocatable memory and the eviction
threshold, and EKS already sets them by default) — they are NOT specific to
Memory Manager Static. The previous code only plumbed them through the Static
block and rejected them otherwise, conflating independent knobs.
Rework `_memory_manager_block` -> `_kubelet_memory_block`: each knob is emitted
into kubelet.config ONLY when the def specifies it, and propagates on its own:
- memory_manager_policy -> memoryManagerPolicy
- kube_reserved_memory -> kubeReserved.memory
- system_reserved_memory -> systemReserved.memory
- eviction_hard_memory_available -> evictionHard.memory.available
- reserved_memory -> reservedMemory (still Static-only — it is
consumed solely by the Memory Manager)
Boot gate: validated only when all three reservation terms are pinned alongside
Static (self-contained); if any is left to the EKS default — unknown at
generation — warn that the gate is unvalidated rather than guess. Static still
requires reserved_memory; reserved_memory without Static still errors.
Tests: independent-knob emission (kubeReserved/evictionHard/systemReserved
alone), "only specified blocks included", unpinned-terms warn-but-emit, Static-
without-reserved raises. Fully-pinned p4d output is byte-identical to before.
Addresses RFC review (function too large, mixed concerns). Matches the file's convention of dedicated validators (_validate_fleet, _validate_startup_taints_registry) kept separate from the thin YAML emitters. - _validate_kubelet_memory(nodepool_def, topology_policy): all the checks + warnings + the boot-gate sum (uses early-return to flatten nesting). - _kubelet_memory_block(nodepool_def): pure 1:1 translation — one block per key present, no validation. The three same-shaped knobs (kubeReserved/systemReserved/evictionHard) are driven by a small table (_KUBELET_MEMORY_RESERVATIONS); memoryManagerPolicy and reservedMemory are the two special shapes. generate_nodepool_yaml now calls validate-then-translate. No behavior change: generated output is byte-identical (verified on the fully-pinned p4d block) and all tests pass unchanged.
Add `just render-nodepool <def> [modules]` — renders a single nodepool def to stdout (generation only, no cluster required) so operators can preview the generated userData / kubelet config before deploying. Document the nodeadm deep-merge assumption in generate_nodepools.py: setting e.g. kubeReserved.memory relies on nodeadm preserving the EKS-default sibling keys (cpu, ephemeral-storage, nodefs.*). This was empirically validated on a real p4d (2026-06-17) but is an implementation detail, not a contractual guarantee. The comment describes the re-verification procedure and fallback mitigation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
(WIP) RFC
Context
Surfaces the ability for nodepools to enable
memoryManagerPolicy: Staticwhich NUMA-aware scheduling needs. Requested resources for a runner (GPU, CPU, and Memory) need to be enumerated, and CPU is already treated this way in the node config. Not having this means scheduler will returncannot align pod/pending state.Memory management is also coupled with needing the specification of reservedMemory, which has to be precise to the byte to account for pod overhead for kubelet. Syntax is given by the following source:
For our cases, memory limits would likely be symmetrical.
Formula provided at https://kubernetes.io/docs/tasks/administer-cluster/memory-manager/:

This can be worked out for AWS nodes. In the p4d/A100 example, we can observe there are multiple sources where this is present:
max pods can also be deduced from running nodes and other configs, but it's important to avoid drift if AMI images change or ADM changes merge behavior of some of these node config parameters. Guardrails can be added in the form of alerts, if the image suddenly changes and node fails to start after X minutes.
Contents
reserved_memoryblock and associated parameters.justcommand for materializing the node config changes.Risks
Testing