From 6c183d57e735210170bf7001f27629b4310c9b61 Mon Sep 17 00:00:00 2001 From: George Hong Date: Wed, 17 Jun 2026 15:34:58 -0700 Subject: [PATCH 1/6] [RFC] nodepools: add kubelet Memory Manager Static support (#696) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit RFC for team discussion — adds the MECHANISM only; no fleet is enabled. The p4d enablement (defs/p4d.yaml) is intentionally NOT included here so we can agree on the approach before flipping a scarce-capacity fleet. ## Why Under topologyManagerPolicy=single-numa-node + scope=pod, scheduler-plugins NodeResourceTopologyMatch ANDs the per-NUMA bitmask of every requested native resource. Today GPU nodes run memoryManagerPolicy=None, so the kubelet never publishes per-NUMA memory into the NodeResourceTopology (NRT). A Guaranteed GPU pod that requests memory therefore ANDs an empty bitmask -> `cannot align pod`, permanently Pending. (This is "Bug #2" of the NUMA E2E; "Bug #1" = the NFD topology-updater snapshot race, tracked separately.) Fix: enable kubelet Memory Manager Static on single-numa-node GPU fleets so memory becomes a NUMA-tracked resource the scheduler can align. ## What this commit adds (generate_nodepools.py) - `_parse_mem_quantity()` — exact byte parsing of K8s quantities (Ki/Mi/Gi/Ti). - `_memory_manager_block()` — emits the kubelet.config Memory Manager block into EC2NodeClass userData, ONLY when a def opts in, GATED to topology_manager_policy == single-numa-node (raises otherwise), and VALIDATES the kubelet boot-gate invariant at generation time: sum(reservedMemory) == kubeReserved.memory + systemReserved.memory + evictionHard.memory.available A mismatch fails `just test` / generation instead of bricking a node's boot (kubelet refuses to start if the sum is wrong -> fresh node never joins). - Pass-through of the new per-def keys for fleet-format defs. - TestMemoryManager (9 cases): gated emit, boot-gate validation, mixed units, misconfig rejection, fleet pass-through. Uses synthetic defs, so it needs no real-def change; the existing real-def round-trip still passes against the unmodified p4d.yaml. No generated output changes (0 of 75 fleets opt in) until a def is enabled. ## What enablement WOULD look like (NOT in this commit — for discussion) defs/p4d.yaml, under the p4d.24xlarge instance (fully-pinned variant so the boot-gate sum is immune to EKS maxPods/AMI formula drift): memory_manager_policy: Static kube_reserved_memory: 8500Mi # EKS floor 8362 (=11Mi*737+255) + headroom system_reserved_memory: "0" eviction_hard_memory_available: 100Mi reserved_memory: # must total 8500 + 0 + 100 = 8600Mi - { numa_node: 0, memory: 4300Mi } - { numa_node: 1, memory: 4300Mi } Open questions for the team: - Pinned (above) vs exact-match (mirror EKS's formula-derived kubeReserved and leave it owned by EKS)? Pinned is drift-proof but we own the number. - Roll mechanics: memoryManagerPolicy is boot-only; GPU fleets have disruption_budget=0, so enabling requires a manual node roll per fleet. - Per-fleet numbers differ (p5/p6 have different maxPods -> different kubeReserved). ## Validation (done on staging, fully-pinned p4d) Rolled one fresh p4d in meta-staging-aws-ue1 and validated E2E with a real pytorch-canary GPU job: node reached Ready (boot gate correct), nodeadm deep-merge preserved EKS sibling reservations, NRT gained a memory zone, and the GPU workflow pod aligned cpu+memory+gpu via numa-scheduler with no `cannot align`. Negative tests (over-one-zone memory / GPU) were correctly refused. Still TODO before any enablement ships: the Bug #1 fix (topology-updater ordering + GPU-aware wait-for-nrt.py) must land alongside it. --- .../scripts/python/generate_nodepools.py | 121 +++++++++++++++- .../scripts/python/test_generate_nodepools.py | 133 ++++++++++++++++++ 2 files changed, 253 insertions(+), 1 deletion(-) diff --git a/osdc/modules/nodepools/scripts/python/generate_nodepools.py b/osdc/modules/nodepools/scripts/python/generate_nodepools.py index e3208bf0..29e6b1d7 100755 --- a/osdc/modules/nodepools/scripts/python/generate_nodepools.py +++ b/osdc/modules/nodepools/scripts/python/generate_nodepools.py @@ -194,6 +194,110 @@ def _user_data_script_mime_part(indented_script): """ +def _parse_mem_quantity(value): + """Parse a Kubernetes memory quantity into an exact integer byte count. + + Supports binary suffixes (Ki/Mi/Gi/Ti) and a bare integer (bytes). The + mantissa must be a whole number — fractional quantities (e.g. "4.5Gi") are + rejected so reservation math stays exact. Used to validate the Memory + Manager Static boot gate at generation time, in bytes, so a typo in the + per-NUMA split can never reach a node and stop the kubelet from starting. + """ + s = str(value).strip() + for suffix, mult in (("Ki", 1024), ("Mi", 1024**2), ("Gi", 1024**3), ("Ti", 1024**4)): + if s.endswith(suffix): + return int(s[: -len(suffix)]) * mult + return int(s) # bare integer = bytes + + +def _memory_manager_block(nodepool_def, topology_policy): + """Build the kubelet Memory Manager (Static) config block, or "". + + Returns the indented YAML lines (10-space base, for the ``kubelet.config`` + map) that pin ``memoryManagerPolicy`` plus every reservation that feeds the + Memory Manager boot gate, or an empty string when the def doesn't opt in. + + Memory Manager Static surfaces per-NUMA memory into the NodeResourceTopology + so single-numa-node alignment can match native ``memory`` requests (without + it, the scheduler ANDs an empty bitmask and returns "cannot align pod"). The + kubelet enforces a hard invariant at boot: + + sum(reservedMemory) == kubeReserved + systemReserved + evictionHard + + or it refuses to start. We therefore pin all three right-hand terms in the + def so the sum is a self-contained constant immune to EKS AMI / maxPods + formula drift, and validate the equality here — a mismatch fails generation + instead of bricking a fresh node's boot. + """ + name = nodepool_def["name"] + policy = nodepool_def.get("memory_manager_policy") + reservation_keys = ( + "kube_reserved_memory", + "system_reserved_memory", + "eviction_hard_memory_available", + "reserved_memory", + ) + + if not policy: + # Reservation overrides only make sense alongside Static — flag stragglers. + if any(nodepool_def.get(k) is not None for k in reservation_keys): + raise ValueError( + f"{name}: {', '.join(reservation_keys)} are only valid when memory_manager_policy: Static is set." + ) + return "" + + if policy != "Static": + raise ValueError( + f"{name}: memory_manager_policy must be 'Static' (got '{policy}'); " + f"None is the kubelet default and needs no override." + ) + if topology_policy != "single-numa-node": + raise ValueError( + f"{name}: memory_manager_policy: Static requires topology_manager_policy: " + f"single-numa-node (got '{topology_policy}'). Memory Manager Static is " + f"only needed to surface per-NUMA memory for single-numa-node alignment." + ) + + kube_reserved = nodepool_def.get("kube_reserved_memory") + system_reserved = nodepool_def.get("system_reserved_memory", "0") + eviction_hard = nodepool_def.get("eviction_hard_memory_available") + reserved_memory = nodepool_def.get("reserved_memory") + if not kube_reserved or not eviction_hard or not reserved_memory: + raise ValueError( + f"{name}: memory_manager_policy: Static requires kube_reserved_memory, " + f"eviction_hard_memory_available, and reserved_memory to be set " + f"(the Memory Manager boot gate sums them)." + ) + + gate = ( + _parse_mem_quantity(kube_reserved) + _parse_mem_quantity(system_reserved) + _parse_mem_quantity(eviction_hard) + ) + reserved_total = sum(_parse_mem_quantity(z["memory"]) for z in reserved_memory) + if reserved_total != gate: + raise ValueError( + f"{name}: reserved_memory sum ({reserved_total} bytes) must equal " + f"kube_reserved_memory + system_reserved_memory + " + f"eviction_hard_memory_available ({gate} bytes), or the kubelet will " + f"refuse to start. Adjust the per-NUMA split to total the reservation sum." + ) + + zones = "".join( + f" - numaNode: {z['numa_node']}\n limits:\n memory: {z['memory']}\n" + for z in reserved_memory + ) + return ( + f" memoryManagerPolicy: {policy}\n" + f" kubeReserved:\n" + f" memory: {kube_reserved}\n" + f" systemReserved:\n" + f' memory: "{system_reserved}"\n' + f" evictionHard:\n" + f" memory.available: {eviction_hard}\n" + f" reservedMemory:\n" + f"{zones}" + ) + + def generate_nodepool_yaml(nodepool_def, module_name, defs_dir=None): """Generate a combined NodePool + EC2NodeClass YAML string.""" name = nodepool_def["name"] @@ -211,6 +315,10 @@ def generate_nodepool_yaml(nodepool_def, module_name, defs_dir=None): topology_policy = nodepool_def.get("topology_manager_policy", "best-effort") topology_scope = nodepool_def.get("topology_manager_scope", "container") + # Per-def kubelet Memory Manager (Static) — gated to single-numa-node fleets. + # Validates the boot-gate reservation invariant; raises on misconfiguration. + mem_manager_block = _memory_manager_block(nodepool_def, topology_policy) + # Read optional user data script for embedding as a MIME part indented_userdata = _read_user_data_script(user_data_script_path, defs_dir) if defs_dir else None @@ -441,6 +549,7 @@ def generate_nodepool_yaml(nodepool_def, module_name, defs_dir=None): topologyManagerPolicy: {topology_policy} topologyManagerScope: {topology_scope} {" topologyManagerPolicyOptions:" + chr(10) + ' prefer-closest-numa-nodes: "true"' + chr(10) if topology_policy in ("restricted", "best-effort") else ""}\ +{mem_manager_block}\ containerLogMaxSize: 50Mi containerLogMaxFiles: 5 {_user_data_script_mime_part(indented_userdata)} @@ -552,7 +661,17 @@ def _build_fleet_nodepool_def(fleet_data, inst, name_suffix="", extra_labels=Non # Only set optional keys when explicitly provided — leaving them absent # lets generate_nodepool_yaml() fall through to its own defaults. - for key in ("node_compactor", "topology_manager_policy", "topology_manager_scope", "user_data_script"): + for key in ( + "node_compactor", + "topology_manager_policy", + "topology_manager_scope", + "user_data_script", + "memory_manager_policy", + "kube_reserved_memory", + "system_reserved_memory", + "eviction_hard_memory_available", + "reserved_memory", + ): val = inst.get(key) if val is not None: nodepool_def[key] = val diff --git a/osdc/modules/nodepools/scripts/python/test_generate_nodepools.py b/osdc/modules/nodepools/scripts/python/test_generate_nodepools.py index e5e8700e..07e64b16 100644 --- a/osdc/modules/nodepools/scripts/python/test_generate_nodepools.py +++ b/osdc/modules/nodepools/scripts/python/test_generate_nodepools.py @@ -28,6 +28,22 @@ def parse_all_yaml(text: str) -> list[dict]: return [doc for doc in yaml.safe_load_all(text) if doc is not None] +def _extract_node_config(userdata: str) -> dict: + """Pull the embedded EKS NodeConfig YAML out of the userData MIME blob. + + Lets tests assert the kubelet config as a parsed structure (so YAML + indentation, not just substring presence, is validated). + """ + lines = userdata.splitlines() + start = next(i for i, ln in enumerate(lines) if ln.strip().startswith("apiVersion: node.eks.aws")) + body = [] + for ln in lines[start:]: + if ln.strip().startswith("--==BOUNDARY=="): + break + body.append(ln) + return yaml.safe_load("\n".join(body)) + + def _make_nodepool_def(**overrides) -> dict: """Build a minimal nodepool def dict with sensible defaults.""" base = { @@ -782,6 +798,123 @@ def test_optional_fields_default_correctly(self): assert "topology_manager_scope" not in result assert "user_data_script" not in result assert "node_compactor" not in result + assert "memory_manager_policy" not in result + assert "reserved_memory" not in result + + +class TestMemoryManager: + """kubelet Memory Manager (Static) — the #696 Bug #2 fix for single-numa-node. + + Static surfaces per-NUMA memory into the NRT so the scheduler can align + native ``memory`` requests; it also imposes a hard boot gate + (sum(reservedMemory) == kubeReserved + systemReserved + evictionHard) that + we validate at generation time. + """ + + def _static_def(self, **overrides) -> dict: + base = { + "topology_manager_policy": "single-numa-node", + "topology_manager_scope": "pod", + "memory_manager_policy": "Static", + "kube_reserved_memory": "8500Mi", + "system_reserved_memory": "0", + "eviction_hard_memory_available": "100Mi", + "reserved_memory": [ + {"numa_node": 0, "memory": "4300Mi"}, + {"numa_node": 1, "memory": "4300Mi"}, + ], + } + base.update(overrides) + return _make_nodepool_def(**base) + + def test_absent_by_default(self): + """No memory_manager_policy → no Memory Manager keys in userData.""" + output = generate_nodepool_yaml(_make_nodepool_def(), "nodepools") + userdata = parse_all_yaml(output)[1]["spec"]["userData"] + assert "memoryManagerPolicy" not in userdata + assert "reservedMemory" not in userdata + assert "kubeReserved" not in userdata + + def test_static_emits_valid_config(self): + """Static emits a well-formed kubelet.config with all pinned reservations.""" + output = generate_nodepool_yaml(self._static_def(), "nodepools") + userdata = parse_all_yaml(output)[1]["spec"]["userData"] + cfg = _extract_node_config(userdata)["spec"]["kubelet"]["config"] + assert cfg["memoryManagerPolicy"] == "Static" + assert cfg["kubeReserved"]["memory"] == "8500Mi" + assert cfg["systemReserved"]["memory"] == "0" + assert cfg["evictionHard"]["memory.available"] == "100Mi" + assert cfg["reservedMemory"] == [ + {"numaNode": 0, "limits": {"memory": "4300Mi"}}, + {"numaNode": 1, "limits": {"memory": "4300Mi"}}, + ] + # topology knobs and log rotation must still be present and parseable + assert cfg["topologyManagerPolicy"] == "single-numa-node" + assert cfg["containerLogMaxSize"] == "50Mi" + + def test_requires_single_numa(self): + """Static on a non-single-numa-node def is rejected.""" + bad = self._static_def(topology_manager_policy="best-effort", topology_manager_scope="container") + with pytest.raises(ValueError, match="single-numa-node"): + generate_nodepool_yaml(bad, "nodepools") + + def test_gate_sum_mismatch_raises(self): + """reserved_memory that doesn't total the reservation sum fails generation.""" + bad = self._static_def( + reserved_memory=[ + {"numa_node": 0, "memory": "4300Mi"}, + {"numa_node": 1, "memory": "4000Mi"}, # 8300 != 8600 + ] + ) + with pytest.raises(ValueError, match="must equal"): + generate_nodepool_yaml(bad, "nodepools") + + def test_gate_accepts_mixed_units(self): + """Gate math is exact across unit suffixes (8500Mi+0+100Mi == 8600Mi == ~8.39Gi+...).""" + # 8600Mi expressed as a single zone in Mi must validate against Mi reservations. + ok = self._static_def( + reserved_memory=[ + {"numa_node": 0, "memory": "4Gi"}, # 4096Mi + {"numa_node": 1, "memory": "4504Mi"}, # 4096 + 4504 = 8600Mi + ] + ) + output = generate_nodepool_yaml(ok, "nodepools") # must not raise + assert "memoryManagerPolicy: Static" in parse_all_yaml(output)[1]["spec"]["userData"] + + def test_invalid_policy_value_raises(self): + """memory_manager_policy must be exactly 'Static'.""" + bad = self._static_def(memory_manager_policy="None") + with pytest.raises(ValueError, match="must be 'Static'"): + generate_nodepool_yaml(bad, "nodepools") + + def test_reservations_without_policy_raise(self): + """Reservation keys without memory_manager_policy are a misconfig.""" + bad = _make_nodepool_def(reserved_memory=[{"numa_node": 0, "memory": "100Mi"}]) + with pytest.raises(ValueError, match="only valid when"): + generate_nodepool_yaml(bad, "nodepools") + + def test_fleet_passthrough(self): + """_build_fleet_nodepool_def propagates the Memory Manager keys.""" + fleet = {"name": "p4d", "arch": "amd64", "gpu": True} + inst = { + "type": "p4d.24xlarge", + "weight": 100, + "node_disk_size": 1000, + "topology_manager_policy": "single-numa-node", + "topology_manager_scope": "pod", + "memory_manager_policy": "Static", + "kube_reserved_memory": "8500Mi", + "system_reserved_memory": "0", + "eviction_hard_memory_available": "100Mi", + "reserved_memory": [ + {"numa_node": 0, "memory": "4300Mi"}, + {"numa_node": 1, "memory": "4300Mi"}, + ], + } + result = _build_fleet_nodepool_def(fleet, inst) + assert result["memory_manager_policy"] == "Static" + assert result["kube_reserved_memory"] == "8500Mi" + assert result["reserved_memory"][1]["numa_node"] == 1 # ============================================================================ From 2b7f14dd0f18151148528aa91b59bdac2f607126 Mon Sep 17 00:00:00 2001 From: George Hong Date: Wed, 17 Jun 2026 16:06:00 -0700 Subject: [PATCH 2/6] nodepools/arc-runners: consolidate K8s memory parsing into shared quantities.py MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses RFC review: the boot-gate validator added a 6th private copy of a K8s memory-quantity parser (`_parse_mem_quantity`) that was also untested and less accurate than the existing one (it rejected valid decimal-SI suffixes like `8G`/`500M`). Hoist the better implementation (arc-runners' `parse_memory_bytes`, which already handled binary + decimal SI) into the shared `scripts/python/` — where `generate_nodepools.py` already imports `instance_specs`/`nodepool_defs` from, so no cross-module dependency — and give it direct unit tests (`test_quantities.py`). Both `generate_nodepools.py` and `generate_runners.py` now import the single shared `parse_memory_bytes`; the local copies are gone. Net: one tested, decimal-SI-correct parser instead of two (and a path to retire the remaining copies in daemonset_overhead/analyze_node_utilization/ node-compactor as follow-up). The gate validator inherits real coverage. --- .../scripts/python/generate_runners.py | 37 +------------ .../scripts/python/generate_nodepools.py | 23 ++------ osdc/scripts/python/quantities.py | 53 +++++++++++++++++++ osdc/scripts/python/test_quantities.py | 46 ++++++++++++++++ 4 files changed, 103 insertions(+), 56 deletions(-) create mode 100644 osdc/scripts/python/quantities.py create mode 100644 osdc/scripts/python/test_quantities.py diff --git a/osdc/modules/arc-runners/scripts/python/generate_runners.py b/osdc/modules/arc-runners/scripts/python/generate_runners.py index 85b31c61..5914389f 100755 --- a/osdc/modules/arc-runners/scripts/python/generate_runners.py +++ b/osdc/modules/arc-runners/scripts/python/generate_runners.py @@ -33,6 +33,7 @@ from conditional_blocks import strip_conditional_block # noqa: E402 from fleet_naming import derive_fleet_name # noqa: E402 from nodepool_defs import load_excluded_instance_types # noqa: E402 +from quantities import parse_memory_bytes # noqa: E402 from runner_fleet_validator import validate_cluster_runner_fleets # noqa: E402 # ANSI colors @@ -59,42 +60,6 @@ def normalize_name(name): return name.replace(".", "-").replace("_", "-") -# Kubernetes resource quantity suffixes → multiplier (bytes) -_K8S_MEMORY_SUFFIXES = { - "Ki": 1024, - "Mi": 1024**2, - "Gi": 1024**3, - "Ti": 1024**4, - "K": 1000, - "M": 1000**2, - "G": 1000**3, - "T": 1000**4, -} - - -def parse_memory_bytes(memory_str): - """Convert a Kubernetes memory quantity string to bytes. - - Supports binary (Ki, Mi, Gi, Ti) and decimal (K, M, G, T) suffixes, - as well as plain integer strings (already in bytes). - - >>> parse_memory_bytes("115Gi") - 123480309760 - >>> parse_memory_bytes("512Mi") - 536870912 - >>> parse_memory_bytes("1024") - 1024 - """ - s = str(memory_str) - # Try two-char suffix first (Ki, Mi, Gi, Ti), then one-char (K, M, G, T) - for suffix_len in (2, 1): - if len(s) > suffix_len: - suffix = s[-suffix_len:] - if suffix in _K8S_MEMORY_SUFFIXES: - return int(s[:-suffix_len]) * _K8S_MEMORY_SUFFIXES[suffix] - return int(s) - - def load_clusters_yaml(repo_root): """Load clusters.yaml from the repository root.""" config_path = repo_root / "clusters.yaml" diff --git a/osdc/modules/nodepools/scripts/python/generate_nodepools.py b/osdc/modules/nodepools/scripts/python/generate_nodepools.py index 29e6b1d7..5f25157f 100755 --- a/osdc/modules/nodepools/scripts/python/generate_nodepools.py +++ b/osdc/modules/nodepools/scripts/python/generate_nodepools.py @@ -34,6 +34,7 @@ from instance_specs import INSTANCE_SPECS # noqa: E402 from nodepool_defs import is_excluded_for_region as _is_excluded_for_region # noqa: E402 +from quantities import parse_memory_bytes # noqa: E402 # List of startup taint entries. Each entry is a dict with: # - ``key``, ``value``, ``effect`` (all str) — the Karpenter startupTaint to emit @@ -194,22 +195,6 @@ def _user_data_script_mime_part(indented_script): """ -def _parse_mem_quantity(value): - """Parse a Kubernetes memory quantity into an exact integer byte count. - - Supports binary suffixes (Ki/Mi/Gi/Ti) and a bare integer (bytes). The - mantissa must be a whole number — fractional quantities (e.g. "4.5Gi") are - rejected so reservation math stays exact. Used to validate the Memory - Manager Static boot gate at generation time, in bytes, so a typo in the - per-NUMA split can never reach a node and stop the kubelet from starting. - """ - s = str(value).strip() - for suffix, mult in (("Ki", 1024), ("Mi", 1024**2), ("Gi", 1024**3), ("Ti", 1024**4)): - if s.endswith(suffix): - return int(s[: -len(suffix)]) * mult - return int(s) # bare integer = bytes - - def _memory_manager_block(nodepool_def, topology_policy): """Build the kubelet Memory Manager (Static) config block, or "". @@ -269,10 +254,8 @@ def so the sum is a self-contained constant immune to EKS AMI / maxPods f"(the Memory Manager boot gate sums them)." ) - gate = ( - _parse_mem_quantity(kube_reserved) + _parse_mem_quantity(system_reserved) + _parse_mem_quantity(eviction_hard) - ) - reserved_total = sum(_parse_mem_quantity(z["memory"]) for z in reserved_memory) + gate = parse_memory_bytes(kube_reserved) + parse_memory_bytes(system_reserved) + parse_memory_bytes(eviction_hard) + reserved_total = sum(parse_memory_bytes(z["memory"]) for z in reserved_memory) if reserved_total != gate: raise ValueError( f"{name}: reserved_memory sum ({reserved_total} bytes) must equal " diff --git a/osdc/scripts/python/quantities.py b/osdc/scripts/python/quantities.py new file mode 100644 index 00000000..25674f62 --- /dev/null +++ b/osdc/scripts/python/quantities.py @@ -0,0 +1,53 @@ +"""Shared parsing of Kubernetes resource quantities. + +Canonical home for converting Kubernetes quantity strings to numbers, so the +several generators that need it (nodepools, arc-runners, ...) share one tested +implementation instead of each carrying a copy. + +Currently provides memory parsing; CPU/other quantity helpers can join here. +""" + +from __future__ import annotations + +# Kubernetes resource quantity suffixes → multiplier (bytes). +# Binary (Ki/Mi/Gi/Ti = powers of 1024) and decimal SI (K/M/G/T = powers of +# 1000) are both valid Kubernetes quantity forms. +_K8S_MEMORY_SUFFIXES = { + "Ki": 1024, + "Mi": 1024**2, + "Gi": 1024**3, + "Ti": 1024**4, + "K": 1000, + "M": 1000**2, + "G": 1000**3, + "T": 1000**4, +} + + +def parse_memory_bytes(memory_str) -> int: + """Convert a Kubernetes memory quantity to an exact integer byte count. + + Supports binary (Ki, Mi, Gi, Ti) and decimal (K, M, G, T) suffixes, plus a + plain integer (already bytes). The mantissa must be a whole number — + fractional quantities (e.g. "4.5Gi") raise ValueError so reservation math + stays exact rather than silently truncating. + + >>> parse_memory_bytes("115Gi") + 123480309760 + >>> parse_memory_bytes("512Mi") + 536870912 + >>> parse_memory_bytes("500M") + 500000000 + >>> parse_memory_bytes("1024") + 1024 + >>> parse_memory_bytes(0) + 0 + """ + s = str(memory_str).strip() + # Try two-char suffix first (Ki, Mi, Gi, Ti), then one-char (K, M, G, T). + for suffix_len in (2, 1): + if len(s) > suffix_len: + suffix = s[-suffix_len:] + if suffix in _K8S_MEMORY_SUFFIXES: + return int(s[:-suffix_len]) * _K8S_MEMORY_SUFFIXES[suffix] + return int(s) # bare integer = bytes diff --git a/osdc/scripts/python/test_quantities.py b/osdc/scripts/python/test_quantities.py new file mode 100644 index 00000000..c4382fa1 --- /dev/null +++ b/osdc/scripts/python/test_quantities.py @@ -0,0 +1,46 @@ +"""Unit tests for quantities.py — Kubernetes quantity parsing.""" + +import pytest +from quantities import parse_memory_bytes + + +class TestParseMemoryBytes: + def test_binary_suffixes(self): + assert parse_memory_bytes("256Ki") == 256 * 1024 + assert parse_memory_bytes("512Mi") == 512 * 1024**2 + assert parse_memory_bytes("115Gi") == 115 * 1024**3 + assert parse_memory_bytes("1Ti") == 1 * 1024**4 + + def test_decimal_si_suffixes(self): + assert parse_memory_bytes("500K") == 500 * 1000 + assert parse_memory_bytes("500M") == 500 * 1000**2 + assert parse_memory_bytes("10G") == 10 * 1000**3 + assert parse_memory_bytes("2T") == 2 * 1000**4 + + def test_bare_integer_is_bytes(self): + assert parse_memory_bytes("1024") == 1024 + assert parse_memory_bytes(1024) == 1024 + + def test_zero(self): + assert parse_memory_bytes("0") == 0 + assert parse_memory_bytes(0) == 0 + + def test_whitespace_is_tolerated(self): + assert parse_memory_bytes(" 8500Mi ") == 8500 * 1024**2 + + def test_boot_gate_values_round_trip(self): + # The exact p4d boot-gate numbers: 8500Mi + 0 + 100Mi == 4300Mi + 4300Mi + lhs = parse_memory_bytes("8500Mi") + parse_memory_bytes("0") + parse_memory_bytes("100Mi") + rhs = parse_memory_bytes("4300Mi") + parse_memory_bytes("4300Mi") + assert lhs == rhs == 8600 * 1024**2 + + def test_mixed_units_compare_exactly(self): + # 4Gi + 4504Mi == 8600Mi (binary), and Gi != Mi is not conflated + assert parse_memory_bytes("4Gi") + parse_memory_bytes("4504Mi") == parse_memory_bytes("8600Mi") + + @pytest.mark.parametrize("bad", ["4.5Gi", "1.5Mi", "1e3Mi", "abc", "Mi", ""]) + def test_fractional_or_garbage_raises(self, bad): + # Fractional mantissa and non-numeric input must fail loudly, never + # silently truncate — reservation math depends on exactness. + with pytest.raises(ValueError, match="invalid literal"): + parse_memory_bytes(bad) From 5597ceac9b68a70057656703e436833f110497b0 Mon Sep 17 00:00:00 2001 From: George Hong Date: Wed, 17 Jun 2026 16:37:19 -0700 Subject: [PATCH 3/6] nodepools: decouple memory_manager_policy from topology policy + regression test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses RFC review: memoryManagerPolicy and topologyManagerPolicy are independent kubelet knobs, and the kubelet accepts Static under any topology policy. The previous hard requirement (Static => single-numa-node, else raise) conflated two settings the API keeps separate. - Static no longer requires single-numa-node. It still ALWAYS validates the reservedMemory boot gate (a real kubelet requirement, independent of topology). When enabled without single-numa-node it now WARNS (stderr) rather than raising — that combo is a likely misconfig (memory is NUMA-reserved but the scheduler won't align it) but is valid and the operator's call. - Add test_no_keys_leaves_config_untouched: a def WITHOUT any Memory Manager keys emits the exact pre-feature kubelet.config — exact key set, and no blank-line artifact at the empty-block injection point. Guards every existing nodepool def against drift from this feature. - Replace the hard-raise test with test_static_without_single_numa_warns_but_emits. --- .../scripts/python/generate_nodepools.py | 24 +++++++++---- .../scripts/python/test_generate_nodepools.py | 36 ++++++++++++++++--- 2 files changed, 49 insertions(+), 11 deletions(-) diff --git a/osdc/modules/nodepools/scripts/python/generate_nodepools.py b/osdc/modules/nodepools/scripts/python/generate_nodepools.py index 5f25157f..f36854a6 100755 --- a/osdc/modules/nodepools/scripts/python/generate_nodepools.py +++ b/osdc/modules/nodepools/scripts/python/generate_nodepools.py @@ -119,6 +119,7 @@ def _validate_startup_taints_registry(modules_root: Path) -> None: # ANSI colors GREEN = "\033[0;32m" +YELLOW = "\033[0;33m" RED = "\033[0;31m" NC = "\033[0m" @@ -127,6 +128,10 @@ def log_info(msg): print(f"{GREEN}\u2192{NC} {msg}") +def log_warning(msg): + print(f"{YELLOW}\u26a0{NC} {msg}", file=sys.stderr) + + def log_error(msg): print(f"{RED}\u2717{NC} {msg}") @@ -236,11 +241,17 @@ def so the sum is a self-contained constant immune to EKS AMI / maxPods f"{name}: memory_manager_policy must be 'Static' (got '{policy}'); " f"None is the kubelet default and needs no override." ) + # memoryManagerPolicy is a kubelet knob independent of topologyManagerPolicy — + # the kubelet accepts Static under any topology policy, and the reservedMemory + # boot gate below applies regardless. But Static only yields scheduling benefit + # when Topology Manager actually aligns resources (single-numa-node), so warn + # (don't block) if it's enabled elsewhere — likely a misconfiguration that + # takes on boot-gate risk for no gain. if topology_policy != "single-numa-node": - raise ValueError( - f"{name}: memory_manager_policy: Static requires topology_manager_policy: " - f"single-numa-node (got '{topology_policy}'). Memory Manager Static is " - f"only needed to surface per-NUMA memory for single-numa-node alignment." + log_warning( + f"{name}: memory_manager_policy: Static with topology_manager_policy=" + f"'{topology_policy}' (not single-numa-node) — memory will be NUMA-reserved " + f"but the scheduler won't align it; Static only helps under single-numa-node." ) kube_reserved = nodepool_def.get("kube_reserved_memory") @@ -298,8 +309,9 @@ def generate_nodepool_yaml(nodepool_def, module_name, defs_dir=None): topology_policy = nodepool_def.get("topology_manager_policy", "best-effort") topology_scope = nodepool_def.get("topology_manager_scope", "container") - # Per-def kubelet Memory Manager (Static) — gated to single-numa-node fleets. - # Validates the boot-gate reservation invariant; raises on misconfiguration. + # Per-def kubelet Memory Manager (Static) — independent of the topology policy. + # Validates the boot-gate reservation invariant (raises on a bad sum); warns, + # but does not block, if enabled without single-numa-node. mem_manager_block = _memory_manager_block(nodepool_def, topology_policy) # Read optional user data script for embedding as a MIME part diff --git a/osdc/modules/nodepools/scripts/python/test_generate_nodepools.py b/osdc/modules/nodepools/scripts/python/test_generate_nodepools.py index 07e64b16..304670c0 100644 --- a/osdc/modules/nodepools/scripts/python/test_generate_nodepools.py +++ b/osdc/modules/nodepools/scripts/python/test_generate_nodepools.py @@ -835,6 +835,28 @@ def test_absent_by_default(self): assert "reservedMemory" not in userdata assert "kubeReserved" not in userdata + def test_no_keys_leaves_config_untouched(self): + """A def without Memory Manager keys emits the exact pre-feature + kubelet.config — the empty block must inject NOTHING: no stray keys and + no blank-line artifact at the injection point. This guards every existing + nodepool def against drift from this feature.""" + d = _make_nodepool_def(topology_manager_policy="single-numa-node", topology_manager_scope="pod") + userdata = parse_all_yaml(generate_nodepool_yaml(d, "nodepools"))[1]["spec"]["userData"] + cfg = _extract_node_config(userdata)["spec"]["kubelet"]["config"] + # Exactly the pre-feature key set — nothing leaked from the Memory Manager path. + assert set(cfg) == { + "cpuManagerPolicy", + "topologyManagerPolicy", + "topologyManagerScope", + "containerLogMaxSize", + "containerLogMaxFiles", + } + # The empty block must not introduce a blank line where it would be injected + # (single-numa-node emits no topologyManagerPolicyOptions, so scope is + # immediately followed by the log-rotation keys). The parsed userData block + # scalar is dedented by 4 spaces, so config keys sit at 6-space indent. + assert "topologyManagerScope: pod\n containerLogMaxSize: 50Mi" in userdata + def test_static_emits_valid_config(self): """Static emits a well-formed kubelet.config with all pinned reservations.""" output = generate_nodepool_yaml(self._static_def(), "nodepools") @@ -852,11 +874,15 @@ def test_static_emits_valid_config(self): assert cfg["topologyManagerPolicy"] == "single-numa-node" assert cfg["containerLogMaxSize"] == "50Mi" - def test_requires_single_numa(self): - """Static on a non-single-numa-node def is rejected.""" - bad = self._static_def(topology_manager_policy="best-effort", topology_manager_scope="container") - with pytest.raises(ValueError, match="single-numa-node"): - generate_nodepool_yaml(bad, "nodepools") + def test_static_without_single_numa_warns_but_emits(self, capsys): + """Static is independent of the topology policy: it still emits and still + validates the boot gate, but warns (does not block) when not paired with + single-numa-node, since alignment only happens under single-numa-node.""" + d = self._static_def(topology_manager_policy="best-effort", topology_manager_scope="container") + output = generate_nodepool_yaml(d, "nodepools") # must NOT raise + userdata = parse_all_yaml(output)[1]["spec"]["userData"] + assert "memoryManagerPolicy: Static" in userdata + assert "single-numa-node" in capsys.readouterr().err # warning emitted to stderr def test_gate_sum_mismatch_raises(self): """reserved_memory that doesn't total the reservation sum fails generation.""" From 6dec70078c29ce91dabf50f53869ec0e14bf90b5 Mon Sep 17 00:00:00 2001 From: George Hong Date: Wed, 17 Jun 2026 16:52:33 -0700 Subject: [PATCH 4/6] nodepools: make kubelet memory knobs independently settable MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses RFC review: kubeReserved / systemReserved / evictionHard are GENERAL kubelet settings (they change node-allocatable memory and the eviction threshold, and EKS already sets them by default) — they are NOT specific to Memory Manager Static. The previous code only plumbed them through the Static block and rejected them otherwise, conflating independent knobs. Rework `_memory_manager_block` -> `_kubelet_memory_block`: each knob is emitted into kubelet.config ONLY when the def specifies it, and propagates on its own: - memory_manager_policy -> memoryManagerPolicy - kube_reserved_memory -> kubeReserved.memory - system_reserved_memory -> systemReserved.memory - eviction_hard_memory_available -> evictionHard.memory.available - reserved_memory -> reservedMemory (still Static-only — it is consumed solely by the Memory Manager) Boot gate: validated only when all three reservation terms are pinned alongside Static (self-contained); if any is left to the EKS default — unknown at generation — warn that the gate is unvalidated rather than guess. Static still requires reserved_memory; reserved_memory without Static still errors. Tests: independent-knob emission (kubeReserved/evictionHard/systemReserved alone), "only specified blocks included", unpinned-terms warn-but-emit, Static- without-reserved raises. Fully-pinned p4d output is byte-identical to before. --- .../scripts/python/generate_nodepools.py | 178 ++++++++++-------- .../scripts/python/test_generate_nodepools.py | 94 ++++++++- 2 files changed, 181 insertions(+), 91 deletions(-) diff --git a/osdc/modules/nodepools/scripts/python/generate_nodepools.py b/osdc/modules/nodepools/scripts/python/generate_nodepools.py index f36854a6..a38c394c 100755 --- a/osdc/modules/nodepools/scripts/python/generate_nodepools.py +++ b/osdc/modules/nodepools/scripts/python/generate_nodepools.py @@ -200,96 +200,108 @@ def _user_data_script_mime_part(indented_script): """ -def _memory_manager_block(nodepool_def, topology_policy): - """Build the kubelet Memory Manager (Static) config block, or "". - - Returns the indented YAML lines (10-space base, for the ``kubelet.config`` - map) that pin ``memoryManagerPolicy`` plus every reservation that feeds the - Memory Manager boot gate, or an empty string when the def doesn't opt in. - - Memory Manager Static surfaces per-NUMA memory into the NodeResourceTopology - so single-numa-node alignment can match native ``memory`` requests (without - it, the scheduler ANDs an empty bitmask and returns "cannot align pod"). The - kubelet enforces a hard invariant at boot: - - sum(reservedMemory) == kubeReserved + systemReserved + evictionHard - - or it refuses to start. We therefore pin all three right-hand terms in the - def so the sum is a self-contained constant immune to EKS AMI / maxPods - formula drift, and validate the equality here — a mismatch fails generation - instead of bricking a fresh node's boot. +def _kubelet_memory_block(nodepool_def, topology_policy): + """Build the kubelet memory-reservation config block from per-def knobs, or "". + + Each knob is independent and is emitted into ``kubelet.config`` ONLY when the + def specifies it: + - ``memory_manager_policy`` -> memoryManagerPolicy + - ``kube_reserved_memory`` -> kubeReserved.memory + - ``system_reserved_memory`` -> systemReserved.memory + - ``eviction_hard_memory_available`` -> evictionHard.memory.available + - ``reserved_memory`` -> reservedMemory (per-NUMA list) + + kubeReserved / systemReserved / evictionHard are GENERAL kubelet settings + that take effect on their own (they change node-allocatable memory and the + eviction threshold); nodeadm deep-merges each onto the EKS defaults, + preserving sibling keys (cpu, ephemeral-storage, nodefs.*). reservedMemory + and the boot-gate invariant are specific to Memory Manager Static: + + sum(reservedMemory) == kubeReserved.memory + systemReserved.memory + + evictionHard.memory.available + + or the kubelet refuses to start. We validate that equality here ONLY when all + three right-hand terms are pinned in the def (so it's self-contained); if any + is left to the EKS default — unknown at generation time — we warn that the + gate is unvalidated rather than guess. """ name = nodepool_def["name"] policy = nodepool_def.get("memory_manager_policy") - reservation_keys = ( - "kube_reserved_memory", - "system_reserved_memory", - "eviction_hard_memory_available", - "reserved_memory", - ) - - if not policy: - # Reservation overrides only make sense alongside Static — flag stragglers. - if any(nodepool_def.get(k) is not None for k in reservation_keys): - raise ValueError( - f"{name}: {', '.join(reservation_keys)} are only valid when memory_manager_policy: Static is set." - ) - return "" + kube_reserved = nodepool_def.get("kube_reserved_memory") + system_reserved = nodepool_def.get("system_reserved_memory") + eviction_hard = nodepool_def.get("eviction_hard_memory_available") + reserved_memory = nodepool_def.get("reserved_memory") - if policy != "Static": + # ----- validation ----- + if policy is not None and policy != "Static": raise ValueError( f"{name}: memory_manager_policy must be 'Static' (got '{policy}'); " f"None is the kubelet default and needs no override." ) - # memoryManagerPolicy is a kubelet knob independent of topologyManagerPolicy — - # the kubelet accepts Static under any topology policy, and the reservedMemory - # boot gate below applies regardless. But Static only yields scheduling benefit - # when Topology Manager actually aligns resources (single-numa-node), so warn - # (don't block) if it's enabled elsewhere — likely a misconfiguration that - # takes on boot-gate risk for no gain. - if topology_policy != "single-numa-node": - log_warning( - f"{name}: memory_manager_policy: Static with topology_manager_policy=" - f"'{topology_policy}' (not single-numa-node) — memory will be NUMA-reserved " - f"but the scheduler won't align it; Static only helps under single-numa-node." - ) - - kube_reserved = nodepool_def.get("kube_reserved_memory") - system_reserved = nodepool_def.get("system_reserved_memory", "0") - eviction_hard = nodepool_def.get("eviction_hard_memory_available") - reserved_memory = nodepool_def.get("reserved_memory") - if not kube_reserved or not eviction_hard or not reserved_memory: + # reservedMemory is consumed only by the Memory Manager — meaningless without Static. + if reserved_memory is not None and policy != "Static": raise ValueError( - f"{name}: memory_manager_policy: Static requires kube_reserved_memory, " - f"eviction_hard_memory_available, and reserved_memory to be set " - f"(the Memory Manager boot gate sums them)." + f"{name}: reserved_memory requires memory_manager_policy: Static " + f"(reservedMemory is consumed only by the kubelet Memory Manager)." ) - gate = parse_memory_bytes(kube_reserved) + parse_memory_bytes(system_reserved) + parse_memory_bytes(eviction_hard) - reserved_total = sum(parse_memory_bytes(z["memory"]) for z in reserved_memory) - if reserved_total != gate: - raise ValueError( - f"{name}: reserved_memory sum ({reserved_total} bytes) must equal " - f"kube_reserved_memory + system_reserved_memory + " - f"eviction_hard_memory_available ({gate} bytes), or the kubelet will " - f"refuse to start. Adjust the per-NUMA split to total the reservation sum." - ) + if policy == "Static": + if not reserved_memory: + raise ValueError( + f"{name}: memory_manager_policy: Static requires reserved_memory " + f"(the per-NUMA reservations the Memory Manager pins)." + ) + # Static is independent of the topology policy, but only yields scheduling + # benefit under single-numa-node — warn (don't block) otherwise. + if topology_policy != "single-numa-node": + log_warning( + f"{name}: memory_manager_policy: Static with topology_manager_policy=" + f"'{topology_policy}' (not single-numa-node) — memory will be NUMA-reserved " + f"but the scheduler won't align it; Static only helps under single-numa-node." + ) + # Boot gate: validate the sum only when all three terms are pinned, so the + # equality is self-contained. Otherwise the unset term(s) fall back to EKS + # defaults we can't see at generation time. + if kube_reserved is not None and system_reserved is not None and eviction_hard is not None: + gate = ( + parse_memory_bytes(kube_reserved) + + parse_memory_bytes(system_reserved) + + parse_memory_bytes(eviction_hard) + ) + reserved_total = sum(parse_memory_bytes(z["memory"]) for z in reserved_memory) + if reserved_total != gate: + raise ValueError( + f"{name}: reserved_memory sum ({reserved_total} bytes) must equal " + f"kube_reserved_memory + system_reserved_memory + " + f"eviction_hard_memory_available ({gate} bytes), or the kubelet will " + f"refuse to start. Adjust the per-NUMA split to total the reservation sum." + ) + else: + log_warning( + f"{name}: memory_manager_policy: Static — boot gate NOT validated at generation " + f"because kube_reserved_memory/system_reserved_memory/eviction_hard_memory_available " + f"are not all pinned. The kubelet requires sum(reservedMemory) == kubeReserved + " + f"systemReserved + evictionHard (EKS defaults apply where unset) or it will not " + f"boot — pin all three to validate here." + ) - zones = "".join( - f" - numaNode: {z['numa_node']}\n limits:\n memory: {z['memory']}\n" - for z in reserved_memory - ) - return ( - f" memoryManagerPolicy: {policy}\n" - f" kubeReserved:\n" - f" memory: {kube_reserved}\n" - f" systemReserved:\n" - f' memory: "{system_reserved}"\n' - f" evictionHard:\n" - f" memory.available: {eviction_hard}\n" - f" reservedMemory:\n" - f"{zones}" - ) + # ----- emission: include each block only when its key is specified ----- + parts = [] + if policy is not None: + parts.append(f" memoryManagerPolicy: {policy}\n") + if kube_reserved is not None: + parts.append(f" kubeReserved:\n memory: {kube_reserved}\n") + if system_reserved is not None: + parts.append(f' systemReserved:\n memory: "{system_reserved}"\n') + if eviction_hard is not None: + parts.append(f" evictionHard:\n memory.available: {eviction_hard}\n") + if reserved_memory is not None: + zones = "".join( + f" - numaNode: {z['numa_node']}\n limits:\n memory: {z['memory']}\n" + for z in reserved_memory + ) + parts.append(f" reservedMemory:\n{zones}") + return "".join(parts) def generate_nodepool_yaml(nodepool_def, module_name, defs_dir=None): @@ -309,10 +321,10 @@ def generate_nodepool_yaml(nodepool_def, module_name, defs_dir=None): topology_policy = nodepool_def.get("topology_manager_policy", "best-effort") topology_scope = nodepool_def.get("topology_manager_scope", "container") - # Per-def kubelet Memory Manager (Static) — independent of the topology policy. - # Validates the boot-gate reservation invariant (raises on a bad sum); warns, - # but does not block, if enabled without single-numa-node. - mem_manager_block = _memory_manager_block(nodepool_def, topology_policy) + # Per-def kubelet memory knobs (kubeReserved / systemReserved / evictionHard / + # memoryManagerPolicy / reservedMemory). Each is emitted only when specified; + # validates the Memory Manager Static boot gate when all terms are pinned. + kubelet_memory_block = _kubelet_memory_block(nodepool_def, topology_policy) # Read optional user data script for embedding as a MIME part indented_userdata = _read_user_data_script(user_data_script_path, defs_dir) if defs_dir else None @@ -544,7 +556,7 @@ def generate_nodepool_yaml(nodepool_def, module_name, defs_dir=None): topologyManagerPolicy: {topology_policy} topologyManagerScope: {topology_scope} {" topologyManagerPolicyOptions:" + chr(10) + ' prefer-closest-numa-nodes: "true"' + chr(10) if topology_policy in ("restricted", "best-effort") else ""}\ -{mem_manager_block}\ +{kubelet_memory_block}\ containerLogMaxSize: 50Mi containerLogMaxFiles: 5 {_user_data_script_mime_part(indented_userdata)} diff --git a/osdc/modules/nodepools/scripts/python/test_generate_nodepools.py b/osdc/modules/nodepools/scripts/python/test_generate_nodepools.py index 304670c0..4c330b85 100644 --- a/osdc/modules/nodepools/scripts/python/test_generate_nodepools.py +++ b/osdc/modules/nodepools/scripts/python/test_generate_nodepools.py @@ -803,12 +803,13 @@ def test_optional_fields_default_correctly(self): class TestMemoryManager: - """kubelet Memory Manager (Static) — the #696 Bug #2 fix for single-numa-node. + """kubelet memory knobs (#696 Bug #2) — independently settable per def. - Static surfaces per-NUMA memory into the NRT so the scheduler can align - native ``memory`` requests; it also imposes a hard boot gate - (sum(reservedMemory) == kubeReserved + systemReserved + evictionHard) that - we validate at generation time. + kubeReserved / systemReserved / evictionHard are general kubelet settings + emitted on their own; memoryManagerPolicy: Static + reservedMemory are the + Memory Manager pieces. When all three reservation terms are pinned alongside + Static, the boot gate (sum(reservedMemory) == kubeReserved + systemReserved + + evictionHard) is validated at generation time. """ def _static_def(self, **overrides) -> dict: @@ -913,10 +914,87 @@ def test_invalid_policy_value_raises(self): with pytest.raises(ValueError, match="must be 'Static'"): generate_nodepool_yaml(bad, "nodepools") - def test_reservations_without_policy_raise(self): - """Reservation keys without memory_manager_policy are a misconfig.""" + def test_reserved_memory_without_policy_raises(self): + """reservedMemory is Memory-Manager-only — invalid without Static.""" bad = _make_nodepool_def(reserved_memory=[{"numa_node": 0, "memory": "100Mi"}]) - with pytest.raises(ValueError, match="only valid when"): + with pytest.raises(ValueError, match="requires memory_manager_policy: Static"): + generate_nodepool_yaml(bad, "nodepools") + + def test_kube_reserved_emits_independently(self): + """kubeReserved is a general kubelet knob — settable on its own, no Static.""" + d = _make_nodepool_def(kube_reserved_memory="9Gi") + cfg = _extract_node_config(parse_all_yaml(generate_nodepool_yaml(d, "nodepools"))[1]["spec"]["userData"])[ + "spec" + ]["kubelet"]["config"] + assert cfg["kubeReserved"]["memory"] == "9Gi" + # nothing else from the memory path leaked + for k in ("memoryManagerPolicy", "reservedMemory", "systemReserved", "evictionHard"): + assert k not in cfg + + def test_eviction_hard_emits_independently(self): + """evictionHard.memory.available is settable on its own, no Static.""" + d = _make_nodepool_def(eviction_hard_memory_available="250Mi") + cfg = _extract_node_config(parse_all_yaml(generate_nodepool_yaml(d, "nodepools"))[1]["spec"]["userData"])[ + "spec" + ]["kubelet"]["config"] + assert cfg["evictionHard"]["memory.available"] == "250Mi" + for k in ("memoryManagerPolicy", "reservedMemory", "kubeReserved", "systemReserved"): + assert k not in cfg + + def test_only_specified_blocks_included(self): + """Each knob is emitted only when present — a lone systemReserved override + adds exactly that block and nothing else.""" + d = _make_nodepool_def( + topology_manager_policy="single-numa-node", + topology_manager_scope="pod", + system_reserved_memory="512Mi", + ) + cfg = _extract_node_config(parse_all_yaml(generate_nodepool_yaml(d, "nodepools"))[1]["spec"]["userData"])[ + "spec" + ]["kubelet"]["config"] + assert cfg["systemReserved"]["memory"] == "512Mi" + assert set(cfg) == { + "cpuManagerPolicy", + "topologyManagerPolicy", + "topologyManagerScope", + "systemReserved", + "containerLogMaxSize", + "containerLogMaxFiles", + } + + def test_static_unpinned_terms_warn_but_emit(self, capsys): + """Static + reserved_memory but reservation terms NOT all pinned: emits + (reservedMemory only), and warns the boot gate can't be validated at + generation (the unset terms fall back to EKS defaults).""" + d = _make_nodepool_def( + topology_manager_policy="single-numa-node", + topology_manager_scope="pod", + memory_manager_policy="Static", + reserved_memory=[ + {"numa_node": 0, "memory": "4Gi"}, + {"numa_node": 1, "memory": "4Gi"}, + ], + ) + cfg = _extract_node_config(parse_all_yaml(generate_nodepool_yaml(d, "nodepools"))[1]["spec"]["userData"])[ + "spec" + ]["kubelet"]["config"] + assert cfg["memoryManagerPolicy"] == "Static" + assert cfg["reservedMemory"] == [ + {"numaNode": 0, "limits": {"memory": "4Gi"}}, + {"numaNode": 1, "limits": {"memory": "4Gi"}}, + ] + # unpinned terms are NOT emitted (left to EKS defaults) + assert "kubeReserved" not in cfg + assert "boot gate" in capsys.readouterr().err.lower() + + def test_static_without_reserved_memory_raises(self): + """Static requires reserved_memory (the Memory Manager needs per-NUMA pins).""" + bad = _make_nodepool_def( + topology_manager_policy="single-numa-node", + topology_manager_scope="pod", + memory_manager_policy="Static", + ) + with pytest.raises(ValueError, match="requires reserved_memory"): generate_nodepool_yaml(bad, "nodepools") def test_fleet_passthrough(self): From 0d4def1051a727163986f0b1ac4be76d0b915ee8 Mon Sep 17 00:00:00 2001 From: George Hong Date: Wed, 17 Jun 2026 17:11:50 -0700 Subject: [PATCH 5/6] nodepools: split kubelet memory validation from the 1:1 emitter MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses RFC review (function too large, mixed concerns). Matches the file's convention of dedicated validators (_validate_fleet, _validate_startup_taints_registry) kept separate from the thin YAML emitters. - _validate_kubelet_memory(nodepool_def, topology_policy): all the checks + warnings + the boot-gate sum (uses early-return to flatten nesting). - _kubelet_memory_block(nodepool_def): pure 1:1 translation — one block per key present, no validation. The three same-shaped knobs (kubeReserved/systemReserved/evictionHard) are driven by a small table (_KUBELET_MEMORY_RESERVATIONS); memoryManagerPolicy and reservedMemory are the two special shapes. generate_nodepool_yaml now calls validate-then-translate. No behavior change: generated output is byte-identical (verified on the fully-pinned p4d block) and all tests pass unchanged. --- .../scripts/python/generate_nodepools.py | 156 +++++++++--------- 1 file changed, 80 insertions(+), 76 deletions(-) diff --git a/osdc/modules/nodepools/scripts/python/generate_nodepools.py b/osdc/modules/nodepools/scripts/python/generate_nodepools.py index a38c394c..21633217 100755 --- a/osdc/modules/nodepools/scripts/python/generate_nodepools.py +++ b/osdc/modules/nodepools/scripts/python/generate_nodepools.py @@ -200,101 +200,104 @@ def _user_data_script_mime_part(indented_script): """ -def _kubelet_memory_block(nodepool_def, topology_policy): - """Build the kubelet memory-reservation config block from per-def knobs, or "". - - Each knob is independent and is emitted into ``kubelet.config`` ONLY when the - def specifies it: - - ``memory_manager_policy`` -> memoryManagerPolicy - - ``kube_reserved_memory`` -> kubeReserved.memory - - ``system_reserved_memory`` -> systemReserved.memory - - ``eviction_hard_memory_available`` -> evictionHard.memory.available - - ``reserved_memory`` -> reservedMemory (per-NUMA list) - - kubeReserved / systemReserved / evictionHard are GENERAL kubelet settings - that take effect on their own (they change node-allocatable memory and the - eviction threshold); nodeadm deep-merges each onto the EKS defaults, - preserving sibling keys (cpu, ephemeral-storage, nodefs.*). reservedMemory - and the boot-gate invariant are specific to Memory Manager Static: - - sum(reservedMemory) == kubeReserved.memory + systemReserved.memory - + evictionHard.memory.available - - or the kubelet refuses to start. We validate that equality here ONLY when all - three right-hand terms are pinned in the def (so it's self-contained); if any - is left to the EKS default — unknown at generation time — we warn that the - gate is unvalidated rather than guess. +# Per-def kubelet memory knobs → (kubelet.config field, sub-key, quote-value?). +# These three share the shape ``:\n : `` and are GENERAL +# kubelet settings (they change node-allocatable memory / the eviction threshold, +# and take effect on their own). nodeadm deep-merges each onto the EKS defaults, +# preserving sibling keys (cpu, ephemeral-storage, nodefs.*). +_KUBELET_MEMORY_RESERVATIONS = ( + ("kube_reserved_memory", "kubeReserved", "memory", False), + ("system_reserved_memory", "systemReserved", "memory", True), # quote so "0" stays a string + ("eviction_hard_memory_available", "evictionHard", "memory.available", False), +) + + +def _validate_kubelet_memory(nodepool_def, topology_policy): + """Validate the per-def kubelet memory knobs (see ``_kubelet_memory_block``). + + The general knobs (kubeReserved/systemReserved/evictionHard) need no + cross-checks — they stand alone. The Memory Manager pieces do: + - ``reserved_memory`` is consumed only by the Memory Manager → requires Static; + - Static requires ``reserved_memory``; + - with Static AND all three reservation terms pinned, the kubelet boot gate + ``sum(reservedMemory) == kubeReserved + systemReserved + evictionHard`` is + checked here (a bad sum stops the kubelet from starting). If a term is left + to the EKS default — unknown at generation time — warn instead of guess. + Also warns when Static is used without single-numa-node (no scheduling gain). """ name = nodepool_def["name"] policy = nodepool_def.get("memory_manager_policy") - kube_reserved = nodepool_def.get("kube_reserved_memory") - system_reserved = nodepool_def.get("system_reserved_memory") - eviction_hard = nodepool_def.get("eviction_hard_memory_available") reserved_memory = nodepool_def.get("reserved_memory") - # ----- validation ----- if policy is not None and policy != "Static": raise ValueError( f"{name}: memory_manager_policy must be 'Static' (got '{policy}'); " f"None is the kubelet default and needs no override." ) - # reservedMemory is consumed only by the Memory Manager — meaningless without Static. if reserved_memory is not None and policy != "Static": raise ValueError( f"{name}: reserved_memory requires memory_manager_policy: Static " f"(reservedMemory is consumed only by the kubelet Memory Manager)." ) + if policy != "Static": + return # general knobs (if any) stand on their own — nothing more to check - if policy == "Static": - if not reserved_memory: + if not reserved_memory: + raise ValueError( + f"{name}: memory_manager_policy: Static requires reserved_memory " + f"(the per-NUMA reservations the Memory Manager pins)." + ) + if topology_policy != "single-numa-node": + log_warning( + f"{name}: memory_manager_policy: Static with topology_manager_policy=" + f"'{topology_policy}' (not single-numa-node) — memory will be NUMA-reserved " + f"but the scheduler won't align it; Static only helps under single-numa-node." + ) + + terms = [nodepool_def.get(key) for key, *_ in _KUBELET_MEMORY_RESERVATIONS] + if all(t is not None for t in terms): + gate = sum(parse_memory_bytes(t) for t in terms) + reserved_total = sum(parse_memory_bytes(z["memory"]) for z in reserved_memory) + if reserved_total != gate: raise ValueError( - f"{name}: memory_manager_policy: Static requires reserved_memory " - f"(the per-NUMA reservations the Memory Manager pins)." - ) - # Static is independent of the topology policy, but only yields scheduling - # benefit under single-numa-node — warn (don't block) otherwise. - if topology_policy != "single-numa-node": - log_warning( - f"{name}: memory_manager_policy: Static with topology_manager_policy=" - f"'{topology_policy}' (not single-numa-node) — memory will be NUMA-reserved " - f"but the scheduler won't align it; Static only helps under single-numa-node." - ) - # Boot gate: validate the sum only when all three terms are pinned, so the - # equality is self-contained. Otherwise the unset term(s) fall back to EKS - # defaults we can't see at generation time. - if kube_reserved is not None and system_reserved is not None and eviction_hard is not None: - gate = ( - parse_memory_bytes(kube_reserved) - + parse_memory_bytes(system_reserved) - + parse_memory_bytes(eviction_hard) - ) - reserved_total = sum(parse_memory_bytes(z["memory"]) for z in reserved_memory) - if reserved_total != gate: - raise ValueError( - f"{name}: reserved_memory sum ({reserved_total} bytes) must equal " - f"kube_reserved_memory + system_reserved_memory + " - f"eviction_hard_memory_available ({gate} bytes), or the kubelet will " - f"refuse to start. Adjust the per-NUMA split to total the reservation sum." - ) - else: - log_warning( - f"{name}: memory_manager_policy: Static — boot gate NOT validated at generation " - f"because kube_reserved_memory/system_reserved_memory/eviction_hard_memory_available " - f"are not all pinned. The kubelet requires sum(reservedMemory) == kubeReserved + " - f"systemReserved + evictionHard (EKS defaults apply where unset) or it will not " - f"boot — pin all three to validate here." + f"{name}: reserved_memory sum ({reserved_total} bytes) must equal " + f"kube_reserved_memory + system_reserved_memory + " + f"eviction_hard_memory_available ({gate} bytes), or the kubelet will " + f"refuse to start. Adjust the per-NUMA split to total the reservation sum." ) + else: + log_warning( + f"{name}: memory_manager_policy: Static — boot gate NOT validated at generation " + f"because kube_reserved_memory/system_reserved_memory/eviction_hard_memory_available " + f"are not all pinned. The kubelet requires sum(reservedMemory) == kubeReserved + " + f"systemReserved + evictionHard (EKS defaults apply where unset) or it will not " + f"boot — pin all three to validate here." + ) - # ----- emission: include each block only when its key is specified ----- + +def _kubelet_memory_block(nodepool_def): + """Translate the per-def kubelet memory knobs to ``kubelet.config`` YAML lines + (10-space base), one block per key present, or "" when none are set. + + Pure translation — call ``_validate_kubelet_memory`` first for the constraints. + - ``memory_manager_policy`` -> memoryManagerPolicy + - ``kube_reserved_memory`` -> kubeReserved.memory + - ``system_reserved_memory`` -> systemReserved.memory + - ``eviction_hard_memory_available`` -> evictionHard.memory.available + - ``reserved_memory`` -> reservedMemory (per-NUMA list) + """ parts = [] + policy = nodepool_def.get("memory_manager_policy") if policy is not None: parts.append(f" memoryManagerPolicy: {policy}\n") - if kube_reserved is not None: - parts.append(f" kubeReserved:\n memory: {kube_reserved}\n") - if system_reserved is not None: - parts.append(f' systemReserved:\n memory: "{system_reserved}"\n') - if eviction_hard is not None: - parts.append(f" evictionHard:\n memory.available: {eviction_hard}\n") + + for key, field, subkey, quote in _KUBELET_MEMORY_RESERVATIONS: + value = nodepool_def.get(key) + if value is not None: + rendered = f'"{value}"' if quote else value + parts.append(f" {field}:\n {subkey}: {rendered}\n") + + reserved_memory = nodepool_def.get("reserved_memory") if reserved_memory is not None: zones = "".join( f" - numaNode: {z['numa_node']}\n limits:\n memory: {z['memory']}\n" @@ -322,9 +325,10 @@ def generate_nodepool_yaml(nodepool_def, module_name, defs_dir=None): topology_scope = nodepool_def.get("topology_manager_scope", "container") # Per-def kubelet memory knobs (kubeReserved / systemReserved / evictionHard / - # memoryManagerPolicy / reservedMemory). Each is emitted only when specified; - # validates the Memory Manager Static boot gate when all terms are pinned. - kubelet_memory_block = _kubelet_memory_block(nodepool_def, topology_policy) + # memoryManagerPolicy / reservedMemory): validate the constraints, then translate + # each set key 1:1 into kubelet.config. Each block is emitted only when present. + _validate_kubelet_memory(nodepool_def, topology_policy) + kubelet_memory_block = _kubelet_memory_block(nodepool_def) # Read optional user data script for embedding as a MIME part indented_userdata = _read_user_data_script(user_data_script_path, defs_dir) if defs_dir else None From 9dd6eec42ef59e1403486d9628c57758a7b10cf3 Mon Sep 17 00:00:00 2001 From: George Hong Date: Wed, 17 Jun 2026 17:30:45 -0700 Subject: [PATCH 6/6] nodepools: add render-nodepool recipe + document nodeadm deep-merge risk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add `just render-nodepool [modules]` — renders a single nodepool def to stdout (generation only, no cluster required) so operators can preview the generated userData / kubelet config before deploying. Document the nodeadm deep-merge assumption in generate_nodepools.py: setting e.g. kubeReserved.memory relies on nodeadm preserving the EKS-default sibling keys (cpu, ephemeral-storage, nodefs.*). This was empirically validated on a real p4d (2026-06-17) but is an implementation detail, not a contractual guarantee. The comment describes the re-verification procedure and fallback mitigation. --- osdc/justfile | 32 +++++++++++++++++++ .../scripts/python/generate_nodepools.py | 12 +++++++ 2 files changed, 44 insertions(+) diff --git a/osdc/justfile b/osdc/justfile index e2ba0963..2ef2ae8b 100644 --- a/osdc/justfile +++ b/osdc/justfile @@ -2262,6 +2262,38 @@ generate-arc-runners cluster: fi echo "Generated ARC runner configs in ${OUTPUT_DIR} (${elapsed})" +# Render a single nodepool def to stdout (generation only, no cluster required). +# Useful for previewing the generated userData / kubelet config before deploying. +# just render-nodepool p4d # renders defs/p4d.yaml +# just render-nodepool p4d nfd # with nfd startup taint enabled +render-nodepool def *modules: + #!/usr/bin/env bash + set -euo pipefail + source "{{UPSTREAM}}/scripts/mise-activate.sh" + DEF="{{def}}" + MODULES="{{modules}}" + MODULE_DIR="{{UPSTREAM}}/modules/nodepools" + DEFS_DIR="$MODULE_DIR/defs" + DEF_FILE="$DEFS_DIR/${DEF}.yaml" + if [[ ! -f "$DEF_FILE" ]]; then + echo "ERROR: $DEF_FILE not found" >&2 + echo "Available defs:" >&2 + ls "$DEFS_DIR"/*.yaml 2>/dev/null | xargs -n1 basename | sed 's/\.yaml$//' | sed 's/^/ /' >&2 + exit 1 + fi + TMPDEF=$(mktemp -d) + TMPOUT=$(mktemp -d) + trap 'rm -rf "$TMPDEF" "$TMPOUT"' EXIT + cp "$DEF_FILE" "$TMPDEF/" + NODEPOOLS_DEFS_DIR="$TMPDEF" \ + NODEPOOLS_OUTPUT_DIR="$TMPOUT" \ + NODEPOOLS_ENABLED_MODULES="$MODULES" \ + uv run "$MODULE_DIR/scripts/python/generate_nodepools.py" >&2 + for f in "$TMPOUT"/*.yaml; do + echo "--- # $(basename "$f")" + cat "$f" + done + # Monte Carlo cluster simulation for PyTorch CI load simulate-cluster *args: @export OSDC_ROOT="{{ROOT}}"; \ diff --git a/osdc/modules/nodepools/scripts/python/generate_nodepools.py b/osdc/modules/nodepools/scripts/python/generate_nodepools.py index 21633217..9b1b6107 100755 --- a/osdc/modules/nodepools/scripts/python/generate_nodepools.py +++ b/osdc/modules/nodepools/scripts/python/generate_nodepools.py @@ -205,6 +205,18 @@ def _user_data_script_mime_part(indented_script): # kubelet settings (they change node-allocatable memory / the eviction threshold, # and take effect on their own). nodeadm deep-merges each onto the EKS defaults, # preserving sibling keys (cpu, ephemeral-storage, nodefs.*). +# +# NODEADM DEEP-MERGE ASSUMPTION: setting e.g. kubeReserved.memory preserves the +# EKS-default kubeReserved.cpu and kubeReserved.ephemeral-storage (likewise +# evictionHard.memory.available preserves nodefs.available and nodefs.inodesFree). +# Validated empirically on EKS AL2023 AMI (2026-06-17, p4d ip-10-20-26-82). +# This is a nodeadm implementation detail, NOT a contractual API guarantee — a +# future AMI/nodeadm version could change to replace-merge. Re-verify sibling +# preservation on every AMI update by checking ``kubectl get --raw +# /api/v1/nodes//proxy/configz`` on the first rolled node. If siblings +# are ever dropped, the mitigation is to emit all sibling keys explicitly here +# (pin kubeReserved.cpu / ephemeral-storage / evictionHard.nodefs.* alongside +# the memory keys). _KUBELET_MEMORY_RESERVATIONS = ( ("kube_reserved_memory", "kubeReserved", "memory", False), ("system_reserved_memory", "systemReserved", "memory", True), # quote so "0" stays a string