docs: design document for multi-node consolidation#10
docs: design document for multi-node consolidation#10jamesmt-aws wants to merge 5 commits intomainfrom
Conversation
Document the intent and behavioral contract of multi-node consolidation as it exists today. Covers candidate selection, ordering, move evaluation, cost comparison, budget interaction, and known limitations. This is a baseline design doc. Future changes to multi-node consolidation behavior should start as revisions to this document.
Address feedback: add terminology section, explain eligibility criteria rationale, clarify cost model (offering prices, not historical costs), investigate ordering intent (multi-node before single-node), note combinatorial complexity of subset enumeration (Bell numbers), apply Orwell's rules throughout.
Reference kubernetes-sigs/karpenter issues kubernetes-sigs#1202, kubernetes-sigs#1442, kubernetes-sigs#1653, kubernetes-sigs#1733, kubernetes-sigs#1970, kubernetes-sigs#2434, kubernetes-sigs#2814, kubernetes-sigs#2819 and aws/karpenter-provider-aws#5944 in the known limitations section.
jamesmt-aws
left a comment
There was a problem hiding this comment.
Design Doc Review
Tenet Compliance
The doc is generally strong on intent over mechanism, but has several tenet violations:
-
Function/field name references violate "Designs capture intent, not mechanism":
consolidation.ShouldDisrupt()(Candidate Selection section) -- describe the eligibility check without naming the function.markConsolidated()andIsConsolidated()(The Consolidated Mark section) -- these are implementation details; describe the caching behavior without naming the methods.lastConsolidationStatefield name (The Consolidated Mark section) -- describe the in-memory timestamp without naming the field.nodeclaim/disruption/consolidation.gofile path (Candidate Selection, third bullet) -- remove the file path reference; describe the behavior.
-
Commit hash reference: "The original commit introducing multi-node consolidation (978daf3, Nov 2022)" in the Interaction with Single-Node Consolidation section. Commit hashes are mechanism, not intent. Rephrase as historical context without the hash if needed at all.
-
O(log N) complexity reference in the Prefix-Only Evaluation limitation. This is implementation detail -- the design should state that binary search makes it tractable, not quote the complexity class. (Minor violation since it appears in the Limitations section as justification.)
No violations of "Reconciliation over perfection" or "Designs as legible contracts" -- the doc is honest about gaps and an operator can predict behavior from what is written.
Accuracy
The doc is accurate against the codebase with the following notes:
-
Omission: Validation step. After finding a valid move, multi-node consolidation waits 15 seconds (
consolidationTTL) and then re-validates the command by re-checking candidate eligibility and re-running the scheduling simulation. If validation fails, the move is rejected and an event is emitted. This is significant operator-visible behavior -- a consolidation move that was found valid can be silently dropped if the cluster changes during the validation window. The doc does not mention this at all. -
Omission: Budget reduction to one candidate. The doc says "Multi-node consolidation requires at least two candidates" but does not explicitly state what happens when budget filtering reduces the candidate set to one or zero. The code returns a no-op in this case. An operator setting tight budgets should understand that multi-node consolidation may be silently disabled as a result.
-
Minor inaccuracy: "cheapest compatible offering" for replacement pricing. In the Cost Comparison section, the doc says "The replacement's cost is the price of the cheapest offering among its eligible instance types." This is correct for the cost comparison filter, but the actual replacement that launches may not get the cheapest offering (especially for spot). The doc could clarify that the cost comparison uses cheapest-offering pricing as a lower bound, while the actual launch price may differ.
-
Accuracy confirmed on all other points:
- Disruption method ordering (Emptiness, StaticDrift, Drift, Multi, Single) matches
NewMethods(). - Binary search with min=1, max=len-1, 100-cap, 1-minute timeout -- all match code.
- Empty node filtering rationale matches the code's
len(candidate.reschedulablePods) == 0check. - Budget-constrained skip preventing consolidated mark matches
if !constrainedByBudgets { m.markConsolidated() }. - Same-type filtering logic matches
filterOutSameInstanceType. - 5-minute consolidation state reset confirmed in
cluster.go:550-563. - Spot-to-spot multi-node behavior (no 15-type minimum) confirmed.
- Pair search fallback is correctly absent (not upstream).
- Disruption method ordering (Emptiness, StaticDrift, Drift, Multi, Single) matches
Completeness
The doc covers all expected sections thoroughly. Two gaps:
-
Missing: Validation/re-check step. As noted under Accuracy, the 15-second validation window is a distinct behavioral phase that the doc omits entirely. This should be its own subsection or added to Move Evaluation. An operator debugging "why did consolidation find a move but not execute it?" needs to know about this.
-
The
consolidateAfterinteraction could be clearer. The doc correctly states thatconsolidateAftercontrols when nodes become candidates via theConsolidatablestatus condition. But it would help to explicitly state: "A node that recently had pods scheduled to it will not be eligible for multi-node consolidation untilconsolidateAfterelapses, even if it is underutilized." This is the most common operator surprise with this setting.
The Known Limitations section is honest and complete:
- Prefix-only limitation: covered with good explanation of why arbitrary subsets are infeasible.
- Single-replacement constraint: covered with issue links.
- Candidate cap and timeout: covered with issue links and metric names.
- Spot-to-spot without minimum flexibility: covered.
Suggestions
-
Remove all function names, field names, and file paths. Replace with behavioral descriptions. For example, instead of "defined in
consolidation.ShouldDisrupt()", write "defined by the consolidation eligibility checks". Instead of "lastConsolidationStatetimestamp", write "an in-memory timestamp". -
Add a Validation section describing the 15-second re-check after a move is found. This is important for operators to understand why a valid-looking consolidation may not execute.
-
Remove the commit hash from the single-node ordering rationale. The historical context is useful; the hash is not.
-
Consider adding a brief "Observability" subsection to Steady-State Behavior. The doc already mentions the timeout metric and evaluation duration histogram in Known Limitations, but operators would benefit from a consolidated list of signals they can use to understand multi-node consolidation behavior (metrics, events, log lines). This is optional -- the metric names in Known Limitations may be sufficient.
-
Explicitly state the budget edge case where tight budgets reduce multi-node candidates below 2 and effectively disable the method. This is predictable from the text but worth calling out for operators managing large clusters with restrictive budgets.
Verdict
REQUEST_CHANGES
The doc is well-written, accurate, and thorough. The request for changes is based on two issues:
- Tenet violations: Function names, field names, file paths, and a commit hash should be removed to comply with "Designs capture intent, not mechanism."
- Missing validation step: The 15-second re-validation window is operator-visible behavior that belongs in the design. Without it, the doc is incomplete as a behavioral contract.
Neither issue is a major rewrite. The doc's structure, accuracy, and coverage of limitations are strong.
Remove implementation details (function names, field names, file paths, commit hash). Add validation step description. Call out budget edge case where tight budgets silently disable multi-node. Clarify consolidateAfter interaction. Apply Orwell's rules.
jamesmt-aws
left a comment
There was a problem hiding this comment.
Design Doc Review
Tenet Compliance
The document largely follows the tenets well. A few observations:
-
Designs capture intent, not mechanism -- Generally well followed. The description of binary search behavior, the
filterOutSameInstanceTypealgorithm with specific instance type examples (t3a.2xlarge, t3a.small, t3a.nano), and the clamped per-pod cost range[-10.0, 10.0]border on implementation detail, but they serve to make the behavior predictable to operators rather than describing code structure. Acceptable. -
Implementations are derived and replaceable -- The doc avoids function names, line numbers, and pseudocode. It describes what happens, not how the code is organized. Passes.
-
Reconciliation over perfection -- The Known Limitations section is honest and thorough, with links to upstream issues. The "Interaction with Single-Node Consolidation" section openly discusses trade-offs and gaps in the original design rationale. Strong.
-
Designs as legible contracts -- An operator reading this document could predict multi-node consolidation behavior in most scenarios. The budget interaction, ordering, and validation sections give enough detail to reason about edge cases.
No tenet violations found.
Accuracy
-
consolidateAfterdescription (Candidate Selection section): The doc says "An unsetconsolidateAfterdisables consolidation for the NodePool." In practice,consolidateAfterhas a default value of"0s"and is a required field. It is disabled by setting it to"Never"(which results in a nil Duration internally), not by leaving it unset. An operator reading "unset" might expect to omit the field entirely. Consider rewording to: "SettingconsolidateAftertoNeverdisables consolidation for the NodePool." -
consolidateAftergating description: The doc saysconsolidateAftercontrols "how long after the last pod scheduling event a node becomes eligible." The actual implementation usesLastPodEventTimeon the NodeClaim status, which tracks the last pod event (not only scheduling events). If no pod events have occurred, it falls back to the NodeClaim's initialization time. This is a minor imprecision -- "last pod event" would be more accurate than "last pod scheduling event." -
Disruption cost formula: The doc says
disruption_cost = rescheduling_cost(pods) * lifetime_remaining(node). Looking at the code,ReschedulingCostis computed over all pods on the node (not just reschedulable pods), which the doc correctly implies by saying "sums the eviction cost of each pod on the node." Accurate. -
LifetimeRemaining source: The doc says lifetime remaining uses
expireAfterfrom the node. In the code,LifetimeRemaining()readsnodeClaim.Spec.ExpireAfter, not a NodePool-level field. The doc says "the node'sexpireAfterduration" which is correct since NodeClaims inherit this from their NodePool. Accurate. -
Validation re-check period: The doc says "waits for a brief re-check period (on the order of seconds)." The actual TTL is 15 seconds (
consolidationTTL). The vagueness is appropriate for a design doc (avoids over-specification), but an operator might benefit from knowing the approximate magnitude. Current wording is acceptable. -
Cost comparison -- cheapest compatible offering: The doc says "Each candidate's cost is the price of its instance type's cheapest compatible offering given the node's current labels (zone, capacity type)." This matches the code in
getCandidatePrices(), which filters offerings by the node's label requirements and takes the cheapest. Accurate. -
Binary search bounds: The doc says the binary search finds "the largest prefix." The code initializes
min = 1andmax = len(candidates) - 1(or the cap), meaning the smallest prefix evaluated is 2 candidates (indices[0:2]). The doc correctly states "multi-node consolidation requires at least two candidates." Accurate. -
Same-type filtering: The doc's example matches the
filterOutSameInstanceTypefunction behavior precisely. Accurate. -
Empty node filtering rationale: The doc gives two reasons for filtering empty nodes. The code comment says: "If there was an empty node that wasn't consolidated before this, we should assume that it was due to budgets." The doc's explanation is more thorough and accurate than the code comment itself. Good.
-
Spot-to-spot behavior: The doc correctly states that multi-node does not enforce the 15 instance type minimum for spot-to-spot. The code at
computeSpotToSpotConsolidationline 269 confirms:if len(candidates) > 1returns early without the minimum check. Accurate.
Completeness
Required sections covered:
- Terminology: Yes
- Purpose: Yes
- When It Runs: Yes
- Candidate Selection: Yes
- Candidate Ordering: Yes
- Move Evaluation: Yes
- Cost Comparison: Yes
- Interaction with Single-Node Consolidation: Yes
- Interaction with Disruption Budgets: Yes
- Known Limitations: Yes
- Steady-State Behavior: Yes (bonus section, valuable)
Prefix-only limitation in Known Limitations: Yes, thoroughly covered with the Bell number complexity argument.
Budget interaction with 1 candidate: Yes, covered in "Interaction with Disruption Budgets" -- "If the budget reduces the candidate count below two, the method returns no action without evaluating."
Ordering with single-node consolidation: Yes, covered in "Interaction with Single-Node Consolidation" with nuanced discussion of trade-offs.
Consolidated cache: Yes, covered in both "When It Runs" and "The Consolidated Mark" subsection.
Pair search fallback: Correctly absent -- no mention of pair search fallback, which is not upstream.
One potential omission: The document does not mention what happens when getCandidatePrices encounters a candidate with no compatible offerings. In the code, this returns 0.0 for the entire candidate set (modeling it as free), which causes the cost comparison to reject the move. This edge case affects operator-visible behavior (consolidation silently skips nodes whose offerings have disappeared) but is arguably a detail that drift handles.
Suggestions
-
Clarify
consolidateAfterdisabling: Change "An unsetconsolidateAfter" to "AconsolidateAftervalue ofNever" to match the API surface operators interact with. -
Pod event vs. pod scheduling event: In the Candidate Selection section, change "how long after the last pod scheduling event" to "how long after the last pod event" to match the
LastPodEventTimefield name. -
Validation period specificity: Consider noting that the re-check period is "approximately 15 seconds" rather than "on the order of seconds." This helps operators reason about how quickly cluster changes can invalidate a found move.
-
Budget interaction -- consider a concrete example: The "5% of 20 nodes = 1 candidate" example is excellent. Consider also noting that multi-NodePool clusters can still have multi-node consolidation work across NodePools even when individual NodePool budgets are tight, since candidates from different NodePools are merged into the same sorted list.
Verdict
APPROVE
jamesmt-aws
left a comment
There was a problem hiding this comment.
Review: Multi-Node Consolidation Design
This is a strong document. It reads well, covers the required ground substantively, and stays at the right level of abstraction.
Tenet Compliance
- Intent over mechanism: The document describes what multi-node consolidation does and why, without dropping into code-level detail. The disruption cost formula uses readable pseudocode to convey the concept. Pass.
- Implementations are derived and replaceable: Nothing in the document ties the design to a specific code structure. A reimplementation could follow this spec without reference to existing code. Pass.
- Reconciliation over perfection: The design explicitly favors incremental convergence. Timeouts return the best move found so far. The prefix-only search trades optimality for tractability. The steady-state caching avoids wasted work. Pass.
- Legible contracts: A reader unfamiliar with the codebase can follow every section. Terminology is front-loaded and referenced consistently. Pass.
- No implementation details: No function names, file paths, variable names, struct names, or package references. API-level fields like
consolidateAfterandWhenEmptyOrUnderutilizedare user-facing configuration, not implementation. Pass.
Writing Quality
Short sentences throughout. Active voice. Concrete specifics (the t3a example in Same-Type Filtering is effective). No em-dashes. No forward references. No business-school phrasing or self-congratulatory framing. Economy is good: sections say what they need to and stop.
Completeness
All ten required sections are present with substantive content:
- Purpose: clearly states the gap.
- When It Runs: ordering within the disruption loop, skip logic.
- Candidate Selection: six eligibility criteria, each with rationale. Empty-node filtering explained.
- Candidate Ordering: sort formula, significance of ordering for prefix search.
- Move Evaluation: simulation approach, deletion vs replacement, timeout, validation.
- Cost Comparison: current pricing rationale, same-type filtering with worked example.
- Interaction with Single-Node: tradeoffs of batch vs sequential, cases single-node catches.
- Interaction with Disruption Budgets: filtering order, per-NodePool caps, consolidated-mark interaction.
- Known Limitations: prefix-only, single-replacement constraint, cap/timeout, spot-to-spot flexibility.
- Steady-State Behavior: caching via consolidated mark.
Readability
The document flows logically from purpose through mechanism to interactions and limitations. Terminology is defined up front and used consistently.
Minor nits (not blocking):
- "Reschedulable pods" is used in the Empty Node Filtering section but not defined in Terminology. A one-line definition would help a new reader.
- "Consolidation loops" in the Spot-to-Spot limitation could use a brief parenthetical explanation for readers who have not encountered the concept.
- The Steady-State Behavior section largely restates content from "The Consolidated Mark" and "When It Runs." Consider folding it into one of those sections or adding unique content to justify keeping it separate.
Verdict
APPROVE. The document is clear, complete, well-structured, and follows all design tenets. The nits above are minor and can be addressed in a follow-up if desired.
Summary
Design document for multi-node consolidation describing its current
behavioral contract, eligibility criteria, move evaluation, cost model,
and known limitations.
Round 2: revised based on detailed feedback. Key changes:
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.