docs: design document for multi-node consolidation by jamesmt-aws · Pull Request #10 · jamesmt-aws/karpenter

jamesmt-aws · 2026-03-28T19:40:41Z

Summary

Design document for multi-node consolidation describing its current
behavioral contract, eligibility criteria, move evaluation, cost model,
and known limitations.

Round 2: revised based on detailed feedback. Key changes:

Added terminology section
Explained WHY each eligibility criterion exists
Clarified cost model (offering prices, not historical costs)
Investigated ordering intent (multi-node before single-node)
Noted combinatorial complexity of subset enumeration
Applied Orwell's rules throughout

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.

Document the intent and behavioral contract of multi-node consolidation as it exists today. Covers candidate selection, ordering, move evaluation, cost comparison, budget interaction, and known limitations. This is a baseline design doc. Future changes to multi-node consolidation behavior should start as revisions to this document.

Address feedback: add terminology section, explain eligibility criteria rationale, clarify cost model (offering prices, not historical costs), investigate ordering intent (multi-node before single-node), note combinatorial complexity of subset enumeration (Bell numbers), apply Orwell's rules throughout.

Reference kubernetes-sigs/karpenter issues kubernetes-sigs#1202, kubernetes-sigs#1442, kubernetes-sigs#1653, kubernetes-sigs#1733, kubernetes-sigs#1970, kubernetes-sigs#2434, kubernetes-sigs#2814, kubernetes-sigs#2819 and aws/karpenter-provider-aws#5944 in the known limitations section.

jamesmt-aws

Design Doc Review

Tenet Compliance

The doc is generally strong on intent over mechanism, but has several tenet violations:

Function/field name references violate "Designs capture intent, not mechanism":
- consolidation.ShouldDisrupt() (Candidate Selection section) -- describe the eligibility check without naming the function.
- markConsolidated() and IsConsolidated() (The Consolidated Mark section) -- these are implementation details; describe the caching behavior without naming the methods.
- lastConsolidationState field name (The Consolidated Mark section) -- describe the in-memory timestamp without naming the field.
- nodeclaim/disruption/consolidation.go file path (Candidate Selection, third bullet) -- remove the file path reference; describe the behavior.
Commit hash reference: "The original commit introducing multi-node consolidation (978daf3, Nov 2022)" in the Interaction with Single-Node Consolidation section. Commit hashes are mechanism, not intent. Rephrase as historical context without the hash if needed at all.
O(log N) complexity reference in the Prefix-Only Evaluation limitation. This is implementation detail -- the design should state that binary search makes it tractable, not quote the complexity class. (Minor violation since it appears in the Limitations section as justification.)

No violations of "Reconciliation over perfection" or "Designs as legible contracts" -- the doc is honest about gaps and an operator can predict behavior from what is written.

Accuracy

The doc is accurate against the codebase with the following notes:

Omission: Validation step. After finding a valid move, multi-node consolidation waits 15 seconds (consolidationTTL) and then re-validates the command by re-checking candidate eligibility and re-running the scheduling simulation. If validation fails, the move is rejected and an event is emitted. This is significant operator-visible behavior -- a consolidation move that was found valid can be silently dropped if the cluster changes during the validation window. The doc does not mention this at all.
Omission: Budget reduction to one candidate. The doc says "Multi-node consolidation requires at least two candidates" but does not explicitly state what happens when budget filtering reduces the candidate set to one or zero. The code returns a no-op in this case. An operator setting tight budgets should understand that multi-node consolidation may be silently disabled as a result.
Minor inaccuracy: "cheapest compatible offering" for replacement pricing. In the Cost Comparison section, the doc says "The replacement's cost is the price of the cheapest offering among its eligible instance types." This is correct for the cost comparison filter, but the actual replacement that launches may not get the cheapest offering (especially for spot). The doc could clarify that the cost comparison uses cheapest-offering pricing as a lower bound, while the actual launch price may differ.
Accuracy confirmed on all other points:
- Disruption method ordering (Emptiness, StaticDrift, Drift, Multi, Single) matches NewMethods().
- Binary search with min=1, max=len-1, 100-cap, 1-minute timeout -- all match code.
- Empty node filtering rationale matches the code's len(candidate.reschedulablePods) == 0 check.
- Budget-constrained skip preventing consolidated mark matches if !constrainedByBudgets { m.markConsolidated() }.
- Same-type filtering logic matches filterOutSameInstanceType.
- 5-minute consolidation state reset confirmed in cluster.go:550-563.
- Spot-to-spot multi-node behavior (no 15-type minimum) confirmed.
- Pair search fallback is correctly absent (not upstream).

Completeness

The doc covers all expected sections thoroughly. Two gaps:

Missing: Validation/re-check step. As noted under Accuracy, the 15-second validation window is a distinct behavioral phase that the doc omits entirely. This should be its own subsection or added to Move Evaluation. An operator debugging "why did consolidation find a move but not execute it?" needs to know about this.
The consolidateAfter interaction could be clearer. The doc correctly states that consolidateAfter controls when nodes become candidates via the Consolidatable status condition. But it would help to explicitly state: "A node that recently had pods scheduled to it will not be eligible for multi-node consolidation until consolidateAfter elapses, even if it is underutilized." This is the most common operator surprise with this setting.

The Known Limitations section is honest and complete:

Prefix-only limitation: covered with good explanation of why arbitrary subsets are infeasible.
Single-replacement constraint: covered with issue links.
Candidate cap and timeout: covered with issue links and metric names.
Spot-to-spot without minimum flexibility: covered.

Suggestions

Remove all function names, field names, and file paths. Replace with behavioral descriptions. For example, instead of "defined in consolidation.ShouldDisrupt()", write "defined by the consolidation eligibility checks". Instead of "lastConsolidationState timestamp", write "an in-memory timestamp".
Add a Validation section describing the 15-second re-check after a move is found. This is important for operators to understand why a valid-looking consolidation may not execute.
Remove the commit hash from the single-node ordering rationale. The historical context is useful; the hash is not.
Consider adding a brief "Observability" subsection to Steady-State Behavior. The doc already mentions the timeout metric and evaluation duration histogram in Known Limitations, but operators would benefit from a consolidated list of signals they can use to understand multi-node consolidation behavior (metrics, events, log lines). This is optional -- the metric names in Known Limitations may be sufficient.
Explicitly state the budget edge case where tight budgets reduce multi-node candidates below 2 and effectively disable the method. This is predictable from the text but worth calling out for operators managing large clusters with restrictive budgets.

Verdict

REQUEST_CHANGES

The doc is well-written, accurate, and thorough. The request for changes is based on two issues:

Tenet violations: Function names, field names, file paths, and a commit hash should be removed to comply with "Designs capture intent, not mechanism."
Missing validation step: The 15-second re-validation window is operator-visible behavior that belongs in the design. Without it, the doc is incomplete as a behavioral contract.

Neither issue is a major rewrite. The doc's structure, accuracy, and coverage of limitations are strong.

Remove implementation details (function names, field names, file paths, commit hash). Add validation step description. Call out budget edge case where tight budgets silently disable multi-node. Clarify consolidateAfter interaction. Apply Orwell's rules.

jamesmt-aws

Design Doc Review

Tenet Compliance

The document largely follows the tenets well. A few observations:

Designs capture intent, not mechanism -- Generally well followed. The description of binary search behavior, the filterOutSameInstanceType algorithm with specific instance type examples (t3a.2xlarge, t3a.small, t3a.nano), and the clamped per-pod cost range [-10.0, 10.0] border on implementation detail, but they serve to make the behavior predictable to operators rather than describing code structure. Acceptable.
Implementations are derived and replaceable -- The doc avoids function names, line numbers, and pseudocode. It describes what happens, not how the code is organized. Passes.
Reconciliation over perfection -- The Known Limitations section is honest and thorough, with links to upstream issues. The "Interaction with Single-Node Consolidation" section openly discusses trade-offs and gaps in the original design rationale. Strong.
Designs as legible contracts -- An operator reading this document could predict multi-node consolidation behavior in most scenarios. The budget interaction, ordering, and validation sections give enough detail to reason about edge cases.

No tenet violations found.

Accuracy

consolidateAfter description (Candidate Selection section): The doc says "An unset consolidateAfter disables consolidation for the NodePool." In practice, consolidateAfter has a default value of "0s" and is a required field. It is disabled by setting it to "Never" (which results in a nil Duration internally), not by leaving it unset. An operator reading "unset" might expect to omit the field entirely. Consider rewording to: "Setting consolidateAfter to Never disables consolidation for the NodePool."
consolidateAfter gating description: The doc says consolidateAfter controls "how long after the last pod scheduling event a node becomes eligible." The actual implementation uses LastPodEventTime on the NodeClaim status, which tracks the last pod event (not only scheduling events). If no pod events have occurred, it falls back to the NodeClaim's initialization time. This is a minor imprecision -- "last pod event" would be more accurate than "last pod scheduling event."
Disruption cost formula: The doc says disruption_cost = rescheduling_cost(pods) * lifetime_remaining(node). Looking at the code, ReschedulingCost is computed over all pods on the node (not just reschedulable pods), which the doc correctly implies by saying "sums the eviction cost of each pod on the node." Accurate.
LifetimeRemaining source: The doc says lifetime remaining uses expireAfter from the node. In the code, LifetimeRemaining() reads nodeClaim.Spec.ExpireAfter, not a NodePool-level field. The doc says "the node's expireAfter duration" which is correct since NodeClaims inherit this from their NodePool. Accurate.
Validation re-check period: The doc says "waits for a brief re-check period (on the order of seconds)." The actual TTL is 15 seconds (consolidationTTL). The vagueness is appropriate for a design doc (avoids over-specification), but an operator might benefit from knowing the approximate magnitude. Current wording is acceptable.
Cost comparison -- cheapest compatible offering: The doc says "Each candidate's cost is the price of its instance type's cheapest compatible offering given the node's current labels (zone, capacity type)." This matches the code in getCandidatePrices(), which filters offerings by the node's label requirements and takes the cheapest. Accurate.
Binary search bounds: The doc says the binary search finds "the largest prefix." The code initializes min = 1 and max = len(candidates) - 1 (or the cap), meaning the smallest prefix evaluated is 2 candidates (indices [0:2]). The doc correctly states "multi-node consolidation requires at least two candidates." Accurate.
Same-type filtering: The doc's example matches the filterOutSameInstanceType function behavior precisely. Accurate.
Empty node filtering rationale: The doc gives two reasons for filtering empty nodes. The code comment says: "If there was an empty node that wasn't consolidated before this, we should assume that it was due to budgets." The doc's explanation is more thorough and accurate than the code comment itself. Good.
Spot-to-spot behavior: The doc correctly states that multi-node does not enforce the 15 instance type minimum for spot-to-spot. The code at computeSpotToSpotConsolidation line 269 confirms: if len(candidates) > 1 returns early without the minimum check. Accurate.

Completeness

Required sections covered:

Terminology: Yes
Purpose: Yes
When It Runs: Yes
Candidate Selection: Yes
Candidate Ordering: Yes
Move Evaluation: Yes
Cost Comparison: Yes
Interaction with Single-Node Consolidation: Yes
Interaction with Disruption Budgets: Yes
Known Limitations: Yes
Steady-State Behavior: Yes (bonus section, valuable)

Prefix-only limitation in Known Limitations: Yes, thoroughly covered with the Bell number complexity argument.

Budget interaction with 1 candidate: Yes, covered in "Interaction with Disruption Budgets" -- "If the budget reduces the candidate count below two, the method returns no action without evaluating."

Ordering with single-node consolidation: Yes, covered in "Interaction with Single-Node Consolidation" with nuanced discussion of trade-offs.

Consolidated cache: Yes, covered in both "When It Runs" and "The Consolidated Mark" subsection.

Pair search fallback: Correctly absent -- no mention of pair search fallback, which is not upstream.

One potential omission: The document does not mention what happens when getCandidatePrices encounters a candidate with no compatible offerings. In the code, this returns 0.0 for the entire candidate set (modeling it as free), which causes the cost comparison to reject the move. This edge case affects operator-visible behavior (consolidation silently skips nodes whose offerings have disappeared) but is arguably a detail that drift handles.

Suggestions

Clarify consolidateAfter disabling: Change "An unset consolidateAfter" to "A consolidateAfter value of Never" to match the API surface operators interact with.
Pod event vs. pod scheduling event: In the Candidate Selection section, change "how long after the last pod scheduling event" to "how long after the last pod event" to match the LastPodEventTime field name.
Validation period specificity: Consider noting that the re-check period is "approximately 15 seconds" rather than "on the order of seconds." This helps operators reason about how quickly cluster changes can invalidate a found move.
Budget interaction -- consider a concrete example: The "5% of 20 nodes = 1 candidate" example is excellent. Consider also noting that multi-NodePool clusters can still have multi-node consolidation work across NodePools even when individual NodePool budgets are tight, since candidates from different NodePools are merged into the same sorted list.

Verdict

APPROVE

jamesmt-aws

Review: Multi-Node Consolidation Design

This is a strong document. It reads well, covers the required ground substantively, and stays at the right level of abstraction.

Tenet Compliance

Intent over mechanism: The document describes what multi-node consolidation does and why, without dropping into code-level detail. The disruption cost formula uses readable pseudocode to convey the concept. Pass.
Implementations are derived and replaceable: Nothing in the document ties the design to a specific code structure. A reimplementation could follow this spec without reference to existing code. Pass.
Reconciliation over perfection: The design explicitly favors incremental convergence. Timeouts return the best move found so far. The prefix-only search trades optimality for tractability. The steady-state caching avoids wasted work. Pass.
Legible contracts: A reader unfamiliar with the codebase can follow every section. Terminology is front-loaded and referenced consistently. Pass.
No implementation details: No function names, file paths, variable names, struct names, or package references. API-level fields like consolidateAfter and WhenEmptyOrUnderutilized are user-facing configuration, not implementation. Pass.

Writing Quality

Short sentences throughout. Active voice. Concrete specifics (the t3a example in Same-Type Filtering is effective). No em-dashes. No forward references. No business-school phrasing or self-congratulatory framing. Economy is good: sections say what they need to and stop.

Completeness

All ten required sections are present with substantive content:

Purpose: clearly states the gap.
When It Runs: ordering within the disruption loop, skip logic.
Candidate Selection: six eligibility criteria, each with rationale. Empty-node filtering explained.
Candidate Ordering: sort formula, significance of ordering for prefix search.
Move Evaluation: simulation approach, deletion vs replacement, timeout, validation.
Cost Comparison: current pricing rationale, same-type filtering with worked example.
Interaction with Single-Node: tradeoffs of batch vs sequential, cases single-node catches.
Interaction with Disruption Budgets: filtering order, per-NodePool caps, consolidated-mark interaction.
Known Limitations: prefix-only, single-replacement constraint, cap/timeout, spot-to-spot flexibility.
Steady-State Behavior: caching via consolidated mark.

Readability

The document flows logically from purpose through mechanism to interactions and limitations. Terminology is defined up front and used consistently.

Minor nits (not blocking):

"Reschedulable pods" is used in the Empty Node Filtering section but not defined in Terminology. A one-line definition would help a new reader.
"Consolidation loops" in the Spot-to-Spot limitation could use a brief parenthetical explanation for readers who have not encountered the concept.
The Steady-State Behavior section largely restates content from "The Consolidated Mark" and "When It Runs." Consider folding it into one of those sections or adding unique content to justify keeping it separate.

Verdict

APPROVE. The document is clear, complete, well-structured, and follows all design tenets. The nits above are minor and can be addressed in a follow-up if desired.

james-michael-thompson added 3 commits March 27, 2026 18:26

jamesmt-aws commented Mar 28, 2026

View reviewed changes

docs: revise multi-node consolidation design

ef7f1d1

jamesmt-aws commented Mar 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: design document for multi-node consolidation#10

docs: design document for multi-node consolidation#10
jamesmt-aws wants to merge 5 commits intomainfrom
multinode-consolidation-design

jamesmt-aws commented Mar 28, 2026

Uh oh!

jamesmt-aws left a comment

Uh oh!

jamesmt-aws left a comment

Uh oh!

jamesmt-aws left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jamesmt-aws commented Mar 28, 2026

Summary

Uh oh!

jamesmt-aws left a comment

Choose a reason for hiding this comment

Design Doc Review

Tenet Compliance

Accuracy

Completeness

Suggestions

Verdict

Uh oh!

jamesmt-aws left a comment

Choose a reason for hiding this comment

Design Doc Review

Tenet Compliance

Accuracy

Completeness

Suggestions

Verdict

Uh oh!

jamesmt-aws left a comment

Choose a reason for hiding this comment

Review: Multi-Node Consolidation Design

Tenet Compliance

Writing Quality

Completeness

Readability

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants