Skip to content

Make node-drain behavior configurable (mirror kubectl drain options) #259

Description

@lockwobr

Summary

When a package has an interrupt, the operator cordons and drains the node before running the interrupt. The drain policy is currently hardcoded with no way to tune it. A user has asked to expose the equivalent of the common kubectl drain knobs:

  • --delete-emptydir-data
  • --disable-eviction
  • --force
  • --ignore-daemonsets
  • --timeout
  • --grace-period — sets the termination grace period on the eviction/delete call. The current eviction passes policyv1.Eviction{} with no DeleteOptions, so pods get their own terminationGracePeriodSeconds; a drain-level override is a one-line addition and pairs naturally with --timeout. (Distinct from the existing per-package GracefulShutdown, which sets the grace period on the package pod, not the workloads being drained.) Not in the original ticket, but cheap and a natural fit.

Note up front: the operator does not invoke kubectl drain. It reimplements drain as an in-process eviction loop (DrainNodeIsDrained / ShouldEvict in operator/internal/controller/skyhook_controller.go). So this is not "pass flags through to kubectl" — it's "add configuration to our eviction loop that mirrors the behavior those flags describe." The issue title borrows kubectl's vocabulary because that's what users know.

Current behavior

operator/internal/controller/skyhook_controller.go:

  • DrainNode (:1678) lists pods on the node and, for each pod ShouldEvict returns true for, creates a policyv1.Eviction subresource (:1710-1711). Eviction respects PodDisruptionBudgets.
  • ShouldEvict (:1407) hardcodes the policy:
    • Only Running / Pending pods are candidates.
    • Pods tolerating node.kubernetes.io/unschedulable are skipped.
    • DaemonSet-owned pods are skipped (:1418-1424).
    • kube-system pods are skipped (:1426).
    • Everything else is evicted.
  • The drain "loop" is the reconcile requeue: DrainNode returns (false, nil) until IsDrained reports no evictable pods remain. There is no overall deadline.

Mapping the requested flags onto this:

kubectl flag Today Gap
--ignore-daemonsets Effectively hardcoded on — DaemonSet pods always skipped. No way to turn off (rarely wanted, but currently not expressible).
--disable-eviction Unsupported. Always uses the eviction API, so PDBs always apply. No direct-DELETE path to bypass PDBs when an operator explicitly wants force-through.
--force Bare / unmanaged pods are evicted anyway (no orphan guard). kubectl --force deletes unmanaged pods; we evict them. Mostly equivalent, but no explicit opt-in/refuse-without-flag behavior.
--delete-emptydir-data Not honored. emptyDir-backed pods are evicted with no guard. Operator is more aggressive than kubectl's default (kubectl refuses without this flag). To mirror kubectl we'd add a guard that refuses emptyDir pods unless opted in — a behavior change for existing users.
--timeout None. Drain requeues until drained, forever. A stuck pod (e.g. PDB at zero, finalizer wedged) blocks the interrupt indefinitely with no surfaced failure.

Proposed change

Add an optional drain-options struct to SkyhookSpec (SCR-level, next to InterruptionBudget and AdditionalTolerations in operator/api/v1alpha1/skyhook_types.go), and plumb it through the wrapper layer into DrainNode / ShouldEvict.

Sketch:

// SkyhookSpec
// DrainConfig tunes how nodes are drained before an interrupt. All fields optional;
// zero value preserves today's behavior.
DrainConfig *DrainConfig `json:"drainConfig,omitempty"`

type DrainConfig struct {
    // DisableEviction bypasses the eviction API (and therefore PodDisruptionBudgets)
    // and deletes pods directly. Default false (use eviction API).
    DisableEviction bool `json:"disableEviction,omitempty"`

    // DeleteEmptyDirData allows draining pods backed by emptyDir volumes.
    // Default TRUE (preserves today's behavior — we evict emptyDir pods today).
    // NOTE: this deliberately inverts kubectl drain's default (false); see "Compatibility".
    DeleteEmptyDirData *bool `json:"deleteEmptyDirData,omitempty"`

    // Force evicts/deletes pods not managed by a controller (bare pods).
    // Default TRUE (preserves today's behavior — we evict bare pods today).
    // NOTE: this deliberately inverts kubectl drain's default (false); see "Compatibility".
    Force *bool `json:"force,omitempty"`

    // IgnoreDaemonSets skips DaemonSet-managed pods. Default true (current behavior).
    IgnoreDaemonSets *bool `json:"ignoreDaemonSets,omitempty"`

    // Timeout bounds how long the operator will wait for a node to drain before
    // marking the node erroring. Zero means no timeout (current behavior).
    Timeout *metav1.Duration `json:"timeout,omitempty"`

    // GracePeriod overrides the per-pod termination grace period used when
    // evicting/deleting. Nil uses each pod's own terminationGracePeriodSeconds
    // (current behavior). Mirrors `kubectl drain --grace-period`.
    GracePeriod *metav1.Duration `json:"gracePeriod,omitempty"`
}

Wiring:

  1. ShouldEvict takes the resolved options instead of being a free function with hardcoded policy (or keep the signature and pass a small policy value).
  2. DrainNode chooses eviction vs. direct delete based on DisableEviction.
  3. --timeout needs a start marker. Persist a drain-start timestamp on the node (annotation, consistent with how other per-node state is stored — see docs/operator-status-definitions.md on where node state lives) and compare against Timeout each reconcile; on expiry, surface the node as erroring with an event rather than requeuing forever.

Out of scope

Deliberately not included (niche / out of scope unless asked):

  • --pod-selector — drain only pods matching a label selector. Adds real complexity and isn't requested; defer.
  • --skip-wait-for-delete-timeout, --chunk-size — tuning knobs with little payoff at Skyhook's scale.
  • The hardcoded kube-system skip (:1426) is Skyhook-specific, not a kubectl flag. Leave it hardcoded; making it configurable invites foot-guns (evicting control-plane pods). Note it in docs rather than exposing it.

Compatibility — chosen defaults

The decision: preserve today's behavior by default, expose the knobs so users can tighten. Nodewright is run at node bring-up and inside maintenance windows, where aggressive eviction is the desired behavior, so the current aggressiveness is a feature, not an accident. Concretely:

  • force defaults true — keep evicting bare/unmanaged pods, as today.
  • deleteEmptyDirData defaults true — keep evicting emptyDir-backed pods, as today.
  • Both of these deliberately invert kubectl drain's defaults (kubectl ships both false). Document this prominently — a kubectl-literate user will expect false and be surprised. The names still read naturally; only the default differs.
  • ignoreDaemonSets defaults true — unchanged.
  • An unset / nil drainConfig reproduces today's behavior exactly, so existing SCRs see no change. No migration, no minor-version forcing on behavior grounds.

Skyhook-specific skips kept as permanent policy (not flags)

Two exclusions Skyhook has that kubectl drain does not — keep both, hardcoded, not configurable:

  • kube-system namespace — never evict (protects control-plane / system workloads). Exposing it invites foot-guns.
  • Pods tolerating node.kubernetes.io/unschedulable — already tolerate the cordon taint, so leave them.

"Match drain" for the two harmless gaps

Adopt kubectl's coverage for the two filters Skyhook lacks today, since both are strictly safer and change nothing about the aggressive intent:

  • Skip pods with a DeletionTimestamp (already terminating) — stop re-issuing evictions against them.
  • Skip mirror / static pods — they can't be meaningfully evicted (kubelet recreates them); today only the kube-system ones are skipped.

Implementation approach

Two structural moves, kept separate from the behavior change above.

1. Extract drain into its own unit

DrainNode / IsDrained / ShouldEvict are a self-contained cluster inside the 2500-line skyhook_controller.go. Move them to operator/internal/drain (a focused, unit-testable package mirroring internal/graph) — or at minimum internal/controller/drain.go. The feature only grows that surface, and the selection logic is exactly the kind of thing that wants table-driven unit tests independent of the reconcile harness.

2. Pod selection: reuse kubectl's library, or grow our own — leave to implementation

Deciding which pods are evictable (daemonset-managed, mirror/static pods, already-deleting pods, unmanaged/bare pods under --force, emptyDir pods under --delete-emptydir-data) is the bug-prone part. Two routes; this is an implementation detail, not a decision the issue needs to lock in up front.

First, note that today's ShouldEvict is not the same policy as kubectl's default filters — it's more aggressive in some places and more conservative in others, so neither route is behavior-preserving for free:

Pod kind Skyhook today kubectl defaults (Force/DeleteEmptyDirData off)
emptyDir-backed evicts refuses (needs --delete-emptydir-data)
bare / unmanaged evicts refuses (needs --force)
mirror / static pod evicts (unless kube-system) always skips
terminating (DeletionTimestamp) re-issues eviction skips after timeout
any pod in kube-system skips (blanket) evicts (no namespace rule)
tolerates unschedulable skips evicts (no such rule)

So adopting GetPodsForDeletion as-is would both start refusing emptyDir/bare pods we evict today and drop the kube-system + unschedulable-toleration skips (which would have to be re-added as AdditionalFilters). The "more aggressive on emptyDir/bare" rows are why we default force and deleteEmptyDirData to true (see Compatibility) — applies to either route. (Separately: the daemonset skip at :1418 is guarded by len(OwnerReferences) > 1, but DaemonSet pods usually have exactly one owner ref; they're caught by the earlier unschedulable-toleration skip instead, so probably not a live bug — but verify when rewriting this line.)

Things to weigh:

Route A — import k8s.io/kubectl/pkg/drain. It's the exact library kubectl drain uses. drain.Helper fields map 1:1 to this ticket (Force, IgnoreAllDaemonSets, DeleteEmptyDirData, DisableEviction, GracePeriodSeconds), and helper.GetPodsForDeletion(nodeName) returns the evictable set. The draw is real: well-tested selection logic for free, including edge cases we'd otherwise re-discover. Caveats to validate before committing:

  • The filter chain is internal. GetPodsForDeletion runs pods through an unexported makeFilters() (skipDeletedFilter, daemonSetFilter, mirrorPodFilter, localStorageFilter, unreplicatedFilter); the only customization points are the boolean Helper fields and an append-only AdditionalFilters. You take the chain as-is — some "magic" you don't control or see.
  • Do not call Helper.RunNodeDrain — it blocks, looping until evictions complete, which violates the non-blocking reconcile model and would stall the operator under the global-reconcile-key proposal. Use only the selection helper; keep Skyhook's requeue-driven eviction. Consequently --timeout stays Skyhook's own deadline annotation, not Helper.Timeout.
  • drain.Helper.Client is a client-go kubernetes.Interface; the controller only has the controller-runtime client via dal today (only the CLI builds a clientset, internal/cli/client/client.go:64). Adopting the library forces a clientset seam into the controller. Wrap behind dal or call it out per the "justify new patterns" rule.
  • Adds k8s.io/kubectl to the vendor tree (not yet vendored; would pin to v0.36.0 to match k8s.io/api / k8s.io/client-go). Sizeable transitive dependency.

Route B — grow our own. We already have ShouldEvict with the daemonset / unschedulable / kube-system checks. The remaining predicates (skipDeleted, mirrorPod, emptyDir guard, unmanaged-pod guard, grace-period + eviction-vs-delete) are each a few lines and well-documented. Hand-rolling keeps every exclusion explicit and unit-testable in internal/drain, adds no heavy dependency, and avoids forcing a clientset into the controller. Cost: we own correctness for ~4 small predicates.

Worth noting: we're already discarding the library's genuinely hard, battle-tested part — the driver loop (eviction-vs-delete, retries, wait-for-deletion). What's left to reuse is the selection, which is small. So "tested code for free" is a smaller win than it first looks. Net steer (not a mandate): prototype against the library; if GetPodsForDeletion's opacity + the clientset seam fight the codebase, fall back to a small, well-tested in-house internal/drain rather than bending the controller around the library.

3. Preserve the golden-workload barrier — do not fold it into drain selection

podNonInterruptLabels ("golden workloads" — pods that must never be interrupted) is a pre-drain barrier, not an eviction-exclusion. EnsureNodeIsReadyForInterrupt (:2605) calls HasNonInterruptWork (:1629) before DrainNode; if a matching pod is Running/Pending on the node, the interrupt waits and never drains. The semantics (see docs/interrupt_flow.md) are block the interrupt entirely until the golden pod leaves, not skip it and drain the rest.

This must stay as a separate gate that runs before drain. Do not express golden workloads as a drain.Helper pod filter — a drain filter would skip the pod and proceed to drain everything else and interrupt anyway, silently breaking the contract. GetPodsForDeletion only ever replaces ShouldEvict's hardcoded daemonset / kube-system / unschedulable logic; HasNonInterruptWork is untouched.

(Optionally, register podNonInterruptLabels as an AdditionalFilters entry on the Helper too, purely as a defensive backstop against a golden pod racing onto the node after the barrier check — but the barrier remains the load-bearing mechanism.)

4. Testing

Split coverage between unit and e2e by what each can actually exercise — and lean on the pattern the CLI already uses.

  • Unit (selection logic) — model it on the CLI tests. The CLI tests selection/mutation against k8s.io/client-go/kubernetes/fake (9 files, e.g. cmd/cli/app/node/node_reset_test.go): seed objects into a fake clientset, inject it, run the code, assert on the recorded actions. The same approach covers drain selection directly: seed a node's pods (daemonset-owned, mirror, emptyDir, bare/unmanaged, golden-labeled, ordinary), run the selector, and assert which pods get an eviction/delete action and with what DeleteOptions (grace period), plus that the golden barrier issues no eviction. The fake clientset records the eviction-subresource create, so this is a pure unit test — no envtest, no cluster. This works for either route: Route A's GetPodsForDeletion and Route B's hand-rolled selector both run against a fake clientset. Note the controller's normal harness is envtest (no kubelet, no real eviction), so envtest can assert that an eviction was issued but not its real-world effect.
  • E2e (real eviction semantics) — chainsaw, drain-focused. Anything the fake clientset can't model belongs in a k8s-tests/chainsaw/ suite against a real cluster: PDB actually blocking an eviction, disableEviction: true bypassing that PDB via direct delete, --timeout expiry surfacing the node as erroring, and --force / emptyDir behavior end-to-end. If unit coverage of selection turns out thin (especially under Route A, where the chain is opaque), add lib-focused chainsaw cases that assert pod-level outcomes rather than just controller status.
  • Before building, read the CLI test setup (cmd/cli/app/... + internal/cli/client) to see how the fake clientset is wired and injected; reuse that scaffolding rather than inventing a new one. If drain ends up on a clientset (Route A), the CLI's injection pattern is the precedent to follow.

Prior art — how comparable projects drain

The projects most like nodewright (controller-runtime reconcilers managing node lifecycle at scale) all hand-roll a non-blocking, requeue-driven eviction loop and deliberately avoid the blocking RunNodeDrain. This corroborates both Skyhook's existing approach and the steers above.

  • Karpenter (kubernetes-sigs/karpenter) — closest analogue. Finalizer-driven termination controller; evicts via the Eviction API directly (respects PDBs), requeues until drained; no kubectl drain library. Skips static/mirror pods, pods tolerating its own karpenter.sh/disrupted:NoSchedule taint, and completed pods — nearly identical to the skip set we settled on (unschedulable-toleration + mirror).
  • Cluster API (kubernetes-sigs/cluster-api) — Machine-controller drain is reconcile/requeue-driven (non-blocking), evicting by priority and requeuing; semantics "aligned to kubectl drain" but it vendored a third_party/kubernetes-drain copy rather than depending on the helper live. Default skips DaemonSet + static/mirror pods. It is adding a MachineDrainRule CRD for configurable per-pod behavior (Drain / Skip / WaitCompleted via label selectors, with a pod-annotation override) — direct prior art for DrainConfig, and its Skip/WaitCompleted is the same pattern as our podNonInterruptLabels golden-workload barrier.

By contrast, the projects that use the kubectl drain library including the blocking RunNodeDrain are daemons / single-node tools where blocking is acceptable (AWS node-termination-handler, medik8s node-maintenance-operator), or they shell out to the kubectl binary (kured, Rancher system-upgrade-controller). None are at-scale reconcilers. So the ecosystem consensus for our shape is "own the non-blocking eviction loop; use the library for selection at most," which is the steer in §2.

References: Karpenter termination design (kubernetes-sigs/karpenter/designs/), Cluster API MachineDrainRule proposal (docs/proposals/20240930-machine-drain-rules.md), k8s.io/kubectl/pkg/drain godoc.

Acceptance criteria

  • New optional drainConfig field on the Skyhook CRD; unset preserves today's behavior exactly (make manifests generate run; CRD YAML + deepcopy regenerated; chart/ CRD mirrored).
  • disableEviction: true deletes pods directly and drains through a zero-allowance PDB that would otherwise block — chainsaw e2e (real PDB enforcement isn't exercised by envtest/fake clientset).
  • timeout expiry surfaces the node as erroring with an event, instead of requeuing indefinitely — chainsaw e2e with a non-evictable pod.
  • ignoreDaemonSets: false makes DaemonSet pods eviction candidates — unit (fake-clientset selection test).
  • Golden-workload contract preserved: a pod matching podNonInterruptLabels that is Running/Pending blocks the interrupt (node waits, never drains) regardless of drainConfig — unit test (fake clientset) asserting no eviction action is issued while a golden pod is present.
  • force and deleteEmptyDirData default to true (unset drainConfig evicts bare + emptyDir pods exactly as today); setting either false makes the corresponding pods block the drain. Divergence from kubectl's defaults documented in docs/interrupt_flow.md and the changelog.
  • kube-system and unschedulable-tolerating pods are never evicted regardless of drainConfig; terminating (DeletionTimestamp) and mirror/static pods are skipped — covered by the fake-clientset selection test.
  • docs/interrupt_flow.md updated to document the new options and their defaults; changelog note for any behavior change.
  • CLI: surface drainConfig in kubectl skyhook output only if a status/describe path already renders spec fields; otherwise no CLI change needed (note it in the PR per the CLI-contract rule).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    component/operatorSkyhook operator (controller-manager)
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions