Make node-drain behavior configurable (mirror `kubectl drain` options)


## Summary

When a package has an interrupt, the operator cordons and drains the node before running the interrupt. The drain policy is currently hardcoded with no way to tune it. A user has asked to expose the equivalent of the common `kubectl drain` knobs:

- `--delete-emptydir-data`
- `--disable-eviction`
- `--force`
- `--ignore-daemonsets`
- `--timeout`
- `--grace-period` — sets the termination grace period on the eviction/delete call. The current eviction passes `policyv1.Eviction{}` with no `DeleteOptions`, so pods get their own `terminationGracePeriodSeconds`; a drain-level override is a one-line addition and pairs naturally with `--timeout`. (Distinct from the existing per-package `GracefulShutdown`, which sets the grace period on the *package* pod, not the workloads being drained.) Not in the original ticket, but cheap and a natural fit.

Note up front: the operator does **not** invoke `kubectl drain`. It reimplements drain as an in-process eviction loop (`DrainNode` → `IsDrained` / `ShouldEvict` in `operator/internal/controller/skyhook_controller.go`). So this is not "pass flags through to kubectl" — it's "add configuration to our eviction loop that mirrors the behavior those flags describe." The issue title borrows kubectl's vocabulary because that's what users know.

## Current behavior

`operator/internal/controller/skyhook_controller.go`:

- `DrainNode` (`:1678`) lists pods on the node and, for each pod `ShouldEvict` returns true for, creates a `policyv1.Eviction` subresource (`:1710-1711`). Eviction respects PodDisruptionBudgets.
- `ShouldEvict` (`:1407`) hardcodes the policy:
  - Only `Running` / `Pending` pods are candidates.
  - Pods tolerating `node.kubernetes.io/unschedulable` are skipped.
  - DaemonSet-owned pods are skipped (`:1418-1424`).
  - `kube-system` pods are skipped (`:1426`).
  - Everything else is evicted.
- The drain "loop" is the reconcile requeue: `DrainNode` returns `(false, nil)` until `IsDrained` reports no evictable pods remain. There is no overall deadline.

Mapping the requested flags onto this:

| kubectl flag | Today | Gap |
|---|---|---|
| `--ignore-daemonsets` | Effectively hardcoded **on** — DaemonSet pods always skipped. | No way to turn off (rarely wanted, but currently not expressible). |
| `--disable-eviction` | **Unsupported.** Always uses the eviction API, so PDBs always apply. | No direct-`DELETE` path to bypass PDBs when an operator explicitly wants force-through. |
| `--force` | Bare / unmanaged pods are evicted anyway (no orphan guard). | kubectl `--force` *deletes* unmanaged pods; we evict them. Mostly equivalent, but no explicit opt-in/refuse-without-flag behavior. |
| `--delete-emptydir-data` | **Not honored.** emptyDir-backed pods are evicted with no guard. | Operator is *more* aggressive than kubectl's default (kubectl refuses without this flag). To mirror kubectl we'd add a guard that refuses emptyDir pods unless opted in — a **behavior change** for existing users. |
| `--timeout` | **None.** Drain requeues until drained, forever. | A stuck pod (e.g. PDB at zero, finalizer wedged) blocks the interrupt indefinitely with no surfaced failure. |

## Proposed change

Add an optional drain-options struct to `SkyhookSpec` (SCR-level, next to `InterruptionBudget` and `AdditionalTolerations` in `operator/api/v1alpha1/skyhook_types.go`), and plumb it through the `wrapper` layer into `DrainNode` / `ShouldEvict`.

Sketch:

```go
// SkyhookSpec
// DrainConfig tunes how nodes are drained before an interrupt. All fields optional;
// zero value preserves today's behavior.
DrainConfig *DrainConfig `json:"drainConfig,omitempty"`

type DrainConfig struct {
    // DisableEviction bypasses the eviction API (and therefore PodDisruptionBudgets)
    // and deletes pods directly. Default false (use eviction API).
    DisableEviction bool `json:"disableEviction,omitempty"`

    // DeleteEmptyDirData allows draining pods backed by emptyDir volumes.
    // Default TRUE (preserves today's behavior — we evict emptyDir pods today).
    // NOTE: this deliberately inverts kubectl drain's default (false); see "Compatibility".
    DeleteEmptyDirData *bool `json:"deleteEmptyDirData,omitempty"`

    // Force evicts/deletes pods not managed by a controller (bare pods).
    // Default TRUE (preserves today's behavior — we evict bare pods today).
    // NOTE: this deliberately inverts kubectl drain's default (false); see "Compatibility".
    Force *bool `json:"force,omitempty"`

    // IgnoreDaemonSets skips DaemonSet-managed pods. Default true (current behavior).
    IgnoreDaemonSets *bool `json:"ignoreDaemonSets,omitempty"`

    // Timeout bounds how long the operator will wait for a node to drain before
    // marking the node erroring. Zero means no timeout (current behavior).
    Timeout *metav1.Duration `json:"timeout,omitempty"`

    // GracePeriod overrides the per-pod termination grace period used when
    // evicting/deleting. Nil uses each pod's own terminationGracePeriodSeconds
    // (current behavior). Mirrors `kubectl drain --grace-period`.
    GracePeriod *metav1.Duration `json:"gracePeriod,omitempty"`
}
```

Wiring:

1. `ShouldEvict` takes the resolved options instead of being a free function with hardcoded policy (or keep the signature and pass a small policy value).
2. `DrainNode` chooses eviction vs. direct delete based on `DisableEviction`.
3. `--timeout` needs a start marker. Persist a drain-start timestamp on the node (annotation, consistent with how other per-node state is stored — see `docs/operator-status-definitions.md` on where node state lives) and compare against `Timeout` each reconcile; on expiry, surface the node as `erroring` with an event rather than requeuing forever.

## Out of scope

Deliberately **not** included (niche / out of scope unless asked):

- `--pod-selector` — drain only pods matching a label selector. Adds real complexity and isn't requested; defer.
- `--skip-wait-for-delete-timeout`, `--chunk-size` — tuning knobs with little payoff at Skyhook's scale.
- The hardcoded `kube-system` skip (`:1426`) is Skyhook-specific, not a kubectl flag. Leave it hardcoded; making it configurable invites foot-guns (evicting control-plane pods). Note it in docs rather than exposing it.

## Compatibility — chosen defaults

The decision: **preserve today's behavior by default**, expose the knobs so users can *tighten*. Nodewright is run at node bring-up and inside maintenance windows, where aggressive eviction is the desired behavior, so the current aggressiveness is a feature, not an accident. Concretely:

- **`force` defaults `true`** — keep evicting bare/unmanaged pods, as today.
- **`deleteEmptyDirData` defaults `true`** — keep evicting emptyDir-backed pods, as today.
- Both of these **deliberately invert kubectl drain's defaults** (kubectl ships both `false`). Document this prominently — a kubectl-literate user will expect `false` and be surprised. The names still read naturally; only the default differs.
- **`ignoreDaemonSets` defaults `true`** — unchanged.
- An unset / nil `drainConfig` reproduces today's behavior exactly, so existing SCRs see **no** change. No migration, no minor-version forcing on behavior grounds.

### Skyhook-specific skips kept as permanent policy (not flags)

Two exclusions Skyhook has that kubectl drain does not — keep both, hardcoded, not configurable:

- **`kube-system` namespace** — never evict (protects control-plane / system workloads). Exposing it invites foot-guns.
- **Pods tolerating `node.kubernetes.io/unschedulable`** — already tolerate the cordon taint, so leave them.

### "Match drain" for the two harmless gaps

Adopt kubectl's coverage for the two filters Skyhook lacks today, since both are strictly safer and change nothing about the aggressive intent:

- **Skip pods with a `DeletionTimestamp`** (already terminating) — stop re-issuing evictions against them.
- **Skip mirror / static pods** — they can't be meaningfully evicted (kubelet recreates them); today only the `kube-system` ones are skipped.

## Implementation approach

Two structural moves, kept separate from the behavior change above.

### 1. Extract drain into its own unit

`DrainNode` / `IsDrained` / `ShouldEvict` are a self-contained cluster inside the 2500-line `skyhook_controller.go`. Move them to `operator/internal/drain` (a focused, unit-testable package mirroring `internal/graph`) — or at minimum `internal/controller/drain.go`. The feature only grows that surface, and the selection logic is exactly the kind of thing that wants table-driven unit tests independent of the reconcile harness.

### 2. Pod selection: reuse kubectl's library, or grow our own — leave to implementation

Deciding *which* pods are evictable (daemonset-managed, mirror/static pods, already-deleting pods, unmanaged/bare pods under `--force`, emptyDir pods under `--delete-emptydir-data`) is the bug-prone part. Two routes; this is an implementation detail, not a decision the issue needs to lock in up front.

First, note that today's `ShouldEvict` is **not** the same policy as kubectl's default filters — it's more aggressive in some places and more conservative in others, so neither route is behavior-preserving for free:

| Pod kind | Skyhook today | kubectl defaults (`Force`/`DeleteEmptyDirData` off) |
|---|---|---|
| emptyDir-backed | evicts | refuses (needs `--delete-emptydir-data`) |
| bare / unmanaged | evicts | refuses (needs `--force`) |
| mirror / static pod | evicts (unless `kube-system`) | always skips |
| terminating (`DeletionTimestamp`) | re-issues eviction | skips after timeout |
| any pod in `kube-system` | **skips** (blanket) | evicts (no namespace rule) |
| tolerates `unschedulable` | **skips** | evicts (no such rule) |

So adopting `GetPodsForDeletion` as-is would *both* start refusing emptyDir/bare pods we evict today *and* drop the `kube-system` + unschedulable-toleration skips (which would have to be re-added as `AdditionalFilters`). The "more aggressive on emptyDir/bare" rows are why we default `force` and `deleteEmptyDirData` to `true` (see [Compatibility](#compatibility--chosen-defaults)) — applies to either route. (Separately: the daemonset skip at `:1418` is guarded by `len(OwnerReferences) > 1`, but DaemonSet pods usually have exactly one owner ref; they're caught by the earlier unschedulable-toleration skip instead, so probably not a live bug — but verify when rewriting this line.)

Things to weigh:

**Route A — import `k8s.io/kubectl/pkg/drain`.** It's the exact library `kubectl drain` uses. `drain.Helper` fields map 1:1 to this ticket (`Force`, `IgnoreAllDaemonSets`, `DeleteEmptyDirData`, `DisableEviction`, `GracePeriodSeconds`), and `helper.GetPodsForDeletion(nodeName)` returns the evictable set. The draw is real: well-tested selection logic for free, including edge cases we'd otherwise re-discover. Caveats to validate before committing:

- The filter chain is internal. `GetPodsForDeletion` runs pods through an unexported `makeFilters()` (`skipDeletedFilter`, `daemonSetFilter`, `mirrorPodFilter`, `localStorageFilter`, `unreplicatedFilter`); the only customization points are the boolean `Helper` fields and an append-only `AdditionalFilters`. You take the chain as-is — some "magic" you don't control or see.
- **Do not call `Helper.RunNodeDrain`** — it blocks, looping until evictions complete, which violates the non-blocking reconcile model and would stall the operator under the global-reconcile-key proposal. Use only the selection helper; keep Skyhook's requeue-driven eviction. Consequently `--timeout` stays Skyhook's own deadline annotation, not `Helper.Timeout`.
- `drain.Helper.Client` is a client-go `kubernetes.Interface`; the controller only has the controller-runtime client via `dal` today (only the CLI builds a clientset, `internal/cli/client/client.go:64`). Adopting the library forces a clientset seam into the controller. Wrap behind `dal` or call it out per the "justify new patterns" rule.
- Adds `k8s.io/kubectl` to the vendor tree (not yet vendored; would pin to `v0.36.0` to match `k8s.io/api` / `k8s.io/client-go`). Sizeable transitive dependency.

**Route B — grow our own.** We already have `ShouldEvict` with the daemonset / unschedulable / kube-system checks. The remaining predicates (`skipDeleted`, `mirrorPod`, emptyDir guard, unmanaged-pod guard, grace-period + eviction-vs-delete) are each a few lines and well-documented. Hand-rolling keeps every exclusion explicit and unit-testable in `internal/drain`, adds no heavy dependency, and avoids forcing a clientset into the controller. Cost: we own correctness for ~4 small predicates.

Worth noting: we're already discarding the library's genuinely hard, battle-tested part — the driver loop (eviction-vs-delete, retries, wait-for-deletion). What's left to reuse is the selection, which is small. So "tested code for free" is a smaller win than it first looks. **Net steer (not a mandate):** prototype against the library; if `GetPodsForDeletion`'s opacity + the clientset seam fight the codebase, fall back to a small, well-tested in-house `internal/drain` rather than bending the controller around the library.

### 3. Preserve the golden-workload barrier — do not fold it into drain selection

`podNonInterruptLabels` ("golden workloads" — pods that must never be interrupted) is a **pre-drain barrier**, not an eviction-exclusion. `EnsureNodeIsReadyForInterrupt` (`:2605`) calls `HasNonInterruptWork` (`:1629`) *before* `DrainNode`; if a matching pod is Running/Pending on the node, the interrupt **waits and never drains**. The semantics (see `docs/interrupt_flow.md`) are *block the interrupt entirely until the golden pod leaves*, not *skip it and drain the rest*.

This must stay as a separate gate that runs before drain. **Do not** express golden workloads as a `drain.Helper` pod filter — a drain filter would skip the pod and proceed to drain everything else and interrupt anyway, silently breaking the contract. `GetPodsForDeletion` only ever replaces `ShouldEvict`'s hardcoded daemonset / kube-system / unschedulable logic; `HasNonInterruptWork` is untouched.

(Optionally, register `podNonInterruptLabels` as an `AdditionalFilters` entry on the Helper too, purely as a defensive backstop against a golden pod racing onto the node after the barrier check — but the barrier remains the load-bearing mechanism.)

### 4. Testing

Split coverage between unit and e2e by what each can actually exercise — and lean on the pattern the CLI already uses.

- **Unit (selection logic) — model it on the CLI tests.** The CLI tests selection/mutation against `k8s.io/client-go/kubernetes/fake` (9 files, e.g. `cmd/cli/app/node/node_reset_test.go`): seed objects into a fake clientset, inject it, run the code, assert on the recorded actions. The same approach covers drain selection directly: seed a node's pods (daemonset-owned, mirror, emptyDir, bare/unmanaged, golden-labeled, ordinary), run the selector, and assert which pods get an eviction/delete action and with what `DeleteOptions` (grace period), plus that the golden barrier issues **no** eviction. The fake clientset records the eviction-subresource create, so this is a pure unit test — no envtest, no cluster. This works for **either route**: Route A's `GetPodsForDeletion` and Route B's hand-rolled selector both run against a fake clientset. Note the controller's normal harness is envtest (no kubelet, no real eviction), so envtest can assert *that an eviction was issued* but not its real-world effect.
- **E2e (real eviction semantics) — chainsaw, drain-focused.** Anything the fake clientset can't model belongs in a `k8s-tests/chainsaw/` suite against a real cluster: PDB actually blocking an eviction, `disableEviction: true` bypassing that PDB via direct delete, `--timeout` expiry surfacing the node as `erroring`, and `--force` / emptyDir behavior end-to-end. If unit coverage of selection turns out thin (especially under Route A, where the chain is opaque), add lib-focused chainsaw cases that assert pod-level outcomes rather than just controller status.
- **Before building, read the CLI test setup** (`cmd/cli/app/...` + `internal/cli/client`) to see how the fake clientset is wired and injected; reuse that scaffolding rather than inventing a new one. If drain ends up on a clientset (Route A), the CLI's injection pattern is the precedent to follow.

## Prior art — how comparable projects drain

The projects most like nodewright (controller-runtime reconcilers managing node lifecycle at scale) all hand-roll a **non-blocking, requeue-driven eviction loop** and deliberately avoid the blocking `RunNodeDrain`. This corroborates both Skyhook's existing approach and the steers above.

- **Karpenter** (`kubernetes-sigs/karpenter`) — closest analogue. Finalizer-driven termination controller; evicts via the **Eviction API** directly (respects PDBs), requeues until drained; **no kubectl drain library**. Skips static/mirror pods, pods tolerating its own `karpenter.sh/disrupted:NoSchedule` taint, and completed pods — nearly identical to the skip set we settled on (unschedulable-toleration + mirror).
- **Cluster API** (`kubernetes-sigs/cluster-api`) — Machine-controller drain is **reconcile/requeue-driven (non-blocking)**, evicting by priority and requeuing; semantics "aligned to kubectl drain" but it vendored a `third_party/kubernetes-drain` copy rather than depending on the helper live. Default skips DaemonSet + static/mirror pods. It is adding a **`MachineDrainRule`** CRD for configurable per-pod behavior (`Drain` / `Skip` / `WaitCompleted` via label selectors, with a pod-annotation override) — direct prior art for `DrainConfig`, and its `Skip`/`WaitCompleted` is the same pattern as our `podNonInterruptLabels` golden-workload barrier.

By contrast, the projects that use the kubectl drain library *including* the blocking `RunNodeDrain` are daemons / single-node tools where blocking is acceptable (AWS node-termination-handler, medik8s node-maintenance-operator), or they shell out to the `kubectl` binary (kured, Rancher system-upgrade-controller). None are at-scale reconcilers. So the ecosystem consensus for our shape is "own the non-blocking eviction loop; use the library for selection at most," which is the steer in §2.

References: Karpenter termination design (`kubernetes-sigs/karpenter/designs/`), Cluster API MachineDrainRule proposal (`docs/proposals/20240930-machine-drain-rules.md`), `k8s.io/kubectl/pkg/drain` godoc.

## Acceptance criteria

- New optional `drainConfig` field on the Skyhook CRD; unset preserves today's behavior exactly (`make manifests generate` run; CRD YAML + deepcopy regenerated; `chart/` CRD mirrored).
- `disableEviction: true` deletes pods directly and drains through a zero-allowance PDB that would otherwise block — chainsaw e2e (real PDB enforcement isn't exercised by envtest/fake clientset).
- `timeout` expiry surfaces the node as `erroring` with an event, instead of requeuing indefinitely — chainsaw e2e with a non-evictable pod.
- `ignoreDaemonSets: false` makes DaemonSet pods eviction candidates — unit (fake-clientset selection test).
- Golden-workload contract preserved: a pod matching `podNonInterruptLabels` that is Running/Pending blocks the interrupt (node waits, never drains) regardless of `drainConfig` — unit test (fake clientset) asserting no eviction action is issued while a golden pod is present.
- `force` and `deleteEmptyDirData` default to `true` (unset `drainConfig` evicts bare + emptyDir pods exactly as today); setting either `false` makes the corresponding pods block the drain. Divergence from kubectl's defaults documented in `docs/interrupt_flow.md` and the changelog.
- `kube-system` and unschedulable-tolerating pods are never evicted regardless of `drainConfig`; terminating (`DeletionTimestamp`) and mirror/static pods are skipped — covered by the fake-clientset selection test.
- `docs/interrupt_flow.md` updated to document the new options and their defaults; changelog note for any behavior change.
- CLI: surface `drainConfig` in `kubectl skyhook` output only if a status/describe path already renders spec fields; otherwise no CLI change needed (note it in the PR per the CLI-contract rule).

## References

- `kubectl drain` flag reference: <https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#drain>
- Drain helper <https://pkg.go.dev/k8s.io/kubectl/pkg/drain>
- Eviction API (PDB-aware): <https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/>
- Current drain code: `operator/internal/controller/skyhook_controller.go:1381` (`IsDrained`), `:1407` (`ShouldEvict`), `:1678` (`DrainNode`)


kubectl flag	Today	Gap
`--ignore-daemonsets`	Effectively hardcoded on — DaemonSet pods always skipped.	No way to turn off (rarely wanted, but currently not expressible).
`--disable-eviction`	Unsupported. Always uses the eviction API, so PDBs always apply.	No direct-`DELETE` path to bypass PDBs when an operator explicitly wants force-through.
`--force`	Bare / unmanaged pods are evicted anyway (no orphan guard).	kubectl `--force` deletes unmanaged pods; we evict them. Mostly equivalent, but no explicit opt-in/refuse-without-flag behavior.
`--delete-emptydir-data`	Not honored. emptyDir-backed pods are evicted with no guard.	Operator is more aggressive than kubectl's default (kubectl refuses without this flag). To mirror kubectl we'd add a guard that refuses emptyDir pods unless opted in — a behavior change for existing users.
`--timeout`	None. Drain requeues until drained, forever.	A stuck pod (e.g. PDB at zero, finalizer wedged) blocks the interrupt indefinitely with no surfaced failure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make node-drain behavior configurable (mirror `kubectl drain` options) #259

Summary

Current behavior

Proposed change

Out of scope

Compatibility — chosen defaults

Skyhook-specific skips kept as permanent policy (not flags)

"Match drain" for the two harmless gaps

Implementation approach

1. Extract drain into its own unit

2. Pod selection: reuse kubectl's library, or grow our own — leave to implementation

3. Preserve the golden-workload barrier — do not fold it into drain selection

4. Testing

Prior art — how comparable projects drain

Acceptance criteria

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Pod kind	Skyhook today	kubectl defaults (`Force`/`DeleteEmptyDirData` off)
emptyDir-backed	evicts	refuses (needs `--delete-emptydir-data`)
bare / unmanaged	evicts	refuses (needs `--force`)
mirror / static pod	evicts (unless `kube-system`)	always skips
terminating (`DeletionTimestamp`)	re-issues eviction	skips after timeout
any pod in `kube-system`	skips (blanket)	evicts (no namespace rule)
tolerates `unschedulable`	skips	evicts (no such rule)

Uh oh!

Make node-drain behavior configurable (mirror kubectl drain options) #259

Description

Summary

Current behavior

Proposed change

Out of scope

Compatibility — chosen defaults

Skyhook-specific skips kept as permanent policy (not flags)

"Match drain" for the two harmless gaps

Implementation approach

1. Extract drain into its own unit

2. Pod selection: reuse kubectl's library, or grow our own — leave to implementation

3. Preserve the golden-workload barrier — do not fold it into drain selection

4. Testing

Prior art — how comparable projects drain

Acceptance criteria

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Make node-drain behavior configurable (mirror `kubectl drain` options) #259