Host drain-and-relocate workflow before kernel/OS patch reboot

## Problem

There is currently no way to take a backend host offline for a kernel/OS
patch without hard-stopping every tenant container running on it.

`containarium capacity withdraw --drain` gracefully stops workloads within a
bounded window, but it is scoped to BYOC-advertised spare capacity only, not
"every tenant on this host." `internal/cmd/pool_leave.go` already has a
comment admitting workload drain/migration for pool-leave is an unbuilt
follow-up. There is no container live-migration primitive in this codebase.

Net effect: patching a host kernel that needs a reboot means either
hard-stopping every tenant on that host, or skipping the patch. Auto-upgrade
timers are deliberately disabled fleet-wide ("manual patching only" —
`terraform/gce/scripts/startup-sentinel.sh`, `startup-spot.sh`), so this
isn't a hypothetical — it's the only path today.

## Proposal

A general-purpose "drain a host for maintenance" primitive, independent of
the BYOC capacity-advertisement feature:

1. Mark a backend `draining` — stop scheduling new containers onto it.
2. For each running container: attempt graceful stop (respecting any
   in-flight work / bounded window, similar to the existing `--drain-window`
   knob), or relocate via the existing cross-backend `move_container` path
   where the workload supports it.
3. Report per-container outcome (drained / relocated / force-stopped /
   failed) so an operator knows the blast radius before rebooting.
4. Only after the host reports empty does the maintenance/reboot proceed.

## Why this matters

This is the structural gap behind the "host-kernel LPE reaches every tenant"
caveat in `docs/security/SECURITY-FAQ.md` — today there's no way to respond
to a kernel CVE without either accepting downtime for every tenant on a host
or leaving the vulnerability unpatched.

## Related

- `docs/security/SECURITY-FAQ.md`
- `docs/security/KERNEL-PATCH-RUNBOOK.md` (companion runbook, references this
  issue as the blocking gap)
- `internal/cmd/capacity.go` (existing bounded-drain primitive to generalize)
- `internal/cmd/pool_leave.go` (prior TODO admitting the same gap)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Host drain-and-relocate workflow before kernel/OS patch reboot #889

Problem

Proposal

Why this matters

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Host drain-and-relocate workflow before kernel/OS patch reboot #889

Description

Problem

Proposal

Why this matters

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions