fix(api-rs): reap orphaned sandboxes and backstop lost idle pauses#524
Open
fredo wants to merge 1 commit into
Open
fix(api-rs): reap orphaned sandboxes and backstop lost idle pauses#524fredo wants to merge 1 commit into
fredo wants to merge 1 commit into
Conversation
Sandbox desired state lives in process memory and idle pauses are per-execution tokio timers, so a control-plane restart (or a create/register race) leaves running sandboxes that nothing will ever stop; they accumulate until their memory requests exhaust the node and every new sandbox times out waiting for readiness. Add a periodic sandbox janitor to the session runtime: - stop backend sandboxes with no sessions.sandbox_id or ready/claimed warm-pool reference, after two consecutive unreferenced passes so in-flight creations have a full interval to register; sandboxes still in Created are never touched - pause sessions whose latest execution finished before a backstop TTL and have no active execution, reusing the record_idle_pause guards - retry failed stops/pauses on the next pass instead of dropping them Wired behind --session-sandbox-janitor-interval-secs (default 300, 0 disables) and --session-sandbox-janitor-idle-backstop-secs (default 21600, 0 disables the idle arm).
ed150c6 to
f3ab8e9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Sandbox desired state lives in process memory (
InMemoryDesiredStateStore) and idle pauses are per-executiontokio::spawn(sleep(...))timers. A control-plane restart (deploy, crash, OOM-kill) therefore permanently orphans every live sandbox: the pods keep running, nothing references them, and no code path will ever stop them.list_observed()exists on the backend trait but is only reachable through the manualdrainadmin route.We hit the end state of this in production-like use on a single-node cluster: dozens of unreferenced 512Mi sandbox pods accumulated until memory requests reached ~99% of allocatable, after which every new
agent_turnfailed withsandbox readiness timed out after 60s. The same failure class is reported in #172 (thread sandboxes never garbage-collect — which proposes exactly the periodic reconciler implemented here) and #171 (warm pods orphaned across restarts). #349 fixes the warm-pool slice of this on the legacy Python side; this PR covers all session sandboxes on the api-rs side.Fix
Add a periodic sandbox janitor to
centaur-session-runtime, spawned frommain.rsbeside the existing background tasks. Two independent arms per tick:backend.list_observed()against the durable references in Postgres (sessions.sandbox_id∪ready/claimedrows insession_warm_sandboxes);stop()anything running/suspended that nothing references. Safety: a sandbox must be unreferenced on two consecutive passes before it is stopped (in-flight creations get a full interval to register),Created-status sandboxes are never touched (creation can legitimately outlast an interval while pulling images), and failed stops stay pending and retry next pass.record_idle_pause(which re-validates the latest execution and sandbox status, so racing a live turn is safe). This restores idle-pausing for sessions whose in-process timer died with a restart — the sandbox-pod analogue of what feat: adopt orphaned executions and unwedge render recovery #486 did for executions.Configuration (both env-tunable, additive, no behavior change when disabled):
--session-sandbox-janitor-interval-secs(default 300,0disables the janitor)--session-sandbox-janitor-idle-backstop-secs(default 21600,0disables the idle arm)No schema migration; the two new store queries are read-only.
Verification
cargo fmt;cargo clippy --all-targets -- -D warningsclean on the three touched cratescargo test -p centaur-session-runtime -p centaur-session-sqlx -p centaur-sandbox-managergreen, including 7 new unit tests on the pure reap-selection logic (pending→reap progression, referenced rescue, terminal/Createdskip, vanished-pending cleanup)A Python-side equivalent of this fix (reconciler pod sweep + retry-on-failed-teardown in
agent.py) has been running on our deployment since the incident; we can contribute that too if the legacy path's lifetime warrants it.Refs #172, #171. Complements #349.