fix(api-rs): reap orphaned sandboxes and backstop lost idle pauses by fredo · Pull Request #524 · paradigmxyz/centaur

fredo · 2026-06-12T09:19:33Z

Problem

Sandbox desired state lives in process memory (InMemoryDesiredStateStore) and idle pauses are per-execution tokio::spawn(sleep(...)) timers. A control-plane restart (deploy, crash, OOM-kill) therefore permanently orphans every live sandbox: the pods keep running, nothing references them, and no code path will ever stop them. list_observed() exists on the backend trait but is only reachable through the manual drain admin route.

We hit the end state of this in production-like use on a single-node cluster: dozens of unreferenced 512Mi sandbox pods accumulated until memory requests reached ~99% of allocatable, after which every new agent_turn failed with sandbox readiness timed out after 60s. The same failure class is reported in #172 (thread sandboxes never garbage-collect — which proposes exactly the periodic reconciler implemented here) and #171 (warm pods orphaned across restarts). #349 fixes the warm-pool slice of this on the legacy Python side; this PR covers all session sandboxes on the api-rs side.

Fix

Add a periodic sandbox janitor to centaur-session-runtime, spawned from main.rs beside the existing background tasks. Two independent arms per tick:

Orphan sweep — diff backend.list_observed() against the durable references in Postgres (sessions.sandbox_id ∪ ready/claimed rows in session_warm_sandboxes); stop() anything running/suspended that nothing references. Safety: a sandbox must be unreferenced on two consecutive passes before it is stopped (in-flight creations get a full interval to register), Created-status sandboxes are never touched (creation can legitimately outlast an interval while pulling images), and failed stops stay pending and retry next pass.
Idle backstop — pause sessions whose latest execution is terminal, older than a TTL, with no active execution, via the existing record_idle_pause (which re-validates the latest execution and sandbox status, so racing a live turn is safe). This restores idle-pausing for sessions whose in-process timer died with a restart — the sandbox-pod analogue of what feat: adopt orphaned executions and unwedge render recovery #486 did for executions.

Configuration (both env-tunable, additive, no behavior change when disabled):

--session-sandbox-janitor-interval-secs (default 300, 0 disables the janitor)
--session-sandbox-janitor-idle-backstop-secs (default 21600, 0 disables the idle arm)

No schema migration; the two new store queries are read-only.

Verification

cargo fmt; cargo clippy --all-targets -- -D warnings clean on the three touched crates
cargo test -p centaur-session-runtime -p centaur-session-sqlx -p centaur-sandbox-manager green, including 7 new unit tests on the pure reap-selection logic (pending→reap progression, referenced rescue, terminal/Created skip, vanished-pending cleanup)
Known gap, stated openly: the two SQL queries have no DB-backed test — the crate currently has no Postgres test harness; happy to add one if you have a preferred shape.

A Python-side equivalent of this fix (reconciler pod sweep + retry-on-failed-teardown in agent.py) has been running on our deployment since the incident; we can contribute that too if the legacy path's lifetime warrants it.

Refs #172, #171. Complements #349.

Sandbox desired state lives in process memory and idle pauses are per-execution tokio timers, so a control-plane restart (or a create/register race) leaves running sandboxes that nothing will ever stop; they accumulate until their memory requests exhaust the node and every new sandbox times out waiting for readiness. Add a periodic sandbox janitor to the session runtime: - stop backend sandboxes with no sessions.sandbox_id or ready/claimed warm-pool reference, after two consecutive unreferenced passes so in-flight creations have a full interval to register; sandboxes still in Created are never touched - pause sessions whose latest execution finished before a backstop TTL and have no active execution, reusing the record_idle_pause guards - retry failed stops/pauses on the next pass instead of dropping them Wired behind --session-sandbox-janitor-interval-secs (default 300, 0 disables) and --session-sandbox-janitor-idle-backstop-secs (default 21600, 0 disables the idle arm).

fredo force-pushed the api-rs-sandbox-janitor branch from ed150c6 to f3ab8e9 Compare June 12, 2026 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api-rs): reap orphaned sandboxes and backstop lost idle pauses#524

fix(api-rs): reap orphaned sandboxes and backstop lost idle pauses#524
fredo wants to merge 1 commit into
paradigmxyz:mainfrom
fredo:api-rs-sandbox-janitor

fredo commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fredo commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fredo commented Jun 12, 2026 •

edited

Loading