Skip to content

fix(api-rs): reap orphaned sandboxes and backstop lost idle pauses#524

Open
fredo wants to merge 1 commit into
paradigmxyz:mainfrom
fredo:api-rs-sandbox-janitor
Open

fix(api-rs): reap orphaned sandboxes and backstop lost idle pauses#524
fredo wants to merge 1 commit into
paradigmxyz:mainfrom
fredo:api-rs-sandbox-janitor

Conversation

@fredo

@fredo fredo commented Jun 12, 2026

Copy link
Copy Markdown

Problem

Sandbox desired state lives in process memory (InMemoryDesiredStateStore) and idle pauses are per-execution tokio::spawn(sleep(...)) timers. A control-plane restart (deploy, crash, OOM-kill) therefore permanently orphans every live sandbox: the pods keep running, nothing references them, and no code path will ever stop them. list_observed() exists on the backend trait but is only reachable through the manual drain admin route.

We hit the end state of this in production-like use on a single-node cluster: dozens of unreferenced 512Mi sandbox pods accumulated until memory requests reached ~99% of allocatable, after which every new agent_turn failed with sandbox readiness timed out after 60s. The same failure class is reported in #172 (thread sandboxes never garbage-collect — which proposes exactly the periodic reconciler implemented here) and #171 (warm pods orphaned across restarts). #349 fixes the warm-pool slice of this on the legacy Python side; this PR covers all session sandboxes on the api-rs side.

Fix

Add a periodic sandbox janitor to centaur-session-runtime, spawned from main.rs beside the existing background tasks. Two independent arms per tick:

  • Orphan sweep — diff backend.list_observed() against the durable references in Postgres (sessions.sandbox_idready/claimed rows in session_warm_sandboxes); stop() anything running/suspended that nothing references. Safety: a sandbox must be unreferenced on two consecutive passes before it is stopped (in-flight creations get a full interval to register), Created-status sandboxes are never touched (creation can legitimately outlast an interval while pulling images), and failed stops stay pending and retry next pass.
  • Idle backstop — pause sessions whose latest execution is terminal, older than a TTL, with no active execution, via the existing record_idle_pause (which re-validates the latest execution and sandbox status, so racing a live turn is safe). This restores idle-pausing for sessions whose in-process timer died with a restart — the sandbox-pod analogue of what feat: adopt orphaned executions and unwedge render recovery #486 did for executions.

Configuration (both env-tunable, additive, no behavior change when disabled):

  • --session-sandbox-janitor-interval-secs (default 300, 0 disables the janitor)
  • --session-sandbox-janitor-idle-backstop-secs (default 21600, 0 disables the idle arm)

No schema migration; the two new store queries are read-only.

Verification

  • cargo fmt; cargo clippy --all-targets -- -D warnings clean on the three touched crates
  • cargo test -p centaur-session-runtime -p centaur-session-sqlx -p centaur-sandbox-manager green, including 7 new unit tests on the pure reap-selection logic (pending→reap progression, referenced rescue, terminal/Created skip, vanished-pending cleanup)
  • Known gap, stated openly: the two SQL queries have no DB-backed test — the crate currently has no Postgres test harness; happy to add one if you have a preferred shape.

A Python-side equivalent of this fix (reconciler pod sweep + retry-on-failed-teardown in agent.py) has been running on our deployment since the incident; we can contribute that too if the legacy path's lifetime warrants it.

Refs #172, #171. Complements #349.

Sandbox desired state lives in process memory and idle pauses are
per-execution tokio timers, so a control-plane restart (or a
create/register race) leaves running sandboxes that nothing will ever
stop; they accumulate until their memory requests exhaust the node and
every new sandbox times out waiting for readiness.

Add a periodic sandbox janitor to the session runtime:

- stop backend sandboxes with no sessions.sandbox_id or ready/claimed
  warm-pool reference, after two consecutive unreferenced passes so
  in-flight creations have a full interval to register; sandboxes still
  in Created are never touched
- pause sessions whose latest execution finished before a backstop TTL
  and have no active execution, reusing the record_idle_pause guards
- retry failed stops/pauses on the next pass instead of dropping them

Wired behind --session-sandbox-janitor-interval-secs (default 300, 0
disables) and --session-sandbox-janitor-idle-backstop-secs (default
21600, 0 disables the idle arm).
@fredo fredo force-pushed the api-rs-sandbox-janitor branch from ed150c6 to f3ab8e9 Compare June 12, 2026 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant