Skip to content

provider: short-circuit retries when modal sandbox is dead#206

Draft
evgunter wants to merge 1 commit into
mainfrom
mngr/ci-flakes
Draft

provider: short-circuit retries when modal sandbox is dead#206
evgunter wants to merge 1 commit into
mainfrom
mngr/ci-flakes

Conversation

@evgunter
Copy link
Copy Markdown

@evgunter evgunter commented May 1, 2026

a claude found a mngr ci issue where offload was waiting on a dead sandbox. this PR is not really my suggestion for a fix, but more just a way to unambiguously indicate what the issue is. (by which i mean that i have no reason to think the fix claude came up with here is actually good)


Summary

Fixes a CI-corrupting interaction between offload's retry loop and the Modal SDK's own gRPC retry loop. When a Modal sandbox dies (status='failure') before its junit.xml is written, DefaultSandbox::download returns ProviderError::DownloadFailed, which is_retryable() classifies as retryable. with_retry! therefore makes up to three rust-level attempts. Each attempt invokes modal_sandbox.py download, which opens the sandbox via sandbox.open(), which itself triggers the Modal SDK's internal gRPC retry against the dead container -- tens of seconds per attempt. The product is roughly four minutes of polling a corpse per dead sandbox, after which the "Not Run" tally reports phantom failures (e.g. "91 tests failed" from a single sandbox).

Observed in imbue-ai/mngr CI runs 25225190138 and 25223712168.

Fix

  • modal_sandbox.py download polls sandbox.poll() before any sandbox.open() and short-circuits with a OFFLOAD_SANDBOX_DEAD: <reason> stderr sentinel when the sandbox is already finished. The same sentinel is emitted if sandbox.open() later fails with Modal's Container ID ... finished ... status= wording (covers the race where a sandbox dies mid-download).
  • DefaultSandbox::download parses the sentinel out of stderr and maps it to a new ProviderError::SandboxDead(String) variant.
  • is_retryable deliberately excludes SandboxDead, so with_retry! bails on the first attempt instead of polling a corpse.

DownloadFailed semantics are unchanged for genuinely transient download errors. Both default and modal providers are covered because ModalProvider wraps DefaultSandbox.

Test plan

  • cargo build
  • cargo nextest run (217/217 pass, including new sandbox_dead_is_not_retryable and parse_sandbox_dead_marker_* cases)
  • cargo fmt --check
  • cargo clippy --all-targets --all-features -- -D warnings
  • Re-run mngr CI on a flake-prone job to confirm dead sandboxes now fail in seconds rather than ~4 minutes and stop generating phantom failures.

🤖 Generated with Claude Code

When a Modal sandbox dies (status='failure') before its junit.xml is
written, the existing retry path in `TestRunner::try_download_results`
drives `DefaultSandbox::download` through `with_retry!`, which classifies
`ProviderError::DownloadFailed` as retryable. Each rust-level attempt
calls into `modal_sandbox.py download`, which opens the sandbox via
`sandbox.open()`. The Modal SDK then runs its own internal gRPC retry
loop (tens of seconds per attempt) against a container that is already
gone. Three rust attempts x Modal-internal retry add up to ~4 minutes
per dead sandbox -- long enough that downstream "Not Run" reporting
turns the dead container into a phantom multi-test failure (e.g. "91
tests failed" from a single sandbox), as observed in imbue-ai/mngr CI
runs 25225190138 and 25223712168.

Fix:
- `modal_sandbox.py download` now polls `sandbox.poll()` before any
  `sandbox.open()` call. If the sandbox has already finished, it emits
  an `OFFLOAD_SANDBOX_DEAD: <reason>` sentinel on stderr and exits
  before the SDK can start its retry loop. The same sentinel is emitted
  if `sandbox.open()` later fails with Modal's "finished ... status="
  wording (covers the race where a sandbox dies mid-download).
- `DefaultSandbox::download` parses the sentinel out of stderr and maps
  it to a new `ProviderError::SandboxDead(String)` variant.
- `is_retryable` deliberately excludes `SandboxDead`, so `with_retry!`
  bails on the first attempt instead of polling a corpse.

The fix keeps the existing `DownloadFailed` semantics for genuinely
transient download errors. Both `default` and `modal` providers are
covered because `ModalProvider` wraps `DefaultSandbox`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant