provider: short-circuit retries when modal sandbox is dead#206
Draft
evgunter wants to merge 1 commit into
Draft
Conversation
When a Modal sandbox dies (status='failure') before its junit.xml is written, the existing retry path in `TestRunner::try_download_results` drives `DefaultSandbox::download` through `with_retry!`, which classifies `ProviderError::DownloadFailed` as retryable. Each rust-level attempt calls into `modal_sandbox.py download`, which opens the sandbox via `sandbox.open()`. The Modal SDK then runs its own internal gRPC retry loop (tens of seconds per attempt) against a container that is already gone. Three rust attempts x Modal-internal retry add up to ~4 minutes per dead sandbox -- long enough that downstream "Not Run" reporting turns the dead container into a phantom multi-test failure (e.g. "91 tests failed" from a single sandbox), as observed in imbue-ai/mngr CI runs 25225190138 and 25223712168. Fix: - `modal_sandbox.py download` now polls `sandbox.poll()` before any `sandbox.open()` call. If the sandbox has already finished, it emits an `OFFLOAD_SANDBOX_DEAD: <reason>` sentinel on stderr and exits before the SDK can start its retry loop. The same sentinel is emitted if `sandbox.open()` later fails with Modal's "finished ... status=" wording (covers the race where a sandbox dies mid-download). - `DefaultSandbox::download` parses the sentinel out of stderr and maps it to a new `ProviderError::SandboxDead(String)` variant. - `is_retryable` deliberately excludes `SandboxDead`, so `with_retry!` bails on the first attempt instead of polling a corpse. The fix keeps the existing `DownloadFailed` semantics for genuinely transient download errors. Both `default` and `modal` providers are covered because `ModalProvider` wraps `DefaultSandbox`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
a claude found a mngr ci issue where offload was waiting on a dead sandbox. this PR is not really my suggestion for a fix, but more just a way to unambiguously indicate what the issue is. (by which i mean that i have no reason to think the fix claude came up with here is actually good)
Summary
Fixes a CI-corrupting interaction between offload's retry loop and the Modal SDK's own gRPC retry loop. When a Modal sandbox dies (
status='failure') before itsjunit.xmlis written,DefaultSandbox::downloadreturnsProviderError::DownloadFailed, whichis_retryable()classifies as retryable.with_retry!therefore makes up to three rust-level attempts. Each attempt invokesmodal_sandbox.py download, which opens the sandbox viasandbox.open(), which itself triggers the Modal SDK's internal gRPC retry against the dead container -- tens of seconds per attempt. The product is roughly four minutes of polling a corpse per dead sandbox, after which the "Not Run" tally reports phantom failures (e.g. "91 tests failed" from a single sandbox).Observed in imbue-ai/mngr CI runs
25225190138and25223712168.Fix
modal_sandbox.py downloadpollssandbox.poll()before anysandbox.open()and short-circuits with aOFFLOAD_SANDBOX_DEAD: <reason>stderr sentinel when the sandbox is already finished. The same sentinel is emitted ifsandbox.open()later fails with Modal'sContainer ID ... finished ... status=wording (covers the race where a sandbox dies mid-download).DefaultSandbox::downloadparses the sentinel out of stderr and maps it to a newProviderError::SandboxDead(String)variant.is_retryabledeliberately excludesSandboxDead, sowith_retry!bails on the first attempt instead of polling a corpse.DownloadFailedsemantics are unchanged for genuinely transient download errors. Bothdefaultandmodalproviders are covered becauseModalProviderwrapsDefaultSandbox.Test plan
cargo buildcargo nextest run(217/217 pass, including newsandbox_dead_is_not_retryableandparse_sandbox_dead_marker_*cases)cargo fmt --checkcargo clippy --all-targets --all-features -- -D warnings🤖 Generated with Claude Code