fix: fail sandbox spawn when secrets cannot be loaded#562
Conversation
Previously, failures to load global or repo secrets were caught and swallowed with a warning, allowing the sandbox to proceed without user-configured secrets. This is no longer acceptable — missing secrets should prevent the sandbox from spawning. Remove the try/catch wrappers in getUserEnvVars() so errors propagate to the caller (lifecycle manager), which already handles spawn failures.
📝 WalkthroughWalkthrough
Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Terraform Validation Results
Pushed by: @ColeMurray, Action: |
There was a problem hiding this comment.
Summary
This PR changes getUserEnvVars() to fail fast when secret loading fails instead of silently continuing without secrets. The intent makes sense, but the current implementation also turns ensureRepoId() into a hard dependency for sandbox spawn, which introduces a regression for sessions that still start with repo_id = null.
PR Title and number: fix: fail sandbox spawn when secrets cannot be loaded (#562)
Author: @ColeMurray
Files changed: 1, additions/deletions: +6/-22
Critical Issues
- [Functionality]
packages/control-plane/src/session/durable-object.ts:1694-await this.ensureRepoId(session)is now inside the fail-fast path. User-created sessions can still be initialized withoutrepo_id, so sandbox spawn now depends on a live SCM lookup even when there are no repo-specific secrets to load. A transient GitHub/GitLab lookup failure will now fail sandbox creation for those sessions, which is broader than the PR goal of failing only when secret loading itself breaks. Suggested fix: either guaranteerepo_idis always populated before spawn, or only make the actual D1/decryption reads fail-fast once a repo ID is already known.
Suggestions
- [Testing]
packages/control-plane/src/session/durable-object.ts:1689- Add a regression test covering the new error path through the lifecycle manager (for example,getUserEnvVars()rejecting during spawn/restore) so this fail-fast behavior stays intentional and visible.
Nitpicks
- None.
Positive Feedback
- The change is small and easy to reason about.
- Letting real secret decryption failures propagate to the lifecycle manager is cleaner than silently degrading the sandbox environment.
- The existing spawn failure handling path already gives this change a natural place to surface errors to clients.
Questions
- None.
Verdict (for GitHub PRs)
Request Changes: Critical issue must be addressed before merging.
Validation: src/sandbox/lifecycle/manager.test.ts passes (47 tests).
Address review feedback: - Keep ensureRepoId() in the fail-fast path — if we can't resolve the repo ID, repo secrets would be silently dropped, defeating the purpose of this change. Failing here is the correct behavior. - Add regression test verifying that getUserEnvVars() rejection during spawn correctly fails the sandbox and resets isSpawning.
Terraform Validation Results
Pushed by: @ColeMurray, Action: |
|
Re: testing suggestion from the review — done in e96fbec. Added a |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/control-plane/src/sandbox/lifecycle/manager.test.ts (1)
682-707: Good regression coverage — consider also assertingsetLastSpawnError.The test correctly verifies the three key invariants: provider sandbox creation is skipped, sandbox status transitions to
failed, andisSpawning()is reset. Optionally, you could also assert thatsetLastSpawnErrorwas invoked with the underlying error message so a future change that swallows the message indoSpawn's catch block (and re-broadcasts something generic) would still be caught here:♻️ Optional additional assertion
expect(provider.createSandbox).not.toHaveBeenCalled(); expect(storage.calls).toContain("updateSandboxStatus:failed"); + expect(storage.calls).toContain("setLastSpawnError:D1 decryption failure"); expect(manager.isSpawning()).toBe(false);This would also pin down that the user-facing
last_spawn_errorcorrectly surfaces the secret-loading failure — which is the behavior this PR is explicitly trying to enable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/control-plane/src/sandbox/lifecycle/manager.test.ts` around lines 682 - 707, Add an assertion that setLastSpawnError was called with the underlying error message when getUserEnvVars rejects: after awaiting manager.spawnSandbox(), assert that storage.setLastSpawnError (from createMockStorage) was invoked and that its argument includes "D1 decryption failure" (the error thrown in the mocked getUserEnvVars), so the test checks setLastSpawnError alongside provider.createSandbox, updateSandboxStatus:failed, and manager.isSpawning().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/control-plane/src/sandbox/lifecycle/manager.test.ts`:
- Around line 682-707: Add an assertion that setLastSpawnError was called with
the underlying error message when getUserEnvVars rejects: after awaiting
manager.spawnSandbox(), assert that storage.setLastSpawnError (from
createMockStorage) was invoked and that its argument includes "D1 decryption
failure" (the error thrown in the mocked getUserEnvVars), so the test checks
setLastSpawnError alongside provider.createSandbox, updateSandboxStatus:failed,
and manager.isSpawning().
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 05426107-6914-40df-a347-07cc991b2810
📒 Files selected for processing (2)
packages/control-plane/src/sandbox/lifecycle/manager.test.tspackages/control-plane/src/session/durable-object.ts
Summary
getUserEnvVars()so failures propagate as errors instead of being silently swallowedOPENAI_API_KEY)Context
The original fallback behavior was added when D1 secrets were first introduced, as a safety net during rollout. Now that secrets are a core part of sandbox operation, silently dropping them can cause confusing downstream failures (agent can't authenticate to APIs, etc.). Failing fast gives clear signal.
Test plan
Summary by CodeRabbit
Bug Fixes
Tests