Detect crashed VMs in list/pause/resume/ssh; confirm cleanup#309
Conversation
`smolvm list` trusted the SQLite state DB blindly, so a VM whose QEMU process had died still showed up as `running`. Every follow-up command then failed for a different misleading reason: `ssh` got "connection refused", `resume` said "Cannot resume VM in state 'running'", `pause` timed out on the QMP socket. - `SmolVMManager.refresh_status(vm_info)` does a single `os.kill(pid, 0)` syscall and demotes a stale running/paused row to `ERROR`. `_run_list` runs every row through it before rendering and re-applies the status filter so demoted rows drop out of `smolvm list` (default RUNNING). - `pause` and `resume` route their state-guard and runtime-call errors through the same check. When the underlying process is gone, they raise a single-sentence "VM 'X' is not running — its process has exited. Run 'smolvm delete X' to clear it." instead of the misleading errors. - `smolvm ssh` prints the same hint after a non-zero ssh exit if the VM has crashed (without changing the ssh exit code). - `smolvm cleanup` now lists the targeted VMs and prompts before deleting. `--force` / `-f` skips the prompt. Without `--force`, non-TTY callers and `--json` mode refuse rather than silently deleting; declining at the prompt exits cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThe PR detects stale VM process states across the system and adds safety confirmations to destructive cleanup. It introduces lightweight liveness refresh to demote stale RUNNING/PAUSED states to ERROR, integrates this check into pause/resume operations with clearer error messages, applies the refresh in list and SSH commands, and gates cleanup deletion behind user confirmation unless ChangesVM crash detection and cleanup confirmation safety
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Docs PR opened: CelestoAI/mintlify-docs#82 Documented the new cleanup confirmation prompt and force flag, plus crashed VM detection in list, ssh, pause, and resume. Once this PR is merged, we'll do a second pass to capture any additional changes and update the docs PR. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/test_cleanup.py (1)
252-267: ⚡ Quick winThis test won't catch the
--jsonrefusal bug, so tighten it up.You check the exit code and that nothing got deleted, but you never assert the stdout is valid JSON. That's exactly the hole that lets the non-JSON refusal at Line 315-319 of
cleanup.pyslip through. Parsecapsysand assert a JSON object withokfalse.🧪 Suggested assertion
ret = run_cleanup(json_output=True) assert ret == 1 sdk.delete.assert_not_called() + payload = json.loads(capsys.readouterr().out) + assert payload["command"] == "cleanup" + assert payload["ok"] is False🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_cleanup.py` around lines 252 - 267, The test test_run_cleanup_json_requires_force currently only checks the return code and that sdk.delete wasn't called; update it to also capture stdout via capsys after calling run_cleanup(json_output=True), parse the captured output with json.loads, assert the parsed value is a dict/object and that parsed["ok"] is False (i.e. JSON refusal), and keep the existing assertions (ret == 1 and sdk.delete.assert_not_called()); reference the test function name test_run_cleanup_json_requires_force and the call run_cleanup(json_output=True) when locating where to add the json parsing and assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/smolvm/vm.py`:
- Around line 214-219: The _crashed_message function currently returns two
sentences; change it to a single sentence that includes the vm_id and the
suggested command. Locate the _crashed_message(vm_id: str) function and return a
single consolidated sentence (e.g., "VM '{vm_id}' is not running — its process
has exited; run 'smolvm delete {vm_id}' to clear it.") so the CLI message
contract of one sentence is preserved.
- Around line 1200-1203: The SmolVMError raised when guarding VM state must
include an actionable recovery instruction: update the SmolVMError message
construction in vm.py (the raise using SmolVMError with variables action and
vm_info) to append a clear recovery sentence that names the exact CLI command
with the sandbox identifier interpolated (e.g. "To recover run: smolvm <action>
<sandbox-name>" using vm_info.sandbox_name or vm_info.vm_id), and keep the
existing context dict (vm_id/current_status) intact; ensure the final
user-facing string contains the actual sandbox name and the full recovery
command.
---
Nitpick comments:
In `@tests/test_cleanup.py`:
- Around line 252-267: The test test_run_cleanup_json_requires_force currently
only checks the return code and that sdk.delete wasn't called; update it to also
capture stdout via capsys after calling run_cleanup(json_output=True), parse the
captured output with json.loads, assert the parsed value is a dict/object and
that parsed["ok"] is False (i.e. JSON refusal), and keep the existing assertions
(ret == 1 and sdk.delete.assert_not_called()); reference the test function name
test_run_cleanup_json_requires_force and the call run_cleanup(json_output=True)
when locating where to add the json parsing and assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9589e75a-dcb6-45f1-9d88-3cb5afde6b27
📒 Files selected for processing (6)
src/smolvm/cli/cleanup.pysrc/smolvm/cli/main.pysrc/smolvm/vm.pytests/test_cleanup.pytests/test_cli.pytests/test_vm.py
- `_crashed_message` is now a single sentence joined with a semicolon.
- The "Cannot {action} VM in state 'X'" error appends a state-aware
recovery clause: STOPPED/CREATED → `smolvm start <id>`, ERROR →
`smolvm delete <id>`. PAUSED/RUNNING already-in-state cases need no
recovery — the status alone is the explanation.
- `smolvm cleanup --json` without `--force` now emits the standard JSON
envelope (`ok: false`, `exit_code: 1`, `error.type: refused`) instead
of a Rich panel to stderr, so machine callers can parse the refusal.
- `test_run_cleanup_json_requires_force` parses the captured stdout and
asserts on the envelope shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
smolvm listtrusted the SQLite state DB blindly, so a VM whose QEMU process had died still showed up asrunning. Every follow-up command then failed for a different misleading reason (sshgot "connection refused",resumesaid "Cannot resume VM in state 'running'",pausetimed out on the QMP socket). This PR detects the stale state and surfaces an actionable error, and adds a confirmation prompt tosmolvm cleanupsince it's irreversible.list— newSmolVMManager.refresh_status(vm_info)does a singleos.kill(pid, 0)syscall and demotes a stalerunning/pausedrow toERROR._run_listruns every row through it before rendering and re-applies the status filter so demoted rows drop out of the defaultrunningview.pause/resume/ssh— these check the PID only when something goes wrong (state-guard rejection, QMP timeout, non-zero ssh exit). When the underlying process is gone, the misleading error is replaced with: "VM 'X' is not running — its process has exited. Run 'smolvm delete X' to clear it." The ssh exit code is preserved.smolvm cleanupconfirmation +--force— lists the targeted VMs and prompts before deleting.--force/-fskips the prompt. Without--force, non-TTY callers and--jsonmode refuse to delete rather than silently destroying data; declining at the prompt exits cleanly (0).Test plan
uv run pytest tests/test_vm.py tests/test_cleanup.py tests/test_cli.py— 227 passeduv run pytest— full suite, 1064 passed / 13 skippeduv run ruff checkanduv run ruff format --checkon touched files (no new findings)smolvm list,smolvm ssh,smolvm pause,smolvm resume) and verify each surfaces the new crash messagesmolvm cleanup(interactive prompt),smolvm cleanup --force,smolvm cleanup --json(refuses without--force),smolvm cleanup --json --force(deletes)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
--forcetosmolvm cleanupto skip interactive confirmation; cleanup now refuses to run in JSON mode or when stdin is non‑TTY unless forced.smolvm listperforms a lightweight status refresh for accurate filtering.Bug Fixes
Tests