Skip to content

Detect crashed VMs in list/pause/resume/ssh; confirm cleanup#309

Merged
aniketmaurya merged 2 commits into
mainfrom
detect-crashed-vm-and-cleanup-confirm
May 29, 2026
Merged

Detect crashed VMs in list/pause/resume/ssh; confirm cleanup#309
aniketmaurya merged 2 commits into
mainfrom
detect-crashed-vm-and-cleanup-confirm

Conversation

@aniketmaurya
Copy link
Copy Markdown
Collaborator

@aniketmaurya aniketmaurya commented May 28, 2026

Summary

smolvm list trusted the SQLite state DB blindly, so a VM whose QEMU process had died still showed up as running. Every follow-up command then failed for a different misleading reason (ssh got "connection refused", resume said "Cannot resume VM in state 'running'", pause timed out on the QMP socket). This PR detects the stale state and surfaces an actionable error, and adds a confirmation prompt to smolvm cleanup since it's irreversible.

  • Cheap PID check in list — new SmolVMManager.refresh_status(vm_info) does a single os.kill(pid, 0) syscall and demotes a stale running/paused row to ERROR. _run_list runs every row through it before rendering and re-applies the status filter so demoted rows drop out of the default running view.
  • Crash detection on pause / resume / ssh — these check the PID only when something goes wrong (state-guard rejection, QMP timeout, non-zero ssh exit). When the underlying process is gone, the misleading error is replaced with: "VM 'X' is not running — its process has exited. Run 'smolvm delete X' to clear it." The ssh exit code is preserved.
  • smolvm cleanup confirmation + --force — lists the targeted VMs and prompts before deleting. --force / -f skips the prompt. Without --force, non-TTY callers and --json mode refuse to delete rather than silently destroying data; declining at the prompt exits cleanly (0).

Test plan

  • uv run pytest tests/test_vm.py tests/test_cleanup.py tests/test_cli.py — 227 passed
  • uv run pytest — full suite, 1064 passed / 13 skipped
  • uv run ruff check and uv run ruff format --check on touched files (no new findings)
  • Manual: reproduce the original bug (kill QEMU PID, then run smolvm list, smolvm ssh, smolvm pause, smolvm resume) and verify each surfaces the new crash message
  • Manual: smolvm cleanup (interactive prompt), smolvm cleanup --force, smolvm cleanup --json (refuses without --force), smolvm cleanup --json --force (deletes)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added --force to smolvm cleanup to skip interactive confirmation; cleanup now refuses to run in JSON mode or when stdin is non‑TTY unless forced.
    • CLI forwards the force flag; smolvm list performs a lightweight status refresh for accurate filtering.
  • Bug Fixes

    • Improved crash detection with clearer VM "not running" errors for pause/resume and SSH failure cases.
  • Tests

    • Updated and added tests covering force behavior, JSON/non‑TTY refusal, status refresh, and crash handling.

Review Change Stack

`smolvm list` trusted the SQLite state DB blindly, so a VM whose QEMU
process had died still showed up as `running`. Every follow-up command
then failed for a different misleading reason: `ssh` got "connection
refused", `resume` said "Cannot resume VM in state 'running'", `pause`
timed out on the QMP socket.

- `SmolVMManager.refresh_status(vm_info)` does a single `os.kill(pid, 0)`
  syscall and demotes a stale running/paused row to `ERROR`. `_run_list`
  runs every row through it before rendering and re-applies the status
  filter so demoted rows drop out of `smolvm list` (default RUNNING).
- `pause` and `resume` route their state-guard and runtime-call errors
  through the same check. When the underlying process is gone, they raise
  a single-sentence "VM 'X' is not running — its process has exited. Run
  'smolvm delete X' to clear it." instead of the misleading errors.
- `smolvm ssh` prints the same hint after a non-zero ssh exit if the VM
  has crashed (without changing the ssh exit code).
- `smolvm cleanup` now lists the targeted VMs and prompts before
  deleting. `--force` / `-f` skips the prompt. Without `--force`,
  non-TTY callers and `--json` mode refuse rather than silently
  deleting; declining at the prompt exits cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e6689a98-f0de-421d-96dd-1e605be98694

📥 Commits

Reviewing files that changed from the base of the PR and between d9b1496 and 7a082bb.

📒 Files selected for processing (3)
  • src/smolvm/cli/cleanup.py
  • src/smolvm/vm.py
  • tests/test_cleanup.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/smolvm/cli/cleanup.py
  • tests/test_cleanup.py
  • src/smolvm/vm.py

📝 Walkthrough

Walkthrough

The PR detects stale VM process states across the system and adds safety confirmations to destructive cleanup. It introduces lightweight liveness refresh to demote stale RUNNING/PAUSED states to ERROR, integrates this check into pause/resume operations with clearer error messages, applies the refresh in list and SSH commands, and gates cleanup deletion behind user confirmation unless --force is set.

Changes

VM crash detection and cleanup confirmation safety

Layer / File(s) Summary
VM status refresh core and crash message formatting
src/smolvm/vm.py, tests/test_vm.py
refresh_status() method detects stale RUNNING/PAUSED states with dead PIDs and demotes them to ERROR. Module-level _crashed_message() formats actionable recovery text for users. Tests cover liveness checks and state demotion logic.
Crash-aware error translation in pause and resume
src/smolvm/vm.py, tests/test_vm.py
pause() and resume() now refresh VM status and replace misleading state errors with crash-aware messages. _raise_if_crashed() helper converts stale process detection into actionable "VM is not running" errors. Tests verify crash detection during pause/resume operations.
List command status refresh integration
src/smolvm/cli/main.py, tests/test_cli.py
_run_list refreshes each VM's status before rendering to correct stale RUNNING/PAUSED rows. Status filter reapplied after refresh. Mock fixture updated to support refresh_status() passthrough. Test state adjusted from created to running for consistency.
SSH command crash detection and recovery hint
src/smolvm/cli/main.py
_hint_if_vm_crashed() refreshes VM on SSH command failure and emits recovery message when status becomes ERROR. Integrated into both direct and attach SSH command paths when exit code is non-zero.
Cleanup force flag and confirmation flow
src/smolvm/cli/cleanup.py, tests/test_cleanup.py
CLI --force/-f flag added. _confirm_cleanup() enforces rules: auto-allow with force, block --json without force, block non-TTY without force, otherwise prompt user. run_cleanup() gates execution on confirmation result. Tests cover JSON refusal, non-TTY refusal, TTY prompts, and CLI forwarding.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • CelestoAI/SmolVM#74: Unified cleanup CLI JSON plumbing and test envelope that this PR extends with force confirmation logic.

Suggested labels

QEMU


Listen: we find the dead process, mark it error, and ask before we burn it down.
If you're quick with --force, we do as you say—otherwise we wait, aye. ⚙️

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely captures the main changes: detecting crashed VMs across list/pause/resume/ssh commands and adding confirmation to cleanup operations.
Description check ✅ Passed The description covers all required template sections: a clear summary explaining the problem and solution, related issues (none), comprehensive testing plan with specific test runs, and all checklist items addressed.
Docstring Coverage ✅ Passed Docstring coverage is 86.05% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch detect-crashed-vm-and-cleanup-confirm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the QEMU label May 28, 2026
@mintlify
Copy link
Copy Markdown

mintlify Bot commented May 28, 2026

Docs PR opened: CelestoAI/mintlify-docs#82

Documented the new cleanup confirmation prompt and force flag, plus crashed VM detection in list, ssh, pause, and resume.

Once this PR is merged, we'll do a second pass to capture any additional changes and update the docs PR.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/test_cleanup.py (1)

252-267: ⚡ Quick win

This test won't catch the --json refusal bug, so tighten it up.

You check the exit code and that nothing got deleted, but you never assert the stdout is valid JSON. That's exactly the hole that lets the non-JSON refusal at Line 315-319 of cleanup.py slip through. Parse capsys and assert a JSON object with ok false.

🧪 Suggested assertion
         ret = run_cleanup(json_output=True)

         assert ret == 1
         sdk.delete.assert_not_called()
+        payload = json.loads(capsys.readouterr().out)
+        assert payload["command"] == "cleanup"
+        assert payload["ok"] is False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_cleanup.py` around lines 252 - 267, The test
test_run_cleanup_json_requires_force currently only checks the return code and
that sdk.delete wasn't called; update it to also capture stdout via capsys after
calling run_cleanup(json_output=True), parse the captured output with
json.loads, assert the parsed value is a dict/object and that parsed["ok"] is
False (i.e. JSON refusal), and keep the existing assertions (ret == 1 and
sdk.delete.assert_not_called()); reference the test function name
test_run_cleanup_json_requires_force and the call run_cleanup(json_output=True)
when locating where to add the json parsing and assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/smolvm/vm.py`:
- Around line 214-219: The _crashed_message function currently returns two
sentences; change it to a single sentence that includes the vm_id and the
suggested command. Locate the _crashed_message(vm_id: str) function and return a
single consolidated sentence (e.g., "VM '{vm_id}' is not running — its process
has exited; run 'smolvm delete {vm_id}' to clear it.") so the CLI message
contract of one sentence is preserved.
- Around line 1200-1203: The SmolVMError raised when guarding VM state must
include an actionable recovery instruction: update the SmolVMError message
construction in vm.py (the raise using SmolVMError with variables action and
vm_info) to append a clear recovery sentence that names the exact CLI command
with the sandbox identifier interpolated (e.g. "To recover run: smolvm <action>
<sandbox-name>" using vm_info.sandbox_name or vm_info.vm_id), and keep the
existing context dict (vm_id/current_status) intact; ensure the final
user-facing string contains the actual sandbox name and the full recovery
command.

---

Nitpick comments:
In `@tests/test_cleanup.py`:
- Around line 252-267: The test test_run_cleanup_json_requires_force currently
only checks the return code and that sdk.delete wasn't called; update it to also
capture stdout via capsys after calling run_cleanup(json_output=True), parse the
captured output with json.loads, assert the parsed value is a dict/object and
that parsed["ok"] is False (i.e. JSON refusal), and keep the existing assertions
(ret == 1 and sdk.delete.assert_not_called()); reference the test function name
test_run_cleanup_json_requires_force and the call run_cleanup(json_output=True)
when locating where to add the json parsing and assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9589e75a-dcb6-45f1-9d88-3cb5afde6b27

📥 Commits

Reviewing files that changed from the base of the PR and between 5edc886 and d9b1496.

📒 Files selected for processing (6)
  • src/smolvm/cli/cleanup.py
  • src/smolvm/cli/main.py
  • src/smolvm/vm.py
  • tests/test_cleanup.py
  • tests/test_cli.py
  • tests/test_vm.py

Comment thread src/smolvm/vm.py
Comment thread src/smolvm/vm.py
- `_crashed_message` is now a single sentence joined with a semicolon.
- The "Cannot {action} VM in state 'X'" error appends a state-aware
  recovery clause: STOPPED/CREATED → `smolvm start <id>`, ERROR →
  `smolvm delete <id>`. PAUSED/RUNNING already-in-state cases need no
  recovery — the status alone is the explanation.
- `smolvm cleanup --json` without `--force` now emits the standard JSON
  envelope (`ok: false`, `exit_code: 1`, `error.type: refused`) instead
  of a Rich panel to stderr, so machine callers can parse the refusal.
- `test_run_cleanup_json_requires_force` parses the captured stdout and
  asserts on the envelope shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aniketmaurya aniketmaurya merged commit 796e959 into main May 29, 2026
11 checks passed
@aniketmaurya aniketmaurya deleted the detect-crashed-vm-and-cleanup-confirm branch May 29, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant