Skip to content

feat: surface computer error state on terminal disconnect#6

Closed
aniketmaurya wants to merge 2 commits into
mainfrom
feat/spot-recovery-terminal-ux
Closed

feat: surface computer error state on terminal disconnect#6
aniketmaurya wants to merge 2 commits into
mainfrom
feat/spot-recovery-terminal-ux

Conversation

@aniketmaurya
Copy link
Copy Markdown
Collaborator

@aniketmaurya aniketmaurya commented Apr 13, 2026

Summary

  • When the terminal WebSocket drops unexpectedly, check computer status via the API
  • If computer is in ERROR state (host was spot-interrupted), show last_error and suggest celesto computer create
  • If computer is still RUNNING, suggest reconnecting with celesto computer ssh
  • No more silent "Disconnected" when the host dies

Companion PRs:

  • Backend: CelestoAI/celesto-backend (cleanup task + draining endpoint)
  • Agent: CelestoAI/celesto-agent#4 (spot watcher + graceful shutdown)

Test plan

  • Open terminal session, kill host agent mid-session — verify error message appears
  • Open terminal session, disconnect network briefly — verify "still running, reconnect" message
  • Normal Ctrl+] disconnect should still show clean "Disconnected"

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved connection-loss handling and messages. When a connection drops unexpectedly, the tool now checks the computer's status and provides tailored guidance for states like running, stopped, creating/starting, deleting/deleted, or error (including suggested next steps).
    • Enhanced error handling for status checks so network/auth failures or missing records produce concise, user-friendly notices instead of noisy errors.

When the terminal WebSocket drops unexpectedly, check the computer
status via the API. If the computer entered ERROR state (e.g. host
was spot-interrupted), show the error message and suggest creating
a new computer instead of a silent "Disconnected".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

When a websocket closes non‑normally in ssh_to_computer, the handler now reports “Connection lost” and then attempts an API check of the computer's status, printing contextual guidance based on the retrieved status or handling specific fetch errors.

Changes

Cohort / File(s) Summary
Connection Loss Status Handling
src/celesto/computer.py
Replace generic non‑normal websocket close message with “Connection lost”. After disconnect, open a new API client and fetch computers.get(resolved_id), branch on status to print targeted guidance (error, running, stopped/stopping, deleted/deleting, creating/starting, other). Add explicit catches for CelestoNotFoundError, CelestoAuthenticationError/CelestoNetworkError, and a broad Exception fallback around the status check.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant SSHClient as SSH Client
  participant WS as WebSocket
  participant API as API Client
  participant DB as Computers API

  User->>SSHClient: start ssh_to_computer()
  SSHClient->>WS: open websocket
  WS-->>SSHClient: normal data flow...
  alt non-normal close (rcvd.code != 1000)
    WS-->>SSHClient: Connection closed (non-normal)
    SSHClient->>SSHClient: print "Connection lost"
    SSHClient->>API: _get_client(api_key) and computers.get(resolved_id)
    API->>DB: fetch computer record
    DB-->>API: returns status (running/error/stopped/...)
    API-->>SSHClient: computer record / status
    SSHClient->>User: print guidance based on status
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

When the line goes quiet, we do what must be done,
Not panic — check the machine, see if the fire's still on.
A lost connection is no mystery; we fetch the truth and speak it plain,
Guide the hands that mend the gears, then try to connect again. 🔧

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main change: surfacing computer error state when terminal disconnects, which aligns with the PR's core objective of providing actionable status messages on WebSocket disconnect.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/spot-recovery-terminal-ux

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/celesto/computer.py (1)

266-271: Consider handling the other states, mate. There's more to this world than "error" and "running".

Your ComputerStatus contract in types.py defines nine possible values: "creating", "running", "stopping", "stopped", "starting", "deleting", "deleted", and "error". You're only handling two of them here.

If the computer ends up in "stopped", "stopping", "deleted", or any transitional state when the connection drops, the user gets no guidance whatsoever — just silence after the "Connection lost" message.

Now, I'm not saying you need to handle every bloody state individually, but an else branch with a sensible fallback would show the user you've thought this through properly.

♻️ Proposed enhancement: Add fallback for other states
                         if comp_status == "error":
                             last_err = info.get("last_error", "unknown")
                             console.print(f"[red]Computer error: {last_err}[/red]")
                             console.print("[dim]Create a new computer with: celesto computer create[/dim]")
                         elif comp_status == "running":
                             console.print(f"[yellow]Computer is still running. Reconnect with: celesto computer ssh {computer_id}[/yellow]")
+                        elif comp_status in ("stopped", "stopping", "deleted", "deleting"):
+                            console.print(f"[yellow]Computer is {comp_status}. Create a new one with: celesto computer create[/yellow]")
+                        else:
+                            console.print(f"[dim]Computer status: {comp_status}[/dim]")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/celesto/computer.py` around lines 266 - 271, The current comp_status
handling only covers "error" and "running" in the block that reads comp_status =
info.get("status") (the snippet uses comp_status, info and computer_id); add an
else branch that covers all other ComputerStatus values (e.g., "creating",
"starting", "stopping", "stopped", "deleting", "deleted") and prints a helpful
fallback message to the user (for example: indicate the current state and
suggested next steps such as waiting, trying "celesto computer
start/stop/delete" or recreating with "celesto computer create", and for
terminal states like "deleted" suggest recreating); ensure the new branch uses
the existing variables comp_status, info.get("last_error") if present, and
computer_id to provide context and avoid leaving the user with no guidance.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/celesto/computer.py`:
- Around line 272-273: You have a bare except in the try/except around the SDK
get() call that swallows all errors (the block with "except Exception: pass");
replace it with targeted exception handling for CelestoAuthenticationError,
CelestoNotFoundError, CelestoNetworkError (and a generic Exception fallback) and
log a clear, user-friendly error message (including the exception details) via
the module/logger used in this file so failures are visible; for recoverable
cases keep the existing best-effort behavior after logging, and for
fatal/authentication/network errors either return a controlled error response or
re-raise after logging to avoid hiding problems.

---

Nitpick comments:
In `@src/celesto/computer.py`:
- Around line 266-271: The current comp_status handling only covers "error" and
"running" in the block that reads comp_status = info.get("status") (the snippet
uses comp_status, info and computer_id); add an else branch that covers all
other ComputerStatus values (e.g., "creating", "starting", "stopping",
"stopped", "deleting", "deleted") and prints a helpful fallback message to the
user (for example: indicate the current state and suggested next steps such as
waiting, trying "celesto computer start/stop/delete" or recreating with "celesto
computer create", and for terminal states like "deleted" suggest recreating);
ensure the new branch uses the existing variables comp_status,
info.get("last_error") if present, and computer_id to provide context and avoid
leaving the user with no guidance.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a9748e5e-2b5a-4626-a40c-984489e98835

📥 Commits

Reviewing files that changed from the base of the PR and between af7fd8b and 56a0f94.

📒 Files selected for processing (1)
  • src/celesto/computer.py

Comment thread src/celesto/computer.py Outdated
…connect

- Replace bare `except Exception: pass` with specific handlers for
  CelestoNotFoundError, CelestoAuthenticationError, CelestoNetworkError
  so failures are visible instead of silently swallowed
- Add branches for all ComputerStatus values (stopped, deleted, creating,
  starting) with actionable guidance for each

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/celesto/computer.py`:
- Around line 260-287: The branch only runs the status probe when e.rcvd is
present, but websocket network failures raise ConnectionClosedError (or leave
e.rcvd == None) so we must treat e.rcvd is None as an abnormal disconnect:
update the except handling around the websocket recv to also catch
ConnectionClosedError (or check if e.rcvd is None) and in that path execute the
same status probe using _get_client(resolved_id) /
check_client.computers.get(resolved_id) and the existing status-handling logic
(last_error, "running", "stopped"/"stopping", "deleted"/"deleting",
"creating"/"starting"), and set a local flag (e.g., abnormal_disconnect) so the
later clean "Disconnected." footer is suppressed; keep existing
CelestoNotFoundError, CelestoAuthenticationError, CelestoNetworkError handling
but ensure they are used inside this new path so network/auth errors still print
the dim "Could not check computer status" message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d0676181-04a7-4018-b54c-c35f5f33ad97

📥 Commits

Reviewing files that changed from the base of the PR and between 56a0f94 and b448bf0.

📒 Files selected for processing (1)
  • src/celesto/computer.py

Comment thread src/celesto/computer.py
Comment on lines 260 to +287
if e.rcvd and e.rcvd.code != 1000:
console.print(f"\n[red]Connection closed: code={e.rcvd.code} reason={e.rcvd.reason}[/red]")
console.print(f"\n[red]Connection lost (code={e.rcvd.code}).[/red]")
# Check if computer entered error state (e.g. host spot-interrupted)
try:
with _get_client(api_key) as check_client:
info = check_client.computers.get(resolved_id)
comp_status = info.get("status")
if comp_status == "error":
last_err = info.get("last_error", "unknown")
console.print(f"[red]Computer error: {last_err}[/red]")
console.print("[dim]Create a new computer with: celesto computer create[/dim]")
elif comp_status == "running":
console.print(f"[yellow]Computer is still running. Reconnect with: celesto computer ssh {computer_id}[/yellow]")
elif comp_status in ("stopped", "stopping"):
console.print(f"[yellow]Computer is {comp_status}. Start it with: celesto computer start {computer_id}[/yellow]")
elif comp_status in ("deleted", "deleting"):
console.print("[red]Computer has been deleted.[/red]")
console.print("[dim]Create a new computer with: celesto computer create[/dim]")
elif comp_status in ("creating", "starting"):
console.print(f"[yellow]Computer is {comp_status}. Wait a moment and retry: celesto computer ssh {computer_id}[/yellow]")
else:
console.print(f"[yellow]Computer status: {comp_status}[/yellow]")
except (CelestoNotFoundError,):
console.print("[red]Computer not found — it may have been deleted.[/red]")
except (CelestoAuthenticationError, CelestoNetworkError) as exc:
console.print(f"[dim]Could not check computer status: {exc}[/dim]")
except Exception as exc:
console.print(f"[dim]Could not check computer status: {exc}[/dim]")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Abrupt drops can still slip past this recovery path.

Line 260 only runs the new status probe when e.rcvd exists, but websockets only sets rcvd if a close frame was actually received. For network failures / broken handshakes, recv() raises ConnectionClosedError and rcvd may be None, so this whole branch is skipped and the flow falls through to the clean Disconnected. footer on Line 334 — exactly the case this PR is trying to surface. Catch the error close explicitly (or treat e.rcvd is None as abnormal) and suppress the clean-disconnect footer for that path. (websockets.readthedocs.io)

🔧 One tidy way to close the gap
     old_settings = termios.tcgetattr(stdin_fd)
     done = threading.Event()
+    clean_disconnect = True

     def recv_loop():
+        nonlocal clean_disconnect
         try:
             while not done.is_set():
                 try:
                     msg = ws.recv(timeout=1)
                 except TimeoutError:
                     continue
                 if isinstance(msg, str):
                     os.write(sys.stdout.fileno(), msg.encode("utf-8"))
                 elif isinstance(msg, bytes):
                     os.write(sys.stdout.fileno(), msg)
-        except websockets.exceptions.ConnectionClosed as e:
-            if e.rcvd and e.rcvd.code != 1000:
-                console.print(f"\n[red]Connection lost (code={e.rcvd.code}).[/red]")
-                # Check if computer entered error state (e.g. host spot-interrupted)
-                try:
-                    with _get_client(api_key) as check_client:
-                        info = check_client.computers.get(resolved_id)
-                        comp_status = info.get("status")
-                        if comp_status == "error":
-                            last_err = info.get("last_error", "unknown")
-                            console.print(f"[red]Computer error: {last_err}[/red]")
-                            console.print("[dim]Create a new computer with: celesto computer create[/dim]")
-                        elif comp_status == "running":
-                            console.print(f"[yellow]Computer is still running. Reconnect with: celesto computer ssh {computer_id}[/yellow]")
-                        elif comp_status in ("stopped", "stopping"):
-                            console.print(f"[yellow]Computer is {comp_status}. Start it with: celesto computer start {computer_id}[/yellow]")
-                        elif comp_status in ("deleted", "deleting"):
-                            console.print("[red]Computer has been deleted.[/red]")
-                            console.print("[dim]Create a new computer with: celesto computer create[/dim]")
-                        elif comp_status in ("creating", "starting"):
-                            console.print(f"[yellow]Computer is {comp_status}. Wait a moment and retry: celesto computer ssh {computer_id}[/yellow]")
-                        else:
-                            console.print(f"[yellow]Computer status: {comp_status}[/yellow]")
+        except websockets.exceptions.ConnectionClosedError as e:
+            clean_disconnect = False
+            close_code = e.rcvd.code if e.rcvd else (e.sent.code if e.sent else "unknown")
+            console.print(f"\n[red]Connection lost (code={close_code}).[/red]")
+            try:
+                with _get_client(api_key) as check_client:
+                    info = check_client.computers.get(resolved_id)
+                    comp_status = info.get("status")
+                    if comp_status == "error":
+                        last_err = info.get("last_error", "unknown")
+                        console.print(f"[red]Computer error: {last_err}[/red]")
+                        console.print("[dim]Create a new computer with: celesto computer create[/dim]")
+                    elif comp_status == "running":
+                        console.print(f"[yellow]Computer is still running. Reconnect with: celesto computer ssh {computer_id}[/yellow]")
+                    elif comp_status in ("stopped", "stopping"):
+                        console.print(f"[yellow]Computer is {comp_status}. Start it with: celesto computer start {computer_id}[/yellow]")
+                    elif comp_status in ("deleted", "deleting"):
+                        console.print("[red]Computer has been deleted.[/red]")
+                        console.print("[dim]Create a new computer with: celesto computer create[/dim]")
+                    elif comp_status in ("creating", "starting"):
+                        console.print(f"[yellow]Computer is {comp_status}. Wait a moment and retry: celesto computer ssh {computer_id}[/yellow]")
+                    else:
+                        console.print(f"[yellow]Computer status: {comp_status}[/yellow]")
                 except (CelestoNotFoundError,):
                     console.print("[red]Computer not found — it may have been deleted.[/red]")
                 except (CelestoAuthenticationError, CelestoNetworkError) as exc:
                     console.print(f"[dim]Could not check computer status: {exc}[/dim]")
                 except Exception as exc:
                     console.print(f"[dim]Could not check computer status: {exc}[/dim]")
+        except websockets.exceptions.ConnectionClosedOK:
+            pass
         except OSError:
             pass
         finally:
             done.set()
@@
-        console.print("\n[dim]Disconnected.[/dim]")
+        if clean_disconnect:
+            console.print("\n[dim]Disconnected.[/dim]")
🧰 Tools
🪛 Ruff (0.15.9)

[warning] 282-282: A length-one tuple literal is redundant in exception handlers

Replace with except CelestoNotFoundError

(B013)


[warning] 286-286: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/celesto/computer.py` around lines 260 - 287, The branch only runs the
status probe when e.rcvd is present, but websocket network failures raise
ConnectionClosedError (or leave e.rcvd == None) so we must treat e.rcvd is None
as an abnormal disconnect: update the except handling around the websocket recv
to also catch ConnectionClosedError (or check if e.rcvd is None) and in that
path execute the same status probe using _get_client(resolved_id) /
check_client.computers.get(resolved_id) and the existing status-handling logic
(last_error, "running", "stopped"/"stopping", "deleted"/"deleting",
"creating"/"starting"), and set a local flag (e.g., abnormal_disconnect) so the
later clean "Disconnected." footer is suppressed; keep existing
CelestoNotFoundError, CelestoAuthenticationError, CelestoNetworkError handling
but ensure they are used inside this new path so network/auth errors still print
the dim "Could not check computer status" message.

@aniketmaurya aniketmaurya deleted the feat/spot-recovery-terminal-ux branch April 15, 2026 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant