Skip to content

feat(viz): add RerunWebSocketServer for dimos-viewer remote mode#1643

Draft
jeff-hykin wants to merge 12 commits intodevfrom
jeff/fix/rconnect
Draft

feat(viz): add RerunWebSocketServer for dimos-viewer remote mode#1643
jeff-hykin wants to merge 12 commits intodevfrom
jeff/fix/rconnect

Conversation

@jeff-hykin
Copy link
Copy Markdown
Member

@jeff-hykin jeff-hykin commented Mar 21, 2026

NOTE: this will become ready-for-review once this dimos-viewer PR is merged

Problem(s)

  1. dimos-viewer with --connect doesn't work for remote connections.
  2. there's no way to remap what dimos-viewer publishes on. It publishes to a hardcoded LCM path of cmd_vel which ends up making stream-renaming a pain since everything else must be renamed to know if the cmd_vel is coming from the viewer or from a planner/module.

Solution

Use websockets instead of LCM for the viewer.

This PR only changes one side, the other half lives in this dimos-viewer PR. This should only be merged after that one.

  • A dimos module starts a websocket server listening for clicked_point and tele_cmd_vel
  • The dimos-viewer tries to connect to a websocket and publishes clicked_point and tele_cmd_vel on there.

Breaking Changes

None

How to Test

You'll need the jeff/fix/connect branch of dimos-viewer compiled and python installed into dimos:

# Build and install dimos-viewer (requires pixi)
cd <path-to-dimos-viewer>
git checkout jeff/fix/connect
pixi run build
uv pip install target/wheels/dimos_viewer-*.whl --force-reinstall
# Unit + E2E tests
uv run pytest dimos/visualization/rerun/test_websocket_server.py dimos/visualization/rerun/test_viewer_ws_e2e.py -v --timeout=30 -k "not test_viewer_ws_client_connects"

Alternatively

# Terminal 1: start the websocket server
python -c "
from dimos.visualization.rerun.websocket_server import RerunWebSocketServer
import threading
server = RerunWebSocketServer(port=3030)
server.clicked_point.subscribe(lambda pt: print(f'[CLICK] {pt.x:.3f},{pt.y:.3f},{pt.z:.3f}'))
server.tele_cmd_vel.subscribe(lambda tw: print(f'[TWIST] {tw}'))
server.start()
threading.Event().wait()
"

# Terminal 2: start the viewer
dimos-viewer --ws-url ws://127.0.0.1:3030/ws

Click in the 3D viewport and use WASD keys — should see [CLICK] and [TWIST] in terminal 1.

Contributor License Agreement

  • I have read and approved the CLA.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 21, 2026

Greptile Summary

This PR adds RerunWebSocketServer, a new DimOS module that bridges the gap when dimos-viewer runs in --connect mode: it listens for WebSocket connections from the viewer and translates incoming JSON events (clicks, keyboard teleop, stops, heartbeats) into DimOS stream publishes on clicked_point (Out[PointStamped]) and tele_cmd_vel (Out[Twist]). The implementation is clean and well-tested, but two correctness issues in the server and one fragile test fixture need attention before merge.

  • Non-dict JSON crashes the client handler (websocket_server.py:154): json.loads() can return any JSON value; calling .get("type") on a list, integer, or null raises an unhandled AttributeError that silently terminates the connection handler for that client. A single isinstance(msg, dict) guard after the json.loads() call fixes this.
  • Race condition in stop() (websocket_server.py:96-104): _stop_event is initialised inside the _serve() coroutine on the background thread. If stop() is called before that line executes, _stop_event is None evaluates to True and the server loop is never signalled to exit, leaving the port bound.
  • Hard-coded environment in smoke test (test_viewer_ws_e2e.py:305-311): test_viewer_ws_client_connects passes a stripped PATH and HOME=/home/dimos, which will break on any CI runner or developer machine where dimos-viewer is not installed at that exact path.
  • No path filtering on WebSocket endpoint (websocket_server.py:124-128): The server accepts upgrade requests on any URL path, not just /ws. Given the default 0.0.0.0 binding, this means any host on the network can send teleop commands via a non-standard path.

Confidence Score: 3/5

  • Not safe to merge yet — two logic bugs in the core server need to be fixed first.
  • The feature is well-scoped and the test coverage is thorough, but the unhandled AttributeError on non-dict JSON is a P1 code correctness bug that will cause unexpected client disconnections, and the stop() race condition can leave the port bound — both affect the primary user path. Fixing these two issues would bring the PR to merge-ready.
  • dimos/visualization/rerun/websocket_server.py needs the non-dict JSON guard and the stop() race fix; dimos/visualization/rerun/test_viewer_ws_e2e.py needs the hard-coded environment corrected.

Important Files Changed

Filename Overview
dimos/visualization/rerun/websocket_server.py Core new module — introduces RerunWebSocketServer; has two logic issues: _dispatch crashes on non-dict JSON (unhandled AttributeError) and stop() has a race condition where _stop_event may still be None when called.
dimos/visualization/rerun/test_websocket_server.py Comprehensive unit test suite for RerunWebSocketServer covering click, twist, stop, heartbeat, invalid JSON, and reconnection; well-structured with MockViewerPublisher helper.
dimos/visualization/rerun/test_viewer_ws_e2e.py E2E / protocol-level tests; the binary smoke test (test_viewer_ws_client_connects) hard-codes /home/dimos and a stripped PATH that will fail on most developer and CI environments.

Sequence Diagram

sequenceDiagram
    participant Viewer as dimos-viewer<br/>(--connect mode)
    participant WS as RerunWebSocketServer<br/>(0.0.0.0:3030/ws)
    participant CP as clicked_point<br/>Out[PointStamped]
    participant TV as tele_cmd_vel<br/>Out[Twist]
    participant DS as Downstream<br/>(e.g. ReplanningAStarPlanner)

    Viewer->>WS: WebSocket connect (ws://host:3030/ws)
    WS-->>Viewer: connection accepted
    loop Every second
        Viewer->>WS: {"type":"heartbeat","timestamp_ms":...}
        Note over WS: logged, no publish
    end
    Viewer->>WS: {"type":"click","x":1.0,"y":2.0,"z":0.5,...}
    WS->>CP: publish PointStamped(x,y,z,ts,frame_id)
    CP->>DS: subscriber callback

    Viewer->>WS: {"type":"twist","linear_x":0.5,...,"angular_z":0.8}
    WS->>TV: publish Twist(linear, angular)

    Viewer->>WS: {"type":"stop"}
    WS->>TV: publish Twist.zero()

    Viewer--xWS: disconnect
    Note over WS: logs disconnect, ready for reconnect
Loading

Last reviewed commit: "fix: ruff formatting..."

Comment on lines +147 to +155
def _dispatch(self, raw: str | bytes) -> None:
try:
msg = json.loads(raw)
except json.JSONDecodeError:
logger.warning(f"RerunWebSocketServer: ignoring non-JSON message: {raw!r}")
return

msg_type = msg.get("type")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Non-dict JSON causes unhandled AttributeError

json.loads() can return any JSON value (list, int, null, bool, etc.). If it returns anything other than a dict, the call to msg.get("type") on line 154 raises AttributeError: 'list' object has no attribute 'get', which is not caught by the json.JSONDecodeError guard. This exception propagates through _handle_client (which only catches websockets.ConnectionClosed) and becomes an unhandled coroutine exception, abruptly terminating the connection handler for that client.

Suggested change
def _dispatch(self, raw: str | bytes) -> None:
try:
msg = json.loads(raw)
except json.JSONDecodeError:
logger.warning(f"RerunWebSocketServer: ignoring non-JSON message: {raw!r}")
return
msg_type = msg.get("type")
def _dispatch(self, raw: str | bytes) -> None:
try:
msg = json.loads(raw)
except json.JSONDecodeError:
logger.warning(f"RerunWebSocketServer: ignoring non-JSON message: {raw!r}")
return
if not isinstance(msg, dict):
logger.warning(f"RerunWebSocketServer: ignoring non-object JSON message: {raw!r}")
return
msg_type = msg.get("type")

Comment on lines +96 to +104
@rpc
def stop(self) -> None:
if (
self._ws_loop is not None
and not self._ws_loop.is_closed()
and self._stop_event is not None
):
self._ws_loop.call_soon_threadsafe(self._stop_event.set)
super().stop()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Race condition: stop() is a no-op if called before _serve() initialises _stop_event

_stop_event is assigned inside the _serve() coroutine (line 122), which runs on the background thread. There is a window between the thread starting (and _ws_loop being set) and the coroutine executing the self._stop_event = asyncio.Event() line. If stop() is called in that window, the guard self._stop_event is not None is False and the call to call_soon_threadsafe is skipped entirely — super().stop() still runs, but the asyncio server loop never receives the stop signal and keeps the port bound until the process exits.

In practice this is most likely to surface when a test fails before _wait_for_server() returns, leaving the port occupied for the next test run. Consider initialising the event before the thread is spawned (using asyncio.Event() on the server's own loop once it is created) or adding a proper join() with a short deadline so stop() is always synchronous.

Comment on lines +299 to +311
env={
"DISPLAY": "",
"HOME": "/home/dimos",
"PATH": "/home/dimos/.cargo/bin:/usr/bin:/bin",
},
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)

# Give the viewer up to 5 s to connect its WebSocket client to our server.
# We detect the connection by waiting for the server to accept a client.
deadline = time.monotonic() + 5.0
while time.monotonic() < deadline:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Hard-coded HOME and PATH will break on most CI/developer machines

The stripped environment passes "HOME": "/home/dimos" and a minimal PATH that only covers /home/dimos/.cargo/bin, /usr/bin, and /bin. This will fail to locate dimos-viewer on any system where it is installed in a virtual-environment bin (e.g. ~/.local/bin, $VIRTUAL_ENV/bin, or a pixi-managed path) or where the user's home directory is not /home/dimos.

Consider inheriting the caller's environment and only overriding DISPLAY, or using shutil.which("dimos-viewer") to assert the binary is reachable before the test runs:

import os, shutil
env = {**os.environ, "DISPLAY": ""}
proc = subprocess.Popen(
    ["dimos-viewer", "--connect", f"--ws-url=ws://127.0.0.1:{_E2E_PORT}/ws"],
    env=env,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
)

Comment on lines +119 to +132
async def _serve(self) -> None:
import websockets.asyncio.server as ws_server

self._stop_event = asyncio.Event()

async with ws_server.serve(
self._handle_client,
host=self.config.host,
port=self.config.port,
):
logger.info(
f"RerunWebSocketServer listening on ws://{self.config.host}:{self.config.port}/ws"
)
await self._stop_event.wait()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 WebSocket server accepts connections on any URL path, not just /ws

websockets.asyncio.server.serve() called with only a handler function accepts every incoming WebSocket upgrade request regardless of the requested path. Any client that connects to ws://host:3030/ or ws://host:3030/admin gets the same handler. Because the server binds to 0.0.0.0 by default, this means any machine on the network can send arbitrary click and teleop commands by connecting to any path.

Consider adding a path check inside _handle_client and closing the connection immediately if the path is not /ws:

async def _handle_client(self, websocket: Any) -> None:
    if websocket.request.path != "/ws":
        await websocket.close(1008, "Not Found")
        return
    ...

@jeff-hykin jeff-hykin marked this pull request as draft March 21, 2026 23:32
- Move _server_ready.set() inside ws_server.serve() context so stop()
  waits for the port to actually bind before sending shutdown signal
- Add /ws path filter to reject non-viewer WebSocket connections
- Add pytest.mark.skipif for dimos-viewer binary test in CI
- Fix import ordering in manipulation/blueprints.py
The default websockets ping_interval=20s + ping_timeout=20s was too
aggressive. Increase both to 30s to give the viewer more time to
respond, especially during brief network hiccups.
…lure

If _serve() throws (e.g. port in use), _server_ready was never set,
causing stop() to block for 5s. Now logs the exception and sets
_server_ready in finally block.

Revert: git revert HEAD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant