Skip to content

feat(repl): add repl#1665

Open
paul-nechifor wants to merge 1 commit intodevfrom
paul/feat/repl
Open

feat(repl): add repl#1665
paul-nechifor wants to merge 1 commit intodevfrom
paul/feat/repl

Conversation

@paul-nechifor
Copy link
Copy Markdown
Contributor

@paul-nechifor paul-nechifor commented Mar 25, 2026

Problem

  • We don't have a way to inspect a running system.
  • Agents don't have enough access with just CLI commands

Closes DIM-743

Solution

  • Used rpyc to connect to all running modules and proxy objects.
  • We can now use dimos repl to start an IPython REPL and call methods on modules or get any property

Breaking Changes

None

How to Test

uv run dimos run unitree-go2-agentic

and then:

●  uv run dimos repl
DimOS REPL
Connected to localhost:18861

  coordinator  ModuleCoordinator instance
  modules()    List deployed module names
  get(name)    Get module instance by class name


In [1]: get('WavefrontFrontierExplorer')
Out[1]: <dimos.navigation.frontier_exploration.wavefront_frontier_goal_selector.WavefrontFrontierExplorer object at 0x77dda4604680>

In [2]: _.begin_exploration()
Out[2]: 'Started exploration skill. The robot is now moving. Use end_exploration to stop. You also need to cancel before starting a new movement tool.'

In [3]: 

Contributor License Agreement

  • I have read and approved the CLA.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 25, 2026

Greptile Summary

This PR adds an interactive RPyC-based REPL (dimos repl) for inspecting live running DimOS systems. When dimos run starts, a coordinator-level RPyC server is launched on a fixed port (default 18861) and each worker process gets its own auto-assigned RPyC server. dimos repl then connects to the coordinator, resolves module locations, and proxies objects directly from worker processes into an IPython (or stdlib code) session. The approach is well-structured, fully tested, and fits naturally into the existing worker/coordinator architecture.

Issues found:

  • Daemon PID regression (P1): RunEntry is now created before daemonize() is called, so pid=os.getpid() captures the parent process PID rather than the daemon's PID. After the double-fork in daemonize(), the daemon process has a different PID. entry.save() is called inside the daemon with this stale PID, causing dimos stop and dimos status to target an already-exited process and never successfully signal the daemon. The fix is to reassign entry.pid = os.getpid() immediately after daemonize(log_dir).
  • No version bound for rpyc (P2): The dependency is declared without a version specifier in pyproject.toml, leaving the project open to future breaking changes from rpyc major releases.
  • find_free_port TOCTOU (P2): The shared fixture closes the socket before returning the port, creating a small window where the port could be claimed by another process in parallel CI.

Confidence Score: 3/5

  • Not safe to merge until the daemon PID regression is fixed — dimos stop will silently fail for all daemonised instances.
  • The REPL feature itself (repl_server.py, repl.py, worker IPC, module_coordinator additions) is well-implemented and thoroughly tested. However, refactoring the RunEntry creation to a single pre-fork location introduced a regression: in daemon mode the stored PID is the parent's PID, not the daemon's. This breaks dimos stop and dimos status — core production CLI commands — in the primary daemon usage pattern. That meets the threshold for a 3/5 per the guidance (likely production reliability problem in normal usage).
  • dimos/robot/cli/dimos.py — the pid=os.getpid() call must move to after daemonize().

Important Files Changed

Filename Overview
dimos/robot/cli/dimos.py Adds --repl/--no-repl and --repl-port options to dimos run, and a new repl subcommand. Introduces a regression: RunEntry.pid is captured before daemonize(), so the stored PID is the parent's (already-exited) PID in daemon mode, breaking dimos stop / dimos status.
dimos/core/repl_server.py New file implementing ReplServer (coordinator-side RPyC server) and start_worker_repl_server (per-worker RPyC server). Both use ThreadedServer with allow_all_attrs/setattr/delattr — intentional for a debug REPL. Port 0 auto-assignment is handled correctly: rpyc binds the socket in __init__, so server.port is reliable before the thread starts.
dimos/robot/cli/repl.py New REPL client: auto-detects port from run registry, connects via rpyc, and starts either an IPython or stdlib code.interact session with coordinator, modules(), and get(name) pre-populated. Connection cleanup in finally is correct; sync_request_timeout=None is appropriate for an interactive REPL.
dimos/core/module_coordinator.py Adds list_modules, get_module, get_module_location, and start_repl_server methods, plus _module_locations tracking. Introduces a guarded client property to replace duplicated "not started" checks. stop() correctly tears down the REPL server before other resources.
dimos/core/worker.py Adds Worker.start_repl_server() (sends IPC message to worker process) and handles the start_repl_server message type in _worker_loop. Both are wrapped in the existing try/except Exception error-handling block, so failures surface cleanly as error responses rather than crashing the loop.
dimos/conftest.py Adds shared test fixtures (find_free_port, wait_until_rpyc_connectable, make_stub_coordinator) and the _StubCoordinator helper used across the new REPL tests. Minor TOCTOU in find_free_port (socket closed before caller binds), which is standard but could cause rare flakiness in parallel CI.
pyproject.toml Adds rpyc as a runtime dependency and suppresses mypy errors for rpyc/rpyc.*. No version bound is specified, leaving the project open to future breaking changes from rpyc major releases.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as dimos CLI
    participant Coord as ModuleCoordinator
    participant ReplSrv as ReplServer (port 18861)
    participant Worker as Worker Process
    participant WorkerSrv as WorkerReplServer (port 0→N)

    Note over CLI,WorkerSrv: dimos run (startup)
    CLI->>Coord: build() + start()
    CLI->>Coord: start_repl_server(port=18861)
    Coord->>Worker: start_repl_server IPC message
    Worker->>WorkerSrv: start_worker_repl_server(instances)
    WorkerSrv-->>Worker: listening on port N
    Worker-->>Coord: port N
    Coord->>Coord: _module_locations["ModuleX"] = ("localhost", N)
    Coord->>ReplSrv: ReplServer(coordinator).start()

    Note over User,WorkerSrv: dimos repl (client session)
    User->>CLI: dimos repl
    CLI->>ReplSrv: rpyc.connect(localhost, 18861)
    CLI->>ReplSrv: root.get_coordinator()
    ReplSrv-->>CLI: coordinator proxy
    CLI->>User: IPython REPL (coordinator, modules(), get())

    User->>CLI: get("ModuleX")
    CLI->>ReplSrv: root.get_module_location("ModuleX")
    ReplSrv-->>CLI: ("localhost", N)
    CLI->>WorkerSrv: rpyc.connect(localhost, N)
    CLI->>WorkerSrv: root.get_instance_by_name("ModuleX")
    WorkerSrv-->>CLI: ModuleX proxy (allow_all_attrs)
    CLI-->>User: <ModuleX object>
Loading

Comments Outside Diff (2)

  1. pyproject.toml, line 1068 (link)

    P2 No version constraint for rpyc

    rpyc is added without a version specifier, which means any future major release (e.g., 7.x) could be resolved and potentially introduce breaking changes to the ThreadedServer API or Connection behaviour used in repl_server.py.

  2. dimos/conftest.py, line 27-31 (link)

    P2 TOCTOU race in find_free_port

    The socket is closed before the port number is returned to the caller. Between closing the socket and the test actually binding to that port there is a small window in which another OS process (or a parallel pytest worker) could claim the same ephemeral port, leading to an Address already in use error and a flaky test.

    The same pattern was previously present in test_unity_sim.py (which this PR correctly consolidates here). The standard mitigation is to keep the socket open and pass it directly to the server, or to use SO_REUSEADDR — but that is a broader test-infrastructure concern rather than a blocker for this PR.

Reviews (1): Last reviewed commit: "feat(repl): add repl" | Re-trigger Greptile

@leshy
Copy link
Copy Markdown
Contributor

leshy commented Mar 25, 2026

I understand this is done for interacting with running dimos, I assume ideally we can use the same API if deploying a blueprint and playing with it as a part of an actual python file right?

in py:

bla = dimos.deploy(something)
bla.nav.goto(..)

vs (in terminal)

uv run dimos --daemon run ...
uv run dimos repl

then

coordinator.nav.goto(..)

would be nice to have parity here

or if I have a running dimos I (or agent!) should be able to write

test.py

bla = dimos.connect()
bla.nav.goto(..)

right? (ignore actual API I'm using, it's an example imaginary API)

would be great to have docs/ showing all 3 usecases if ready,
if not - do you agree with my thoughts, do want to do this in this PR or separate?

@paul-nechifor
Copy link
Copy Markdown
Contributor Author

I understand this is done for interacting with running dimos, I assume ideally we can use the same API if deploying a blueprint and playing with it as a part of an actual python file right?

in py:

bla = dimos.deploy(something)
bla.nav.goto(..)

vs (in terminal)

uv run dimos --daemon run ...
uv run dimos repl

then

coordinator.nav.goto(..)

would be nice to have parity here

or if I have a running dimos I (or agent!) should be able to write

test.py

bla = dimos.connect()
bla.nav.goto(..)

right? (ignore actual API I'm using, it's an example imaginary API)

would be great to have docs/ showing all 3 usecases if ready, if not - do you agree with my thoughts, do want to do this in this PR or separate?

But this has always been possible it's just hard to specify a system composition in a REPL. Also, we only give access to call methods marked with @rpc. rpyc gives you access to any methods/properties on the object.

Example blueprint start in REPL:

>>> from dimos.robot.unitree.go2.connection import GO2Connection
>>> b = GO2Connection.blueprint()
>>> mc = b.build()
23:14:51.426[inf][dimos/core/blueprints.py      ] Building the blueprint
23:14:51.431[inf][dimos/core/blueprints.py      ] Starting the modules
23:14:51.454[inf][dimos/core/worker_manager.py  ] Worker pool started. n_workers=2
23:15:02.099[inf][dimos/core/worker.py          ] Deployed module. module=GO2Connection module_id=0 worker_id=0
23:15:02.101[inf][dimos/core/blueprints.py      ] Transport module=GO2Connection name=pointcloud original_name=pointcloud topic=/pointcloud#sensor_msgs.PointCloud2 transport=LCMTransport type=dimos.msgs.sensor_msgs.PointCloud2.PointCloud2
23:15:02.101[inf][dimos/core/blueprints.py      ] Transport module=GO2Connection name=color_image original_name=color_image topic=/color_image#sensor_msgs.Image transport=LCMTransport type=dimos.msgs.sensor_msgs.Image.Image
23:15:02.101[inf][dimos/core/blueprints.py      ] Transport module=GO2Connection name=camera_info original_name=camera_info topic=/camera_info#sensor_msgs.CameraInfo transport=LCMTransport type=dimos.msgs.sensor_msgs.CameraInfo.CameraInfo
23:15:02.102[inf][dimos/core/blueprints.py      ] Transport module=GO2Connection name=cmd_vel original_name=cmd_vel topic=/cmd_vel#geometry_msgs.Twist transport=LCMTransport type=dimos.msgs.geometry_msgs.Twist.Twist
23:15:02.102[inf][dimos/core/blueprints.py      ] Transport module=GO2Connection name=odom original_name=odom topic=/odom#geometry_msgs.PoseStamped transport=LCMTransport type=dimos.msgs.geometry_msgs.PoseStamped.PoseStamped
23:15:02.102[inf][dimos/core/blueprints.py      ] Transport module=GO2Connection name=lidar original_name=lidar topic=/lidar#sensor_msgs.PointCloud2 transport=LCMTransport type=dimos.msgs.sensor_msgs.PointCloud2.PointCloud2
*************** EP Error ***************
EP Error /onnxruntime_src/onnxruntime/python/onnxruntime_pybind_state.cc:539 void onnxruntime::python::RegisterTensorRTPluginsAsCustomOps(PySessionOptions&, const onnxruntime::ProviderOptions&) Please install TensorRT libraries as mentioned in the GPU requirements page, make sure they're in the PATH or LD_LIBRARY_PATH, and that your GPU is supported.
 when using ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying.
****************************************
23:15:14.060[inf][os/simulation/mujoco/policy.py] Loaded policy: /home/p/pro/dimensional/dimos/data/mujoco_sim/unitree_go1_policy.onnx with providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']
23:15:14.160[inf][t/unitree/mujoco_connection.py] MuJoCo process started successfully
>>> go2 = mc.get_instance(GO2Connection)
>>> go2.
go2.actor_class(       go2.actor_instance     go2.remote_name        go2.rpc                go2.rpcs               go2.stop_rpc_client()  
>>> go2.liedown()
True
>>> 

Copy link
Copy Markdown
Contributor

@leshy leshy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some API change requests for easier access, idk if you want to merge first then iterate, seems isolated enough to iterate here?

['GO2Connection', 'RerunBridge', 'McpServer', ...]

# Get a module instance and call methods on it
>>> wfe = get('WavefrontFrontierExplorer')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels slightly awkward, why can't I just have WavefrontFrontierExplorer already there in the namespace?

```python
# List all deployed modules
>>> modules()
['GO2Connection', 'RerunBridge', 'McpServer', ...]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it nicer to just have a list of actual instances?

"Started exploring."

# Access the coordinator directly
>>> coordinator.list_modules()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it nicer to just have a list of actual instances?

or even better if I have a magic namespace I can do coordinator.modules.WavefrontFrontierExplorer

you have repl so coordinator.modules. TAB gives you autocomplete etc

(memory2 does this for streams)

@leshy
Copy link
Copy Markdown
Contributor

leshy commented Mar 26, 2026

But this has always been possible it's just hard to specify a system composition in a REPL. Also, we only give access to call methods marked with @rpc. rpyc gives you access to any methods/properties on the object.

yes I know, just making sure that

REPL API,
and "I write a small script to interact with live dimos" API
and "I write a small script that deploys a blueprint then interacts with it" API
and "I run Ipython directly and deploy a blueprint then interact with it" API

is all the same API, you just change how you interact with a deployed thing, and this is what I'd document, not REPL as a single entrypoint

@leshy
Copy link
Copy Markdown
Contributor

leshy commented Mar 26, 2026

after chatting, update here, example of unified interaction with dimos - it doesn't have to be this exact same API, point is it's the same API in all cases

I use repl to talk to running dimos
dimos repl

> dimos.modules. <tab>
Go2Connection, VoxelMapper, WaveFrontExplorer, ...

> dimos.modules.WaveFrontExplorer.
<tab>
Start, Stop, start_exploration, some_skill...

> dimos.modules.WaveFrontExplorer.start_exploration()
...

> dimos.modules.WaveFrontExplorer.some_skill(`some_skill_arg`)
...

I run dimos (dimos run unitre...) then write a script
Script talks to dimos
bla.py

from dimos.core import connect

dimos = connect() #idk exact naming
dimos.modules.WaveFrontExplorer.start_exploration()
dimos.modules.WaveFrontExplorer.some_skill(`some_skill_arg`)

I run my blueprint by myself
bla.py (this can be a script but could also be an ipython interaction)

from dimos.robot.unitree.go2.connection import GO2Connection
dimos = GO2Connection.blueprint().build()

dimos.modules.WaveFrontExplorer.start_exploration()
dimos.modules.WaveFrontExplorer.some_skill(`some_skill_arg`)

above assumes we cannot deploy multiple blueprints in the same instance, I think we want this, so will iterate on that separately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants