Conversation
Greptile SummaryThis PR introduces the SmartNav navigation stack for the Unitree Go2: a Python PGO (pose graph optimization) module using GTSAM iSAM2 for loop-closure-corrected odometry, a WebSocket server that bridges dimos-viewer click/teleop events to DimOS streams, a content-based file change detection utility ( Key concerns:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant GO2 as GO2Connection
participant PGO as PGO Module
participant VM as VoxelGridMapper
participant CM as CostMapper
participant AStar as ReplanningAStarPlanner
participant WFE as WavefrontFrontierExplorer
participant WS as RerunWebSocketServer
participant Viewer as dimos-viewer
GO2->>PGO: raw_odom (PoseStamped)
GO2->>PGO: registered_scan (PointCloud2)
PGO->>PGO: keyframe detection + ICP loop closure
PGO->>PGO: GTSAM iSAM2 graph optimization
PGO-->>AStar: odom (PoseStamped, corrected)
PGO-->>PGO: global_static_map [pgo_global_static_map] (viz only)
GO2->>VM: registered_scan (PointCloud2)
VM-->>CM: global_map (OccupancyGrid)
CM-->>AStar: global_costmap (OccupancyGrid)
WFE-->>AStar: frontier goal
AStar-->>GO2: cmd_vel (Twist)
Viewer->>WS: click / twist / stop (JSON over WebSocket)
WS-->>AStar: clicked_point (PointStamped)
WS-->>GO2: tele_cmd_vel (Twist)
Last reviewed commit: "CI code cleanup" |
| def _on_scan(self, cloud: PointCloud2) -> None: | ||
| points, _ = cloud.as_numpy() | ||
| if len(points) == 0: | ||
| return | ||
|
|
||
| with self._lock: | ||
| if not self._has_odom: | ||
| return | ||
| r_local = self._latest_r.copy() | ||
| t_local = self._latest_t.copy() | ||
| ts = self._latest_time | ||
|
|
||
| pgo = self._pgo | ||
| assert pgo is not None | ||
|
|
||
| # Body-frame points | ||
| if self.config.unregister_input: | ||
| # registered_scan is world-frame, transform back to body-frame | ||
| body_pts = (r_local.T @ (points[:, :3].T - t_local[:, None])).T | ||
| else: | ||
| body_pts = points[:, :3] | ||
|
|
||
| added = pgo.add_key_pose(r_local, t_local, ts, body_pts) | ||
| if added: | ||
| pgo.search_for_loops() | ||
| pgo.smooth_and_update() | ||
| print( | ||
| f"[PGO] Keyframe {pgo.num_key_poses} added " | ||
| f"({t_local[0]:.1f}, {t_local[1]:.1f}, {t_local[2]:.1f})" | ||
| ) | ||
|
|
||
| # Publish corrected odometry | ||
| r_corr, t_corr = pgo.get_corrected_pose(r_local, t_local) | ||
| self._publish_corrected_odom(r_corr, t_corr, ts) |
There was a problem hiding this comment.
_SimplePGO accessed from two threads without locking
_on_scan releases self._lock after copying the latest odom (line 447) and then calls pgo.add_key_pose, pgo.search_for_loops, and pgo.smooth_and_update (lines 459-462) entirely outside the lock. Concurrently, _publish_loop (see lines 509-519) reads pgo.num_key_poses and calls pgo.build_global_static_map in its own thread with no lock held.
_SimplePGO is not thread-safe: _key_poses is mutated by add_key_pose / smooth_and_update and iterated by build_global_static_map at the same time. This can cause index-out-of-range, incorrect frame selection during _get_submap, or a corrupted parts list in build_global_static_map.
The lock already exists — extend its scope to cover all _pgo accesses in both _on_scan and _publish_loop, or add a dedicated _pgo_lock that both methods acquire.
dimos/navigation/loop_closure/pgo.py
Outdated
| def start(self) -> None: | ||
| self._pgo = _SimplePGO(self.config) | ||
| self.raw_odom._transport.subscribe(self._on_raw_odom) | ||
| self.registered_scan._transport.subscribe(self._on_scan) | ||
| self._running = True | ||
| self._thread = threading.Thread(target=self._publish_loop, daemon=True) | ||
| self._thread.start() | ||
| print("[PGO] Python PGO module started (gtsam iSAM2)") |
There was a problem hiding this comment.
Every other module in the codebase calls super().start() at the beginning of its start() override (e.g. RerunWebSocketServer.start() calls it on its first line). PGO.start() skips this entirely. The base Module.start() may set up transport bindings, health checks, or lifecycle hooks that are expected before _transport.subscribe() is called. Skipping it can lead to partially-initialised transports or missing signals to the coordinator.
| def start(self) -> None: | |
| self._pgo = _SimplePGO(self.config) | |
| self.raw_odom._transport.subscribe(self._on_raw_odom) | |
| self.registered_scan._transport.subscribe(self._on_scan) | |
| self._running = True | |
| self._thread = threading.Thread(target=self._publish_loop, daemon=True) | |
| self._thread.start() | |
| print("[PGO] Python PGO module started (gtsam iSAM2)") | |
| def start(self) -> None: | |
| super().start() | |
| self._pgo = _SimplePGO(self.config) | |
| self.raw_odom._transport.subscribe(self._on_raw_odom) | |
| self.registered_scan._transport.subscribe(self._on_scan) | |
| self._running = True | |
| self._thread = threading.Thread(target=self._publish_loop, daemon=True) | |
| self._thread.start() | |
| print("[PGO] Python PGO module started (gtsam iSAM2)") |
| ) | ||
|
|
||
| @rpc | ||
| def stop(self) -> None: | ||
| if ( | ||
| self._ws_loop is not None | ||
| and not self._ws_loop.is_closed() | ||
| and self._stop_event is not None | ||
| ): |
There was a problem hiding this comment.
Race condition:
stop() may miss the stop signal
_ws_loop is assigned inside _run_server(), which runs in the background thread (line 110). There is a window between server_thread.start() (line 91) and the first statement of _run_server() where stop() can be called. At that point self._ws_loop is None, so the stop event is never signalled and _serve() blocks forever on await self._stop_event.wait().
Additionally, stop() does not join _server_thread, so the port may still be bound briefly after stop() returns — this causes flaky failures in the test suite when a new server is created on the same port immediately after.
A simple mitigation is to wait for the loop to be ready before returning from start(), or use a threading event to signal readiness. At minimum, join the thread in stop():
def stop(self) -> None:
if self._ws_loop is not None and not self._ws_loop.is_closed() and self._stop_event is not None:
self._ws_loop.call_soon_threadsafe(self._stop_event.set)
if self._server_thread is not None:
self._server_thread.join(timeout=5.0)
super().stop()
dimos/navigation/loop_closure/pgo.py
Outdated
| self._running = True | ||
| self._thread = threading.Thread(target=self._publish_loop, daemon=True) | ||
| self._thread.start() | ||
| print("[PGO] Python PGO module started (gtsam iSAM2)") |
There was a problem hiding this comment.
print() used instead of structured logger
This file uses print() throughout (lines 412, 463-465, 515-518, 792) instead of the established setup_logger() pattern used by every other module in the codebase. This bypasses log-level filtering, structured fields, and the console formatter configured by the rest of the system.
Replace all print(...) calls with logger.info(...) / logger.debug(...) after adding logger = setup_logger() at module level. This also applies to the _SimplePGO.search_for_loops method (line 792) which prints directly.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
|
||
| def search_for_loops(self) -> None: | ||
| if len(self._key_poses) < 10: | ||
| return | ||
|
|
||
| # Rate limit | ||
| if self._history_pairs: | ||
| cur_time = self._key_poses[-1].timestamp | ||
| last_time = self._key_poses[self._history_pairs[-1][1]].timestamp | ||
| if cur_time - last_time < self._cfg.min_loop_detect_duration: | ||
| return | ||
|
|
||
| cur_idx = len(self._key_poses) - 1 | ||
| cur_kp = self._key_poses[-1] | ||
|
|
||
| # Build KD-tree of previous keyframe positions |
There was a problem hiding this comment.
KDTree rebuilt from scratch on every new keyframe
search_for_loops() re-creates a KDTree over all previous keyframe positions on every call. This is called once per keyframe (from _on_scan), so the cost scales as O(n log n) per keyframe and O(n² log n) over the lifetime of a session. For a long indoor mapping run (thousands of keyframes), this will become a real-time bottleneck in the scan callback, which currently runs on the transport thread.
Consider either:
- Maintaining an incremental list that is only rebuilt when
smooth_and_updatechanges keyframe positions (loop closure), or - Using a simpler, O(n) radius search with a spatial hash for the common no-loop-closure path.
Add a generic file change detection utility (dimos/utils/change_detect.py) that tracks content hashes via xxhash and integrate it into NativeModule so it can automatically rebuild when watched source files change. - change_detect.did_change() hashes file content, stores per-cache-name hash files in the venv, and returns True when files differ - NativeModuleConfig gains rebuild_on_change: list[str] | None - NativeModule._maybe_build() deletes stale executables when sources change - 11 tests for change_detect, 3 integration tests for native rebuild
…avoid unlinking Nix store executables - Add `cwd` parameter to `did_change()` and `_resolve_paths()` so relative glob patterns in `rebuild_on_change` are resolved against the module's working directory instead of the process cwd. - Replace `exe.unlink()` with a `needs_rebuild` flag so executables that live in read-only locations (e.g. Nix store) are not deleted; instead the build command is re-run which handles the output path itself.
- VoxelMapper builds navigation map with column carving so stale obstacles are cleared when the robot rescans an area - PGO's accumulated global_static_map kept for visualisation only (remapped to pgo_global_static_map) - Add simulation blueprint entries and odom-adapter to all_blueprints - Update e2e test to use LCMTransport subscribers instead of module internals (works with forked workers)
Was imported inline 3 times in hot paths (per-keyframe and per-odom). Hoisted to module-level import. Revert: git revert HEAD
assert is stripped in python -O mode. Use if/return for production safety guards. Revert: git revert HEAD
msg.ts=0.0 is a valid timestamp (epoch) but falsy. The old check 'if msg.ts' would incorrectly fall back to time.time(). Revert: git revert HEAD
Replace O(N) Python loop (set membership per point) with vectorized np.isin. For 50k+ point clouds at 10Hz this was a bottleneck. Revert: git revert HEAD
f0a9371 to
211ae03
Compare
Loop closure MVP: this grabs the PGO loop-closure (native module) out of RosNav and integrates it into the go2 nav stack. There are a few problems:
A hack (
ScanCorrector) is used to get around the slow-ness and static-ness: part of the static map is always cleared in favor of live lidar readings using the PGO odom instead of the sensor odom. The clearing of the static map is just temporary and uses z-column clearing; closed-door problem still exists and its real real bad.Problem
Closes DIM-XXX
Solution
Breaking Changes
How to Test
Contributor License Agreement