Skip to content

Barebones Loop Closure#1645

Draft
jeff-hykin wants to merge 27 commits intodevfrom
jeff/feat/nav1
Draft

Barebones Loop Closure#1645
jeff-hykin wants to merge 27 commits intodevfrom
jeff/feat/nav1

Conversation

@jeff-hykin
Copy link
Copy Markdown
Member

@jeff-hykin jeff-hykin commented Mar 22, 2026

Loop closure MVP: this grabs the PGO loop-closure (native module) out of RosNav and integrates it into the go2 nav stack. There are a few problems:

  1. PGO's global map is slow to update
  2. PGO's map seems kinda bad
  3. PGO's map is static

A hack (ScanCorrector) is used to get around the slow-ness and static-ness: part of the static map is always cleared in favor of live lidar readings using the PGO odom instead of the sensor odom. The clearing of the static map is just temporary and uses z-column clearing; closed-door problem still exists and its real real bad.

image

Problem

Closes DIM-XXX

Solution

Breaking Changes

How to Test

Contributor License Agreement

  • I have read and approved the CLA.

@jeff-hykin jeff-hykin marked this pull request as draft March 22, 2026 01:14
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 22, 2026

Greptile Summary

This PR introduces the SmartNav navigation stack for the Unitree Go2: a Python PGO (pose graph optimization) module using GTSAM iSAM2 for loop-closure-corrected odometry, a WebSocket server that bridges dimos-viewer click/teleop events to DimOS streams, a content-based file change detection utility (change_detect.py / xxhash) used to trigger native module rebuilds and URDF cache invalidation, and a new unitree_go2_smartnav blueprint wiring them together with VoxelMapper, CostMapper, and ReplanningAStarPlanner.

Key concerns:

  • Thread safety bug in pgo.py: _on_scan (transport thread) and _publish_loop (dedicated thread) both access the shared _SimplePGO instance — mutating and iterating _key_poses, _r_offset, _cache_pairs, etc. — without any mutual exclusion. This can corrupt the pose graph mid-iteration during build_global_static_map.
  • Missing super().start() in PGO.start(): Every other module calls super().start() first; skipping it may bypass base-class transport setup or lifecycle hooks.
  • Race condition in RerunWebSocketServer.stop(): _ws_loop is assigned in the background thread; if stop() is called before the thread initialises the loop, the stop signal is silently dropped and the port stays bound. The server thread is also never joined, which causes port-reuse failures in tests that create a new server immediately after stopping.
  • pgo.py uses bare print() throughout instead of the project-wide setup_logger() pattern.
  • search_for_loops() rebuilds a fresh KDTree from all keyframe positions on every new keyframe, an O(n log n)-per-frame cost that will degrade over long mapping sessions.

Confidence Score: 2/5

  • Not safe to merge — the PGO thread-safety bug can corrupt the pose graph in normal operation, and the WebSocket server has a race condition that causes reliable test failures.
  • Two concrete, reproducible bugs (concurrent access to _SimplePGO without a lock; stop() race on _ws_loop) block the primary user path. The PGO thread-safety issue will manifest silently as corrupted pose estimates or crashes during any long mapping session. Resolving those two issues and the missing super().start() call would raise this to a 4.
  • dimos/navigation/loop_closure/pgo.py (thread safety, missing super().start()), dimos/visualization/rerun/websocket_server.py (stop() race condition, missing thread join)

Important Files Changed

Filename Overview
dimos/navigation/loop_closure/pgo.py New Python PGO module with GTSAM iSAM2 loop closure; has a thread-safety bug between _on_scan and _publish_loop accessing _SimplePGO concurrently, and missing super().start() call.
dimos/visualization/rerun/websocket_server.py New WebSocket server module bridging dimos-viewer events to DimOS streams; has a race condition where stop() can miss the stop signal if called before _run_server sets _ws_loop, and the server thread is never joined.
dimos/utils/change_detect.py New content-based file change detection utility using xxhash; well-structured with file locking, glob support, and clear semantics. Note: fcntl is Unix-only but acceptable for this robotics target.
dimos/core/native_module.py Integrates change_detect into NativeModule to trigger rebuilds when source files change; logic is correct — cache is seeded post-build and rebuild_on_change is excluded from CLI args.
dimos/robot/unitree/go2/blueprints/smart/unitree_go2_smartnav.py New blueprint composing PGO + VoxelMapper + CostMapper + ReplanningAStarPlanner + WavefrontFrontierExplorer with correct remappings; straightforward and well-documented.
dimos/manipulation/planning/utils/mesh_utils.py Switches URDF cache invalidation from mtime-in-key to did_change content hashing; processing_version bumped to v3 to invalidate stale caches. Change is correct and clean.
dimos/e2e_tests/test_smartnav_replay.py New end-to-end integration test replaying sensor data through PGO + VoxelMapper + CostMapper pipeline; uses LCMTransport for message collection. Marked @pytest.mark.slow appropriately.
pyproject.toml Adds xxhash as a core dependency, adds navigation extra (gtsam/gtsam-develop with aarch64 workaround), and adds uv override to fix gtsam-develop's incorrect pytest runtime dep.

Sequence Diagram

sequenceDiagram
    participant GO2 as GO2Connection
    participant PGO as PGO Module
    participant VM as VoxelGridMapper
    participant CM as CostMapper
    participant AStar as ReplanningAStarPlanner
    participant WFE as WavefrontFrontierExplorer
    participant WS as RerunWebSocketServer
    participant Viewer as dimos-viewer

    GO2->>PGO: raw_odom (PoseStamped)
    GO2->>PGO: registered_scan (PointCloud2)
    PGO->>PGO: keyframe detection + ICP loop closure
    PGO->>PGO: GTSAM iSAM2 graph optimization
    PGO-->>AStar: odom (PoseStamped, corrected)
    PGO-->>PGO: global_static_map [pgo_global_static_map] (viz only)

    GO2->>VM: registered_scan (PointCloud2)
    VM-->>CM: global_map (OccupancyGrid)
    CM-->>AStar: global_costmap (OccupancyGrid)

    WFE-->>AStar: frontier goal
    AStar-->>GO2: cmd_vel (Twist)

    Viewer->>WS: click / twist / stop (JSON over WebSocket)
    WS-->>AStar: clicked_point (PointStamped)
    WS-->>GO2: tele_cmd_vel (Twist)
Loading

Last reviewed commit: "CI code cleanup"

Comment on lines +437 to +470
def _on_scan(self, cloud: PointCloud2) -> None:
points, _ = cloud.as_numpy()
if len(points) == 0:
return

with self._lock:
if not self._has_odom:
return
r_local = self._latest_r.copy()
t_local = self._latest_t.copy()
ts = self._latest_time

pgo = self._pgo
assert pgo is not None

# Body-frame points
if self.config.unregister_input:
# registered_scan is world-frame, transform back to body-frame
body_pts = (r_local.T @ (points[:, :3].T - t_local[:, None])).T
else:
body_pts = points[:, :3]

added = pgo.add_key_pose(r_local, t_local, ts, body_pts)
if added:
pgo.search_for_loops()
pgo.smooth_and_update()
print(
f"[PGO] Keyframe {pgo.num_key_poses} added "
f"({t_local[0]:.1f}, {t_local[1]:.1f}, {t_local[2]:.1f})"
)

# Publish corrected odometry
r_corr, t_corr = pgo.get_corrected_pose(r_local, t_local)
self._publish_corrected_odom(r_corr, t_corr, ts)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 _SimplePGO accessed from two threads without locking

_on_scan releases self._lock after copying the latest odom (line 447) and then calls pgo.add_key_pose, pgo.search_for_loops, and pgo.smooth_and_update (lines 459-462) entirely outside the lock. Concurrently, _publish_loop (see lines 509-519) reads pgo.num_key_poses and calls pgo.build_global_static_map in its own thread with no lock held.

_SimplePGO is not thread-safe: _key_poses is mutated by add_key_pose / smooth_and_update and iterated by build_global_static_map at the same time. This can cause index-out-of-range, incorrect frame selection during _get_submap, or a corrupted parts list in build_global_static_map.

The lock already exists — extend its scope to cover all _pgo accesses in both _on_scan and _publish_loop, or add a dedicated _pgo_lock that both methods acquire.

Comment on lines +405 to +412
def start(self) -> None:
self._pgo = _SimplePGO(self.config)
self.raw_odom._transport.subscribe(self._on_raw_odom)
self.registered_scan._transport.subscribe(self._on_scan)
self._running = True
self._thread = threading.Thread(target=self._publish_loop, daemon=True)
self._thread.start()
print("[PGO] Python PGO module started (gtsam iSAM2)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 super().start() not called

Every other module in the codebase calls super().start() at the beginning of its start() override (e.g. RerunWebSocketServer.start() calls it on its first line). PGO.start() skips this entirely. The base Module.start() may set up transport bindings, health checks, or lifecycle hooks that are expected before _transport.subscribe() is called. Skipping it can lead to partially-initialised transports or missing signals to the coordinator.

Suggested change
def start(self) -> None:
self._pgo = _SimplePGO(self.config)
self.raw_odom._transport.subscribe(self._on_raw_odom)
self.registered_scan._transport.subscribe(self._on_scan)
self._running = True
self._thread = threading.Thread(target=self._publish_loop, daemon=True)
self._thread.start()
print("[PGO] Python PGO module started (gtsam iSAM2)")
def start(self) -> None:
super().start()
self._pgo = _SimplePGO(self.config)
self.raw_odom._transport.subscribe(self._on_raw_odom)
self.registered_scan._transport.subscribe(self._on_scan)
self._running = True
self._thread = threading.Thread(target=self._publish_loop, daemon=True)
self._thread.start()
print("[PGO] Python PGO module started (gtsam iSAM2)")

Comment on lines +94 to +102
)

@rpc
def stop(self) -> None:
if (
self._ws_loop is not None
and not self._ws_loop.is_closed()
and self._stop_event is not None
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Race condition: stop() may miss the stop signal

_ws_loop is assigned inside _run_server(), which runs in the background thread (line 110). There is a window between server_thread.start() (line 91) and the first statement of _run_server() where stop() can be called. At that point self._ws_loop is None, so the stop event is never signalled and _serve() blocks forever on await self._stop_event.wait().

Additionally, stop() does not join _server_thread, so the port may still be bound briefly after stop() returns — this causes flaky failures in the test suite when a new server is created on the same port immediately after.

A simple mitigation is to wait for the loop to be ready before returning from start(), or use a threading event to signal readiness. At minimum, join the thread in stop():

def stop(self) -> None:
    if self._ws_loop is not None and not self._ws_loop.is_closed() and self._stop_event is not None:
        self._ws_loop.call_soon_threadsafe(self._stop_event.set)
    if self._server_thread is not None:
        self._server_thread.join(timeout=5.0)
    super().stop()

self._running = True
self._thread = threading.Thread(target=self._publish_loop, daemon=True)
self._thread.start()
print("[PGO] Python PGO module started (gtsam iSAM2)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 print() used instead of structured logger

This file uses print() throughout (lines 412, 463-465, 515-518, 792) instead of the established setup_logger() pattern used by every other module in the codebase. This bypasses log-level filtering, structured fields, and the console formatter configured by the rest of the system.

Replace all print(...) calls with logger.info(...) / logger.debug(...) after adding logger = setup_logger() at module level. This also applies to the _SimplePGO.search_for_loops method (line 792) which prints directly.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +234 to +249

def search_for_loops(self) -> None:
if len(self._key_poses) < 10:
return

# Rate limit
if self._history_pairs:
cur_time = self._key_poses[-1].timestamp
last_time = self._key_poses[self._history_pairs[-1][1]].timestamp
if cur_time - last_time < self._cfg.min_loop_detect_duration:
return

cur_idx = len(self._key_poses) - 1
cur_kp = self._key_poses[-1]

# Build KD-tree of previous keyframe positions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 KDTree rebuilt from scratch on every new keyframe

search_for_loops() re-creates a KDTree over all previous keyframe positions on every call. This is called once per keyframe (from _on_scan), so the cost scales as O(n log n) per keyframe and O(n² log n) over the lifetime of a session. For a long indoor mapping run (thousands of keyframes), this will become a real-time bottleneck in the scan callback, which currently runs on the transport thread.

Consider either:

  • Maintaining an incremental list that is only rebuilt when smooth_and_update changes keyframe positions (loop closure), or
  • Using a simpler, O(n) radius search with a spatial hash for the common no-loop-closure path.

@jeff-hykin jeff-hykin changed the title Jeff/feat/nav1 Barebones Loop Closure Mar 22, 2026
jeff-hykin and others added 26 commits March 25, 2026 15:10
Add a generic file change detection utility (dimos/utils/change_detect.py)
that tracks content hashes via xxhash and integrate it into NativeModule so
it can automatically rebuild when watched source files change.

- change_detect.did_change() hashes file content, stores per-cache-name
  hash files in the venv, and returns True when files differ
- NativeModuleConfig gains rebuild_on_change: list[str] | None
- NativeModule._maybe_build() deletes stale executables when sources change
- 11 tests for change_detect, 3 integration tests for native rebuild
…avoid unlinking Nix store executables

- Add `cwd` parameter to `did_change()` and `_resolve_paths()` so relative
  glob patterns in `rebuild_on_change` are resolved against the module's
  working directory instead of the process cwd.
- Replace `exe.unlink()` with a `needs_rebuild` flag so executables that
  live in read-only locations (e.g. Nix store) are not deleted; instead
  the build command is re-run which handles the output path itself.
- VoxelMapper builds navigation map with column carving so stale
  obstacles are cleared when the robot rescans an area
- PGO's accumulated global_static_map kept for visualisation only
  (remapped to pgo_global_static_map)
- Add simulation blueprint entries and odom-adapter to all_blueprints
- Update e2e test to use LCMTransport subscribers instead of module
  internals (works with forked workers)
Was imported inline 3 times in hot paths (per-keyframe and per-odom).
Hoisted to module-level import.

Revert: git revert HEAD
assert is stripped in python -O mode. Use if/return for production
safety guards.

Revert: git revert HEAD
msg.ts=0.0 is a valid timestamp (epoch) but falsy. The old check
'if msg.ts' would incorrectly fall back to time.time().

Revert: git revert HEAD
Replace O(N) Python loop (set membership per point) with vectorized
np.isin. For 50k+ point clouds at 10Hz this was a bottleneck.

Revert: git revert HEAD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant