Skip to content

feat(native): add rebuild-on-change for NativeModule#1599

Open
jeff-hykin wants to merge 13 commits intodevfrom
jeff/feat/native_rebuild
Open

feat(native): add rebuild-on-change for NativeModule#1599
jeff-hykin wants to merge 13 commits intodevfrom
jeff/feat/native_rebuild

Conversation

@jeff-hykin
Copy link
Copy Markdown
Member

@jeff-hykin jeff-hykin commented Mar 17, 2026

[Jeff's Claw Bot] - do not review yet

Summary

Add a generic file change detection utility (dimos/utils/change_detect.py) and integrate it into NativeModule so native executables automatically rebuild when watched source files change.

Problem

NativeModules wrap C/C++ executables that need recompilation when source files change. Previously there was no mechanism to detect source changes — users had to manually clean and rebuild, or stale binaries would silently run outdated code.

Solution

  • dimos/utils/change_detect.py: Content-hash-based file change detection using xxhash. Stores per-cache-name hash files in the venv. did_change() returns True when file content differs from last check. Supports glob patterns and thread-safe locking (threading.Lock + fcntl.flock).
  • NativeModuleConfig.rebuild_on_change: Optional list[str] of file paths/globs to watch. When set, NativeModule._maybe_build() checks for changes and deletes stale executables before rebuilding.
  • Cache key includes class qualname to prevent collisions when multiple NativeModule subclasses share a source file.
  • Thread leak fix: Reordered stop() to call super().stop() (joins asyncio loop thread) before setting _process = None — prevents a race where the thread detector sees leaked threads on CI.
  • LCM test isolation: Tests in test_pattern_sub.py and test_lcmpubsub.py now use dedicated multicast addresses to prevent cross-test contamination via shared LCM multicast.
  • mypy fix: Added import-not-found to onnxruntime type-ignore comments.

Breaking Changes

None

How to Test

# Change detection tests
pytest dimos/utils/test_change_detect.py -v

# Native module rebuild tests
pytest dimos/core/test_native_rebuild.py -v

# Native module crash/thread leak test
pytest dimos/core/test_native_module.py::test_process_crash_triggers_stop -v

# LCM isolation tests
pytest dimos/protocol/pubsub/test_pattern_sub.py -v
pytest dimos/protocol/pubsub/impl/test_lcmpubsub.py -v

# Full fast suite
pytest -m 'not (tool or slow or mujoco)' dimos/

Contributor License Agreement

  • I have read and approved the CLA.

Add a generic file change detection utility (dimos/utils/change_detect.py)
that tracks content hashes via xxhash and integrate it into NativeModule so
it can automatically rebuild when watched source files change.

- change_detect.did_change() hashes file content, stores per-cache-name
  hash files in the venv, and returns True when files differ
- NativeModuleConfig gains rebuild_on_change: list[str] | None
- NativeModule._maybe_build() deletes stale executables when sources change
- 11 tests for change_detect, 3 integration tests for native rebuild
…avoid unlinking Nix store executables

- Add `cwd` parameter to `did_change()` and `_resolve_paths()` so relative
  glob patterns in `rebuild_on_change` are resolved against the module's
  working directory instead of the process cwd.
- Replace `exe.unlink()` with a `needs_rebuild` flag so executables that
  live in read-only locations (e.g. Nix store) are not deleted; instead
  the build command is re-run which handles the output path itself.
@jeff-hykin jeff-hykin marked this pull request as draft March 17, 2026 22:56
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 17, 2026

Greptile Summary

This PR adds a content-hash-based file change detection utility (dimos/utils/change_detect.py) and integrates it into NativeModule so native executables automatically rebuild when watched source files change. It also fixes a thread-leak ordering bug in NativeModule.stop(), improves LCM test isolation with dedicated multicast addresses, and resolves the missing xxhash dependency.

Key changes and findings:

  • xxhash dependency added to pyproject.toml — resolves the previously-flagged missing dependency.
  • Thread-leak fixsuper().stop() (which joins the asyncio loop thread) is now correctly called before _process = None, preventing a race condition in CI.
  • LCM test isolationtest_lcmpubsub.py and test_pattern_sub.py now use dedicated multicast groups (239.255.76.98:7698 and 239.255.76.99:7699) to prevent cross-test contamination.
  • mesh_utils.py regressiondid_change is called without cwd using str(urdf_path). If urdf_path is relative (the function signature accepts Path | str with no absolute constraint), _resolve_paths raises ValueError. Previously the mtime-based approach handled relative paths transparently via the process CWD.
  • Two issues flagged in prior review rounds remain open: the pre-build did_change call writes the new hash before confirming build success (stale-build-on-failure scenario), and the post-build seeding call does not forward cwd for relative path sets.

Confidence Score: 3/5

  • Core feature is functional on the happy path but has known reliability gaps: a build failure leaves the cache in a state that permanently suppresses future rebuild attempts, and a new regression in mesh_utils.py breaks relative URDF paths.
  • Several improvements from prior rounds are addressed (xxhash dependency, thread-leak, LCM isolation), but the two central reliability bugs in the rebuild-on-change feature (cache written before build success, missing cwd in post-build seed call) remain unaddressed from earlier review rounds. The new relative-path regression in mesh_utils.py adds one more concrete fix required before merge.
  • dimos/core/native_module.py (pre-build hash write / missing cwd in post-build seed) and dimos/manipulation/planning/utils/mesh_utils.py (relative urdf_path raises ValueError)

Important Files Changed

Filename Overview
dimos/utils/change_detect.py New content-hash change detection utility using xxhash. Core logic is sound; thread + file locking implemented correctly. The known design trade-off (cache always written on call, not just on success) is the root cause of the previously-flagged build-failure regression, but is otherwise well-documented.
dimos/core/native_module.py Integrates rebuild-on-change into _maybe_build. Thread-leak fix (super().stop() before _process=None) is correct. Two previously-flagged issues remain open: the pre-build did_change call writes the hash before confirming build success, and the post-build seeding call (line 314) does not pass cwd, both already tracked in earlier review rounds.
dimos/manipulation/planning/utils/mesh_utils.py Switches URDF cache invalidation from mtime-in-cache-key to did_change. Introduces a behavioral regression: relative urdf_path values now raise ValueError because did_change is called without cwd, whereas the previous mtime approach worked with relative paths.
dimos/core/test_native_rebuild.py New test file covering happy-path rebuild-on-change scenarios. Uses autouse tmp_path cache redirect for proper isolation. All tests use try/finally to call mod.stop().
dimos/utils/test_change_detect.py Good unit coverage for change_detect.py. One test (test_nonexistent_path_warns) captures caplog but never asserts its contents, previously flagged in an earlier round.
dimos/protocol/pubsub/impl/test_lcmpubsub.py LCM test isolation fix using a dedicated multicast address — clean and correct improvement.
dimos/protocol/pubsub/test_pattern_sub.py Same LCM isolation fix as test_lcmpubsub.py, using a separate isolated multicast group.
pyproject.toml Adds xxhash>=3.0.0 to project dependencies, resolving the missing dependency flagged in a prior review round.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[NativeModule._maybe_build] --> B{rebuild_on_change\nset AND exe exists?}
    B -- No --> C{exe exists?}
    B -- Yes --> D["did_change(cache_name, paths, cwd)\n⚠ writes hash to cache immediately"]
    D -- False\nno change --> C
    D -- True\nfiles changed --> E[needs_rebuild = True\nlog 'Source files changed']
    C -- Yes and not needs_rebuild --> F[return early\nno build]
    C -- No OR needs_rebuild --> G{build_command set?}
    G -- No --> H[raise FileNotFoundError]
    G -- Yes --> I[subprocess.Popen build_command]
    I --> J{returncode == 0?}
    J -- No --> K[raise RuntimeError\n⚠ cache already updated\nfuture rebuilds blocked]
    J -- Yes --> L{exe exists\nafter build?}
    L -- No --> M[raise FileNotFoundError]
    L -- Yes --> N["did_change(cache_name, paths, cwd)\nseed cache post-build\n⚠ cwd arg present here"]
    N --> O[build complete]
Loading

Reviews (2): Last reviewed commit: "fix(test): isolate LCM multicast in flak..." | Re-trigger Greptile

Comment on lines +254 to +258
if self.config.rebuild_on_change and exe.exists():
cache_name = f"native_{type(self).__name__}_build"
if did_change(cache_name, self.config.rebuild_on_change, cwd=self.config.cwd):
logger.info("Source files changed, triggering rebuild", executable=str(exe))
needs_rebuild = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Cache written before build completes — failed rebuild silently blocks all future rebuilds

did_change() always writes the new hash to the cache file as a side effect before returning. So when it returns True here (change detected, rebuild needed), the cache is already updated to the new hash. If the build subsequently fails (e.g. RuntimeError is raised at line 288), the exception propagates up and the cache is left pointing at the new file state.

On the very next call to _maybe_build():

  1. exe.exists()True (old binary is still there)
  2. did_change(...)False (cache already reflects the new file state)
  3. needs_rebuild = False → the method returns early

The module silently continues running the stale executable and will never retry the build, even if you fix the build error and call start() again.

The root fix is to not update the cache during the "check" phase, and only commit the hash after a successful build. One approach is to split did_change into a read-only has_changed check and a separate update_cache write, calling them independently:

# Check-only (read the cache, do not write it)
if self.config.rebuild_on_change and exe.exists():
    cache_name = f"native_{type(self).__name__}_build"
    if has_changed(cache_name, self.config.rebuild_on_change, cwd=self.config.cwd):
        logger.info("Source files changed, triggering rebuild", executable=str(exe))
        needs_rebuild = True

and after a successful build:

if self.config.rebuild_on_change:
    update_cache(cache_name, self.config.rebuild_on_change, cwd=self.config.cwd)

Comment on lines +296 to +299
# Update the change cache so next check is clean
if self.config.rebuild_on_change:
cache_name = f"native_{type(self).__name__}_build"
did_change(cache_name, self.config.rebuild_on_change)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Missing cwd argument causes an infinite rebuild loop with relative paths

The post-build did_change() call at line 299 does not forward cwd=self.config.cwd. Compare with the earlier call at line 256 which correctly passes cwd:

# line 256 — correct
did_change(cache_name, self.config.rebuild_on_change, cwd=self.config.cwd)

# line 299 — missing cwd
did_change(cache_name, self.config.rebuild_on_change)   # ← bug

When rebuild_on_change contains relative paths (e.g. ["src"]) and cwd is set, _resolve_paths at line 299 expands them relative to the process's current working directory, not self.config.cwd. If those relative paths don't match anything from the process CWD, _resolve_paths returns [], _hash_files([]) returns the hash of empty input (call it H_empty), and H_empty is written to the cache.

On the next _maybe_build() invocation, the first call (line 256, with cwd) resolves the correct files and computes hash H1. Since H1 != H_empty, did_change returns True → rebuild is triggered → the cycle repeats indefinitely.

Fix: pass cwd consistently:

Suggested change
# Update the change cache so next check is clean
if self.config.rebuild_on_change:
cache_name = f"native_{type(self).__name__}_build"
did_change(cache_name, self.config.rebuild_on_change)
# Update the change cache so next check is clean
if self.config.rebuild_on_change:
cache_name = f"native_{type(self).__name__}_build"
did_change(cache_name, self.config.rebuild_on_change, cwd=self.config.cwd)

Note also that this second call is redundant as long as the first did_change() call already wrote the updated hash to the cache (see the other comment about separating the check from the write). Once the did_change API is refactored into read/write phases, this entire post-build block should be replaced with a dedicated update_cache call that passes cwd.

import os
from pathlib import Path

import xxhash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 xxhash not declared as a project dependency

xxhash is imported here but is absent from the [project.dependencies] list in pyproject.toml. Users who install dimos from PyPI (or any environment that doesn't happen to have xxhash transitively) will get an ImportError at import time of dimos.utils.change_detect, which now also propagates to dimos.core.native_module since it imports did_change at module level.

Please add xxhash to pyproject.toml:

# pyproject.toml — [project.dependencies]
"xxhash>=3.0",

Comment on lines +119 to +126
previous_hash = cache_file.read_text().strip()
changed = current_hash != previous_hash

# Always update the cache with the current hash
cache_dir.mkdir(parents=True, exist_ok=True)
cache_file.write_text(current_hash)

return changed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 did_change conflates check and update — makes caller-side error handling impossible

The function always writes the new hash to the cache file (line 126) regardless of what the caller does next. This design is the root cause of the build-failure bug in native_module.py: there is no way to "peek" at whether files changed without also committing the new state.

Consider splitting the public surface into two functions:

def has_changed(cache_name: str, paths: ..., cwd: ...) -> bool:
    """Read-only check; does NOT update the cache."""
    ...

def update_cache(cache_name: str, paths: ..., cwd: ...) -> None:
    """Write the current hash to the cache."""
    ...

did_change can then be kept as a convenience wrapper that calls both (preserving backwards-compatibility for callers that want the old atomic behaviour). NativeModule._maybe_build would call has_changed for the pre-build check, and update_cache only after a confirmed successful build.

Comment on lines +107 to +114
assert did_change("empty_test", []) is False


def test_nonexistent_path_warns(caplog: pytest.LogCaptureFixture) -> None:
"""A non-existent path logs a warning and doesn't crash."""
result = did_change("missing_test", ["/nonexistent/path/to/file.c"])
# First call with no resolvable files still returns True (no cache)
assert isinstance(result, bool)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Weak assertion in test_nonexistent_path_warns

The test only asserts isinstance(result, bool), which is trivially true for any bool return value. Based on the documented contract ("returns True if no previous cache exists"), the first call for a non-existent path should return True. The assertion should be:

Suggested change
assert did_change("empty_test", []) is False
def test_nonexistent_path_warns(caplog: pytest.LogCaptureFixture) -> None:
"""A non-existent path logs a warning and doesn't crash."""
result = did_change("missing_test", ["/nonexistent/path/to/file.c"])
# First call with no resolvable files still returns True (no cache)
assert isinstance(result, bool)
assert result is True

The existing assertion does not actually verify the warning was logged either (the test name says "warns" but caplog is not asserted against).

@jeff-hykin jeff-hykin force-pushed the jeff/feat/native_rebuild branch from bfab461 to e01688c Compare March 19, 2026 21:09
@jeff-hykin jeff-hykin force-pushed the jeff/feat/native_rebuild branch 2 times, most recently from fa3defd to cb5041b Compare March 20, 2026 00:40
fcntl.flock is per-file-description, not per-thread. Two threads in the
same process can both hold LOCK_EX simultaneously. Add a per-cache-name
threading.Lock to protect intra-process concurrent access.

Revert: git revert HEAD
Two NativeModule subclasses in the same file would share a cache key,
corrupting each other's rebuild state. Add qualname to disambiguate.

Revert: git revert HEAD
Move super().stop() before self._process = None so the asyncio loop
thread is joined before tests see the exit signal. Wrap crash test
in try/finally with mod.stop().
Use dedicated multicast addresses to prevent cross-test contamination
in test_pattern_sub.py and test_lcmpubsub.py.
@jeff-hykin jeff-hykin changed the title feat(native): add rebuild-on-change for NativeModule [Jeff's Claw Bot] feat(native): add rebuild-on-change for NativeModule Mar 23, 2026
@jeff-hykin jeff-hykin marked this pull request as ready for review March 23, 2026 22:38
@jeff-hykin jeff-hykin changed the title [Jeff's Claw Bot] feat(native): add rebuild-on-change for NativeModule feat(native): add rebuild-on-change for NativeModule Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant