Skip to content

feat(security): add security demo#1619

Open
paul-nechifor wants to merge 2 commits intodevfrom
paul/feat/security-demo
Open

feat(security): add security demo#1619
paul-nechifor wants to merge 2 commits intodevfrom
paul/feat/security-demo

Conversation

@paul-nechifor
Copy link
Copy Markdown
Contributor

@paul-nechifor paul-nechifor commented Mar 20, 2026

Problem

  • Agents aren't very good at coordinating skill calls.

Closes DIM-XXX

Solution

  • Add an additional security module which integrates everything.

Breaking Changes

None

How to Test

VERY IMPORTANT: YOU MUST EXPLORE FIRST

2026-03-28_02-57
  • Move around the environment you want to patrol so that the global map fills in.
  • Then start uv run dimos --dtop --robot-ip <IP> run --disable spatial-memory unitree-go2-security
  • Run humancli and tell it start the security patrol. Just call start_security_patrol. Do not ask me anything.

Contributor License Agreement

  • I have read and approved the CLA.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 20, 2026

Greptile Summary

This PR introduces a self-contained SecurityModule that implements an IDLE → PATROLLING → FOLLOWING state machine on the Unitree Go2 robot, driven by YOLO person detection and visual servoing. It also adds thread-safety improvements to the audio output pipeline (node_output.py, node_openai.py), a new SpeakSkillSpec protocol, a reusable explore_office E2E fixture, and an end-to-end test for the security flow.

Key changes:

  • security_module.py: New SecurityModule with a background patrol/detect/follow loop, integrated into the unitree_go2_spatial blueprint.
  • node_output.py: Added threading.Lock around sounddevice stream access to prevent concurrent stop/write race conditions.
  • node_openai.py: Queue is now flushed before joining the processing thread on dispose(), and the join timeout is raised to 30 s.
  • conftest.py: Patrol route extracted into a shared explore_office fixture; used by both the existing patrol test and the new security test.

Issues found:

  • TOCTOU race in start_security_patrol: the alive-check for _main_thread is done under _lock, but the new thread is created and assigned outside the lock, allowing two callers to simultaneously start independent patrol threads with only one tracked.
  • CPU busy-wait in _follow_step: when latest_image is None, the function returns immediately with no sleep, causing the main loop to spin at 100% CPU. _patrol_step correctly uses _stop_event.wait(timeout=0.01) in the equivalent branch.
  • Zombie thread after stop: _stop_security_patrol_internal joins the thread with a 5-second timeout and then unconditionally clears _main_thread, even if the thread is still alive (e.g., blocked in a TTS API call). A subsequent start_security_patrol would then launch a second thread, resulting in two concurrent loops publishing conflicting cmd_vel commands.
  • _state read without lock: self._state is read in _main_loop's match statement without holding _lock, while _transition_to writes it under the lock.
  • Return type mismatch in explore_office fixture: annotated as Callable[[list[tuple[float, float]]], None] but the returned function takes no arguments (Callable[[], None]).

Confidence Score: 2/5

  • Not safe to merge — the security module has concurrency bugs that can cause duplicate untracked robot-control threads and CPU spin loops.
  • The audio thread-safety fixes and fixture refactor are solid, but the core SecurityModule has three logic-level concurrency issues: a TOCTOU race that can spawn untracked patrol threads, a CPU busy-wait when no image is available during FOLLOWING, and a zombie-thread risk after stop's 5-second join timeout. Any of these could result in conflicting cmd_vel commands being sent to the robot simultaneously, which is a real hardware safety concern.
  • dimos/experimental/security_demo/security_module.py requires the most attention due to the thread-safety issues described above.

Important Files Changed

Filename Overview
dimos/experimental/security_demo/security_module.py New security patrol state machine module with TOCTOU race in start_security_patrol, CPU busy-wait in _follow_step, zombie thread risk after join timeout, and unprotected state reads in _main_loop.
dimos/e2e_tests/conftest.py Extracts patrol points into a reusable explore_office fixture; return type annotation is incorrectly typed as Callable[[list[tuple[float, float]]], None] instead of Callable[[], None].
dimos/stream/audio/node_output.py Adds threading.Lock around sounddevice stream access to prevent concurrent stop/write races; the early-exit check for _running was correctly simplified.
dimos/stream/audio/tts/node_openai.py Increases join timeout from 5s to 30s and pre-clears the queue on dispose to prevent new TTS synthesis from starting during shutdown; sensible fix.
dimos/robot/unitree/go2/blueprints/smart/unitree_go2_spatial.py Wires SecurityModule into the unitree_go2_spatial blueprint with GO2Connection camera info; straightforward integration change.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant SecurityModule
    participant PatrolRouter
    participant Planner as ReplanningAStarPlanner
    participant NavStack as Navigation (goal_request / cmd_vel)
    participant YOLO as Yolo2DDetector
    participant TTS as SpeakSkill

    Agent->>SecurityModule: start_security_patrol()
    SecurityModule->>Planner: set_replanning_enabled(False)
    SecurityModule->>SecurityModule: spawn _main_loop thread
    SecurityModule-->>Agent: "Security patrol started"

    loop PATROLLING state
        SecurityModule->>PatrolRouter: next_goal()
        PatrolRouter-->>SecurityModule: PoseStamped goal
        SecurityModule->>NavStack: goal_request.publish(goal)
        SecurityModule->>YOLO: process_image(latest_image)
        alt person detected
            SecurityModule->>NavStack: goal_request.publish(current_pose) [cancel]
            SecurityModule->>TTS: speak("Intruder detected")
            SecurityModule->>SecurityModule: transition → FOLLOWING
        else no detection
            SecurityModule->>SecurityModule: wait for goal_reached event
        end
    end

    loop FOLLOWING state
        SecurityModule->>YOLO: process_image(latest_image)
        alt person visible
            SecurityModule->>NavStack: cmd_vel.publish(twist)
        else person lost
            SecurityModule->>NavStack: cmd_vel.publish(Twist.zero())
            SecurityModule->>TTS: speak("Lost sight of intruder, resuming patrol")
            SecurityModule->>PatrolRouter: reset()
            SecurityModule->>SecurityModule: transition → PATROLLING
        end
    end

    Agent->>SecurityModule: stop_security_patrol()
    SecurityModule->>SecurityModule: _stop_event.set()
    SecurityModule->>Planner: set_replanning_enabled(True)
    SecurityModule->>NavStack: goal_request.publish(current_pose) [cancel]
    SecurityModule->>SecurityModule: join _main_loop thread (timeout=5s)
    SecurityModule->>SecurityModule: transition → IDLE
    SecurityModule-->>Agent: "Security patrol stopped"
Loading

Last reviewed commit: "feat(security): add ..."

Comment on lines +234 to +250
def _follow_step(self) -> None:
"""One iteration of the follow loop (YOLO detect + servo + publish)."""
with self._lock:
latest_image = self._latest_image

if latest_image is None:
return

best = self._find_best_person(latest_image)

if best is None:
self.cmd_vel.publish(Twist.zero())
self._speak_skill.speak("Lost sight of intruder, resuming patrol")
self._router.reset()
self._has_active_goal = False
self._transition_to("PATROLLING")
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 CPU busy-wait when latest_image is None during FOLLOWING

When latest_image is None, _follow_step returns immediately with no sleep or yield. The _main_loop will immediately call it again, spinning at 100% CPU until an image arrives. Compare this with _patrol_step, which correctly uses self._stop_event.wait(timeout=0.01) in the analogous branch. A similar small wait should be added here:

        if latest_image is None:
            self._stop_event.wait(timeout=0.01)
            return

@paul-nechifor paul-nechifor force-pushed the paul/feat/security-demo branch 4 times, most recently from 85be8b4 to d9faea6 Compare March 26, 2026 00:43
return getattr(python_module, attr_name)() # type: ignore[no-any-return]
module_path, class_name = all_modules[name].rsplit(".", 1)
python_module = __import__(module_path, fromlist=[class_name])
return getattr(python_module, class_name).blueprint() # type: ignore[no-any-return]
Copy link
Copy Markdown
Member

@jeff-hykin jeff-hykin Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for breaking this out into that other PR. It does make it easier to understand.


if self.processing_thread and self.processing_thread.is_alive():
self.processing_thread.join(timeout=5.0)
self.processing_thread.join(timeout=30.0)
Copy link
Copy Markdown
Member

@jeff-hykin jeff-hykin Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why so big? this makes me worry about other timeouts

def _play_audio_event(self, audio_event) -> None: # type: ignore[no-untyped-def]
"""Play audio from an AudioEvent."""
if not self._running or not self._stream:
if not self._running:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does self._stream need a lock but self._running doesn't?

@paul-nechifor paul-nechifor force-pushed the paul/feat/security-demo branch from aa28910 to a3a2469 Compare March 27, 2026 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants