Skip to content

fix(search): bound glob/grep tree-walk with cooperative + watchdog timeout#201

Merged
yishuiliunian merged 1 commit into
mainfrom
fix/glob-traversal-hang
Jun 25, 2026
Merged

fix(search): bound glob/grep tree-walk with cooperative + watchdog timeout#201
yishuiliunian merged 1 commit into
mainfrom
fix/glob-traversal-hang

Conversation

@yishuiliunian

Copy link
Copy Markdown
Contributor

Summary

  • A Glob over a path containing a network mount (rclone/OSS NFS) hung the agent loop forever — the single-threaded walker exhaustively traversed the whole tree for a zero-match pattern with no deadline, blocking the spawn_blocking thread the loop awaits.
  • Fixed with a two-layer timeout: an inner cooperative walk_timeout (partial results) + an outer per-tool watchdog hard floor, applied at a single convergence point both execution paths route through.
  • Plus: parallelized the glob walker, follow_links(false) (ripgrep parity), truncated/timeout decoupling, deterministic glob ordering, and shared timeout-notice const.

Changes

  • loopal-backend/search: glob.rs rewritten to a parallel ignore walker with a cooperative Instant deadline; grep.rs deadline + truncated derived from the match-cap (not the timeout flag); walker.rs follow_links(true)→false; mod.rs reverts to plain spawn_blocking (outer bound moved to the watchdog); limits.rs adds walk_timeout (30s).
  • loopal-runtime/agent_loop: new execute_tool_watchdogged convergence helper in tool_exec.rs; streaming_tool_exec.rs (early-start path) and execute_approved_tools both route through it; tool_watchdog.rs covers all 7 fs-read tools at 60s with StaleReason::WatchdogTimeout.
  • loopal-tool-api: timed_out field on Glob/GrepSearchResult; shared SEARCH_TIMEOUT_NOTICE.
  • tools/glob,grep: surface the timeout notice; glob output gains a path sort tiebreak.
  • tests: parallel correctness, symlink-not-followed (cfg(unix)), truncation tolerance, timeout, determinism, watchdog coverage.
  • design/glob-traversal-hang/: RCA + P1 plan + post-review consolidation notes.

Test plan

  • CI passes (bazel build //..., affected tests, clippy, rustfmt)

…meout

A Glob over a path containing a network mount (rclone/OSS NFS) hung the
agent loop forever: the single-threaded walker exhaustively traversed the
whole tree for a zero-match pattern with no deadline, blocking the
spawn_blocking thread that the agent loop awaits.

Two-layer fix:
- Inner cooperative deadline (ResourceLimits::walk_timeout, 30s) checked
  between walk entries; returns partial results + timed_out so the tool can
  tell the LLM to narrow `path`.
- Outer hard floor via the runtime per-tool watchdog, now extended from
  Bash-only to all fs-read tools (Glob/Grep/Ls/Read/ReadPdf/ReadImage/
  ReadHtml) with a typed StaleReason.

The watchdog is applied at a single convergence point
(execute_tool_watchdogged) that BOTH the streaming early-start path and the
normal approval path route through, so a tool can never be bounded on one
path and unbounded on the other.

Also: parallelize the glob walker (mirrors grep's build_parallel), default
follow_links to false (ripgrep parity, avoids cross-mount escape/cycles),
decouple `truncated` from the timeout signal (no spurious overflow files),
add a deterministic path tiebreak to glob output, dedup the timeout notice
into loopal-tool-api, and clamp search `max` to >=1.

Design notes under design/glob-traversal-hang/.
@yishuiliunian yishuiliunian merged commit 304aa55 into main Jun 25, 2026
4 checks passed
@yishuiliunian yishuiliunian deleted the fix/glob-traversal-hang branch June 25, 2026 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant