Skip to content

fix: bound install-scan timeout + sweep/release correlation slots#16

Merged
olivrg merged 1 commit into
mainfrom
fix/governance-timeout-registry-eviction
Jun 18, 2026
Merged

fix: bound install-scan timeout + sweep/release correlation slots#16
olivrg merged 1 commit into
mainfrom
fix/governance-timeout-registry-eviction

Conversation

@olivrg

@olivrg olivrg commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Two hardening fixes from the pre-0.1.0 self-review, both on the governance hot path. TDD throughout.

#1install-scan had no timeout

/install-scan is a gating call like /evaluate, but it was unbounded while evaluate had an AbortController + evaluateTimeoutMs. A hung proxy would stall the install hook indefinitely — fail-closed (never proceeds) but not prompt, contradicting the spirit of the bounded-/evaluate non-negotiable.

  • Factored a bounded() helper that runs a gating call under an AbortController + evaluateTimeoutMs (covers fetch and body read), and applied it to both evaluate and installScan.

#2 — correlation slot leak + false ambiguity-block

  • No periodic sweep: eviction was lazy per-key, so a per-call-unique toolCallId key whose after_tool_call never arrives was never reclaimed → slow unbounded memory growth. Added a sweep (at most once per TTL) on reserve() — the only growth path — so memory stays bounded without a background timer. Exposed a size getter (telemetry + leak guard).
  • No release on terminal approval: when an approval resolves to deny/timeout/cancelled the tool doesn't run, so no after_tool_call claims the slot. Beyond the leak, on a runId/no-ID key this stale slot would wrongly ambiguity-block the next legitimate call until TTL. Now released in onResolution for those decisions; allow-once/allow-always keep the slot (the tool runs and after_tool_call claims it) — both directions covered by tests.

Verification

pnpm verify green: 196 tests (was 185; +11), 0 type errors, build ✓. New tests: install-scan timeout fail-closed; registry sweep evicts never-claimed unique-key slots; approval deny/timeout/cancelled release the slot while allow-once/allow-always keep it.

Self-review also raised minor nits (AdapterConfig readonly, baseUrl scheme, mapSession fallback docs) — deferred; not in this PR.

Self-review hardening of the governance hot path before 0.1.0.

#1 install-scan timeout: /install-scan is a gating call like /evaluate but was
unbounded, so a hung proxy would stall the install hook indefinitely (fail-closed
but never prompt). Factor a `bounded()` helper that runs a gating call under an
AbortController + evaluateTimeoutMs, and apply it to both evaluate and installScan.

#2 correlation slot leak / false ambiguity-block:
- Add a periodic sweep (at most once per TTL) on reserve(): lazy per-key eviction
  never reclaims a per-call-unique toolCallId key whose after_tool_call never
  arrives, so such slots leaked. Sweeping on the growth path bounds the map
  without a background timer. Expose a `size` getter (telemetry + leak guard).
- Release the reservation when an approval resolves to deny/timeout/cancelled:
  the tool won't run, so no after_tool_call claims the slot — freeing it prevents
  the leak and stops a stale slot from wrongly ambiguity-blocking the next call
  on a runId/no-ID key.
@olivrg olivrg merged commit 858af19 into main Jun 18, 2026
1 check passed
@olivrg olivrg deleted the fix/governance-timeout-registry-eviction branch June 18, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant