fix(memory): enforce wall-clock deadline on compact LLM path (#131) by tcconnally · Pull Request #162 · tcconnally/perseus

tcconnally · 2026-06-03T18:30:22Z

Summary

Closes #131 — fourth PR in v1.0.6 milestone. Second-most complex remaining item (concurrency + deadline + deterministic fallback).

Pre-1.0.6, perseus memory compact with an LLM provider configured could hang for hours. The root cause: _mneme_compact_llm() → run_llm() only enforced llm.timeout_s (default 30s) on the HTTP request itself. With streaming-token providers like Ollama serving large models, individual tokens arrive within timeout but total wall time was unbounded.

Fix

Wall-clock deadline + deterministic fallback

_memory_do_compact() now wraps the LLM call in a ThreadPoolExecutor.future.result(timeout=…). On timeout, the LLM future is abandoned and _deterministic_narrative() produces a usable narrative — operators get SOMETHING, plus a clear stderr signal:

> ⚠ Mnēmē compact: LLM provider 'ollama' exceeded
compact_total_timeout_s=180s; falling back to deterministic narrative.

Same fallback for any LLM exception

If the LLM call raises (provider unreachable, payload error, etc.) — memory compact no longer propagates the failure. Deterministic narrative is built; a stderr message is printed. Operators always get a usable narrative.

New config knob

memory:
  compact_total_timeout_s: 180   # 0 = pre-1.0.6 unbounded behavior

Observability

New audit event memory_compact_timeout with provider, total_timeout_s, workspace_hash fields.

Limitation (documented)

Python's ThreadPoolExecutor cannot truly kill a running thread. The in-flight HTTP request continues until urllib's per-request timeout fires. Worst-case wait is therefore compact_total_timeout_s + llm.timeout_s. The leaked thread is daemonized via cancel_futures=True + wait=False and cannot block process exit.

A true interruption would require switching from urllib.request to requests with a (connect, read) tuple timeout, OR running the LLM call in a child process. Both are larger refactors; deferred to v1.1+.

Files Changed (5)

src/perseus/agora.py — _memory_do_compact() deadline wrapper + fallback
src/perseus/config.py — memory.compact_total_timeout_s: 180 in DEFAULT_CONFIG
tests/test_memory.py — 4 regression tests
CHANGELOG.md — v1.0.6 entry
perseus.py — rebuilt artifact

Tests

4 new regression tests in tests/test_memory.py:

test_memory_compact_total_timeout_falls_back_to_deterministic — slow LLM mock (2.0s) exceeds 0.5s deadline; assert <1.5s return + deterministic body + stderr
test_memory_compact_succeeds_within_total_timeout — fast LLM mock under deadline; assert LLM body used
test_memory_compact_llm_exception_falls_back_to_deterministic — exception path; assert no propagation + deterministic body + stderr
test_memory_compact_default_timeout_is_180s — config sanity

Test results

All 4 new regression tests pass
All 47 tests in test_memory.py + test_mneme.py pass

Migration Notes

The new default compact_total_timeout_s: 180 is strictly safer than pre-1.0.6 behavior. Users who want the old (unbounded) behavior can set it to 0 — but this is not recommended.

Fourth of 12 PRs in the v1.0.6 milestone. Suggested next: #139 (MCP _call_tool subprocess leak — third-most complex remaining).

Pre-1.0.6, `perseus memory compact` with an LLM provider configured could hang for hours. The root cause: _mneme_compact_llm() → run_llm() only enforced llm.timeout_s (default 30s) on the HTTP request itself. With streaming-token providers like Ollama serving large models, individual tokens arrive within timeout but total wall time was unbounded. This patch adds a true wall-clock deadline at the _memory_do_compact() level, with deterministic fallback so operators always get a usable narrative. src/perseus/agora.py: - _memory_do_compact() now wraps the LLM call in ThreadPoolExecutor.future.result(timeout=total_timeout). - New knob: memory.compact_total_timeout_s (default 180s). - On timeout: stderr message + audit_event('memory_compact_timeout', ...) + fall back to _deterministic_narrative. - On generic LLM exception (provider unreachable, payload error): same deterministic fallback path. memory compact never propagates LLM failures up to the operator. - Executor is shutdown(wait=False, cancel_futures=True) so the call returns immediately on timeout. The worker thread is daemonized and cannot block process exit. src/perseus/config.py: - Add memory.compact_total_timeout_s: 180 to DEFAULT_CONFIG with explanatory comment about pre-1.0.6 behavior and the 0=disabled escape hatch. Limitation (documented in code + CHANGELOG): Python's ThreadPoolExecutor cannot truly kill a running thread. The in-flight HTTP request continues until urllib's per-request timeout fires. Worst-case wait is therefore compact_total_timeout_s + llm.timeout_s. Daemonized so it doesn't block exit. Tests (tests/test_memory.py): - test_memory_compact_total_timeout_falls_back_to_deterministic — slow LLM mock exceeds 0.5s deadline; assert <1.5s return + deterministic body + stderr message - test_memory_compact_succeeds_within_total_timeout — fast LLM mock under deadline; assert LLM body present - test_memory_compact_llm_exception_falls_back_to_deterministic — exception in LLM path; assert no propagation + deterministic body + stderr message - test_memory_compact_default_timeout_is_180s — config default All 4 new regression tests pass. All 47 tests in test_memory.py and test_mneme.py pass. Closes #131 Refs milestone v1.0.6

tcconnally added this to the v1.0.6 milestone Jun 3, 2026

tcconnally added bug Something isn't working mneme P1-high labels Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory): enforce wall-clock deadline on compact LLM path (#131)#162

fix(memory): enforce wall-clock deadline on compact LLM path (#131)#162
tcconnally wants to merge 1 commit into
mainfrom
fix/131-memory-compact-total-timeout

tcconnally commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tcconnally commented Jun 3, 2026

Summary

Fix

Wall-clock deadline + deterministic fallback

Same fallback for any LLM exception

New config knob

Observability

Limitation (documented)

Files Changed (5)

Tests

Test results

Migration Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants