Skip to content

Service crashes with uncaught IIIInvocationError TIMEOUT on state::set (v0.9.3) #204

@bunke

Description

@bunke

Summary

The agentmemory service (@agentmemory/agentmemory@0.9.3) crashes intermittently with an uncaught IIIInvocationError: TIMEOUT: invocation timed out after 30000ms on function_id: 'state::set'. The rejection escapes through iii-sdk/dist/index.mjs:405 and terminates the Node process. Under sustained write load (passive observation capture from CC hooks across multiple projects) we observed 5–15 crashes/hour. systemd auto-restart recovers in ~10s, but in-flight BM25/vector index updates that hadn't fired their 5s IndexPersistence debounce are lost across the crash boundary, breaking memory_smart_search recall for very recent saves.

Environment

  • Node.js v20.20.2
  • Linux 6.8.0-106-generic, x86_64
  • @agentmemory/agentmemory v0.9.3 installed via npm install -g
  • iii engine v0.11.0 native binary
  • Embedding provider: openai
  • AGENTMEMORY_AUTO_COMPRESS=true, CONSOLIDATION_ENABLED=true, GRAPH_EXTRACTION_ENABLED=true
  • Load profile: 5 Claude Code agents + ~1.7K observations/day (~75/hour avg) via plugin hooks

Reproduction

  1. Run service with the env vars above
  2. Drive sustained write load: ~1 observation/sec via POST /agentmemory/observe (or via the plugin hooks under active CC sessions)
  3. Within 1–10 minutes, the process exits with the trace below

Crash log

[agentmemory] Ready. Triple-stream (BM25+Vector+Graph) search active.
... (40–90 s of normal operation, observations captured/compressed) ...
file:///usr/lib/node_modules/@agentmemory/agentmemory/node_modules/iii-sdk/dist/index.mjs:405
                                                reject(new IIIInvocationError({
                                                       ^

IIIInvocationError: TIMEOUT: invocation timed out after 30000ms
    at Timeout._onTimeout (file:///.../iii-sdk/dist/index.mjs:405:14)
    at listOnTimeout (node:internal/timers:581:17)
    at process.processTimers (node:internal/timers:519:7) {
  code: 'TIMEOUT',
  function_id: 'state::set',
  stacktrace: undefined
}

Node.js v20.20.2
agentmemory.service: Main process exited, code=exited, status=1/FAILURE

The same crash repeats with restart counter climbing (we see >40 starts/24h on a busy day).

What we expect

The state::set invocation timing out shouldn't crash the whole process. Either:

  • Catch and log the rejection (degrade gracefully — drop or queue the write), or
  • Surface a configurable longer timeout / retry policy for KV writes, or
  • Add process.on('unhandledRejection', …) at the entrypoint as a hard floor.

Side effects observed

  1. Recent BM25 / vector index additions (since the last IndexPersistence.scheduleSave() debounce flush — 5s) are lost across the crash, since they live only in memory until persisted via state::set(KV.bm25Index, …).
  2. memory_smart_search doesn't return content saved within ~30s of the crash, even though kv.set(KV.memories, …) itself completed.
  3. Even after restart, OTel WebSocket reconnect loop ("WebSocket error: Unexpected server response: 404") spams logs with exponential backoff up to ~30s.

Possibly related

  • Whether the iii engine has an internal queue limit / write backpressure that surfaces as a 30s timeout under load — happy to share a journalctl dump if useful.
  • function_id: 'state::set' is the only function_id we see crash; state::get and others time out gracefully.

Suggested fixes (any one helps)

  • Wrap state::set calls in IndexPersistence.save() and kv.set(...) paths with a .catch() that logs + drops, instead of letting the rejection propagate.
  • Top-level process.on('unhandledRejection', …) handler in the service entrypoint so a single SDK timeout doesn't take down the whole memory mesh.
  • Document the recommended iii-engine tuning for sustained-write workloads (e.g. AGENTMEMORY_OBSERVE_QUEUE_LIMIT=…) — if such knobs exist.

Happy to PR a .catch() wrapper if you point me at the preferred location.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions