Service crashes with uncaught IIIInvocationError TIMEOUT on state::set (v0.9.3)

## Summary

The agentmemory service (`@agentmemory/agentmemory@0.9.3`) crashes intermittently with an uncaught `IIIInvocationError: TIMEOUT: invocation timed out after 30000ms` on `function_id: 'state::set'`. The rejection escapes through `iii-sdk/dist/index.mjs:405` and terminates the Node process. Under sustained write load (passive observation capture from CC hooks across multiple projects) we observed 5–15 crashes/hour. systemd auto-restart recovers in ~10s, but in-flight BM25/vector index updates that hadn't fired their 5s `IndexPersistence` debounce are lost across the crash boundary, breaking `memory_smart_search` recall for very recent saves.

## Environment

- Node.js v20.20.2
- Linux 6.8.0-106-generic, x86_64
- `@agentmemory/agentmemory` v0.9.3 installed via `npm install -g`
- `iii` engine v0.11.0 native binary
- Embedding provider: `openai`
- `AGENTMEMORY_AUTO_COMPRESS=true`, `CONSOLIDATION_ENABLED=true`, `GRAPH_EXTRACTION_ENABLED=true`
- Load profile: 5 Claude Code agents + ~1.7K observations/day (~75/hour avg) via plugin hooks

## Reproduction

1. Run service with the env vars above
2. Drive sustained write load: ~1 observation/sec via `POST /agentmemory/observe` (or via the plugin hooks under active CC sessions)
3. Within 1–10 minutes, the process exits with the trace below

## Crash log

```
[agentmemory] Ready. Triple-stream (BM25+Vector+Graph) search active.
... (40–90 s of normal operation, observations captured/compressed) ...
file:///usr/lib/node_modules/@agentmemory/agentmemory/node_modules/iii-sdk/dist/index.mjs:405
                                                reject(new IIIInvocationError({
                                                       ^

IIIInvocationError: TIMEOUT: invocation timed out after 30000ms
    at Timeout._onTimeout (file:///.../iii-sdk/dist/index.mjs:405:14)
    at listOnTimeout (node:internal/timers:581:17)
    at process.processTimers (node:internal/timers:519:7) {
  code: 'TIMEOUT',
  function_id: 'state::set',
  stacktrace: undefined
}

Node.js v20.20.2
agentmemory.service: Main process exited, code=exited, status=1/FAILURE
```

The same crash repeats with restart counter climbing (we see >40 starts/24h on a busy day).

## What we expect

The `state::set` invocation timing out shouldn't crash the whole process. Either:

- Catch and log the rejection (degrade gracefully — drop or queue the write), or
- Surface a configurable longer timeout / retry policy for KV writes, or
- Add `process.on('unhandledRejection', …)` at the entrypoint as a hard floor.

## Side effects observed

1. Recent BM25 / vector index additions (since the last `IndexPersistence.scheduleSave()` debounce flush — 5s) are lost across the crash, since they live only in memory until persisted via `state::set(KV.bm25Index, …)`.
2. `memory_smart_search` doesn't return content saved within ~30s of the crash, even though `kv.set(KV.memories, …)` itself completed.
3. Even after restart, OTel WebSocket reconnect loop ("WebSocket error: Unexpected server response: 404") spams logs with exponential backoff up to ~30s.

## Possibly related

- Whether the iii engine has an internal queue limit / write backpressure that surfaces as a 30s timeout under load — happy to share a journalctl dump if useful.
- `function_id: 'state::set'` is the only function_id we see crash; `state::get` and others time out gracefully.

## Suggested fixes (any one helps)

- [ ] Wrap `state::set` calls in `IndexPersistence.save()` and `kv.set(...)` paths with a `.catch()` that logs + drops, instead of letting the rejection propagate.
- [ ] Top-level `process.on('unhandledRejection', …)` handler in the service entrypoint so a single SDK timeout doesn't take down the whole memory mesh.
- [ ] Document the recommended `iii-engine` tuning for sustained-write workloads (e.g. `AGENTMEMORY_OBSERVE_QUEUE_LIMIT=…`) — if such knobs exist.

Happy to PR a `.catch()` wrapper if you point me at the preferred location.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service crashes with uncaught IIIInvocationError TIMEOUT on state::set (v0.9.3) #204

Summary

Environment

Reproduction

Crash log

What we expect

Side effects observed

Possibly related

Suggested fixes (any one helps)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Service crashes with uncaught IIIInvocationError TIMEOUT on state::set (v0.9.3) #204

Description

Summary

Environment

Reproduction

Crash log

What we expect

Side effects observed

Possibly related

Suggested fixes (any one helps)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions