Skip to content

Web UI WS state machine: outbox + seq + ack + keepalive#172

Merged
seamus-brady merged 1 commit intomainfrom
feature/web-ui-state-machine
Apr 26, 2026
Merged

Web UI WS state machine: outbox + seq + ack + keepalive#172
seamus-brady merged 1 commit intomainfrom
feature/web-ui-state-machine

Conversation

@seamus-brady
Copy link
Copy Markdown
Owner

Summary

Replaces the per-node monotonic seq (diagnostic-only) with a per-client outbox keyed on stable client_id. Every server-pushed frame is appended to that client's outbox before being shipped, the client acks by seq, the server prunes acked frames. On reconnect the client opens with ?since=N; the server replays the gap from the outbox or falls back to a full session_history rebuild when the gap exceeds the buffer.

Adds WS keepalive — server emits Ping every 25s, client replies Pong. Idle proxies and OS-level NAT can no longer silently drop the connection during long agent cycles.

Three reported symptoms this fixes

Symptom Root cause Fix
Long-running cycle, no live notification, refresh shows the reply WS dropped silently (no keepalive); reply fires into dead notify_subject Keepalive prevents the dropout; if it happens anyway, reconnect-with-since replays the missed frames
Typed user message disappears user_message_ack dropped from inFlightUserMessages BEFORE cog persisted; race with reconnect-mid-cycle wiping the DOM Ack is purely a UI hint; in-flight tracking stays until renderSessionHistory observes the message in the server's authoritative view
No live agent_progress / thinking / tool updates during work Same as #1 — notifications fire into a dead WS Same fix — keepalive + outbox replay

What's in the box

  • src/web/outbox.gleam — pure ring buffer with seq, ack, replay_since (UpToDate / Replay / TooOld with boundary handling), age-prune, size cap. 12 unit tests.
  • src/web/client_registry.gleam — OTP actor wrapping per-client_id Outboxes. Survives WS process death so reconnects under the same id replay correctly. Hourly janitor drops idle clients.
  • src/web/gui.gleamWsState gains client_id, registry, keepalive_subject. New ws_send helper routes every message through client_registry.append. ~25 direct mist.send_text_frame call sites converted. ?since=N parsed from URL; ws_on_init either spawns the history query (fresh / TooOld) or fires ReplayReady (gap fits). New Ack(seq) and Pong client-message handlers. KeepaliveTick selector arm emits Ping and re-arms.
  • src/web/protocol.gleamencode_server_message_with_seq takes explicit seq; encode_server_message_body returns the seq-less body for the outbox; splice_seq re-emits a stored body with its original seq during replay. Old global monotonic_seq FFI removed; encode_server_message kept as a back-compat alias that emits seq=0.
  • src/web/html.gleam — both ws_connect_js (chat + admin) and the mobile page's connect() track lastSeenSeq in sessionStorage, build wsUrl with &since=N, send periodic {type:"ack", seq:N} every 5s, reply pong on {type:"ping"}. user_message_ack handler no longer drops in-flight entries.

What is preserved

The full UI surface — uploads (POST /upload + attachment refs in user_message), all admin tabs (Narrative, Log, Scheduler, Cycles, Planner, D' Safety, D' Config, Comms, Affect, Skills, Memory, Documents), question/answer flow, history browsing, search, approve/reject — all unchanged. Every server message type retains its existing JSON shape; only the seq field's semantics changed (per-node → per-client).

Test plan

  • gleam build clean
  • gleam format clean
  • gleam test2190 passing (gained 12 outbox tests + 11 from concurrent main merges)
  • Rebased cleanly on latest origin/main
  • gleam run — confirm boot, observe [outbox] debug entries on first WS connect
  • Live exercise the three reported symptoms:
    • Send a slow query, watch agent_progress arrive live (no refresh needed)
    • Reconnect mid-cycle (kill the network briefly), verify response still lands
    • Type a message during a long cycle, verify the bubble persists across the eventual reply

🤖 Generated with Claude Code

Replaces the per-node monotonic seq (diagnostic-only) with a per-
client outbox keyed on stable client_id. Every server-pushed frame
is appended to that client's outbox before being shipped, the
client acks by seq, the server prunes acked frames. On reconnect
the client opens with `?since=N` and the server replays the gap
from the outbox; when the gap is too wide (frames pruned out) it
falls back to a full session_history rebuild.

Adds WS keepalive — server emits Ping every 25s; client replies
Pong. Idle proxies and OS-level NAT can no longer silently drop
the connection during long agent cycles.

Three operator-reported symptoms this fixes:

  1. Long-running cycle, no live notification, refresh shows the
     reply. Caused by silent WS dropouts (no keepalive); reply
     delivered into a dead notify_subject. Now: keepalive prevents
     the dropout, OR if it happens anyway, reconnect-with-since
     replays the missed frames.

  2. User-typed message disappears. Caused by user_message_ack
     being dropped from inFlightUserMessages BEFORE cog had
     persisted the message — race with reconnect-mid-cycle wiping
     the DOM. Now: ack is purely a UI hint; in-flight tracking
     stays until renderSessionHistory observes the message in the
     server's authoritative view.

  3. No live agent_progress / thinking / tool updates during
     work. Caused by the same dropout as #1 — notifications fire
     into a dead WS. Same fix: keepalive + outbox replay.

Implementation:

  * `src/web/outbox.gleam` — pure ring buffer with seq, ack,
    replay_since (UpToDate / Replay / TooOld), age-prune, size cap
  * `src/web/client_registry.gleam` — OTP actor wrapping per-
    client_id Outboxes; survives WS process death, hourly janitor
    drops idle clients
  * `src/web/gui.gleam` — `WsState` gains `client_id`, `registry`,
    `keepalive_subject`. New `ws_send` helper routes every message
    through `client_registry.append`; ~25 direct
    `mist.send_text_frame` callsites converted. `?since=N` parsed
    from URL; ws_on_init either spawns history query (fresh
    connect or TooOld) or fires ReplayReady (gap fits in outbox).
    New ClientMessage handlers for `Ack(seq)` and `Pong`. New
    KeepaliveTick selector arm sends `Ping` and re-arms.
  * `src/web/protocol.gleam` — `encode_server_message_with_seq`
    takes explicit seq; `encode_server_message_body` returns
    seq-less body for the outbox; `splice_seq` re-emits a stored
    body with its original seq during replay. Old global
    `monotonic_seq` FFI call removed.
  * `src/web/html.gleam` — both ws_connect_js (chat + admin
    pages) and the mobile page's connect() track lastSeenSeq in
    sessionStorage, build wsUrl with `&since=N`, send periodic
    `{type:"ack", seq:N}` every 5s, reply pong on `{type:"ping"}`.
    user_message_ack handler no longer drops in-flight entries.

Tests: 2167 passing. New: 12 outbox tests covering append, ack,
replay (UpToDate / Replay / TooOld with boundary at oldest_kept-1),
size cap, age prune, end-to-end reconnect flow. Updated
seq_increments test to confirm the new "explicit seq pass-through"
contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@seamus-brady seamus-brady merged commit f889fd3 into main Apr 26, 2026
1 check passed
@seamus-brady seamus-brady deleted the feature/web-ui-state-machine branch April 26, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant