fix: reattach SSE on session-switch return + close leaked stream connections#2925
fix: reattach SSE on session-switch return + close leaked stream connections#2925wirtsi wants to merge 14 commits into
Conversation
Frontend: Add document.hidden guards to three polling loops that previously fired even when the tab was backgrounded — streaming session list (5s), background task status (3s→15s when hidden), and session time refresh (60s). This eliminates unnecessary fetch→DOM-update cycles that choked the UI when background agent jobs ran in other sessions. Backend: Narrow CHAT_LOCK in _handle_chat_sync to wrap only credential/provider resolution, not the entire agent.run_conversation() call. Two synchronous chat requests in different sessions no longer serialize during LLM execution.
…a#2024) - New api/agent_subprocess.py: stripped agent worker that runs in a separate multiprocessing.Process. All heavy hermes-agent imports happen inside the subprocess, keeping the main HTTP process free. - api/streaming.py: add _run_agent_streaming_subprocess() which: 1. Creates a multiprocessing.Queue + Event for IPC 2. Spawns a relay thread to forward events from the MP queue to STREAMS 3. Starts the agent subprocess via _start_agent_subprocess() 4. Waits for process exit, then captures the final result 5. Merges messages back into the session and emits done/error/cancel - api/routes.py: switch call sites from _run_agent_streaming to _run_agent_streaming_subprocess for /api/chat/start, /api/btw, and /api/background. - cancel_stream() now also signals the MP cancel event so the subprocess exits early. Trade-off: agent cache is lost per-turn (fresh AIAgent each time). Session state is still preserved because sessions are file-backed. The subprocess incurs ~1-2s cold-start on first use but keeps the HTTP server responsive during long agent runs.
…esquena#2024)" This reverts commit a611fcd.
time.sleep(0) in the put() callback releases the GIL after each token/tool event, giving HTTP handler threads a chance to serve /api/sessions and other endpoints during long agent runs.
Two intertwined bugs fixed:
1. LIVE_STREAMS was never written to — the EventSource created by
attachLiveStream() was stored in a closure variable but never tracked
in the LIVE_STREAMS dictionary. This meant closeOtherLiveStreams()
and closeLiveStream() were no-ops (iterating an empty object). Every
session switch leaked the old SSE connection, which kept pumping token
events into the orphaned closure, flooding the browser main thread and
causing the macOS beach ball during long agent runs.
Fix: store {streamId, source} in LIVE_STREAMS[activeSid] inside
_wireSSE() so closeOtherLiveStreams() actually closes the previous
session's EventSource when switching.
2. When switching away from a running chat and back, attachLiveStream()
with {reconnecting: true} started with empty assistantText and
reasoningText, losing all progress. The new SSE connection would
append new tokens to nothing — the already-rendered response vanished.
Fix: on reconnect, restore assistantText and reasoningText from
INFLIGHT[activeSid].messages (the _live assistant message) instead of
starting from empty strings.
Also removes the time.sleep(0) GIL-yield in streaming.py — the stall was
browser-side (connection leak → event flood → main thread freeze), not
Python-side. ThreadingHTTPServer serves requests in separate threads and
run_conversation() runs in a daemon thread; the GIL is not the bottleneck.
The source-level assertions in test_streaming_race_fix and
test_regressions now accept both the original empty-string init
('' for first connect) and the conditional restore from INFLIGHT
(for reconnect).
…onnect 1. sessions.js: Call closeOtherLiveStreams() in loadSession() when switching away from a session. This ensures the old session's EventSource is closed, stopping token events from flooding the main thread. Previously only called inside attachLiveStream(), which is not invoked for idle sessions — leaving leaked SSE connections that froze the browser. 2. messages.js: Store EventSource in LIVE_STREAMS inside _wireSSE() so closeOtherLiveStreams() and closeLiveStream() actually work. LIVE_STREAMS was never written to, making both functions no-ops. 3. messages.js: Restore assistantText/reasoningText from INFLIGHT on reconnect so the already-rendered content survives the session switch. The StreamChannel replays buffered gap events which correctly append to the restored state. 4. tests: Update assertions to accept the new conditional init pattern for reconnection accumulator restoration.
The root cause of the browser beach ball: switching sessions didn't close the old session's SSE EventSource, which kept pumping token/ reasoning events through its closure into the browser main thread. Closing the EventSource triggers its 'error' handler which auto- reconnects. Added _isSessionActivelyViewed() guard to the error handler so it won't reconnect when the user has switched to a different session. Also Reverted the syncInflightAssistantMessage reordering — it needs to run even when backgrounded to keep INFLIGHT data up-to-date for reconnection.
Two related fixes: 1. messages.js — closeLiveStream() now flags INFLIGHT[sid].reattach=true after tearing down the EventSource. Previously this flag was only set by the storage-load path in sessions.js loadSession(), so an in-memory INFLIGHT entry stayed unflagged through the session switch. When the user returned to the still-streaming session, the reattach branch in loadSession() was skipped and the SSE was never reopened — the user saw no live tokens until the server-side run completed and a metadata refresh swapped in the final reply. Guarded by an existence check so the terminal-state teardown path (_clearOwnerInflightState() runs before _closeSource()) remains a safe no-op. 2. api/updates.py — _dirty_suffix() silently dropped the `-dirty` suffix on any dirty working tree. The previous implementation routed through _run_git(), which packaged a synthetic "git exited with status 1" diagnostic into stdout for non-zero exit codes. diff-index --quiet uses exit code 1 to *signal* dirty (not an error), so the `if not out` guard always saw a non-empty `out` and skipped the suffix. As a result the static-asset cache-busting query string (`?v=<WEBUI_VERSION>`) was identical for a clean and dirty checkout — browsers kept serving the pre-edit JS during local development. Call subprocess directly and check for the `returncode == 1, no stdout/stderr` shape that diff-index --quiet uses. Tests: - 3 new regression tests in test_inflight_stream_reuse.py pin the closeLiveStream → reattach chain (fail on master, pass with fix). - 2 tests in test_parallel_session_switch.py rewritten with a more resilient regex match so unrelated inserts in the same loadSession reset block (like the closeOtherLiveStreams call added earlier on this branch) don't break the assertion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…connection-leak # Conflicts: # api/routes.py # static/messages.js
The previous attempt bypassed _run_git() and called subprocess directly, which broke test_dirty_check_appends_suffix_when_fast (the test mocks _run_git, so a direct subprocess.run() escapes the mock). Restore the _run_git() call path. The trick is that _run_git() packs a synthetic "git exited with status N" diagnostic into its return value when both stdout and stderr are empty — which is exactly what `diff-index --quiet` does to *signal* a dirty tree (exit 1, no output). Treat that synthetic shape AND an empty `out` as the dirty signal; real errors (timeouts, missing git, repo-not-found) come with their own diagnostic and correctly suppress the suffix so the base version remains visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SummaryReading the diff against Code reference — the missing
|
|
Holding for @nesquena review — this PR modifies 3 existing tests ( What needs to be answered before merge:
Flagging for @nesquena. @wirtsi, thank you — just needs a careful conflict-with-shipped-fix check. |
|
Reproduced on iOS mobile (Safari PWA). When switching sessions or locking screen, SSE drops and messages won't appear until fully loaded. Makes the mobile experience unusable. Looking forward to this fix! 🙏 |
…connection-leak # Conflicts: # static/messages.js
The c5da885 merge into fix/live-streams-connection-leak resolved a static/messages.js conflict by keeping our side, which silently dropped three upstream commits already on master: - 47f6648 (one live SSE source per stream): closeLiveStream/_closeSource source parameter + same-source guard - fe597c1 (chat upload attachment paths): uploadedPaths fall back to u.path before u.name/u.filename - 85e13a6 (clarify dialog padding): _syncClarifyTranscriptSpace, _ensureClarifyResizeListener, and their call sites in toggleClarifyCardCollapsed/hideClarifyCard/showClarifyCard Restores green CI on the branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
I re-read this against current #2928 made That makes the If this PR is being held because the diff is broad, I would support narrowing it rather than dropping it: keep the reattach marking, reconnect accumulator restore, background reconnect guard, and the focused tests for the close/reattach/loadSession chain; split the unrelated |
Summary
EventSources on session switch) with the missing reattach trigger on return, plus a_dirty_suffixcorrectness fix.What was broken
LIVE_STREAMSwas never written to — theEventSourcewas kept in a closure variable but not tracked in the dictionary, socloseOtherLiveStreams()andcloseLiveStream()were no-ops. Every session switch leaked a liveEventSource. After a few switches, browser connection-pool exhaustion produced pending requests and the macOS beach-ball during long agent runs.closeOtherLiveStreamsactually closes prior streams),loadSession()returning to a still-streaming session needed to reopen the SSE. The reattach gate isINFLIGHT[sid].reattach && activeStreamId, butreattach=truewas only set on the storage-load path. An in-memoryINFLIGHTentry stayed unflagged, so no newEventSourcewas opened on return — the user saw nothing until the final response landed via metadata refresh._dirty_suffixsilently dropped-dirty._run_git()substitutes a synthetic "git exited with status N" diagnostic when both stdout/stderr are empty (which is exactly whatdiff-index --quietdoes to signal a dirty tree). The naïveif not outguard always saw a truthyoutand dropped the suffix — defeating dev-build cache busting (static/foo.js?v=…stayed identical between clean and dirty checkouts, so browsers kept serving stale assets after a local edit).What changed
static/messages.js—_wireSSE()writesLIVE_STREAMS[activeSid];closeLiveStream()now also setsINFLIGHT[sid].reattach = true(guarded) after closing, soloadSession's reattach branch fires on return. Reconnect handler bails out via_isSessionActivelyViewed()so an SSE closed intentionally during session switch doesn't auto-reconnect in the background.static/sessions.js—loadSession()callscloseOtherLiveStreams(sid)before fetching session metadata, so the previous session'sEventSourceis torn down deterministically (instead of leaking until the nextattachLiveStream).api/updates.py—_dirty_suffix()recognises both the empty-outand synthetic-diagnostic shapes as the dirty signal, keeping the_run_git()call path so the existing test mock still works.api/routes.py,tests/test_regressions.py,tests/test_streaming_race_fix.py— small edits that came along with the broader connection-leak hardening already on this branch.Tests
tests/test_inflight_stream_reuse.py— 3 new regression tests pin the chain:closeLiveStreammarks reattach →closeOtherLiveStreamspropagates the mark →loadSession's INFLIGHT branch keeps the gate shape that the mark feeds into.tests/test_parallel_session_switch.py— two brittle substring tests rewritten with resilient regex matches so future inserts in the sameloadSessionreset block don't break the assertion.tests/test_version_badge.py— exercises the corrected_dirty_suffixvia the existing_run_gitmock.Test plan
static/messages.js?v=…URL gains-dirtyon local edits and busts the browser cache.🤖 Generated with Claude Code