Skip to content

fix: reattach SSE on session-switch return + close leaked stream connections#2925

Open
wirtsi wants to merge 14 commits into
nesquena:masterfrom
wirtsi:fix/live-streams-connection-leak
Open

fix: reattach SSE on session-switch return + close leaked stream connections#2925
wirtsi wants to merge 14 commits into
nesquena:masterfrom
wirtsi:fix/live-streams-connection-leak

Conversation

@wirtsi
Copy link
Copy Markdown
Contributor

@wirtsi wirtsi commented May 25, 2026

Summary

What was broken

  1. Connection leak (browser beach-ball). LIVE_STREAMS was never written to — the EventSource was kept in a closure variable but not tracked in the dictionary, so closeOtherLiveStreams() and closeLiveStream() were no-ops. Every session switch leaked a live EventSource. After a few switches, browser connection-pool exhaustion produced pending requests and the macOS beach-ball during long agent runs.
  2. Reattach skipped on return. Once the leak fix landed (closeOtherLiveStreams actually closes prior streams), loadSession() returning to a still-streaming session needed to reopen the SSE. The reattach gate is INFLIGHT[sid].reattach && activeStreamId, but reattach=true was only set on the storage-load path. An in-memory INFLIGHT entry stayed unflagged, so no new EventSource was opened on return — the user saw nothing until the final response landed via metadata refresh.
  3. _dirty_suffix silently dropped -dirty. _run_git() substitutes a synthetic "git exited with status N" diagnostic when both stdout/stderr are empty (which is exactly what diff-index --quiet does to signal a dirty tree). The naïve if not out guard always saw a truthy out and dropped the suffix — defeating dev-build cache busting (static/foo.js?v=… stayed identical between clean and dirty checkouts, so browsers kept serving stale assets after a local edit).

What changed

  • static/messages.js_wireSSE() writes LIVE_STREAMS[activeSid]; closeLiveStream() now also sets INFLIGHT[sid].reattach = true (guarded) after closing, so loadSession's reattach branch fires on return. Reconnect handler bails out via _isSessionActivelyViewed() so an SSE closed intentionally during session switch doesn't auto-reconnect in the background.
  • static/sessions.jsloadSession() calls closeOtherLiveStreams(sid) before fetching session metadata, so the previous session's EventSource is torn down deterministically (instead of leaking until the next attachLiveStream).
  • api/updates.py_dirty_suffix() recognises both the empty-out and synthetic-diagnostic shapes as the dirty signal, keeping the _run_git() call path so the existing test mock still works.
  • api/routes.py, tests/test_regressions.py, tests/test_streaming_race_fix.py — small edits that came along with the broader connection-leak hardening already on this branch.

Tests

  • tests/test_inflight_stream_reuse.py — 3 new regression tests pin the chain: closeLiveStream marks reattach → closeOtherLiveStreams propagates the mark → loadSession's INFLIGHT branch keeps the gate shape that the mark feeds into.
  • tests/test_parallel_session_switch.py — two brittle substring tests rewritten with resilient regex matches so future inserts in the same loadSession reset block don't break the assertion.
  • tests/test_version_badge.py — exercises the corrected _dirty_suffix via the existing _run_git mock.
  • Full local suite: 6406 passed, 2 unrelated (gateway-sync network timeout in CI-less env).

Test plan

  • Start chat A with a long prompt; click +; send a brief message in B; click back to A; confirm tokens stream live.
  • Repeat with a model that has visible reasoning vs. plain token output to ensure both paths reattach.
  • Confirm static/messages.js?v=… URL gains -dirty on local edits and busts the browser cache.

🤖 Generated with Claude Code

Florian Krause and others added 12 commits May 21, 2026 21:10
Frontend: Add document.hidden guards to three polling loops that
previously fired even when the tab was backgrounded — streaming
session list (5s), background task status (3s→15s when hidden),
and session time refresh (60s). This eliminates unnecessary
fetch→DOM-update cycles that choked the UI when background agent
jobs ran in other sessions.

Backend: Narrow CHAT_LOCK in _handle_chat_sync to wrap only
credential/provider resolution, not the entire
agent.run_conversation() call. Two synchronous chat requests in
different sessions no longer serialize during LLM execution.
…a#2024)

- New api/agent_subprocess.py: stripped agent worker that runs in a separate
  multiprocessing.Process. All heavy hermes-agent imports happen inside the
  subprocess, keeping the main HTTP process free.

- api/streaming.py: add _run_agent_streaming_subprocess() which:
  1. Creates a multiprocessing.Queue + Event for IPC
  2. Spawns a relay thread to forward events from the MP queue to STREAMS
  3. Starts the agent subprocess via _start_agent_subprocess()
  4. Waits for process exit, then captures the final result
  5. Merges messages back into the session and emits done/error/cancel

- api/routes.py: switch call sites from _run_agent_streaming to
  _run_agent_streaming_subprocess for /api/chat/start, /api/btw, and
  /api/background.

- cancel_stream() now also signals the MP cancel event so the subprocess
  exits early.

Trade-off: agent cache is lost per-turn (fresh AIAgent each time). Session
state is still preserved because sessions are file-backed. The subprocess
incurs ~1-2s cold-start on first use but keeps the HTTP server responsive
during long agent runs.
time.sleep(0) in the put() callback releases the GIL after each token/tool
event, giving HTTP handler threads a chance to serve /api/sessions and other
endpoints during long agent runs.
Two intertwined bugs fixed:

1. LIVE_STREAMS was never written to — the EventSource created by
   attachLiveStream() was stored in a closure variable but never tracked
   in the LIVE_STREAMS dictionary. This meant closeOtherLiveStreams()
   and closeLiveStream() were no-ops (iterating an empty object). Every
   session switch leaked the old SSE connection, which kept pumping token
   events into the orphaned closure, flooding the browser main thread and
   causing the macOS beach ball during long agent runs.

   Fix: store {streamId, source} in LIVE_STREAMS[activeSid] inside
   _wireSSE() so closeOtherLiveStreams() actually closes the previous
   session's EventSource when switching.

2. When switching away from a running chat and back, attachLiveStream()
   with {reconnecting: true} started with empty assistantText and
   reasoningText, losing all progress. The new SSE connection would
   append new tokens to nothing — the already-rendered response vanished.

   Fix: on reconnect, restore assistantText and reasoningText from
   INFLIGHT[activeSid].messages (the _live assistant message) instead of
   starting from empty strings.

Also removes the time.sleep(0) GIL-yield in streaming.py — the stall was
browser-side (connection leak → event flood → main thread freeze), not
Python-side. ThreadingHTTPServer serves requests in separate threads and
run_conversation() runs in a daemon thread; the GIL is not the bottleneck.
The source-level assertions in test_streaming_race_fix and
test_regressions now accept both the original empty-string init
('' for first connect) and the conditional restore from INFLIGHT
(for reconnect).
…onnect

1. sessions.js: Call closeOtherLiveStreams() in loadSession() when
   switching away from a session. This ensures the old session's
   EventSource is closed, stopping token events from flooding the
   main thread. Previously only called inside attachLiveStream(),
   which is not invoked for idle sessions — leaving leaked SSE
   connections that froze the browser.

2. messages.js: Store EventSource in LIVE_STREAMS inside _wireSSE()
   so closeOtherLiveStreams() and closeLiveStream() actually work.
   LIVE_STREAMS was never written to, making both functions no-ops.

3. messages.js: Restore assistantText/reasoningText from INFLIGHT
   on reconnect so the already-rendered content survives the
   session switch. The StreamChannel replays buffered gap events
   which correctly append to the restored state.

4. tests: Update assertions to accept the new conditional init
   pattern for reconnection accumulator restoration.
The root cause of the browser beach ball: switching sessions didn't
close the old session's SSE EventSource, which kept pumping token/
reasoning events through its closure into the browser main thread.

Closing the EventSource triggers its 'error' handler which auto-
reconnects. Added _isSessionActivelyViewed() guard to the error handler
so it won't reconnect when the user has switched to a different session.

Also Reverted the syncInflightAssistantMessage reordering — it needs
to run even when backgrounded to keep INFLIGHT data up-to-date for
reconnection.
Two related fixes:

1. messages.js — closeLiveStream() now flags INFLIGHT[sid].reattach=true
   after tearing down the EventSource. Previously this flag was only set by
   the storage-load path in sessions.js loadSession(), so an in-memory
   INFLIGHT entry stayed unflagged through the session switch. When the user
   returned to the still-streaming session, the reattach branch in
   loadSession() was skipped and the SSE was never reopened — the user saw
   no live tokens until the server-side run completed and a metadata
   refresh swapped in the final reply.

   Guarded by an existence check so the terminal-state teardown path
   (_clearOwnerInflightState() runs before _closeSource()) remains a safe
   no-op.

2. api/updates.py — _dirty_suffix() silently dropped the `-dirty` suffix on
   any dirty working tree. The previous implementation routed through
   _run_git(), which packaged a synthetic "git exited with status 1"
   diagnostic into stdout for non-zero exit codes. diff-index --quiet
   uses exit code 1 to *signal* dirty (not an error), so the `if not out`
   guard always saw a non-empty `out` and skipped the suffix. As a result
   the static-asset cache-busting query string (`?v=<WEBUI_VERSION>`) was
   identical for a clean and dirty checkout — browsers kept serving the
   pre-edit JS during local development. Call subprocess directly and
   check for the `returncode == 1, no stdout/stderr` shape that
   diff-index --quiet uses.

Tests:
- 3 new regression tests in test_inflight_stream_reuse.py pin the
  closeLiveStream → reattach chain (fail on master, pass with fix).
- 2 tests in test_parallel_session_switch.py rewritten with a more
  resilient regex match so unrelated inserts in the same loadSession
  reset block (like the closeOtherLiveStreams call added earlier on
  this branch) don't break the assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…connection-leak

# Conflicts:
#	api/routes.py
#	static/messages.js
The previous attempt bypassed _run_git() and called subprocess directly,
which broke test_dirty_check_appends_suffix_when_fast (the test mocks
_run_git, so a direct subprocess.run() escapes the mock).

Restore the _run_git() call path. The trick is that _run_git() packs a
synthetic "git exited with status N" diagnostic into its return value
when both stdout and stderr are empty — which is exactly what
`diff-index --quiet` does to *signal* a dirty tree (exit 1, no output).
Treat that synthetic shape AND an empty `out` as the dirty signal; real
errors (timeouts, missing git, repo-not-found) come with their own
diagnostic and correctly suppress the suffix so the base version
remains visible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Summary

Reading the diff against origin/master and the actual chain in static/messages.js:610-628, the four-part fix (track the EventSource, mark for reattach on teardown, restore accumulators on reconnect, suppress reconnect for backgrounded sessions) lines up correctly. The non-obvious part — that the same patch had to land in four places to work end-to-end — is exactly the cluster of regressions #2924 reports, and the regression tests pin the chain so a future refactor can't half-fix it.

Code reference — the missing LIVE_STREAMS write was the headline bug

On origin/master, LIVE_STREAMS is declared but never written to. _wireSSE opens an EventSource and keeps the handle in a closure, but the dictionary stays empty, so closeOtherLiveStreams() and closeLiveStream() iterate over {} and silently no-op. Every session switch leaked a live stream. The branch adds the missing write at static/messages.js:1478:

function _wireSSE(source){
  // Track the EventSource in LIVE_STREAMS so closeOtherLiveStreams() /
  // closeLiveStream() can close it when switching sessions, and so
  // reconnect can detect an already-connected stream.
  LIVE_STREAMS[activeSid]={streamId,source};

Without this, the rest of the chain has nothing to operate on. With it, closeOtherLiveStreams(sid) from static/sessions.js:577 (added in this PR) actually closes the prior session's EventSource on switch.

Code reference — reattach gate

The follow-on bug is the part most likely to be missed in review. Once the leak is sealed, returning to a still-streaming session needs to reopen the SSE. The reattach branch in loadSession() is gated on INFLIGHT[sid].reattach, but that flag was only set on the storage-load path. The branch fixes this at static/messages.js:627:

if(INFLIGHT[sessionId]) INFLIGHT[sessionId].reattach=true;

closeLiveStream() is now the single chokepoint that marks reattach. The teardown call at _clearOwnerInflightState() runs before _closeSource() in the terminal-state path, so INFLIGHT[sessionId] is already gone there and the guarded write is a no-op — exactly the behaviour you want for the clean-finish path. The regression test at tests/test_inflight_stream_reuse.py:104-122 pins this precise shape.

Code reference — reconnect guard

static/messages.js:2113-2116 short-circuits the error-handler's reconnect when the user has navigated away:

if(!_isSessionActivelyViewed(activeSid)) return;

This is the right place to put it — source.close() runs first, so the EventSource is torn down whether or not we attempt reconnect, but the explicit _isSessionActivelyViewed check prevents _reconnectAttempted from re-firing a background stream the user just walked away from. Combined with LIVE_STREAMS[activeSid]={streamId,source} so closeOtherLiveStreams() can find it, no orphan streams remain after a switch.

_dirty_suffix correctness

_run_git() at api/updates.py:104 substitutes "git exited with status N" when both stdout and stderr are empty. git diff-index --quiet HEAD -- is designed to produce exactly that shape on a dirty tree (exit 1, no output). The previous return "-dirty" if not out else "" always saw truthy out and dropped the suffix. The branch correctly handles both:

if not out or out.startswith('git exited with status '):
    return "-dirty"
return ""

This restores cache busting on dirty dev checkouts — without it, static/messages.js?v=… was identical between clean and dirty builds, so browsers kept serving stale assets after a local edit. Real errors (timeout text, "git executable not found") carry distinct diagnostics and correctly fall through to "".

Note — the api/routes.py diff is whitespace-only

The api/routes.py hunk in the diff (84 lines changed) is the indentation undo from commit c0a683bd "fix: suppress hidden-tab polling and narrow CHAT_LOCK scope", where _handle_chat_sync got its CHAT_LOCK scope narrowed — that lives only on this branch, so against master the diff has to re-indent the AIAgent construction block back inside with CHAT_LOCK:. Worth flagging in the PR description so reviewers don't waste time looking for a behavioural change inside it. If you want a smaller blast radius for the SSE fix, splitting that earlier CHAT_LOCK narrowing into a separate PR would make the streaming PR more obviously contained.

Verification

CI is green (3.11/3.12/3.13 SUCCESS), and the repro steps in the PR description match the chain the tests pin — start A long, switch to B and send, return to A → tokens stream live because (1) the EventSource for A was actually closed on switch, (2) INFLIGHT.reattach was marked, (3) loadSession's reattach branch reopens via attachLiveStream with {reconnecting:true}, (4) the new closure restores assistantText from the last live INFLIGHT message so the prefix isn't doubled or dropped.

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Holding for @nesquena review — this PR modifies 3 existing tests (test_inflight_stream_reuse, test_parallel_session_switch, test_regressions, test_streaming_race_fix), which signals an intentional contract change in the SSE reattach behavior. Compared to the SSE work we just shipped in #2928 (Release DH, v0.51.136), the surfaces overlap heavily.

What needs to be answered before merge:

  1. Does this PR's behavior agree with or contradict fix(chat): keep one live SSE source per stream #2928's "single live source per stream" guarantee?
  2. The test assertion changes — are they redefining what we considered a regression contract, or just adjusting to new internal helper names?
  3. Is the "close leaked stream connections" part actually fixing a separate leak, or duplicating what fix(chat): keep one live SSE source per stream #2928 already does in _wireSSE?

Flagging for @nesquena. @wirtsi, thank you — just needs a careful conflict-with-shipped-fix check.

@darickmunroec
Copy link
Copy Markdown

Reproduced on iOS mobile (Safari PWA). When switching sessions or locking screen, SSE drops and messages won't appear until fully loaded. Makes the mobile experience unusable. Looking forward to this fix! 🙏

Florian Krause and others added 2 commits May 26, 2026 08:36
…connection-leak

# Conflicts:
#	static/messages.js
The c5da885 merge into fix/live-streams-connection-leak resolved a
static/messages.js conflict by keeping our side, which silently dropped
three upstream commits already on master:

- 47f6648 (one live SSE source per stream): closeLiveStream/_closeSource
  source parameter + same-source guard
- fe597c1 (chat upload attachment paths): uploadedPaths fall back to
  u.path before u.name/u.filename
- 85e13a6 (clarify dialog padding): _syncClarifyTranscriptSpace,
  _ensureClarifyResizeListener, and their call sites in
  toggleClarifyCardCollapsed/hideClarifyCard/showClarifyCard

Restores green CI on the branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@franksong2702
Copy link
Copy Markdown
Contributor

franksong2702 commented May 26, 2026

I re-read this against current origin/master, #2347, and the already-merged #2928. My read is that the core of this PR does not contradict #2928 one live SSE source; it completes the missing half of that invariant.

#2928 made LIVE_STREAMS authoritative enough that old EventSources can actually be closed. That is correct, but it creates the state described in #2924: the original session can still have an in-memory INFLIGHT[sid] entry and an active server-side active_stream_id, while its browser EventSource has been closed. The existing loadSession() return path only reopens the SSE when INFLIGHT[sid].reattach && activeStreamId is true, and the in-memory path never sets that flag. So the stream is silently disconnected until final session refresh lands.

That makes the closeLiveStream() -> INFLIGHT[sessionId].reattach = true part the key missing invariant. It also restores the user-facing guarantee from #2347: switching away and back should preserve the live working scene, not just the last DOM snapshot before the stream was cut.

If this PR is being held because the diff is broad, I would support narrowing it rather than dropping it: keep the reattach marking, reconnect accumulator restore, background reconnect guard, and the focused tests for the close/reattach/loadSession chain; split the unrelated _dirty_suffix / cache-busting cleanup and any incidental route/test churn if needed. #2958 looks adjacent but separate: that is live timeline/interim-progress presentation, while #2924/#2925 is the transport reattach break.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hold maintainer-review Maintainer fit-assessment needed — may not merge even with fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Live SSE stream does not reattach after session switch — user sees no live tokens until completion

4 participants