Skip to content

fix: forward task_started / task_notification between turns via a single-consumer session reader#713

Open
julianmesa-gitkraken wants to merge 3 commits into
agentclientprotocol:mainfrom
julianmesa-gitkraken:fix/336-background-drain-task-notifications
Open

fix: forward task_started / task_notification between turns via a single-consumer session reader#713
julianmesa-gitkraken wants to merge 3 commits into
agentclientprotocol:mainfrom
julianmesa-gitkraken:fix/336-background-drain-task-notifications

Conversation

@julianmesa-gitkraken

@julianmesa-gitkraken julianmesa-gitkraken commented May 26, 2026

Copy link
Copy Markdown

Fixes #336.

Summary

When Claude Code launches background tasks via the Task tool with run_in_background=true, task_notification (and task_started) system messages arriving between turns sit in the SDK's internal buffer until the user sends the next message — at which point the wrapper's prompt() switch hits the // Todo: process via status api no-op and silently discards them. The user observes this as "the agent doesn't react to my background task completing — I have to type something to wake it, and then it acts on a stale prompt".

The wrapper now runs a single-consumer session reader per session that is the sole consumer of session.query. It forwards task_started / task_notification to the ACP client the moment they arrive (in-turn or between turns), hands in-turn messages to the active prompt() via a queue, and forwards autonomous task-notification followups out-of-turn. prompt() no longer touches the SDK iterator directly.

An earlier revision of this PR implemented the issue's "Option B" sketch as a background drain that shared the SDK AsyncGenerator with prompt() via a cooperative lock (drainedBuffer / drainReadInFlight / onPromptIdle). An adversarial review found that sharing one AsyncGenerator between two consumers is fragile by construction (≈15 findings: orphaned Promise.race reads, stale buffered messages leaking into the next turn, client-latency coupling, missing abort on process-death, etc.). The drain was replaced with the single-consumer design described below.

Architecture

A new module src/session-reader.ts holds three unit-testable building blocks:

  • TurnQueue — single-producer / single-consumer async queue. The reader is the only producer; prompt() is the only consumer per turn. close() / error() propagate to the consumer; clear() drops leftovers for an abandoned turn.
  • classifyOffTurn — pure classifier: a message is either lifecycle (emit immediately) or a followup-candidate (feed to the collector).
  • OffTurnFollowupCollector — small state machine for messages seen while no turn is active. It buffers followup candidates until the closing result/idle, then forwards them (when result.origin.kind === "task-notification") or discards them as aftermath. Bounded by a 256-entry cap.

In src/acp-agent.ts:

  • #runSessionReader — the only caller of session.query.next(). Per message: optionally schedule a raw-SDK emit; if lifecycle, emit it; else if a turn is running, push to turnQueue; else feed offTurn. Exits cleanly on abort / iterator close / iterator throw, closing the queue so any consumer is unblocked.
  • #enqueueReaderSideEffect — serializes the reader's client-facing emits (raw SDK messages, lifecycle, followup) onto a per-session promise chain the reader never awaits, so a slow ACP client can't stall SDK consumption or block the next prompt.
  • #emitTaskLifecycleUpdate — renders task_started (using the SDK's description) and task_notification (using the SDK's real status) as an agent_message_chunk, routed to the reader's bound sessionId.
  • #emitFollowup / computeFollowupUsageUpdate — forward an autonomous followup's content using the same notification helpers prompt() uses, plus a usage_update derived from the followup's own snapshots, without ever touching accumulatedUsage, stopReason, or the user's pending prompts.
  • prompt() consumes session.turnQueue.take() instead of session.query.next(). Raw-SDK emit moved entirely to the reader (no double-emit). Error/cancel paths clear the queue.

Behaviour

  • task_started / task_notification produce a short bracketed agent_message_chunk[task <id>] started: <description> or [task <id>] <status>: <summary> plus output: <path> when present — whether the event arrives during a turn or between turns.
  • skip_transcript: true lifecycle messages produce no client-visible update.
  • Autonomous task-notification followups (an SDK-driven mini-turn that runs between user turns) are forwarded to the client out-of-turn: their assistant/tool content via the normal notification helpers, plus a usage_update carrying the followup's cost and _claude/origin meta. They never contaminate the next user turn's usage or stop reason.
  • Non-followup off-turn aftermath (e.g. the tail of a cancelled or errored turn) is discarded with a debug log, never replayed into the next prompt.
  • A followup's assembled text/thinking blocks are deduped against what actually streamed live (per message id), and un-streamed blocks are forwarded as a fallback, so a followup behind a non-streaming gateway still delivers its final answer — mirroring the in-turn handling from fix: Forward unstreamed assistant text blocks #757.

Lifecycle & robustness

  • Single consumer. Only the reader calls session.query.next(), eliminating the concurrent-next()-on-an-AsyncGenerator hazard.
  • Teardown. teardownSession and the process-died recovery abort(), then wait for the reader and drain its detached readerSideEffects chain (both bounded by a 2s timeout, via #drainReader) before deleting the session — so queued emits finish before the session disappears and a wedged SDK iterator (issue session/cancel doesn't abort in-flight TaskOutput block, pending session/prompt never resolves #680) can't hang teardown on an unbounded await readerDone.
  • Reader-death recovery. The reader is the sole consumer and is spawned once, so a TurnQueue left in a terminal state (iterator threw a non-process-death error, or returned done mid-session) means the reader is gone and can never feed the session again. prompt() detects this (turnQueue.isTerminal()) and tears the session down so the client starts fresh, rather than every future prompt failing on the latched error.
  • Cancellation. prompt() clears the turnQueue on both error and cancel, so messages the reader buffered for an abandoned turn don't leak into the next one. A force-cancel that races an iterator throw (the rejection winning the take() vs. wake-up race) still returns stopReason: "cancelled" rather than surfacing the trailing error.
  • Interrupt & send cancellation. cancel() now keeps a separate interrupted flag while a turn is winding down, so a queued follow-up prompt can reset cancelled without turning the interrupted turn's late is_error result into an Internal error or cancelling the queued prompt.
  • Raw-SDK ordering contract. Raw _claude/sdkMessage emits are FIFO among themselves but are explicitly not ordered relative to the derived session/update stream (the reader doesn't block on the client). Documented at the emit site.

Tests

npm run test:run: 419 passed | 15 skipped.

  • src/tests/session-reader.test.ts — unit tests for TurnQueue (push/take/close/error/clear, FIFO, concurrent-take guard), classifyOffTurn (every subtype), OffTurnFollowupCollector (all transitions, cap, emit-error swallowing).
  • src/tests/acp-agent.test.tsdescribe("session reader (issue #336)") — lifecycle (description not summary; real status; skip_transcript), raw-SDK emit (off-turn, no in-turn double-emit, FIFO order), followup forwarding (out-of-turn, no usage contamination, no authRequired from a stale Please run /login, lifecycle-mid-followup, real #emitFollowup end-to-end render), aftermath (orphan+idle, non-followup result, cancellation aftermath discarded), process-died propagation, teardown awaits reader + drains side-effects, single-consumer invariant.
  • computeFollowupUsageUpdate unit cases: buffered snapshot, fallback to result.usage, subagent-only fallback, model with no matching modelUsage key → inferred context window.
  • Regression coverage added during review/rebase: a non-process-death iterator error tears the session down (no permanent brick); a force-cancel racing an iterator throw returns cancelled; an interrupt-and-send late is_error result is treated as cancellation after a queued prompt resets cancelled; a followup's assembled text is forwarded when nothing streamed and deduped when it did (the fix: Forward unstreamed assistant text blocks #757 alignment for the off-turn path).

Verification

npm run build         # tsc, clean
npm run check         # eslint + prettier, clean
npm run test:run      # 419 passed | 15 skipped

Out of scope (deliberately not in this PR)

Two product enhancements were considered and intentionally left out:

  1. Live streaming of autonomous followups. Today a followup's content is buffered and forwarded as a block when its closing result arrives. Streaming it token-by-token is blocked by the SDK contract, not by effort: only the result message carries origin; the assistant / stream_event content messages do not. To stream we'd have to emit content before knowing whether the group is a real followup or the aftermath of a cancelled/errored turn — and emitting aftermath optimistically reintroduces exactly the contamination this PR fixes, with no way to retract it over ACP. A safe implementation needs either (a) the SDK tagging followup content messages with origin (not just the result), or (b) a retractable preview channel in the ACP protocol. Neither exists today.

  2. A dedicated session-update variant / _meta marker for background-task output. Followup and lifecycle updates currently piggyback on agent_message_chunk (the followup's usage_update already carries _claude/origin). A richer scheme — tagging every followup chunk with _meta or a dedicated task_* session-update so clients can render background output distinctly — is additive and backward-compatible, but only produces visible effect once an ACP client consumes it. Left for a follow-up coordinated with client-side rendering rather than shipped speculatively here.

@julianmesa-gitkraken julianmesa-gitkraken marked this pull request as draft May 26, 2026 17:04
@julianmesa-gitkraken julianmesa-gitkraken force-pushed the fix/336-background-drain-task-notifications branch 5 times, most recently from 6eab829 to 43b7e08 Compare May 27, 2026 09:34
@julianmesa-gitkraken julianmesa-gitkraken marked this pull request as ready for review May 27, 2026 09:35
@julianmesa-gitkraken julianmesa-gitkraken force-pushed the fix/336-background-drain-task-notifications branch 3 times, most recently from 44df1c5 to bb8dae9 Compare May 30, 2026 16:58
@julianmesa-gitkraken julianmesa-gitkraken force-pushed the fix/336-background-drain-task-notifications branch 2 times, most recently from f6cb5e9 to 4faab61 Compare June 8, 2026 21:23
@julianmesa-gitkraken

Copy link
Copy Markdown
Author

Rebased onto main (picked up the #680 force-cancel backstop) and folded a review pass into the #336 commit. Two of the findings were real correctness regressions that the single-consumer model introduced, so flagging them here for visibility:

1. A non-process-death iterator error bricked the session permanently. The reader is the sole consumer of session.query.next() and is spawned exactly once. If next() threw an error whose message didn't match the process-died substrings (e.g. "Unexpected event order, got ... before message_start", "stream has ended, this shouldn't happen"), the reader called turnQueue.error() and exited, but prompt() re-threw without tearing the session down. The TurnQueue latches hasError with no reset, so every subsequent prompt() rejected instantly with the same stale error and the dead reader never fed the queue again. On main this self-healed because each prompt re-entered query.next() directly. Fix: added TurnQueue.isTerminal(); the catch now treats a terminal queue (reader dead) the same as process-death and tears the session down so the client starts a fresh one.

2. A force-cancel racing an iterator throw was reported as an internal error. When the take() rejection won the Promise.race against the cancelled wake-up, the await threw before the line-level cancelController.signal.aborted check, landing in the catch and surfacing a cancel as RequestError.internalError instead of stopReason: "cancelled". Fix: re-check cancelController.signal.aborted || session.cancelled at the top of the catch and return cancelled.

Also folded in: route the reader's lifecycle check through classifyOffTurn (it was exported and unit-tested but the reader inlined the predicate, so the policy lived in two places), use a local toolUseCache in #emitFollowup instead of session.toolUseCache (the detached followup can overlap a live turn mutating the same cache), and extracted the duplicated teardown drain into #drainReader.

Added two regression tests for the bugs above; both verified to fail without their fix. npm run build + test:run + check all clean.

One scope note on "Fixes #336": this resolves the between-turns desync, which is the core complaint, but it does not change the long-running-bg-task behavior @dmeehan1968 hit. We close the turn on session_state_changed: idle, and the SDK only reports idle once background tasks settle, so a never-ending dev server still keeps the turn active. That's an SDK limitation (no origin on content messages, idle gated on task completion), not something this PR can fix without the upstream RFD.

@julianmesa-gitkraken

julianmesa-gitkraken commented Jun 9, 2026

Copy link
Copy Markdown
Author

Possible improvements after this merges

Grouping by what each one depends on, since that decides which are follow-ups we own and which are blocked on the SDK or the ACP protocol.

Edited after the rebase onto #757 ("Forward unstreamed assistant text blocks"): point 1 below changed. #757 reworked the in-turn text/thinking filter, and the rebase aligned #emitFollowup with it (otherwise a followup behind a non-streaming gateway would have lost its text). So half of point 1's original divergence is now fixed and the rest is restated below.

Wrapper-only (follow-ups we can do without waiting on anyone)

1. Unify the render/usage path between prompt() and #emitFollowup. The message-render logic and per-delta usage accumulation are duplicated across prompt(), #emitFollowup, and computeFollowupUsageUpdate, and the rebase onto #757 widened that duplication: the streamedTextIds / streamedThinkingIds tracking now lives in both prompt() and #emitFollowup. Two divergences worth noting: the text/thinking filter is now aligned across both (fixed in the rebase), but the in-turn handler in prompt() still skips user-echo messages and strips <local-command-stdout> / <local-command-stderr> markers while #emitFollowup does not, so an autonomous followup carrying one of those can forward noise a normal turn would have filtered. The root fix is extracting shared renderSdkMessageToClient(session, message) and trackUsageFromMessage(state, message) helpers used by both paths, which makes the marker filtering and the streamed-id dedup identical by construction instead of two copies that can drift. Kept out of this PR because it touches the preexisting, sensitive prompt() loop and mixes paying down duplication with fixing #336; that's a separate intent and deserves its own review focus.

2. Recover the reader instead of tearing the session down. The brick fix in this PR tears the session down when the reader dies on an iterator error. That's correct and safe, but for a transient stream-protocol hiccup (e.g. "Unexpected event order...") it's more aggressive than needed: the client has to start a fresh session. A follow-up could retry next() once or respawn the reader before declaring the session dead, reserving teardown for genuinely fatal errors. This is an enhancement on already-correct code with its own edge cases (which errors are transient vs fatal, retry count, avoiding a respawn loop), so it belongs in its own PR rather than expanding this one.

3. Cap the off-turn buffer per group, not per message. OFF_TURN_BUFFER_CAP currently shift()s the oldest message at 256, which can split the coherence of a single followup group. Dropping the whole group (or closing the followup) would be cleaner. Low priority; only matters against a misbehaving SDK that emits an unbounded group without a closing result.

Blocked by the SDK

4. Live streaming of followups. This PR forwards a followup as a block when its closing result arrives. Only the result message carries origin; the assistant / stream_event content messages do not, so we can't know whether a group is a real followup or the aftermath of a cancelled/errored turn until it closes. Emitting content optimistically would reintroduce exactly the contamination this PR fixes, with no way to retract it over ACP. This needs the SDK to tag followup content messages with origin, not just the result.

5. Unblock the turn for long-running background tasks (the symptom @dmeehan1968 hit). We close the user turn on session_state_changed: idle, and the SDK only reports idle once background tasks settle, so a never-ending dev server keeps the turn active. To close the user's turn independently we'd need the SDK to distinguish the user prompt's result from a background task's result (what @SteffenDE described upthread). The SDK doesn't expose that today, so this PR can't address it.

Blocked by the ACP protocol

6. A dedicated session-update variant for task output. Lifecycle and followups currently piggyback on plain-text agent_message_chunk ([task <id>] started: ...). The RFD @CSRessel linked (agentclientprotocol/agent-client-protocol#954) moves toward handling async / agent-initiated turns. Once that lands we could migrate to a dedicated task_* session-update so clients render background output distinctly, plus a retractable preview channel that, combined with #4, would finally allow streaming with the ability to withdraw content that turns out to be aftermath. Additive and backward-compatible, but only produces visible effect once a client consumes it, so it's coordinated with client-side rendering rather than shipped here.

Summary

Of these, only 1-3 are ours to schedule; 1 is the highest value since it also fixes the remaining marker divergence and collapses the duplication the #757 rebase widened. 4-6 aren't pending work on our side, they're external dependencies: the actionable part is following the RFD and, once the SDK adds origin to content messages, unblocking streaming.

@julianmesa-gitkraken

julianmesa-gitkraken commented Jun 9, 2026

Copy link
Copy Markdown
Author

@josevalim @SteffenDE @benbrandt can you take a look?

…nFollowupCollector

Building blocks for the single-consumer session reader (issue agentclientprotocol#336), in a
standalone module so each unit is testable in isolation:

- TurnQueue: single-producer / single-consumer async queue. The reader is
  the sole producer; prompt() is the sole consumer per turn. close() and
  error() propagate to the consumer; clear() drops leftovers for an
  abandoned turn; take() guards against concurrent consumers.
- classifyOffTurn: pure classification of an off-turn message into a
  task-lifecycle event (emit immediately) or a followup candidate.
- OffTurnFollowupCollector: state machine that buffers off-turn followup
  candidates until the closing result/idle, then forwards them (for
  origin.kind === "task-notification") or discards them as aftermath.
  Bounded by a 256-entry cap.
@julianmesa-gitkraken julianmesa-gitkraken force-pushed the fix/336-background-drain-task-notifications branch from 4faab61 to c2478a8 Compare June 9, 2026 09:05
@julianmesa-gitkraken julianmesa-gitkraken changed the title fix: forward task_started / task_notification between turns via a background drain fix: forward task_started / task_notification between turns via a single-consumer session reader Jun 9, 2026
…der (agentclientprotocol#336)

Background tasks launched with run_in_background=true emit task_started /
task_notification system messages between turns. Previously they sat in the
SDK's buffer until the next user prompt, where prompt()'s switch discarded
them as a no-op — the agent appeared unresponsive to background-task
completion and then acted on a stale prompt.

A per-session reader (#runSessionReader) is now the sole consumer of
session.query.next(). It forwards lifecycle events to the client
immediately (in-turn or between turns), hands in-turn messages to the active
prompt() via a TurnQueue, and forwards autonomous task-notification
followups out-of-turn. prompt() consumes the queue instead of touching the
iterator, so there is only ever one consumer of the AsyncGenerator.

Key pieces:
- #runSessionReader: single linear loop; lifecycle/raw/followup emits are
  scheduled on a detached, never-awaited readerSideEffects chain so a slow
  ACP client can't stall SDK consumption or block the next prompt.
- #emitTaskLifecycleUpdate: renders task_started (SDK description) and
  task_notification (real SDK status, no fallback) as an agent_message_chunk
  routed to the reader's bound sessionId.
- #emitFollowup + computeFollowupUsageUpdate: forward a followup's content
  via the same notification helpers prompt() uses, plus a usage_update
  derived from the followup's own snapshots, without touching
  accumulatedUsage / stopReason / pending prompts.
- prompt() consumes turnQueue.take(); raw-SDK emit moved entirely to the
  reader (no double-emit); the error/cancel finally clears the queue so an
  abandoned turn's buffered messages don't leak into the next one.
- teardownSession and process-died recovery abort, await readerDone, then
  drain readerSideEffects (bounded by a 2s timeout) before deleting the
  session, so queued emits finish first and a wedged client can't hang
  teardown.

Tests: session-reader describe block covers lifecycle (description vs
summary, real status, skip_transcript), raw-SDK emit (off-turn, no in-turn
double-emit, FIFO order), followup forwarding (out-of-turn, no usage
contamination, no authRequired from a stale login result, lifecycle
mid-followup, real #emitFollowup end-to-end), aftermath discard,
process-died propagation, teardown drain, and the single-consumer
invariant. computeFollowupUsageUpdate has unit coverage for buffered
snapshot, result.usage fallback, subagent-only fallback, and inferred
context window. acp-agent-settings mocks gain next/close stubs so the
reader parks cleanly.
@julianmesa-gitkraken julianmesa-gitkraken force-pushed the fix/336-background-drain-task-notifications branch from c2478a8 to 1ec26b7 Compare June 9, 2026 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Background task notifications (task_notification / task_started) are silently dropped causing desynced conversations

1 participant