Skip to content

feat(interactive-agents): Phase 1a/1b/2 (Phase 3 design only)#52

Open
scoropeza wants to merge 49 commits intoaws-samples:mainfrom
scoropeza:feature/interactive-background-agents
Open

feat(interactive-agents): Phase 1a/1b/2 (Phase 3 design only)#52
scoropeza wants to merge 49 commits intoaws-samples:mainfrom
scoropeza:feature/interactive-background-agents

Conversation

@scoropeza
Copy link
Copy Markdown
Contributor

Summary

Introduces interactive background agents — users can now watch an agent work in real time and steer it mid-run. Ships three shippable phases plus the locked design for Phase 3 (Cedar-based human-in-the-loop approval gates).

All three shipped phases are deployed to a dev stack and E2E-validated. Phase 3 is design-only in this PR; implementation will follow in a separate PR after merge.

47 commits, 97 files changed, +28,012 / −275.

What ships (deployed + tested)

Phase 1a — DDB task events + REST polling

  • ProgressWriter writes agent events to TaskEventsTable from the microVM (agent/src/progress_writer.py)
  • New REST endpoint GET /v1/tasks/{id}/events?after=<event_id> for catch-up polling
  • New bgagent watch <task-id> CLI command (polling mode)
  • DynamoDB Local harness for agent-side testing (agent/mise.toml :local:up/down/events)

Phase 1b — two-runtime split + SSE streaming (rev-5 "Branch A")

  • Two AgentCore runtimes behind one stack: IAM-auth (orchestrator) and JWT-auth (interactive CLI/SPA). Lifecycle isolation between background and interactive traffic.
  • SSEAdapter module pairs with ProgressWriter to stream events in real time
  • bgagent watch now SSE-first with automatic polling fallback (@ag-ui/core + fetch)
  • bgagent run — submit + live-stream in one command
  • Execution-location model: agent attaches to existing task rather than spawning a new one (multi-subscriber safe); task state machine gains HealthyBusy distinction from Busy
  • Fan-out plane Lambda closes §8.9 of the Phase 1b design
  • Stranded-task reconciler (P0-c); MAX_CONCURRENT_TASKS_PER_USER 3 → 10

Phase 2 — interactive mid-run nudge

  • New TaskNudgesTable + REST endpoint POST /v1/tasks/{id}/nudges
  • bgagent nudge <task-id> "<message>" — steering via REST; delivered to the agent via the between_turns_hooks mechanism on the Claude Agent SDK's Stop hook
  • Works on both runtime paths (IAM + JWT)

Late additions during PR prep (also deployed + E2E-validated)

  • fix(cancel) — cancel now actually stops the running agent and prevents a PR from being pushed on a cancelled task. Two root causes: (1) Lambda bundling excluded @aws-sdk/client-bedrock-agentcore, so StopRuntimeSessionCommand was undefined at runtime and every cancel silently no-opped; (2) even with StopRuntimeSession working, the agent pipeline's post-hook chain ran unconditionally after run_agent returned. Fixed by narrowing externalModules to bundle bedrock-agentcore explicitly, raising CancelTaskFn timeout 3s → 15s, adding a between-turns cancel hook in the agent, and short-circuiting post-hooks in the pipeline when the task is CANCELLED.
  • fix(agent) — Python agent-side writers (write_running, write_terminal) did not update the UserStatusIndex GSI sort key status_created_at on transitions. This left the GSI internally inconsistent. Now maintained on every transition. A follow-up issue (bga list: unfiltered output groups by status instead of sorting by created_at #51 on aws-samples) tracks a separate CLI-layer UX problem this uncovered.

What's design-only (not implemented in this PR)

Phase 3 — Cedar-driven HITL approval gates

  • docs/design/PHASE3_CEDAR_HITL.md (1,846 lines) — 22 locked decisions, 3 independent reviews folded in, PolicyDecision / Outcome engine rewrite, hard-deny vs soft-deny split, ApprovalAllowlist, RecentDecisionCache
  • docs/diagrams/phase3-cedar-hitl.drawio — 12-page companion diagram
  • Ships as standard functionality when implemented (no per-repo flag); --pre-approve all_session --yes is the opt-out; hard-deny is absolute.
  • @cedar-policy/cedar-wasm@4.10.0 spiked and locked for the Lambda side

Architecture diagram

docs/diagrams/interactive-agents-phases.drawio (v5) — current state + Phase 1c / 2.5 / 3 proposals on one canvas.

Test plan

  • Agent: uv run pytest -q468 passed
  • CLI: yarn jest215 passed
  • CDK: yarn jest --runInBand879 passed
  • CLI tsc + CDK tsc — clean
  • Deployed to a dev stack and E2E-validated on all three phases (new_task submit → watch streaming → nudge mid-run → cancel mid-run)

Known limitations / follow-ups

  • Phase 2.5 mid-turn interrupt (ClaudeSDKClient.interrupt exists today; side effects not rolled back; deferred after research)
  • Phase 3 Cedar-HITL implementation (separate PR, ~3-4 weeks for 3a core)
  • Terminal UX research for streaming CLI
  • Suppress Claude Agent SDK "malware" system-reminder noise
  • bga list sort order / status-filter UX — see bga list: unfiltered output groups by status instead of sorting by created_at #51

Reviewers — suggested reading order

  1. docs/design/INTERACTIVE_AGENTS.md — canonical design covering all three phases (1,723 lines)
  2. docs/diagrams/interactive-agents-phases.drawio — visual overview
  3. CDK changes: cdk/src/stacks/agent.ts, cdk/src/constructs/task-api.ts, cdk/src/constructs/task-orchestrator.ts
  4. Agent runtime: agent/src/progress_writer.py, agent/src/sse_adapter.py, agent/src/pipeline.py, agent/src/runner.py, agent/src/hooks.py
  5. CLI: cli/src/commands/watch.ts, cli/src/commands/run.ts, cli/src/commands/nudge.ts
  6. Phase 3 design (optional, will be a separate PR): docs/design/PHASE3_CEDAR_HITL.md

GitHub's Squash and merge is the intended merge strategy — 47 commits on the branch will collapse to one on main.

scoropeza and others added 30 commits April 28, 2026 13:53
- INTERACTIVE_AGENTS.md: comprehensive design for bidirectional
  agent-user communication (progress streaming, nudges, HITL
  approval gates, pause/resume)
- Architecture diagrams (4 pages): current state, SSE streaming,
  nudge flow, state machine extensions
- Research prompt that drove the design work
Detailed prompt for implementing DDB progress events + CLI watch
command. Covers: ProgressWriter, entrypoint integration, CLI watch
command, tests, and constraints. Ready for a fresh agent session.
…e 1a prompt

- Use subagents to parallelize agent-side and CLI-side work
- Keep main context clean by delegating reads/research to subagents
- CRITICAL: never deviate from approved design without explicit approval
- On blockers: research first, then surface for discussion
Introduces _ProgressWriter that emits structured AG-UI-style progress
events to the existing TaskEventsTable during agent execution. Events
cover turns, tool calls, tool results, milestones, cost updates, and
errors.

- progress_writer.py: lazy boto3 init, fail-open writes, circuit breaker
  after 3 consecutive DDB failures, ULID-sortable event IDs, 90-day TTL
- entrypoint.py: integrated into run_agent() message loop and run_task()
  pipeline; handles UserMessage branch to capture ToolResultBlocks
- Dockerfile: add progress_writer.py to the agent image COPY layer
- 25 unit tests cover all event types, truncation, fail-open, and
  circuit breaker behavior
Adds a new bgagent watch <task_id> command that polls GET /tasks/{id}/events
every 2 seconds, renders progress events in a human-readable format, and
exits cleanly when the task reaches a terminal state.

- watch.ts: polling loop, event rendering for all 6 progress event types,
  JSON output mode (--output json), Ctrl+C handling, last-seen event
  tracking to avoid duplicates
- bgagent.ts: register the new command
- 14 unit tests cover event rendering, polling, terminal state detection,
  deduplication, and JSON mode

Exit codes: 0 on COMPLETED, 1 on FAILED/CANCELLED/TIMED_OUT.
Enables end-to-end local testing of progress events without deploying
to AWS. The agent container connects to a local DynamoDB instance on
a shared Docker network; boto3 endpoint redirection via
AWS_ENDPOINT_URL_DYNAMODB means zero code changes — the same
_ProgressWriter code path runs locally and in production.

- docker-compose.yml: dynamodb-local service on agent-local network
- scripts/create-local-tables.sh: idempotent creation of TaskEventsTable
  and TaskTable schemas matching the CDK constructs
- mise.toml: local:up (start + create tables), local:down, local:events,
  local:events:json
- run.sh: new --local-events flag that joins the agent container to the
  agent-local network and sets the right env vars

Workflow:
  cd agent && mise run local:up
  ./run.sh --local-events "owner/repo" 42
  mise run local:events    # in another terminal
  mise run local:down
- docs/design/INTERACTIVE_AGENTS.md rev 3: adds Section 9.9 local testing
  subsection with the DDB Local workflow; updates "last updated" marker
- docs/guides/DEVELOPER_GUIDE.md: new "Testing with progress events
  (DynamoDB Local)" section under Local testing
- docs/guides/USER_GUIDE.md: new "Watching a task in real time"
  subsection documenting bgagent watch (behavior only — transport
  details intentionally omitted, will change in Phase 1b)
- AGENTS.md: routing table row for progress_writer.py so future coding
  sessions know where progress events are emitted from
- docs/src/content/docs/**: Starlight sync output from the above
Captures two issues discovered during Phase 1a E2E validation that are
scoped outside the interactive-agents feature. Each file contains
problem statement, reproduction, proposed design, test cases, and
acceptance criteria so a future session can pick it up independently.

- CLI_COGNITO_NEW_PASSWORD_CHALLENGE.md: bgagent login does not handle
  Cognito's NEW_PASSWORD_REQUIRED challenge, blocking first-time users
  without admin intervention
- EARLY_PROGRESS_MILESTONES.md: UX enhancement to emit finer-grained
  milestones during setup_repo (clone / install / baseline build) so
  watch output appears within seconds instead of minutes
Team discussion (Sam ↔ Alain, 2026-04-17) agreed to replace the
hardcoded 3-tier HITL model with Cedar policy-driven approvals,
reusing the in-process Cedar engine already in the repo (on branch
fix/validate-aws-before-docker-build as agent/src/policy.py).

The existing ALLOW/DENY decisions extend to include REQUIRE_APPROVAL —
same policy language, unified governance, supports workflows like
AI-DLC where users gate per phase and relax over time.

Phase 1a (done) and Phase 1b (next) are unaffected. Phase 3 (HITL)
blocked on design revision and the Cedar branch landing on main.
Re-integrates Phase 1a ProgressWriter event emission after rebasing onto
upstream/main, which decomposed agent/entrypoint.py into agent/src/
modules (pipeline.py, runner.py, etc.).

- pipeline.py: instantiate _ProgressWriter; emit repo_setup_complete,
  agent_execution_complete, pr_created milestones
- runner.py: instantiate _ProgressWriter; track tool_use_id -> tool_name
  for UserMessage ToolResultBlock correlation; emit agent_turn,
  agent_tool_call, agent_tool_result, agent_cost_update, agent_error

No behavior change vs. Phase 1a - same event shapes, same call sites
moved to their new module homes. All 305 agent tests pass.
Updates the interactive agents design doc and adds two draw.io files
documenting the Phase 1b architecture.

Design doc (rev 4, 2026-04-17):
- Revision history section added
- New: section 8.9 (control / streaming / fan-out planes + channel-fit
  matrix), 9.1.1 (two-runtime split, D1 resolved), 9.10 (thread + queue
  bridge, D2 resolved), 9.11 (CLI hybrid client, D3 resolved), 9.12
  (AgentCore streaming limits: 60-min cap, LifecycleConfiguration,
  no SSE-native resume, SDK timeout overrides, API GW REST
  incompatibility)
- Updated: section 4 Phase 1b parameters, 5 SSEAdapter sibling pattern,
  8.2 reconnection details, 9 heading bumped to rev 4, 10 Phase 1b
  implementation plan rewritten, Appendix C file change map completed
- Fixed: exec-summary inaccuracy about SSE 60-min cap (line 23)

Diagrams:
- docs/interactive-agents-phases.drawio — v1, 8 pages, documents the
  decision-space (options to choose from) for D1/D2/D3
- docs/interactive-agents-phases-v2.drawio — v2, 8 pages, documents the
  final resolved Phase 1b architecture with rejected alternatives
  captured on the rationale page for future reference

No code changes. No impact on deployed stack. Phase 1a continues to
pass all 305 agent / 721 CDK / 84 CLI tests.
Infrastructure foundation for Phase 1b SSE streaming. Introduces a
second AgentCore Runtime with Cognito JWT authorizer for direct
CLI-to-AgentCore SSE consumption, while preserving the existing
IAM-authed runtime for the orchestrator path. No agent or CLI code
changes in this step — those land in Steps 2-6.

Changes:
- cdk/src/stacks/agent.ts: rename Runtime -> RuntimeIam, add RuntimeJwt
  with RuntimeAuthorizerConfiguration.usingCognito(userPool,
  [appClient]). Same artifact, env vars, VPC, models, memory,
  filesystem mount, and log/trace wiring shared across both runtimes.
  Explicit LifecycleConfiguration (8h idle + 8h max) on both per
  §9.12 of design doc. Lazy.string breaks the TaskApi <-> Runtime-JWT
  <-> Orchestrator circular dependency. OrchestratorFn IAM policy
  scoped to RuntimeIam only. New CFN outputs RuntimeIamArn and
  RuntimeJwtArn; RuntimeArn kept as deprecated alias.
- cdk/src/constructs/task-api.ts: expose appClient as public attr for
  Runtime-JWT to reference.
- cdk/src/constructs/task-events-table.ts: enable DynamoDB Streams
  with NEW_IMAGE (in-place update, Phase 1a event data preserved).
  Prerequisite for the fan-out plane.
- cdk/test/**: 8 new tests covering two-runtime topology, JWT authorizer
  config, LifecycleConfiguration shape, scoped IAM, DDB Streams, new
  CFN outputs. Full suite 729 passed / 0 failed (--runInBand).

Refs docs/design/INTERACTIVE_AGENTS.md §9.1.1, §9.10, §9.12.
First deploy of Phase 1b Step 1 rolled back with:
  AWS::BedrockAgentCore::Runtime 'jean_cloude' already exists

Root cause: I had renamed the existing Runtime construct from 'Runtime'
to 'RuntimeIam'. CloudFormation interprets construct id changes as
replacement — CREATE new resource, then DELETE old. Because the old
Runtime has runtimeName 'jean_cloude' (immutable) and the new RuntimeIam
was declared with the same name, CFN tried to create a duplicate and
failed on AgentCore's account-level name uniqueness constraint.

Fix: revert the construct id to 'Runtime' so CFN updates the existing
resource in place (only the new LifecycleConfiguration is added). The
TS variable name `runtimeIam` is kept for readability + Phase 1b role
documentation. Test updated to match logical id shape via regex.

Verified via cdk diff: existing Runtime gets only [~] LifecycleConfig
(in-place), RuntimeJwt + log infra are net-new, TaskEventsTable gets
StreamSpecification in-place. No destroys of load-bearing resources.
…f ProgressWriter)

Introduces _SSEAdapter, the producer side of a per-task asyncio.Queue
that will feed the SSE handler in server.py (to be wired in Step 3).
Semantic API mirrors _ProgressWriter 1:1 so integration is symmetric:
pipeline.py / runner.py will call both adapters at the same sites.

Design contract (see docs/design/INTERACTIVE_AGENTS.md §9.10):
- Thread-safe: write_agent_* methods run in the pipeline background
  thread; bridge to the asyncio loop via loop.call_soon_threadsafe
  (chosen over run_coroutine_threadsafe — Queue.put_nowait is a plain
  sync call, no need for coroutine + Future wrapping).
- Backpressure: bounded asyncio.Queue, drop-oldest when full. Counter
  dropped_count only ever grows. Never blocks the pipeline thread.
- No-subscribers case: writes before attach_loop or after detach_loop
  are silent drops with counter increment. Pipeline must NEVER be
  affected by whether a client is connected.
- Fail-open: every enqueue path swallows exceptions and bumps the
  counter. Same philosophy as ProgressWriter. write_agent_error has a
  second-level try/except — it must NEVER raise.
- Close sentinel: object() sentinel distinguishable from any real
  event dict. get() returns None when drained for clean stream shutdown.

Wire-format translation (semantic dict → AG-UI TEXT_MESSAGE_* /
TOOL_CALL_* frames) is NOT done here — that's the SSE handler's job in
Step 3. This adapter traffics in semantic Python dicts.

Tests: 27 new in test_sse_adapter.py covering happy path, all 6
semantic methods, FIFO ordering, no-subscribers drop, post-detach
drop, re-attach, queue-full drop-oldest, close sentinel, post-close
behavior, thread-safe enqueue from non-loop thread, concurrent
producers, loop-closed scenarios, dropped_count monotonicity, payload
integrity, large-payload pass-through, bulletproof write_agent_error.

Test counts: 305 Phase 1a baseline + 27 new = 332 passed, 0 failed.
Lint (ruff check): clean. Format: clean. Type check (ty): clean.
Phase 1a ProgressWriter unchanged (deployed + working).

No integration in pipeline.py / runner.py / server.py yet — that lands
in Step 3.
…ions

Phase 1b Step 3 implementation resolves the pending tactical decision
in §9.10 and Appendix C: the SSE handler lives on the existing
/invocations endpoint via content-type negotiation rather than a new
/invocations/stream endpoint. Rationale: one endpoint contract, matches
AgentCore's documented AG-UI pattern, zero risk of the orchestrator's
sync path being misrouted.

- §9.10: replace "deferred to PR" sentence with the resolved choice —
  Accept: text/event-stream routes to SSE, anything else preserves the
  existing sync behavior byte-for-byte.
- Appendix C: update the server.py row to describe content-type
  negotiation explicitly. Add new row for agent/src/sse_wire.py —
  the pure-function translator from semantic events to AG-UI frames,
  kept separate from transport for testability.
- v2 drawio: surgical text updates to 7 labels that referenced
  /invocations/stream; no edge or layout changes. Validator passes.

v1 drawio (interactive-agents-phases.drawio, kept as decision history)
is untouched.
…erver.py

Makes Phase 1b SSE functional end-to-end inside the agent container.
Client-side CLI work lands in Steps 5/6.

Endpoint: content-type negotiation on existing /invocations (locked in
docs fa622b4). Accept: text/event-stream → new SSE handler. Any other
Accept (missing, application/json, */*) → existing sync acceptance
response, byte-for-byte preserved. Orchestrator's InvokeAgentRuntime
calls are unaffected because they do not send text/event-stream.

server.py:
- content-type negotiation via _wants_sse() — case-insensitive substring
  match so qualified types and lists work (e.g. 'text/event-stream;q=1').
- SSE path: create _SSEAdapter, attach_loop, spawn background thread
  running run_task(..., sse_adapter=adapter), return StreamingResponse.
- Async generator drains adapter.get() with 15s timeout; emits
  ': ping\n\n' keepalive on timeout, AG-UI 'data: <json>\n\n' frames on
  events. RUN_STARTED synthesised up front; RUN_FINISHED on clean close,
  RUN_ERROR if any agent_error observed during the stream.
- Client disconnect detection: GeneratorExit → detach_loop; background
  run_task keeps running so DDB durability is unaffected, events then
  drop silently until the adapter is closed.
- Idempotent close() in both server.py finally and pipeline.py finally
  (belt-and-braces — close is idempotent by design).

sse_wire.py (new): pure-function translator from semantic SSEAdapter
dicts to AG-UI wire-format events. Kept separate from transport for
testability. Mappings per design doc §9.10 and the AG-UI facts section
of the resolved-decisions memory:
- agent_turn → TEXT_MESSAGE_START / CONTENT / END (+ optional CUSTOM
  for non-empty thinking since AG-UI has no native thinking event)
- agent_tool_call → TOOL_CALL_START / ARGS / END
- agent_tool_result → TOOL_CALL_RESULT (role=tool, is_error propagated)
- agent_milestone → STEP_STARTED + STEP_FINISHED pair
- agent_cost_update → CUSTOM (no native cost event)
- agent_error → RUN_ERROR (terminal) or CUSTOM (non-terminal); Phase 1b
  treats all runner errors as terminal
Uses camelCase on the wire, SCREAMING_SNAKE_CASE event types, ULIDs
for message/tool_call ids, ms-since-epoch timestamps.

pipeline.py: add sse_adapter: _SSEAdapter | None = None kwarg; mirror
the 3 progress.write_agent_milestone calls on the adapter. finally
block calls adapter.close() so the consumer stream ends cleanly.

runner.py: add sse_adapter kwarg; mirror the 5 progress.write_X calls
(turn, tool_call, tool_result, cost_update, error). Zero change to
ProgressWriter behavior — DDB remains durable source of truth.

Tests (38 new, 371 total, 305 Phase 1a baseline preserved):
- test_sse_wire.py (29 tests): per-event-type roundtrip, camelCase,
  ULID shape, timestamp presence, CUSTOM fallback for cost, is_error
  on tool results, empty-field edge cases, terminality rule for errors.
- test_server.py (+6): content-type negotiation routing, SSE happy path,
  SSE terminates with RUN_ERROR on agent_error, 15s keepalive ping,
  client-disconnect detach_loop + background continues, sync path
  regression ×2 (no Accept / Accept: application/json).
- test_pipeline.py (+3): sse_adapter=None preserves current behavior,
  adapter mirrored at milestone sites + close() in finally, runner
  signature accepts sse_adapter kwarg.

Test hardening: test_background_thread_failure_503 now waits for the
mock to be called (race existed in baseline too — /ping flips 503
before task_state.write_terminal runs because of intervening print +
traceback.print_exc in the except block).

No CDK change (Step 1 shipped). No CLI change (Steps 5/6). Not
deployed yet — deployment gated on Step 4 (get-task-events ?after= for
catch-up). ProgressWriter unchanged.
…tch-up

Adds the backend + CLI client pieces for SSE reconnect catch-up. When
the CLI's SSE stream drops (network blip, AgentCore 60-min cap), it
queries REST /tasks/{id}/events?after=<last_seen_event_id> to replay
missed events from DynamoDB, then reopens the SSE stream. This is the
ONLY reconnection mechanism because AG-UI has no Last-Event-ID and
AgentCore has no SSE-native resume (design §9.12).

Handler (cdk/src/handlers/get-task-events.ts):
- Accept ?after=<ulid> alongside existing ?next_token. Back-compat: if
  neither present, existing from-beginning behavior preserved.
- Query mode routing: after → KeyConditionExpression 'task_id AND
  event_id > :after'. ULIDs are Crockford Base32 lexicographically
  sortable, so string compare is correct.
- Collision policy: both ?after and ?next_token present → after wins
  + WARN log (indicates a client bug).
- ?after validation: Crockford Base32 regex, 26 chars. Invalid → 400
  VALIDATION_ERROR.
- Empty after= string falls through to from-beginning (matches how
  URLSearchParams omits empty values; avoids spurious 400s).
- limit-truncated responses still emit next_token when using after,
  so callers can paginate beyond the first page.

Shared validation (cdk/src/handlers/shared/validation.ts):
- New isValidUlid() + ULID_PATTERN. Exported for handler + future uses.

Shared types (cdk/src/handlers/shared/types.ts + cli/src/types.ts):
- New GetTaskEventsQuery type formalizing the query shape. Mirrored per
  AGENTS.md "cdk types must stay in sync with cli types" rule.

CLI (cli/src/api-client.ts):
- getTaskEvents now accepts { after?: string } in the options object
  alongside existing { next_token?, limit? }. URL-encoded properly.
- New catchUpEvents(taskId, afterEventId, pageSize=100) helper that
  paginates internally. First page uses ?after; subsequent pages use
  ?next_token (no re-sending after to avoid server WARN). Returns
  flattened event array — one-call API for Step 5/6 reconnect path.

Comprehensive structured logging (per user emphasis this step):
- INFO on handler entry: {request_id, task_id, limit, query_mode:
  'next_token' | 'after' | 'from_beginning'}.
- INFO on handler exit: {request_id, task_id, event_count, has_more}.
- WARN when both ?after and ?next_token provided, when ?after fails
  ULID validation, when DDB returns unexpectedly empty after a cursor.
- ERROR on DDB exceptions, malformed response, unexpected errors —
  all include error_type.
- DEBUG-style gated via LOG_LEVEL=DEBUG env; emits INFO with tag
  level_override='DEBUG' so CloudWatch filters distinguish them
  without requiring a logger-module change.

Tests: CDK 729 → 745 (+16: 9 handler + 7 validation). CLI 84 → 90
(+6). All pass, compile clean, lint clean. No agent code touched. No
deploy yet — chained with Steps 5/6/7.
…+ fetch)

Transport layer for bgagent watch's upcoming SSE path (Step 6). D3
hybrid bet locked in: import @ag-ui/core for event TYPES + Zod schemas
only; own transport with native fetch + eventsource-parser; own the
reconnection/backoff/JWT-refresh/60-min-restart logic.

No @ag-ui/client anywhere (only grep hits are docstrings explicitly
calling out why we do not use it).

Deps:
- @ag-ui/core@^0.0.52 (types + Zod schemas, pulls zod transitively)
- eventsource-parser@^3.0.8 (SSE frame parser, zero deps, ~32.7M wk)

cli/src/sse-client.ts (746 lines):
- runSseClient(options): Promise<SseRunResult>. Happy path, reconnect,
  catch-up, terminal detection, external cancellation.
- AgentCore URL builder + required header (x-amzn-bedrock-agentcore-
  runtime-session-id = task_id, matches server.py extraction).
- Frame parsing: eventsource-parser → JSON.parse → Zod
  EventSchemas.safeParse; Zod failures still forward by type sniff
  so future AG-UI additions do not break terminal detection.
- Keepalive watchdog (default 30s grace) resets on any byte incl
  ': ping\n\n' comments; on starve, aborts and reconnects.
- Proactive 60-min restart at maxStreamSeconds (default 3500) pre-
  empts AgentCore's hard cap.
- Exponential backoff reconnect (1s initial, factor 2, max 30s).
- Dedup by id chain: id → messageId → toolCallId → runId →
  stepName+ts → name+ts → type+ts fallback. Set + ordered array;
  capped at 10000 with oldest-half eviction.
- 401 handling: one forced refresh via getAuthToken(), retry once;
  double-401 rejects with CliError(UNAUTHORIZED) + login hint.
- External AbortSignal support.
- Terminal codes fatal (no reconnect): AGENT_ERROR, AgentError,
  UNAUTHORIZED, ACCESS_DENIED, missing code; transient RUN_ERROR
  reconnects.

Catch-up cursor architecture:
- SSE events from sse_wire.py do NOT carry DDB event_ids.
- Client accepts initialCatchUpCursor + delegates cursor management
  to the caller. Cursor advances only on events that carry an
  explicit 'id' field. watch.ts (Step 6) will inject DDB event_id as
  id on events returned from its catchUp closure; live events never
  advance the cursor; dedup handles replay overlap.

Comprehensive structured logging at DEBUG/INFO/WARN/ERROR as the user
requested this step.

Tests: 35 new (90 baseline -> 125 total). Coverage 95.6% stmts / 77%
branch / 88% funcs. mise //cli:build clean. No CDK, agent, or design-
doc changes.

Step 6 consumes this: watch.ts must inject DDB event_id as id on
catchUp() events so the cursor advances; absence only causes re-fetch
(dedup prevents double-emit).
…lback

Rewires bgagent watch to use the Step 5 SSE client as primary transport
with REST polling as fallback. Final client-side piece before deploy +
E2E (Step 7).

New --transport <sse|polling|auto> flag (default auto). --stream-
timeout-seconds passes through to maxStreamSeconds (default 3500 =
58 min; pre-empts AgentCore's 60-min cap). Existing --output and
existing text/JSON formatting preserved byte-for-byte.

New cli/src/ag-ui-translator.ts:
- translateDbRowToAgUi(row): mirror of agent/src/sse_wire.py in TS.
  Emits the same AG-UI triad/pair shapes (TEXT_MESSAGE_*, TOOL_CALL_*,
  STEP_*, CUSTOM for cost, RUN_ERROR) so live-SSE and catch-up-REST
  frames are indistinguishable to dedup + formatter.
- CRITICAL: every returned event carries the DDB event_id as `id` with
  a suffix (:start, :content, :end, :args, :call, :step-started,
  :step-finished, :thinking, :msg) so multi-frame groups dedup
  distinctly AND the SSE client's cursor advances on replay.
- agUiToSemantic(ev): inverse translator for formatter rendering
  (Option A boundary — text output byte-identical between transports).

watch.ts flow:
1. Snapshot fetch (getTaskEvents + getTask in parallel) serves three
   purposes: detect already-terminal task → print tail + exit;
   print history tail for late joiners; seed initialCatchUpCursor.
2. Transport selection:
   - --transport sse: runSseClient only; errors propagate.
   - --transport polling: pollTaskEvents only (Phase 1a behavior).
   - --transport auto (default): SSE first; on any error from
     runSseClient, fall back to polling with seeded cursor so no
     duplicates. Missing runtime_jwt_arn config → WARN + polling.
3. Post-SSE authoritative status: one more getTask after
   RUN_FINISHED/RUN_ERROR to set exit code from REST truth
   (COMPLETED → 0; anything else → 1).
4. Ctrl+C unification: single AbortController plumbed into both
   runSseClient({signal}) and pollTaskEvents' abortable sleep.
5. Consecutive-reconnect counter → WARN "network may be flaky"
   after 3 in a row.

configure.ts: new --runtime-jwt-arn <arn> flag persisting to config.
tryLoadConfig() helper enables partial-update merge — running
bgagent configure --runtime-jwt-arn <arn> alone works on an already-
configured install. First-time configure still requires the four
core fields (post-merge check). CliConfig's runtime_jwt_arn is
optional for back-compat; old config.json files without it load fine.

Comprehensive structured logging (per user emphasis this step):
- DEBUG: transport choice, config read, snapshot (N events), cursor
  seed, each SSE event (type + id), reconnect attempts, catch-up
  replay, fallback rationale.
- INFO: transport selected, reconnecting, replayed N events, task
  completed/failed, clean exit.
- WARN: fallback-to-polling (with reason), missing runtime_jwt_arn,
  duplicate skipped, consecutive-reconnect > 3.
- ERROR: terminal RUN_ERROR fatal, config invalid, not authenticated,
  --transport sse unrecoverable.
- JSON mode discipline: every log line routes to process.stderr;
  stdout stays pure NDJSON. Test 11 asserts every stdout line
  JSON.parses cleanly.

Tests: 51 new (125 baseline → 176 total, 14 → 15 suites).
- ag-ui-translator.test.ts: 29 tests (10 per-type translator + 11
  agUiToSemantic + 6 round-trip + 2 edge).
- watch.test.ts: 18 new transport tests covering all flag
  combinations, happy/sad paths, missing-config, already-terminal,
  SIGINT, JSON mode discipline, catch-up id injection, fallback
  ordering + no-duplicates.
- configure.test.ts: 4 new tests (flag accepted, merge, first-time
  required-field check, back-compat old config).

No CDK changes. No agent changes. No design-doc prose changes.
mise //cli:build clean. No new runtime deps.

Step 7 risks (for the parent to validate in E2E):
- AG-UI messageId/toolCallId mismatch between live SSE (fresh ULIDs)
  and catch-up REST (DDB event_id with suffix). Expected — live wins
  dedup-wise, catch-up fills gaps. Verify no double-rendering on a
  mid-turn reconnect.
- Polling snapshot re-render on SSE-fallback: cursor carries forward
  the SSE run's last emitted event id's DDB prefix; claimed
  correct + verified by Test 15, but worth eyeballing during E2E.
E2E of Phase 1b revealed that the two endpoints require different
Cognito JWT types because they check different claims:

- API Gateway's Cognito authorizer validates the 'aud' claim, which
  only Cognito ID tokens carry. It 401s on access tokens.
- AgentCore Runtime's customJWTAuthorizer (configured via
  RuntimeAuthorizerConfiguration.usingCognito(pool, [appClient]) in
  Step 1 CDK) renders CloudFormation with allowedClients set, which
  AgentCore validates against the 'client_id' claim — only present on
  access tokens. It rejects ID tokens.

Before this fix the CLI cached only the ID token and sent it to both
endpoints. REST worked; SSE to Runtime-JWT failed with UNAUTHORIZED.
The SSE client's 401 handler retried once and the --transport auto
path correctly fell back to polling, so the user-visible failure was
a warning line + a transport downgrade, not a hard error. But the
primary SSE path was effectively unusable.

Fix:
- Cache BOTH IdToken and AccessToken from Cognito in
  ~/.bgagent/credentials.json (new optional access_token field; old
  credential files continue to load and fall back to id_token with a
  debug line).
- Split the single getAuthToken() into:
  - getIdToken() → REST API (API Gateway Cognito authorizer)
  - getAccessToken() → AgentCore Runtime-JWT SSE endpoint
  - getAuthToken() kept as a backward-compat alias for getIdToken so
    api-client.ts and any other REST call path is unchanged.
- watch.ts's SSE client closure now calls getAccessToken() specifically.
- Login and refresh flows require both tokens from Cognito; tests
  updated to mock AccessToken alongside IdToken.

Verified with curl:
- API Gateway + ID token → HTTP 200
- API Gateway + access token → HTTP 401 (WHY we keep id_token for REST)
- AgentCore Runtime-JWT + ID token → would 401 (WHY we need access token)
- AgentCore Runtime-JWT + access token → JWT authenticates; request
  reaches the container (separate unrelated container 502 under
  investigation).

Tests: 176 → 177 (+1: legacy credentials fallback). All CLI tests
pass. No CDK, agent, design-doc, or diagram changes. Users must
re-run 'bgagent login' once to populate access_token in their
credentials file; otherwise SSE falls back to polling with a hint.
Exercising the SSE path end-to-end against the deployed stack surfaced
three production bugs that blocked streaming from reaching the client.
Each is fixed and verified: the latest test run received the full live
stream of 74 AG-UI events (turns, tool calls, tool results, cost,
milestones) with correct timestamps and a real PR created at
scoropeza/agent-plugins#10. Two known issues remain for a fresh session
— see "Remaining work" in the doc + memory entries updated elsewhere.

Bugs fixed in this commit:

1. agent/src/server.py — _debug_cw was blocking module import with
   synchronous boto3 CloudWatch Logs writes. AgentCore runs a /ping
   health check within ~1s of container boot; if uvicorn hasn't
   bound port 8080 by then, the container is marked unhealthy and
   the runtime returns 424 "Runtime health check failed or timed
   out". Fix: CW writes run in a daemon thread (fire-and-forget),
   and the boot-time log uses plain print() to avoid spawning any
   thread during import. The debug path (used after the first
   request arrives) still writes to CloudWatch.

2. cli/src/sse-client.ts — was sending the 26-char task_id ULID as
   the X-Amzn-Bedrock-AgentCore-Runtime-Session-Id header, but
   AgentCore validates >=33 chars and returns HTTP 400 Bad Request.
   Fix: new buildRuntimeSessionId() helper that prefixes the task_id
   with 'bgagent-watch-' (40 chars total). Deterministic so reconnect
   attempts re-use the same session and AgentCore routes back to the
   same microVM (preserving in-progress session state).

3. cli/src/commands/watch.ts — AG-UI translator mints suffixed event
   ids like "01KPPVWM...:step-started" to keep TEXT_MESSAGE/
   TOOL_CALL triads dedup-unique. On SSE reconnect, the CLI was
   passing the full suffixed id to GET /events?after=, which fails
   the ULID validator added in Step 4 with "Invalid `after`
   parameter: must be a 26-character ULID." Fix: strip ':suffix'
   in the catchUp closure before calling apiClient.catchUpEvents.

Debug infrastructure added (stays enabled by default for the rest of
Phase 1b development per explicit user direction — so adding new
debug lines later doesn't require redeploy to toggle a log level):

- _debug_cw helper in server.py writes to server_debug/<task_id>
  CloudWatch streams (AgentCore doesn't forward container stdout;
  only explicit CloudWatch Logs API writes land in APPLICATION_LOGS,
  matching the existing telemetry._emit_metrics_to_cloudwatch
  pattern). Wired through invoke_agent entry, _extract_invocation_
  params result, sync vs SSE routing decision, _invoke_sse entry,
  _SSEAdapter construction, attach_loop, _spawn_background,
  _run_task_background entry, _sse_event_stream entry/RUN_STARTED
  yield/event translations/keepalive pings/close sentinel/disconnect.
- Every CW write is non-blocking (daemon thread) so request latency
  is unaffected.
- Local-stdout print() is retained for docker-compose runs.

No test changes in this commit (tests still pass: agent 371, CLI 177,
CDK 745). No new deps. Known remaining issues documented separately:
(a) terminal RUN_FINISHED frame not delivered to the CLI after task
completion — CLI hits repeated 424s and never detects terminality;
(b) suspected duplicate-pipeline execution — REST-submit path fires
orchestrator → Runtime-IAM pipeline AND the subsequent SSE
invocation on Runtime-JWT spawns ANOTHER pipeline, yielding ~2×
expected event volume in DDB. These are Phase 1b design-level gaps
(not one-line fixes) flagged for the next session.
…esearch

First live SSE bring-up against the deployed stack surfaced a design gap
in the rev-4 plan: SSE invocations on Runtime-JWT were spawning a
SECOND pipeline when the task had already been launched via the
orchestrator on Runtime-IAM. Root cause: AgentCore's "same session_id
→ same microVM" routing is per-runtime-ARN only; cross-runtime live
attach requires an external pub/sub layer.

Competitive survey (docs/research/agent-streaming-patterns.md, new;
31 sources across CopilotKit, LangGraph Platform, OpenAI Assistants,
Mastra, Temporal, Vercel resumable-stream, AgentCore itself)
identified three dominant shapes: same-process streaming (used by
CopilotKit / Mastra / OpenAI), orchestrator+observer with pub/sub
(LangGraph join_stream, Vercel resumable-stream), and pull-based
(Temporal, OpenAI fallback). LangGraph Platform documented as the
clearest reference for orchestrator+observer (Postgres log + Redis
pubsub); Vercel resumable-stream is the simpler AWS-friendly
equivalent.

Branch A chosen for Phase 1b (§9.13, new in rev 5):

- Path 1 — `bgagent submit --watch` / `bgagent run`: direct-submit
  via POST /v1/tasks with `execution_mode=interactive` → Lambda
  admits + writes TaskTable, SKIPS orchestrator → CLI opens SSE to
  Runtime-JWT → server.py runs pipeline same-process. True real-time.
  Reconnection within microVM lifetime attaches to existing adapter
  via new {task_id: adapter} registry (attach-don't-spawn).
- Path 2 — `bgagent submit` plain: default `execution_mode=orchestrator`
  keeps Phase 1a behaviour (pipeline on Runtime-IAM). CLI `watch` on
  such a task = polling (no cross-runtime live attach; that's Phase 1c).
- Non-interactive (Slack/webhook/cron): Path 2 plus DDB Streams fan-out
  per §8.9.

Trade-offs documented:
- Pipeline lifetime for both paths bounded by AgentCore maxLifetime
  (8h in our CDK). DDB persists log only, not execution continuation.
- Tasks > 8h need to leave AgentCore Runtime (Fargate / Step Functions,
  out of Phase 1b scope).
- Direct-submit task dies with its microVM if CLI disconnects long
  enough for idle / maxLifetime eviction; orchestrator-submit is
  indifferent to CLI presence.

New concrete deliverables flagged in the doc (to be implemented next):

1. CDK: add `execution_mode` field to POST /v1/tasks.
2. CreateTask Lambda: if `interactive`, skip orchestrator Lambda.Invoke.
3. server.py: attach-don't-spawn logic via {task_id: _SSEAdapter}
   registry.
4. server.py: /ping HealthyBusy while pipeline thread alive.
5. CLI: bgagent submit --watch (or bgagent run) flow — POST /v1/tasks
   with execution_mode=interactive, then open SSE to Runtime-JWT.

Phase 1c roadmap: add pub/sub (IoT Core MQTT or ElastiCache Redis +
resumable-stream port) for real-time cross-runtime attach. Additive;
Branch A code does not change.

Diagram updates to v2 (surgical text edits, no layout change):
- Page 2 caption: explicit two-path architecture.
- Page 5 caption + sequence arrow: reflect same-process pipeline on
  Runtime-JWT for interactive path.

Research brief: docs/research/agent-streaming-patterns.md, ~2300 words,
31 sources, flagged time-sensitive claims (especially AgentCore quotas).

No code changes in this commit.
… + multi-subscriber

Implements the server-side Branch A design from §9.13:

1. _SSEAdapter multi-subscriber fan-out (sse_adapter.py):
   - New _Subscriber dataclass (queue + dropped_count).
   - Default subscriber created eagerly at __init__ (backward-compatible
     get() still works even when write_*() fires before first get()).
   - New subscribe() returns a fresh per-observer queue; unsubscribe(q)
     removes it.
   - _broadcast_from_loop writes to all subscriber queues with per-sub
     drop-oldest backpressure (one slow consumer can't stall others).
   - _broadcast_sentinel_from_loop fans out close sentinel to all.
   - dropped_count is the sum of per-sub drops + _undelivered_count
     (no-loop / closed cases).

2. server.py attach-don't-spawn logic:
   - New module-level _active_sse_adapters: dict[str, _SSEAdapter]
     tracked under _threads_lock.
   - _invoke_sse checks the registry first: if an adapter with active
     subscribers exists for this task_id, subscribe() a new queue and
     return a StreamingResponse backed by it — do NOT call
     _spawn_background. Solves the duplicate-pipeline bug (two
     pipelines running the same task, observed in the previous
     bring-up run).
   - Spawn path now: register in _active_sse_adapters BEFORE spawning
     (so a rapid reconnect race attaches instead of double-spawning),
     then subscribe() the observer's queue BEFORE spawning (so no
     events are missed between spawn and first drain iteration).
   - _run_task_background's finally block removes the adapter from
     the registry. Only removes if the current entry matches — guards
     against a newer adapter having replaced it.
   - Rollback on spawn failure: adapter removed from registry if
     _spawn_background raises.

3. _sse_event_stream takes sub_queue per-observer parameter:
   - Drains THIS observer's queue (not the adapter's default) so
     multiple observers attached to the same pipeline each receive
     the full event stream independently.
   - On client disconnect / generator cancellation, calls
     adapter.unsubscribe(sub_queue) — leaves the adapter and other
     subscribers intact. The background pipeline keeps running and
     ProgressWriter keeps writing to DDB.

4. /ping HealthyBusy (§9.13.2 idle-evict defense-in-depth):
   - Returns {"status": "HealthyBusy"} while any pipeline thread is
     alive, signalling AgentCore not to idle-evict the microVM. Per
     AWS runtime-long-run docs.
   - Returns {"status": "healthy"} otherwise.
   - 503 + "unhealthy" on _background_pipeline_failed unchanged.
   - _last_ping_status module-level: /ping transitions logged to
     CW once (healthy <-> HealthyBusy <-> unhealthy) so the stream
     is not spammed with per-probe lines.

5. Tests — 371 baseline → 377:
   - test_ping_reports_healthy_when_idle.
   - test_ping_reports_healthybusy_when_pipeline_alive.
   - test_sse_attach_does_not_spawn_second_pipeline: registry
     pre-populated, _invoke_sse returns StreamingResponse WITHOUT
     calling _spawn_background; subscriber_count increments.
   - test_multi_subscriber_broadcast: two subscribers both receive
     the same event.
   - test_multi_subscriber_close_sentinel_fans_out: close()
     sentinel reaches every subscriber.
   - test_registry_cleanup_on_pipeline_completion: finally block
     removes the adapter from the registry.
   - Existing test_sse_stream_client_disconnect_calls_detach renamed
     and updated to verify unsubscribe(queue) instead of detach_loop.

Comprehensive CW debug logs at every state transition (registry
insert / remove / attach / rollback, ping status transitions,
subscribe / unsubscribe). All via the existing _debug_cw helper
(writes in daemon thread so container startup is not blocked).

No CLI changes, no CDK changes, no design-doc changes (already
landed in 48837af). Part 1 (CDK admission execution_mode +
bgagent run command) will ship in a separate commit.

Known gap: the CDK POST /v1/tasks still unconditionally fires the
orchestrator Lambda when a task is created, so `bgagent run` (Part 1,
not yet implemented) cannot yet use this path to submit a task
without ALSO firing the orchestrator. Part 1 adds the
execution_mode field to skip the orchestrator invoke when the CLI
is going to drive the pipeline itself via SSE.
…xport helpers

Partial landing of the CDK admission + CLI helper-export work for rev 5
Branch A. Stops short of the final `bgagent run` command — that needs
one more session of work. Everything below is test-green and
uncommitted branches are fine for handoff.

CDK changes:

1. types.ts + cli/src/types.ts (kept in sync per AGENTS.md):
   - New ExecutionMode = 'orchestrator' | 'interactive'.
   - CreateTaskRequest now has optional execution_mode.

2. create-task-core.ts:
   - New optional allowedExecutionModes parameter (default
     ['orchestrator']). Validates body.execution_mode against the
     allowlist; 400 VALIDATION_ERROR with a clear message if not
     allowed.
   - execution_mode='interactive' SKIPS the orchestrator Lambda.Invoke
     and logs "Admission: interactive mode, orchestrator invoke
     skipped". Default/undefined/='orchestrator' preserves Phase 1a
     behaviour byte-for-byte.

3. create-task.ts (Cognito-authed POST /v1/tasks):
   - Passes ['orchestrator', 'interactive'] → both modes allowed.
   - webhook-create-task.ts unchanged → default ['orchestrator'] →
     webhook callers cannot request interactive mode (returns 400).

4. Tests (create-task-core.test.ts): 5 new cases — interactive skips
   orchestrator, explicit orchestrator still fires, undefined is
   orchestrator (regression), webhook rejects interactive, bogus
   mode rejected. Full CDK suite: 750 passed / 0 failed (745
   baseline + 5 new).

CLI changes (helper exports for the upcoming `bgagent run` command):

5. cli/src/commands/watch.ts:
   - Exported makeFormatter, fetchInitialSnapshot, runSse, RunSseArgs
     so a new run command can reuse the same SSE-watch machinery
     without duplicating the renderer / reconnect / terminal-state
     logic.
   - No behaviour change; purely public-API widening.

6. cli/test/sse-client.test.ts: updated the session-id header
   expectation to match the buildRuntimeSessionId() prefix (fix
   landed earlier in cd093b2 — this test was still expecting the
   bare task_id).

Not yet in this commit (next session):

- cli/src/commands/run.ts — composes apiClient.createTask({execution_mode:
  'interactive'}) + runSse(). Should be a small file (~100-150 lines).
- cli/src/bin/bgagent.ts — register run command alongside submit/watch.
- cli/src/api-client.ts — verify createTask forwards execution_mode
  (current shape spreads ...CreateTaskRequest so it already does, but
  needs a test assertion).
- cli/test/commands/run.test.ts — new tests for the run flow.
- Deploy + E2E validation on the deployed stack.

Test counts: CDK 745 -> 750 (+5). CLI unchanged at 177 (no new tests
yet for run command). Agent unchanged at 377 (Part 2 committed in
3dab225 already).
…RE guard

Completes Phase 1b rev-5 Branch A (design §9.13). Adds the direct-submit
interactive CLI path and the cross-runtime safety guard that prevents
duplicate pipeline execution when a watcher opens SSE against a task
that was submitted via the orchestrator.

CLI — `bgagent run`:
* New command composes createTask({execution_mode: 'interactive'}) + runSse
  so the pipeline executes same-process with the SSE stream on Runtime-JWT
  (real-time, no orchestrator hop).
* Requires runtime_jwt_arn in config; errors with a clear pointer to
  `bgagent configure` otherwise.
* Registered in bin/bgagent.ts between `submit` and `list`.
* 11 tests in cli/test/commands/run.test.ts.
* 1 api-client test: execution_mode is forwarded in POST body.

RUN_ELSEWHERE guard (§9.13.4):
* CDK: TaskRecord gains execution_mode; create-task-core persists it on
  the TaskTable record so server.py can read it. 1 new test asserts
  persistence for both interactive and orchestrator modes.
* Agent: _invoke_sse now checks task_state.get_task() before spawning.
  If the task record says execution_mode != 'interactive', returns
  HTTP 409 {code: RUN_ELSEWHERE, execution_mode} instead of spawning.
  Fails open when record is missing (blueprints / legacy tasks).
  3 new tests cover orchestrator-rejected, interactive-allowed, and
  fail-open behaviours.
* CLI: sse-client recognizes 409 RUN_ELSEWHERE as a non-retriable error,
  throws CliError so `watch --transport auto` catch-block falls back to
  polling cleanly (no reconnect storm). Generic 409s still take the
  regular reconnect path. 2 new sse-client tests cover both paths.

Test status: agent 380 (was 377), CDK 751 (was 750), CLI 191 (was 177).
Live bring-up of the rev-5 Branch A path surfaced three production bugs
after the initial rev-5 deploy. All three are fixed here and validated
end-to-end against backgroundagent-dev/us-east-1.

1. CDK: RuntimeJwt execution role was missing ECR pull perms.
   Root cause: the L2 `AgentRuntimeArtifact.fromAsset` AssetImage.bind()
   guards against double-grant with `this.bound = true`, so when the
   SAME artifact instance is attached to two Runtimes the second
   runtime's role never receives the ECR pull statements. Image pull
   failed with "no basic auth credentials", AgentCore then returned
   424 Failed Dependency on every /invocations for the JWT runtime.
   Fix: create two AssetImage instances (one per runtime). The
   DockerImageAsset dedupes on asset hash so only one image is
   published to ECR.

2. Agent: server.py on Runtime-JWT could not run pipelines for
   CLI-submitted tasks. The rev-5 design (§9.13.4) requires server.py
   to hydrate repo_url + task description + max_turns / max_budget /
   task_type / branch_name / pr_number from TaskTable when the SSE
   body only carries `{task_id}`. Previously `_extract_params` read
   only from the invocation body, so pipelines blew up with
   `ValueError: repo_url is required` immediately after spawn.
   Fix: extend the RUN_ELSEWHERE guard block in `_invoke_sse` to
   also hydrate missing params from the TaskTable record when the
   mode is `'interactive'`. Preserves the orchestrator contract
   (which passes full params in the invocation body — we only fill
   fields the caller didn't send).

3. CLI+CDK: `bgagent watch --transport auto` against an orchestrator-
   submitted task hit a reconnect storm. server.py correctly returns
   409 RUN_ELSEWHERE, but AgentCore wraps non-2xx responses as 424
   to the client, so the CLI never saw the 409 code and fell into
   generic http_error retry. Design-correct fix: expose `execution_mode`
   on the TaskDetail REST response so watch can choose the transport
   from the snapshot instead of probing Runtime-JWT. watch.ts now
   routes directly to polling whenever `snapshot.executionMode !==
   'interactive'` under `--transport auto`. The 409 RUN_ELSEWHERE
   guard stays server-side as defence-in-depth.

CDK: TaskDetail gains `execution_mode: ExecutionMode | null`
(null for legacy tasks that predate rev 5). `toTaskDetail` forwards
the field.

CLI: watch.ts SnapshotResult gains `executionMode`. Two new tests
cover orchestrator-mode → polling and legacy (null) → polling.

E2E results against backgroundagent-dev:
* E2E-A (`bgagent run`, interactive): task 01KPRN3XKQ2V4AVNT6GV1E3PEN
  → PR scoropeza/agent-plugins#11, COMPLETED in 54s, 61 real-time
  events streamed, exit 0.
* E2E-B (`submit` + `watch --transport auto`, orchestrator):
  snapshot resolves execution_mode=orchestrator, CLI skips SSE,
  polling streams turn events and tool calls, final status FAILED
  (max-turns exhausted — orchestrator-side behaviour unaffected by
  this fix), exit 1. No 424 reconnect storm.

Tests: agent 380, CDK 751, CLI 193. TS typecheck clean.
… nits)

Multi-agent validation pass (code-reviewer, silent-failure-hunter,
pr-test-analyzer, type-design-analyzer) surfaced five P0 concerns and
several key nits. Per user direction (Option C — no cutting corners),
land all P0s inline + key nits; defer P1s and design refactors to
`docs/design/PHASE_1B_REV5_FOLLOWUPS.md` with full rationale.

P0 fixes:

* P0-a — latent AttributeError in _SSEAdapter.write_agent_error:
  referenced non-existent `self._dropped_count` inside a "last-ditch"
  except block. Fix to the real `_undelivered_count` + regression test
  (test_write_agent_error_fallback_uses_undelivered_counter).
* P0-b — task_state.get_task conflated NotFound with FetchFailed (both
  returned None via bare `except Exception`). Rev-5 RUN_ELSEWHERE guard
  on Runtime-JWT would spawn a duplicate pipeline on orchestrator-mode
  tasks during a DDB blip. Introduce new `TaskFetchError` exception;
  `get_task` raises on DDB/boto failures, returns None only on real
  absence. server.py `_invoke_sse` now returns 503
  TASK_STATE_UNAVAILABLE on fetch failure (fail-closed) and still
  fails-open for legacy tasks with no record. Four new tests cover
  the three return shapes + the 503 response.
* P0-d — `bgagent run` wraps `runSse` in try/catch. On fatal SSE
  error: cancels the task (so it doesn't sit stranded in SUBMITTED
  occupying a concurrency slot), prints the task_id + a
  `bgagent status <task_id>` resume hint, and exits non-zero. Three
  new tests: SSE-fails-immediately-and-cancels, cancel-also-fails-
  bubbles-SSE-error, SIGINT-propagates-abort-to-runSseClient.
* P0-e — post-hydration validation in server.py `_invoke_sse`. New
  `_validate_required_params` helper checks for the minimum viable
  param set (repo_url; new_task needs issue OR description;
  pr_iteration/pr_review need pr_number). Returns 500
  TASK_RECORD_INCOMPLETE with a list of missing fields instead of
  letting the pipeline crash inside setup_repo. Two new tests
  (hydration-fills-missing, explicit-caller-wins-over-record) + one
  for the 500 + one for the helper's type-dispatch logic.

Key nits:

* Lift `validateStreamTimeout` + `DEFAULT_STREAM_TIMEOUT_SECONDS=3500`
  into `cli/src/commands/_stream.ts` (shared between run + watch).
* `SnapshotResult.executionMode: string | null` → `ExecutionMode |
  null` (compile-time exhaustiveness restored).
* `TaskDetail.execution_mode?:` in CLI → required `ExecutionMode |
  null` (matches CDK side, which always sets it).
* server.py now exports `EXECUTION_MODE_ORCHESTRATOR` and
  `EXECUTION_MODE_INTERACTIVE` constants; inline string literals
  replaced.
* Heartbeat 45s magic number → `_HEARTBEAT_INTERVAL_SECONDS`.
* `logInfo` no-op `if/else` branches in watch.ts removed; parameter
  renamed `_isJson` for call-site documentation.
* `run.ts` `logInfo(message)` signature → `logInfo(isJson, message)`
  to match watch.ts; `Verbose mode:` line gated on `isVerbose()`.
* CDK two-artifact workaround comment: replace `<check>` placeholder
  with a pointer to `DEFER_FOLLOWUPS.md` CDK-1 + detailed location of
  the L2 bug (`AssetImage.bind` guard in the vendored module).

Tests added:

* agent: +9 tests (hydration-full, hydration-preserves-caller, 503-
  TASK_STATE_UNAVAILABLE, 500-TASK_RECORD_INCOMPLETE, validator
  dispatch for pr_iteration/pr_review/new_task, get_task-notfound,
  get_task-found, get_task-empty-item, get_task-raises-TaskFetchError,
  write_agent_error-fallback).
* cdk: +1 test (both runtime execution roles carry ECR pull perms —
  regression for the L2 double-attach bug).
* cli: +3 tests (run-SSE-fatal-cancels-task, run-cancel-also-fails-
  bubbles-SSE-error, run-SIGINT-forwards-abort-to-runSseClient).

Other:

* New docs/interactive-agents-phases-v3.drawio (generated by
  diagram-specialist) with execution_mode fork on POST /tasks,
  RUN_ELSEWHERE guard + hydration annotation, Path-1 vs Path-2
  sequence, per-microVM reconnection-attach note, and a "Rev-5
  invariants" callout. v2 kept as a reference baseline.
* New docs/design/PHASE_1B_REV5_FOLLOWUPS.md catalogs the 14 deferred
  items (P0-c stranded-task reconciler, all P1s, type refactors,
  observability gaps, polling-interval design deviation, upstream
  L2 bug filing).

Test totals: agent 380 → 390 (+10), CDK 751 → 752 (+1), CLI 193 →
196 (+3). TS typecheck clean.

Re-validated against deployed backgroundagent-dev:
* POST-P0 E2E-A (interactive/JWT/SSE) — task 01KPS5A90S6PPX8D9083BCC8P3
  → PR scoropeza/agent-plugins#14, COMPLETED, hydration log confirms
  5 params filled from TaskTable.
* POST-P0 E2E-B (orchestrator/IAM/polling) — task
  01KPS6TM1K74QAV2ZGPBB8GDTG → snapshot reads execution_mode=
  orchestrator, CLI prints "using polling (SSE only supported for
  interactive tasks)", no SSE attempt, 33 turns streamed via polling,
  terminal status FAILED (max_turns=6 exhaustion — pipeline-side,
  unrelated to watch).
…S_PER_USER 3→10

Follow-up to fe84de5 pre-push hardening: land the last P0 from the
multi-agent validation pass. User feedback during the review exposed a
live instance of the stranded-task problem (2 interactive tasks
occupying concurrency slots after CLI-side failures pre-P0-d), which
made the case to fix it in-scope rather than deferring to a follow-up
PR.

Stranded-task reconciler:

* New construct `cdk/src/constructs/stranded-task-reconciler.ts` wires an
  EventBridge schedule to a new Lambda running every 5 minutes by
  default.
* New handler `cdk/src/handlers/reconcile-stranded-tasks.ts` queries
  `TaskTable.StatusIndex` for `status IN (SUBMITTED, HYDRATING)` with
  `created_at < cutoff`, classifies each row by `execution_mode`, and
  fails any exceeding its applicable timeout:
    - interactive:       300 s (STRANDED_INTERACTIVE_TIMEOUT_SECONDS)
    - orchestrator:     1200 s (STRANDED_ORCHESTRATOR_TIMEOUT_SECONDS)
    - legacy (missing): 1200 s (treated as orchestrator)
  Thresholds are env-tunable per Lambda without code changes.
* Transition to FAILED is conditional on the current status to avoid
  racing a legitimate transition; on CCFE we log `advanced_during_
  reconcile` and move on.
* Emits two events on fail: `task_stranded` carrying
  `{code: 'STRANDED_NO_HEARTBEAT', prior_status, execution_mode,
  age_seconds}` for observability, and the standard `task_failed` that
  existing consumers already react to.
* Decrements the user's concurrency counter conditionally. If the
  counter is already 0 (double-release or drift), we swallow the CCFE —
  the existing concurrency reconciler catches lingering drift.
* Does NOT touch RUNNING / FINALIZING — those remain handled by
  `pollTaskStatus`'s `agent_heartbeat_at` timeout path in
  `orchestrator.ts`.
* Handler `entry` uses `@aws-sdk/client-dynamodb` inline (not
  `lib-dynamodb`) to match the sibling reconciler pattern and keep the
  Lambda bundle lean.

8 new handler unit tests cover: no-candidates no-op, interactive stale
→ fails + events + decrement, orchestrator young → ignored,
orchestrator old → fails, legacy no-mode → orchestrator threshold, race
with legitimate transition (CCFE clean skip), both SUBMITTED + HYDRATING
queried, pagination via `ExclusiveStartKey`.

MAX_CONCURRENT_TASKS_PER_USER 3 → 10:

* `cdk/src/constructs/task-orchestrator.ts:163` default raised. 3 was
  too tight for power-user CLI flows (`bgagent run` during iterative
  development triggered the cliff routinely). The stranded-task
  reconciler now prevents abandoned tasks from occupying slots, so the
  bump is ergonomic rather than load-bearing.
* Existing test updated (`task-orchestrator.test.ts`).
* The `?? 3` fallback → `?? 10` only; callers that pass an explicit
  value are unaffected.

Docs:

* `docs/design/PHASE_1B_REV5_FOLLOWUPS.md` — move P0-c from "deferred"
  to "LANDED" with the actual design delivered. Same for
  MAX_CONCURRENT_TASKS_PER_USER. Also add new entry DATA-1 (split DDB
  `turns` field into `turns_attempted` + derived `turns_completed` —
  user flagged during review; SDK reports attempted count including
  the cap-check, cosmetic not bug).

Tests: agent unchanged at 390, CDK 752 → 760 (+8 reconciler handler
tests; existing task-orchestrator test updated for new default), CLI
unchanged at 196. TS typecheck clean.

Deployed to backgroundagent-dev: StrandedTaskReconciler Lambda live,
env verified (INTERACTIVE=300 ORCHESTRATOR=1200). Running every 5
minutes on EventBridge.
Round 1 of the pre-push follow-ups. Three correctness fixes from the
silent-failure-hunter pass, plus the observability gap the concurrency
incident exposed. Plan in docs/design/PHASE_1B_REV5_FOLLOWUPS.md.

P1-3 — attach-path subscribe() failure no longer duplicates pipeline
(agent/src/server.py):
* When `_invoke_sse` finds a live adapter in the registry but
  `adapter.subscribe()` raises (close race, queue-lifecycle bug), we
  now return HTTP 503 `{"code": "SSE_ATTACH_RACE"}` instead of falling
  through to spawn. Falling through would have duplicated the pipeline
  inside a single microVM.
* New test
  `test_sse_attach_subscribe_failure_returns_503_no_spawn` asserts no
  spawn is called and the registry is left untouched on subscribe
  failure.

P1-1 — 409 on the SSE path is always terminal (cli/src/sse-client.ts):
* Prior behavior: if the 409 body wasn't JSON or didn't carry
  `code: RUN_ELSEWHERE`, the code fell through to the generic
  reconnect path — i.e., retried against a server that was
  deliberately rejecting the request (reconnect storm). Now any 409 is
  terminal: RUN_ELSEWHERE → `CliError('RUN_ELSEWHERE...')` (caller
  falls back to polling), other 409 → `CliError('HTTP 409 ... body:
  ...')` with a 500-byte body excerpt so operators can see what the
  server said. Non-retriable.
* Updated existing test and added
  `HTTP 409 with non-JSON body → terminal CliError`.

OBS-4 — interactive path records session_id on TaskTable
(agent/src/task_state.py + agent/src/server.py):
* New helper `task_state.write_session_info(task_id, session_id,
  agent_runtime_arn)`. Conditional UpdateItem on
  `status IN (SUBMITTED, HYDRATING)` so a concurrent legitimate
  transition wins cleanly.
* `_invoke_sse` calls it just before spawning the pipeline for
  `execution_mode='interactive'` tasks, passing the AgentCore
  session header value + (for now) an empty runtime ARN.
* Cancel-task Lambda resolves the runtime ARN from `execution_mode`
  against two new env vars `RUNTIME_IAM_ARN` + `RUNTIME_JWT_ARN`
  (cdk/src/handlers/cancel-task.ts + cdk/src/constructs/task-api.ts).
  `StopRuntimeSession` now picks the correct runtime for interactive
  tasks that only have `session_id` on the record.
* Four new tests in test_task_state.py (writes both fields, no-op on
  empty, ConditionalCheckFailed skip, partial session-only), two in
  test_server.py (spawn-writes-session-info + skip-when-both-missing).

Important design note added to agent.ts: an earlier attempt injected
each runtime's own ARN as an env var on the runtime itself. That
creates a CFN cycle (the Runtime's properties reference its own
`AgentRuntimeArn` GetAtt). The solution moved to cancel-task where
both runtime ARNs are already in scope at CDK synth time with no
cycle. `agent_runtime_arn` stays null on the interactive TaskTable
record; cancel resolves from `execution_mode`.

Test totals: agent 390 → 397 (+7), CDK 760 → 760 (existing tests
unchanged), CLI 196 → 197 (+1 non-JSON 409 test). TS typecheck clean.

Deployed to backgroundagent-dev and validated with a fresh E2E-A run:
task 01KPSEF0G8C2141JFF0CVGY4CD → PR scoropeza/agent-plugins#15,
COMPLETED, session_id "bgagent-watch-01KPSEF..." now recorded on
TaskTable (was null on all prior interactive-path tasks).
Two follow-ups from the silent-failure-hunter pass that surface
triage-critical information previously lost at debug-level only.

P1-2 — post-SSE getTask failure no longer silent
(cli/src/commands/watch.ts):

Before, if the authoritative-status REST call after the SSE terminal
event failed, the CLI silently inferred status from the AG-UI frame
type, logged only via `debug()` (invisible without `--verbose`), and
printed "Task completed." / "Task failed." without any indication
that the status was inferred. In practice this means a user sees a
green "completed" even when REST is down and the server might have
finalized as CANCELLED or FAILED.

Now on getTask failure we:
* Emit a `WARN:` line to stderr with the underlying error and a
  `bgagent status <task_id>` suggestion.
* Suffix the terminal line with ` (inferred)` so it is self-evident
  that the status may not match the server's view.
* Exit code still uses the inferred status (nothing else to do).

Two new tests: RUN_FINISHED + getTask fails → warn + inferred
completed (exit 0); RUN_ERROR + getTask fails → warn + inferred
failed (exit 1).

P1-5 — rev-5 except blocks now include tracebacks in CloudWatch
(agent/src/server.py):

The AgentCore runtime container's stdout is not forwarded to
APPLICATION_LOGS (server.py docstring already flags this). Before,
every bare `except Exception` in the rev-5 paths logged only
`{type(exc).__name__}: {exc}` via `_debug_cw`, losing the call-site
traceback and making programming-bug triage expensive.

* New helper `_debug_cw_exc(message, exc, *, task_id=None)` formats
  the full traceback and hands it to `_debug_cw` so it reaches the
  custom CloudWatch stream alongside the exception.
* Replaced 6 rev-5-era call sites:
  `_extract_invocation_params FAILED`, `_invoke_sse FAILED`,
  attach-path `subscribe() FAILED`, spawn-path `_SSEAdapter
  construction FAILED`, `attach_loop FAILED`, `subscribe FAILED in
  spawn path`, `_spawn_background FAILED`. Removed now-redundant
  `traceback.print_exc()` calls at those sites (stdout was
  swallowed in production anyway).

Test totals: agent 397 (unchanged — no new behavior assertion; the
helper is exercised indirectly through the existing P1-3 + SSE
construction tests), CDK 760 (unchanged), CLI 197 → 199 (+2 P1-2
tests). TS typecheck clean.

No infra change — agent/src changes but no deploy needed until we
actually want to exercise P1-5 in live CloudWatch.
bgagent added 17 commits April 28, 2026 13:54
Four observability follow-ups from the silent-failure-hunter pass.
None is a correctness bug — together they make rev-5 paths alarmable
and triageable from CloudWatch Logs Insights.

OBS-1 — attach-vs-spawn metric (agent/src/server.py):

New `_emit_sse_route_metric(task_id, route, *, subscriber_count)`
writes a JSON event `{event: "SSE_ROUTE", route: "attach"|"spawn",
task_id, [subscriber_count]}` to CloudWatch Logs stream
`sse_routing/<task_id>`. Called from both `_invoke_sse` branches:
attach (line 717, after the `subscriber_count=N` debug line) and
spawn (line 935, right after `_spawn_background` returns). Best-
effort daemon-thread write; never blocks the request path.

Downstream: Logs Insights can now answer "ratio of attach to spawn
per hour" and alarm on attach_count=0 over time (regression where
same-session-id routing silently broke) or spawn_count=0 on a
running microVM (registry bug hiding live adapters).

Two new tests: `test_sse_emits_attach_route_metric`,
`test_sse_emits_spawn_route_metric`.

OBS-2 — post-hydration full-param keyset log
(agent/src/server.py):

After the existing "hydrated N params" line, always log:
  post-hydration params: populated=[...] origin={k: 'record'|'caller'}

Where `origin` maps each populated field to whether it came from
the TaskTable hydration (`record`) or from the caller's invocation
body (`caller`). Enables triage of "ran with wrong repo" bugs that
could stem from either a hydration miss or a caller passing bad
input — the log now disambiguates.

One new test: `test_sse_logs_full_post_hydration_keyset` asserts
the line contains both the populated list and an origin mapping.

OBS-3 — stable event names on admission logs
(cdk/src/handlers/shared/create-task-core.ts):

Added structured `event` fields:
  * `task.admitted.orchestrator_skipped` — interactive mode path
  * `task.admitted.orchestrator_invoked` — happy-path orchestrator
  * `task.admitted.orchestrator_invoke_failed` — Lambda invoke err

Logs Insights can now filter on `event = '...'` without free-text
matching on message strings that may change with future
refactoring.

P1-4 — _debug_cw write-failure counter
(agent/src/server.py):

Added module-level `_debug_cw_failures` counter + lock. Every time
the daemon-thread CW write fails, we bump the counter and (every 5
failures, plus the very first) also emit a `DEBUG_CW_WRITE_FAILURES`
JSON event to the separate `sse_routing/<task_id>` stream (different
code path, less likely to fail the same way). Dashboard can alarm
on this counter to detect blind rev-5 debug paths.

Test totals: agent 397 → 400 (+3), CDK 760 (unchanged, 33 in the
affected file), CLI 199 (unchanged). TS typecheck clean.

No infra/deploy change needed for OBS-1/2; the helper writes to
LOG_GROUP_NAME which is already set. OBS-3 only affects logger
output format — no deploy strictly needed, but the next deploy will
pick up the stable event names. P1-4 counter is read internally;
emission uses the same LOG_GROUP_NAME already configured.
Type-design-analyzer follow-ups focused on replacing invariants-by-
comment with structural guarantees. Correctness-preserving refactors;
no behavior change.

TDA-1 — `_AdapterRegistry` class (agent/src/server.py):

Before: a bare module-level `dict[str, _SSEAdapter]` + a separate
`_threads_lock` guarded by convention, with three open-coded
identity-checked pop sites (`if registry.get(task_id) is adapter:
registry.pop(...)`) in _run_task_background's finally, the subscribe-
failure rollback, and the _spawn_background-failure rollback. A
fourth caller forgetting the identity check could evict a replacement
adapter that legitimately took the slot during a reconnect race.

After: `_AdapterRegistry` owns the lock internally and exposes:
- `insert(task_id, adapter)` — raises RuntimeError on conflict
  (different adapter already present), idempotent for same-identity
- `remove_if_current(task_id, adapter)` — identity-checked pop, used
  by all three rollback sites + _run_task_background's finally
- `get(task_id)`, `size()`, plus __contains__/__setitem__/__getitem__/
  pop for backward compatibility with existing tests

The bare dict is gone. All four call sites now call
`remove_if_current` or `insert`; the `with _threads_lock:` scaffolding
is removed around registry access (the class's own lock is correct).
Two new tests: insert-conflict-raises + remove_if_current-identity-
checked.

TDA-2 — `_SSEAdapter.subscription()` context manager
(agent/src/sse_adapter.py):

Before: `subscribe()` returned a raw asyncio.Queue. Callers had to
remember to call `unsubscribe(queue)` with the exact queue object.
Forgetting leaks a subscriber slot + per-observer drop counters for
the lifetime of the adapter.

After: new `adapter.subscription()` context manager yields the queue
and calls `unsubscribe` in its finally. Preferred for call sites that
can use `with`. The raw `subscribe()` / `unsubscribe()` API is
retained for `_sse_event_stream` which hands the queue off to a
StreamingResponse that outlives the subscribe call — this path
explicitly cannot use `with`, but the structural contract is now
available for everything else.

Two new tests: context-manager auto-unsubscribes on normal exit +
still unsubscribes when an exception propagates through the `with`
block.

TDA-6 — Python `ExecutionMode` Literal + normalize helper
(agent/src/server.py):

Before: `record.get("execution_mode") or "orchestrator"` — stringly-
typed. mypy couldn't catch a typo like `"interacttive"` anywhere it
was compared to an ExecutionMode-like value.

After:
- `ExecutionMode = Literal["orchestrator", "interactive"]`
- `EXECUTION_MODE_ORCHESTRATOR` / `_INTERACTIVE` annotated with the
  Literal type
- `normalize_execution_mode(raw: object) -> ExecutionMode` — single
  canonical coercion. Unknown / missing / non-string collapses to
  legacy `orchestrator`; `"interactive"` is the only non-legacy
  value.
- `_invoke_sse` uses the normalizer instead of the raw `or`.

One new test exercising the legacy-default branch for None / "" /
"unknown" / non-string input + happy-path identity returns.

Test totals: agent 400 → 405 (+5). CDK + CLI unchanged. TS typecheck
clean.

No infra or deploy change.
… TDA-5)

TDA-3 — shared ApiErrorBody envelope:

The SSE data-plane emits ad-hoc JSON on non-2xx responses (RUN_ELSEWHERE,
TASK_STATE_UNAVAILABLE, SSE_ATTACH_RACE, TASK_RECORD_INCOMPLETE) rather
than going through the REST API Gateway's ErrorResponse envelope. Before,
each code string was a bare literal on the server with no matching type
on the client. Now:

* New `ApiErrorCode` union + `ApiErrorBody<C>` interface defined in
  both `cdk/src/handlers/shared/types.ts` and `cli/src/types.ts`.
* Back-compat `execution_mode` and `missing` fields on the envelope
  for the two existing codes that flatten specifics at the top level
  (RUN_ELSEWHERE, TASK_RECORD_INCOMPLETE).
* New type guard `isApiError<C>(body, code)` in the CLI.
* `sse-client.ts`'s 409 branch now uses `isApiError(parsed,
  'RUN_ELSEWHERE')` instead of hand-rolled `parsed?.code === ...`.

TDA-4 — cross-file type drift detection:

Full `@abca/shared-types` workspace is a bigger infra change than the
rev-5 scope justifies (noted in PHASE_1B_REV5_FOLLOWUPS.md). For now,
new `cli/test/types-sync.test.ts` reads `cdk/src/handlers/shared/types.ts`
via fs + regex and asserts the `ExecutionMode` + `ApiErrorCode` unions
contain the same string members as the CLI-side canonical list. Drift
in either direction fails CI. The test self-documents as temporary —
when the workspace moves, the test deletes.

TDA-5 — SemanticEvent TypedDict union (agent/src/sse_adapter.py):

Added top-level TypedDicts for the six event shapes already flowing
through `_enqueue` / `get`: AgentTurnEvent, AgentToolCallEvent,
AgentToolResultEvent, AgentMilestoneEvent, AgentCostUpdateEvent,
AgentErrorEvent, and the union `SemanticEvent`. Each matches the
`metadata` dict the sibling `_ProgressWriter.write_agent_*` stores to
DDB.

For this round I left the internal method signatures (`_enqueue(event:
dict)` etc.) as `dict` to minimise diff — the TypedDicts are available
for callers that want mypy checks without forcing a type-propagation
pass. A follow-up can thread `SemanticEvent` through the signatures
when we tighten mypy strictness.

Test totals: agent 405 (unchanged), CDK 760 (unchanged), CLI 199 → 201
(+2 type-sync tests). TS typecheck clean.

No deploy needed.
POLL-1 — polling cadence aligns with design §9.13.1
(cli/src/commands/watch.ts):

Design §9.13.1 specified a 500 ms polling interval for the REST
fallback (vs Phase 1a's 2 s). Rather than flat-applying 500 ms (hot
for hours-long tasks) we decay:
  * POLL_FAST_INTERVAL_MS = 500 ms  — first 3 minutes (fresh tasks
    where users are actively watching)
  * POLL_SLOW_INTERVAL_MS = 2000 ms — after POLL_FAST_WINDOW_MS
    (long-running observations)

New `currentPollInterval(startedAt)` picks the value based on
elapsed time in the poll loop. Zero-impact on the Phase 1a polling
consumers' cost at steady state; quadruples responsiveness for the
common case.

DATA-1 — split `turns` into `turns_attempted` + `turns_completed`:

Background (from user review): setting `max_turns=6` and hitting the
cap yields `turns=7` in TaskTable because the Claude Agent SDK
reports `ResultMessage.num_turns` as the number of attempts,
including the one rejected by the cap. "Why does the DDB show 7
when I asked for 6" is a legitimate operator confusion.

Changes:

agent/src/models.py — TaskResult gains two new fields:
  * turns_attempted  — the SDK-authoritative counter (= legacy `turns`)
  * turns_completed  — clamped to max_turns when
    `agent_status='error_max_turns'`, else == turns_attempted
The legacy `turns` field is retained (= turns_attempted) for
backward compat with existing DDB consumers; the CLI / docs can
migrate to the new pair over time.

agent/src/pipeline.py — compute both counters before TaskResult:
  turns_attempted = agent_result.num_turns or agent_result.turns
  turns_completed = (
    min(turns_attempted, config.max_turns)
    if agent_status == "error_max_turns" and turns_attempted
    else turns_attempted
  )

agent/src/task_state.py — `write_terminal` writes both new fields
to TaskTable alongside the legacy `turns`.

cdk/src/handlers/shared/types.ts — `TaskRecord` + `TaskDetail` gain
the two optional number fields; `toTaskDetail` forwards them.

cli/src/types.ts — `TaskDetail` mirrors (optional/nullable; wire
legacy rows populate only `turns`).

No CLI formatter change in this commit — the two-field display is a
legitimate UX decision to make separately (current `turns=N` line
keeps working). Dashboards and anyone querying DDB can now
self-serve the truthful answer.

Test totals: agent 405 (unchanged — the test suite uses the
pipeline shape without asserting new fields; adding a dedicated test
is overkill for the min() clamp), CDK 760 (unchanged for handlers),
CLI 201 (unchanged). TS typecheck clean.

No deploy strictly needed for POLL-1 (client-only change); DATA-1
Lambda changes take effect on next deploy.
Rewrite PHASE_1B_REV5_FOLLOWUPS.md to accurately reflect the current
state after Rounds 1–5 shipped. Previously the doc still listed all
P1/TDA/OBS/POLL/DATA items as "pending" when they had in fact landed
in commits fce9d07 (Round 1), bd7b886 (Round 2), 0d29939 (Round 3),
bc56731 (Round 4a), 228c935 (Round 4b), and dfe7b84 (Round 5).

Structure of the new doc:

* **Round summary table** at the top — maps each round to its commit
  and status so future readers can walk the history.
* **Landed (by round)** — every validator-surfaced item is listed
  under the round that landed it, with a one-sentence summary of what
  was fixed.
* **Pending** — only CDK-1 remains (filing the upstream
  `AssetImage.bind` bug against `@aws-cdk/aws-bedrock-agentcore-alpha`).
  It's an admin task, not code — documented with filing instructions
  + the CFN template output operators should include in the repro.
* **Candidates NOT landed by design** — full shared-types workspace,
  `SemanticEvent` threaded through signatures, CLI formatter for the
  new turns fields. Each has a note on what triggers taking it on.

No code change; doc only.
Filed aws/aws-cdk#37663 for the `AssetImage.bind` double-attach guard
that causes the second runtime's execution role to silently miss ECR
pull permissions when a single `AgentRuntimeArtifact.fromAsset` is
shared across two `agentcore.Runtime` constructs.

Updated the code comment at `cdk/src/stacks/agent.ts:55-68` to link
the upstream issue so future maintainers can check its status before
removing the two-artifact workaround. Updated
`PHASE_1B_REV5_FOLLOWUPS.md` CDK-1 entry to mark the filing as done.

No behavior change.
Only draw.io viewport coordinates (dx/dy) and a 10px horizontal nudge
on the fan-out note on page 2. No cell additions or content changes.
Committing to keep the working tree clean before E2E re-validation.
…8.9)

DynamoDB Streams on `TaskEventsTable` have been enabled since Phase 1b
Step 1 but lacked a consumer. This commit adds the fan-out Lambda +
construct + tests so non-interactive channels (Slack, GitHub PR
comments, email) can receive task milestones without any change to
the agent or CLI — exactly the shape the design calls for:

  "the fan-out Lambda itself can ship later without any change to
   the agent or CLI" — §8.9

Shipped as a skeleton: dispatcher stubs log each would-be delivery
to CloudWatch but don't actually call Slack / GitHub / SES yet.
Enabling a real dispatcher is a per-channel PR (add the SDK client,
replace the `log-only` block, add an IAM policy or Secrets Manager
grant, add config to the construct's props).

## Handler — `cdk/src/handlers/fanout-task-events.ts`

* `parseStreamRecord(record)` flattens a NEW_IMAGE DDB Stream record
  to a plain `FanOutEvent` (task_id, event_id, event_type, timestamp,
  metadata). Ignores REMOVE records. Returns `null` on malformed
  input so one poisonous record doesn't fail the batch.
* `shouldFanOut(event)` closed allow-list: `task_created`,
  `task_failed`, `task_completed`, `task_cancelled`, `task_stranded`,
  `agent_milestone`, `agent_error`, `pr_created`. Dropped:
  `agent_turn`, `agent_tool_call`, `agent_tool_result`,
  `agent_cost_update` — those are the "verbose text triad" pattern
  for live SSE watchers, not async channel delivery.
* Per-task per-invocation cap of 20 events (`MAX_EVENTS_PER_TASK_PER_
  INVOCATION`) bounds spam from chatty agents. A future follow-up
  can promote this to a DDB-backed rate limiter.
* Three dispatcher stubs — `dispatchToSlack`, `dispatchToGitHubComment`,
  `dispatchToEmail` — each catches its own errors so one failing
  channel doesn't block the others. MUST NOT throw on transient
  errors: that would replay the whole batch and duplicate
  deliveries. Persistent failures fall to the DLQ.
* Structured logs with stable `event` names
  (`fanout.batch.complete`, `fanout.slack.dispatch_stub`,
  `fanout.github.dispatch_stub`, `fanout.email.dispatch_stub`,
  `fanout.rate_limit.hit`, `fanout.*.dispatch_failed`) for Logs
  Insights queries.

## Construct — `cdk/src/constructs/fanout-consumer.ts`

* `FanOutConsumer` wires DynamoDB Streams → Lambda via
  `DynamoEventSource` with `StartingPosition.LATEST`, batch size 100,
  max batching window 5s.
* 3 retry attempts + DLQ (14-day retention) via `SqsDlq` on
  `onFailure`. `reportBatchItemFailures: true` so partial batch
  failures don't replay successful records.
* cdk-nag suppressions: `IAM5` for generated stream wildcard perms,
  `SQS3` for the DLQ having no DLQ-of-its-own (would be recursion).

## Tests — `cdk/test/handlers/fanout-task-events.test.ts`

24 new tests covering:
* `parseStreamRecord`: well-formed INSERT, REMOVE (ignored), missing
  required fields.
* `shouldFanOut`: 8 fanned-out event types + 9 dropped event types
  (`test.each`).
* `handler`: mixed-batch filter, per-task cap at 20, malformed
  records dropped not thrown, REMOVE skipped.

## Wiring

`cdk/src/stacks/agent.ts` instantiates `FanOutConsumer` alongside
the other scheduled consumers (ConcurrencyReconciler,
StrandedTaskReconciler). No other stack changes.

## Live validation

Deployed to backgroundagent-dev; smoke-tested with `bgagent run` on
scoropeza/agent-plugins (task 01KPTPTDT69X26FW3QJS0Q8QA4, PR aws-samples#18
created). Fan-out Lambda processed the stream batches, dropped all
verbose events (agent_turn / agent_tool_call / agent_tool_result /
agent_cost_update), and dispatched 2 `agent_milestone` events to
the Slack dispatcher stub (repo_setup_complete + pr_created
milestone). Log excerpt:

  {"event":"fanout.slack.dispatch_stub","task_id":"01KPTPTDT...",
   "event_id":"01KPTQF2...","event_type":"agent_milestone"}
  {"event":"fanout.slack.dispatch_stub","task_id":"01KPTPTDT...",
   "event_id":"01KPTQGG...","event_type":"agent_milestone"}

## Test totals

agent 405 (unchanged), CDK 760 → 784 (+24), CLI 201 (unchanged).
TS typecheck clean. All existing tests pass.

## What's next (not in this commit)

Per-channel integrations land incrementally as separate PRs. Each
adds its SDK dep, replaces the stub, wires config (tokens /
credentials via Secrets Manager), and tests. The interface is
stable — dispatchers are async void functions that don't throw on
transient errors.

Also: `docs/interactive-agents-phases-v3.drawio` viewport-only diff
from local inspection, rolled into this commit.
…me paths

Corrects v3 page 8 that mistakenly depicted nudge as a Phase 1c+
WebSocket-only feature. Per design §10, nudge is a REST-submitted,
DDB-backed, between-turns injection that works identically on
Runtime-IAM (orchestrator) and Runtime-JWT (direct interactive) —
no WebSocket required.

Page 8 rewrite (main change):

* Title: "Phase 2 preview — nudge (user → agent steering, REST)"
* Side-by-side layout: Phase 2 REST (primary, green, left) and
  Phase 1c+ /ws (optional, orange, right). Both on one page so the
  design call is visible.
* Left column shows the submission lane (CLI → API GW →
  nudge-task.ts Lambda → TaskNudgesTable DDB → 202) and the
  consumption lane with parallel Runtime-IAM and Runtime-JWT
  microVMs each running `nudge_reader.py` polling the shared
  TaskNudgesTable.
* Right column reframes /ws as "synchronous ack only" for
  Cedar-gated tool-use approvals and mid-turn CANCEL frames —
  explicitly independent of Phase 2 nudge.
* Adds "What Phase 2 nudge is NOT" yellow callout (not WebSocket,
  not mid-turn, not a new streaming channel) and "Risk:
  Medium-High" red callout citing design §10 and §11 riskiest-
  assumption aws-samples#3.

Page 2: the existing "Future: WebSocket bidirectional path" box is
rewritten as "Phase 2 preview — Nudge (REST, both runtime paths)"
with the POST /v1/tasks/{id}/nudge → TaskNudgesTable →
nudge_reader.py summary; notes WS is a separate deferred concern.

Page 3: bidirectional plane cell reframed to "synchronous tool-call
approvals + HITL ack within <100ms" and explicitly states async
nudges use REST (Phase 2). The "nudges ⇄ events" edge label becomes
"sync approvals ⇄ events".

Pages 1, 4, 5, 6, 7 unchanged (they accurately depict Phase 1b
rev-5 as-deployed).

Cell count: 848 → 888 (+40 new cells for Phase 2 concepts).
Structural validator: clean. Light + dark mode Playwright
screenshots captured and validated. v3 retained as reference.
…between-turns hook

Implements Phase 2 of the interactive-agents design: while a task is running
the operator can send a short steering message via `bgagent nudge <task-id>
<message>` which is guardrail-screened, rate-limited (10/min/task), and
persisted in a new TaskNudgesTable. The agent container picks it up at the
next Stop-hook (between-turns) seam and injects it as an authoritative
<user_nudge> XML block so the next turn incorporates it naturally.

Architecture:
- POST /v1/tasks/{task_id}/nudge — auth → ownership → state gate →
  guardrail (fail-closed) → rate-limit → persist (202 Accepted)
- Agent: between-turns hook registry (extensible for Phase 3 approval
  gates) reads pending nudges, marks them consumed via conditional
  UpdateItem, formats as XML, returns them via SDK `decision: "block"` /
  `reason: "<xml>"` mechanism.
- Observability: new `agent_milestone(nudge_applied)` event fires on
  both ProgressWriter and SSE adapter so the CLI stream shows a visible
  star marker when a nudge reaches the agent.

Drive-by fixes bundled:
- sse_wire: stamp turn/model/tool_calls_count on TEXT_MESSAGE_START+END
  so CLI renders real `Turn #N (model, N tool calls)` instead of `#0`.
- pipeline: extract `_compute_turns_completed` helper + add 8 unit tests
  covering the Rev-5 DATA-1 `error_max_turns` clamp (previously uncovered).
- nudge_reader.mark_consumed: alias `consumed` via ExpressionAttributeNames
  (DDB reserved keyword — caught during live E2E, MagicMock tests had
  missed it; test hardened with an explicit alias-present assertion).

Security posture:
- Rate-limit check runs AFTER guardrail so a blocked message doesn't
  consume a slot.
- State gate rejects nudges to tasks in terminal states with
  TASK_ALREADY_TERMINAL.
- Process-lifetime dedup set guards against infinite re-injection if
  mark_consumed persistently fails.
- 30-day TTL on nudge rows + ~2-min TTL on rate-limit counter rows.

Tests: agent 453 / CDK 835 / CLI 212 — all green (+51 net).
E2E validated on backgroundagent-dev via PRs scoropeza/agent-plugins#19,
aws-samples#20, aws-samples#22 covering both RuntimeJwt (interactive) and Runtime-IAM
(orchestrator) paths plus the polling watch transport.

Refs: docs/design/INTERACTIVE_AGENTS.md §10
…sals

v4 pre-dated Phase 2 execution and treated 1a/1b/2 as separate per-phase
pages. v5 folds them into a single "current AWS architecture" page that
reflects what's actually deployed on backgroundagent-dev as of 2026-04-23:

- All 15 application Lambdas, 6 DDB tables (incl. TaskNudgesTable),
  both AgentCore runtimes (Runtime-IAM + Runtime-JWT), Bedrock Guardrail
  + Claude Sonnet 4.6, shared Cognito user pool (API GW + Runtime-JWT
  authorizer), Secrets Manager, DDB-Stream fan-out to FanOutFn, VPC +
  DNS Firewall + VPC endpoints.
- Three-plane color legend (control/streaming/fan-out) used consistently
  across all architecture pages.

New content:
- Page 5: dedicated bgagent submit + watch (orchestrator + polling) flow
- Page 7: Nudge happy path (21-step sequence incl. the DDB reserved-
  keyword bug call-out)
- Page 8: Phase 1c WebSocket proposal (side-by-side with current SSE)
- Page 9: Phase 2.5 mid-turn interrupt proposal (stop_poller sibling
  task + 4 risk panels)
- Page 10: Phase 3 Cedar-HITL approval proposal (REQUIRE_APPROVAL
  extension of the existing Cedar policy engine, AWAITING_APPROVAL
  state, escalation scopes) - labelled as pending final §9.3 revision.
- Page 11: decision rationale refreshed - D1/D2/D3 now carry "validated
  in production" status pills; adds D4 (rate-limit-after-guardrail),
  D5 (between-turns hook registry), D6 (TaskNudgesTable separate from
  TaskTable), D7 (observability-first nudge delivery via milestone).

Carried forward from v4 (unchanged or minimally retitled):
- Pages 2/3/4/6: control vs streaming planes, channel-fit matrix, dual
  happy-path sequence, SSE reconnection flow.
Replaces the rev-3 3-tier HITL model (§9.3 in INTERACTIVE_AGENTS.md)
with the Cedar-policy-driven approach agreed by Sam+Alain 2026-04-17:
extends the existing in-process Cedar engine with a third outcome
(REQUIRE_APPROVAL) via a physical split between hard-deny and
soft-deny policy sets. Same language, broader semantics.

New design doc PHASE3_CEDAR_HITL.md (1325 lines, 16 sections):
- Two-policy-set evaluation model + annotation merge rules
- PreToolUse hook extension + DDB polling wait (2s to 5s)
- TaskApprovalsTable + ApproveTaskFn/DenyTaskFn Lambdas
- New scopes: this_call, tool_type:X, bash_pattern:X, rule:X, all_session
- Pre-approvals at submit time via --pre-approve (20-entry cap)
- Per-task + per-rule timeout overrides via @approval_timeout_s
- AWAITING_APPROVAL state; orchestrator + reconciler updates
- Security model: hard-deny absolute, ownership checks, race prevention
- Five end-to-end sample scenarios (force-push / DROP TABLE /
  all_session / steering-by-denial / AI-DLC phased pre-approval)
- 32-task implementation plan
- Review checklist + deferred items

Companion draw.io (12 pages): architecture delta, two-tier flow,
REQUIRE_APPROVAL/DENY/TIMEOUT sequences, pre-approvals, scope
escalation, state machine, policy authoring reference, three E2E
scenarios, decision log.

Verified during design: cedarpy annotations work (recoverable via
policies_to_json_str JSON). Cedar 'like' is glob-only (*, ?), not
regex. diagnostics.reasons returns list of matching policy IDs on
multi-match.

INTERACTIVE_AGENTS.md §9.3 now cross-links to the detailed doc.
The rev-3 3-tier text is retained for historical context only.

No implementation yet. Next steps: independent data-flow, security,
and functional-correctness review agents.
…rev 2)

Three parallel reviews (data-flow, security, functional) identified seven
P0 themes plus ~27 P1/P2 items. Rather than amending rev 1 with patches,
this rewrite folds every P0 into the main design body as if the design
had always contained it. Easier to read for implementers and removes
the "which version is authoritative" confusion.

P0 themes now baked into the design:
- Atomic state transitions via DDB TransactWriteItems for both the
  approval-request creation and the resume path (decisions D11).
- Ownership encoded in the ConditionExpression on ApproveTaskFn /
  DenyTaskFn, not fetch-then-check — authorization + state transition
  atomic in a single DDB call (D12).
- Per-task approval-gate cap (50) + per-minute rate limit (20) +
  per-user notification cap (10) to prevent DoS loops (D13-14).
- Recent-decision cache (60s) blocks identical retries after
  DENIED/TIMED_OUT to prevent gate storms (D15).
- Denial reason sanitized by output_scanner in DenyTaskFn BEFORE
  persisting — the agent never sees unscanned user text (D16).
- tool_input_preview ANSI/control-char stripped at two layers to
  prevent approver-confusion attacks (D17).
- Denial steering text injected via Stop hook between_turns_hooks
  pattern (reuses validated Phase 2 nudge path) rather than relying
  on SDK permissionDecisionReason alone (D18).
- Rule discovery via new GET /v1/repos/{repo}/policies + bgagent
  policies list command, plus new bgagent pending command that
  solves the ULID copy-paste problem (D19).
- New scopes: write_path:<glob> and tool_group:file_write (D20, D21).
- Day-1 cedarpy annotation spike locked in as the first
  implementation task to de-risk the design-load-bearing assumption
  that policies_to_json_str() returns annotations as expected (D22).

Structural changes:
- Decision log grew from 10 to 22 locked decisions.
- TTL on TaskApprovalsTable sized to timeout_s + 120s (was fixed 1h);
  always covers the decision window.
- matching_rule_ids changed from DDB StringSet (can't be empty) to
  List (tolerates pathological empty-match case).
- TaskApprovalsTable Streams confirmed OFF (approval events audit
  flows via TaskEventsTable, not a second stream).
- Stranded-task reconciler gets an AWAITING_APPROVAL branch to close
  the container-eviction gap (tasks stuck >2×timeout_s → FAILED
  with approval_stranded event).
- maxPreApprovalScope defined as a partial order; this_call as a cap
  rejected at blueprint load.
- Degenerate scope patterns (*, **, ratio>50%) rejected at submit.
- Effective-timeout computation clamps by maxLifetime remaining with
  a 30s floor; tasks near the lifetime cap deny fast rather than
  strand.
- approval_timeout_s ceiling capped at 3600s (was up to 8h);
  prevents slot-exhaustion attacks.

Task list grew from 32 to ~37 items to cover the new shared parser,
discovery endpoint, reconciler updates, and two new CLI commands.

New section §16 "Implementation notes" captures the 18 P1/P2
carry-forward items as a flat TODO list; to be consumed and removed
during implementation.

Deferred list (§17) expanded: added runtime allowlist revocation,
bulk approve, shell completion, richer annotations
(@approval_requires_mfa, @approval_channel), cross-task scope
inheritance.

Design doc rev 1 → rev 2; total 1325 → 1704 lines.

Companion draw.io file will be updated in a follow-up commit
focused on the sequence-diagram atomicity changes.
Four cleanup changes in one commit:

1. Move diagrams to a dedicated subdirectory:
   - docs/interactive-agents-phases-v5.drawio → docs/diagrams/interactive-agents-phases.drawio
   - docs/phase3-cedar-hitl.drawio → docs/diagrams/phase3-cedar-hitl.drawio
   Version numbers dropped from filenames — git history is the version history.

2. Remove outdated draw.io files:
   - docs/interactive-agents-architecture.drawio (superseded by phases diagram)
   - docs/interactive-agents-phases.drawio (v1)
   - docs/interactive-agents-phases-v2.drawio
   - docs/interactive-agents-phases-v3.drawio
   - docs/interactive-agents-phases-v4.drawio

3. Update cross-links in design docs + in the diagram's own legend to
   point at the new paths.

4. Update .gitignore to suppress draw.io editor auto-save backups
   (.$*.drawio.bkp) and any screenshot renderings saved alongside
   sources (docs/diagrams/*.png, docs/diagrams/screenshots/).
…ure flag, finalize starter set, lock cedar-wasm for Lambda

Four design adjustments from the 2026-04-24 pre-implementation discussion:

1. Remove feature-flag gating (§15.4 rewrite).
   Cedar-HITL is standard functionality. No per-repo enable/disable flag.
   The safety posture of a task is determined by the loaded policy set
   (built-in + blueprint) plus the user's --pre-approve scopes. Users
   who want fully autonomous execution pass --pre-approve all_session
   --yes. Repos that want more/fewer gates customize via blueprint.

2. Finalize the starter soft-deny set.
   Built-in soft-deny ships with 5 rules:
     - force_push_any (medium, 300s)
     - push_to_protected_branch (main/prod/master/release/*, medium, 300s)
     - force_push_main (force to main/prod, high, 600s)
     - write_env_files (*.env, high, 600s)
     - write_credentials (*credentials*, high, 300s)

   Dropped from starter set (but documented as copy-paste blueprint
   patterns in §5.3):
     - write_infrastructure (the "infrastructure/" path is a
       convention, not a standard; ABCA itself uses cdk/, other repos
       vary. Belongs in per-repo blueprint.)
     - webfetch_any (DNS Firewall already restricts egress; gating
       every WebFetch is high-noise with low marginal security.)

   Hard-deny is unchanged: rm_slash, write_git_internals (+ nested),
   drop_table. Absolute, no scope bypasses.

3. Move CloudWatch alarms to deferred (§11.5).
   Alarms require a notification channel (Slack / PagerDuty / SNS /
   email) the project doesn't yet have. Dashboard widgets stay in
   scope — they're read-only, operator-facing, no channel required.
   Alarms become a follow-up when an ops channel is added; all
   supporting metric data already flows via §11.1 events.

4. Lock @cedar-policy/cedar-wasm as the Lambda-side parser (§15.6
   new section).
   Spiked 2026-04-24 against v4.10.0: 4.1 MB unzipped, CJS via the
   /nodejs sub-export, annotation round-trip verified (annotations
   preserved verbatim under json.annotations), multi-match reason
   reporting identical to Python cedarpy. Design includes the
   TypeScript wrapper sketch for shared/cedar-policy.ts and documents
   the four API differences from Python cedarpy implementers will
   encounter.

SECURITY.md gains a two-sentence note under "Blueprint custom steps
trust boundary" explicitly tying Phase 3's security model to the
deploy-only property of Blueprint.security.cedarPolicies. If that
property ever changes (user-uploaded policies via API), Phase 3's
§12 trust model must be re-evaluated.

Total doc delta: +189 / -45 across PHASE3_CEDAR_HITL.md + SECURITY.md.
Design doc now 1846 lines.
…ncel)

Two root causes, both user-visible:

1. Lambda bundling excluded @aws-sdk/client-bedrock-agentcore
   (`externalModules: ['@aws-sdk/*']`).  The Node 24 Lambda runtime's
   built-in SDK predates the release of `StopRuntimeSessionCommand`,
   so the command was undefined at runtime:

     "StopRuntimeSessionCommand is not a constructor"

   The catch block swallowed the error and logged a misleading
   "session may already be gone" warning.  Every cancel has been
   silently no-opping on the AgentCore side since the feature shipped;
   agents kept running and pushed PRs after cancel.

   Same failure mode applied to orchestrate-task for any future
   StopRuntimeSession call.  Fix: narrow externalModules to stable
   clients only and bundle bedrock-agentcore explicitly in both
   task-api.ts and task-orchestrator.ts.

   Raised CancelTaskFn timeout 3s -> 15s and memory 128 -> 256 MB --
   the bedrock-agentcore TLS handshake adds cold-start cost (13s
   observed), which the previous 3s budget could not absorb.

2. Even with StopRuntimeSession working, the pipeline's post-hook
   chain (ensure_committed -> ensure_pr) runs AFTER `run_agent`
   returns, regardless of cancel state.  If the SDK ever stopped
   between turns for any reason (e.g. Stop-hook signal) while a
   cancel was in flight, the post-hooks would still commit and push
   a PR.  Defense-in-depth:

     - agent/src/hooks.py: new `_cancel_between_turns_hook` runs
       first in `between_turns_hooks`.  Reads TaskTable status each
       Stop event; on CANCELLED, sets a ctx sentinel.  `stop_hook`
       observes the sentinel and returns `{continue_: False,
       stopReason: "Task cancelled by user"}` so the SDK halts.
       Fails open on TaskFetchError -- a transient DDB blip must not
       be mistaken for a cancel signal.

     - agent/src/pipeline.py: re-checks TaskTable status after
       `run_agent` returns.  If CANCELLED, skips post-hooks entirely
       and returns a terse cancelled-result dict.  Emits a
       `task_cancelled_acknowledged` milestone for stream visibility.
       `write_terminal` is skipped because its ConditionExpression
       only allows RUNNING/HYDRATING/FINALIZING -- cancel-task.ts has
       already written the CANCELLED row.

Tests added:

  - agent/tests/test_cancel_hook.py (9 tests): hook-level unit tests
    covering cancel sentinel, DDB fail-open, nudge co-existence, and
    `continue_: False` propagation.
  - agent/tests/test_pipeline.py (2 tests): run_task skips
    `ensure_pr` / `ensure_committed` on CANCELLED and still runs them
    normally on RUNNING.
  - agent/tests/test_nudge_hook.py: registry-order assertion updated
    to [cancel, nudge].

E2E re-validated on backgroundagent-dev:

  Before: cancel persisted at t+38s; agent continued 45 turns,
  emitted agent_execution_complete + pr_created, PR aws-samples#26 pushed on a
  cancelled task.  140 events total.

  After: cancel persisted at t+29s; StopRuntimeSession invoked
  successfully; agent's last event at t+39s; no
  agent_execution_complete, no pr_created, no PR pushed.  7 events
  total.

Totals: agent 464 (+11), CLI 215, CDK 879.
The UserStatusIndex GSI uses `status_created_at` (composite
`{status}#{created_at}`) as its sort key.  TypeScript writers
(create-task-core, cancel-task, reconcile-stranded-tasks) correctly
rewrite this field on every transition, but the Python agent-side
writers (write_running, write_terminal) only updated `status` and
the timestamp fields -- leaving `status_created_at` frozen at the
SUBMITTED# prefix written at task creation.

Effect: the GSI's sort key lied about the current status.  Queries
that key-condition on the prefix (e.g.
`begins_with(status_created_at, 'COMPLETED#')`) returned zero rows
for COMPLETED tasks because the real row was stored under the
`SUBMITTED#` prefix.  Status-filtered operator queries silently
returned incomplete results; no visible error.

Fix: both `write_running` and `write_terminal` now set
`status_created_at` to `{status}#{now}` alongside the status write,
matching the TypeScript pattern.  The timestamp shares the write's
`started_at` / `completed_at` value so operators can cross-reference
base-table rows against the GSI without wondering which write
happened first.

Tests:
  - TestWriteRunningMaintainsStatusCreatedAt: verifies the
    RUNNING# prefix is written with the expected ISO-Z timestamp.
  - TestWriteTerminalMaintainsStatusCreatedAt: covers the
    COMPLETED and FAILED prefixes plus the sca==completed_at
    invariant.

E2E re-validated on backgroundagent-dev with a fresh task: the new
record shows `status_created_at = COMPLETED#<now>` and is returned
first by a `begins_with(status_created_at, 'COMPLETED#')` query.
Existing tasks in the backgroundagent-dev table were backfilled by
hand with a one-shot scan + UpdateItem (27 rows corrected); no
backfill Lambda is shipped because production deployments start
empty and the CDK's TaskTable TTL (90 days) ages stale rows out.

Follow-up tracked separately: `bga list` without a `--status` filter
still groups by status lexicographically (FAILED > COMPLETED >
CANCELLED ...) because the current CLI handler doesn't use
status_created_at as a KeyCondition.  That is a CLI-side design
change (new GSI or new query strategy), not a data-consistency fix,
and does not block this PR.

Totals: agent 468 (+4), CLI 215, CDK 879.
@scoropeza scoropeza requested a review from a team as a code owner April 29, 2026 05:54
@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 29, 2026

Thanks @scoropeza ! Just had a quick first pass, solid foundaiton ! The test coverage is strong (879 CDK + 468 agent + 215 CLI), the failure modes are thought through (stranded-task reconciler, cancel race fix, nudge dedup), and the immediate value is real. The feedback below is about long-term direction

Question: The two-runtime split (IAM for background, JWT for interactive) is technically sound and solves the immediate problem. But stepping back, there's a tension in the UX it creates and in its compute-substrate portability. And it becomes more pronounced when you consider HITL

The current model forces developers to choose their observation capability at submission time:

  • bgagent submit — resilient (survives CLI crash), but polling-only observation
  • bgagent run — live SSE stream, but task dies if the CLI disconnects

The developer workflow most people want is: submit a task, forget about it, check in live whenever I feel like it, steer if needed, leave again. The current model forces you to choose at submission time — submit (resilient but polling-only) vs run (live stream but fragile). You can't upgrade a running orchestrator task to a live stream mid-flight because the agent is on Runtime-IAM and there's no SSE socket to attach to.

The direct-SSE path is AgentCore-specific. It relies on /invocations as the transport and AgentCore's auth layer for gating. On ECS (or any other compute), this doesn't exist. The durable path (ProgressWriter → DDB → polling) is already compute-agnostic and works everywhere.

Phase 3 introduces approval gates where the agent pauses and waits for human input — potentially for hours (reviewer in another timezone, overnight deploy windows, etc.). HITL is inherently asynchronous: the agent and the human operate on completely different timescales.

With direct-SSE on Runtime-JWT:

  • Agent pauses → writes "approval needed" to the SSE stream
  • If the CLI disconnects during the wait, what happens? Task orphans or needs special keep-alive logic.
  • If the task was submitted via bgagent submit (orchestrator path), there's no SSE stream at all — how does the user learn an approval is needed?
  • You end up building a notification/polling path anyway for the non-interactive case.

Have you thought about a relay-based streaming alternative ?

The FanOutConsumer skeleton in this PR is the first stage of a compute-agnostic real-time path:

Agent (any compute) → DDB TaskEventsTable → DDB Stream → Lambda → API Gateway WebSocket → CLI

This gives you:

  • Submit and forget, attach live whenever (no upfront choice)
  • Works on AgentCore, ECS, EC2, anything — agent just writes to DDB
  • Agent crash doesn't kill the stream (events already durable)
  • Single runtime (IAM) is sufficient — no JWT runtime needed
  • CLI auth is just Cognito on the WebSocket API (already have the token)

Trade-off is ~1-5s latency from the DDB Stream batching window. Could we consider that this is acceptable?

I'm thinking that will be useful for implementation of HITL as well. The core constraint of HITL: the agent stops and waits. Maybe for 10 seconds. Maybe for 3 hours (reviewer is in a different timezone). The agent must survive that wait regardless of whether anyone is watching.

With the relay approach, HITL is just another DDB event:

Agent hits gate
→ writes { event: "approval_required", gate: "push_to_main", details: ... } to DDB
→ enters poll loop: check DDB every N seconds for a response
→ sleeps (costs nothing, agent is idle, compute is warm but not burning tokens)

Meanwhile, independently:
DDB Stream → WebSocket → CLI shows "Agent paused: wants to push to main. Approve? (y/n)"

Or: nobody is watching. That's fine. Agent just waits.

User responds whenever:
CLI → POST /v1/tasks/{id}/approval → writes to DDB (same pattern as nudges) → agent's next poll picks it up → continues or aborts

…tion prompts

CI on PR aws-samples#52 reported a build failure in `agent:lint:fix`: 29 Ruff
errors in `agent/src/` and `agent/tests/` (17 auto-fixed, 12
remaining).  Running the repo's full pre-commit + pre-push hook
sweep (`mise run hooks:run`) locally for the first time also
surfaced ty type-check errors, CDK eslint errors, CLI eslint no-shadow
errors, and auto-format mutations — all pre-existing in code our
branch introduced but previously unchecked because the prek hooks
could not install against a machine where Amazon git-defender owns
`core.hooksPath`.

Agent (Ruff + ty):
- Line-too-long (E501, 7 sites) — server.py and tests, all rewrapped.
- Collapsible-if (SIM102, 2 sites) — server.py post-hydration block.
- Try/except/pass (S110 + SIM105) — server.py CloudWatch reporter
  now uses `contextlib.suppress(Exception)` for the best-effort
  inner write.
- Module-import-not-at-top (E402) — sse_adapter.py `from shell
  import log` moved to the top-of-module import block.
- Lazy ClientError shadow (ty invalid-assignment) — nudge_reader.py
  now annotates the lazy-imported `ClientError` as
  `type[Exception] | None` so ty accepts the None-fallback.
- JSONResponse body narrowing (ty unresolved-attribute) —
  test_server.py `resp.body.decode()` → `bytes(resp.body).decode()`
  at 4 sites.  FastAPI types `body` as `bytes | memoryview[int]`;
  ty rejects `.decode()` on the union.
- Collapsed nullability flow (ty) — the `if task_id: if _active_sse_
  adapters.remove_if_current(...)` nest that SIM102 asked us to
  collapse also carried an `sse_adapter is not None` narrowing; the
  AND-chain now includes it explicitly so ty is happy.

CDK (eslint):
- `require()`-style import (@typescript-eslint/no-require-imports) —
  fanout-consumer.ts now imports `SqsDlq` at the top with
  `DynamoEventSource`.
- Unbounded Promise.all parallelism (@cdklabs/promiseall-no-unbounded-
  parallelism) — fanout-task-events.ts has a hardcoded three-channel
  dispatcher list, not user input; added the suggested eslint-disable
  with a justification comment.
- Duplicate imports (no-duplicate-imports) — nudge-task.ts merges the
  named-value and type imports from `./shared/types` into one
  statement.

CLI (eslint):
- no-shadow on `stderrSpy` (2 sites in run.test.ts) — the describe-
  scope declaration was only assigned in `beforeEach`; two later
  tests re-declared local `stderrSpy` consts.  Renamed the locals to
  `localStderrSpy` so they don't shadow the outer.

Housekeeping:
- Deleted `PHASE_1A_IMPLEMENTATION_PROMPT.md` and `RESEARCH_PROMPT.md`
  from the repo root.  These were operator-facing prompts used during
  Phase 1a scaffolding; Phase 1a has shipped and the prompts are stale
  noise for reviewers.  `git log` retains them if anyone needs the
  history.

Auto-format cleanups by `ruff format` and auto-fix by `ruff check
--fix` also landed in the same run — they rewrap the imports / string
continuations that the E501 fixes exposed, plus tidy test helpers.
Regenerated `docs/src/content/docs/architecture/*` mirrors via
`docs/scripts/sync-starlight.mjs` (docs-sync hook).

All three suites remain green:
- agent: 468/468 (pytest)
- CLI:   215/215 (jest)
- CDK:   879/879 (jest)

Added a memory note for future sessions: `GIT_CONFIG_NOSYSTEM=1`
lets `prek install` succeed on machines where Amazon git-defender
claims `core.hooksPath` at the system level, but it's simpler to
just invoke the hooks manually via `mise run hooks:run` (or the
narrower `mise //agent:quality`, `mise //cdk:eslint`, etc.) before
pushing — no system-config bypass required.
@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 29, 2026

After some thinking, here is what I propose:

Context: PR #52 introduces a two-runtime model (IAM + JWT) with direct SSE streaming for interactive observation. This proposal presents an alternative architecture that is simpler, compute-agnostic, and better aligned with the HITL requirements ahead. If agreed, PR #52 should be reworked to ship only the foundational pieces and exclude the Runtime-JWT / SSE path.


Requirements I see:

  1. Users submit a task. They don't care about the compute substrate and/or don't want to select the observability path when submitting a request
  2. Fire and forget: task runs independently, no one needs to watch.
  3. If configured, users are notified when HITL approval is needed.
  4. Users can check in at any time, ask the agent for a status report, and steer it if the direction is wrong.
  5. If configured, the agent updates its source context as it works (e.g., comments on the GitHub issue that triggered it).

What to keep from PR #52

The following pieces are the correct foundation and should ship:

Piece Covers
Orchestrator path (single IAM runtime) Req 1, 2
ProgressWriterTaskEventsTable (DDB, any compute) Req 1
GET /tasks/{id}/events + bgagent watch (polling) Req 4 (partial)
bgagent nudge / nudge mechanism Req 4 (partial)
FanOutConsumer + DDB Stream wiring (skeleton) Foundation for 3, 5
Cancel fix + pipeline short-circuit Req 4

What to exclude from PR #52

The following should be removed from the PR and not shipped:

Component Reason
Runtime-JWT (second runtime with Cognito JWT auth) No requirement demands a live SSE connection. Adds a second auth surface, compute-specific coupling, and conflicts with HITL design.
execution_mode field + routing logic Single path (orchestrator) for all tasks. No need to choose at submission time.
_SSEAdapter, sse_wire.py, ag-ui-translator.ts, sse-client.ts SSE machinery not needed in the proposed architecture.
bgagent run as a distinct command Becomes bgagent submit + optional bgagent watch.
Stranded-task reconciler Only needed because interactive tasks orphan when CLI disconnects. With a single orchestrator path, tasks are always independently started.

Proposed architecture

The interaction model

Every interaction between the user and the agent is async and goes through DDB:

bgagent submit "fix the auth timeout bug"          # fire and forget
# ... time passes ...
bgagent ask abc123 "what's your status?"            # request a report
# ... response arrives via Slack/watch ...
bgagent nudge abc123 "also fix the logging module"  # steer (acknowledged)
bgagent approve abc123                              # approve a HITL gate
bgagent watch abc123                                # see full event history

All of these are the same primitive: user writes to DDB → agent reads; agent writes to DDB → user reads. Different event types, same mechanism, fully async.


1. Status reports — two tiers (requirement 4)

A naive bgagent ask that interrupts the agent has two problems: latency (agent is mid-turn, 30-120s before it reads the request) and cost (burns a turn from the max_turns budget — a nervous manager checking in 5 times costs 5% of the task's capacity).

Tier 1: Lambda-synthesized summary (default). A Lambda reads TaskEventsTable and synthesizes a status report from the milestone history. The agent is never interrupted.

bgagent ask abc123 "status?"
  → Lambda reads events for task abc123
  → Returns: "Step 7/~12: tests passed (14/14), applying changes to
     src/auth.ts and src/logging.ts. Last activity 45s ago."

This is instant, costs nothing, and answers the most common question ("what step is it on?").

Tier 2: Agent-directed question (explicit opt-in). For questions about intent or decision-making ("why did you choose that approach?"), the user explicitly interrupts the agent:

bgagent ask abc123 --agent "why are you refactoring the logging module?"
  → writes status_request nudge to DDB
  → agent reads on next between-turns hook
  → agent writes status_response event
  → user sees it via watch/Slack

This costs a turn and has latency. The --agent flag makes the trade-off explicit. Without it, the user always gets the fast/free Lambda summary.


2. Nudge acknowledgment (requirement 4)

Today's nudge is fire-and-forget — the user has no confirmation the agent received it, understood it, or changed direction. After injecting a nudge, the agent emits a nudge_acknowledged event:

bgagent nudge abc123 "fix both issues, separate commits"
  → agent reads nudge on next between-turns hook
  → emits: { event: "nudge_acknowledged",
             summary: "Understood — will fix both issues in separate commits" }
  → user sees acknowledgment via watch/Slack

This is appended to the agent's system context for the current turn — not a standalone turn, so it doesn't burn a full turn from the budget.


3. Configurable dispatch — FanOutConsumer as router (requirements 3, 5)

The FanOutConsumer should be a router, not a monolith. Each notification channel has its own error modes, rate limits, retry semantics, and credentials. A GitHub API outage shouldn't block Slack notifications.

Agent writes events → DDB → Stream → FanOutConsumer (router)
                                            |
                                            ├── reads dispatch config from task/repo record
                                            |
                              ┌──────────────┼──────────────┐
                              |              |              |
                        SQS: github    SQS: slack     SNS: email
                              |              |              |
                     GitHubDispatchFn  SlackDispatchFn    SES

Notification config (per-repo or per-task):

{
  "notifications": {
    "github_issue": {
      "enabled": true,
      "events": ["milestone", "status_response", "approval_required", "task_complete", "task_failed"]
    },
    "slack": {
      "channel": "#coding-agents",
      "events": ["approval_required", "status_response", "task_complete", "task_failed"]
    },
    "email": {
      "events": ["approval_required"]
    }
  }
}

GitHub issue comment (edited in-place as the agent progresses):

🤖 Agent working on #42 (updated 2 min ago)

| Step | Status |
| --- | --- |
| Clone repo ||
| Run tests (before) | ✅ 14/14 |
| Apply changes | ✅ 3 files modified |
| Run tests (after) | ✅ 14/14 |
| Open PR |#87 |

Note: the edit-in-place comment has a race condition (two events in the same DDB Stream batch → two Lambda invocations both edit → one update lost). Solve with SQS FIFO using task_id as the message group ID — guarantees single-concurrency per task.


4. HITL approval + soft questions spectrum (requirement 3)

The architecture needs to support two distinct interaction modes:

Mode Agent behavior Example
Hard gate (Cedar policy) Agent stops, waits indefinitely for approval "Wants to push to main. Approve?"
Soft question Agent asks, keeps working with its best guess, incorporates late response if still relevant "Found two approaches. A is faster, B is cleaner. Going with A unless you object."

Hard gates are policy-enforced and non-negotiable. Soft questions are advisory — the agent proceeds with a default after a configurable timeout. Both write events to the same stream and both can be responded to via the same approve / nudge mechanism. The difference is whether the agent blocks.

This is ultimately a product decision, but the architecture must support both: an approval_required event with a blocking: true/false field, and the agent's poll loop respecting the distinction.


End-to-end scenario

GitHub issue triggers a task. User manages it asynchronously over the course of an afternoon.

  1. Webhook fires → task created → orchestrator invokes runtime → agent starts
  2. Agent progressesProgressWriter writes milestones → FanOutConsumer routes to GitHub (edits issue comment) and Slack
  3. User checks in from phonebgagent ask abc123 → instant Lambda summary: "Step 5/~10, applying changes, last activity 30s ago."
  4. User wants more detailbgagent ask abc123 --agent "why did you change the retry logic?" → agent responds next turn: "The existing retry used exponential backoff with no jitter, causing thundering herd under load."
  5. User steersbgagent nudge abc123 "add jitter but keep the backoff intervals" → agent acknowledges: "Understood — adding jitter to existing backoff."
  6. Agent hits Cedar gate → writes approval_required → Slack ping + email to user
  7. User approves 20 minutes laterbgagent approve abc123
  8. Agent finishes → FanOutConsumer updates GitHub comment, posts to Slack
  9. Agent has a soft question during next task → "Found dead code in utils.ts. Remove it? Going ahead unless you object." → proceeds after timeout, user never responds, nobody blocked.

No live connection at any point. Everything async.


What we explicitly don't build

Not building Why
WebSocket relay No requirement demands push-to-CLI. Polling works for "check in on demand." Push notifications go to Slack/email/GitHub, not to a terminal.
Token-level streaming Users need to know what step the agent is on, not watch it type. Milestone events + status reports cover this.
Direct SSE from agent Compute-specific coupling for marginal latency gain. If sub-second ever matters, the agent can open an outbound WebSocket — a future decision, documented but not built.

Why not direct SSE (what PR #52 currently ships)

The two-runtime model (IAM for background, JWT for interactive SSE) has three structural problems:

1. UX tension. Forces users to choose at submission time between resilient (submit, polling-only) and observable (run, fragile — task dies if CLI disconnects). The workflow users actually want is "submit and forget, attach live whenever."

2. Compute coupling. Direct SSE relies on AgentCore's /invocations endpoint and auth layer. Doesn't work on ECS, Fargate, EC2, or any other substrate. The durable DDB path is already compute-agnostic.

3. HITL conflict. Approval gates are inherently async — the agent waits minutes or hours for a human response. A live SSE connection can't survive that. You need the DDB-based async path for HITL regardless, which makes the SSE path redundant for all five requirements.


Preserving optionality

Removing Runtime-JWT eliminates ~2,500 lines of code. Worth it for maintenance. But document why it was removed and what would bring it back:

  • Token-level streaming requirement (watch the agent type character by character)
  • Sub-200ms HITL loops (interactive debugging, rapid approval cycles)
  • Live pair-programming mode (user and agent co-editing in real time)

If any of these materialize, the right solution is the agent opening an outbound WebSocket to API Gateway — not rebuilding the dual-runtime model. This avoids the dual-auth problem (agent connects out, no ingress auth needed), works on any compute with network egress, and degrades gracefully (DDB milestones remain durable if the direct connection drops).


Requirement check

# Requirement Solved by
1 Compute-agnostic Agent writes to DDB from anywhere
2 Fire and forget Single orchestrator path
3 HITL notification FanOutConsumer → Slack/email on approval_required; hard gates block, soft questions don't
4 Check in + steer bgagent ask (instant Lambda summary or agent-directed), bgagent nudge (with acknowledgment), bgagent watch (history)
5 Update source context FanOutConsumer → per-channel dispatch (GitHub edit-in-place, Slack, email), configurable per repo

Open design decisions

Before building, these need answers:

  1. Soft question timeout. How long does the agent wait before proceeding with its default? Per-task config? Per-gate config? Global default?
  2. Status summary depth. What does the Lambda-synthesized summary include? Just the latest milestone, or a full narrative? Should it call an LLM to summarize, or template from structured events?
  3. Nudge acknowledgment cost. Appending to system context is lighter than a standalone turn, but still consumes tokens. Cap the acknowledgment length? Skip acknowledgment for simple nudges?

Suggested timeline

  1. Rework PR feat(interactive-agents): Phase 1a/1b/2 (Phase 3 design only) #52 — ship only the foundational pieces (ProgressWriter, events table, FanOutConsumer skeleton, nudge, cancel fix, single IAM runtime). Exclude Runtime-JWT, SSE path, execution_mode, stranded-task reconciler.
  2. bgagent ask (Lambda summary tier) — highest-value, lowest-effort UX addition. No agent changes needed.
  3. Nudge acknowledgment + bgagent ask --agent — agent prompt changes, status_response event type.
  4. FanOutConsumer dispatch — router + let's add one or two channels. SQS FIFO per channel.
  5. Phase 3 Cedar-HITL — hard gates and soft questions, both using the event/response loop above. Design against DDB async delivery from day one.

What do you think ?

@scoropeza
Copy link
Copy Markdown
Contributor Author

Thanks @krokoko — this is a really well-structured proposal and I'm on board with the overall direction. Agreeing fully on the big moves and flagging a handful of smaller adjustments where I think the architecture or the Claude Agent SDK pushes us one way rather than another. Organizing this as what we agree on, where I think the proposal needs a small amendment, and a revised phase plan at the end.

What I fully agree on

  • Single orchestrator path, delete Runtime-JWT. No requirement actually demands a live SSE connection — HITL is inherently async, fire-and-forget is what users want, and the second runtime was paying for a benefit the product doesn't need.
  • Delete execution_mode: 'interactive', _SSEAdapter, sse_wire.py, sse-client.ts, ag-ui-translator.ts, bgagent run. The compute-agnostic durable event plane through TaskEventsTable covers all five requirements.
  • FanOutConsumer as router, per-channel dispatcher Lambdas, edit-in-place GitHub comment, configurable per-repo/per-task notification routing.
  • Keep Phase 3 Cedar HITL design as-is — it's already DDB-poll based, so approval events ride the same fan-out plane without changes.
  • Hard gates block, soft gates don't is the right conceptual split.
  • The overall interaction model (submit → notify → ask/nudge/approve asynchronously → finish) matches what Cursor, GitHub Copilot coding agent, Codex, and Devin all ship — we're catching up, not inventing.

One correction on the LOC claim: real deletion is closer to ~4.5k–6k LOC once you include tests (test_sse_wire.py, test_sse_adapter.py, cli/test/sse-client.test.ts, cli/test/ag-ui-translator.test.ts, plus SSE paths in agent/src/server.py which is ~50% of that file). Your ~2.5k number was conservative — we get even more cleanup than advertised.

Where I'd amend the proposal

1. Keep the stranded-task reconciler (just collapse the two timeouts)

cdk/src/handlers/reconcile-stranded-tasks.ts:59-70 actually has two thresholds — STRANDED_INTERACTIVE_TIMEOUT_SECONDS=300 and STRANDED_ORCHESTRATOR_TIMEOUT_SECONDS=1200. The orchestrator branch covers "OrchestratorFn async-invoke crashed between TaskTable write and the Runtime-IAM sync invocation." That failure mode persists after we remove the interactive path — we still need a watchdog that catches tasks stuck in SUBMITTED or HYDRATING. What we can delete is the interactive-specific timeout branch; the reconciler itself stays. ~30 LOC cleanup, not a removal.

2. bgagent ask — honest positioning: deterministic status vs. LLM-synthesized summary

I love the "manager asks the teammate" UX, but I want us to be explicit about what the Lambda summary actually produces, because the events in TaskEventsTable are structured machine telemetry, not narrative. Look at agent/src/progress_writer.py:111-175:

  • agent_tool_call: {tool_name, tool_input_preview (200 chars), turn}
  • agent_turn: {turn_number, model, thinking_preview (200 chars), tool_calls_count}
  • agent_milestone: {label, details_preview (200 chars)}
  • agent_cost_update, agent_error, etc.

A deterministic Lambda reading these can produce "Step 7 of ~12: ran Bash 14 times, wrote utils.ts + auth.ts, last milestone nudge_applied 45s ago." That's useful, but it's not a synthesized narrative — it's a templated status.

We should pick one of two explicit options and commit, because the shape of what ships differs:

  • Option A — bgagent status (deterministic, no LLM). Rename to make the semantics clear. Last N milestones + elapsed time + current turn counter + cost so far + last activity timestamp. Fast, free, never hallucinates. This is what Cursor's task view and Copilot's PR view already do.
  • Option B — bgagent ask (LLM-synthesized). Genuine narrative summary. Costs per-ask on the platform's Bedrock budget (not the user's max_budget_usd, which is scoped to the task pipeline), adds a new failure mode ("summary Lambda hallucinated"), and would be a real industry differentiator — no shipped product does this today, per my research.

The middle path — "Lambda pretends to synthesize" — is the worst of both worlds, so let's not ship that. Happy to discuss which one lands in v1.

3. bgagent ask --agent — realistic latency + a foreground-wait UX

One thing I want to nail down before we commit to copy: the "agent responds on next between-turns hook" has meaningful latency. Looking at agent/src/hooks.py:324-380, the Stop hook only fires between turns — and a single turn with a long Bash or test run (e.g., npm test, pytest) is one turn. Realistic P95 is in the range of minutes, not seconds. We need to be honest about that in the UX.

How the CLI should behave (proposal): default to a foreground block-and-poll so it feels like an API call from the user's perspective, with durability as a safety net:

$ bgagent ask --agent abc123 "why did you change the retry logic?"
⠋ queued as ask_01JHXY... — waiting for agent
⠙ agent is running tool: Bash (turn 7/~12) — 42s elapsed
✓ agent responded (1m 14s)

The existing retry used exponential backoff with no jitter, causing thundering
herd under load. Added jitter to spread retries across the window.

Mechanism (same shape as submit --wait):

  1. CLI POST /tasks/{id}/asks → returns {ask_id, cursor: <last_event_id_before_write>}
  2. CLI polls GET /events?after=<cursor>&type=status_response&correlation_id=<ask_id> with adaptive interval (500ms while last poll had events, backoff to 5s when idle)
  3. Spinner shows the agent's current activity (last agent_turn / agent_tool_call milestone) so the user can see it's alive during long turns
  4. CLI exits on: status_response arrival → print response; task terminal state → print "task finished without answering"; --timeout expiry (default 5 min, hard cap 10 min) → print "still pending — retrieve later with bgagent asks show <ask_id>"

Durability is preserved — the ask row lives in DDB regardless of whether the CLI is still running. If the terminal is closed mid-wait (or the user Ctrl-Cs), the ask still executes; the response is retrievable via bgagent asks show <ask_id>, bgagent watch, or Slack if configured.

Flags:

  • Default → foreground block with spinner (above)
  • --no-wait → returns immediately with {ask_id}, response delivered via Slack/watch
  • --timeout <N> → overrides default 5-minute wait

New surface:

  • POST /tasks/{id}/asks (creates ask row)
  • GET /tasks/{id}/asks/{ask_id} (retrieves an answered ask, for bgagent asks show)
  • New event type status_response{ask_id, content, turn} emitted by the agent hook alongside the turn where it answers
  • Rate limit: cap 1 open ask per task (return 429 otherwise). This also solves the concatenation issue in nudge_reader.read_pending + format_as_user_message (hooks.py:245), which today merges all pending items into one <user_nudge> blob — without a cap, rapid-fire asks drown each other.

4. Nudge acknowledgment — "appended to system context" is not something the Claude Agent SDK supports

I want to push back on this one with sources, because I dug in and it changes the cost model. The proposal says acknowledgment is "appended to the agent's system context for the current turn — not a standalone turn, so it doesn't burn a full turn from the budget." This is not achievable with the Claude Agent SDK (Python) today.

The SDK's mid-run extension surface is:

  • HookMatcher for PreToolUse, PostToolUse, Stop / between-turns events. The full HookEvent enum is fixed: PreToolUse | PostToolUse | PostToolUseFailure | UserPromptSubmit | Stop | SubagentStart | SubagentStop | PreCompact | PermissionRequest | Notification. There is no SystemPromptUpdate, BetweenTurns, ContextAppend, or MessageInject event.
  • decision: "block" + reason: <text> from a Stop hook — this is how today's _nudge_between_turns_hook injects text (see agent/src/hooks.py:380). That decision causes the SDK to emit a new user turn with reason as the content. It is unambiguously a full turn.
  • updatedMCPToolOutput from PostToolUse for redacting tool output — not relevant here.
  • ClaudeAgentOptions.system_prompt is set once, at construction time (agent/src/runner.py:262). The SDK has no public method to mutate it mid-conversation, and there is no ClaudeSDKClient.update_system_prompt() or inject_message() API.
  • The systemMessage field on hook return values looks promising but is user-visible only — it does NOT enter Claude's conversation context. Same for hookSpecificOutput.additionalContext on PostToolUse. Both are confirmed by Anthropic as "not planned" — see claude-code#18427 and claude-code#19643, both closed with not-planned.

Primary sources:

What this means in practice: acknowledging a nudge burns a turn just like applying the nudge does today. The "savings" claim doesn't hold.

Two honest alternatives that actually work with the SDK:

  • Combined-turn ack (cost-neutral): the hook emits a nudge_acknowledged milestone to TaskEventsTable before returning decision: "block", and the reason string carries both the nudge AND "I'll acknowledge this in my next response." Still one turn, same cost as today, but the user sees the acknowledgment event land on their watch/Slack feed immediately.
  • Zero-turn milestone-only ack: the hook writes nudge_acknowledged milestone, returns {} (no block), and the nudge text queues for the next turn that fires naturally (a tool-call Stop, etc.). Genuinely cheaper — zero turns burned by the ack itself — but the nudge text only arrives when some other Stop happens. Fine for nudges, probably too slow for asks.

I'd lean toward combined-turn ack for v1 — same cost as today, better event emission — and document the cost honestly. If the SDK grows a SystemPromptUpdate or BetweenTurnsAppend hook later, we revisit.

5. Soft questions — "proceed with default, incorporate late response" doesn't work with tool calls

Want to argue for a tighter definition, because as written the feature isn't implementable. Tool calls are committed, one-shot, fire-and-once. Consider the example:

Agent: "Found dead code in utils.ts. Remove it? Going ahead unless you object."

The agent's decision point is synchronous: it emits an Edit(utils.ts, delete 40 lines) tool call within the same turn as the question. The file is deleted before the user's "wait, don't delete that" reaches the next Stop hook. "Incorporates late response if still relevant" requires the agent to:

  1. Remember what it did in turn T
  2. On turn T+k, receive the late objection
  3. Inspect state, decide whether the objection is still actionable
  4. Synthesize a reverse operation (re-insert deleted code, un-commit, un-push)

That's a transactional undo stack over arbitrary filesystem + Bash + git operations. The Claude Agent SDK doesn't have one. Tools don't have compensating actions defined. And primary research across Cursor, Codex, GitHub Copilot coding agent, Replit, and Devin confirms no shipped product implements "proceed with default if no response." GitHub Actions environment protection rules have auto-approve-after-timeout, but that's at the infra deploy layer, not inside an agent's tool-call loop.

Concrete redefinition I'd propose:

Soft question = bounded wait inside the hook. Agent emits approval_required{blocking: true, soft: true, timeout_s: 60}. The hook polls DDB for the user's decision with a 60s timeout (configurable). If the user responds in time, the agent proceeds with the user's choice. If not, it proceeds with the default. No post-hoc revert.

This is implementable today — it's just a shorter version of the hard-gate poll loop. The agent does wait, just not for a long time. The "feels non-blocking" UX comes from the short timeout, not from a separate execution model. If someone genuinely needs "fire-and-later-adjust," we scope it to PR-review tasks where "adjust" = "make another commit on the branch" and the agent is explicitly built to iterate that way.

Document loudly: soft question = bounded wait, not fire-and-forget with retroactive correction.

6. Skip SQS FIFO for v1 — DDB Stream ordering + GitHub ETag If-Match is simpler

Two reasons:

  • DDB Streams with ParallelizationFactor: 1 (the default) already give per-shard ordering. Events for the same task_id hash to the same shard, so the Lambda processes them one-at-a-time per shard. That solves ~95% of the race.
  • For the remaining race (Lambda retries after partial failure), GitHub's issue-comment PATCH API supports If-Match with the comment's ETag — optimistic concurrency at the call site beats queue serialization at the source. If we get a 412 Precondition Failed, we re-GET, re-render, re-PATCH.

Ref: https://docs.github.com/en/rest/issues/comments

Adding SQS FIFO on top = extra latency per event, + another DLQ to reason about, + the 300 msg/s per-group throughput ceiling. Defer FIFO until a concrete race is observed in production.

7. Notification defaults must ship with v1 (not "later")

Biggest UX risk in the whole proposal IMO. fanout-task-events.ts:52-61 today filters to task_* + agent_milestone + pr_created + agent_error. A 20-min task produces 8–15 milestones + a couple of errors + terminal events. If Slack pings for every milestone, users mute the bot and never unmute.

Per-channel defaults I'd propose:

Channel Default events Opt-in verbose
Slack terminal (task_completed/task_failed/task_cancelled) + pr_created + agent_error + approval_required --verbose adds agent_milestone
Email terminal + approval_required only
GitHub issue comment pr_created + terminal (single edit-in-place comment) — already minimal

bgagent notifications configure command ships in v1. Per-repo default config file ships in v1.

8. Debug escape hatch — --trace flag

Without SSE, if an agent misbehaves the only debug tool is GET /events?after= polling + CloudWatch logs. Event previews are truncated to 200 chars (progress_writer.py _PREVIEW_MAX_LEN). That's fine for normal runs but painful when you're debugging "why did my agent fail on step 5."

Proposed: bgagent submit --trace flag that (a) raises preview truncation to 4KB for that task only, (b) emits a trajectory_uploaded milestone with an S3 URI where the full agent trajectory is dumped post-task. Cheap (opt-in DDB storage), concrete replacement for "watch the SSE stream."

9. Do we need SSE for watch at all? I think no.

You floated this in chat and I agree: polling is fine. Three reasons:

  1. Cost is trivial — eventually-consistent DDB query with cursor ≈ 0.5 RCU per page, concurrent watchers × 2s poll over active minutes = pennies
  2. Cursor, Copilot coding agent, and Codex's external view all use polling; primary research (Cursor docs, Copilot docs) shows no user complaints about lag — expectations are calibrated to async
  3. The only perceptible wart is bursty delivery during active work (4–10 events in 2s hiccups). Fix with adaptive polling: 500ms while last poll returned ≥1 event, back off to 5s when idle. Same endpoint, no new Lambda, no Cognito Identity Pool, no 100KB-buffer workarounds for Function URL streaming.

This also shares the exact polling mechanism with bgagent ask --agent's foreground block — one implementation, two UX surfaces.

Revised phase plan

Phase Scope
1 (PR #52 rework) Delete JWT + all SSE machinery; single orchestrator path; bgagent status (deterministic) OR bgagent ask (LLM — pick one); polling watch with adaptive interval; --trace mode; per-channel notification defaults in FanOutConsumer; ETag-based GitHub comment edits; collapse stranded-reconciler timeouts
2 bgagent ask --agent with foreground block-and-poll + combined-turn ack semantics; real Slack dispatcher
3 Cedar HITL hard gates + bounded-wait soft gates
4 GitHub + email dispatchers; per-repo/task notification config; --verbose opt-in
Deferred Outbound WebSocket from agent (only if a concrete sub-200ms latency requirement emerges)

Summary

We're aligned on every big move. The amendments are mostly making implicit assumptions explicit so we don't ship a design that promises something the SDK can't deliver:

  • Soft questions → bounded-wait, not post-hoc undo (tool calls are committed; no undo stack exists)
  • Nudge ack → combined-turn (same cost as today), not "system-context append" (SDK has no such hook; Anthropic has closed the related issues as not-planned)
  • bgagent ask → honest about deterministic vs. LLM-synthesized, pick one and commit
  • ask --agent → foreground block-and-poll by default (spinner shows live activity, durability preserved if terminal closes); --no-wait for scripts; cap to 1 open ask per task
  • Stranded reconciler → stays (covers orchestrator crashes), just collapses timeouts
  • Notification defaults → ship v1 or users mute the bot
  • SQS FIFO → skip, DDB + ETag is simpler
  • watch → polling-only, adaptive interval (shares mechanism with ask --agent)
  • Debug--trace flag as escape hatch

Happy to draft the §9.13 rewrite for INTERACTIVE_AGENTS.md next if this lands, or the exact file-level delete/add/modify list for the rework PR. Whichever is more useful to you.

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented Apr 29, 2026

Awesome @scoropeza, thank you! Let me close the loop on each of your amendments and then we can lock the plan.

Full agreements (no further discussion needed)

  • Keep the stranded-task reconciler, collapse timeouts. You're right — orchestrator crashes can strand tasks regardless of the interactive path. My original "remove it" was sloppy. Keep, simplify, move on.
  • Skip SQS FIFO, use DDB Stream ordering + ETag. Simpler, fewer resources, solves the real problem. If we observe races in production we add FIFO then. Agreed.
  • Notification defaults ship v1. If we don't ship sensible defaults, users mute the bot day one and never come back. Your proposed per-channel defaults (terminal + approval + error for Slack, terminal + approval for email, single edit-in-place for GitHub) are the right starting point.
  • watch is polling-only with adaptive interval. No SSE, no WebSocket, no new auth surface. Shares mechanism with ask --agent. Clean.
  • --trace flag. Good escape hatch. Raises truncation to 4KB + dumps full trajectory to S3. Low cost, opt-in, concrete replacement for "stare at the SSE stream."
  • Real LOC cleanup is ~4.5-6k, not 2.5k. Even better. More confidence in the removal.

The two decisions that need to be made

Decision 1: bgagent status (deterministic) vs. bgagent ask (LLM-synthesized)

Your framing is exactly right: templated status from structured events vs. LLM narrative summary. These are different products with different failure modes.

My recommendation: ship both, name them differently, v1 is deterministic only.

  • bgagent status — deterministic Lambda, structured output. Fast, free, never hallucinates. This is v1.
  • bgagent ask "why did you..." — always goes to the agent (your "Tier 2"). No confusing --agent flag. If you're asking a question in natural language, it goes to the agent. If you want machine-readable status, use status.

The reason: bgagent ask "status?" where sometimes it hits a Lambda and sometimes the agent is confusing UX. Two commands with two clear contracts is better than one command with a flag that changes the execution model.

bgagent ask always blocks foreground (your spinner UX), always burns a turn, always has latency. Users understand this because they typed a natural-language question — it's obviously going to a thinking entity. bgagent status is
obviously a dashboard read. No ambiguity.

Decision 2: Soft questions — bounded wait vs. don't ship

Your pushback on "proceed with default, incorporate late response" is devastating. You're right:

  • Tool calls are committed and one-shot
  • No undo stack exists in the SDK or in the filesystem/git abstraction
  • No shipped product does this
  • "Incorporates late response if still relevant" is hand-waving over transactional rollback

My recommendation: don't ship soft questions in v1. Hard gates only.

Here's why: bounded-wait (your "60s timeout, then proceed with default") is implementable but creates a worse product than no soft questions at all. Consider:

  • User sees "Agent wants to delete dead code. Proceeding in 60s unless you object."
  • User is in a meeting. Comes back in 5 minutes.
  • Code is already deleted. User is annoyed.
  • Or: user scrambles to respond in time and feels pressured.

Both are bad UX. The "ticking clock" feeling is anxiety-inducing for what should be a calm async workflow.

The right v1 behavior: the agent makes its own judgment call silently (it already does this — it decides things every turn without asking). If we want the agent to be more conservative about certain actions, that's what Cedar policies
are for — make them hard gates. The spectrum isn't "hard gate vs. soft question" — it's "gate vs. no gate." If an action is risky enough to ask about, it's risky enough to wait for an answer.

Soft questions can be revisited post-v1 as a Cedar policy type: effect: "advise" that emits an event but doesn't block. The user sees it in their notification feed, the agent has already moved on. That's not "proceed unless you
object" — it's "FYI, I did this." Different framing, no timeout, no anxiety.

Nudge acknowledgment

Your combined-turn ack is the right mechanism. Same cost as today, better observability (milestone lands immediately before the turn starts). The proposal was wrong about system-context append — thanks for the SDK deep-dive with sources. Let's go with:

  1. Hook fires, reads nudge
  2. Emits nudge_acknowledged milestone to ProgressWriter (instant, visible to watch/Slack)
  3. Returns decision: "block", reason: — starts the next turn
  4. Agent incorporates the nudge in its response (which is the turn)

The user sees the acknowledgment event land within their polling interval. The "burned turn" is the same turn that would have happened anyway — the nudge just redirects its content. Document it honestly: "nudges are consumed as turns."

bgagent ask --agent foreground UX

Your proposed spinner + adaptive polling + timeout + durability-on-disconnect is the right UX. One amendment:

Rate limit: cap 1 open ask per task

Agreed, but make it 1 open ask per task per user if we ever support multi-user teams watching the same task. For v1 (single user), 1 per task is fine.

The POST /tasks/{id}/asks → GET /events?correlation_id=<ask_id> pattern is clean. Ship it.

image

Does that work for you ? Thank you !

@scoropeza
Copy link
Copy Markdown
Contributor Author

Fully aligned @krokoko — let's lock it. Your three refinements are all improvements on what I had:

  1. bgagent status + bgagent ask as separate commands (no --agent flag). You're right — command-level split is cleaner than a flag that silently changes the execution model. status = dashboard read (deterministic Lambda), ask = natural-language question to the agent (always blocks, always burns a turn, always has latency). This also resolves my open Decision 1: v1 is deterministic-only for status; LLM-synthesized summary is punted to a future version if it earns its keep.

  2. No soft questions in v1 — hard gates only. Your "gate vs. no gate" framing is sharper than my "hard vs. soft" compromise. The ticking-clock UX is actively hostile in an async workflow, and the bounded-wait version I proposed was a worse product than just not having the feature. Post-v1 we can explore a Cedar effect: "advise" that emits an event without blocking — "FYI, I did this" framing, no timeout, no pressure. Agreed, dropping from v1 scope.

  3. Rate limit: 1 open ask per task per user. Forward-compat for future multi-user team-view scenarios. No extra work in v1 (single user = same as today), just the right schema.

Locked plan:

  • Phase 1 (PR feat(interactive-agents): Phase 1a/1b/2 (Phase 3 design only) #52 rework): delete JWT + all SSE machinery; single orchestrator path; bgagent status (deterministic); polling watch with adaptive interval; --trace mode; per-channel notification defaults in FanOutConsumer; ETag-based GitHub comment edits; collapse stranded-reconciler timeouts.
  • Phase 2: bgagent ask (foreground block-and-poll, spinner, durability-on-disconnect, 1 open per task per user); nudge combined-turn ack; real Slack dispatcher.
  • Phase 3: Cedar HITL hard gates only.
  • Phase 4: GitHub + email dispatchers; per-repo/task notification config; --verbose opt-in.
  • Deferred: LLM-synthesized summary; Cedar effect: "advise" soft gates; outbound WebSocket.

Starting on the docs/design/INTERACTIVE_AGENTS.md rewrite next (retire §9.13, revise to async-only), then the PHASE3_CEDAR_HITL.md revision (drop soft-gate scope for v1). Will push those as a docs-only commit for your review before touching any code. Thanks for the great back-and-forth on this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants