Skip to content

docs(adr): ADR-0001 — SSE transport tier-2 evaluation (status-quo)#244

Open
howard-eridani wants to merge 1 commit into
mainfrom
daedalus/adr-0001-event-transport
Open

docs(adr): ADR-0001 — SSE transport tier-2 evaluation (status-quo)#244
howard-eridani wants to merge 1 commit into
mainfrom
daedalus/adr-0001-event-transport

Conversation

@howard-eridani

Copy link
Copy Markdown
Member

Summary

  • Research spike: evaluate whether SSE cursor+replay (shipped May 21) is sufficient, or if we need NATS JetStream per-agent consumers or Redis Streams as tier-2 transport.
  • 21-day prod observation window (2026-05-21 → 2026-06-11).
  • Decision: C — status-quo (cursor+replay on existing JetStream+Postgres stack is sufficient).

Key Findings

  • NATS JetStream MESH_EVENTS already the durable write-path backbone (running since 2026-03-13): pg-writer consumer shows 0 redeliveries, 0 pending across 13,228 events.
  • agent_events table retention: 7 days (critical events: task.assigned, task.created, task.mentioned) / 24h (others) — covers all reconnect scenarios within normal operating patterns.
  • 0 lost-event incidents, 0 stale-redispatch triggers in 21-day window.
  • Option A (per-agent JetStream consumers for SSE fan-out) is a valid future upgrade at 50+ agents; not warranted now.
  • Option B (Redis Streams) would be redundant — JetStream already provides the same guarantees.

Follow-up (non-blocking)

  • F1: sse_cursor_410_total Prometheus counter (closes telemetry gap)
  • F2: sse_replay_events_total counter
  • Both are separate small tasks for Linus, not part of this ADR.

Test plan

  • ADR reviewed by Pavel
  • If approved → task fc2acff3 closed (no implementation needed per acceptance criteria)
  • F1/F2 counters tracked as separate follow-up tasks

Research spike after 21-day prod telemetry window (dep 81f5cec1 deployed 2026-05-21).

Key findings:
- NATS JetStream MESH_EVENTS already the durable backbone (since 2026-03-13)
- pg-writer consumer: 13,228 events, 0 redeliveries, 0 pending — perfect durability
- agent_events TTL: 7d (critical) / 24h (others) covers reconnect gaps via cursor replay
- 0 lost-event incidents in 21-day observation window

Decision: C (status-quo — cursor+replay on existing JetStream+Postgres stack).
Option A (per-agent JetStream consumers for SSE) deferred to 50+ agents threshold.
Option B (Redis Streams) redundant given JetStream already present.

Follow-up: add sse_cursor_410_total + sse_replay_events_total Prometheus counters (F1/F2).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant