Problem
The outbound send path is synchronous and not durable: a.sender.Send() (internal/agent/api.go:1185) runs the SES SMTP transaction inline in the HTTP request, with only in-process retries (1s+5s+15s backoff for transient SES 4xx). If the process dies mid-send, nothing resumes it.
This bites on every deploy (and any crash):
- The server gets SIGTERM and drains via
http.Server.Shutdown (~30s budget, cmd/e2a/main.go). A send that finishes in time is fine.
- A send still in-flight when the grace window expires is SIGKILLed → lost, with no server-side retry. Note the relay's own backoff (~21s) can exceed a tight grace window. (Mitigated, not fixed, by
stop_grace_period: 35s — Mnexa-AI/e2a-ops#153.)
- On a hard SIGKILL the idempotency guard's deferred
finalize() never runs (internal/agent/idempotency_guard.go), leaving the claim stuck "in-flight" → client retries hit 409 until the lock expires. And if the kill landed after SES accepted but before finalize, the original delivers, the server has no record, and the eventual client retry re-sends → duplicate.
Net: outbound send is at-most-once on the happy path, but a mid-flight kill exposes a small lost-or-duplicated window that the server never auto-recovers.
Proposed solution
Make the send durable, mirroring the pattern the webhook delivery path already uses (transactional outbox + River worker):
- Enqueue, don't send inline. In the send handler, write the composed message row and a
send job in one transaction (durable outbox), then return 202/the message id. No SES call on the request path.
- Relay from a worker. A River (or equivalent) worker picks up the job and does the SES SMTP relay, with bounded retries + backoff, updating the message's send status (
queued → sending → sent | failed) and provider_message_id.
- Idempotent delivery. Carry an idempotency token on the job so a worker retry after a crash can't double-send (dedup on the message id / a sent marker checked before the SES call).
- Reconcile stuck
sending rows on startup (re-enqueue), and surface terminal failed to the agent (webhook / status).
Acceptance criteria
- A
kill -9 of the server mid-send results in the message being delivered exactly once after restart (no loss, no duplicate).
- A deploy never loses an accepted send.
/send returns promptly (no longer blocks on the SES round-trip + backoff).
- HITL hold, screening, and the existing idempotency-key contract still hold.
Context
Surfaced while reviewing deploy-time availability. Partial mitigation shipped: stop_grace_period: 35s (Mnexa-AI/e2a-ops#153) lets graceful drain + relay backoff finish, converting most in-flight sends from killed → completed — but the durable path is the real fix.
Problem
The outbound send path is synchronous and not durable:
a.sender.Send()(internal/agent/api.go:1185) runs the SES SMTP transaction inline in the HTTP request, with only in-process retries (1s+5s+15s backoff for transient SES 4xx). If the process dies mid-send, nothing resumes it.This bites on every deploy (and any crash):
http.Server.Shutdown(~30s budget,cmd/e2a/main.go). A send that finishes in time is fine.stop_grace_period: 35s— Mnexa-AI/e2a-ops#153.)finalize()never runs (internal/agent/idempotency_guard.go), leaving the claim stuck "in-flight" → client retries hit 409 until the lock expires. And if the kill landed after SES accepted but before finalize, the original delivers, the server has no record, and the eventual client retry re-sends → duplicate.Net: outbound send is at-most-once on the happy path, but a mid-flight kill exposes a small lost-or-duplicated window that the server never auto-recovers.
Proposed solution
Make the send durable, mirroring the pattern the webhook delivery path already uses (transactional outbox + River worker):
sendjob in one transaction (durable outbox), then return202/the message id. No SES call on the request path.queued → sending → sent | failed) andprovider_message_id.sendingrows on startup (re-enqueue), and surface terminalfailedto the agent (webhook / status).Acceptance criteria
kill -9of the server mid-send results in the message being delivered exactly once after restart (no loss, no duplicate)./sendreturns promptly (no longer blocks on the SES round-trip + backoff).Context
Surfaced while reviewing deploy-time availability. Partial mitigation shipped:
stop_grace_period: 35s(Mnexa-AI/e2a-ops#153) lets graceful drain + relay backoff finish, converting most in-flight sends from killed → completed — but the durable path is the real fix.