Skip to content

Make outbound /send durable (enqueue + worker) so it survives restarts & crashes #327

Description

@jiashuoz

Problem

The outbound send path is synchronous and not durable: a.sender.Send() (internal/agent/api.go:1185) runs the SES SMTP transaction inline in the HTTP request, with only in-process retries (1s+5s+15s backoff for transient SES 4xx). If the process dies mid-send, nothing resumes it.

This bites on every deploy (and any crash):

  • The server gets SIGTERM and drains via http.Server.Shutdown (~30s budget, cmd/e2a/main.go). A send that finishes in time is fine.
  • A send still in-flight when the grace window expires is SIGKILLed → lost, with no server-side retry. Note the relay's own backoff (~21s) can exceed a tight grace window. (Mitigated, not fixed, by stop_grace_period: 35s — Mnexa-AI/e2a-ops#153.)
  • On a hard SIGKILL the idempotency guard's deferred finalize() never runs (internal/agent/idempotency_guard.go), leaving the claim stuck "in-flight" → client retries hit 409 until the lock expires. And if the kill landed after SES accepted but before finalize, the original delivers, the server has no record, and the eventual client retry re-sends → duplicate.

Net: outbound send is at-most-once on the happy path, but a mid-flight kill exposes a small lost-or-duplicated window that the server never auto-recovers.

Proposed solution

Make the send durable, mirroring the pattern the webhook delivery path already uses (transactional outbox + River worker):

  1. Enqueue, don't send inline. In the send handler, write the composed message row and a send job in one transaction (durable outbox), then return 202/the message id. No SES call on the request path.
  2. Relay from a worker. A River (or equivalent) worker picks up the job and does the SES SMTP relay, with bounded retries + backoff, updating the message's send status (queued → sending → sent | failed) and provider_message_id.
  3. Idempotent delivery. Carry an idempotency token on the job so a worker retry after a crash can't double-send (dedup on the message id / a sent marker checked before the SES call).
  4. Reconcile stuck sending rows on startup (re-enqueue), and surface terminal failed to the agent (webhook / status).

Acceptance criteria

  • A kill -9 of the server mid-send results in the message being delivered exactly once after restart (no loss, no duplicate).
  • A deploy never loses an accepted send.
  • /send returns promptly (no longer blocks on the SES round-trip + backoff).
  • HITL hold, screening, and the existing idempotency-key contract still hold.

Context

Surfaced while reviewing deploy-time availability. Partial mitigation shipped: stop_grace_period: 35s (Mnexa-AI/e2a-ops#153) lets graceful drain + relay backoff finish, converting most in-flight sends from killed → completed — but the durable path is the real fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions