Skip to content

retrymq limit & DLQ #663

@alexluong

Description

@alexluong

When a retry task's executor fails (e.g., event not found in logstore, transient errors), the message sits in the queue and becomes visible again after a fixed 30s visibility timeout. This repeats indefinitely with no limit.

Problems

  • A permanently failing retry message cycles forever with no cap
  • No dead-letter path to detect or surface stuck messages
  • Fixed visibility timeout on re-fetch failures — no backoff between attempts

The underlying queue already tracks receive count and supports per-message visibility changes, so the primitives are there.

Open questions

Max receive count

What should the default be?

Suggestion: 5 internal re-fetch attempts before giving up. This is separate from the delivery retry max limit, which controls how many times we re-deliver to the destination.

Backoff on re-fetch

Should we apply exponential backoff on internal failures (e.g., 30s → 60s → 120s), or is a fixed interval fine since these are typically short-lived transient issues?

What happens when max is exceeded

Suggestion: Route to a DLQ. Gives observability into stuck messages and the ability to replay them.

Configuration

Suggestion: Expose as retrymq config, similar to how deliverymq is configured. e.g., RETRYMQ_MAX_RECEIVE_COUNT, RETRYMQ_VISIBILITY_TIMEOUT_SECONDS.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions