Skip to content

fix(campaign-packer): terminal-classify deterministic audience failures to stop redelivery loops (EVO-1676)#36

Merged
dpaes merged 1 commit into
developfrom
fix/EVO-1676
Jun 9, 2026
Merged

fix(campaign-packer): terminal-classify deterministic audience failures to stop redelivery loops (EVO-1676)#36
dpaes merged 1 commit into
developfrom
fix/EVO-1676

Conversation

@nickoliveira23

Copy link
Copy Markdown

Summary

A campaigns.pack message whose audience computation fails deterministically (malformed segment SQL, invalid campaign config) was requeued forever — only CampaignNotFoundError was terminal, everything else fell to nack(requeue=true), the config never changes between redeliveries, so the poison message hot-looped and blocked the partition without ever showing in terminal_failures.

This introduces a shared, layered terminal-error taxonomy + ack/nack policy so the "can never succeed on retry" decision lives at the boundary where the knowledge is, not in copy-pasted try/catch blocks. The campaign-sender (4.2 / EVO-1217) consumer reuses the same policy.

  • shared/errors/TerminalError — neutral marker base for permanent failures.
  • shared/broker/consumer/processWithAckPolicy — success → ack, TerminalErrornack(false), else → nack(true); optional meta merged into the failure log.
  • shared/persistence/isDeterministicDbError — SQLSTATE classifier (42/22 deterministic; 08/53/57/40 transient; unknown → transient/retry).
  • shared/audience/errorsAudienceConfigError (segment/SQL validation) + DeterministicAudienceError (deterministic DB failure, wraps cause).
  • CampaignNotFoundError / InvalidEnvelopeError now extend TerminalError; segment-query-builder validation throws AudienceConfigError; pack() wraps computeAudience and classifies; both consumers delegate to the shared policy.

Deterministic failures now drop terminally and surface in terminal_failures. The residual risk (an unclassified deterministic error) is the broker-level redelivery backstop tracked in EVO-1677 (filed, Backlog).

Behavior change

Terminal drops (incl. campaign-not-found) now log at error level with a terminal flag, unifying the two consumers (event-process already did this; campaign-packer previously logged not-found at warn).

Out of scope (pre-existing, not introduced here)

  • validateSQLQuery uses substring keyword matching (e.g. created_at contains CREATE) — this change actually makes such false-positives drop terminally instead of looping.
  • ack() failure after a successful pack() requeues a processed message — identical to prior behavior; relies on idempotency (EVO-1204).

Security

No auth/tenant surface touched. The SQLSTATE classifier reads only error codes; no user input reaches it. validateSQLQuery SELECT-only / no-comments / no-DDL guards are unchanged (now raising a typed terminal error).

Test plan

  • evo-flow: npm run typecheck — clean
  • evo-flow: npx eslint <changed files> — clean (no net-new lint in segment-query-builder; pre-existing baseline preserved)
  • evo-flow: npm test -- src/shared/persistence src/shared/broker/consumer src/runners/campaign-packer src/runners/event-process34/34
  • evo-flow: npm test (full) — 493 passed; the single failure (campaigns.controller.spec) is pre-existing on develop (confirmed via stash), unrelated to this change.

Changed Files

  • src/shared/errors/terminal-error.ts (new)
  • src/shared/persistence/deterministic-db-error.ts (+spec, new)
  • src/shared/broker/consumer/process-with-ack-policy.ts (+spec, new)
  • src/shared/audience/errors/audience.errors.ts (new)
  • src/shared/audience/segment-query-builder.service.ts
  • src/runners/campaign-packer/services/campaign-packer.service.ts (+spec)
  • src/runners/campaign-packer/consumers/campaigns-pack.consumer.ts
  • src/runners/campaign-packer/errors/campaign-not-found.error.ts
  • src/runners/event-process/services/event-process.service.ts
  • src/runners/event-process/services/events-received.consumer.ts

Linked Issue

  • EVO-1676

🤖 Generated with Claude Code

…es to stop redelivery loops (EVO-1676)

A campaigns.pack message whose audience computation fails deterministically
(malformed segment SQL, invalid campaign config) was requeued forever: only
CampaignNotFoundError was treated as terminal, every other error fell through
to nack(requeue=true), and the config never changes between redeliveries — so
the poison message hot-looped and blocked the partition without ever showing up
in terminal_failures.

Introduce a shared, layered terminal-error taxonomy and ack/nack policy so the
"this can never succeed on retry" decision lives at the boundary where the
knowledge is, not in copy-pasted try/catch blocks (the campaign-sender / 4.2
consumer will reuse it):

- shared/errors/TerminalError: neutral marker base for permanent failures.
- shared/broker/consumer/processWithAckPolicy: success -> ack, TerminalError ->
  nack(false), anything else -> nack(true). Optional meta merged into the
  failure log.
- shared/persistence/isDeterministicDbError: SQLSTATE classifier (42/22
  deterministic; 08/53/57/40 transient; unknown -> transient/retry).
- shared/audience/errors: AudienceConfigError (segment/SQL validation) and
  DeterministicAudienceError (deterministic DB failure, wraps cause).
- CampaignNotFoundError / InvalidEnvelopeError now extend TerminalError;
  segment-query-builder validation throws AudienceConfigError; pack() wraps
  computeAudience and classifies; both consumers delegate to the shared policy.

Deterministic failures now drop terminally (and surface in terminal_failures).
The residual risk (an unclassified deterministic error) is covered by the
broker-level redelivery backstop tracked in EVO-1677.

Behavior change: terminal drops (incl. campaign-not-found) now log at error
level with a `terminal` flag, unifying the two consumers.

Tests: SQLSTATE classifier, ack/nack policy, and pack() classification
(deterministic -> terminal, config -> terminal, transient -> requeue).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @nickoliveira23, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@dpaes dpaes left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Reviewed against the EVO-1676 acceptance criteria with independent adversarial verification of each finding.

Core goal verified to primary source. Deterministic audience failures are now terminal-classified instead of requeued forever:

  • The SQLSTATE plumbing is correct — pg sets DatabaseError.code, TypeORM copies driverError props onto QueryFailedError, and extractSqlState reads driverError?.code ?? code, so a deterministic SQL error (class 42/22) from the raw manager.query path reaches the terminal branch rather than falling through to unknown → retry.
  • The Kafka partition-block half is genuinely resolved: a terminal nack(requeue=false) commits offset+1, advancing past the poison message; evo_broker_terminal_failures_total now increments on the deterministic drop.
  • The taxonomy (neutral TerminalError base + processWithAckPolicy) is the right shape — new consumers inherit correct redelivery for free.

Non-blocking notes (left to your discretion for a follow-up card):

  • N1pack()'s campaignRepository.findOne() runs outside computeAudienceOrClassify; a deterministic error there would still requeue. Acceptable: it's an entity-mapped query on the core campaigns table (deploy-wide outage, not a per-message poison), and the card scopes computeAudience, which is wrapped.
  • N2 — SQLSTATE class 23 (integrity) falls to unknown → retry. Safe here because clearAudience runs before the insert (idempotent on redelivery) and the entity has no unique constraint + a UUID PK. Worth a one-line comment so a future reader doesn't read it as an oversight.
  • N3 — the deterministic → nack(false) → metric path is proven by composition of unit specs, not one integration test. Each seam is individually pinned; optionally add a consumer-level test that rejects { code: '42601' } and asserts broker.nack(msg, false).
  • CI note: evo-flow CI runs Sourcery only — the "typecheck clean / eslint clean / 493 passed" is self-reported; a local npm test is the standard confirmation for this repo.

Merging to develop.

@dpaes dpaes merged commit e7b9f24 into develop Jun 9, 2026
4 checks passed
@dpaes dpaes deleted the fix/EVO-1676 branch June 9, 2026 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants