fix(campaign-packer): terminal-classify deterministic audience failures to stop redelivery loops (EVO-1676)#36
Merged
Merged
Conversation
…es to stop redelivery loops (EVO-1676) A campaigns.pack message whose audience computation fails deterministically (malformed segment SQL, invalid campaign config) was requeued forever: only CampaignNotFoundError was treated as terminal, every other error fell through to nack(requeue=true), and the config never changes between redeliveries — so the poison message hot-looped and blocked the partition without ever showing up in terminal_failures. Introduce a shared, layered terminal-error taxonomy and ack/nack policy so the "this can never succeed on retry" decision lives at the boundary where the knowledge is, not in copy-pasted try/catch blocks (the campaign-sender / 4.2 consumer will reuse it): - shared/errors/TerminalError: neutral marker base for permanent failures. - shared/broker/consumer/processWithAckPolicy: success -> ack, TerminalError -> nack(false), anything else -> nack(true). Optional meta merged into the failure log. - shared/persistence/isDeterministicDbError: SQLSTATE classifier (42/22 deterministic; 08/53/57/40 transient; unknown -> transient/retry). - shared/audience/errors: AudienceConfigError (segment/SQL validation) and DeterministicAudienceError (deterministic DB failure, wraps cause). - CampaignNotFoundError / InvalidEnvelopeError now extend TerminalError; segment-query-builder validation throws AudienceConfigError; pack() wraps computeAudience and classifies; both consumers delegate to the shared policy. Deterministic failures now drop terminally (and surface in terminal_failures). The residual risk (an unclassified deterministic error) is covered by the broker-level redelivery backstop tracked in EVO-1677. Behavior change: terminal drops (incl. campaign-not-found) now log at error level with a `terminal` flag, unifying the two consumers. Tests: SQLSTATE classifier, ack/nack policy, and pack() classification (deterministic -> terminal, config -> terminal, transient -> requeue). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Sorry @nickoliveira23, you have reached your weekly rate limit of 500000 diff characters.
Please try again later or upgrade to continue using Sourcery
dpaes
approved these changes
Jun 9, 2026
dpaes
left a comment
There was a problem hiding this comment.
Approving. Reviewed against the EVO-1676 acceptance criteria with independent adversarial verification of each finding.
Core goal verified to primary source. Deterministic audience failures are now terminal-classified instead of requeued forever:
- The SQLSTATE plumbing is correct — pg sets
DatabaseError.code, TypeORM copiesdriverErrorprops ontoQueryFailedError, andextractSqlStatereadsdriverError?.code ?? code, so a deterministic SQL error (class42/22) from the rawmanager.querypath reaches the terminal branch rather than falling through tounknown → retry. - The Kafka partition-block half is genuinely resolved: a terminal
nack(requeue=false)commitsoffset+1, advancing past the poison message;evo_broker_terminal_failures_totalnow increments on the deterministic drop. - The taxonomy (neutral
TerminalErrorbase +processWithAckPolicy) is the right shape — new consumers inherit correct redelivery for free.
Non-blocking notes (left to your discretion for a follow-up card):
N1—pack()'scampaignRepository.findOne()runs outsidecomputeAudienceOrClassify; a deterministic error there would still requeue. Acceptable: it's an entity-mapped query on the corecampaignstable (deploy-wide outage, not a per-message poison), and the card scopescomputeAudience, which is wrapped.N2— SQLSTATE class23(integrity) falls tounknown → retry. Safe here becauseclearAudienceruns before the insert (idempotent on redelivery) and the entity has no unique constraint + a UUID PK. Worth a one-line comment so a future reader doesn't read it as an oversight.N3— the deterministic →nack(false)→ metric path is proven by composition of unit specs, not one integration test. Each seam is individually pinned; optionally add a consumer-level test that rejects{ code: '42601' }and assertsbroker.nack(msg, false).- CI note: evo-flow CI runs Sourcery only — the "typecheck clean / eslint clean / 493 passed" is self-reported; a local
npm testis the standard confirmation for this repo.
Merging to develop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A
campaigns.packmessage whose audience computation fails deterministically (malformed segment SQL, invalid campaign config) was requeued forever — onlyCampaignNotFoundErrorwas terminal, everything else fell tonack(requeue=true), the config never changes between redeliveries, so the poison message hot-looped and blocked the partition without ever showing interminal_failures.This introduces a shared, layered terminal-error taxonomy + ack/nack policy so the "can never succeed on retry" decision lives at the boundary where the knowledge is, not in copy-pasted try/catch blocks. The campaign-sender (4.2 / EVO-1217) consumer reuses the same policy.
shared/errors/TerminalError— neutral marker base for permanent failures.shared/broker/consumer/processWithAckPolicy— success →ack,TerminalError→nack(false), else →nack(true); optionalmetamerged into the failure log.shared/persistence/isDeterministicDbError— SQLSTATE classifier (42/22 deterministic; 08/53/57/40 transient; unknown → transient/retry).shared/audience/errors—AudienceConfigError(segment/SQL validation) +DeterministicAudienceError(deterministic DB failure, wrapscause).CampaignNotFoundError/InvalidEnvelopeErrornow extendTerminalError;segment-query-buildervalidation throwsAudienceConfigError;pack()wrapscomputeAudienceand classifies; both consumers delegate to the shared policy.Deterministic failures now drop terminally and surface in
terminal_failures. The residual risk (an unclassified deterministic error) is the broker-level redelivery backstop tracked in EVO-1677 (filed, Backlog).Behavior change
Terminal drops (incl. campaign-not-found) now log at error level with a
terminalflag, unifying the two consumers (event-process already did this; campaign-packer previously logged not-found atwarn).Out of scope (pre-existing, not introduced here)
validateSQLQueryuses substring keyword matching (e.g.created_atcontainsCREATE) — this change actually makes such false-positives drop terminally instead of looping.ack()failure after a successfulpack()requeues a processed message — identical to prior behavior; relies on idempotency (EVO-1204).Security
No auth/tenant surface touched. The SQLSTATE classifier reads only error codes; no user input reaches it.
validateSQLQuerySELECT-only / no-comments / no-DDL guards are unchanged (now raising a typed terminal error).Test plan
evo-flow: npm run typecheck— cleanevo-flow: npx eslint <changed files>— clean (no net-new lint insegment-query-builder; pre-existing baseline preserved)evo-flow: npm test -- src/shared/persistence src/shared/broker/consumer src/runners/campaign-packer src/runners/event-process— 34/34evo-flow: npm test(full) — 493 passed; the single failure (campaigns.controller.spec) is pre-existing ondevelop(confirmed via stash), unrelated to this change.Changed Files
src/shared/errors/terminal-error.ts(new)src/shared/persistence/deterministic-db-error.ts(+spec, new)src/shared/broker/consumer/process-with-ack-policy.ts(+spec, new)src/shared/audience/errors/audience.errors.ts(new)src/shared/audience/segment-query-builder.service.tssrc/runners/campaign-packer/services/campaign-packer.service.ts(+spec)src/runners/campaign-packer/consumers/campaigns-pack.consumer.tssrc/runners/campaign-packer/errors/campaign-not-found.error.tssrc/runners/event-process/services/event-process.service.tssrc/runners/event-process/services/events-received.consumer.tsLinked Issue
🤖 Generated with Claude Code