Skip to content

[NO REVIEW YET]fix: isolate per-record ID-generation failures in Kafka sink transformer (#49268)#49269

Open
xinlian12 wants to merge 5 commits into
mainfrom
fix/issue-49268-single-record-dlq
Open

[NO REVIEW YET]fix: isolate per-record ID-generation failures in Kafka sink transformer (#49268)#49269
xinlian12 wants to merge 5 commits into
mainfrom
fix/issue-49268-single-record-dlq

Conversation

@xinlian12
Copy link
Copy Markdown
Member

@xinlian12 xinlian12 commented May 26, 2026

Fix: Single record failing ID parsing should not fail entire batch (#49268)

Problem

When a single SinkRecord in a batch fails during ID generation (e.g., due to an invalid JsonPath in ProvidedInStrategy.generateId()), the entire batch fails and all records are routed to the DLQ — not just the malformed record.

Root cause: SinkRecordTransformer.transform() lacked per-record error isolation. An exception from idStrategy.generateId() would abort the entire transform() call before records reached the writer-level DLQ handling in CosmosWriterBase.sendToDlqIfConfigured().

Solution

Added per-record try-catch in SinkRecordTransformer.transform() that is consistent with the writer-level pattern in CosmosWriterBase:

  1. DLQ report is fire-and-forget side-effect — always report to ErrantRecordReporter if available (guarded against reporter failures)
  2. Tolerance level controls flowALL skips and continues, NONE throws (regardless of whether DLQ reporter is present)
Scenario ErrantRecordReporter ToleranceOnErrorLevel Behavior
DLQ configured Available ALL Report to DLQ, skip bad record, continue
DLQ configured Available NONE Report to DLQ, then throw (fail-fast)
DLQ not configured null ALL Log warning, skip bad record, continue
DLQ not configured null NONE Throw (fail-fast, preserves existing behavior)
DLQ reporter fails Available (throws) ALL Log DLQ error, skip bad record, continue
DLQ reporter fails Available (throws) NONE Log DLQ error, throw original exception

Changes

  • SinkRecordTransformer.java
    • Added ErrantRecordReporter and ToleranceOnErrorLevel fields
    • Wrapped per-record processing in try-catch with DLQ/tolerance handling
    • Guarded ErrantRecordReporter.report() against secondary failures
    • Added package-private constructor for testability
    • Made createIdStrategy() static for constructor chain
  • CosmosSinkTask.java
    • Pass errantRecordReporter and toleranceOnErrorLevel to SinkRecordTransformer
    • Guard context.errantRecordReporter() against older Kafka Connect runtimes
    • Fix bookkeeping to count transformedRecords.size() (post-filter) instead of entry.getValue().size() (pre-filter)
  • SinkRecordTransformerTest.java — 8 new unit tests:
    • T1: Mixed batch with reporter + tolerance ALL → bad→DLQ, good records survive
    • T2: Tolerance ALL without reporter → bad record skipped
    • T3: Tolerance NONE without reporter → exception thrown (fail-fast)
    • T4: All valid records → all in output, reporter never called (regression)
    • T5: All bad with reporter + tolerance ALL → all→DLQ, empty output
    • T6: Reporter throws + tolerance NONE → original exception rethrown
    • T7: Reporter throws + tolerance ALL → record skipped, continues
    • T8: Tolerance NONE with reporter → DLQ report AND exception thrown

Fixes #49268

…mer (#49268)

When a single record's ID strategy fails (e.g., ProvidedInStrategy JsonPath
parse error), only that record should be routed to DLQ — not the entire batch.

Previously, SinkRecordTransformer.transform() had no per-record error handling,
so one malformed record would abort transformation of all records in the batch.

Changes:
- SinkRecordTransformer: Add per-record try-catch in transform(). Accept
  ErrantRecordReporter and ToleranceOnErrorLevel. Report failing records to DLQ
  when available, skip when tolerance is ALL, rethrow when tolerance is NONE.
- CosmosSinkTask: Pass reporter and tolerance to SinkRecordTransformer. Fix
  written-record bookkeeping to count only successfully transformed records.

Fixes #49268

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
xinlian12 and others added 2 commits May 26, 2026 14:34
Covers: DLQ reporting, tolerance ALL skip, tolerance NONE rethrow,
all-valid regression, all-bad with reporter, reporter precedence over tolerance.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wrap errantRecordReporter.report() in its own try/catch to prevent DLQ
reporter failures from collapsing the entire batch. When the reporter
throws:
- ToleranceOnErrorLevel.ALL: log and continue (skip the bad record)
- ToleranceOnErrorLevel.NONE: rethrow the original transform exception

Add T6/T7 tests covering both scenarios. Renumber T6→T8.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12 xinlian12 marked this pull request as ready for review May 26, 2026 21:47
Copilot AI review requested due to automatic review settings May 26, 2026 21:47
@xinlian12 xinlian12 requested review from a team and kirankumarkolli as code owners May 26, 2026 21:47
Address review findings:
- Align DLQ/tolerance precedence with writer pattern: DLQ report is
  fire-and-forget side-effect, tolerance level controls continue-vs-throw.
  With tolerance=NONE + reporter, record is reported AND task fails.
- Guard context.errantRecordReporter() against older Kafka Connect runtimes
  that lack the API (catch NoClassDefFoundError/NoSuchMethodError).
- Add package-private constructor for testability (eliminates reflection).
- Consolidate double-logging: one log entry per failed record.
- Rewrite tests to use package-private constructor and align with new
  semantics. T8 now tests tolerance=NONE+reporter → DLQ+throw.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the Cosmos DB Kafka sink’s robustness by isolating per-record failures during transformation (notably ID generation) so that a single malformed SinkRecord doesn’t fail the entire batch and unnecessarily route all records to the DLQ.

Changes:

  • Added per-record error handling in SinkRecordTransformer.transform() with optional DLQ reporting and tolerance-based behavior.
  • Updated CosmosSinkTask to pass ErrantRecordReporter + ToleranceOnErrorLevel into the transformer and to count only successfully transformed records.
  • Added a dedicated unit test suite validating mixed/invalid batches, DLQ reporter behavior, and tolerance modes.
Show a summary per file
File Description
sdk/cosmos/azure-cosmos-kafka-connect/src/main/java/com/azure/cosmos/kafka/connect/implementation/sink/SinkRecordTransformer.java Adds per-record try/catch around transformation and routes failures to DLQ or tolerance logic.
sdk/cosmos/azure-cosmos-kafka-connect/src/main/java/com/azure/cosmos/kafka/connect/implementation/sink/CosmosSinkTask.java Wires DLQ reporter/tolerance into transformer and fixes per-container written-record counting post-filtering.
sdk/cosmos/azure-cosmos-kafka-connect/src/test/java/com/azure/cosmos/kafka/connect/implementation/sink/SinkRecordTransformerTest.java Introduces unit tests covering tolerant vs fail-fast behavior and DLQ reporter success/failure scenarios.
.gitignore Ignores .coding-harness/ directory.

Copilot's findings

  • Files reviewed: 3/4 changed files
  • Comments generated: 2

Comment on lines +96 to +100
record.topic(),
record.kafkaPartition(),
record.kafkaOffset(),
containerName,
reportException);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4ac387b — changed the outer catch to catch(RuntimeException e) since all exceptions from ID strategies (ConnectException, etc.) are RuntimeException subclasses.

record.kafkaOffset(),
containerName,
e);
throw e;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4ac387b — same fix as above, outer catch changed to catch(RuntimeException e).

@xinlian12 xinlian12 changed the title fix: isolate per-record ID-generation failures in Kafka sink transformer (#49268) [NO REVIEW YET]fix: isolate per-record ID-generation failures in Kafka sink transformer (#49268) May 26, 2026
Change catch clause from Exception (checked) to RuntimeException
(unchecked) since transform() doesn't declare throws Exception.
ConnectException and all other exceptions from ID strategies are
RuntimeException subclasses. This fixes the CI build failure.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ErrantRecordReporter errantRecordReporter = null;
try {
errantRecordReporter = this.context.errantRecordReporter();
Copy link
Copy Markdown
Member Author

@xinlian12 xinlian12 May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this check? from public API -> this.context.errantRecordReporter() returned ErrantRecordReporter. I think it should be safe to just pass in other SinkRecordTransformer, similar as the writer pattern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG][Kafka Connector]Single record which can not parse id successfully would fail the whole batch and cause all batches be routed to DLQ

2 participants