Skip to content

feat: retry failed transaction commit#576

Open
linguoxuan wants to merge 1 commit intoapache:mainfrom
linguoxuan:main
Open

feat: retry failed transaction commit#576
linguoxuan wants to merge 1 commit intoapache:mainfrom
linguoxuan:main

Conversation

@linguoxuan
Copy link

@linguoxuan linguoxuan commented Feb 26, 2026

This commit implements the retry for transaction commits. It introduces a generic RetryRunner utility with exponential backoff and error-kind filtering, and integrates it into Transaction::Commit() to automatically refresh table metadata and retry on commit conflicts.

@linguoxuan linguoxuan force-pushed the main branch 2 times, most recently from 82ada96 to ff6c292 Compare February 26, 2026 11:28
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Report: PR #576

(Generated by gemini-cli)

📄 File: src/iceberg/table.cc

Java Counterpart: Table.java / BaseTable.java

  • Parity Check: ✅ Correctly matches the Java behavior where the cached metadata location is updated upon Refresh().
  • Style & Comments: ✅ Clean and straightforward.
  • Logic Check: ✅ Fixes a bug where metadata_location_ could be stale.
  • Design & Conciseness: ✅ Good.
  • Test Quality: ✅ Covered by general refresh logic in existing tests.

📄 File: src/iceberg/transaction.cc & src/iceberg/transaction.h

Java Counterpart: BaseTransaction.java

  • Parity Check: ✅ The retry execution cleanly mimics BaseTransaction.commitSimpleTransaction() behavior utilizing RetryRunner. The use of applying_updates_ correctly prevents recursive commits for single-operation transactions.
  • Style & Comments: ✅ Well documented.
  • Logic Check: ✅ Mapping ErrorKind::kCommitFailed inside CommitOnce() during a re-apply to ValidationFailed is a clever and correct way to abort the retry loop when a logical conflict occurs with the newly refreshed state.
  • Design & Conciseness: ✅ Elegant restructuring of the commit phase.
  • Test Quality: ✅ Good.

📄 File: src/iceberg/update/snapshot_update.cc

Java Counterpart: SnapshotProducer.java

  • Parity Check: ✅ Clearing manifest_lists_ and cleaning uncommitted files before reapplying correctly mirrors the cleanup loop in Java's SnapshotProducer.apply(). Snapshot ID generation also properly handles collisions, achieving parity.
  • Style & Comments: ✅ Good.
  • Logic Check: ✅ State resets cleanly before each retry attempt.
  • Design & Conciseness: ✅ Well integrated into the existing Apply() flow.
  • Test Quality: ✅ Assumed covered by integration tests simulating conflicts.

📄 File: src/iceberg/update/update_snapshot_reference.cc

Java Counterpart: UpdateSnapshotReferencesOperation.java

  • Parity Check: ✅ Matches and arguably improves upon Java's internalApply().
  • Style & Comments: ✅ Good.
  • Logic Check: ✅ The introduction of initial_refs_ to compute the user's intended delta (additions/removals) and applying that delta to the refreshed current_refs is a very robust way to handle transaction rebasing. It properly avoids inadvertently dropping concurrent reference changes during a retry.
  • Design & Conciseness: ✅ Concise and logical.
  • Test Quality: ✅ Good.

📄 File: src/iceberg/util/retry_util.h & src/iceberg/test/retry_util_test.cc

Java Counterpart: Tasks.java (specifically the retry builder)

  • Parity Check: ✅ Config defaults (e.g., 4 retries, 100ms min wait, 60s max wait, 30m total timeout) directly match Iceberg Java's COMMIT_NUM_RETRIES_DEFAULT, etc.
  • Style & Comments: ⚠️ Minor Style Point: In RetryRunner::CalculateDelay, std::random_device rd; and std::mt19937 gen(rd()); are instantiated on every single retry attempt. While the overhead is mostly negligible due to the surrounding sleep, std::random_device can block on /dev/urandom or throw exceptions in constrained environments. It is standard C++ practice to use a thread_local generator to avoid repeatedly initializing the PRNG. For example:
static thread_local std::mt19937 gen(std::random_device{}());
  • Logic Check: ✅ Exponential backoff and jitter math is sound. Timeout and max attempts logic is flawless.
  • Design & Conciseness: ✅ Excellent standalone utility. Does not reinvent the wheel as STL/Arrow lack an equivalent drop-in.
  • Test Quality:retry_util_test.cc is exhaustive and tests all major edge cases (exhaustion, filters, zero retries).

Summary & Recommendation

  • Comment
    Overall, a beautifully constructed PR. The implementation of transaction retries and rebasing (especially calculating the intent delta in UpdateSnapshotReference) is very robust and accurately achieves parity with the Java spec.

I have left a minor suggestion regarding the PRNG initialization in RetryRunner::CalculateDelay to adhere to modern C++ best practices, but otherwise, the code is solid.

@linguoxuan linguoxuan force-pushed the main branch 3 times, most recently from ec7234f to c624990 Compare February 28, 2026 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants