Skip to content

feat: Add incremental scan API with IncrementalAppendScan and Increme…#559

Open
WZhuo wants to merge 1 commit intoapache:mainfrom
WZhuo:increment_scan
Open

feat: Add incremental scan API with IncrementalAppendScan and Increme…#559
WZhuo wants to merge 1 commit intoapache:mainfrom
WZhuo:increment_scan

Conversation

@WZhuo
Copy link
Contributor

@WZhuo WZhuo commented Feb 10, 2026

No description provided.

@WZhuo WZhuo force-pushed the increment_scan branch 4 times, most recently from ef8f9b8 to 3214b82 Compare February 11, 2026 10:28
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Report: PR #559

(Generated by gemini-cli)

📄 Files: src/iceberg/table_scan.h & src/iceberg/table_scan.cc

Java Counterpart: TableScan.java / IncrementalScan.java

  • Parity Check & API Design:

    • CRITICAL API FLAW: The PR makes BaseIncrementalScanBuilder inherit from TableScanBuilder. However, TableScanBuilder's methods (like Filter(), Project(), Select()) return TableScanBuilder&.
    • This breaks the fluent builder pattern. If a user tries to chain methods:
      table->NewIncrementalAppendScan().value()
           ->Filter(expr)
           .FromSnapshot(1); // ERROR: TableScanBuilder has no FromSnapshot method
      It will fail to compile because Filter() returns TableScanBuilder&.
    • Furthermore, calling Build() after Filter() will invoke TableScanBuilder::Build(), which returns a DataTableScan instead of the expected IncrementalAppendScan.
    • Action: Refactor the builders to use the Curiously Recurring Template Pattern (CRTP) for a common base builder (e.g., template <typename Derived> class ScanBuilderBase), so that inherited methods return the correct derived type. Alternatively, override the methods in the derived builders to cast and return *this.
    • Good catch: Removing UseBranch from TableScanBuilder and moving it to BaseIncrementalScanBuilder perfectly matches Java's parity, where useBranch only exists in IncrementalScan.
  • Style & Comments: ⚠️

    • Java uses fromSnapshotInclusive(long) and fromSnapshotExclusive(long). The PR uses FromSnapshot(int64_t, bool inclusive = false). While acceptable in C++, boolean arguments can cause "boolean blindness" at the call site (e.g., FromSnapshot(100, true)). Consider using an enum (e.g., enum class SnapshotBoundary { kInclusive, kExclusive }) or mirroring Java's explicit method names for better readability.
  • Logic Check:

    • In table_scan.cc, the Build() implementation is hardcoded to return the same error message regardless of the ScanType:
      template <typename ScanType>
      Result<std::unique_ptr<ScanType>> IncrementalScanBuilder<ScanType>::Build() {
        return NotImplemented("IncrementalAppendScanBuilder is not implemented");
      }
      Action: If ScanType is IncrementalChangelogScan, the error message is misleading. Update the error message to be generic (e.g., "Incremental scan builder is not implemented") or use typeid or template specialization to provide an accurate message.
  • Design & Conciseness:

    • The separation of DataTableScan and the new incremental scan classes is logically sound and structurally aligned with the Iceberg specification.
  • Test Quality:

    • No tests were added for the new incremental builders. While they currently return NotImplemented, there should be basic instantiation tests to ensure the builders can be created from Table::NewIncremental...().
    • Action: Add a test case that specifically chains a common method (like Filter) with an incremental method (like FromSnapshot) to ensure the fluent API compiles and works correctly once the CRTP/inheritance issue is fixed.

📄 Files: src/iceberg/table.h & src/iceberg/table.cc

Java Counterpart: Table.java

  • Parity Check: ✅ The new methods NewIncrementalAppendScan() and NewIncrementalChangelogScan() match Java's Table interface. Returning a builder instead of the scan itself aligns with the existing C++ paradigm (NewScan() returning TableScanBuilder).
  • Style & Comments: ✅ Clean and consistent with the rest of the file.
  • Logic Check:
  • Design & Conciseness:
  • Test Quality: ⚠️ No tests were added to verify that Table::NewIncrementalAppendScan() successfully returns a builder.

Summary & Recommendation

  • Request Changes
  • The PR has a critical API design flaw regarding the C++ Builder pattern inheritance. Because TableScanBuilder methods return TableScanBuilder&, method chaining in IncrementalScanBuilder is broken. The builder hierarchy needs to be refactored (likely using CRTP) to support fluent chaining correctly. Please also address the hardcoded error message in the template builder and add basic tests to catch compile-time API issues.


/// \brief Use the specified branch
/// \param branch the branch name
BaseIncrementalScanBuilder& UseBranch(const std::string& branch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UseBranch should also be used in non-incremental scan scenarios ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

branch is only used in the IncrementalScan, for validating the to_snapshot_id is on the specified branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically snapshot read of a table scan should support branch. So I still think it should be kept even if the Java api does not implement this.


/// \brief Use the specified branch
/// \param branch the branch name
TableScanBuilder& UseBranch(const std::string& branch);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this api is moved away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

branch is only used in the IncrementalScan, for validating the to_snapshot_id is on the specified branch.


/// \brief Plans the scan tasks by resolving manifests and data files.
/// \return A Result containing scan tasks or an error.
virtual Result<std::vector<std::shared_ptr<FileScanTask>>> PlanFiles() const = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep this in the base class? It is safe to let subclass to return vector of ChangelogScanTask.

///
/// Forwards to TableScanBuilder and returns the derived type to preserve fluent chaining.
template <class Builder>
class ICEBERG_EXPORT ScanBuilderBase : public TableScanBuilder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name looks weird that base is inheriting from a concrete class.

template <class Builder>
class ICEBERG_EXPORT ScanBuilderBase : public TableScanBuilder {
public:
Builder& Option(std::string key, std::string value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that the intention of this class hierarchy design is to reuse code. However, this duplicates all apis of the builder so it does not seem worth the effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any good suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a template scan builder class for all and then use std::enable_if to enable functions of incremental scan builder?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants