feat: Add incremental scan API with IncrementalAppendScan and Increme…#559
feat: Add incremental scan API with IncrementalAppendScan and Increme…#559WZhuo wants to merge 1 commit intoapache:mainfrom
Conversation
ef8f9b8 to
3214b82
Compare
wgtmac
left a comment
There was a problem hiding this comment.
Review Report: PR #559
(Generated by gemini-cli)
📄 Files: src/iceberg/table_scan.h & src/iceberg/table_scan.cc
Java Counterpart: TableScan.java / IncrementalScan.java
-
Parity Check & API Design: ❌
- CRITICAL API FLAW: The PR makes
BaseIncrementalScanBuilderinherit fromTableScanBuilder. However,TableScanBuilder's methods (likeFilter(),Project(),Select()) returnTableScanBuilder&. - This breaks the fluent builder pattern. If a user tries to chain methods:
It will fail to compile because
table->NewIncrementalAppendScan().value() ->Filter(expr) .FromSnapshot(1); // ERROR: TableScanBuilder has no FromSnapshot method
Filter()returnsTableScanBuilder&. - Furthermore, calling
Build()afterFilter()will invokeTableScanBuilder::Build(), which returns aDataTableScaninstead of the expectedIncrementalAppendScan. - Action: Refactor the builders to use the Curiously Recurring Template Pattern (CRTP) for a common base builder (e.g.,
template <typename Derived> class ScanBuilderBase), so that inherited methods return the correct derived type. Alternatively, override the methods in the derived builders to cast and return*this. - Good catch: Removing
UseBranchfromTableScanBuilderand moving it toBaseIncrementalScanBuilderperfectly matches Java's parity, whereuseBranchonly exists inIncrementalScan.
- CRITICAL API FLAW: The PR makes
-
Style & Comments:
⚠️ - Java uses
fromSnapshotInclusive(long)andfromSnapshotExclusive(long). The PR usesFromSnapshot(int64_t, bool inclusive = false). While acceptable in C++, boolean arguments can cause "boolean blindness" at the call site (e.g.,FromSnapshot(100, true)). Consider using an enum (e.g.,enum class SnapshotBoundary { kInclusive, kExclusive }) or mirroring Java's explicit method names for better readability.
- Java uses
-
Logic Check: ❌
- In
table_scan.cc, theBuild()implementation is hardcoded to return the same error message regardless of theScanType:Action: Iftemplate <typename ScanType> Result<std::unique_ptr<ScanType>> IncrementalScanBuilder<ScanType>::Build() { return NotImplemented("IncrementalAppendScanBuilder is not implemented"); }
ScanTypeisIncrementalChangelogScan, the error message is misleading. Update the error message to be generic (e.g.,"Incremental scan builder is not implemented") or usetypeidor template specialization to provide an accurate message.
- In
-
Design & Conciseness: ✅
- The separation of
DataTableScanand the new incremental scan classes is logically sound and structurally aligned with the Iceberg specification.
- The separation of
-
Test Quality: ❌
- No tests were added for the new incremental builders. While they currently return
NotImplemented, there should be basic instantiation tests to ensure the builders can be created fromTable::NewIncremental...(). - Action: Add a test case that specifically chains a common method (like
Filter) with an incremental method (likeFromSnapshot) to ensure the fluent API compiles and works correctly once the CRTP/inheritance issue is fixed.
- No tests were added for the new incremental builders. While they currently return
📄 Files: src/iceberg/table.h & src/iceberg/table.cc
Java Counterpart: Table.java
- Parity Check: ✅ The new methods
NewIncrementalAppendScan()andNewIncrementalChangelogScan()match Java'sTableinterface. Returning a builder instead of the scan itself aligns with the existing C++ paradigm (NewScan()returningTableScanBuilder). - Style & Comments: ✅ Clean and consistent with the rest of the file.
- Logic Check: ✅
- Design & Conciseness: ✅
- Test Quality:
⚠️ No tests were added to verify thatTable::NewIncrementalAppendScan()successfully returns a builder.
Summary & Recommendation
- Request Changes
- The PR has a critical API design flaw regarding the C++ Builder pattern inheritance. Because
TableScanBuildermethods returnTableScanBuilder&, method chaining inIncrementalScanBuilderis broken. The builder hierarchy needs to be refactored (likely using CRTP) to support fluent chaining correctly. Please also address the hardcoded error message in the template builder and add basic tests to catch compile-time API issues.
src/iceberg/table_scan.h
Outdated
|
|
||
| /// \brief Use the specified branch | ||
| /// \param branch the branch name | ||
| BaseIncrementalScanBuilder& UseBranch(const std::string& branch); |
There was a problem hiding this comment.
UseBranch should also be used in non-incremental scan scenarios ?
There was a problem hiding this comment.
branch is only used in the IncrementalScan, for validating the to_snapshot_id is on the specified branch.
There was a problem hiding this comment.
Technically snapshot read of a table scan should support branch. So I still think it should be kept even if the Java api does not implement this.
3214b82 to
5e3c1d0
Compare
…ntalChangelogScan
5e3c1d0 to
9d73001
Compare
|
|
||
| /// \brief Use the specified branch | ||
| /// \param branch the branch name | ||
| TableScanBuilder& UseBranch(const std::string& branch); |
There was a problem hiding this comment.
branch is only used in the IncrementalScan, for validating the to_snapshot_id is on the specified branch.
|
|
||
| /// \brief Plans the scan tasks by resolving manifests and data files. | ||
| /// \return A Result containing scan tasks or an error. | ||
| virtual Result<std::vector<std::shared_ptr<FileScanTask>>> PlanFiles() const = 0; |
There was a problem hiding this comment.
Should we keep this in the base class? It is safe to let subclass to return vector of ChangelogScanTask.
| /// | ||
| /// Forwards to TableScanBuilder and returns the derived type to preserve fluent chaining. | ||
| template <class Builder> | ||
| class ICEBERG_EXPORT ScanBuilderBase : public TableScanBuilder { |
There was a problem hiding this comment.
This name looks weird that base is inheriting from a concrete class.
| template <class Builder> | ||
| class ICEBERG_EXPORT ScanBuilderBase : public TableScanBuilder { | ||
| public: | ||
| Builder& Option(std::string key, std::string value) { |
There was a problem hiding this comment.
I know that the intention of this class hierarchy design is to reuse code. However, this duplicates all apis of the builder so it does not seem worth the effort.
There was a problem hiding this comment.
Any good suggestions?
There was a problem hiding this comment.
Use a template scan builder class for all and then use std::enable_if to enable functions of incremental scan builder?
No description provided.