Skip to content

Implement Builder for Scans on TableChanges #521

Merged
OussamaSaoudi-db merged 14 commits into
delta-io:mainfrom
OussamaSaoudi-db:scan_builder
Nov 23, 2024
Merged

Implement Builder for Scans on TableChanges #521
OussamaSaoudi-db merged 14 commits into
delta-io:mainfrom
OussamaSaoudi-db:scan_builder

Conversation

@OussamaSaoudi-db

@OussamaSaoudi-db OussamaSaoudi-db commented Nov 21, 2024

Copy link
Copy Markdown
Collaborator

What changes are proposed in this pull request?

This PR introduces the TableChangesScanBuilder which constructs a TableChangesScan given a TableChanges, and optionally, a predicate and schema.

This introduces the following structs:

  • TableChangesScan
  • TableChangesScanBuilder

I also introduce methods to TableChanges to get a builder:

  • into_scan_builder
  • scan_builder

How was this change tested?

I ensure that schema projection works for CDF's generated columns, predicates are correctly processed, and that the ColumnTypes are correctly created for the TableChangesScan

@codecov

codecov Bot commented Nov 21, 2024

Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 93.22034% with 8 lines in your changes missing coverage. Please review.

Project coverage is 80.58%. Comparing base (4a0fad2) to head (53732c4).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/table_changes/scan.rs 95.53% 2 Missing and 3 partials ⚠️
kernel/src/table_changes/mod.rs 50.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #521      +/-   ##
==========================================
+ Coverage   80.43%   80.58%   +0.15%     
==========================================
  Files          62       63       +1     
  Lines       13645    13762     +117     
  Branches    13645    13762     +117     
==========================================
+ Hits        10975    11090     +115     
+ Misses       2114     2113       -1     
- Partials      556      559       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@github-actions github-actions Bot added the breaking-change Public API change that could cause downstream compilation failures. Requires a major version bump. label Nov 21, 2024

@zachschuermann zachschuermann left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick pass with some early comments

Comment thread kernel/src/table_changes/mod.rs
Comment thread kernel/src/scan/mod.rs Outdated
Comment thread kernel/src/table_changes/scan.rs
Comment thread kernel/src/table_changes/scan.rs
Comment thread kernel/src/table_changes/scan.rs
Comment thread kernel/src/table_changes/scan.rs Outdated
Comment thread kernel/src/table_changes/scan.rs Outdated
Comment thread kernel/src/table_changes/scan.rs Outdated
Comment thread kernel/src/actions/mod.rs
Option::<Protocol>::get_struct_field(PROTOCOL_NAME),
Option::<SetTransaction>::get_struct_field(SET_TRANSACTION_NAME),
Option::<CommitInfo>::get_struct_field(COMMIT_INFO_NAME),
Option::<Cdc>::get_struct_field(CDC_NAME),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR need to rebase? I could have sworn Cdc infra already merged?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, needs a rebase

@OussamaSaoudi-db OussamaSaoudi-db changed the title Implement Builder for Scans on TableChangs Implement Builder for Scans on TableChanges Nov 21, 2024
@OussamaSaoudi-db OussamaSaoudi-db force-pushed the scan_builder branch 2 times, most recently from 9ac7550 to 2e1fd66 Compare November 21, 2024 23:19

@scovich scovich left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, few nits

Comment thread kernel/src/table_changes/scan.rs Outdated
Comment on lines +54 to +58
match schema_opt {
Some(schema) => self.with_schema(schema),
None => self,
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this also work?

Suggested change
match schema_opt {
Some(schema) => self.with_schema(schema),
None => self,
}
schema_opt.map_or(self, |schema| self.with_schema(schema))

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it doesn't because it sees self is moving ownership to map_or's default and also to the closure. Not allowed 😔

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map_or_else might delay the move enough for it to work?

schema_opt.map_or_else(|| self, |schema| self.with_schema(schema))

Comment thread kernel/src/table_changes/scan.rs Outdated
Comment thread kernel/src/table_changes/scan.rs Outdated
// Add to read schema, store field so we can build a `Column` expression later
// if needed (i.e. if we have partition columns)
let physical_field =
logical_field.make_physical(*self.table_changes.column_mapping_mode())?;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this changes a lot to support column mapping and/or nested columns: #512

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm how do you think we should handle this?

@scovich scovich Nov 22, 2024

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an area where CDF is not different from a normal scan. We just need to leverage the same approach the other PR introduces. IMO we should NOT solve it here -- just let the other PR pick it up.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet, ty!

Comment on lines +156 to +188
ColumnType::Selected("_change_type".to_string()),
ColumnType::Selected("_commit_version".to_string()),
ColumnType::Selected("_commit_timestamp".to_string()),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these generated columns are not file-constant values, so we have to treat them like normal columns even tho they don't come from the parquet?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I considered introducing a ColumnType::Generated. I feel like this is the "right" way, but may take some more discussion and changes to kernel/scan/mod.rs. If you think ColumnType::Generated is not controversial, we can go with that.

The current plan is to treat them as selected, and later check when transforming physical to logical:

match (column_type) {
    ColumnType::Selected(col) => {
       if CDF_FIELDS.contains(col) {
           // Treat as CDF generated column
       } else {
           // Usual path for ColumnType::Selected
       }
   }
   ...
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicklan @zachschuermann If you feel strongly that we should do Generated or keep Selected, let me know!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it as-is for now. Once we add support for row tracking there will be more generated columns to worry about and we can revisit with more context available.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and I think the expression stuff will change this as well. Rather than a "selected" field, we'll just generate an expression to say, add a column with this value.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure SGTM :)

Comment thread kernel/src/table_changes/table_changes_scan.rs Outdated
Comment on lines +156 to +188
ColumnType::Selected("_change_type".to_string()),
ColumnType::Selected("_commit_version".to_string()),
ColumnType::Selected("_commit_timestamp".to_string()),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and I think the expression stuff will change this as well. Rather than a "selected" field, we'll just generate an expression to say, add a column with this value.

@zachschuermann zachschuermann left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a couple nits!

Comment on lines +156 to +188
ColumnType::Selected("_change_type".to_string()),
ColumnType::Selected("_commit_version".to_string()),
ColumnType::Selected("_commit_timestamp".to_string()),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure SGTM :)


#[test]
fn simple_table_changes_scan_builder() {
let path = "./tests/data/table-with-cdf";

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: little bit confusing there's a column named 'part' but there are no partition columns. maybe just add a comment that this is a non-partitioned table?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added 👍

Comment thread kernel/src/table_changes/scan.rs
@zachschuermann zachschuermann removed the breaking-change Public API change that could cause downstream compilation failures. Requires a major version bump. label Nov 22, 2024
@OussamaSaoudi-db OussamaSaoudi-db merged commit db29db0 into delta-io:main Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants