feat!: Introduce struct patch builder#2665
Conversation
| if insertions.is_empty() { | ||
| return Err(Error::generic(format!( | ||
| "Internal error: builder produced a no-op patch for field '{field_name}'" | ||
| ))); | ||
| } |
There was a problem hiding this comment.
Probably need to get rid of this check, at least for top-level -- kernel does rely on no-op transforms to apply new schemas to relax column rename and relax nullability.
There was a problem hiding this comment.
Actually, this only errors out on patches that become struct fields.
Patches that become the new top-level are unaffected, so we're good.
scovich
left a comment
There was a problem hiding this comment.
Self-review comments
| int patch_op_cmp(const void* a, const void* b) { | ||
| const struct FieldPatch* op_a = ((ExpressionItem*)a)->ref; | ||
| const struct FieldPatch* op_b = ((ExpressionItem*)b)->ref; | ||
| if (op_a->field_name == NULL && op_b->field_name == NULL) { |
There was a problem hiding this comment.
note: Field names are no longer nullable, because append and prepend have separate handling paths now
|
|
||
| ExpressionItemList construct_predicate(SharedPredicate* predicate) { | ||
| ExpressionBuilder data = { 0 }; | ||
| ExpressionBuilder data = { .list_count = 1 }; |
There was a problem hiding this comment.
The kernel code no longer assumes that list id 0 exists and is empty.
Here we code defensively by ensuring list id 0 is never allocated at all (acts like NULL)
| if insertions.is_empty() { | ||
| return Err(Error::generic(format!( | ||
| "Internal error: builder produced a no-op patch for field '{field_name}'" | ||
| ))); | ||
| } |
There was a problem hiding this comment.
Actually, this only errors out on patches that become struct fields.
Patches that become the new top-level are unaffected, so we're good.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2665 +/- ##
==========================================
- Coverage 88.70% 88.69% -0.02%
==========================================
Files 209 210 +1
Lines 65409 65805 +396
Branches 65409 65805 +396
==========================================
+ Hits 58024 58367 +343
- Misses 5170 5212 +42
- Partials 2215 2226 +11 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
… it (delta-io#2682) ## What changes are proposed in this pull request? The `WriteContext` code wraps a `WriteState` that is common to both partitioned and unpartitioned tables. Originally, it used a `OnceLock` to memoize the value. However: * This is a footgun because the `Transaction` object that produces these context objects is _mutable_, so there's no guarantee the memoized value matches current state. At least one unit test already mutates the transaction after construction, tho fortunately it does so before triggering the once-lock. * It forces infallible initialization. This is not guaranteed to remain true, e.g. delta-io#2665 adds validation to patch creation to catch invalid patches that currently go undetected. Just remove the OnceLock. It's memoizing something that's dirt cheap to compute and that could be invalidated. For unpartitioned tables, it would anyway only be called once per transaction. For partitioned tables, the per-partition cost of validating and serializing partition values easily outweighs the cost of recreating the patch expression. While we're at it, add missing validation that the set of named partition columns in `metadata.partitionColumns` actually corresponds to columns present in the table's schema. This _could_ have been added to the newly-fallible `Transaction::generate_logical_to_physical` method, but it's actually a better fit for the `TableConfiguration` constructor that is the intentional narrow waist for all such checks. ## How was this change tested? Existing unit tests for the `WriteState` change; new unit tests for the new validation.
scovich
left a comment
There was a problem hiding this comment.
Leaving notes for reviewers
| checkpoint_data_schema.clone(), | ||
| Arc::new(Expression::struct_patch( | ||
| ExpressionStructPatch::new_top_level() | ||
| ExpressionStructPatchBuilder::new() |
There was a problem hiding this comment.
Renamed because the old name was clunky.
| .get(field) | ||
| .map(|ft| ft.is_replace && ft.exprs.len() == 1) | ||
| .unwrap_or(false) | ||
| .is_some_and(|ft| !ft.keep_input && !ft.insertions.is_empty()) |
There was a problem hiding this comment.
I don't know why these were map+unwrap_or before... fixed since I anyway had to touch them.
| .transpose()?; | ||
|
|
||
| // For nested patches, get the source struct's null bitmap to preserve null rows | ||
| let source_null_buffer = source_array.as_ref().and_then(|arr| { |
There was a problem hiding this comment.
moved down to where it's actually used
| .collect(); | ||
|
|
||
| // For nested patches, get the source struct's null bitmap to preserve null rows | ||
| let source_null_buffer = source_array.as_ref().and_then(|arr| { |
| Some("stats"), | ||
| Expression::column([FILE_CONSTANT_VALUES_NAME, TAGS_NAME]).into(), | ||
| ); | ||
| .with_dropped_field(STATS_PARSED_NAME); |
There was a problem hiding this comment.
This could be simplified even in the original code -- we always insert tags, so the else can be eliminated.
… it (delta-io#2682) ## What changes are proposed in this pull request? The `WriteContext` code wraps a `WriteState` that is common to both partitioned and unpartitioned tables. Originally, it used a `OnceLock` to memoize the value. However: * This is a footgun because the `Transaction` object that produces these context objects is _mutable_, so there's no guarantee the memoized value matches current state. At least one unit test already mutates the transaction after construction, tho fortunately it does so before triggering the once-lock. * It forces infallible initialization. This is not guaranteed to remain true, e.g. delta-io#2665 adds validation to patch creation to catch invalid patches that currently go undetected. Just remove the OnceLock. It's memoizing something that's dirt cheap to compute and that could be invalidated. For unpartitioned tables, it would anyway only be called once per transaction. For partitioned tables, the per-partition cost of validating and serializing partition values easily outweighs the cost of recreating the patch expression. While we're at it, add missing validation that the set of named partition columns in `metadata.partitionColumns` actually corresponds to columns present in the table's schema. This _could_ have been added to the newly-fallible `Transaction::generate_logical_to_physical` method, but it's actually a better fit for the `TableConfiguration` constructor that is the intentional narrow waist for all such checks. ## How was this change tested? Existing unit tests for the `WriteState` change; new unit tests for the new validation.
| patch->field_patches = get_expr_list(data, child_list_id); | ||
| patch->prepended_fields = get_expr_list(data, prepended_field_list_id); | ||
| patch->field_patches = get_expr_list(data, field_patch_list_id); | ||
| patch->appended_fields = get_expr_list(data, appended_field_list_id); |
There was a problem hiding this comment.
By handling prepended and appended fields directly, we no longer need a complicated "field" model -- fields are simply fields.
| /// |field_name? |expr_list? |is_replace? |meaning| | ||
| /// |-|-|-|-| | ||
| /// | NO | NO | * | NO-OP (prepend an empty list of expressions to the output) | ||
| /// | NO | YES | * | Prepend a list of expressions to the output | ||
| /// | YES | NO | NO | NO-OP (insert an empty list of expressions after the named input field) | ||
| /// | YES | NO | YES | Drop the named input field | ||
| /// | YES | YES | NO | Insert a list of expressions after the named input field | ||
| /// | YES | YES | YES | Replace the named input field with a list of expressions | ||
| /// | ||
| /// NOTE: Treating list id 0 as an empty list yields a simplified truth table: |
| Err(de::Error::custom("Cannot deserialize an Opaque Expression")) | ||
| } | ||
|
|
||
| /// A patch affecting a single input field. |
There was a problem hiding this comment.
All this moved to a new patches.rs mod
| } | ||
|
|
||
| // Adapter that converts the insert_after option into a method call on the patch. | ||
| fn apply_insert_after( |
There was a problem hiding this comment.
Needed because we no longer model prepend as insert-after None
Benchmark resultsCommit:
|
What changes are proposed in this pull request?
The current struct expression patching API has several warts including:
Rework the API to be easier to use and less error-prone and introduce a builder that can validate the changes before emitting a much simpler patch format.
The builder exposes the following operations:
The builder's
buildmethod detects and rejects invalid combinations of operations such as:The output of
buildis a simple patch format:The FFI visitors were then rewritten to take advantage of this new simpler form (builders don't cross the FFI boundary).
This PR affects the following public APIs
Reworked the whole expression struct patch API.
How was this change tested?
Existing code updated to use the new builder, plus new tests.