Skip to content

feat!: Introduce struct patch builder#2665

Open
scovich wants to merge 17 commits into
delta-io:mainfrom
scovich:struct-patch-builder
Open

feat!: Introduce struct patch builder#2665
scovich wants to merge 17 commits into
delta-io:mainfrom
scovich:struct-patch-builder

Conversation

@scovich

@scovich scovich commented May 30, 2026

Copy link
Copy Markdown
Collaborator

What changes are proposed in this pull request?

The current struct expression patching API has several warts including:

  • No easy way to append a column at the end of the schema (must find/know last column and insert-after it)
  • Weird semantics of field replace and drop operations that change depending on invocation order
  • Awkward FFI support

Rework the API to be easier to use and less error-prone and introduce a builder that can validate the changes before emitting a much simpler patch format.

The builder exposes the following operations:

  • prepend (insert new field before-first)
  • append (insert new field after-last)
  • insert-after (some named predecessor)
  • replace (some named expression)
  • drop (error if target field is missing)
  • optional drop (ignore if target field is missing)
  • xxx_at (all of the above, but acting on some nested field)

The builder's build method detects and rejects invalid combinations of operations such as:

  • double-drop, double-replace, or drop-and-replace
  • insert-after an optional drop (where to insert if target is missing?)
  • drop or replace a field with nested changes
  • no-op patch on a nested field

The output of build is a simple patch format:

  • at struct level
    • prepended fields
    • appended fields
    • altered fields
  • at field level
    • keep_input - if true, propagate the named input to the output
    • insertions - a list of new fields to insert after the named input
    • optional - identifies an optional drop

The FFI visitors were then rewritten to take advantage of this new simpler form (builders don't cross the FFI boundary).

This PR affects the following public APIs

Reworked the whole expression struct patch API.

How was this change tested?

Existing code updated to use the new builder, plus new tests.

@github-actions github-actions Bot added the breaking-change Public API change that could cause downstream compilation failures. Requires a major version bump. label May 30, 2026
Comment thread kernel/src/expressions/mod.rs Outdated
Comment on lines +832 to +836
if insertions.is_empty() {
return Err(Error::generic(format!(
"Internal error: builder produced a no-op patch for field '{field_name}'"
)));
}

@scovich scovich Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to get rid of this check, at least for top-level -- kernel does rely on no-op transforms to apply new schemas to relax column rename and relax nullability.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this only errors out on patches that become struct fields.
Patches that become the new top-level are unaffected, so we're good.

@scovich scovich left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review comments

int patch_op_cmp(const void* a, const void* b) {
const struct FieldPatch* op_a = ((ExpressionItem*)a)->ref;
const struct FieldPatch* op_b = ((ExpressionItem*)b)->ref;
if (op_a->field_name == NULL && op_b->field_name == NULL) {

@scovich scovich Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: Field names are no longer nullable, because append and prepend have separate handling paths now


ExpressionItemList construct_predicate(SharedPredicate* predicate) {
ExpressionBuilder data = { 0 };
ExpressionBuilder data = { .list_count = 1 };

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernel code no longer assumes that list id 0 exists and is empty.
Here we code defensively by ensuring list id 0 is never allocated at all (acts like NULL)

Comment thread kernel/src/expressions/mod.rs Outdated
Comment on lines +832 to +836
if insertions.is_empty() {
return Err(Error::generic(format!(
"Internal error: builder produced a no-op patch for field '{field_name}'"
)));
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this only errors out on patches that become struct fields.
Patches that become the new top-level are unaffected, so we're good.

@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.08777% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.69%. Comparing base (e3cc537) to head (0072895).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
ffi/src/expressions/engine_visitor.rs 0.00% 20 Missing ⚠️
ffi/src/test_ffi.rs 0.00% 19 Missing ⚠️
kernel/src/expressions/patches.rs 94.62% 11 Missing and 6 partials ⚠️
kernel/src/expressions/mod.rs 80.64% 3 Missing and 3 partials ⚠️
kernel/src/transaction/mod.rs 86.36% 2 Missing and 1 partial ⚠️
kernel/src/checkpoint/mod.rs 33.33% 1 Missing and 1 partial ⚠️
kernel/src/scan/log_replay.rs 80.00% 2 Missing ⚠️
kernel/src/transaction/update.rs 66.66% 2 Missing ⚠️
kernel/src/checkpoint/checkpoint_transform.rs 96.66% 0 Missing and 1 partial ⚠️
kernel/src/checkpoint/sidecar/mod.rs 50.00% 1 Missing ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2665      +/-   ##
==========================================
- Coverage   88.70%   88.69%   -0.02%     
==========================================
  Files         209      210       +1     
  Lines       65409    65805     +396     
  Branches    65409    65805     +396     
==========================================
+ Hits        58024    58367     +343     
- Misses       5170     5212      +42     
- Partials     2215     2226      +11     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

scovich added a commit to scovich/delta-kernel-rs that referenced this pull request Jun 3, 2026
… it (delta-io#2682)

## What changes are proposed in this pull request?

The `WriteContext` code wraps a `WriteState` that is common to both
partitioned and unpartitioned tables. Originally, it used a `OnceLock`
to memoize the value. However:
* This is a footgun because the `Transaction` object that produces these
context objects is _mutable_, so there's no guarantee the memoized value
matches current state. At least one unit test already mutates the
transaction after construction, tho fortunately it does so before
triggering the once-lock.
* It forces infallible initialization. This is not guaranteed to remain
true, e.g. delta-io#2665 adds
validation to patch creation to catch invalid patches that currently go
undetected.

Just remove the OnceLock. It's memoizing something that's dirt cheap to
compute and that could be invalidated. For unpartitioned tables, it
would anyway only be called once per transaction. For partitioned
tables, the per-partition cost of validating and serializing partition
values easily outweighs the cost of recreating the patch expression.

While we're at it, add missing validation that the set of named
partition columns in `metadata.partitionColumns` actually corresponds to
columns present in the table's schema. This _could_ have been added to
the newly-fallible `Transaction::generate_logical_to_physical` method,
but it's actually a better fit for the `TableConfiguration` constructor
that is the intentional narrow waist for all such checks.

## How was this change tested?

Existing unit tests for the `WriteState` change; new unit tests for the
new validation.

@scovich scovich left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving notes for reviewers

checkpoint_data_schema.clone(),
Arc::new(Expression::struct_patch(
ExpressionStructPatch::new_top_level()
ExpressionStructPatchBuilder::new()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed because the old name was clunky.

.get(field)
.map(|ft| ft.is_replace && ft.exprs.len() == 1)
.unwrap_or(false)
.is_some_and(|ft| !ft.keep_input && !ft.insertions.is_empty())

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why these were map+unwrap_or before... fixed since I anyway had to touch them.

.transpose()?;

// For nested patches, get the source struct's null bitmap to preserve null rows
let source_null_buffer = source_array.as_ref().and_then(|arr| {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved down to where it's actually used

.collect();

// For nested patches, get the source struct's null bitmap to preserve null rows
let source_null_buffer = source_array.as_ref().and_then(|arr| {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved from above

Some("stats"),
Expression::column([FILE_CONSTANT_VALUES_NAME, TAGS_NAME]).into(),
);
.with_dropped_field(STATS_PARSED_NAME);

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified even in the original code -- we always insert tags, so the else can be eliminated.

Jameson-Crate pushed a commit to Jameson-Crate/delta-kernel-rs that referenced this pull request Jun 4, 2026
… it (delta-io#2682)

## What changes are proposed in this pull request?

The `WriteContext` code wraps a `WriteState` that is common to both
partitioned and unpartitioned tables. Originally, it used a `OnceLock`
to memoize the value. However:
* This is a footgun because the `Transaction` object that produces these
context objects is _mutable_, so there's no guarantee the memoized value
matches current state. At least one unit test already mutates the
transaction after construction, tho fortunately it does so before
triggering the once-lock.
* It forces infallible initialization. This is not guaranteed to remain
true, e.g. delta-io#2665 adds
validation to patch creation to catch invalid patches that currently go
undetected.

Just remove the OnceLock. It's memoizing something that's dirt cheap to
compute and that could be invalidated. For unpartitioned tables, it
would anyway only be called once per transaction. For partitioned
tables, the per-partition cost of validating and serializing partition
values easily outweighs the cost of recreating the patch expression.

While we're at it, add missing validation that the set of named
partition columns in `metadata.partitionColumns` actually corresponds to
columns present in the table's schema. This _could_ have been added to
the newly-fallible `Transaction::generate_logical_to_physical` method,
but it's actually a better fit for the `TableConfiguration` constructor
that is the intentional narrow waist for all such checks.

## How was this change tested?

Existing unit tests for the `WriteState` change; new unit tests for the
new validation.
@scovich scovich requested a review from OussamaSaoudi June 5, 2026 19:12
@scovich scovich changed the title [WIP] Struct patch builder feat!: Introduce struct patch builder Jun 5, 2026
@scovich scovich marked this pull request as ready for review June 5, 2026 20:35
patch->field_patches = get_expr_list(data, child_list_id);
patch->prepended_fields = get_expr_list(data, prepended_field_list_id);
patch->field_patches = get_expr_list(data, field_patch_list_id);
patch->appended_fields = get_expr_list(data, appended_field_list_id);

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By handling prepended and appended fields directly, we no longer need a complicated "field" model -- fields are simply fields.

Comment on lines -233 to -242
/// |field_name? |expr_list? |is_replace? |meaning|
/// |-|-|-|-|
/// | NO | NO | * | NO-OP (prepend an empty list of expressions to the output)
/// | NO | YES | * | Prepend a list of expressions to the output
/// | YES | NO | NO | NO-OP (insert an empty list of expressions after the named input field)
/// | YES | NO | YES | Drop the named input field
/// | YES | YES | NO | Insert a list of expressions after the named input field
/// | YES | YES | YES | Replace the named input field with a list of expressions
///
/// NOTE: Treating list id 0 as an empty list yields a simplified truth table:

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good riddance!

Err(de::Error::custom("Cannot deserialize an Opaque Expression"))
}

/// A patch affecting a single input field.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this moved to a new patches.rs mod

}

// Adapter that converts the insert_after option into a method call on the patch.
fn apply_insert_after(

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed because we no longer model prepend as insert-after None

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

Benchmark results

Commit: 0072895 · Trigger: auto-push · Tags: base

Test Base PR Change
101kAdds1kCommitsSinceChkpt1Chkpt/readLatest/readMetadata/serial 357.5±11.25ms 356.8±10.17ms 1.00x faster
101kAdds1kCommitsSinceChkpt1Chkpt/readV10/readMetadata/serial 1626.0±57.58µs 1674.8±46.76µs 1.03x slower
101kAdds1kCommitsSinceChkpt1Chkpt/readV110/readMetadata/serial 36.2±1.16ms 36.9±1.24ms 1.02x slower
101kAdds1kCommitsSinceChkpt1Chkpt/readV210/readMetadata/serial 72.0±7.02ms 70.5±4.72ms 1.02x faster
101kAdds1kCommitsSinceChkpt1Chkpt/readV510/readMetadata/serial 174.9±5.65ms 175.8±5.92ms 1.01x slower
101kAdds1kCommitsSinceChkpt1Chkpt/readV60/readMetadata/serial 18.0±0.73ms 18.4±0.61ms 1.02x slower
101kAdds1kCommitsSinceChkpt1Chkpt/snapshotLatest/snapshotConstruction 94.7±4.02ms 95.0±3.81ms 1.00x slower
101kAdds1kCommitsSinceChkpt1Chkpt/snapshotV10/snapshotConstruction 20.9±0.25ms 21.2±0.46ms 1.01x slower
101kAdds1kCommitsSinceChkpt1Chkpt/snapshotV110/snapshotConstruction 28.6±0.56ms 29.2±2.31ms 1.02x slower
101kAdds1kCommitsSinceChkpt1Chkpt/snapshotV210/snapshotConstruction 36.9±0.78ms 37.2±0.73ms 1.01x slower
101kAdds1kCommitsSinceChkpt1Chkpt/snapshotV510/snapshotConstruction 60.1±1.73ms 60.5±1.70ms 1.01x slower
101kAdds1kCommitsSinceChkpt1Chkpt/snapshotV60/snapshotConstruction 24.8±0.53ms 24.8±0.52ms 1.00x

Comment thread kernel/src/scan/log_replay.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Public API change that could cause downstream compilation failures. Requires a major version bump.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants