Skip to content

Delta kernel conversion target#801

Open
vaibhavk1992 wants to merge 60 commits intoapache:mainfrom
vaibhavk1992:test-4
Open

Delta kernel conversion target#801
vaibhavk1992 wants to merge 60 commits intoapache:mainfrom
vaibhavk1992:test-4

Conversation

@vaibhavk1992
Copy link
Copy Markdown
Contributor

@vaibhavk1992 vaibhavk1992 commented Feb 6, 2026

What is the purpose of the pull request

This PR migrates
XTable's Delta Lake integration from Delta Standalone to
Delta Kernel for writers

Brief change log

  • Added unit test for delta kernel data file updates
  • Added unit tests for conversion target

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added TestConversionController to verify the change.
  • Manually verified the change by running a job locally.

@the-other-tim-brown
Copy link
Copy Markdown
Contributor

@vaibhavk1992 can you update the PR description?

Copy link
Copy Markdown
Contributor

@the-other-tim-brown the-other-tim-brown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaibhavk1992 can you complete a self-review of the code first to clean up a bit?

Merged 7 upstream commits:
- f991e31 Parquet Source: snapshot sync fixes (apache#806)
- 4307565 Parquet source: column stats support (apache#805)
- 5c25674 Remove wildcard imports and enforce with spotless (apache#809)
- fe7215e add .sdkmanrc to .gitignore (apache#793)
- abbf4b7 fix(iceberg): nested comments (apache#797)
- 8e58367 Remove redundant getSnapshotAt calls (apache#791)
- 8cab6a2 fix(delta): avoid NPE for binary in map/array (apache#795)

Resolved conflicts:
- TestDeltaKernelSchemaExtractor.java: kept StructField import needed for new tests
Fixed wildcard imports in Delta Kernel test files to comply with
spotless rules enforced in upstream commit 5c25674.
The spotless:apply command removed wildcard imports but didn't add
back all necessary specific imports. Added missing imports:

TestDeltaKernelReadWriteIntegration.java:
- Static assertions (assertEquals, assertTrue, assertFalse, assertNotNull)
- java.util.* (Random, UUID, List, Map, Set, Arrays, Collections, etc.)

TestDeltaKernelSync.java:
- Static assertions (including fail)
- java.util.* (Random, UUID, List, Map, Set, Arrays, Collections, etc.)

TestDeltaKernelDataFileUpdatesExtractor.java:
- Static assertions (assertEquals, assertTrue, assertFalse, assertNotNull)
- java.util.* (List, Arrays, Collections)

All tests now compile successfully.
@vaibhavk1992
Copy link
Copy Markdown
Contributor Author

I have tried to address all the comments @vinishjail97 @the-other-tim-brown

}

@Override
public void init(TargetTable targetTable, Configuration configuration) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init() creates a new Engine from configuration, silently discarding the engine injected via the constructor. If someone builds a DeltaKernelConversionTarget with a custom engine (e.g. in tests or via direct construction), calling init() will override it without any warning. The two initialization paths (constructor vs init()) are competing and confusing — should either merge them or explicitly document when each is used.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this. @vaibhavk1992 can you address this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaibhavk1992 any updates on whether this can be addressed?

}
} catch (Exception e) {
// Log and continue to next commit
log.warn(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.warn("...", version, e.getMessage()) swallows the stack trace. Pass the exception as the last argument instead: log.warn("Failed to parse commit metadata for version {}", version, e). On-call engineers debugging a production issue won't know where the parse failure originated.

try {
Table table = Table.forPath(engine, basePath);
this.latestSchema = table.getLatestSnapshot(engine).getSchema();
} catch (Exception e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catch (Exception e) silently sets this.latestSchema = null. If the error is a network issue, permissions error, or anything other than "table doesn't exist", this will silently proceed with a null schema and cause an NPE in addColumn() at line 366 (latestSchema.add(field)). Should catch only the specific "table not found" exception from Delta Kernel, not the broad Exception.

@vaibhavk1992
Copy link
Copy Markdown
Contributor Author

@vinishjail97 I have implemented all the above suggestions.

for (RowBackedAction action : actions) {

if (action instanceof io.delta.kernel.internal.actions.AddFile) {
io.delta.kernel.internal.actions.AddFile addFile =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these classes from the delta kernel be imported?

Map<String, String> configMap = new HashMap<>();

configMap.put(TableSyncMetadata.XTABLE_METADATA, metadata.toJson());
configMap.put(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a constant we can use here?

DeltaKernelDataFileUpdatesExtractor.builder()
.engine(engine)
.basePath(targetTable.getBasePath())
// Column statistics are not needed for conversion operations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we convert from one format to another, we actually do want the statistics. This allows query engines to leverage them for their planning operations for improved efficiency.

if (action instanceof AddFile) {
AddFile addFile = (AddFile) action;
Row wrappedRow =
io.delta.kernel.internal.actions.SingleAction.createAddFileSingleAction(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start using imports throughout the PR please. Do a sanity check of the files and make sure you are using them throughout. Highlighting every line leads to a lot of noise on the reviews.

public static boolean tableExists(Engine engine, String basePath) {
try {
Table table = Table.forPath(engine, basePath);
table.getLatestSnapshot(engine);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it will load the snapshot, is there a more lightweight way to do this?

DeltaKernelSchemaExtractor.getInstance().toInternalSchema(structRepresentation));
}

// ========== Tests for fromInternalSchema() - New Tests ==========
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comment for New Tests?


// ========== Tests for fromInternalSchema() - New Tests ==========

@Test
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have tests with nested fields, lists and maps as well


@Override
public Optional<String> getTargetCommitIdentifier(String sourceIdentifier) {
// Delta Kernel 4.0.0 does not support commit tags in commitInfo, which are required for
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure there is an issue in XTable tracking this as well.

private Seq<RowBackedAction> applyDiff(
Set<? extends InternalFile> filesAdded,
Collection<RowBackedAction> removeFileActions,
InternalSchema tableSchema,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tableSchema is unused, can it be removed from the method signature?

diff.getFilesAdded(),
diff.getFilesRemoved(),
tableSchema,
table.getPath(engine),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be set to basePath?


// Verify we have AddFile actions
boolean hasAddFile = actionList.stream().anyMatch(action -> action instanceof AddFile);
assertTrue(hasAddFile, "Should contain AddFile actions");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert on the content of the AddFile to make sure it is aligned with our expectations?


// Note: file2 should not appear in actions as it's unchanged
// file1 should appear as RemoveFile as it's not in the new sync
System.out.println(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaibhavk1992 can you address this? We don't want printing in our tests

actionList.stream().filter(action -> action instanceof RemoveFile).count();

// Verify: Should have AddFile for file3 (new file)
assertTrue(addFileCount >= 1, "Should have at least 1 AddFile action for new file (file3)");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These counts should be strict. Only 1 file is expected to be added and 1 removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants