Support Iceberg nested and binary GPU writes#14611
Support Iceberg nested and binary GPU writes#14611res-life wants to merge 15 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR enables GPU-accelerated Iceberg writes for nested types (arrays, maps, structs) and binary columns by propagating Iceberg field IDs through the Spark schema metadata and into cuDF's Confidence Score: 5/5Safe to merge; no P0/P1 issues found — the two comments are P2 style suggestions about a deliberately temporary reflection workaround. The previous P1 (accidental Hex removal) is absent in this revision. No resource leaks, no broken field-ID round-trip logic, and no correctness gaps were found. The reflection-based cuDF field mutation is explicitly acknowledged with a TODO + tracking issue (rapidsai/cudf#22347). Test coverage is substantively improved. Score stays at the P2-only ceiling of 5.
Important Files Changed
Sequence DiagramsequenceDiagram
participant IcebergSchema as Iceberg Schema
participant GpuTypeToSparkType
participant StructField as Spark StructField (metadata)
participant writerOptionsFromField as SchemaUtils.writerOptionsFromField
participant setParquetFieldId as setParquetFieldId (reflection)
participant ColumnWriterOptions as cuDF ColumnWriterOptions
participant ParquetFile as Parquet File
IcebergSchema->>GpuTypeToSparkType: toSparkType(schema)
GpuTypeToSparkType->>GpuTypeToSparkType: appendNestedFieldIdMetadata(fieldType)
GpuTypeToSparkType->>StructField: attach {parquet.field.id, rapids.parquet.list.element.field.id, ...}
Note over StructField: Top-level field carries nested element/key/value IDs as side-channel metadata
StructField->>writerOptionsFromField: fieldMeta with nested IDs
writerOptionsFromField->>writerOptionsFromField: nestedFieldMetadata() → child Metadata
writerOptionsFromField->>ColumnWriterOptions: build() element/key/value options
writerOptionsFromField->>setParquetFieldId: setParquetFieldId(listOptions/mapOptions, parentId)
setParquetFieldId->>ColumnWriterOptions: reflection: hasParquetFieldId=true, parquetFieldId=N
ColumnWriterOptions->>ParquetFile: Write with Parquet field IDs preserved
Reviews (8): Last reviewed commit: "Update docs" | Re-trigger Greptile |
22d5160 to
f29e294
Compare
Signed-off-by: Chong Gao <res_life@163.com>
f29e294 to
8c0ad91
Compare
Signed-off-by: Chong Gao <res_life@163.com>
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @res-life for this pr. One missing point is that I hope you can do a careful check in the dml tests, since there are some fallbacks no longer necessary.
Use ParquetUtil.footerMetrics directly instead of routing the original Iceberg schema through GpuParquetMetricsHelper. With field IDs correctly written in the Parquet footer, the schema argument is redundant. Addresses review comment from liurenjie1024 on PR NVIDIA#14611. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The two @allow_non_gpu("ColumnarToRowExec", "BatchScanExec") decorators
on Iceberg delete tests are no longer required.
Addresses review comment from liurenjie1024 on PR NVIDIA#14611.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops the queue-based sub-batch emission change that was bundled into the nested/binary writes branch. It is unrelated to Iceberg field-id preservation and should be pursued separately. Addresses review comment from liurenjie1024 on PR NVIDIA#14611. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the standalone IcebergParquetMetadataKeys file. The nested list-element / map-key / map-value field-id keys are an internal contract between the Iceberg writer (GpuTypeToSparkType) and SchemaUtils, so define them where they're consumed and rename them without the iceberg-specific prefix. Replace SchemaUtils' duplicate private FIELD_ID_METADATA_KEY with the canonical ParquetUtils.FIELD_ID_METADATA_KEY. Addresses review comment from liurenjie1024 on PR NVIDIA#14611. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cuDF already exposes withBinaryColumn(name, nullable, parquetFieldId) on NestedBuilder, so the buildBinaryOptions / markAsBinary helpers and the isBinary reflection are unnecessary. Switch the BinaryType writer paths to call the public API directly. The remaining reflection on hasParquetFieldId / parquetFieldId for list and map containers stays until cuDF exposes builder variants that accept parquetFieldId; comment updated to reflect that and drop the bogus issues/NEW placeholder. Addresses review comment from liurenjie1024 on PR NVIDIA#14611. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
iceberg_full_gens_list already covers nested arrays, structs, and maps, so the *_nested_types variants in iceberg_append_test.py and iceberg_ctas_test.py duplicate the *_all_cols tests. Keep the binary-focused CTAS test since it exercises a distinct code path. Addresses review comments from liurenjie1024 on PR NVIDIA#14611. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add scaladoc to nestedMetadataJson / appendNestedFieldIdMetadata explaining why nested Iceberg field ids are stored on the parent StructField via flat keys plus JSON-serialized sub-Metadata, and how SchemaUtils consumes them. Promote both helpers to package-private so they can be exercised directly. Add GpuTypeToSparkTypeSuite covering primitive, list, map, list-of-list, map-of-(list, list), struct-of-struct, and an end-to-end toSparkType round-trip that confirms the nested ids land on the parent field. Addresses review comment from liurenjie1024 on PR NVIDIA#14611. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The branch's GpuOverrides.scala / mathExpressions.scala edits dropped the Hex registration that landed on main in PR NVIDIA#14575, so hex(...) was silently falling back to CPU. Reinstate: - expr[Hex] entry in GpuOverrides - GpuHex case class in mathExpressions - test_hex coverage in arithmetic_ops_test.py - Hex rows in operatorsScore.csv / supportedExprs.csv (3.3.0+) - Hex docs in advanced_configs.md and supported_ops.md Addresses Greptile P1 on PR NVIDIA#14611. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This reverts commit 24ad8e0.
|
build |
Signed-off-by: Chong Gao <res_life@163.com>
|
build |
Signed-off-by: Chong Gao <res_life@163.com>
|
build |
Fixes #14239.
Depends on:
Description
Checklists
Documentation
Testing
(Please provide the names of the existing tests in the PR description.)
Performance