feat: add S3/Iceberg sink (append-only changelog)#100
Open
dariomazzitellireplik-coder wants to merge 2 commits into
Open
feat: add S3/Iceberg sink (append-only changelog)#100dariomazzitellireplik-coder wants to merge 2 commits into
dariomazzitellireplik-coder wants to merge 2 commits into
Conversation
New IcebergSink behind the opt-in cargo feature sink-iceberg (not in default builds). Streams CDC into Apache Iceberg tables on S3 via a REST catalog (Lakekeeper, Polaris, Nessie, Tabular, Glue REST). Model: append-only changelog — every INSERT/UPDATE/DELETE becomes one row carrying the source columns plus _cdc_op / _cdc_position / _cdc_ts metadata columns. Consumers materialize current state downstream. Durability: each write_batch() writes Parquet through iceberg-rust's DataFileWriter (spec field-ids + stats) and commits a FastAppend snapshot per touched table before returning, so the pipeline's LSN confirmation never outruns durable data. No staging window; commit retries with table refresh on transient catalog failures. Type mapping is exhaustive and strict: UInt64 -> decimal(20,0), NUMERIC -> decimal(p,s), UUID -> uuid, etc. Text-encoded values from the snapshot path are parsed strictly; unparseable or mismatched values fail the batch instead of being coerced to defaults. TOAST Unchanged becomes NULL with a WARN counter. Schema evolution is not auto-applied (iceberg-rust 0.9 transactions cannot update schemas): SchemaChange events increment the existing schema_evolution_skipped metric; capabilities report it honestly. Also bumps arrow/parquet 53 -> 57 (required by iceberg 0.9); the Snowflake sink passes its suite against v57. Verified end-to-end against MinIO + tabulario/iceberg-rest + PG16: snapshot + live I/U/D committed and read back exactly via pyiceberg, including crash/restart recovery with zero loss.
iceberg-catalog-rest 0.9.1 requires rustc 1.92; default-feature builds still compile on 1.91.1.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New
IcebergSinkbehind the opt-in cargo featuresink-iceberg(not in default builds). Streams CDC into Apache Iceberg tables on S3-compatible stores via an Iceberg REST catalog (Lakekeeper, Polaris, Nessie, Tabular, Glue REST endpoint).Replaces the abandoned
feat/s3-iceberg-sinkbranch, rebuilt from scratch after review found its write path unsalvageable (no-op catalog, deferred commits past LSN confirmation, deletes written as live rows, Parquet without Iceberg field-ids).Design
_cdc_op/_cdc_position/_cdc_tsmetadata columns. Consumers materialize current state (dedup by PK + max_cdc_position, dropD).write_batch()writes Parquet via iceberg-rust'sDataFileWriter(spec field-ids + column stats) and commits a FastAppend snapshot per touched table before returning — the pipeline's LSN confirmation never outruns durable data. No staging window; uncommitted files from a crash are simply unreferenced.UInt64→decimal(20,0)(full range, no wrap),NUMERIC→decimal(p,s),UUID→uuid, temporal types to their Iceberg counterparts. Text-encoded values (snapshot path) are parsed strictly; unparseable/mismatched values fail the batch — never coerced to defaults.supports_upsert: false,supports_schema_evolution: false(iceberg-rust 0.9 transactions cannot update schemas — SchemaChange events feed the existingschema_evolution_skippedmetric),optimal_flush_interval_ms: 60sto bound snapshot count.schema__tablenaming inside theSINK_DATABASEnamespace); existing tables validated for column compatibility.Also in this PR
docs/configuration.md: Iceberg env var reference + operator requirements (snapshot expiry/compaction is operator-owned).src/connectors/sinks/iceberg/README.md.[Unreleased].Verification
cargo test --features sink-iceberg: 183 passed (includes Snowflake regression on arrow 57).cargo checkon default,--features sink-iceberg, and--no-default-features --features 'sink-postgres sink-iceberg'— default build never references iceberg crates.cargo fmt --check+clippy --all-targets -- -D warningsclean._cdc_position).Irows from the engine's re-snapshot; materialized state matched the source exactly.v1 non-goals (documented)
Merge-on-read/equality deletes, schema evolution auto-apply, partitioned tables, non-REST catalogs, table maintenance (expiry/compaction).