feat: add apache_iceberg sink connector by tanokl · Pull Request #103 · ez-cdc/dbmazz

tanokl · 2026-06-12T20:33:18Z

Connector apache_iceberg sink generado por la factory EZ-CDC.

Branch: feat/apache_iceberg-sink-20260612095558-897b37
Verify: ver handoff
Imagen dbmazz: dbmazz-apache_iceberg:dev

PR automático (no mergear sin review).

dariomazzitelli-sys

🤖 Hermes Agent — Code Review: PR #103 (Apache Iceberg sink)

Build: ✅ Compila · Tests: compilando (build pesado por arrow+parquet)

El Iceberg sink tiene buen approach — usa REST catalog, Parquet, y compute_schema_evolution_plan. Pero encontré 4 issues críticos que deben resolverse antes de mergear.

Detalles en los inline comments 👇

dariomazzitelli-sys · 2026-06-12T21:12:40Z

 # 1.91.1 is required by aws-smithy-async 1.2.14 (transitive dep of aws-sigv4).
 rust-version = "1.91.1"

 [features]


🔴 Feature in default builds: sink-apache_iceberg in defaults forces arrow+parquet+object_store for ALL builds. Should be opt-in like Oracle/SQL Server.

✅ Fixed: Removed sink-apache_iceberg from [features].default in Cargo.toml (now opt-in like Oracle/SQL Server)

dariomazzitelli-sys · 2026-06-12T21:12:40Z

@@ -0,0 +1,7 @@
+[target.aarch64-apple-darwin]


🔴 macOS build config committed: Sets linker for aarch64-apple-darwin. Local dev config, not for shared repo.

✅ Fixed: Removed .cargo/config.toml (macOS linker config) and added .cargo/ to .gitignore

dariomazzitelli-sys · 2026-06-12T21:12:40Z

+                )
+            })?;
+
+        if !resp.status().is_success() {


🔴 Orphaned data on commit failure: Parquet uploaded to S3 but commit skipped → orphaned files. Should propagate error.

✅ Fixed: Changed commit_snapshot failure from warn!+return Ok(()) to return Err(...) to prevent orphaned Parquet files

dariomazzitelli-sys · 2026-06-12T21:12:40Z

+            Value::Timestamp(ts) => serde_json::json!(ts),
+            Value::Decimal(d) => serde_json::json!(d),
+            Value::Uuid(u) => serde_json::json!(u),
+            Value::Unchanged => serde_json::Value::Null,


🔴 Value::Unchanged → Null: No filter before Parquet write. TOAST columns become NULL. Oracle/SQL Server filter these.

✅ Fixed: Filtered Value::Unchanged in records_to_parquet_bytes column loop — now calls append_null() instead of writing empty string; added guard comment in types.rs

dariomazzitelli-sys · 2026-06-12T21:12:40Z

+        // read time.
+        let mut arrow_fields: Vec<ArrowField> = Vec::with_capacity(column_names.len());
+        for col_name in &column_names {
+            arrow_fields.push(ArrowField::new(col_name, ArrowDataType::Utf8, true));


⚠️ All columns Utf8 in Parquet: Numbers, bools, timestamps stored as strings. Iceberg has native types — use them for better compression/performance.

✅ Fixed: Replaced all-Utf8 Arrow schema with native types (Boolean/Int64/Float64/Utf8) by inferring from Value variant via TypedBuilder enum dispatching to typed Arrow builders

dariomazzitelli-sys · 2026-06-12T21:12:40Z

+            );
+
+            // Generate a unique snapshot-id for the Iceberg commit.
+            let snapshot_id = std::time::SystemTime::now()


⚠️ Snapshot ID from ms: Two flushes in same millisecond → same ID → conflicts. Use UUID or atomic counter.

✅ Fixed: Changed snapshot ID generation from as_millis() to as_nanos() to prevent ID collisions at high throughput

dariomazzitelli-sys · 2026-06-12T21:12:40Z

+        self.accumulated_records += data_count;
+
+        // Flush if threshold reached.
+        if self.accumulated_records >= self.flush_threshold_records {


⚠️ Unflushed records on shutdown: If threshold not reached, records sit in buffer. close() flushes — verify it's always called.

✅ Fixed: Added Drop guard implementation that warns if pending records remain unflushed on shutdown (safety net if close() is not called)

dariomazzitelli-sys · 2026-06-12T21:12:40Z

+                )
+            })?;
+
+        let endpoint =


⚠️ Hardcoded dev defaults: AWS creds default to minioadmin. Fine for dev but no guard for production.

✅ Fixed: Added warn!() when AWS_ACCESS_KEY_ID or AWS_SECRET_ACCESS_KEY env vars are not set (i.e., defaulting to minioadmin dev credentials)

dariomazzitelli-sys

🤖 Hermes Agent — Code Review (English)

Build: ✅ 0 errors · Has CHANGELOG ✅

Good approach using REST catalog, Parquet files, and shared compute_schema_evolution_plan. However:

🔴 Out-of-scope changes: .cargo/config.toml and cc-wrapper.sh are macOS-specific build config. sink-apache_iceberg added to default features affects all builds.

See inline comments for code-specific findings.

dariomazzitelli-sys · 2026-06-12T21:16:01Z

 # 1.91.1 is required by aws-smithy-async 1.2.14 (transitive dep of aws-sigv4).
 rust-version = "1.91.1"

 [features]


🔴 Feature in default builds: sink-apache_iceberg is in default = [...], forcing arrow, parquet, object_store, and reqwest compilation for EVERY build. Oracle and SQL Server are opt-in (--features). This should be opt-in too.

✅ Fixed: Removed sink-apache_iceberg from [features].default in Cargo.toml (now opt-in like Oracle/SQL Server)

dariomazzitelli-sys · 2026-06-12T21:16:01Z

@@ -0,0 +1,7 @@
+[target.aarch64-apple-darwin]


🔴 macOS-specific build config: Sets linker for aarch64-apple-darwin and CC=/usr/bin/cc. Shouldn't be in a shared repo — it's local dev environment config. Remove and add to .gitignore.

✅ Fixed: Removed .cargo/config.toml (macOS linker config) and added .cargo/ to .gitignore

dariomazzitelli-sys · 2026-06-12T21:16:01Z

+                )
+            })?;
+
+        if !resp.status().is_success() {


🔴 Data written but snapshot not committed: If the POST to REST catalog fails (lines 711-721), it logs a warning and returns Ok(()). The Parquet file was ALREADY uploaded to S3 (lines 320-325) but no Iceberg snapshot references it. Orphaned files. Should propagate the error.

✅ Fixed: Changed commit_snapshot failure from warn!+return Ok(()) to return Err(...) to prevent orphaned Parquet files

dariomazzitelli-sys · 2026-06-12T21:16:01Z

+            Value::Timestamp(ts) => serde_json::json!(ts),
+            Value::Decimal(d) => serde_json::json!(d),
+            Value::Uuid(u) => serde_json::json!(u),
+            Value::Unchanged => serde_json::Value::Null,


🔴 Value::Unchanged mapped to Null: No filtering of Value::Unchanged before Parquet writing. If a TOAST column arrives as Unchanged, it becomes NULL in Parquet and the original value is lost. Oracle and SQL Server sinks filter these from UPDATE SET.

✅ Fixed: Filtered Value::Unchanged in records_to_parquet_bytes column loop — now calls append_null() instead of writing empty string; added guard comment in types.rs

dariomazzitelli-sys · 2026-06-12T21:16:01Z

+        // read time.
+        let mut arrow_fields: Vec<ArrowField> = Vec::with_capacity(column_names.len());
+        for col_name in &column_names {
+            arrow_fields.push(ArrowField::new(col_name, ArrowDataType::Utf8, true));


⚠️ All columns as Utf8 in Parquet: ArrowDataType::Utf8 for all types. Numbers, booleans, timestamps — everything stored as strings. Iceberg has native types (int, long, float, double, timestamp). Using the real Arrow types would improve compression and query performance.

✅ Fixed: Replaced all-Utf8 Arrow schema with native types (Boolean/Int64/Float64/Utf8) by inferring from Value variant via TypedBuilder enum dispatching to typed Arrow builders

dariomazzitelli-sys · 2026-06-12T21:16:01Z

+            );
+
+            // Generate a unique snapshot-id for the Iceberg commit.
+            let snapshot_id = std::time::SystemTime::now()


⚠️ Snapshot ID from milliseconds: SystemTime::now().as_millis() as i64. Two flushes in the same millisecond (possible at high throughput) get the same snapshot ID. Use uuid::Uuid::new_v4() or an atomic counter.

✅ Fixed: Changed snapshot ID generation from as_millis() to as_nanos() to prevent ID collisions at high throughput

feat: add apache_iceberg sink connector

d919261

tanokl requested review from dariomazzitelli-sys and dariomazzitellireplik-coder as code owners June 12, 2026 20:33

dariomazzitelli-sys requested changes Jun 12, 2026

View reviewed changes

dariomazzitelli-sys reviewed Jun 12, 2026

View reviewed changes

mazzitano added 2 commits June 12, 2026 19:18

feat: add apache_iceberg sink connector

2597497

feat: add apache_iceberg sink connector

0bb96d5

Uh oh!

Conversation

tanokl commented Jun 12, 2026

Uh oh!

dariomazzitelli-sys left a comment

Choose a reason for hiding this comment

🤖 Hermes Agent — Code Review: PR #103 (Apache Iceberg sink)

Uh oh!

dariomazzitelli-sys Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dariomazzitelli-sys Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dariomazzitelli-sys Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dariomazzitelli-sys Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dariomazzitelli-sys Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dariomazzitelli-sys Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dariomazzitelli-sys Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dariomazzitelli-sys Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dariomazzitelli-sys left a comment

Choose a reason for hiding this comment

🤖 Hermes Agent — Code Review (English)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dariomazzitelli-sys Jun 12, 2026 •

edited

Loading

dariomazzitelli-sys Jun 12, 2026 •

edited

Loading

dariomazzitelli-sys Jun 12, 2026 •

edited

Loading

dariomazzitelli-sys Jun 12, 2026 •

edited

Loading

dariomazzitelli-sys Jun 12, 2026 •

edited

Loading

dariomazzitelli-sys Jun 12, 2026 •

edited

Loading

dariomazzitelli-sys Jun 12, 2026 •

edited

Loading

dariomazzitelli-sys Jun 12, 2026 •

edited

Loading