Skip to content

Spark + Iceberg batch processing job #9

@kiyeonjeon21

Description

@kiyeonjeon21

Priority: P1

The batch-processing.env recipe starts Spark, Iceberg, Trino, and Jupyter but has no actual Spark job.

What to do

  • Create stacks/processing/spark/jobs/etl_to_iceberg.py — PySpark reads from PostgreSQL, writes Iceberg tables
  • Configure Spark to use Iceberg REST catalog
  • Create Jupyter notebook 04-spark-iceberg.ipynb demonstrating the workflow

Acceptance criteria

  • spark-submit runs the ETL job without errors
  • At least 2 Iceberg tables created in MinIO
  • Trino can query the Spark-created tables
  • Jupyter notebook demonstrates full workflow

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestintegrationCross-tool end-to-end scenarios

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions