Skip to content

Add Apache Iceberg format support#8148

Merged
lhoestq merged 2 commits into
huggingface:mainfrom
frankliee:iceberg
May 27, 2026
Merged

Add Apache Iceberg format support#8148
lhoestq merged 2 commits into
huggingface:mainfrom
frankliee:iceberg

Conversation

@frankliee
Copy link
Copy Markdown
Contributor

@frankliee frankliee commented Apr 23, 2026

Add Apache Iceberg format support

Motivation

Apache Iceberg is the most widely adopted open table format for data lakes, supported by Databricks,
Snowflake, AWS Glue, Dremio, and others. A large amount of ML training data lives in Iceberg tables.
Currently, users must manually export Iceberg data to Parquet before loading it into HuggingFace Datasets —
this PR removes that friction.

fix this (#7863)

Usage

Users pass a pre-configured pyiceberg Catalog object and a table identifier:

from pyiceberg.catalog.sql import SqlCatalog
from datasets import load_dataset

catalog = SqlCatalog("my_catalog", uri="sqlite:///catalog.db", warehouse="/tmp/warehouse")

Basic loading

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table")

Column selection + row filtering (predicate pushdown)

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table",
columns=["text", "label"],
filters=[("label", ">", 0)])

Multiple splits from different tables

ds = load_dataset("iceberg", catalog=catalog,
table={"train": "db.train", "test": "db.test"})

Time travel via snapshot_id

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table",
snapshot_id=7051729674881785648)

Streaming

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table", streaming=True)

Works with any pyiceberg-supported catalog backend (REST, Hive, Glue, SQL, etc.) — the builder is agnostic
to how the catalog is configured.

Design decisions

  • Catalog object passed in, not constructed internally. Iceberg catalog configuration varies widely across
    backends (REST, Hive, Glue, SQL each have different auth/connection params). Rather than re-implementing a
    "catalog factory" inside the builder, users bring their own catalog — similar to how the sql builder accepts
    an existing SQLAlchemy connection. This keeps the builder simple and forward-compatible with new catalog
    types.
  • No _EXTENSION_TO_MODULE registration. Unlike file-based formats (Parquet, Lance, CSV), Iceberg tables are
    addressed via catalog + table identifier, not file extensions. Users must specify "iceberg" explicitly as
    the path argument.
  • create_config_id override for fingerprinting. Catalog objects (containing SQLAlchemy engines, connection
    pools, etc.) are not picklable by dill. The override replaces the catalog with a stable string
    representation ("{ClassName}_{name}") before hashing.
  • _CountableBuilderMixin for fast row counting. Uses scan.plan_files() metadata to count rows without
    reading data files.

@frankliee frankliee force-pushed the iceberg branch 2 times, most recently from c567422 to 62ee472 Compare April 23, 2026 12:18
@frankliee frankliee changed the title [WIP] Add Apache Iceberg format support Add Apache Iceberg format support Apr 23, 2026
Copy link
Copy Markdown
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome ! I just have one comment:

splits.append(
datasets.SplitGenerator(
name=split_name,
gen_kwargs={"scan": scan},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can have a list here instead ? This would enable parallel processing/streaming

e.g. one scan object per file maybe

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I've refactored the implementation to support num_proc > 1 parallel processing:

Changes:

  1. List-based gen_kwargs: _split_generators now passes tasks = list(scan.plan_files()) as a list in gen_kwargs, which allows _split_gen_kwargs to distribute FileScanTask objects across
    workers automatically.
  2. Picklable scan_context: Instead of passing the unpicklable scan object, I extract a tuple of individually-serializable components (table_metadata, io, projected_schema, row_filter,
    case_sensitive, limit) that can reconstruct an ArrowScan reader in each worker.
  3. Drop catalog after use: At the end of _split_generators, both self.config.catalog and self.config_kwargs["catalog"] are set to None so the builder itself can be pickled when sent to
    child processes. The catalog is no longer needed after planning — all reading state lives in scan_context.
  4. Per-task reading: _generate_tables now iterates over its assigned tasks list and uses ArrowScan.to_record_batches([task]) for each one, yielding Key(task_idx, batch_idx) for proper
    shard-level parallelism.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq
Thanks for the review! I’ve addressed this comment in the latest commit.
Could you please take another look when you have a chance?

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great ! I have one last comment before we merge:

Comment thread tests/packaged_modules/test_iceberg.py Outdated
Comment on lines +4 to +6
from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import DoubleType, FloatType, ListType, LongType, NestedField, StringType
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move the imports inside the test functions and decorate them with a @require_pyiceberg ? this way people can run the test suite even if they don't have all the test dependencies

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, and I have added @require_pyiceberg and @require_not_windows on iceberg test cases, since iceberg tests does not support on window.

Copy link
Copy Markdown
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

@lhoestq lhoestq merged commit 6fab6b1 into huggingface:main May 27, 2026
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants