Add Apache Iceberg format support by frankliee · Pull Request #8148 · huggingface/datasets

frankliee · 2026-04-23T11:29:11Z

Add Apache Iceberg format support

Motivation

Apache Iceberg is the most widely adopted open table format for data lakes, supported by Databricks,
Snowflake, AWS Glue, Dremio, and others. A large amount of ML training data lives in Iceberg tables.
Currently, users must manually export Iceberg data to Parquet before loading it into HuggingFace Datasets —
this PR removes that friction.

fix this (#7863)

Usage

Users pass a pre-configured pyiceberg Catalog object and a table identifier:

from pyiceberg.catalog.sql import SqlCatalog
from datasets import load_dataset

catalog = SqlCatalog("my_catalog", uri="sqlite:///catalog.db", warehouse="/tmp/warehouse")

Basic loading

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table")

Column selection + row filtering (predicate pushdown)

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table",
columns=["text", "label"],
filters=[("label", ">", 0)])

Multiple splits from different tables

ds = load_dataset("iceberg", catalog=catalog,
table={"train": "db.train", "test": "db.test"})

Time travel via snapshot_id

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table",
snapshot_id=7051729674881785648)

Streaming

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table", streaming=True)

Works with any pyiceberg-supported catalog backend (REST, Hive, Glue, SQL, etc.) — the builder is agnostic
to how the catalog is configured.

Design decisions

Catalog object passed in, not constructed internally. Iceberg catalog configuration varies widely across
backends (REST, Hive, Glue, SQL each have different auth/connection params). Rather than re-implementing a
"catalog factory" inside the builder, users bring their own catalog — similar to how the sql builder accepts
an existing SQLAlchemy connection. This keeps the builder simple and forward-compatible with new catalog
types.
No _EXTENSION_TO_MODULE registration. Unlike file-based formats (Parquet, Lance, CSV), Iceberg tables are
addressed via catalog + table identifier, not file extensions. Users must specify "iceberg" explicitly as
the path argument.
create_config_id override for fingerprinting. Catalog objects (containing SQLAlchemy engines, connection
pools, etc.) are not picklable by dill. The override replaces the catalog with a stable string
representation ("{ClassName}_{name}") before hashing.
_CountableBuilderMixin for fast row counting. Uses scan.plan_files() metadata to count rows without
reading data files.

lhoestq

Awesome ! I just have one comment:

lhoestq · 2026-04-24T14:21:51Z

+            splits.append(
+                datasets.SplitGenerator(
+                    name=split_name,
+                    gen_kwargs={"scan": scan},


Do you think we can have a list here instead ? This would enable parallel processing/streaming

e.g. one scan object per file maybe

Thanks for the suggestion! I've refactored the implementation to support num_proc > 1 parallel processing:

Changes:

List-based gen_kwargs: _split_generators now passes tasks = list(scan.plan_files()) as a list in gen_kwargs, which allows _split_gen_kwargs to distribute FileScanTask objects across
workers automatically.

Picklable scan_context: Instead of passing the unpicklable scan object, I extract a tuple of individually-serializable components (table_metadata, io, projected_schema, row_filter,
case_sensitive, limit) that can reconstruct an ArrowScan reader in each worker.

Drop catalog after use: At the end of _split_generators, both self.config.catalog and self.config_kwargs["catalog"] are set to None so the builder itself can be pickled when sent to
child processes. The catalog is no longer needed after planning — all reading state lives in scan_context.

Per-task reading: _generate_tables now iterates over its assigned tasks list and uses ArrowScan.to_record_batches([task]) for each one, yielding Key(task_idx, batch_idx) for proper
shard-level parallelism.

@lhoestq
Thanks for the review! I’ve addressed this comment in the latest commit.
Could you please take another look when you have a chance?

HuggingFaceDocBuilderDev · 2026-04-24T14:25:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

lhoestq

Great ! I have one last comment before we merge:

lhoestq · 2026-05-26T16:09:17Z

+from pyiceberg.catalog.sql import SqlCatalog
+from pyiceberg.schema import Schema
+from pyiceberg.types import DoubleType, FloatType, ListType, LongType, NestedField, StringType


can you move the imports inside the test functions and decorate them with a @require_pyiceberg ? this way people can run the test suite even if they don't have all the test dependencies

Thanks, and I have added @require_pyiceberg and @require_not_windows on iceberg test cases, since iceberg tests does not support on window.

lhoestq

lgtm !

frankliee force-pushed the iceberg branch 2 times, most recently from c567422 to 62ee472 Compare April 23, 2026 12:18

frankliee changed the title ~~[WIP] Add Apache Iceberg format support~~ Add Apache Iceberg format support Apr 23, 2026

frankliee mentioned this pull request Apr 23, 2026

Support hosting lance / vortex / iceberg / zarr datasets on huggingface hub #7863

Open

lhoestq reviewed Apr 24, 2026

View reviewed changes

Add Apache Iceberg format support

0651412

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

frankliee force-pushed the iceberg branch from 182d803 to 0651412 Compare May 2, 2026 15:09

lhoestq reviewed May 26, 2026

View reviewed changes

fix comment

e9811ff

lhoestq approved these changes May 27, 2026

View reviewed changes

lhoestq merged commit 6fab6b1 into huggingface:main May 27, 2026
13 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Apache Iceberg format support#8148

Add Apache Iceberg format support#8148
lhoestq merged 2 commits into
huggingface:mainfrom
frankliee:iceberg

frankliee commented Apr 23, 2026 •

edited

Loading

Uh oh!

lhoestq left a comment

Uh oh!

lhoestq Apr 24, 2026

Uh oh!

frankliee May 2, 2026

Uh oh!

frankliee May 12, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 24, 2026

Uh oh!

lhoestq left a comment

Uh oh!

lhoestq May 26, 2026

Uh oh!

frankliee May 27, 2026

Uh oh!

lhoestq left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

frankliee commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Apache Iceberg format support

Motivation

Usage

Basic loading

Column selection + row filtering (predicate pushdown)

Multiple splits from different tables

Time travel via snapshot_id

Streaming

Design decisions

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

frankliee May 2, 2026

Choose a reason for hiding this comment

Uh oh!

frankliee May 12, 2026

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 24, 2026

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq May 26, 2026

Choose a reason for hiding this comment

Uh oh!

frankliee May 27, 2026

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

frankliee commented Apr 23, 2026 •

edited

Loading