Skip to content

Support hosting lance / vortex / iceberg / zarr datasets on huggingface hub #7863

@pavanramkumar

Description

@pavanramkumar

Feature request

Huggingface datasets has great support for large tabular datasets in parquet with large partitions. I would love to see two things in the future:

  • equivalent support for lance, vortex, iceberg, zarr (in that order) in a way that I can stream them using the datasets library
  • more fine-grained control of streaming, so that I can stream at the partition / shard level

Motivation

I work with very large lance datasets on S3 and often require random access for AI/ML applications like multi-node training. I was able to achieve high throughput dataloading on a lance dataset with ~150B rows by building distributed dataloaders that can be scaled both vertically (until i/o and CPU are saturated), and then horizontally (to workaround network bottlenecks).

Using this strategy I was able to achieve 10-20x the throughput of the streaming data loader from the huggingface/datasets library.

I realized that these would be great features for huggingface to support natively

Your contribution

I'm not ready yet to make a PR but open to it with the right pointers!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions