Skip to content

Support composed splits in streaming datasets#8220

Open
lanarkite99 wants to merge 1 commit into
huggingface:mainfrom
lanarkite99:resume/datasets-streaming-split-composition
Open

Support composed splits in streaming datasets#8220
lanarkite99 wants to merge 1 commit into
huggingface:mainfrom
lanarkite99:resume/datasets-streaming-split-composition

Conversation

@lanarkite99
Copy link
Copy Markdown

Fixes #2699
Fixes #4804

This PR adds support for unsliced split composition when loading datasets in streaming mode, e.g. split="train+validation".

Previously, DatasetBuilder.as_streaming_dataset() only accepted a single split name or returned all splits as an IterableDatasetDict, so composed split strings raised ValueError: Bad split.

The change resolves composed split instructions by building each requested streaming split and concatenating the resulting IterableDatasets. It also supports the all split sentinel in streaming mode.

This intentionally does not add support for sliced streaming expressions such as train[:10%], which require separate handling.

Tests added for:

  • string split composition: "train+test"
  • object split composition: Split.TRAIN + Split.TEST
  • "all"
  • Split.ALL

Validation:

  • python -m pytest tests/test_builder.py -q

@lanarkite99
Copy link
Copy Markdown
Author

@lhoestq could you please review this PR when you get a chance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

streaming dataset with concatenating splits raises an error cannot combine splits merging and streaming?

1 participant