fix(webdataset): when loading data in WebDataset format using load_datasets during multi-matchines training. by Wolfram-St · Pull Request #8203 · huggingface/datasets

Wolfram-St · 2026-05-16T06:41:37Z

Description

The Bug:
During distributed training with streaming=True, if a user provides fewer .tar files than the number of distributed ranks, some ranks receive an empty list of files. This caused an IndexError: list index out of range in webdataset.py because the code assumed tar_paths[0] always existed for feature inference.

The Fix:

Updated _split_generators to scan across splits and capture the first_tar_path. Feature inference is now safely skipped for ranks that receive zero files.
Updated _generate_examples to gracefully handle cases where self.info.features is None (due to the skipped inference), defaulting to an empty dictionary to prevent downstream AttributeError crashes.

Testing

Verified logic locally.
Ran pytest tests/packaged_modules/test_webdataset.py (Note: test_audio_webdataset skips/fails locally on Windows due to missing FFmpeg libtorchcodec dependencies, but core logic tests pass).
Formatted code using ruff.

During multi-machine distributed training with streaming=True, if the number of tar files is fewer than the number of ranks, some ranks receive an empty list of files. Previously, `_split_generators` hardcoded `tar_paths[0]` for feature inference, causing an IndexError on empty ranks. This commit: 1. Iterates through splits to find the first valid file for feature inference. 2. Safely defaults to an empty dictionary in `_generate_examples` if `features` remains None, preventing subsequent AttributeErrors.

HuggingFaceDocBuilderDev · 2026-05-18T14:06:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Wolfram-St mentioned this pull request May 16, 2026

Encountered an error when loading data in WebDataset format using load_datasets during multi-matchines training. #8201

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(webdataset): when loading data in WebDataset format using load_datasets during multi-matchines training.#8203

fix(webdataset): when loading data in WebDataset format using load_datasets during multi-matchines training.#8203
Wolfram-St wants to merge 1 commit into
huggingface:mainfrom
Wolfram-St:main

Wolfram-St commented May 16, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Wolfram-St commented May 16, 2026

Description

Testing

Uh oh!

HuggingFaceDocBuilderDev commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants