Skip to content

fix(webdataset): when loading data in WebDataset format using load_datasets during multi-matchines training.#8203

Open
Wolfram-St wants to merge 1 commit into
huggingface:mainfrom
Wolfram-St:main
Open

fix(webdataset): when loading data in WebDataset format using load_datasets during multi-matchines training.#8203
Wolfram-St wants to merge 1 commit into
huggingface:mainfrom
Wolfram-St:main

Conversation

@Wolfram-St
Copy link
Copy Markdown

Description

Fixes #8201

The Bug:
During distributed training with streaming=True, if a user provides fewer .tar files than the number of distributed ranks, some ranks receive an empty list of files. This caused an IndexError: list index out of range in webdataset.py because the code assumed tar_paths[0] always existed for feature inference.

The Fix:

  1. Updated _split_generators to scan across splits and capture the first_tar_path. Feature inference is now safely skipped for ranks that receive zero files.
  2. Updated _generate_examples to gracefully handle cases where self.info.features is None (due to the skipped inference), defaulting to an empty dictionary to prevent downstream AttributeError crashes.

Testing

  • Verified logic locally.
  • Ran pytest tests/packaged_modules/test_webdataset.py (Note: test_audio_webdataset skips/fails locally on Windows due to missing FFmpeg libtorchcodec dependencies, but core logic tests pass).
  • Formatted code using ruff.

During multi-machine distributed training with streaming=True,
if the number of tar files is fewer than the number of ranks,
some ranks receive an empty list of files.

Previously, `_split_generators` hardcoded `tar_paths[0]` for
feature inference, causing an IndexError on empty ranks.

This commit:
1. Iterates through splits to find the first valid file for feature inference.
2. Safely defaults to an empty dictionary in `_generate_examples` if `features` remains None, preventing subsequent AttributeErrors.
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Encountered an error when loading data in WebDataset format using load_datasets during multi-matchines training.

2 participants