fix(webdataset): when loading data in WebDataset format using load_datasets during multi-matchines training.#8203
Open
Wolfram-St wants to merge 1 commit into
Open
fix(webdataset): when loading data in WebDataset format using load_datasets during multi-matchines training.#8203Wolfram-St wants to merge 1 commit into
Wolfram-St wants to merge 1 commit into
Conversation
During multi-machine distributed training with streaming=True, if the number of tar files is fewer than the number of ranks, some ranks receive an empty list of files. Previously, `_split_generators` hardcoded `tar_paths[0]` for feature inference, causing an IndexError on empty ranks. This commit: 1. Iterates through splits to find the first valid file for feature inference. 2. Safely defaults to an empty dictionary in `_generate_examples` if `features` remains None, preventing subsequent AttributeErrors.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes #8201
The Bug:
During distributed training with
streaming=True, if a user provides fewer.tarfiles than the number of distributed ranks, some ranks receive an empty list of files. This caused anIndexError: list index out of rangeinwebdataset.pybecause the code assumedtar_paths[0]always existed for feature inference.The Fix:
_split_generatorsto scan across splits and capture thefirst_tar_path. Feature inference is now safely skipped for ranks that receive zero files._generate_examplesto gracefully handle cases whereself.info.featuresisNone(due to the skipped inference), defaulting to an empty dictionary to prevent downstreamAttributeErrorcrashes.Testing
pytest tests/packaged_modules/test_webdataset.py(Note:test_audio_webdatasetskips/fails locally on Windows due to missing FFmpeglibtorchcodecdependencies, but core logic tests pass).ruff.