Validate split name before starting download#8222
Open
ParamChordiya wants to merge 1 commit into
Open
Conversation
When load_dataset() is called with an invalid split name, the error is
currently raised deep inside arrow_reader after the full download has
already completed. For large datasets this wastes a lot of time.
Add _check_split_names() to splits.py and call it from two places:
* load.py: immediately after load_dataset_builder(), before
download_and_prepare(). If the builder already has split info
(from Hub YAML metadata or a cached dataset_info.json) we can
bail out early.
* builder.as_dataset(): after the default-split expansion, before
map_nested(). This guarantees a clear ValueError with the list of
available splits instead of a confusing error bubbling up from
arrow_reader, even in cases where the early check wasn't possible.
The helper handles composite specs ("train+test"), sliced specs
("train[:1000]"), lists, and dicts transparently.
Fixes huggingface#5523
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this fixes
Closes #5523.
When you pass an invalid split name to
load_dataset(), the error only surfaces after the entire dataset has already been downloaded. For large datasets that can mean waiting many minutes before seeing a one-liner like:The expected behaviour is to raise immediately.
How it works
splits.py— new_check_split_names(split, known_splits)helper. It normalises all the split spec shapes the library accepts (plain name,"train[:50%]","train+test", list, dict) and raises aValueErrorwith the list of available splits if any base name is not recognised.load.py— calls_check_split_namesright afterload_dataset_builder()returns, beforedownload_and_prepare()is ever reached. This fires whenever split info is already available – i.e. when the builder was initialised from Hub YAML metadata or from a cacheddataset_info.json.builder.py— calls_check_split_namesat the top ofas_dataset(), beforemap_nested(). This is the safety-net for the cases where the early check wasn't possible (first download with no Hub YAML split info): instead of an opaque error fromarrow_reader, the user gets a cleanValueErrorwith the list of available splits.Tests
test_splits.py: 14 parametrised cases covering valid names, slices, compound specs, lists, dicts,None,"all", and emptyknown_splits.test_builder.py:test_builder_as_dataset_unknown_split_raises– verifies the guard inas_dataset()for both a plain bad name and a compound spec with one bad part.