Skip to content

Validate split name before starting download#8222

Open
ParamChordiya wants to merge 1 commit into
huggingface:mainfrom
ParamChordiya:fix/validate-split-before-download
Open

Validate split name before starting download#8222
ParamChordiya wants to merge 1 commit into
huggingface:mainfrom
ParamChordiya:fix/validate-split-before-download

Conversation

@ParamChordiya
Copy link
Copy Markdown

What this fixes

Closes #5523.

When you pass an invalid split name to load_dataset(), the error only surfaces after the entire dataset has already been downloaded. For large datasets that can mean waiting many minutes before seeing a one-liner like:

from datasets import load_dataset
load_dataset("mozilla-foundation/common_voice_11_0", "en", split="blabla")
# ... full download happens first ...
# ValueError: Unknown split "blabla". Should be one of [...]

The expected behaviour is to raise immediately.

How it works

splits.py — new _check_split_names(split, known_splits) helper. It normalises all the split spec shapes the library accepts (plain name, "train[:50%]", "train+test", list, dict) and raises a ValueError with the list of available splits if any base name is not recognised.

load.py — calls _check_split_names right after load_dataset_builder() returns, before download_and_prepare() is ever reached. This fires whenever split info is already available – i.e. when the builder was initialised from Hub YAML metadata or from a cached dataset_info.json.

builder.py — calls _check_split_names at the top of as_dataset(), before map_nested(). This is the safety-net for the cases where the early check wasn't possible (first download with no Hub YAML split info): instead of an opaque error from arrow_reader, the user gets a clean ValueError with the list of available splits.

Tests

  • test_splits.py: 14 parametrised cases covering valid names, slices, compound specs, lists, dicts, None, "all", and empty known_splits.
  • test_builder.py: test_builder_as_dataset_unknown_split_raises – verifies the guard in as_dataset() for both a plain bad name and a compound spec with one bad part.

When load_dataset() is called with an invalid split name, the error is
currently raised deep inside arrow_reader after the full download has
already completed.  For large datasets this wastes a lot of time.

Add _check_split_names() to splits.py and call it from two places:

  * load.py: immediately after load_dataset_builder(), before
    download_and_prepare().  If the builder already has split info
    (from Hub YAML metadata or a cached dataset_info.json) we can
    bail out early.

  * builder.as_dataset(): after the default-split expansion, before
    map_nested().  This guarantees a clear ValueError with the list of
    available splits instead of a confusing error bubbling up from
    arrow_reader, even in cases where the early check wasn't possible.

The helper handles composite specs ("train+test"), sliced specs
("train[:1000]"), lists, and dicts transparently.

Fixes huggingface#5523
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Checking that split name is correct happens only after the data is downloaded

1 participant