Validate split name before starting download by ParamChordiya · Pull Request #8222 · huggingface/datasets

ParamChordiya · 2026-05-27T01:54:14Z

What this fixes

Closes #5523.

When you pass an invalid split name to load_dataset(), the error only surfaces after the entire dataset has already been downloaded. For large datasets that can mean waiting many minutes before seeing a one-liner like:

from datasets import load_dataset
load_dataset("mozilla-foundation/common_voice_11_0", "en", split="blabla")
# ... full download happens first ...
# ValueError: Unknown split "blabla". Should be one of [...]

The expected behaviour is to raise immediately.

How it works

splits.py — new _check_split_names(split, known_splits) helper. It normalises all the split spec shapes the library accepts (plain name, "train[:50%]", "train+test", list, dict) and raises a ValueError with the list of available splits if any base name is not recognised.

load.py — calls _check_split_names right after load_dataset_builder() returns, before download_and_prepare() is ever reached. This fires whenever split info is already available – i.e. when the builder was initialised from Hub YAML metadata or from a cached dataset_info.json.

builder.py — calls _check_split_names at the top of as_dataset(), before map_nested(). This is the safety-net for the cases where the early check wasn't possible (first download with no Hub YAML split info): instead of an opaque error from arrow_reader, the user gets a clean ValueError with the list of available splits.

Tests

test_splits.py: 14 parametrised cases covering valid names, slices, compound specs, lists, dicts, None, "all", and empty known_splits.
test_builder.py: test_builder_as_dataset_unknown_split_raises – verifies the guard in as_dataset() for both a plain bad name and a compound spec with one bad part.

When load_dataset() is called with an invalid split name, the error is currently raised deep inside arrow_reader after the full download has already completed. For large datasets this wastes a lot of time. Add _check_split_names() to splits.py and call it from two places: * load.py: immediately after load_dataset_builder(), before download_and_prepare(). If the builder already has split info (from Hub YAML metadata or a cached dataset_info.json) we can bail out early. * builder.as_dataset(): after the default-split expansion, before map_nested(). This guarantees a clear ValueError with the list of available splits instead of a confusing error bubbling up from arrow_reader, even in cases where the early check wasn't possible. The helper handles composite specs ("train+test"), sliced specs ("train[:1000]"), lists, and dicts transparently. Fixes huggingface#5523

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate split name before starting download#8222

Validate split name before starting download#8222
ParamChordiya wants to merge 1 commit into
huggingface:mainfrom
ParamChordiya:fix/validate-split-before-download

ParamChordiya commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ParamChordiya commented May 27, 2026

What this fixes

How it works

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant