Skip to content

Read cached dataset_info.json to populate config_names offline#8216

Open
adityasingh2400 wants to merge 1 commit into
huggingface:mainfrom
adityasingh2400:fix-cached-dataset-config-names-offline-7947
Open

Read cached dataset_info.json to populate config_names offline#8216
adityasingh2400 wants to merge 1 commit into
huggingface:mainfrom
adityasingh2400:fix-cached-dataset-config-names-offline-7947

Conversation

@adityasingh2400
Copy link
Copy Markdown

What

get_dataset_config_names with HF_DATASETS_OFFLINE=1 returns ['default'] even when the cache contains multiple config subdirectories for the dataset. For example, after caching cais/mmlu:

$ HF_DATASETS_OFFLINE=1 python -c "import datasets; print(datasets.get_dataset_config_names('cais/mmlu'))"
Using the latest cached version of the dataset since cais/mmlu couldn't be found on the Hugging Face Hub (offline mode is enabled).
['default']

This is because CachedDatasetModuleFactory.get_module returns a DatasetModule without builder_configs_parameters, so get_dataset_config_names falls back to default.

Fix

Walk the cached repo subdirectories, parse each dataset_info.json, and populate builder_configs_parameters with the discovered config names. Malformed or unreadable dataset_info.json files are skipped silently so a single bad cache entry does not break the lookup.

With this change:

$ HF_DATASETS_OFFLINE=1 python -c "import datasets; print(datasets.get_dataset_config_names('cais/mmlu'))"
Using the latest cached version of the dataset since cais/mmlu couldn't be found on the Hugging Face Hub (offline mode is enabled).
['abstract_algebra', 'all', 'anatomy', ...]

Test

Adds test_CachedDatasetModuleFactory_offline_populates_config_names in tests/test_load.py that builds a fake cache layout with three configs, calls the factory with HF_HUB_OFFLINE patched to True, and asserts every cached config name is returned. No network access required.

Fixes #7947. Supersedes the abandoned #7977.

get_dataset_config_names with HF_DATASETS_OFFLINE=1 falls back to a
['default'] list even when multiple config subdirs are present in the
cache, because CachedDatasetModuleFactory.get_module never reads the
cached dataset_info.json files to find them. This breaks any code path
that relies on the cached config list (e.g. iterating MMLU subjects
without internet).

Walk the cached repo subdirs, parse each dataset_info.json, and populate
builder_configs_parameters with the discovered config names.

Fixes huggingface#7947
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MMLU get_dataset_config_names provides different lists of subsets in online and offline modes

1 participant