Read cached dataset_info.json to populate config_names offline#8216
Open
adityasingh2400 wants to merge 1 commit into
Open
Read cached dataset_info.json to populate config_names offline#8216adityasingh2400 wants to merge 1 commit into
adityasingh2400 wants to merge 1 commit into
Conversation
get_dataset_config_names with HF_DATASETS_OFFLINE=1 falls back to a ['default'] list even when multiple config subdirs are present in the cache, because CachedDatasetModuleFactory.get_module never reads the cached dataset_info.json files to find them. This breaks any code path that relies on the cached config list (e.g. iterating MMLU subjects without internet). Walk the cached repo subdirs, parse each dataset_info.json, and populate builder_configs_parameters with the discovered config names. Fixes huggingface#7947
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
get_dataset_config_nameswithHF_DATASETS_OFFLINE=1returns['default']even when the cache contains multiple config subdirectories for the dataset. For example, after cachingcais/mmlu:This is because
CachedDatasetModuleFactory.get_modulereturns aDatasetModulewithoutbuilder_configs_parameters, soget_dataset_config_namesfalls back todefault.Fix
Walk the cached repo subdirectories, parse each
dataset_info.json, and populatebuilder_configs_parameterswith the discovered config names. Malformed or unreadabledataset_info.jsonfiles are skipped silently so a single bad cache entry does not break the lookup.With this change:
Test
Adds
test_CachedDatasetModuleFactory_offline_populates_config_namesintests/test_load.pythat builds a fake cache layout with three configs, calls the factory withHF_HUB_OFFLINEpatched toTrue, and asserts every cached config name is returned. No network access required.Fixes #7947. Supersedes the abandoned #7977.