feat: Add return_file_name parameter to JSON builder (#5806)#8214
Open
Kinetic-Labs-GT wants to merge 5 commits into
Open
feat: Add return_file_name parameter to JSON builder (#5806)#8214Kinetic-Labs-GT wants to merge 5 commits into
Kinetic-Labs-GT wants to merge 5 commits into
Conversation
This adds support for an optional boolean parameter `return_file_name` in `JsonConfig`. When set to True, the data table generation injects a new column named "file_name" containing the current source file path string for every single row, to address Issue huggingface#5806. Co-authored-by: Kinetic-Labs-GT <250848918+Kinetic-Labs-GT@users.noreply.github.com>
…name-7019226806567132062 Add `return_file_name` config parameter
Added warning about memory bloat when using keep_in_memory with PyTorch DataLoader.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Resolves #5806.
This PR introduces an optional
return_file_nameboolean parameter to the packaged JSON builder. When enabled, it injects afile_namecolumn into the generated PyArrow tables, allowing downstream users to track exactly which data shard a specific row originated from during interrupted LLM training runs.Changes Made:
return_file_name: bool = Falseto theJsonConfigdataclass._generate_tablesinsrc/datasets/packaged_modules/json/json.pyto inspectself.config.return_file_name.self.featuresif provided dynamically.Note on CI Status:
Local testing and
deps-minimumCI matrices passed successfully. I noticed that theubuntu-latest, deps-latestandwindows-latest, deps-latestintegration tests are currently failing on this branch. Reviewing the logs, these failures are occurring intest_inspect.pyandtest_offline_util.pydue to a recenthuggingface_hubdependency update enforcing strictnamespace/namevalidation (e.g., rejecting the'paws'and'dummy'fixtures).Since this is an upstream dependency breakage unrelated to the
json.pyreturn_file_nameimplementation, I have left those test files untouched to avoid scope pollution in this PR.