Skip to content

feat: Add return_file_name parameter to JSON builder (#5806)#8214

Open
Kinetic-Labs-GT wants to merge 5 commits into
huggingface:mainfrom
Kinetic-Labs-GT:main
Open

feat: Add return_file_name parameter to JSON builder (#5806)#8214
Kinetic-Labs-GT wants to merge 5 commits into
huggingface:mainfrom
Kinetic-Labs-GT:main

Conversation

@Kinetic-Labs-GT
Copy link
Copy Markdown

Description

Resolves #5806.

This PR introduces an optional return_file_name boolean parameter to the packaged JSON builder. When enabled, it injects a file_name column into the generated PyArrow tables, allowing downstream users to track exactly which data shard a specific row originated from during interrupted LLM training runs.

Changes Made:

  • Added return_file_name: bool = False to the JsonConfig dataclass.
  • Updated _generate_tables in src/datasets/packaged_modules/json/json.py to inspect self.config.return_file_name.
  • Appended the string representation of the source file to the output dictionary prior to PyArrow yield.
  • Added schema checks to ensure "file_name" is included in self.features if provided dynamically.

Note on CI Status:

Local testing and deps-minimum CI matrices passed successfully. I noticed that the ubuntu-latest, deps-latest and windows-latest, deps-latest integration tests are currently failing on this branch. Reviewing the logs, these failures are occurring in test_inspect.py and test_offline_util.py due to a recent huggingface_hub dependency update enforcing strict namespace/name validation (e.g., rejecting the 'paws' and 'dummy' fixtures).

Since this is an upstream dependency breakage unrelated to the json.py return_file_name implementation, I have left those test files untouched to avoid scope pollution in this PR.

google-labs-jules Bot and others added 4 commits May 21, 2026 13:36
This adds support for an optional boolean parameter `return_file_name` in
`JsonConfig`. When set to True, the data table generation injects a new
column named "file_name" containing the current source file path string
for every single row, to address Issue huggingface#5806.

Co-authored-by: Kinetic-Labs-GT <250848918+Kinetic-Labs-GT@users.noreply.github.com>
…name-7019226806567132062

Add `return_file_name` config parameter
Added warning about memory bloat when using keep_in_memory with PyTorch DataLoader.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Return the name of the currently loaded file in the load_dataset function.

1 participant