DataLoader.collate now clones cached HData when sampling full hypergraph#176
Merged
tizianocitro merged 2 commits intohypernetwork-research-group:mainfrom Apr 29, 2026
Conversation
`DataLoader.collate()` returned `self.__cached_dataset_hdata.to(...)` when `sample_full_hypergraph=True`. Because `HData.to()` is in-place, that returned the cached dataset object itself — so iterating the dataloader and mutating the batch (or transferring through a different device path on the next iteration) silently mutated the dataset's cached `hdata`. This change adds an `HData.clone()` method that returns a structurally independent `HData` (every tensor field cloned, scalar fields passed through), and wires the loader to `clone().to(...)` instead of `to(...)` directly. The clone happens once per batch in the sample-full path, so the cost is bounded by the dataset size — same order of magnitude as the device transfer that already happens there. Three regression tests in `hyperbench/tests/data/loader_test.py`: - `test_collate_sample_full_hypergraph_does_not_share_storage_with_cached_hdata` asserts `data_ptr` inequality across `x`, `hyperedge_index`, `hyperedge_attr`. - `test_collate_sample_full_hypergraph_mutating_batch_does_not_affect_cached_hdata` mutates the batch in place and confirms the cached hdata's tensors are unchanged. - `test_collate_sample_full_hypergraph_with_weights_isolates_weights` exercises the same isolation for `hyperedge_weights`. Each fails when `loader.py` and `hdata.py` are stashed, confirming they exercise the new behaviour. Existing `test_collate_sample_full_hypergraph_returns_cached_hdata` continues to pass — the equality of contents is preserved. Closes hypernetwork-research-group#173
Contributor
|
I had to make small changes to make it follow repo practices and pass tests. Given I had to do only little work, I'm approving this one. Please next time, follow the contribution guide listed in the template or README. |
tizianocitro
approved these changes
Apr 29, 2026
819bdf5
into
hypernetwork-research-group:main
19 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
All Submissions
Description
Closes #173.
DataLoader.collate()returnedself.__cached_dataset_hdata.to(batch[0].device)whensample_full_hypergraph=True. BecauseHData.to()is in-place, this returned the cached dataset object itself. As a result, iterating the dataloader and mutating the batch, or transferring through a different device path on the next iteration, could silently mutate the dataset's cachedhdata.This change introduces a public
HData.clone()method that returns a structurally independentHData: tensor fields are cloned, while scalar fields are passed through. The loader now usesclone().to(...)instead of callingto(...)directly on the cached object.The clone happens once per batch in the sample-full path, so the cost is bounded by dataset size and is the same order of magnitude as the device transfer already performed there.
Test plan:
Three regression tests were added in
hyperbench/tests/data/loader_test.py:test_collate_sample_full_hypergraph_does_not_share_storage_with_cached_hdata— checks
data_ptrinequality acrossx,hyperedge_index, andhyperedge_attr.test_collate_sample_full_hypergraph_mutating_batch_does_not_affect_cached_hdata— mutates the batch in place and confirms the cached
hdatatensors are unchanged.test_collate_sample_full_hypergraph_with_weights_isolates_weights— verifies the same isolation for
hyperedge_weights.Each test fails when only
loader.pyandhdata.pyare reverted, confirming they exercise the new behavior. The existingtest_collate_sample_full_hypergraph_returns_cached_hdatacontent-equality test continues to pass.Checklist
make test)make lint)make typecheck)