Skip to content

DataLoader.collate now clones cached HData when sampling full hypergraph#176

Merged
tizianocitro merged 2 commits intohypernetwork-research-group:mainfrom
SAY-5:fix/dataloader-collate-clones-cached-hdata
Apr 29, 2026
Merged

DataLoader.collate now clones cached HData when sampling full hypergraph#176
tizianocitro merged 2 commits intohypernetwork-research-group:mainfrom
SAY-5:fix/dataloader-collate-clones-cached-hdata

Conversation

@SAY-5
Copy link
Copy Markdown
Contributor

@SAY-5 SAY-5 commented Apr 28, 2026

All Submissions

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

Description

Closes #173.

DataLoader.collate() returned self.__cached_dataset_hdata.to(batch[0].device) when sample_full_hypergraph=True. Because HData.to() is in-place, this returned the cached dataset object itself. As a result, iterating the dataloader and mutating the batch, or transferring through a different device path on the next iteration, could silently mutate the dataset's cached hdata.

This change introduces a public HData.clone() method that returns a structurally independent HData: tensor fields are cloned, while scalar fields are passed through. The loader now uses clone().to(...) instead of calling to(...) directly on the cached object.

The clone happens once per batch in the sample-full path, so the cost is bounded by dataset size and is the same order of magnitude as the device transfer already performed there.

Test plan:

Three regression tests were added in hyperbench/tests/data/loader_test.py:

  • test_collate_sample_full_hypergraph_does_not_share_storage_with_cached_hdata
    — checks data_ptr inequality across x, hyperedge_index, and hyperedge_attr.
  • test_collate_sample_full_hypergraph_mutating_batch_does_not_affect_cached_hdata
    — mutates the batch in place and confirms the cached hdata tensors are unchanged.
  • test_collate_sample_full_hypergraph_with_weights_isolates_weights
    — verifies the same isolation for hyperedge_weights.

Each test fails when only loader.py and hdata.py are reverted, confirming they exercise the new behavior. The existing test_collate_sample_full_hypergraph_returns_cached_hdata content-equality test continues to pass.

Checklist

  • Does your submission pass all tests? (use make test)
  • Have you written tests to cover all your changes? If not, provide a reason.
  • Have you lint your code locally before submission? (use make lint)
  • Have you type checked your code locally before submission? (use make typecheck)
  • Have you added an explanation of what your changes are and why you'd like us to include them?

`DataLoader.collate()` returned `self.__cached_dataset_hdata.to(...)`
when `sample_full_hypergraph=True`. Because `HData.to()` is in-place,
that returned the cached dataset object itself — so iterating the
dataloader and mutating the batch (or transferring through a different
device path on the next iteration) silently mutated the dataset's
cached `hdata`.

This change adds an `HData.clone()` method that returns a structurally
independent `HData` (every tensor field cloned, scalar fields passed
through), and wires the loader to `clone().to(...)` instead of
`to(...)` directly. The clone happens once per batch in the
sample-full path, so the cost is bounded by the dataset size — same
order of magnitude as the device transfer that already happens there.

Three regression tests in `hyperbench/tests/data/loader_test.py`:

- `test_collate_sample_full_hypergraph_does_not_share_storage_with_cached_hdata`
  asserts `data_ptr` inequality across `x`, `hyperedge_index`,
  `hyperedge_attr`.
- `test_collate_sample_full_hypergraph_mutating_batch_does_not_affect_cached_hdata`
  mutates the batch in place and confirms the cached hdata's tensors
  are unchanged.
- `test_collate_sample_full_hypergraph_with_weights_isolates_weights`
  exercises the same isolation for `hyperedge_weights`.

Each fails when `loader.py` and `hdata.py` are stashed, confirming
they exercise the new behaviour. Existing
`test_collate_sample_full_hypergraph_returns_cached_hdata` continues
to pass — the equality of contents is preserved.

Closes hypernetwork-research-group#173
@tizianocitro tizianocitro changed the title fix: DataLoader.collate clones cached hdata on sample_full_hypergraph DataLoader.collate now clones cached HData when sampling full hypergraph Apr 29, 2026
@tizianocitro
Copy link
Copy Markdown
Contributor

I had to make small changes to make it follow repo practices and pass tests.

Given I had to do only little work, I'm approving this one. Please next time, follow the contribution guide listed in the template or README.

@tizianocitro tizianocitro merged commit 819bdf5 into hypernetwork-research-group:main Apr 29, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DataLoader.collate() mutates and returns cached dataset state when sample_full_hypergraph=True

2 participants