Replace shape-based empty batch handling inside `DPDataLoader` with structure-aware approach by david-stan · Pull Request #806 · meta-pytorch/opacus

david-stan · 2026-01-26T19:43:29Z

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Docs change / refactoring / dependency upgrade

Motivation and Context / Related issue

Replaces unstable shape-based empty batch handling with a stateful approach that learns and replicates the actual output structure from collate_fn. This fixes a critical bug where custom collate functions returning non-list structures (dicts, custom classes) were incompatible with Poisson sampling.

The old implementation inspected dataset[0] to pre-compute shapes, then hardcoded empty batches as lists:

def collate(batch, collate_fn, sample_empty_shapes, dtypes):
    if len(batch) > 0:
        return collate_fn(batch)  # Could return dict, custom class, etc.
    else:
        return [torch.zeros(shape, dtype=dtype) for ...]  # Always list!

Bug -> if collate_fn returns a dict, non-empty batches are dicts but empty batches are lists -> type mismatch crash

Existing, related issue: #534

Solution:

New CollateFnWithEmpty learns the structure from the first non-empty batch:

class CollateFnWithEmpty:
    def __call__(self, batch):
        if len(batch) > 0:
            output = self.wrapped_collator_fn(batch)
            if self.first_batch is None:
                self.first_batch = copy.deepcopy(output)  # Learn structure
        else:
            output = self._make_empty_batch(self.first_batch)  # Replicate structure
        return output

Now empty batches match the structure of non-empty batches, regardless of what collate_fn returns.

If the first non-empty batch is actually the first batch, then it returns an error:

if self.first_batch is None:
    raise ValueError(
        "First sampled batch cannot be empty. Please ensure your dataset "
        "has sufficient samples or increase sample_rate."
    )

Key Changes

Removed: shape_safe(), dtype_safe(), hardcoded list return
Added: CollateFnWithEmpty class with recursive structure replication
Changed: wrap_collate_with_empty() signature: (collate_fn, sample_empty_shapes, dtype) -> (collate_fn, batch_first, rand_on_empty)

It is compatible with existing API.
A small disclosure: for small percentage of users who hacked around empty batches handling, it might cause problems but in majority of cases it should be compatible.

How Has This Been Tested (if it applies)

We used this approach to fine-tune Qwen 7B model using trl library for model alignment
Tested on Mellum 5B parameter model fine-tuning

Checklist

The documentation is up-to-date with the changes I made.
[] I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
All tests passed, and additional code has been covered with new tests.

…sistent batch structure Mark tests incompatible with new empty batch handling as skipped

…rove documentation, and add extensive test coverage

meta-codesync · 2026-01-26T19:46:26Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this in D91500466. (Because this pull request was imported automatically, there will not be any future comments.)

coveralls · 2026-02-08T17:14:38Z

Pull Request Test Coverage Report for Build 21371492613

Details

160 of 162 (98.77%) changed or added relevant lines in 4 files are covered.
27 unchanged lines in 4 files lost coverage.
Overall coverage increased (+0.02%) to 78.194%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
opacus/data_loader.py	33	34	97.06%
opacus/tests/dpdataloader_test.py	123	124	99.19%

Files with Coverage Reduction	New Missed Lines	%
opacus/optimizers/optimizer.py	1	87.78%
opacus/utils/batch_memory_manager.py	3	83.33%
opacus/tests/privacy_engine_test.py	4	94.53%
opacus/tests/batch_memory_manager_test.py	19	81.55%

Totals
Change from base Build 21792510836:	0.02%
Covered Lines:	5784
Relevant Lines:	7397

💛 - Coveralls

iden-kalemaj · 2026-02-08T17:10:42Z

+            return type(sample)(converted)
+
+        # base case
+        return sample


@david-stan am I understanding correctly that if the return of the collate_fn does not follow any of the 3 listed instances, you always return the first batch instead of the empty batch? This breaks the DP guarantee because it violates the assumption that each sample is used in training with a certain probability.

Let's raise an error describing what the supported output types are together with a note to either raise an issue or provide a PR if there's a need for a different output type.

You are right, we should raise an error here.

iden-kalemaj · 2026-02-08T17:12:03Z


        dataset = TensorDataset(x, y)
-        data_loader = DPDataLoader(dataset, sample_rate=1e-5)
+        # Use moderate sample rate to get non-empty batches


@david-stan were you able to check that with this sampling rate there are indeed some empty batches produced?

Updated the test to deterministically produce first batch non-empty, and lowered sample rate to consistently generate empty batches after that.

iden-kalemaj · 2026-02-08T17:16:30Z

        return SampleConvNet()


+@pytest.mark.skip(("Incompatible with the new empty batch handling"))


Let's delete this test instead of skipping.

But it could be useful to maintain some of the old behavior, per one of my comments.

iden-kalemaj · 2026-02-08T17:25:11Z

+                self.first_batch = copy.deepcopy(output)
+        else:
+            if self.first_batch is None:
+                raise ValueError(


@david-stan when first_batch is empty, how about we maintain the old behavior of using lists, so that we still offer some support for more basic collate functions for the case when sampling rate is small and first batch is empty. We can raise a warning here that lists are used. Open to your opinion here as well.

return [ torch.zeros(shape, dtype=dtype) for shape, dtype in zip(sample_empty_shapes, dtypes) ]

Having default behavior on random seems like a bigger concern. Also, having an extra parameter for this scenario is also debatable. Generally, interesting idea. What would you suggest?

iden-kalemaj · 2026-02-08T17:26:16Z

Hi @david-stan, thank you for this change and the overall approach looks good to me. Could you please address the comments and also see the failed lint test. Please ping me when ready, so I can re-run the tests.

… with seeded low sample rate

…to preserve DP guarantees

iden-kalemaj · 2026-02-20T19:55:53Z

+            f"CollateFnWithEmpty only supports batches containing torch.Tensor, "
+            f"dict (Mapping), list, or tuple types. "
+            f"If you need support for a different output type, please open an issue at "
+            f"https://github.com/JetBrains-Research/opacus/issues or submit a PR."


lets remove the link and just say ... please open an issue on Opacus or submit a PR.

iden-kalemaj · 2026-02-22T22:14:10Z

@david-stan please also see the failing lint test.

david-stan · 2026-02-23T12:44:13Z

@david-stan when first_batch is empty, how about we maintain the old behavior of using lists, so that we still offer some support for more basic collate functions for the case when sampling rate is small and first batch is empty. We can raise a warning here that lists are used. Open to your opinion here as well.
return [
            torch.zeros(shape, dtype=dtype)
            for shape, dtype in zip(sample_empty_shapes, dtypes)
        ]

This one is last it seems, what is your decision on this one? Are we sticking to lists at the end

iden-kalemaj · 2026-02-23T18:50:58Z

This one is last it seems, what is your decision on this one? Are we sticking to lists at the end

Yes for backward compatibility if self.first_batch is None let's return an empty list and raise a Warning that says that 'First batch is empty. We are using a list of zero-valued tensors as a batch. This may causes issues if the model expects a different batch format. To fix, use more data, increase epsilon, or increase sampling rate'.

Also please see failing lint (and test code with black and isort as well to make sure those pass too).

david-stan · 2026-02-24T13:16:10Z

This one is last it seems, what is your decision on this one? Are we sticking to lists at the end

Yes for backward compatibility if self.first_batch is None let's return an empty list and raise a Warning that says that 'First batch is empty. We are using a list of zero-valued tensors as a batch. This may causes issues if the model expects a different batch format. To fix, use more data, increase epsilon, or increase sampling rate'.

Also please see failing lint (and test code with black and isort as well to make sure those pass too).

Changed to return empty list. Potential problem is if you explicitly wanted list of zero-valued tensors instead. In that case I will need to reintroduce sample_empty_shapes and dtypes, which would require additional API changes.

iden-kalemaj · 2026-03-06T16:10:55Z

@david-stan apologies, the behavior I intended was to return a list of zero valued tensors using sample_empty_shapes, i.e., reverting to the original behavior.

We can either:

Raise a warning if first batch is empty (i.e., revert your last commit)
If you have the time, implement returning a list of zero valued tensors.

Please let me know which one you would prefer.

Reintroduce sample_empty_shapes and dtypes from dataset[0] so that when the first Poisson-sampled batch is empty, CollateFnWithEmpty returns properly shaped zero tensors instead of an empty list. Add thorough tests with deterministic seeds for the empty first batch path and the transition to learned batch structure.

david-stan · 2026-03-23T11:27:30Z

@david-stan apologies, the behavior I intended was to return a list of zero valued tensors using sample_empty_shapes, i.e., reverting to the original behavior.

We can either:

Raise a warning if first batch is empty (i.e., revert your last commit)

If you have the time, implement returning a list of zero valued tensors.

Please let me know which one you would prefer.

Committed, please review!

david-stan · 2026-03-24T16:12:18Z

Just saw the lint error, fixed

iden-kalemaj · 2026-03-24T17:15:16Z

@david-stan please see another lint failure. Just curious if you tried all the linting tests from our contribution guide before submitting and if those passed?

david-stan · 2026-03-26T12:16:38Z

Should be fine

…tructure-aware approach (meta-pytorch#806) Summary: ## Types of changes - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Docs change / refactoring / dependency upgrade ## Motivation and Context / Related issue Replaces unstable shape-based empty batch handling with a stateful approach that learns and replicates the actual output structure from `collate_fn`. This fixes a critical bug where custom collate functions returning non-list structures (dicts, custom classes) were incompatible with Poisson sampling. The old implementation inspected `dataset[0]` to pre-compute shapes, then hardcoded empty batches as lists: ```python def collate(batch, collate_fn, sample_empty_shapes, dtypes): if len(batch) > 0: return collate_fn(batch) # Could return dict, custom class, etc. else: return [torch.zeros(shape, dtype=dtype) for ...] # Always list! ``` Bug -> if `collate_fn` returns a dict, non-empty batches are dicts but empty batches are lists -> type mismatch crash Existing, related issue: meta-pytorch#534 ### Solution: New `CollateFnWithEmpty` learns the structure from the first non-empty batch: ```python class CollateFnWithEmpty: def __call__(self, batch): if len(batch) > 0: output = self.wrapped_collator_fn(batch) if self.first_batch is None: self.first_batch = copy.deepcopy(output) # Learn structure else: output = self._make_empty_batch(self.first_batch) # Replicate structure return output ``` Now empty batches match the structure of non-empty batches, regardless of what `collate_fn` returns. If the first non-empty batch is actually the first batch, then it returns an error: ```python if self.first_batch is None: raise ValueError( "First sampled batch cannot be empty. Please ensure your dataset " "has sufficient samples or increase sample_rate." ) ``` ### Key Changes - Removed: `shape_safe()`, `dtype_safe()`, hardcoded list return - Added: `CollateFnWithEmpty` class with recursive structure replication - Changed: `wrap_collate_with_empty()` signature: `(collate_fn, sample_empty_shapes, dtype)` -> `(collate_fn, batch_first, rand_on_empty)` It is compatible with existing API. A small disclosure: for small percentage of users who hacked around empty batches handling, it might cause problems but in majority of cases it should be compatible. ## How Has This Been Tested (if it applies) - We used this approach to fine-tune `Qwen 7B` model using `trl` library for model alignment - Tested on `Mellum` 5B parameter model fine-tuning ## Checklist - [ ] The documentation is up-to-date with the changes I made. - [] I have read the **CONTRIBUTING** document and completed the CLA (see **CONTRIBUTING**). - [x] All tests passed, and additional code has been covered with new tests. Pull Request resolved: meta-pytorch#806 Differential Revision: D91500466

…tructure-aware approach (meta-pytorch#806) Summary: ## Types of changes - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Docs change / refactoring / dependency upgrade ## Motivation and Context / Related issue Replaces unstable shape-based empty batch handling with a stateful approach that learns and replicates the actual output structure from `collate_fn`. This fixes a critical bug where custom collate functions returning non-list structures (dicts, custom classes) were incompatible with Poisson sampling. The old implementation inspected `dataset[0]` to pre-compute shapes, then hardcoded empty batches as lists: ```python def collate(batch, collate_fn, sample_empty_shapes, dtypes): if len(batch) > 0: return collate_fn(batch) # Could return dict, custom class, etc. else: return [torch.zeros(shape, dtype=dtype) for ...] # Always list! ``` Bug -> if `collate_fn` returns a dict, non-empty batches are dicts but empty batches are lists -> type mismatch crash Existing, related issue: meta-pytorch#534 ### Solution: New `CollateFnWithEmpty` learns the structure from the first non-empty batch: ```python class CollateFnWithEmpty: def __call__(self, batch): if len(batch) > 0: output = self.wrapped_collator_fn(batch) if self.first_batch is None: self.first_batch = copy.deepcopy(output) # Learn structure else: output = self._make_empty_batch(self.first_batch) # Replicate structure return output ``` Now empty batches match the structure of non-empty batches, regardless of what `collate_fn` returns. If the first non-empty batch is actually the first batch, then it returns an error: ```python if self.first_batch is None: raise ValueError( "First sampled batch cannot be empty. Please ensure your dataset " "has sufficient samples or increase sample_rate." ) ``` ### Key Changes - Removed: `shape_safe()`, `dtype_safe()`, hardcoded list return - Added: `CollateFnWithEmpty` class with recursive structure replication - Changed: `wrap_collate_with_empty()` signature: `(collate_fn, sample_empty_shapes, dtype)` -> `(collate_fn, batch_first, rand_on_empty)` It is compatible with existing API. A small disclosure: for small percentage of users who hacked around empty batches handling, it might cause problems but in majority of cases it should be compatible. ## How Has This Been Tested (if it applies) - We used this approach to fine-tune `Qwen 7B` model using `trl` library for model alignment - Tested on `Mellum` 5B parameter model fine-tuning ## Checklist - [ ] The documentation is up-to-date with the changes I made. - [] I have read the **CONTRIBUTING** document and completed the CLA (see **CONTRIBUTING**). - [x] All tests passed, and additional code has been covered with new tests. Test Plan: Imported from GitHub, without a `Test Plan:` line. Unit tests Differential Revision: D98312879 Pulled By: iden-kalemaj

meta-codesync · 2026-03-26T21:16:09Z

@iden-kalemaj merged this pull request in 6dc0a27.

david-stan added 3 commits January 26, 2026 16:41

Introduce CollateFnWithEmpty to handle empty batches and ensure con…

7aa0556

…sistent batch structure Mark tests incompatible with new empty batch handling as skipped

Enhance CollateFnWithEmpty to support diverse batch structures, imp…

51a20d8

…rove documentation, and add extensive test coverage

Improve tests.

d5d8414

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 26, 2026

evgri243 mentioned this pull request Jan 26, 2026

Non-wrapping mode for better Transformers compatibility #794

Closed

7 tasks

iden-kalemaj reviewed Feb 8, 2026

View reviewed changes

iden-kalemaj self-assigned this Feb 16, 2026

david-stan added 3 commits February 19, 2026 16:52

Make DPDataLoader tests deterministically verify empty batch handling…

d7317f1

… with seeded low sample rate

Add error handling for unsupported batch types in CollateFnWithEmpty …

14f2bb9

…to preserve DP guarantees

Update issue url

09dfd48

iden-kalemaj reviewed Feb 20, 2026

View reviewed changes

david-stan added 2 commits February 23, 2026 13:33

Remove GitHub-specific URL from error message

fdc1ac2

Apply black formatting to data_loader.py and dpdataloader_test.py

6e81057

david-stan added 2 commits February 24, 2026 13:52

isort lint fix

ecf7dfd

Change empty first batch from error to warning with empty list return

6ea5f9f

lint fix

723446f

Remove unused variable in dpdataloader test

df59ab8

meta-codesync Bot closed this in 6dc0a27 Mar 26, 2026

facebook-github-tools Bot added the Merged label Mar 26, 2026

iden-kalemaj mentioned this pull request Mar 30, 2026

TypeError: zeros() received an invalid combination of arguments - got (tuple, dtype=type), but expected one of: #743

Closed

		return SampleConvNet()


		@pytest.mark.skip(("Incompatible with the new empty batch handling"))

Conversation

david-stan commented Jan 26, 2026

Types of changes

Motivation and Context / Related issue

Solution:

Key Changes

How Has This Been Tested (if it applies)

Checklist

Uh oh!

meta-codesync Bot commented Jan 26, 2026

Uh oh!

coveralls commented Feb 8, 2026

Pull Request Test Coverage Report for Build 21371492613

Details

💛 - Coveralls

Uh oh!

Uh oh!

iden-kalemaj Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-stan Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

david-stan Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

iden-kalemaj Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

david-stan Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iden-kalemaj Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

iden-kalemaj Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

iden-kalemaj Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

david-stan Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iden-kalemaj commented Feb 8, 2026

Uh oh!

iden-kalemaj Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

iden-kalemaj commented Feb 22, 2026

Uh oh!

david-stan commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iden-kalemaj commented Feb 23, 2026

Uh oh!

david-stan commented Feb 24, 2026

Uh oh!

iden-kalemaj commented Mar 6, 2026

Uh oh!

david-stan commented Mar 23, 2026

Uh oh!

david-stan commented Mar 24, 2026

Uh oh!

iden-kalemaj commented Mar 24, 2026

Uh oh!

david-stan commented Mar 26, 2026

Uh oh!

meta-codesync Bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

iden-kalemaj Feb 8, 2026 •

edited

Loading

david-stan Feb 19, 2026 •

edited

Loading

david-stan Feb 19, 2026 •

edited

Loading

david-stan commented Feb 23, 2026 •

edited

Loading