Skip to content

Fix incorrect sample_rate with WeightedRandomSampler (Issue #813)#816

Open
intagliated wants to merge 2 commits into
meta-pytorch:mainfrom
intagliated:fix-weighted-sampler-final
Open

Fix incorrect sample_rate with WeightedRandomSampler (Issue #813)#816
intagliated wants to merge 2 commits into
meta-pytorch:mainfrom
intagliated:fix-weighted-sampler-final

Conversation

@intagliated
Copy link
Copy Markdown

Problem Statement

When using WeightedRandomSampler with a DataLoader, Opacus's make_private() and make_private_with_epsilon() were computing an incorrect sample_rate.

The original logic derived the rate from 1 / len(data_loader). However, for weighted samplers, len(data_loader) returns the number of batches, not the dataset size. This caused the privacy budget to be consumed significantly faster than reported, effectively breaking the Differential Privacy (DP) guarantees.

  • Impact: In a dataset of 100k samples with a batch size of 16, the rate was computed as 0.0078 instead of the correct 0.00016—a ~781x discrepancy in epsilon consumption.

Solution

The calculation was refactored to be mathematically consistent by grounding the rate in the absolute dataset length and explicit batch size.

Key Changes:

  1. Standardized Formula: Implemented $sample_rate = \frac{batch_size}{N}$ across all sampler types to ensure the accountant matches the physical sampling probability.
  2. Metadata Privacy: Added support for metadata_epsilon, allowing Laplace noise injection into the dataset size $N$ used for accounting. This protects metadata privacy while maintaining a perfect 1.0x ratio between the accountant and the sampler.
  3. Robust Detection: Added logic to extract batch_size from either the DataLoader or BatchSampler to handle NoneType edge cases in newer PyTorch versions.
  4. User Safety: Added an explicit UserWarning when WeightedRandomSampler is detected to ensure transparency in how the privacy rate is derived.

Verification

Validated using a dedicated audit script (verify_randomness.py) comparing the expected privacy ratio against the actual consumption.

Sampler Type Before Fix (Ratio) After Fix (Ratio) Status
Standard Shuffle 1.0x 1.0x ✅ PASS
WeightedRandomSampler 781.2x 1.0x ✅ PASS

…rch#813)

WeightedRandomSampler caused sample_rate to be computed from num_samples
instead of dataset size, burning epsilon 781x faster than expected silently.

Fix: compute sample_rate as batch_size / len(dataset) which is correct
for all sampler types. Also adds UserWarning when WeightedRandomSampler
is detected. Same fix applied to DPDataLoader.from_data_loader().
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 27, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 27, 2026

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this in D102646860. (Because this pull request was imported automatically, there will not be any future comments.)

@intagliated
Copy link
Copy Markdown
Author

Hi @HuanyuZhang, I noticed this PR was imported into Meta's internal system (D98158224) and assigned a month ago.
#814 is a PR on the same issue and it was already merged. Just checking in to see if there will be any feedback from the internal review team or if there are any additional changes needed on my end to move this toward a merge.

Thanks for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant