Skip to content

Potential mismatch between FEASGM-style normalization and SGM-based privacy accounting #819

@Eexac

Description

@Eexac

🐛 Bug

I would like to report a potential privacy-accounting mismatch related to DP-SGD with Poisson sampling and gradient normalization.

I found that Opacus versions v1.0.0--v1.6.0 appear to implement a floor-based Expected-Averaged Subsampled Gaussian Mechanism (FEASGM), as indicated by the floor-based expected-batch-size normalization in the implementation:

https://github.com/meta-pytorch/opacus/blame/v1.6.0/opacus/privacy_engine.py#L441C10-L441C10

However, the privacy accounting appears to rely on the standard SGM-based analysis.

Our auditing shows that, in some small-dataset and high-dimensional-output settings, when the normalizing factor differs between two neighboring datasets, the privacy leakage can be very large: two neighboring datasets can become almost fully distinguishable. We further performed a privacy analysis under the f-DP framework:

https://academic.oup.com/jrsssb

Our analysis suggests that this occurs because the privacy guarantee deteriorates as the output dimension increases and can vanish as the output dimension approaches infinity.

This issue is related to the prior bug report:

#571

However, our findings differ from and extend that report.

As we understand it, issue #571 made the following observations:

  1. In small-dataset settings, the empirical privacy leakage was estimated to be around 2.5.
  2. Averaging by the realized batch size, which we denote as Averaged SGM (ASGM), may make the leakage bounded by the SGM guarantee Privacy Leakage at low sample size #571 (comment).
  3. In large-dataset settings, the privacy leakage may be bounded by the SGM guarantee.

In contrast, our analysis identifies the following additional findings:

  1. In some small-dataset and high-dimensional-output settings, the privacy leakage can be significantly larger, and two neighboring datasets can become almost fully distinguishable.
  2. Averaging by the realized batch size does not necessarily make the privacy leakage bounded by the SGM guarantee; ASGM can still leak more privacy than what is captured by SGM-based accounting.
  3. In some large-dataset regimes, our analysis suggests that, under practical parameter settings, the actual privacy guarantee can still be weaker than the guarantee reported by the SGM-based accountant.

To Reproduce

This is not a runtime crash, so there is no traceback. The issue concerns the formal mechanism being accounted for.

The detailed auditing procedure, including the experimental setup, reproduction steps, empirical results, and theoretical analysis, is provided in our paper on arXiv:

https://arxiv.org/abs/2605.15648

Expected behavior

I think it would be helpful if Opacus provided a warning or documentation explaining when the implemented mechanism may not exactly match the standard SGM assumption used by the privacy accountant. Clarifying this point would help users better understand the privacy guarantee provided by Opacus.

In particular, such a warning or documentation could ask users to check whether the floor-based expected-batch-size normalizer changes between neighboring dataset sizes, e.g., from $\lfloor Nq \rfloor$ to $\lfloor (N+1)q \rfloor$, such as for datasets with 190 and 191 records. If these two normalizers differ, a potential mismatch may arise between the FEASGM-style implementation and the SGM-based privacy accounting.

I look forward to discussing this issue further with the maintainers and hearing your thoughts on this potential mismatch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions