Skip to content

Fix GitHub issue #792: Fast gradient clipping ignores ignore_index masking#808

Closed
HuanyuZhang wants to merge 1 commit into
meta-pytorch:mainfrom
HuanyuZhang:export-D95489302
Closed

Fix GitHub issue #792: Fast gradient clipping ignores ignore_index masking#808
HuanyuZhang wants to merge 1 commit into
meta-pytorch:mainfrom
HuanyuZhang:export-D95489302

Conversation

@HuanyuZhang
Copy link
Copy Markdown
Contributor

Summary:
Context/Motivation: Fixes #792

When using fast/ghost gradient clipping for NLP tasks, DPLossFastGradientClipping
computes per-sample mean loss via .mean(dim=1), which divides by the full sequence
length. This ignores the ignore_index parameter from the criterion (e.g.,
CrossEntropyLoss(ignore_index=-100)), causing masked/padded positions to dilute
the loss. For tasks like SQuAD where only a few tokens are real targets out of a
long sequence, the loss becomes orders of magnitude too small, preventing training.

This diff:

  • Modified DPLossFastGradientClipping.__call__() to check for ignore_index on the
    criterion and compute mean only over non-ignored positions when present
  • Added regression test github_issue_test.py verifying ignore_index is respected
    for both mean and sum reductions, plus a backwards-compatibility test for the
    no-masking case

Differential Revision: D95489302

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 6, 2026

@HuanyuZhang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95489302.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 6, 2026
HuanyuZhang added a commit to HuanyuZhang/opacus that referenced this pull request Mar 6, 2026
…ore_index masking (meta-pytorch#808)

Summary:

Context/Motivation: Fixes meta-pytorch#792

When using fast/ghost gradient clipping for NLP tasks, `DPLossFastGradientClipping`
computes per-sample mean loss via `.mean(dim=1)`, which divides by the full sequence
length. This ignores the `ignore_index` parameter from the criterion (e.g.,
`CrossEntropyLoss(ignore_index=-100)`), causing masked/padded positions to dilute
the loss. For tasks like SQuAD where only a few tokens are real targets out of a
long sequence, the loss becomes orders of magnitude too small, preventing training.

This diff:
- Modified `DPLossFastGradientClipping.__call__()` to check for `ignore_index` on the
  criterion and compute mean only over non-ignored positions when present
- Added regression test `github_issue_test.py` verifying ignore_index is respected
  for both mean and sum reductions, plus a backwards-compatibility test for the
  no-masking case

Differential Revision: D95489302
@HuanyuZhang HuanyuZhang force-pushed the export-D95489302 branch 2 times, most recently from c2efd81 to 1e0a727 Compare March 7, 2026 14:40
HuanyuZhang added a commit to HuanyuZhang/opacus that referenced this pull request Mar 7, 2026
…ore_index masking (meta-pytorch#808)

Summary:

Context/Motivation: Fixes meta-pytorch#792

When using fast/ghost gradient clipping for NLP tasks, `DPLossFastGradientClipping`
computes per-sample mean loss via `.mean(dim=1)`, which divides by the full sequence
length. This ignores the `ignore_index` parameter from the criterion (e.g.,
`CrossEntropyLoss(ignore_index=-100)`), causing masked/padded positions to dilute
the loss. For tasks like SQuAD where only a few tokens are real targets out of a
long sequence, the loss becomes orders of magnitude too small, preventing training.

This diff:
- Modified `DPLossFastGradientClipping.__call__()` to check for `ignore_index` on the
  criterion and compute mean only over non-ignored positions when present
- Added regression test `github_issue_test.py` verifying ignore_index is respected
  for both mean and sum reductions, plus a backwards-compatibility test for the
  no-masking case

Differential Revision: D95489302
…ore_index masking (meta-pytorch#808)

Summary:

Context/Motivation: Fixes meta-pytorch#792

When using fast/ghost gradient clipping for NLP tasks, `DPLossFastGradientClipping`
computes per-sample mean loss via `.mean(dim=1)`, which divides by the full sequence
length. This ignores the `ignore_index` parameter from the criterion (e.g.,
`CrossEntropyLoss(ignore_index=-100)`), causing masked/padded positions to dilute
the loss. For tasks like SQuAD where only a few tokens are real targets out of a
long sequence, the loss becomes orders of magnitude too small, preventing training.

This diff:
- Modified `DPLossFastGradientClipping.__call__()` to check for `ignore_index` on the
  criterion and compute mean only over non-ignored positions when present
- Added regression test `github_issue_test.py` verifying ignore_index is respected
  for both mean and sum reductions, plus a backwards-compatibility test for the
  no-masking case

Reviewed By: aparna-aketi

Differential Revision: D95489302
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 9, 2026

This pull request has been merged in 8493eeb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fast gradient clipping ignores masking

1 participant