Fix GitHub issue #792: Fast gradient clipping ignores ignore_index masking#808
Closed
HuanyuZhang wants to merge 1 commit into
Closed
Fix GitHub issue #792: Fast gradient clipping ignores ignore_index masking#808HuanyuZhang wants to merge 1 commit into
HuanyuZhang wants to merge 1 commit into
Conversation
|
@HuanyuZhang has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95489302. |
HuanyuZhang
added a commit
to HuanyuZhang/opacus
that referenced
this pull request
Mar 6, 2026
…ore_index masking (meta-pytorch#808) Summary: Context/Motivation: Fixes meta-pytorch#792 When using fast/ghost gradient clipping for NLP tasks, `DPLossFastGradientClipping` computes per-sample mean loss via `.mean(dim=1)`, which divides by the full sequence length. This ignores the `ignore_index` parameter from the criterion (e.g., `CrossEntropyLoss(ignore_index=-100)`), causing masked/padded positions to dilute the loss. For tasks like SQuAD where only a few tokens are real targets out of a long sequence, the loss becomes orders of magnitude too small, preventing training. This diff: - Modified `DPLossFastGradientClipping.__call__()` to check for `ignore_index` on the criterion and compute mean only over non-ignored positions when present - Added regression test `github_issue_test.py` verifying ignore_index is respected for both mean and sum reductions, plus a backwards-compatibility test for the no-masking case Differential Revision: D95489302
c2efd81 to
1e0a727
Compare
HuanyuZhang
added a commit
to HuanyuZhang/opacus
that referenced
this pull request
Mar 7, 2026
…ore_index masking (meta-pytorch#808) Summary: Context/Motivation: Fixes meta-pytorch#792 When using fast/ghost gradient clipping for NLP tasks, `DPLossFastGradientClipping` computes per-sample mean loss via `.mean(dim=1)`, which divides by the full sequence length. This ignores the `ignore_index` parameter from the criterion (e.g., `CrossEntropyLoss(ignore_index=-100)`), causing masked/padded positions to dilute the loss. For tasks like SQuAD where only a few tokens are real targets out of a long sequence, the loss becomes orders of magnitude too small, preventing training. This diff: - Modified `DPLossFastGradientClipping.__call__()` to check for `ignore_index` on the criterion and compute mean only over non-ignored positions when present - Added regression test `github_issue_test.py` verifying ignore_index is respected for both mean and sum reductions, plus a backwards-compatibility test for the no-masking case Differential Revision: D95489302
…ore_index masking (meta-pytorch#808) Summary: Context/Motivation: Fixes meta-pytorch#792 When using fast/ghost gradient clipping for NLP tasks, `DPLossFastGradientClipping` computes per-sample mean loss via `.mean(dim=1)`, which divides by the full sequence length. This ignores the `ignore_index` parameter from the criterion (e.g., `CrossEntropyLoss(ignore_index=-100)`), causing masked/padded positions to dilute the loss. For tasks like SQuAD where only a few tokens are real targets out of a long sequence, the loss becomes orders of magnitude too small, preventing training. This diff: - Modified `DPLossFastGradientClipping.__call__()` to check for `ignore_index` on the criterion and compute mean only over non-ignored positions when present - Added regression test `github_issue_test.py` verifying ignore_index is respected for both mean and sum reductions, plus a backwards-compatibility test for the no-masking case Reviewed By: aparna-aketi Differential Revision: D95489302
1e0a727 to
9b82a22
Compare
|
This pull request has been merged in 8493eeb. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Context/Motivation: Fixes #792
When using fast/ghost gradient clipping for NLP tasks,
DPLossFastGradientClippingcomputes per-sample mean loss via
.mean(dim=1), which divides by the full sequencelength. This ignores the
ignore_indexparameter from the criterion (e.g.,CrossEntropyLoss(ignore_index=-100)), causing masked/padded positions to dilutethe loss. For tasks like SQuAD where only a few tokens are real targets out of a
long sequence, the loss becomes orders of magnitude too small, preventing training.
This diff:
DPLossFastGradientClipping.__call__()to check forignore_indexon thecriterion and compute mean only over non-ignored positions when present
github_issue_test.pyverifying ignore_index is respectedfor both mean and sum reductions, plus a backwards-compatibility test for the
no-masking case
Differential Revision: D95489302