Fix MTP recompute padding mask forwarding by BestJuly · Pull Request #4963 · NVIDIA/Megatron-LM

BestJuly · 2026-05-25T02:06:00Z

Summary

Fix MultiTokenPredictionLayer full-recompute path to accept and forward padding_mask.
Resolve the TypeError reported in 🐛 CI failure: MultiTokenPredictionLayer._checkpointed_forward() got unexpected kwarg 'padding_mask' #4933 when recompute calls _checkpointed_forward with padding_mask.

Validation

PYTHONPATH=.venv/lib/python3.12/site-packages:$PYTHONPATH /usr/bin/python3.12 -m torch.distributed.run --standalone --nproc-per-node 8 -m pytest -q --tb=short --capture=fd tests/unit_tests/transformer/test_multi_token_prediction.py::TestMultiTokenPrediction::test_forward_backward[1-1-True] tests/unit_tests/transformer/test_multi_token_prediction.py::TestMultiTokenPrediction::test_forward_backward[1-2-True] tests/unit_tests/transformer/test_multi_token_prediction.py::TestMultiTokenPrediction::test_forward_backward[1-4-True] tests/unit_tests/transformer/test_multi_token_prediction.py::TestMultiTokenPrediction::test_forward_backward[2-1-True] tests/unit_tests/transformer/test_multi_token_prediction.py::TestMultiTokenPrediction::test_forward_backward[2-2-True] tests/unit_tests/transformer/test_multi_token_prediction.py::TestMultiTokenPrediction::test_forward_backward[2-4-True] tests/unit_tests/transformer/test_multi_token_prediction.py::TestMultiTokenPrediction::test_forward_backward[4-1-True] tests/unit_tests/transformer/test_multi_token_prediction.py::TestMultiTokenPrediction::test_forward_backward[4-2-True] tests/unit_tests/transformer/test_multi_token_prediction.py::TestMultiTokenPrediction::test_fp8_support[True] tests/unit_tests/transformer/test_multi_token_prediction.py::TestMultiTokenPrediction::test_packed_sequences_with_full_recompute
PYTHONPATH=.venv/lib/python3.12/site-packages:$PYTHONPATH /usr/bin/python3.12 -m torch.distributed.run --standalone --nproc-per-node 8 -m pytest -q --tb=short --capture=fd tests/unit_tests/distributed/megatron_fsdp/test_mcore_fully_sharded_data_parallel.py::TestMegatronFSDPE2E::test_compatible_with_nd_parallel[optim_grads_params_double_buffer-TP2]
PYTHONPATH=.venv/lib/python3.12/site-packages:$PYTHONPATH /usr/bin/python3.12 -m torch.distributed.run --standalone --nproc-per-node 8 -m pytest -q --tb=short --capture=fd tests/unit_tests/distributed/megatron_fsdp/test_mcore_fully_sharded_data_parallel.py::TestMegatronFSDPE2E::test_compatible_with_nd_parallel[optim_grads_params_double_buffer-EP2_ETP2] tests/unit_tests/distributed/megatron_fsdp/test_mcore_fully_sharded_data_parallel.py::TestMegatronFSDPE2E::test_compatible_with_nd_parallel[optim_grads_params_double_buffer-OUTER_DP2_EP2]

Dev branch

Checked latest origin/dev (56481b050) and this issue is not present there: megatron/core/transformer/multi_token_prediction.py contains no padding_mask usage, so there is no _checkpointed_forward/padding_mask signature mismatch to fix. No dev PR is needed.

copy-pr-bot · 2026-05-25T02:06:03Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

BestJuly · 2026-05-26T07:56:11Z

Hi @ko3n1g , this should fix the issue, could you help merge this? Thanks.

ko3n1g · 2026-05-26T10:26:38Z

@BestJuly could you also re-enable the tests like I did here https://github.com/NVIDIA/Megatron-LM/pull/4983/changes#diff-aacc823c21e05a0581eda86d881c4ea79ca36e6339efc02e79e72764a8f8581b?

MultiTokenPredictionLayer.forward calls self._checkpointed_forward( padding_mask=padding_mask, ...) (multi_token_prediction.py:1305), but _checkpointed_forward and its inner custom_forward never accepted padding_mask. With recompute_granularity == 'full' and self.training, this raised: TypeError: MultiTokenPredictionLayer._checkpointed_forward() got an unexpected keyword argument 'padding_mask' at multi_token_prediction.py:1301. The kwarg was introduced in NVIDIA#2645 on the call site; the _checkpointed_forward refactor in NVIDIA#4593 dropped padding_mask from the recompute path. Add padding_mask: * to _checkpointed_forward's signature * to custom_forward's signature so it flows into _proj_and_transformer_layer * positionally to te_checkpoint and tensor_parallel.checkpoint, matching the other tensor / None args (padding_mask is a rolled tensor, not a non-tensor closure-captured arg like attention_bias) * to the recompute_method == 'block' fallback that also calls _proj_and_transformer_layer directly Also remove the @pytest.mark.flaky_in_dev markers from test_forward_backward, test_fp8_support, and test_packed_sequences_with_full_recompute, which were added in NVIDIA#4931 to mask this exact failure. Closes NVIDIA#4933 Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g · 2026-05-26T11:01:22Z

/ok to test 4a8785b

ko3n1g · 2026-05-26T14:22:55Z

/ok to test 4a8785b

ko3n1g · 2026-05-26T14:28:31Z

/ok to test c89ec01

BestJuly marked this pull request as ready for review May 25, 2026 02:07

BestJuly requested review from a team as code owners May 25, 2026 02:07

svcnvidia-nemo-ci requested a review from a team May 25, 2026 02:07

copy-pr-bot Bot temporarily deployed to public May 25, 2026 02:07 Inactive

svcnvidia-nemo-ci added the complexity: low label May 25, 2026

copy-pr-bot Bot temporarily deployed to test May 25, 2026 02:08 Inactive

copy-pr-bot Bot temporarily deployed to public May 25, 2026 02:10 Inactive

copy-pr-bot Bot temporarily deployed to public May 25, 2026 02:11 Inactive

copy-pr-bot Bot temporarily deployed to public May 25, 2026 02:18 Inactive

ko3n1g approved these changes May 26, 2026

View reviewed changes

ko3n1g mentioned this pull request May 26, 2026

Fix _checkpointed_forward missing padding_mask parameter #4966

Closed

ko3n1g mentioned this pull request May 26, 2026

fix: forward padding_mask through MTP recompute path #4983

Closed

copy-pr-bot Bot temporarily deployed to public May 26, 2026 10:35 Inactive

copy-pr-bot Bot temporarily deployed to test May 26, 2026 10:35 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 10:38 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 10:39 Inactive

Fixes NVIDIA#4933

95652c7

copy-pr-bot Bot requested a deployment to public May 26, 2026 10:46 Abandoned

BestJuly force-pushed the lit/issue_4933 branch from b6409bd to 4a8785b Compare May 26, 2026 10:47

Merge branch 'main' into lit/issue_4933

c89ec01

copy-pr-bot Bot temporarily deployed to public May 26, 2026 14:29 Inactive

copy-pr-bot Bot temporarily deployed to test May 26, 2026 14:29 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 14:32 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 14:40 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MTP recompute padding mask forwarding#4963

Fix MTP recompute padding mask forwarding#4963
BestJuly wants to merge 3 commits into
NVIDIA:mainfrom
BestJuly:lit/issue_4933

BestJuly commented May 25, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 25, 2026

Uh oh!

BestJuly commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BestJuly commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Dev branch

Uh oh!

copy-pr-bot Bot commented May 25, 2026

Uh oh!

BestJuly commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

ko3n1g commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BestJuly commented May 25, 2026 •

edited

Loading