[Upstream] Update megatron version to dev branch (Feb 13) and rebase modifications by guapisolo · Pull Request #13 · radixark/Megatron-LM

guapisolo · 2026-02-27T03:48:16Z

This PR rebases from radixark Megatron fork miles-20260218 and resolve conflicts.

Upgrade Megatron from Dec 17 (3714d81) to Feb 13 (1dcf0da)

This PR has been reviewed by Yueming.

The desription is generated by claude code and reviewed by me.

dp_reshardable fix should be removed after [DO NOT MERGE] Upstream feb 26 #15
mtp r3 bypass should also be simplified after [DO NOT MERGE] Upstream feb 26 #15

Rebase Diff Report: miles-main vs rebase-miles-main

Base branches:

miles-main (rdxa/miles-main) diverges from rdxa/dev at commit 3714d81d4 (older base)
rebase-miles-main rebases the same feature set onto rdxa/dev commit 1dcf0dafa

Commit Mapping

#	miles-main	rebase-miles-main	Status
1	`[1/8]` fix: misc compatibility fixes for PyTorch and TE	`[1/8]` same	Equivalent intent — minor context adaptation on `rdxa/dev`
2	`[2/8]` support partial checkpoint loading	`[2/8]` same	Equivalent intent — minor context adaptation on `rdxa/dev`
3	`[3/8]` post-attention and post-MLP layernorm	`[3/8]` same	Modified — adapted to rdxa/dev structure
4	`[4/8]` MLA RoPE triton kernel fix	`[4/8]` same	Identical
5	`[5/8]` detach output layer params for RL	`[5/8]` support MTP training in RL	Dropped as standalone; behavior absorbed and rewritten in rebase [5/8]
6	`[6/8]` support MTP training in RL	(mapped to rebase [5/8])	Rewritten commit message + implementation expanded
7	`[7/8]` support rollout routing replay (R3)	`[6/8]` R3 + bypass for MTP layers	Merged with [fix], adapted to rdxa/dev
—	`[fix]` bypass r3 for mtp layer	(merged into rebase [6/8])	Merged
8	`[8/8]` INT4 fake QAT for MoE	`[7/8]` same	Identical (renumbered)
—	(not present)	`[8/8]` fix: CUDA IPC incompatibility from Megatron bump	New — fixes rdxa/dev upstream conflict causing IPC failure
—	(not present)	`[9/8]` fix: dp_reshardable checkpoint backward compat	New — direct Megatron core fixes

Detailed Differences

[3/8] feat: add post-attention and post-MLP layernorm support

Files changed: gpt_layer_specs.py, transformer_layer.py, transformer_config.py, arguments.py

`gpt_layer_specs.py` — structural adaptation

miles-main: Adds layernorm spec via get_transformer_layer_spec_for_backend() helper — only one insertion point needed
rebase-miles-main: Adds layernorm spec inline in both MLA and non-MLA code paths (two insertion points)
Reason: rdxa/dev removed the get_transformer_layer_spec_for_backend() helper and inlined spec construction into two separate branches. Functionally equivalent.

`transformer_layer.py` — identical

Both versions add the same +24 lines: dataclass fields, module build in __init__, and layernorm application after self-attention and MLP outputs. No difference.

Note: An earlier version of the rebase had a bug (duplicate recompute_pre_mlp_layernorm block) which was already fixed before this report.

`transformer_config.py` — minor difference

miles-main: Adds post_self_attn_layernorm, post_mlp_layernorm, and use_gated_attention fields
rebase-miles-main: Adds post_self_attn_layernorm and post_mlp_layernorm only
Reason: use_gated_attention has no consumers in either codebase; dropped during rebase.

`arguments.py` — different mechanism

miles-main: Adds 3 manual group.add_argument() calls (--post-self-attn-layernorm, --post-mlp-layernorm, --use-gated-attention) + 2 explicit kw_args[...] = args.xxx assignments in core_transformer_config_from_args()
rebase-miles-main: Removes post_self_attn_layernorm and post_mlp_layernorm from the exclude list in ArgumentGroupFactory(TransformerConfig, exclude=exclude) + keeps the 2 explicit kw_args assignments (redundant but harmless)
Reason: rdxa/dev uses ArgumentGroupFactory to auto-generate CLI args from TransformerConfig dataclass fields. Adding manual add_argument() for the same fields causes an argparse conflict error (conflicting option string). Removing them from the exclude list lets the auto-generation handle it. The explicit kw_args assignments are redundant (the auto-loop in core_transformer_config_from_args already does this) but kept for clarity.

[5/8] feat: support MTP training in RL (rebased from miles-main [6/8], with [5/8] detach logic folded in)

miles-main had two separate commits:

[5/8] detach output layer params — edits existing compute_output_layer_and_language_model_loss() in language_module.py (1 file, 9 insertions, 1 deletion)
[6/8] support MTP training in RL — adds mtp_kwargs interface, changes MTP label/loss_mask flow in gpt_model.py + multi_token_prediction.py

In range-diff, miles-main [5/8] is dropped as a standalone commit, and miles-main [6/8] is rewritten into rebase [5/8] with an expanded commit message that includes detach + MTP behavior.

Commit message vs actual behavior (miles-main)

miles-main [5/8] message says "detach output layer params for RL training", but code only changes the non-fused branch in language_module.py (functional_call with detached module params). The fused linear_cross_entropy path still uses the caller-provided weight as-is.
miles-main [6/8] message says "support MTP training in RL". It adds mtp_kwargs and MTP flow changes, and calls compute_output_layer_and_language_model_loss(...) from MTP path, but fused-path detach is still not fully enforced there.
rebase [5/8] message explicitly combines these concerns (detach + mtp_kwargs + label/loss_mask roll), and implementation detaches both fused and non-fused MTP output-layer paths in gpt_model.py.

Detach implementation — where it lives

Aspect	miles-main [5/8]+[6/8]	rebase-miles-main [5/8]
Where	Existing `compute_output_layer_and_language_model_loss()` in `language_module.py` + MTP callsite in `gpt_model.py`	Inline in `gpt_model.py` `_postprocess`
`language_module.py`	Modified existing method only (small delta: `functional_call` detach in non-fused branch)	Unchanged from rdxa/dev
Non-fused detach	`functional_call` with all params detached + `col_linear_kwargs={'weight': output_weight.detach()}`	Same approach — `functional_call` with all params detached + `weight=output_weight.detach()` in kwargs
Fused detach	`weight` parameter comes from caller as `self.shared_embedding_or_output_weight()` — NOT detached	`weight=output_weight.detach()` — correctly detached

Key difference: In miles-main, the fused path (linear_cross_entropy) receives the weight parameter from the method signature, which is passed by the caller in gpt_model.py as self.shared_embedding_or_output_weight() without .detach(). This means the fused path does not block MTP gradient from flowing back to the output layer. The non-fused path is correct in both versions.

rebase-miles-main fixes this by detaching weight in both paths at the call site.

`mtp_kwargs` interface — identical

Both versions add the same mtp_kwargs: Optional[dict] = {} parameter, the same mtp_labels sourcing from mtp_kwargs['mtp_labels'], and the same loss_mask roll logic.

`multi_token_prediction.py` — identical

Both versions make the same changes: position_ids None check, decoder_input.detach(), keep_graph=False, and _checkpointed_forward rewrite to support non-tensor arguments.

Gradient flow (both versions intend the same behavior)

hidden_states → output_layer(detached params) → mtp_loss
                     ✗ gradient blocked to output layer
    ↓
MTP layer params ✓ trained normally

[6/8] feat: support rollout routing replay (R3) and bypass for MTP layers (was miles-main [7/8] + [fix])

miles-main had two separate commits:

[7/8] support rollout routing replay — adds R3 integration (2 files, 6 lines)
[fix] bypass r3 for mtp layer — adds is_mtp bypass (4 files, 13 lines)

rebase-miles-main merges both into a single [6/8], and additionally replaces rdxa/dev's built-in RouterReplay class.

API — identical

Aspect	miles-main [7/8] + [fix]	rebase-miles-main [6/8]
Import path	`from miles.utils.replay_base import routing_replay_manager`	Same
Registration	`routing_replay_manager.register_to_module(self, "routing_replay")`	Same
topk wrapping	`routing_replay_manager.get_topk_fn(compute_topk, return_probs=True)`	`routing_replay_manager.get_topk_fn(_compute_topk, return_probs=True)` — rdxa/dev renamed internal function; functionally equivalent
MTP bypass (`is_mtp`)	`[fix]` adds `is_mtp` flag, `set_is_mtp()` method, bypass logic	Same logic, merged into single commit

rdxa/dev `RouterReplay` handling

rdxa/dev has its own built-in RouterReplay class (router_replay.py), with moe_enable_routing_replay config and parameter-passing style integration. rebase-miles-main [6/8]:

Removes all RouterReplay imports and usage from router.py and moe_utils.py
Removes router_replay=self.router_replay parameter passing
Replaces with miles.utils.replay_base.routing_replay_manager (same as miles-main)
router_replay.py file still exists but is dead code (no imports remain)

[8/8] fix: CUDA IPC incompatibility from Megatron bump (new)

Not present on miles-main. Commit message documents a failure after rebasing to rdxa/dev: colocated IPC weight update hits torch.AcceleratorError: CUDA error: invalid argument during CUDA tensor serialization with torch.multiprocessing.

Motivation (from commit message): TMS hook behavior from upstream Megatron bump can make allocator behavior IPC-incompatible in this flow.

Code-level fix in this commit:

dynamic_context.py — disables torch_memory_saver hook mode in this context (HAVE_TORCH_MEMORY_SAVER = False)

This commit introduce this bug. NVIDIA@42986ac

[9/8] fix: dp_reshardable checkpoint backward compat in Megatron core (new)

Background: When loading a dp_reshardable checkpoint saved with a different DP world size, bucket counts may differ. Megatron's sharded_param_state_dp_reshardable pads the bucket state list with {"padding": True} entries for alignment, but the loading side had two bugs:

dict_utils.merge list length mismatch — merge() raises ValueError when shard file lists have different lengths (extra padding entries from save side). Fix: detect optimizer/param_state paths via _is_optimizer_param_state_key() and truncate the longer list (x1) to match x2.
distrib_optimizer.py KeyError on ['padding'] — Old checkpoint entries lack the padding field entirely. Fix: bucket_state_elem.get('padding', False) instead of bucket_state_elem['padding'].
mapping.py flattened_range exception — ShardedTensor.__init__ raises CheckpointingException for flattened_range. Fix: downgrade to deprecation warning (logged once) for backward compat with older checkpoint formats.

Files changed:

megatron/core/dist_checkpointing/dict_utils.py — _is_optimizer_param_state_key() helper + merge() truncation logic
megatron/core/optimizer/distrib_optimizer.py — .get('padding', False) in load_parameter_state_from_dp_reshardable
megatron/core/dist_checkpointing/mapping.py — flattened_range deprecation warning

Summary of Changes vs miles-main

Change	Location	Description
Adapted to rdxa/dev code structure	`gpt_layer_specs.py`	Inline spec in MLA/non-MLA paths (no helper function)
Adapted to rdxa/dev arg mechanism	`arguments.py`	Use `ArgumentGroupFactory` exclude list instead of manual `add_argument`
Fused-path detach fix	`gpt_model.py`	miles-main fused path doesn't detach weight; rebase correctly detaches
Inline detach placement	`gpt_model.py` vs `language_module.py`	Rebase inlines detach logic in `_postprocess`; miles-main touched existing `language_module.py` method (no new method added)
Replace rdxa/dev RouterReplay	`moe_utils.py`, `router.py`	rdxa/dev's built-in `RouterReplay` replaced with `miles.utils.replay_base`
Dropped `use_gated_attention`	`transformer_config.py`, `arguments.py`	No consumers in codebase
New: CUDA IPC fix	`mapping.py`, `dynamic_context.py`	Megatron bump broke IPC via TMS hook allocator change; isolate side effects
New: dp_reshardable compat	`dict_utils.py`, `distrib_optimizer.py`, `mapping.py`	Direct Megatron core fixes replacing Miles monkey-patches

guapisolo mentioned this pull request Feb 27, 2026

[Docker] Megatron version bump to Feb 13 and upgrade fla==0.4.1 radixark/miles#643

Merged

guapisolo force-pushed the upstream/1dcf0dafa branch 2 times, most recently from 4ac0f2f to 07220d2 Compare February 28, 2026 01:02

guapisolo changed the title ~~[Upstream] Update megatron version to latest dev branch and rebase modifications~~ [Upstream] Update megatron version to dev branch (Feb 13) and rebase modifications Mar 4, 2026

guapisolo merged commit 038e8e5 into miles-main Mar 4, 2026

guapisolo force-pushed the miles-main branch from 992c0a2 to 038e8e5 Compare March 4, 2026 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Upstream] Update megatron version to dev branch (Feb 13) and rebase modifications#13

[Upstream] Update megatron version to dev branch (Feb 13) and rebase modifications#13
guapisolo merged 0 commit into
miles-mainfrom
upstream/1dcf0dafa

guapisolo commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

guapisolo commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rebase Diff Report: miles-main vs rebase-miles-main

Commit Mapping

Detailed Differences

[3/8] feat: add post-attention and post-MLP layernorm support

gpt_layer_specs.py — structural adaptation

transformer_layer.py — identical

transformer_config.py — minor difference

arguments.py — different mechanism

[5/8] feat: support MTP training in RL (rebased from miles-main [6/8], with [5/8] detach logic folded in)

Commit message vs actual behavior (miles-main)

Detach implementation — where it lives

mtp_kwargs interface — identical

multi_token_prediction.py — identical

Gradient flow (both versions intend the same behavior)

[6/8] feat: support rollout routing replay (R3) and bypass for MTP layers (was miles-main [7/8] + [fix])

API — identical

rdxa/dev RouterReplay handling

[8/8] fix: CUDA IPC incompatibility from Megatron bump (new)

[9/8] fix: dp_reshardable checkpoint backward compat in Megatron core (new)

Summary of Changes vs miles-main

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

guapisolo commented Feb 27, 2026 •

edited

Loading

`gpt_layer_specs.py` — structural adaptation

`transformer_layer.py` — identical

`transformer_config.py` — minor difference

`arguments.py` — different mechanism

`mtp_kwargs` interface — identical

`multi_token_prediction.py` — identical

rdxa/dev `RouterReplay` handling