Skip to content

[Upstream] Update megatron version to dev branch (Feb 13) and rebase modifications#13

Merged
guapisolo merged 0 commit into
miles-mainfrom
upstream/1dcf0dafa
Mar 4, 2026
Merged

[Upstream] Update megatron version to dev branch (Feb 13) and rebase modifications#13
guapisolo merged 0 commit into
miles-mainfrom
upstream/1dcf0dafa

Conversation

@guapisolo
Copy link
Copy Markdown

@guapisolo guapisolo commented Feb 27, 2026

This PR rebases from radixark Megatron fork miles-20260218 and resolve conflicts.

Upgrade Megatron from Dec 17 (3714d81) to Feb 13 (1dcf0da)

This PR has been reviewed by Yueming.

The desription is generated by claude code and reviewed by me.

Rebase Diff Report: miles-main vs rebase-miles-main

Base branches:

  • miles-main (rdxa/miles-main) diverges from rdxa/dev at commit 3714d81d4 (older base)
  • rebase-miles-main rebases the same feature set onto rdxa/dev commit 1dcf0dafa

Commit Mapping

# miles-main rebase-miles-main Status
1 [1/8] fix: misc compatibility fixes for PyTorch and TE [1/8] same Equivalent intent — minor context adaptation on rdxa/dev
2 [2/8] support partial checkpoint loading [2/8] same Equivalent intent — minor context adaptation on rdxa/dev
3 [3/8] post-attention and post-MLP layernorm [3/8] same Modified — adapted to rdxa/dev structure
4 [4/8] MLA RoPE triton kernel fix [4/8] same Identical
5 [5/8] detach output layer params for RL [5/8] support MTP training in RL Dropped as standalone; behavior absorbed and rewritten in rebase [5/8]
6 [6/8] support MTP training in RL (mapped to rebase [5/8]) Rewritten commit message + implementation expanded
7 [7/8] support rollout routing replay (R3) [6/8] R3 + bypass for MTP layers Merged with [fix], adapted to rdxa/dev
[fix] bypass r3 for mtp layer (merged into rebase [6/8]) Merged
8 [8/8] INT4 fake QAT for MoE [7/8] same Identical (renumbered)
(not present) [8/8] fix: CUDA IPC incompatibility from Megatron bump New — fixes rdxa/dev upstream conflict causing IPC failure
(not present) [9/8] fix: dp_reshardable checkpoint backward compat New — direct Megatron core fixes

Detailed Differences

[3/8] feat: add post-attention and post-MLP layernorm support

Files changed: gpt_layer_specs.py, transformer_layer.py, transformer_config.py, arguments.py

gpt_layer_specs.py — structural adaptation

  • miles-main: Adds layernorm spec via get_transformer_layer_spec_for_backend() helper — only one insertion point needed
  • rebase-miles-main: Adds layernorm spec inline in both MLA and non-MLA code paths (two insertion points)
  • Reason: rdxa/dev removed the get_transformer_layer_spec_for_backend() helper and inlined spec construction into two separate branches. Functionally equivalent.

transformer_layer.py — identical

Both versions add the same +24 lines: dataclass fields, module build in __init__, and layernorm application after self-attention and MLP outputs. No difference.

Note: An earlier version of the rebase had a bug (duplicate recompute_pre_mlp_layernorm block) which was already fixed before this report.

transformer_config.py — minor difference

  • miles-main: Adds post_self_attn_layernorm, post_mlp_layernorm, and use_gated_attention fields
  • rebase-miles-main: Adds post_self_attn_layernorm and post_mlp_layernorm only
  • Reason: use_gated_attention has no consumers in either codebase; dropped during rebase.

arguments.py — different mechanism

  • miles-main: Adds 3 manual group.add_argument() calls (--post-self-attn-layernorm, --post-mlp-layernorm, --use-gated-attention) + 2 explicit kw_args[...] = args.xxx assignments in core_transformer_config_from_args()
  • rebase-miles-main: Removes post_self_attn_layernorm and post_mlp_layernorm from the exclude list in ArgumentGroupFactory(TransformerConfig, exclude=exclude) + keeps the 2 explicit kw_args assignments (redundant but harmless)
  • Reason: rdxa/dev uses ArgumentGroupFactory to auto-generate CLI args from TransformerConfig dataclass fields. Adding manual add_argument() for the same fields causes an argparse conflict error (conflicting option string). Removing them from the exclude list lets the auto-generation handle it. The explicit kw_args assignments are redundant (the auto-loop in core_transformer_config_from_args already does this) but kept for clarity.

[5/8] feat: support MTP training in RL (rebased from miles-main [6/8], with [5/8] detach logic folded in)

miles-main had two separate commits:

  • [5/8] detach output layer params — edits existing compute_output_layer_and_language_model_loss() in language_module.py (1 file, 9 insertions, 1 deletion)
  • [6/8] support MTP training in RL — adds mtp_kwargs interface, changes MTP label/loss_mask flow in gpt_model.py + multi_token_prediction.py

In range-diff, miles-main [5/8] is dropped as a standalone commit, and miles-main [6/8] is rewritten into rebase [5/8] with an expanded commit message that includes detach + MTP behavior.

Commit message vs actual behavior (miles-main)

  • miles-main [5/8] message says "detach output layer params for RL training", but code only changes the non-fused branch in language_module.py (functional_call with detached module params). The fused linear_cross_entropy path still uses the caller-provided weight as-is.
  • miles-main [6/8] message says "support MTP training in RL". It adds mtp_kwargs and MTP flow changes, and calls compute_output_layer_and_language_model_loss(...) from MTP path, but fused-path detach is still not fully enforced there.
  • rebase [5/8] message explicitly combines these concerns (detach + mtp_kwargs + label/loss_mask roll), and implementation detaches both fused and non-fused MTP output-layer paths in gpt_model.py.

Detach implementation — where it lives

Aspect miles-main [5/8]+[6/8] rebase-miles-main [5/8]
Where Existing compute_output_layer_and_language_model_loss() in language_module.py + MTP callsite in gpt_model.py Inline in gpt_model.py _postprocess
language_module.py Modified existing method only (small delta: functional_call detach in non-fused branch) Unchanged from rdxa/dev
Non-fused detach functional_call with all params detached + col_linear_kwargs={'weight': output_weight.detach()} Same approach — functional_call with all params detached + weight=output_weight.detach() in kwargs
Fused detach weight parameter comes from caller as self.shared_embedding_or_output_weight()NOT detached weight=output_weight.detach() — correctly detached

Key difference: In miles-main, the fused path (linear_cross_entropy) receives the weight parameter from the method signature, which is passed by the caller in gpt_model.py as self.shared_embedding_or_output_weight() without .detach(). This means the fused path does not block MTP gradient from flowing back to the output layer. The non-fused path is correct in both versions.

rebase-miles-main fixes this by detaching weight in both paths at the call site.

mtp_kwargs interface — identical

Both versions add the same mtp_kwargs: Optional[dict] = {} parameter, the same mtp_labels sourcing from mtp_kwargs['mtp_labels'], and the same loss_mask roll logic.

multi_token_prediction.py — identical

Both versions make the same changes: position_ids None check, decoder_input.detach(), keep_graph=False, and _checkpointed_forward rewrite to support non-tensor arguments.

Gradient flow (both versions intend the same behavior)

hidden_states → output_layer(detached params) → mtp_loss
                     ✗ gradient blocked to output layer
    ↓
MTP layer params ✓ trained normally

[6/8] feat: support rollout routing replay (R3) and bypass for MTP layers (was miles-main [7/8] + [fix])

miles-main had two separate commits:

  • [7/8] support rollout routing replay — adds R3 integration (2 files, 6 lines)
  • [fix] bypass r3 for mtp layer — adds is_mtp bypass (4 files, 13 lines)

rebase-miles-main merges both into a single [6/8], and additionally replaces rdxa/dev's built-in RouterReplay class.

API — identical

Aspect miles-main [7/8] + [fix] rebase-miles-main [6/8]
Import path from miles.utils.replay_base import routing_replay_manager Same
Registration routing_replay_manager.register_to_module(self, "routing_replay") Same
topk wrapping routing_replay_manager.get_topk_fn(compute_topk, return_probs=True) routing_replay_manager.get_topk_fn(_compute_topk, return_probs=True) — rdxa/dev renamed internal function; functionally equivalent
MTP bypass (is_mtp) [fix] adds is_mtp flag, set_is_mtp() method, bypass logic Same logic, merged into single commit

rdxa/dev RouterReplay handling

rdxa/dev has its own built-in RouterReplay class (router_replay.py), with moe_enable_routing_replay config and parameter-passing style integration. rebase-miles-main [6/8]:

  • Removes all RouterReplay imports and usage from router.py and moe_utils.py
  • Removes router_replay=self.router_replay parameter passing
  • Replaces with miles.utils.replay_base.routing_replay_manager (same as miles-main)
  • router_replay.py file still exists but is dead code (no imports remain)

[8/8] fix: CUDA IPC incompatibility from Megatron bump (new)

Not present on miles-main. Commit message documents a failure after rebasing to rdxa/dev: colocated IPC weight update hits torch.AcceleratorError: CUDA error: invalid argument during CUDA tensor serialization with torch.multiprocessing.

Motivation (from commit message): TMS hook behavior from upstream Megatron bump can make allocator behavior IPC-incompatible in this flow.

Code-level fix in this commit:

  • dynamic_context.py — disables torch_memory_saver hook mode in this context (HAVE_TORCH_MEMORY_SAVER = False)

This commit introduce this bug. NVIDIA@42986ac


[9/8] fix: dp_reshardable checkpoint backward compat in Megatron core (new)

Background: When loading a dp_reshardable checkpoint saved with a different DP world size, bucket counts may differ. Megatron's sharded_param_state_dp_reshardable pads the bucket state list with {"padding": True} entries for alignment, but the loading side had two bugs:

  1. dict_utils.merge list length mismatchmerge() raises ValueError when shard file lists have different lengths (extra padding entries from save side). Fix: detect optimizer/param_state paths via _is_optimizer_param_state_key() and truncate the longer list (x1) to match x2.

  2. distrib_optimizer.py KeyError on ['padding'] — Old checkpoint entries lack the padding field entirely. Fix: bucket_state_elem.get('padding', False) instead of bucket_state_elem['padding'].

  3. mapping.py flattened_range exceptionShardedTensor.__init__ raises CheckpointingException for flattened_range. Fix: downgrade to deprecation warning (logged once) for backward compat with older checkpoint formats.

Files changed:

  • megatron/core/dist_checkpointing/dict_utils.py_is_optimizer_param_state_key() helper + merge() truncation logic
  • megatron/core/optimizer/distrib_optimizer.py.get('padding', False) in load_parameter_state_from_dp_reshardable
  • megatron/core/dist_checkpointing/mapping.pyflattened_range deprecation warning

Summary of Changes vs miles-main

Change Location Description
Adapted to rdxa/dev code structure gpt_layer_specs.py Inline spec in MLA/non-MLA paths (no helper function)
Adapted to rdxa/dev arg mechanism arguments.py Use ArgumentGroupFactory exclude list instead of manual add_argument
Fused-path detach fix gpt_model.py miles-main fused path doesn't detach weight; rebase correctly detaches
Inline detach placement gpt_model.py vs language_module.py Rebase inlines detach logic in _postprocess; miles-main touched existing language_module.py method (no new method added)
Replace rdxa/dev RouterReplay moe_utils.py, router.py rdxa/dev's built-in RouterReplay replaced with miles.utils.replay_base
Dropped use_gated_attention transformer_config.py, arguments.py No consumers in codebase
New: CUDA IPC fix mapping.py, dynamic_context.py Megatron bump broke IPC via TMS hook allocator change; isolate side effects
New: dp_reshardable compat dict_utils.py, distrib_optimizer.py, mapping.py Direct Megatron core fixes replacing Miles monkey-patches

@guapisolo guapisolo force-pushed the upstream/1dcf0dafa branch 2 times, most recently from 4ac0f2f to 07220d2 Compare February 28, 2026 01:02
@guapisolo guapisolo changed the title [Upstream] Update megatron version to latest dev branch and rebase modifications [Upstream] Update megatron version to dev branch (Feb 13) and rebase modifications Mar 4, 2026
@guapisolo guapisolo merged commit 038e8e5 into miles-main Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant