[DO NOT MERGE] trainer ft by fzyzcjy · Pull Request #21 · radixark/Megatron-LM

fzyzcjy · 2026-04-02T02:27:11Z

No description provided.

Add _pre_decoder_hooks list and register_pre_decoder_hook() method. Hooks are called between _preprocess and decoder, allowing external code to transform decoder_input without Megatron knowing specifics.

Already initialized in __init__, no need for getattr fallback.

Remove _pre_decoder_hooks list and register_pre_decoder_hook() method. Add witness_ids parameter to forward() and build_schedule_plan(). Witness logic is inline: hasattr(self, 'head_witness') check + add to decoder_input or decoder.input_tensor depending on PP stage.

- build_schedule_plan: accept witness_ids, pass to schedule plan - TransformerModelChunkSchedulePlan: store witness_ids in chunk_state - PreProcessNode.forward_impl: apply witness after _preprocess (same logic as GPTModel.forward)

- GPTModel.forward: add tail_witness after decoder, before _postprocess - build_schedule_plan: revert witness_ids param (not supported) - model_chunk_schedule_plan: revert witness_ids in chunk_state - fine_grained_callables: revert witness logic in Pre/PostProcessNode

_DataWitness.forward returns [b, s, 1] but Megatron's decoder_input is in [s, b, h] format after the embedding layer. Without transposing, broadcasting [s, b, h] + [b, s, 1] creates a [s, s, h] tensor, causing OOM (648 GiB for s=18432).

When sequence parallel is active, decoder_input and hidden_states are scattered along the sequence dimension ([s/tp, b, h]). The witness output must also be scattered to match, otherwise shapes mismatch (e.g. [18432, 1, 1] vs [9216, 1, h] with TP=2).

…ptimizer The distributed optimizer replaces optimizer param_groups with shard main params (fp32 copies). get_main_grads_for_grad_norm checks _is_witness_param on these main params, but the flag was only set on the original model params. Copy the flag when building main param groups for both float16→fp32 and fp32 paths.

…t_process

…l_tail_witness

…ancellation

fzyzcjy added 3 commits April 1, 2026 22:53

feat: add generic pre-decoder hook mechanism to GPTModel

c5f9de8

Add _pre_decoder_hooks list and register_pre_decoder_hook() method. Hooks are called between _preprocess and decoder, allowing external code to transform decoder_input without Megatron knowing specifics.

fix: use direct attribute access for _pre_decoder_hooks

77185e1

Already initialized in __init__, no need for getattr fallback.

more

153fa7f

fzyzcjy changed the title ~~Add register_pre_decoder_hook~~ [DO NOT MERGE] trainer ft Apr 2, 2026

fzyzcjy marked this pull request as draft April 2, 2026 02:28

fzyzcjy added 19 commits April 2, 2026 10:36

fix: use witness_ids param directly instead of miles_kwargs dict

91bbff1

more

f12f77b

gitignore

4a8e159

more

b7af836

more

8225413

fix: guard witness forward with hasattr for PP stage awareness

3c25800

fix: use pre_process/post_process assert instead of hasattr for witness

2592086

fix: exclude witness params from grad_norm computation

bc8ed09

fix: assert witness params not used with precision-aware optimizer

4028201

fix: use hasattr guard for witness forward instead of pre_process/pos…

755c9f7

…t_process

refactor: rename head_witness/tail_witness to local_head_witness/loca…

9cc8cb8

…l_tail_witness

fix: use witness_broadcast_add for tail witness to prevent gradient c…

43cb211

…ancellation

refactor: simplify witness calls in gpt_model — one-line API

9a5447d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] trainer ft#21

[DO NOT MERGE] trainer ft#21
fzyzcjy wants to merge 22 commits into
miles-mainfrom
trainer_ft/dev

fzyzcjy commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fzyzcjy commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant