[Docker] Megatron version bump to Feb 13 and upgrade fla==0.4.1#643
Conversation
Summary of ChangesHello @guapisolo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request updates the integration with Megatron-LM to ensure compatibility with recent changes, specifically regarding how tensor parallel sharding is handled for certain layers. It refactors the parameter gathering mechanism to support strided partitioning, adjusts model provider function signatures to accommodate new configuration parameters, and includes a minor correction to an argument mapping for Hugging Face configurations. These changes collectively aim to maintain correct behavior and functionality with an updated Megatron version. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request appears to be updating the Megatron integration to a newer version. The changes include updating model provider function signatures, refactoring weight gathering logic to properly support strided tensor parallelism via partition_stride, and adjusting argument names for validation. Overall, these changes align the codebase with upstream updates.
However, I've identified a critical regression in miles/backends/megatron_utils/actor.py where the refactoring of parallel_state initialization can lead to incorrect behavior or crashes when using virtual pipeline parallelism. My review includes a specific comment and code suggestion to address this issue.
| (self.model, self.optimizer, self.opt_param_scheduler, loaded_rollout_id) = initialize_model_and_optimizer( | ||
| args, role | ||
| ) | ||
|
|
There was a problem hiding this comment.
The parallel_state should be re-initialized here after the model is created. The initial creation at line 83 with model=None is sufficient for the debug_rollout_only path, but for the main training path, parallel_state requires information from the model's configuration, especially when virtual pipeline parallelism (vpp_size > 1) is used.
Removing this line causes the training process to use an incomplete parallel_state, which can lead to an AssertionError or incorrect behavior during training. Re-adding this line ensures the parallel state is always correctly configured.
| self.parallel_state = create_megatron_parallel_state(model=self.model) |
b81daf9 to
6b7856d
Compare
debug_rollout_only mode calls train() which needs parallel_state for rollout data preprocessing and logging. Previously parallel_state was only created after model initialization, which is skipped in debug_rollout_only mode. Move it before the early return with model=None. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
772a011 to
de4a130
Compare
…d fc1/fc2 logic Megatron-LM PR #2708 fixed partition_stride to correctly report stride=2 for linear_fc1 (GLU/SwiGLU interleaved [gate, up]) and stride=1 for linear_fc2. Replace the old hard-coded fc1 chunk reordering and fc2 partition_dim workaround with generic stride-aware gathering. Add _check_partition_stride() asserts to validate expected stride values for linear_fc1 (must be 2) and linear_fc2 (must be 1). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…name Megatron-LM renamed the config field from norm_epsilon to layernorm_epsilon. Update the HF config validation mapping accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…o_hf Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6fa3ec6 to
8db6954
Compare
yueming-yuan
left a comment
There was a problem hiding this comment.
LGTM as offline discussed
…xark#643) Co-authored-by: Yueming Yuan <yym022502@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…xark#643) Co-authored-by: Yueming Yuan <yym022502@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
NOTICE: SOMEONE NOTICE THERE IS PROBLEM TO RUN THIS PR WITH GLM-5. IF PROBLEM EXIST, try downgrade megatron, turn on
DEPRECATED_MEGATRON_COMPATIBLEand try.ci-megatron-pr: upstream/1dcf0dafa
PR related: radixark/Megatron-LM#13
There is some critical megatron changes between Dec 18, 2025 and Feb 13, 2026.
If there is any problem with new megatron, you can revert it to the old version Rebased Megatron from Dec 18 and add
DEPRECATED_MEGATRON_COMPATIBLEto align with old version code.Adapt to upstream Megatron-LM breaking changes
1. TransformerConfig auto-registration
Megatron-LM now auto-generates CLI arguments directly from
TransformerConfigdataclass fields (#2896), and passes a pre-builtconfig+pg_collectionintomodel_provider(#2608).model_provider.py— Accept the newconfigandpg_collectionkwargs in all three provider paths (custom, bridge, default). Assertconfig is Nonesince Miles builds its own config from args viacore_transformer_config_from_args.arguments.py— The old CLI arg--norm-epsilonwas manually mapped toTransformerConfig.layernorm_epsilon. With auto-registration, this mapping is gone — the field is now exposed directly asargs.layernorm_epsilon. Update the HF config validation accordingly.2. Fix TP weight gather to use
partition_strideBackground. In tensor-parallel SwiGLU/GLU models, each TP rank's
linear_fc1.weightstores interleaved[gate, up]blocks — the correctpartition_strideis 2. Megatron-LM previously hard-coded stride=1 for all parameters; our old code compensated with manualchunk(2) → reorder → catfor fc1 and apartition_dimswap workaround for fc2's grouped-MoE bug.Megatron-LM #2708 partially fixed this:
linear_fc1now correctly setpartition_stride=2(whengated_linear_unit=True) andlinear_fc2setpartition_stride=1. But when--moe-grouped-gemmis set,partition_strideis still 1.So the solution is to remove the original assertion
partition_stride == 13. Remove old Megatron ckpt format dependents
fully_sharded_model_spacetodp_reshardable.Related bug fix can be found at radixark/Megatron-LM#13