NMFW-464: HyperCommGrid alt-factorization + Nemotron VLM E2E (MIMO hetero parallel)#17
Draft
yashaswikarnati wants to merge 2 commits into
Draft
NMFW-464: HyperCommGrid alt-factorization + Nemotron VLM E2E (MIMO hetero parallel)#17yashaswikarnati wants to merge 2 commits into
yashaswikarnati wants to merge 2 commits into
Conversation
Phase 1 substrate that lets the Nemotron VLM (RADIO ViT + MLP projection + Mamba-MoE LLM) run end-to-end with heterogeneous parallelism in MIMO, in both colocated and non-colocated modes, without inflating the world size with orthogonal expert axes. Core: - HyperCommGrid first-class alt-factorization: a single grid object can carry both a primary factorization (tp/cp/dp/pp) and an alt factorization (etp/ep/edp) over the same per-PP-stage rank slab. The constraint is tp*cp*dp == ep*etp*edp with PP shared. World size stays tp*cp*dp*pp. - ProcessGroupCollection.from_hyper_comm_grid(): builds the standard model-parallel + expert PG fields directly from a HyperCommGrid, with no global parallel_state initialization. Expert fields are populated to None when the grid carries no alt factorization so DDP/optimizer hasattr probes work uniformly. Tests (all 11 pass on 8 GPUs in a Slurm batch job; see scripts/nmfw464_e2e_batch.sh): - HyperCommGrid alt-factorization unit + 2 NCCL integration tests (overlap + PP-confined invariants). - ProcessGroupCollection.from_hyper_comm_grid distributed tests. - MoE GPT through forward_backward_no_pipelining with HCG-only PGs. - MIMO + GPT-MoE colocated (basic + Nemotron-flavor TransformerConfig). - MIMO + literal Mamba-MoE colocated. - Literal Nemotron VLM (RADIOEncoderWrapper + MultimodalProjector + MambaModel) colocated. - MIMO + Mamba-MoE non-colocated (1F1B with MultiModulePipelineCommunicator bridge). - Literal Nemotron VLM non-colocated. Also pulled the Nemotron-MoE VLM model provider, config, and RADIOEncoderWrapper from feat/nemotron-moe-vlm-mimo so the production assembly is carried alongside the substrate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The MIMO hetero path is meant to flow process groups end-to-end through HyperCommGrid + ProcessGroupCollection, never via parallel_state. The previous E2E tests called parallel_state.initialize_model_parallel(...) to keep HybridModel and RADIOViTModel from asserting — that masked the risk that any code reaching for parallel_state would silently pick up a group that disagreed with the HCG topology. Three plumbing fixes that move us off parallel_state for the colocated and non-colocated MIMO paths: - megatron/core/models/hybrid/hybrid_layer_allocation.py: thread tp_group and dp_cp_group through select_pipeline_segment so its log_on_each_pipeline_stage call no longer falls back to parallel_state.get_tensor_model_parallel_rank. - megatron/core/models/hybrid/hybrid_model.py: pass pg_collection.tp / pg_collection.dp_cp into select_pipeline_segment. - megatron/core/models/vision/radio.py: thread pg_collection.tp into the embedder's ColumnParallelLinear so it doesn't fall back to parallel_state.get_tensor_model_parallel_group at forward time. - examples/mimo/model_providers/radio_encoder.py: accept and forward pg_collection to RADIOViTModel. - tests/unit_tests/models/test_mimo_moe_e2e.py: replace _reset_parallel_state(...) (which destroyed-and-reinit'd parallel_state) with _reset_global_singletons() that only destroys any leftover state. Tests now verify every group used by the model comes from the HyperCommGrid. Full E2E batch (11 tests, 8 GPUs Slurm) green with no parallel_state init for the MIMO/MoE path; HCG groups are the only model-parallel topology the process knows about. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
HyperCommGridso EP/ETP/EDP overlap onto the same physical ranks as TP/CP/DP within each PP stage. Constraint:tp*cp*dp == ep*etp*edpwithppshared. World size staystp*cp*dp*pp. (NMFW-464 expert-overlap fix.)ProcessGroupCollection.from_hyper_comm_grid()so MIMO / DDP / optimizer / MoE call sites can build their PG collection directly from a HyperCommGrid — noparallel_state.initialize_model_parallel()required for the MIMO hetero path. Expert fields are populated toNonewhen the grid carries no alt factorization, sohasattrprobes uniformly succeed.MultiModulePipelineCommunicator+ 1F1B schedule.feat/nemotron-moe-vlm-mimoso the production assembly lives alongside the substrate.Test plan
All 11 tests pass under a single Slurm batch job (
scripts/nmfw464_e2e_batch.sh); each test runs in its owntorch.distributed.runinvocation so global singletons can't leak.tests/unit_tests/test_hyper_comm_grid.py::TestHyperCommGrid(mock)tests/unit_tests/test_hyper_comm_grid.py::TestHyperCommGridAltFactorization(mock)tests/unit_tests/test_hyper_comm_grid.py::TestHyperCommGridIntegration(NCCL: alt-factorization overlap + PP-confined invariants)tests/unit_tests/test_process_groups_config.py::TestPGConfigFromHyperCommGrid(NCCL)tests/unit_tests/transformer/moe/test_moe_with_hcg_pg.py(MoE GPT throughforward_backward_no_pipeliningwith HCG-only PGs)tests/unit_tests/models/test_mimo_moe_e2e.py::test_mimo_moe_colocated_8gpu[False](basic GPT-MoE)tests/unit_tests/models/test_mimo_moe_e2e.py::test_mimo_moe_colocated_8gpu[True](Nemotron-flavor GPT-MoE)tests/unit_tests/models/test_mimo_moe_e2e.py::test_mimo_nemotron_mamba_moe_colocated_8gpu(literal Mamba-MoE)tests/unit_tests/models/test_mimo_moe_e2e.py::test_mimo_nemotron_radio_mamba_moe_colocated_8gpu(literal Nemotron VLM, colocated)tests/unit_tests/models/test_mimo_moe_e2e.py::test_mimo_mamba_moe_non_colocated_8gpu(Mamba-MoE non-colocated)tests/unit_tests/models/test_mimo_moe_e2e.py::test_mimo_nemotron_radio_mamba_moe_non_colocated_8gpu(literal Nemotron VLM, non-colocated)Three rounds of code review (overdesign / correctness / software practices) plus three batch e2e runs were completed before pushing.
Notes
HybridModelstill calls into a fewparallel_stateaccessors (e.g.log_on_each_pipeline_stage); the e2e tests minimallyparallel_state.initialize_model_parallel(...)with topology matching the LLM grid via a_reset_parallel_statehelper that's also order-safe (destroys + reinits).mamba_ssmandcausal-conv1dare required (already inpyproject.toml[dev]/[lts]).examples/mimo/{configs,model_providers}/nemotron_moe_vlm.pyare cherry-picked fromfeat/nemotron-moe-vlm-mimoand not exercised directly by these tests; their relativefrom configs.…import works through the existingexamples/mimo/train.pysys.pathsetup.🤖 Generated with Claude Code