NMFW-464: HyperCommGrid alt-factorization + Nemotron VLM E2E (MIMO hetero parallel) by yashaswikarnati · Pull Request #17 · yashaswikarnati/Megatron-LM

yashaswikarnati · 2026-05-10T16:22:28Z

Summary

Adds first-class alt-factorization to HyperCommGrid so EP/ETP/EDP overlap onto the same physical ranks as TP/CP/DP within each PP stage. Constraint: tp*cp*dp == ep*etp*edp with pp shared. World size stays tp*cp*dp*pp. (NMFW-464 expert-overlap fix.)
Adds ProcessGroupCollection.from_hyper_comm_grid() so MIMO / DDP / optimizer / MoE call sites can build their PG collection directly from a HyperCommGrid — no parallel_state.initialize_model_parallel() required for the MIMO hetero path. Expert fields are populated to None when the grid carries no alt factorization, so hasattr probes uniformly succeed.
Verifies the literal Nemotron VLM (RADIOEncoderWrapper + MultimodalProjector + MambaModel/HybridModel) end-to-end on 8 GPUs in both colocated and non-colocated MIMO modes with the new substrate (mock images + token IDs). Non-colocated path uses MultiModulePipelineCommunicator + 1F1B schedule.
Carries the Nemotron-MoE VLM config / model_provider / RADIOEncoderWrapper from feat/nemotron-moe-vlm-mimo so the production assembly lives alongside the substrate.

Test plan

All 11 tests pass under a single Slurm batch job (scripts/nmfw464_e2e_batch.sh); each test runs in its own torch.distributed.run invocation so global singletons can't leak.

Three rounds of code review (overdesign / correctness / software practices) plus three batch e2e runs were completed before pushing.

Notes

HybridModel still calls into a few parallel_state accessors (e.g. log_on_each_pipeline_stage); the e2e tests minimally parallel_state.initialize_model_parallel(...) with topology matching the LLM grid via a _reset_parallel_state helper that's also order-safe (destroys + reinits).
mamba_ssm and causal-conv1d are required (already in pyproject.toml [dev]/[lts]).
examples/mimo/{configs,model_providers}/nemotron_moe_vlm.py are cherry-picked from feat/nemotron-moe-vlm-mimo and not exercised directly by these tests; their relative from configs.… import works through the existing examples/mimo/train.py sys.path setup.

🤖 Generated with Claude Code

Phase 1 substrate that lets the Nemotron VLM (RADIO ViT + MLP projection + Mamba-MoE LLM) run end-to-end with heterogeneous parallelism in MIMO, in both colocated and non-colocated modes, without inflating the world size with orthogonal expert axes. Core: - HyperCommGrid first-class alt-factorization: a single grid object can carry both a primary factorization (tp/cp/dp/pp) and an alt factorization (etp/ep/edp) over the same per-PP-stage rank slab. The constraint is tp*cp*dp == ep*etp*edp with PP shared. World size stays tp*cp*dp*pp. - ProcessGroupCollection.from_hyper_comm_grid(): builds the standard model-parallel + expert PG fields directly from a HyperCommGrid, with no global parallel_state initialization. Expert fields are populated to None when the grid carries no alt factorization so DDP/optimizer hasattr probes work uniformly. Tests (all 11 pass on 8 GPUs in a Slurm batch job; see scripts/nmfw464_e2e_batch.sh): - HyperCommGrid alt-factorization unit + 2 NCCL integration tests (overlap + PP-confined invariants). - ProcessGroupCollection.from_hyper_comm_grid distributed tests. - MoE GPT through forward_backward_no_pipelining with HCG-only PGs. - MIMO + GPT-MoE colocated (basic + Nemotron-flavor TransformerConfig). - MIMO + literal Mamba-MoE colocated. - Literal Nemotron VLM (RADIOEncoderWrapper + MultimodalProjector + MambaModel) colocated. - MIMO + Mamba-MoE non-colocated (1F1B with MultiModulePipelineCommunicator bridge). - Literal Nemotron VLM non-colocated. Also pulled the Nemotron-MoE VLM model provider, config, and RADIOEncoderWrapper from feat/nemotron-moe-vlm-mimo so the production assembly is carried alongside the substrate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The MIMO hetero path is meant to flow process groups end-to-end through HyperCommGrid + ProcessGroupCollection, never via parallel_state. The previous E2E tests called parallel_state.initialize_model_parallel(...) to keep HybridModel and RADIOViTModel from asserting — that masked the risk that any code reaching for parallel_state would silently pick up a group that disagreed with the HCG topology. Three plumbing fixes that move us off parallel_state for the colocated and non-colocated MIMO paths: - megatron/core/models/hybrid/hybrid_layer_allocation.py: thread tp_group and dp_cp_group through select_pipeline_segment so its log_on_each_pipeline_stage call no longer falls back to parallel_state.get_tensor_model_parallel_rank. - megatron/core/models/hybrid/hybrid_model.py: pass pg_collection.tp / pg_collection.dp_cp into select_pipeline_segment. - megatron/core/models/vision/radio.py: thread pg_collection.tp into the embedder's ColumnParallelLinear so it doesn't fall back to parallel_state.get_tensor_model_parallel_group at forward time. - examples/mimo/model_providers/radio_encoder.py: accept and forward pg_collection to RADIOViTModel. - tests/unit_tests/models/test_mimo_moe_e2e.py: replace _reset_parallel_state(...) (which destroyed-and-reinit'd parallel_state) with _reset_global_singletons() that only destroys any leftover state. Tests now verify every group used by the model comes from the HyperCommGrid. Full E2E batch (11 tests, 8 GPUs Slurm) green with no parallel_state init for the MIMO/MoE path; HCG groups are the only model-parallel topology the process knows about. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yashaswikarnati and others added 2 commits May 10, 2026 03:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NMFW-464: HyperCommGrid alt-factorization + Nemotron VLM E2E (MIMO hetero parallel)#17

NMFW-464: HyperCommGrid alt-factorization + Nemotron VLM E2E (MIMO hetero parallel)#17
yashaswikarnati wants to merge 2 commits into
mainfrom
ykarnati/nmfw-464-claude

yashaswikarnati commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yashaswikarnati commented May 10, 2026

Summary

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant