Skip to content

NMFW-464: HyperCommGrid alt-factorization + Nemotron VLM E2E (MIMO hetero parallel)#17

Draft
yashaswikarnati wants to merge 2 commits into
mainfrom
ykarnati/nmfw-464-claude
Draft

NMFW-464: HyperCommGrid alt-factorization + Nemotron VLM E2E (MIMO hetero parallel)#17
yashaswikarnati wants to merge 2 commits into
mainfrom
ykarnati/nmfw-464-claude

Conversation

@yashaswikarnati
Copy link
Copy Markdown
Owner

Summary

  • Adds first-class alt-factorization to HyperCommGrid so EP/ETP/EDP overlap onto the same physical ranks as TP/CP/DP within each PP stage. Constraint: tp*cp*dp == ep*etp*edp with pp shared. World size stays tp*cp*dp*pp. (NMFW-464 expert-overlap fix.)
  • Adds ProcessGroupCollection.from_hyper_comm_grid() so MIMO / DDP / optimizer / MoE call sites can build their PG collection directly from a HyperCommGrid — no parallel_state.initialize_model_parallel() required for the MIMO hetero path. Expert fields are populated to None when the grid carries no alt factorization, so hasattr probes uniformly succeed.
  • Verifies the literal Nemotron VLM (RADIOEncoderWrapper + MultimodalProjector + MambaModel/HybridModel) end-to-end on 8 GPUs in both colocated and non-colocated MIMO modes with the new substrate (mock images + token IDs). Non-colocated path uses MultiModulePipelineCommunicator + 1F1B schedule.
  • Carries the Nemotron-MoE VLM config / model_provider / RADIOEncoderWrapper from feat/nemotron-moe-vlm-mimo so the production assembly lives alongside the substrate.

Test plan

All 11 tests pass under a single Slurm batch job (scripts/nmfw464_e2e_batch.sh); each test runs in its own torch.distributed.run invocation so global singletons can't leak.

  • tests/unit_tests/test_hyper_comm_grid.py::TestHyperCommGrid (mock)
  • tests/unit_tests/test_hyper_comm_grid.py::TestHyperCommGridAltFactorization (mock)
  • tests/unit_tests/test_hyper_comm_grid.py::TestHyperCommGridIntegration (NCCL: alt-factorization overlap + PP-confined invariants)
  • tests/unit_tests/test_process_groups_config.py::TestPGConfigFromHyperCommGrid (NCCL)
  • tests/unit_tests/transformer/moe/test_moe_with_hcg_pg.py (MoE GPT through forward_backward_no_pipelining with HCG-only PGs)
  • tests/unit_tests/models/test_mimo_moe_e2e.py::test_mimo_moe_colocated_8gpu[False] (basic GPT-MoE)
  • tests/unit_tests/models/test_mimo_moe_e2e.py::test_mimo_moe_colocated_8gpu[True] (Nemotron-flavor GPT-MoE)
  • tests/unit_tests/models/test_mimo_moe_e2e.py::test_mimo_nemotron_mamba_moe_colocated_8gpu (literal Mamba-MoE)
  • tests/unit_tests/models/test_mimo_moe_e2e.py::test_mimo_nemotron_radio_mamba_moe_colocated_8gpu (literal Nemotron VLM, colocated)
  • tests/unit_tests/models/test_mimo_moe_e2e.py::test_mimo_mamba_moe_non_colocated_8gpu (Mamba-MoE non-colocated)
  • tests/unit_tests/models/test_mimo_moe_e2e.py::test_mimo_nemotron_radio_mamba_moe_non_colocated_8gpu (literal Nemotron VLM, non-colocated)

Three rounds of code review (overdesign / correctness / software practices) plus three batch e2e runs were completed before pushing.

Notes

  • HybridModel still calls into a few parallel_state accessors (e.g. log_on_each_pipeline_stage); the e2e tests minimally parallel_state.initialize_model_parallel(...) with topology matching the LLM grid via a _reset_parallel_state helper that's also order-safe (destroys + reinits).
  • mamba_ssm and causal-conv1d are required (already in pyproject.toml [dev]/[lts]).
  • examples/mimo/{configs,model_providers}/nemotron_moe_vlm.py are cherry-picked from feat/nemotron-moe-vlm-mimo and not exercised directly by these tests; their relative from configs.… import works through the existing examples/mimo/train.py sys.path setup.

🤖 Generated with Claude Code

yashaswikarnati and others added 2 commits May 10, 2026 03:05
Phase 1 substrate that lets the Nemotron VLM (RADIO ViT + MLP projection
+ Mamba-MoE LLM) run end-to-end with heterogeneous parallelism in MIMO,
in both colocated and non-colocated modes, without inflating the world
size with orthogonal expert axes.

Core:
- HyperCommGrid first-class alt-factorization: a single grid object can
  carry both a primary factorization (tp/cp/dp/pp) and an alt
  factorization (etp/ep/edp) over the same per-PP-stage rank slab. The
  constraint is tp*cp*dp == ep*etp*edp with PP shared. World size stays
  tp*cp*dp*pp.
- ProcessGroupCollection.from_hyper_comm_grid(): builds the standard
  model-parallel + expert PG fields directly from a HyperCommGrid, with
  no global parallel_state initialization. Expert fields are populated
  to None when the grid carries no alt factorization so DDP/optimizer
  hasattr probes work uniformly.

Tests (all 11 pass on 8 GPUs in a Slurm batch job; see
scripts/nmfw464_e2e_batch.sh):
- HyperCommGrid alt-factorization unit + 2 NCCL integration tests
  (overlap + PP-confined invariants).
- ProcessGroupCollection.from_hyper_comm_grid distributed tests.
- MoE GPT through forward_backward_no_pipelining with HCG-only PGs.
- MIMO + GPT-MoE colocated (basic + Nemotron-flavor TransformerConfig).
- MIMO + literal Mamba-MoE colocated.
- Literal Nemotron VLM (RADIOEncoderWrapper + MultimodalProjector +
  MambaModel) colocated.
- MIMO + Mamba-MoE non-colocated (1F1B with
  MultiModulePipelineCommunicator bridge).
- Literal Nemotron VLM non-colocated.

Also pulled the Nemotron-MoE VLM model provider, config, and
RADIOEncoderWrapper from feat/nemotron-moe-vlm-mimo so the production
assembly is carried alongside the substrate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The MIMO hetero path is meant to flow process groups end-to-end through
HyperCommGrid + ProcessGroupCollection, never via parallel_state. The
previous E2E tests called parallel_state.initialize_model_parallel(...)
to keep HybridModel and RADIOViTModel from asserting — that masked the
risk that any code reaching for parallel_state would silently pick up a
group that disagreed with the HCG topology.

Three plumbing fixes that move us off parallel_state for the colocated
and non-colocated MIMO paths:

- megatron/core/models/hybrid/hybrid_layer_allocation.py: thread tp_group
  and dp_cp_group through select_pipeline_segment so its
  log_on_each_pipeline_stage call no longer falls back to
  parallel_state.get_tensor_model_parallel_rank.
- megatron/core/models/hybrid/hybrid_model.py: pass
  pg_collection.tp / pg_collection.dp_cp into select_pipeline_segment.
- megatron/core/models/vision/radio.py: thread pg_collection.tp into the
  embedder's ColumnParallelLinear so it doesn't fall back to
  parallel_state.get_tensor_model_parallel_group at forward time.
- examples/mimo/model_providers/radio_encoder.py: accept and forward
  pg_collection to RADIOViTModel.
- tests/unit_tests/models/test_mimo_moe_e2e.py: replace
  _reset_parallel_state(...) (which destroyed-and-reinit'd
  parallel_state) with _reset_global_singletons() that only destroys
  any leftover state. Tests now verify every group used by the model
  comes from the HyperCommGrid.

Full E2E batch (11 tests, 8 GPUs Slurm) green with no parallel_state
init for the MIMO/MoE path; HCG groups are the only model-parallel
topology the process knows about.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant