NMFW-17: Add ColocatedBridgeCommunicator for heterogeneous TP/DP MIMO training by yashaswikarnati · Pull Request #1 · yashaswikarnati/Megatron-LM

yashaswikarnati · 2026-03-26T18:48:19Z

Summary

ColocatedBridgeCommunicator: autograd-aware fan-in/fan-out/equal-DP communication between encoder and LLM with different TP/DP on same ranks
MimoModel colocated forward path with config, role, and optimizer support
3 test files: communicator unit tests (11), multi-iteration correctness (9 checks x 3 iters x 3 configs), e2e VLM with MimoOptimizer

Test commands (8 GPUs, run individually)

uv run python -m torch.distributed.run --nproc_per_node=8 -m pytest tests/unit_tests/models/test_mimo_colocated_communicator.py -v
uv run python -m torch.distributed.run --nproc_per_node=8 -m pytest "tests/unit_tests/models/test_mimo_colocated_correctness.py::TestColocatedCorrectness::test_correctness[fan_in]" -v
uv run python -m torch.distributed.run --nproc_per_node=8 -m pytest tests/unit_tests/models/test_mimo_colocated_e2e.py -v

Linear: NMFW-17

🤖 Generated with Claude Code

yashaswikarnati · 2026-03-26T19:04:23Z

+        packing_kwargs: Optional[dict] = None,
+    ):
+        """Forward pass for colocated mode: encoder and LLM on same ranks, different TP/DP."""
+        packed_seq_params = None


lets not worry about sequence packing for now

yashaswikarnati · 2026-03-26T19:04:55Z

+        )
+
+        # 4. Optional partition adapter
+        if self.partition_adapter is not None:


also dont worry about partition adapter yet

yashaswikarnati · 2026-03-26T19:07:18Z

+        packing_kwargs: Optional[dict] = None,
+    ):
+        """Forward pass for colocated mode: encoder and LLM on same ranks, different TP/DP."""
+        packed_seq_params = None


also this function seems a little verbose, we almost copied the whole thing and just added the apply colocated comms?

yashaswikarnati · 2026-03-26T19:11:41Z



+@dataclass
+class ColocatedCommConfig:


do we really need seperate ColocatedCommConfig ? also module to grid map seems to be replicated at both places ? mimo model config and here? suggest simpler and cleaner alternatives

yashaswikarnati · 2026-03-26T19:12:22Z

+        self._extract_parallelism_info()
+        self._build_rank_mappings()
+
+        self.all_gather_pg: Optional[dist.ProcessGroup] = None


when can this be None ?

yashaswikarnati · 2026-03-26T19:12:55Z

+            )
+
+    def _extract_parallelism_info(self):
+        self.src_tp_size = self.src_grid.shape[self.src_grid.dim_names.index('tp')]


can we use pg for this? pg.size() ?

yashaswikarnati · 2026-03-26T19:14:46Z

+    """Config for colocated modules with different TP/DP on same ranks."""
+
+    module_to_grid_map: Dict[str, 'HyperCommGrid'] = field(default_factory=dict)
+    topology: Dict[str, list] = field(default_factory=dict)


what do we need topology for ?

yashaswikarnati · 2026-03-26T19:16:54Z

    from megatron.core.optimizer import get_megatron_optimizer

    grid_map = mimo_model.mimo_config.module_to_grid_map
+    if grid_map is None and mimo_model.mimo_config.colocated_comm_config is not None:


this seems a little redundant and two sources of truth ?

yashaswikarnati · 2026-03-26T19:17:38Z

@@ -0,0 +1,348 @@
+# Colocated MIMO Correctness Testing Design


dont push local planning here, nothing from docs/plans unless explicitly asked

yashaswikarnati · 2026-03-26T19:20:22Z

+        self.dp_scale_factor = self.src_dp_size / self.dest_dp_size
+
+    def _build_rank_mappings(self):
+        self.rank_to_src_pos: Dict[int, Tuple[int, int]] = {}


what is this var storing here ?

yashaswikarnati · 2026-03-26T19:22:33Z

-        if mimo_config.module_to_grid_map:
+        self.colocated_comms = {}
+        if mimo_config.colocated_comm_config is not None:
+            self.role = RankRole.colocated(modality_names + [MIMO_LANGUAGE_MODULE_KEY])


this seems a little flaky, can we also build this from grid map ?

yashaswikarnati · 2026-03-26T19:24:04Z

+                )
+        return modality_embeddings
+
+    def _forward_colocated(


would unified mode still makes sense ? or will that break things now ? can we just have colocated custom process groups and the legacy unified with process groups used from parallel state cleanly supported in colocated umbrella?

… (NMFW-17) COLOCATED mode replaces UNIFIED — covers both legacy (no grid map) and heterogeneous TP/DP on shared ranks. Auto-detects colocated from grid overlap. Core: ColocatedBridgeCommunicator with fan-in/fan-out/equal-DP autograd. Model: _forward_all_modules with optional colocated communication. Tests: communicator unit tests, multi-iteration correctness, e2e VLM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three-phase execution for colocated encoder PP=1 + LLM PP>1: - Phase 1: Encoder forward + communicate on full batch (all ranks sync) - Phase 2: LLM 1F1B pipeline with detached encoder embeddings - Phase 3: Encoder backward on full batch (all ranks sync) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yashaswikarnati · 2026-03-28T03:47:13Z

Status Update (NMFW-17 + NMFW-19)

Implementation Complete — 19 tests passing

PR structure:

Commit 1 (NMFW-17): PP=1 ColocatedBridgeCommunicator — fan-in/fan-out/equal-DP autograd, COLOCATED mode replaces UNIFIED, auto-detect from grid overlap
Commit 2 (NMFW-19): PP>1 three-phase schedule — encoder batch → LLM 1F1B pipeline → encoder backward
Commit 3: Moved correctness tests to PR NMFW-50: Colocated correctness tests #2 (NMFW-50)

PP=1 tests: communicator (11), e2e VLM + MimoOptimizer (1)
PP>1 tests: fan-in TP2/DP4→TP2/DP2/PP2 (1), equal-DP TP4/DP2→TP2/DP2/PP2 (1), grad accumulation 6mb (1), extreme TP1/DP8→TP4/DP1/PP2 (1)
Correctness tests: moved to PR #2

PP>1 Design Summary

Phase 1: One encoder forward + one communicate on full batch (all ranks sync)
Phase 2: 1F1B pipeline for LLM with detached encoder embeddings sliced per microbatch
Phase 3: Broadcast gradient from PP stage 0 → 1+, one encoder backward (all ranks sync)
Detach prevents encoder TP all-reduce (which may cross PP stages) from running inside staggered pipeline

yashaswikarnati commented Mar 26, 2026

View reviewed changes

yashaswikarnati force-pushed the ykarnati/nmfw-17-colocated-colocated-bridge-communicator branch from c12e3db to a8122d5 Compare March 26, 2026 20:48

yashaswikarnati and others added 2 commits March 27, 2026 20:28

yashaswikarnati force-pushed the ykarnati/nmfw-17-colocated-colocated-bridge-communicator branch from a8122d5 to d834b76 Compare March 28, 2026 03:29

Move correctness tests to separate PR (NMFW-50)

a1b6238

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yashaswikarnati mentioned this pull request Mar 28, 2026

NMFW-50: Colocated correctness tests #2

Draft

yashaswikarnati closed this Apr 28, 2026

		@@ -0,0 +1,348 @@
		# Colocated MIMO Correctness Testing Design



		@dataclass
		class ColocatedCommConfig:

Conversation

yashaswikarnati commented Mar 26, 2026

Summary

Test commands (8 GPUs, run individually)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati commented Mar 28, 2026

Status Update (NMFW-17 + NMFW-19)

Implementation Complete — 19 tests passing

PP>1 Design Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant