docs: add agentic heterogeneous parallelism discussion article by yashaswikarnati · Pull Request #13 · yashaswikarnati/Megatron-LM

yashaswikarnati · 2026-04-28T06:34:52Z

What does this PR do?

Adds a discussion article to docs/discussions/ documenting agent-driven exploration of heterogeneous parallelism for multimodal (MIMO) training in Megatron-LM.

Article content

Agentic research methodology and operating model (autonomous exploration + human guardrails)
Agent team structure: team lead, campaign manager, systems expert, SLURM runner
Throughput results: +8.1% to +14.0% at 8K seq len, up to +41.8% at 64K (H100 BF16, 16-64 GPUs)
Optimization chains discovered by agents (encoder TP=1, encoder recompute as throughput lever, colocated PP + offload)
Failure modes and human verification requirements
Reusable patterns for adapting agent teams to other workloads

Files changed

docs/discussions/agentic-heterogeneous-parallelism/agentic-heterogeneous-parallelism.md (new)
docs/discussions/agentic-heterogeneous-parallelism/images/throughput_8k.png (new)
docs/discussions/agentic-heterogeneous-parallelism/images/throughput_long_context.png (new)
docs/discussions/README.md (updated index)

Checklist

Documentation only
Unit tests
Functional tests

Ref: NMFW-451

Adds a discussion article documenting agent-driven exploration of heterogeneous parallelism for multimodal (MIMO) training in Megatron-LM. - Article covers agentic research methodology, agent team structure, throughput results (H100 BF16, 16-64 GPUs), discovered optimization chains, failure modes, and reusable patterns - Includes throughput charts for 8K and long-context sequences - Updates docs/discussions/README.md with new entry Ref: NMFW-451

yashaswikarnati

Review: agentic heterogeneous parallelism article

Verified the full diff and cross-checked every number in Tables 1 and 2 against raw/data_charts/MiMo-Colocated Throughput - Master.csv. Tables match the source data exactly (TFLOPs, deltas, memory, iteration times, vision-token fractions, parallelism configs).

What's working

Narrative leads with the operating model and frames throughput as evidence (intro and "Results: Throughput Evidence" header both reinforce this). Aligned with the requested framing.
Tone is engineering throughout. No marketing language; honesty about the 80/20 ceiling, propagating baseline bias, and over-generalized rules is the right move.
No internal strings (cog, gitlab-master, coreai_dlalgo, /lustre/, cw-dfw, NMFW-, CLAUDE_CODE_EXPERIMENTAL) anywhere in the article body or README addition. Lustre appears only as a generic filesystem-type example, which is fine.
The optimization-stack notes under each table (encoder TP=1 + offload, PP=2 enabling lower LLM TP, etc.) match the per-row "Notes" column in the source CSV.
Image references images/throughput_8k.png and images/throughput_long_context.png resolve.

Suggestions (non-blocking)

Table 1 — 1B+7B iter time. The cell shows — and the table caption explains it ("the 1B+7B 8K iter time was not captured in the source data"). Confirmed against CSV (Fwd+Bwd columns are blank for that row). Wording is fine; flagging only because the dash is the only — in either table — a reader skimming the column might assume measurement failure rather than missing source data. Consider tightening the caption to "not recorded in the source data" if you want to remove that ambiguity.
Resources section. Only two links. Could also link the Megatron-FSDP discussion in the same docs/discussions/ directory, since the operating-model framing applies to other systems-research efforts. Optional.
"Sanitized agent role definitions ... follow-up change." Either land them in this PR or open a tracking issue and reference it; otherwise the promise rots.

No factual issues. Approve once the iter-time caption is tightened (optional) — content is solid.

yashaswikarnati · 2026-04-28T16:30:06Z

+
+Sanitized agent role definitions for adaptation to other workloads will be added in a follow-up change.
+
+## Resources


remove resources

yashaswikarnati · 2026-04-28T16:43:38Z

+
+| Encoder | LLM   | World | GBS | Homo TP/DP/PP | Hetero Enc TP/DP/PP | Hetero LLM TP/DP/PP | Homo TFLOPs/GPU | Hetero TFLOPs/GPU |    Δ    | Mem Homo (GB) | Mem Hetero (GB) | Iter Homo (ms) | Iter Hetero (ms) |
+| ------- | ----- | ----: | --: | ------------- | ------------------- | ------------------- | --------------: | ----------------: | ------: | ------------: | --------------: | -------------: | ---------------: |
+| 1B      | 7B    |    16 |  64 | 4/4/1         | 1/16/1              | 4/4/1               |           333.8 |             360.8 |  +8.1%  |          66.3 |            62.2 |              — |                — |


we can remove first row

…ticle - Drop the 1B+7B 8K row from Table 1 and regenerate Figure 1 to match. The remaining four rows (1B+14B, 3B+14B, 3B+32B, 3B+70B) are all at 64 GPUs with LLM PP=2; the gain range tightens to +12.7% to +14.0%. Update the Standard sequence length prose and Table 1 caption accordingly. Removes the only "—" entry (1B+7B iter time was not captured in the source data). - Remove the Resources section and its Table of Contents entry. - Remove the trailing "follow-up change" promise about sanitized agent role definitions; nothing in this PR commits to it, so the line was load-bearing rot.

Lead with what we did and found, not what the article contains. Drops the meta framing and the word "headline" entirely. Now structured as: what we gave the agents, the three optimization chains they found and the throughput gains, the days of human verification, and what is portable beyond the Megatron-specific knobs. Also rolls up other in-flight edits to the same file.

Mirror the narrative cadence of the prior Megatron-techblog opener: strong action verb in sentence 1, the dual frame (throughput gains and a study of where agents help) in sentence 2, then numbers, human verification, and what is portable.

What we did and why, what we found, what the rest of the article covers. No human-verification timeline, no two-goals frame, no portability claim — those belong further down.

yashaswikarnati mentioned this pull request Apr 28, 2026

docs: add Megatron-LM skills for unit testing and SLURM execution #14

Draft

3 tasks

yashaswikarnati commented Apr 28, 2026

View reviewed changes

yashaswikarnati added 4 commits April 28, 2026 16:54

docs: sharpen opening hook

eb927b2

Mirror the narrative cadence of the prior Megatron-techblog opener: strong action verb in sentence 1, the dual frame (throughput gains and a study of where agents help) in sentence 2, then numbers, human verification, and what is portable.

docs: simplify opening to three sentences

d734aa4

What we did and why, what we found, what the rest of the article covers. No human-verification timeline, no two-goals frame, no portability claim — those belong further down.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add agentic heterogeneous parallelism discussion article#13

docs: add agentic heterogeneous parallelism discussion article#13
yashaswikarnati wants to merge 5 commits into
mainfrom
techblog-NMFW-451-agentic-heterogeneous-parallelism

yashaswikarnati commented Apr 28, 2026

Uh oh!

yashaswikarnati left a comment

Uh oh!

yashaswikarnati Apr 28, 2026

Uh oh!

yashaswikarnati Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		Sanitized agent role definitions for adaptation to other workloads will be added in a follow-up change.

		## Resources

Conversation

yashaswikarnati commented Apr 28, 2026

What does this PR do?

Article content

Files changed

Checklist

Uh oh!

yashaswikarnati left a comment

Choose a reason for hiding this comment

Review: agentic heterogeneous parallelism article

What's working

Suggestions (non-blocking)

Uh oh!

yashaswikarnati Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant