docs: add agentic heterogeneous parallelism discussion article#13
Draft
yashaswikarnati wants to merge 5 commits into
Draft
docs: add agentic heterogeneous parallelism discussion article#13yashaswikarnati wants to merge 5 commits into
yashaswikarnati wants to merge 5 commits into
Conversation
Adds a discussion article documenting agent-driven exploration of heterogeneous parallelism for multimodal (MIMO) training in Megatron-LM. - Article covers agentic research methodology, agent team structure, throughput results (H100 BF16, 16-64 GPUs), discovered optimization chains, failure modes, and reusable patterns - Includes throughput charts for 8K and long-context sequences - Updates docs/discussions/README.md with new entry Ref: NMFW-451
3 tasks
yashaswikarnati
commented
Apr 28, 2026
Owner
Author
yashaswikarnati
left a comment
There was a problem hiding this comment.
Review: agentic heterogeneous parallelism article
Verified the full diff and cross-checked every number in Tables 1 and 2 against raw/data_charts/MiMo-Colocated Throughput - Master.csv. Tables match the source data exactly (TFLOPs, deltas, memory, iteration times, vision-token fractions, parallelism configs).
What's working
- Narrative leads with the operating model and frames throughput as evidence (intro and "Results: Throughput Evidence" header both reinforce this). Aligned with the requested framing.
- Tone is engineering throughout. No marketing language; honesty about the 80/20 ceiling, propagating baseline bias, and over-generalized rules is the right move.
- No internal strings (
cog,gitlab-master,coreai_dlalgo,/lustre/,cw-dfw,NMFW-,CLAUDE_CODE_EXPERIMENTAL) anywhere in the article body or README addition.Lustreappears only as a generic filesystem-type example, which is fine. - The optimization-stack notes under each table (encoder TP=1 + offload, PP=2 enabling lower LLM TP, etc.) match the per-row "Notes" column in the source CSV.
- Image references
images/throughput_8k.pngandimages/throughput_long_context.pngresolve.
Suggestions (non-blocking)
- Table 1 — 1B+7B iter time. The cell shows
—and the table caption explains it ("the 1B+7B 8K iter time was not captured in the source data"). Confirmed against CSV (Fwd+Bwd columns are blank for that row). Wording is fine; flagging only because the dash is the only—in either table — a reader skimming the column might assume measurement failure rather than missing source data. Consider tightening the caption to "not recorded in the source data" if you want to remove that ambiguity. - Resources section. Only two links. Could also link the Megatron-FSDP discussion in the same
docs/discussions/directory, since the operating-model framing applies to other systems-research efforts. Optional. - "Sanitized agent role definitions ... follow-up change." Either land them in this PR or open a tracking issue and reference it; otherwise the promise rots.
No factual issues. Approve once the iter-time caption is tightened (optional) — content is solid.
yashaswikarnati
commented
Apr 28, 2026
|
|
||
| Sanitized agent role definitions for adaptation to other workloads will be added in a follow-up change. | ||
|
|
||
| ## Resources |
yashaswikarnati
commented
Apr 28, 2026
|
|
||
| | Encoder | LLM | World | GBS | Homo TP/DP/PP | Hetero Enc TP/DP/PP | Hetero LLM TP/DP/PP | Homo TFLOPs/GPU | Hetero TFLOPs/GPU | Δ | Mem Homo (GB) | Mem Hetero (GB) | Iter Homo (ms) | Iter Hetero (ms) | | ||
| | ------- | ----- | ----: | --: | ------------- | ------------------- | ------------------- | --------------: | ----------------: | ------: | ------------: | --------------: | -------------: | ---------------: | | ||
| | 1B | 7B | 16 | 64 | 4/4/1 | 1/16/1 | 4/4/1 | 333.8 | 360.8 | +8.1% | 66.3 | 62.2 | — | — | |
Owner
Author
There was a problem hiding this comment.
we can remove first row
…ticle - Drop the 1B+7B 8K row from Table 1 and regenerate Figure 1 to match. The remaining four rows (1B+14B, 3B+14B, 3B+32B, 3B+70B) are all at 64 GPUs with LLM PP=2; the gain range tightens to +12.7% to +14.0%. Update the Standard sequence length prose and Table 1 caption accordingly. Removes the only "—" entry (1B+7B iter time was not captured in the source data). - Remove the Resources section and its Table of Contents entry. - Remove the trailing "follow-up change" promise about sanitized agent role definitions; nothing in this PR commits to it, so the line was load-bearing rot.
Lead with what we did and found, not what the article contains. Drops the meta framing and the word "headline" entirely. Now structured as: what we gave the agents, the three optimization chains they found and the throughput gains, the days of human verification, and what is portable beyond the Megatron-specific knobs. Also rolls up other in-flight edits to the same file.
Mirror the narrative cadence of the prior Megatron-techblog opener: strong action verb in sentence 1, the dual frame (throughput gains and a study of where agents help) in sentence 2, then numbers, human verification, and what is portable.
What we did and why, what we found, what the rest of the article covers. No human-verification timeline, no two-goals frame, no portability claim — those belong further down.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds a discussion article to
docs/discussions/documenting agent-driven exploration of heterogeneous parallelism for multimodal (MIMO) training in Megatron-LM.Article content
Files changed
docs/discussions/agentic-heterogeneous-parallelism/agentic-heterogeneous-parallelism.md(new)docs/discussions/agentic-heterogeneous-parallelism/images/throughput_8k.png(new)docs/discussions/agentic-heterogeneous-parallelism/images/throughput_long_context.png(new)docs/discussions/README.md(updated index)Checklist
Ref: NMFW-451