Skip to content

docs: add agentic heterogeneous parallelism discussion article#13

Draft
yashaswikarnati wants to merge 5 commits into
mainfrom
techblog-NMFW-451-agentic-heterogeneous-parallelism
Draft

docs: add agentic heterogeneous parallelism discussion article#13
yashaswikarnati wants to merge 5 commits into
mainfrom
techblog-NMFW-451-agentic-heterogeneous-parallelism

Conversation

@yashaswikarnati
Copy link
Copy Markdown
Owner

What does this PR do?

Adds a discussion article to docs/discussions/ documenting agent-driven exploration of heterogeneous parallelism for multimodal (MIMO) training in Megatron-LM.

Article content

  • Agentic research methodology and operating model (autonomous exploration + human guardrails)
  • Agent team structure: team lead, campaign manager, systems expert, SLURM runner
  • Throughput results: +8.1% to +14.0% at 8K seq len, up to +41.8% at 64K (H100 BF16, 16-64 GPUs)
  • Optimization chains discovered by agents (encoder TP=1, encoder recompute as throughput lever, colocated PP + offload)
  • Failure modes and human verification requirements
  • Reusable patterns for adapting agent teams to other workloads

Files changed

  • docs/discussions/agentic-heterogeneous-parallelism/agentic-heterogeneous-parallelism.md (new)
  • docs/discussions/agentic-heterogeneous-parallelism/images/throughput_8k.png (new)
  • docs/discussions/agentic-heterogeneous-parallelism/images/throughput_long_context.png (new)
  • docs/discussions/README.md (updated index)

Checklist

  • Documentation only
  • Unit tests
  • Functional tests

Ref: NMFW-451

Adds a discussion article documenting agent-driven exploration of
heterogeneous parallelism for multimodal (MIMO) training in Megatron-LM.

- Article covers agentic research methodology, agent team structure,
  throughput results (H100 BF16, 16-64 GPUs), discovered optimization
  chains, failure modes, and reusable patterns
- Includes throughput charts for 8K and long-context sequences
- Updates docs/discussions/README.md with new entry

Ref: NMFW-451
Copy link
Copy Markdown
Owner Author

@yashaswikarnati yashaswikarnati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: agentic heterogeneous parallelism article

Verified the full diff and cross-checked every number in Tables 1 and 2 against raw/data_charts/MiMo-Colocated Throughput - Master.csv. Tables match the source data exactly (TFLOPs, deltas, memory, iteration times, vision-token fractions, parallelism configs).

What's working

  • Narrative leads with the operating model and frames throughput as evidence (intro and "Results: Throughput Evidence" header both reinforce this). Aligned with the requested framing.
  • Tone is engineering throughout. No marketing language; honesty about the 80/20 ceiling, propagating baseline bias, and over-generalized rules is the right move.
  • No internal strings (cog, gitlab-master, coreai_dlalgo, /lustre/, cw-dfw, NMFW-, CLAUDE_CODE_EXPERIMENTAL) anywhere in the article body or README addition. Lustre appears only as a generic filesystem-type example, which is fine.
  • The optimization-stack notes under each table (encoder TP=1 + offload, PP=2 enabling lower LLM TP, etc.) match the per-row "Notes" column in the source CSV.
  • Image references images/throughput_8k.png and images/throughput_long_context.png resolve.

Suggestions (non-blocking)

  1. Table 1 — 1B+7B iter time. The cell shows and the table caption explains it ("the 1B+7B 8K iter time was not captured in the source data"). Confirmed against CSV (Fwd+Bwd columns are blank for that row). Wording is fine; flagging only because the dash is the only in either table — a reader skimming the column might assume measurement failure rather than missing source data. Consider tightening the caption to "not recorded in the source data" if you want to remove that ambiguity.
  2. Resources section. Only two links. Could also link the Megatron-FSDP discussion in the same docs/discussions/ directory, since the operating-model framing applies to other systems-research efforts. Optional.
  3. "Sanitized agent role definitions ... follow-up change." Either land them in this PR or open a tracking issue and reference it; otherwise the promise rots.

No factual issues. Approve once the iter-time caption is tightened (optional) — content is solid.


Sanitized agent role definitions for adaptation to other workloads will be added in a follow-up change.

## Resources
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove resources


| Encoder | LLM | World | GBS | Homo TP/DP/PP | Hetero Enc TP/DP/PP | Hetero LLM TP/DP/PP | Homo TFLOPs/GPU | Hetero TFLOPs/GPU | Δ | Mem Homo (GB) | Mem Hetero (GB) | Iter Homo (ms) | Iter Hetero (ms) |
| ------- | ----- | ----: | --: | ------------- | ------------------- | ------------------- | --------------: | ----------------: | ------: | ------------: | --------------: | -------------: | ---------------: |
| 1B | 7B | 16 | 64 | 4/4/1 | 1/16/1 | 4/4/1 | 333.8 | 360.8 | +8.1% | 66.3 | 62.2 | — | — |
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove first row

…ticle

- Drop the 1B+7B 8K row from Table 1 and regenerate Figure 1 to match.
  The remaining four rows (1B+14B, 3B+14B, 3B+32B, 3B+70B) are all at
  64 GPUs with LLM PP=2; the gain range tightens to +12.7% to +14.0%.
  Update the Standard sequence length prose and Table 1 caption
  accordingly. Removes the only "—" entry (1B+7B iter time was not
  captured in the source data).
- Remove the Resources section and its Table of Contents entry.
- Remove the trailing "follow-up change" promise about sanitized agent
  role definitions; nothing in this PR commits to it, so the line was
  load-bearing rot.
Lead with what we did and found, not what the article contains.
Drops the meta framing and the word "headline" entirely. Now structured
as: what we gave the agents, the three optimization chains they found
and the throughput gains, the days of human verification, and what is
portable beyond the Megatron-specific knobs.

Also rolls up other in-flight edits to the same file.
Mirror the narrative cadence of the prior Megatron-techblog opener:
strong action verb in sentence 1, the dual frame (throughput gains
and a study of where agents help) in sentence 2, then numbers,
human verification, and what is portable.
What we did and why, what we found, what the rest of the article
covers. No human-verification timeline, no two-goals frame, no
portability claim — those belong further down.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant