docs: add Megatron-LM skills for unit testing and SLURM execution#14
docs: add Megatron-LM skills for unit testing and SLURM execution#14yashaswikarnati wants to merge 2 commits into
Conversation
yashaswikarnati
left a comment
There was a problem hiding this comment.
Review: Megatron-LM skills (unit tests + SLURM)
Verified both files against the live repo state. Bash-syntax-checked the sbatch template (passes bash -n). Tried uv run pytest --collect-only on the example test path — the path tests/unit_tests/models/test_gpt_model.py::TestGPTModel::test_constructor resolves correctly (collection only fails outside the canonical CUDA container, which is expected and is the case the skill explicitly tells the reader about).
Issues that need fixing
-
running-unit-tests-in-megatron-lm.md:-xclaim is wrong.The skill says:
The default pytest config in
pyproject.tomlincludes-x(stop on first failure). Use--maxfail=Nto see more failures per run.Actual
addoptsinpyproject.toml(line 235):addopts = "--durations=15 -s -rA". There is no-x. The "stop on first failure" claim is incorrect, and the--maxfail=Nworkaround is unnecessary. Either delete this bullet or rewrite it to reflect the real defaults (-sdisables capture,-rAshows summary for all outcomes,--durations=15reports the 15 slowest tests). -
mlmextra is deprecated.Both skills recommend
uv sync --extra mlm --extra dev.pyproject.tomlline 80 says explicitly:'mlm' group is deprecated. please use 'training' instead
The
trainingextra is defined immediately above it with the same dependency list. Update both skills touv sync --extra training --extra dev(preserve the--extra ltsalternative note). -
running-on-slurm.md:CUDA_DEVICE_MAX_CONNECTIONS=1is conditional, not universal.The skill exports
CUDA_DEVICE_MAX_CONNECTIONS=1unconditionally and the "Common pitfalls" section frames omitting it as a deadlock risk. The actual behavior inmegatron/training/arguments.py(around line 1290–1320):- Required when
TP > 1orCP > 1on pre-Blackwell (get_device_arch_version() < 10) — Megatron asserts and refuses to start, it does not deadlock. - No longer required on Blackwell (per the comment at line 1295).
- Must NOT be 1 with FSDP —
use_torch_fsdp2anduse_megatron_fsdpboth assert!= "1"(lines 938, 1040). - Different value with
overlap_moe_expert_parallel_comm— recommends 32, not 1.
Recommend rewording to: "Set
CUDA_DEVICE_MAX_CONNECTIONS=1when running TP>1 or CP>1 on pre-Blackwell hardware without FSDP. Megatron will assert at startup if it's required and missing. Do not set it with Torch-FSDP2 or Megatron-FSDP." The current "deadlock" framing is inaccurate. - Required when
Smaller issues
-
Inconsistent flag style.
running-unit-tests-in-megatron-lm.mduses--nproc-per-node(hyphen);running-on-slurm.mduses--nproc_per_node(underscore).torch.distributed.runaccepts both, but pick one for consistency across the skill set. -
Multi-rank command:
pyproject.tomladdoptsincludes-s. Worth noting in the multi-rank section that-s(no capture) means rank 0's stdout interleaves with other ranks' — readers debugging multi-rank tests often want--capture=fdinstead. Optional.
What's correct
- Markers
internal,flaky,flaky_in_devare real (declared atpyproject.toml:238-242). --experimentalflag is real (tests/unit_tests/conftest.py:22-25)./opt/datareference matchestests/unit_tests/conftest.py:80.tests/unit_tests/run_ci_test.shexists.pretrain_gpt.pyexists at repo root.docker/.ngc_version.devanddocker/.ngc_version.ltsexist.- sbatch template parses with
bash -n. uv run python -m torch.distributed.run(not baretorchrun) is the right call and is used consistently.- No internal strings (
cog,gitlab-master,coreai_dlalgo,/lustre/,cw-dfw,NMFW-,CLAUDE_CODE_EXPERIMENTAL).
Block on items 1, 2, 3. Items 4 and 5 are nice-to-haves.
yashaswikarnati
left a comment
There was a problem hiding this comment.
Follow-up review: all 4 fixes verified
Re-reviewed the latest diff. All four issues from the previous review are correctly addressed.
1. -x claim — FIXED ✓
running-unit-tests-in-megatron-lm.md now describes the actual addopts (--durations=15 -s -rA) and explains each flag accurately. No more spurious -x / --maxfail advice.
2. mlm extra deprecation — FIXED ✓
Both skills now recommend uv sync --extra training --extra dev. Re-ran the resolver against the worktree:
uv sync --extra trainingresolves cleanly.uv sync --extra training --extra devonly fails at thenvidia-resiliency-extbuild step outside the canonical NGC container, which is the expected behavior the skill already calls out under Prerequisites ("A working PyTorch + CUDA environment matching the repo's required versions").
3. CUDA_DEVICE_MAX_CONNECTIONS rework — FIXED ✓
Now correctly handled:
- Removed the unconditional
export CUDA_DEVICE_MAX_CONNECTIONS=1from the sbatch template. - Added a dedicated section enumerating all four cases (pre-Blackwell+TP/CP non-FSDP → 1; Blackwell → no-op; FSDP → must NOT be 1;
overlap_moe_expert_parallel_comm→ 32). - "Common pitfalls" now says "the code asserts, it does not deadlock" — accurate.
- In-template comment points at the dedicated section. Good UX.
4. Flag-style consistency — FIXED ✓
Both skills now use the hyphen form: --nproc-per-node, --node-rank, --master-addr, --master-port. The sbatch template matches running-unit-tests-in-megatron-lm.md consistently.
Re-validation
bash -non the updated sbatch template: passes.- All file references still resolve:
pretrain_gpt.py,docker/.ngc_version.{dev,lts},tests/unit_tests/run_ci_test.sh,tests/unit_tests/conftest.py. - No internal strings.
All blockers cleared. Approving — ready to take out of draft when you're ready.
|
|
||
| ## Run the full unit suite | ||
|
|
||
| For tests that do not require multiple GPUs: |
There was a problem hiding this comment.
all tests require GPUs , we dont need just pure pytest ones
|
|
||
| ## Common gotchas | ||
|
|
||
| - Prefer `uv run python -m pytest` over `uv run pytest`. On some managed environments the pytest console script shebang can point to a stale venv path (typically after the venv is renamed or rebuilt), causing the command to fail with exit code 127. Using `python -m pytest` invokes pytest as a module through the current Python and avoids the issue. All examples in this guide use the `python -m pytest` form for that reason. |
There was a problem hiding this comment.
we dont need just uv run python -m pytest, all tests actually use torch.distrributerd.run
|
|
||
| A practical guide to enable Megatron-FSDP training, including a quick-start example for DeepSeek-V3, required and recommended configurations, and instructions for checkpoint conversion from torch_dist to fsdp_dtensor. | ||
|
|
||
| ### Skills |
There was a problem hiding this comment.
we dont need this in discussions
eb15ccc to
485ac83
Compare
Two new skills documenting how to run Megatron-LM unit tests and how to launch distributed training on a SLURM cluster. run-unit-tests covers environment setup with `uv sync --extra training --extra dev`, the fact that all unit tests initialize a torch distributed group and must be launched through `torch.distributed.run`, single-test and marker-filtered invocations, the CI bucket runner, and common gotchas around the default pytest addopts and per-rank stdout interleaving. run-on-slurm covers a minimal sbatch skeleton, `MASTER_ADDR` / `MASTER_PORT` / `WORLD_SIZE` / `NODE_RANK` setup, the conditional rules for `CUDA_DEVICE_MAX_CONNECTIONS` (pre-Blackwell vs Blackwell, FSDP, `overlap_moe_expert_parallel_comm`), container conventions, `squeue`/`sacct`/`scancel`, and per-rank failure diagnosis. Both follow the existing skills/ convention: a directory per skill with a SKILL.md file and the standard frontmatter (name, description, TRIGGER when, DO NOT TRIGGER when).
485ac83 to
31f3634
Compare
The existing testsystem skill covers CI pipelines, recipe YAMLs, and test-failure investigation. With the new run-unit-tests skill landing alongside it, the unqualified name "testsystem" is easy to confuse with "system for running tests." Rename to ci-test-system, which matches the actual scope, and update every cross-reference: - skills/ci-test-system/SKILL.md: name field - skills/build-and-dependency/SKILL.md: DO NOT TRIGGER cross-reference - skills/run-unit-tests/SKILL.md: DO NOT TRIGGER cross-reference - skills/run-on-slurm/SKILL.md: DO NOT TRIGGER cross-reference The body of ci-test-system/SKILL.md does not reference its own name, so no further edits inside that file.
What does this PR do?
Adds two concise skill/guide documents to
docs/discussions/megatron-lm-skills/:Files
uv run pytestanduv run python -m torch.distributed.runfor multi-rank testsBoth documents use
uvfor dependency management and are written as generic Megatron-LM contributor guides.Checklist
Ref: NMFW-451
Related: #13 (article PR)