docs: add Megatron-LM skills for unit testing and SLURM execution by yashaswikarnati · Pull Request #14 · yashaswikarnati/Megatron-LM

yashaswikarnati · 2026-04-28T06:36:22Z

What does this PR do?

Adds two concise skill/guide documents to docs/discussions/megatron-lm-skills/:

Files

running-unit-tests-in-megatron-lm.md — How to run unit tests locally and in CI using uv run pytest and uv run python -m torch.distributed.run for multi-rank tests
running-on-slurm.md — Minimal sbatch skeleton for multi-node training, covering environment setup, containers, monitoring, and failure diagnosis

Both documents use uv for dependency management and are written as generic Megatron-LM contributor guides.

Checklist

Documentation only
Unit tests
Functional tests

Ref: NMFW-451
Related: #13 (article PR)

yashaswikarnati

Review: Megatron-LM skills (unit tests + SLURM)

Verified both files against the live repo state. Bash-syntax-checked the sbatch template (passes bash -n). Tried uv run pytest --collect-only on the example test path — the path tests/unit_tests/models/test_gpt_model.py::TestGPTModel::test_constructor resolves correctly (collection only fails outside the canonical CUDA container, which is expected and is the case the skill explicitly tells the reader about).

Issues that need fixing

running-unit-tests-in-megatron-lm.md: -x claim is wrong.

The skill says:

The default pytest config in pyproject.toml includes -x (stop on first failure). Use --maxfail=N to see more failures per run.

Actual addopts in pyproject.toml (line 235): addopts = "--durations=15 -s -rA". There is no -x. The "stop on first failure" claim is incorrect, and the --maxfail=N workaround is unnecessary. Either delete this bullet or rewrite it to reflect the real defaults (-s disables capture, -rA shows summary for all outcomes, --durations=15 reports the 15 slowest tests).
mlm extra is deprecated.

Both skills recommend uv sync --extra mlm --extra dev. pyproject.toml line 80 says explicitly:

'mlm' group is deprecated. please use 'training' instead

The training extra is defined immediately above it with the same dependency list. Update both skills to uv sync --extra training --extra dev (preserve the --extra lts alternative note).
running-on-slurm.md: CUDA_DEVICE_MAX_CONNECTIONS=1 is conditional, not universal.

The skill exports CUDA_DEVICE_MAX_CONNECTIONS=1 unconditionally and the "Common pitfalls" section frames omitting it as a deadlock risk. The actual behavior in megatron/training/arguments.py (around line 1290–1320):
- Required when TP > 1 or CP > 1 on pre-Blackwell (get_device_arch_version() < 10) — Megatron asserts and refuses to start, it does not deadlock.
- No longer required on Blackwell (per the comment at line 1295).
- Must NOT be 1 with FSDP — use_torch_fsdp2 and use_megatron_fsdp both assert != "1" (lines 938, 1040).
- Different value with overlap_moe_expert_parallel_comm — recommends 32, not 1.
Recommend rewording to: "Set CUDA_DEVICE_MAX_CONNECTIONS=1 when running TP>1 or CP>1 on pre-Blackwell hardware without FSDP. Megatron will assert at startup if it's required and missing. Do not set it with Torch-FSDP2 or Megatron-FSDP." The current "deadlock" framing is inaccurate.

Smaller issues

Inconsistent flag style. running-unit-tests-in-megatron-lm.md uses --nproc-per-node (hyphen); running-on-slurm.md uses --nproc_per_node (underscore). torch.distributed.run accepts both, but pick one for consistency across the skill set.
Multi-rank command: pyproject.toml addopts includes -s. Worth noting in the multi-rank section that -s (no capture) means rank 0's stdout interleaves with other ranks' — readers debugging multi-rank tests often want --capture=fd instead. Optional.

What's correct

Markers internal, flaky, flaky_in_dev are real (declared at pyproject.toml:238-242).
--experimental flag is real (tests/unit_tests/conftest.py:22-25).
/opt/data reference matches tests/unit_tests/conftest.py:80.
tests/unit_tests/run_ci_test.sh exists.
pretrain_gpt.py exists at repo root.
docker/.ngc_version.dev and docker/.ngc_version.lts exist.
sbatch template parses with bash -n.
uv run python -m torch.distributed.run (not bare torchrun) is the right call and is used consistently.
No internal strings (cog, gitlab-master, coreai_dlalgo, /lustre/, cw-dfw, NMFW-, CLAUDE_CODE_EXPERIMENTAL).

Block on items 1, 2, 3. Items 4 and 5 are nice-to-haves.

yashaswikarnati

Follow-up review: all 4 fixes verified

Re-reviewed the latest diff. All four issues from the previous review are correctly addressed.

1. `-x` claim — FIXED ✓

running-unit-tests-in-megatron-lm.md now describes the actual addopts (--durations=15 -s -rA) and explains each flag accurately. No more spurious -x / --maxfail advice.

2. `mlm` extra deprecation — FIXED ✓

Both skills now recommend uv sync --extra training --extra dev. Re-ran the resolver against the worktree:

uv sync --extra training resolves cleanly.
uv sync --extra training --extra dev only fails at the nvidia-resiliency-ext build step outside the canonical NGC container, which is the expected behavior the skill already calls out under Prerequisites ("A working PyTorch + CUDA environment matching the repo's required versions").

3. `CUDA_DEVICE_MAX_CONNECTIONS` rework — FIXED ✓

Now correctly handled:

Removed the unconditional export CUDA_DEVICE_MAX_CONNECTIONS=1 from the sbatch template.
Added a dedicated section enumerating all four cases (pre-Blackwell+TP/CP non-FSDP → 1; Blackwell → no-op; FSDP → must NOT be 1; overlap_moe_expert_parallel_comm → 32).
"Common pitfalls" now says "the code asserts, it does not deadlock" — accurate.
In-template comment points at the dedicated section. Good UX.

4. Flag-style consistency — FIXED ✓

Both skills now use the hyphen form: --nproc-per-node, --node-rank, --master-addr, --master-port. The sbatch template matches running-unit-tests-in-megatron-lm.md consistently.

Re-validation

bash -n on the updated sbatch template: passes.
All file references still resolve: pretrain_gpt.py, docker/.ngc_version.{dev,lts}, tests/unit_tests/run_ci_test.sh, tests/unit_tests/conftest.py.
No internal strings.

All blockers cleared. Approving — ready to take out of draft when you're ready.

yashaswikarnati · 2026-04-28T18:00:40Z

+
+## Run the full unit suite
+
+For tests that do not require multiple GPUs:


all tests require GPUs , we dont need just pure pytest ones

yashaswikarnati · 2026-04-28T18:01:41Z

+
+## Common gotchas
+
+- Prefer `uv run python -m pytest` over `uv run pytest`. On some managed environments the pytest console script shebang can point to a stale venv path (typically after the venv is renamed or rebuilt), causing the command to fail with exit code 127. Using `python -m pytest` invokes pytest as a module through the current Python and avoids the issue. All examples in this guide use the `python -m pytest` form for that reason.


we dont need just uv run python -m pytest, all tests actually use torch.distrributerd.run

yashaswikarnati · 2026-04-28T18:02:00Z


  A practical guide to enable Megatron-FSDP training, including a quick-start example for DeepSeek-V3, required and recommended configurations, and instructions for checkpoint conversion from torch_dist to fsdp_dtensor.

+### Skills


we dont need this in discussions

Two new skills documenting how to run Megatron-LM unit tests and how to launch distributed training on a SLURM cluster. run-unit-tests covers environment setup with `uv sync --extra training --extra dev`, the fact that all unit tests initialize a torch distributed group and must be launched through `torch.distributed.run`, single-test and marker-filtered invocations, the CI bucket runner, and common gotchas around the default pytest addopts and per-rank stdout interleaving. run-on-slurm covers a minimal sbatch skeleton, `MASTER_ADDR` / `MASTER_PORT` / `WORLD_SIZE` / `NODE_RANK` setup, the conditional rules for `CUDA_DEVICE_MAX_CONNECTIONS` (pre-Blackwell vs Blackwell, FSDP, `overlap_moe_expert_parallel_comm`), container conventions, `squeue`/`sacct`/`scancel`, and per-rank failure diagnosis. Both follow the existing skills/ convention: a directory per skill with a SKILL.md file and the standard frontmatter (name, description, TRIGGER when, DO NOT TRIGGER when).

The existing testsystem skill covers CI pipelines, recipe YAMLs, and test-failure investigation. With the new run-unit-tests skill landing alongside it, the unqualified name "testsystem" is easy to confuse with "system for running tests." Rename to ci-test-system, which matches the actual scope, and update every cross-reference: - skills/ci-test-system/SKILL.md: name field - skills/build-and-dependency/SKILL.md: DO NOT TRIGGER cross-reference - skills/run-unit-tests/SKILL.md: DO NOT TRIGGER cross-reference - skills/run-on-slurm/SKILL.md: DO NOT TRIGGER cross-reference The body of ci-test-system/SKILL.md does not reference its own name, so no further edits inside that file.

yashaswikarnati commented Apr 28, 2026

View reviewed changes

yashaswikarnati force-pushed the docs-skills-NMFW-451-megatron-lm-skills branch from eb15ccc to 485ac83 Compare April 28, 2026 18:47

yashaswikarnati force-pushed the docs-skills-NMFW-451-megatron-lm-skills branch from 485ac83 to 31f3634 Compare April 28, 2026 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add Megatron-LM skills for unit testing and SLURM execution#14

docs: add Megatron-LM skills for unit testing and SLURM execution#14
yashaswikarnati wants to merge 2 commits into
mainfrom
docs-skills-NMFW-451-megatron-lm-skills

yashaswikarnati commented Apr 28, 2026

Uh oh!

yashaswikarnati left a comment

Uh oh!

yashaswikarnati left a comment

Uh oh!

yashaswikarnati Apr 28, 2026

Uh oh!

yashaswikarnati Apr 28, 2026

Uh oh!

yashaswikarnati Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		## Run the full unit suite

		For tests that do not require multiple GPUs:


		## Common gotchas

		- Prefer `uv run python -m pytest` over `uv run pytest`. On some managed environments the pytest console script shebang can point to a stale venv path (typically after the venv is renamed or rebuilt), causing the command to fail with exit code 127. Using `python -m pytest` invokes pytest as a module through the current Python and avoids the issue. All examples in this guide use the `python -m pytest` form for that reason.


		A practical guide to enable Megatron-FSDP training, including a quick-start example for DeepSeek-V3, required and recommended configurations, and instructions for checkpoint conversion from torch_dist to fsdp_dtensor.

		### Skills

Conversation

yashaswikarnati commented Apr 28, 2026

What does this PR do?

Files

Checklist

Uh oh!

yashaswikarnati left a comment

Choose a reason for hiding this comment

Review: Megatron-LM skills (unit tests + SLURM)

Issues that need fixing

'mlm' group is deprecated. please use 'training' instead

Smaller issues

What's correct

Uh oh!

yashaswikarnati left a comment

Choose a reason for hiding this comment

Follow-up review: all 4 fixes verified

1. -x claim — FIXED ✓

2. mlm extra deprecation — FIXED ✓

3. CUDA_DEVICE_MAX_CONNECTIONS rework — FIXED ✓

4. Flag-style consistency — FIXED ✓

Re-validation

Uh oh!

yashaswikarnati Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `-x` claim — FIXED ✓

2. `mlm` extra deprecation — FIXED ✓

3. `CUDA_DEVICE_MAX_CONNECTIONS` rework — FIXED ✓