Skip to content

docs: add Megatron-LM skills for unit testing and SLURM execution#14

Draft
yashaswikarnati wants to merge 2 commits into
mainfrom
docs-skills-NMFW-451-megatron-lm-skills
Draft

docs: add Megatron-LM skills for unit testing and SLURM execution#14
yashaswikarnati wants to merge 2 commits into
mainfrom
docs-skills-NMFW-451-megatron-lm-skills

Conversation

@yashaswikarnati
Copy link
Copy Markdown
Owner

What does this PR do?

Adds two concise skill/guide documents to docs/discussions/megatron-lm-skills/:

Files

  • running-unit-tests-in-megatron-lm.md — How to run unit tests locally and in CI using uv run pytest and uv run python -m torch.distributed.run for multi-rank tests
  • running-on-slurm.md — Minimal sbatch skeleton for multi-node training, covering environment setup, containers, monitoring, and failure diagnosis

Both documents use uv for dependency management and are written as generic Megatron-LM contributor guides.

Checklist

  • Documentation only
  • Unit tests
  • Functional tests

Ref: NMFW-451
Related: #13 (article PR)

Copy link
Copy Markdown
Owner Author

@yashaswikarnati yashaswikarnati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Megatron-LM skills (unit tests + SLURM)

Verified both files against the live repo state. Bash-syntax-checked the sbatch template (passes bash -n). Tried uv run pytest --collect-only on the example test path — the path tests/unit_tests/models/test_gpt_model.py::TestGPTModel::test_constructor resolves correctly (collection only fails outside the canonical CUDA container, which is expected and is the case the skill explicitly tells the reader about).

Issues that need fixing

  1. running-unit-tests-in-megatron-lm.md: -x claim is wrong.

    The skill says:

    The default pytest config in pyproject.toml includes -x (stop on first failure). Use --maxfail=N to see more failures per run.

    Actual addopts in pyproject.toml (line 235): addopts = "--durations=15 -s -rA". There is no -x. The "stop on first failure" claim is incorrect, and the --maxfail=N workaround is unnecessary. Either delete this bullet or rewrite it to reflect the real defaults (-s disables capture, -rA shows summary for all outcomes, --durations=15 reports the 15 slowest tests).

  2. mlm extra is deprecated.

    Both skills recommend uv sync --extra mlm --extra dev. pyproject.toml line 80 says explicitly:

    'mlm' group is deprecated. please use 'training' instead

    The training extra is defined immediately above it with the same dependency list. Update both skills to uv sync --extra training --extra dev (preserve the --extra lts alternative note).

  3. running-on-slurm.md: CUDA_DEVICE_MAX_CONNECTIONS=1 is conditional, not universal.

    The skill exports CUDA_DEVICE_MAX_CONNECTIONS=1 unconditionally and the "Common pitfalls" section frames omitting it as a deadlock risk. The actual behavior in megatron/training/arguments.py (around line 1290–1320):

    • Required when TP > 1 or CP > 1 on pre-Blackwell (get_device_arch_version() < 10) — Megatron asserts and refuses to start, it does not deadlock.
    • No longer required on Blackwell (per the comment at line 1295).
    • Must NOT be 1 with FSDPuse_torch_fsdp2 and use_megatron_fsdp both assert != "1" (lines 938, 1040).
    • Different value with overlap_moe_expert_parallel_comm — recommends 32, not 1.

    Recommend rewording to: "Set CUDA_DEVICE_MAX_CONNECTIONS=1 when running TP>1 or CP>1 on pre-Blackwell hardware without FSDP. Megatron will assert at startup if it's required and missing. Do not set it with Torch-FSDP2 or Megatron-FSDP." The current "deadlock" framing is inaccurate.

Smaller issues

  1. Inconsistent flag style. running-unit-tests-in-megatron-lm.md uses --nproc-per-node (hyphen); running-on-slurm.md uses --nproc_per_node (underscore). torch.distributed.run accepts both, but pick one for consistency across the skill set.

  2. Multi-rank command: pyproject.toml addopts includes -s. Worth noting in the multi-rank section that -s (no capture) means rank 0's stdout interleaves with other ranks' — readers debugging multi-rank tests often want --capture=fd instead. Optional.

What's correct

  • Markers internal, flaky, flaky_in_dev are real (declared at pyproject.toml:238-242).
  • --experimental flag is real (tests/unit_tests/conftest.py:22-25).
  • /opt/data reference matches tests/unit_tests/conftest.py:80.
  • tests/unit_tests/run_ci_test.sh exists.
  • pretrain_gpt.py exists at repo root.
  • docker/.ngc_version.dev and docker/.ngc_version.lts exist.
  • sbatch template parses with bash -n.
  • uv run python -m torch.distributed.run (not bare torchrun) is the right call and is used consistently.
  • No internal strings (cog, gitlab-master, coreai_dlalgo, /lustre/, cw-dfw, NMFW-, CLAUDE_CODE_EXPERIMENTAL).

Block on items 1, 2, 3. Items 4 and 5 are nice-to-haves.

Copy link
Copy Markdown
Owner Author

@yashaswikarnati yashaswikarnati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review: all 4 fixes verified

Re-reviewed the latest diff. All four issues from the previous review are correctly addressed.

1. -x claim — FIXED ✓

running-unit-tests-in-megatron-lm.md now describes the actual addopts (--durations=15 -s -rA) and explains each flag accurately. No more spurious -x / --maxfail advice.

2. mlm extra deprecation — FIXED ✓

Both skills now recommend uv sync --extra training --extra dev. Re-ran the resolver against the worktree:

  • uv sync --extra training resolves cleanly.
  • uv sync --extra training --extra dev only fails at the nvidia-resiliency-ext build step outside the canonical NGC container, which is the expected behavior the skill already calls out under Prerequisites ("A working PyTorch + CUDA environment matching the repo's required versions").

3. CUDA_DEVICE_MAX_CONNECTIONS rework — FIXED ✓

Now correctly handled:

  • Removed the unconditional export CUDA_DEVICE_MAX_CONNECTIONS=1 from the sbatch template.
  • Added a dedicated section enumerating all four cases (pre-Blackwell+TP/CP non-FSDP → 1; Blackwell → no-op; FSDP → must NOT be 1; overlap_moe_expert_parallel_comm → 32).
  • "Common pitfalls" now says "the code asserts, it does not deadlock" — accurate.
  • In-template comment points at the dedicated section. Good UX.

4. Flag-style consistency — FIXED ✓

Both skills now use the hyphen form: --nproc-per-node, --node-rank, --master-addr, --master-port. The sbatch template matches running-unit-tests-in-megatron-lm.md consistently.

Re-validation

  • bash -n on the updated sbatch template: passes.
  • All file references still resolve: pretrain_gpt.py, docker/.ngc_version.{dev,lts}, tests/unit_tests/run_ci_test.sh, tests/unit_tests/conftest.py.
  • No internal strings.

All blockers cleared. Approving — ready to take out of draft when you're ready.


## Run the full unit suite

For tests that do not require multiple GPUs:
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all tests require GPUs , we dont need just pure pytest ones


## Common gotchas

- Prefer `uv run python -m pytest` over `uv run pytest`. On some managed environments the pytest console script shebang can point to a stale venv path (typically after the venv is renamed or rebuilt), causing the command to fail with exit code 127. Using `python -m pytest` invokes pytest as a module through the current Python and avoids the issue. All examples in this guide use the `python -m pytest` form for that reason.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need just uv run python -m pytest, all tests actually use torch.distrributerd.run

Comment thread docs/discussions/README.md Outdated

A practical guide to enable Megatron-FSDP training, including a quick-start example for DeepSeek-V3, required and recommended configurations, and instructions for checkpoint conversion from torch_dist to fsdp_dtensor.

### Skills
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need this in discussions

@yashaswikarnati yashaswikarnati force-pushed the docs-skills-NMFW-451-megatron-lm-skills branch from eb15ccc to 485ac83 Compare April 28, 2026 18:47
Two new skills documenting how to run Megatron-LM unit tests and how to
launch distributed training on a SLURM cluster.

run-unit-tests covers environment setup with `uv sync --extra training
--extra dev`, the fact that all unit tests initialize a torch
distributed group and must be launched through `torch.distributed.run`,
single-test and marker-filtered invocations, the CI bucket runner, and
common gotchas around the default pytest addopts and per-rank stdout
interleaving.

run-on-slurm covers a minimal sbatch skeleton, `MASTER_ADDR` /
`MASTER_PORT` / `WORLD_SIZE` / `NODE_RANK` setup, the conditional rules
for `CUDA_DEVICE_MAX_CONNECTIONS` (pre-Blackwell vs Blackwell, FSDP,
`overlap_moe_expert_parallel_comm`), container conventions,
`squeue`/`sacct`/`scancel`, and per-rank failure diagnosis.

Both follow the existing skills/ convention: a directory per skill with
a SKILL.md file and the standard frontmatter (name, description,
TRIGGER when, DO NOT TRIGGER when).
@yashaswikarnati yashaswikarnati force-pushed the docs-skills-NMFW-451-megatron-lm-skills branch from 485ac83 to 31f3634 Compare April 28, 2026 22:29
The existing testsystem skill covers CI pipelines, recipe YAMLs, and
test-failure investigation. With the new run-unit-tests skill landing
alongside it, the unqualified name "testsystem" is easy to confuse
with "system for running tests." Rename to ci-test-system, which
matches the actual scope, and update every cross-reference:

- skills/ci-test-system/SKILL.md: name field
- skills/build-and-dependency/SKILL.md: DO NOT TRIGGER cross-reference
- skills/run-unit-tests/SKILL.md: DO NOT TRIGGER cross-reference
- skills/run-on-slurm/SKILL.md: DO NOT TRIGGER cross-reference

The body of ci-test-system/SKILL.md does not reference its own name,
so no further edits inside that file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant