yashaswikarnati · yashaswikarnati · Apr 28, 2026 · Apr 29, 2026
diff --git a/skills/build-and-dependency/SKILL.md b/skills/build-and-dependency/SKILL.md
@@ -2,7 +2,7 @@
 name: build-and-dependency
 description: Container-based dev environment setup and dependency management for Megatron-LM. Covers acquiring and launching the CI container, uv package management, updating uv.lock, and linting.
 TRIGGER when: user asks to add, remove, or update a dependency; user edits or asks about pyproject.toml or uv.lock; uv.lock has a merge conflict; user asks to set up a dev environment or pull/build the CI container; user hits a container build error or uv error; user asks to run linting or autoformat.
-DO NOT TRIGGER when: user is only running tests, investigating CI failures, or opening a PR (use testsystem instead).
+DO NOT TRIGGER when: user is only running tests, investigating CI failures, or opening a PR (use ci-test-system instead).
 ---
 
 # Build & Dependency Guide

diff --git a/skills/testsystem/SKILL.md → skills/ci-test-system/SKILL.md b/skills/testsystem/SKILL.md → skills/ci-test-system/SKILL.md
@@ -1,5 +1,5 @@
 ---
-name: testsystem
+name: ci-test-system
 description: Test system, CI pipeline, and CI failure investigation for Megatron-LM. Covers test layout, recipe YAML structure, adding unit and functional tests, CI scope labels, triggering internal GitLab CI, pipeline structure, and debugging CI failures.
 TRIGGER when: user asks to run tests, add a test, investigate a CI failure, understand the CI pipeline, or work with test recipes; user opens or pushes to a PR and needs to know which CI label to attach; user wants to trigger the internal GitLab CI pipeline; user asks to download golden values or references a pipeline/run ID in the context of golden values.
 DO NOT TRIGGER when: user is only setting up the dev environment or managing dependencies (use build-and-dependency instead).

diff --git a/skills/run-on-slurm/SKILL.md b/skills/run-on-slurm/SKILL.md
@@ -0,0 +1,115 @@
+---
+name: run-on-slurm
+description: How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.
+TRIGGER when: user asks to submit a SLURM job for Megatron-LM, write or debug an sbatch script, configure multi-node distributed training, set MASTER_ADDR / MASTER_PORT / WORLD_SIZE, or diagnose a SLURM job failure.
+DO NOT TRIGGER when: user is running on a single GPU node without SLURM; user is asking about CI test infrastructure (use ci-test-system); user is asking about container builds or dependency management (use build-and-dependency).
+---
+
+# Run Megatron-LM on SLURM
+
+## Prerequisites
+
+- A SLURM cluster login with submission rights to a GPU partition.
+- Megatron-LM checked out on a filesystem visible to all nodes in the allocation (NFS, Lustre, or similar). All nodes must reach the same paths for code, data, checkpoints, and output.
+- `uv` installed; run `uv sync --extra training --extra dev` (or `--extra lts`) on the worktree once before submission so the `.venv` is materialized and visible to every node.
+
+## Minimal sbatch script
+
+Save as `run_megatron.slurm` in the worktree:
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=megatron
+#SBATCH --account=<SLURM_ACCOUNT>
+#SBATCH --partition=<SLURM_PARTITION>
+#SBATCH --nodes=<NODES>
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=<GPUS_PER_NODE>
+#SBATCH --time=<HH:MM:SS>
+#SBATCH --output=logs/%x-%j.out
+#SBATCH --error=logs/%x-%j.err
+
+set -euo pipefail
+cd <MEGATRON_WORKTREE>
+
+export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
+export MASTER_PORT=${MASTER_PORT:-29500}
+export NNODES=${SLURM_NNODES}
+export GPUS_PER_NODE=<GPUS_PER_NODE>
+export WORLD_SIZE=$((NNODES * GPUS_PER_NODE))
+
+# Set CUDA_DEVICE_MAX_CONNECTIONS only when your configuration requires it
+# (see the section below). Example for pre-Blackwell with TP>1 or CP>1
+# (non-FSDP):
+#   export CUDA_DEVICE_MAX_CONNECTIONS=1
+
+srun --ntasks=${NNODES} --ntasks-per-node=1 bash -c '
+  # NODE_RANK comes from SLURM_NODEID with one task per node.
+  NODE_RANK=${SLURM_NODEID}
+  uv run python -m torch.distributed.run \
+    --nnodes='"${NNODES}"' \
+    --nproc-per-node='"${GPUS_PER_NODE}"' \
+    --node-rank=${NODE_RANK} \
+    --master-addr='"${MASTER_ADDR}"' \
+    --master-port='"${MASTER_PORT}"' \
+    pretrain_gpt.py \
+      <MEGATRON_ARGS>
+'
+```
+
+Submit:
+
+```bash
+mkdir -p logs && JOB_ID=$(sbatch --parsable run_megatron.slurm)
+echo "Submitted ${JOB_ID}"
+```
+
+## Multi-node rules
+
+- Submit from the worktree you intend to run, or `cd` to it in the script. All nodes must reach the same path on a shared filesystem (NFS, Lustre, or similar) — node-local paths will not be visible to peer ranks.
+- Use one `torchrun` worker group across all nodes; do not start independent single-node jobs.
+- `--nproc-per-node` should equal the number of visible GPUs per node.
+- Write checkpoints, tensorboard data, and structured logs to shared storage.
+
+## CUDA_DEVICE_MAX_CONNECTIONS
+
+The right value depends on your hardware and parallelism mode. Do not export it unconditionally:
+
+- **Pre-Blackwell (Hopper, Ampere) with TP>1 or CP>1, non-FSDP:** set to `1`. The relevant code path asserts on this — you will get an assertion error if it is not `1`, not a silent deadlock.
+- **Blackwell:** not required; setting it has no effect.
+- **Torch-FSDP2 or Megatron-FSDP:** must NOT be `1`. Leave the env var unset, or set it to a value greater than `1`.
+- **`overlap_moe_expert_parallel_comm` enabled:** set to `32`.
+
+Set it explicitly in the sbatch script when your configuration calls for it.
+
+## Containers
+
+Many sites run Megatron-LM inside a container (enroot/pyxis on some clusters, singularity on others). If you do, the uv-managed `.venv` must live on a path that is visible from inside the container, and the container image must provide the CUDA / NCCL / torch versions the repo expects (see `docker/.ngc_version.dev` and `.ngc_version.lts`). The skeleton above stays the same; wrap the `srun` invocation with your scheduler's container flags (`--container-image=…`, `--container-mounts=…`, etc.).
+
+## Monitor and collect
+
+```bash
+squeue -j "$JOB_ID" -o "%.10i %.8T %.10M %.6D %R"
+sacct -j "$JOB_ID" --format=JobID,State,ExitCode,Elapsed
+scancel "$JOB_ID"
+```
+
+If your training script writes a result artifact (a JSON metrics file from rank 0, a final checkpoint, etc.), poll for the artifact rather than waiting only on `squeue` state. Useful output usually appears before SLURM marks the job complete, and polling on the artifact lets you cancel the job as soon as it lands instead of holding the allocation until the timeout.
+
+## Failure diagnosis
+
+Scan stderr from every rank, not just rank 0. The earliest non-NCCL Python traceback is usually the root cause; later NCCL timeouts on other ranks are downstream symptoms of the first crash.
+
+Classify quickly:
+
+- **OOM**: record rank, phase (forward / backward / optimizer), batch size, sequence length, parallelism (TP/DP/CP/PP), and peak memory before adjusting.
+- **Shape / divisibility error**: check `WORLD_SIZE = TP × DP × CP × PP` and head-count divisibility (`num_attention_heads % TP == 0`).
+- **Import error**: wrong worktree, missing `uv sync`, or stale `PYTHONPATH`. Confirm `cd <MEGATRON_WORKTREE>` before launch.
+- **NCCL failure** with no Python traceback: verify allocation, port reachability, `MASTER_ADDR` resolution, and command consistency across ranks.
+
+## Common pitfalls
+
+- Forgetting `uv sync` before the first submission. If the venv is missing, every job rebuilds it from inside `srun`, costing minutes per job.
+- Writing logs to a node-local path that disappears at job exit. Always write to the shared filesystem.
+- Setting `CUDA_DEVICE_MAX_CONNECTIONS=1` blindly. The right value depends on hardware and parallelism mode (see the dedicated section above). Setting it to `1` with FSDP causes a different problem; on Blackwell it has no effect; on pre-Blackwell with TP>1 or CP>1 (non-FSDP) the code asserts, it does not deadlock.
+- Running bare `torchrun` instead of `uv run python -m torch.distributed.run`. Bare `torchrun` may dispatch through a python interpreter that does not see venv packages, depending on how the venv is set up.
diff --git a/skills/run-unit-tests/SKILL.md b/skills/run-unit-tests/SKILL.md
@@ -0,0 +1,82 @@
+---
+name: run-unit-tests
+description: How to run Megatron-LM unit tests on a GPU node. Covers environment setup with uv, launching tests through torch.distributed.run, marker filters, CI parity, and common gotchas. All Megatron-LM unit tests initialize a torch distributed group, so every invocation requires GPU access and is launched through torch.distributed.run.
+TRIGGER when: user asks to run unit tests, debug a unit test failure, reproduce a CI test failure locally, set up the unit-test environment, or invoke pytest on Megatron-LM.
+DO NOT TRIGGER when: user is asking about functional or end-to-end tests, CI infrastructure, or recipe files (use ci-test-system instead); user is asking about container or dependency setup (use build-and-dependency instead).
+---
+
+# Run Megatron-LM Unit Tests
+
+## Prerequisites
+
+- Megatron-LM checked out; `pyproject.toml` and `uv.lock` present at the repo root.
+- `uv` installed (https://docs.astral.sh/uv/).
+- A working PyTorch + CUDA environment matching the repo's required versions. The supported NGC PyTorch base images are pinned in `docker/.ngc_version.dev` (development stack) and `docker/.ngc_version.lts` (long-term-support stack).
+- A node with one or more visible NVIDIA GPUs. All Megatron-LM unit tests initialize a torch distributed group, so every invocation requires GPU access and is launched through `torch.distributed.run`.
+
+## Set up the environment
+
+From the repo root:
+
+```bash
+uv sync --extra training --extra dev
+```
+
+Use `--extra lts` for the long-term-support image stack. This materializes a `.venv` with the Megatron-LM dependencies. All commands below run through `uv run`, which executes inside that venv.
+
+## Run the unit suite
+
+```bash
+uv run python -m torch.distributed.run --nproc-per-node N -m pytest -q tests/unit_tests
+```
+
+Set `--nproc-per-node N` to the GPU count your tests need. Use `8` (or whatever your node provides) for distributed-feature tests; `1` is enough for tests that do not fan out beyond rank 0 but still require an initialized process group.
+
+## Run a specific test
+
+Single file:
+
+```bash
+uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
+  tests/unit_tests/models/test_gpt_model.py
+```
+
+Single test:
+
+```bash
+uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
+  tests/unit_tests/models/test_gpt_model.py::TestGPTModel::test_constructor
+```
+
+Filter by name substring:
+
+```bash
+uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
+  tests/unit_tests -k optimizer
+```
+
+## Marker filters
+
+Megatron-LM uses pytest markers `internal`, `flaky`, and `flaky_in_dev` (declared in `pyproject.toml`). To exclude unstable tests during development:
+
+```bash
+uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
+  tests/unit_tests -m "not flaky and not flaky_in_dev"
+```
+
+Experimental tests are gated behind a flag:
+
+```bash
+uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q \
+  tests/unit_tests --experimental
+```
+
+## CI parity
+
+The CI bucket runner is `tests/unit_tests/run_ci_test.sh`. It expands buckets, applies marker filters by environment, and writes coverage. Use it to reproduce a CI bucket failure locally; otherwise prefer the direct `torch.distributed.run` invocations above.
+
+## Common gotchas
+
+- The default pytest config in `pyproject.toml` sets `addopts = --durations=15 -s -rA`. That means stdout is not captured (`-s`), the slowest 15 tests are reported, and a short summary of all outcomes is printed at the end. Override with explicit pytest flags if you need different defaults.
+- Because `addopts` includes `-s`, stdout from every rank interleaves in the same terminal during multi-rank runs. When debugging a specific rank, override with `--capture=fd` so each process captures its own stdout.
+- `tests/unit_tests/conftest.py` looks for test data under `/opt/data` and attempts a download if it is missing. If you are running outside the canonical container, supply the data manually or skip data-dependent tests.