Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions runners/launch_mi300x-amds.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,9 @@

set -x

# Pin to the known-good mi300x nodes; others are unavailable:
# chi-mi300x-033, chi-mi300x-037: down (Not responding)
# chi-mi300x-049: drained (persistent /nvme_home disk-full)
JOB_ID=$(salloc --partition=$PARTITION --nodelist=chi-mi300x-[034-036,054,057-058].ord.vultr.cpe.ice.amd.com --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell --job-name="$RUNNER_NAME" 2>&1 | tee /dev/stderr | grep -oP 'Granted job allocation \K[0-9]+')
# Exclude known-bad nodes; let Slurm pick from anything else:
# chi-mi300x-049: drained (persistent /nvme_home disk-full)
JOB_ID=$(salloc --partition=$PARTITION --exclude=chi-mi300x-049.ord.vultr.cpe.ice.amd.com --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell --job-name="$RUNNER_NAME" 2>&1 | tee /dev/stderr | grep -oP 'Granted job allocation \K[0-9]+')

Check warning on line 14 in runners/launch_mi300x-amds.sh

View check run for this annotation

Claude / Claude Code Review

Stale KLAUD_DEBUG.md §5.3 reference to old --nodelist pin

Documentation drift: KLAUD_DEBUG.md §5.3 (line 121) still describes `runners/launch_mi300x-amds.sh` as pinning salloc to `chi-mi300x-[034-036,054,057-058]` via `--nodelist`, but after this PR the script uses `--exclude=chi-mi300x-049` only. Consider updating that paragraph in the same PR so the operator playbook matches the new exclude-only strategy.
Comment on lines +12 to +14
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Documentation drift: KLAUD_DEBUG.md §5.3 (line 121) still describes runners/launch_mi300x-amds.sh as pinning salloc to chi-mi300x-[034-036,054,057-058] via --nodelist, but after this PR the script uses --exclude=chi-mi300x-049 only. Consider updating that paragraph in the same PR so the operator playbook matches the new exclude-only strategy.

Extended reasoning...

What the drift is

KLAUD_DEBUG.md §5.3 ("chi-mi300x-049/nvme_home disk-full") at line 121 currently reads:

Fix already landed: runners/launch_mi300x-amds.sh now pins salloc to only known-good mi300x nodes (chi-mi300x-[034-036,054,057-058]) — see PR #1462.

This PR replaces the --nodelist=chi-mi300x-[034-036,054,057-058]... pin with --exclude=chi-mi300x-049.... Two specific claims in §5.3 become factually wrong on merge:

  1. The script no longer "pins salloc to only known-good mi300x nodes" — it allows Slurm to pick any node except -049.
  2. The enumerated set chi-mi300x-[034-036,054,057-058] is no longer the allowed set; per the PR description, -033/-035/-037 may rejoin at any time and would be eligible.

Why this matters

The file's own intro (lines 3–5) says it is a "running playbook" that should be read first when debugging, with an explicit instruction: "When you fix something not yet listed, add it here so the next session doesn't re-learn it." So §5.3 is a maintained operator-facing statement of current state, not historical context — the present-tense "now pins" framing is a live claim. The PR is itself a one-line behavior change motivated by exactly the -049 issue documented in §5.3, so updating that paragraph in this same PR keeps the playbook coherent.

Step-by-step proof

  1. Read KLAUD_DEBUG.md line 121: it says launch_mi300x-amds.sh "now pins salloc to only known-good mi300x nodes (chi-mi300x-[034-036,054,057-058])".
  2. Read the post-PR contents of runners/launch_mi300x-amds.sh line 14: the salloc invocation uses --exclude=chi-mi300x-049.ord.vultr.cpe.ice.amd.com and no --nodelist flag at all.
  3. Therefore, after merge, the playbook's claim about pinning to that specific set is false: the script does not pin to any set, and -033/-035/-037 (currently down*) would be allocated automatically if they come back up — which the PR description explicitly anticipates.

How to fix

A one-line edit in the same PR. Suggested rewording for §5.3:

Fix already landed: runners/launch_mi300x-amds.sh now excludes chi-mi300x-049 from salloc and lets Slurm pick any other healthy mi300x node — see PR #1462 (original pin) and PR #1532 (switch to exclude-only). chi-mi300x-049 is held in State=DOWN by a watchdog on the controller (/home/gharunner/_audit/drain_049_watchdog.sh)…

The watchdog/drain explanation in the rest of the paragraph remains accurate and can stay as-is.

Severity

nit — internal-debug-doc drift, not a runtime defect. Non-blocking, but a near-zero-cost fix that keeps the playbook honest about the very script this PR is changing.


if [ -z "$JOB_ID" ]; then
echo "ERROR: salloc failed to allocate a job"
Expand Down