feat(power): multinode measured-power aggregation by arygupt · Pull Request #1574 · SemiAnalysisAI/InferenceX

arygupt · 2026-05-27T19:23:25Z

Summary

Extends single-node measured-power aggregation (#1558) to multinode srt-slurm benchmarks. Wires per-node perf_samples_<host>.csv from srt-slurm's PR #35 perfmon through the launcher into process_result.py → aggregate_power.py, which now namespaces local GPU indices per source CSV stem so each node's local indices 0..N-1 don't collapse across nodes.

Backward compatible: aggregate_power() accepts both Path and Iterable[Path]; single-CSV callers (single-node start_gpu_monitor path) are unchanged. csv_path param name preserved.

Pipeline

srt-slurm perfmon (PR #35 on NVIDIA/srt-slurm, layered on
  SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon)
  → perf_samples_<host>.csv in outputs/<job>/logs/ on shared NFS
  → launch_gb300-cw.sh exports GPU_METRICS_CSV_GLOB to $GITHUB_ENV
  → process_result.py expands glob → aggregate_power.run() with list
  → aggregate_power.py emits cluster-wide avg_power_w + joules_per_*_token
  → InferenceX-app ETL auto-captures (no schema change)

Files

utils/aggregate_power.py — csv_path widened to Path | Iterable[Path]. Per-source GPU-id namespacing only kicks in for 2+ sources so single-node num_gpus is unchanged. CLI adds --csv-glob (mutually exclusive with --csv).
utils/process_result.py — bridge GPU_METRICS_CSV_GLOB env var. Glob takes precedence over single GPU_METRICS_CSV when both are set.
runners/launch_gb300-cw.sh — point dynamo-sglang at our srt-slurm fork, append monitoring: block to each recipe post-copy (idempotent), write GPU_METRICS_CSV_GLOB to $GITHUB_ENV after the job.
utils/test_aggregate_power.py — 8 new multinode cases: per-source namespacing, sub-second clock drift, asymmetric prefill/decode power, missing-CSV silent skip, backward-compat single-path-in-list, Iterable acceptance, E2E with list.
utils/test_process_result.py — 3 new cases: glob aggregation, precedence over single CSV, empty-match falls through.

Test plan

36/36 aggregator tests pass (28 existing + 8 new)
28/28 process_result tests pass (25 existing + 3 new)
nvidia-smi inside sglang container on real gb300-cw emits expected columns (timestamp, gpu, power_w) — verified manually with srun --container-image=...sglang...sqsh nvidia-smi --query-gpu=...
First E2E multinode sweep produces avg_power_w + joules_per_*_token in the agg JSON (pending perf-changelog.yaml entry + sweep-enabled label)
num_gpus in agg JSON matches prefill_gpus + decode_gpus from launcher (validates per-source namespacing — without the fix, num_gpus would equal a single node's gpus_per_node)
Chart at inferencex.semianalysis.com renders the new data via ?unofficialrun=<run_id>

Depends on

SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon (pinned by the launcher). Tracks NVIDIA/srt-slurm PR #35 head; will rebase to upstream main once #35 merges.

Note

Medium Risk
Changes how agg JSON power metrics are computed for multinode runs (GPU count and time-window inference); failures are skipped gracefully but wrong math would skew dashboard energy data until caught by tests/smoke sweep.

Overview
Extends measured-power aggregation from single-node GPU metrics to multinode srt-slurm runs by wiring per-node perf_samples_*.csv (perfmon PR #35) through the GB300-CW launcher into result processing.

aggregate_power.py now accepts one or many CSVs (--csv-glob), namespaces GPU IDs by per-node CSV stem so repeated local indices 0..N across nodes do not under-count total GPUs, and adds joules_per_total_token. For multinode bench JSONs without benchmark_*_time_unix, it derives the load window from date + duration or file mtime.

process_result.py honors GPU_METRICS_CSV_GLOB (glob wins over single GPU_METRICS_CSV); aggregation remains best-effort and never fails the upload.

launch_gb300-cw.sh pins SemiAnalysisAI/srt-slurm feat/inferencex-perfmon, idempotently appends monitoring: to overlaid recipes, and exports the perf CSV glob to $GITHUB_ENV after the job. perf-changelog.yaml adds a smoke entry to trigger the first multinode sweep with power fields on agg JSONs.

^{Reviewed by Cursor Bugbot for commit 8d30341. Bugbot is set up for automated code reviews on this repo. Configure here.}

…regation Appends entry for dsv4-fp4-gb300-dynamo-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1574. The sweep produces the first multinode agg JSONs with avg_power_w + joules_per_*_token, validating the per-source GPU-id namespacing and GPU_METRICS_CSV_GLOB env-var bridge end-to-end on real GB300 hardware (gb300-cw cluster).

github-actions · 2026-05-27T19:35:42Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26534043069
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26534043069

claude · 2026-05-27T19:38:03Z

+    _csv_arg = None
+    _glob_pattern = os.environ.get('GPU_METRICS_CSV_GLOB')
+    if _glob_pattern:
+        _matched = sorted(Path(p) for p in _glob_module.glob(_glob_pattern))
+        if _matched:
+            _csv_arg = _matched
+        else:
+            print(
+                f'[process_result] GPU_METRICS_CSV_GLOB={_glob_pattern!r} matched no files',
+                file=sys.stderr,
+            )
+
+    if _csv_arg is None:
+        # Single-node path: gpu_metrics.csv written by start_gpu_monitor in the
+        # bench container.
+        _csv_candidates = [
+            os.environ.get('GPU_METRICS_CSV'),
+            'gpu_metrics.csv',
+            '/workspace/gpu_metrics.csv',
+        ]
+        _csv_arg = next(
+            (Path(p) for p in _csv_candidates if p and Path(p).is_file()),
+            None,
+        )


🔴 When GPU_METRICS_CSV_GLOB is set but matches no files, _csv_arg stays None and the code falls through to the single-CSV candidate list (GPU_METRICS_CSV, gpu_metrics.csv, /workspace/gpu_metrics.csv) — contradicting the comment at lines 145-148 that the glob 'Takes precedence over the single-CSV fallback'. On a persistent self-hosted runner with a stale /workspace/gpu_metrics.csv from a prior single-node run (or a leaked GPU_METRICS_CSV env var), a multinode run whose perfmon failed on every node would silently patch wrong single-node avg_power_w / joules_per_*_token values into the multinode agg JSON. Fix: when _glob_pattern is truthy, skip the single-CSV fallback regardless of whether the glob matched anything.

Extended reasoning...

The contract violation

The block at utils/process_result.py:142-159 documents the precedence contract clearly:

Takes precedence over the single-CSV fallback — if the launcher set the glob, the run was multinode and there is no single-CSV fallback to make.

But the implementation only honors that contract when the glob actually matches files. On empty match:

_csv_arg = None _glob_pattern = os.environ.get('GPU_METRICS_CSV_GLOB') if _glob_pattern: _matched = sorted(Path(p) for p in _glob_module.glob(_glob_pattern)) if _matched: _csv_arg = _matched else: print(..., file=sys.stderr) # warns but doesn't prevent fallthrough if _csv_arg is None: # still None — falls into single-CSV branch _csv_candidates = [ os.environ.get('GPU_METRICS_CSV'), 'gpu_metrics.csv', '/workspace/gpu_metrics.csv', ] ...

The else branch just logs; _csv_arg stays None, and the next if _csv_arg is None block consults the single-CSV candidates.

Step-by-step proof on a persistent self-hosted runner

Single-node run on gb300-cw_N completes successfully. benchmarks/benchmark_lib.sh exports GPU_METRICS_CSV=/workspace/gpu_metrics.csv (it lives in gpu_metrics.csv in cwd too). The file is left behind because the runner is persistent across jobs.

Next job is a multinode dynamo-sglang sweep. runners/launch_gb300-cw.sh (lines 297-318) writes GPU_METRICS_CSV_GLOB=$LOGS_DIR/perf_samples_*.csv to $GITHUB_ENV — but only when perf_csv_count > 0. Suppose perfmon failed to start on every node (srt-slurm PR [NVIDIA] Reduce B200 Runs & add B200 FP4 Docker Script #35 had startup issues, host driver mismatch, etc.) — perf_csv_count would be 0 and the glob env var would not be written. Fine — that path is safe.

However, suppose perfmon CSVs were written at the end of the job (so the launcher writes the GLOB), but a downstream cleanup hook between launcher and process_result.py removed them, OR srt-slurm wrote the CSVs to a different path on a subsequent retry, OR a persistent env var (GPU_METRICS_CSV_GLOB from a prior job) leaks in. The glob expansion in process_result.py returns empty.

process_result.py enters the else branch on line 155, prints a warning, and falls through. os.environ.get('GPU_METRICS_CSV') from the prior single-node job returns /workspace/gpu_metrics.csv (or gpu_metrics.csv in cwd is still there). Path(p).is_file() is True. _csv_arg = Path('/workspace/gpu_metrics.csv').

_aggregate_power_run is called with the stale single-node CSV.

Why the bench-window timestamp filter doesn't always save us

One verifier argued the start_unix <= ts <= end_unix filter at aggregate_power.py:177-178 would reject stale samples. That's true if the window comes from explicit Unix timestamps. But this PR adds two new fallback tiers in _load_bench_window:

Tier 2: date field parsed as a UTC string (YYYYMMDD-HHMMSS).

Tier 3: bench_result_path.stat().st_mtime — the bench JSON's own mtime, which is the current run's mtime, used as bench-end with start = end - duration.

The mtime tier is exactly the danger zone: on a persistent runner the bench JSON is freshly written, so its mtime is now. If the stale gpu_metrics.csv was also written recently (within the derived [mtime - duration, mtime] window — possible if the prior single-node run finished a few minutes ago), its samples do fall inside the window. Result: silent wrong avg_power_w and joules_per_*_token patched into the multinode agg JSON, which InferenceX-app's ETL auto-captures into the dashboard.

What the test misses

The accompanying test test_multinode_csv_glob_empty_match_falls_through_silently only asserts the no-stale-file case (asserts 'avg_power_w' not in patched). It does not stage a stale fallback CSV, so it can't catch the precedence violation. test_multinode_csv_glob_takes_precedence_over_single_csv only tests precedence when the glob matches.

Fix

One-line change in the empty-match branch:

if _glob_pattern: _matched = sorted(Path(p) for p in _glob_module.glob(_glob_pattern)) if _matched: _csv_arg = _matched else: _csv_arg = [] # sentinel: glob attempted, fallback forbidden print(...) if not _csv_arg: # treats [] same as None for the downstream check, but… if _glob_pattern: pass # …skip single-CSV candidates when glob was attempted else: _csv_candidates = [...] _csv_arg = next(...)

Or more cleanly: guard the single-CSV block on not _glob_pattern instead of _csv_arg is None.

github-actions · 2026-05-28T00:57:20Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26547958720
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26547958720

… runs Builds on PR #1558 (single-node measured-power) for multinode benchmarks via srt-slurm. Pipeline: srt-slurm perfmon (per-node nvidia-smi sampling — PR #35 on NVIDIA/srt-slurm, layered on SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon) perf_samples_<host>.csv in outputs/<job>/logs/ on shared NFS launch_gb300-cw.sh exports GPU_METRICS_CSV_GLOB to $GITHUB_ENV process_result.py expands the glob and hands the list to aggregate_power.run() aggregate_power.py namespaces local GPU indices per source CSV stem so each node's local indices 0..N-1 don't collide across nodes; emits cluster-wide avg_power_w + joules_per_*_token InferenceX-app ETL auto-captures the numeric fields (no schema change) Changes: - utils/aggregate_power.py: widen csv_path to Path | Iterable[Path] keeping the original param name. Per-source GPU-id namespacing only kicks in when there are 2+ sources so single-node num_gpus is unchanged. CLI adds --csv-glob (Python-side glob, mutually exclusive with --csv). - utils/process_result.py: bridge GPU_METRICS_CSV_GLOB env var. Glob takes precedence over single GPU_METRICS_CSV when both are set. - runners/launch_gb300-cw.sh: point dynamo-sglang at our srt-slurm fork, append `monitoring:` block to each recipe post-copy (idempotent), and write GPU_METRICS_CSV_GLOB to $GITHUB_ENV after the job for the downstream Process result step. - 8 new multinode tests in test_aggregate_power.py (per-source namespacing, sub-second clock drift, asymmetric prefill/decode power, missing-CSV silent skip, backward-compat single-path-in-list, Iterable acceptance, E2E run with list). 3 new in test_process_result.py (glob aggregation, precedence over single CSV, empty-match falls through). 64/64 pass. Verified data-format end-to-end on gb300 hardware: nvidia-smi inside the sglang container emits the columns aggregate_power.py needs timestamp, gpu, power_w.

…regation Appends entry for dsv4-fp4-gb300-dynamo-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1574. The sweep produces the first multinode agg JSONs with avg_power_w + joules_per_*_token, validating the per-source GPU-id namespacing and GPU_METRICS_CSV_GLOB env-var bridge end-to-end on real GB300 hardware (gb300-cw cluster).

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 8d30341. Configure here.}

cursor · 2026-05-28T01:09:40Z

        bench = json.loads(bench_result_path.read_text(encoding="utf-8"))
    except (OSError, json.JSONDecodeError):
        return None
-    start = bench.get("benchmark_start_time_unix")


Multinode num_gpus wrong without GPU column and clock drift

Low Severity

When multiple CSV paths are provided but none have a recognized GPU column (no match for _GPU_INDEX_COL_RE), num_gpus falls back to max(per_sample_row_count.values()). With multinode clock drift, each timestamp bucket only contains rows from a single node, so max returns one node's GPU count instead of the cluster total. This underestimates num_gpus and produces an incorrect total_system_energy_j, leading to wrong joules_per_*_token values in the agg JSON.

^{Reviewed by Cursor Bugbot for commit 8d30341. Configure here.}

github-actions · 2026-05-28T04:57:33Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26548110246
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26548110246

arygupt requested a review from a team May 27, 2026 19:23

github-project-automation Bot added this to InferenceMAX Board May 27, 2026

arygupt added the sweep-enabled label May 27, 2026

functionstackx changed the title ~~feat(power): multinode measured-power aggregation~~ feat(power): multinode measured-montiroring aggregation May 27, 2026

claude Bot reviewed May 27, 2026

View reviewed changes

arygupt added sweep-enabled and removed sweep-enabled labels May 28, 2026

arygupt changed the title ~~feat(power): multinode measured-montiroring aggregation~~ feat(power): multinode measured-power aggregation May 28, 2026

arygupt added 2 commits May 27, 2026 18:00

arygupt force-pushed the feat/measured-power-multinode branch from 3caf593 to 8d30341 Compare May 28, 2026 01:01

cursor Bot reviewed May 28, 2026

View reviewed changes

arygupt mentioned this pull request May 28, 2026

feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574) #1577

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(power): multinode measured-power aggregation#1574

feat(power): multinode measured-power aggregation#1574
arygupt wants to merge 2 commits into
mainfrom
feat/measured-power-multinode

arygupt commented May 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

claude Bot May 27, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arygupt commented May 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pipeline

Files

Test plan

Depends on

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

claude Bot May 27, 2026

Choose a reason for hiding this comment

The contract violation

Step-by-step proof on a persistent self-hosted runner

Why the bench-window timestamp filter doesn't always save us

What the test misses

Fix

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 28, 2026

Choose a reason for hiding this comment

Multinode num_gpus wrong without GPU column and clock drift

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arygupt commented May 27, 2026 •

edited by cursor Bot

Loading

Multinode `num_gpus` wrong without GPU column and clock drift