feat(power): multinode measured-power aggregation#1574
Conversation
…regation Appends entry for dsv4-fp4-gb300-dynamo-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1574. The sweep produces the first multinode agg JSONs with avg_power_w + joules_per_*_token, validating the per-source GPU-id namespacing and GPU_METRICS_CSV_GLOB env-var bridge end-to-end on real GB300 hardware (gb300-cw cluster).
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26534043069 |
| _csv_arg = None | ||
| _glob_pattern = os.environ.get('GPU_METRICS_CSV_GLOB') | ||
| if _glob_pattern: | ||
| _matched = sorted(Path(p) for p in _glob_module.glob(_glob_pattern)) | ||
| if _matched: | ||
| _csv_arg = _matched | ||
| else: | ||
| print( | ||
| f'[process_result] GPU_METRICS_CSV_GLOB={_glob_pattern!r} matched no files', | ||
| file=sys.stderr, | ||
| ) | ||
|
|
||
| if _csv_arg is None: | ||
| # Single-node path: gpu_metrics.csv written by start_gpu_monitor in the | ||
| # bench container. | ||
| _csv_candidates = [ | ||
| os.environ.get('GPU_METRICS_CSV'), | ||
| 'gpu_metrics.csv', | ||
| '/workspace/gpu_metrics.csv', | ||
| ] | ||
| _csv_arg = next( | ||
| (Path(p) for p in _csv_candidates if p and Path(p).is_file()), | ||
| None, | ||
| ) |
There was a problem hiding this comment.
🔴 When GPU_METRICS_CSV_GLOB is set but matches no files, _csv_arg stays None and the code falls through to the single-CSV candidate list (GPU_METRICS_CSV, gpu_metrics.csv, /workspace/gpu_metrics.csv) — contradicting the comment at lines 145-148 that the glob 'Takes precedence over the single-CSV fallback'. On a persistent self-hosted runner with a stale /workspace/gpu_metrics.csv from a prior single-node run (or a leaked GPU_METRICS_CSV env var), a multinode run whose perfmon failed on every node would silently patch wrong single-node avg_power_w / joules_per_*_token values into the multinode agg JSON. Fix: when _glob_pattern is truthy, skip the single-CSV fallback regardless of whether the glob matched anything.
Extended reasoning...
The contract violation
The block at utils/process_result.py:142-159 documents the precedence contract clearly:
Takes precedence over the single-CSV fallback — if the launcher set the glob, the run was multinode and there is no single-CSV fallback to make.
But the implementation only honors that contract when the glob actually matches files. On empty match:
_csv_arg = None
_glob_pattern = os.environ.get('GPU_METRICS_CSV_GLOB')
if _glob_pattern:
_matched = sorted(Path(p) for p in _glob_module.glob(_glob_pattern))
if _matched:
_csv_arg = _matched
else:
print(..., file=sys.stderr) # warns but doesn't prevent fallthrough
if _csv_arg is None: # still None — falls into single-CSV branch
_csv_candidates = [
os.environ.get('GPU_METRICS_CSV'),
'gpu_metrics.csv',
'/workspace/gpu_metrics.csv',
]
...The else branch just logs; _csv_arg stays None, and the next if _csv_arg is None block consults the single-CSV candidates.
Step-by-step proof on a persistent self-hosted runner
- Single-node run on
gb300-cw_Ncompletes successfully.benchmarks/benchmark_lib.shexportsGPU_METRICS_CSV=/workspace/gpu_metrics.csv(it lives ingpu_metrics.csvin cwd too). The file is left behind because the runner is persistent across jobs. - Next job is a multinode dynamo-sglang sweep.
runners/launch_gb300-cw.sh(lines 297-318) writesGPU_METRICS_CSV_GLOB=$LOGS_DIR/perf_samples_*.csvto$GITHUB_ENV— but only whenperf_csv_count > 0. Suppose perfmon failed to start on every node (srt-slurm PR [NVIDIA] Reduce B200 Runs & add B200 FP4 Docker Script #35 had startup issues, host driver mismatch, etc.) —perf_csv_countwould be 0 and the glob env var would not be written. Fine — that path is safe. - However, suppose perfmon CSVs were written at the end of the job (so the launcher writes the GLOB), but a downstream cleanup hook between launcher and
process_result.pyremoved them, OR srt-slurm wrote the CSVs to a different path on a subsequent retry, OR a persistent env var (GPU_METRICS_CSV_GLOBfrom a prior job) leaks in. The glob expansion inprocess_result.pyreturns empty. process_result.pyenters the else branch on line 155, prints a warning, and falls through.os.environ.get('GPU_METRICS_CSV')from the prior single-node job returns/workspace/gpu_metrics.csv(orgpu_metrics.csvin cwd is still there).Path(p).is_file()is True._csv_arg = Path('/workspace/gpu_metrics.csv')._aggregate_power_runis called with the stale single-node CSV.
Why the bench-window timestamp filter doesn't always save us
One verifier argued the start_unix <= ts <= end_unix filter at aggregate_power.py:177-178 would reject stale samples. That's true if the window comes from explicit Unix timestamps. But this PR adds two new fallback tiers in _load_bench_window:
- Tier 2:
datefield parsed as a UTC string (YYYYMMDD-HHMMSS). - Tier 3:
bench_result_path.stat().st_mtime— the bench JSON's own mtime, which is the current run's mtime, used as bench-end withstart = end - duration.
The mtime tier is exactly the danger zone: on a persistent runner the bench JSON is freshly written, so its mtime is now. If the stale gpu_metrics.csv was also written recently (within the derived [mtime - duration, mtime] window — possible if the prior single-node run finished a few minutes ago), its samples do fall inside the window. Result: silent wrong avg_power_w and joules_per_*_token patched into the multinode agg JSON, which InferenceX-app's ETL auto-captures into the dashboard.
What the test misses
The accompanying test test_multinode_csv_glob_empty_match_falls_through_silently only asserts the no-stale-file case (asserts 'avg_power_w' not in patched). It does not stage a stale fallback CSV, so it can't catch the precedence violation. test_multinode_csv_glob_takes_precedence_over_single_csv only tests precedence when the glob matches.
Fix
One-line change in the empty-match branch:
if _glob_pattern:
_matched = sorted(Path(p) for p in _glob_module.glob(_glob_pattern))
if _matched:
_csv_arg = _matched
else:
_csv_arg = [] # sentinel: glob attempted, fallback forbidden
print(...)
if not _csv_arg: # treats [] same as None for the downstream check, but…
if _glob_pattern:
pass # …skip single-CSV candidates when glob was attempted
else:
_csv_candidates = [...]
_csv_arg = next(...)Or more cleanly: guard the single-CSV block on not _glob_pattern instead of _csv_arg is None.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26547958720 |
… runs Builds on PR #1558 (single-node measured-power) for multinode benchmarks via srt-slurm. Pipeline: srt-slurm perfmon (per-node nvidia-smi sampling — PR #35 on NVIDIA/srt-slurm, layered on SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon) perf_samples_<host>.csv in outputs/<job>/logs/ on shared NFS launch_gb300-cw.sh exports GPU_METRICS_CSV_GLOB to $GITHUB_ENV process_result.py expands the glob and hands the list to aggregate_power.run() aggregate_power.py namespaces local GPU indices per source CSV stem so each node's local indices 0..N-1 don't collide across nodes; emits cluster-wide avg_power_w + joules_per_*_token InferenceX-app ETL auto-captures the numeric fields (no schema change) Changes: - utils/aggregate_power.py: widen csv_path to Path | Iterable[Path] keeping the original param name. Per-source GPU-id namespacing only kicks in when there are 2+ sources so single-node num_gpus is unchanged. CLI adds --csv-glob (Python-side glob, mutually exclusive with --csv). - utils/process_result.py: bridge GPU_METRICS_CSV_GLOB env var. Glob takes precedence over single GPU_METRICS_CSV when both are set. - runners/launch_gb300-cw.sh: point dynamo-sglang at our srt-slurm fork, append `monitoring:` block to each recipe post-copy (idempotent), and write GPU_METRICS_CSV_GLOB to $GITHUB_ENV after the job for the downstream Process result step. - 8 new multinode tests in test_aggregate_power.py (per-source namespacing, sub-second clock drift, asymmetric prefill/decode power, missing-CSV silent skip, backward-compat single-path-in-list, Iterable acceptance, E2E run with list). 3 new in test_process_result.py (glob aggregation, precedence over single CSV, empty-match falls through). 64/64 pass. Verified data-format end-to-end on gb300 hardware: nvidia-smi inside the sglang container emits the columns aggregate_power.py needs timestamp, gpu, power_w.
…regation Appends entry for dsv4-fp4-gb300-dynamo-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1574. The sweep produces the first multinode agg JSONs with avg_power_w + joules_per_*_token, validating the per-source GPU-id namespacing and GPU_METRICS_CSV_GLOB env-var bridge end-to-end on real GB300 hardware (gb300-cw cluster).
3caf593 to
8d30341
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8d30341. Configure here.
| bench = json.loads(bench_result_path.read_text(encoding="utf-8")) | ||
| except (OSError, json.JSONDecodeError): | ||
| return None | ||
| start = bench.get("benchmark_start_time_unix") |
There was a problem hiding this comment.
Multinode num_gpus wrong without GPU column and clock drift
Low Severity
When multiple CSV paths are provided but none have a recognized GPU column (no match for _GPU_INDEX_COL_RE), num_gpus falls back to max(per_sample_row_count.values()). With multinode clock drift, each timestamp bucket only contains rows from a single node, so max returns one node's GPU count instead of the cluster total. This underestimates num_gpus and produces an incorrect total_system_energy_j, leading to wrong joules_per_*_token values in the agg JSON.
Reviewed by Cursor Bugbot for commit 8d30341. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26548110246 |


Summary
Extends single-node measured-power aggregation (#1558) to multinode srt-slurm benchmarks. Wires per-node
perf_samples_<host>.csvfrom srt-slurm's PR #35 perfmon through the launcher intoprocess_result.py→aggregate_power.py, which now namespaces local GPU indices per source CSV stem so each node's local indices0..N-1don't collapse across nodes.Backward compatible:
aggregate_power()accepts bothPathandIterable[Path]; single-CSV callers (single-nodestart_gpu_monitorpath) are unchanged.csv_pathparam name preserved.Pipeline
Files
utils/aggregate_power.py—csv_pathwidened toPath | Iterable[Path]. Per-source GPU-id namespacing only kicks in for 2+ sources so single-nodenum_gpusis unchanged. CLI adds--csv-glob(mutually exclusive with--csv).utils/process_result.py— bridgeGPU_METRICS_CSV_GLOBenv var. Glob takes precedence over singleGPU_METRICS_CSVwhen both are set.runners/launch_gb300-cw.sh— point dynamo-sglang at our srt-slurm fork, appendmonitoring:block to each recipe post-copy (idempotent), writeGPU_METRICS_CSV_GLOBto$GITHUB_ENVafter the job.utils/test_aggregate_power.py— 8 new multinode cases: per-source namespacing, sub-second clock drift, asymmetric prefill/decode power, missing-CSV silent skip, backward-compat single-path-in-list,Iterableacceptance, E2E with list.utils/test_process_result.py— 3 new cases: glob aggregation, precedence over single CSV, empty-match falls through.Test plan
nvidia-smiinside sglang container on real gb300-cw emits expected columns (timestamp,gpu,power_w) — verified manually withsrun --container-image=...sglang...sqsh nvidia-smi --query-gpu=...avg_power_w+joules_per_*_tokenin the agg JSON (pendingperf-changelog.yamlentry +sweep-enabledlabel)num_gpusin agg JSON matchesprefill_gpus + decode_gpusfrom launcher (validates per-source namespacing — without the fix, num_gpus would equal a single node'sgpus_per_node)?unofficialrun=<run_id>Depends on
SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon(pinned by the launcher). Tracks NVIDIA/srt-slurm PR #35 head; will rebase to upstreammainonce #35 merges.Note
Medium Risk
Changes how agg JSON power metrics are computed for multinode runs (GPU count and time-window inference); failures are skipped gracefully but wrong math would skew dashboard energy data until caught by tests/smoke sweep.
Overview
Extends measured-power aggregation from single-node GPU metrics to multinode srt-slurm runs by wiring per-node
perf_samples_*.csv(perfmon PR #35) through the GB300-CW launcher into result processing.aggregate_power.pynow accepts one or many CSVs (--csv-glob), namespaces GPU IDs by per-node CSV stem so repeated local indices0..Nacross nodes do not under-count total GPUs, and addsjoules_per_total_token. For multinode bench JSONs withoutbenchmark_*_time_unix, it derives the load window fromdate+durationor file mtime.process_result.pyhonorsGPU_METRICS_CSV_GLOB(glob wins over singleGPU_METRICS_CSV); aggregation remains best-effort and never fails the upload.launch_gb300-cw.shpins SemiAnalysisAI/srt-slurmfeat/inferencex-perfmon, idempotently appendsmonitoring:to overlaid recipes, and exports the perf CSV glob to$GITHUB_ENVafter the job.perf-changelog.yamladds a smoke entry to trigger the first multinode sweep with power fields on agg JSONs.Reviewed by Cursor Bugbot for commit 8d30341. Bugbot is set up for automated code reviews on this repo. Configure here.