feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574)#1577
feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574)#1577arygupt wants to merge 4 commits into
Conversation
… runs Builds on PR #1558 (single-node measured-power) for multinode benchmarks via srt-slurm. Pipeline: srt-slurm perfmon (per-node nvidia-smi sampling — PR #35 on NVIDIA/srt-slurm, layered on SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon) perf_samples_<host>.csv in outputs/<job>/logs/ on shared NFS launch_gb300-cw.sh exports GPU_METRICS_CSV_GLOB to $GITHUB_ENV process_result.py expands the glob and hands the list to aggregate_power.run() aggregate_power.py namespaces local GPU indices per source CSV stem so each node's local indices 0..N-1 don't collide across nodes; emits cluster-wide avg_power_w + joules_per_*_token InferenceX-app ETL auto-captures the numeric fields (no schema change) Changes: - utils/aggregate_power.py: widen csv_path to Path | Iterable[Path] keeping the original param name. Per-source GPU-id namespacing only kicks in when there are 2+ sources so single-node num_gpus is unchanged. CLI adds --csv-glob (Python-side glob, mutually exclusive with --csv). - utils/process_result.py: bridge GPU_METRICS_CSV_GLOB env var. Glob takes precedence over single GPU_METRICS_CSV when both are set. - runners/launch_gb300-cw.sh: point dynamo-sglang at our srt-slurm fork, append `monitoring:` block to each recipe post-copy (idempotent), and write GPU_METRICS_CSV_GLOB to $GITHUB_ENV after the job for the downstream Process result step. - 8 new multinode tests in test_aggregate_power.py (per-source namespacing, sub-second clock drift, asymmetric prefill/decode power, missing-CSV silent skip, backward-compat single-path-in-list, Iterable acceptance, E2E run with list). 3 new in test_process_result.py (glob aggregation, precedence over single CSV, empty-match falls through). 64/64 pass. Verified data-format end-to-end on gb300 hardware: nvidia-smi inside the sglang container emits the columns aggregate_power.py needs timestamp, gpu, power_w.
…regation Appends entry for dsv4-fp4-gb300-dynamo-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1574. The sweep produces the first multinode agg JSONs with avg_power_w + joules_per_*_token, validating the per-source GPU-id namespacing and GPU_METRICS_CSV_GLOB env-var bridge end-to-end on real GB300 hardware (gb300-cw cluster).
There was a problem hiding this comment.
LGTM — additive per-worker breakdown on top of existing cluster-wide aggregation, backward compatible for legacy CSV names, well-covered by 9 new tests.
Extended reasoning...
Overview
Extends utils/aggregate_power.py with per-worker power attribution: a new _parse_role_from_filename helper, a new aggregate_power_per_worker function that wraps the existing aggregator and groups CSVs by (role, worker_idx) parsed from perf_samples_<role>_w<idx>_<host>.csv, a new optional extras kwarg to patch_agg_result, and an extended run() that emits a workers array plus role-split scalars (prefill_avg_power_w, decode_avg_power_w, joules_per_input_token, joules_per_output_token_decode) when both prefill and decode workers are present. utils/test_aggregate_power.py adds 9 tests.
Security risks
None. This is offline post-processing of NVIDIA/AMD SMI CSVs produced by the benchmark harness; no network calls, no auth, no user-controlled input. Filename regex is anchored and narrow.
Level of scrutiny
Low. This touches only optional benchmark power telemetry — not the inference, serving, or scheduling hot paths. The aggregator is best-effort and returns 0 on every error path so monitoring failures cannot break benchmark uploads. Cluster-wide fields (avg_power_w, joules_per_output_token, joules_per_total_token) keep their existing semantics and old-format unlabeled CSVs produce bit-for-bit identical output.
Other factors
- Tests cover filename parsing (labeled / multi-digit idx / old format), per-worker grouping with multi-node decode worker aggregation, mixed labeled+unlabeled inputs, disagg E2E with role-split joules, agg E2E without disagg-only fields, and legacy single-CSV bit-for-bit compat.
- The V1 limitation (multiple workers of the same role colocated on one node collapse to lowest worker_idx) is documented in the PR description and does not affect the target gb300-cw multi-node disagg case.
- Bug hunter found no issues.
… recipes The previous glob `$SRT_RECIPE_DST/*.yaml` only matched top-level YAMLs, but recipes live under workload subdirectories (e.g. 8k1k/*.yaml). The loop iterated zero times, no recipe got the monitoring: block, perfmon never spawned, no perf_samples_*.csv were written, aggregate_power silently skipped patching the agg JSON, and the dashboard had no power data. Sweep #26548110246 burned hours of GB300 time and shipped "success" with zero power keys in every agg artifact — exactly the silent-failure chain we should have caught earlier. Fix: recurse via `find -type f -name '*.yaml'`. Add a loud WARNING when zero recipes get the injection so future regressions surface immediately instead of waiting for missing dashboard data to be noticed.
… joules Layers per-worker breakdown on top of the cluster-wide multinode aggregation in the parent PR #1574. New agg JSON fields (additive — all existing keys preserved bit-for-bit for backward compat): workers: [{role, worker_idx, num_gpus, avg_power_w}, ...] role ∈ "prefill" / "decode" / "agg" / "frontend". Each (role, idx) aggregates across all CSVs for that worker — a multi-node TP=16 decode worker on 4 nodes produces one workers entry with num_gpus=16. prefill_avg_power_w, decode_avg_power_w (disagg only) Weighted per-GPU averages within each role. joules_per_input_token = prefill_energy / total_input_tokens joules_per_output_token_decode = decode_energy / total_output_tokens Disagg-only role-split metrics. Existing joules_per_output_token and joules_per_total_token keep their cluster-wide semantics so the chart won't shift on existing data. Worker → CSV mapping is by filename: srt-slurm's perfmon (companion change on SemiAnalysisAI/srt-slurm c4c86dc) writes `perf_samples_<role>_w<worker_idx>_<host>.csv`. Unlabeled filenames (old single-CSV format) silently emit empty workers list and skip the role split — cluster-wide metrics unchanged in that case. 77/77 tests pass (68 existing + 9 new — per-worker grouping, multi-node worker aggregation, mixed labeled/unlabeled inputs, disagg E2E with role split, agg E2E omitting disagg-only fields, bit-for-bit backward compat for old-format callers).
f951aef to
d04634e
Compare
f5b5c77 to
1af17ab
Compare
Stacked on #1574 — review #1574 first; this PR adds per-worker breakdown on top of the cluster-wide multinode aggregation that PR lands.
Summary
Manager-requested follow-up. Adds per-worker power attribution and role-split energy metrics for disagg runs:
workers: [{role, worker_idx, num_gpus, avg_power_w}, ...]— per-worker breakdown in the agg JSON.role∈prefill/decode/agg/frontend. Multi-node workers (e.g. TP=16 decode across 4 nodes) aggregate as one entry withnum_gpus=16.prefill_avg_power_w,decode_avg_power_w— weighted per-GPU averages within each role (disagg only).joules_per_input_token=prefill_energy / total_input_tokens(disagg only).joules_per_output_token_decode=decode_energy / total_output_tokens(disagg only).Backward compatible: existing
avg_power_w,joules_per_output_token,joules_per_total_tokenkeep their cluster-wide semantics. Old-format unlabeled CSVs (single-nodestart_gpu_monitorpath) emit noworkersfield and no role split — bit-for-bit identical to the parent PR's behavior for those callers.Pipeline
Companion srt-slurm change
SemiAnalysisAI/srt-slurm c4c86dc:
_start_perf_monitorderives role+worker_idx per node frombackend_processes(which carriesendpoint_modeandendpoint_indexper Process) and labels output filenames. Allocator invariant — prefill and decode never share a node — means a single (role, worker_idx) per node is well-defined.Files
utils/aggregate_power.py: new_parse_role_from_filename+aggregate_power_per_worker.patch_agg_resultgainsextraskwarg.run()computes role-split energy and per-worker breakdown when filenames are labeled.utils/test_aggregate_power.py: 9 new tests.Test plan
workersarray +prefill_avg_power_w/decode_avg_power_w/joules_per_input_token/joules_per_output_token_decodein the agg JSONbenchmark-transform.tsto surfaceworkersarray; add prefill/decode chart legendApp-side follow-up
ETL auto-captures any numeric field, so the scalar role-split fields (
prefill_avg_power_wetc.) land inmetricsJSONB automatically. The nestedworkersarray needs explicit handling — that's a separate PR onInferenceX-appcovering: type extension,benchmark-transform.tsmapper, chart legend with prefill/decode swatches.Limitation (V1)
When multiple workers of the same role share a node (uncommon — happens in single-node multi-worker setups, e.g. 2 decode workers per 8-GPU node), perfmon labels by the lowest worker_idx and the aggregator currently attributes that node's full power to the lowest-idx worker. The other colocated workers would not appear in the workers list. For the GB300 multi-node disagg case (worker spans 1+ nodes, never shares), this isn't an issue.
Note
Low Risk
Additive benchmark JSON fields with legacy single-CSV behavior unchanged; risk is mainly mis-attributed disagg energy if filename labels or grouping assumptions are wrong, which the new tests target.
Overview
Extends
aggregate_powerso labeled srt-slurm perf CSVs (perf_samples_<role>_w<idx>_<host>.csv) drive per-worker power and disaggregated energy fields in the agg JSON, while cluster-wideavg_power_w/ joules metrics stay the same.aggregate_power_per_workergroups files by parsed(role, worker_idx)(multi-node workers merge several CSVs into one entry with summed GPUs).run()uses that instead of bareaggregate_power, merges optionalextrasviapatch_agg_result, and when both prefill and decode workers exist addsworkers,prefill_avg_power_w/decode_avg_power_w,joules_per_input_token, andjoules_per_output_token_decode. Unlabeled legacy CSVs still patch only the original three keys (noworkers).Nine new tests cover filename parsing, grouping, mixed labeled/unlabeled inputs, and disagg vs agg vs old-format behavior.
Reviewed by Cursor Bugbot for commit d04634e. Bugbot is set up for automated code reviews on this repo. Configure here.