feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574) by arygupt · Pull Request #1577 · SemiAnalysisAI/InferenceX

arygupt · 2026-05-28T01:22:11Z

Stacked on #1574 — review #1574 first; this PR adds per-worker breakdown on top of the cluster-wide multinode aggregation that PR lands.

Summary

Manager-requested follow-up. Adds per-worker power attribution and role-split energy metrics for disagg runs:

workers: [{role, worker_idx, num_gpus, avg_power_w}, ...] — per-worker breakdown in the agg JSON. role ∈ prefill / decode / agg / frontend. Multi-node workers (e.g. TP=16 decode across 4 nodes) aggregate as one entry with num_gpus=16.
prefill_avg_power_w, decode_avg_power_w — weighted per-GPU averages within each role (disagg only).
joules_per_input_token = prefill_energy / total_input_tokens (disagg only).
joules_per_output_token_decode = decode_energy / total_output_tokens (disagg only).

Backward compatible: existing avg_power_w, joules_per_output_token, joules_per_total_token keep their cluster-wide semantics. Old-format unlabeled CSVs (single-node start_gpu_monitor path) emit no workers field and no role split — bit-for-bit identical to the parent PR's behavior for those callers.

Pipeline

srt-slurm perfmon (SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon c4c86dc)
  → perf_samples_<role>_w<worker_idx>_<host>.csv on shared NFS
  → process_result.py (parent PR's bridge) → aggregate_power.run()
  → aggregate_power_per_worker() groups CSVs by (role, worker_idx)
  → agg JSON gets workers array + role-split joules (disagg) + existing cluster fields

Companion srt-slurm change

SemiAnalysisAI/srt-slurm c4c86dc: _start_perf_monitor derives role+worker_idx per node from backend_processes (which carries endpoint_mode and endpoint_index per Process) and labels output filenames. Allocator invariant — prefill and decode never share a node — means a single (role, worker_idx) per node is well-defined.

Files

utils/aggregate_power.py: new _parse_role_from_filename + aggregate_power_per_worker. patch_agg_result gains extras kwarg. run() computes role-split energy and per-worker breakdown when filenames are labeled.
utils/test_aggregate_power.py: 9 new tests.

Test plan

77/77 tests pass (68 existing + 9 new — filename parsing, per-worker grouping, multi-node worker aggregation, mixed labeled/unlabeled inputs, disagg E2E with role split, agg E2E without disagg-only fields, bit-for-bit backward compat)
After parent PR merges and srt-slurm change lands, first real disagg sweep on gb300-cw emits workers array + prefill_avg_power_w / decode_avg_power_w / joules_per_input_token / joules_per_output_token_decode in the agg JSON
InferenceX-app stage (separate PR): extend benchmark-transform.ts to surface workers array; add prefill/decode chart legend

App-side follow-up

ETL auto-captures any numeric field, so the scalar role-split fields (prefill_avg_power_w etc.) land in metrics JSONB automatically. The nested workers array needs explicit handling — that's a separate PR on InferenceX-app covering: type extension, benchmark-transform.ts mapper, chart legend with prefill/decode swatches.

Limitation (V1)

When multiple workers of the same role share a node (uncommon — happens in single-node multi-worker setups, e.g. 2 decode workers per 8-GPU node), perfmon labels by the lowest worker_idx and the aggregator currently attributes that node's full power to the lowest-idx worker. The other colocated workers would not appear in the workers list. For the GB300 multi-node disagg case (worker spans 1+ nodes, never shares), this isn't an issue.

Note

Low Risk
Additive benchmark JSON fields with legacy single-CSV behavior unchanged; risk is mainly mis-attributed disagg energy if filename labels or grouping assumptions are wrong, which the new tests target.

Overview
Extends aggregate_power so labeled srt-slurm perf CSVs (perf_samples_<role>_w<idx>_<host>.csv) drive per-worker power and disaggregated energy fields in the agg JSON, while cluster-wide avg_power_w / joules metrics stay the same.

aggregate_power_per_worker groups files by parsed (role, worker_idx) (multi-node workers merge several CSVs into one entry with summed GPUs). run() uses that instead of bare aggregate_power, merges optional extras via patch_agg_result, and when both prefill and decode workers exist adds workers, prefill_avg_power_w / decode_avg_power_w, joules_per_input_token, and joules_per_output_token_decode. Unlabeled legacy CSVs still patch only the original three keys (no workers).

Nine new tests cover filename parsing, grouping, mixed labeled/unlabeled inputs, and disagg vs agg vs old-format behavior.

^{Reviewed by Cursor Bugbot for commit d04634e. Bugbot is set up for automated code reviews on this repo. Configure here.}

… runs Builds on PR #1558 (single-node measured-power) for multinode benchmarks via srt-slurm. Pipeline: srt-slurm perfmon (per-node nvidia-smi sampling — PR #35 on NVIDIA/srt-slurm, layered on SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon) perf_samples_<host>.csv in outputs/<job>/logs/ on shared NFS launch_gb300-cw.sh exports GPU_METRICS_CSV_GLOB to $GITHUB_ENV process_result.py expands the glob and hands the list to aggregate_power.run() aggregate_power.py namespaces local GPU indices per source CSV stem so each node's local indices 0..N-1 don't collide across nodes; emits cluster-wide avg_power_w + joules_per_*_token InferenceX-app ETL auto-captures the numeric fields (no schema change) Changes: - utils/aggregate_power.py: widen csv_path to Path | Iterable[Path] keeping the original param name. Per-source GPU-id namespacing only kicks in when there are 2+ sources so single-node num_gpus is unchanged. CLI adds --csv-glob (Python-side glob, mutually exclusive with --csv). - utils/process_result.py: bridge GPU_METRICS_CSV_GLOB env var. Glob takes precedence over single GPU_METRICS_CSV when both are set. - runners/launch_gb300-cw.sh: point dynamo-sglang at our srt-slurm fork, append `monitoring:` block to each recipe post-copy (idempotent), and write GPU_METRICS_CSV_GLOB to $GITHUB_ENV after the job for the downstream Process result step. - 8 new multinode tests in test_aggregate_power.py (per-source namespacing, sub-second clock drift, asymmetric prefill/decode power, missing-CSV silent skip, backward-compat single-path-in-list, Iterable acceptance, E2E run with list). 3 new in test_process_result.py (glob aggregation, precedence over single CSV, empty-match falls through). 64/64 pass. Verified data-format end-to-end on gb300 hardware: nvidia-smi inside the sglang container emits the columns aggregate_power.py needs timestamp, gpu, power_w.

…regation Appends entry for dsv4-fp4-gb300-dynamo-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1574. The sweep produces the first multinode agg JSONs with avg_power_w + joules_per_*_token, validating the per-source GPU-id namespacing and GPU_METRICS_CSV_GLOB env-var bridge end-to-end on real GB300 hardware (gb300-cw cluster).

claude

LGTM — additive per-worker breakdown on top of existing cluster-wide aggregation, backward compatible for legacy CSV names, well-covered by 9 new tests.

Extended reasoning...

Overview

Extends utils/aggregate_power.py with per-worker power attribution: a new _parse_role_from_filename helper, a new aggregate_power_per_worker function that wraps the existing aggregator and groups CSVs by (role, worker_idx) parsed from perf_samples_<role>_w<idx>_<host>.csv, a new optional extras kwarg to patch_agg_result, and an extended run() that emits a workers array plus role-split scalars (prefill_avg_power_w, decode_avg_power_w, joules_per_input_token, joules_per_output_token_decode) when both prefill and decode workers are present. utils/test_aggregate_power.py adds 9 tests.

Security risks

None. This is offline post-processing of NVIDIA/AMD SMI CSVs produced by the benchmark harness; no network calls, no auth, no user-controlled input. Filename regex is anchored and narrow.

Level of scrutiny

Low. This touches only optional benchmark power telemetry — not the inference, serving, or scheduling hot paths. The aggregator is best-effort and returns 0 on every error path so monitoring failures cannot break benchmark uploads. Cluster-wide fields (avg_power_w, joules_per_output_token, joules_per_total_token) keep their existing semantics and old-format unlabeled CSVs produce bit-for-bit identical output.

Other factors

Tests cover filename parsing (labeled / multi-digit idx / old format), per-worker grouping with multi-node decode worker aggregation, mixed labeled+unlabeled inputs, disagg E2E with role-split joules, agg E2E without disagg-only fields, and legacy single-CSV bit-for-bit compat.
The V1 limitation (multiple workers of the same role colocated on one node collapse to lowest worker_idx) is documented in the PR description and does not affect the target gb300-cw multi-node disagg case.
Bug hunter found no issues.

… recipes The previous glob `$SRT_RECIPE_DST/*.yaml` only matched top-level YAMLs, but recipes live under workload subdirectories (e.g. 8k1k/*.yaml). The loop iterated zero times, no recipe got the monitoring: block, perfmon never spawned, no perf_samples_*.csv were written, aggregate_power silently skipped patching the agg JSON, and the dashboard had no power data. Sweep #26548110246 burned hours of GB300 time and shipped "success" with zero power keys in every agg artifact — exactly the silent-failure chain we should have caught earlier. Fix: recurse via `find -type f -name '*.yaml'`. Add a loud WARNING when zero recipes get the injection so future regressions surface immediately instead of waiting for missing dashboard data to be noticed.

… joules Layers per-worker breakdown on top of the cluster-wide multinode aggregation in the parent PR #1574. New agg JSON fields (additive — all existing keys preserved bit-for-bit for backward compat): workers: [{role, worker_idx, num_gpus, avg_power_w}, ...] role ∈ "prefill" / "decode" / "agg" / "frontend". Each (role, idx) aggregates across all CSVs for that worker — a multi-node TP=16 decode worker on 4 nodes produces one workers entry with num_gpus=16. prefill_avg_power_w, decode_avg_power_w (disagg only) Weighted per-GPU averages within each role. joules_per_input_token = prefill_energy / total_input_tokens joules_per_output_token_decode = decode_energy / total_output_tokens Disagg-only role-split metrics. Existing joules_per_output_token and joules_per_total_token keep their cluster-wide semantics so the chart won't shift on existing data. Worker → CSV mapping is by filename: srt-slurm's perfmon (companion change on SemiAnalysisAI/srt-slurm c4c86dc) writes `perf_samples_<role>_w<worker_idx>_<host>.csv`. Unlabeled filenames (old single-CSV format) silently emit empty workers list and skip the role split — cluster-wide metrics unchanged in that case. 77/77 tests pass (68 existing + 9 new — per-worker grouping, multi-node worker aggregation, mixed labeled/unlabeled inputs, disagg E2E with role split, agg E2E omitting disagg-only fields, bit-for-bit backward compat for old-format callers).

arygupt added 2 commits May 27, 2026 18:00

github-project-automation Bot added this to InferenceMAX Board May 28, 2026

claude Bot reviewed May 28, 2026

View reviewed changes

arygupt added 2 commits May 28, 2026 10:39

arygupt force-pushed the feat/measured-power-per-worker branch from f951aef to d04634e Compare May 28, 2026 17:40

arygupt force-pushed the feat/measured-power-multinode branch from f5b5c77 to 1af17ab Compare May 28, 2026 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574)#1577

feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574)#1577
arygupt wants to merge 4 commits into
feat/measured-power-multinodefrom
feat/measured-power-per-worker

arygupt commented May 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arygupt commented May 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pipeline

Companion srt-slurm change

Files

Test plan

App-side follow-up

Limitation (V1)

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arygupt commented May 28, 2026 •

edited by cursor Bot

Loading