Skip to content

feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574)#1577

Open
arygupt wants to merge 4 commits into
feat/measured-power-multinodefrom
feat/measured-power-per-worker
Open

feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574)#1577
arygupt wants to merge 4 commits into
feat/measured-power-multinodefrom
feat/measured-power-per-worker

Conversation

@arygupt
Copy link
Copy Markdown
Collaborator

@arygupt arygupt commented May 28, 2026

Stacked on #1574 — review #1574 first; this PR adds per-worker breakdown on top of the cluster-wide multinode aggregation that PR lands.

Summary

Manager-requested follow-up. Adds per-worker power attribution and role-split energy metrics for disagg runs:

  • workers: [{role, worker_idx, num_gpus, avg_power_w}, ...] — per-worker breakdown in the agg JSON. roleprefill / decode / agg / frontend. Multi-node workers (e.g. TP=16 decode across 4 nodes) aggregate as one entry with num_gpus=16.
  • prefill_avg_power_w, decode_avg_power_w — weighted per-GPU averages within each role (disagg only).
  • joules_per_input_token = prefill_energy / total_input_tokens (disagg only).
  • joules_per_output_token_decode = decode_energy / total_output_tokens (disagg only).

Backward compatible: existing avg_power_w, joules_per_output_token, joules_per_total_token keep their cluster-wide semantics. Old-format unlabeled CSVs (single-node start_gpu_monitor path) emit no workers field and no role split — bit-for-bit identical to the parent PR's behavior for those callers.

Pipeline

srt-slurm perfmon (SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon c4c86dc)
  → perf_samples_<role>_w<worker_idx>_<host>.csv on shared NFS
  → process_result.py (parent PR's bridge) → aggregate_power.run()
  → aggregate_power_per_worker() groups CSVs by (role, worker_idx)
  → agg JSON gets workers array + role-split joules (disagg) + existing cluster fields

Companion srt-slurm change

SemiAnalysisAI/srt-slurm c4c86dc: _start_perf_monitor derives role+worker_idx per node from backend_processes (which carries endpoint_mode and endpoint_index per Process) and labels output filenames. Allocator invariant — prefill and decode never share a node — means a single (role, worker_idx) per node is well-defined.

Files

  • utils/aggregate_power.py: new _parse_role_from_filename + aggregate_power_per_worker. patch_agg_result gains extras kwarg. run() computes role-split energy and per-worker breakdown when filenames are labeled.
  • utils/test_aggregate_power.py: 9 new tests.

Test plan

  • 77/77 tests pass (68 existing + 9 new — filename parsing, per-worker grouping, multi-node worker aggregation, mixed labeled/unlabeled inputs, disagg E2E with role split, agg E2E without disagg-only fields, bit-for-bit backward compat)
  • After parent PR merges and srt-slurm change lands, first real disagg sweep on gb300-cw emits workers array + prefill_avg_power_w / decode_avg_power_w / joules_per_input_token / joules_per_output_token_decode in the agg JSON
  • InferenceX-app stage (separate PR): extend benchmark-transform.ts to surface workers array; add prefill/decode chart legend

App-side follow-up

ETL auto-captures any numeric field, so the scalar role-split fields (prefill_avg_power_w etc.) land in metrics JSONB automatically. The nested workers array needs explicit handling — that's a separate PR on InferenceX-app covering: type extension, benchmark-transform.ts mapper, chart legend with prefill/decode swatches.

Limitation (V1)

When multiple workers of the same role share a node (uncommon — happens in single-node multi-worker setups, e.g. 2 decode workers per 8-GPU node), perfmon labels by the lowest worker_idx and the aggregator currently attributes that node's full power to the lowest-idx worker. The other colocated workers would not appear in the workers list. For the GB300 multi-node disagg case (worker spans 1+ nodes, never shares), this isn't an issue.


Note

Low Risk
Additive benchmark JSON fields with legacy single-CSV behavior unchanged; risk is mainly mis-attributed disagg energy if filename labels or grouping assumptions are wrong, which the new tests target.

Overview
Extends aggregate_power so labeled srt-slurm perf CSVs (perf_samples_<role>_w<idx>_<host>.csv) drive per-worker power and disaggregated energy fields in the agg JSON, while cluster-wide avg_power_w / joules metrics stay the same.

aggregate_power_per_worker groups files by parsed (role, worker_idx) (multi-node workers merge several CSVs into one entry with summed GPUs). run() uses that instead of bare aggregate_power, merges optional extras via patch_agg_result, and when both prefill and decode workers exist adds workers, prefill_avg_power_w / decode_avg_power_w, joules_per_input_token, and joules_per_output_token_decode. Unlabeled legacy CSVs still patch only the original three keys (no workers).

Nine new tests cover filename parsing, grouping, mixed labeled/unlabeled inputs, and disagg vs agg vs old-format behavior.

Reviewed by Cursor Bugbot for commit d04634e. Bugbot is set up for automated code reviews on this repo. Configure here.

arygupt added 2 commits May 27, 2026 18:00
… runs

Builds on PR #1558 (single-node measured-power) for multinode benchmarks
via srt-slurm. Pipeline:

  srt-slurm perfmon (per-node nvidia-smi sampling — PR #35 on
    NVIDIA/srt-slurm, layered on SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon)
   perf_samples_<host>.csv in outputs/<job>/logs/ on shared NFS
   launch_gb300-cw.sh exports GPU_METRICS_CSV_GLOB to $GITHUB_ENV
   process_result.py expands the glob and hands the list to
   aggregate_power.run()
   aggregate_power.py namespaces local GPU indices per source CSV stem so
   each node's local indices 0..N-1 don't collide across nodes; emits
   cluster-wide avg_power_w + joules_per_*_token
   InferenceX-app ETL auto-captures the numeric fields (no schema change)

Changes:

- utils/aggregate_power.py: widen csv_path to Path | Iterable[Path] keeping
  the original param name. Per-source GPU-id namespacing only kicks in when
  there are 2+ sources so single-node num_gpus is unchanged. CLI adds
  --csv-glob (Python-side glob, mutually exclusive with --csv).
- utils/process_result.py: bridge GPU_METRICS_CSV_GLOB env var. Glob takes
  precedence over single GPU_METRICS_CSV when both are set.
- runners/launch_gb300-cw.sh: point dynamo-sglang at our srt-slurm fork,
  append `monitoring:` block to each recipe post-copy (idempotent), and
  write GPU_METRICS_CSV_GLOB to $GITHUB_ENV after the job for the
  downstream Process result step.
- 8 new multinode tests in test_aggregate_power.py (per-source namespacing,
  sub-second clock drift, asymmetric prefill/decode power, missing-CSV
  silent skip, backward-compat single-path-in-list, Iterable acceptance,
  E2E run with list). 3 new in test_process_result.py (glob aggregation,
  precedence over single CSV, empty-match falls through). 64/64 pass.

Verified data-format end-to-end on gb300 hardware: nvidia-smi
inside the sglang container emits the columns aggregate_power.py needs
timestamp, gpu, power_w.
…regation

Appends entry for dsv4-fp4-gb300-dynamo-sglang so run-sweep.yml fires when
the sweep-enabled label is added to PR #1574. The sweep produces the first
multinode agg JSONs with avg_power_w + joules_per_*_token, validating the
per-source GPU-id namespacing and GPU_METRICS_CSV_GLOB env-var bridge
end-to-end on real GB300 hardware (gb300-cw cluster).
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — additive per-worker breakdown on top of existing cluster-wide aggregation, backward compatible for legacy CSV names, well-covered by 9 new tests.

Extended reasoning...

Overview

Extends utils/aggregate_power.py with per-worker power attribution: a new _parse_role_from_filename helper, a new aggregate_power_per_worker function that wraps the existing aggregator and groups CSVs by (role, worker_idx) parsed from perf_samples_<role>_w<idx>_<host>.csv, a new optional extras kwarg to patch_agg_result, and an extended run() that emits a workers array plus role-split scalars (prefill_avg_power_w, decode_avg_power_w, joules_per_input_token, joules_per_output_token_decode) when both prefill and decode workers are present. utils/test_aggregate_power.py adds 9 tests.

Security risks

None. This is offline post-processing of NVIDIA/AMD SMI CSVs produced by the benchmark harness; no network calls, no auth, no user-controlled input. Filename regex is anchored and narrow.

Level of scrutiny

Low. This touches only optional benchmark power telemetry — not the inference, serving, or scheduling hot paths. The aggregator is best-effort and returns 0 on every error path so monitoring failures cannot break benchmark uploads. Cluster-wide fields (avg_power_w, joules_per_output_token, joules_per_total_token) keep their existing semantics and old-format unlabeled CSVs produce bit-for-bit identical output.

Other factors

  • Tests cover filename parsing (labeled / multi-digit idx / old format), per-worker grouping with multi-node decode worker aggregation, mixed labeled+unlabeled inputs, disagg E2E with role-split joules, agg E2E without disagg-only fields, and legacy single-CSV bit-for-bit compat.
  • The V1 limitation (multiple workers of the same role colocated on one node collapse to lowest worker_idx) is documented in the PR description and does not affect the target gb300-cw multi-node disagg case.
  • Bug hunter found no issues.

arygupt added 2 commits May 28, 2026 10:39
… recipes

The previous glob `$SRT_RECIPE_DST/*.yaml` only matched top-level YAMLs,
but recipes live under workload subdirectories (e.g. 8k1k/*.yaml). The
loop iterated zero times, no recipe got the monitoring: block, perfmon
never spawned, no perf_samples_*.csv were written, aggregate_power
silently skipped patching the agg JSON, and the dashboard had no power
data.

Sweep #26548110246 burned hours of GB300 time and shipped "success" with
zero power keys in every agg artifact — exactly the silent-failure chain
we should have caught earlier.

Fix: recurse via `find -type f -name '*.yaml'`. Add a loud WARNING when
zero recipes get the injection so future regressions surface immediately
instead of waiting for missing dashboard data to be noticed.
… joules

Layers per-worker breakdown on top of the cluster-wide multinode
aggregation in the parent PR #1574. New agg JSON fields (additive — all
existing keys preserved bit-for-bit for backward compat):

  workers: [{role, worker_idx, num_gpus, avg_power_w}, ...]
    role ∈ "prefill" / "decode" / "agg" / "frontend". Each (role, idx)
    aggregates across all CSVs for that worker — a multi-node TP=16
    decode worker on 4 nodes produces one workers entry with num_gpus=16.

  prefill_avg_power_w, decode_avg_power_w  (disagg only)
    Weighted per-GPU averages within each role.

  joules_per_input_token         = prefill_energy / total_input_tokens
  joules_per_output_token_decode = decode_energy  / total_output_tokens
    Disagg-only role-split metrics. Existing joules_per_output_token and
    joules_per_total_token keep their cluster-wide semantics so the chart
    won't shift on existing data.

Worker → CSV mapping is by filename: srt-slurm's perfmon (companion change
on SemiAnalysisAI/srt-slurm c4c86dc) writes
`perf_samples_<role>_w<worker_idx>_<host>.csv`. Unlabeled filenames (old
single-CSV format) silently emit empty workers list and skip the role
split — cluster-wide metrics unchanged in that case.

77/77 tests pass (68 existing + 9 new — per-worker grouping, multi-node
worker aggregation, mixed labeled/unlabeled inputs, disagg E2E with role
split, agg E2E omitting disagg-only fields, bit-for-bit backward compat
for old-format callers).
@arygupt arygupt force-pushed the feat/measured-power-per-worker branch from f951aef to d04634e Compare May 28, 2026 17:40
@arygupt arygupt force-pushed the feat/measured-power-multinode branch from f5b5c77 to 1af17ab Compare May 28, 2026 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant