Skip to content

feat(power): aggregate measured GPU power into agg result JSON#1551

Closed
arygupt wants to merge 3 commits into
SemiAnalysisAI:mainfrom
arygupt:chore/measured-power-aggregation
Closed

feat(power): aggregate measured GPU power into agg result JSON#1551
arygupt wants to merge 3 commits into
SemiAnalysisAI:mainfrom
arygupt:chore/measured-power-aggregation

Conversation

@arygupt
Copy link
Copy Markdown
Collaborator

@arygupt arygupt commented May 22, 2026

Summary

Adds measured per-GPU power and joules-per-output-token to every benchmark's agg_<run>.json, sourced from the existing gpu_metrics.csv that start_gpu_monitor already produces. The InferenceX-app dashboard consumes these via a companion PR (semianalysisai/InferenceX-app) to render new chart options alongside the existing TDP-derived jTotal/jOutput/jInput.

Two new fields land in the agg JSON:

  • avg_power_w — mean per-GPU draw during the load window
  • joules_per_output_tokenavg_power_w * num_gpus * duration / total_output_tokens

How it works

  1. benchmark_serving.py now records benchmark_start_time_unix and benchmark_end_time_unix (wall-clock epoch) alongside the existing duration. The aggregator needs these to know which slice of the long-running monitor CSV is the actual load window — without them, naive averaging would mix in ~60s of server warmup (~120W) and the optional eval phase (~300W), biasing a 720W per-GPU draw down to roughly 440W.

  2. utils/aggregate_power.py (new, stdlib only, ~210 lines) reads the CSV, detects vendor schema by header regex (handles nvidia-smi power.draw [W] and amd-smi socket_power), filters samples to the bench window, averages per-GPU power per timestamp then over time, and atomically patches the agg JSON. Best-effort throughout — missing/empty/malformed CSV is logged to stderr and skipped without ever failing the run.

  3. utils/process_result.py calls the aggregator right after writing the agg JSON. Path resolution checks $GPU_METRICS_CSV./gpu_metrics.csv/workspace/gpu_metrics.csv, accommodating the scripts in benchmarks/single_node/ that override the default path. Wrapped in try/except so telemetry never blocks the upload.

  4. benchmarks/benchmark_lib.sh exports GPU_METRICS_CSV so per-script CSV-path overrides cross the shell→Python boundary.

No workflow YAML change, no schema migration anywhere downstream — the InferenceX-app ETL's benchmark-mapper.ts is permissive about numeric keys in the agg JSON.

Verification

End-to-end smoke test on a synthesized 1680-row CSV (8 GPUs × 210s spanning warmup at 120W, bench at 720W, eval at 300W):

[aggregate_power] avg_power_w=715.69 (per GPU, n=8)
                  joules_per_output_token=8.3869
                  duration=120.0s output_tokens=81920

Cross-check: 720W × 8 GPUs / 682 tok/s ≈ 8.45 J/tok. ✓ Window isolation correctly excluded the warmup + eval samples (naive average would have given ~440W).

Test plan

  • 26 unit tests covering NVIDIA + AMD CSV formats, multi-GPU per-sample aggregation, window filtering, malformed-row resilience, missing files, atomic JSON patching, divide-by-zero on failed runs
  • 3 subprocess integration tests through process_result.py: stages a CSV + bench JSON + env vars, asserts agg_<run>.json gets patched
  • All 22 existing test_process_result.py tests still pass (no regressions in the established flow)
  • First real benchmark run after merge — verify the two new keys appear in the uploaded agg_<run>.json artifact
  • InferenceX-app ingest picks up the new keys (no METRIC_KEYS warning in ETL logs)

Backfill option

The workflow already uploads gpu_metrics.csv as an artifact for every run (.github/workflows/benchmark-tmpl.yml). After this merges, historical runs that have both a CSV and an agg_<run>.json could be backfilled by re-running this aggregator against the artifact store. Out of scope here.

🤖 Generated with Claude Code

arygupt and others added 2 commits May 21, 2026 16:40
Adds two new fields to agg_<run>.json so the InferenceX-app dashboard
can chart measured-energy metrics alongside the existing TDP-derived
ones:

  - avg_power_w               (mean per-GPU draw during the load window)
  - joules_per_output_token   (avg_power_w * num_gpus * duration / total_output_tokens)

How it works:

1. benchmark_serving.py now records benchmark_start_time_unix and
   benchmark_end_time_unix alongside the existing duration field so the
   aggregator knows exactly which slice of the long-running monitor CSV
   to read (the bracket-the-whole-job monitor includes server warmup
   and the optional eval phase, which would otherwise bias the average).

2. aggregate_power.py reads /workspace/gpu_metrics.csv (path overridable
   via GPU_METRICS_CSV, which benchmark_lib.sh now exports), detects the
   vendor schema by header regex (handles nvidia-smi "power.draw [W]"
   and amd-smi socket_power formats), filters samples to the bench
   window, and atomically patches the agg JSON. Best-effort: missing /
   empty / malformed CSV is logged to stderr and skipped without
   failing the run.

3. process_result.py invokes the aggregator right after writing the
   agg JSON — no workflow YAML change needed.

The InferenceX-app ETL (benchmark-mapper.ts) auto-captures unknown
numeric metrics into the metrics JSONB column, so no schema migration
or downstream change is required for the data to land in the DB. A
follow-up PR on InferenceX-app adds the two Y-axis options to the
inference scatter chart.

26 unit tests covering NVIDIA + AMD CSV shapes, window filtering,
multi-GPU per-sample aggregation, malformed-row resilience, missing
files, division-by-zero guards, and atomic JSON patching.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three new tests in TestPowerAggregationIntegration:

  - test_agg_json_gets_patched_with_power_and_joules: full pipeline.
    Stages a 1Hz nvidia-smi CSV with warmup/bench/eval phases, runs
    process_result.py as a subprocess with GPU_METRICS_CSV set, and
    verifies the agg JSON gets patched with avg_power_w (600W) and
    joules_per_output_token (9.6 J/tok = 600W * 8 GPUs * 60s / 30k tok).
    Warmup (100W) and eval (200W) samples must be excluded by the
    timestamp window — would otherwise bias the result downward.

  - test_missing_csv_does_not_break_process_result: production case for
    runs that ship without monitoring. process_result.py succeeds and
    writes the agg JSON sans power fields.

  - test_missing_bench_timestamps_does_not_patch: legacy bench JSON
    without benchmark_start_time_unix gracefully skips aggregation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@arygupt arygupt requested a review from a team May 22, 2026 00:42
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires
when the sweep-enabled label is added to PR SemiAnalysisAI#1551. The sweep will produce
the first agg_<run>.json containing avg_power_w and joules_per_output_token,
validating the aggregator end-to-end on real GPU hardware.

Cheap single-node H200 config picked to minimize runner-pool contention.
@arygupt
Copy link
Copy Markdown
Collaborator Author

arygupt commented May 22, 2026

Closed this in favor of #1558

@arygupt arygupt closed this May 22, 2026
arygupt added a commit that referenced this pull request May 22, 2026
Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires
when the sweep-enabled label is added to PR #1551. The sweep will produce
the first agg_<run>.json containing avg_power_w and joules_per_output_token,
validating the aggregator end-to-end on real GPU hardware.

Cheap single-node H200 config picked to minimize runner-pool contention.
Klaud-Cold pushed a commit that referenced this pull request May 26, 2026
Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires
when the sweep-enabled label is added to PR #1551. The sweep will produce
the first agg_<run>.json containing avg_power_w and joules_per_output_token,
validating the aggregator end-to-end on real GPU hardware.

Cheap single-node H200 config picked to minimize runner-pool contention.
functionstackx pushed a commit that referenced this pull request May 27, 2026
* feat(power): aggregate measured GPU power into agg result JSON

Adds two new fields to agg_<run>.json so the InferenceX-app dashboard
can chart measured-energy metrics alongside the existing TDP-derived
ones:

  - avg_power_w               (mean per-GPU draw during the load window)
  - joules_per_output_token   (avg_power_w * num_gpus * duration / total_output_tokens)

How it works:

1. benchmark_serving.py now records benchmark_start_time_unix and
   benchmark_end_time_unix alongside the existing duration field so the
   aggregator knows exactly which slice of the long-running monitor CSV
   to read (the bracket-the-whole-job monitor includes server warmup
   and the optional eval phase, which would otherwise bias the average).

2. aggregate_power.py reads /workspace/gpu_metrics.csv (path overridable
   via GPU_METRICS_CSV, which benchmark_lib.sh now exports), detects the
   vendor schema by header regex (handles nvidia-smi "power.draw [W]"
   and amd-smi socket_power formats), filters samples to the bench
   window, and atomically patches the agg JSON. Best-effort: missing /
   empty / malformed CSV is logged to stderr and skipped without
   failing the run.

3. process_result.py invokes the aggregator right after writing the
   agg JSON — no workflow YAML change needed.

The InferenceX-app ETL (benchmark-mapper.ts) auto-captures unknown
numeric metrics into the metrics JSONB column, so no schema migration
or downstream change is required for the data to land in the DB. A
follow-up PR on InferenceX-app adds the two Y-axis options to the
inference scatter chart.

26 unit tests covering NVIDIA + AMD CSV shapes, window filtering,
multi-GPU per-sample aggregation, malformed-row resilience, missing
files, division-by-zero guards, and atomic JSON patching.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(power): subprocess integration covering process_result + aggregator

Three new tests in TestPowerAggregationIntegration:

  - test_agg_json_gets_patched_with_power_and_joules: full pipeline.
    Stages a 1Hz nvidia-smi CSV with warmup/bench/eval phases, runs
    process_result.py as a subprocess with GPU_METRICS_CSV set, and
    verifies the agg JSON gets patched with avg_power_w (600W) and
    joules_per_output_token (9.6 J/tok = 600W * 8 GPUs * 60s / 30k tok).
    Warmup (100W) and eval (200W) samples must be excluded by the
    timestamp window — would otherwise bias the result downward.

  - test_missing_csv_does_not_break_process_result: production case for
    runs that ship without monitoring. process_result.py succeeds and
    writes the agg JSON sans power fields.

  - test_missing_bench_timestamps_does_not_patch: legacy bench JSON
    without benchmark_start_time_unix gracefully skips aggregation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(perf-changelog): trigger sweep for measured-power aggregation

Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires
when the sweep-enabled label is added to PR #1551. The sweep will produce
the first agg_<run>.json containing avg_power_w and joules_per_output_token,
validating the aggregator end-to-end on real GPU hardware.

Cheap single-node H200 config picked to minimize runner-pool contention.

* fix(aggregate_power): infer num_gpus from row count when GPU column absent

Addresses review feedback on PR #1558. The original _detect_columns used a
strict-anchored regex ^(index|gpu|gpu_id|gpu_index|card|device)$ for the
GPU-index column while the power column uses a permissive r"power" match.
That asymmetry meant a future SMI schema variant with header device_id,
gpu_serial, or "GPU ID" would silently collapse every row to gpu_id="0",
yielding system-total power instead of per-GPU mean.

In-tree pipelines (nvidia-smi index, amd-smi gpu) match the regex and are
unaffected — this is a latent bug, not a current-production one. But
avg_power_w is the standalone headline new metric this PR adds; a future
4000W reading on the dashboard is the worst kind of regression (still
plausible-looking, off by 8x).

Fix:
- Maintain per_sample_row_count independently of GPU-column detection.
- When gpu_col is present, divisor stays len(per_sample_gpus[ts]) — same
  as before, behavior unchanged for the existing pipeline.
- When gpu_col is absent, divisor is per_sample_row_count[ts] and
  num_gpus is the modal row count per sample. Both assume one row per
  GPU per sample, which is what every SMI tool we've encountered emits.

New test (test_aggregate_power_no_gpu_column_infers_from_row_count)
exercises a CSV header with device_id (regex miss) and asserts avg_power
is the per-GPU mean (500W), not the system sum (2000W).

Verified:
- 27 unit tests pass (was 26)
- 25 process_result tests pass (no change)

* docs(perf-changelog): clarify why measured-power entry is kept past merge

* chore(perf-changelog): also validate aggregator on AMD MI355X

Adds dsr1-fp8-mi355x-sglang to the validation entry so the sweep
also runs on AMD hardware. Same framework (sglang) as the H200
config so the only variable is GPU vendor: NVIDIA nvidia-smi CSV
(power.draw [W] column) vs AMD amd-smi CSV (socket_power column).

If the aggregator handles both, no further work needed. If AMD
fails, the workflow log shows exactly which step (column detection,
timestamp parsing, etc.) and the fix is local to aggregate_power.py.

* chore: re-trigger workflow synchronize for AMD sweep

* fix(launch-mi355x): add --container-remap-root so amd-smi can read power

amd-smi metric -p inside an enroot container requires the user to be in
the render or video group. The current srun setup runs as the regular
user (uid 300070) with no supplementary groups, so the command throws
RuntimeError("User missing render/video groups"). Combined with the
2>/dev/null in benchmark_lib.sh start_gpu_monitor, this manifests as a
silent 0-byte gpu_metrics.csv → aggregate_power.py skips with "no
usable power samples" → AMD agg JSONs have no avg_power_w.

Verified on real MI355X (mia1-p01-g10) via direct srun + enroot:
  Without --container-remap-root: 0 bytes
  With --container-remap-root:    45056 bytes / 6s, populated socket_power

Equivalent to what the docker-based launchers (mi300x/mi325x) already
get via --group-add video on the docker run command.

Side effect: the benchmark inside the container now runs as root. The
benchmark scripts already work fine in root contexts (the docker-based
launchers run via sudo docker), and files written through the mounted
/workspace land back on the host with root ownership which is fine for
the downstream process_result + artifact-upload steps.

* feat(power): emit joules_per_total_token alongside joules_per_output_token

Adds a third measured-power field to agg_<run>.json:

  joules_per_total_token = system_energy / (input_tokens + output_tokens)

Where the existing joules_per_output_token divides only by output tokens
(treating input as free), the new field divides by all tokens the system
actually processed. For prompt-heavy workloads (8K input, 1K output)
J/total-token is ~9x smaller than J/output-token — and arguably more
honest about real workload cost.

Mechanically: extend _load_bench_window to also return total_input_tokens
(default 0 if absent, which makes J/total-token degrade to J/output-token
for older bench JSONs). Compute total_system_energy_j once and divide by
both denominators. Patch both fields atomically into the agg JSON.

Tests: +1 covering the (input + output) denominator with an 8k1k
prompt-heavy workload, asserting the ~9x ratio sanity check.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant