feat(power): aggregate measured GPU power into agg result JSON by arygupt · Pull Request #1551 · SemiAnalysisAI/InferenceX

arygupt · 2026-05-22T00:42:58Z

Summary

Adds measured per-GPU power and joules-per-output-token to every benchmark's agg_<run>.json, sourced from the existing gpu_metrics.csv that start_gpu_monitor already produces. The InferenceX-app dashboard consumes these via a companion PR (semianalysisai/InferenceX-app) to render new chart options alongside the existing TDP-derived jTotal/jOutput/jInput.

Two new fields land in the agg JSON:

avg_power_w — mean per-GPU draw during the load window
joules_per_output_token — avg_power_w * num_gpus * duration / total_output_tokens

How it works

benchmark_serving.py now records benchmark_start_time_unix and benchmark_end_time_unix (wall-clock epoch) alongside the existing duration. The aggregator needs these to know which slice of the long-running monitor CSV is the actual load window — without them, naive averaging would mix in ~60s of server warmup (~120W) and the optional eval phase (~300W), biasing a 720W per-GPU draw down to roughly 440W.
utils/aggregate_power.py (new, stdlib only, ~210 lines) reads the CSV, detects vendor schema by header regex (handles nvidia-smi power.draw [W] and amd-smi socket_power), filters samples to the bench window, averages per-GPU power per timestamp then over time, and atomically patches the agg JSON. Best-effort throughout — missing/empty/malformed CSV is logged to stderr and skipped without ever failing the run.
utils/process_result.py calls the aggregator right after writing the agg JSON. Path resolution checks $GPU_METRICS_CSV → ./gpu_metrics.csv → /workspace/gpu_metrics.csv, accommodating the scripts in benchmarks/single_node/ that override the default path. Wrapped in try/except so telemetry never blocks the upload.
benchmarks/benchmark_lib.sh exports GPU_METRICS_CSV so per-script CSV-path overrides cross the shell→Python boundary.

No workflow YAML change, no schema migration anywhere downstream — the InferenceX-app ETL's benchmark-mapper.ts is permissive about numeric keys in the agg JSON.

Verification

End-to-end smoke test on a synthesized 1680-row CSV (8 GPUs × 210s spanning warmup at 120W, bench at 720W, eval at 300W):

[aggregate_power] avg_power_w=715.69 (per GPU, n=8)
                  joules_per_output_token=8.3869
                  duration=120.0s output_tokens=81920

Cross-check: 720W × 8 GPUs / 682 tok/s ≈ 8.45 J/tok. ✓ Window isolation correctly excluded the warmup + eval samples (naive average would have given ~440W).

Test plan

26 unit tests covering NVIDIA + AMD CSV formats, multi-GPU per-sample aggregation, window filtering, malformed-row resilience, missing files, atomic JSON patching, divide-by-zero on failed runs
3 subprocess integration tests through process_result.py: stages a CSV + bench JSON + env vars, asserts agg_<run>.json gets patched
All 22 existing test_process_result.py tests still pass (no regressions in the established flow)
First real benchmark run after merge — verify the two new keys appear in the uploaded agg_<run>.json artifact
InferenceX-app ingest picks up the new keys (no METRIC_KEYS warning in ETL logs)

Backfill option

The workflow already uploads gpu_metrics.csv as an artifact for every run (.github/workflows/benchmark-tmpl.yml). After this merges, historical runs that have both a CSV and an agg_<run>.json could be backfilled by re-running this aggregator against the artifact store. Out of scope here.

🤖 Generated with Claude Code

Adds two new fields to agg_<run>.json so the InferenceX-app dashboard can chart measured-energy metrics alongside the existing TDP-derived ones: - avg_power_w (mean per-GPU draw during the load window) - joules_per_output_token (avg_power_w * num_gpus * duration / total_output_tokens) How it works: 1. benchmark_serving.py now records benchmark_start_time_unix and benchmark_end_time_unix alongside the existing duration field so the aggregator knows exactly which slice of the long-running monitor CSV to read (the bracket-the-whole-job monitor includes server warmup and the optional eval phase, which would otherwise bias the average). 2. aggregate_power.py reads /workspace/gpu_metrics.csv (path overridable via GPU_METRICS_CSV, which benchmark_lib.sh now exports), detects the vendor schema by header regex (handles nvidia-smi "power.draw [W]" and amd-smi socket_power formats), filters samples to the bench window, and atomically patches the agg JSON. Best-effort: missing / empty / malformed CSV is logged to stderr and skipped without failing the run. 3. process_result.py invokes the aggregator right after writing the agg JSON — no workflow YAML change needed. The InferenceX-app ETL (benchmark-mapper.ts) auto-captures unknown numeric metrics into the metrics JSONB column, so no schema migration or downstream change is required for the data to land in the DB. A follow-up PR on InferenceX-app adds the two Y-axis options to the inference scatter chart. 26 unit tests covering NVIDIA + AMD CSV shapes, window filtering, multi-GPU per-sample aggregation, malformed-row resilience, missing files, division-by-zero guards, and atomic JSON patching. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Three new tests in TestPowerAggregationIntegration: - test_agg_json_gets_patched_with_power_and_joules: full pipeline. Stages a 1Hz nvidia-smi CSV with warmup/bench/eval phases, runs process_result.py as a subprocess with GPU_METRICS_CSV set, and verifies the agg JSON gets patched with avg_power_w (600W) and joules_per_output_token (9.6 J/tok = 600W * 8 GPUs * 60s / 30k tok). Warmup (100W) and eval (200W) samples must be excluded by the timestamp window — would otherwise bias the result downward. - test_missing_csv_does_not_break_process_result: production case for runs that ship without monitoring. process_result.py succeeds and writes the agg JSON sans power fields. - test_missing_bench_timestamps_does_not_patch: legacy bench JSON without benchmark_start_time_unix gracefully skips aggregation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR SemiAnalysisAI#1551. The sweep will produce the first agg_<run>.json containing avg_power_w and joules_per_output_token, validating the aggregator end-to-end on real GPU hardware. Cheap single-node H200 config picked to minimize runner-pool contention.

arygupt · 2026-05-22T20:56:32Z

Closed this in favor of #1558

Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1551. The sweep will produce the first agg_<run>.json containing avg_power_w and joules_per_output_token, validating the aggregator end-to-end on real GPU hardware. Cheap single-node H200 config picked to minimize runner-pool contention.

* feat(power): aggregate measured GPU power into agg result JSON Adds two new fields to agg_<run>.json so the InferenceX-app dashboard can chart measured-energy metrics alongside the existing TDP-derived ones: - avg_power_w (mean per-GPU draw during the load window) - joules_per_output_token (avg_power_w * num_gpus * duration / total_output_tokens) How it works: 1. benchmark_serving.py now records benchmark_start_time_unix and benchmark_end_time_unix alongside the existing duration field so the aggregator knows exactly which slice of the long-running monitor CSV to read (the bracket-the-whole-job monitor includes server warmup and the optional eval phase, which would otherwise bias the average). 2. aggregate_power.py reads /workspace/gpu_metrics.csv (path overridable via GPU_METRICS_CSV, which benchmark_lib.sh now exports), detects the vendor schema by header regex (handles nvidia-smi "power.draw [W]" and amd-smi socket_power formats), filters samples to the bench window, and atomically patches the agg JSON. Best-effort: missing / empty / malformed CSV is logged to stderr and skipped without failing the run. 3. process_result.py invokes the aggregator right after writing the agg JSON — no workflow YAML change needed. The InferenceX-app ETL (benchmark-mapper.ts) auto-captures unknown numeric metrics into the metrics JSONB column, so no schema migration or downstream change is required for the data to land in the DB. A follow-up PR on InferenceX-app adds the two Y-axis options to the inference scatter chart. 26 unit tests covering NVIDIA + AMD CSV shapes, window filtering, multi-GPU per-sample aggregation, malformed-row resilience, missing files, division-by-zero guards, and atomic JSON patching. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(power): subprocess integration covering process_result + aggregator Three new tests in TestPowerAggregationIntegration: - test_agg_json_gets_patched_with_power_and_joules: full pipeline. Stages a 1Hz nvidia-smi CSV with warmup/bench/eval phases, runs process_result.py as a subprocess with GPU_METRICS_CSV set, and verifies the agg JSON gets patched with avg_power_w (600W) and joules_per_output_token (9.6 J/tok = 600W * 8 GPUs * 60s / 30k tok). Warmup (100W) and eval (200W) samples must be excluded by the timestamp window — would otherwise bias the result downward. - test_missing_csv_does_not_break_process_result: production case for runs that ship without monitoring. process_result.py succeeds and writes the agg JSON sans power fields. - test_missing_bench_timestamps_does_not_patch: legacy bench JSON without benchmark_start_time_unix gracefully skips aggregation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(perf-changelog): trigger sweep for measured-power aggregation Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1551. The sweep will produce the first agg_<run>.json containing avg_power_w and joules_per_output_token, validating the aggregator end-to-end on real GPU hardware. Cheap single-node H200 config picked to minimize runner-pool contention. * fix(aggregate_power): infer num_gpus from row count when GPU column absent Addresses review feedback on PR #1558. The original _detect_columns used a strict-anchored regex ^(index|gpu|gpu_id|gpu_index|card|device)$ for the GPU-index column while the power column uses a permissive r"power" match. That asymmetry meant a future SMI schema variant with header device_id, gpu_serial, or "GPU ID" would silently collapse every row to gpu_id="0", yielding system-total power instead of per-GPU mean. In-tree pipelines (nvidia-smi index, amd-smi gpu) match the regex and are unaffected — this is a latent bug, not a current-production one. But avg_power_w is the standalone headline new metric this PR adds; a future 4000W reading on the dashboard is the worst kind of regression (still plausible-looking, off by 8x). Fix: - Maintain per_sample_row_count independently of GPU-column detection. - When gpu_col is present, divisor stays len(per_sample_gpus[ts]) — same as before, behavior unchanged for the existing pipeline. - When gpu_col is absent, divisor is per_sample_row_count[ts] and num_gpus is the modal row count per sample. Both assume one row per GPU per sample, which is what every SMI tool we've encountered emits. New test (test_aggregate_power_no_gpu_column_infers_from_row_count) exercises a CSV header with device_id (regex miss) and asserts avg_power is the per-GPU mean (500W), not the system sum (2000W). Verified: - 27 unit tests pass (was 26) - 25 process_result tests pass (no change) * docs(perf-changelog): clarify why measured-power entry is kept past merge * chore(perf-changelog): also validate aggregator on AMD MI355X Adds dsr1-fp8-mi355x-sglang to the validation entry so the sweep also runs on AMD hardware. Same framework (sglang) as the H200 config so the only variable is GPU vendor: NVIDIA nvidia-smi CSV (power.draw [W] column) vs AMD amd-smi CSV (socket_power column). If the aggregator handles both, no further work needed. If AMD fails, the workflow log shows exactly which step (column detection, timestamp parsing, etc.) and the fix is local to aggregate_power.py. * chore: re-trigger workflow synchronize for AMD sweep * fix(launch-mi355x): add --container-remap-root so amd-smi can read power amd-smi metric -p inside an enroot container requires the user to be in the render or video group. The current srun setup runs as the regular user (uid 300070) with no supplementary groups, so the command throws RuntimeError("User missing render/video groups"). Combined with the 2>/dev/null in benchmark_lib.sh start_gpu_monitor, this manifests as a silent 0-byte gpu_metrics.csv → aggregate_power.py skips with "no usable power samples" → AMD agg JSONs have no avg_power_w. Verified on real MI355X (mia1-p01-g10) via direct srun + enroot: Without --container-remap-root: 0 bytes With --container-remap-root: 45056 bytes / 6s, populated socket_power Equivalent to what the docker-based launchers (mi300x/mi325x) already get via --group-add video on the docker run command. Side effect: the benchmark inside the container now runs as root. The benchmark scripts already work fine in root contexts (the docker-based launchers run via sudo docker), and files written through the mounted /workspace land back on the host with root ownership which is fine for the downstream process_result + artifact-upload steps. * feat(power): emit joules_per_total_token alongside joules_per_output_token Adds a third measured-power field to agg_<run>.json: joules_per_total_token = system_energy / (input_tokens + output_tokens) Where the existing joules_per_output_token divides only by output tokens (treating input as free), the new field divides by all tokens the system actually processed. For prompt-heavy workloads (8K input, 1K output) J/total-token is ~9x smaller than J/output-token — and arguably more honest about real workload cost. Mechanically: extend _load_bench_window to also return total_input_tokens (default 0 if absent, which makes J/total-token degrade to J/output-token for older bench JSONs). Compute total_system_energy_j once and divide by both denominators. Patch both fields atomically into the agg JSON. Tests: +1 covering the (input + output) denominator with an 8k1k prompt-heavy workload, asserting the ~9x ratio sanity check. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

arygupt and others added 2 commits May 21, 2026 16:40

arygupt requested a review from a team May 22, 2026 00:42

github-project-automation Bot added this to InferenceMAX Board May 22, 2026

claude Bot reviewed May 22, 2026

View reviewed changes

arygupt mentioned this pull request May 22, 2026

feat(inference): measured-power Y-axis metrics on scatter chart SemiAnalysisAI/InferenceX-app#375

Merged

9 tasks

arygupt added sweep-enabled full-sweep-enabled labels May 22, 2026

arygupt mentioned this pull request May 22, 2026

feat(power): aggregate measured GPU power into agg result JSON #1558

Merged

5 tasks

arygupt closed this May 22, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(power): aggregate measured GPU power into agg result JSON#1551

feat(power): aggregate measured GPU power into agg result JSON#1551
arygupt wants to merge 3 commits into
SemiAnalysisAI:mainfrom
arygupt:chore/measured-power-aggregation

arygupt commented May 22, 2026

Uh oh!

claude Bot left a comment

Uh oh!

arygupt commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arygupt commented May 22, 2026

Summary

How it works

Verification

Test plan

Backfill option

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

arygupt commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arygupt commented May 22, 2026 •

edited

Loading