feat(power): aggregate measured GPU power into agg result JSON#1551
Closed
arygupt wants to merge 3 commits into
Closed
feat(power): aggregate measured GPU power into agg result JSON#1551arygupt wants to merge 3 commits into
arygupt wants to merge 3 commits into
Conversation
Adds two new fields to agg_<run>.json so the InferenceX-app dashboard can chart measured-energy metrics alongside the existing TDP-derived ones: - avg_power_w (mean per-GPU draw during the load window) - joules_per_output_token (avg_power_w * num_gpus * duration / total_output_tokens) How it works: 1. benchmark_serving.py now records benchmark_start_time_unix and benchmark_end_time_unix alongside the existing duration field so the aggregator knows exactly which slice of the long-running monitor CSV to read (the bracket-the-whole-job monitor includes server warmup and the optional eval phase, which would otherwise bias the average). 2. aggregate_power.py reads /workspace/gpu_metrics.csv (path overridable via GPU_METRICS_CSV, which benchmark_lib.sh now exports), detects the vendor schema by header regex (handles nvidia-smi "power.draw [W]" and amd-smi socket_power formats), filters samples to the bench window, and atomically patches the agg JSON. Best-effort: missing / empty / malformed CSV is logged to stderr and skipped without failing the run. 3. process_result.py invokes the aggregator right after writing the agg JSON — no workflow YAML change needed. The InferenceX-app ETL (benchmark-mapper.ts) auto-captures unknown numeric metrics into the metrics JSONB column, so no schema migration or downstream change is required for the data to land in the DB. A follow-up PR on InferenceX-app adds the two Y-axis options to the inference scatter chart. 26 unit tests covering NVIDIA + AMD CSV shapes, window filtering, multi-GPU per-sample aggregation, malformed-row resilience, missing files, division-by-zero guards, and atomic JSON patching. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three new tests in TestPowerAggregationIntegration:
- test_agg_json_gets_patched_with_power_and_joules: full pipeline.
Stages a 1Hz nvidia-smi CSV with warmup/bench/eval phases, runs
process_result.py as a subprocess with GPU_METRICS_CSV set, and
verifies the agg JSON gets patched with avg_power_w (600W) and
joules_per_output_token (9.6 J/tok = 600W * 8 GPUs * 60s / 30k tok).
Warmup (100W) and eval (200W) samples must be excluded by the
timestamp window — would otherwise bias the result downward.
- test_missing_csv_does_not_break_process_result: production case for
runs that ship without monitoring. process_result.py succeeds and
writes the agg JSON sans power fields.
- test_missing_bench_timestamps_does_not_patch: legacy bench JSON
without benchmark_start_time_unix gracefully skips aggregation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Merged
9 tasks
Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR SemiAnalysisAI#1551. The sweep will produce the first agg_<run>.json containing avg_power_w and joules_per_output_token, validating the aggregator end-to-end on real GPU hardware. Cheap single-node H200 config picked to minimize runner-pool contention.
5 tasks
Collaborator
Author
|
Closed this in favor of #1558 |
arygupt
added a commit
that referenced
this pull request
May 22, 2026
Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1551. The sweep will produce the first agg_<run>.json containing avg_power_w and joules_per_output_token, validating the aggregator end-to-end on real GPU hardware. Cheap single-node H200 config picked to minimize runner-pool contention.
Klaud-Cold
pushed a commit
that referenced
this pull request
May 26, 2026
Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1551. The sweep will produce the first agg_<run>.json containing avg_power_w and joules_per_output_token, validating the aggregator end-to-end on real GPU hardware. Cheap single-node H200 config picked to minimize runner-pool contention.
functionstackx
pushed a commit
that referenced
this pull request
May 27, 2026
* feat(power): aggregate measured GPU power into agg result JSON
Adds two new fields to agg_<run>.json so the InferenceX-app dashboard
can chart measured-energy metrics alongside the existing TDP-derived
ones:
- avg_power_w (mean per-GPU draw during the load window)
- joules_per_output_token (avg_power_w * num_gpus * duration / total_output_tokens)
How it works:
1. benchmark_serving.py now records benchmark_start_time_unix and
benchmark_end_time_unix alongside the existing duration field so the
aggregator knows exactly which slice of the long-running monitor CSV
to read (the bracket-the-whole-job monitor includes server warmup
and the optional eval phase, which would otherwise bias the average).
2. aggregate_power.py reads /workspace/gpu_metrics.csv (path overridable
via GPU_METRICS_CSV, which benchmark_lib.sh now exports), detects the
vendor schema by header regex (handles nvidia-smi "power.draw [W]"
and amd-smi socket_power formats), filters samples to the bench
window, and atomically patches the agg JSON. Best-effort: missing /
empty / malformed CSV is logged to stderr and skipped without
failing the run.
3. process_result.py invokes the aggregator right after writing the
agg JSON — no workflow YAML change needed.
The InferenceX-app ETL (benchmark-mapper.ts) auto-captures unknown
numeric metrics into the metrics JSONB column, so no schema migration
or downstream change is required for the data to land in the DB. A
follow-up PR on InferenceX-app adds the two Y-axis options to the
inference scatter chart.
26 unit tests covering NVIDIA + AMD CSV shapes, window filtering,
multi-GPU per-sample aggregation, malformed-row resilience, missing
files, division-by-zero guards, and atomic JSON patching.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* test(power): subprocess integration covering process_result + aggregator
Three new tests in TestPowerAggregationIntegration:
- test_agg_json_gets_patched_with_power_and_joules: full pipeline.
Stages a 1Hz nvidia-smi CSV with warmup/bench/eval phases, runs
process_result.py as a subprocess with GPU_METRICS_CSV set, and
verifies the agg JSON gets patched with avg_power_w (600W) and
joules_per_output_token (9.6 J/tok = 600W * 8 GPUs * 60s / 30k tok).
Warmup (100W) and eval (200W) samples must be excluded by the
timestamp window — would otherwise bias the result downward.
- test_missing_csv_does_not_break_process_result: production case for
runs that ship without monitoring. process_result.py succeeds and
writes the agg JSON sans power fields.
- test_missing_bench_timestamps_does_not_patch: legacy bench JSON
without benchmark_start_time_unix gracefully skips aggregation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* chore(perf-changelog): trigger sweep for measured-power aggregation
Appends an entry listing qwen3.5-fp8-h200-sglang so run-sweep.yml fires
when the sweep-enabled label is added to PR #1551. The sweep will produce
the first agg_<run>.json containing avg_power_w and joules_per_output_token,
validating the aggregator end-to-end on real GPU hardware.
Cheap single-node H200 config picked to minimize runner-pool contention.
* fix(aggregate_power): infer num_gpus from row count when GPU column absent
Addresses review feedback on PR #1558. The original _detect_columns used a
strict-anchored regex ^(index|gpu|gpu_id|gpu_index|card|device)$ for the
GPU-index column while the power column uses a permissive r"power" match.
That asymmetry meant a future SMI schema variant with header device_id,
gpu_serial, or "GPU ID" would silently collapse every row to gpu_id="0",
yielding system-total power instead of per-GPU mean.
In-tree pipelines (nvidia-smi index, amd-smi gpu) match the regex and are
unaffected — this is a latent bug, not a current-production one. But
avg_power_w is the standalone headline new metric this PR adds; a future
4000W reading on the dashboard is the worst kind of regression (still
plausible-looking, off by 8x).
Fix:
- Maintain per_sample_row_count independently of GPU-column detection.
- When gpu_col is present, divisor stays len(per_sample_gpus[ts]) — same
as before, behavior unchanged for the existing pipeline.
- When gpu_col is absent, divisor is per_sample_row_count[ts] and
num_gpus is the modal row count per sample. Both assume one row per
GPU per sample, which is what every SMI tool we've encountered emits.
New test (test_aggregate_power_no_gpu_column_infers_from_row_count)
exercises a CSV header with device_id (regex miss) and asserts avg_power
is the per-GPU mean (500W), not the system sum (2000W).
Verified:
- 27 unit tests pass (was 26)
- 25 process_result tests pass (no change)
* docs(perf-changelog): clarify why measured-power entry is kept past merge
* chore(perf-changelog): also validate aggregator on AMD MI355X
Adds dsr1-fp8-mi355x-sglang to the validation entry so the sweep
also runs on AMD hardware. Same framework (sglang) as the H200
config so the only variable is GPU vendor: NVIDIA nvidia-smi CSV
(power.draw [W] column) vs AMD amd-smi CSV (socket_power column).
If the aggregator handles both, no further work needed. If AMD
fails, the workflow log shows exactly which step (column detection,
timestamp parsing, etc.) and the fix is local to aggregate_power.py.
* chore: re-trigger workflow synchronize for AMD sweep
* fix(launch-mi355x): add --container-remap-root so amd-smi can read power
amd-smi metric -p inside an enroot container requires the user to be in
the render or video group. The current srun setup runs as the regular
user (uid 300070) with no supplementary groups, so the command throws
RuntimeError("User missing render/video groups"). Combined with the
2>/dev/null in benchmark_lib.sh start_gpu_monitor, this manifests as a
silent 0-byte gpu_metrics.csv → aggregate_power.py skips with "no
usable power samples" → AMD agg JSONs have no avg_power_w.
Verified on real MI355X (mia1-p01-g10) via direct srun + enroot:
Without --container-remap-root: 0 bytes
With --container-remap-root: 45056 bytes / 6s, populated socket_power
Equivalent to what the docker-based launchers (mi300x/mi325x) already
get via --group-add video on the docker run command.
Side effect: the benchmark inside the container now runs as root. The
benchmark scripts already work fine in root contexts (the docker-based
launchers run via sudo docker), and files written through the mounted
/workspace land back on the host with root ownership which is fine for
the downstream process_result + artifact-upload steps.
* feat(power): emit joules_per_total_token alongside joules_per_output_token
Adds a third measured-power field to agg_<run>.json:
joules_per_total_token = system_energy / (input_tokens + output_tokens)
Where the existing joules_per_output_token divides only by output tokens
(treating input as free), the new field divides by all tokens the system
actually processed. For prompt-heavy workloads (8K input, 1K output)
J/total-token is ~9x smaller than J/output-token — and arguably more
honest about real workload cost.
Mechanically: extend _load_bench_window to also return total_input_tokens
(default 0 if absent, which makes J/total-token degrade to J/output-token
for older bench JSONs). Compute total_system_energy_j once and divide by
both denominators. Patch both fields atomically into the agg JSON.
Tests: +1 covering the (input + output) denominator with an 8k1k
prompt-heavy workload, asserting the ~9x ratio sanity check.
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds measured per-GPU power and joules-per-output-token to every benchmark's
agg_<run>.json, sourced from the existinggpu_metrics.csvthatstart_gpu_monitoralready produces. The InferenceX-app dashboard consumes these via a companion PR (semianalysisai/InferenceX-app) to render new chart options alongside the existing TDP-derivedjTotal/jOutput/jInput.Two new fields land in the agg JSON:
avg_power_w— mean per-GPU draw during the load windowjoules_per_output_token—avg_power_w * num_gpus * duration / total_output_tokensHow it works
benchmark_serving.pynow recordsbenchmark_start_time_unixandbenchmark_end_time_unix(wall-clock epoch) alongside the existingduration. The aggregator needs these to know which slice of the long-running monitor CSV is the actual load window — without them, naive averaging would mix in ~60s of server warmup (~120W) and the optional eval phase (~300W), biasing a 720W per-GPU draw down to roughly 440W.utils/aggregate_power.py(new, stdlib only, ~210 lines) reads the CSV, detects vendor schema by header regex (handles nvidia-smipower.draw [W]and amd-smisocket_power), filters samples to the bench window, averages per-GPU power per timestamp then over time, and atomically patches the agg JSON. Best-effort throughout — missing/empty/malformed CSV is logged to stderr and skipped without ever failing the run.utils/process_result.pycalls the aggregator right after writing the agg JSON. Path resolution checks$GPU_METRICS_CSV→./gpu_metrics.csv→/workspace/gpu_metrics.csv, accommodating the scripts inbenchmarks/single_node/that override the default path. Wrapped intry/exceptso telemetry never blocks the upload.benchmarks/benchmark_lib.shexportsGPU_METRICS_CSVso per-script CSV-path overrides cross the shell→Python boundary.No workflow YAML change, no schema migration anywhere downstream — the InferenceX-app ETL's
benchmark-mapper.tsis permissive about numeric keys in the agg JSON.Verification
End-to-end smoke test on a synthesized 1680-row CSV (8 GPUs × 210s spanning warmup at 120W, bench at 720W, eval at 300W):
Cross-check: 720W × 8 GPUs / 682 tok/s ≈ 8.45 J/tok. ✓ Window isolation correctly excluded the warmup + eval samples (naive average would have given ~440W).
Test plan
process_result.py: stages a CSV + bench JSON + env vars, assertsagg_<run>.jsongets patchedtest_process_result.pytests still pass (no regressions in the established flow)agg_<run>.jsonartifactBackfill option
The workflow already uploads
gpu_metrics.csvas an artifact for every run (.github/workflows/benchmark-tmpl.yml). After this merges, historical runs that have both a CSV and anagg_<run>.jsoncould be backfilled by re-running this aggregator against the artifact store. Out of scope here.🤖 Generated with Claude Code