Component
xpumd — receiver: github.com/intel/xpumanager/xpumd/receiver/sysman
Environment
| Field |
Value |
| GPU |
Intel Arc Pro B70 (PCI DID 0x8086:e223) |
| Platform |
Dell Dell validation platform - 8× B70 per host |
| Host |
<internal-host> |
| OS |
Ubuntu 24.04.4 LTS |
| Kernel |
6.17.0-1009-intel |
| GPU driver |
xe |
libxpum1 |
1.3.6-1~24.04~ppa1 |
| Level Zero |
libze-intel-gpu1 26.14.37833.4-1~24.04~ppa1 |
| Metric source |
job=xpumd, otel_scope_name=github.com/intel/xpumanager/xpumd/receiver/sysman |
Observed Behavior
Three of the eight B70 cards on the host intermittently report hw_power_watts values of approximately 3.69 × 10¹² W (~3.69 TW) — roughly 70 billion times the expected idle power (~52 W).
Affected BDFs in this event:
0001:91:00.0
0000:da:00.0
0000:b4:00.0
The spike lasts exactly one scrape interval (~30 s), then the value returns to normal. The other four cards on the same host are unaffected in the same window.
Prometheus Samples (Observed)
# Normal value
hw_power_watts{pci_bdf="0001:91:00.0", node="<internal-host>"} 52.03
# Spike (13:20 UTC May 27 2026)
hw_power_watts{pci_bdf="0001:91:00.0", node="<internal-host>"} 3689342859416.54
# Spike (13:25 UTC)
hw_power_watts{pci_bdf="0000:da:00.0", ...} 3689219639677.50
# Spike (11:51 UTC and 14:40 UTC)
hw_power_watts{pci_bdf="0000:b4:00.0", ...} 3689068393166.92
Root Cause Analysis
The sysman receiver computes instantaneous power from the Level Zero energy counter:
The affected hwmon devices (xe driver) expose only energy1_input (accumulated energy in µJ) — there is no power1_input in sysfs for these cards:
# /sys/class/hwmon/hwmon24 (BDF 0001:91:00.0)
energy1_input = 30184145309326 # µJ accumulated since boot
# power1_input → does NOT EXIST
When a GPU undergoes a power-state transition (D0 ↔ D3), the Level Zero zesPowerGetEnergyCounter() call returns a stale or uninitialized energy timestamp on the first sample after wake-up. This causes the delta:
ΔE = (large accumulated value) − (near-zero or stale previous reading)
Δt = very small interval (first valid tick after state change)
Result: power = ~4×10¹³ µJ / ~1×10¹ µs ≈ 3.69×10¹² W
The value magnitude (~3.69 TW) correlates directly with the energy accumulator value at the time of the transition — not random noise.
Why only 3 of 8 cards?
The 3 affected cards entered a low-power / idle state at that moment (confirmed by intermittent pattern and the fact that different cards are affected on different scrape cycles). The 4 unaffected cards maintained active workloads and never cleared their counter baseline.
Expected Behavior
hw_power_watts should return a plausible value (or be skipped / reported as NaN) when zesPowerGetEnergyCounter() returns an invalid first-sample delta after a power-state transition.
Suggested Fix
In xpumd/receiver/sysman, add a sanity guard after computing the power delta:
// Before reporting:
if powerWatts > maxSanePowerWatts { // e.g. 1000.0 for B70 (TDP ≈ 150W)
// skip this sample — counter baseline not yet stable
return
}
Alternatively, skip the first sample after detecting a power-state transition (D3→D0) and re-baseline previousEnergy and previousTimestamp on wake-up.
Workaround (Applied)
Context: This node is part of a GPU/BMG validation cluster monitored via Grafana Unified Alerting. We have four alert rules firing on hw_power_watts thresholds (B60 High ≥190 W, B60 Critical ≥210 W, B70 High ≥250 W, B70 Critical ≥275 W). When the spurious TW spikes occurred, all four rules fired as critical for the affected cards — generating noise in the on-call alert queue and ES (Elasticsearch) event index used for audit trails.
Why we applied it here and not upstream: Since xpumd is deployed as a system daemon managed by the platform team and we do not own the binary, patching xpumd was not an immediate option. The fastest mitigation was to guard the PromQL expression in the alert rules themselves.
Applied in: Grafana Unified Alerting rules (4 rules, GPU/BMG validation cluster), PromQL query node A:
max by (node, pci_bdf) (
hw_power_watts{hw_sensor_location="card"} < 500
)
* on (node, pci_bdf) group_left(pci_device_id)
hw_gpu_info{pci_device_id="e223", pci_vendor_id="8086"}
The < 500 predicate drops any sample above 500 W before the threshold expression evaluates, silencing false-positive critical alerts. This prevents false-positive alerts but does not fix the root cause — the bogus samples still land in the hw_power_watts time series and will corrupt any max_over_time / capacity planning queries unless filtered everywhere.
Reproducibility
- Frequency: Several times per day on an 8-card B70 host at idle
- Duration: 1 scrape interval (30 s)
- Trigger: GPU power-state transition (D3 → D0 wake-up)
- Reproducible: Monitor
max_over_time(hw_power_watts[6h]) on any multi-B70 host at mixed load/idle
Component
xpumd— receiver:github.com/intel/xpumanager/xpumd/receiver/sysmanEnvironment
0x8086:e223)<internal-host>6.17.0-1009-intelxelibxpum11.3.6-1~24.04~ppa1libze-intel-gpu1 26.14.37833.4-1~24.04~ppa1job=xpumd,otel_scope_name=github.com/intel/xpumanager/xpumd/receiver/sysmanObserved Behavior
Three of the eight B70 cards on the host intermittently report
hw_power_wattsvalues of approximately 3.69 × 10¹² W (~3.69 TW) — roughly 70 billion times the expected idle power (~52 W).Affected BDFs in this event:
0001:91:00.00000:da:00.00000:b4:00.0The spike lasts exactly one scrape interval (~30 s), then the value returns to normal. The other four cards on the same host are unaffected in the same window.
Prometheus Samples (Observed)
Root Cause Analysis
The
sysmanreceiver computes instantaneous power from the Level Zero energy counter:The affected hwmon devices (xe driver) expose only
energy1_input(accumulated energy in µJ) — there is nopower1_inputin sysfs for these cards:When a GPU undergoes a power-state transition (D0 ↔ D3), the Level Zero
zesPowerGetEnergyCounter()call returns a stale or uninitialized energy timestamp on the first sample after wake-up. This causes the delta:Result:
power = ~4×10¹³ µJ / ~1×10¹ µs ≈ 3.69×10¹² WThe value magnitude (~3.69 TW) correlates directly with the energy accumulator value at the time of the transition — not random noise.
Why only 3 of 8 cards?
The 3 affected cards entered a low-power / idle state at that moment (confirmed by intermittent pattern and the fact that different cards are affected on different scrape cycles). The 4 unaffected cards maintained active workloads and never cleared their counter baseline.
Expected Behavior
hw_power_wattsshould return a plausible value (or be skipped / reported asNaN) whenzesPowerGetEnergyCounter()returns an invalid first-sample delta after a power-state transition.Suggested Fix
In
xpumd/receiver/sysman, add a sanity guard after computing the power delta:Alternatively, skip the first sample after detecting a power-state transition (D3→D0) and re-baseline
previousEnergyandpreviousTimestampon wake-up.Workaround (Applied)
Context: This node is part of a GPU/BMG validation cluster monitored via Grafana Unified Alerting. We have four alert rules firing on
hw_power_wattsthresholds (B60 High ≥190 W, B60 Critical ≥210 W, B70 High ≥250 W, B70 Critical ≥275 W). When the spurious TW spikes occurred, all four rules fired ascriticalfor the affected cards — generating noise in the on-call alert queue and ES (Elasticsearch) event index used for audit trails.Why we applied it here and not upstream: Since xpumd is deployed as a system daemon managed by the platform team and we do not own the binary, patching xpumd was not an immediate option. The fastest mitigation was to guard the PromQL expression in the alert rules themselves.
Applied in: Grafana Unified Alerting rules (4 rules, GPU/BMG validation cluster), PromQL query node A:
The
< 500predicate drops any sample above 500 W before the threshold expression evaluates, silencing false-positive critical alerts. This prevents false-positive alerts but does not fix the root cause — the bogus samples still land in thehw_power_wattstime series and will corrupt anymax_over_time/ capacity planning queries unless filtered everywhere.Reproducibility
max_over_time(hw_power_watts[6h])on any multi-B70 host at mixed load/idle