Skip to content

hw_power_watts reports spurious Terawatt values (~3.69e12 W) for Intel Arc Pro B70 cards during GPU power-state transitions #130

@sjlealru

Description

@sjlealru

Component

xpumd — receiver: github.com/intel/xpumanager/xpumd/receiver/sysman

Environment

Field Value
GPU Intel Arc Pro B70 (PCI DID 0x8086:e223)
Platform Dell Dell validation platform - 8× B70 per host
Host <internal-host>
OS Ubuntu 24.04.4 LTS
Kernel 6.17.0-1009-intel
GPU driver xe
libxpum1 1.3.6-1~24.04~ppa1
Level Zero libze-intel-gpu1 26.14.37833.4-1~24.04~ppa1
Metric source job=xpumd, otel_scope_name=github.com/intel/xpumanager/xpumd/receiver/sysman

Observed Behavior

Three of the eight B70 cards on the host intermittently report hw_power_watts values of approximately 3.69 × 10¹² W (~3.69 TW) — roughly 70 billion times the expected idle power (~52 W).

Affected BDFs in this event:

  • 0001:91:00.0
  • 0000:da:00.0
  • 0000:b4:00.0

The spike lasts exactly one scrape interval (~30 s), then the value returns to normal. The other four cards on the same host are unaffected in the same window.

Prometheus Samples (Observed)

# Normal value
hw_power_watts{pci_bdf="0001:91:00.0", node="<internal-host>"} 52.03

# Spike (13:20 UTC May 27 2026)
hw_power_watts{pci_bdf="0001:91:00.0", node="<internal-host>"} 3689342859416.54

# Spike (13:25 UTC)
hw_power_watts{pci_bdf="0000:da:00.0", ...} 3689219639677.50

# Spike (11:51 UTC and 14:40 UTC)
hw_power_watts{pci_bdf="0000:b4:00.0", ...} 3689068393166.92

Root Cause Analysis

The sysman receiver computes instantaneous power from the Level Zero energy counter:

power_W = ΔE_µJ / Δt_µs

The affected hwmon devices (xe driver) expose only energy1_input (accumulated energy in µJ) — there is no power1_input in sysfs for these cards:

# /sys/class/hwmon/hwmon24 (BDF 0001:91:00.0)
energy1_input = 30184145309326  # µJ accumulated since boot
# power1_input → does NOT EXIST

When a GPU undergoes a power-state transition (D0 ↔ D3), the Level Zero zesPowerGetEnergyCounter() call returns a stale or uninitialized energy timestamp on the first sample after wake-up. This causes the delta:

ΔE = (large accumulated value) − (near-zero or stale previous reading)
Δt = very small interval (first valid tick after state change)

Result: power = ~4×10¹³ µJ / ~1×10¹ µs ≈ 3.69×10¹² W

The value magnitude (~3.69 TW) correlates directly with the energy accumulator value at the time of the transition — not random noise.

Why only 3 of 8 cards?

The 3 affected cards entered a low-power / idle state at that moment (confirmed by intermittent pattern and the fact that different cards are affected on different scrape cycles). The 4 unaffected cards maintained active workloads and never cleared their counter baseline.

Expected Behavior

hw_power_watts should return a plausible value (or be skipped / reported as NaN) when zesPowerGetEnergyCounter() returns an invalid first-sample delta after a power-state transition.

Suggested Fix

In xpumd/receiver/sysman, add a sanity guard after computing the power delta:

// Before reporting:
if powerWatts > maxSanePowerWatts {  // e.g. 1000.0 for B70 (TDP ≈ 150W)
    // skip this sample — counter baseline not yet stable
    return
}

Alternatively, skip the first sample after detecting a power-state transition (D3→D0) and re-baseline previousEnergy and previousTimestamp on wake-up.

Workaround (Applied)

Context: This node is part of a GPU/BMG validation cluster monitored via Grafana Unified Alerting. We have four alert rules firing on hw_power_watts thresholds (B60 High ≥190 W, B60 Critical ≥210 W, B70 High ≥250 W, B70 Critical ≥275 W). When the spurious TW spikes occurred, all four rules fired as critical for the affected cards — generating noise in the on-call alert queue and ES (Elasticsearch) event index used for audit trails.

Why we applied it here and not upstream: Since xpumd is deployed as a system daemon managed by the platform team and we do not own the binary, patching xpumd was not an immediate option. The fastest mitigation was to guard the PromQL expression in the alert rules themselves.

Applied in: Grafana Unified Alerting rules (4 rules, GPU/BMG validation cluster), PromQL query node A:

max by (node, pci_bdf) (
  hw_power_watts{hw_sensor_location="card"} < 500
)
* on (node, pci_bdf) group_left(pci_device_id)
hw_gpu_info{pci_device_id="e223", pci_vendor_id="8086"}

The < 500 predicate drops any sample above 500 W before the threshold expression evaluates, silencing false-positive critical alerts. This prevents false-positive alerts but does not fix the root cause — the bogus samples still land in the hw_power_watts time series and will corrupt any max_over_time / capacity planning queries unless filtered everywhere.

Reproducibility

  • Frequency: Several times per day on an 8-card B70 host at idle
  • Duration: 1 scrape interval (30 s)
  • Trigger: GPU power-state transition (D3 → D0 wake-up)
  • Reproducible: Monitor max_over_time(hw_power_watts[6h]) on any multi-B70 host at mixed load/idle

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions