`hw_power_watts` reports spurious Terawatt values (~3.69e12 W) for Intel Arc Pro B70 cards during GPU power-state transitions

### Component
`xpumd` — receiver: `github.com/intel/xpumanager/xpumd/receiver/sysman`

### Environment

| Field | Value |
|---|---|
| GPU | Intel Arc Pro B70 (PCI DID `0x8086:e223`) |
| Platform | Dell Dell validation platform - 8× B70 per host |
| Host | `<internal-host>` |
| OS | Ubuntu 24.04.4 LTS |
| Kernel | `6.17.0-1009-intel` |
| GPU driver | `xe` |
| `libxpum1` | `1.3.6-1~24.04~ppa1` |
| Level Zero | `libze-intel-gpu1 26.14.37833.4-1~24.04~ppa1` |
| Metric source | `job=xpumd`, `otel_scope_name=github.com/intel/xpumanager/xpumd/receiver/sysman` |

### Observed Behavior
Three of the eight B70 cards on the host intermittently report `hw_power_watts` values of approximately **3.69 × 10¹² W (~3.69 TW)** — roughly 70 billion times the expected idle power (~52 W).

Affected BDFs in this event:
- `0001:91:00.0`
- `0000:da:00.0`
- `0000:b4:00.0`

The spike lasts exactly **one scrape interval (~30 s)**, then the value returns to normal. The other four cards on the same host are unaffected in the same window.

### Prometheus Samples (Observed)

```
# Normal value
hw_power_watts{pci_bdf="0001:91:00.0", node="<internal-host>"} 52.03

# Spike (13:20 UTC May 27 2026)
hw_power_watts{pci_bdf="0001:91:00.0", node="<internal-host>"} 3689342859416.54

# Spike (13:25 UTC)
hw_power_watts{pci_bdf="0000:da:00.0", ...} 3689219639677.50

# Spike (11:51 UTC and 14:40 UTC)
hw_power_watts{pci_bdf="0000:b4:00.0", ...} 3689068393166.92
```

### Root Cause Analysis

The `sysman` receiver computes instantaneous power from the Level Zero energy counter:

```
power_W = ΔE_µJ / Δt_µs
```

The affected hwmon devices (xe driver) expose **only `energy1_input`** (accumulated energy in µJ) — there is no `power1_input` in sysfs for these cards:

```bash
# /sys/class/hwmon/hwmon24 (BDF 0001:91:00.0)
energy1_input = 30184145309326  # µJ accumulated since boot
# power1_input → does NOT EXIST
```

When a GPU undergoes a power-state transition (D0 ↔ D3), the Level Zero `zesPowerGetEnergyCounter()` call returns a **stale or uninitialized energy timestamp** on the first sample after wake-up. This causes the delta:

```
ΔE = (large accumulated value) − (near-zero or stale previous reading)
Δt = very small interval (first valid tick after state change)
```

Result: `power = ~4×10¹³ µJ / ~1×10¹ µs ≈ 3.69×10¹² W`

The value magnitude (~3.69 TW) correlates directly with the energy accumulator value at the time of the transition — not random noise.

### Why only 3 of 8 cards?
The 3 affected cards entered a low-power / idle state at that moment (confirmed by intermittent pattern and the fact that different cards are affected on different scrape cycles). The 4 unaffected cards maintained active workloads and never cleared their counter baseline.

### Expected Behavior
`hw_power_watts` should return a plausible value (or be skipped / reported as `NaN`) when `zesPowerGetEnergyCounter()` returns an invalid first-sample delta after a power-state transition.

### Suggested Fix
In `xpumd/receiver/sysman`, add a sanity guard after computing the power delta:

```go
// Before reporting:
if powerWatts > maxSanePowerWatts {  // e.g. 1000.0 for B70 (TDP ≈ 150W)
    // skip this sample — counter baseline not yet stable
    return
}
```

Alternatively, skip the first sample after detecting a power-state transition (D3→D0) and re-baseline `previousEnergy` and `previousTimestamp` on wake-up.

### Workaround (Applied)

**Context:** This node is part of a GPU/BMG validation cluster monitored via Grafana Unified Alerting. We have four alert rules firing on `hw_power_watts` thresholds (B60 High ≥190 W, B60 Critical ≥210 W, B70 High ≥250 W, B70 Critical ≥275 W). When the spurious TW spikes occurred, all four rules fired as `critical` for the affected cards — generating noise in the on-call alert queue and ES (Elasticsearch) event index used for audit trails.

**Why we applied it here and not upstream:** Since xpumd is deployed as a system daemon managed by the platform team and we do not own the binary, patching xpumd was not an immediate option. The fastest mitigation was to guard the PromQL expression in the alert rules themselves.

**Applied in:** Grafana Unified Alerting rules (4 rules, GPU/BMG validation cluster), PromQL query node A:
```promql
max by (node, pci_bdf) (
  hw_power_watts{hw_sensor_location="card"} < 500
)
* on (node, pci_bdf) group_left(pci_device_id)
hw_gpu_info{pci_device_id="e223", pci_vendor_id="8086"}
```

The `< 500` predicate drops any sample above 500 W before the threshold expression evaluates, silencing false-positive critical alerts. This prevents false-positive alerts but does **not** fix the root cause — the bogus samples still land in the `hw_power_watts` time series and will corrupt any `max_over_time` / capacity planning queries unless filtered everywhere.

### Reproducibility
- **Frequency**: Several times per day on an 8-card B70 host at idle
- **Duration**: 1 scrape interval (30 s)
- **Trigger**: GPU power-state transition (D3 → D0 wake-up)
- **Reproducible**: Monitor `max_over_time(hw_power_watts[6h])` on any multi-B70 host at mixed load/idle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`hw_power_watts` reports spurious Terawatt values (~3.69e12 W) for Intel Arc Pro B70 cards during GPU power-state transitions #130

Component

Environment

Observed Behavior

Prometheus Samples (Observed)

Root Cause Analysis

Why only 3 of 8 cards?

Expected Behavior

Suggested Fix

Workaround (Applied)

Reproducibility

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
GPU	Intel Arc Pro B70 (PCI DID `0x8086:e223`)
Platform	Dell Dell validation platform - 8× B70 per host
Host	`<internal-host>`
OS	Ubuntu 24.04.4 LTS
Kernel	`6.17.0-1009-intel`
GPU driver	`xe`
`libxpum1`	`1.3.6-1~24.04~ppa1`
Level Zero	`libze-intel-gpu1 26.14.37833.4-1~24.04~ppa1`
Metric source	`job=xpumd`, `otel_scope_name=github.com/intel/xpumanager/xpumd/receiver/sysman`

hw_power_watts reports spurious Terawatt values (~3.69e12 W) for Intel Arc Pro B70 cards during GPU power-state transitions #130

Description

Component

Environment

Observed Behavior

Prometheus Samples (Observed)

Root Cause Analysis

Why only 3 of 8 cards?

Expected Behavior

Suggested Fix

Workaround (Applied)

Reproducibility

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`hw_power_watts` reports spurious Terawatt values (~3.69e12 W) for Intel Arc Pro B70 cards during GPU power-state transitions #130