xpumd v2 crashloops with ERROR_UNINITIALIZED on Gen9.5 (Coffee Lake) iGPUs

## Summary

xpumd v2 (helm chart `oci://ghcr.io/intel/xpumanager/charts/xpumd`) crashloops on Coffee Lake / Gen9.5 iGPUs with `failed to initialize L0 Sysman API (likely a Level-Zero driver / device access issue): ERROR_UNINITIALIZED`. The same daemon and chart work fine on Alder Lake-N (Gen12) iGPUs in the same cluster, so the failure appears to be specific to older Gen9.5 hardware that the v2 Level Zero Sysman backend does not support.

Because xpumd exits with a fatal error rather than degrading or skipping, the affected node ends up in a permanent CrashLoopBackOff. This also chains into `intel/intel-resource-drivers-for-kubernetes` ≥ v0.10.0: its kubelet-plugin runs an `xpumdListen` goroutine that panics if it cannot reach xpumd, so a single unsupported GPU node takes down the DRA driver pod on that node too.

## Environment

| Item | Value |
|---|---|
| xpumd chart | `oci://ghcr.io/intel/xpumanager/charts/xpumd:0.0.0-v2.x` (also reproduced with `2.0.0-rc.0`, `0.0.0-v2.x` and `0.0.0-main`) |
| xpumd image | `ghcr.io/intel/xpumanager/xpumd:main` (digest `sha256:4ca9e8d626891087a34179ac240ebcdae95322e1ef7801008e2adfa4739883a0`) |
| `gpuAccess` | `dra` |
| Companion driver | `intel-gpu-resource-driver-chart` v0.10.1 (DRA mode) |
| Kubernetes | v1.x on Talos Linux |
| Affected GPU | Intel CoffeeLake-S GT2 [UHD Graphics 630] — PCI `8086:3e92`, class `0380` |
| Affected kernel driver | `i915` |
| Working GPUs (same cluster) | Alder Lake-N UHD Graphics — PCI `8086:46d1`, class `0300` (`i915`) |

## Symptoms

xpumd pod on the affected node restarts indefinitely. From the container logs:

```
2026-05-12T16:08:54.441Z    info    builders/builders.go:28    Development component. May change in the future.    {"resource": {"service.instance.id": "0a87364d-ef34-440f-b6df-7aa3f55392af", "service.
name": "xpumd", "service.version": "0.0.0"}, "otelcol.component.id": "intelxpuinfo", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics"}
2026-05-12T16:08:54.442Z    info    builders/builders.go:28    Development component. May change in the future.    {"resource": {"service.instance.id": "0a87364d-ef34-440f-b6df-7aa3f55392af", "service.
name": "xpumd", "service.version": "0.0.0"}, "otelcol.component.id": "intelxpustatus", "otelcol.component.kind": "processor", "otelcol.pipeline.id": "metrics", "otelcol.signal": "metrics"}
2026-05-12T16:08:54.490Z    error    service@v0.149.0/service.go:165    error found during service initialization    {"resource": {"service.instance.id": "0a87364d-ef34-440f-b6df-7aa3f55392af", "servic
e.name": "xpumd", "service.version": "0.0.0"}, "error": "failed to build pipelines: failed to create \"intelxpu\" receiver for data type \"metrics\": failed to initialize L0 Sysman API (likely a Level-
Zero driver / device access issue): ERROR_UNINITIALIZED"}
go.opentelemetry.io/collector/service.New.func2
    go.opentelemetry.io/collector/service@v0.149.0/service.go:165
go.opentelemetry.io/collector/service.New
    go.opentelemetry.io/collector/service@v0.149.0/service.go:232
go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents
    go.opentelemetry.io/collector/otelcol@v0.149.0/collector.go:211
go.opentelemetry.io/collector/otelcol.(*Collector).Run
    go.opentelemetry.io/collector/otelcol@v0.149.0/collector.go:329
go.opentelemetry.io/collector/otelcol.NewCommand.func1
    go.opentelemetry.io/collector/otelcol@v0.149.0/command.go:41
github.com/spf13/cobra.(*Command).execute
    github.com/spf13/cobra@v1.10.2/command.go:1015
github.com/spf13/cobra.(*Command).ExecuteC
    github.com/spf13/cobra@v1.10.2/command.go:1148
github.com/spf13/cobra.(*Command).Execute
    github.com/spf13/cobra@v1.10.2/command.go:1071
main.runInteractive
    github.com/intel/xpumanager/xpumd/main.go:60
main.run
    github.com/intel/xpumanager/xpumd/main_others.go:10
main.main
    github.com/intel/xpumanager/xpumd/main.go:51
runtime.main
    runtime/proc.go:285
Error: failed to build pipelines: failed to create "intelxpu" receiver for data type "metrics": failed to initialize L0 Sysman API (likely a Level-Zero driver / device access issue): ERROR_UNINITIALIZE
D
```

Plex (using VAAPI on `/dev/dri/renderD128` via the same DRA driver) works correctly on the same node, so kernel-driver-level access is fine — only the Level Zero Sysman path is failing.

## Reproduction

1. Install `intel-gpu-resource-driver` v0.10.1 (chart default `healthMonitoring.enabled: true`, requires xpumd socket).
2. Install xpumd via:
   ```
   helm install xpumd oci://ghcr.io/intel/xpumanager/charts/xpumd \
     --version 0.0.0-main \
     --set gpuAccess=dra \
     --namespace intel-xpumd
   ```
   on a node with an Intel UHD Graphics 630 / Coffee Lake / Gen9.5 iGPU.
3. Observe xpumd container crashlooping with `ERROR_UNINITIALIZED` from L0 Sysman.

## Expected behavior

One of:

- xpumd successfully initializes on Gen9.5 i915 GPUs.
- xpumd detects that L0 Sysman is unavailable for the bound device and exits with a clear non-fatal status / a documented "unsupported hardware" condition, rather than crashlooping. A liveness signal that lets a CrashLoopBackOff be distinguished from a transient failure would let cluster operators schedule away cleanly.
- Documentation explicitly lists the minimum supported GPU generation for xpumd v2 so operators can scope their `nodeSelector` accordingly.

## Questions for maintainers

1. Is Gen9.5 (Coffee Lake, e.g. UHD 630, `8086:3e92`) intended to be supported by xpumd v2's L0 Sysman backend? The chart README does not mention a minimum supported generation.
2. If Gen9.5 is not supported, is there a documented label / discriminator (NFD label, device ID list, etc.) that the chart could ship to scope the DaemonSet only to supported hardware?
3. Would maintainers accept a PR adding either (a) graceful exit on `ERROR_UNINITIALIZED` with an "unsupported hardware" log, or (b) a values-level allowlist / regex over PCI device IDs?

## Related: downstream impact

`intel/intel-resource-drivers-for-kubernetes` v0.10.0+ treats xpumd as a hard dependency — its kubelet-plugin's `xpumdListen` goroutine panics if the socket isn't reachable. Combined with the issue above, a single unsupported GPU node breaks DRA-driven workloads on that node entirely. Current workaround on our side is to set `kubeletPlugin.healthMonitoring.enabled=false` + `kubeletPlugin.privileged=true` cluster-wide and exclude the Gen9.5 node from the xpumd DaemonSet, but losing health-monitoring everywhere because of one unsupported node feels heavier than necessary. Happy to file a corresponding issue on that repo if useful.

## Workaround in place

- `intel-gpu-resource-driver` values: `kubeletPlugin.privileged=true`, `kubeletPlugin.healthMonitoring.enabled=false` (cluster-wide).
- xpumd values: `nodeAffinity` excluding the Gen9.5 node by hostname.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xpumd v2 crashloops with ERROR_UNINITIALIZED on Gen9.5 (Coffee Lake) iGPUs #129

Summary

Environment

Symptoms

Reproduction

Expected behavior

Questions for maintainers

Related: downstream impact

Workaround in place

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Item	Value
xpumd chart	`oci://ghcr.io/intel/xpumanager/charts/xpumd:0.0.0-v2.x` (also reproduced with `2.0.0-rc.0`, `0.0.0-v2.x` and `0.0.0-main`)
xpumd image	`ghcr.io/intel/xpumanager/xpumd:main` (digest `sha256:4ca9e8d626891087a34179ac240ebcdae95322e1ef7801008e2adfa4739883a0`)
`gpuAccess`	`dra`
Companion driver	`intel-gpu-resource-driver-chart` v0.10.1 (DRA mode)
Kubernetes	v1.x on Talos Linux
Affected GPU	Intel CoffeeLake-S GT2 [UHD Graphics 630] — PCI `8086:3e92`, class `0380`
Affected kernel driver	`i915`
Working GPUs (same cluster)	Alder Lake-N UHD Graphics — PCI `8086:46d1`, class `0300` (`i915`)

xpumd v2 crashloops with ERROR_UNINITIALIZED on Gen9.5 (Coffee Lake) iGPUs #129

Description

Summary

Environment

Symptoms

Reproduction

Expected behavior

Questions for maintainers

Related: downstream impact

Workaround in place

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions