Skip to content

xpumd v2 crashloops with ERROR_UNINITIALIZED on Gen9.5 (Coffee Lake) iGPUs #129

@niklasfrick

Description

@niklasfrick

Summary

xpumd v2 (helm chart oci://ghcr.io/intel/xpumanager/charts/xpumd) crashloops on Coffee Lake / Gen9.5 iGPUs with failed to initialize L0 Sysman API (likely a Level-Zero driver / device access issue): ERROR_UNINITIALIZED. The same daemon and chart work fine on Alder Lake-N (Gen12) iGPUs in the same cluster, so the failure appears to be specific to older Gen9.5 hardware that the v2 Level Zero Sysman backend does not support.

Because xpumd exits with a fatal error rather than degrading or skipping, the affected node ends up in a permanent CrashLoopBackOff. This also chains into intel/intel-resource-drivers-for-kubernetes ≥ v0.10.0: its kubelet-plugin runs an xpumdListen goroutine that panics if it cannot reach xpumd, so a single unsupported GPU node takes down the DRA driver pod on that node too.

Environment

Item Value
xpumd chart oci://ghcr.io/intel/xpumanager/charts/xpumd:0.0.0-v2.x (also reproduced with 2.0.0-rc.0, 0.0.0-v2.x and 0.0.0-main)
xpumd image ghcr.io/intel/xpumanager/xpumd:main (digest sha256:4ca9e8d626891087a34179ac240ebcdae95322e1ef7801008e2adfa4739883a0)
gpuAccess dra
Companion driver intel-gpu-resource-driver-chart v0.10.1 (DRA mode)
Kubernetes v1.x on Talos Linux
Affected GPU Intel CoffeeLake-S GT2 [UHD Graphics 630] — PCI 8086:3e92, class 0380
Affected kernel driver i915
Working GPUs (same cluster) Alder Lake-N UHD Graphics — PCI 8086:46d1, class 0300 (i915)

Symptoms

xpumd pod on the affected node restarts indefinitely. From the container logs:

2026-05-12T16:08:54.441Z    info    builders/builders.go:28    Development component. May change in the future.    {"resource": {"service.instance.id": "0a87364d-ef34-440f-b6df-7aa3f55392af", "service.
name": "xpumd", "service.version": "0.0.0"}, "otelcol.component.id": "intelxpuinfo", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics"}
2026-05-12T16:08:54.442Z    info    builders/builders.go:28    Development component. May change in the future.    {"resource": {"service.instance.id": "0a87364d-ef34-440f-b6df-7aa3f55392af", "service.
name": "xpumd", "service.version": "0.0.0"}, "otelcol.component.id": "intelxpustatus", "otelcol.component.kind": "processor", "otelcol.pipeline.id": "metrics", "otelcol.signal": "metrics"}
2026-05-12T16:08:54.490Z    error    service@v0.149.0/service.go:165    error found during service initialization    {"resource": {"service.instance.id": "0a87364d-ef34-440f-b6df-7aa3f55392af", "servic
e.name": "xpumd", "service.version": "0.0.0"}, "error": "failed to build pipelines: failed to create \"intelxpu\" receiver for data type \"metrics\": failed to initialize L0 Sysman API (likely a Level-
Zero driver / device access issue): ERROR_UNINITIALIZED"}
go.opentelemetry.io/collector/service.New.func2
    go.opentelemetry.io/collector/service@v0.149.0/service.go:165
go.opentelemetry.io/collector/service.New
    go.opentelemetry.io/collector/service@v0.149.0/service.go:232
go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents
    go.opentelemetry.io/collector/otelcol@v0.149.0/collector.go:211
go.opentelemetry.io/collector/otelcol.(*Collector).Run
    go.opentelemetry.io/collector/otelcol@v0.149.0/collector.go:329
go.opentelemetry.io/collector/otelcol.NewCommand.func1
    go.opentelemetry.io/collector/otelcol@v0.149.0/command.go:41
github.com/spf13/cobra.(*Command).execute
    github.com/spf13/cobra@v1.10.2/command.go:1015
github.com/spf13/cobra.(*Command).ExecuteC
    github.com/spf13/cobra@v1.10.2/command.go:1148
github.com/spf13/cobra.(*Command).Execute
    github.com/spf13/cobra@v1.10.2/command.go:1071
main.runInteractive
    github.com/intel/xpumanager/xpumd/main.go:60
main.run
    github.com/intel/xpumanager/xpumd/main_others.go:10
main.main
    github.com/intel/xpumanager/xpumd/main.go:51
runtime.main
    runtime/proc.go:285
Error: failed to build pipelines: failed to create "intelxpu" receiver for data type "metrics": failed to initialize L0 Sysman API (likely a Level-Zero driver / device access issue): ERROR_UNINITIALIZE
D

Plex (using VAAPI on /dev/dri/renderD128 via the same DRA driver) works correctly on the same node, so kernel-driver-level access is fine — only the Level Zero Sysman path is failing.

Reproduction

  1. Install intel-gpu-resource-driver v0.10.1 (chart default healthMonitoring.enabled: true, requires xpumd socket).
  2. Install xpumd via:
    helm install xpumd oci://ghcr.io/intel/xpumanager/charts/xpumd \
      --version 0.0.0-main \
      --set gpuAccess=dra \
      --namespace intel-xpumd
    
    on a node with an Intel UHD Graphics 630 / Coffee Lake / Gen9.5 iGPU.
  3. Observe xpumd container crashlooping with ERROR_UNINITIALIZED from L0 Sysman.

Expected behavior

One of:

  • xpumd successfully initializes on Gen9.5 i915 GPUs.
  • xpumd detects that L0 Sysman is unavailable for the bound device and exits with a clear non-fatal status / a documented "unsupported hardware" condition, rather than crashlooping. A liveness signal that lets a CrashLoopBackOff be distinguished from a transient failure would let cluster operators schedule away cleanly.
  • Documentation explicitly lists the minimum supported GPU generation for xpumd v2 so operators can scope their nodeSelector accordingly.

Questions for maintainers

  1. Is Gen9.5 (Coffee Lake, e.g. UHD 630, 8086:3e92) intended to be supported by xpumd v2's L0 Sysman backend? The chart README does not mention a minimum supported generation.
  2. If Gen9.5 is not supported, is there a documented label / discriminator (NFD label, device ID list, etc.) that the chart could ship to scope the DaemonSet only to supported hardware?
  3. Would maintainers accept a PR adding either (a) graceful exit on ERROR_UNINITIALIZED with an "unsupported hardware" log, or (b) a values-level allowlist / regex over PCI device IDs?

Related: downstream impact

intel/intel-resource-drivers-for-kubernetes v0.10.0+ treats xpumd as a hard dependency — its kubelet-plugin's xpumdListen goroutine panics if the socket isn't reachable. Combined with the issue above, a single unsupported GPU node breaks DRA-driven workloads on that node entirely. Current workaround on our side is to set kubeletPlugin.healthMonitoring.enabled=false + kubeletPlugin.privileged=true cluster-wide and exclude the Gen9.5 node from the xpumd DaemonSet, but losing health-monitoring everywhere because of one unsupported node feels heavier than necessary. Happy to file a corresponding issue on that repo if useful.

Workaround in place

  • intel-gpu-resource-driver values: kubeletPlugin.privileged=true, kubeletPlugin.healthMonitoring.enabled=false (cluster-wide).
  • xpumd values: nodeAffinity excluding the Gen9.5 node by hostname.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions