Summary
xpumd v2 (helm chart oci://ghcr.io/intel/xpumanager/charts/xpumd) crashloops on Coffee Lake / Gen9.5 iGPUs with failed to initialize L0 Sysman API (likely a Level-Zero driver / device access issue): ERROR_UNINITIALIZED. The same daemon and chart work fine on Alder Lake-N (Gen12) iGPUs in the same cluster, so the failure appears to be specific to older Gen9.5 hardware that the v2 Level Zero Sysman backend does not support.
Because xpumd exits with a fatal error rather than degrading or skipping, the affected node ends up in a permanent CrashLoopBackOff. This also chains into intel/intel-resource-drivers-for-kubernetes ≥ v0.10.0: its kubelet-plugin runs an xpumdListen goroutine that panics if it cannot reach xpumd, so a single unsupported GPU node takes down the DRA driver pod on that node too.
Environment
| Item |
Value |
| xpumd chart |
oci://ghcr.io/intel/xpumanager/charts/xpumd:0.0.0-v2.x (also reproduced with 2.0.0-rc.0, 0.0.0-v2.x and 0.0.0-main) |
| xpumd image |
ghcr.io/intel/xpumanager/xpumd:main (digest sha256:4ca9e8d626891087a34179ac240ebcdae95322e1ef7801008e2adfa4739883a0) |
gpuAccess |
dra |
| Companion driver |
intel-gpu-resource-driver-chart v0.10.1 (DRA mode) |
| Kubernetes |
v1.x on Talos Linux |
| Affected GPU |
Intel CoffeeLake-S GT2 [UHD Graphics 630] — PCI 8086:3e92, class 0380 |
| Affected kernel driver |
i915 |
| Working GPUs (same cluster) |
Alder Lake-N UHD Graphics — PCI 8086:46d1, class 0300 (i915) |
Symptoms
xpumd pod on the affected node restarts indefinitely. From the container logs:
2026-05-12T16:08:54.441Z info builders/builders.go:28 Development component. May change in the future. {"resource": {"service.instance.id": "0a87364d-ef34-440f-b6df-7aa3f55392af", "service.
name": "xpumd", "service.version": "0.0.0"}, "otelcol.component.id": "intelxpuinfo", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics"}
2026-05-12T16:08:54.442Z info builders/builders.go:28 Development component. May change in the future. {"resource": {"service.instance.id": "0a87364d-ef34-440f-b6df-7aa3f55392af", "service.
name": "xpumd", "service.version": "0.0.0"}, "otelcol.component.id": "intelxpustatus", "otelcol.component.kind": "processor", "otelcol.pipeline.id": "metrics", "otelcol.signal": "metrics"}
2026-05-12T16:08:54.490Z error service@v0.149.0/service.go:165 error found during service initialization {"resource": {"service.instance.id": "0a87364d-ef34-440f-b6df-7aa3f55392af", "servic
e.name": "xpumd", "service.version": "0.0.0"}, "error": "failed to build pipelines: failed to create \"intelxpu\" receiver for data type \"metrics\": failed to initialize L0 Sysman API (likely a Level-
Zero driver / device access issue): ERROR_UNINITIALIZED"}
go.opentelemetry.io/collector/service.New.func2
go.opentelemetry.io/collector/service@v0.149.0/service.go:165
go.opentelemetry.io/collector/service.New
go.opentelemetry.io/collector/service@v0.149.0/service.go:232
go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents
go.opentelemetry.io/collector/otelcol@v0.149.0/collector.go:211
go.opentelemetry.io/collector/otelcol.(*Collector).Run
go.opentelemetry.io/collector/otelcol@v0.149.0/collector.go:329
go.opentelemetry.io/collector/otelcol.NewCommand.func1
go.opentelemetry.io/collector/otelcol@v0.149.0/command.go:41
github.com/spf13/cobra.(*Command).execute
github.com/spf13/cobra@v1.10.2/command.go:1015
github.com/spf13/cobra.(*Command).ExecuteC
github.com/spf13/cobra@v1.10.2/command.go:1148
github.com/spf13/cobra.(*Command).Execute
github.com/spf13/cobra@v1.10.2/command.go:1071
main.runInteractive
github.com/intel/xpumanager/xpumd/main.go:60
main.run
github.com/intel/xpumanager/xpumd/main_others.go:10
main.main
github.com/intel/xpumanager/xpumd/main.go:51
runtime.main
runtime/proc.go:285
Error: failed to build pipelines: failed to create "intelxpu" receiver for data type "metrics": failed to initialize L0 Sysman API (likely a Level-Zero driver / device access issue): ERROR_UNINITIALIZE
D
Plex (using VAAPI on /dev/dri/renderD128 via the same DRA driver) works correctly on the same node, so kernel-driver-level access is fine — only the Level Zero Sysman path is failing.
Reproduction
- Install
intel-gpu-resource-driver v0.10.1 (chart default healthMonitoring.enabled: true, requires xpumd socket).
- Install xpumd via:
helm install xpumd oci://ghcr.io/intel/xpumanager/charts/xpumd \
--version 0.0.0-main \
--set gpuAccess=dra \
--namespace intel-xpumd
on a node with an Intel UHD Graphics 630 / Coffee Lake / Gen9.5 iGPU.
- Observe xpumd container crashlooping with
ERROR_UNINITIALIZED from L0 Sysman.
Expected behavior
One of:
- xpumd successfully initializes on Gen9.5 i915 GPUs.
- xpumd detects that L0 Sysman is unavailable for the bound device and exits with a clear non-fatal status / a documented "unsupported hardware" condition, rather than crashlooping. A liveness signal that lets a CrashLoopBackOff be distinguished from a transient failure would let cluster operators schedule away cleanly.
- Documentation explicitly lists the minimum supported GPU generation for xpumd v2 so operators can scope their
nodeSelector accordingly.
Questions for maintainers
- Is Gen9.5 (Coffee Lake, e.g. UHD 630,
8086:3e92) intended to be supported by xpumd v2's L0 Sysman backend? The chart README does not mention a minimum supported generation.
- If Gen9.5 is not supported, is there a documented label / discriminator (NFD label, device ID list, etc.) that the chart could ship to scope the DaemonSet only to supported hardware?
- Would maintainers accept a PR adding either (a) graceful exit on
ERROR_UNINITIALIZED with an "unsupported hardware" log, or (b) a values-level allowlist / regex over PCI device IDs?
Related: downstream impact
intel/intel-resource-drivers-for-kubernetes v0.10.0+ treats xpumd as a hard dependency — its kubelet-plugin's xpumdListen goroutine panics if the socket isn't reachable. Combined with the issue above, a single unsupported GPU node breaks DRA-driven workloads on that node entirely. Current workaround on our side is to set kubeletPlugin.healthMonitoring.enabled=false + kubeletPlugin.privileged=true cluster-wide and exclude the Gen9.5 node from the xpumd DaemonSet, but losing health-monitoring everywhere because of one unsupported node feels heavier than necessary. Happy to file a corresponding issue on that repo if useful.
Workaround in place
intel-gpu-resource-driver values: kubeletPlugin.privileged=true, kubeletPlugin.healthMonitoring.enabled=false (cluster-wide).
- xpumd values:
nodeAffinity excluding the Gen9.5 node by hostname.
Summary
xpumd v2 (helm chart
oci://ghcr.io/intel/xpumanager/charts/xpumd) crashloops on Coffee Lake / Gen9.5 iGPUs withfailed to initialize L0 Sysman API (likely a Level-Zero driver / device access issue): ERROR_UNINITIALIZED. The same daemon and chart work fine on Alder Lake-N (Gen12) iGPUs in the same cluster, so the failure appears to be specific to older Gen9.5 hardware that the v2 Level Zero Sysman backend does not support.Because xpumd exits with a fatal error rather than degrading or skipping, the affected node ends up in a permanent CrashLoopBackOff. This also chains into
intel/intel-resource-drivers-for-kubernetes≥ v0.10.0: its kubelet-plugin runs anxpumdListengoroutine that panics if it cannot reach xpumd, so a single unsupported GPU node takes down the DRA driver pod on that node too.Environment
oci://ghcr.io/intel/xpumanager/charts/xpumd:0.0.0-v2.x(also reproduced with2.0.0-rc.0,0.0.0-v2.xand0.0.0-main)ghcr.io/intel/xpumanager/xpumd:main(digestsha256:4ca9e8d626891087a34179ac240ebcdae95322e1ef7801008e2adfa4739883a0)gpuAccessdraintel-gpu-resource-driver-chartv0.10.1 (DRA mode)8086:3e92, class0380i9158086:46d1, class0300(i915)Symptoms
xpumd pod on the affected node restarts indefinitely. From the container logs:
Plex (using VAAPI on
/dev/dri/renderD128via the same DRA driver) works correctly on the same node, so kernel-driver-level access is fine — only the Level Zero Sysman path is failing.Reproduction
intel-gpu-resource-driverv0.10.1 (chart defaulthealthMonitoring.enabled: true, requires xpumd socket).ERROR_UNINITIALIZEDfrom L0 Sysman.Expected behavior
One of:
nodeSelectoraccordingly.Questions for maintainers
8086:3e92) intended to be supported by xpumd v2's L0 Sysman backend? The chart README does not mention a minimum supported generation.ERROR_UNINITIALIZEDwith an "unsupported hardware" log, or (b) a values-level allowlist / regex over PCI device IDs?Related: downstream impact
intel/intel-resource-drivers-for-kubernetesv0.10.0+ treats xpumd as a hard dependency — its kubelet-plugin'sxpumdListengoroutine panics if the socket isn't reachable. Combined with the issue above, a single unsupported GPU node breaks DRA-driven workloads on that node entirely. Current workaround on our side is to setkubeletPlugin.healthMonitoring.enabled=false+kubeletPlugin.privileged=truecluster-wide and exclude the Gen9.5 node from the xpumd DaemonSet, but losing health-monitoring everywhere because of one unsupported node feels heavier than necessary. Happy to file a corresponding issue on that repo if useful.Workaround in place
intel-gpu-resource-drivervalues:kubeletPlugin.privileged=true,kubeletPlugin.healthMonitoring.enabled=false(cluster-wide).nodeAffinityexcluding the Gen9.5 node by hostname.