You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on May 27, 2026. It is now read-only.
When I try to run vllm on image: intel/vllm:0.11.1-xpu on kubernetes, I get several errors indicating that it's not able to use the MXFP4 any more.
I'm running this on a node with two Arc B580 with 12 GB each. Here is the OS info on the host node:
uname -a
Linux fs5 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Inside the container:
uname -a
Linux vllm-gptoss-xpu-5f9755f86d-n7m9h 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
python --version
Python 3.12.3
cat /etc/release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.3 LTS"
PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
Error Logs
(Worker_TP0 pid=227) INFO 01-30 18:16:00 [gpu_model_runner.py:3258] Starting to load model openai/gpt-oss-20b...
(Worker_TP1 pid=228) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 linear layer is not implemented - falling back to UnquantizedLinearMethod.
(Worker_TP1 pid=228) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 attention layer is not implemented. Skipping quantization for this layer.
(Worker_TP1 pid=228) INFO 01-30 18:16:00 [xpu.py:58] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_TP1 pid=228) INFO 01-30 18:16:00 [xpu.py:79] Using Flash Attention backend.
(Worker_TP1 pid=228) INFO 01-30 18:16:00 [mxfp4.py:147] Using ipex marlin backend on XPU
(Worker_TP0 pid=227) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 linear layer is not implemented - falling back to UnquantizedLinearMethod.
(Worker_TP0 pid=227) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 attention layer is not implemented. Skipping quantization for this layer.
(Worker_TP0 pid=227) INFO 01-30 18:16:00 [xpu.py:58] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_TP0 pid=227) INFO 01-30 18:16:00 [xpu.py:79] Using Flash Attention backend.
(Worker_TP0 pid=227) INFO 01-30 18:16:00 [mxfp4.py:147] Using ipex marlin backend on XPU
Describe the bug
When I try to run vllm on image: intel/vllm:0.11.1-xpu on kubernetes, I get several errors indicating that it's not able to use the MXFP4 any more.
I'm running this on a node with two Arc B580 with 12 GB each. Here is the OS info on the host node:
uname -a
Linux fs5 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.3 LTS"
PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
Inside the container:
uname -a
Linux vllm-gptoss-xpu-5f9755f86d-n7m9h 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
python --version
Python 3.12.3
cat /etc/release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.3 LTS"
PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
Error Logs
Reproduction Instructions
Affected Subfolder
Versions