Skip to content
This repository was archived by the owner on May 27, 2026. It is now read-only.
This repository was archived by the owner on May 27, 2026. It is now read-only.

MXFP4 linear layer is not implemented - falling back to UnquantizedLinearMethod. #958

@prentrodgers

Description

@prentrodgers

Describe the bug

When I try to run vllm on image: intel/vllm:0.11.1-xpu on kubernetes, I get several errors indicating that it's not able to use the MXFP4 any more.

I'm running this on a node with two Arc B580 with 12 GB each. Here is the OS info on the host node:
uname -a
Linux fs5 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.3 LTS"
PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu

Inside the container:
uname -a
Linux vllm-gptoss-xpu-5f9755f86d-n7m9h 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
python --version
Python 3.12.3
cat /etc/release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.3 LTS"
PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"

Error Logs

(Worker_TP0 pid=227) INFO 01-30 18:16:00 [gpu_model_runner.py:3258] Starting to load model openai/gpt-oss-20b...
(Worker_TP1 pid=228) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 linear layer is not implemented - falling back to UnquantizedLinearMethod.
(Worker_TP1 pid=228) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 attention layer is not implemented. Skipping quantization for this layer.
(Worker_TP1 pid=228) INFO 01-30 18:16:00 [xpu.py:58] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_TP1 pid=228) INFO 01-30 18:16:00 [xpu.py:79] Using Flash Attention backend.
(Worker_TP1 pid=228) INFO 01-30 18:16:00 [mxfp4.py:147] Using ipex marlin backend on XPU
(Worker_TP0 pid=227) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 linear layer is not implemented - falling back to UnquantizedLinearMethod.
(Worker_TP0 pid=227) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 attention layer is not implemented. Skipping quantization for this layer.
(Worker_TP0 pid=227) INFO 01-30 18:16:00 [xpu.py:58] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_TP0 pid=227) INFO 01-30 18:16:00 [xpu.py:79] Using Flash Attention backend.
(Worker_TP0 pid=227) INFO 01-30 18:16:00 [mxfp4.py:147] Using ipex marlin backend on XPU

Reproduction Instructions

Yaml:
# Deployment for fs5 running gpt-oss-20b on Intel Arc B580 GPUs with MXFP4
# Image: intel/vllm:0.11.1-xpu
# NodePort: 30024
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gptoss-xpu
  namespace: vllm
  labels:
    app: vllm-gptoss-xpu
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: vllm-gptoss-xpu
  template:
    metadata:
      labels:
        app: vllm-gptoss-xpu
    spec:
      nodeSelector:
        kubernetes.io/hostname: fs5
        intel.feature.node.kubernetes.io/gpu: "true"
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: vllm-pvc
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "10Gi"
      - name: dri
        hostPath:
          path: /dev/dri
      containers:
      - name: vllm-gptoss-xpu
        image: intel/vllm:0.11.1-xpu
        imagePullPolicy: IfNotPresent
        securityContext:
          privileged: true
        command: ["/bin/bash", "-c"]
        args:
          - |
            export HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
            export VLLM_WORKER_MULTIPROC_METHOD=spawn
            export VLLM_XPU_MEMORY_LIMIT=10GiB
            vllm serve openai/gpt-oss-20b \
              --dtype=bfloat16 \
              --trust-remote-code \
              --enforce-eager \
              --port 8000 \
              --block-size 16 \
              --gpu-memory-utilization 0.85 \
              --no-enable-prefix-caching \
              --disable-log-requests \
              --max-num-batched-tokens 512 \
              --max-model-len 1024 \
              --tensor-parallel-size 2 \
              --cpu-offload-gb 20
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        - name: HF_HOME
          value: "/root/.cache/huggingface"
        - name: TRANSFORMERS_CACHE
          value: "/root/.cache/huggingface"
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "16"
            memory: 40Gi
          requests:
            cpu: "4"
            memory: 12Gi
        volumeMounts:
        - mountPath: /root/.cache/huggingface
          name: cache-volume
        - name: shm
          mountPath: /dev/shm
        - name: dri
          mountPath: /dev/dri
        startupProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 15
          failureThreshold: 40
        readinessProbe:
          httpGet:
            path: /v1/models
            port: 8000
          periodSeconds: 10
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          periodSeconds: 30
          failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-gptoss-xpu
  namespace: vllm
spec:
  type: NodePort
  selector:
    app: vllm-gptoss-xpu
  ports:
  - port: 8000
    targetPort: 8000
    nodePort: 30024

Affected Subfolder

  • classical-ml
  • enterprise
  • preset
  • python
  • pytorch
  • tensorflow
  • test-runner
  • workflows

Versions

lscpu
lspci
cat /etc/os-release
docker --version
docker compose version
python --version
pip freeze

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions