Skip to content

[Web] fp16 and q4f16 Gemma 3 models produce invalid outputs on WebGPU due to overflow in ONNX runtime #26732

@fosple

Description

@fosple

Describe the issue

When running Gemma 3 (270M) models exported to ONNX produces unusable outputs using WebGPU and fp16.

The same model work correctly on:

  • ONNX Runtime CPU / WASM
  • WebGPU when using fp32

This strongly suggests an overflow / numerical stability issue inside ONNX Runtime’s WebGPU fp16 execution path, likely identical or related to issue [WebGPU] fp16 nanochat produces NaNs (CPU works fine)

Maybe this has also to do with the unsloth finding that activations become infinity for float16?

To reproduce

The failure can be reproduced with a minimal Transformers.js script:

main.js:

import { pipeline } from "@huggingface/transformers";

const generator = await pipeline(
  "text-generation",
  "onnx-community/gemma-3-270m-it-ONNX",
  { 
    dtype: 'fp16',
    device: 'webgpu',
  },
);

const messages = [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "What is the capital of France?" },
];

const output = await generator(messages, { max_new_tokens: 128 });
console.log(output[0].generated_text.at(-1).content);

HTML to try it out:

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>Transformers.js Browser Demo</title>
  <script type="importmap">
  {
    "imports": {
      "@huggingface/transformers": "https://cdn.jsdelivr.net/npm/@huggingface/transformers"
    }
  }
  </script>
</head>
<body>
  <h1>Transformers.js Browser Demo</h1>
  <p>Open the browser console to see the generated output.</p>
  <script type="module" src="./main.js"></script>
</body>
</html>

For model HuggingFaceTB/SmolLM2-360M-Instruct with dtype: 'fp16' and device: 'webgpu' it returns:

The capital of France is Paris.

For model onnx-community/gemma-3-270m-it-ONNX with dtype: 'fp16' and device: 'webgpu' it returns:

<unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56>

For model onnx-community/gemma-3-270m-it-ONNX with dtype: 'fp16' and device: 'wasm' it returns:

The capital of France is Paris.

For model onnx-community/gemma-3-270m-it-ONNX with dtype: 'q4f16' and device: 'webgpu' it returns:

<unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56>

For model onnx-community/gemma-3-270m-it-ONNX with dtype: 'q4f16' and device: 'wasm' it returns:

The capital of France is Paris.

Urgency

High.

Gemma-3-270M is designed for edge and client-side inference, especially WebGPU.
Because FP16 and Q4F16 are the intended fast modes, this overflow prevents the model from being usable on ONNX Runtime WebGPU in real deployments.

This affects:

  • ONNX Runtime-based inference pipelines in browsers
  • Transformers.js WebGPU backend
  • Any WebGPU client expecting FP16 performance

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.23.2

Execution Provider

'webgpu' (WebGPU)

Metadata

Metadata

Assignees

No one assigned

    Labels

    .NETPull requests that update .net codeep:WebGPUort-web webgpu providermodel:transformerissues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.platform:webissues related to ONNX Runtime web; typically submitted using template

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions