-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Describe the issue
When running Gemma 3 (270M) models exported to ONNX produces unusable outputs using WebGPU and fp16.
The same model work correctly on:
- ONNX Runtime CPU / WASM
- WebGPU when using fp32
This strongly suggests an overflow / numerical stability issue inside ONNX Runtime’s WebGPU fp16 execution path, likely identical or related to issue [WebGPU] fp16 nanochat produces NaNs (CPU works fine)
Maybe this has also to do with the unsloth finding that activations become infinity for float16?
To reproduce
The failure can be reproduced with a minimal Transformers.js script:
main.js:
import { pipeline } from "@huggingface/transformers";
const generator = await pipeline(
"text-generation",
"onnx-community/gemma-3-270m-it-ONNX",
{
dtype: 'fp16',
device: 'webgpu',
},
);
const messages = [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is the capital of France?" },
];
const output = await generator(messages, { max_new_tokens: 128 });
console.log(output[0].generated_text.at(-1).content);
HTML to try it out:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Transformers.js Browser Demo</title>
<script type="importmap">
{
"imports": {
"@huggingface/transformers": "https://cdn.jsdelivr.net/npm/@huggingface/transformers"
}
}
</script>
</head>
<body>
<h1>Transformers.js Browser Demo</h1>
<p>Open the browser console to see the generated output.</p>
<script type="module" src="./main.js"></script>
</body>
</html>
For model HuggingFaceTB/SmolLM2-360M-Instruct with dtype: 'fp16' and device: 'webgpu' it returns:
The capital of France is Paris.
For model onnx-community/gemma-3-270m-it-ONNX with dtype: 'fp16' and device: 'webgpu' it returns:
<unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56>
For model onnx-community/gemma-3-270m-it-ONNX with dtype: 'fp16' and device: 'wasm' it returns:
The capital of France is Paris.
For model onnx-community/gemma-3-270m-it-ONNX with dtype: 'q4f16' and device: 'webgpu' it returns:
<unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56><unused56>
For model onnx-community/gemma-3-270m-it-ONNX with dtype: 'q4f16' and device: 'wasm' it returns:
The capital of France is Paris.
Urgency
High.
Gemma-3-270M is designed for edge and client-side inference, especially WebGPU.
Because FP16 and Q4F16 are the intended fast modes, this overflow prevents the model from being usable on ONNX Runtime WebGPU in real deployments.
This affects:
- ONNX Runtime-based inference pipelines in browsers
- Transformers.js WebGPU backend
- Any WebGPU client expecting FP16 performance
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.23.2
Execution Provider
'webgpu' (WebGPU)