Reporting concrete importer gaps found while trying to run a real Gemma-3 ONNX export through
tract. The model architecture is fully supported op-wise; the blockers are all in how the
ORT-GenAI fused decode export packs things into com.microsoft.GroupQueryAttention.
Repro
onnx-community/gemma-3-1b-it-ONNX → onnx/model_q4.onnx (int4 MatMulNBits, block_size 32)
tract model_q4.onnx -f onnx dump
→ GroupQueryAttention: internal rotary (do_rotary) is unsupported; apply RotaryEmbedding separately
What the export uses (the GQA node)
attrs: num_heads=4, kv_num_heads=1, scale=0.0625, local_window_size=512,
softcap=0.0, do_rotary=1, rotary_interleaved=0
inputs: query, key, value, past_key, past_value, seqlens_k, total_seq,
cos_cache, sin_cache, "", attention_bias
So beyond tract's current prefill-only / no-rotary GroupQueryAttention, this needs:
do_rotary=1 — RoPE applied inside GQA to Q and K from cos_cache/sin_cache
(half-rotation, rotary_interleaved=0). The rotation math already exists in tract's
RotaryEmbedding handler; it just isn't wired from GQA.
- decode /
have_past — past_key/past_value inputs → concat with current → attend →
present_key/present_value outputs. The current handler rejects any past KV (prefill only).
- RoPE positions from
seqlens_k — the op indexes the cos/sin tables at positions derived
from the per-batch length, not from tensor shapes.
GatherBlockQuantized (1 node) — quantized embedding lookup contrib op, also unhandled.
Note
None of this is an architecture gap — tract has every op the Gemma family needs
(RotaryEmbedding, GQA prefill, SimplifiedLayerNormalization, Gelu, and local_window_size
sliding-window from #2323). The blocker is purely the importer not unpacking the fused
ORT-GenAI decode form. The same gaps block the broader ORT-GenAI fused-export family
(Gemma, Phi-3, Llama "genai" exports), so handling them would unlock several models at once.
(For contrast: an optimum/transformers.js decomposed export — separate RoPE, no GQA op — does
import and run; I validated Qwen2.5-0.5B int4 that way to bit-identical logits vs ONNXRuntime.)
Filing for visibility / triage — not proposing to take it on right now.
Reporting concrete importer gaps found while trying to run a real Gemma-3 ONNX export through
tract. The model architecture is fully supported op-wise; the blockers are all in how the
ORT-GenAI fused decode export packs things into
com.microsoft.GroupQueryAttention.Repro
What the export uses (the GQA node)
So beyond tract's current prefill-only / no-rotary
GroupQueryAttention, this needs:do_rotary=1— RoPE applied inside GQA to Q and K fromcos_cache/sin_cache(half-rotation,
rotary_interleaved=0). The rotation math already exists in tract'sRotaryEmbeddinghandler; it just isn't wired from GQA.have_past—past_key/past_valueinputs → concat with current → attend →present_key/present_valueoutputs. The current handler rejects any past KV (prefill only).seqlens_k— the op indexes the cos/sin tables at positions derivedfrom the per-batch length, not from tensor shapes.
GatherBlockQuantized(1 node) — quantized embedding lookup contrib op, also unhandled.Note
None of this is an architecture gap — tract has every op the Gemma family needs
(
RotaryEmbedding, GQA prefill,SimplifiedLayerNormalization,Gelu, andlocal_window_sizesliding-window from #2323). The blocker is purely the importer not unpacking the fused
ORT-GenAI decode form. The same gaps block the broader ORT-GenAI fused-export family
(Gemma, Phi-3, Llama "genai" exports), so handling them would unlock several models at once.
(For contrast: an optimum/transformers.js decomposed export — separate RoPE, no GQA op — does
import and run; I validated Qwen2.5-0.5B int4 that way to bit-identical logits vs ONNXRuntime.)
Filing for visibility / triage — not proposing to take it on right now.