Skip to content

ONNX import: GroupQueryAttention do_rotary + decode KV cache (ORT-GenAI fused exports, e.g. Gemma-3) #2345

@czoli1976

Description

@czoli1976

Reporting concrete importer gaps found while trying to run a real Gemma-3 ONNX export through
tract. The model architecture is fully supported op-wise; the blockers are all in how the
ORT-GenAI fused decode export packs things into com.microsoft.GroupQueryAttention.

Repro

onnx-community/gemma-3-1b-it-ONNX → onnx/model_q4.onnx     (int4 MatMulNBits, block_size 32)
tract model_q4.onnx -f onnx dump
→ GroupQueryAttention: internal rotary (do_rotary) is unsupported; apply RotaryEmbedding separately

What the export uses (the GQA node)

attrs:  num_heads=4, kv_num_heads=1, scale=0.0625, local_window_size=512,
        softcap=0.0, do_rotary=1, rotary_interleaved=0
inputs: query, key, value, past_key, past_value, seqlens_k, total_seq,
        cos_cache, sin_cache, "", attention_bias

So beyond tract's current prefill-only / no-rotary GroupQueryAttention, this needs:

  1. do_rotary=1 — RoPE applied inside GQA to Q and K from cos_cache/sin_cache
    (half-rotation, rotary_interleaved=0). The rotation math already exists in tract's
    RotaryEmbedding handler; it just isn't wired from GQA.
  2. decode / have_pastpast_key/past_value inputs → concat with current → attend →
    present_key/present_value outputs. The current handler rejects any past KV (prefill only).
  3. RoPE positions from seqlens_k — the op indexes the cos/sin tables at positions derived
    from the per-batch length, not from tensor shapes.
  4. GatherBlockQuantized (1 node) — quantized embedding lookup contrib op, also unhandled.

Note

None of this is an architecture gap — tract has every op the Gemma family needs
(RotaryEmbedding, GQA prefill, SimplifiedLayerNormalization, Gelu, and local_window_size
sliding-window from #2323). The blocker is purely the importer not unpacking the fused
ORT-GenAI decode form
. The same gaps block the broader ORT-GenAI fused-export family
(Gemma, Phi-3, Llama "genai" exports), so handling them would unlock several models at once.

(For contrast: an optimum/transformers.js decomposed export — separate RoPE, no GQA op — does
import and run; I validated Qwen2.5-0.5B int4 that way to bit-identical logits vs ONNXRuntime.)

Filing for visibility / triage — not proposing to take it on right now.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions