ONNX import: GroupQueryAttention do_rotary + decode KV cache (ORT-GenAI fused exports, e.g. Gemma-3)

Reporting concrete importer gaps found while trying to run a real Gemma-3 ONNX export through
tract. The model architecture is fully supported op-wise; the blockers are all in how the
ORT-GenAI **fused decode export** packs things into `com.microsoft.GroupQueryAttention`.

## Repro
```
onnx-community/gemma-3-1b-it-ONNX → onnx/model_q4.onnx     (int4 MatMulNBits, block_size 32)
tract model_q4.onnx -f onnx dump
→ GroupQueryAttention: internal rotary (do_rotary) is unsupported; apply RotaryEmbedding separately
```

## What the export uses (the GQA node)
```
attrs:  num_heads=4, kv_num_heads=1, scale=0.0625, local_window_size=512,
        softcap=0.0, do_rotary=1, rotary_interleaved=0
inputs: query, key, value, past_key, past_value, seqlens_k, total_seq,
        cos_cache, sin_cache, "", attention_bias
```

So beyond tract's current prefill-only / no-rotary `GroupQueryAttention`, this needs:

1. **`do_rotary=1`** — RoPE applied *inside* GQA to Q and K from `cos_cache`/`sin_cache`
   (half-rotation, `rotary_interleaved=0`). The rotation math already exists in tract's
   `RotaryEmbedding` handler; it just isn't wired from GQA.
2. **decode / `have_past`** — `past_key`/`past_value` inputs → concat with current → attend →
   `present_key`/`present_value` outputs. The current handler rejects any past KV (prefill only).
3. **RoPE positions from `seqlens_k`** — the op indexes the cos/sin tables at positions derived
   from the per-batch length, not from tensor shapes.
4. **`GatherBlockQuantized`** (1 node) — quantized embedding lookup contrib op, also unhandled.

## Note
None of this is an architecture gap — tract has every op the Gemma family needs
(`RotaryEmbedding`, GQA prefill, `SimplifiedLayerNormalization`, `Gelu`, and `local_window_size`
sliding-window from #2323). The blocker is purely the importer not unpacking the **fused
ORT-GenAI decode form**. The same gaps block the broader ORT-GenAI fused-export family
(Gemma, Phi-3, Llama "genai" exports), so handling them would unlock several models at once.

(For contrast: an optimum/transformers.js *decomposed* export — separate RoPE, no GQA op — does
import and run; I validated Qwen2.5-0.5B int4 that way to bit-identical logits vs ONNXRuntime.)

Filing for visibility / triage — not proposing to take it on right now.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX import: GroupQueryAttention do_rotary + decode KV cache (ORT-GenAI fused exports, e.g. Gemma-3) #2345

Repro

What the export uses (the GQA node)

Note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ONNX import: GroupQueryAttention do_rotary + decode KV cache (ORT-GenAI fused exports, e.g. Gemma-3) #2345

Description

Repro

What the export uses (the GQA node)

Note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions