Add Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe — WIP by justinchuby · Pull Request #432 · microsoft/olive-recipes

justinchuby · 2026-05-27T05:47:54Z

Summary

Adds google-gemma-4-E2B-it/QNN/ — exploratory Olive recipe for compiling all four Gemma 4 components (decoder, embedding, vision_encoder, audio_encoder) into QNN EPContext binaries for HTP execution on Snapdragon X / Copilot+ PC / Snapdragon 8 Gen 3+.

Marked WIP because not yet hardware-validated.

Approach

All components compile to QNN. Olive's CompositeModelHandler dispatch runs quant + StaticLLM per component automatically; EPContextBinaryGenerator and ComposeOnnxModels (both _accepts_composite_model = True) finalize the multimodal package without any manual splitting.

Pipeline

HfModel (multimodal Gemma 4)
   ↓ MobiusBuilder (fp32)               4 ONNX components + sidecars (genai_config, tokenizer, processors)
   ↓ OnnxKQuantQuantization (INT4)      mobius-standard Q4_K_M; weights → com.microsoft::MatMulNBits
   ↓ MatMulNBitsToQDQ                   MatMulNBits → MatMul + DequantizeLinear (QNN-compatible QDQ)
   ↓ OnnxStaticQuantization             activations uint16 / weights uint8 (calibrated)
   ↓ StaticLLM                          static shapes
   ↓ EPContextBinaryGenerator           HTP blobs, weight-shared across components
   ↓ ComposeOnnxModels                  final package

MatMulNBitsToQDQ is required between OnnxKQuantQuantization and OnnxStaticQuantization: the former emits com.microsoft::MatMulNBits which the QNN EP partitioner does not claim, so without QDQ rewriting every quantized MatMul silently falls back to CPU.

Why no `GraphSurgeries`?

Existing QNN recipes (Phi-3, Qwen) use RemoveRopeMultiCache, AttentionMaskToSequenceLengths, SimplifiedLayerNormToL2Norm to rewrite ModelBuilder-specific contrib ops into HTP-friendly shapes. MobiusBuilder emits opset-23 standard ops (RMSNormalization, Attention) instead of the contrib variants, so those surgeries are either no-ops or inapplicable. Gemma-4–specific surgeries may still be needed (notably for the final logit soft-cap cap * tanh(x / cap)), but the existing borrowed set is not it.

Known limitations (documented in README)

Logit soft-cap may not lower to HTP — may need a new RemoveLogitSoftcap GraphSurgery upstream, or host post-processing.
Hybrid local/global attention with dual head_dim (256 / 512) — HTP per-layer dispatch needs testing.
per_layer_inputs data flow — should "just work" when both embedding + decoder are on QNN, but StaticLLM might need a hint.
Tokenizer calibration — wikitext-2 under-represents the 256k Gemma 4 vocab (image/audio specials).
StaticLLM context_length=64 — placeholder; tune for target SKU.
Standard Attention vs GroupQueryAttention — mobius's QNN ep_capabilities() advertises an empty gqa_dtypes, so the decoder uses opset-23 Attention(attention_mask) rather than GQA(seqlens_k, total_seq_len). See discussion comment.

Asks for review

Anyone with Snapdragon HW willing to run this and report which pass breaks?
Is OnnxKQuantQuantization the right INT4 path for QNN, or should this use OnnxBlockWiseRtnQuantization / GptqModel?
Should I attempt the upstream RemoveLogitSoftcap Olive surgery now, or wait for HW validation to confirm it's actually the blocker?

Adds google-gemma-4-E2B-it/QNN/ as a starting-point recipe for compiling Gemma 4's text decoder into a QNN EPContext binary for HTP execution on Snapdragon X / Copilot+ PC / Snapdragon 8 Gen 3+. Pipeline: MobiusBuilder fp32 → OnnxKQuantQuantization (INT4 weights) → MatMulNBitsToQDQ → GraphSurgeries (RemoveRopeMultiCache / AttentionMaskToSequenceLengths / SimplifiedLayerNormToL2Norm) → OnnxStaticQuantization (uint16 act / uint8 wt) → SplitModel + StaticLLM → EPContextBinaryGenerator (HTP blob) → ComposeOnnxModels Marked WORK IN PROGRESS in the README. Known limitations called out explicitly: 1) MobiusBuilder always exports the multimodal 4-component package for google/gemma-4-E2B-it; no current way to force the text-only gemma4_text path from the recipe config. Splitting the QNN passes to apply only to the decoder component is still TODO. 2) GraphSurgeries borrowed from Phi-3 / Qwen QNN recipes have not been verified against Gemma 4's hybrid local/global attention, dual head_dim KV cache, or final logit soft-capping (tanh-cap). 3) per_layer_inputs (second embedding output, consumed by every decoder block) needs custom split orchestration if embedding stays on CPU and decoder runs on HTP. 4) Calibration via wikitext-2 may under-represent multimodal-format tokens (256k vocab includes vision/audio specials). 5) StaticLLM context_length=64 is a placeholder for HW tuning. Filed as exploratory template so other contributors with Snapdragon HW can iterate. Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

…urgeries All four Gemma 4 components (decoder + embedding + vision_encoder + audio_encoder) compile to QNN EPContext binaries together. Olive's CompositeModelHandler dispatch runs quant + StaticLLM per component automatically, then EPContextBinaryGenerator + ComposeOnnxModels (both _accepts_composite_model = True) finalise the multimodal package. Drop: * SplitModel — not needed when all components stay on QNN * MatMulNBitsToQDQ — was a ModelBuilder-specific stepping stone * GraphSurgeries with RemoveRopeMultiCache / AttentionMaskToSequenceLengths / SimplifiedLayerNormToL2Norm — those rewrite ModelBuilder contrib ops that mobius does not emit in the first place (mobius uses opset-23 RMSNormalization / Attention, not com.microsoft variants) The README now explains the surgery removal explicitly and lists what might still need a Gemma-4–specific upstream surgery (logit soft-cap). Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

OnnxKQuantQuantization emits com.microsoft::MatMulNBits which is fast on CPU / CUDA but not in the QNN EP's supported-op list. Without MatMulNBitsToQDQ the QNN partitioner rejects every quantized MatMul node and the model silently falls back to CPU — defeating the point of compiling to HTP. Restore MatMulNBitsToQDQ between the INT4 quant and the static activation quant so each MatMulNBits gets rewritten into the standard MatMul + DequantizeLinear pair the QNN partitioner can claim and lower onto HTP. README updated with an explanation of why both passes are needed. Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Make explicit that mobius emits opset-23 Attention (with attention_mask input) for QNN, not com.microsoft::GroupQueryAttention(seqlens_k, total_seq_len), because mobius's QNN ep_capabilities() advertises an empty gqa_dtypes list. The existing AttentionMaskToSequenceLengths GraphSurgery is therefore inapplicable (it only rewrites GQA), and no surgery is needed if HTP's standard-attention kernel lowers cleanly. Two follow-up options spelled out if HW shows the standard Attention path is too slow on HTP: (a) extend mobius ep_capabilities for QNN to set gqa_dtypes so the builder emits GQA directly; or (b) port AttentionMaskToSequenceLengths to also rewrite standard Attention (currently it short-circuits when GQA is absent). Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

justinchuby · 2026-05-27T06:10:42Z

Question for QNN reviewers: should mobius emit com.microsoft::GroupQueryAttention(seqlens_k, total_seq_len) for the QNN path, or is the opset-23 standard Attention(attention_mask=[B, past+seq]) better?

Right now mobius's QNN ep_capabilities() advertises an empty gqa_dtypes, so Gemma4TextModel.forward falls through to standard Attention — see src/mobius/models/gemma4.py:1500-1508. The existing AttentionMaskToSequenceLengths surgery only rewrites GroupQueryAttention (olive/passes/onnx/graph_surgeries.py:1638), so it's a no-op on the mobius output.

Two paths forward and I don't know which is preferred:

Have mobius emit GQA directly. Extend mobius's QNN EP capability to include INT4/INT8/uint16 in gqa_dtypes, so Gemma4TextModel switches to the GQA branch. No GraphSurgery needed — mobius emits GQA(query, key, value, past_key, past_value, seqlens_k, total_seq_len, cos_cache, sin_cache, local_window_size=...) directly with the right inputs for HTP.
Stay on standard Attention and assume HTP lowers it. Simpler if HTP's standard-attention kernel is fast enough. No mobius change needed.
Stay on standard Attention and add a new surgery. Port AttentionMaskToSequenceLengths (or write a new one) that rewrites Attention(..., attention_mask) into a sequence-length form QNN prefers. Most work, but keeps mobius's default output portable.

@jambayk @xiaoyu-work — what does QNN HTP actually want here? Option 1 looks cleanest from the mobius side; willing to send the EP-capability PR if that's the right call.

Copilot

Pull request overview

Adds a new (WIP / exploratory) Olive recipe under google-gemma-4-E2B-it/QNN/ intended to compile Gemma 4 E2B’s multimodal components into QNN EPContext binaries for execution on Qualcomm Hexagon (HTP) via ONNX Runtime QNN EP.

Changes:

Introduces a QNN-focused Olive pipeline (MobiusBuilder → INT4 quant → static quant → StaticLLM → QNN EPContext generation → compose).
Adds end-to-end README guidance for setup, build, and known limitations for Snapdragon targets.
Registers the recipe via info.yml and adds a QNN-specific requirements.txt.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
google-gemma-4-E2B-it/QNN/requirements.txt	Adds Python dependencies for running the QNN recipe workflow.
google-gemma-4-E2B-it/QNN/README.md	Documents the intended pipeline, environment setup, build command, and limitations.
google-gemma-4-E2B-it/QNN/info.yml	Registers the recipe metadata for repo scanning/indexing.
google-gemma-4-E2B-it/QNN/config.json	Defines the Olive passes and QNN EPContext generation configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

justinchuby · 2026-05-27T06:16:19Z

+datasets
+mobius-ai
+olive-ai
+onnxruntime-gpu
+transformers>=5.0


Pinned olive-ai==0.9.3 and onnxruntime-gpu==1.21.1 to match microsoft-Phi-3-mini-4k-instruct/QNN/requirements.txt (the last-validated versions). Kept mobius-ai and transformers>=5.0 unpinned since this recipe is still WIP and the validated set will only firm up after HW validation. Fixed in cbf992c.

Per follow-up from author: mobius isn't published yet, so freezing versions doesn't aid reproducibility — anyone trying the recipe needs floating latest anyway. Reverted the version pins in 811f3a0; we can revisit when the recipe is hardware-validated.

justinchuby · 2026-05-27T06:16:21Z

+### AOT compilation environment (separate venv, x64 with QNN SDK)
+```bash
+pip install olive-ai mobius-ai
+pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn" --no-deps


Pinned to onnxruntime-qnn==1.22.2 to match Phi-3 QNN README. Fixed in cbf992c.

justinchuby · 2026-05-27T06:16:22Z

+    devices:
+      - npu
+    eps: QNNExecutionProvider
+name: gemma4_e2b_qnn


Good catch — renamed top-level name from gemma4_e2b_qnn to gemma4-e2b-qnn to match the recipe name. Fixed in cbf992c.

justinchuby · 2026-05-27T06:16:24Z

+        "mnb_to_qdq": {
+            "type": "MatMulNBitsToQDQ",
+            "use_int4": true,
+            "add_zero_point": true,
+            "save_as_external_data": true
+        },


The PR description was stale. mnb_to_qdq is intentional — OnnxKQuantQuantization emits com.microsoft::MatMulNBits which QNN EP doesn't claim, so without QDQ rewriting every quantized MatMul silently falls back to CPU. Fixed by updating the PR description (commit 1e7c186 already restored the pass; the description was just out of sync).

justinchuby · 2026-05-27T06:16:25Z

+```
+
+## Build
+


Added an explicit note in the Build section: run olive run from the quantization environment; Olive invokes the QNN AOT venv automatically via systems.qnn_system.python_environment_path for the EPContextBinaryGenerator pass. Fixed in cbf992c.

* requirements.txt: pin olive-ai==0.9.3 and onnxruntime-gpu==1.21.1 to match the last-validated versions used by the other QNN recipes in this repo (e.g. microsoft-Phi-3-mini-4k-instruct/QNN/). Keep mobius-ai and transformers>=5.0 unpinned for now since this recipe is still WIP and the validated version set will only stabilize after HW validation. * README: pin onnxruntime-qnn==1.22.2 in the AOT compilation env install command, matching microsoft-Phi-3-mini-4k-instruct/QNN/. * README: state explicitly that 'olive run' runs from the quantization environment, with Olive invoking the QNN AOT venv via systems.qnn_system.python_environment_path for the EPContextBinary pass. Avoids the easy mistake of running 'olive run' from the QNN venv (which lacks GPU quantization deps). * info.yml: align the top-level name (gemma4_e2b_qnn → gemma4-e2b-qnn) with the recipe name so scanner tables aren't ambiguous. PR description updated to drop the stale 'v2 drops MatMulNBitsToQDQ' claim — that pass was restored in 1e7c186 (QNN cannot run MatMulNBits). Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

Mobius isn't published yet, so freezing olive-ai / onnxruntime-gpu / onnxruntime-qnn / transformers at specific versions doesn't help reproducibility — anyone trying this recipe needs the floating latest of each anyway. Revert the version pins added in cbf992c and let upstream tracking ride. When the recipe is hardware-validated and the project starts publishing pinned-version-validated recipes we can revisit. Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>

justinchuby added 2 commits May 27, 2026 05:47

justinchuby marked this pull request as ready for review May 27, 2026 06:04

Copilot AI review requested due to automatic review settings May 27, 2026 06:04

Copilot started reviewing on behalf of justinchuby May 27, 2026 06:05 View session

justinchuby added 2 commits May 27, 2026 06:05

Copilot AI reviewed May 27, 2026

View reviewed changes

justinchuby added 2 commits May 27, 2026 06:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe — WIP#432

Add Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe — WIP#432
justinchuby wants to merge 6 commits into
mainfrom
gemma4-qnn-recipe

justinchuby commented May 27, 2026 •

edited

Loading

Uh oh!

justinchuby commented May 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

justinchuby May 27, 2026

Uh oh!

justinchuby May 27, 2026

Uh oh!

justinchuby May 27, 2026

Uh oh!

justinchuby May 27, 2026

Uh oh!

justinchuby May 27, 2026

Uh oh!

justinchuby May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		```

		## Build

Conversation

justinchuby commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Pipeline

Why no GraphSurgeries?

Known limitations (documented in README)

Asks for review

Related

Uh oh!

justinchuby commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

justinchuby May 27, 2026

Choose a reason for hiding this comment

Uh oh!

justinchuby May 27, 2026

Choose a reason for hiding this comment

Uh oh!

justinchuby May 27, 2026

Choose a reason for hiding this comment

Uh oh!

justinchuby May 27, 2026

Choose a reason for hiding this comment

Uh oh!

justinchuby May 27, 2026

Choose a reason for hiding this comment

Uh oh!

justinchuby May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justinchuby commented May 27, 2026 •

edited

Loading

Why no `GraphSurgeries`?

justinchuby commented May 27, 2026 •

edited

Loading