Add Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe — WIP#432
Add Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe — WIP#432justinchuby wants to merge 6 commits into
Conversation
Adds google-gemma-4-E2B-it/QNN/ as a starting-point recipe for
compiling Gemma 4's text decoder into a QNN EPContext binary for
HTP execution on Snapdragon X / Copilot+ PC / Snapdragon 8 Gen 3+.
Pipeline:
MobiusBuilder fp32 → OnnxKQuantQuantization (INT4 weights)
→ MatMulNBitsToQDQ
→ GraphSurgeries (RemoveRopeMultiCache /
AttentionMaskToSequenceLengths /
SimplifiedLayerNormToL2Norm)
→ OnnxStaticQuantization (uint16 act / uint8 wt)
→ SplitModel + StaticLLM
→ EPContextBinaryGenerator (HTP blob)
→ ComposeOnnxModels
Marked WORK IN PROGRESS in the README. Known limitations called out
explicitly:
1) MobiusBuilder always exports the multimodal 4-component package
for google/gemma-4-E2B-it; no current way to force the text-only
gemma4_text path from the recipe config. Splitting the QNN passes
to apply only to the decoder component is still TODO.
2) GraphSurgeries borrowed from Phi-3 / Qwen QNN recipes have not
been verified against Gemma 4's hybrid local/global attention,
dual head_dim KV cache, or final logit soft-capping (tanh-cap).
3) per_layer_inputs (second embedding output, consumed by every
decoder block) needs custom split orchestration if embedding stays
on CPU and decoder runs on HTP.
4) Calibration via wikitext-2 may under-represent multimodal-format
tokens (256k vocab includes vision/audio specials).
5) StaticLLM context_length=64 is a placeholder for HW tuning.
Filed as exploratory template so other contributors with Snapdragon
HW can iterate.
Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
…urgeries
All four Gemma 4 components (decoder + embedding + vision_encoder +
audio_encoder) compile to QNN EPContext binaries together. Olive's
CompositeModelHandler dispatch runs quant + StaticLLM per component
automatically, then EPContextBinaryGenerator + ComposeOnnxModels
(both _accepts_composite_model = True) finalise the multimodal
package.
Drop:
* SplitModel — not needed when all components stay on QNN
* MatMulNBitsToQDQ — was a ModelBuilder-specific stepping stone
* GraphSurgeries with RemoveRopeMultiCache /
AttentionMaskToSequenceLengths / SimplifiedLayerNormToL2Norm —
those rewrite ModelBuilder contrib ops that mobius does not emit
in the first place (mobius uses opset-23 RMSNormalization /
Attention, not com.microsoft variants)
The README now explains the surgery removal explicitly and lists what
might still need a Gemma-4–specific upstream surgery (logit soft-cap).
Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
OnnxKQuantQuantization emits com.microsoft::MatMulNBits which is fast on CPU / CUDA but not in the QNN EP's supported-op list. Without MatMulNBitsToQDQ the QNN partitioner rejects every quantized MatMul node and the model silently falls back to CPU — defeating the point of compiling to HTP. Restore MatMulNBitsToQDQ between the INT4 quant and the static activation quant so each MatMulNBits gets rewritten into the standard MatMul + DequantizeLinear pair the QNN partitioner can claim and lower onto HTP. README updated with an explanation of why both passes are needed. Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Make explicit that mobius emits opset-23 Attention (with attention_mask
input) for QNN, not com.microsoft::GroupQueryAttention(seqlens_k,
total_seq_len), because mobius's QNN ep_capabilities() advertises an
empty gqa_dtypes list. The existing AttentionMaskToSequenceLengths
GraphSurgery is therefore inapplicable (it only rewrites GQA), and
no surgery is needed if HTP's standard-attention kernel lowers cleanly.
Two follow-up options spelled out if HW shows the standard Attention
path is too slow on HTP:
(a) extend mobius ep_capabilities for QNN to set gqa_dtypes so the
builder emits GQA directly; or
(b) port AttentionMaskToSequenceLengths to also rewrite standard
Attention (currently it short-circuits when GQA is absent).
Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
|
Question for QNN reviewers: should mobius emit Right now mobius's QNN Two paths forward and I don't know which is preferred:
@jambayk @xiaoyu-work — what does QNN HTP actually want here? Option 1 looks cleanest from the mobius side; willing to send the EP-capability PR if that's the right call. |
There was a problem hiding this comment.
Pull request overview
Adds a new (WIP / exploratory) Olive recipe under google-gemma-4-E2B-it/QNN/ intended to compile Gemma 4 E2B’s multimodal components into QNN EPContext binaries for execution on Qualcomm Hexagon (HTP) via ONNX Runtime QNN EP.
Changes:
- Introduces a QNN-focused Olive pipeline (
MobiusBuilder→ INT4 quant → static quant →StaticLLM→ QNN EPContext generation → compose). - Adds end-to-end README guidance for setup, build, and known limitations for Snapdragon targets.
- Registers the recipe via
info.ymland adds a QNN-specificrequirements.txt.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| google-gemma-4-E2B-it/QNN/requirements.txt | Adds Python dependencies for running the QNN recipe workflow. |
| google-gemma-4-E2B-it/QNN/README.md | Documents the intended pipeline, environment setup, build command, and limitations. |
| google-gemma-4-E2B-it/QNN/info.yml | Registers the recipe metadata for repo scanning/indexing. |
| google-gemma-4-E2B-it/QNN/config.json | Defines the Olive passes and QNN EPContext generation configuration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| datasets | ||
| mobius-ai | ||
| olive-ai | ||
| onnxruntime-gpu | ||
| transformers>=5.0 |
There was a problem hiding this comment.
Pinned olive-ai==0.9.3 and onnxruntime-gpu==1.21.1 to match microsoft-Phi-3-mini-4k-instruct/QNN/requirements.txt (the last-validated versions). Kept mobius-ai and transformers>=5.0 unpinned since this recipe is still WIP and the validated set will only firm up after HW validation. Fixed in cbf992c.
There was a problem hiding this comment.
Per follow-up from author: mobius isn't published yet, so freezing versions doesn't aid reproducibility — anyone trying the recipe needs floating latest anyway. Reverted the version pins in 811f3a0; we can revisit when the recipe is hardware-validated.
| ### AOT compilation environment (separate venv, x64 with QNN SDK) | ||
| ```bash | ||
| pip install olive-ai mobius-ai | ||
| pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn" --no-deps |
There was a problem hiding this comment.
Pinned to onnxruntime-qnn==1.22.2 to match Phi-3 QNN README. Fixed in cbf992c.
| devices: | ||
| - npu | ||
| eps: QNNExecutionProvider | ||
| name: gemma4_e2b_qnn |
There was a problem hiding this comment.
Good catch — renamed top-level name from gemma4_e2b_qnn to gemma4-e2b-qnn to match the recipe name. Fixed in cbf992c.
| "mnb_to_qdq": { | ||
| "type": "MatMulNBitsToQDQ", | ||
| "use_int4": true, | ||
| "add_zero_point": true, | ||
| "save_as_external_data": true | ||
| }, |
There was a problem hiding this comment.
The PR description was stale. mnb_to_qdq is intentional — OnnxKQuantQuantization emits com.microsoft::MatMulNBits which QNN EP doesn't claim, so without QDQ rewriting every quantized MatMul silently falls back to CPU. Fixed by updating the PR description (commit 1e7c186 already restored the pass; the description was just out of sync).
| ``` | ||
|
|
||
| ## Build | ||
|
|
There was a problem hiding this comment.
Added an explicit note in the Build section: run olive run from the quantization environment; Olive invokes the QNN AOT venv automatically via systems.qnn_system.python_environment_path for the EPContextBinaryGenerator pass. Fixed in cbf992c.
* requirements.txt: pin olive-ai==0.9.3 and onnxruntime-gpu==1.21.1 to match the last-validated versions used by the other QNN recipes in this repo (e.g. microsoft-Phi-3-mini-4k-instruct/QNN/). Keep mobius-ai and transformers>=5.0 unpinned for now since this recipe is still WIP and the validated version set will only stabilize after HW validation. * README: pin onnxruntime-qnn==1.22.2 in the AOT compilation env install command, matching microsoft-Phi-3-mini-4k-instruct/QNN/. * README: state explicitly that 'olive run' runs from the quantization environment, with Olive invoking the QNN AOT venv via systems.qnn_system.python_environment_path for the EPContextBinary pass. Avoids the easy mistake of running 'olive run' from the QNN venv (which lacks GPU quantization deps). * info.yml: align the top-level name (gemma4_e2b_qnn → gemma4-e2b-qnn) with the recipe name so scanner tables aren't ambiguous. PR description updated to drop the stale 'v2 drops MatMulNBitsToQDQ' claim — that pass was restored in 1e7c186 (QNN cannot run MatMulNBits). Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Mobius isn't published yet, so freezing olive-ai / onnxruntime-gpu / onnxruntime-qnn / transformers at specific versions doesn't help reproducibility — anyone trying this recipe needs the floating latest of each anyway. Revert the version pins added in cbf992c and let upstream tracking ride. When the recipe is hardware-validated and the project starts publishing pinned-version-validated recipes we can revisit. Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Summary
Adds
google-gemma-4-E2B-it/QNN/— exploratory Olive recipe for compiling all four Gemma 4 components (decoder, embedding, vision_encoder, audio_encoder) into QNN EPContext binaries for HTP execution on Snapdragon X / Copilot+ PC / Snapdragon 8 Gen 3+.Marked WIP because not yet hardware-validated.
Approach
All components compile to QNN. Olive's
CompositeModelHandlerdispatch runs quant +StaticLLMper component automatically;EPContextBinaryGeneratorandComposeOnnxModels(both_accepts_composite_model = True) finalize the multimodal package without any manual splitting.Pipeline
MatMulNBitsToQDQis required betweenOnnxKQuantQuantizationandOnnxStaticQuantization: the former emitscom.microsoft::MatMulNBitswhich the QNN EP partitioner does not claim, so without QDQ rewriting every quantized MatMul silently falls back to CPU.Why no
GraphSurgeries?Existing QNN recipes (Phi-3, Qwen) use
RemoveRopeMultiCache,AttentionMaskToSequenceLengths,SimplifiedLayerNormToL2Normto rewrite ModelBuilder-specific contrib ops into HTP-friendly shapes.MobiusBuilderemits opset-23 standard ops (RMSNormalization,Attention) instead of the contrib variants, so those surgeries are either no-ops or inapplicable. Gemma-4–specific surgeries may still be needed (notably for the final logit soft-capcap * tanh(x / cap)), but the existing borrowed set is not it.Known limitations (documented in README)
RemoveLogitSoftcapGraphSurgery upstream, or host post-processing.per_layer_inputsdata flow — should "just work" when both embedding + decoder are on QNN, butStaticLLMmight need a hint.wikitext-2under-represents the 256k Gemma 4 vocab (image/audio specials).StaticLLM context_length=64— placeholder; tune for target SKU.AttentionvsGroupQueryAttention— mobius's QNNep_capabilities()advertises an emptygqa_dtypes, so the decoder uses opset-23Attention(attention_mask)rather thanGQA(seqlens_k, total_seq_len). See discussion comment.Asks for review
OnnxKQuantQuantizationthe right INT4 path for QNN, or should this useOnnxBlockWiseRtnQuantization/GptqModel?RemoveLogitSoftcapOlive surgery now, or wait for HW validation to confirm it's actually the blocker?Related