Skip to content

Add Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe — WIP#432

Open
justinchuby wants to merge 6 commits into
mainfrom
gemma4-qnn-recipe
Open

Add Gemma 4 E2B QNN (Snapdragon Hexagon NPU) recipe — WIP#432
justinchuby wants to merge 6 commits into
mainfrom
gemma4-qnn-recipe

Conversation

@justinchuby
Copy link
Copy Markdown
Contributor

@justinchuby justinchuby commented May 27, 2026

Summary

Adds google-gemma-4-E2B-it/QNN/ — exploratory Olive recipe for compiling all four Gemma 4 components (decoder, embedding, vision_encoder, audio_encoder) into QNN EPContext binaries for HTP execution on Snapdragon X / Copilot+ PC / Snapdragon 8 Gen 3+.

Marked WIP because not yet hardware-validated.

Approach

All components compile to QNN. Olive's CompositeModelHandler dispatch runs quant + StaticLLM per component automatically; EPContextBinaryGenerator and ComposeOnnxModels (both _accepts_composite_model = True) finalize the multimodal package without any manual splitting.

Pipeline

HfModel (multimodal Gemma 4)
   ↓ MobiusBuilder (fp32)               4 ONNX components + sidecars (genai_config, tokenizer, processors)
   ↓ OnnxKQuantQuantization (INT4)      mobius-standard Q4_K_M; weights → com.microsoft::MatMulNBits
   ↓ MatMulNBitsToQDQ                   MatMulNBits → MatMul + DequantizeLinear (QNN-compatible QDQ)
   ↓ OnnxStaticQuantization             activations uint16 / weights uint8 (calibrated)
   ↓ StaticLLM                          static shapes
   ↓ EPContextBinaryGenerator           HTP blobs, weight-shared across components
   ↓ ComposeOnnxModels                  final package

MatMulNBitsToQDQ is required between OnnxKQuantQuantization and OnnxStaticQuantization: the former emits com.microsoft::MatMulNBits which the QNN EP partitioner does not claim, so without QDQ rewriting every quantized MatMul silently falls back to CPU.

Why no GraphSurgeries?

Existing QNN recipes (Phi-3, Qwen) use RemoveRopeMultiCache, AttentionMaskToSequenceLengths, SimplifiedLayerNormToL2Norm to rewrite ModelBuilder-specific contrib ops into HTP-friendly shapes. MobiusBuilder emits opset-23 standard ops (RMSNormalization, Attention) instead of the contrib variants, so those surgeries are either no-ops or inapplicable. Gemma-4–specific surgeries may still be needed (notably for the final logit soft-cap cap * tanh(x / cap)), but the existing borrowed set is not it.

Known limitations (documented in README)

  1. Logit soft-cap may not lower to HTP — may need a new RemoveLogitSoftcap GraphSurgery upstream, or host post-processing.
  2. Hybrid local/global attention with dual head_dim (256 / 512) — HTP per-layer dispatch needs testing.
  3. per_layer_inputs data flow — should "just work" when both embedding + decoder are on QNN, but StaticLLM might need a hint.
  4. Tokenizer calibrationwikitext-2 under-represents the 256k Gemma 4 vocab (image/audio specials).
  5. StaticLLM context_length=64 — placeholder; tune for target SKU.
  6. Standard Attention vs GroupQueryAttention — mobius's QNN ep_capabilities() advertises an empty gqa_dtypes, so the decoder uses opset-23 Attention(attention_mask) rather than GQA(seqlens_k, total_seq_len). See discussion comment.

Asks for review

  • Anyone with Snapdragon HW willing to run this and report which pass breaks?
  • Is OnnxKQuantQuantization the right INT4 path for QNN, or should this use OnnxBlockWiseRtnQuantization / GptqModel?
  • Should I attempt the upstream RemoveLogitSoftcap Olive surgery now, or wait for HW validation to confirm it's actually the blocker?

Related

Adds google-gemma-4-E2B-it/QNN/ as a starting-point recipe for
compiling Gemma 4's text decoder into a QNN EPContext binary for
HTP execution on Snapdragon X / Copilot+ PC / Snapdragon 8 Gen 3+.

Pipeline:
  MobiusBuilder fp32 → OnnxKQuantQuantization (INT4 weights)
                     → MatMulNBitsToQDQ
                     → GraphSurgeries (RemoveRopeMultiCache /
                       AttentionMaskToSequenceLengths /
                       SimplifiedLayerNormToL2Norm)
                     → OnnxStaticQuantization (uint16 act / uint8 wt)
                     → SplitModel + StaticLLM
                     → EPContextBinaryGenerator (HTP blob)
                     → ComposeOnnxModels

Marked WORK IN PROGRESS in the README. Known limitations called out
explicitly:

  1) MobiusBuilder always exports the multimodal 4-component package
     for google/gemma-4-E2B-it; no current way to force the text-only
     gemma4_text path from the recipe config. Splitting the QNN passes
     to apply only to the decoder component is still TODO.
  2) GraphSurgeries borrowed from Phi-3 / Qwen QNN recipes have not
     been verified against Gemma 4's hybrid local/global attention,
     dual head_dim KV cache, or final logit soft-capping (tanh-cap).
  3) per_layer_inputs (second embedding output, consumed by every
     decoder block) needs custom split orchestration if embedding stays
     on CPU and decoder runs on HTP.
  4) Calibration via wikitext-2 may under-represent multimodal-format
     tokens (256k vocab includes vision/audio specials).
  5) StaticLLM context_length=64 is a placeholder for HW tuning.

Filed as exploratory template so other contributors with Snapdragon
HW can iterate.

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
…urgeries

All four Gemma 4 components (decoder + embedding + vision_encoder +
audio_encoder) compile to QNN EPContext binaries together. Olive's
CompositeModelHandler dispatch runs quant + StaticLLM per component
automatically, then EPContextBinaryGenerator + ComposeOnnxModels
(both _accepts_composite_model = True) finalise the multimodal
package.

Drop:

  * SplitModel — not needed when all components stay on QNN
  * MatMulNBitsToQDQ — was a ModelBuilder-specific stepping stone
  * GraphSurgeries with RemoveRopeMultiCache /
    AttentionMaskToSequenceLengths / SimplifiedLayerNormToL2Norm —
    those rewrite ModelBuilder contrib ops that mobius does not emit
    in the first place (mobius uses opset-23 RMSNormalization /
    Attention, not com.microsoft variants)

The README now explains the surgery removal explicitly and lists what
might still need a Gemma-4–specific upstream surgery (logit soft-cap).

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
@justinchuby justinchuby marked this pull request as ready for review May 27, 2026 06:04
Copilot AI review requested due to automatic review settings May 27, 2026 06:04
OnnxKQuantQuantization emits com.microsoft::MatMulNBits which is fast
on CPU / CUDA but not in the QNN EP's supported-op list. Without
MatMulNBitsToQDQ the QNN partitioner rejects every quantized MatMul
node and the model silently falls back to CPU — defeating the point
of compiling to HTP.

Restore MatMulNBitsToQDQ between the INT4 quant and the static
activation quant so each MatMulNBits gets rewritten into the standard
MatMul + DequantizeLinear pair the QNN partitioner can claim and
lower onto HTP.

README updated with an explanation of why both passes are needed.

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Make explicit that mobius emits opset-23 Attention (with attention_mask
input) for QNN, not com.microsoft::GroupQueryAttention(seqlens_k,
total_seq_len), because mobius's QNN ep_capabilities() advertises an
empty gqa_dtypes list. The existing AttentionMaskToSequenceLengths
GraphSurgery is therefore inapplicable (it only rewrites GQA), and
no surgery is needed if HTP's standard-attention kernel lowers cleanly.

Two follow-up options spelled out if HW shows the standard Attention
path is too slow on HTP:
  (a) extend mobius ep_capabilities for QNN to set gqa_dtypes so the
      builder emits GQA directly; or
  (b) port AttentionMaskToSequenceLengths to also rewrite standard
      Attention (currently it short-circuits when GQA is absent).

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
@justinchuby
Copy link
Copy Markdown
Contributor Author

justinchuby commented May 27, 2026

Question for QNN reviewers: should mobius emit com.microsoft::GroupQueryAttention(seqlens_k, total_seq_len) for the QNN path, or is the opset-23 standard Attention(attention_mask=[B, past+seq]) better?

Right now mobius's QNN ep_capabilities() advertises an empty gqa_dtypes, so Gemma4TextModel.forward falls through to standard Attention — see src/mobius/models/gemma4.py:1500-1508. The existing AttentionMaskToSequenceLengths surgery only rewrites GroupQueryAttention (olive/passes/onnx/graph_surgeries.py:1638), so it's a no-op on the mobius output.

Two paths forward and I don't know which is preferred:

  1. Have mobius emit GQA directly. Extend mobius's QNN EP capability to include INT4/INT8/uint16 in gqa_dtypes, so Gemma4TextModel switches to the GQA branch. No GraphSurgery needed — mobius emits GQA(query, key, value, past_key, past_value, seqlens_k, total_seq_len, cos_cache, sin_cache, local_window_size=...) directly with the right inputs for HTP.

  2. Stay on standard Attention and assume HTP lowers it. Simpler if HTP's standard-attention kernel is fast enough. No mobius change needed.

  3. Stay on standard Attention and add a new surgery. Port AttentionMaskToSequenceLengths (or write a new one) that rewrites Attention(..., attention_mask) into a sequence-length form QNN prefers. Most work, but keeps mobius's default output portable.

@jambayk @xiaoyu-work — what does QNN HTP actually want here? Option 1 looks cleanest from the mobius side; willing to send the EP-capability PR if that's the right call.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new (WIP / exploratory) Olive recipe under google-gemma-4-E2B-it/QNN/ intended to compile Gemma 4 E2B’s multimodal components into QNN EPContext binaries for execution on Qualcomm Hexagon (HTP) via ONNX Runtime QNN EP.

Changes:

  • Introduces a QNN-focused Olive pipeline (MobiusBuilder → INT4 quant → static quant → StaticLLM → QNN EPContext generation → compose).
  • Adds end-to-end README guidance for setup, build, and known limitations for Snapdragon targets.
  • Registers the recipe via info.yml and adds a QNN-specific requirements.txt.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
google-gemma-4-E2B-it/QNN/requirements.txt Adds Python dependencies for running the QNN recipe workflow.
google-gemma-4-E2B-it/QNN/README.md Documents the intended pipeline, environment setup, build command, and limitations.
google-gemma-4-E2B-it/QNN/info.yml Registers the recipe metadata for repo scanning/indexing.
google-gemma-4-E2B-it/QNN/config.json Defines the Olive passes and QNN EPContext generation configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +5
datasets
mobius-ai
olive-ai
onnxruntime-gpu
transformers>=5.0
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinned olive-ai==0.9.3 and onnxruntime-gpu==1.21.1 to match microsoft-Phi-3-mini-4k-instruct/QNN/requirements.txt (the last-validated versions). Kept mobius-ai and transformers>=5.0 unpinned since this recipe is still WIP and the validated set will only firm up after HW validation. Fixed in cbf992c.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per follow-up from author: mobius isn't published yet, so freezing versions doesn't aid reproducibility — anyone trying the recipe needs floating latest anyway. Reverted the version pins in 811f3a0; we can revisit when the recipe is hardware-validated.

Comment thread google-gemma-4-E2B-it/QNN/README.md Outdated
### AOT compilation environment (separate venv, x64 with QNN SDK)
```bash
pip install olive-ai mobius-ai
pip install --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple "onnxruntime-qnn" --no-deps
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinned to onnxruntime-qnn==1.22.2 to match Phi-3 QNN README. Fixed in cbf992c.

Comment thread google-gemma-4-E2B-it/QNN/info.yml Outdated
devices:
- npu
eps: QNNExecutionProvider
name: gemma4_e2b_qnn
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — renamed top-level name from gemma4_e2b_qnn to gemma4-e2b-qnn to match the recipe name. Fixed in cbf992c.

Comment on lines +40 to +45
"mnb_to_qdq": {
"type": "MatMulNBitsToQDQ",
"use_int4": true,
"add_zero_point": true,
"save_as_external_data": true
},
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description was stale. mnb_to_qdq is intentional — OnnxKQuantQuantization emits com.microsoft::MatMulNBits which QNN EP doesn't claim, so without QDQ rewriting every quantized MatMul silently falls back to CPU. Fixed by updating the PR description (commit 1e7c186 already restored the pass; the description was just out of sync).

```

## Build

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an explicit note in the Build section: run olive run from the quantization environment; Olive invokes the QNN AOT venv automatically via systems.qnn_system.python_environment_path for the EPContextBinaryGenerator pass. Fixed in cbf992c.

* requirements.txt: pin olive-ai==0.9.3 and onnxruntime-gpu==1.21.1 to
  match the last-validated versions used by the other QNN recipes in
  this repo (e.g. microsoft-Phi-3-mini-4k-instruct/QNN/). Keep
  mobius-ai and transformers>=5.0 unpinned for now since this recipe
  is still WIP and the validated version set will only stabilize after
  HW validation.

* README: pin onnxruntime-qnn==1.22.2 in the AOT compilation env
  install command, matching microsoft-Phi-3-mini-4k-instruct/QNN/.

* README: state explicitly that 'olive run' runs from the
  quantization environment, with Olive invoking the QNN AOT venv via
  systems.qnn_system.python_environment_path for the EPContextBinary
  pass. Avoids the easy mistake of running 'olive run' from the QNN
  venv (which lacks GPU quantization deps).

* info.yml: align the top-level name (gemma4_e2b_qnn → gemma4-e2b-qnn)
  with the recipe name so scanner tables aren't ambiguous.

PR description updated to drop the stale 'v2 drops MatMulNBitsToQDQ'
claim — that pass was restored in 1e7c186 (QNN cannot run
MatMulNBits).

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Mobius isn't published yet, so freezing olive-ai / onnxruntime-gpu /
onnxruntime-qnn / transformers at specific versions doesn't help
reproducibility — anyone trying this recipe needs the floating latest
of each anyway. Revert the version pins added in cbf992c and let
upstream tracking ride. When the recipe is hardware-validated and the
project starts publishing pinned-version-validated recipes we can
revisit.

Signed-off-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants