Enhance LiteRT performance: enable GPU (OpenCL/Metal) and NPU (QNN) backends

## Summary

Inference today runs **CPU-only** (6 threads). Our Phase-0 benchmark concluded GPU/NPU "didn't work," but that was a property of the **`litert-community` `.litertlm` bundles we pull** — they don't ship GPU/NPU artifacts — **not** a limitation of LiteRT itself. Current LiteRT-LM has real, shipping GPU and NPU acceleration. This issue tracks unlocking it so we get the runtime's actual performance ceiling instead of leaving the GPU/NPU on the table or reaching for a second backend (llama.cpp) that would be *slower* on Android anyway.

## Why this is the right lever (2026 facts)

- LiteRT-LM reaches **~52 tok/s decode on Gemma 4 E2B via the Android GPU (OpenCL)** and **~56 tok/s on iOS (Metal)**, with **NPU support via Qualcomm QNN**.
- By contrast, **llama.cpp on Android is effectively CPU-bound** — the OpenCL Adreno backend gives only modest gains (~6–8 tok/s on a 7B) and Vulkan is unreliable/crashes on Adreno & Mali. So switching runtimes would not buy us speed.
- Conclusion: our "we're slow" problem is solvable **within LiteRT** by enabling the GPU/NPU path. (Model-catalog/bring-your-own-model is a separate concern and a separate decision.)

## Scope

- [ ] Source or build **GPU-capable `.litertlm` bundles** (Google's optimized artifacts) rather than the community bundles that fail to load on GPU/NPU.
- [ ] Wire the **GPU backend (OpenCL on Android, Metal on iOS)** through the `LocalAiEngine` interface with graceful CPU fallback.
- [ ] Investigate the **NPU path via Qualcomm QNN** for supported devices (Snapdragon).
- [ ] Add a backend-selection / capability-detection step (NPU → GPU → CPU) so we always pick the best available accelerator per device.
- [ ] Re-run the Phase-0 benchmark across CPU vs GPU vs NPU and record tok/s + TTFT per device tier.
- [ ] Confirm multimodal (Gemma 4 E2B vision) still works on the accelerated path.

## Acceptance criteria

- Measurable decode-speed improvement over the current CPU-only baseline on at least one mid/high-tier Android device, with GPU enabled and CPU fallback intact.
- Documented benchmark table (CPU vs GPU vs NPU) committed to the repo.

## References

- LiteRT-LM overview: https://ai.google.dev/edge/litert-lm/overview
- Blazing-fast on-device GenAI with LiteRT-LM: https://developers.googleblog.com/blazing-fast-on-device-genai-with-litert-lm/
- Gemma 4 E2B + Qualcomm QNN deep-dive: https://medium.com/google-developer-experts/bringing-multimodal-gemma-4-e2b-to-the-edge-a-deep-dive-into-litert-lm-and-qualcomm-qnn-4e1e06f3030c
- Run LLMs using LiteRT-LM (NPU): https://ai.google.dev/edge/litert/next/litert_lm_npu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance LiteRT performance: enable GPU (OpenCL/Metal) and NPU (QNN) backends #31

Summary

Why this is the right lever (2026 facts)

Scope

Acceptance criteria

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Enhance LiteRT performance: enable GPU (OpenCL/Metal) and NPU (QNN) backends #31

Description

Summary

Why this is the right lever (2026 facts)

Scope

Acceptance criteria

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions