Summary
Inference today runs CPU-only (6 threads). Our Phase-0 benchmark concluded GPU/NPU "didn't work," but that was a property of the litert-community .litertlm bundles we pull — they don't ship GPU/NPU artifacts — not a limitation of LiteRT itself. Current LiteRT-LM has real, shipping GPU and NPU acceleration. This issue tracks unlocking it so we get the runtime's actual performance ceiling instead of leaving the GPU/NPU on the table or reaching for a second backend (llama.cpp) that would be slower on Android anyway.
Why this is the right lever (2026 facts)
- LiteRT-LM reaches ~52 tok/s decode on Gemma 4 E2B via the Android GPU (OpenCL) and ~56 tok/s on iOS (Metal), with NPU support via Qualcomm QNN.
- By contrast, llama.cpp on Android is effectively CPU-bound — the OpenCL Adreno backend gives only modest gains (~6–8 tok/s on a 7B) and Vulkan is unreliable/crashes on Adreno & Mali. So switching runtimes would not buy us speed.
- Conclusion: our "we're slow" problem is solvable within LiteRT by enabling the GPU/NPU path. (Model-catalog/bring-your-own-model is a separate concern and a separate decision.)
Scope
Acceptance criteria
- Measurable decode-speed improvement over the current CPU-only baseline on at least one mid/high-tier Android device, with GPU enabled and CPU fallback intact.
- Documented benchmark table (CPU vs GPU vs NPU) committed to the repo.
References
Summary
Inference today runs CPU-only (6 threads). Our Phase-0 benchmark concluded GPU/NPU "didn't work," but that was a property of the
litert-community.litertlmbundles we pull — they don't ship GPU/NPU artifacts — not a limitation of LiteRT itself. Current LiteRT-LM has real, shipping GPU and NPU acceleration. This issue tracks unlocking it so we get the runtime's actual performance ceiling instead of leaving the GPU/NPU on the table or reaching for a second backend (llama.cpp) that would be slower on Android anyway.Why this is the right lever (2026 facts)
Scope
.litertlmbundles (Google's optimized artifacts) rather than the community bundles that fail to load on GPU/NPU.LocalAiEngineinterface with graceful CPU fallback.Acceptance criteria
References