Skip to content

Enhance LiteRT performance: enable GPU (OpenCL/Metal) and NPU (QNN) backends #31

@sagar-develop

Description

@sagar-develop

Summary

Inference today runs CPU-only (6 threads). Our Phase-0 benchmark concluded GPU/NPU "didn't work," but that was a property of the litert-community .litertlm bundles we pull — they don't ship GPU/NPU artifacts — not a limitation of LiteRT itself. Current LiteRT-LM has real, shipping GPU and NPU acceleration. This issue tracks unlocking it so we get the runtime's actual performance ceiling instead of leaving the GPU/NPU on the table or reaching for a second backend (llama.cpp) that would be slower on Android anyway.

Why this is the right lever (2026 facts)

  • LiteRT-LM reaches ~52 tok/s decode on Gemma 4 E2B via the Android GPU (OpenCL) and ~56 tok/s on iOS (Metal), with NPU support via Qualcomm QNN.
  • By contrast, llama.cpp on Android is effectively CPU-bound — the OpenCL Adreno backend gives only modest gains (~6–8 tok/s on a 7B) and Vulkan is unreliable/crashes on Adreno & Mali. So switching runtimes would not buy us speed.
  • Conclusion: our "we're slow" problem is solvable within LiteRT by enabling the GPU/NPU path. (Model-catalog/bring-your-own-model is a separate concern and a separate decision.)

Scope

  • Source or build GPU-capable .litertlm bundles (Google's optimized artifacts) rather than the community bundles that fail to load on GPU/NPU.
  • Wire the GPU backend (OpenCL on Android, Metal on iOS) through the LocalAiEngine interface with graceful CPU fallback.
  • Investigate the NPU path via Qualcomm QNN for supported devices (Snapdragon).
  • Add a backend-selection / capability-detection step (NPU → GPU → CPU) so we always pick the best available accelerator per device.
  • Re-run the Phase-0 benchmark across CPU vs GPU vs NPU and record tok/s + TTFT per device tier.
  • Confirm multimodal (Gemma 4 E2B vision) still works on the accelerated path.

Acceptance criteria

  • Measurable decode-speed improvement over the current CPU-only baseline on at least one mid/high-tier Android device, with GPU enabled and CPU fallback intact.
  • Documented benchmark table (CPU vs GPU vs NPU) committed to the repo.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions