Snapdragon NPU experiments that grew into a real phone-side SDXL pipeline. [!WARNING] WAN end-to-end beta is NOT VERIFIED (
НЕ ПРОВЕРЕН) and may not work at all. Hot-swapWxHbuckets and HotSwap LoRA are still test-stage features and can break. Stabilization/polish target afterv0.5.0: about 2 weeks.
Docs: English · Русский · Android APK
This repository is about running large diffusion models on Qualcomm Snapdragon devices, not just exporting graphs and calling it a day.
Right now the most complete path is:
- SDXL on Snapdragon 8 Elite NPU;
- real phone-side generation with CLIP + split UNet + VAE;
- a custom persistent QNN server in C;
- a standalone phone runtime via
phone_generate.py; - an Android app in
APK/.
- SDXL end-to-end exists:
checkpoint -> build/export -> deploy -> phone PNG; v0.5.0APK split SDXL/WAN into separate tabs with independent per-tab generation settings;- WAN remains beta: end-to-end path and hot resolution switching are available for testing, but not validated for production;
- AI Hub helpers already exist for heavyweight compile flows, especially useful for WAN and large UNet pieces.
- Publicly usable now: SDXL
- Active engineering focus: WAN and FLUX
- Training / method labs: SD1.5 and SD3.5
SDXL is temporarily frozen as the main product branch while the repo shifts toward broader model-family support.
- English documentation
- Русская документация
- Android app notes
- WAN 2.1 workspace
- Archive index (EN)
- Архивный индекс (RU)
- License
- Notice / attribution
The validated warm SDXL path on OnePlus 13 / Snapdragon 8 Elite is still in the ~30 s total class at 1024x1024, 8 steps, with cached CLIP, split UNet, and VAE on-device. In v0.4.8-beta3, the custom QNN server got a stronger HTP perf configuration and the major decoder regression dropped from roughly ~820 ms per decoder pass to about ~725–776 ms. There is still a residual tail of around ~50 ms versus the historical ideal marker, and it is documented honestly instead of being swept under the rug.
![]() |
![]() |
![]() |
![]() |
All gallery samples and the currently documented phone-side examples are 1024×1024 outputs from the current Lightning-merged SDXL path.
Public screenshot lineage so far: 273.6 s → 100.8 s → 78.0 s → 34.6 s, with the fourth slot now showing the current 34.6 s cold-start APK proof image.
Inside that latest run, the accelerator-visible stages add up to ~16.25 s total: CLIP 0.134 s + UNet 14.248 s + VAE 1.872 s. The screenshot-visible 34.6 s total therefore still includes cold start / runtime bring-up / UI orchestration overhead rather than pure accelerator work.
The best validated historical warm-path marker remains 30.4 s total. The new proof slot is intentionally described as a cold-start APK run, not as a replacement for that warm-path number.
Observed fast-path thermals in the current short-run proof cycle sat around 85–95°C without visible throttling, so a few back-to-back generations remained practically safe/usable in the tested burst window.
phone_generate.py— the main phone runtime entrypoint. SDXL really runs through this file; WAN support here is currently a runtime/probe path, not final generation.phone_runtime_accel.py— optional native math/layout helper for scheduler and tensor operations, with a safe NumPy fallback if the native library is unavailable.NPU/qnn_multi_context_server.c— the persistent QNN server that keeps contexts alive and runs split UNet faster than repeatedqnn-net-runspawns.SDXL/run_end_to_end.ps1— the most practical host-side wrapper forcheckpoint -> build -> deploy.scripts/build_all.py— reproducible early build stages for SDXL: checkpoint conversion, Lightning merge, ONNX export.scripts/deploy_to_phone.py— pushes runtime files, QNN libs/bins, contexts, and optional TAESD pieces to the phone.WAN 2.1 1.3B/export_and_compile_wan_aihub.py— AI Hub helper for WAN package prep, compile jobs, status, and downloads.WAN 2.1 1.3B/wan_tool.py— WAN helper CLI for model selection, downloads, and phone checks.
SDXL/— SDXL build/export/verification scripts and experimentsWAN 2.1 1.3B/— WAN research workspace, AI Hub helpers, runtime probesNPU/— custom native runtime pieces, including the QNN multi-context serverAPK/— Android appscripts/— deployment and utility scriptstokenizer/— shared tokenizer assets
- 0.5.0 — APK split into SDXL/WAN tabs with separate saved settings per tab; WAN host flow got hot manifest bucket selection by requested
WxH; addedWAN 2.1 1.3B/run_end_to_end.ps1beta wrapper. - 0.4.8-beta3 — stronger HTP perf mode in
qnn-multi-context-server; major decoder regression mostly fixed; residual tail documented as known issue. - 0.4.8-beta2 — APK runtime hotfix plus a dedicated Copy error action.
- 0.4.8-beta — bundled Python runtime, dual root/no-root paths, TAESD preview intentionally disabled in APK because it hurt the fast path.
- 0.4.7 — exact CFG forwarding including
1.0, better TAESD failure reporting. - 0.4.6 — more deterministic packaged runtime refresh and safer public APK behavior.
The repo accumulated a lot of historical notes, one-off reviews, and working documents. They are still useful, but they no longer belong on the front page.
- Use ARCHIVE_EN.md for English archive links.
- Use ARCHIVE_RU.md for Russian archive links.
This repo is distributed under PolyForm Noncommercial License 1.0.0.
That means, in plain language:
- non-commercial use/study/modification/forks are allowed;
- redistributions must keep the required notice text;
- third-party components keep their own licenses.







