Skip to content

VitalikDen0/Model-To-NPU

Repository files navigation

Model-to-NPU Pipeline for Snapdragon

Snapdragon NPU experiments that grew into a real phone-side SDXL pipeline. [!WARNING] WAN end-to-end beta is NOT VERIFIED (НЕ ПРОВЕРЕН) and may not work at all. Hot-swap WxH buckets and HotSwap LoRA are still test-stage features and can break. Stabilization/polish target after v0.5.0: about 2 weeks.

Docs: English · Русский · Android APK

This repository is about running large diffusion models on Qualcomm Snapdragon devices, not just exporting graphs and calling it a day.

Right now the most complete path is:

  • SDXL on Snapdragon 8 Elite NPU;
  • real phone-side generation with CLIP + split UNet + VAE;
  • a custom persistent QNN server in C;
  • a standalone phone runtime via phone_generate.py;
  • an Android app in APK/.

What is working today

  • SDXL end-to-end exists: checkpoint -> build/export -> deploy -> phone PNG;
  • v0.5.0 APK split SDXL/WAN into separate tabs with independent per-tab generation settings;
  • WAN remains beta: end-to-end path and hot resolution switching are available for testing, but not validated for production;
  • AI Hub helpers already exist for heavyweight compile flows, especially useful for WAN and large UNet pieces.

Current direction

  • Publicly usable now: SDXL
  • Active engineering focus: WAN and FLUX
  • Training / method labs: SD1.5 and SD3.5

SDXL is temporarily frozen as the main product branch while the repo shifts toward broader model-family support.

Quick links

Performance in one paragraph

The validated warm SDXL path on OnePlus 13 / Snapdragon 8 Elite is still in the ~30 s total class at 1024x1024, 8 steps, with cached CLIP, split UNet, and VAE on-device. In v0.4.8-beta3, the custom QNN server got a stronger HTP perf configuration and the major decoder regression dropped from roughly ~820 ms per decoder pass to about ~725–776 ms. There is still a residual tail of around ~50 ms versus the historical ideal marker, and it is documented honestly instead of being swept under the rug.

Gallery

SDXL on phone sample 1 SDXL on phone sample 2
SDXL on phone sample 3 SDXL on phone sample 4

All gallery samples and the currently documented phone-side examples are 1024×1024 outputs from the current Lightning-merged SDXL path.

Proof that it actually runs on-device

Earlier public screenshot — 273.6s total
Earlier phone-side proof screenshot at 273.6 seconds
v0.2.0 public marker — 100.8s total
Phone-side proof screenshot for v0.2.0 at 100.8 seconds
v0.2.3 screenshot (Live Preview ON) — 78.0s total
Phone-side proof screenshot for v0.2.3 at 78.0 seconds
Current cold-start APK proof — 34.6s total
Current phone-side proof screenshot at 34.6 seconds (cold start)
Measured accelerator-visible time inside this run: ~16.25 s.

Public screenshot lineage so far: 273.6 s → 100.8 s → 78.0 s → 34.6 s, with the fourth slot now showing the current 34.6 s cold-start APK proof image.

Inside that latest run, the accelerator-visible stages add up to ~16.25 s total: CLIP 0.134 s + UNet 14.248 s + VAE 1.872 s. The screenshot-visible 34.6 s total therefore still includes cold start / runtime bring-up / UI orchestration overhead rather than pure accelerator work.

The best validated historical warm-path marker remains 30.4 s total. The new proof slot is intentionally described as a cold-start APK run, not as a replacement for that warm-path number.

Observed fast-path thermals in the current short-run proof cycle sat around 85–95°C without visible throttling, so a few back-to-back generations remained practically safe/usable in the tested burst window.

Important files, explained like a human

  • phone_generate.py — the main phone runtime entrypoint. SDXL really runs through this file; WAN support here is currently a runtime/probe path, not final generation.
  • phone_runtime_accel.py — optional native math/layout helper for scheduler and tensor operations, with a safe NumPy fallback if the native library is unavailable.
  • NPU/qnn_multi_context_server.c — the persistent QNN server that keeps contexts alive and runs split UNet faster than repeated qnn-net-run spawns.
  • SDXL/run_end_to_end.ps1 — the most practical host-side wrapper for checkpoint -> build -> deploy.
  • scripts/build_all.py — reproducible early build stages for SDXL: checkpoint conversion, Lightning merge, ONNX export.
  • scripts/deploy_to_phone.py — pushes runtime files, QNN libs/bins, contexts, and optional TAESD pieces to the phone.
  • WAN 2.1 1.3B/export_and_compile_wan_aihub.py — AI Hub helper for WAN package prep, compile jobs, status, and downloads.
  • WAN 2.1 1.3B/wan_tool.py — WAN helper CLI for model selection, downloads, and phone checks.

Repository map

  • SDXL/ — SDXL build/export/verification scripts and experiments
  • WAN 2.1 1.3B/ — WAN research workspace, AI Hub helpers, runtime probes
  • NPU/ — custom native runtime pieces, including the QNN multi-context server
  • APK/ — Android app
  • scripts/ — deployment and utility scripts
  • tokenizer/ — shared tokenizer assets

Recent changes

  • 0.5.0 — APK split into SDXL/WAN tabs with separate saved settings per tab; WAN host flow got hot manifest bucket selection by requested WxH; added WAN 2.1 1.3B/run_end_to_end.ps1 beta wrapper.
  • 0.4.8-beta3 — stronger HTP perf mode in qnn-multi-context-server; major decoder regression mostly fixed; residual tail documented as known issue.
  • 0.4.8-beta2 — APK runtime hotfix plus a dedicated Copy error action.
  • 0.4.8-beta — bundled Python runtime, dual root/no-root paths, TAESD preview intentionally disabled in APK because it hurt the fast path.
  • 0.4.7 — exact CFG forwarding including 1.0, better TAESD failure reporting.
  • 0.4.6 — more deterministic packaged runtime refresh and safer public APK behavior.

About the archive

The repo accumulated a lot of historical notes, one-off reviews, and working documents. They are still useful, but they no longer belong on the front page.

License

This repo is distributed under PolyForm Noncommercial License 1.0.0.

That means, in plain language:

  • non-commercial use/study/modification/forks are allowed;
  • redistributions must keep the required notice text;
  • third-party components keep their own licenses.

About

Model-to-NPU pipelines for Qualcomm Snapdragon: QNN/ONNX/Android runtimes for on-device image and video generation.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors