[QNN EP] Add prepare_and_load session option for single-session AOT inference#517
Draft
qti-mbadnara wants to merge 3 commits into
Draft
[QNN EP] Add prepare_and_load session option for single-session AOT inference#517qti-mbadnara wants to merge 3 commits into
qti-mbadnara wants to merge 3 commits into
Conversation
Collaborator
Author
|
/ci |
Contributor
|
🔄 CI triggered on |
a33e6a7 to
3ccf1fe
Compare
…le_prepare_and_load
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a new session config flag
ep.qnnexecutionprovider.enable_htp_prepare_and_load=1that performs AOT compilation and context loading within a single ORT session. This allows large models (e.g., LLMs) to bypass the single process-domain (PD) memory limit imposed by the QNN JIT flow without requiring two separate sessions.Motivation / Context
QNN EP's JIT flow places all graph splits in a single QNN process domain (PD), which exhausts memory for large models. The existing AOT workaround requires two sessions:
prepare_only=1— compile and emit_ctx.onnx_ctx.onnxfor inference (QNN spreads splits across multiple PDs)This two-session flow is cumbersome for embedded customers who want a single API call. The new
prepare_and_loadoption performs both steps internally:Behavior matrix
ep.context_enableprepare_and_load_ctx.onnx+.bin→ reload → infer. Artifact persists.Test Plan: prepare_and_load
Unit Tests
Run new prepare_and_load tests
./onnxruntime_provider_test.exe --gtest_filter="*PrepareAndLoad*"Expected: 5 tests pass (PathA, PathB, MutuallyExclusive, ContextDisabledWithFilePath, EmbedModeOverridden)
Manual Validation with onnxruntime_perf_test
Test 1: Path A — Prepare and load, no artifact
Verify:
_ctx.onnxor_qnn.binfile created near the modelTest 2: Path B — Prepare and load + persist artifact
Verify:
model_ctx.onnxexists on diskmodel_ctx_qnn.binexists on disk (external binary)Test 3: Load saved artifact from Test 2 (existing AOT flow)
Verify:
Test 4: Regression — JIT flow (default, no new flags)
Verify:
Test 5: Regression — Prepare-only flow (existing AOT step 1)
Verify:
prep_only_ctx.onnxis created on diskTest 6: Error — Both prepare_and_load + prepare_only (mutually exclusive)
Verify:
Test 7: Error — Contradictory options (no persist + file path)
Verify:
Test 8: Embed mode override (warning + forced to external)
Verify:
embed_test_ctx.onnxexistsembed_test_ctx_qnn.binexists (external binary, NOT embedded)Test 9: Prepare and load on already-compiled context model (warning, no-op)
(Uses
model_ctx.onnxfrom Test 2)Verify:
Summary Checklist
.onnx+.bincreated.bincreated