Skip to content

[QNN EP] Add prepare_and_load session option for single-session AOT inference#517

Draft
qti-mbadnara wants to merge 3 commits into
mainfrom
dev/qti-mbadnara/enable_prepare_and_load
Draft

[QNN EP] Add prepare_and_load session option for single-session AOT inference#517
qti-mbadnara wants to merge 3 commits into
mainfrom
dev/qti-mbadnara/enable_prepare_and_load

Conversation

@qti-mbadnara

@qti-mbadnara qti-mbadnara commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add a new session config flag ep.qnnexecutionprovider.enable_htp_prepare_and_load=1 that performs AOT compilation and context loading within a single ORT session. This allows large models (e.g., LLMs) to bypass the single process-domain (PD) memory limit imposed by the QNN JIT flow without requiring two separate sessions.

Motivation / Context

QNN EP's JIT flow places all graph splits in a single QNN process domain (PD), which exhausts memory for large models. The existing AOT workaround requires two sessions:

  1. Session 1: prepare_only=1 — compile and emit _ctx.onnx
  2. Session 2: load _ctx.onnx for inference (QNN spreads splits across multiple PDs)

This two-session flow is cumbersome for embedded customers who want a single API call. The new prepare_and_load option performs both steps internally:

  1. Compile the model (JIT path)
  2. Extract the QNN context binary
  3. Release the compile-time single-PD context
  4. Reload the binary via the AOT path (multi-PD)
  5. Session is immediately ready for inference

Behavior matrix

ep.context_enable prepare_and_load Behavior
0 0 Existing JIT flow (unchanged)
1 0 Existing AOT flow (unchanged)
0 1 New Path A: compile → reload from memory → infer. No persisted artifact.
1 1 New Path B: compile → write _ctx.onnx + .bin → reload → infer. Artifact persists.

Test Plan: prepare_and_load

Unit Tests

Run new prepare_and_load tests

./onnxruntime_provider_test.exe --gtest_filter="*PrepareAndLoad*"

Expected: 5 tests pass (PathA, PathB, MutuallyExclusive, ContextDisabledWithFilePath, EmbedModeOverridden)


Manual Validation with onnxruntime_perf_test

Test 1: Path A — Prepare and load, no artifact

.\onnxruntime_perf_test.exe --plugin_ep_libs "QNNExecutionProvider|onnxruntime_providers_qnn.dll" --plugin_eps "QNNExecutionProvider" -i "backend_path|QnnHTP.dll" -m times -r 1 -p burst -C "ep.qnnexecutionprovider.enable_htp_prepare_and_load|1" <model.onnx>

Verify:

  • Inference completes successfully
  • No _ctx.onnx or _qnn.bin file created near the model

Test 2: Path B — Prepare and load + persist artifact

.\onnxruntime_perf_test.exe --plugin_ep_libs "QNNExecutionProvider|onnxruntime_providers_qnn.dll" --plugin_eps "QNNExecutionProvider" -i "backend_path|QnnHTP.dll" -m times -r 1 -p burst -C "ep.context_enable|1" -C "ep.context_file_path|model_ctx.onnx" -C "ep.context_embed_mode|0" -C "ep.qnnexecutionprovider.enable_htp_prepare_and_load|1" <model.onnx>

Verify:

  • Inference completes successfully
  • model_ctx.onnx exists on disk
  • model_ctx_qnn.bin exists on disk (external binary)

Test 3: Load saved artifact from Test 2 (existing AOT flow)

.\onnxruntime_perf_test.exe --plugin_ep_libs "QNNExecutionProvider|onnxruntime_providers_qnn.dll" --plugin_eps "QNNExecutionProvider" -i "backend_path|QnnHTP.dll" -m times -r 1 -p burst model_ctx.onnx

Verify:

  • Loads and runs without recompilation
  • Startup is faster than Test 2 (no compile step)

Test 4: Regression — JIT flow (default, no new flags)

.\onnxruntime_perf_test.exe --plugin_ep_libs "QNNExecutionProvider|onnxruntime_providers_qnn.dll" --plugin_eps "QNNExecutionProvider" -i "backend_path|QnnHTP.dll" -m times -r 1 -p burst <model.onnx>

Verify:

  • Works as before, no behavior change

Test 5: Regression — Prepare-only flow (existing AOT step 1)

.\onnxruntime_perf_test.exe --plugin_ep_libs "QNNExecutionProvider|onnxruntime_providers_qnn.dll" --plugin_eps "QNNExecutionProvider" -i "backend_path|QnnHTP.dll" -m times -r 1 -p burst -C "ep.context_enable|1" -C "ep.context_file_path|prep_only_ctx.onnx" -C "ep.qnnexecutionprovider.enable_htp_prepare_only|1" <model.onnx>

Verify:

  • prep_only_ctx.onnx is created on disk
  • Inference does NOT run (returns EP_FAIL or error about prepare_only mode)

Test 6: Error — Both prepare_and_load + prepare_only (mutually exclusive)

.\onnxruntime_perf_test.exe --plugin_ep_libs "QNNExecutionProvider|onnxruntime_providers_qnn.dll" --plugin_eps "QNNExecutionProvider" -i "backend_path|QnnHTP.dll" -m times -r 1 -p burst -C "ep.context_enable|1" -C "ep.qnnexecutionprovider.enable_htp_prepare_only|1" -C "ep.qnnexecutionprovider.enable_htp_prepare_and_load|1" <model.onnx>

Verify:

  • Session creation fails with error containing "mutually exclusive"

Test 7: Error — Contradictory options (no persist + file path)

.\onnxruntime_perf_test.exe --plugin_ep_libs "QNNExecutionProvider|onnxruntime_providers_qnn.dll" --plugin_eps "QNNExecutionProvider" -i "backend_path|QnnHTP.dll" -m times -r 1 -p burst -C "ep.context_file_path|some_path.onnx" -C "ep.qnnexecutionprovider.enable_htp_prepare_and_load|1" <model.onnx>

Verify:

  • Session creation fails with error containing "Contradictory"

Test 8: Embed mode override (warning + forced to external)

.\onnxruntime_perf_test.exe --plugin_ep_libs "QNNExecutionProvider|onnxruntime_providers_qnn.dll" --plugin_eps "QNNExecutionProvider" -i "backend_path|QnnHTP.dll" -m times -r 1 -p burst -C "ep.context_enable|1" -C "ep.context_file_path|embed_test_ctx.onnx" -C "ep.context_embed_mode|1" -C "ep.qnnexecutionprovider.enable_htp_prepare_and_load|1" <model.onnx>

Verify:

  • Warning in logs: "Overriding ep.context_embed_mode to 0"
  • Inference completes successfully
  • embed_test_ctx.onnx exists
  • embed_test_ctx_qnn.bin exists (external binary, NOT embedded)

Test 9: Prepare and load on already-compiled context model (warning, no-op)

.\onnxruntime_perf_test.exe --plugin_ep_libs "QNNExecutionProvider|onnxruntime_providers_qnn.dll" --plugin_eps "QNNExecutionProvider" -i "backend_path|QnnHTP.dll" -m times -r 1 -p burst -C "ep.qnnexecutionprovider.enable_htp_prepare_and_load|1" model_ctx.onnx

(Uses model_ctx.onnx from Test 2)

Verify:

  • Warning in logs: "prepare_and_load=1 is ignored because the input model is already a pre-compiled context model"
  • Loads and runs normally via existing AOT path

Summary Checklist

# Test Expected Result
1 Path A (no artifact) Infer OK, no files written
2 Path B (persist) Infer OK, .onnx + .bin created
3 Reload artifact Infer OK, fast startup
4 JIT regression Unchanged behavior
5 Prepare-only regression Writes ctx, no inference
6 Mutually exclusive error Fails: "mutually exclusive"
7 Contradictory error Fails: "Contradictory"
8 Embed mode override Warning + external .bin created
9 Flag on pre-compiled model Warning + loads normally

@qti-mbadnara

Copy link
Copy Markdown
Collaborator Author

/ci

@github-actions

Copy link
Copy Markdown
Contributor

🔄 CI triggered on dev/qti-mbadnara/enable_prepare_and_load (draft PR — CI was skipped on push) by @qti-mbadnara. Check the Actions tab for progress.

@onnxruntime onnxruntime deleted a comment from github-actions Bot Jun 18, 2026
@qti-mbadnara qti-mbadnara force-pushed the dev/qti-mbadnara/enable_prepare_and_load branch from a33e6a7 to 3ccf1fe Compare June 18, 2026 21:31
@qti-mbadnara qti-mbadnara changed the title [QNN EP] Add support for htp_prepare_and_load EP Provider Option [QNN EP] Add prepare_and_load session option for single-session AOT inference Jun 18, 2026
@qti-mbadnara qti-mbadnara reopened this Jun 18, 2026
@qti-mbadnara qti-mbadnara marked this pull request as ready for review June 18, 2026 23:30
@qti-mbadnara qti-mbadnara marked this pull request as draft June 19, 2026 01:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant