Skip to content

Commit d1da8a0

Browse files
[QNN EP] documentation updates for the GPU backend. (#26508)
### Description Updating QNN EP documentation to include the GPU backend. ### Motivation and Context GPU backend differs in specific areas from the HTP backend.
1 parent 4b76c2d commit d1da8a0

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

docs/execution-providers/QNN-ExecutionProvider.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ The QNN Execution Provider supports a number of configuration options. These pro
6464
|---|-----|
6565
|'libQnnCpu.so' or 'QnnCpu.dll'|Enable CPU backend. See `backend_type` 'cpu'.|
6666
|'libQnnHtp.so' or 'QnnHtp.dll'|Enable HTP backend. See `backend_type` 'htp'.|
67+
|'libQnnGpu.so' or 'QnnGpu.dll'|Enable GPU backend. See `backend_type` 'gpu'.|
6768

6869
**Note:** `backend_path` is an alternative to `backend_type`. At most one of the two should be specified.
6970
`backend_path` requires a platform-specific path (e.g., `libQnnCpu.so` vs. `QnnCpu.dll`) but also allows one to specify an arbitrary path.
@@ -392,6 +393,22 @@ Available session configurations include:
392393

393394
The above snippet only specifies the `backend_path` provider option. Refer to the [Configuration options section](./QNN-ExecutionProvider.md#configuration-options) for a list of all available QNN EP provider options.
394395

396+
## Running a model with QNN EP's GPU backend
397+
398+
The QNN GPU backend can run models with 32-bit/16-bit floating-point activations and weights as such without prior quantization. A 16-bit floating-point model generally can run inference faster on the GPU compared to its 32-bit version. To help reduce the size of large models, quantizing weights to `uint8`, while keeping activations in float is also supported.
399+
400+
Other than the quantized model requirement mentioned in the above HTP backend section, all other requirements are valid for the GPU backend also. So is the model inference sample code except for the portion where you specify the backend.
401+
402+
```python
403+
# Create an ONNX Runtime session.
404+
# TODO: Provide the path to your ONNX model
405+
session = onnxruntime.InferenceSession("model.onnx",
406+
sess_options=options,
407+
providers=["QNNExecutionProvider"],
408+
provider_options=[{"backend_path": "QnnGpu.dll"}]) # Provide path to Gpu dll in QNN SDK
409+
410+
```
411+
395412
## QNN context binary cache feature
396413
There's a QNN context which contains QNN graphs after converting, compiling, finalizing the model. QNN can serialize the context into binary file, so that user can use it for futher inference directly (without the QDQ model) to improve the model loading cost.
397414
The QNN Execution Provider supports a number of session options to configure this.

0 commit comments

Comments
 (0)