[QNN EP] documentation updates for the GPU backend. (#26508)

johnpaultaken · web-flow · commit d1da8a008cdf · 2025-11-24T11:15:40.000-08:00
### Description
Updating QNN EP documentation to include the GPU backend.

### Motivation and Context
GPU backend differs in specific areas from the HTP backend.
diff --git a/docs/execution-providers/QNN-ExecutionProvider.md b/docs/execution-providers/QNN-ExecutionProvider.md
@@ -64,6 +64,7 @@ The QNN Execution Provider supports a number of configuration options. These pro
 |---|-----|
 |'libQnnCpu.so' or 'QnnCpu.dll'|Enable CPU backend. See `backend_type` 'cpu'.|
 |'libQnnHtp.so' or 'QnnHtp.dll'|Enable HTP backend. See `backend_type` 'htp'.|
+|'libQnnGpu.so' or 'QnnGpu.dll'|Enable GPU backend. See `backend_type` 'gpu'.|
 
 **Note:** `backend_path` is an alternative to `backend_type`. At most one of the two should be specified.
 `backend_path` requires a platform-specific path (e.g., `libQnnCpu.so` vs. `QnnCpu.dll`) but also allows one to specify an arbitrary path.
@@ -392,6 +393,22 @@ Available session configurations include:
 
 The above snippet only specifies the `backend_path` provider option. Refer to the [Configuration options section](./QNN-ExecutionProvider.md#configuration-options) for a list of all available QNN EP provider options.
 
+## Running a model with QNN EP's GPU backend
+
+The QNN GPU backend can run models with 32-bit/16-bit floating-point activations and weights as such without prior quantization. A 16-bit floating-point model generally can run inference faster on the GPU compared to its 32-bit version. To help reduce the size of large models, quantizing weights to `uint8`, while keeping activations in float is also supported.
+
+Other than the quantized model requirement mentioned in the above HTP backend section, all other requirements are valid for the GPU backend also. So is the model inference sample code except for the portion where you specify the backend.
+
+```python
+# Create an ONNX Runtime session.
+# TODO: Provide the path to your ONNX model
+session = onnxruntime.InferenceSession("model.onnx",
+                                       sess_options=options,
+                                       providers=["QNNExecutionProvider"],
+                                       provider_options=[{"backend_path": "QnnGpu.dll"}]) # Provide path to Gpu dll in QNN SDK
+
+```
+
 ## QNN context binary cache feature
 There's a QNN context which contains QNN graphs after converting, compiling, finalizing the model. QNN can serialize the context into binary file, so that user can use it for futher inference directly (without the QDQ model) to improve the model loading cost.
 The QNN Execution Provider supports a number of session options to configure this.