microsoft · xieofxie · Jul 29, 2025 · Jul 25, 2025 · Jul 25, 2025 · Jul 25, 2025
@@ -0,0 +1,5 @@
+__pycache__
+/cache
+/history/*/*
+!/history/*/history.config
+!/history/*/olive_config.json
@@ -0,0 +1,160 @@
+# Qwen2.5-1.5B-Instruct Model Optimization
+
+This repository demonstrates the optimization of the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model using **post-training quantization (PTQ)** techniques. The optimization process is divided into three main workflows:
+
+- QDQ for AMD NPU
+- PTQ + AOT for QNN NPU
+   + This process extends the QDQ flow and compiling specifically for **Qualcomm NPUs**
+- OpenVINO for Intel NPU
+   + This process uses OpenVINO specific passes like `OpenVINOOptimumConversion`, `OpenVINOIoUpdate` and `OpenVINOEncapsulation`
+
+## **QDQ Model with 4-bit Weights & 16-bit Activations**
+
+This workflow produces an ONNX QDQ model that is agnostic to the target hardware and accelerator, making it suitable for general inference.
+
+### **Optimization Process**
+
+The model is optimized using **weight-only quantization** and **activation quantization** for efficient deployment. The process includes:
+
+1. **Weight Rotation ([QuaRot](https://arxiv.org/abs/2404.00456))**
+   - Reduces outliers from weights and hidden states to enhance quantization efficiency.
+
+2. **4-bit Per-Channel Symmetric Quantization ([GPTQ](https://arxiv.org/abs/2210.17323))**
+   - Reduces transformer layer size while preserving accuracy.
+
+3. **ONNX Graph Capture**
+   - Exports the model to ONNX for further optimization.
+
+4. **4-bit Block-wise Quantization**
+   - Applies weight-only quantization to the **embedding layer** and **language modeling head**.
+
+5. **16-bit Activation Quantization**
+   - Uses 16-bit activations to balance precision and efficiency.
+
+The final output is a **QDQ model** with **4-bit weights** and **16-bit activations**. This model also leverages [GroupQueryAttention (GQA)](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.GroupQueryAttention) for efficient long-context processing and long-sequence generation.
+
+### **Handling Dynamic and Static Input Shapes**
+
+NPUs require **precompiled graphs**, meaning the model must use **static input shapes**. However, **text generation** involves two distinct processing stages:
+
+- **Prefill (Prompt Processing)**: Processes multiple tokens simultaneously.
+- **Token Generation (Iteration)**: Processes one token at a time.
+
+To support both efficiently, we create **two model instances**:
+1. **Prefill model**: Optimized for batch processing.
+2. **Token generation model**: Optimized for one-token-at-a-time inference.
+
+## **PTQ + AOT Compilation for Qualcomm NPUs using QNN EP**
+
+This process extends the [**QDQ Model with 4-bit Weights & 16-bit Activations**](#qdq-model-with-4-bit-weights--16-bit-activations) by compiling it specifically for **Qualcomm NPUs** using the **QNN Execution Provider**.
+
+### **Resource Optimization Strategy**
+
+To maximize efficiency while supporting dynamic input handling:
+
+- **Embedding Layer & Language Model Head** → Executed on CPU (handles dynamic input).
+- **Transformer Layers** → Executed on NPU (requires static input shapes).
+- **Weight Sharing** → Prefill & token generation models reuse weights to minimize memory usage.
+
+> ⚠️ **Note:** GQA is an ONNX Runtime *contrib operator* and must be executed on the CPU. The model graph is partitioned into **CPU (GQA nodes)** and **NPU (other nodes)** for execution.
+
+### **Compilation for Qualcomm NPU Deployment**
+
+Once optimized, the model is compiled for Qualcomm NPUs using **ONNX Runtime QNNExecutionProvider**. The steps include:
+
+1. **Split the Quantized Model** → Divide into three parts:
+   - **Embedding Layer**
+   - **Transformer Layers**
+   - **Language Model Head**
+2. **Set Static Input Shapes**:
+   - **(1, 64)** for prefill (batch size, sequence length).
+   - **(1, 1)** for token generation.
+3. **Compile using QNNExecutionProvider**:
+   - Leverages **weight sharing** across the prefill and token generation models.
+
+### **Usage**
+
+This workflow is configured using the `qnn_config.json` file. It contains all of the quantization and compilation steps. It requires two separate Python environments described below.
+
+#### A workable version
+
+- python=3.10
+- CUDA=12.1
+- cudnn=9.2.0
+
+#### Quantization Python Environment Setup
+
+Quantization is resource-intensive and requires GPU acceleration. In an [x64 Python environment with Olive installed](https://github.com/microsoft/Olive/blob/main/examples/README.md#important), install the required packages:
+
+```bash
+# Install common dependencies
+pip install -r requirements.txt
+
+# Install ONNX Runtime GPU packages
+pip install "onnxruntime-gpu>=1.21.0" "onnxruntime-genai-cuda>=0.6.0"
+
+# AutoGPTQ: Install from source (stable package may be slow for weight packing)
+# Disable CUDA extension build (not required)
+# Linux
+export BUILD_CUDA_EXT=0
+# Windows
+# set BUILD_CUDA_EXT=0
+
+# Install AutoGPTQ from source
+pip install --no-build-isolation git+https://github.com/PanQiWei/AutoGPTQ.git
+```
+
+> ⚠️ Only set up the environment and install the packages. Do not run the `olive run` command at this point.
+
+#### AOT Compilation Python Environment Setup
+
+Model compilation using QNN Execution Provider requires a Python environment with onnxruntime-qnn installed. In a separate Python environment with Olive installed, install the required packages:
+
+```bash
+# Install ONNX Runtime QNN
+pip install -r https://raw.githubusercontent.com/microsoft/onnxruntime/refs/heads/main/requirements.txt
+pip install -U --pre --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-deps
+```
+
+Replace `/path/to/qnn/env/bin` in `qnn_config.json` with the path to the directory containing your QNN environment's Python executable. This path can be found by running the following command in the environment:
+
+```bash
+# Linux
+command -v python
+# Windows
+# where python
+```
+
+This command will return the path to the Python executable. Set the parent directory of the executable as the `/path/to/qnn/env/bin` in the config file.
+
+#### **Run the Quantization + Compilation Config**
+
+Activate the **Quantization Python Environment** and run the workflow:
+
+```bash
+olive run --config qnn_config.json
+```
+
+Olive will run the AOT compilation step in the **AOT Compilation Python Environment** specified in the config file using a subprocess. All other steps will run in the **Quantization Python Environment** natively.
+
+✅ Optimized model saved in: `./model`
+
+> ⚠️ If optimization fails due to out of memory, please remove `calibration_providers` in config file.
+
+> ⚠️ If optimization fails during context binary generation, rerun the command. The process will resume from the last completed step.
+
+### **Inference**
+
+The optimized model can be used for inference using ONNX Runtime QNNExecutionProvider and ONNX Runtime GenAI. **Inference must be run on a Windows Copilot+ PC with a Qualcomm NPU.**
+
+#### **Install Required Packages (arm64 Python)**
+```bash
+pip install -r https://raw.githubusercontent.com/microsoft/onnxruntime/refs/heads/main/requirements.txt
+pip install -U --pre --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-deps
+pip install "onnxruntime-genai>=0.7.0rc2"
+```
+
+#### **Run Console-Based Chat Interface**
+Execute the provided `inference_sample.ipynb` notebook.
+
+
@@ -0,0 +1,144 @@
+{
+    "copies": [
+        {
+            "src": "../../../deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/1/model_project.config",
+            "dst": "model_project.config",
+            "replacements": [
+                {
+                    "find": "deepseek_qnn_config",
+                    "replace": "qwen2_5_qnn_config"
+                },
+                {
+                    "find": "deepseek_vitis_ai_config",
+                    "replace": "qwen2_5_vitis_ai_config"
+                },
+                {
+                    "find": "deepseek_ov_config",
+                    "replace": "qwen2_5_ov_config"
+                },
+                {
+                    "find": "deepseek_dml_config",
+                    "replace": "qwen2_5_dml_config"
+                }
+            ]
+        },
+        {
+            "src": "../../../deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/1/deepseek_qnn_config.json",
+            "dst": "qwen2_5_qnn_config.json",
+            "replacements": [
+                {
+                    "find": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
+                    "replace": "Qwen/Qwen2.5-1.5B-Instruct"
+                },
+                {
+                    "find": "model/deepseek",
+                    "replace": "model/qwen2_5"
+                }
+            ]
+        },
+        {
+            "src": "../../../deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/1/deepseek_qnn_config.json.config",
+            "dst": "qwen2_5_qnn_config.json.config",
+            "replacements": [
+            ]
+        },
+        {
+            "src": "../../../deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/1/deepseek_vitis_ai_config.json",
+            "dst": "qwen2_5_vitis_ai_config.json",
+            "replacements": [
+                {
+                    "find": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
+                    "replace": "Qwen/Qwen2.5-1.5B-Instruct"
+                },
+                {
+                    "find": "model/deepseek",
+                    "replace": "model/qwen2_5"
+                }
+            ]
+        },
+        {
+            "src": "../../../deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/1/deepseek_vitis_ai_config.json.config",
+            "dst": "qwen2_5_vitis_ai_config.json.config",
+            "replacements": [
+            ]
+        },
+        {
+            "src": "../../../deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/1/deepseek_ov_config.json",
+            "dst": "qwen2_5_ov_config.json",
+            "replacements": [
+                {
+                    "find": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
+                    "replace": "Qwen/Qwen2.5-1.5B-Instruct"
+                },
+                {
+                    "find": "model/deepseek",
+                    "replace": "model/qwen2_5"
+                }
+            ]
+        },
+        {
+            "src": "../../../deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/1/deepseek_ov_config.json.config",
+            "dst": "qwen2_5_ov_config.json.config",
+            "replacements": [
+                {
+                    "find": "deepseek/openvino/DeepSeek-R1-Distill-Qwen-1.5B_context_ov_dynamic_sym_gs128_bkp_int8_sym_r1.json",
+                    "replace": "qwen2_5/openvino/Qwen2.5-1.5B-instruct_context_ov_dynamic_sym_bkp_int8_sym_r1.json"
+                }
+            ]
+        },
+        {
+            "src": "../../../deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/1/deepseek_dml_config.json",
+            "dst": "qwen2_5_dml_config.json",
+            "replacements": [
+                {
+                    "find": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
+                    "replace": "Qwen/Qwen2.5-1.5B-Instruct"
+                },
+                {
+                    "find": "model/deepseek",
+                    "replace": "model/qwen2_5"
+                }
+            ]
+        },
+        {
+            "src": "../../../deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/1/deepseek_dml_config.json.config",
+            "dst": "qwen2_5_dml_config.json.config",
+            "replacements": [
+            ]
+        },
+        {
+            "src": "../../../deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/1/README.md",
+            "dst": "README.md",
+            "replacements": [
+                {
+                    "find": "# DeepSeek-R1-Distill-Qwen-1.5B Model Optimization",
+                    "replace": "# Qwen2.5-1.5B-Instruct Model Optimization"
+                },
+                {
+                    "find": "[DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)",
+                    "replace": "[Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)"
+                },
+                {
+                    "find": "> ⚠️ If got 6033 error, replace `genai_config.json` in `./model` folder",
+                    "replace": ""
+                }
+            ]
+        },
+        {
+            "src": "../../../deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/1/requirements.txt",
+            "dst": "requirements.txt",
+            "replacements": [
+            ]
+        },
+        {
+            "src": "../../../deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/1/inference_sample.ipynb",
+            "dst": "inference_sample.ipynb",
+            "replacements": [
+                {
+                    "find": "<｜User｜>{input}<｜Assistant｜><think>",
+                    "replace": "<|im_start|>user\\\\n{input}<|im_end|>\\\\n<|im_start|>assistant\\\\n"
+                }
+            ]
+        }
+    ]
+}
@@ -0,0 +1,31 @@
+{
+  "Name": "Qwen2.5-1.5B-Instruct",
+  "PromptTemplate": {
+    "assistant": "{Content}",
+    "prompt": "<|im_start|>user\n{Content}<|im_end|>\n<|im_start|>assistant\n"
+  },
+  "ParameterSchema": {
+    "enabled": [
+      {
+        "name": "max_tokens",
+        "default": 512
+      },
+      {
+        "name": "temperature",
+        "default": 0.6
+      },
+      {
+        "name": "top_p",
+        "default": 0.95
+      },
+      {
+        "name": "top_k",
+        "default": 5
+      },
+      {
+        "name": "random_seed",
+        "default": 3328
+      }
+    ]
+  }
+}