[QNN EP] Convert BQ to LPBQ encodings by qti-ashimaj · Pull Request #307 · onnxruntime/onnxruntime-qnn

qti-ashimaj · 2026-04-28T06:32:55Z

Description

Convert BQ encodings to htp supported LPBQ encodings

Motivation and Context

The block quantized ONNX models are currently not supported in QNN-EP. In order to execute them via QNN HTP, we are converting the BQ encodings to LPBQ encodings.

CLAassistant · 2026-04-28T06:33:02Z

All committers have signed the CLA.

qti-kromero · 2026-05-08T16:10:46Z

    params_.axisScaleOffsetEncoding.axis = static_cast<int32_t>(axis);
    params_.axisScaleOffsetEncoding.numScaleOffsets = static_cast<uint32_t>(num_elems);
    params_.axisScaleOffsetEncoding.scaleOffset = data_span.data();
+  } else if (is_block_quant) {


I feel like this logic and some of the other logic in qnn_utils warrants a new test (or a few new tests)? is that planned to be added later or something we can add in this PR?

qti-chuteng · 2026-05-13T02:34:05Z

Critical

[C-1] onnxruntime/core/providers/qnn/builder/qnn_utils.cc:1216,1267-1268
max_int_scale = 1u << bitwidth is wrong by one. For int8 the value is 256, then static_cast<uint8_t>(clamped) wraps the channel's max-scale block to 0, encoding it as zero-scale; for int4 the value 16 overflows the 4-bit block-scale field. The standard LPBQ convention is int_scale ∈ [1, 2^bitwidth − 1]. Use max_int_scale = (1u << bitwidth) - 1 and divide by it in per_channel_scales[c] = max_scale / max_int_scale. This will hit 100% of inputs since the algorithm intentionally maps max_scale to max_int_scale. Pair this fix with M-4.

[C-2] onnxruntime/core/providers/qnn/builder/opbuilder/matmul_op_builder.cc:1
A UTF-8 BOM (EF BB BF) was accidentally inserted at the start of the file. No other .cc file in the repo has a BOM, and some toolchains (notably nvcc and older MSVC) treat the BOM as the first source character and break preprocessing. Re-save without BOM, e.g. sed -i '1s/^\xEF\xBB\xBF//' matmul_op_builder.cc.

Major

[M-1] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:553-555
constexpr int64_t kDefaultBlockAxis = 0; violates the ONNX QuantizeLinear / DequantizeLinear spec, which has defaulted axis to 1 since opset 13. The two per-channel branches in this same function correctly use DEFAULT_QDQ_AXIS = 1. Change to 1 (or reuse DEFAULT_QDQ_AXIS); otherwise models that omit the axis attribute will be processed with the wrong layout, and qnn_axis = 1 − onnx_axis will compute the wrong direction.

[M-2] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:434-442
The new flag chain has a silent fallback: when scales.size() == 1 && block_size > 0, the is_per_tensor branch wins and block_size is silently ignored. Either set is_per_tensor = scales.size() == 1 && !is_block_quant, or add RETURN_IF_NOT(!(is_per_tensor && is_block_quant), "Per-tensor + block_size>0 is not supported") before the if-else.

[M-3] onnxruntime/test/providers/qnn/ (no new files)
~125 lines of new numeric-conversion logic and zero unit-test coverage. Per checklist [G02] every new logic change must have a corresponding test. Add at minimum: (1) a QnnHTPBackendTests end-to-end with opset-21 MatMul + DequantizeLinear(block_size) for both int4 and int8 weights asserting ExpectedEPNodeAssignment::All; (2) a standalone gtest for TryConvertBlockQuantScalesToLpbq covering max-block-scale round-trip (which would catch C-1), axis=0 vs axis=1 transpose, all-zero scales, asymmetric zero-points must fail; (3) a kill-test fence — disable the BQ→LPBQ change and confirm the new tests fail.

[M-4] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:153
lpbq.blockScaleBitwidth = is_int4 ? 4 : 0; writes blockScaleBitwidth = 0 for int8. Before this PR the false branch was dead code (only the int4 LPBQ fusions used the ctor); this PR exercises it. Fix: lpbq.blockScaleBitwidth = is_int4 ? 4 : 8;. Same root cause as C-1, please fix together.

[M-5] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:548-563
After the scale_shape.size() == 2 check, please also assert scale_shape[0] > 0 && scale_shape[1] > 0 and scale_shape[0] * scale_shape[1] == scales.size() so dynamic-shape scales fail loudly with a clear message instead of overflowing inside TryConvertBlockQuantScalesToLpbq.

[M-6] onnxruntime/core/providers/qnn/builder/qnn_utils.h:289 / qnn_utils.cc:1202
TryConvertBlockQuantScalesToLpbq uses the Try-prefix idiom (which in ORT — e.g. TryFusion — implies bool / non-fatal failure), but it actually returns Ort::Status and the caller uses RETURN_IF_ERROR. Drop the prefix → ConvertBlockQuantScalesToLpbq.

Minor

[N-1] qnn_quant_params_wrapper.cc:581-589 — bq_offsets transpose path is dead code: TryConvertBlockQuantScalesToLpbq immediately requires every offset to be 0. Validate zero_points are all zero up front and drop the transpose.

[N-2] qnn_utils.h:280-284 / qnn_utils.cc:1232-1239 — Doc comment "max_int_scale = 2^bitwidth (256 for int8, 16 for int4)" embeds the same off-by-one as C-1. Update together with C-1.

[N-3] qnn_utils.cc:382-393 — operator<< for BLOCKWISE_EXPANSION uses numBlocksPerAxis as the print length for scaleOffsets, but QnnTypes.h documents that array size as axisSize. Either pass axisSize through (e.g. via per_channel_scales_size_) or update the comment to clarify "first numBlocksPerAxis entries only".

[N-4] qnn_quant_params_wrapper.cc:434-610 — The new is_block_quant flag reorders the if-else chain. Add 1-2 lines of comment above mapping (per-tensor / per-channel / block-quant × int4) to each branch and noting the final else as a defensive fallback.

[N-5] qnn_quant_params_wrapper.cc:597 — const uint32_t kBitwidth = is_int4_type ? 4u : 8u; is a runtime const, not a constexpr. The k prefix is reserved for constexpr / static const in Google C++ style. Rename to bitwidth.

quic-ristha · 2026-05-20T10:40:09Z

+                "BQ offsets size must be empty or equal to bq_scales size");
+  RETURN_IF_NOT(bitwidth > 0 && bitwidth <= 16, "bitwidth must be in range [1, 16]");
+
+  const uint32_t max_int_scale = (1u << bitwidth) - 1u;


max_int_scale calculation logic differes from what we used in the standard LPBQ workflow

For 4->8 expansion

the int scale will range from [1, 16]

Thus the max_int_sacle is calculated as 2^(decompressed_bw - bq_bw) i.e. 16 in this case

You can refer this class code : https://github.com/qualcomm/aimet/blob/66b2834fcb2711ef3a0bf6cec847090b42da683b/TrainingExtensions/torch/src/python/aimet_torch/quantization/affine/quantizer.py#L1192

sorry thought I had commented this but I guess it got lost - we need to upgrade the datatype on per_block_int_scales or otherwise adjust for the differing range. otherwise, we will have overflow for uint8 storing 256

cc @quic-calvnguy

Done. The int_scale range will be [1, 16] for 4-bit as suggested. Since the original algorithm is applicable for 4-bits only, we are adding a check to convert bq to lpbq encodings for only 4 bit and so the overflow for 8 bits will not happen with this.

quic-pkaul · 2026-05-20T11:12:49Z

+  // QNN uses different structs to represent quantization parameters depending on:
+  // - per-tensor (scales.size()==1, no block_size): SCALE_OFFSET or BW_SCALE_OFFSET
+  // - per-channel (scales.size()>1, no block_size): AXIS_SCALE_OFFSET or BW_AXIS_SCALE_OFFSET
+  // - block quantization (block_size>0): BLOCKWISE_EXPANSION (LPBQ)


Specifically, We have both block quantization (QNN_QUANTIZATION_ENCODING_BLOCK) and lpbq (QNN_QUANTIZATION_BLOCKWISE_EXPANSION) supported in QNN.

quic-pkaul · 2026-05-20T11:15:20Z

    params_.bwAxisScaleOffsetEncoding.scales = scales_span.data();
    params_.bwAxisScaleOffsetEncoding.offsets = zps_span.data();
-  } else if (!is_per_tensor && !is_int4_type) {
+  } else if (is_per_channel && !is_int4_type) {


Not sure if this gets taken elsewhere.
But before we make BQ->LPBQ transition, besides the two constraints (is_per_channel and is_int4_type) we also need to make sure that the input and output activation of the op is Integer as well. Otherwise better keep in BQ.

Added a check for (is_block_quant && is_int4_type) , the conversion will only happen for these data types else not.

quic-pkaul · 2026-05-20T11:20:11Z

+
+    // Determine block axis (= ONNX axis attribute).
+    constexpr int64_t DEFAULT_QDQ_AXIS = 1;
+    int64_t axis = ort_quant_params->axis.value_or(DEFAULT_QDQ_AXIS);


The code doesn't seem to take care of transposed weights. Which will swap axis.
Please see transB here.
https://onnx.ai/onnx/operators/onnx__Gemm.html

This can be handled in the separate op builders using the HandleTranspose/HandleUnsqueeze API which will update the axis.

quic-pkaul · 2026-05-20T11:26:44Z

+  }
+
+  // Algorithm:
+  //   max_int_scale             = 2^bitwidth - 1


Also correct based on rishabh's comment

qti-kromero · 2026-05-20T20:09:08Z

+          << " blockScaleBitwidth=" << lpbq.blockScaleBitwidth;
+      // For LPBQ, num_elems are not present in the quantize_params,
+      // we are using numBlocksPerAxis instead to print the first numBlocksPerAxis scale offset values
+      size_t num_elems = lpbq.numBlocksPerAxis;


is lpbq.scaleOffsets channel-major or block-major?

am wondering what this does in plain language? prints the first numBlocksPerAxis scaleOffsets so is that just the blocks for the first row of axis?

The scale offsets are of size num_channels only but since this qnn struct doesn't have numScaleOffsets member, we are taking numBlocksPerAxis as a proxy to print the first few scale and offset values.

qti-kromero · 2026-05-20T20:11:26Z

+        utils::GetInitializerShape(ort_quant_params->scale, qnn_model_wrapper.GetOrtApi());
+    RETURN_IF_NOT(scale_shape.size() >= 2,
+                  "Block quantization scale tensors must have at least rank 2 for LPBQ conversion");
+    RETURN_IF_NOT(scale_shape[0] > 0 && scale_shape[1] > 0,


does this and below work properly for ranks > 2?

Yes, we have tested this on models having MatMul BQ (rank 2) and Conv BQ (rank 4) weights.

qti-kromero · 2026-06-09T20:30:50Z

LGTM aside from missing tests

qti-ashimaj · 2026-06-10T09:46:54Z

#307 (comment)

LGTM aside from missing tests

The test cases are being added in the individual op builder PRs

qti-chuteng · 2026-06-17T07:00:48Z

🔴 Critical

[C-1] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:L597
The else if (is_block_quant) branch only builds LPBQ params when is_int4_type && convert_bq_to_lpbq; the inner if has no else, so a non-int4 (e.g. int8) block-quant weight falls through to return Ort::Status() with params_ left as QNN_QUANTIZE_PARAMS_INIT (UNDEFINED). This is reachable under the default convert_bq_to_lpbq=true, where an int8 BQ MatMul weight is no longer claimed by IsBQWeight() and routes through this generic path — yielding wrong numerics or a QNN compile failure instead of a clean error. Add an else that returns MAKE_EP_FAIL(...).

🟠 Major

[M-1] onnxruntime/core/providers/qnn/qnn_execution_provider.cc:L915
The new option convert_bq_to_lpbq defaults to true (both in ModelSettings and ParseBoolOption). This violates the opt-in convention for new session config keys and silently flips existing int4 BQ models (BW_FLOAT_BLOCK, landed in PR #288) to a lossy LPBQ approximation after upgrade, changing their output. Default to false, or document the behavior change explicitly.

[M-2] onnxruntime/core/providers/qnn/builder/opbuilder/matmul_op_builder.cc:L91
Only MatMul's IsBQWeight() honors convert_bq_to_lpbq. The BQ detectors in conv_op_builder.cc and gemm_op_builder.cc (IsBQGemmWeight) ignore the option and always use BW_FLOAT_BLOCK, so under convert_bq_to_lpbq=1 MatMul uses LPBQ while Conv/Gemm still use BQ — inconsistent and contradicting the doc. Gate Conv/Gemm the same way, or scope the doc to MatMul-only.

[M-3] onnxruntime/test/providers/qnn/matmul_test.cc:L365
Both new provider-option helpers set opts["convert_bq_to_lpbq"] = "0", which disables the entire new LPBQ conversion path. No test runs it with "1" (or the default), so the new algorithm could be fully broken and all tests would still pass. Add an HTP QDQ test (and ideally a direct unit test for ConvertBlockQuantScalesToLpbq) that exercises the conversion.

🔵 Minor

[N-1] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.h → per_channel_scales_size_
uint32_t per_channel_scales_size_; has no in-class initializer (unlike adjacent block_encoding_tensor_rank_ = 0 / num_blocks_ = 0) and is read in the copy ctor for LPBQ. In the degenerate num_elems==0 LPBQ case it stays indeterminate.

  uint32_t per_channel_scales_size_ = 0;

[N-2] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:L655
const uint32_t bitwidth = is_int4_type ? 4u : 8u; sits inside if (is_int4_type && ...), so the 8u arm is dead code; ConvertBlockQuantScalesToLpbq also hard-rejects non-4 bitwidth. Use const uint32_t bitwidth = 4u; with a comment to avoid implying int8 support.

[N-3] docs/execution_providers/QNN-ExecutionProvider.md:L200
The doc says block-quant encodings "will be converted to LPBQ", but conversion is int4-only and (per M-2) MatMul-only. Update to match the actual scope and final default value.

qti-ashimaj · 2026-06-17T11:40:56Z

🔴 Critical

[C-1] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:L597 The else if (is_block_quant) branch only builds LPBQ params when is_int4_type && convert_bq_to_lpbq; the inner if has no else, so a non-int4 (e.g. int8) block-quant weight falls through to return Ort::Status() with params_ left as QNN_QUANTIZE_PARAMS_INIT (UNDEFINED). This is reachable under the default convert_bq_to_lpbq=true, where an int8 BQ MatMul weight is no longer claimed by IsBQWeight() and routes through this generic path — yielding wrong numerics or a QNN compile failure instead of a clean error. Add an else that returns MAKE_EP_FAIL(...).

🟠 Major

[M-1] onnxruntime/core/providers/qnn/qnn_execution_provider.cc:L915 The new option convert_bq_to_lpbq defaults to true (both in ModelSettings and ParseBoolOption). This violates the opt-in convention for new session config keys and silently flips existing int4 BQ models (BW_FLOAT_BLOCK, landed in PR #288) to a lossy LPBQ approximation after upgrade, changing their output. Default to false, or document the behavior change explicitly.

[M-2] onnxruntime/core/providers/qnn/builder/opbuilder/matmul_op_builder.cc:L91 Only MatMul's IsBQWeight() honors convert_bq_to_lpbq. The BQ detectors in conv_op_builder.cc and gemm_op_builder.cc (IsBQGemmWeight) ignore the option and always use BW_FLOAT_BLOCK, so under convert_bq_to_lpbq=1 MatMul uses LPBQ while Conv/Gemm still use BQ — inconsistent and contradicting the doc. Gate Conv/Gemm the same way, or scope the doc to MatMul-only.

[M-3] onnxruntime/test/providers/qnn/matmul_test.cc:L365 Both new provider-option helpers set opts["convert_bq_to_lpbq"] = "0", which disables the entire new LPBQ conversion path. No test runs it with "1" (or the default), so the new algorithm could be fully broken and all tests would still pass. Add an HTP QDQ test (and ideally a direct unit test for ConvertBlockQuantScalesToLpbq) that exercises the conversion.

🔵 Minor

[N-1] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.h → per_channel_scales_size_ uint32_t per_channel_scales_size_; has no in-class initializer (unlike adjacent block_encoding_tensor_rank_ = 0 / num_blocks_ = 0) and is read in the copy ctor for LPBQ. In the degenerate num_elems==0 LPBQ case it stays indeterminate.
  uint32_t per_channel_scales_size_ = 0;
[N-2] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:L655 const uint32_t bitwidth = is_int4_type ? 4u : 8u; sits inside if (is_int4_type && ...), so the 8u arm is dead code; ConvertBlockQuantScalesToLpbq also hard-rejects non-4 bitwidth. Use const uint32_t bitwidth = 4u; with a comment to avoid implying int8 support.

[N-3] docs/execution_providers/QNN-ExecutionProvider.md:L200 The doc says block-quant encodings "will be converted to LPBQ", but conversion is int4-only and (per M-2) MatMul-only. Update to match the actual scope and final default value.

[C-1] If the conditions for BQ->LPBQ conversion are not met, ORT::Status is returned and the respective op builder files will construct the BW_FLOAT_BLOCK encodings.

[M-1] Updated the default value to false

[M-2] The respective op lpbq PRs will take care of these.

[M-3] The respective op lpbq PRs have the tests added.

[N-1] Done

[N-2] Done

[N-3] Updated the description

qti-ashimaj marked this pull request as ready for review May 7, 2026 09:34

qti-ashimaj requested review from qti-ashwshan, qti-chuteng, qti-jkilpatrick, qti-kromero, qti-yuduo, tirupath-qti and yath1 as code owners May 7, 2026 09:34

qti-ashimaj changed the title ~~[WIP] Convert BQ to LPBQ encodings~~ [QNN EP] Convert BQ to LPBQ encodings May 7, 2026

qti-kromero reviewed May 8, 2026

View reviewed changes

huaychou mentioned this pull request May 13, 2026

[OpBuilder] Support HTP MatMulNBits. #288

Merged

quic-ristha reviewed May 20, 2026

View reviewed changes

quic-pkaul reviewed May 20, 2026

View reviewed changes

qti-kromero reviewed May 20, 2026

View reviewed changes

This was referenced Jun 3, 2026

Translate MatMul to Conv2D 1x1 for LPBQ encodings #357

Open

[QNN EP] Support LPBQ encodings for Conv Op #483

Draft

qti-ashimaj added 5 commits June 15, 2026 16:01

convert BQ to LPBQ encodings

6689161

address review comments

6fd25fc

address review comments

09ac5dc

address review comments

d2e886a

fix lint

965a281

qti-ashimaj force-pushed the dev/qti-ashimaj/bq2lpbq branch from da97fe8 to 965a281 Compare June 15, 2026 10:33

add provider option for bq to lpbq conversion

24b4741

address review comments

0131740

fix scale size

ef2680e

Conversation

qti-ashimaj commented Apr 28, 2026

Description

Motivation and Context

Uh oh!

CLAassistant commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qti-chuteng commented May 13, 2026

Critical

Major

Minor

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qti-kromero commented Jun 9, 2026

Uh oh!

qti-ashimaj commented Jun 10, 2026

Uh oh!

qti-chuteng commented Jun 17, 2026

🔴 Critical

🟠 Major

🔵 Minor

Uh oh!

qti-ashimaj commented Jun 17, 2026

🔴 Critical

🟠 Major

🔵 Minor

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

CLAassistant commented Apr 28, 2026 •

edited

Loading