Skip to content

[QNN EP] Convert BQ to LPBQ encodings#307

Open
qti-ashimaj wants to merge 8 commits into
mainfrom
dev/qti-ashimaj/bq2lpbq
Open

[QNN EP] Convert BQ to LPBQ encodings#307
qti-ashimaj wants to merge 8 commits into
mainfrom
dev/qti-ashimaj/bq2lpbq

Conversation

@qti-ashimaj

Copy link
Copy Markdown
Collaborator

Description

Convert BQ encodings to htp supported LPBQ encodings

Motivation and Context

The block quantized ONNX models are currently not supported in QNN-EP. In order to execute them via QNN HTP, we are converting the BQ encodings to LPBQ encodings.

@CLAassistant

CLAassistant commented Apr 28, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@qti-ashimaj qti-ashimaj marked this pull request as ready for review May 7, 2026 09:34
@qti-ashimaj qti-ashimaj changed the title [WIP] Convert BQ to LPBQ encodings [QNN EP] Convert BQ to LPBQ encodings May 7, 2026
Comment thread onnxruntime/core/providers/qnn/builder/opbuilder/matmul_op_builder.cc Outdated
Comment thread onnxruntime/core/providers/qnn/builder/opbuilder/matmul_op_builder.cc Outdated
Comment thread onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc Outdated
params_.axisScaleOffsetEncoding.axis = static_cast<int32_t>(axis);
params_.axisScaleOffsetEncoding.numScaleOffsets = static_cast<uint32_t>(num_elems);
params_.axisScaleOffsetEncoding.scaleOffset = data_span.data();
} else if (is_block_quant) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this logic and some of the other logic in qnn_utils warrants a new test (or a few new tests)? is that planned to be added later or something we can add in this PR?

Comment thread onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.h
Comment thread onnxruntime/core/providers/qnn/builder/qnn_utils.cc Outdated
Comment thread onnxruntime/core/providers/qnn/builder/qnn_utils.cc
Comment thread onnxruntime/core/providers/qnn/builder/qnn_utils.cc Outdated
Comment thread onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc Outdated
@qti-chuteng

Copy link
Copy Markdown
Collaborator

Critical

[C-1] onnxruntime/core/providers/qnn/builder/qnn_utils.cc:1216,1267-1268
max_int_scale = 1u << bitwidth is wrong by one. For int8 the value is 256, then static_cast<uint8_t>(clamped) wraps the channel's max-scale block to 0, encoding it as zero-scale; for int4 the value 16 overflows the 4-bit block-scale field. The standard LPBQ convention is int_scale ∈ [1, 2^bitwidth − 1]. Use max_int_scale = (1u << bitwidth) - 1 and divide by it in per_channel_scales[c] = max_scale / max_int_scale. This will hit 100% of inputs since the algorithm intentionally maps max_scale to max_int_scale. Pair this fix with M-4.

[C-2] onnxruntime/core/providers/qnn/builder/opbuilder/matmul_op_builder.cc:1
A UTF-8 BOM (EF BB BF) was accidentally inserted at the start of the file. No other .cc file in the repo has a BOM, and some toolchains (notably nvcc and older MSVC) treat the BOM as the first source character and break preprocessing. Re-save without BOM, e.g. sed -i '1s/^\xEF\xBB\xBF//' matmul_op_builder.cc.


Major

[M-1] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:553-555
constexpr int64_t kDefaultBlockAxis = 0; violates the ONNX QuantizeLinear / DequantizeLinear spec, which has defaulted axis to 1 since opset 13. The two per-channel branches in this same function correctly use DEFAULT_QDQ_AXIS = 1. Change to 1 (or reuse DEFAULT_QDQ_AXIS); otherwise models that omit the axis attribute will be processed with the wrong layout, and qnn_axis = 1 − onnx_axis will compute the wrong direction.

[M-2] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:434-442
The new flag chain has a silent fallback: when scales.size() == 1 && block_size > 0, the is_per_tensor branch wins and block_size is silently ignored. Either set is_per_tensor = scales.size() == 1 && !is_block_quant, or add RETURN_IF_NOT(!(is_per_tensor && is_block_quant), "Per-tensor + block_size>0 is not supported") before the if-else.

[M-3] onnxruntime/test/providers/qnn/ (no new files)
~125 lines of new numeric-conversion logic and zero unit-test coverage. Per checklist [G02] every new logic change must have a corresponding test. Add at minimum: (1) a QnnHTPBackendTests end-to-end with opset-21 MatMul + DequantizeLinear(block_size) for both int4 and int8 weights asserting ExpectedEPNodeAssignment::All; (2) a standalone gtest for TryConvertBlockQuantScalesToLpbq covering max-block-scale round-trip (which would catch C-1), axis=0 vs axis=1 transpose, all-zero scales, asymmetric zero-points must fail; (3) a kill-test fence — disable the BQ→LPBQ change and confirm the new tests fail.

[M-4] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:153
lpbq.blockScaleBitwidth = is_int4 ? 4 : 0; writes blockScaleBitwidth = 0 for int8. Before this PR the false branch was dead code (only the int4 LPBQ fusions used the ctor); this PR exercises it. Fix: lpbq.blockScaleBitwidth = is_int4 ? 4 : 8;. Same root cause as C-1, please fix together.

[M-5] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:548-563
After the scale_shape.size() == 2 check, please also assert scale_shape[0] > 0 && scale_shape[1] > 0 and scale_shape[0] * scale_shape[1] == scales.size() so dynamic-shape scales fail loudly with a clear message instead of overflowing inside TryConvertBlockQuantScalesToLpbq.

[M-6] onnxruntime/core/providers/qnn/builder/qnn_utils.h:289 / qnn_utils.cc:1202
TryConvertBlockQuantScalesToLpbq uses the Try-prefix idiom (which in ORT — e.g. TryFusion — implies bool / non-fatal failure), but it actually returns Ort::Status and the caller uses RETURN_IF_ERROR. Drop the prefix → ConvertBlockQuantScalesToLpbq.


Minor

[N-1] qnn_quant_params_wrapper.cc:581-589bq_offsets transpose path is dead code: TryConvertBlockQuantScalesToLpbq immediately requires every offset to be 0. Validate zero_points are all zero up front and drop the transpose.

[N-2] qnn_utils.h:280-284 / qnn_utils.cc:1232-1239 — Doc comment "max_int_scale = 2^bitwidth (256 for int8, 16 for int4)" embeds the same off-by-one as C-1. Update together with C-1.

[N-3] qnn_utils.cc:382-393operator<< for BLOCKWISE_EXPANSION uses numBlocksPerAxis as the print length for scaleOffsets, but QnnTypes.h documents that array size as axisSize. Either pass axisSize through (e.g. via per_channel_scales_size_) or update the comment to clarify "first numBlocksPerAxis entries only".

[N-4] qnn_quant_params_wrapper.cc:434-610 — The new is_block_quant flag reorders the if-else chain. Add 1-2 lines of comment above mapping (per-tensor / per-channel / block-quant × int4) to each branch and noting the final else as a defensive fallback.

[N-5] qnn_quant_params_wrapper.cc:597const uint32_t kBitwidth = is_int4_type ? 4u : 8u; is a runtime const, not a constexpr. The k prefix is reserved for constexpr / static const in Google C++ style. Rename to bitwidth.

"BQ offsets size must be empty or equal to bq_scales size");
RETURN_IF_NOT(bitwidth > 0 && bitwidth <= 16, "bitwidth must be in range [1, 16]");

const uint32_t max_int_scale = (1u << bitwidth) - 1u;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_int_scale calculation logic differes from what we used in the standard LPBQ workflow

For 4->8 expansion

the int scale will range from [1, 16]

Thus the max_int_sacle is calculated as 2^(decompressed_bw - bq_bw) i.e. 16 in this case

You can refer this class code : https://github.com/qualcomm/aimet/blob/66b2834fcb2711ef3a0bf6cec847090b42da683b/TrainingExtensions/torch/src/python/aimet_torch/quantization/affine/quantizer.py#L1192

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry thought I had commented this but I guess it got lost - we need to upgrade the datatype on per_block_int_scales or otherwise adjust for the differing range. otherwise, we will have overflow for uint8 storing 256

cc @quic-calvnguy

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The int_scale range will be [1, 16] for 4-bit as suggested. Since the original algorithm is applicable for 4-bits only, we are adding a check to convert bq to lpbq encodings for only 4 bit and so the overflow for 8 bits will not happen with this.

// QNN uses different structs to represent quantization parameters depending on:
// - per-tensor (scales.size()==1, no block_size): SCALE_OFFSET or BW_SCALE_OFFSET
// - per-channel (scales.size()>1, no block_size): AXIS_SCALE_OFFSET or BW_AXIS_SCALE_OFFSET
// - block quantization (block_size>0): BLOCKWISE_EXPANSION (LPBQ)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, We have both block quantization (QNN_QUANTIZATION_ENCODING_BLOCK) and lpbq (QNN_QUANTIZATION_BLOCKWISE_EXPANSION) supported in QNN.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

params_.bwAxisScaleOffsetEncoding.scales = scales_span.data();
params_.bwAxisScaleOffsetEncoding.offsets = zps_span.data();
} else if (!is_per_tensor && !is_int4_type) {
} else if (is_per_channel && !is_int4_type) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this gets taken elsewhere.
But before we make BQ->LPBQ transition, besides the two constraints (is_per_channel and is_int4_type) we also need to make sure that the input and output activation of the op is Integer as well. Otherwise better keep in BQ.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check for (is_block_quant && is_int4_type) , the conversion will only happen for these data types else not.


// Determine block axis (= ONNX axis attribute).
constexpr int64_t DEFAULT_QDQ_AXIS = 1;
int64_t axis = ort_quant_params->axis.value_or(DEFAULT_QDQ_AXIS);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code doesn't seem to take care of transposed weights. Which will swap axis.
Please see transB here.
https://onnx.ai/onnx/operators/onnx__Gemm.html

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be handled in the separate op builders using the HandleTranspose/HandleUnsqueeze API which will update the axis.

}

// Algorithm:
// max_int_scale = 2^bitwidth - 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also correct based on rishabh's comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

<< " blockScaleBitwidth=" << lpbq.blockScaleBitwidth;
// For LPBQ, num_elems are not present in the quantize_params,
// we are using numBlocksPerAxis instead to print the first numBlocksPerAxis scale offset values
size_t num_elems = lpbq.numBlocksPerAxis;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is lpbq.scaleOffsets channel-major or block-major?

am wondering what this does in plain language? prints the first numBlocksPerAxis scaleOffsets so is that just the blocks for the first row of axis?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scale offsets are of size num_channels only but since this qnn struct doesn't have numScaleOffsets member, we are taking numBlocksPerAxis as a proxy to print the first few scale and offset values.

utils::GetInitializerShape(ort_quant_params->scale, qnn_model_wrapper.GetOrtApi());
RETURN_IF_NOT(scale_shape.size() >= 2,
"Block quantization scale tensors must have at least rank 2 for LPBQ conversion");
RETURN_IF_NOT(scale_shape[0] > 0 && scale_shape[1] > 0,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this and below work properly for ranks > 2?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have tested this on models having MatMul BQ (rank 2) and Conv BQ (rank 4) weights.

@qti-kromero

Copy link
Copy Markdown
Collaborator

LGTM aside from missing tests

@qti-ashimaj

Copy link
Copy Markdown
Collaborator Author

#307 (comment)

LGTM aside from missing tests

The test cases are being added in the individual op builder PRs

@qti-ashimaj qti-ashimaj force-pushed the dev/qti-ashimaj/bq2lpbq branch from da97fe8 to 965a281 Compare June 15, 2026 10:33
@qti-chuteng

Copy link
Copy Markdown
Collaborator

🔴 Critical

[C-1] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:L597
The else if (is_block_quant) branch only builds LPBQ params when is_int4_type && convert_bq_to_lpbq; the inner if has no else, so a non-int4 (e.g. int8) block-quant weight falls through to return Ort::Status() with params_ left as QNN_QUANTIZE_PARAMS_INIT (UNDEFINED). This is reachable under the default convert_bq_to_lpbq=true, where an int8 BQ MatMul weight is no longer claimed by IsBQWeight() and routes through this generic path — yielding wrong numerics or a QNN compile failure instead of a clean error. Add an else that returns MAKE_EP_FAIL(...).


🟠 Major

[M-1] onnxruntime/core/providers/qnn/qnn_execution_provider.cc:L915
The new option convert_bq_to_lpbq defaults to true (both in ModelSettings and ParseBoolOption). This violates the opt-in convention for new session config keys and silently flips existing int4 BQ models (BW_FLOAT_BLOCK, landed in PR #288) to a lossy LPBQ approximation after upgrade, changing their output. Default to false, or document the behavior change explicitly.

[M-2] onnxruntime/core/providers/qnn/builder/opbuilder/matmul_op_builder.cc:L91
Only MatMul's IsBQWeight() honors convert_bq_to_lpbq. The BQ detectors in conv_op_builder.cc and gemm_op_builder.cc (IsBQGemmWeight) ignore the option and always use BW_FLOAT_BLOCK, so under convert_bq_to_lpbq=1 MatMul uses LPBQ while Conv/Gemm still use BQ — inconsistent and contradicting the doc. Gate Conv/Gemm the same way, or scope the doc to MatMul-only.

[M-3] onnxruntime/test/providers/qnn/matmul_test.cc:L365
Both new provider-option helpers set opts["convert_bq_to_lpbq"] = "0", which disables the entire new LPBQ conversion path. No test runs it with "1" (or the default), so the new algorithm could be fully broken and all tests would still pass. Add an HTP QDQ test (and ideally a direct unit test for ConvertBlockQuantScalesToLpbq) that exercises the conversion.


🔵 Minor

[N-1] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.h → per_channel_scales_size_
uint32_t per_channel_scales_size_; has no in-class initializer (unlike adjacent block_encoding_tensor_rank_ = 0 / num_blocks_ = 0) and is read in the copy ctor for LPBQ. In the degenerate num_elems==0 LPBQ case it stays indeterminate.

  uint32_t per_channel_scales_size_ = 0;

[N-2] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:L655
const uint32_t bitwidth = is_int4_type ? 4u : 8u; sits inside if (is_int4_type && ...), so the 8u arm is dead code; ConvertBlockQuantScalesToLpbq also hard-rejects non-4 bitwidth. Use const uint32_t bitwidth = 4u; with a comment to avoid implying int8 support.

[N-3] docs/execution_providers/QNN-ExecutionProvider.md:L200
The doc says block-quant encodings "will be converted to LPBQ", but conversion is int4-only and (per M-2) MatMul-only. Update to match the actual scope and final default value.

@qti-ashimaj

Copy link
Copy Markdown
Collaborator Author

🔴 Critical

[C-1] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:L597 The else if (is_block_quant) branch only builds LPBQ params when is_int4_type && convert_bq_to_lpbq; the inner if has no else, so a non-int4 (e.g. int8) block-quant weight falls through to return Ort::Status() with params_ left as QNN_QUANTIZE_PARAMS_INIT (UNDEFINED). This is reachable under the default convert_bq_to_lpbq=true, where an int8 BQ MatMul weight is no longer claimed by IsBQWeight() and routes through this generic path — yielding wrong numerics or a QNN compile failure instead of a clean error. Add an else that returns MAKE_EP_FAIL(...).

🟠 Major

[M-1] onnxruntime/core/providers/qnn/qnn_execution_provider.cc:L915 The new option convert_bq_to_lpbq defaults to true (both in ModelSettings and ParseBoolOption). This violates the opt-in convention for new session config keys and silently flips existing int4 BQ models (BW_FLOAT_BLOCK, landed in PR #288) to a lossy LPBQ approximation after upgrade, changing their output. Default to false, or document the behavior change explicitly.

[M-2] onnxruntime/core/providers/qnn/builder/opbuilder/matmul_op_builder.cc:L91 Only MatMul's IsBQWeight() honors convert_bq_to_lpbq. The BQ detectors in conv_op_builder.cc and gemm_op_builder.cc (IsBQGemmWeight) ignore the option and always use BW_FLOAT_BLOCK, so under convert_bq_to_lpbq=1 MatMul uses LPBQ while Conv/Gemm still use BQ — inconsistent and contradicting the doc. Gate Conv/Gemm the same way, or scope the doc to MatMul-only.

[M-3] onnxruntime/test/providers/qnn/matmul_test.cc:L365 Both new provider-option helpers set opts["convert_bq_to_lpbq"] = "0", which disables the entire new LPBQ conversion path. No test runs it with "1" (or the default), so the new algorithm could be fully broken and all tests would still pass. Add an HTP QDQ test (and ideally a direct unit test for ConvertBlockQuantScalesToLpbq) that exercises the conversion.

🔵 Minor

[N-1] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.h → per_channel_scales_size_ uint32_t per_channel_scales_size_; has no in-class initializer (unlike adjacent block_encoding_tensor_rank_ = 0 / num_blocks_ = 0) and is read in the copy ctor for LPBQ. In the degenerate num_elems==0 LPBQ case it stays indeterminate.

  uint32_t per_channel_scales_size_ = 0;

[N-2] onnxruntime/core/providers/qnn/builder/qnn_quant_params_wrapper.cc:L655 const uint32_t bitwidth = is_int4_type ? 4u : 8u; sits inside if (is_int4_type && ...), so the 8u arm is dead code; ConvertBlockQuantScalesToLpbq also hard-rejects non-4 bitwidth. Use const uint32_t bitwidth = 4u; with a comment to avoid implying int8 support.

[N-3] docs/execution_providers/QNN-ExecutionProvider.md:L200 The doc says block-quant encodings "will be converted to LPBQ", but conversion is int4-only and (per M-2) MatMul-only. Update to match the actual scope and final default value.

[C-1] If the conditions for BQ->LPBQ conversion are not met, ORT::Status is returned and the respective op builder files will construct the BW_FLOAT_BLOCK encodings.

[M-1] Updated the default value to false

[M-2] The respective op lpbq PRs will take care of these.

[M-3] The respective op lpbq PRs have the tests added.

[N-1] Done

[N-2] Done

[N-3] Updated the description

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants