[QNN EP] Convert BQ to LPBQ encodings#307
Conversation
| params_.axisScaleOffsetEncoding.axis = static_cast<int32_t>(axis); | ||
| params_.axisScaleOffsetEncoding.numScaleOffsets = static_cast<uint32_t>(num_elems); | ||
| params_.axisScaleOffsetEncoding.scaleOffset = data_span.data(); | ||
| } else if (is_block_quant) { |
There was a problem hiding this comment.
I feel like this logic and some of the other logic in qnn_utils warrants a new test (or a few new tests)? is that planned to be added later or something we can add in this PR?
Critical[C-1] [C-2] Major[M-1] [M-2] [M-3] [M-4] [M-5] [M-6] Minor[N-1] [N-2] [N-3] [N-4] [N-5] |
| "BQ offsets size must be empty or equal to bq_scales size"); | ||
| RETURN_IF_NOT(bitwidth > 0 && bitwidth <= 16, "bitwidth must be in range [1, 16]"); | ||
|
|
||
| const uint32_t max_int_scale = (1u << bitwidth) - 1u; |
There was a problem hiding this comment.
max_int_scale calculation logic differes from what we used in the standard LPBQ workflow
For 4->8 expansion
the int scale will range from [1, 16]
Thus the max_int_sacle is calculated as 2^(decompressed_bw - bq_bw) i.e. 16 in this case
You can refer this class code : https://github.com/qualcomm/aimet/blob/66b2834fcb2711ef3a0bf6cec847090b42da683b/TrainingExtensions/torch/src/python/aimet_torch/quantization/affine/quantizer.py#L1192
There was a problem hiding this comment.
sorry thought I had commented this but I guess it got lost - we need to upgrade the datatype on per_block_int_scales or otherwise adjust for the differing range. otherwise, we will have overflow for uint8 storing 256
There was a problem hiding this comment.
Done. The int_scale range will be [1, 16] for 4-bit as suggested. Since the original algorithm is applicable for 4-bits only, we are adding a check to convert bq to lpbq encodings for only 4 bit and so the overflow for 8 bits will not happen with this.
| // QNN uses different structs to represent quantization parameters depending on: | ||
| // - per-tensor (scales.size()==1, no block_size): SCALE_OFFSET or BW_SCALE_OFFSET | ||
| // - per-channel (scales.size()>1, no block_size): AXIS_SCALE_OFFSET or BW_AXIS_SCALE_OFFSET | ||
| // - block quantization (block_size>0): BLOCKWISE_EXPANSION (LPBQ) |
There was a problem hiding this comment.
Specifically, We have both block quantization (QNN_QUANTIZATION_ENCODING_BLOCK) and lpbq (QNN_QUANTIZATION_BLOCKWISE_EXPANSION) supported in QNN.
| params_.bwAxisScaleOffsetEncoding.scales = scales_span.data(); | ||
| params_.bwAxisScaleOffsetEncoding.offsets = zps_span.data(); | ||
| } else if (!is_per_tensor && !is_int4_type) { | ||
| } else if (is_per_channel && !is_int4_type) { |
There was a problem hiding this comment.
Not sure if this gets taken elsewhere.
But before we make BQ->LPBQ transition, besides the two constraints (is_per_channel and is_int4_type) we also need to make sure that the input and output activation of the op is Integer as well. Otherwise better keep in BQ.
There was a problem hiding this comment.
Added a check for (is_block_quant && is_int4_type) , the conversion will only happen for these data types else not.
|
|
||
| // Determine block axis (= ONNX axis attribute). | ||
| constexpr int64_t DEFAULT_QDQ_AXIS = 1; | ||
| int64_t axis = ort_quant_params->axis.value_or(DEFAULT_QDQ_AXIS); |
There was a problem hiding this comment.
The code doesn't seem to take care of transposed weights. Which will swap axis.
Please see transB here.
https://onnx.ai/onnx/operators/onnx__Gemm.html
There was a problem hiding this comment.
This can be handled in the separate op builders using the HandleTranspose/HandleUnsqueeze API which will update the axis.
| } | ||
|
|
||
| // Algorithm: | ||
| // max_int_scale = 2^bitwidth - 1 |
There was a problem hiding this comment.
Also correct based on rishabh's comment
| << " blockScaleBitwidth=" << lpbq.blockScaleBitwidth; | ||
| // For LPBQ, num_elems are not present in the quantize_params, | ||
| // we are using numBlocksPerAxis instead to print the first numBlocksPerAxis scale offset values | ||
| size_t num_elems = lpbq.numBlocksPerAxis; |
There was a problem hiding this comment.
is lpbq.scaleOffsets channel-major or block-major?
am wondering what this does in plain language? prints the first numBlocksPerAxis scaleOffsets so is that just the blocks for the first row of axis?
There was a problem hiding this comment.
The scale offsets are of size num_channels only but since this qnn struct doesn't have numScaleOffsets member, we are taking numBlocksPerAxis as a proxy to print the first few scale and offset values.
| utils::GetInitializerShape(ort_quant_params->scale, qnn_model_wrapper.GetOrtApi()); | ||
| RETURN_IF_NOT(scale_shape.size() >= 2, | ||
| "Block quantization scale tensors must have at least rank 2 for LPBQ conversion"); | ||
| RETURN_IF_NOT(scale_shape[0] > 0 && scale_shape[1] > 0, |
There was a problem hiding this comment.
does this and below work properly for ranks > 2?
There was a problem hiding this comment.
Yes, we have tested this on models having MatMul BQ (rank 2) and Conv BQ (rank 4) weights.
|
LGTM aside from missing tests |
The test cases are being added in the individual op builder PRs |
da97fe8 to
965a281
Compare
🔴 Critical[C-1] 🟠 Major[M-1] [M-2] [M-3] 🔵 Minor[N-1] [N-2] [N-3] |
[C-1] If the conditions for BQ->LPBQ conversion are not met, ORT::Status is returned and the respective op builder files will construct the [M-1] Updated the default value to [M-2] The respective op lpbq PRs will take care of these. [M-3] The respective op lpbq PRs have the tests added. [N-1] Done [N-2] Done [N-3] Updated the description |
Description
Convert BQ encodings to htp supported LPBQ encodings
Motivation and Context
The block quantized ONNX models are currently not supported in QNN-EP. In order to execute them via QNN HTP, we are converting the BQ encodings to LPBQ encodings.