[ET-VK][Ops] torchao.quantize_affine vulkan impl and shader and cleanup#12575
Conversation
# Changes
* Implement `torchao.quantize_affine` operator in Vulkan backend with comprehensive texture and buffer storage support
* Add block-wise quantization mode in `quantize_texture.glsl` and `quantize_buffer.glsl` shaders for configurable tensor block quantization
* Introduce comprehensive test suite in `affine_test.cpp` with multi-dimensional tensor validation and reference implementation
* Extend quantization infrastructure in `Quantize.cpp` to handle affine transformations with configurable block sizes and quantization parameters
BE: Improved the documentation in the shader logic which is more detailed and clear
NOTE: I delegated the quantize_affine and future affine operators through a new custom test file denoted as `affine_test.cpp` as the other quantization testing framework was getting a little large, and it makes more sense to separate the namespace between torchao and quantized_decomposed. I believe the _decomposed namespace is getting phased out in favor of this affine operator so deprecation will be easier in the future.
# Motivation
The existing Vulkan quantization infrastructure lacked support for the `torchao.quantize_affine` operator, which is essential for enabling dynamic quantization efficiently. The `quantize_affine` operator provides flexible block-wise quantization that allows different scale and zero-point values for tensor blocks, enabling:
* **Block-wise Quantization**: Applies quantization parameters to configurable tensor blocks rather than entire tensors, improving quantization accuracy for heterogeneous data distributions
* **Affine Transformation**: Uses the formula `qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)` for precise floating-point to integer mapping
# Operator Description
The `quantize_affine` operator converts floating-point tensor values to n-bit integer representations using pre-computed quantization parameters (scale and zero_point) applied to configurable tensor blocks. Block-wise quantization divides tensors into blocks and applies separate quantization parameters to each block, allowing fine-grained control over quantization precision.
The quantization formula is: `qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)`
**Storage Requirements**: Scale and zero_point tensors must use buffer storage with width-packed layout. Input/output tensors support both buffer and texture storage with standard axis mapping.
# Block-wise Quantization Implementation
Block-wise quantization enables fine-grained quantization by dividing tensors into blocks and applying separate quantization parameters to each block. The implementation uses several key data structures computed in `Quantize.cpp`:
* **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks)
* **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()`
* **`num_blocks_vec`**: Number of blocks per dimension calculated as `tensor_size_whcn / block_size_vec`
* **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation
The block coordinate calculation uses: `bcoord = tidx / blockSize` where `tidx` is the tensor coordinate in WHCN layout, then the linear block ID is computed as: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`
# Shader Algorithm Overview
## Texture Storage Implementation (`quantize_texture.glsl`)
**Workgroup Configuration**:
- **Global WG Size**: Default sizing based on texture dimensions
- **Local WG Size**: Default with special handling for batch dimension quantization (Z dimension set to 1 for proper workgroup dispatching when `global_workgroup_size[2] > 1`)
**Block-wise Mode Algorithm**:
The shader processes 3D texture positions where each position represents a texel containing 4 width-packed components. For each texel at position `pos`, it calculates a base tensor index `base_tidx = ivec4(pos.x * 4, pos.y, pos.z, 0)` to account for width-packing.
For each of the 4 components in the texel, it computes the actual tensor coordinate: `tidx = ivec4(base_tidx.x + i, base_tidx.y, (foldedZ % C_total), (foldedZ / C_total))` where `foldedZ = pos.z` handles batch-channel folding in 4D tensors and `C_total = numBlocks.z * blockSize.z` represents the total channel dimension.
The block coordinate is calculated using integer division: `bcoord = tidx / blockSize`, then the linear block ID uses pre-computed strides: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`.
Each component is quantized using its corresponding block's parameters: `qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])` and written to the output texel.
## Buffer Storage Implementation (`quantize_buffer.glsl`)
**Workgroup Configuration**:
- **Global WG Size**: Default sizing based on buffer element count
- **Local WG Size**: Default sizing without special constraints
**Block-wise Mode Algorithm**:
The shader processes linear buffer indices using `gl_GlobalInvocationID.x` as the output buffer index. It converts this to tensor coordinates using `bufi_to_tidx(out_bufi, t_out_strides, out_dim_order)` which handles the buffer-to-tensor index mapping with proper stride calculations.
For each element, it computes the block coordinate directly: `bcoord = out_tidx / blockSize` where `out_tidx` is the 4D tensor coordinate in WHCN layout. The linear block ID calculation uses the same pre-computed stride approach: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`.
The element value is loaded using the corresponding input buffer index: `value = t_in[in_bufi]` where `in_bufi = tidx_to_bufi(out_tidx, t_in_strides)`. Quantization applies the block-specific parameters: `qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])`.
**Future Improvements**: Dynamic workgroup sizing based on block dimensions, there is likely a better method to making it better than what it is currently.
Differential Revision: [D78302195](https://our.internmc.facebook.com/intern/diff/D78302195/)
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12575
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 495b62b with merge base b6b7a16 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
This pull request was exported from Phabricator. Differential Revision: D78302195 |
…r and cleanup"
# Changes
* Implement `torchao.quantize_affine` operator in Vulkan backend with comprehensive texture and buffer storage support
* Add block-wise quantization mode in `quantize_texture.glsl` and `quantize_buffer.glsl` shaders for configurable tensor block quantization
* Introduce comprehensive test suite in `affine_test.cpp` with multi-dimensional tensor validation and reference implementation
* Extend quantization infrastructure in `Quantize.cpp` to handle affine transformations with configurable block sizes and quantization parameters
BE: Improved the documentation in the shader logic which is more detailed and clear
NOTE: I delegated the quantize_affine and future affine operators through a new custom test file denoted as `affine_test.cpp` as the other quantization testing framework was getting a little large, and it makes more sense to separate the namespace between torchao and quantized_decomposed. I believe the _decomposed namespace is getting phased out in favor of this affine operator so deprecation will be easier in the future.
# Motivation
The existing Vulkan quantization infrastructure lacked support for the `torchao.quantize_affine` operator, which is essential for enabling dynamic quantization efficiently. The `quantize_affine` operator provides flexible block-wise quantization that allows different scale and zero-point values for tensor blocks, enabling:
* **Block-wise Quantization**: Applies quantization parameters to configurable tensor blocks rather than entire tensors, improving quantization accuracy for heterogeneous data distributions
* **Affine Transformation**: Uses the formula `qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)` for precise floating-point to integer mapping
# Operator Description
The `quantize_affine` operator converts floating-point tensor values to n-bit integer representations using pre-computed quantization parameters (scale and zero_point) applied to configurable tensor blocks. Block-wise quantization divides tensors into blocks and applies separate quantization parameters to each block, allowing fine-grained control over quantization precision.
The quantization formula is: `qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)`
**Storage Requirements**: Scale and zero_point tensors must use buffer storage with width-packed layout. Input/output tensors support both buffer and texture storage with standard axis mapping.
# Block-wise Quantization Implementation
Block-wise quantization enables fine-grained quantization by dividing tensors into blocks and applying separate quantization parameters to each block. The implementation uses several key data structures computed in `Quantize.cpp`:
* **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks)
* **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()`
* **`num_blocks_vec`**: Number of blocks per dimension calculated as `tensor_size_whcn / block_size_vec`
* **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation
The block coordinate calculation uses: `bcoord = tidx / blockSize` where `tidx` is the tensor coordinate in WHCN layout, then the linear block ID is computed as: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`
# Shader Algorithm Overview
## Texture Storage Implementation (`quantize_texture.glsl`)
**Workgroup Configuration**:
- **Global WG Size**: Default sizing based on texture dimensions
- **Local WG Size**: Default with special handling for batch dimension quantization (Z dimension set to 1 for proper workgroup dispatching when `global_workgroup_size[2] > 1`)
**Block-wise Mode Algorithm**:
The shader processes 3D texture positions where each position represents a texel containing 4 width-packed components. For each texel at position `pos`, it calculates a base tensor index `base_tidx = ivec4(pos.x * 4, pos.y, pos.z, 0)` to account for width-packing.
For each of the 4 components in the texel, it computes the actual tensor coordinate: `tidx = ivec4(base_tidx.x + i, base_tidx.y, (foldedZ % C_total), (foldedZ / C_total))` where `foldedZ = pos.z` handles batch-channel folding in 4D tensors and `C_total = numBlocks.z * blockSize.z` represents the total channel dimension.
The block coordinate is calculated using integer division: `bcoord = tidx / blockSize`, then the linear block ID uses pre-computed strides: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`.
Each component is quantized using its corresponding block's parameters: `qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])` and written to the output texel.
## Buffer Storage Implementation (`quantize_buffer.glsl`)
**Workgroup Configuration**:
- **Global WG Size**: Default sizing based on buffer element count
- **Local WG Size**: Default sizing without special constraints
**Block-wise Mode Algorithm**:
The shader processes linear buffer indices using `gl_GlobalInvocationID.x` as the output buffer index. It converts this to tensor coordinates using `bufi_to_tidx(out_bufi, t_out_strides, out_dim_order)` which handles the buffer-to-tensor index mapping with proper stride calculations.
For each element, it computes the block coordinate directly: `bcoord = out_tidx / blockSize` where `out_tidx` is the 4D tensor coordinate in WHCN layout. The linear block ID calculation uses the same pre-computed stride approach: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`.
The element value is loaded using the corresponding input buffer index: `value = t_in[in_bufi]` where `in_bufi = tidx_to_bufi(out_tidx, t_in_strides)`. Quantization applies the block-specific parameters: `qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])`.
**Future Improvements**: Dynamic workgroup sizing based on block dimensions, there is likely a better method to making it better than what it is currently.
Differential Revision: [D78302195](https://our.internmc.facebook.com/intern/diff/D78302195/)
cc SS-JIA manuelcandales cbilgin
[ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D78302195 |
…r and cleanup"
# Changes
* Implement `torchao.quantize_affine` operator in Vulkan backend with comprehensive texture and buffer storage support
* Add block-wise quantization mode in `quantize_texture.glsl` and `quantize_buffer.glsl` shaders for configurable tensor block quantization
* Introduce comprehensive test suite in `affine_test.cpp` with multi-dimensional tensor validation and reference implementation
* Extend quantization infrastructure in `Quantize.cpp` to handle affine transformations with configurable block sizes and quantization parameters
BE: Improved the documentation in the shader logic which is more detailed and clear
NOTE: I delegated the quantize_affine and future affine operators through a new custom test file denoted as `affine_test.cpp` as the other quantization testing framework was getting a little large, and it makes more sense to separate the namespace between torchao and quantized_decomposed. I believe the _decomposed namespace is getting phased out in favor of this affine operator so deprecation will be easier in the future.
# Motivation
The existing Vulkan quantization infrastructure lacked support for the `torchao.quantize_affine` operator, which is essential for enabling dynamic quantization efficiently. The `quantize_affine` operator provides flexible block-wise quantization that allows different scale and zero-point values for tensor blocks, enabling:
* **Block-wise Quantization**: Applies quantization parameters to configurable tensor blocks rather than entire tensors, improving quantization accuracy for heterogeneous data distributions
* **Affine Transformation**: Uses the formula `qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)` for precise floating-point to integer mapping
# Operator Description
The `quantize_affine` operator converts floating-point tensor values to n-bit integer representations using pre-computed quantization parameters (scale and zero_point) applied to configurable tensor blocks. Block-wise quantization divides tensors into blocks and applies separate quantization parameters to each block, allowing fine-grained control over quantization precision.
The quantization formula is: `qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)`
**Storage Requirements**: Scale and zero_point tensors must use buffer storage with width-packed layout. Input/output tensors support both buffer and texture storage with standard axis mapping.
# Block-wise Quantization Implementation
Block-wise quantization enables fine-grained quantization by dividing tensors into blocks and applying separate quantization parameters to each block. The implementation uses several key data structures computed in `Quantize.cpp`:
* **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks)
* **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()`
* **`num_blocks_vec`**: Number of blocks per dimension calculated as `tensor_size_whcn / block_size_vec`
* **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation
The block coordinate calculation uses: `bcoord = tidx / blockSize` where `tidx` is the tensor coordinate in WHCN layout, then the linear block ID is computed as: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`
# Shader Algorithm Overview
## Texture Storage Implementation (`quantize_texture.glsl`)
**Workgroup Configuration**:
- **Global WG Size**: Default sizing based on texture dimensions
- **Local WG Size**: Default with special handling for batch dimension quantization (Z dimension set to 1 for proper workgroup dispatching when `global_workgroup_size[2] > 1`)
**Block-wise Mode Algorithm**:
The shader processes 3D texture positions where each position represents a texel containing 4 width-packed components. For each texel at position `pos`, it calculates a base tensor index `base_tidx = ivec4(pos.x * 4, pos.y, pos.z, 0)` to account for width-packing.
For each of the 4 components in the texel, it computes the actual tensor coordinate: `tidx = ivec4(base_tidx.x + i, base_tidx.y, (foldedZ % C_total), (foldedZ / C_total))` where `foldedZ = pos.z` handles batch-channel folding in 4D tensors and `C_total = numBlocks.z * blockSize.z` represents the total channel dimension.
The block coordinate is calculated using integer division: `bcoord = tidx / blockSize`, then the linear block ID uses pre-computed strides: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`.
Each component is quantized using its corresponding block's parameters: `qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])` and written to the output texel.
## Buffer Storage Implementation (`quantize_buffer.glsl`)
**Workgroup Configuration**:
- **Global WG Size**: Default sizing based on buffer element count
- **Local WG Size**: Default sizing without special constraints
**Block-wise Mode Algorithm**:
The shader processes linear buffer indices using `gl_GlobalInvocationID.x` as the output buffer index. It converts this to tensor coordinates using `bufi_to_tidx(out_bufi, t_out_strides, out_dim_order)` which handles the buffer-to-tensor index mapping with proper stride calculations.
For each element, it computes the block coordinate directly: `bcoord = out_tidx / blockSize` where `out_tidx` is the 4D tensor coordinate in WHCN layout. The linear block ID calculation uses the same pre-computed stride approach: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`.
The element value is loaded using the corresponding input buffer index: `value = t_in[in_bufi]` where `in_bufi = tidx_to_bufi(out_tidx, t_in_strides)`. Quantization applies the block-specific parameters: `qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])`.
**Future Improvements**: Dynamic workgroup sizing based on block dimensions, there is likely a better method to making it better than what it is currently.
Differential Revision: [D78302195](https://our.internmc.facebook.com/intern/diff/D78302195/)
cc SS-JIA manuelcandales cbilgin
[ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D78302195 |
…r and cleanup"
# Changes
* Implement `torchao.quantize_affine` operator in Vulkan backend with comprehensive texture and buffer storage support
* Add block-wise quantization mode in `quantize_texture.glsl` and `quantize_buffer.glsl` shaders for configurable tensor block quantization
* Introduce comprehensive test suite in `affine_test.cpp` with multi-dimensional tensor validation and reference implementation
* Extend quantization infrastructure in `Quantize.cpp` to handle affine transformations with configurable block sizes and quantization parameters
BE: Improved the documentation in the shader logic which is more detailed and clear
NOTE: I delegated the quantize_affine and future affine operators through a new custom test file denoted as `affine_test.cpp` as the other quantization testing framework was getting a little large, and it makes more sense to separate the namespace between torchao and quantized_decomposed. I believe the _decomposed namespace is getting phased out in favor of this affine operator so deprecation will be easier in the future.
# Motivation
The existing Vulkan quantization infrastructure lacked support for the `torchao.quantize_affine` operator, which is essential for enabling dynamic quantization efficiently. The `quantize_affine` operator provides flexible block-wise quantization that allows different scale and zero-point values for tensor blocks, enabling:
* **Block-wise Quantization**: Applies quantization parameters to configurable tensor blocks rather than entire tensors, improving quantization accuracy for heterogeneous data distributions
* **Affine Transformation**: Uses the formula `qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)` for precise floating-point to integer mapping
# Operator Description
The `quantize_affine` operator converts floating-point tensor values to n-bit integer representations using pre-computed quantization parameters (scale and zero_point) applied to configurable tensor blocks. Block-wise quantization divides tensors into blocks and applies separate quantization parameters to each block, allowing fine-grained control over quantization precision.
The quantization formula is: `qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)`
**Storage Requirements**: Scale and zero_point tensors must use buffer storage with width-packed layout. Input/output tensors support both buffer and texture storage with standard axis mapping.
# Block-wise Quantization Implementation
Block-wise quantization enables fine-grained quantization by dividing tensors into blocks and applying separate quantization parameters to each block. The implementation uses several key data structures computed in `Quantize.cpp`:
* **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks)
* **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()`
* **`num_blocks_vec`**: Number of blocks per dimension calculated as `tensor_size_whcn / block_size_vec`
* **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation
The block coordinate calculation uses: `bcoord = tidx / blockSize` where `tidx` is the tensor coordinate in WHCN layout, then the linear block ID is computed as: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`
# Shader Algorithm Overview
## Texture Storage Implementation (`quantize_texture.glsl`)
**Workgroup Configuration**:
- **Global WG Size**: Default sizing based on texture dimensions
- **Local WG Size**: Default with special handling for batch dimension quantization (Z dimension set to 1 for proper workgroup dispatching when `global_workgroup_size[2] > 1`)
**Block-wise Mode Algorithm**:
The shader processes 3D texture positions where each position represents a texel containing 4 width-packed components. For each texel at position `pos`, it calculates a base tensor index `base_tidx = ivec4(pos.x * 4, pos.y, pos.z, 0)` to account for width-packing.
For each of the 4 components in the texel, it computes the actual tensor coordinate: `tidx = ivec4(base_tidx.x + i, base_tidx.y, (foldedZ % C_total), (foldedZ / C_total))` where `foldedZ = pos.z` handles batch-channel folding in 4D tensors and `C_total = numBlocks.z * blockSize.z` represents the total channel dimension.
The block coordinate is calculated using integer division: `bcoord = tidx / blockSize`, then the linear block ID uses pre-computed strides: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`.
Each component is quantized using its corresponding block's parameters: `qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])` and written to the output texel.
## Buffer Storage Implementation (`quantize_buffer.glsl`)
**Workgroup Configuration**:
- **Global WG Size**: Default sizing based on buffer element count
- **Local WG Size**: Default sizing without special constraints
**Block-wise Mode Algorithm**:
The shader processes linear buffer indices using `gl_GlobalInvocationID.x` as the output buffer index. It converts this to tensor coordinates using `bufi_to_tidx(out_bufi, t_out_strides, out_dim_order)` which handles the buffer-to-tensor index mapping with proper stride calculations.
For each element, it computes the block coordinate directly: `bcoord = out_tidx / blockSize` where `out_tidx` is the 4D tensor coordinate in WHCN layout. The linear block ID calculation uses the same pre-computed stride approach: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`.
The element value is loaded using the corresponding input buffer index: `value = t_in[in_bufi]` where `in_bufi = tidx_to_bufi(out_tidx, t_in_strides)`. Quantization applies the block-specific parameters: `qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])`.
**Future Improvements**: Dynamic workgroup sizing based on block dimensions, there is likely a better method to making it better than what it is currently.
Differential Revision: [D78302195](https://our.internmc.facebook.com/intern/diff/D78302195/)
cc SS-JIA manuelcandales cbilgin
[ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D78302195 |
…r and cleanup"
# Changes
* Implement `torchao.quantize_affine` operator in Vulkan backend with comprehensive texture and buffer storage support
* Add block-wise quantization mode in `quantize_texture.glsl` and `quantize_buffer.glsl` shaders for configurable tensor block quantization
* Introduce comprehensive test suite in `affine_test.cpp` with multi-dimensional tensor validation and reference implementation
* Extend quantization infrastructure in `Quantize.cpp` to handle affine transformations with configurable block sizes and quantization parameters
BE: Improved the documentation in the shader logic which is more detailed and clear
NOTE: I delegated the quantize_affine and future affine operators through a new custom test file denoted as `affine_test.cpp` as the other quantization testing framework was getting a little large, and it makes more sense to separate the namespace between torchao and quantized_decomposed. I believe the _decomposed namespace is getting phased out in favor of this affine operator so deprecation will be easier in the future.
# Motivation
The existing Vulkan quantization infrastructure lacked support for the `torchao.quantize_affine` operator, which is essential for enabling dynamic quantization efficiently. The `quantize_affine` operator provides flexible block-wise quantization that allows different scale and zero-point values for tensor blocks, enabling:
* **Block-wise Quantization**: Applies quantization parameters to configurable tensor blocks rather than entire tensors, improving quantization accuracy for heterogeneous data distributions
* **Affine Transformation**: Uses the formula `qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)` for precise floating-point to integer mapping
# Operator Description
The `quantize_affine` operator converts floating-point tensor values to n-bit integer representations using pre-computed quantization parameters (scale and zero_point) applied to configurable tensor blocks. Block-wise quantization divides tensors into blocks and applies separate quantization parameters to each block, allowing fine-grained control over quantization precision.
The quantization formula is: `qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)`
**Storage Requirements**: Scale and zero_point tensors must use buffer storage with width-packed layout. Input/output tensors support both buffer and texture storage with standard axis mapping.
# Block-wise Quantization Implementation
Block-wise quantization enables fine-grained quantization by dividing tensors into blocks and applying separate quantization parameters to each block. The implementation uses several key data structures computed in `Quantize.cpp`:
* **`block_size_vec`**: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks)
* **`tensor_size_whcn`**: Input tensor dimensions converted to WHCN layout using `utils::make_whcn_ivec4()`
* **`num_blocks_vec`**: Number of blocks per dimension calculated as `tensor_size_whcn / block_size_vec`
* **`block_stride_vec`**: Pre-computed linear strides for block grid indexing `{1, #W, #W*#H, #W*#H*#C}` to enable efficient block ID calculation
The block coordinate calculation uses: `bcoord = tidx / blockSize` where `tidx` is the tensor coordinate in WHCN layout, then the linear block ID is computed as: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`
# Shader Algorithm Overview
## Texture Storage Implementation (`quantize_texture.glsl`)
**Workgroup Configuration**:
- **Global WG Size**: Default sizing based on texture dimensions
- **Local WG Size**: Default with special handling for batch dimension quantization (Z dimension set to 1 for proper workgroup dispatching when `global_workgroup_size[2] > 1`)
**Block-wise Mode Algorithm**:
The shader processes 3D texture positions where each position represents a texel containing 4 width-packed components. For each texel at position `pos`, it calculates a base tensor index `base_tidx = ivec4(pos.x * 4, pos.y, pos.z, 0)` to account for width-packing.
For each of the 4 components in the texel, it computes the actual tensor coordinate: `tidx = ivec4(base_tidx.x + i, base_tidx.y, (foldedZ % C_total), (foldedZ / C_total))` where `foldedZ = pos.z` handles batch-channel folding in 4D tensors and `C_total = numBlocks.z * blockSize.z` represents the total channel dimension.
The block coordinate is calculated using integer division: `bcoord = tidx / blockSize`, then the linear block ID uses pre-computed strides: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`.
Each component is quantized using its corresponding block's parameters: `qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])` and written to the output texel.
## Buffer Storage Implementation (`quantize_buffer.glsl`)
**Workgroup Configuration**:
- **Global WG Size**: Default sizing based on buffer element count
- **Local WG Size**: Default sizing without special constraints
**Block-wise Mode Algorithm**:
The shader processes linear buffer indices using `gl_GlobalInvocationID.x` as the output buffer index. It converts this to tensor coordinates using `bufi_to_tidx(out_bufi, t_out_strides, out_dim_order)` which handles the buffer-to-tensor index mapping with proper stride calculations.
For each element, it computes the block coordinate directly: `bcoord = out_tidx / blockSize` where `out_tidx` is the 4D tensor coordinate in WHCN layout. The linear block ID calculation uses the same pre-computed stride approach: `block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w`.
The element value is loaded using the corresponding input buffer index: `value = t_in[in_bufi]` where `in_bufi = tidx_to_bufi(out_tidx, t_in_strides)`. Quantization applies the block-specific parameters: `qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])`.
**Future Improvements**: Dynamic workgroup sizing based on block dimensions, there is likely a better method to making it better than what it is currently.
Differential Revision: [D78302195](https://our.internmc.facebook.com/intern/diff/D78302195/)
cc SS-JIA manuelcandales cbilgin
[ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D78302195 |
…up (#13001) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #12575 by @ahmtox ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/43/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/43/head Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/42/orig Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/43/orig @diff-train-skip-merge cc @SS-JIA @manuelcandales @cbilgin --------- Co-authored-by: morelos <morelos@devvm4573.ash0.facebook.com> Co-authored-by: ahmtox <69552192+ahmtox@users.noreply.github.com>
Stack from ghstack (oldest at bottom):
Changes
torchao.quantize_affineoperator in Vulkan backend with comprehensive texture and buffer storage supportquantize_texture.glslandquantize_buffer.glslshaders for configurable tensor block quantizationaffine_test.cppwith multi-dimensional tensor validation and reference implementationQuantize.cppto handle affine transformations with configurable block sizes and quantization parametersBE: Improved the documentation in the shader logic which is more detailed and clear
NOTE: I delegated the quantize_affine and future affine operators through a new custom test file denoted as
affine_test.cppas the other quantization testing framework was getting a little large, and it makes more sense to separate the namespace between torchao and quantized_decomposed. I believe the _decomposed namespace is getting phased out in favor of this affine operator so deprecation will be easier in the future.Motivation
The existing Vulkan quantization infrastructure lacked support for the
torchao.quantize_affineoperator, which is essential for enabling dynamic quantization efficiently. Thequantize_affineoperator provides flexible block-wise quantization that allows different scale and zero-point values for tensor blocks, enabling:qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)for precise floating-point to integer mappingOperator Description
The
quantize_affineoperator converts floating-point tensor values to n-bit integer representations using pre-computed quantization parameters (scale and zero_point) applied to configurable tensor blocks. Block-wise quantization divides tensors into blocks and applies separate quantization parameters to each block, allowing fine-grained control over quantization precision.The quantization formula is:
qvalue = clamp(round(value / scale) + zero_point, quant_min, quant_max)Storage Requirements: Scale and zero_point tensors must use buffer storage with width-packed layout. Input/output tensors support both buffer and texture storage with standard axis mapping.
Block-wise Quantization Implementation
Block-wise quantization enables fine-grained quantization by dividing tensors into blocks and applying separate quantization parameters to each block. The implementation uses several key data structures computed in
Quantize.cpp:block_size_vec: WHCN-ordered block dimensions converted from PyTorch NCHW layout (e.g., [3,3,2,1] for 3×3×2×1 blocks)tensor_size_whcn: Input tensor dimensions converted to WHCN layout usingutils::make_whcn_ivec4()num_blocks_vec: Number of blocks per dimension calculated astensor_size_whcn / block_size_vecblock_stride_vec: Pre-computed linear strides for block grid indexing{1, #W, #W*#H, #W*#H*#C}to enable efficient block ID calculationThe block coordinate calculation uses:
bcoord = tidx / blockSizewheretidxis the tensor coordinate in WHCN layout, then the linear block ID is computed as:block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.wShader Algorithm Overview
Texture Storage Implementation (
quantize_texture.glsl)Workgroup Configuration:
global_workgroup_size[2] > 1)Block-wise Mode Algorithm:
The shader processes 3D texture positions where each position represents a texel containing 4 width-packed components. For each texel at position
pos, it calculates a base tensor indexbase_tidx = ivec4(pos.x * 4, pos.y, pos.z, 0)to account for width-packing.For each of the 4 components in the texel, it computes the actual tensor coordinate:
tidx = ivec4(base_tidx.x + i, base_tidx.y, (foldedZ % C_total), (foldedZ / C_total))wherefoldedZ = pos.zhandles batch-channel folding in 4D tensors andC_total = numBlocks.z * blockSize.zrepresents the total channel dimension.The block coordinate is calculated using integer division:
bcoord = tidx / blockSize, then the linear block ID uses pre-computed strides:block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w.Each component is quantized using its corresponding block's parameters:
qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id])and written to the output texel.Buffer Storage Implementation (
quantize_buffer.glsl)Workgroup Configuration:
Block-wise Mode Algorithm:
The shader processes linear buffer indices using
gl_GlobalInvocationID.xas the output buffer index. It converts this to tensor coordinates usingbufi_to_tidx(out_bufi, t_out_strides, out_dim_order)which handles the buffer-to-tensor index mapping with proper stride calculations.For each element, it computes the block coordinate directly:
bcoord = out_tidx / blockSizewhereout_tidxis the 4D tensor coordinate in WHCN layout. The linear block ID calculation uses the same pre-computed stride approach:block_id = bcoord.x * blockStride.x + bcoord.y * blockStride.y + bcoord.z * blockStride.z + bcoord.w * blockStride.w.The element value is loaded using the corresponding input buffer index:
value = t_in[in_bufi]wherein_bufi = tidx_to_bufi(out_tidx, t_in_strides). Quantization applies the block-specific parameters:qvalue = quantize_val(value, t_scale[block_id], t_zero_point[block_id]).Future Improvements: Dynamic workgroup sizing based on block dimensions, there is likely a better method to making it better than what it is currently.
Differential Revision: D78302195
cc @SS-JIA @manuelcandales @cbilgin