[Inductor] Support qlinear by LevelDownRefine · Pull Request #3 · LevelDownRefine/ao

LevelDownRefine · 2025-07-07T01:55:16Z

No description provided.

LevelDownRefine · 2025-07-08T02:01:45Z

…totune configs (pytorch#2697)

…to (pytorch#2518) tuples Summary: THis is needed because lists are not hashable, since they are mutable, and as a result we cannot have literals_to_ph in pattern rewrites used inside reference_representation_rewrite.py Test Plan: CI + next diff relies on this feature Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

* Update KleidiAI * up * up * up

…erences (pytorch#2651) Summary: This PR checks different kernel preferences for Float8Tensor are similar in numerics (AUTO, TORCH and FBGEMM) triton implementation and torchao implementation are a bit different right now actually, need to decide if we should fix it or not 1. difference in quantize op main difference seems to be the triton implementation is using: ``` a_scale = MAX_FP8 / max_abs then do a_scale = 1.0 / a_scale a_fp8 = a * a_scale ``` while torch is doing: ``` a_scale = max_abs / MAX_FP8 a_fp8 = a / a_scale ``` Also the hp_value_lb and hp_value_ub settings are slightly different triton choose scale and quantize code: https://github.com/pytorch/FBGEMM/blob/a4286c01ef01dad435b2ec8798605127d3032cd8/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py#L2382-L2392 torchao choose scale and quantize code: https://github.com/pytorch/ao/blob/3c466f844684af0fb80014094f2ca8663881eb33/torchao/quantization/quant_primitives.py#L2183 https://github.com/pytorch/ao/blob/3c466f844684af0fb80014094f2ca8663881eb33/torchao/quantization/quant_primitives.py#L2283 2. (potentially) difference in matrix multiplication ops TORCH and AUTO/FBGEMM are using different quantized mm ops Added a reverse option to bring sqnr closer: ``` granularity: PerTensor() sizes: ((128,), 256, 128) kp: KernelPreference.AUTO tensor(inf, device='cuda:0', dtype=torch.bfloat16) granularity: PerTensor() sizes: ((128,), 256, 128) kp: KernelPreference.FBGEMM tensor(inf, device='cuda:0', dtype=torch.bfloat16) .granularity: PerTensor() sizes: ((32, 128), 64, 256) kp: KernelPreference.AUTO tensor(inf, device='cuda:0', dtype=torch.bfloat16) granularity: PerTensor() sizes: ((32, 128), 64, 256) kp: KernelPreference.FBGEMM tensor(inf, device='cuda:0', dtype=torch.bfloat16) .granularity: PerRow() sizes: ((128,), 256, 128) kp: KernelPreference.AUTO tensor(inf, device='cuda:0', dtype=torch.bfloat16) granularity: PerRow() sizes: ((128,), 256, 128) kp: KernelPreference.FBGEMM tensor(inf, device='cuda:0', dtype=torch.bfloat16) .granularity: PerRow() sizes: ((32, 128), 64, 256) kp: KernelPreference.AUTO tensor(64.5000, device='cuda:0', dtype=torch.bfloat16) granularity: PerRow() sizes: ((32, 128), 64, 256) kp: KernelPreference.FBGEMM tensor(68., device='cuda:0', dtype=torch.bfloat16) ``` Test Plan: python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags:

…amicActivationInt4WeightConfig (pytorch#2474) Summary: we will * deprecate FbgemmConfig since it's a single kernel (later). * we'd like to categorize things to derived dtype + packed format, e.g. int4 preshuffled, float8 plain * Added PackingFormat that has preshuffled, plain in Version 2 of Int4WeightOnlyConfig, the older AQT tensor will remain in Version 1 Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py python test/quantization/quantize_/workflows/int4/test_int4_preshuffled_tensor.py python test/quantization/quantize_/workflows/float8/test_float8_tensor.py Reviewers: Subscribers: Tasks: Tags:

) Summary: Currently we have a long queue, so would like to reduce it Test Plan: Reviewers: Subscribers: Tasks: Tags:

Differential Revision: D79846881 Pull Request resolved: pytorch#2716

Summary: att Test Plan: python test/quantization/quantize_/workflows/float8/test_float8_tensor.py Reviewers: Subscribers: Tasks: Tags:

**Summary:** Similar to pytorch#2628, but for `FakeQuantizer`. It is cleaner to isolate the logic of each quantizer in separate classes, e.g. intx vs nvfp4 vs fp8. Naming change: ``` FakeQuantizer -> IntxFakeQuantizer ``` **BC-breaking notes:** This is technically not BC-breaking yet since we are just deprecating the old APIs while keeping them around. It will be when we do remove the old APIs in the future according to pytorch#2630. Before: ``` config = IntxFakeQuantizeConfig(torch.int8, "per_channel") FakeQuantizer(config) ``` After: ``` config = IntxFakeQuantizeConfig(torch.int8, "per_channel") IntxFakeQuantizer(config) # or FakeQuantizerBase.from_config(config) ``` **Test Plan:** ``` python test/quantization/test_qat.py ``` [ghstack-poisoned]

…vided by coreml Differential Revision: D79119940 Pull Request resolved: pytorch#2679

* Deprecate old TORCH_VERSION variables **Summary:** This commit deprecates the following variables: ``` TORCH_VERSION_AT_LEAST_2_5 TORCH_VERSION_AT_LEAST_2_4 TORCH_VERSION_AT_LEAST_2_3 TORCH_VERSION_AT_LEAST_2_2 TORCH_VERSION_AFTER_2_5 TORCH_VERSION_AFTER_2_4 TORCH_VERSION_AFTER_2_3 TORCH_VERSION_AFTER_2_2 ``` As of this commit, the latest released version of PyTorch is 2.8, which means we can drop support for 2.5 and before since we only support 3 of the latest releases. The next commit will remove usages of all of these variables from within torchao. **Test Plan:** ``` python test/test_utils.py -k torch_version_deprecation ``` [ghstack-poisoned] * Update on "Deprecate old TORCH_VERSION variables" **Summary:** This commit deprecates the following variables: ``` TORCH_VERSION_AT_LEAST_2_5 TORCH_VERSION_AT_LEAST_2_4 TORCH_VERSION_AT_LEAST_2_3 TORCH_VERSION_AT_LEAST_2_2 TORCH_VERSION_AFTER_2_5 TORCH_VERSION_AFTER_2_4 TORCH_VERSION_AFTER_2_3 TORCH_VERSION_AFTER_2_2 ``` As of this commit, the latest released version of PyTorch is 2.8, which means we can drop support for 2.5 and before since we only support 3 of the latest releases. The next commit will remove usages of all of these variables from within torchao. **Test Plan:** ``` python test/test_utils.py -k torch_version_deprecation ``` [ghstack-poisoned] * Update on "Deprecate old TORCH_VERSION variables" **Summary:** This commit deprecates the following variables: ``` # Always True TORCH_VERSION_AT_LEAST_2_6 TORCH_VERSION_AT_LEAST_2_5 TORCH_VERSION_AT_LEAST_2_4 TORCH_VERSION_AT_LEAST_2_3 TORCH_VERSION_AT_LEAST_2_2 # TORCH_VERSION_AFTER* was confusing to users TORCH_VERSION_AFTER_2_5 TORCH_VERSION_AFTER_2_4 TORCH_VERSION_AFTER_2_3 TORCH_VERSION_AFTER_2_2 ``` As of this commit, the latest released version of PyTorch is 2.8, which means the oldest pytorch version we support is now 2.6 since we only support 3 of the latest releases. The next commit will remove usages of all of these variables from within torchao. **Test Plan:** ``` python test/test_utils.py -k torch_version_deprecation ``` [ghstack-poisoned] * Update on "Deprecate old TORCH_VERSION variables" **Summary:** This commit deprecates the following variables: ``` # Always True TORCH_VERSION_AT_LEAST_2_6 TORCH_VERSION_AT_LEAST_2_5 TORCH_VERSION_AT_LEAST_2_4 TORCH_VERSION_AT_LEAST_2_3 TORCH_VERSION_AT_LEAST_2_2 # TORCH_VERSION_AFTER* was confusing to users TORCH_VERSION_AFTER_2_5 TORCH_VERSION_AFTER_2_4 TORCH_VERSION_AFTER_2_3 TORCH_VERSION_AFTER_2_2 ``` As of this commit, the latest released version of PyTorch is 2.8, which means the oldest pytorch version we support is now 2.6 since we only support 3 of the latest releases. The next commit will remove usages of all of these variables from within torchao. **Test Plan:** ``` python test/test_utils.py -k torch_version_deprecation ``` [ghstack-poisoned] * Update on "Deprecate old TORCH_VERSION variables" **Summary:** This commit deprecates the following variables: ``` # Always True TORCH_VERSION_AT_LEAST_2_6 TORCH_VERSION_AT_LEAST_2_5 TORCH_VERSION_AT_LEAST_2_4 TORCH_VERSION_AT_LEAST_2_3 TORCH_VERSION_AT_LEAST_2_2 # TORCH_VERSION_AFTER* was confusing to users TORCH_VERSION_AFTER_2_5 TORCH_VERSION_AFTER_2_4 TORCH_VERSION_AFTER_2_3 TORCH_VERSION_AFTER_2_2 ``` As of this commit, the latest released version of PyTorch is 2.8, which means the oldest pytorch version we support is now 2.6 since we only support 3 of the latest releases. The next commit will remove usages of all of these variables from within torchao. **Test Plan:** ``` python test/test_utils.py -k torch_version_deprecation ``` [ghstack-poisoned]

Differential Revision: D79936256 Pull Request resolved: pytorch#2726

I got a devgpu with 8 AMD MI300X GPUs, ran the torchtitan benchmarks (without any performance debugging), and adding the numbers I saw to the readme. The tensorwise number looks lower than expected, we can debug/fix this in a future PR.

Differential Revision: D79119958 Pull Request resolved: pytorch#2702

* When replacing literals with placeholders lists are always converted to tuples Summary: THis is needed because lists are not hashable, since they are mutable, and as a result we cannot have literals_to_ph in pattern rewrites used inside reference_representation_rewrite.py Test Plan: CI + next diff relies on this feature Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned] * Allow pattern replacement to ignore literals Summary: This is necessary because sometimes the patterns found have literals include tuple of ints kind of literals. This values shouldnt be used for pattern matching since often they are based on consts derived from example inputs. THis is not exactly a safe thing to do in general so by default it is turned off Test Plan: Subsequent diff adds a pattern that relies on this Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned] * Update on "Allow pattern replacement to ignore literals" Summary: This is necessary because sometimes the patterns found have literals include tuple of ints kind of literals. This values shouldnt be used for pattern matching since often they are based on consts derived from example inputs. THis is not exactly a safe thing to do in general so by default it is turned off Test Plan: Subsequent diff adds a pattern that relies on this Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Differential Revision: D79401683 Pull Request resolved: pytorch#2704

Differential Revision: D79401818 Pull Request resolved: pytorch#2731

Summary: Small fixes to make the float8 training rowwise benchmarks work properly on AMD GPUs, just making sure the right float8 flavor is used. Test Plan: ```bash python benchmarks/float8/float8_roofline.py ~/local/tmp/20250811_amd_mi300x_rowwise_with_gw_hp.csv --float8_recipe_name rowwise_with_gw_hp --shape_gen_name pow2_extended ``` MI300x results: https://gist.github.com/vkuzo/586af24b4c9a90f107590ba5e96dd7eb H100 results: https://gist.github.com/vkuzo/586af24b4c9a90f107590ba5e96dd7eb Reviewers: Subscribers: Tasks: Tags:

…or (pytorch#2687) Summary: Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N] Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops) * defined `tensor_data_names` and `tensor_attribute_names` so we can remove some of the implementations from TorchAOBaseTensor * Migrated op implementation and tests from pytorch#2387 Note: This is just refactoring Int4Tensor, no BC related changes in this PR Int4Tensor path is exposed in version 2 of `Int4WeightOnlyConfig` (default version is still 1, which is using the old AQT path Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py Reviewers: Subscribers: Tasks: Tags:

Summary: Allows subclasses inheriting from TorchAOBaseTensor to have optional tensor attributes, updated all common util functions to support `optional_tensor_names` list, including `__tensor_flatten__`, `__tensor_unflatten__`, ops like aten._to_copy, contiguous, alias etc. Test Plan: python test/test_utils.py Reviewers: Subscribers: Tasks: Tags:

…the Float8Tensor (pytorch#2738) Summary: similar to pytorch#2687, we updated Int4PreshuffledTensor to align the implementation details, also used TorchAOBaseTensor to simplify some of the implementations Note: This is just refactoring Int4PreshuffledTensor, no BC related changes in this PR Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_preshuffled_tensor.py Reviewers: Subscribers: Tasks: Tags:

…Int4WeightConfig` (pytorch#2746) Summary: This was a mistake, we need to align the name with other dynamic quant configs Test Plan: CI Reviewers: Subscribers: Tasks: Tags:

…orch#2747) Summary: We typically should not be calling contiguous in the op implementations since these does not align with the semantics of the op, e.g. transpose Test Plan: python test/quantization/quantize_/workflows/float8/test_float8_tensor.py Reviewers: Subscribers: Tasks: Tags:

Summary: We have recently updated our design for structuring tensor subclasses in torchao to remove unnecessary abstractions and reduce indirections and having a structuring that aligns better with people's intuitive understanding of different quantization use cases, examples using the new design are: pytorch#2463, pytorch#2687 Test Plan: check generated doc Reviewers: Subscribers: Tasks: Tags:

Summary: Didn't repro the error before due to some installation cache Test Plan: python test/quantization/quantize_/workflows/float8/test_float8_tensor.py Reviewers: Subscribers: Tasks: Tags: stack-info: PR: pytorch#2751, branch: jerryzh168/stack/27

* [sparse] Add in missing op support for FP8 Sparse Summary: For ads, we are missing some op support in their lowering stack, namely `.to(dtype=torch.float)` and `.clone()` This PR adds in op support for the `CutlassSemiSparseLayout`. Test Plan: ``` python test/test_sparse_api -k lowering ``` Reviewers: Subscribers: Tasks: Tags: * update * ruff fix * update tests * fix test to add in layout kwarg * skip non h100

)

Summary: This is used for prototype previously, not used now, we now expose fbgemm kernels through Int4WeightOnlyConfig (for int4) and Float8DynamicActivationFloat8WeightConfig (for FP8) Not considering this BC breaking since we haven't publicized the API yet Test Plan: CI Reviewers: Subscribers: Tasks: Tags:

* Support PLAIN_INT32 for AWQ on Intel GPU * Support PLAIN_INT32 for AWQ on Intel GPU * Support PLAIN_INT32 for AWQ on Intel GPU

Summary: Added `_convert_to_packed_tensor_based_on_current_hardware` to convert a tensor from the unpacked / plain version to a packed version This is to enable vllm for packed weights, vllm will do a slice for the quantized weight, but slicing is not always supported for all torchao tensor subclasses. So we want to first ship an plain / unpacked checkpoint and then convert to the packed version using this API Test Plan: pytest test/prototype/test_tensor_conversion.py Reviewers: Subscribers: Tasks: Tags:

* Support Int4OpaqueTensor for HQQ Make Int4OpaqueTensor support HQQ. Signed-off-by: Cui, Lily <lily.cui@intel.com> * Format codes Signed-off-by: Cui, Lily <lily.cui@intel.com> --------- Signed-off-by: Cui, Lily <lily.cui@intel.com>

…ytorch#3002)

…ate it (pytorch#3004)

**Summary:** Add support to pass scales and zero points learned during QAT range learning to the PTQ base config. Currently only the following configs support this feature: ``` IntxWeightOnlyConfig Int8DynamicActivationInt4WeightConfig Int8DynamicActivationIntxWeightConfig ``` During the convert phase, QAT will detect if range learning was used during training, and pass the learned scales and zero points as custom qparams to the quantized tensor subclass, so PTQ will produce more consistent numerics. Fixes part of pytorch#2271. **Test Plan:** ``` python test/quantization/test_qat.py -k test_range_learning_convert_pass_qparams ```

…ngle input/output TMA descriptors (pytorch#3034)

…rch#2991) * [CPU] Fix fp8 sdpa compiling issue with latest pytorch * disable fp8 fusion

pytorch#2961) * register fp8 quant/dequant only on CPU * add non-decomposed quantize_affine_float8 and dequantize_affine_float8

…3037)

* Summary: In `from_pretrained()` method in `huggingface/transformers`, `torch_dtype` is deprecated and `dtype` replaces it. To prevent deprecation warnings, this PR replaces `torch_dtype` with `dtype`. Test plan: CI Reference: huggingface/transformers#39782 * fix pre-commit * revert to source: model uploader

* Avoid normalization layers in HF's quantization_config * Add TestTorchAoConfigIntegration * Use PreTrainedModel.from_pretrained

Summary: * removed requirement of setting VLLM_DIR environment, since benchmark is now a cli command * reordered the evals and summarization of results to match better with the order of model card Test Plan: local manual runs achiving desired results Reviewers: Subscribers: Tasks: Tags:

Differential Revision: D82492826 Pull Request resolved: pytorch#3031

…nto wengshiy/qlinear

* Unify get_block_size * Remove granularity defines in the pt2e path * Fix format

LevelDownRefine changed the title ~~Wengshiy/qlinear~~ [Inductor] Support qlinear Jul 7, 2025

danielvegamyhre and others added 28 commits August 7, 2025 11:34

[moe training] add fp8 rowwise kernels for expert weights (pytorch#2696)

8a9cba6

[moe training] add bench script for fp8 rowwise kernels and update au…

143c3a6

…totune configs (pytorch#2697)

[moe training] integrate rowwise expert quant kernel (pytorch#2698)

246b142

Update KleidiAI (pytorch#2692)

1526dfe

* Update KleidiAI * up * up * up

Updating 4xH100 to only run with tags or workflow dispatch (pytorch#2715

0315628

) Summary: Currently we have a long queue, so would like to reduce it Test Plan: Reviewers: Subscribers: Tasks: Tags:

Don't call erase if node is already erased in batch norm fusion.

7cb920b

Differential Revision: D79846881 Pull Request resolved: pytorch#2716

Remove dep on protype MoEQuantConfig (pytorch#2717)

c086ade

Summary: att Test Plan: python test/quantization/quantize_/workflows/float8/test_float8_tensor.py Reviewers: Subscribers: Tasks: Tags:

Add api for group wise low bit quantization, using codebook utils pro…

4fc4068

…vided by coreml Differential Revision: D79119940 Pull Request resolved: pytorch#2679

Fix internal tests after recent chagnes

948ade1

Differential Revision: D79936256 Pull Request resolved: pytorch#2726

Add __init__.py for group wise lut quantization package

510e1b4

Differential Revision: D79119958 Pull Request resolved: pytorch#2702

Add meta function for linear operation (groupwise lut kernel).

5bf05b6

Differential Revision: D79401683 Pull Request resolved: pytorch#2704

Remove meta linear operation in cpp.

d7f7bf2

Differential Revision: D79401818 Pull Request resolved: pytorch#2731

don't learn zero points for symmetric quantization (pytorch#2739)

d08bbb0

Rename Float8ActivationInt4WeightConfig to `Float8DynamicActivation…

cd7975e

…Int4WeightConfig` (pytorch#2746) Summary: This was a mistake, we need to align the name with other dynamic quant configs Test Plan: CI Reviewers: Subscribers: Tasks: Tags:

jcaip and others added 30 commits September 17, 2025 15:01

Fix torchao_convert, remove StretchedAffineQuantizedTensor (pytorch#3015

122b307

)

update compile arg for llama3.sh bench script (pytorch#3006)

18dbe87

Support PLAIN_INT32 for AWQ on Intel GPU (pytorch#3019)

cfa39c8

* Support PLAIN_INT32 for AWQ on Intel GPU * Support PLAIN_INT32 for AWQ on Intel GPU * Support PLAIN_INT32 for AWQ on Intel GPU

Support Int4OpaqueTensor for HQQ (pytorch#3028)

1591603

* Support Int4OpaqueTensor for HQQ Make Int4OpaqueTensor support HQQ. Signed-off-by: Cui, Lily <lily.cui@intel.com> * Format codes Signed-off-by: Cui, Lily <lily.cui@intel.com> --------- Signed-off-by: Cui, Lily <lily.cui@intel.com>

[mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise (p…

f210443

…ytorch#3002)

[mxfp8 moe training] wrap 3d quantize tensor in custom ops and integr…

4bf39b0

…ate it (pytorch#3004)

[mxfp8 moe training] remove mxfp8_gemms.py (pytorch#3033)

f35dcd7

[mxfp8 moe training] update 3d quant colwise scaling kernel to use si…

d2fae7a

…ngle input/output TMA descriptors (pytorch#3034)

[Bug fix][CPU] Fix fp8 sdpa compiling issue with latest PyTorch (pyto…

22819f4

…rch#2991) * [CPU] Fix fp8 sdpa compiling issue with latest pytorch * disable fp8 fusion

[Float8] add non-decomposed version of quantize/dequantize ops for fp8 (

8525185

pytorch#2961) * register fp8 quant/dequant only on CPU * add non-decomposed quantize_affine_float8 and dequantize_affine_float8

merge main

db46a18

Merge branch 'main' into wengshiy/qlinear

3d3f8cf

[mxfp8 moe training] use new 3d colwise quantization kernel (pytorch#…

9d88c16

…3037)

Avoid normalization layers in HF's quantization_config (pytorch#3030)

fb7c837

* Avoid normalization layers in HF's quantization_config * Add TestTorchAoConfigIntegration * Use PreTrainedModel.from_pretrained

Minor fix on TAO op to support lowering

eadead5

Differential Revision: D82492826 Pull Request resolved: pytorch#3031

change to use non-decomposed q/dq

4e7afcb

Merge remote-tracking branch 'refs/remotes/origin/wengshiy/qlinear' i…

d79c3cc

…nto wengshiy/qlinear

fix lint

e417a4e

add version check

c23e286

change version

77da321

Unify get_block_size (pytorch#3039)

8e2ca35

* Unify get_block_size * Remove granularity defines in the pt2e path * Fix format

fix attention bug; update ut

7ffc616

Merge remote-tracking branch 'origin/main' into wengshiy/qlinear

f1bbf13

add liftup oplist

4fb5f7a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor] Support qlinear#3

[Inductor] Support qlinear#3
LevelDownRefine wants to merge 427 commits into
wengshiy/scaled_mmfrom
wengshiy/qlinear

LevelDownRefine commented Jul 7, 2025

Uh oh!

LevelDownRefine commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

LevelDownRefine commented Jul 7, 2025

Uh oh!

LevelDownRefine commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants