Skip to content

[Inductor] Support qlinear#3

Open
LevelDownRefine wants to merge 427 commits into
wengshiy/scaled_mmfrom
wengshiy/qlinear
Open

[Inductor] Support qlinear#3
LevelDownRefine wants to merge 427 commits into
wengshiy/scaled_mmfrom
wengshiy/qlinear

Conversation

@LevelDownRefine
Copy link
Copy Markdown
Owner

No description provided.

@LevelDownRefine LevelDownRefine changed the title Wengshiy/qlinear [Inductor] Support qlinear Jul 7, 2025
@LevelDownRefine
Copy link
Copy Markdown
Owner Author

blocked by pytorch/pytorch#157684

danielvegamyhre and others added 28 commits August 7, 2025 11:34
…to (pytorch#2518)

tuples

Summary:
THis is needed because lists are not hashable, since they are mutable,
and as a result we cannot have literals_to_ph in pattern rewrites used
inside reference_representation_rewrite.py

Test Plan:
CI + next diff relies on this feature

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
* Update KleidiAI

* up

* up

* up
…erences (pytorch#2651)

Summary:
This PR checks different kernel preferences for Float8Tensor are similar in numerics
(AUTO, TORCH and FBGEMM)

triton implementation and torchao implementation are a bit different right now actually, need to decide if we should fix it or not

1. difference in quantize op
main difference seems to be the triton implementation is using:
```
a_scale = MAX_FP8 / max_abs
then do
a_scale = 1.0 / a_scale
a_fp8 = a * a_scale
```

while torch is doing:
```
a_scale = max_abs / MAX_FP8
a_fp8 = a / a_scale
```

Also the hp_value_lb and hp_value_ub settings are slightly different

triton choose scale and quantize code: https://github.com/pytorch/FBGEMM/blob/a4286c01ef01dad435b2ec8798605127d3032cd8/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py#L2382-L2392

torchao choose scale and quantize code:
https://github.com/pytorch/ao/blob/3c466f844684af0fb80014094f2ca8663881eb33/torchao/quantization/quant_primitives.py#L2183
https://github.com/pytorch/ao/blob/3c466f844684af0fb80014094f2ca8663881eb33/torchao/quantization/quant_primitives.py#L2283

2. (potentially) difference in matrix multiplication ops

TORCH and AUTO/FBGEMM are using different quantized mm ops

Added a reverse option to bring sqnr closer:
```
granularity: PerTensor()  sizes: ((128,), 256, 128)  kp: KernelPreference.AUTO tensor(inf, device='cuda:0', dtype=torch.bfloat16)
granularity: PerTensor()  sizes: ((128,), 256, 128)  kp: KernelPreference.FBGEMM tensor(inf, device='cuda:0', dtype=torch.bfloat16)
.granularity: PerTensor()  sizes: ((32, 128), 64, 256)  kp: KernelPreference.AUTO tensor(inf, device='cuda:0', dtype=torch.bfloat16)
granularity: PerTensor()  sizes: ((32, 128), 64, 256)  kp: KernelPreference.FBGEMM tensor(inf, device='cuda:0', dtype=torch.bfloat16)
.granularity: PerRow()  sizes: ((128,), 256, 128)  kp: KernelPreference.AUTO tensor(inf, device='cuda:0', dtype=torch.bfloat16)
granularity: PerRow()  sizes: ((128,), 256, 128)  kp: KernelPreference.FBGEMM tensor(inf, device='cuda:0', dtype=torch.bfloat16)
.granularity: PerRow()  sizes: ((32, 128), 64, 256)  kp: KernelPreference.AUTO tensor(64.5000, device='cuda:0', dtype=torch.bfloat16)
granularity: PerRow()  sizes: ((32, 128), 64, 256)  kp: KernelPreference.FBGEMM tensor(68., device='cuda:0', dtype=torch.bfloat16)
```
Test Plan:
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_kernel_preference_numerical_equivalence

Reviewers:

Subscribers:

Tasks:

Tags:
…amicActivationInt4WeightConfig (pytorch#2474)

Summary:
we will
* deprecate FbgemmConfig since it's a single kernel (later).
* we'd like to categorize things to derived dtype + packed format, e.g. int4 preshuffled, float8 plain
* Added PackingFormat that has preshuffled, plain in Version 2 of Int4WeightOnlyConfig, the older AQT tensor will remain in Version 1

Test Plan:
python test/quantization/quantize_/workflows/int4/test_int4_tensor.py
python test/quantization/quantize_/workflows/int4/test_int4_preshuffled_tensor.py
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py

Reviewers:

Subscribers:

Tasks:

Tags:
)

Summary:
Currently we have a long queue, so would like to reduce it

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Differential Revision: D79846881

Pull Request resolved: pytorch#2716
Summary:
att

Test Plan:
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py

Reviewers:

Subscribers:

Tasks:

Tags:
**Summary:** Similar to pytorch#2628,
but for `FakeQuantizer`. It is cleaner to isolate the logic of
each quantizer in separate classes, e.g. intx vs nvfp4 vs fp8.
Naming change:

```
FakeQuantizer -> IntxFakeQuantizer
```

**BC-breaking notes:** This is technically not BC-breaking yet
since we are just deprecating the old APIs while keeping them
around. It will be when we do remove the old APIs in the future
according to pytorch#2630.

Before:
```
config = IntxFakeQuantizeConfig(torch.int8, "per_channel")
FakeQuantizer(config)
```

After:
```
config = IntxFakeQuantizeConfig(torch.int8, "per_channel")
IntxFakeQuantizer(config) # or
FakeQuantizerBase.from_config(config)
```

**Test Plan:**
```
python test/quantization/test_qat.py
```

[ghstack-poisoned]
…vided by coreml

Differential Revision: D79119940

Pull Request resolved: pytorch#2679
* Deprecate old TORCH_VERSION variables

**Summary:** This commit deprecates the following variables:

```
TORCH_VERSION_AT_LEAST_2_5
TORCH_VERSION_AT_LEAST_2_4
TORCH_VERSION_AT_LEAST_2_3
TORCH_VERSION_AT_LEAST_2_2
TORCH_VERSION_AFTER_2_5
TORCH_VERSION_AFTER_2_4
TORCH_VERSION_AFTER_2_3
TORCH_VERSION_AFTER_2_2
```

As of this commit, the latest released version of PyTorch is 2.8,
which means we can drop support for 2.5 and before since we only
support 3 of the latest releases.

The next commit will remove usages of all of these variables
from within torchao.

**Test Plan:**
```
python test/test_utils.py -k torch_version_deprecation
```

[ghstack-poisoned]

* Update on "Deprecate old TORCH_VERSION variables"


**Summary:** This commit deprecates the following variables:

```
TORCH_VERSION_AT_LEAST_2_5
TORCH_VERSION_AT_LEAST_2_4
TORCH_VERSION_AT_LEAST_2_3
TORCH_VERSION_AT_LEAST_2_2
TORCH_VERSION_AFTER_2_5
TORCH_VERSION_AFTER_2_4
TORCH_VERSION_AFTER_2_3
TORCH_VERSION_AFTER_2_2
```

As of this commit, the latest released version of PyTorch is 2.8,
which means we can drop support for 2.5 and before since we only
support 3 of the latest releases.

The next commit will remove usages of all of these variables
from within torchao.

**Test Plan:**
```
python test/test_utils.py -k torch_version_deprecation
```

[ghstack-poisoned]

* Update on "Deprecate old TORCH_VERSION variables"


**Summary:** This commit deprecates the following variables:

```
# Always True
TORCH_VERSION_AT_LEAST_2_6
TORCH_VERSION_AT_LEAST_2_5
TORCH_VERSION_AT_LEAST_2_4
TORCH_VERSION_AT_LEAST_2_3
TORCH_VERSION_AT_LEAST_2_2
# TORCH_VERSION_AFTER* was confusing to users
TORCH_VERSION_AFTER_2_5
TORCH_VERSION_AFTER_2_4
TORCH_VERSION_AFTER_2_3
TORCH_VERSION_AFTER_2_2
```

As of this commit, the latest released version of PyTorch is 2.8, which means the oldest pytorch version we support is now 2.6 since we only support 3 of the latest releases.

The next commit will remove usages of all of these variables from within torchao.

**Test Plan:**
```
python test/test_utils.py -k torch_version_deprecation
```

[ghstack-poisoned]

* Update on "Deprecate old TORCH_VERSION variables"


**Summary:** This commit deprecates the following variables:

```
# Always True
TORCH_VERSION_AT_LEAST_2_6
TORCH_VERSION_AT_LEAST_2_5
TORCH_VERSION_AT_LEAST_2_4
TORCH_VERSION_AT_LEAST_2_3
TORCH_VERSION_AT_LEAST_2_2
# TORCH_VERSION_AFTER* was confusing to users
TORCH_VERSION_AFTER_2_5
TORCH_VERSION_AFTER_2_4
TORCH_VERSION_AFTER_2_3
TORCH_VERSION_AFTER_2_2
```

As of this commit, the latest released version of PyTorch is 2.8, which means the oldest pytorch version we support is now 2.6 since we only support 3 of the latest releases.

The next commit will remove usages of all of these variables from within torchao.

**Test Plan:**
```
python test/test_utils.py -k torch_version_deprecation
```

[ghstack-poisoned]

* Update on "Deprecate old TORCH_VERSION variables"


**Summary:** This commit deprecates the following variables:

```
# Always True
TORCH_VERSION_AT_LEAST_2_6
TORCH_VERSION_AT_LEAST_2_5
TORCH_VERSION_AT_LEAST_2_4
TORCH_VERSION_AT_LEAST_2_3
TORCH_VERSION_AT_LEAST_2_2
# TORCH_VERSION_AFTER* was confusing to users
TORCH_VERSION_AFTER_2_5
TORCH_VERSION_AFTER_2_4
TORCH_VERSION_AFTER_2_3
TORCH_VERSION_AFTER_2_2
```

As of this commit, the latest released version of PyTorch is 2.8, which means the oldest pytorch version we support is now 2.6 since we only support 3 of the latest releases.

The next commit will remove usages of all of these variables from within torchao.

**Test Plan:**
```
python test/test_utils.py -k torch_version_deprecation
```

[ghstack-poisoned]
Differential Revision: D79936256

Pull Request resolved: pytorch#2726
I got a devgpu with 8 AMD MI300X GPUs, ran the torchtitan benchmarks (without any performance debugging), and adding the numbers I saw to the readme.

The tensorwise number looks lower than expected, we can debug/fix this in a future PR.
Differential Revision: D79119958

Pull Request resolved: pytorch#2702
* When replacing literals with placeholders lists are always converted to
tuples

Summary:
THis is needed because lists are not hashable, since they are mutable,
and as a result we cannot have literals_to_ph in pattern rewrites used
inside reference_representation_rewrite.py

Test Plan:
CI + next diff relies on this feature

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

* Allow pattern replacement to ignore literals

Summary:
This is necessary because sometimes the patterns found have literals
include tuple of ints kind of literals. This values shouldnt be used for
pattern matching since often they are based on consts derived from
example inputs.

THis is not exactly a safe thing to do in general so by default it is
turned off

Test Plan:
Subsequent diff adds a pattern that relies on this

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

* Update on "Allow pattern replacement to ignore literals"


Summary:
This is necessary because sometimes the patterns found have literals
include tuple of ints kind of literals. This values shouldnt be used for
pattern matching since often they are based on consts derived from
example inputs.

THis is not exactly a safe thing to do in general so by default it is
turned off

Test Plan:
Subsequent diff adds a pattern that relies on this

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Differential Revision: D79401683

Pull Request resolved: pytorch#2704
Differential Revision: D79401818

Pull Request resolved: pytorch#2731
Summary:

Small fixes to make the float8 training rowwise benchmarks work properly
on AMD GPUs, just making sure the right float8 flavor is used.

Test Plan:

```bash
python benchmarks/float8/float8_roofline.py ~/local/tmp/20250811_amd_mi300x_rowwise_with_gw_hp.csv --float8_recipe_name rowwise_with_gw_hp --shape_gen_name pow2_extended
```

MI300x results:
https://gist.github.com/vkuzo/586af24b4c9a90f107590ba5e96dd7eb
H100 results:
https://gist.github.com/vkuzo/586af24b4c9a90f107590ba5e96dd7eb

Reviewers:

Subscribers:

Tasks:

Tags:
…or (pytorch#2687)

Summary:
Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N]

Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops)
* defined `tensor_data_names` and `tensor_attribute_names` so we can remove some of the implementations from TorchAOBaseTensor
* Migrated op implementation and tests from pytorch#2387

Note: This is just refactoring Int4Tensor, no BC related changes in this PR

Int4Tensor path is exposed in version 2 of `Int4WeightOnlyConfig` (default version is still 1, which is using the old AQT path

Test Plan:
python test/quantization/quantize_/workflows/int4/test_int4_tensor.py

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:
Allows subclasses inheriting from TorchAOBaseTensor to have optional tensor attributes, updated
all common util functions to support `optional_tensor_names` list, including `__tensor_flatten__`, `__tensor_unflatten__`, ops like aten._to_copy, contiguous, alias etc.

Test Plan:
python test/test_utils.py

Reviewers:

Subscribers:

Tasks:

Tags:
…the Float8Tensor (pytorch#2738)

Summary:
similar to pytorch#2687, we updated Int4PreshuffledTensor to align
the implementation details, also used TorchAOBaseTensor to simplify some of the implementations

Note: This is just refactoring Int4PreshuffledTensor, no BC related changes in this PR

Test Plan:
python test/quantization/quantize_/workflows/int4/test_int4_preshuffled_tensor.py

Reviewers:

Subscribers:

Tasks:

Tags:
…Int4WeightConfig` (pytorch#2746)

Summary:
This was a mistake, we need to align the name with other dynamic quant configs

Test Plan:
CI

Reviewers:

Subscribers:

Tasks:

Tags:
…orch#2747)

Summary:
We typically should not be calling contiguous in the op implementations since
these does not align with the semantics of the op, e.g. transpose

Test Plan:
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:
We have recently updated our design for structuring tensor subclasses in torchao
to remove unnecessary abstractions and reduce indirections and having a structuring that
aligns better with people's intuitive understanding of different quantization use cases,
examples using the new design are: pytorch#2463, pytorch#2687

Test Plan:
check generated doc
Reviewers:

Subscribers:

Tasks:

Tags:
Summary:
Didn't repro the error before due to some installation cache

Test Plan:
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py

Reviewers:

Subscribers:

Tasks:

Tags:

stack-info: PR: pytorch#2751, branch: jerryzh168/stack/27
jcaip and others added 30 commits September 17, 2025 15:01
* [sparse] Add in missing op support for FP8 Sparse

Summary:

For ads, we are missing some op support in their lowering stack, namely
`.to(dtype=torch.float)` and `.clone()`

This PR adds in op support for the `CutlassSemiSparseLayout`.

Test Plan:
```
python test/test_sparse_api -k lowering
```

Reviewers:

Subscribers:

Tasks:

Tags:

* update

* ruff fix

* update tests

* fix test to add in layout kwarg

* skip non h100
Summary:
This is used for prototype previously, not used now, we now expose fbgemm
kernels through Int4WeightOnlyConfig (for int4) and Float8DynamicActivationFloat8WeightConfig (for FP8)

Not considering this BC breaking since we haven't publicized the API yet

Test Plan:
CI

Reviewers:

Subscribers:

Tasks:

Tags:
* Support PLAIN_INT32 for AWQ on Intel GPU

* Support PLAIN_INT32 for AWQ on Intel GPU

* Support PLAIN_INT32 for AWQ on Intel GPU
Summary:
Added `_convert_to_packed_tensor_based_on_current_hardware`
to convert a tensor from the unpacked / plain version to a packed version

This is to enable vllm for packed weights, vllm will do a slice for the quantized weight,
but slicing is not always supported for all torchao tensor subclasses. So we want to
first ship an plain / unpacked checkpoint and then convert to the packed version using this API

Test Plan:
pytest test/prototype/test_tensor_conversion.py

Reviewers:

Subscribers:

Tasks:

Tags:
* Support Int4OpaqueTensor for HQQ

Make Int4OpaqueTensor support HQQ.

Signed-off-by: Cui, Lily <lily.cui@intel.com>

* Format codes

Signed-off-by: Cui, Lily <lily.cui@intel.com>

---------

Signed-off-by: Cui, Lily <lily.cui@intel.com>
**Summary:** Add support to pass scales and zero points
learned during QAT range learning to the PTQ base config.
Currently only the following configs support this feature:

```
IntxWeightOnlyConfig
Int8DynamicActivationInt4WeightConfig
Int8DynamicActivationIntxWeightConfig
```

During the convert phase, QAT will detect if range learning
was used during training, and pass the learned scales and
zero points as custom qparams to the quantized tensor
subclass, so PTQ will produce more consistent numerics.

Fixes part of pytorch#2271.

**Test Plan:**
```
python test/quantization/test_qat.py -k
test_range_learning_convert_pass_qparams
```
…rch#2991)

* [CPU] Fix fp8 sdpa compiling issue with latest pytorch

* disable fp8 fusion
pytorch#2961)

* register fp8 quant/dequant only on CPU

* add non-decomposed quantize_affine_float8 and dequantize_affine_float8
* Summary:
In `from_pretrained()` method in `huggingface/transformers`, `torch_dtype`
is deprecated and `dtype` replaces it. To prevent deprecation warnings,
this PR replaces `torch_dtype` with `dtype`.

Test plan: CI

Reference: huggingface/transformers#39782

* fix pre-commit

* revert to source: model uploader
* Avoid normalization layers in HF's quantization_config

* Add TestTorchAoConfigIntegration

* Use PreTrainedModel.from_pretrained
Summary:
* removed requirement of setting VLLM_DIR environment, since benchmark is now a cli command
* reordered the evals and summarization of results to match better with the order of model card

Test Plan:
local manual runs achiving desired results

Reviewers:

Subscribers:

Tasks:

Tags:
Differential Revision: D82492826

Pull Request resolved: pytorch#3031
* Unify get_block_size

* Remove granularity defines in the pt2e path

* Fix format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.