[Inductor] Support qlinear#3
Open
LevelDownRefine wants to merge 427 commits into
Open
Conversation
Owner
Author
|
blocked by pytorch/pytorch#157684 |
…to (pytorch#2518) tuples Summary: THis is needed because lists are not hashable, since they are mutable, and as a result we cannot have literals_to_ph in pattern rewrites used inside reference_representation_rewrite.py Test Plan: CI + next diff relies on this feature Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
* Update KleidiAI * up * up * up
…erences (pytorch#2651) Summary: This PR checks different kernel preferences for Float8Tensor are similar in numerics (AUTO, TORCH and FBGEMM) triton implementation and torchao implementation are a bit different right now actually, need to decide if we should fix it or not 1. difference in quantize op main difference seems to be the triton implementation is using: ``` a_scale = MAX_FP8 / max_abs then do a_scale = 1.0 / a_scale a_fp8 = a * a_scale ``` while torch is doing: ``` a_scale = max_abs / MAX_FP8 a_fp8 = a / a_scale ``` Also the hp_value_lb and hp_value_ub settings are slightly different triton choose scale and quantize code: https://github.com/pytorch/FBGEMM/blob/a4286c01ef01dad435b2ec8798605127d3032cd8/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py#L2382-L2392 torchao choose scale and quantize code: https://github.com/pytorch/ao/blob/3c466f844684af0fb80014094f2ca8663881eb33/torchao/quantization/quant_primitives.py#L2183 https://github.com/pytorch/ao/blob/3c466f844684af0fb80014094f2ca8663881eb33/torchao/quantization/quant_primitives.py#L2283 2. (potentially) difference in matrix multiplication ops TORCH and AUTO/FBGEMM are using different quantized mm ops Added a reverse option to bring sqnr closer: ``` granularity: PerTensor() sizes: ((128,), 256, 128) kp: KernelPreference.AUTO tensor(inf, device='cuda:0', dtype=torch.bfloat16) granularity: PerTensor() sizes: ((128,), 256, 128) kp: KernelPreference.FBGEMM tensor(inf, device='cuda:0', dtype=torch.bfloat16) .granularity: PerTensor() sizes: ((32, 128), 64, 256) kp: KernelPreference.AUTO tensor(inf, device='cuda:0', dtype=torch.bfloat16) granularity: PerTensor() sizes: ((32, 128), 64, 256) kp: KernelPreference.FBGEMM tensor(inf, device='cuda:0', dtype=torch.bfloat16) .granularity: PerRow() sizes: ((128,), 256, 128) kp: KernelPreference.AUTO tensor(inf, device='cuda:0', dtype=torch.bfloat16) granularity: PerRow() sizes: ((128,), 256, 128) kp: KernelPreference.FBGEMM tensor(inf, device='cuda:0', dtype=torch.bfloat16) .granularity: PerRow() sizes: ((32, 128), 64, 256) kp: KernelPreference.AUTO tensor(64.5000, device='cuda:0', dtype=torch.bfloat16) granularity: PerRow() sizes: ((32, 128), 64, 256) kp: KernelPreference.FBGEMM tensor(68., device='cuda:0', dtype=torch.bfloat16) ``` Test Plan: python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_kernel_preference_numerical_equivalence Reviewers: Subscribers: Tasks: Tags:
…amicActivationInt4WeightConfig (pytorch#2474) Summary: we will * deprecate FbgemmConfig since it's a single kernel (later). * we'd like to categorize things to derived dtype + packed format, e.g. int4 preshuffled, float8 plain * Added PackingFormat that has preshuffled, plain in Version 2 of Int4WeightOnlyConfig, the older AQT tensor will remain in Version 1 Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py python test/quantization/quantize_/workflows/int4/test_int4_preshuffled_tensor.py python test/quantization/quantize_/workflows/float8/test_float8_tensor.py Reviewers: Subscribers: Tasks: Tags:
Differential Revision: D79846881 Pull Request resolved: pytorch#2716
Summary: att Test Plan: python test/quantization/quantize_/workflows/float8/test_float8_tensor.py Reviewers: Subscribers: Tasks: Tags:
**Summary:** Similar to pytorch#2628, but for `FakeQuantizer`. It is cleaner to isolate the logic of each quantizer in separate classes, e.g. intx vs nvfp4 vs fp8. Naming change: ``` FakeQuantizer -> IntxFakeQuantizer ``` **BC-breaking notes:** This is technically not BC-breaking yet since we are just deprecating the old APIs while keeping them around. It will be when we do remove the old APIs in the future according to pytorch#2630. Before: ``` config = IntxFakeQuantizeConfig(torch.int8, "per_channel") FakeQuantizer(config) ``` After: ``` config = IntxFakeQuantizeConfig(torch.int8, "per_channel") IntxFakeQuantizer(config) # or FakeQuantizerBase.from_config(config) ``` **Test Plan:** ``` python test/quantization/test_qat.py ``` [ghstack-poisoned]
…vided by coreml Differential Revision: D79119940 Pull Request resolved: pytorch#2679
* Deprecate old TORCH_VERSION variables **Summary:** This commit deprecates the following variables: ``` TORCH_VERSION_AT_LEAST_2_5 TORCH_VERSION_AT_LEAST_2_4 TORCH_VERSION_AT_LEAST_2_3 TORCH_VERSION_AT_LEAST_2_2 TORCH_VERSION_AFTER_2_5 TORCH_VERSION_AFTER_2_4 TORCH_VERSION_AFTER_2_3 TORCH_VERSION_AFTER_2_2 ``` As of this commit, the latest released version of PyTorch is 2.8, which means we can drop support for 2.5 and before since we only support 3 of the latest releases. The next commit will remove usages of all of these variables from within torchao. **Test Plan:** ``` python test/test_utils.py -k torch_version_deprecation ``` [ghstack-poisoned] * Update on "Deprecate old TORCH_VERSION variables" **Summary:** This commit deprecates the following variables: ``` TORCH_VERSION_AT_LEAST_2_5 TORCH_VERSION_AT_LEAST_2_4 TORCH_VERSION_AT_LEAST_2_3 TORCH_VERSION_AT_LEAST_2_2 TORCH_VERSION_AFTER_2_5 TORCH_VERSION_AFTER_2_4 TORCH_VERSION_AFTER_2_3 TORCH_VERSION_AFTER_2_2 ``` As of this commit, the latest released version of PyTorch is 2.8, which means we can drop support for 2.5 and before since we only support 3 of the latest releases. The next commit will remove usages of all of these variables from within torchao. **Test Plan:** ``` python test/test_utils.py -k torch_version_deprecation ``` [ghstack-poisoned] * Update on "Deprecate old TORCH_VERSION variables" **Summary:** This commit deprecates the following variables: ``` # Always True TORCH_VERSION_AT_LEAST_2_6 TORCH_VERSION_AT_LEAST_2_5 TORCH_VERSION_AT_LEAST_2_4 TORCH_VERSION_AT_LEAST_2_3 TORCH_VERSION_AT_LEAST_2_2 # TORCH_VERSION_AFTER* was confusing to users TORCH_VERSION_AFTER_2_5 TORCH_VERSION_AFTER_2_4 TORCH_VERSION_AFTER_2_3 TORCH_VERSION_AFTER_2_2 ``` As of this commit, the latest released version of PyTorch is 2.8, which means the oldest pytorch version we support is now 2.6 since we only support 3 of the latest releases. The next commit will remove usages of all of these variables from within torchao. **Test Plan:** ``` python test/test_utils.py -k torch_version_deprecation ``` [ghstack-poisoned] * Update on "Deprecate old TORCH_VERSION variables" **Summary:** This commit deprecates the following variables: ``` # Always True TORCH_VERSION_AT_LEAST_2_6 TORCH_VERSION_AT_LEAST_2_5 TORCH_VERSION_AT_LEAST_2_4 TORCH_VERSION_AT_LEAST_2_3 TORCH_VERSION_AT_LEAST_2_2 # TORCH_VERSION_AFTER* was confusing to users TORCH_VERSION_AFTER_2_5 TORCH_VERSION_AFTER_2_4 TORCH_VERSION_AFTER_2_3 TORCH_VERSION_AFTER_2_2 ``` As of this commit, the latest released version of PyTorch is 2.8, which means the oldest pytorch version we support is now 2.6 since we only support 3 of the latest releases. The next commit will remove usages of all of these variables from within torchao. **Test Plan:** ``` python test/test_utils.py -k torch_version_deprecation ``` [ghstack-poisoned] * Update on "Deprecate old TORCH_VERSION variables" **Summary:** This commit deprecates the following variables: ``` # Always True TORCH_VERSION_AT_LEAST_2_6 TORCH_VERSION_AT_LEAST_2_5 TORCH_VERSION_AT_LEAST_2_4 TORCH_VERSION_AT_LEAST_2_3 TORCH_VERSION_AT_LEAST_2_2 # TORCH_VERSION_AFTER* was confusing to users TORCH_VERSION_AFTER_2_5 TORCH_VERSION_AFTER_2_4 TORCH_VERSION_AFTER_2_3 TORCH_VERSION_AFTER_2_2 ``` As of this commit, the latest released version of PyTorch is 2.8, which means the oldest pytorch version we support is now 2.6 since we only support 3 of the latest releases. The next commit will remove usages of all of these variables from within torchao. **Test Plan:** ``` python test/test_utils.py -k torch_version_deprecation ``` [ghstack-poisoned]
Differential Revision: D79936256 Pull Request resolved: pytorch#2726
I got a devgpu with 8 AMD MI300X GPUs, ran the torchtitan benchmarks (without any performance debugging), and adding the numbers I saw to the readme. The tensorwise number looks lower than expected, we can debug/fix this in a future PR.
Differential Revision: D79119958 Pull Request resolved: pytorch#2702
* When replacing literals with placeholders lists are always converted to tuples Summary: THis is needed because lists are not hashable, since they are mutable, and as a result we cannot have literals_to_ph in pattern rewrites used inside reference_representation_rewrite.py Test Plan: CI + next diff relies on this feature Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned] * Allow pattern replacement to ignore literals Summary: This is necessary because sometimes the patterns found have literals include tuple of ints kind of literals. This values shouldnt be used for pattern matching since often they are based on consts derived from example inputs. THis is not exactly a safe thing to do in general so by default it is turned off Test Plan: Subsequent diff adds a pattern that relies on this Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned] * Update on "Allow pattern replacement to ignore literals" Summary: This is necessary because sometimes the patterns found have literals include tuple of ints kind of literals. This values shouldnt be used for pattern matching since often they are based on consts derived from example inputs. THis is not exactly a safe thing to do in general so by default it is turned off Test Plan: Subsequent diff adds a pattern that relies on this Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Differential Revision: D79401683 Pull Request resolved: pytorch#2704
Differential Revision: D79401818 Pull Request resolved: pytorch#2731
Summary: Small fixes to make the float8 training rowwise benchmarks work properly on AMD GPUs, just making sure the right float8 flavor is used. Test Plan: ```bash python benchmarks/float8/float8_roofline.py ~/local/tmp/20250811_amd_mi300x_rowwise_with_gw_hp.csv --float8_recipe_name rowwise_with_gw_hp --shape_gen_name pow2_extended ``` MI300x results: https://gist.github.com/vkuzo/586af24b4c9a90f107590ba5e96dd7eb H100 results: https://gist.github.com/vkuzo/586af24b4c9a90f107590ba5e96dd7eb Reviewers: Subscribers: Tasks: Tags:
…or (pytorch#2687) Summary: Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N] Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops) * defined `tensor_data_names` and `tensor_attribute_names` so we can remove some of the implementations from TorchAOBaseTensor * Migrated op implementation and tests from pytorch#2387 Note: This is just refactoring Int4Tensor, no BC related changes in this PR Int4Tensor path is exposed in version 2 of `Int4WeightOnlyConfig` (default version is still 1, which is using the old AQT path Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py Reviewers: Subscribers: Tasks: Tags:
Summary: Allows subclasses inheriting from TorchAOBaseTensor to have optional tensor attributes, updated all common util functions to support `optional_tensor_names` list, including `__tensor_flatten__`, `__tensor_unflatten__`, ops like aten._to_copy, contiguous, alias etc. Test Plan: python test/test_utils.py Reviewers: Subscribers: Tasks: Tags:
…the Float8Tensor (pytorch#2738) Summary: similar to pytorch#2687, we updated Int4PreshuffledTensor to align the implementation details, also used TorchAOBaseTensor to simplify some of the implementations Note: This is just refactoring Int4PreshuffledTensor, no BC related changes in this PR Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_preshuffled_tensor.py Reviewers: Subscribers: Tasks: Tags:
…Int4WeightConfig` (pytorch#2746) Summary: This was a mistake, we need to align the name with other dynamic quant configs Test Plan: CI Reviewers: Subscribers: Tasks: Tags:
…orch#2747) Summary: We typically should not be calling contiguous in the op implementations since these does not align with the semantics of the op, e.g. transpose Test Plan: python test/quantization/quantize_/workflows/float8/test_float8_tensor.py Reviewers: Subscribers: Tasks: Tags:
Summary: We have recently updated our design for structuring tensor subclasses in torchao to remove unnecessary abstractions and reduce indirections and having a structuring that aligns better with people's intuitive understanding of different quantization use cases, examples using the new design are: pytorch#2463, pytorch#2687 Test Plan: check generated doc Reviewers: Subscribers: Tasks: Tags:
Summary: Didn't repro the error before due to some installation cache Test Plan: python test/quantization/quantize_/workflows/float8/test_float8_tensor.py Reviewers: Subscribers: Tasks: Tags: stack-info: PR: pytorch#2751, branch: jerryzh168/stack/27
* [sparse] Add in missing op support for FP8 Sparse Summary: For ads, we are missing some op support in their lowering stack, namely `.to(dtype=torch.float)` and `.clone()` This PR adds in op support for the `CutlassSemiSparseLayout`. Test Plan: ``` python test/test_sparse_api -k lowering ``` Reviewers: Subscribers: Tasks: Tags: * update * ruff fix * update tests * fix test to add in layout kwarg * skip non h100
Summary: This is used for prototype previously, not used now, we now expose fbgemm kernels through Int4WeightOnlyConfig (for int4) and Float8DynamicActivationFloat8WeightConfig (for FP8) Not considering this BC breaking since we haven't publicized the API yet Test Plan: CI Reviewers: Subscribers: Tasks: Tags:
* Support PLAIN_INT32 for AWQ on Intel GPU * Support PLAIN_INT32 for AWQ on Intel GPU * Support PLAIN_INT32 for AWQ on Intel GPU
Summary: Added `_convert_to_packed_tensor_based_on_current_hardware` to convert a tensor from the unpacked / plain version to a packed version This is to enable vllm for packed weights, vllm will do a slice for the quantized weight, but slicing is not always supported for all torchao tensor subclasses. So we want to first ship an plain / unpacked checkpoint and then convert to the packed version using this API Test Plan: pytest test/prototype/test_tensor_conversion.py Reviewers: Subscribers: Tasks: Tags:
* Support Int4OpaqueTensor for HQQ Make Int4OpaqueTensor support HQQ. Signed-off-by: Cui, Lily <lily.cui@intel.com> * Format codes Signed-off-by: Cui, Lily <lily.cui@intel.com> --------- Signed-off-by: Cui, Lily <lily.cui@intel.com>
**Summary:** Add support to pass scales and zero points learned during QAT range learning to the PTQ base config. Currently only the following configs support this feature: ``` IntxWeightOnlyConfig Int8DynamicActivationInt4WeightConfig Int8DynamicActivationIntxWeightConfig ``` During the convert phase, QAT will detect if range learning was used during training, and pass the learned scales and zero points as custom qparams to the quantized tensor subclass, so PTQ will produce more consistent numerics. Fixes part of pytorch#2271. **Test Plan:** ``` python test/quantization/test_qat.py -k test_range_learning_convert_pass_qparams ```
…ngle input/output TMA descriptors (pytorch#3034)
…rch#2991) * [CPU] Fix fp8 sdpa compiling issue with latest pytorch * disable fp8 fusion
pytorch#2961) * register fp8 quant/dequant only on CPU * add non-decomposed quantize_affine_float8 and dequantize_affine_float8
* Summary: In `from_pretrained()` method in `huggingface/transformers`, `torch_dtype` is deprecated and `dtype` replaces it. To prevent deprecation warnings, this PR replaces `torch_dtype` with `dtype`. Test plan: CI Reference: huggingface/transformers#39782 * fix pre-commit * revert to source: model uploader
* Avoid normalization layers in HF's quantization_config * Add TestTorchAoConfigIntegration * Use PreTrainedModel.from_pretrained
Summary: * removed requirement of setting VLLM_DIR environment, since benchmark is now a cli command * reordered the evals and summarization of results to match better with the order of model card Test Plan: local manual runs achiving desired results Reviewers: Subscribers: Tasks: Tags:
Differential Revision: D82492826 Pull Request resolved: pytorch#3031
…nto wengshiy/qlinear
* Unify get_block_size * Remove granularity defines in the pt2e path * Fix format
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.