FastDeploy supports various quantization inference precisions including FP8, INT8, INT4, 2-bits, etc. It supports different precision inference for weights, activations, and KVCache tensors, which can meet the inference requirements of different scenarios such as low cost, low latency, and long context.
| Quantization Method | Weight Precision | Activation Precision | KVCache Precision | Online/Offline | Supported Hardware |
|---|---|---|---|---|---|
| WINT8 | INT8 | BF16 | BF16 | Online | GPU, XPU |
| WINT4 | INT4 | BF16 | BF16 | Online | GPU, XPU |
| Block-wise FP8 | block-wise static FP8 | token-wise dynamic FP8 | BF16 | Online | GPU |
| WINT2 | 2Bits | BF16 | BF16 | Offline | GPU |
| MixQuant | INT4/INT8 | INT8/BF16 | INT8/BF16 | Offline | GPU, XPU |
Notes
- Quantization Method: Corresponds to the "quantization" field in the quantization configuration file.
- Online/Offline Quantization: Mainly used to distinguish when to quantize the weights.
- Online Quantization: The weights are quantized after being loaded into inference engine.
- Offline Quantization: Before inference, weights are quantized offline and stored as low-bit numerical types. During inference, the quantized low-bit numerical values are loaded.
- Dynamic/Static Quantization: Mainly used to distinguish the quantization method of activations
- Static Quantization: Quantization coefficients are determined and stored before inference. During inference, pre-calculated quantization coefficients are loaded. Since quantization coefficients remain fixed (static) during inference, it's called static quantization.
- Dynamic Quantization: During inference, quantization coefficients for the current batch are calculated in real-time. Since quantization coefficients change dynamically during inference, it's called dynamic quantization.
| Model Name | Supported Quantization Precision |
|---|---|
| ERNIE-4.5-300B-A47B | WINT8, WINT4, Block-wise FP8, MixQuant |
FastDeploy names various quantization precisions in the following format:
{tensor abbreviation}{numerical type}{tensor abbreviation}{numerical type}{tensor abbreviation}{numerical type}
Examples:
- W8A8C8: W=weights, A=activations, C=CacheKV; 8 defaults to INT8
- W8A8C16: 16 defaults to BF16, others same as above
- W4A16C16 / WInt4 / weight-only int4: 4 defaults to INT4
- WNF4A8C8: NF4 refers to 4bits norm-float numerical type
- Wfp8Afp8: Both weights and activations are FP8 precision
- W4Afp8: Weights are INT4, activations are FP8