Skip to content

Latest commit

 

History

History
111 lines (78 loc) · 5.92 KB

File metadata and controls

111 lines (78 loc) · 5.92 KB

Quantization

imp supports both GGUF quantization (loaded directly from llama.cpp-compatible files) and SafeTensors NVFP4 prequant (produced by external calibration tools). This page explains what each format is, where it is used inside the engine, and what the trade-offs are.

For per-model picks see supported-models.md. For benchmark numbers see performance.md.

Formats and where they show up

Format Bits / weight Source Used for
Q8_0 8.0 GGUF dp4a GEMV decode + cuBLAS prefill
Q6_K 6.5 GGUF dp4a GEMV decode + cuBLAS prefill
Q5_K_M 5.5 GGUF dp4a GEMV decode + cuBLAS prefill
Q4_K_M 4.5 GGUF dp4a GEMV decode + cuBLAS prefill
Q4_0 4.5 GGUF dp4a GEMV decode + cuBLAS prefill
FP8 E4M3 8.0 runtime KV cache (opt-in), prefill weight cache
INT8 8.0 runtime KV cache (opt-in)
INT4 4.0 runtime KV cache (long-ctx, opt-in)
NVFP4 4.0 SafeTensors weights (decode + prefill), KV cache
MXFP4 4.5 GGUF weights (decode + prefill attention)

GGUF formats are mmap'd from disk and uploaded as-is to GPU; the *.K quants store block scales in the format the dp4a kernels expect. NVFP4 prequant arrives in two-byte-per-element packed form with FP8 E4M3 micro-scales (per-16) and an FP32 tensor scale; imp registers these directly into the NVFP4 decode cache and CUTLASS NVFP4 GEMM path with no re-quantization.

NVFP4 prequant (SafeTensors)

Calibrated per-tensor scales using AWQ or SmoothQuant. Two upstream tools produce compatible files:

Tool Status
NVIDIA Model Optimizer (Modelopt) Primary path. Coherent on Qwen3-Coder-30B, Mistral-3.2, Qwen3.6, Gemma-4 (after PR #88 lit up the CUTLASS NVFP4×NVFP4 prefill cache).
llm-compressor Loads, but several models degenerate past ~30 tokens. See roadmap. Prefer Modelopt where available.

Workflow with Modelopt:

pip install nvidia-modelopt

python -m modelopt.llm.ptq \
  --model Qwen/Qwen3-8B \
  --quant nvfp4 \
  --output ./Qwen3-8B-nvfp4/

imp-cli --model ./Qwen3-8B-nvfp4/ --prompt "Hello"

Modelopt quantization modes:

Mode What's quantized
nvfp4 all linear layers
nvfp4_mlp_only MLP / FFN layers only
nvfp4_experts_only MoE expert layers only
nvfp4_omlp_only MLP + output projection

NVFP4 internal pipeline

Dense layers:

SafeTensors NVFP4 packed weights + scales
  → loader (BF16 norms / router → FP16, packed FP4 stays packed)
  → Phase 0: register in NVFP4 decode cache (no re-quant)
  → Phase 3b: CUTLASS scale-factor layout (SfAtom) for prefill
  → prefill: CUTLASS NVFP4 GEMM via gemm_dispatch() (sm_120 tensor cores)
  → decode:  NVFP4 GEMV (prmt register LUT, K-parallel)

MoE layers (Modelopt SafeTensors, per-expert):

SafeTensors per-expert weights
  → cache_moe_native_nvfp4 builds one contiguous [ne, N, K_packed]
    buffer per layer per projection (D2D-memcpy from per-expert tensors)
  → per-expert tensors freed inline (32 GiB VRAM ceiling on 35B-A3B)
  → CUDA Graphs capture cleanly via the decode fast-path

Without cache_moe_native_nvfp4 the legacy FP16 dequant + cuBLAS sm_80 WMMA fallback fires per layer per token, killing CUDA Graphs and dropping decode 5–17×.

NVFP4 KV cache (--kv-nvfp4) supports chunked prefill since PR #149 — past chunks' K/V are gathered from the paged cache via paged_kv_gather_nvfp4_to_fp16 (PTX cvt.rn.f16x2.e2m1x2 inner loop + UE4M3 scale fold) and concatenated with the current chunk before rectangular cuBLAS attention. Hybrid GDN+MoE / Mamba2+MoE archs (Qwen3.5/3.6, Nemotron-H) are in scope since PR #156.

MXFP4 (GGUF)

MXFP4 uses the same FP4 E2M1 nibble layout as NVFP4 but with UE8M0 micro-scales (per 32 elements) and no separate tensor scale. This matches the format the Blackwell tensor cores expect natively, so MXFP4 prefill goes through CUTLASS at full FP4 throughput.

imp ships MXFP4 inside GGUF using a proprietary tensor-type code (31). llama.cpp reads this as the removed Q4_0_4_4 format, so cross-tool perplexity comparison is not possible without a standard MXFP4 export.

Round-to-nearest MXFP4 is +5–15% perplexity vs Q8_0, worse than Q4_K_M (+2.2% on Qwen3-4B wikitext-2). MR-GPTQ calibration would close this gap; it is on the roadmap.

KV cache element type

Set via --kv-fp8 / --kv-int8 / --kv-int4 / --kv-turboquant, or in imp.conf:

[kv_cache]
dtype = "fp16"  # fp16 (default) | fp8 | int8 | int4 | nvfp4

The default flipped to FP16 in PR #51 — FP8 had been silently breaking Llama, Mistral, and DeepSeek at first decode. FP8 is now opt-in; it is verified coherent on Qwen3 dense, Qwen3.5 / 3.6 GDN, and Llama-3.2 (the FP8 KV warmup-calibration bug was fixed in PR #89). Gemma-4 keeps a force-FP16 carve-out — its dual head_dim layout (256 SWA / 512 global) doesn't fit the FP8 KV write/read kernel's single-stride assumption.

INT4 KV is for VRAM-pressure cases only — coherent but ~22% decode regression at 20K context. TurboQuant (PolarQuant + QJL) goes lower still but loses ~23% vs FP8 at short context.

Choosing a quant

Quick guidance, not a benchmark:

  • Q8_0 is the cleanest baseline. Use it when output quality matters and VRAM allows.
  • Q4_K_M is the most VRAM-efficient GGUF. Sufficient for most chat; can degenerate on long code-gen on Gemma-4 — use Q5_K_M or Q8_0 there.
  • Q6_K sits in between. Good MoE pick on Qwen3-Coder-30B (234 tok/s).
  • NVFP4 (SafeTensors prequant) gives the highest decode throughput on prequant-aware models — Qwen3-Coder-30B at 272 tok/s, Qwen3.6-35B at 217 tok/s, Gemma-4-26B at 213 tok/s. Requires AWQ/SmoothQuant calibration; only Modelopt is fully tested.
  • MXFP4 is GGUF-native FP4. Smallest footprint (Qwen3-4B at 2.8 GB), but quality lags Q4_K_M without MR-GPTQ calibration.