Quantization

imp supports both GGUF quantization (loaded directly from llama.cpp-compatible files) and SafeTensors NVFP4 prequant (produced by external calibration tools). This page explains what each format is, where it is used inside the engine, and what the trade-offs are.

For per-model picks see supported-models.md. For benchmark numbers see performance.md.

Formats and where they show up

Format	Bits / weight	Source	Used for
Q8_0	8.0	GGUF	dp4a GEMV decode + cuBLAS prefill
Q6_K	6.5	GGUF	dp4a GEMV decode + cuBLAS prefill
Q5_K_M	5.5	GGUF	dp4a GEMV decode + cuBLAS prefill
Q4_K_M	4.5	GGUF	dp4a GEMV decode + cuBLAS prefill
Q4_0	4.5	GGUF	dp4a GEMV decode + cuBLAS prefill
FP8 E4M3	8.0	runtime	KV cache (opt-in), prefill weight cache
INT8	8.0	runtime	KV cache (opt-in)
INT4	4.0	runtime	KV cache (long-ctx, opt-in)
NVFP4	4.0	SafeTensors	weights (decode + prefill), KV cache
MXFP4	4.5	GGUF	weights (decode + prefill attention)

GGUF formats are mmap'd from disk and uploaded as-is to GPU; the *.K quants store block scales in the format the dp4a kernels expect. NVFP4 prequant arrives in two-byte-per-element packed form with FP8 E4M3 micro-scales (per-16) and an FP32 tensor scale; imp registers these directly into the NVFP4 decode cache and CUTLASS NVFP4 GEMM path with no re-quantization.

NVFP4 prequant (SafeTensors)

Calibrated per-tensor scales using AWQ or SmoothQuant. Two upstream tools produce compatible files:

Tool	Status
NVIDIA Model Optimizer (Modelopt)	Primary path. Coherent on Qwen3-Coder-30B, Mistral-3.2, Qwen3.6, Gemma-4 (after PR #88 lit up the CUTLASS NVFP4×NVFP4 prefill cache).
llm-compressor	Loads, but several models degenerate past ~30 tokens. See roadmap. Prefer Modelopt where available.

Workflow with Modelopt:

pip install nvidia-modelopt

python -m modelopt.llm.ptq \
  --model Qwen/Qwen3-8B \
  --quant nvfp4 \
  --output ./Qwen3-8B-nvfp4/

imp-cli --model ./Qwen3-8B-nvfp4/ --prompt "Hello"

Modelopt quantization modes:

Mode	What's quantized
`nvfp4`	all linear layers
`nvfp4_mlp_only`	MLP / FFN layers only
`nvfp4_experts_only`	MoE expert layers only
`nvfp4_omlp_only`	MLP + output projection

NVFP4 internal pipeline

Dense layers:

SafeTensors NVFP4 packed weights + scales
  → loader (BF16 norms / router → FP16, packed FP4 stays packed)
  → Phase 0: register in NVFP4 decode cache (no re-quant)
  → Phase 3b: CUTLASS scale-factor layout (SfAtom) for prefill
  → prefill: CUTLASS NVFP4 GEMM via gemm_dispatch() (sm_120 tensor cores)
  → decode:  NVFP4 GEMV (prmt register LUT, K-parallel)

MoE layers (Modelopt SafeTensors, per-expert):

SafeTensors per-expert weights
  → cache_moe_native_nvfp4 builds one contiguous [ne, N, K_packed]
    buffer per layer per projection (D2D-memcpy from per-expert tensors)
  → per-expert tensors freed inline (32 GiB VRAM ceiling on 35B-A3B)
  → CUDA Graphs capture cleanly via the decode fast-path

Without cache_moe_native_nvfp4 the legacy FP16 dequant + cuBLAS sm_80 WMMA fallback fires per layer per token, killing CUDA Graphs and dropping decode 5–17×.

NVFP4 KV cache (--kv-nvfp4) supports chunked prefill since PR #149 — past chunks' K/V are gathered from the paged cache via paged_kv_gather_nvfp4_to_fp16 (PTX cvt.rn.f16x2.e2m1x2 inner loop + UE4M3 scale fold) and concatenated with the current chunk before rectangular cuBLAS attention. Hybrid GDN+MoE / Mamba2+MoE archs (Qwen3.5/3.6, Nemotron-H) are in scope since PR #156.

MXFP4 (GGUF)

MXFP4 uses the same FP4 E2M1 nibble layout as NVFP4 but with UE8M0 micro-scales (per 32 elements) and no separate tensor scale. This matches the format the Blackwell tensor cores expect natively, so MXFP4 prefill goes through CUTLASS at full FP4 throughput.

imp ships MXFP4 inside GGUF using a proprietary tensor-type code (31). llama.cpp reads this as the removed Q4_0_4_4 format, so cross-tool perplexity comparison is not possible without a standard MXFP4 export.

Round-to-nearest MXFP4 is +5–15% perplexity vs Q8_0, worse than Q4_K_M (+2.2% on Qwen3-4B wikitext-2). MR-GPTQ calibration would close this gap; it is on the roadmap.

KV cache element type

Set via --kv-fp8 / --kv-int8 / --kv-int4 / --kv-turboquant, or in imp.conf:

[kv_cache]
dtype = "fp16"  # fp16 (default) | fp8 | int8 | int4 | nvfp4

The default flipped to FP16 in PR #51 — FP8 had been silently breaking Llama, Mistral, and DeepSeek at first decode. FP8 is now opt-in; it is verified coherent on Qwen3 dense, Qwen3.5 / 3.6 GDN, and Llama-3.2 (the FP8 KV warmup-calibration bug was fixed in PR #89). Gemma-4 keeps a force-FP16 carve-out — its dual head_dim layout (256 SWA / 512 global) doesn't fit the FP8 KV write/read kernel's single-stride assumption.

INT4 KV is for VRAM-pressure cases only — coherent but ~22% decode regression at 20K context. TurboQuant (PolarQuant + QJL) goes lower still but loses ~23% vs FP8 at short context.

Choosing a quant

Quick guidance, not a benchmark:

Q8_0 is the cleanest baseline. Use it when output quality matters and VRAM allows.
Q4_K_M is the most VRAM-efficient GGUF. Sufficient for most chat; can degenerate on long code-gen on Gemma-4 — use Q5_K_M or Q8_0 there.
Q6_K sits in between. Good MoE pick on Qwen3-Coder-30B (234 tok/s).
NVFP4 (SafeTensors prequant) gives the highest decode throughput on prequant-aware models — Qwen3-Coder-30B at 272 tok/s, Qwen3.6-35B at 217 tok/s, Gemma-4-26B at 213 tok/s. Requires AWQ/SmoothQuant calibration; only Modelopt is fully tested.
MXFP4 is GGUF-native FP4. Smallest footprint (Qwen3-4B at 2.8 GB), but quality lags Q4_K_M without MR-GPTQ calibration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization

Formats and where they show up

NVFP4 prequant (SafeTensors)

NVFP4 internal pipeline

MXFP4 (GGUF)

KV cache element type

Choosing a quant

FilesExpand file tree

quantization.md

Latest commit

History

quantization.md

File metadata and controls

Quantization

Formats and where they show up

NVFP4 prequant (SafeTensors)

NVFP4 internal pipeline

MXFP4 (GGUF)

KV cache element type

Choosing a quant