feat: svdquant w4a4 + awq w4a16 kitchen native (CORE-31) by HK416-TYPED · Pull Request #36 · Comfy-Org/comfy-kitchen

HK416-TYPED · 2026-04-21T15:44:20Z

Add SVDQuant W4A4 + AWQ W4A16 int4 quantization support for Qwen-Image-Edit and similar SVD-LoRA / AWQ-modulation checkpoints. Built as kitchen-native row-major backends.

Targets sm_80+ (Ampere).

SVDQuant W4A4

comfy_kitchen.quantize_svdquant_w4a4 / scaled_mm_svdquant_w4a4 as torch.library.custom_op entry points with eager pure-PyTorch reference.
TensorCoreSVDQuantW4A4Layout dispatching aten.t / aten.mm / aten.addmm / aten.linear over int4 qdata + proj_down / proj_up / smooth_factor + per-group weight_scale.
CUDA kernels: m16n8k64 int4 MMA GEMM + fused activation quantize over kitchen-native row-major tensors; fp32 accumulator (fp16 overflows at production Qwen scale ≈ 28 — see kernel header).
act_unsigned end-to-end for post-GELU fc2 layers: u4.s4 MMA variant + +0.171875 shift applied at the layer (kernel API stays shift-free).
Signed quantizer clamp to [-7, 7] (skip -8) to match nunchaku's absmax/7 dequant-symmetric contract.
LoRA-up + bias applied in Python via bf16 addmm_ (faster than kernel fusion for R=96; see scaled_mm_svdquant_w4a4 docstring for the precision tradeoff and the cuBLASLt upgrade path if
needed).

AWQ W4A16

comfy_kitchen.gemv_awq_w4a16 as a single op covering both GEMV (M ≤ 8) and GEMM regimes — kernel routes internally.
TensorCoreAWQW4A16Layout dispatching aten.t / aten.mm / aten.addmm / aten.linear over kitchen-native uint4 qdata + per-group wscales / wzeros.
CUDA kernels:
- M ≤ 8: naive 1-thread-per-output (18× faster than eager at M=1, 4× at M=8 on RTX 5090 / Qwen-Image modulation shape N=18432, K=3072).
- 8 < M ≤ 256: fused mma.m16n8k16.f32.bf16.bf16.f32 with cooperative dequant of int4 weight tile to bf16 in shmem. No 113 MB intermediate W workspace. shmem stride padded 8 b16 to break
  the 128-byte ldmatrix bank conflict.
- M > 256: dequantize + cuBLAS bf16 matmul fallback (the MMA kernel's single-thread-per-N-row dequant pass and lack of cp.async pipelining stop scaling past M ≈ 256 on Blackwell; raising the
  limit is a follow-up tuning task).
bias applied externally in the Python wrapper (out.add_(bias)), mirroring scaled_mm_svdquant_w4a4's epilogue contract.

Tests

tests/test_svdquant_w4a4.py — 16 tests: quantizer clamp contract, signed/unsigned MMA dispatch, lora_x separation, cross-shape smoke.

Verification

AWQ: vs nunchaku runtime ops.gemv_awq on a real r96 modulation tensor — max-abs / max-magnitude ≤ 0.65% at all M (within bf16 ULP precision floor).
E2E: Qwen-Image-Edit r96 ComfyUI sampling on the kitchen-native AWQ checkpoint — wall 274 s vs 342 s eager-AWQ (20% faster), PSNR 33.5 dB vs the bf16-dequant baseline. Stratified sample across 6
variant families (balanced / fast / mid / quality × ranks 32/64/96/128 × base/lightning4/lightning8) all sample successfully with the right (steps, cfg) per distillation contract.ontract, signed/unsigned MMA dispatch, lora_x separation, cross-shape smoke.

Model repository

LAXMAYDAY/Qwen_image_edit_2511_int4
Models carrying the awq suffix are models that have been transformed and can be inferred using that branch

Introduces: - comfy_kitchen.quantize_svdquant_w4a4 / scaled_mm_svdquant_w4a4 as torch.library.custom_op entry points backed by the registry. - Eager pure-PyTorch reference impls for SVDQuant W4A4 (activation quant + smooth + LoRA-down fused; scaled_mm with LoRA-up + bias) and AWQ W4A16 GEMV, following kitchen's existing eager backend conventions. - TensorCoreSVDQuantW4A4Layout dispatching aten.t / aten.mm / aten.addmm / aten.linear for int4 storage with proj_down / proj_up / smooth params alongside weight scales. The eager path is a correctness reference. Kernel-level parity with nunchaku's MMA tile layout is not part of this change; it is covered by the CUDA backend port in the following commit.

Native int4 GEMM + activation quantize kernels that consume kitchen-native row-major tensors (no nunchaku vendor code, no tile-interleaved packing). sm_80+. End-to-end verified on Qwen-Image-Edit via ComfyUI. Accumulator is fp32 by design, not fp16 half2: production Qwen weight scales reach ~28 so per-term products overflow fp16 (±65504) and silently cascade into NaN → black images. See the header of ops/scaled_mm_svdquant_w4a4.cu for the full rationale before considering any fp16-accumulator optimization. Also wires act_unsigned end-to-end for the post-GELU fc2 layers (u4.s4 MMA variant, layer-level +0.171875 shift), aligns the eager signed-quantize clamp to [-7, 7] to match nunchaku's absmax/7 scheme, and adds tests/test_svdquant_w4a4.py (16 tests covering clamp contract, MMA dispatch, lora_x separation, cross-shape smoke).

The eager AWQ W4A16 GEMV (`comfy_kitchen.gemv_awq_w4a16`, registered as a torch.library custom op) was already in place but had no aten dispatch plumbing — there was no `QuantizedTensor` layout to wrap pre-quantized weights and route `F.linear` / `mm` / `addmm` to the GEMV. Add `TensorCoreAWQW4A16Layout` mirroring `TensorCoreSVDQuantW4A4Layout`: - `Params(scale=wscales, zeros=wzeros, group_size=64, transposed=False)` on top of `BaseLayoutParams`; both wscales and wzeros are per-group fp tensors of shape `(K // G, N)` matching the eager kernel's contract. - aten.t / aten.mm / aten.addmm / aten.linear handlers, plus a `dequantize()` reference for the fallback path (used when the RHS is not transposed or operand types don't match). Targets the modulation linears (`img_mod.1` / `txt_mod.1` in Qwen-Image-Edit, ~13 GB of bf16 storage on a r96 checkpoint after the current dequant-on-load path) so they can stay int4 end-to-end and cut both checkpoint size and resident VRAM by ~4x. CUDA fast path TBD; eager backs the dispatch in the meantime. Verified F.linear via this layout is bit-exact with `dequantize()` + plain matmul on synthetic AWQ tensors.

Native int4 × bf16/fp16 matmul on the kitchen-native row-major layout: * GEMV path (M ≤ 8): naive 1-thread-per-output kernel, fp32 accum. 18× faster than eager at M=1, 4× at M=8 on RTX 5090 / Qwen-Image modulation shape (N=18432, K=3072). * Fused MMA path (8 < M ≤ 256): cooperative dequantize qweight to bf16 in shmem, then mma.m16n8k16.row.col.f32.bf16.bf16.f32 (or fp16 specialization) along K. No 113 MB intermediate W workspace. BLOCK_M=16 / BLOCK_N=128 / BLOCK_K=64 (= one quant group), 4 warps, ldmatrix.x4 for both A and B (W is N-major row-major in shmem, which matches mma B's lane layout directly), shmem stride padded to 72 b16 to break the 128-byte ldmatrix bank conflict. 1.6× at M=256 vs eager. * Large-M fallback (M > 256): dequantize-then-cuBLAS bf16 matmul. The crossover sits there on Blackwell because the MMA kernel's single-thread-per-N-row dequant pass and lack of cp.async pipelining stop scaling past M ≈ 256; raising the limit is a follow-up tuning task. 1.2× at M=4096 vs eager. bias is applied externally in the Python wrapper (`out.add_(bias)`), mirroring scaled_mm_svdquant_w4a4's epilogue contract. Routing is internal to `gemv_awq_w4a16` so the public API and the ComfyUI MixedPrecisionOps dispatch are unchanged. Verified vs nunchaku runtime `ops.gemv_awq` oracle on a real r96 modulation tensor: max-abs / max-magnitude ≤ 0.65 % at all M (within the bf16 ULP precision floor of `(nibble - 8) * scale + zero` chains). Qwen-Image-Edit r96 ComfyUI E2E sampling: 274 s vs 342 s eager-AWQ (20 % faster), PSNR 33.48 dB vs the bf16-dequant kitchen-native baseline image — visually equivalent.

- Rename M/N/K/R/M_pad/K_half/G locals to lowercase per pep8-naming (N806/N803). Math notation in docstrings/comments preserved. - Replace × (multiplication sign) with x in docstrings/comments (RUF002/003). - Remove quoted type annotations (UP037). - Tag the standard `import torch.nn.functional as F` alias with noqa: N812. - Drop unused get_capable_backends import + reflow imports in test_svdquant_w4a4.py.

The previous lint pass missed comfy_kitchen/backends/cuda/__init__.py (N806 uppercase math vars in Python wrappers for quantize/scaled_mm and gemv_awq_w4a16) and the I001 import sort in tensor/__init__.py. Apply the same M/N/K/R/G/M_pad/K_half lowercase rename plus × → x in the two remaining docstring/comment occurrences.

silveroxides · 2026-05-11T12:20:53Z

Does this have an associated PR for https://github.com/Comfy-Org/ComfyUI ?

Edit: Ignore me. I am just blind xD

HK416-TYPED added 2 commits April 18, 2026 21:03

alexisrolland changed the title ~~Feat/svdquant w4a4 kitchen native~~ Feat/svdquant w4a4 kitchen native (CORE-31) Apr 24, 2026

HK416-TYPED added 2 commits April 25, 2026 19:34

HK416-TYPED mentioned this pull request Apr 27, 2026

feat: comfykit awq w4a16 modulation (CORE-31) Comfy-Org/ComfyUI#13580

Open

HK416-TYPED changed the title ~~Feat/svdquant w4a4 kitchen native (CORE-31)~~ Feat/svdquant w4a4 + awq w4a16 kitchen native (CORE-31) Apr 27, 2026

alexisrolland changed the title ~~Feat/svdquant w4a4 + awq w4a16 kitchen native (CORE-31)~~ feat: svdquant w4a4 + awq w4a16 kitchen native (CORE-31) Apr 27, 2026

HK416-TYPED and others added 7 commits April 29, 2026 02:56

Add SVDQuant W4A4 tile-packed runtime support

8e8456b

Fix bf16 W4A4 fast-accum CUDA compile

07b4a60

Avoid bf16 hfma2 overload in W4A4 fast accum

01b1104

Fix SVDQuant W4A4 test ruff lint

cf8c5cd

Clean W4A4 release artifacts

c0ac02d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: svdquant w4a4 + awq w4a16 kitchen native (CORE-31)#36

feat: svdquant w4a4 + awq w4a16 kitchen native (CORE-31)#36
HK416-TYPED wants to merge 11 commits into
Comfy-Org:mainfrom
HK416-TYPED:feat/svdquant-w4a4-kitchen-native

HK416-TYPED commented Apr 21, 2026 •

edited

Loading

Uh oh!

silveroxides commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

HK416-TYPED commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SVDQuant W4A4

AWQ W4A16

Tests

Verification

Model repository

Uh oh!

silveroxides commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HK416-TYPED commented Apr 21, 2026 •

edited

Loading

silveroxides commented May 11, 2026 •

edited

Loading