Fused TBQ4 Flash Attention + MTP + Shared Tensors for llama.cpp — 82+ tok/s with lossless 4.25 bpv KV cache at 200K context on RTX 4090
cuda quantization mtp kv-cache fwht llama-cpp flash-attention qwen speculative-decoding rtx-4090 multi-token-prediction turboquant tbq4 tensor-sharing
-
Updated
May 9, 2026 - C++