[SM120] Add NVFP4 blockscaled GEMM path (~95% CUDA) by alecco · Pull Request #145 · Dao-AILab/quack

alecco · 2026-05-25T11:30:17Z

This PR requires NVIDIA/cutlass#3273.

CUDA example "79a_blackwell_geforce_nvfp4_bf16_gemm" ~289 TFLOP/s.

$ SM120_NVFP4_PATH=fast WARMUP=1 ITERS=1 ./scripts/run_sm120_nvfp4_bench.sh
279.6 TFLOP/s

Summary

This PR adds a narrow SM120 NVFP4 blockscaled GEMM path to QuACK.

It follows the existing blockscaled compile/test facade where possible, but the kernel implementation is SM120-specific. SM120 does not use the SM100 tcgen05 / TMEM path, so this PR builds the SM120-native path around:

packed K-major Float4E2M1FN A/B storage
compact 1D interleaved Float8E4M3FN scale storage
native A/B TMA
native FP8 scale TMA
bundled MXF4/NVFP4 warp MMA
split ping-pong TMA producer paths
direct global BF16 stores in the default validated path

This PR depends on the corresponding CUTLASS CuTe DSL SM120 MXF4/NVFP4 native-TMA support.

Supported configuration

The public SM120 blockscaled path is intentionally narrow:

device: SM120
A/B dtype: Float4E2M1FN
scale dtype: Float8E4M3FN
scale vector size: 16
output dtype: BFloat16
A/B layout: K-major packed FP4
D layout: N-major BF16
scale storage: compact 1D interleaved FP8
tile shape: 128x128x128
cluster shape: 1x1
no varlen M/K

Unsupported SM120 blockscaled configurations fail early instead of falling through to the SM100 path.

Host storage and facade

Adds SM120 NVFP4 host-side helpers for:

validating packed FP4 A/B tensors
validating compact interleaved FP8 scale storage
validating BF16 N-major output tensors
creating packed FP4 A/B test tensors
creating compact interleaved scale tensors
creating TensorFill-like nonzero FP4/scale test inputs

The compile path routes the supported SM120 NVFP4 configuration through compile_blockscaled_gemm_tvm_ffi(...) with early validation of A, B, D, SFA, and SFB.

The public scale validator intentionally rejects the older rank-4 physical scale tensor form, so callers cannot pass storage that the kernel would reinterpret as compact interleaved scale storage.

Kernel path

Adds the SM120 NVFP4 kernel path in GemmSm120.

The default validated path uses:

native A/B TMA
native FP8 scale TMA
compact interleaved scale storage
split ping-pong TMA producer paths
bundled MXF4/NVFP4 warp MMA
static scheduling
direct global BF16 stores

The implementation keeps the SM120 path separate from the SM100 blockscaled path. In particular, it does not use SM100 tcgen05, TMEM, cluster shared-memory multicast, or tensor-map proxy-fence assumptions.

A local PipelineTmaWarpMma shim is used directly by the SM120 path rather than monkey-patching cutlass.pipeline.

Path policy

Adds an explicit SM120 NVFP4 path policy:

validated: default conservative path using static scheduling and direct global BF16 stores
fast: opt-in benchmark path using the CLC/full-grid scheduler and delayed-TMA epilogue path

The validated path is the default public path. The fast path is exposed so it can be benchmarked without editing source.

Why direct global stores by default

The default path keeps correctness and mainloop validation conservative:

static scheduling
split ping-pong TMA
direct BF16 global stores

This gives a stable baseline for landing the SM120 NVFP4 implementation first. The faster CLC / delayed-TMA epilogue path is available through sm120_nvfp4_path=fast, but it is kept opt-in rather than being the default.

Tests / coverage

Adds focused SM120 NVFP4 coverage for:

supported/unsupported configuration validation
compact interleaved scale storage validation
rejection of legacy rank-4 scale storage
existing dense SM120 ping-pong constructor regression coverage
single-CTA correctness
K64 scale split behavior
K384 scale-page crossing
multi-tile nonzero scale patterns
TensorFill-like 6x6-tile correctness
PTX signature checks
validated/fast path policy selection

The PTX checks verify the SM120-native path:

native A/B 3D TMA
native scale 2D TMA
MXF4/NVFP4 MMA instructions
direct global stores
no tcgen05
no cluster shared-memory path
no multicast
no tensor-map proxy fence

Benchmark

Adds an SM120 NVFP4 benchmark entry point and script.

The benchmark path supports TensorFill-like bounded nonzero FP4/scales and the older all-ones setup. The TensorFill-like path is the preferred default because it is better at catching scale-layout and row/column mapping issues.

The benchmark also accepts --sm120_nvfp4_path {validated,fast}. The run script forwards SM120_NVFP4_PATH, defaulting to validated.

Notes

This PR is intentionally scoped to the known-good SM120 NVFP4 case. Broader SM120 blockscaled shapes, clusters, varlen support, and additional epilogue/scheduler variants can be layered on top.

Add host-side validation and storage helpers for the narrow SM120 NVFP4 blockscaled GEMM contract: Float4E2M1FN A/B packed K-major operands, compact 1D interleaved Float8E4M3FN scale storage, and BFloat16 N-major output. Route the supported SM120 NVFP4 configuration through compile_blockscaled_gemm_tvm_ffi with early validation for A, B, D, SFA, and SFB. The compile path accepts GPU_ARCH when explicitly set and otherwise follows CUTE_DSL_ARCH, matching the benchmark/test environment convention. The public scale validator intentionally rejects the older rank-4 physical scale tensor form so callers cannot pass storage that the kernel would reinterpret as compact interleaved scales.

Add the SM120 NVFP4 blockscaled GEMM implementation around native A/B TMA, native FP8 scale TMA, bundled MXF4/NVFP4 warp MMA, compact interleaved scale storage, and direct global BFloat16 stores. Keep the SM120 path separate from SM100 tcgen05/TMEM assumptions: the helper layer builds the Blackwell GeForce native TMA/MMA path, rejects non-1x1 clusters, and uses a local PipelineTmaWarpMma shim directly instead of mutating cutlass.pipeline at import time. Keep the large NVFP4 implementation helper private as quack._sm120_nvfp4_utils and leave quack.sm120_utils as a narrow public facade with only stable TX-byte inspection helpers. GemmSm120 imports the private implementation directly so low-level scheduling, TMA, epilogue, and fragment helpers are not advertised as public QuACK API. Scope the NVFP4 pingpong pipeline guard to blockscaled kernels so the existing dense SM120 pingpong constructor remains valid. Also make the compact interleaved scale layout helper reject non-divisible logical K directly before deriving scale tiles. The default validated path keeps split ping-pong tiles and direct stores. Faster CLC/delayed TMA store variants were investigated on the experimental branch but are not part of this clean path because they failed larger-grid validation.

Add correctness, validation, and PTX coverage for the SM120 NVFP4 blockscaled GEMM path. The tests cover the narrow public config gate, compact 1D interleaved scale storage, rejection of legacy rank-4 physical scale tensors, K64 scale splitting, K384 page crossing, multi-tile nonzero scale mapping, TensorFill-like 6x6 tile data, and compact native TMA/PTX instruction checks. Add a dense SM120 pingpong constructor regression to prove the NVFP4-specific pingpong pipeline guard does not break the existing non-blockscaled path, and keep facade validation focused on the three stable sm120_utils TX-byte helpers. Extend the blockscaled benchmark entry point and add a convenience script for the SM120 NVFP4 benchmark configuration. The benchmark path raises deterministic RuntimeError for unsupported architectures instead of relying on assert. Focused validation before rewriting: CUTE_DSL_LIBS=/home/agent/.local/lib/python3.14/site-packages/nvidia_cutlass_dsl/lib/libcute_dsl_runtime.so CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache CUTE_DSL_ARCH=sm_120a python -m pytest -q -s tests/test_gemm_sm120_nvfp4_validation.py tests/test_gemm_sm120_nvfp4_ptx.py tests/test_gemm_sm120_nvfp4_correctness.py -> 10 passed. Experimental branch benchmark notes for the final interleaved-scale path reported 4096^3 TensorFill-like data at 0.645 ms / 213.1 TFLOP/s; faster CLC/delayed-TMA epilogue variants were left out because they did not pass larger-grid validation.

Add an explicit sm120_nvfp4_path policy for the SM120 NVFP4 benchmark and compile path. The default validated policy keeps the conservative static-scheduler/direct-store path, while the fast policy selects the CLC/full-grid scheduler with the delayed TMA epilogue path so it can be benchmarked without editing source. The run_sm120_nvfp4_bench.sh script now forwards SM120_NVFP4_PATH, and benchmark_gemm.py also accepts --sm120_nvfp4_path {validated,fast}. Add a focused validation test proving the two policies select the intended scheduler and epilogue switches. Validation: python -m py_compile quack/gemm_sm120.py quack/blockscaled_gemm_utils.py benchmarks/benchmark_gemm.py tests/test_gemm_sm120_nvfp4_validation.py; python -m ruff check quack/gemm_sm120.py quack/blockscaled_gemm_utils.py benchmarks/benchmark_gemm.py tests/test_gemm_sm120_nvfp4_validation.py; bash -n scripts/run_sm120_nvfp4_bench.sh; CUTE_DSL_LIBS=/home/agent/.local/lib/python3.14/site-packages/nvidia_cutlass_dsl/lib/libcute_dsl_runtime.so CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache CUTE_DSL_ARCH=sm_120a python -m pytest -q tests/test_gemm_sm120_nvfp4_validation.py -> 5 passed. Benchmark smoke: SM120_NVFP4_PATH=fast WARMUP=1 ITERS=1 ./scripts/run_sm120_nvfp4_bench.sh -> 0.498 ms, 276.2 TFLOP/s, PASS; SM120_NVFP4_PATH=fast ./scripts/run_sm120_nvfp4_bench.sh -> 0.498 ms, 275.9 TFLOP/s, PASS; WARMUP=1 ITERS=1 ./scripts/run_sm120_nvfp4_bench.sh -> 0.659 ms, 208.4 TFLOP/s, PASS.

Add a focused SM120 NVFP4 blockscaled GEMM suite covering the public compile/run contract for the (128,128,128) path. The tests exercise TensorFill-like packed FP4 inputs, compact interleaved FP8 scale storage, BF16 N-major output, and both sm120_nvfp4_path=validated and sm120_nvfp4_path=fast. They also distinguish validated direct global stores from the fast delayed TMA epilogue in PTX, and explicitly reject unsupported tilers, clusters, dtypes, varlen, and legacy rank-4 scale storage. Validation run: CUTE_DSL_ARCH=sm_120a CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache python -m pytest -q tests/test_gemm_sm120_nvfp4_blockscaled.py CUTE_DSL_ARCH=sm_120a CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache python -m pytest -q tests/test_gemm_sm120_nvfp4_correctness.py tests/test_gemm_sm120_nvfp4_validation.py tests/test_gemm_sm120_nvfp4_ptx.py python -m ruff check tests/test_gemm_sm120_nvfp4_blockscaled.py

agent added 2 commits May 24, 2026 21:30

alecco had a problem deploying to gpu-ci May 25, 2026 11:30 — with GitHub Actions Error

alecco mentioned this pull request May 25, 2026

[CuTeDSL] Add SM120 MXF4/NVFP4 native-TMA path NVIDIA/cutlass#3273

Open

alecco changed the title ~~[SM120] Add NVFP4 blockscaled GEMM path (~95% of equivalent CUDA)~~ [SM120] Add NVFP4 blockscaled GEMM path (~95% CUDA) May 25, 2026

alecco had a problem deploying to gpu-ci May 25, 2026 15:27 — with GitHub Actions Error

alecco requested a deployment to gpu-ci May 25, 2026 15:27 — with GitHub Actions Waiting

agent added 3 commits May 25, 2026 17:35

alecco force-pushed the sm120-nvfp4 branch from 2a9f7e2 to 43d38d0 Compare May 25, 2026 15:37

alecco requested a deployment to gpu-ci May 25, 2026 15:37 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SM120] Add NVFP4 blockscaled GEMM path (~95% CUDA)#145

[SM120] Add NVFP4 blockscaled GEMM path (~95% CUDA)#145
alecco wants to merge 5 commits into
Dao-AILab:mainfrom
alecco:sm120-nvfp4

alecco commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alecco commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Supported configuration

Host storage and facade

Kernel path

Path policy

Why direct global stores by default

Tests / coverage

Benchmark

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alecco commented May 25, 2026 •

edited

Loading