[SM120] Add NVFP4 blockscaled GEMM path (~95% CUDA)#145
Open
alecco wants to merge 5 commits into
Open
Conversation
added 2 commits
May 24, 2026 21:30
Add host-side validation and storage helpers for the narrow SM120 NVFP4 blockscaled GEMM contract: Float4E2M1FN A/B packed K-major operands, compact 1D interleaved Float8E4M3FN scale storage, and BFloat16 N-major output. Route the supported SM120 NVFP4 configuration through compile_blockscaled_gemm_tvm_ffi with early validation for A, B, D, SFA, and SFB. The compile path accepts GPU_ARCH when explicitly set and otherwise follows CUTE_DSL_ARCH, matching the benchmark/test environment convention. The public scale validator intentionally rejects the older rank-4 physical scale tensor form so callers cannot pass storage that the kernel would reinterpret as compact interleaved scales.
Add the SM120 NVFP4 blockscaled GEMM implementation around native A/B TMA, native FP8 scale TMA, bundled MXF4/NVFP4 warp MMA, compact interleaved scale storage, and direct global BFloat16 stores. Keep the SM120 path separate from SM100 tcgen05/TMEM assumptions: the helper layer builds the Blackwell GeForce native TMA/MMA path, rejects non-1x1 clusters, and uses a local PipelineTmaWarpMma shim directly instead of mutating cutlass.pipeline at import time. Keep the large NVFP4 implementation helper private as quack._sm120_nvfp4_utils and leave quack.sm120_utils as a narrow public facade with only stable TX-byte inspection helpers. GemmSm120 imports the private implementation directly so low-level scheduling, TMA, epilogue, and fragment helpers are not advertised as public QuACK API. Scope the NVFP4 pingpong pipeline guard to blockscaled kernels so the existing dense SM120 pingpong constructor remains valid. Also make the compact interleaved scale layout helper reject non-divisible logical K directly before deriving scale tiles. The default validated path keeps split ping-pong tiles and direct stores. Faster CLC/delayed TMA store variants were investigated on the experimental branch but are not part of this clean path because they failed larger-grid validation.
added 3 commits
May 25, 2026 17:35
Add correctness, validation, and PTX coverage for the SM120 NVFP4 blockscaled GEMM path. The tests cover the narrow public config gate, compact 1D interleaved scale storage, rejection of legacy rank-4 physical scale tensors, K64 scale splitting, K384 page crossing, multi-tile nonzero scale mapping, TensorFill-like 6x6 tile data, and compact native TMA/PTX instruction checks. Add a dense SM120 pingpong constructor regression to prove the NVFP4-specific pingpong pipeline guard does not break the existing non-blockscaled path, and keep facade validation focused on the three stable sm120_utils TX-byte helpers. Extend the blockscaled benchmark entry point and add a convenience script for the SM120 NVFP4 benchmark configuration. The benchmark path raises deterministic RuntimeError for unsupported architectures instead of relying on assert. Focused validation before rewriting: CUTE_DSL_LIBS=/home/agent/.local/lib/python3.14/site-packages/nvidia_cutlass_dsl/lib/libcute_dsl_runtime.so CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache CUTE_DSL_ARCH=sm_120a python -m pytest -q -s tests/test_gemm_sm120_nvfp4_validation.py tests/test_gemm_sm120_nvfp4_ptx.py tests/test_gemm_sm120_nvfp4_correctness.py -> 10 passed. Experimental branch benchmark notes for the final interleaved-scale path reported 4096^3 TensorFill-like data at 0.645 ms / 213.1 TFLOP/s; faster CLC/delayed-TMA epilogue variants were left out because they did not pass larger-grid validation.
Add an explicit sm120_nvfp4_path policy for the SM120 NVFP4 benchmark and compile path. The default validated policy keeps the conservative static-scheduler/direct-store path, while the fast policy selects the CLC/full-grid scheduler with the delayed TMA epilogue path so it can be benchmarked without editing source.
The run_sm120_nvfp4_bench.sh script now forwards SM120_NVFP4_PATH, and benchmark_gemm.py also accepts --sm120_nvfp4_path {validated,fast}. Add a focused validation test proving the two policies select the intended scheduler and epilogue switches.
Validation: python -m py_compile quack/gemm_sm120.py quack/blockscaled_gemm_utils.py benchmarks/benchmark_gemm.py tests/test_gemm_sm120_nvfp4_validation.py; python -m ruff check quack/gemm_sm120.py quack/blockscaled_gemm_utils.py benchmarks/benchmark_gemm.py tests/test_gemm_sm120_nvfp4_validation.py; bash -n scripts/run_sm120_nvfp4_bench.sh; CUTE_DSL_LIBS=/home/agent/.local/lib/python3.14/site-packages/nvidia_cutlass_dsl/lib/libcute_dsl_runtime.so CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache CUTE_DSL_ARCH=sm_120a python -m pytest -q tests/test_gemm_sm120_nvfp4_validation.py -> 5 passed.
Benchmark smoke: SM120_NVFP4_PATH=fast WARMUP=1 ITERS=1 ./scripts/run_sm120_nvfp4_bench.sh -> 0.498 ms, 276.2 TFLOP/s, PASS; SM120_NVFP4_PATH=fast ./scripts/run_sm120_nvfp4_bench.sh -> 0.498 ms, 275.9 TFLOP/s, PASS; WARMUP=1 ITERS=1 ./scripts/run_sm120_nvfp4_bench.sh -> 0.659 ms, 208.4 TFLOP/s, PASS.
Add a focused SM120 NVFP4 blockscaled GEMM suite covering the public compile/run contract for the (128,128,128) path. The tests exercise TensorFill-like packed FP4 inputs, compact interleaved FP8 scale storage, BF16 N-major output, and both sm120_nvfp4_path=validated and sm120_nvfp4_path=fast. They also distinguish validated direct global stores from the fast delayed TMA epilogue in PTX, and explicitly reject unsupported tilers, clusters, dtypes, varlen, and legacy rank-4 scale storage. Validation run: CUTE_DSL_ARCH=sm_120a CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache python -m pytest -q tests/test_gemm_sm120_nvfp4_blockscaled.py CUTE_DSL_ARCH=sm_120a CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache python -m pytest -q tests/test_gemm_sm120_nvfp4_correctness.py tests/test_gemm_sm120_nvfp4_validation.py tests/test_gemm_sm120_nvfp4_ptx.py python -m ruff check tests/test_gemm_sm120_nvfp4_blockscaled.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR requires NVIDIA/cutlass#3273.
CUDA example "79a_blackwell_geforce_nvfp4_bf16_gemm" ~289 TFLOP/s.
Summary
This PR adds a narrow SM120 NVFP4 blockscaled GEMM path to QuACK.
It follows the existing blockscaled compile/test facade where possible, but the kernel implementation is SM120-specific. SM120 does not use the SM100
tcgen05/ TMEM path, so this PR builds the SM120-native path around:Float4E2M1FNA/B storageFloat8E4M3FNscale storageThis PR depends on the corresponding CUTLASS CuTe DSL SM120 MXF4/NVFP4 native-TMA support.
Supported configuration
The public SM120 blockscaled path is intentionally narrow:
Float4E2M1FNFloat8E4M3FN16BFloat16128x128x1281x1Unsupported SM120 blockscaled configurations fail early instead of falling through to the SM100 path.
Host storage and facade
Adds SM120 NVFP4 host-side helpers for:
The compile path routes the supported SM120 NVFP4 configuration through
compile_blockscaled_gemm_tvm_ffi(...)with early validation of A, B, D, SFA, and SFB.The public scale validator intentionally rejects the older rank-4 physical scale tensor form, so callers cannot pass storage that the kernel would reinterpret as compact interleaved scale storage.
Kernel path
Adds the SM120 NVFP4 kernel path in
GemmSm120.The default validated path uses:
The implementation keeps the SM120 path separate from the SM100 blockscaled path. In particular, it does not use SM100
tcgen05, TMEM, cluster shared-memory multicast, or tensor-map proxy-fence assumptions.A local
PipelineTmaWarpMmashim is used directly by the SM120 path rather than monkey-patchingcutlass.pipeline.Path policy
Adds an explicit SM120 NVFP4 path policy:
validated: default conservative path using static scheduling and direct global BF16 storesfast: opt-in benchmark path using the CLC/full-grid scheduler and delayed-TMA epilogue pathThe validated path is the default public path. The fast path is exposed so it can be benchmarked without editing source.
Why direct global stores by default
The default path keeps correctness and mainloop validation conservative:
This gives a stable baseline for landing the SM120 NVFP4 implementation first. The faster CLC / delayed-TMA epilogue path is available through
sm120_nvfp4_path=fast, but it is kept opt-in rather than being the default.Tests / coverage
Adds focused SM120 NVFP4 coverage for:
The PTX checks verify the SM120-native path:
tcgen05Benchmark
Adds an SM120 NVFP4 benchmark entry point and script.
The benchmark path supports TensorFill-like bounded nonzero FP4/scales and the older all-ones setup. The TensorFill-like path is the preferred default because it is better at catching scale-layout and row/column mapping issues.
The benchmark also accepts
--sm120_nvfp4_path {validated,fast}. The run script forwardsSM120_NVFP4_PATH, defaulting tovalidated.Notes
This PR is intentionally scoped to the known-good SM120 NVFP4 case. Broader SM120 blockscaled shapes, clusters, varlen support, and additional epilogue/scheduler variants can be layered on top.