Skip to content

[SM120] Add NVFP4 blockscaled GEMM path (~95% CUDA)#145

Open
alecco wants to merge 5 commits into
Dao-AILab:mainfrom
alecco:sm120-nvfp4
Open

[SM120] Add NVFP4 blockscaled GEMM path (~95% CUDA)#145
alecco wants to merge 5 commits into
Dao-AILab:mainfrom
alecco:sm120-nvfp4

Conversation

@alecco
Copy link
Copy Markdown
Contributor

@alecco alecco commented May 25, 2026

This PR requires NVIDIA/cutlass#3273.

CUDA example "79a_blackwell_geforce_nvfp4_bf16_gemm" ~289 TFLOP/s.

$ SM120_NVFP4_PATH=fast WARMUP=1 ITERS=1 ./scripts/run_sm120_nvfp4_bench.sh
279.6 TFLOP/s

Summary

This PR adds a narrow SM120 NVFP4 blockscaled GEMM path to QuACK.

It follows the existing blockscaled compile/test facade where possible, but the kernel implementation is SM120-specific. SM120 does not use the SM100 tcgen05 / TMEM path, so this PR builds the SM120-native path around:

  • packed K-major Float4E2M1FN A/B storage
  • compact 1D interleaved Float8E4M3FN scale storage
  • native A/B TMA
  • native FP8 scale TMA
  • bundled MXF4/NVFP4 warp MMA
  • split ping-pong TMA producer paths
  • direct global BF16 stores in the default validated path

This PR depends on the corresponding CUTLASS CuTe DSL SM120 MXF4/NVFP4 native-TMA support.

Supported configuration

The public SM120 blockscaled path is intentionally narrow:

  • device: SM120
  • A/B dtype: Float4E2M1FN
  • scale dtype: Float8E4M3FN
  • scale vector size: 16
  • output dtype: BFloat16
  • A/B layout: K-major packed FP4
  • D layout: N-major BF16
  • scale storage: compact 1D interleaved FP8
  • tile shape: 128x128x128
  • cluster shape: 1x1
  • no varlen M/K

Unsupported SM120 blockscaled configurations fail early instead of falling through to the SM100 path.

Host storage and facade

Adds SM120 NVFP4 host-side helpers for:

  • validating packed FP4 A/B tensors
  • validating compact interleaved FP8 scale storage
  • validating BF16 N-major output tensors
  • creating packed FP4 A/B test tensors
  • creating compact interleaved scale tensors
  • creating TensorFill-like nonzero FP4/scale test inputs

The compile path routes the supported SM120 NVFP4 configuration through compile_blockscaled_gemm_tvm_ffi(...) with early validation of A, B, D, SFA, and SFB.

The public scale validator intentionally rejects the older rank-4 physical scale tensor form, so callers cannot pass storage that the kernel would reinterpret as compact interleaved scale storage.

Kernel path

Adds the SM120 NVFP4 kernel path in GemmSm120.

The default validated path uses:

  • native A/B TMA
  • native FP8 scale TMA
  • compact interleaved scale storage
  • split ping-pong TMA producer paths
  • bundled MXF4/NVFP4 warp MMA
  • static scheduling
  • direct global BF16 stores

The implementation keeps the SM120 path separate from the SM100 blockscaled path. In particular, it does not use SM100 tcgen05, TMEM, cluster shared-memory multicast, or tensor-map proxy-fence assumptions.

A local PipelineTmaWarpMma shim is used directly by the SM120 path rather than monkey-patching cutlass.pipeline.

Path policy

Adds an explicit SM120 NVFP4 path policy:

  • validated: default conservative path using static scheduling and direct global BF16 stores
  • fast: opt-in benchmark path using the CLC/full-grid scheduler and delayed-TMA epilogue path

The validated path is the default public path. The fast path is exposed so it can be benchmarked without editing source.

Why direct global stores by default

The default path keeps correctness and mainloop validation conservative:

  • static scheduling
  • split ping-pong TMA
  • direct BF16 global stores

This gives a stable baseline for landing the SM120 NVFP4 implementation first. The faster CLC / delayed-TMA epilogue path is available through sm120_nvfp4_path=fast, but it is kept opt-in rather than being the default.

Tests / coverage

Adds focused SM120 NVFP4 coverage for:

  • supported/unsupported configuration validation
  • compact interleaved scale storage validation
  • rejection of legacy rank-4 scale storage
  • existing dense SM120 ping-pong constructor regression coverage
  • single-CTA correctness
  • K64 scale split behavior
  • K384 scale-page crossing
  • multi-tile nonzero scale patterns
  • TensorFill-like 6x6-tile correctness
  • PTX signature checks
  • validated/fast path policy selection

The PTX checks verify the SM120-native path:

  • native A/B 3D TMA
  • native scale 2D TMA
  • MXF4/NVFP4 MMA instructions
  • direct global stores
  • no tcgen05
  • no cluster shared-memory path
  • no multicast
  • no tensor-map proxy fence

Benchmark

Adds an SM120 NVFP4 benchmark entry point and script.

The benchmark path supports TensorFill-like bounded nonzero FP4/scales and the older all-ones setup. The TensorFill-like path is the preferred default because it is better at catching scale-layout and row/column mapping issues.

The benchmark also accepts --sm120_nvfp4_path {validated,fast}. The run script forwards SM120_NVFP4_PATH, defaulting to validated.

Notes

This PR is intentionally scoped to the known-good SM120 NVFP4 case. Broader SM120 blockscaled shapes, clusters, varlen support, and additional epilogue/scheduler variants can be layered on top.

agent added 2 commits May 24, 2026 21:30
Add host-side validation and storage helpers for the narrow SM120 NVFP4 blockscaled GEMM contract: Float4E2M1FN A/B packed K-major operands, compact 1D interleaved Float8E4M3FN scale storage, and BFloat16 N-major output.

Route the supported SM120 NVFP4 configuration through compile_blockscaled_gemm_tvm_ffi with early validation for A, B, D, SFA, and SFB. The compile path accepts GPU_ARCH when explicitly set and otherwise follows CUTE_DSL_ARCH, matching the benchmark/test environment convention.

The public scale validator intentionally rejects the older rank-4 physical scale tensor form so callers cannot pass storage that the kernel would reinterpret as compact interleaved scales.
Add the SM120 NVFP4 blockscaled GEMM implementation around native A/B TMA, native FP8 scale TMA, bundled MXF4/NVFP4 warp MMA, compact interleaved scale storage, and direct global BFloat16 stores.

Keep the SM120 path separate from SM100 tcgen05/TMEM assumptions: the helper layer builds the Blackwell GeForce native TMA/MMA path, rejects non-1x1 clusters, and uses a local PipelineTmaWarpMma shim directly instead of mutating cutlass.pipeline at import time.

Keep the large NVFP4 implementation helper private as quack._sm120_nvfp4_utils and leave quack.sm120_utils as a narrow public facade with only stable TX-byte inspection helpers. GemmSm120 imports the private implementation directly so low-level scheduling, TMA, epilogue, and fragment helpers are not advertised as public QuACK API.

Scope the NVFP4 pingpong pipeline guard to blockscaled kernels so the existing dense SM120 pingpong constructor remains valid. Also make the compact interleaved scale layout helper reject non-divisible logical K directly before deriving scale tiles.

The default validated path keeps split ping-pong tiles and direct stores. Faster CLC/delayed TMA store variants were investigated on the experimental branch but are not part of this clean path because they failed larger-grid validation.
agent added 3 commits May 25, 2026 17:35
Add correctness, validation, and PTX coverage for the SM120 NVFP4 blockscaled GEMM path. The tests cover the narrow public config gate, compact 1D interleaved scale storage, rejection of legacy rank-4 physical scale tensors, K64 scale splitting, K384 page crossing, multi-tile nonzero scale mapping, TensorFill-like 6x6 tile data, and compact native TMA/PTX instruction checks.

Add a dense SM120 pingpong constructor regression to prove the NVFP4-specific pingpong pipeline guard does not break the existing non-blockscaled path, and keep facade validation focused on the three stable sm120_utils TX-byte helpers.

Extend the blockscaled benchmark entry point and add a convenience script for the SM120 NVFP4 benchmark configuration. The benchmark path raises deterministic RuntimeError for unsupported architectures instead of relying on assert.

Focused validation before rewriting: CUTE_DSL_LIBS=/home/agent/.local/lib/python3.14/site-packages/nvidia_cutlass_dsl/lib/libcute_dsl_runtime.so CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache CUTE_DSL_ARCH=sm_120a python -m pytest -q -s tests/test_gemm_sm120_nvfp4_validation.py tests/test_gemm_sm120_nvfp4_ptx.py tests/test_gemm_sm120_nvfp4_correctness.py -> 10 passed.

Experimental branch benchmark notes for the final interleaved-scale path reported 4096^3 TensorFill-like data at 0.645 ms / 213.1 TFLOP/s; faster CLC/delayed-TMA epilogue variants were left out because they did not pass larger-grid validation.
Add an explicit sm120_nvfp4_path policy for the SM120 NVFP4 benchmark and compile path. The default validated policy keeps the conservative static-scheduler/direct-store path, while the fast policy selects the CLC/full-grid scheduler with the delayed TMA epilogue path so it can be benchmarked without editing source.

The run_sm120_nvfp4_bench.sh script now forwards SM120_NVFP4_PATH, and benchmark_gemm.py also accepts --sm120_nvfp4_path {validated,fast}. Add a focused validation test proving the two policies select the intended scheduler and epilogue switches.

Validation: python -m py_compile quack/gemm_sm120.py quack/blockscaled_gemm_utils.py benchmarks/benchmark_gemm.py tests/test_gemm_sm120_nvfp4_validation.py; python -m ruff check quack/gemm_sm120.py quack/blockscaled_gemm_utils.py benchmarks/benchmark_gemm.py tests/test_gemm_sm120_nvfp4_validation.py; bash -n scripts/run_sm120_nvfp4_bench.sh; CUTE_DSL_LIBS=/home/agent/.local/lib/python3.14/site-packages/nvidia_cutlass_dsl/lib/libcute_dsl_runtime.so CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache CUTE_DSL_ARCH=sm_120a python -m pytest -q tests/test_gemm_sm120_nvfp4_validation.py -> 5 passed.

Benchmark smoke: SM120_NVFP4_PATH=fast WARMUP=1 ITERS=1 ./scripts/run_sm120_nvfp4_bench.sh -> 0.498 ms, 276.2 TFLOP/s, PASS; SM120_NVFP4_PATH=fast ./scripts/run_sm120_nvfp4_bench.sh -> 0.498 ms, 275.9 TFLOP/s, PASS; WARMUP=1 ITERS=1 ./scripts/run_sm120_nvfp4_bench.sh -> 0.659 ms, 208.4 TFLOP/s, PASS.
Add a focused SM120 NVFP4 blockscaled GEMM suite covering the public compile/run contract for the (128,128,128) path.

The tests exercise TensorFill-like packed FP4 inputs, compact interleaved FP8 scale storage, BF16 N-major output, and both sm120_nvfp4_path=validated and sm120_nvfp4_path=fast. They also distinguish validated direct global stores from the fast delayed TMA epilogue in PTX, and explicitly reject unsupported tilers, clusters, dtypes, varlen, and legacy rank-4 scale storage.

Validation run:

  CUTE_DSL_ARCH=sm_120a CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache python -m pytest -q tests/test_gemm_sm120_nvfp4_blockscaled.py

  CUTE_DSL_ARCH=sm_120a CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache python -m pytest -q tests/test_gemm_sm120_nvfp4_correctness.py tests/test_gemm_sm120_nvfp4_validation.py tests/test_gemm_sm120_nvfp4_ptx.py

  python -m ruff check tests/test_gemm_sm120_nvfp4_blockscaled.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant