Skip to content

cuda: add launch-bounded tile8 MoE down path#145

Open
amarrmb wants to merge 2 commits into
antirez:mainfrom
amarrmb:cuda-moe-down-tile8-rowspan
Open

cuda: add launch-bounded tile8 MoE down path#145
amarrmb wants to merge 2 commits into
antirez:mainfrom
amarrmb:cuda-moe-down-tile8-rowspan

Conversation

@amarrmb
Copy link
Copy Markdown

@amarrmb amarrmb commented May 14, 2026

Adds an optional CUDA MoE down path for prefill, guarded by DS4_CUDA_MOE_DOWN_TILE8_ROWSPAN=1.

This is stacked on top of the Blackwell F16 dispatch change from #121. Default behavior is unchanged unless the env flag is set.

Why

Nsight Systems showed MoE down as one of the dominant CUDA kernels during prefill on Blackwell-class devices. The existing tile16 row-span path is fast, but the launch stats show high register and shared-memory pressure.

This PR adds a tile8 row-span variant and uses __launch_bounds__(256, 2) to reduce the register footprint while preserving the row-span structure.

Thor NCU launch stats for the new kernel:

kernel registers/thread static shared memory/block grid size
existing tile16 row-span 165 37,376 bytes 608
new launch-bounded tile8 row-span 128 18,688 bytes 704

speed-bench

Command shape:

./ds4-bench --cuda -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 --ctx-max 65536 \
  --step-incr 2048 --gen-tokens 128

Results are averages over the 32 reported context points.

device variant avg prefill tok/s avg gen tok/s
Jetson Thor default 138.17 9.67
Jetson Thor DS4_CUDA_MOE_DOWN_TILE8_ROWSPAN=1 147.11 9.66
DGX Spark default 341.26 13.37
DGX Spark DS4_CUDA_MOE_DOWN_TILE8_ROWSPAN=1 354.08 13.36

That is approximately +6.5% average prefill on Thor and +3.8% average prefill on DGX Spark, with generation unchanged within noise.

Tests

Passed on both Jetson Thor and DGX Spark:

make ds4-bench ds4_test
DS4_CUDA_MOE_DOWN_TILE8_ROWSPAN=1 ./ds4_test --metal-kernels --server

Notes

This PR intentionally keeps the new path behind an env flag. It is a prefill-oriented tuning path and should not change defaults without more bake time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant