Skip to content

decode bf16 smallm: support arbitrary 1<heads_per_group<=8 via direct Q/Y GMEM when TMA unsuitable#37

Open
Religious-J wants to merge 1 commit intoTencent:mainfrom
Religious-J:feat/decode_push
Open

decode bf16 smallm: support arbitrary 1<heads_per_group<=8 via direct Q/Y GMEM when TMA unsuitable#37
Religious-J wants to merge 1 commit intoTencent:mainfrom
Religious-J:feat/decode_push

Conversation

@Religious-J
Copy link
Copy Markdown

@Religious-J Religious-J commented Apr 2, 2026

Motivation

decode bf16 smallm support heads_per_group = 2,3,4,5,6,7,9

Main changes: in the decode stage, the amount of data involved in loading Q and storing Y is small and does not affect pipeline scheduling, direct memory access is used instead of TMA to support heads_per_group = 2, 3, 4, 5, 6, 7, 9.

Result

H20 GPU, cuda13.0

  1. Qwen2.5-7B heads_per_group = 7
image
  1. Qwen3-14B heads_per_group = 5
image
  1. Qwen3-8B heads_per_group = 4
image

ps. hpc means splitk = false, hpc-splitk means splitk = true
sgl-flash_attn is flash_attn_v3
easy test: https://github.com/Religious-J/ops-beanchmark

Both can achieve optimal performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant