decode bf16 smallm: support arbitrary 1<heads_per_group<=8 via direct Q/Y GMEM when TMA unsuitable by Religious-J · Pull Request #37 · Tencent/hpc-ops

Religious-J · 2026-04-02T11:52:03Z

Motivation

decode bf16 smallm support heads_per_group = 2,3,4,5,6,7,9

Main changes: in the decode stage, the amount of data involved in loading Q and storing Y is small and does not affect pipeline scheduling, direct memory access is used instead of TMA to support heads_per_group = 2, 3, 4, 5, 6, 7, 9.

Result

H20 GPU, cuda13.0

Qwen2.5-7B heads_per_group = 7

Qwen3-14B heads_per_group = 5

Qwen3-8B heads_per_group = 4

ps. hpc means splitk = false, hpc-splitk means splitk = true
sgl-flash_attn is flash_attn_v3
easy test: https://github.com/Religious-J/ops-beanchmark

Both can achieve optimal performance.

… Q/Y GMEM when TMA unsuitable

decode bf16 smallm: support arbitrary 1<heads_per_group<=8 via direct…

0b869ef

… Q/Y GMEM when TMA unsuitable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decode bf16 smallm: support arbitrary 1<heads_per_group<=8 via direct Q/Y GMEM when TMA unsuitable#37

decode bf16 smallm: support arbitrary 1<heads_per_group<=8 via direct Q/Y GMEM when TMA unsuitable#37
Religious-J wants to merge 1 commit intoTencent:mainfrom
Religious-J:feat/decode_push

Religious-J commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Religious-J commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Religious-J commented Apr 2, 2026 •

edited

Loading