Add rope_norm_store_kv fusion op by xueyangcs · Pull Request #39 · Tencent/hpc-ops

xueyangcs · 2026-04-03T05:40:30Z

This PR adds a fused CUDA operator that performs RoPE rotation, optional QK RMSNorm, and blocked KV-cache write in a single kernel.

API

API	Input dtype	Output	Description
`hpc.rope_norm_store_kv`	BF16	`out_q`; K/V written in-place to KV cache	Standard BF16 inference
`hpc.rope_norm_store_kv_fp8`	BF16 → FP8	`(out_q_fp8, q_scale, split_k_flag)`; K/V written to FP8 KV cache	FP8 quantized inference

Both variants support prefill and decode modes via is_prefill.

Controls whether RMSNorm is applied to Q/K and its order relative to RoPE.

Controls the Q quantization granularity. K/V always use static scaling via k_scale / v_scale.

Value	Name	Q Quantization
`1`	dqskv	Dynamic per-token per-head; scale computed by the kernel and written to `q_scale`
`2`	sqskv	Static; uses the caller-supplied `q_scale_inv`

xueyangcs force-pushed the feature/ryannxue_rope_norm_store_kv branch from fa6062b to d2e38d2 Compare April 3, 2026 05:50

add rope norm store kv fuse op

b3524d1

xueyangcs force-pushed the feature/ryannxue_rope_norm_store_kv branch from d2e38d2 to b3524d1 Compare April 6, 2026 16:14