Skip to content

Add rope_norm_store_kv fusion op#39

Open
xueyangcs wants to merge 1 commit intoTencent:mainfrom
xueyangcs:feature/ryannxue_rope_norm_store_kv
Open

Add rope_norm_store_kv fusion op#39
xueyangcs wants to merge 1 commit intoTencent:mainfrom
xueyangcs:feature/ryannxue_rope_norm_store_kv

Conversation

@xueyangcs
Copy link
Copy Markdown
Contributor

This PR adds a fused CUDA operator that performs RoPE rotation, optional QK RMSNorm, and blocked KV-cache write in a single kernel.

API

API Input dtype Output Description
hpc.rope_norm_store_kv BF16 out_q; K/V written in-place to KV cache Standard BF16 inference
hpc.rope_norm_store_kv_fp8 BF16 → FP8 (out_q_fp8, q_scale, split_k_flag); K/V written to FP8 KV cache FP8 quantized inference

Both variants support prefill and decode modes via is_prefill.

Policy Parameters

qk_norm_policy

Controls whether RMSNorm is applied to Q/K and its order relative to RoPE.

Value Behavior
0 No RMSNorm
1 RoPE → RMSNorm
2 RMSNorm → RoPE

quant_policy

Controls the Q quantization granularity. K/V always use static scaling via k_scale / v_scale.

Value Name Q Quantization
1 dqskv Dynamic per-token per-head; scale computed by the kernel and written to q_scale
2 sqskv Static; uses the caller-supplied q_scale_inv

@xueyangcs xueyangcs force-pushed the feature/ryannxue_rope_norm_store_kv branch from fa6062b to d2e38d2 Compare April 3, 2026 05:50
@xueyangcs xueyangcs force-pushed the feature/ryannxue_rope_norm_store_kv branch from d2e38d2 to b3524d1 Compare April 6, 2026 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant