[Rotary] Fuse copy into kernel by simveit · Pull Request #144 · Dao-AILab/quack

simveit · 2026-05-24T08:31:17Z

Remove blocking host side computation by putting it into the kernel and add a benchmark for rotary.

Following copy is quiet expensive, this PR removes it

    out = x if inplace else torch.empty_like(x)
    if rotary_dim < headdim and not inplace:
        out[..., rotary_dim:].copy_(x[..., rotary_dim:])

Benchmark:

Path	Shape `(B,S,H,D,rotary_dim)`	Main ms	PR ms	Speedup	Main GB/s	PR GB/s	GB/s Delta
fwd	`(8,4096,32,128,128)`	0.0921	0.0921	1.00x	5841	5840	-0.0%
fwd	`(8,4096,32,128,64)`	0.1931	0.0999	1.93x	2783	5382	+93.4%
fwd	`(4,4096,32,256,128)`	0.1910	0.0963	1.98x	2816	5588	+98.4%
fwd	`(2,4096,32,512,256)`	0.1909	0.0938	2.04x	2823	5747	+103.6%
bwd	`(8,4096,32,128,128)`	0.0921	0.0921	1.00x	5839	5838	-0.0%
bwd	`(8,4096,32,128,64)`	0.1930	0.0997	1.94x	2785	5388	+93.5%
bwd	`(4,4096,32,256,128)`	0.1937	0.0982	1.97x	2777	5476	+97.2%
bwd	`(2,4096,32,512,256)`	0.1912	0.0933	2.05x	2818	5777	+105.0%

tridao · 2026-05-24T19:57:56Z

if (D, rotary_dim) = (128, 64), then doesn't the kernel in main only read 64 elements and write 64 elements, and copy reads 64 elements and writes 64 elements? So in terms of IO, the version in main is fine?
I'm just trying to understand where the suboptimality comes from.

simveit · 2026-05-24T20:22:22Z

my naive thought why that gives such a good boost in performance is that the op itself is very cheap and thus the overhead of launching a copy and the additional ordering within the whole stream we run on may dominate. i could check this claim tomorrow in nsight system.

Fuse the copy for rotary.

e92cacf

simveit requested a deployment to gpu-ci May 24, 2026 08:31 — with GitHub Actions Waiting

simveit changed the title ~~Fuse copy into kernel~~ [Rotary] Fuse copy into kernel May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Rotary] Fuse copy into kernel#144

[Rotary] Fuse copy into kernel#144
simveit wants to merge 1 commit into
Dao-AILab:mainfrom
simveit:feature/rotary

simveit commented May 24, 2026

Uh oh!

tridao commented May 24, 2026

Uh oh!

simveit commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simveit commented May 24, 2026

Uh oh!

tridao commented May 24, 2026

Uh oh!

simveit commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants