Skip to content

[Rotary] Fuse copy into kernel#144

Open
simveit wants to merge 1 commit into
Dao-AILab:mainfrom
simveit:feature/rotary
Open

[Rotary] Fuse copy into kernel#144
simveit wants to merge 1 commit into
Dao-AILab:mainfrom
simveit:feature/rotary

Conversation

@simveit
Copy link
Copy Markdown
Contributor

@simveit simveit commented May 24, 2026

Remove blocking host side computation by putting it into the kernel and add a benchmark for rotary.

Following copy is quiet expensive, this PR removes it

    out = x if inplace else torch.empty_like(x)
    if rotary_dim < headdim and not inplace:
        out[..., rotary_dim:].copy_(x[..., rotary_dim:])

Benchmark:

Path Shape (B,S,H,D,rotary_dim) Main ms PR ms Speedup Main GB/s PR GB/s GB/s Delta
fwd (8,4096,32,128,128) 0.0921 0.0921 1.00x 5841 5840 -0.0%
fwd (8,4096,32,128,64) 0.1931 0.0999 1.93x 2783 5382 +93.4%
fwd (4,4096,32,256,128) 0.1910 0.0963 1.98x 2816 5588 +98.4%
fwd (2,4096,32,512,256) 0.1909 0.0938 2.04x 2823 5747 +103.6%
bwd (8,4096,32,128,128) 0.0921 0.0921 1.00x 5839 5838 -0.0%
bwd (8,4096,32,128,64) 0.1930 0.0997 1.94x 2785 5388 +93.5%
bwd (4,4096,32,256,128) 0.1937 0.0982 1.97x 2777 5476 +97.2%
bwd (2,4096,32,512,256) 0.1912 0.0933 2.05x 2818 5777 +105.0%

@simveit simveit changed the title Fuse copy into kernel [Rotary] Fuse copy into kernel May 24, 2026
@tridao
Copy link
Copy Markdown
Member

tridao commented May 24, 2026

if (D, rotary_dim) = (128, 64), then doesn't the kernel in main only read 64 elements and write 64 elements, and copy reads 64 elements and writes 64 elements? So in terms of IO, the version in main is fine?
I'm just trying to understand where the suboptimality comes from.

@simveit
Copy link
Copy Markdown
Contributor Author

simveit commented May 24, 2026

my naive thought why that gives such a good boost in performance is that the op itself is very cheap and thus the overhead of launching a copy and the additional ordering within the whole stream we run on may dominate. i could check this claim tomorrow in nsight system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants