Skip to content

flash-attn2: [XPU] Tune FA2 performance on B60#671

Merged
danieldk merged 20 commits into
mainfrom
perf-bmg
May 8, 2026
Merged

flash-attn2: [XPU] Tune FA2 performance on B60#671
danieldk merged 20 commits into
mainfrom
perf-bmg

Conversation

@danieldk
Copy link
Copy Markdown
Member

Original PR: #666

B60's register allocation strategy is different from PVC's in IGC. The kernels we previously tuned for PVC performance cause a ton of register spilling on B60, and the performance tanks badly.

This PR fixes the issue by optimizing specifically for B60’s hardware. Current test results:

torch       : 2.11.0+xpu
XPU device  : Intel(R) Arc(TM) Pro B60 Graphics

======================================================================================
TEST 1: flash_attn_func (dense forward)
  local  recorded: 2026-04-25 01:14:38
  remote recorded: 2026-04-25 01:21:12
======================================================================================

Config                              local       remote                 diff
----------------------------------------------------------------------------
B=4  S=512   H=8  D=64            0.098ms      0.397ms  local  4.06x faster
B=4  S=512   H=8  D=128           0.172ms      1.883ms local  10.98x faster
B=4  S=1024  H=8  D=64            0.132ms      0.646ms  local  4.91x faster
B=4  S=2048  H=8  D=64            0.415ms      1.142ms  local  2.75x faster
B=4  S=2048  H=16 D=64            0.774ms      2.029ms  local  2.62x faster
B=8  S=1024  H=8  D=64            0.233ms      1.101ms  local  4.72x faster
B=1  S=4096  H=32 D=128           3.394ms     22.093ms  local  6.51x faster
B=4  S=512   H=8  D=96            0.140ms      0.548ms  local  3.93x faster
B=4  S=512   H=8  D=192           0.207ms      3.134ms local  15.14x faster
B=4  S=1024  H=8  D=96            0.223ms      0.944ms  local  4.23x faster
B=4  S=2048  H=8  D=192           1.276ms     12.015ms  local  9.42x faster
B=4  S=2048  H=16 D=96            1.427ms      2.877ms  local  2.02x faster
B=8  S=1024  H=8  D=192           0.717ms     11.023ms local  15.37x faster
B=1  S=4096  H=32 D=192           5.028ms     30.823ms  local  6.13x faster

======================================================================================
TEST 2: flash_attn_varlen_func (forward)
  local  recorded: 2026-04-25 01:14:38
  remote recorded: 2026-04-25 01:21:12
======================================================================================

Config                              local       remote                 diff
----------------------------------------------------------------------------
B=4  S=512   H=8  D=64            0.095ms      0.359ms  local  3.78x faster
B=4  S=512   H=8  D=128           0.181ms      1.932ms local  10.70x faster
B=4  S=1024  H=8  D=64            0.128ms      0.617ms  local  4.83x faster
B=4  S=2048  H=8  D=64            0.405ms      1.105ms  local  2.73x faster
B=4  S=2048  H=16 D=64            0.755ms      1.978ms  local  2.62x faster
B=8  S=1024  H=8  D=64            0.226ms      1.068ms  local  4.74x faster
B=1  S=4096  H=32 D=128           3.509ms     21.941ms  local  6.25x faster
B=4  S=512   H=8  D=96            0.149ms      0.557ms  local  3.74x faster
B=4  S=512   H=8  D=192           0.221ms      3.244ms local  14.71x faster
B=4  S=1024  H=8  D=96            0.223ms      0.907ms  local  4.06x faster
B=4  S=2048  H=8  D=192           1.280ms     12.283ms  local  9.60x faster
B=4  S=2048  H=16 D=96            1.444ms      2.816ms  local  1.95x faster
B=8  S=1024  H=8  D=192           0.722ms     11.191ms local  15.50x faster
B=1  S=4096  H=32 D=192           5.079ms     27.010ms  local  5.32x faster

@danieldk danieldk requested a review from drbh as a code owner April 27, 2026 06:38
@danieldk danieldk merged commit 042c80b into main May 8, 2026
11 of 12 checks passed
@danieldk danieldk deleted the perf-bmg branch May 8, 2026 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants