Skip to content

flash-attn2: [XPU] Tune FA2 performance on B60#666

Open
YangKai0616 wants to merge 20 commits into
huggingface:mainfrom
YangKai0616:perf-bmg
Open

flash-attn2: [XPU] Tune FA2 performance on B60#666
YangKai0616 wants to merge 20 commits into
huggingface:mainfrom
YangKai0616:perf-bmg

Conversation

@YangKai0616
Copy link
Copy Markdown
Contributor

@YangKai0616 YangKai0616 commented Apr 20, 2026

B60's register allocation strategy is different from PVC's in IGC. The kernels we previously tuned for PVC performance cause a ton of register spilling on B60, and the performance tanks badly.

This PR fixes the issue by optimizing specifically for B60’s hardware. Current test results:

torch       : 2.11.0+xpu
XPU device  : Intel(R) Arc(TM) Pro B60 Graphics

======================================================================================
TEST 1: flash_attn_func (dense forward)
  local  recorded: 2026-04-25 01:14:38
  remote recorded: 2026-04-25 01:21:12
======================================================================================

Config                              local       remote                 diff
----------------------------------------------------------------------------
B=4  S=512   H=8  D=64            0.098ms      0.397ms  local  4.06x faster
B=4  S=512   H=8  D=128           0.172ms      1.883ms local  10.98x faster
B=4  S=1024  H=8  D=64            0.132ms      0.646ms  local  4.91x faster
B=4  S=2048  H=8  D=64            0.415ms      1.142ms  local  2.75x faster
B=4  S=2048  H=16 D=64            0.774ms      2.029ms  local  2.62x faster
B=8  S=1024  H=8  D=64            0.233ms      1.101ms  local  4.72x faster
B=1  S=4096  H=32 D=128           3.394ms     22.093ms  local  6.51x faster
B=4  S=512   H=8  D=96            0.140ms      0.548ms  local  3.93x faster
B=4  S=512   H=8  D=192           0.207ms      3.134ms local  15.14x faster
B=4  S=1024  H=8  D=96            0.223ms      0.944ms  local  4.23x faster
B=4  S=2048  H=8  D=192           1.276ms     12.015ms  local  9.42x faster
B=4  S=2048  H=16 D=96            1.427ms      2.877ms  local  2.02x faster
B=8  S=1024  H=8  D=192           0.717ms     11.023ms local  15.37x faster
B=1  S=4096  H=32 D=192           5.028ms     30.823ms  local  6.13x faster

======================================================================================
TEST 2: flash_attn_varlen_func (forward)
  local  recorded: 2026-04-25 01:14:38
  remote recorded: 2026-04-25 01:21:12
======================================================================================

Config                              local       remote                 diff
----------------------------------------------------------------------------
B=4  S=512   H=8  D=64            0.095ms      0.359ms  local  3.78x faster
B=4  S=512   H=8  D=128           0.181ms      1.932ms local  10.70x faster
B=4  S=1024  H=8  D=64            0.128ms      0.617ms  local  4.83x faster
B=4  S=2048  H=8  D=64            0.405ms      1.105ms  local  2.73x faster
B=4  S=2048  H=16 D=64            0.755ms      1.978ms  local  2.62x faster
B=8  S=1024  H=8  D=64            0.226ms      1.068ms  local  4.74x faster
B=1  S=4096  H=32 D=128           3.509ms     21.941ms  local  6.25x faster
B=4  S=512   H=8  D=96            0.149ms      0.557ms  local  3.74x faster
B=4  S=512   H=8  D=192           0.221ms      3.244ms local  14.71x faster
B=4  S=1024  H=8  D=96            0.223ms      0.907ms  local  4.06x faster
B=4  S=2048  H=8  D=192           1.280ms     12.283ms  local  9.60x faster
B=4  S=2048  H=16 D=96            1.444ms      2.816ms  local  1.95x faster
B=8  S=1024  H=8  D=192           0.722ms     11.191ms local  15.50x faster
B=1  S=4096  H=32 D=192           5.079ms     27.010ms  local  5.32x faster

@YangKai0616 YangKai0616 marked this pull request as ready for review April 27, 2026 03:18
@YangKai0616 YangKai0616 requested a review from drbh as a code owner April 27, 2026 03:18
@YangKai0616
Copy link
Copy Markdown
Contributor Author

It looks like the failed CI is hanging at the build kernel stage. @danieldk , do you happen to know how to handle this?

@danieldk
Copy link
Copy Markdown
Member

danieldk commented May 4, 2026

It looks like the failed CI is hanging at the build kernel stage. @danieldk , do you happen to know how to handle this?

It looks like the builders go out of memory. Have you seen much-increased memory use when building this PR?

danieldk added a commit that referenced this pull request May 7, 2026
* flash-attn2: make the build concurrency configurable

PR #666 fails due to going OOM. Add a mechanism to support
per-kernel/backend configurable concurrency for such cases.

* Update flake to get new outputs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants