FlashAttention 3 exists and surprisingly not many people know it. It's mainly advertised to be fast on sm90 (H100), and in my benchmarks it's indeed a bit faster than FA2 on sm86 (RTX 3080), so it helps on consumer GPUs.
Some FA3 wheels are available at https://download.pytorch.org/whl/flash-attn-3/ . See the discussion in Dao-AILab/flash-attention#2223 . The only thing missing is that they could not build wheel with Windows + cu130, and it has been fixed with the patch in https://github.com/windreamer/flash-attention3-wheels . This is worth tracking for us.
After we have the wheels, we can replace FA2 with FA3 on supported GPUs in ComfyUI.
FlashAttention 3 exists and surprisingly not many people know it. It's mainly advertised to be fast on sm90 (H100), and in my benchmarks it's indeed a bit faster than FA2 on sm86 (RTX 3080), so it helps on consumer GPUs.
Some FA3 wheels are available at https://download.pytorch.org/whl/flash-attn-3/ . See the discussion in Dao-AILab/flash-attention#2223 . The only thing missing is that they could not build wheel with Windows + cu130, and it has been fixed with the patch in https://github.com/windreamer/flash-attention3-wheels . This is worth tracking for us.
After we have the wheels, we can replace FA2 with FA3 on supported GPUs in ComfyUI.