-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Bug
In iris/ops/matmul_all_reduce.py, the lock-signal atomic in _fused_matmul_all_reduce_kernel uses tl.atomic_xchg(lock_ptr, value, sem="release") which defaults to scope="gpu". The lock array lives on the iris symmetric heap, mapped into all GPUs' address spaces via IPC. When Rank 0 writes to its lock entry, Rank 1 polls via iris.atomic_add(..., scope="sys"). Because the write uses scope="gpu", the store may only be visible within Rank 0's GPU caches and never propagate through the system-level coherence protocol (xGMI).
Impact
Remote ranks spin indefinitely on locks — manifests as intermittent hangs on multi-GPU runs. Non-deterministic and hard to diagnose.
Fix
Change the lock signal to tl.atomic_xchg(lock_ptr, value, sem="release", scope="sys"). All atomic operations on symmetric heap memory that cross GPU boundaries must use scope="sys".
Component
iris/ops/matmul_all_reduce.py