Skip to content

matmul_all_reduce: lock writes use scope="gpu", invisible to remote GPUs #462

@aamarnat

Description

@aamarnat

Bug

In iris/ops/matmul_all_reduce.py, the lock-signal atomic in _fused_matmul_all_reduce_kernel uses tl.atomic_xchg(lock_ptr, value, sem="release") which defaults to scope="gpu". The lock array lives on the iris symmetric heap, mapped into all GPUs' address spaces via IPC. When Rank 0 writes to its lock entry, Rank 1 polls via iris.atomic_add(..., scope="sys"). Because the write uses scope="gpu", the store may only be visible within Rank 0's GPU caches and never propagate through the system-level coherence protocol (xGMI).

Impact

Remote ranks spin indefinitely on locks — manifests as intermittent hangs on multi-GPU runs. Non-deterministic and hard to diagnose.

Fix

Change the lock signal to tl.atomic_xchg(lock_ptr, value, sem="release", scope="sys"). All atomic operations on symmetric heap memory that cross GPU boundaries must use scope="sys".

Component

iris/ops/matmul_all_reduce.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingirisIris project issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions