matmul_all_reduce: lock writes use scope="gpu", invisible to remote GPUs

## Bug

In `iris/ops/matmul_all_reduce.py`, the lock-signal atomic in `_fused_matmul_all_reduce_kernel` uses `tl.atomic_xchg(lock_ptr, value, sem="release")` which defaults to `scope="gpu"`. The lock array lives on the iris symmetric heap, mapped into all GPUs' address spaces via IPC. When Rank 0 writes to its lock entry, Rank 1 polls via `iris.atomic_add(..., scope="sys")`. Because the write uses `scope="gpu"`, the store may only be visible within Rank 0's GPU caches and never propagate through the system-level coherence protocol (xGMI).

## Impact

Remote ranks spin indefinitely on locks — manifests as intermittent hangs on multi-GPU runs. Non-deterministic and hard to diagnose.

## Fix

Change the lock signal to `tl.atomic_xchg(lock_ptr, value, sem="release", scope="sys")`. All atomic operations on symmetric heap memory that cross GPU boundaries must use `scope="sys"`.

## Component

`iris/ops/matmul_all_reduce.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

matmul_all_reduce: lock writes use scope="gpu", invisible to remote GPUs #462

Bug

Impact

Fix

Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

matmul_all_reduce: lock writes use scope="gpu", invisible to remote GPUs #462

Description

Bug

Impact

Fix

Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions