Skip to content

fix: matmul_all_reduce lock signal must use scope="sys" for xGMI visibility#477

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/fix-lock-writes-scope-gpu
Draft

fix: matmul_all_reduce lock signal must use scope="sys" for xGMI visibility#477
Copilot wants to merge 2 commits intomainfrom
copilot/fix-lock-writes-scope-gpu

Conversation

Copy link
Contributor

Copilot AI commented Mar 24, 2026

Lock writes in _fused_matmul_all_reduce_kernel used scope="gpu", making them invisible to remote GPUs polling via system-scope atomics — causing intermittent hangs on multi-GPU runs.

Change

  • iris/ops/matmul_all_reduce.py: Change lock-signal atomic from scope="gpu" to scope="sys"
# Before — write confined to local GPU cache hierarchy
tl.atomic_xchg(lock_ptr, 1, sem="release", scope="gpu")

# After — propagates through xGMI system-level coherence
tl.atomic_xchg(lock_ptr, 1, sem="release", scope="sys")

The lock array lives on the symmetric heap and is mapped into all GPUs' address spaces via IPC. Any atomic on symmetric heap memory consumed by a remote rank must use scope="sys".

Original prompt

This section details on the original issue you should resolve

<issue_title>matmul_all_reduce: lock writes use scope="gpu", invisible to remote GPUs</issue_title>
<issue_description>## Bug

In iris/ops/matmul_all_reduce.py, the lock-signal atomic in _fused_matmul_all_reduce_kernel uses tl.atomic_xchg(lock_ptr, value, sem="release") which defaults to scope="gpu". The lock array lives on the iris symmetric heap, mapped into all GPUs' address spaces via IPC. When Rank 0 writes to its lock entry, Rank 1 polls via iris.atomic_add(..., scope="sys"). Because the write uses scope="gpu", the store may only be visible within Rank 0's GPU caches and never propagate through the system-level coherence protocol (xGMI).

Impact

Remote ranks spin indefinitely on locks — manifests as intermittent hangs on multi-GPU runs. Non-deterministic and hard to diagnose.

Fix

Change the lock signal to tl.atomic_xchg(lock_ptr, value, sem="release", scope="sys"). All atomic operations on symmetric heap memory that cross GPU boundaries must use scope="sys".

Component

iris/ops/matmul_all_reduce.py</issue_description>

<agent_instructions>Agreed with this bug. Consumer can be remote. Need system scope.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.

Copilot AI changed the title [WIP] Fix lock writes to use scope="sys" for matmul_all_reduce fix: matmul_all_reduce lock signal must use scope="sys" for xGMI visibility Mar 24, 2026
Copilot AI requested a review from mawad-amd March 24, 2026 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

matmul_all_reduce: lock writes use scope="gpu", invisible to remote GPUs

2 participants