-
Notifications
You must be signed in to change notification settings - Fork 37
Open
Labels
Description
Bug
In iris/ops/matmul_all_reduce.py, matmul_all_reduce_preamble performs both:
- Workspace buffer allocation —
shmem.zeros()for locks and aux_buffer, which is a collective operation (all ranks must call it together) - Per-call preparation —
C.zero_(),shmem.barrier()
If the workspace matches and can be reused, only some ranks call the preamble while others skip it. This causes shmem.zeros to be called from only a subset of ranks, deadlocking the collective.
Additionally, the preamble zeros the lock array and calls shmem.barrier() on every call. For lock-based variants (one_shot, two_shot), this overhead is unnecessary if versioned locks are used instead.
Impact
Deadlock when workspace is reused across different problem sizes or when ranks take different code paths.
Fix
Separate allocation from preparation:
_allocate_workspace()— only called when workspace doesn't match (shape/variant changed). Handles collectiveshmem.zerosfor locks and aux_buffer._pre_kernel_sync()— called every time, but variant-specific:atomic/spinlock:C.zero_()+ stream-level barrierone_shot/two_shot: no-op (versioned locks + overwrite semantics)
Component
iris/ops/matmul_all_reduce.py, iris/ops/workspace.py
Reactions are currently unavailable