matmul_all_reduce_preamble mixes collective allocation with per-call preparation

## Bug

In `iris/ops/matmul_all_reduce.py`, `matmul_all_reduce_preamble` performs both:

1. **Workspace buffer allocation** — `shmem.zeros()` for locks and aux_buffer, which is a **collective operation** (all ranks must call it together)
2. **Per-call preparation** — `C.zero_()`, `shmem.barrier()`

If the workspace matches and can be reused, only some ranks call the preamble while others skip it. This causes `shmem.zeros` to be called from only a subset of ranks, deadlocking the collective.

Additionally, the preamble zeros the lock array and calls `shmem.barrier()` on every call. For lock-based variants (one_shot, two_shot), this overhead is unnecessary if versioned locks are used instead.

## Impact

Deadlock when workspace is reused across different problem sizes or when ranks take different code paths.

## Fix

Separate allocation from preparation:

- `_allocate_workspace()` — only called when workspace doesn't match (shape/variant changed). Handles collective `shmem.zeros` for locks and aux_buffer.
- `_pre_kernel_sync()` — called every time, but variant-specific:
  - `atomic`/`spinlock`: `C.zero_()` + stream-level barrier
  - `one_shot`/`two_shot`: no-op (versioned locks + overwrite semantics)

## Component

`iris/ops/matmul_all_reduce.py`, `iris/ops/workspace.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

matmul_all_reduce_preamble mixes collective allocation with per-call preparation #464

Bug

Impact

Fix

Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

matmul_all_reduce_preamble mixes collective allocation with per-call preparation #464

Description

Bug

Impact

Fix

Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions