Skip to content

matmul_all_reduce_preamble mixes collective allocation with per-call preparation #464

@aamarnat

Description

@aamarnat

Bug

In iris/ops/matmul_all_reduce.py, matmul_all_reduce_preamble performs both:

  1. Workspace buffer allocationshmem.zeros() for locks and aux_buffer, which is a collective operation (all ranks must call it together)
  2. Per-call preparationC.zero_(), shmem.barrier()

If the workspace matches and can be reused, only some ranks call the preamble while others skip it. This causes shmem.zeros to be called from only a subset of ranks, deadlocking the collective.

Additionally, the preamble zeros the lock array and calls shmem.barrier() on every call. For lock-based variants (one_shot, two_shot), this overhead is unnecessary if versioned locks are used instead.

Impact

Deadlock when workspace is reused across different problem sizes or when ranks take different code paths.

Fix

Separate allocation from preparation:

  • _allocate_workspace() — only called when workspace doesn't match (shape/variant changed). Handles collective shmem.zeros for locks and aux_buffer.
  • _pre_kernel_sync() — called every time, but variant-specific:
    • atomic/spinlock: C.zero_() + stream-level barrier
    • one_shot/two_shot: no-op (versioned locks + overwrite semantics)

Component

iris/ops/matmul_all_reduce.py, iris/ops/workspace.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingirisIris project issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions