Skip to content

Releases: facebookresearch/tensor-layouts

tensor-layouts 0.3.2

15 May 18:09

Choose a tag to compare

What's Changed

New analysis helpers

  • permutation_parity() and is_even_permutation() — detect orientation of dense, injective layouts (thanks @neuralsorcerer, #18)
  • from_F2_matrix() — inverse constructor for to_F2_matrix() with affine + brute-force-Swizzle-extraction reconstruction; round-trip identity holds. to_F2_matrix() strengthened to accept any F2-linear ComposedLayout.

Tensor API

  • Tensor.to_list() and Tensor.copy_from() — flat copy / snapshot helpers (thanks @neuralsorcerer, #23)

Layout API

  • Layout is now purely affine. The swizzle= constructor kwarg and the Layout.swizzle attribute are removed; ComposedLayout is the canonical (and only) carrier for every swizzled / non-affine form. Code that built Layout(..., swizzle=Sw) directly should switch to compose(Sw, layout) or ComposedLayout(Sw, layout). Layout.__repr__ is now exact eval-roundtrip (Layout(shape, stride)).
  • ComposedLayout.preoffsetComposedLayout.offset (renamed).
  • ComposedLayout's offset is now keyword-only — both ComposedLayout(Sw, L, k) and the CuTe-style ComposedLayout(Sw, k, L) porting trap now fail at the call site.
  • Swizzle is now allowed in ComposedLayout's inner slot — the inverse-form ComposedLayout(Layout, offset, Swizzle) arising from right_inverse / left_inverse on offset-bearing swizzle-fronted ComposedLayouts. coalesce() on this form is a no-op (rank-1; no structure to merge).
  • complement() now forwards through ComposedLayout (was unsupported).
  • split_outer_swizzle() — new public structural recogniser for the canonical ComposedLayout(Sw, L, offset=0) form. Replaces the private _split_zero_offset_swizzle that tensor.py had been reaching into.
  • LayoutError(ValueError), UnsupportedComposedLayoutError(NotImplementedError), TensorStorageError(ValueError) — new exception classes for catching layout-algebra and tensor-storage errors specifically. Existing except ValueError / except NotImplementedError handlers continue to catch them.

Fixes

  • Tensor offset alignment with CuTe. A Tensor over a Layout-with-embedded-swizzle previously folded the external offset into the swizzle's input domain (Sw(offset + L(coord))) while a Tensor over a ComposedLayout added the offset AFTER the layout call. The two forms thus disagreed on addresses for nonzero Tensor offset. Both forms now follow CuTe's tensor(coord) == tensor.offset + tensor.layout(coord).
  • cosize(ComposedLayout) now uses max(L(i)) + 1 enumeration over the full domain — the previous delegation to inner-or-outer mis-reported the codomain extent for five common forms and could cause buffer under-allocation. O(n) instead of O(1); cached on the instance for amortised cost.
  • cosize() on embedded-swizzle Layout for non-power-of-2 shapes now correctly accounts for the swizzle's XOR (e.g. Layout(5, 1, swizzle=Swizzle(2, 0, 2)) was reporting cosize 5, true value is 6).
  • Surjectivity checks for explicit and shifted codomains (thanks @neuralsorcerer, #21)
  • Transfer swizzle through logical_product against swizzled tiles (was silently dropping the embedded swizzle and returning a semantically wrong plain layout).
  • Tensor[(slice(None), 0), 1] (slice nested in a hierarchical coordinate tuple) now raises TypeError instead of being silently passed through to slice_and_offset.
  • Drop the typing.Self fallback that imported the undeclared typing_extensions — would ImportError on a fresh 3.10 install.

Robustness

  • Reject complement / logical_product / logical_divide on the inverse-form ComposedLayout(Layout, offset, Swizzle) with NotImplementedError.
  • Reject logical_product on ComposedLayout(Layout, Swizzle, offset) (was crashing in the affine fallback with AttributeError on .stride).
  • as_affine_layout() performs an explicit is_affine() post-check; the error points callers at as_layout_expr() for the non-affine path.
  • coalescing_efficiency / segment_analysis validate warp_size > 0 (was silently producing nonsense from min(thread_count, 0)).
  • viz raises an actionable ImportError pointing at pip install tensor-layouts[viz] when matplotlib/numpy are missing, instead of surfacing a deep ModuleNotFoundError on matplotlib internals.
  • Aligned four pre-existing exception-class inconsistencies before introducing the new hierarchy: to_F2_matrix F6 rejection ValueErrorNotImplementedError; slice_modes / dice_modes structure mismatch TypeErrorValueError; prefix_product / suffix_product tuple-init-on-scalar TypeErrorValueError; _validate_order_permutation 'not iterable' ValueErrorTypeError.

Performance

  • cosize() results cached on each ComposedLayout instance and on swizzled Layout instances (the ComposedLayout cache uses a declarative dataclass field with init/repr/eq/hash all False, so cached and uncached layouts still compare equal and remain dict-key compatible).
  • _address_bounds has an O(1) fast path for the canonical Sw o L form (bounds = (offset, offset + cosize(layout) - 1)), replacing the O(size) per-coordinate walk in _validate_storage. Works for any Tensor offset on ComposedLayout.
  • complement(ComposedLayout) decays swizzled slices to plain Layout when the swizzle's Y/Z bits aren't both touched on the surviving subspace.

Refactors (no functional change)

  • layouts.py (4.4k LOC) split into a layouts/ package with three layered modules: core (exceptions, type predicates, tuple operations, Layout, Tile, Swizzle), expr (ComposedLayout and the LayoutExpr = Layout | ComposedLayout predicates / coercers), and algebra (compose, complement, divide, product, inverses, ...). Dependency direction is strictly one-way (core ← expr ← algebra), enforced by the import graph. No public API change — every name previously importable from tensor_layouts.layouts remains importable from the same path.
  • bank_conflicts / per_group_bank_conflicts deduped via shared _bank_conflicts_for_thread_range.
  • coalescing_efficiency / per_group_coalescing deduped via shared _coalescing_for_thread_range.
  • Layout._calculate_max_offset moved to module-level _affine_max_offset (the staticmethod never used self).
  • Internal _affine_inner_strip_swizzle rename.

CI and Tests

  • Python 3.14 added to the CI matrix and a lint job added (thanks @neuralsorcerer, #17)
  • 20 new CuTe C++ oracle entries pinning complement, coalesce, compose, right_inverse on ComposedLayout form variants F2-F8; compose_truncation_paper oracle case for paper section 3.3.3.
  • CuTe C++ oracle: CUTLASS_PATH / CUTLASS_INCLUDE_DIR env-var override for out-of-tree CUTLASS installations.
  • 32 hand-written AMD oracle C-layout per-atom tests parametrized into a single ORACLE_C_LAYOUT_CASES-driven test.
  • examples/composed.py added to make examples (was the only example not exercised by the smoke target).

Docs

  • docs/layout_api.md / docs/tensor_api.md / docs/analysis_api.md and the examples/ rewritten to reflect Layout-becomes-purely-affine: no more "Layout may also carry one canonical final swizzle" framing, single-form Layout(shape, stride) repr, compose(Sw, L) always returns ComposedLayout(Sw, L, 0).
  • New 'Constructor signature vs CuTe / pycute' subsection in docs/layout_api.md documenting the ComposedLayout(outer, inner, offset=k) ordering vs CuTe's positional ComposedLayout<A, Offset, B>.
  • permutation_parity / is_even_permutation documented in docs/analysis_api.md (thanks @neuralsorcerer)
  • Document supported / unsupported ops for the inverse-form ComposedLayout in the class docstring and docs/layout_api.md. to_F2_matrix / from_F2_matrix documented in docs/analysis_api.md.

Other

  • Revert CONTRIBUTING.md change from D101685100 (thanks @FindHao, #20)

Full Changelog: v0.3.1...v0.3.2

tensor-layouts 0.3.1

22 Apr 20:35

Choose a tag to compare

What's Changed

New analysis helpers

  • aliasing_profile() — analysis helper for detecting layout aliasing patterns (thanks @soumyadipsarkar, #11)
  • thread_stride_profile() — analysis helper for inspecting per-thread stride behavior (thanks @soumyadipsarkar)
  • gap_profile() — layout sparsity analysis helper (thanks @soumyadipsarkar)

Layout API

  • is_empty() — helper for the unit/empty-shape layout (rank 0, size 1, multiplicative identity for composition and concatenation); distinct from a zero-sized layout like Layout((0,), (0,))
  • as_list() — helper that replaces the list(as_tuple(...)) pattern when shapes/strides need to be mutated
  • is_afine_layout()is_affine() — renamed and widened to apply to any type with a .layout attribute; structural check (a swizzle-free ComposedLayout still returns False, since there is no machinery to coalesce one back into a flat Layout with non-zero preoffset)

Robustness

  • Reverse swizzle composition — fixed, with new pycute and CuTe C++ oracle regressions covering the failure mode
  • Exact MMA tile sizestile_mma_grid() now rejects tile_mnk values that are not exact multiples of the natural MMA atom shape, instead of silently floor-dividing and producing a smaller-than-requested grid.
  • Tighter layout helper validation across layout_utils.py
  • Flat 1D tensor storage requiredTensor now rejects multi-dimensional storage backings, with clearer error messages and updated docs/tensor_api.md
  • Negative layout shapes rejected at Layout construction
  • Internal compose() / divide() asserts promoted to TypeError — checks now survive python -O. src/tensor_layouts/ is now assert-free in production paths.

Refactors (no functional change)

  • compose() split into one helper per (lhs, rhs) case for readability
  • _draw_grid() split into single-purpose passes (font auto-sizing, base cells, hierarchy overlays, highlight overlay, value/margin labels)
  • Single-axis figure builders routed through a shared _new_axes() helper so matplotlib defaults can be tuned in one place
  • explain() dispatched through a function→handler table, keyed on the callable so wrappers/aliases resolve correctly
  • IPython detection cleaned up
  • tests/tensor.py converted to flat def test_*() style, matching every other test module

Docs

  • README: links to the algorithms / applications / GEMM example notebooks, plus a few meaningful external references
  • tests/<name>.py mirrors src/tensor_layouts/<name>.py naming convention spelled out in pyproject.toml and CONTRIBUTING.md
  • SM90 GMMA preamble explaining the warpgroup-level (128-thread) convention with shared-memory operands behind a hardware descriptor — distinct from the warp-level SM_70/80 atoms next to it
  • Thread-Value (T, V) layout convention documented in _tv_dimensions, so callers of bank_conflicts, coalescing_efficiency, etc. can interpret mode 0 vs mode 1+
  • AMD make_mfma_atom() parameters documented
  • CDNA_4x4x4 naming-vs-shape note clarified
  • Bit-twiddling intent in make_swizzle clarified
  • viz_api.md keyword fixed: num_shadesnum_colors (examples would have hit TypeError)
  • NVIDIA CuTe quickstart link fixed (thanks @soumyadipsarkar)

Other

  • File restoration after inadvertent cross-project modifications (#4, thanks @paulshen / @oshannessy)

Full Changelog: v0.3.0...v0.3.1

tensor-layouts 0.3.0

20 Apr 19:56

Choose a tag to compare

What's Changed

Composed layouts

  • ComposedLayout Release v0.3 biggest feature. Layout could accept a single Swizzle, but this did not compose, it was hard-coded. We now can compose Layouts and Swizzles aribitrarily: multi-stage compositions (outer ∘ preoffset ∘ inner) that cannot be collapsed into a single affine Layout; double-swizzle, affine-on-swizzled, and recursive compositions now preserve full mapping semantics
  • LayoutExpr type alias (Layout | ComposedLayout) — all public APIs that accept a layout now accept either form transparently
  • Layout traitsis_layout() widened to the CuTe-style trait; new is_affine_layout(), as_layout_expr(), as_affine_layout() for explicit trait boundaries between generic and affine-only code paths
  • Exact compose() — canonical compose(Swizzle, Layout) fast path preserved; non-canonical cases (double-swizzle, affine-on-swizzled, recursive through existing ComposedLayout) produce a ComposedLayout instead of silently losing information
  • CuTe-specific parity rulesLayout ∘ Swizzle composition, zero-preoffset collapse for Layout ∘ ComposedLayout, swizzled composed inverse support, and swizzle-aware max_common_layout() / max_common_vector()
  • Structural transforms forwardedappend, prepend, replace, group, flatten, sort, coalesce, logical_divide, logical_product operate on the inner domain of a ComposedLayout instead of dropping to affine-only assumptions
  • Slicing without offset leaksslice_and_offset() generalized so fixed-coordinate contributions inside a nonlinear composition stay inside the resulting ComposedLayout (external offset 0) instead of being turned into an incorrect pointer offset
  • Swizzle.__hash__Swizzle objects are now hashable, so ComposedLayout with a Swizzle outer works in sets and dicts

Tensor

  • Tensor now accepts LayoutExpr — indexing, slicing, storage validation, and address computation route through layout-expression-aware helpers; Tensor.stride remains deliberately affine-only and raises clearly on composed layouts

Analysis

  • Generic LayoutExpr consumers — image(), is_injective(), is_surjective(), is_bijective(), is_contiguous(), functionally_equal(), offset_table(), footprint(), bank_conflicts(), coalescing_efficiency(), segment_analysis(), per_group_bank_conflicts(), per_group_coalescing(), cycles(), fixed_points(), order() all accept ComposedLayout transparently
  • Affine-only helpers (to_F2_matrix, weakly_congruent, explain) now fail clearly on composed layouts via as_affine_layout()
  • max_common_vector() and max_common_layout() treat embedded-swizzle Layout(..., swizzle=...) and zero-preoffset ComposedLayout(Swizzle, inner) as the same semantic form

Inverse helpers

  • right_inverse() and left_inverse() preserve embedded swizzles by inverting the affine inner layout and recomposing the original swizzle
  • right_inverse() now skips noncontiguous sorted modes instead of terminating immediately, matching CuTe on broadcast-unit examples
  • Composition divisibility tightened: stricter truncation gate restored for partially fitting non-divisible strides while preserving valid §3.3.3 truncation cases

Visualization

  • draw_layout(), draw_slice(), and multi-panel rendering accept ComposedLayout directly
  • draw_slice titles reflect internal composed preoffset instead of leaking external offset
  • Parameter types widened to LayoutExpr in docs/viz_api.md

Examples & notebooks

  • examples/composed.py — runnable example covering canonical swizzle fast path, exact composed fallback, slicing/tensor offset split, and optional --draw figures
  • examples/gemm.ipynb — fully explained GEMM kernel walkthrough with layout algebra
  • examples/viz.ipynb — composed-layout discoverability note added
  • tests/paper_examples.py — full coverage of all figures (1–12) and tables (1–7) in arXiv 2603.02298, with --draw support for rendering paper figures

Bug fixes

  • oracle_cute_cpp skips gracefully when nvidia package is absent instead of crashing
  • Defensive assertions in _compose_with_tiler() and _logical_divide_with_tiler() guard against composed results leaking into affine-only rebuild paths

Docs & build

  • Composed-layout sections added to docs/layout_api.md, docs/tensor_api.md, docs/analysis_api.md, docs/viz_api.md
  • Composed-layout figures (composed_exact.png, composed_slice.png) generated and checked in
  • Makefile clean target updated for tests/figures/
  • pytest --draw conftest hook renders paper figures into tests/figures/

Tests

  • tests/composed.py — 45 regression tests covering representation contract, trait behavior, exact composition, divide/product cascades, recursive chains, hierarchical inners, full-slice identity, multi-mode, Tensor.view, and generic analysis coverage
  • tests/viz.py — cell-value and panel-color correctness tests for composed layouts
  • tests/oracle_cute_cpp.py — nonzero-preoffset composition, recursive composed chains, composed logical_divide/logical_product, make_tensor with ComposedLayout
  • tests/paper_examples.py — full arXiv 2603.02298 coverage with exact offset-value assertions

Full Changelog: v0.2.1...v0.3.0

tensor-layouts 0.2.1

09 Apr 20:35

Choose a tag to compare

What's Changed

Negative stride support

  • Full negative stride support across Layout, Tensor, analysis, and visualization — cosize() and compose() decompose by magnitude and carry sign, matching CuTe C++; Tensor.view() preserves base offset; storage validation uses true addressed range instead of cosize alone
  • Analysis functions (coalescing_efficiency, segment_analysis, per_group_coalescing, cycles, order) rebase the addressed footprint to a local origin for negative-stride layouts
  • Visualization TV mapping rebases negative offsets; explicit cell_labels no longer use Python negative-index wraparound

CuTe conformance fixes

  • left_inverse for non-contiguous (padded) layouts — complete rewrite
  • compose to truncate unreachable modes before the divisibility check (§3.3.2 of arXiv:2603.02298v1)
  • compose and logical_divide for nested tuple tilers
  • zipped_divide, tiled_divide, flat_divide to preserve Layout tiler strides instead of silently degrading to shape tilers
  • Canonicalize stride to 0 for unit-extent modes in logical_divide
  • Layout.call(None) as full-slice identity, matching CuTe's slice(_, layout)

Tensor

  • tensor[:] whole-view full slice, matching the explicit tensor[:, :] behavior
  • Preserve swizzle attribute in slice_and_offset sublayout results

Analysis

  • explain(compose) crash on tuple tilers
  • explain(logical_product) to use cosize(B) for complement bound
  • Move exhaustive introspection helpers (image, is_injective, is_surjective, is_bijective, is_contiguous, functionally_equal) from layouts.py to analysis.py — keeps the core module efficient, O(size) enumeration is opt-in

Visualization

  • Vertical arrangement in draw_swizzle for wide layouts — before/after grids stack top-to-bottom when columns exceed a threshold

Testing

  • CuTe C++ oracle test suite — compiles regression cases directly against installed CUTLASS headers for compose, logical_divide, zipped_divide, tiled_divide, flat_divide, left_inverse, and logical_product; gracefully skips when CUTLASS or a C++ compiler is unavailable
  • Paper examples test suite (arXiv:2603.02298v1) with --draw pytest option for visual output
  • Fix duplicate test name shadowing draw_swizzle coverage

Robustness

  • Reject free coordinates (slices, None) in Tensor.__setitem__ with a clear TypeError guiding users to the slice-then-index pattern

Cleanup

  • Configure Ruff with correct src/tensor_layouts/ paths, add extend-exclude = ["*.ipynb"], fix lint warnings across the codebase

Docs & build
im2col figure and CONV→GEMM mapping clarification in applications notebook
Document shape_div strict scalar divisibility policy — intentional divergence from CuTe C++ ceil_div fallback

Full Changelog: v0.2.0...v0.2.1

tensor-layouts 0.2

06 Apr 19:02

Choose a tag to compare

What's Changed

Tensor class

  • Storage-backed tensors with coordinate indexing (tensor[i, j]), write-through, view semantics on slicing, None as free-dimension marker, Tensor.view(layout) for same-storage reinterpretation, and str with offset notation
  • size(), rank(), cosize(), depth(), mode(), flatten(), image() accept Tensors transparently

GPU atom definitions

  • Intel AMX tile matrix multiply atoms
  • Intel Xe GPU DPAS atoms
  • AMD RDNA3/RDNA4 WMMA atoms
  • MMAAtom.str / CopyAtom.str for concise display
  • Community feedback request notice added to all atom definition files

Analysis

  • to_F2_matrix() — convert power-of-2 layouts to binary matrix representation over GF(2); validated against Triton's LinearLayoutConversionsTest.cpp
  • TV-aware vectorized access modeling — bank_conflicts(), coalescing_efficiency(), and segment_analysis() now iterate all values per thread for multi-mode (TV) layouts, correctly modeling vectorized loads
  • is_contiguous() as an alias for is_bijective()
  • weakly_congruent() for partial-order profile matching
  • element_bytes now required for bank_conflicts(), coalescing_efficiency(), etc.

Visualization

  • draw_gemm() for matmul spatial arrangement of A, B, C operand panels
  • Hierarchical layout support in draw_composite, with auto-computed panel_size and rendering options passed through **kwargs
  • cell_labels parameter for user-supplied per-cell text
  • interleave_colors option for hue-grouped palette
  • transpose option for rank-1 column vectors
  • precision parameter for float cell labels
  • Remove show_*() functions — draw_*(filename=None) handles inline display, fixing double-render in Jupyter
  • Layout.repr now returns eval-safe constructor string, distinct from Layout.str

Notebooks

  • algorithms.ipynb — COPY, GEMM, Grouped GEMM, REDUCE, Epilogue Fusion, and Online Softmax visualized with layout algebra
  • applications.ipynb — six layout algebra patterns from arXiv:2603.02298v1

Bug fixes

  • rank() for single-mode Layouts —rank(Layout(32)) returns 1, not 0
  • idx2crd() coordinate wrapping for scalar shapes
  • idx2crd() / crd2flat() to accept Layout objects as shape argument
  • crd2crd() to thread src_shape through per-mode recursion
  • explain() crash with tuple tilers
  • draw_slice() for 1D layouts
  • draw_composite() auto-sizing to respect grid_rows/grid_cols overrides
  • Rank≥3 panel splitting to match CuTe convention
  • slice_modes to preserve hierarchical mode boundaries
  • Tensor slicing for hierarchical specs with nested Nones
  • Trailing comma in Layout.str for 1-tuple shapes
  • per_group analysis iteration for TV layouts

Robustness

  • Type validation for Layout shape and stride arguments
  • Grid overflow warning when panels exceed capacity instead of silently dropping
  • Duck-type Tensor detection in viz instead of isinstance()

Docs & build

  • Missing license headers added to all source files
  • examples and check targets in Makefile

Full Changelog: v0.1.1...v0.2.0

tensor-layouts 0.1.1

17 Mar 23:07

Choose a tag to compare

Minor documentation fixes so project info looks correct on PyPi.

tensor-layouts 0.1

17 Mar 22:51

Choose a tag to compare

This is the first release of the tensor-layouts library — a pure-Python implementation of NVIDIA's CuTe layout algebra. No GPU required.

Highlights

  • Full CuTe layout algebra — compose, complement, logical_divide, logical_product, coalesce, flatten, upcast, downcast, and more
  • Swizzle support — XOR-based bank conflict avoidance patterns with Swizzle(B, M, S)
  • MMA atom definitions for NVIDIA and AMD architectures:
    • NVIDIA: SM70 (Volta), SM75 (Turing), SM80 (Ampere), SM89 (Ada), SM90 (Hopper GMMA), SM100 (Blackwell UMMA), SM120 (Blackwell B200)
    • AMD: CDNA1 (MI100), CDNA2 (MI200), CDNA3 (MI300), CDNA3+ (gfx950)
  • Rich visualization (pip install tensor-layouts[viz]):
    • Layout grids with thread/value coloring
    • Swizzle before/after comparison views
    • MMA atom thread-value layouts
    • Combined tiled MMA grids
    • Copy atom layouts
    • Hierarchical N-level rendering with color-coded boundaries
  • Oracle tests that cross-validate against NVIDIA's pycute and AMD's matrix instruction calculator
  • Zero dependencies for the core library; only matplotlib needed for visualization

Install

pip install tensor-layouts        # core
pip install tensor-layouts[viz]   # with visualization

Links
Documentation: https://github.com/facebookresearch/tensor-layouts
PyPI: https://pypi.org/project/tensor-layouts/