15 May 18:09

jduprat

e732284

tensor-layouts 0.3.2 Latest

Latest

What's Changed

New analysis helpers

permutation_parity() and is_even_permutation() — detect orientation of dense, injective layouts (thanks @neuralsorcerer, #18)
from_F2_matrix() — inverse constructor for to_F2_matrix() with affine + brute-force-Swizzle-extraction reconstruction; round-trip identity holds. to_F2_matrix() strengthened to accept any F2-linear ComposedLayout.

Tensor API

Tensor.to_list() and Tensor.copy_from() — flat copy / snapshot helpers (thanks @neuralsorcerer, #23)

Layout API

Layout is now purely affine. The swizzle= constructor kwarg and the Layout.swizzle attribute are removed; ComposedLayout is the canonical (and only) carrier for every swizzled / non-affine form. Code that built Layout(..., swizzle=Sw) directly should switch to compose(Sw, layout) or ComposedLayout(Sw, layout). Layout.__repr__ is now exact eval-roundtrip (Layout(shape, stride)).
ComposedLayout.preoffset → ComposedLayout.offset (renamed).
ComposedLayout's offset is now keyword-only — both ComposedLayout(Sw, L, k) and the CuTe-style ComposedLayout(Sw, k, L) porting trap now fail at the call site.
Swizzle is now allowed in ComposedLayout's inner slot — the inverse-form ComposedLayout(Layout, offset, Swizzle) arising from right_inverse / left_inverse on offset-bearing swizzle-fronted ComposedLayouts. coalesce() on this form is a no-op (rank-1; no structure to merge).
complement() now forwards through ComposedLayout (was unsupported).
split_outer_swizzle() — new public structural recogniser for the canonical ComposedLayout(Sw, L, offset=0) form. Replaces the private _split_zero_offset_swizzle that tensor.py had been reaching into.
LayoutError(ValueError), UnsupportedComposedLayoutError(NotImplementedError), TensorStorageError(ValueError) — new exception classes for catching layout-algebra and tensor-storage errors specifically. Existing except ValueError / except NotImplementedError handlers continue to catch them.

Fixes

Tensor offset alignment with CuTe. A Tensor over a Layout-with-embedded-swizzle previously folded the external offset into the swizzle's input domain (Sw(offset + L(coord))) while a Tensor over a ComposedLayout added the offset AFTER the layout call. The two forms thus disagreed on addresses for nonzero Tensor offset. Both forms now follow CuTe's tensor(coord) == tensor.offset + tensor.layout(coord).
cosize(ComposedLayout) now uses max(L(i)) + 1 enumeration over the full domain — the previous delegation to inner-or-outer mis-reported the codomain extent for five common forms and could cause buffer under-allocation. O(n) instead of O(1); cached on the instance for amortised cost.
cosize() on embedded-swizzle Layout for non-power-of-2 shapes now correctly accounts for the swizzle's XOR (e.g. Layout(5, 1, swizzle=Swizzle(2, 0, 2)) was reporting cosize 5, true value is 6).
Surjectivity checks for explicit and shifted codomains (thanks @neuralsorcerer, #21)
Transfer swizzle through logical_product against swizzled tiles (was silently dropping the embedded swizzle and returning a semantically wrong plain layout).
Tensor[(slice(None), 0), 1] (slice nested in a hierarchical coordinate tuple) now raises TypeError instead of being silently passed through to slice_and_offset.
Drop the typing.Self fallback that imported the undeclared typing_extensions — would ImportError on a fresh 3.10 install.

Robustness

Reject complement / logical_product / logical_divide on the inverse-form ComposedLayout(Layout, offset, Swizzle) with NotImplementedError.
Reject logical_product on ComposedLayout(Layout, Swizzle, offset) (was crashing in the affine fallback with AttributeError on .stride).
as_affine_layout() performs an explicit is_affine() post-check; the error points callers at as_layout_expr() for the non-affine path.
coalescing_efficiency / segment_analysis validate warp_size > 0 (was silently producing nonsense from min(thread_count, 0)).
viz raises an actionable ImportError pointing at pip install tensor-layouts[viz] when matplotlib/numpy are missing, instead of surfacing a deep ModuleNotFoundError on matplotlib internals.
Aligned four pre-existing exception-class inconsistencies before introducing the new hierarchy: to_F2_matrix F6 rejection ValueError → NotImplementedError; slice_modes / dice_modes structure mismatch TypeError → ValueError; prefix_product / suffix_product tuple-init-on-scalar TypeError → ValueError; _validate_order_permutation 'not iterable' ValueError → TypeError.

Performance

cosize() results cached on each ComposedLayout instance and on swizzled Layout instances (the ComposedLayout cache uses a declarative dataclass field with init/repr/eq/hash all False, so cached and uncached layouts still compare equal and remain dict-key compatible).
_address_bounds has an O(1) fast path for the canonical Sw o L form (bounds = (offset, offset + cosize(layout) - 1)), replacing the O(size) per-coordinate walk in _validate_storage. Works for any Tensor offset on ComposedLayout.
complement(ComposedLayout) decays swizzled slices to plain Layout when the swizzle's Y/Z bits aren't both touched on the surviving subspace.

Refactors (no functional change)

layouts.py (4.4k LOC) split into a layouts/ package with three layered modules: core (exceptions, type predicates, tuple operations, Layout, Tile, Swizzle), expr (ComposedLayout and the LayoutExpr = Layout | ComposedLayout predicates / coercers), and algebra (compose, complement, divide, product, inverses, ...). Dependency direction is strictly one-way (core ← expr ← algebra), enforced by the import graph. No public API change — every name previously importable from tensor_layouts.layouts remains importable from the same path.
bank_conflicts / per_group_bank_conflicts deduped via shared _bank_conflicts_for_thread_range.
coalescing_efficiency / per_group_coalescing deduped via shared _coalescing_for_thread_range.
Layout._calculate_max_offset moved to module-level _affine_max_offset (the staticmethod never used self).
Internal _affine_inner → _strip_swizzle rename.

CI and Tests

Python 3.14 added to the CI matrix and a lint job added (thanks @neuralsorcerer, #17)
20 new CuTe C++ oracle entries pinning complement, coalesce, compose, right_inverse on ComposedLayout form variants F2-F8; compose_truncation_paper oracle case for paper section 3.3.3.
CuTe C++ oracle: CUTLASS_PATH / CUTLASS_INCLUDE_DIR env-var override for out-of-tree CUTLASS installations.
32 hand-written AMD oracle C-layout per-atom tests parametrized into a single ORACLE_C_LAYOUT_CASES-driven test.
examples/composed.py added to make examples (was the only example not exercised by the smoke target).

Docs

docs/layout_api.md / docs/tensor_api.md / docs/analysis_api.md and the examples/ rewritten to reflect Layout-becomes-purely-affine: no more "Layout may also carry one canonical final swizzle" framing, single-form Layout(shape, stride) repr, compose(Sw, L) always returns ComposedLayout(Sw, L, 0).
New 'Constructor signature vs CuTe / pycute' subsection in docs/layout_api.md documenting the ComposedLayout(outer, inner, offset=k) ordering vs CuTe's positional ComposedLayout<A, Offset, B>.
permutation_parity / is_even_permutation documented in docs/analysis_api.md (thanks @neuralsorcerer)
Document supported / unsupported ops for the inverse-form ComposedLayout in the class docstring and docs/layout_api.md. to_F2_matrix / from_F2_matrix documented in docs/analysis_api.md.

Other

Revert CONTRIBUTING.md change from D101685100 (thanks @FindHao, #20)

Full Changelog: v0.3.1...v0.3.2

Assets 2

22 Apr 20:35

jduprat

v0.3.1

72e2038

tensor-layouts 0.3.1

What's Changed

New analysis helpers

aliasing_profile() — analysis helper for detecting layout aliasing patterns (thanks @soumyadipsarkar, #11)
thread_stride_profile() — analysis helper for inspecting per-thread stride behavior (thanks @soumyadipsarkar)
gap_profile() — layout sparsity analysis helper (thanks @soumyadipsarkar)

Layout API

is_empty() — helper for the unit/empty-shape layout (rank 0, size 1, multiplicative identity for composition and concatenation); distinct from a zero-sized layout like Layout((0,), (0,))
as_list() — helper that replaces the list(as_tuple(...)) pattern when shapes/strides need to be mutated
is_afine_layout() → is_affine() — renamed and widened to apply to any type with a .layout attribute; structural check (a swizzle-free ComposedLayout still returns False, since there is no machinery to coalesce one back into a flat Layout with non-zero preoffset)

Robustness

Reverse swizzle composition — fixed, with new pycute and CuTe C++ oracle regressions covering the failure mode
Exact MMA tile sizes — tile_mma_grid() now rejects tile_mnk values that are not exact multiples of the natural MMA atom shape, instead of silently floor-dividing and producing a smaller-than-requested grid.
Tighter layout helper validation across layout_utils.py
Flat 1D tensor storage required — Tensor now rejects multi-dimensional storage backings, with clearer error messages and updated docs/tensor_api.md
Negative layout shapes rejected at Layout construction
Internal compose() / divide() asserts promoted to TypeError — checks now survive python -O. src/tensor_layouts/ is now assert-free in production paths.

Refactors (no functional change)

compose() split into one helper per (lhs, rhs) case for readability
_draw_grid() split into single-purpose passes (font auto-sizing, base cells, hierarchy overlays, highlight overlay, value/margin labels)
Single-axis figure builders routed through a shared _new_axes() helper so matplotlib defaults can be tuned in one place
explain() dispatched through a function→handler table, keyed on the callable so wrappers/aliases resolve correctly
IPython detection cleaned up
tests/tensor.py converted to flat def test_*() style, matching every other test module

Docs

README: links to the algorithms / applications / GEMM example notebooks, plus a few meaningful external references
tests/<name>.py mirrors src/tensor_layouts/<name>.py naming convention spelled out in pyproject.toml and CONTRIBUTING.md
SM90 GMMA preamble explaining the warpgroup-level (128-thread) convention with shared-memory operands behind a hardware descriptor — distinct from the warp-level SM_70/80 atoms next to it
Thread-Value (T, V) layout convention documented in _tv_dimensions, so callers of bank_conflicts, coalescing_efficiency, etc. can interpret mode 0 vs mode 1+
AMD make_mfma_atom() parameters documented
CDNA_4x4x4 naming-vs-shape note clarified
Bit-twiddling intent in make_swizzle clarified
viz_api.md keyword fixed: num_shades → num_colors (examples would have hit TypeError)
NVIDIA CuTe quickstart link fixed (thanks @soumyadipsarkar)

Other

File restoration after inadvertent cross-project modifications (#4, thanks @paulshen / @oshannessy)

Full Changelog: v0.3.0...v0.3.1

Contributors

paulshen and soumyadipsarkar

Assets 2

20 Apr 19:56

jduprat

v0.3.0

289daaa

tensor-layouts 0.3.0

What's Changed

Composed layouts

ComposedLayout Release v0.3 biggest feature. Layout could accept a single Swizzle, but this did not compose, it was hard-coded. We now can compose Layouts and Swizzles aribitrarily: multi-stage compositions (outer ∘ preoffset ∘ inner) that cannot be collapsed into a single affine Layout; double-swizzle, affine-on-swizzled, and recursive compositions now preserve full mapping semantics
LayoutExpr type alias (Layout | ComposedLayout) — all public APIs that accept a layout now accept either form transparently
Layout traits — is_layout() widened to the CuTe-style trait; new is_affine_layout(), as_layout_expr(), as_affine_layout() for explicit trait boundaries between generic and affine-only code paths
Exact compose() — canonical compose(Swizzle, Layout) fast path preserved; non-canonical cases (double-swizzle, affine-on-swizzled, recursive through existing ComposedLayout) produce a ComposedLayout instead of silently losing information
CuTe-specific parity rules — Layout ∘ Swizzle composition, zero-preoffset collapse for Layout ∘ ComposedLayout, swizzled composed inverse support, and swizzle-aware max_common_layout() / max_common_vector()
Structural transforms forwarded — append, prepend, replace, group, flatten, sort, coalesce, logical_divide, logical_product operate on the inner domain of a ComposedLayout instead of dropping to affine-only assumptions
Slicing without offset leaks — slice_and_offset() generalized so fixed-coordinate contributions inside a nonlinear composition stay inside the resulting ComposedLayout (external offset 0) instead of being turned into an incorrect pointer offset
Swizzle.__hash__ — Swizzle objects are now hashable, so ComposedLayout with a Swizzle outer works in sets and dicts

Tensor

Tensor now accepts LayoutExpr — indexing, slicing, storage validation, and address computation route through layout-expression-aware helpers; Tensor.stride remains deliberately affine-only and raises clearly on composed layouts

Analysis

Generic LayoutExpr consumers — image(), is_injective(), is_surjective(), is_bijective(), is_contiguous(), functionally_equal(), offset_table(), footprint(), bank_conflicts(), coalescing_efficiency(), segment_analysis(), per_group_bank_conflicts(), per_group_coalescing(), cycles(), fixed_points(), order() all accept ComposedLayout transparently
Affine-only helpers (to_F2_matrix, weakly_congruent, explain) now fail clearly on composed layouts via as_affine_layout()
max_common_vector() and max_common_layout() treat embedded-swizzle Layout(..., swizzle=...) and zero-preoffset ComposedLayout(Swizzle, inner) as the same semantic form

Inverse helpers

right_inverse() and left_inverse() preserve embedded swizzles by inverting the affine inner layout and recomposing the original swizzle
right_inverse() now skips noncontiguous sorted modes instead of terminating immediately, matching CuTe on broadcast-unit examples
Composition divisibility tightened: stricter truncation gate restored for partially fitting non-divisible strides while preserving valid §3.3.3 truncation cases

Visualization

draw_layout(), draw_slice(), and multi-panel rendering accept ComposedLayout directly
draw_slice titles reflect internal composed preoffset instead of leaking external offset
Parameter types widened to LayoutExpr in docs/viz_api.md

Examples & notebooks

examples/composed.py — runnable example covering canonical swizzle fast path, exact composed fallback, slicing/tensor offset split, and optional --draw figures
examples/gemm.ipynb — fully explained GEMM kernel walkthrough with layout algebra
examples/viz.ipynb — composed-layout discoverability note added
tests/paper_examples.py — full coverage of all figures (1–12) and tables (1–7) in arXiv 2603.02298, with --draw support for rendering paper figures

Bug fixes

oracle_cute_cpp skips gracefully when nvidia package is absent instead of crashing
Defensive assertions in _compose_with_tiler() and _logical_divide_with_tiler() guard against composed results leaking into affine-only rebuild paths

Docs & build

Composed-layout sections added to docs/layout_api.md, docs/tensor_api.md, docs/analysis_api.md, docs/viz_api.md
Composed-layout figures (composed_exact.png, composed_slice.png) generated and checked in
Makefile clean target updated for tests/figures/
pytest --draw conftest hook renders paper figures into tests/figures/

Tests

tests/composed.py — 45 regression tests covering representation contract, trait behavior, exact composition, divide/product cascades, recursive chains, hierarchical inners, full-slice identity, multi-mode, Tensor.view, and generic analysis coverage
tests/viz.py — cell-value and panel-color correctness tests for composed layouts
tests/oracle_cute_cpp.py — nonzero-preoffset composition, recursive composed chains, composed logical_divide/logical_product, make_tensor with ComposedLayout
tests/paper_examples.py — full arXiv 2603.02298 coverage with exact offset-value assertions

Full Changelog: v0.2.1...v0.3.0

Assets 2

09 Apr 20:35

jduprat

v0.2.1

68dfea5

tensor-layouts 0.2.1

What's Changed

Negative stride support

Full negative stride support across Layout, Tensor, analysis, and visualization — cosize() and compose() decompose by magnitude and carry sign, matching CuTe C++; Tensor.view() preserves base offset; storage validation uses true addressed range instead of cosize alone
Analysis functions (coalescing_efficiency, segment_analysis, per_group_coalescing, cycles, order) rebase the addressed footprint to a local origin for negative-stride layouts
Visualization TV mapping rebases negative offsets; explicit cell_labels no longer use Python negative-index wraparound

CuTe conformance fixes

left_inverse for non-contiguous (padded) layouts — complete rewrite
compose to truncate unreachable modes before the divisibility check (§3.3.2 of arXiv:2603.02298v1)
compose and logical_divide for nested tuple tilers
zipped_divide, tiled_divide, flat_divide to preserve Layout tiler strides instead of silently degrading to shape tilers
Canonicalize stride to 0 for unit-extent modes in logical_divide
Layout.call(None) as full-slice identity, matching CuTe's slice(_, layout)

Tensor

tensor[:] whole-view full slice, matching the explicit tensor[:, :] behavior
Preserve swizzle attribute in slice_and_offset sublayout results

Analysis

explain(compose) crash on tuple tilers
explain(logical_product) to use cosize(B) for complement bound
Move exhaustive introspection helpers (image, is_injective, is_surjective, is_bijective, is_contiguous, functionally_equal) from layouts.py to analysis.py — keeps the core module efficient, O(size) enumeration is opt-in

Visualization

Vertical arrangement in draw_swizzle for wide layouts — before/after grids stack top-to-bottom when columns exceed a threshold

Testing

CuTe C++ oracle test suite — compiles regression cases directly against installed CUTLASS headers for compose, logical_divide, zipped_divide, tiled_divide, flat_divide, left_inverse, and logical_product; gracefully skips when CUTLASS or a C++ compiler is unavailable
Paper examples test suite (arXiv:2603.02298v1) with --draw pytest option for visual output
Fix duplicate test name shadowing draw_swizzle coverage

Robustness

Reject free coordinates (slices, None) in Tensor.__setitem__ with a clear TypeError guiding users to the slice-then-index pattern

Cleanup

Configure Ruff with correct src/tensor_layouts/ paths, add extend-exclude = ["*.ipynb"], fix lint warnings across the codebase

Docs & build
im2col figure and CONV→GEMM mapping clarification in applications notebook
Document shape_div strict scalar divisibility policy — intentional divergence from CuTe C++ ceil_div fallback

Full Changelog: v0.2.0...v0.2.1

Assets 2

06 Apr 19:02

jduprat

v0.2.0

b3b9332

tensor-layouts 0.2

What's Changed

Tensor class

Storage-backed tensors with coordinate indexing (tensor[i, j]), write-through, view semantics on slicing, None as free-dimension marker, Tensor.view(layout) for same-storage reinterpretation, and str with offset notation
size(), rank(), cosize(), depth(), mode(), flatten(), image() accept Tensors transparently

GPU atom definitions

Intel AMX tile matrix multiply atoms
Intel Xe GPU DPAS atoms
AMD RDNA3/RDNA4 WMMA atoms
MMAAtom.str / CopyAtom.str for concise display
Community feedback request notice added to all atom definition files

Analysis

to_F2_matrix() — convert power-of-2 layouts to binary matrix representation over GF(2); validated against Triton's LinearLayoutConversionsTest.cpp
TV-aware vectorized access modeling — bank_conflicts(), coalescing_efficiency(), and segment_analysis() now iterate all values per thread for multi-mode (TV) layouts, correctly modeling vectorized loads
is_contiguous() as an alias for is_bijective()
weakly_congruent() for partial-order profile matching
element_bytes now required for bank_conflicts(), coalescing_efficiency(), etc.

Visualization

draw_gemm() for matmul spatial arrangement of A, B, C operand panels
Hierarchical layout support in draw_composite, with auto-computed panel_size and rendering options passed through **kwargs
cell_labels parameter for user-supplied per-cell text
interleave_colors option for hue-grouped palette
transpose option for rank-1 column vectors
precision parameter for float cell labels
Remove show_*() functions — draw_*(filename=None) handles inline display, fixing double-render in Jupyter
Layout.repr now returns eval-safe constructor string, distinct from Layout.str

Notebooks

algorithms.ipynb — COPY, GEMM, Grouped GEMM, REDUCE, Epilogue Fusion, and Online Softmax visualized with layout algebra
applications.ipynb — six layout algebra patterns from arXiv:2603.02298v1

Bug fixes

rank() for single-mode Layouts —rank(Layout(32)) returns 1, not 0
idx2crd() coordinate wrapping for scalar shapes
idx2crd() / crd2flat() to accept Layout objects as shape argument
crd2crd() to thread src_shape through per-mode recursion
explain() crash with tuple tilers
draw_slice() for 1D layouts
draw_composite() auto-sizing to respect grid_rows/grid_cols overrides
Rank≥3 panel splitting to match CuTe convention
slice_modes to preserve hierarchical mode boundaries
Tensor slicing for hierarchical specs with nested Nones
Trailing comma in Layout.str for 1-tuple shapes
per_group analysis iteration for TV layouts

Robustness

Type validation for Layout shape and stride arguments
Grid overflow warning when panels exceed capacity instead of silently dropping
Duck-type Tensor detection in viz instead of isinstance()

Docs & build

Missing license headers added to all source files
examples and check targets in Makefile

Full Changelog: v0.1.1...v0.2.0

Assets 2

17 Mar 23:07

jduprat

v0.1.1

b658f20

tensor-layouts 0.1.1

Minor documentation fixes so project info looks correct on PyPi.

Assets 2

17 Mar 22:51

jduprat

v0.1.0

112a023

tensor-layouts 0.1

This is the first release of the tensor-layouts library — a pure-Python implementation of NVIDIA's CuTe layout algebra. No GPU required.

Highlights

Full CuTe layout algebra — compose, complement, logical_divide, logical_product, coalesce, flatten, upcast, downcast, and more
Swizzle support — XOR-based bank conflict avoidance patterns with Swizzle(B, M, S)
MMA atom definitions for NVIDIA and AMD architectures:
- NVIDIA: SM70 (Volta), SM75 (Turing), SM80 (Ampere), SM89 (Ada), SM90 (Hopper GMMA), SM100 (Blackwell UMMA), SM120 (Blackwell B200)
- AMD: CDNA1 (MI100), CDNA2 (MI200), CDNA3 (MI300), CDNA3+ (gfx950)
Rich visualization (pip install tensor-layouts[viz]):
- Layout grids with thread/value coloring
- Swizzle before/after comparison views
- MMA atom thread-value layouts
- Combined tiled MMA grids
- Copy atom layouts
- Hierarchical N-level rendering with color-coded boundaries
Oracle tests that cross-validate against NVIDIA's pycute and AMD's matrix instruction calculator
Zero dependencies for the core library; only matplotlib needed for visualization

Install

pip install tensor-layouts        # core
pip install tensor-layouts[viz]   # with visualization

Links
Documentation: https://github.com/facebookresearch/tensor-layouts
PyPI: https://pypi.org/project/tensor-layouts/

Assets 2

Releases: facebookresearch/tensor-layouts

tensor-layouts 0.3.2

What's Changed

New analysis helpers

Tensor API

Layout API

Fixes

Robustness

Performance

Refactors (no functional change)

CI and Tests

Docs

Other

Uh oh!

tensor-layouts 0.3.1

What's Changed

New analysis helpers

Layout API

Robustness

Refactors (no functional change)

Docs

Other

Contributors

Uh oh!

tensor-layouts 0.3.0

What's Changed

Composed layouts

Tensor

Analysis

Inverse helpers

Visualization

Examples & notebooks

Bug fixes

Docs & build

Tests

Uh oh!

tensor-layouts 0.2.1

What's Changed

Uh oh!

tensor-layouts 0.2

What's Changed

Uh oh!

tensor-layouts 0.1.1

Uh oh!

tensor-layouts 0.1

Uh oh!