Releases: facebookresearch/tensor-layouts
tensor-layouts 0.3.2
What's Changed
New analysis helpers
permutation_parity()andis_even_permutation()— detect orientation of dense, injective layouts (thanks @neuralsorcerer, #18)from_F2_matrix()— inverse constructor forto_F2_matrix()with affine + brute-force-Swizzle-extraction reconstruction; round-trip identity holds.to_F2_matrix()strengthened to accept any F2-linearComposedLayout.
Tensor API
Tensor.to_list()andTensor.copy_from()— flat copy / snapshot helpers (thanks @neuralsorcerer, #23)
Layout API
Layoutis now purely affine. Theswizzle=constructor kwarg and theLayout.swizzleattribute are removed;ComposedLayoutis the canonical (and only) carrier for every swizzled / non-affine form. Code that builtLayout(..., swizzle=Sw)directly should switch tocompose(Sw, layout)orComposedLayout(Sw, layout).Layout.__repr__is now exact eval-roundtrip (Layout(shape, stride)).ComposedLayout.preoffset→ComposedLayout.offset(renamed).ComposedLayout'soffsetis now keyword-only — bothComposedLayout(Sw, L, k)and the CuTe-styleComposedLayout(Sw, k, L)porting trap now fail at the call site.Swizzleis now allowed inComposedLayout's inner slot — the inverse-formComposedLayout(Layout, offset, Swizzle)arising fromright_inverse/left_inverseon offset-bearing swizzle-fronted ComposedLayouts.coalesce()on this form is a no-op (rank-1; no structure to merge).complement()now forwards throughComposedLayout(was unsupported).split_outer_swizzle()— new public structural recogniser for the canonicalComposedLayout(Sw, L, offset=0)form. Replaces the private_split_zero_offset_swizzlethattensor.pyhad been reaching into.LayoutError(ValueError),UnsupportedComposedLayoutError(NotImplementedError),TensorStorageError(ValueError)— new exception classes for catching layout-algebra and tensor-storage errors specifically. Existingexcept ValueError/except NotImplementedErrorhandlers continue to catch them.
Fixes
- Tensor offset alignment with CuTe. A
Tensorover a Layout-with-embedded-swizzle previously folded the external offset into the swizzle's input domain (Sw(offset + L(coord))) while aTensorover aComposedLayoutadded the offset AFTER the layout call. The two forms thus disagreed on addresses for nonzeroTensoroffset. Both forms now follow CuTe'stensor(coord) == tensor.offset + tensor.layout(coord). cosize(ComposedLayout)now usesmax(L(i)) + 1enumeration over the full domain — the previous delegation to inner-or-outer mis-reported the codomain extent for five common forms and could cause buffer under-allocation. O(n) instead of O(1); cached on the instance for amortised cost.cosize()on embedded-swizzleLayoutfor non-power-of-2 shapes now correctly accounts for the swizzle's XOR (e.g.Layout(5, 1, swizzle=Swizzle(2, 0, 2))was reporting cosize 5, true value is 6).- Surjectivity checks for explicit and shifted codomains (thanks @neuralsorcerer, #21)
- Transfer swizzle through
logical_productagainst swizzled tiles (was silently dropping the embedded swizzle and returning a semantically wrong plain layout). Tensor[(slice(None), 0), 1](slice nested in a hierarchical coordinate tuple) now raisesTypeErrorinstead of being silently passed through toslice_and_offset.- Drop the
typing.Selffallback that imported the undeclaredtyping_extensions— wouldImportErroron a fresh 3.10 install.
Robustness
- Reject
complement/logical_product/logical_divideon the inverse-formComposedLayout(Layout, offset, Swizzle)withNotImplementedError. - Reject
logical_productonComposedLayout(Layout, Swizzle, offset)(was crashing in the affine fallback withAttributeErroron.stride). as_affine_layout()performs an explicitis_affine()post-check; the error points callers atas_layout_expr()for the non-affine path.coalescing_efficiency/segment_analysisvalidatewarp_size > 0(was silently producing nonsense frommin(thread_count, 0)).vizraises an actionableImportErrorpointing atpip install tensor-layouts[viz]when matplotlib/numpy are missing, instead of surfacing a deepModuleNotFoundErroron matplotlib internals.- Aligned four pre-existing exception-class inconsistencies before introducing the new hierarchy:
to_F2_matrixF6 rejectionValueError→NotImplementedError;slice_modes/dice_modesstructure mismatchTypeError→ValueError;prefix_product/suffix_producttuple-init-on-scalarTypeError→ValueError;_validate_order_permutation'not iterable'ValueError→TypeError.
Performance
cosize()results cached on eachComposedLayoutinstance and on swizzledLayoutinstances (theComposedLayoutcache uses a declarative dataclass field withinit/repr/eq/hashallFalse, so cached and uncached layouts still compare equal and remain dict-key compatible)._address_boundshas an O(1) fast path for the canonicalSw o Lform (bounds = (offset, offset + cosize(layout) - 1)), replacing the O(size) per-coordinate walk in_validate_storage. Works for anyTensoroffset onComposedLayout.complement(ComposedLayout)decays swizzled slices to plainLayoutwhen the swizzle's Y/Z bits aren't both touched on the surviving subspace.
Refactors (no functional change)
layouts.py(4.4k LOC) split into alayouts/package with three layered modules:core(exceptions, type predicates, tuple operations,Layout,Tile,Swizzle),expr(ComposedLayoutand theLayoutExpr = Layout | ComposedLayoutpredicates / coercers), andalgebra(compose, complement, divide, product, inverses, ...). Dependency direction is strictly one-way (core ← expr ← algebra), enforced by the import graph. No public API change — every name previously importable fromtensor_layouts.layoutsremains importable from the same path.bank_conflicts/per_group_bank_conflictsdeduped via shared_bank_conflicts_for_thread_range.coalescing_efficiency/per_group_coalescingdeduped via shared_coalescing_for_thread_range.Layout._calculate_max_offsetmoved to module-level_affine_max_offset(the staticmethod never usedself).- Internal
_affine_inner→_strip_swizzlerename.
CI and Tests
- Python 3.14 added to the CI matrix and a lint job added (thanks @neuralsorcerer, #17)
- 20 new CuTe C++ oracle entries pinning
complement,coalesce,compose,right_inverseon ComposedLayout form variants F2-F8;compose_truncation_paperoracle case for paper section 3.3.3. - CuTe C++ oracle:
CUTLASS_PATH/CUTLASS_INCLUDE_DIRenv-var override for out-of-tree CUTLASS installations. - 32 hand-written AMD oracle C-layout per-atom tests parametrized into a single
ORACLE_C_LAYOUT_CASES-driven test. examples/composed.pyadded tomake examples(was the only example not exercised by the smoke target).
Docs
docs/layout_api.md/docs/tensor_api.md/docs/analysis_api.mdand theexamples/rewritten to reflect Layout-becomes-purely-affine: no more "Layout may also carry one canonical final swizzle" framing, single-formLayout(shape, stride)repr,compose(Sw, L)always returnsComposedLayout(Sw, L, 0).- New 'Constructor signature vs CuTe / pycute' subsection in
docs/layout_api.mddocumenting theComposedLayout(outer, inner, offset=k)ordering vs CuTe's positionalComposedLayout<A, Offset, B>. permutation_parity/is_even_permutationdocumented indocs/analysis_api.md(thanks @neuralsorcerer)- Document supported / unsupported ops for the inverse-form
ComposedLayoutin the class docstring anddocs/layout_api.md.to_F2_matrix/from_F2_matrixdocumented indocs/analysis_api.md.
Other
Full Changelog: v0.3.1...v0.3.2
tensor-layouts 0.3.1
What's Changed
New analysis helpers
aliasing_profile()— analysis helper for detecting layout aliasing patterns (thanks @soumyadipsarkar, #11)thread_stride_profile()— analysis helper for inspecting per-thread stride behavior (thanks @soumyadipsarkar)gap_profile()— layout sparsity analysis helper (thanks @soumyadipsarkar)
Layout API
is_empty()— helper for the unit/empty-shape layout (rank 0, size 1, multiplicative identity for composition and concatenation); distinct from a zero-sized layout likeLayout((0,), (0,))as_list()— helper that replaces thelist(as_tuple(...))pattern when shapes/strides need to be mutatedis_afine_layout()→is_affine()— renamed and widened to apply to any type with a.layoutattribute; structural check (a swizzle-freeComposedLayoutstill returnsFalse, since there is no machinery to coalesce one back into a flatLayoutwith non-zero preoffset)
Robustness
- Reverse swizzle composition — fixed, with new pycute and CuTe C++ oracle regressions covering the failure mode
- Exact MMA tile sizes —
tile_mma_grid()now rejects tile_mnk values that are not exact multiples of the natural MMA atom shape, instead of silently floor-dividing and producing a smaller-than-requested grid. - Tighter layout helper validation across
layout_utils.py - Flat 1D tensor storage required —
Tensornow rejects multi-dimensional storage backings, with clearer error messages and updateddocs/tensor_api.md - Negative layout shapes rejected at
Layoutconstruction - Internal
compose()/divide()asserts promoted toTypeError— checks now survivepython -O.src/tensor_layouts/is now assert-free in production paths.
Refactors (no functional change)
compose()split into one helper per(lhs, rhs)case for readability_draw_grid()split into single-purpose passes (font auto-sizing, base cells, hierarchy overlays, highlight overlay, value/margin labels)- Single-axis figure builders routed through a shared
_new_axes()helper so matplotlib defaults can be tuned in one place explain()dispatched through a function→handler table, keyed on the callable so wrappers/aliases resolve correctly- IPython detection cleaned up
tests/tensor.pyconverted to flatdef test_*()style, matching every other test module
Docs
- README: links to the algorithms / applications / GEMM example notebooks, plus a few meaningful external references
tests/<name>.pymirrorssrc/tensor_layouts/<name>.pynaming convention spelled out inpyproject.tomlandCONTRIBUTING.md- SM90 GMMA preamble explaining the warpgroup-level (128-thread) convention with shared-memory operands behind a hardware descriptor — distinct from the warp-level SM_70/80 atoms next to it
- Thread-Value
(T, V)layout convention documented in_tv_dimensions, so callers ofbank_conflicts,coalescing_efficiency, etc. can interpret mode 0 vs mode 1+ - AMD
make_mfma_atom()parameters documented CDNA_4x4x4naming-vs-shape note clarified- Bit-twiddling intent in
make_swizzleclarified viz_api.mdkeyword fixed:num_shades→num_colors(examples would have hitTypeError)- NVIDIA CuTe quickstart link fixed (thanks @soumyadipsarkar)
Other
Full Changelog: v0.3.0...v0.3.1
tensor-layouts 0.3.0
What's Changed
Composed layouts
ComposedLayoutRelease v0.3 biggest feature. Layout could accept a single Swizzle, but this did not compose, it was hard-coded. We now can compose Layouts and Swizzles aribitrarily: multi-stage compositions (outer ∘ preoffset ∘ inner) that cannot be collapsed into a single affineLayout; double-swizzle, affine-on-swizzled, and recursive compositions now preserve full mapping semanticsLayoutExprtype alias (Layout | ComposedLayout) — all public APIs that accept a layout now accept either form transparently- Layout traits —
is_layout()widened to the CuTe-style trait; newis_affine_layout(),as_layout_expr(),as_affine_layout()for explicit trait boundaries between generic and affine-only code paths - Exact
compose()— canonicalcompose(Swizzle, Layout)fast path preserved; non-canonical cases (double-swizzle, affine-on-swizzled, recursive through existingComposedLayout) produce aComposedLayoutinstead of silently losing information - CuTe-specific parity rules —
Layout ∘ Swizzlecomposition, zero-preoffset collapse forLayout ∘ ComposedLayout, swizzled composed inverse support, and swizzle-awaremax_common_layout()/max_common_vector() - Structural transforms forwarded —
append,prepend,replace,group,flatten,sort,coalesce,logical_divide,logical_productoperate on the inner domain of aComposedLayoutinstead of dropping to affine-only assumptions - Slicing without offset leaks —
slice_and_offset()generalized so fixed-coordinate contributions inside a nonlinear composition stay inside the resultingComposedLayout(external offset 0) instead of being turned into an incorrect pointer offset Swizzle.__hash__—Swizzleobjects are now hashable, soComposedLayoutwith aSwizzleouter works in sets and dicts
Tensor
Tensornow acceptsLayoutExpr— indexing, slicing, storage validation, and address computation route through layout-expression-aware helpers;Tensor.strideremains deliberately affine-only and raises clearly on composed layouts
Analysis
- Generic
LayoutExprconsumers —image(),is_injective(),is_surjective(),is_bijective(),is_contiguous(),functionally_equal(),offset_table(),footprint(),bank_conflicts(),coalescing_efficiency(),segment_analysis(),per_group_bank_conflicts(),per_group_coalescing(),cycles(),fixed_points(),order()all acceptComposedLayouttransparently - Affine-only helpers (
to_F2_matrix,weakly_congruent,explain) now fail clearly on composed layouts viaas_affine_layout() max_common_vector()andmax_common_layout()treat embedded-swizzleLayout(..., swizzle=...)and zero-preoffsetComposedLayout(Swizzle, inner)as the same semantic form
Inverse helpers
right_inverse()andleft_inverse()preserve embedded swizzles by inverting the affine inner layout and recomposing the original swizzleright_inverse()now skips noncontiguous sorted modes instead of terminating immediately, matching CuTe on broadcast-unit examples- Composition divisibility tightened: stricter truncation gate restored for partially fitting non-divisible strides while preserving valid §3.3.3 truncation cases
Visualization
draw_layout(),draw_slice(), and multi-panel rendering acceptComposedLayoutdirectlydraw_slicetitles reflect internal composed preoffset instead of leaking external offset- Parameter types widened to
LayoutExprindocs/viz_api.md
Examples & notebooks
examples/composed.py— runnable example covering canonical swizzle fast path, exact composed fallback, slicing/tensor offset split, and optional--drawfiguresexamples/gemm.ipynb— fully explained GEMM kernel walkthrough with layout algebraexamples/viz.ipynb— composed-layout discoverability note addedtests/paper_examples.py— full coverage of all figures (1–12) and tables (1–7) in arXiv 2603.02298, with--drawsupport for rendering paper figures
Bug fixes
oracle_cute_cppskips gracefully whennvidiapackage is absent instead of crashing- Defensive assertions in
_compose_with_tiler()and_logical_divide_with_tiler()guard against composed results leaking into affine-only rebuild paths
Docs & build
- Composed-layout sections added to
docs/layout_api.md,docs/tensor_api.md,docs/analysis_api.md,docs/viz_api.md - Composed-layout figures (
composed_exact.png,composed_slice.png) generated and checked in Makefileclean target updated fortests/figures/pytest --drawconftest hook renders paper figures intotests/figures/
Tests
tests/composed.py— 45 regression tests covering representation contract, trait behavior, exact composition, divide/product cascades, recursive chains, hierarchical inners, full-slice identity, multi-mode,Tensor.view, and generic analysis coveragetests/viz.py— cell-value and panel-color correctness tests for composed layoutstests/oracle_cute_cpp.py— nonzero-preoffset composition, recursive composed chains, composedlogical_divide/logical_product,make_tensorwithComposedLayouttests/paper_examples.py— full arXiv 2603.02298 coverage with exact offset-value assertions
Full Changelog: v0.2.1...v0.3.0
tensor-layouts 0.2.1
What's Changed
Negative stride support
- Full negative stride support across Layout, Tensor, analysis, and visualization — cosize() and compose() decompose by magnitude and carry sign, matching CuTe C++; Tensor.view() preserves base offset; storage validation uses true addressed range instead of cosize alone
- Analysis functions (coalescing_efficiency, segment_analysis, per_group_coalescing, cycles, order) rebase the addressed footprint to a local origin for negative-stride layouts
- Visualization TV mapping rebases negative offsets; explicit cell_labels no longer use Python negative-index wraparound
CuTe conformance fixes
- left_inverse for non-contiguous (padded) layouts — complete rewrite
- compose to truncate unreachable modes before the divisibility check (§3.3.2 of arXiv:2603.02298v1)
- compose and logical_divide for nested tuple tilers
- zipped_divide, tiled_divide, flat_divide to preserve Layout tiler strides instead of silently degrading to shape tilers
- Canonicalize stride to 0 for unit-extent modes in logical_divide
- Layout.call(None) as full-slice identity, matching CuTe's slice(_, layout)
Tensor
- tensor[:] whole-view full slice, matching the explicit tensor[:, :] behavior
- Preserve swizzle attribute in slice_and_offset sublayout results
Analysis
- explain(compose) crash on tuple tilers
- explain(logical_product) to use cosize(B) for complement bound
- Move exhaustive introspection helpers (image, is_injective, is_surjective, is_bijective, is_contiguous, functionally_equal) from layouts.py to analysis.py — keeps the core module efficient, O(size) enumeration is opt-in
Visualization
- Vertical arrangement in draw_swizzle for wide layouts — before/after grids stack top-to-bottom when columns exceed a threshold
Testing
- CuTe C++ oracle test suite — compiles regression cases directly against installed CUTLASS headers for compose, logical_divide, zipped_divide, tiled_divide, flat_divide, left_inverse, and logical_product; gracefully skips when CUTLASS or a C++ compiler is unavailable
- Paper examples test suite (arXiv:2603.02298v1) with --draw pytest option for visual output
- Fix duplicate test name shadowing draw_swizzle coverage
Robustness
- Reject free coordinates (slices, None) in Tensor.__setitem__ with a clear TypeError guiding users to the slice-then-index pattern
Cleanup
- Configure Ruff with correct src/tensor_layouts/ paths, add extend-exclude = ["*.ipynb"], fix lint warnings across the codebase
Docs & build
im2col figure and CONV→GEMM mapping clarification in applications notebook
Document shape_div strict scalar divisibility policy — intentional divergence from CuTe C++ ceil_div fallback
Full Changelog: v0.2.0...v0.2.1
tensor-layouts 0.2
What's Changed
Tensor class
- Storage-backed tensors with coordinate indexing (tensor[i, j]), write-through, view semantics on slicing, None as free-dimension marker,
Tensor.view(layout)for same-storage reinterpretation, and str with offset notation size(),rank(),cosize(),depth(),mode(),flatten(),image()accept Tensors transparently
GPU atom definitions
- Intel AMX tile matrix multiply atoms
- Intel Xe GPU DPAS atoms
- AMD RDNA3/RDNA4 WMMA atoms
- MMAAtom.str / CopyAtom.str for concise display
- Community feedback request notice added to all atom definition files
Analysis
to_F2_matrix()— convert power-of-2 layouts to binary matrix representation over GF(2); validated against Triton's LinearLayoutConversionsTest.cpp- TV-aware vectorized access modeling —
bank_conflicts(),coalescing_efficiency(), andsegment_analysis()now iterate all values per thread for multi-mode (TV) layouts, correctly modeling vectorized loads is_contiguous()as an alias foris_bijective()weakly_congruent()for partial-order profile matching- element_bytes now required for
bank_conflicts(),coalescing_efficiency(), etc.
Visualization
draw_gemm()for matmul spatial arrangement of A, B, C operand panels- Hierarchical layout support in draw_composite, with auto-computed panel_size and rendering options passed through **kwargs
- cell_labels parameter for user-supplied per-cell text
- interleave_colors option for hue-grouped palette
- transpose option for rank-1 column vectors
- precision parameter for float cell labels
- Remove
show_*()functions —draw_*(filename=None)handles inline display, fixing double-render in Jupyter - Layout.repr now returns eval-safe constructor string, distinct from Layout.str
Notebooks
- algorithms.ipynb — COPY, GEMM, Grouped GEMM, REDUCE, Epilogue Fusion, and Online Softmax visualized with layout algebra
- applications.ipynb — six layout algebra patterns from arXiv:2603.02298v1
Bug fixes
rank()for single-mode Layouts —rank(Layout(32))returns 1, not 0idx2crd()coordinate wrapping for scalar shapesidx2crd()/crd2flat()to accept Layout objects as shape argumentcrd2crd()to thread src_shape through per-mode recursionexplain()crash with tuple tilersdraw_slice()for 1D layoutsdraw_composite()auto-sizing to respect grid_rows/grid_cols overrides- Rank≥3 panel splitting to match CuTe convention
- slice_modes to preserve hierarchical mode boundaries
- Tensor slicing for hierarchical specs with nested Nones
- Trailing comma in Layout.str for 1-tuple shapes
- per_group analysis iteration for TV layouts
Robustness
- Type validation for Layout shape and stride arguments
- Grid overflow warning when panels exceed capacity instead of silently dropping
- Duck-type Tensor detection in viz instead of
isinstance()
Docs & build
- Missing license headers added to all source files
- examples and check targets in Makefile
Full Changelog: v0.1.1...v0.2.0
tensor-layouts 0.1.1
Minor documentation fixes so project info looks correct on PyPi.
tensor-layouts 0.1
This is the first release of the tensor-layouts library — a pure-Python implementation of NVIDIA's CuTe layout algebra. No GPU required.
Highlights
- Full CuTe layout algebra — compose, complement, logical_divide, logical_product, coalesce, flatten, upcast, downcast, and more
- Swizzle support — XOR-based bank conflict avoidance patterns with Swizzle(B, M, S)
- MMA atom definitions for NVIDIA and AMD architectures:
- NVIDIA: SM70 (Volta), SM75 (Turing), SM80 (Ampere), SM89 (Ada), SM90 (Hopper GMMA), SM100 (Blackwell UMMA), SM120 (Blackwell B200)
- AMD: CDNA1 (MI100), CDNA2 (MI200), CDNA3 (MI300), CDNA3+ (gfx950)
- Rich visualization (pip install tensor-layouts[viz]):
- Layout grids with thread/value coloring
- Swizzle before/after comparison views
- MMA atom thread-value layouts
- Combined tiled MMA grids
- Copy atom layouts
- Hierarchical N-level rendering with color-coded boundaries
- Oracle tests that cross-validate against NVIDIA's pycute and AMD's matrix instruction calculator
- Zero dependencies for the core library; only matplotlib needed for visualization
Install
pip install tensor-layouts # core
pip install tensor-layouts[viz] # with visualization
Links
Documentation: https://github.com/facebookresearch/tensor-layouts
PyPI: https://pypi.org/project/tensor-layouts/