Skip to content

Rebase/cherry-pick xsmm-main on top of triton-lang/triton-cpu/main#49

Open
adam-smnk wants to merge 407 commits into
libxsmm:xsmm-mainfrom
adam-smnk:rebase-on-main-25-Jan-09
Open

Rebase/cherry-pick xsmm-main on top of triton-lang/triton-cpu/main#49
adam-smnk wants to merge 407 commits into
libxsmm:xsmm-mainfrom
adam-smnk:rebase-on-main-25-Jan-09

Conversation

@adam-smnk
Copy link
Copy Markdown

No description provided.

ThomasRaoux and others added 30 commits November 21, 2024 14:55
The pass was reordering scf.if operations without checking the extra
dependencies coming from the region.
For now just prevent this case although this part of the code might
still be fragile.
If you build using the `CMakeLists.txt` and not `setup.py` and you build
in `Release` then you get

```
/__w/triton/triton/third_party/amd/lib/TritonAMDGPUTransforms/MfmaGroup.cpp: In function ‘std::pair<mlir::Type, mlir::Type> mlir::TypesFromMfmaId(MLIRContext*, MfmaTypeId)’:
Warning: /__w/triton/triton/third_party/amd/lib/TritonAMDGPUTransforms/MfmaGroup.cpp:240:1: warning: control reaches end of non-void function [-Wreturn-type]
```
While working on some higher dimension tensor kernels, I noticed poor
performance due to the fact that layouts wouldn't propagate to local
loads. Since we do allow layout folding with local store and local
alloc, this seems like a bit of an oversight.

The change gives a 40% speed improvement on certain kernels for NVidia
GPUs.

This also removes asserts in lowering for higher dimensional kernels. As
far as I can tell, those restrictions aren't required in practice.


# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.
- [x] I have added tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices)
…riton-lang#5139)

Adding a shortcut case for fp8 MFMA to dot operand layout conversion
that avoids using shared memory, to speed up FP8 attention kernels.
…#5189 (triton-lang#5200)

We simplify the implementation of `getElemsPerThread` and strengthen the
preconditions of `getRepForOperand`.

More generally, we should try to minimise the calls to `isAmpere` and
`isHopper` throughout the codebase. I'll do a pass fixing many of these
once we land LLs for `ldmatrix` and Hopper.
TMA fences require CUDA toolchain 12.3 or greater, but current gating
does not check the CUDA toolchain version. This causes
`test_experimental_tma.py` to fail when run with older CUDA toolchains.

## Before
With cuda-12.0:
```
55 failed, 9 passed in 18.11s
```

With cuda-12.4:
```
64 passed in 11.99s
```

## After
With cuda-12.0:
```
9 passed, 55 skipped in 4.26s
```

With cuda-12.4:
```
64 passed in 11.96s
```
If you build with `-DTRITON_BUILD_UT=OFF` on Mac you will get something
like

```
-- Looking for histedit.h
CMake Error at /opt/homebrew/Cellar/cmake/3.30.5/share/cmake/Modules/CheckIncludeFile.cmake:90 (try_compile):
  Unknown extension ".c" for file
-- Looking for histedit.h - not found

    /Users/runner/work/triton/triton/triton-build/CMakeFiles/CMakeScratch/TryCompile-QA06d6/CheckIncludeFile.c

  try_compile() works only for enabled languages.  Currently these are:

    CXX

  See project() command to enable other languages.
Call Stack (most recent call first):
  llvm-bd9145c8-macos-arm64/lib/cmake/llvm/FindLibEdit.cmake:28 (check_include_file)
  llvm-bd9145c8-macos-arm64/lib/cmake/llvm/LLVMConfig.cmake:177 (find_package)
  llvm-bd9145c8-macos-arm64/lib/cmake/mlir/MLIRConfig.cmake:10 (find_package)
```

because `C` isn't an enabled project language.
)

This PR disables inline of print related functions, which speeds up
compilation of test_scan_layouts dramatically.

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>
triton-lang#5153 fixed
the issue; but we missed enabling one of the disabled
case.
This is causing some performance regression. I'll investigate and reland
it.
Reverts triton-lang#5219
…ion (FP8)" (triton-lang#5240)

It is causing performance regression, revert until it can be
investigated
Reverts triton-lang#5139
If a kernel is launched on a thread which has not initialized a CUDA
context (as can happen in the linked issue), it will throw an error. A
simple fix is to call `cudaFree(0)` to establish a device context.

Fixes triton-lang#3729
…#5234)

Currently the llvm path changes every time the pin updates which makes
it annoying to use the included tools. e.g. I use the tablegen language
server, but currently need to update my editor config every time the
llvm pin changes.

This adds a stable symlink which for me is
`~/.triton/llvm/llvm-macos-x64`. This will always point to the most
recent version of llvm used to build triton.

As a bonus this also refactors the symlink update code which was
copy-pasted a few times.
Don't pipeline the dot accumulator in the default heuristic.
In the finer grain control will allow user to decide.
…ang#5249)

recommit of triton-lang#5219

While working on some higher dimension tensor kernels, I noticed poor
performance due to the fact that layouts wouldn't propagate to local
loads. Since we do allow layout folding with local store and local
alloc, this seems like a bit of an oversight.

The change gives a 40% speed improvement on certain kernels for NVidia
GPUs.

This also removes asserts in lowering for higher dimensional kernels. As
far as I can tell, those restrictions aren't required in practice.


# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.
- [x] I have added tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices)

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [ ] I am not making a trivial change, such as fixing a typo in a
comment.

- [ ] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

Co-authored-by: Matthew Brookhart <matthewbrookhart@gmail.com>
Upstreaming some of our Windows related changes assuming that there is
interest in this
triton-lang#5094 (comment)
and hoping that it will not make it much more difficult to support this
CMake file.

---------

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
This commit unified the names of header guards in third_party/amd.
Since StreamPipelineV2 has been the default for a while, this
commit promoted StreamPipelineV2 to the general
StreamPipeline by removing 'v2' suffix.
Also cleans up some includes clang thinks are unused.
Add folding for a transpose of a splat constant.

---------

Co-authored-by: peterbell10 <peterbell10@live.co.uk>
…n-lang#5235)

The legacy path has some bugs for cases like `kWidth=1`. I'm starting to
port Hopper to use LLs to try to isolate them.
Refactored namespace hierarchy by squeezing separate
namespace hierarchy together.
)

Fixes proton unit tests when upgrading to ROCm 6.2 by
adding missing event handlers.

Magic number is replaced with the corresponding enum
value which was added by upgrading the HIP headers
triton-lang#5077.
ienkovich and others added 26 commits December 6, 2024 23:54
Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>
Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>
Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>
* Fix isSigned in PrintOp

* Add float16 support for print

* Support float16 printing for old compilers
Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>
* Add pass to decompose matmul to FMA operations.

Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>

* Use block pointers and padding in 03-matrix-multiplication-cpu.py.

* Fix review comments.

Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>

---------

Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>
* Improve AMX lowering to minimize loads and stores.

Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>

* Support bfloat16 in CPU matmul tutorials.

Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>

---------

Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>
Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>
Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>
* [Setup] Skip hatchet pip package for now

This does not exist for Darwin + Arm64.
TODO: Enable this selectively when possible.

* [CPU][driver] Skip non-existent sys paths

* [mac-arm64] Add GH CI support

- look into faster triton install
- enable bf16 tests
- enable openmp
Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>
…bxsmm#30)

* Matrix multiplication tutorial block pointer variant (libxsmm#1)

Adds a `USE_BLOCK_POINTER` flag to the matmul_kernel so we can get IR for pointers-to-tensors instead of tensors-of-pointers.

* Vector to XSMM (libxsmm#3)

Implements lowering pass from vector to XSMM microkernels.
libxsmm is added as an external dependency together with general MLIR infrastructure for handling XSMM code generation and runtime execution.
The XSMM lowering is optional and can be enabled at JIT step by environment variable TRITON_CPU_XSMM=1

libxsmm is built as a shared library and linked with XSMM-related libraries. These are also added to the Python infrastructure.
Additionally, general MLIR utilities are imported to allow analysis, code generation and microkernel execution.
Initially, a simple pattern mapping vector contraction to an XSMM kernel is added.

* [XSMM] Place contraction accumulation buffer outside reduction loop (libxsmm#5)

Contraction lowering now moves accumulation buffer outside of a reduction loop when possible.

This reduces data movement between memory and registers needed to accommodate mixed memref and vector abstractions.

* Triton to XSMM (libxsmm#4)

Adds lowering pass from triton to XSMM microkernels.
XSMM utility APIs are generalized to work on opaque operations
representing contractions.

A simple pattern mapping tt.dot to XSMM kernel is added.
The runtime lowering to XSMM is now controlled by two separate flags:
- TRITON_CPU_VECTOR_XSMM=1 to lower from vector as before
- TRITON_CPU_TRITON_XSMM=1 to lower from triton ops

* Lift -triton-raise-block-pointer pass from intel-xpu-backend-for-triton (libxsmm#7)

* Lift -triton-raise-block-pointer pass from intel-xpu-backend-for-triton

Code was in turn taken from triton-shared (though does not use the tts
dialect).

* [triton][XSMM] Hoist accumulation buffer (libxsmm#8)

Ports hoisting from Vector to XSMM pass to Triton lowering.
Dot lowering now moves accumulation buffer outside of a reduction loop
when possible.

* Bump libxsmm (libxsmm#9)

Updates libxsmm version.
Brings support for vnni sw pipeline.

* Enable bfloat16 support (libxsmm#10)

Extends XSMM code generation to allow for mixed precision computations to match triton requirements for <bf16 x bf16 -> f32> contraction. Data type selection is added as a global variable to the matmul tutorial.

BF16 can suffer from some inaccuracies compared to PyTorch baseline. However, the difference appears to be the same between native triton-cpu and XSMM lowering - no mismatch on SPR.
The matmul tutorial is aligned more with the main branch.
V2 backend benchmarking is disable due to its instabilities.
Default tile sizes are increased to improve general performance.

* Dynamic shape/stride/offset support by way of memref.extract_strided_metadata (libxsmm#11)

* Matmul tutorial - K padding and dynamic K block size (libxsmm#12)

Adds two new optional flags to the matmul tutorial:
- K dim padding - pads input matrices into multiple
  of chosen BLOCK_SIZE_K
- dynamic K blocking - overrides set BLOCK_SIZE_K
  and adjusts it based on the input K dimension;
  input is padded if needed

The main motivation is to allow testing with larger reduction
dimension blocks without kernel lossing support for various sizes.
Padding is required to meet triton's requirement for power-of-2 sizes.
Dynamic blocking can be used to decrease reduction dimension range or
completely eliminate it.

Allowing the kernel to work on larger K blocks is also helpful for
future rewriting of GEMM into BRGEMM to ensure larger batch dimension.

* Matmul tutorial - cache padding (libxsmm#14)

Adds extra optional padding that can be use to ensure that input
matrices' strides are non-power-of-two to improve cache behavior.

Currently, it is most useful with DYNAMIC_K_BLOCK enabled.

* Lower GEMM to BRGEMM kernel (libxsmm#13)

Extends contraction lowering to XSMM by rewriting plain GEMM into
a BRGEMM kernel when possible.

The rewrite improves performance of larger K block sizes thanks to
extra reduction dim tiling. Use of BRGEMM kernel also enables online
VNNI packing for BF16.

* Matmul tutorial - external preprocessing (libxsmm#15)

Adds an optional flag to move matmul input preprocessing
outside of the benchmarked kernel.
This option allows to exclude preprocessing overhead from
performance measurements.

* Fix up useless num_threads arg

* Matmul tutorial - external preprocessing (libxsmm#15)

Adds an optional flag to move matmul input preprocessing
outside of the benchmarked kernel.
This option allows to exclude preprocessing overhead from
performance measurements.

* Utility libxsmm Python extension (libxsmm#17)

Adds a python wrapper for a parallelized in-place copy function using libxsmm and OpenMP.
It is intended to be used for efficient tensor padding implementation.

The libxsmm path have to be specified through env variables:
  - XSMM_ROOT_DIR - path to libxsmm root dir with headers
  - XSMM_LIB_DIR - path to libxsmm.so location

libxsmm .so also has to be available during runtime execution e.g., exposed through LD_LIBRARY_PATH.
The XSMM python module can be built and installed using command:
  pip install -e ./third_party/cpu/python/

* Reduction loop GEMM to XSMM BRGEMM (libxsmm#18)

Adds experimental rewrite collapsing reduction loop over GEMM into a BRGEMM ukernel.

The pattern matches the hand-written kernel using block pointers and is not compatible with IR generated by triton pointer raising. Direct lowering to XSMM allows to bypass triton load restriction when K dimension is not power-of-two.
The pattern is quite brittle but functional for the matmul tutorial example.

The rewriting is disable by default and can be enabled with environment variable:
  TRITON_CPU_LOOP_BRGEMM_XSMM=1

* Matmul tutorial - pad weights only (libxsmm#19)

Adds option to apply padding only to matrix B.

This allows to explore potential speedups by limiting padding to
weights which is reasonably common strategy in e.g., ML inference.
Full padding still has to occur when K dimension is padded to avoid
dimension mismatch and/or meet power-of-two size requirement.

* Add script for running configs (libxsmm#20)

Also makes it so that xsmm_py is only imported when needed

* added thread mgmt

* Update configs to be as indicated on slides (libxsmm#22)

also changes --datatype-bf16 to --datatype bf16

* Fix typo in config parameter (libxsmm#23)

* added set of prepare and build script

* added zen5 patch

* BF8 support (libxsmm#24)

BF8 support

* update XSMM

* don't run torch/torch.compile for bf8 as it is at 1 GFLOP

* revert because of libxsmm#27

* Add run all benchmark script + use internal libxsmm (libxsmm#27)

* go up to 4096

---------

Co-authored-by: Adam Siemieniuk <adam.siemieniuk@intel.com>
Co-authored-by: Alexander Heinecke <alexander.heinecke@intel.com>
Co-authored-by: Renato Golin <rengolin@systemcall.eu>
… a time

Includes fixes for automatically deriving num_threads as well - thanks
@rengolin
According to docs of triton.testing.do_bench:
"rep (int): Repetition time (in ms)"
invoke by with raw data from run_all_benchmarks either on stdin on or
with the filename as the sole argument. Output is to stdout.

Note includes example raw data to test the script with.
It's just too slow, especially when accidentally invoked in a batch job.
Broke due to version hell, see:
triton-lang#5026

This change is the one suggested on that issue.
@adam-smnk
Copy link
Copy Markdown
Author

Based on #48

@adam-smnk
Copy link
Copy Markdown
Author

Passes tests after build, no performance regressions on EMR.

adam-smnk pushed a commit to adam-smnk/triton-cpu that referenced this pull request Mar 6, 2025
This commit enables several fp8 tests, that uses `tl.range`.

Signed-off-by: Dmitrii Makarenko <dmitrii.makarenko@intel.com>
adam-smnk pushed a commit to adam-smnk/triton-cpu that referenced this pull request Apr 15, 2025
This commit enables several fp8 tests, that uses `tl.range`.

Signed-off-by: Dmitrii Makarenko <dmitrii.makarenko@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.