Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
407 commits
Select commit Hold shift + click to select a range
c8a31a0
[AMD] Prevent wrong reordering of scf operations (#5203)
ThomasRaoux Nov 21, 2024
3164a4e
[AMD] Cover default case in MfmaGroup (#5218)
makslevental Nov 21, 2024
82b8f0f
Allow Layouts to propogate to local_load (#5219)
mbrookhart Nov 21, 2024
4ae95e7
[BACKEND] Fix transpose optimization missed during refactor (#5226)
ThomasRaoux Nov 22, 2024
af0649d
[AMD] Use warp shuffle for fp8 MFMA to dot operand layout conversion …
ilia-cher Nov 22, 2024
4330372
[LAYOUTS] [BE] Simplify Ampere/Hopper paths introduced in #5189 (#5200)
lezcano Nov 22, 2024
e558838
[BACKEND] Use LL to simplify redundant elements check and fix related…
Jokeren Nov 22, 2024
bede39f
Make TMA tests compatible with older CUDA toolchains (#5221)
embg Nov 22, 2024
03c6312
[CMake] Add C as project language (#5217)
makslevental Nov 22, 2024
422e5d3
[AMD] Fix slow compilation due to inlining print calls (#5153)
binarman Nov 22, 2024
16ce143
[AMD] Re-enable overflow test in test_reduce_layouts (#5233)
antiagainst Nov 22, 2024
340cbc6
[BACKEND] Fix a missed transpose optimization during refactor (#5236)
ThomasRaoux Nov 22, 2024
84ced0e
Revert "Allow Layouts to propogate to local_load" (#5237)
ThomasRaoux Nov 22, 2024
c6da81a
Revert "[AMD] Use warp shuffle for MFMA to Dot operand layout convers…
ThomasRaoux Nov 22, 2024
f637ea7
Updated README.md to show the steps for overriding kernel's IR (#5239)
arakhmati-openai Nov 22, 2024
85256a6
Ensure device context before launching kernel (#3731)
bertmaher Nov 22, 2024
6404fbb
[LLVM] Update to llvm-project@86b69c3 (#5242)
peterbell10 Nov 23, 2024
e3ab295
[BUILD] Add a stable symlink to llvm in the triton cache (#5234)
peterbell10 Nov 23, 2024
4107453
[PIPELINER] tweak pipeline heuristic (#5247)
ThomasRaoux Nov 25, 2024
09fcc52
Allow Layouts to propogate to local_load (#5219) (#5249)
ThomasRaoux Nov 25, 2024
f9397bc
Windows related changes in `CMakeLists.txt` (#5186)
anmyachev Nov 25, 2024
e1ebeed
[AMD] NFC: Unified header guard in third_party/amd (#5244)
knwng Nov 25, 2024
4210274
[AMD] NFC: Drop v2 Suffix from Stream Pipeline (#5251)
knwng Nov 25, 2024
c780bf4
[NFC] Cleanup references to unused index dialect (#5257)
peterbell10 Nov 25, 2024
22e212b
[BUILD] Ensure parent directory exists before creating symlinks (#5258)
peterbell10 Nov 25, 2024
deee78f
[BACKEND] Fold transpose(splat_const) (#5259)
ThomasRaoux Nov 26, 2024
e2dc77b
[LAYOUTS] Use LLs for Hopper whenever we wouldn't use ldmatrix (#5235)
lezcano Nov 26, 2024
317e483
[AMD] NFC: Cleanup namespace hierachy (#5246)
knwng Nov 26, 2024
68a08dd
[AMD] Fix unhandled profile event in RoctracerProfiler (#5252)
AlexAUT Nov 26, 2024
3f1d70f
Fix Blocked FMA path in isLayoutOK (#5260)
lezcano Nov 26, 2024
25c1014
[Tutorial] Remove incorrect caching from softmax tutorial (#5162)
Mogball Nov 27, 2024
678e492
[INSTRUMENTATION] Generalize code in `test_gpuhello.py` (#5263)
anmyachev Nov 27, 2024
8b29bb7
Create an aggregate `check-triton-unit` target (#5150)
Mogball Nov 27, 2024
2ea9daa
[NFC] Add `test_bessel` into `test_libdevice.py` (#5261)
anmyachev Nov 27, 2024
9e508a4
[NFC] Add functional regression test for cummax with bool type (#5264)
anmyachev Nov 27, 2024
e7a0561
[AMD] NFC: Unified comment style (#5248)
knwng Nov 27, 2024
dbebe10
[AMD] Upgrade AMD CI docker image (#5230)
AlexAUT Nov 27, 2024
5e59bdf
Implement `dot_scaled(mmav3)` (#5269)
lezcano Nov 27, 2024
2003685
[BUILD] Some CMake cleanup/modernisation (#5271)
peterbell10 Nov 27, 2024
6d3ed0b
[DIALECT] Rename `triton_gpu` to `ttg` and `triton_nvidia_gpu` to `tt…
Jokeren Nov 27, 2024
7b2beae
[BACKEND] Fix inline asm bug for multiple packed <32bit output (#5273)
ggengnv Nov 27, 2024
b8a4b87
[NVIDIA][Backend] Add CoalesceAsyncCopy Pass for in-DotOpEnc Upcastin…
ggengnv Nov 27, 2024
2c0b791
[Triton] Add `tl.gather` with a naive codegen implementation (#5262)
Mogball Nov 28, 2024
a575895
[NVIDIA][Launcher] Ensure device context is valid before calling getP…
peterbell10 Nov 28, 2024
61daa33
[CMAKE] Add `triton-tensor-layout` dep to lit tests (#5275)
Mogball Nov 28, 2024
58f8a24
[BACKEND] Fix and document logic for creating warp shapes in MMAv3 (#…
Jokeren Nov 28, 2024
f062089
[NFC] Remove dead code for python<3.8 (#5280)
anmyachev Nov 28, 2024
55b741d
[NFC] Remove `CMAKE_VERBOSE_MAKEFILE` var (#5282)
anmyachev Nov 28, 2024
1cb0d99
[AMD] Use Linear Layout convertions for AMDWmma (#5255)
joviliast Nov 28, 2024
cc89dac
Add tests for 3D local_load local_alloc and relax asserts (#5285)
ThomasRaoux Nov 29, 2024
27e11ab
[Build] Don't require Development.Embed python component (#5287)
peterbell10 Nov 29, 2024
912e595
[NFC] Remove unused forOp argument from `setStageCluster` (#5288)
peterbell10 Nov 30, 2024
d80b015
[PROTON] Don't use designated initializers in `CuptiPCSampling.cpp` a…
anmyachev Nov 30, 2024
b7e0601
Define `pytest-forked` and `pytest-xdist` as `tests` target deps (#5292)
anmyachev Dec 1, 2024
0b0ffc3
[BUILD] Skip installing test related python packages (#5294)
Jokeren Dec 1, 2024
83cf436
[TESTING] Add golden sample test for pipelining matmul with descripto…
peterbell10 Dec 1, 2024
e950d9a
Specify in `setup.py` that `setuptools>=40.8.0` is a required depende…
anmyachev Dec 2, 2024
9c62d88
[TOOLS] Improve `generate-test-checks.py` (#5300)
Jokeren Dec 2, 2024
817cfc2
[NFC][DIALECT] Remove dependency on `mlir::tensor::TensorDialect` (#5…
Jokeren Dec 3, 2024
8d42d21
[IR] Improve `ttg.memdesc` (#5296)
Jokeren Dec 3, 2024
01fb036
[Pipeliner] Handle masking for atomic_rmw (#5231)
scxiao Dec 3, 2024
d3a94e0
[TESTS] Forward fix for CI break (#5323)
peterbell10 Dec 4, 2024
fa0c2bd
Search for `ptxas` only for cuda backend in `supports_tma` function (…
anmyachev Dec 4, 2024
1d5e9a2
[LLVM] Update to llvm/llvm-project@1f20eee6dc36 (#5308)
jungpark-mlir Dec 4, 2024
134b3eb
[AMD][BACKEND] Add gfx950 target definitions. (#5281)
jungpark-mlir Dec 4, 2024
b0ebcfc
[AMD] Adjust local_store and global_load ordering (#5254)
ravil-mobile Dec 4, 2024
712ac66
Re-align main and llvm-head (#5334)
gflegar Dec 4, 2024
00cc5d0
[PIPELINER] Cleanup of LoopScheduling.cpp, introduction of AssignLate…
pawelszczerbuk Dec 4, 2024
a4f1854
Move tensor layout verifier impl into dialect interface (#5312)
alexbaden Dec 4, 2024
86f0de6
Add the python + lit test cases for Ampere small-tile-size mixed prec…
bingyizh233 Dec 4, 2024
7f2c56a
Revert "[LLVM] Update to llvm/llvm-project@1f20eee6dc36" (#5341)
ThomasRaoux Dec 4, 2024
6e24f72
Revert "[AMD][BACKEND] Add gfx950 target definitions." (#5342)
ThomasRaoux Dec 4, 2024
147d332
[Backend] Tiny cleanup/refactor (NFC) (#5340)
Mogball Dec 4, 2024
10552c5
[BACKEND] Use an optimized ptx code sequence for fp4 upcasting (#5344)
ThomasRaoux Dec 5, 2024
390e27f
Use `L` integer literal for `int64_t` (#5351)
anmyachev Dec 5, 2024
67ea999
[LAYOUTS] Use least squares solution in invertAndCompose (#5309)
lezcano Dec 5, 2024
983757b
[Pipeliner] Fix a dangling insertion point issue. (#5352)
htyu Dec 6, 2024
bc2dcbf
Short preamble for the README, explaining why this clone exists
bertmaher Apr 17, 2024
83078a6
OSS Automated Fix: Addition of Code of Conduct (#1)
facebook-github-bot May 1, 2024
b935aec
[BACKEND][CPU] Initial plumbing for cpu backend (#2)
minjang May 2, 2024
e235598
[BACKEND][CPU] Create TritonCPU and conversion dialects (#3)
minjang May 6, 2024
88d1a35
Update README.md
minjang May 6, 2024
c214238
Convert tt.func and tt.return (#4)
minjang May 13, 2024
047a677
[BACKEND][CPU] Convert tt.get_program_id and tt.print (Hello World) (#1)
minjang May 14, 2024
1513b37
Quick patches to make it work after rebasing (#3)
minjang May 16, 2024
48cf851
Support basic lowering through vector dialect in CPU backend.
ienkovich May 2, 2024
693a9f8
Revert unreviewed changes. (#5)
shanenay May 17, 2024
633e190
Add a workaround for LLVM bug in codegen for bf16 vector cast. (#4)
ienkovich May 17, 2024
2b8c201
Prototype of the Triton CPU backend with basic compilation and execut…
ienkovich May 24, 2024
a198dd6
Add support for tl.cat operation. (#9)
ienkovich May 28, 2024
f69beb8
[BACKEND][CPU] Make it buildable and runnable in a different environm…
minjang May 28, 2024
8425ba6
Add support for simple reductions. (#10)
ienkovich May 29, 2024
159fac4
Support tl.histogram for CPU. (#12)
ienkovich May 29, 2024
10b93d1
Fix merge and compile errors (#13)
minjang May 30, 2024
61ecff1
[CPU] Support flexible active driver + update vector-add tutorial (#11)
minjang May 31, 2024
c201f98
Added a simple workflow to run on self-hosted intel runner (#16)
gshimansky Jun 7, 2024
7c8ca4e
Fixed build and test workflow for intel self-hosted runner (#17)
gshimansky Jun 9, 2024
0c9b9c9
[CPU] Add an OpenMP-based CPU launcher (#15)
minjang Jun 10, 2024
aaa1417
Support generic reduction and scan cases. (#14)
ienkovich Jun 10, 2024
0a033bf
[CPU] Dump human-readable asm code in TRITON_CACHE_DIR (#19)
minjang Jun 11, 2024
aea6125
Added g++ installation after switching to ubuntu-22.04 (#21)
gshimansky Jun 11, 2024
7b6ed89
Support atomic ops for CPU. (#20)
ienkovich Jun 11, 2024
033f9b5
[TUTORIAL] Add unmasked matrix multiply example to triton-cpu (#23)
Kuigesi Jun 14, 2024
196f7a6
Update matrix-multiplication-cpu tutorial, use preallocated output bu…
Kuigesi Jun 15, 2024
852b375
Fixes for x86 CI workflow (#26)
ienkovich Jun 18, 2024
7c2619e
Use static compilation for kernels. (#29)
ienkovich Jun 20, 2024
613bdda
Move byte manipulation ops from elwise ops conversion. (#28)
ienkovich Jun 20, 2024
b3ffbad
[TUTORIAL] Add the non-persistent softmax and make it for CPU (#22)
minjang Jun 20, 2024
1ef85d3
Enable few more core tests for CPU. (#31)
ienkovich Jun 20, 2024
2506210
Support tt.split for CPU. (#30)
ienkovich Jun 20, 2024
a41adf4
[BACKEND][CPU] Make the CPU backend buildable and runnable in Mac M1.…
Kuigesi Jun 25, 2024
c9fb202
[CPU] Add conversion for unsupported BF16 ops via target-specific sta…
ienkovich Jun 25, 2024
6ca9ea9
Enabled simple build&test workflow, disabled old Integration Tests wo…
gshimansky Jun 25, 2024
b6499e4
[BACKEND][CPU] Specify CPU target to native for GNU/Linux Arm (#34)
Kuigesi Jun 25, 2024
1ff7c09
Add conversions for mixed precision matmuls. (#32)
ienkovich Jul 2, 2024
37ec39c
[Op support] Support 'get_num_programs' (#39)
Devjiu Jul 3, 2024
e0f0f5d
Add fast-math option: allow fp reduction reassociation
Kuigesi Jul 8, 2024
3e163f2
Change the lowering option for vector.multi_reduction from InnerParal…
Kuigesi Jul 8, 2024
41d6f0d
Fix: TrapUnreachable is not controled by fast-math, we set it uncondi…
Kuigesi Jul 9, 2024
9e4cc37
[so] Compile asm to .so as part of staged lowering (#53)
int3 Jul 17, 2024
a1401f1
Add libdevice for CPU. (#52)
ienkovich Jul 17, 2024
87db5c1
[Op support] Dot3D support (#43)
Devjiu Jul 17, 2024
e6f09a5
Support FP8 conversions for CPU. (#40)
ienkovich Jul 17, 2024
2acf6c7
[CPU] Support device_print for scalar types first (#54)
minjang Jul 18, 2024
cb15d8d
[TUTORIAL] Add matrix vector multiplication tutorial (#46)
Kuigesi Jul 19, 2024
2cf5ad6
Fix FuncOp lowering. (#61)
ienkovich Jul 19, 2024
cf90d9b
[CPU] Easy: remove the old initial boilerplate code (#59)
minjang Jul 19, 2024
04e4928
[Scf If types] Support conversion of types for scf::if (#45)
Devjiu Jul 22, 2024
3821e01
[WA for fp16 torch.matmul] Replace torch.matmul with np.matmul (#44)
Devjiu Jul 23, 2024
88d3276
[cpu] Have MulhiUI lowering support scalars (#64)
int3 Jul 23, 2024
7cf4eaa
[cpu] Fix formatting (#65)
int3 Jul 23, 2024
a4dbfdd
[cpu] Support tl.load(..., padding="nan") (#69)
int3 Jul 23, 2024
5b0de41
[cpu] Use helpers from OptCommon.h to simplify code (#67)
int3 Jul 23, 2024
199e65e
[cpu] Follow up to #69 (#70)
int3 Jul 23, 2024
e3677c7
[cpu] Add runtime library for CPU kernels (#73)
int3 Jul 25, 2024
cf56b35
[FP8 tests] Enable several fp8 tests (#49)
Devjiu Jul 25, 2024
8c581d1
[cpu] Make runtime library build on Linux too (#75)
int3 Jul 25, 2024
7dafc13
[cpu] Get more of test_random.py working (#77)
int3 Jul 26, 2024
56c7b2a
[FIX Pytest] Resolve 'importlib' issue (#78)
Devjiu Jul 29, 2024
25d84b0
Fix importlib issues (#80)
int3 Jul 29, 2024
3c88dfb
[cpu] Add test_annotations.py to CI (#81)
int3 Jul 30, 2024
6da29d0
Reduce/disable some tests on CPU for faster CI runs. (#83)
ienkovich Aug 1, 2024
738ff5b
[cpu] Don't reuse shuffle dummies (#88)
int3 Aug 5, 2024
fd3c8d8
Utilize vector math functions from libmvec. (#55)
ienkovich Aug 5, 2024
50ab503
Make tl.debug_barrier() a no-op on CPU (#89)
int3 Aug 6, 2024
e85da96
ConvertMemoryOps should not use cf dialect (#91)
int3 Aug 6, 2024
0428f85
Remove registered pipelines in favor of explicit lists in python. (#93)
ienkovich Aug 7, 2024
84c3f4e
Don't use cf dialect in ConvertAtomicOps (#94)
int3 Aug 7, 2024
edbd15f
atomic_rmw ops should return original value (#95)
int3 Aug 7, 2024
87fa70f
Compute a scalar pointer for vector load instead of extracting it fro…
ienkovich Aug 7, 2024
5431fcf
Add pass to optimize masked loads and stores. (#96)
ienkovich Aug 7, 2024
bd566e9
Fix incorrect casts in mask optimization. (#101)
ienkovich Aug 8, 2024
f93a8d3
Add conversion for scf.while (#103)
int3 Aug 8, 2024
6602d0e
[TUTORIAL] Add bf16 matrix vector multiplication tutorial (#90)
Kuigesi Aug 8, 2024
416d456
Add an option to use sleef instead of libmvec. (#104)
ienkovich Aug 9, 2024
75f1c45
Enable fast math by default. (#108)
ienkovich Aug 9, 2024
889be25
Add more libdevice lowerings (#97)
int3 Aug 9, 2024
bcc0d46
Enable rsqrt and floor for BF16. (#109)
ienkovich Aug 9, 2024
4d7aa94
Remove specific dwarf version from -g option. (#110)
ienkovich Aug 9, 2024
6aee6b3
Enable `min_dot_size`
Devjiu Aug 13, 2024
986e7f1
[Formatting] Apply formating
Devjiu Aug 13, 2024
6a1e656
Remove is_cpu arg from do_bench. (#113)
ienkovich Aug 13, 2024
bb56353
Enable few more tutorials for CPU (#114)
ienkovich Aug 13, 2024
ef938e8
Pass device type to do_bench in autotuner. (#115)
ienkovich Aug 13, 2024
b51e8a6
Fix indices extraction from block pointer. (#116)
ienkovich Aug 14, 2024
667c30c
[cpu] Rework device_print with triton_cpu.print and 1D vector printin…
minjang Aug 14, 2024
a04bd9b
[Pytests] Add several suits (#106)
Devjiu Aug 14, 2024
75225ee
Identify dot product pattern (mul followed by a sum) for bf16, and co…
Kuigesi Aug 15, 2024
86270e6
Add optional packing for converting bf16 dot product. (#118)
Kuigesi Aug 16, 2024
0dedba1
Add load/store scalarization through loops. (#119)
ienkovich Aug 21, 2024
f474689
Fix typo. (#122)
ienkovich Aug 21, 2024
92642ec
Add lit tests for load/store scalarization. (#121)
ienkovich Aug 22, 2024
1fa19a2
[cpu][easy] Fix compiler error on clang (#120)
minjang Aug 22, 2024
6acc8ad
Offload a part of masks optimization to the canonicalizer.
ienkovich Aug 22, 2024
b3c3937
Implement get_module_map for cpu backend
int3 Aug 29, 2024
93fbe9d
Make test_hooks CPU-compatible by using get_device_key (#128)
int3 Aug 29, 2024
ee480a8
Make CPU runtime lib lookup work for Python 3.8 (#129)
int3 Aug 29, 2024
2c4feb9
Implement device_assert (#126)
int3 Aug 30, 2024
77fcb05
Implement isnan, isinf, signbit (#127)
int3 Aug 30, 2024
6b2cd51
Vendor sleef as a submodule (#130)
int3 Aug 31, 2024
28997fb
Add test_debug_dump.py to CI (#131)
int3 Sep 1, 2024
a94adaa
Refactor MathToLibmvec pass (#135)
int3 Sep 4, 2024
3c60aee
[CPU] Add unit test for print with isSigned and several fixes (#132)
minjang Sep 4, 2024
cf058b5
Refactor math tests + select vector lib backend via kernel option (#136)
int3 Sep 4, 2024
1249451
Vectorize expm1, sqrt, and floor using sleef (#137)
int3 Sep 4, 2024
dbe70c1
Fix infinite optimization loop for mask optimization. (#138)
ienkovich Sep 4, 2024
82bfb1a
Implement libdevice.trunc (#140)
int3 Sep 5, 2024
fc67b1c
Remove old LLVM bug workaround. (#141)
ienkovich Sep 6, 2024
532e922
Add kernel execution time measurement using hooks for do_bench (#139)
ienkovich Sep 9, 2024
8db7197
Use llvm_unreachable in cpu_runtime.cpp (#145)
minjang Sep 9, 2024
0141fcc
Fix undefined symbole error in libTritonCPURuntime.so (#146)
minjang Sep 9, 2024
f9ee7ff
[Dot3D test] Enable with lower block size (#117)
Devjiu Sep 18, 2024
769282b
Add an option to choose between default reduction lowering and our ow…
ienkovich Sep 20, 2024
50e0937
Fix regressions due to rebasing to the latest upstream
minjang Sep 21, 2024
4db29c5
Update build-test.yml for pybind11
minjang Sep 22, 2024
57da305
[FP8 support] Enable Float8 tests failed after rebase (#151)
Devjiu Sep 23, 2024
332e8db
Use 1-D vector reduction op to convert reduce op (#152)
Sep 27, 2024
ab0dcc3
[Keep materialization] Turn on meterialization (#154)
Devjiu Sep 27, 2024
0fce0a7
[Scalarization/Loops generation] Refactor and new pass/interfaces int…
Devjiu Sep 30, 2024
ae7c132
Lower memory ops with vector gather and scatter (#158)
Oct 14, 2024
863eee6
Introduce DotOp lowering to AMX (#157)
ienkovich Oct 17, 2024
04c2bfd
Implement more libdevice functions using extern_elementwise (#161)
int3 Oct 21, 2024
a04ce5b
Fix compilation when ARCH_REQ_XCOMP_PERM isn't defined (#163)
int3 Oct 21, 2024
646769b
[CPU] Drop MLIR prefix in ScalarizeInterface (#164)
minjang Oct 21, 2024
1a1184f
Pad size 2 vectors to size 4 when lowering extern_elementwise ops (#162)
int3 Oct 22, 2024
1e7cf8c
Rebase onto upstream triton ff306da26b and fix regressions
minjang Oct 22, 2024
eaeb32c
Simple fixes to build on MacOSx (#165)
digantdesai Oct 23, 2024
eb1ef8c
Fix trailing null char in ulpSuffix (#166)
digantdesai Oct 23, 2024
75fc8d8
Rebase onto upstream triton 4a5431159a and fix regressions
minjang Oct 24, 2024
58455cb
[Test][Autotuner] Skip use_cuda_graph for non cuda devices (#169)
Devjiu Oct 25, 2024
7e1c317
Add num_threads option to control threading per kernel invocation. (#…
ienkovich Oct 28, 2024
dc6c0d6
[TTC Print Memref] Simplify further multidimensional tensor printing …
Devjiu Oct 28, 2024
b5eb445
Small fixes for autotuner on CPU (#172)
ienkovich Oct 30, 2024
8bd6314
Small fixes for clang + macosx (#173)
digantdesai Oct 30, 2024
346abb0
Support multi-dimensional tensor prints in CPU runtime. (#174)
ienkovich Oct 30, 2024
90be1a1
Fix linux-aarch64 build (#176)
desertfire Nov 11, 2024
22ce152
Fix math tests for armv8 (#178)
digantdesai Nov 27, 2024
eed31e0
Allow using local omp with Apple clang (#181)
digantdesai Dec 4, 2024
203d16a
Add pytest.mark.cpu to two more already-passing tests (#183)
int3 Dec 6, 2024
0389f38
Move libdevice to third_party (#182)
int3 Dec 6, 2024
02c9a81
Introduce triton_cpu.DotOp.
ienkovich Nov 22, 2024
67c75c5
Fixes to use the latest LLVM.
ienkovich Oct 2, 2024
0116651
Fix pybind11 build issue for TritonCPU.
ienkovich Dec 6, 2024
755aed7
Use mlir::amx::TileType.
ienkovich Oct 2, 2024
a608930
Fix formatting
ienkovich Dec 6, 2024
5304375
Fix test_tl_range.
ienkovich Dec 6, 2024
148ebf5
Fix test_conversions.
ienkovich Dec 6, 2024
a74ee76
Disable test_block_copy with lower bound check.
ienkovich Dec 6, 2024
958b889
Fix isSigned and add float16 in PrintOp (#191)
minjang Dec 9, 2024
5830c67
Add TritonCPU canonicalizer. (#192)
ienkovich Dec 10, 2024
3f11034
Introduce FMA lowering for DotOp. (#193)
ienkovich Dec 12, 2024
feb95c3
AMX lowering improvements (#194)
ienkovich Dec 12, 2024
afeda07
Fix extra-store in matmul tutorial. (#198)
ienkovich Dec 17, 2024
8390fa2
Remove unnecessary bounds checks. (#199)
ienkovich Dec 19, 2024
4e4e6b8
Enable armv8 CI (#195)
digantdesai Dec 21, 2024
daa7eb0
Fix isSigned usage for scalar prints. (#201)
ienkovich Dec 23, 2024
a801f4e
Rebase/cherry-pick our work on top of triton-lang/triton-cpu/main (#30)
Nov 24, 2024
0ef97fe
Add --backend option and run experiments one backend*config option at…
Nov 24, 2024
d2850f2
[Triton-benchmark] Set rep=1000 (vs default of 100)
Nov 24, 2024
3cc95d9
Fix debug sizes still being set
Nov 24, 2024
e321e82
Fix typo: benchmark -> config
Nov 25, 2024
7a367d6
Add raw-to-csv.py converter script (#29)
Nov 24, 2024
6190c46
Make raw-to-csv.py compatible with BENCHMARK-outputting script
Nov 25, 2024
c354d81
Fix errant call to do-nothing "preproccess" function on some configs …
Nov 25, 2024
2b748a5
Improve raw-to-csv a la @rengoline and prepare for summary line to be…
Nov 25, 2024
ee0f4f7
Change back max dim size to 3k for data collection
Nov 25, 2024
bcd3d28
Error out when attempting run bf8 with torch
Nov 25, 2024
92ea798
Store commit hash with benchmark run details
Nov 25, 2024
8f22031
Update xsmm_utils.cpp to avoid using libxsmm for now
alheinecke Dec 2, 2024
b2fa69a
Fix broken torch.compile
Nov 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
48 changes: 48 additions & 0 deletions .github/ISSUE_TEMPLATE/bug.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: Report a bug
description: Report triton failing to compile a kernel, or giving incorrect results
labels: ["bug"]

body:
- type: markdown
attributes:
value: |
#### Disclaimer
The core triton team is small and has very limited capacity. We may not have time to look into your report.
For the best results, please:
- Avoid submitting duplicates. Search through [the existing and past issues](https://github.com/triton-lang/triton/issues?q=is%3Aissue+sort%3Acreated-desc+) first to see if it's been reported previously.
- Check if the issue persists with a build from the latest source.
- Provide all relevant information in the initial report, to prevent unnecessary back and forth discussion.
- If you can, try to diagnose and/or fix the issue yourself. We welcome high quality contributions.
- type: textarea
attributes:
label: Describe the bug
description: |
Please provide a clear and concise description of what the bug is.

If relevant, add a [minimal complete example](https://stackoverflow.com/help/minimal-reproducible-example) that reproduces the bug. It is very important for the snippet to be as simple as possible, so please take time to trim down any irrelevant code to help us debug efficiently. We are going to copy-paste your code and we expect to get the same result as you did, so include both the kernel and launching code as well as any relevant imports.

If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.

Please also paste or describe the results you observe instead of the expected results. If you observe an error, please paste the error message including the **full** traceback of the exception. It may be relevant to wrap error messages in ```` ```triple quotes blocks``` ````.
placeholder: |
A clear and concise description of what the bug is.

```python
# Sample code to reproduce the problem
```

```
The error message you got, with the full traceback.
```
validations:
required: true
- type: textarea
attributes:
label: Environment details
description: |
Please include any relevant context about how you're running the reproducer e.g. which version of triton, and what GPU you are using.
placeholder: |
Triton: ...
GPU: ...
validations:
required: true
5 changes: 5 additions & 0 deletions .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
blank_issues_enabled: true
contact_links:
- name: Community help
url: https://discord.gg/gpumode
about: GPU-mode discord community has a triton channel which is a great resource for help writing/learning triton
44 changes: 44 additions & 0 deletions .github/ISSUE_TEMPLATE/performance.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Report a performance issue
description: Report cases where triton is generating sub-optimal (but functionally correct) PTX/LLVM IR
labels: ["performance"]

body:
- type: markdown
attributes:
value: |
#### Disclaimer
The core triton team is small and has very limited capacity. We may not have time to look into your report.
For the best results, please:
- Avoid submitting duplicates. Search through [the existing and past issues](https://github.com/triton-lang/triton/issues?q=is%3Aissue+sort%3Acreated-desc+) first to see if it's been reported previously.
- Check if the issue persists with a build from the latest source.
- Provide all relevant information in the initial report, to prevent unnecessary back and forth discussion.
- If you can, try to diagnose and/or fix the issue yourself. We welcome high quality contributions.
- type: textarea
attributes:
label: Describe the issue
description: |
Please provide a clear and concise description of the issue.

Include a [minimal complete example](https://stackoverflow.com/help/minimal-reproducible-example) that reproduces the issue. It is very important for the snippet to be as simple as possible, so please take time to trim down any irrelevant code to help us debug efficiently. We are going to copy-paste your code and we expect to get the same result as you did.

A reproducer could be a python program that runs a triton kernel and prints out the relevant suboptimal IR, or an IR file with an accompanying triton-opt command.

If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.
placeholder: |
A clear and concise description of the issue.

```python
# Sample code to reproduce the problem
```
validations:
required: true
- type: textarea
attributes:
label: Environment details
description: |
Please include any relevant context about how you're running the reproducer e.g. which version of triton, and what GPU you are using.
placeholder: |
Triton: ...
GPU: ...
validations:
required: true
3 changes: 3 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
<!---
The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]` with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [ ] I am not making a trivial change, such as fixing a typo in a comment.

- [ ] I have written a PR description following these
Expand Down
158 changes: 158 additions & 0 deletions .github/workflows/build-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
name: Build and test
run-name: ${{ inputs.run_name }}

on:
workflow_dispatch:
pull_request:
branches:
- main
# You can name your branch dev-foo to get CI runs.
- 'dev-**'
push:
branches:
- main

jobs:
pre-commit:
name: Pre-commit checks
runs-on:
- glados
- intel
- x86
steps:
- name: Print inputs
run: |
echo "${{ toJSON(github.event.inputs) }}"
echo INSTALL_IPEX=${{ env.INSTALL_IPEX }}

- name: Checkout repository
uses: actions/checkout@v4

- name: Install Python 3.11
uses: actions/setup-python@v5
with:
python-version: '3.11'

- name: Run pre-commit checks
run: |
pip install --upgrade pre-commit

# TODO: ignore the first yapf failure until https://github.com/google/yapf/issues/1164 is fixed
python3 -m pre_commit run --all-files --verbose yapf &> /dev/null || true
# If first run of yapf worked and made changes reset the tree to the original state
git reset --hard

python3 -m pre_commit run --show-diff-on-failure --color=always --all-files --verbose

build-test:
name: Build and test on ${{ matrix.config.runner }}
runs-on: ${{ matrix.config.runs_on }}
strategy:
matrix:
python: ['3.11']
config:
- {runner: 'Ubuntu Intel x86', runs_on: ['glados', 'intel', 'x86'], target-os: 'ubuntu', arch: 'x86'}
- {runner: 'MacOS-latest ARM64', runs_on: ['macos-latest'], target-os: 'macos', arch: 'arm64'}
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
submodules: recursive

- name: Install Python ${{ matrix.python }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python }}

- name: Install pip and apt dependencies
env:
RUNNER_TARGET_OS: ${{ matrix.config.target-os }}
run: |
echo "RUNNER_TARGET_OS: ${RUNNER_TARGET_OS}"
python3 -m pip install --upgrade pip
python3 -m pip install wheel cmake==3.24 ninja pytest-xdist lit pybind11
if [[ "${RUNNER_TARGET_OS}" == "ubuntu" ]]; then
sudo apt-get update
sudo apt-get install -y zlib1g-dev g++
fi
pip install torch==2.1.2

- name: Install Triton
run: |
echo "PATH is '$PATH'"
cd python
python3 -m pip install --no-build-isolation -vvv '.[tests]'

- name: Run python unit tests for MacOS Arm64
if: matrix.config.target-os == 'macos'
run: |
export CC=$(which clang)
export TRITON_DISABLE_OPENMP=1 # temporary
export TRITON_CPU_BACKEND=1

# Document some versions/flags
echo "xcode-select:"; xcode-select -p
echo "CC: ${CC}"
clang --version
echo "TRITON_DISABLE_OPENMP=${TRITON_DISABLE_OPENMP}"
echo "TRITON_CPU_BACKEND=${TRITON_CPU_BACKEND}"

# Skip bfloat16 tests for now
# We are generating bfcvt for bfloat16 tests when converting to fp32.
# This is only for Clang15, works OK for Clang16
# TODO - fix this using driver flags.
python -m pytest -s -n 32 --device cpu \
python/test/unit/language/test_core.py -m cpu -k "not bfloat16"
python -m pytest -s -n 32 --device cpu \
python/test/unit/cpu/test_math.py \
python/test/unit/cpu/test_opt.py \
python/test/unit/language/test_annotations.py \
python/test/unit/language/test_block_pointer.py \
python/test/unit/language/test_compile_errors.py \
python/test/unit/language/test_conversions.py \
python/test/unit/language/test_decorator.py \
python/test/unit/language/test_pipeliner.py \
python/test/unit/language/test_random.py \
python/test/unit/language/test_standard.py \
python/test/unit/runtime/test_autotuner.py \
python/test/unit/runtime/test_bindings.py \
python/test/unit/runtime/test_cache.py \
python/test/unit/runtime/test_driver.py \
python/test/unit/runtime/test_jit.py \
python/test/unit/runtime/test_launch.py \
python/test/unit/runtime/test_subproc.py \
python/test/unit/test_debug_dump.py \
-k "not bfloat16"

- name: Run python unit tests for Intel
if: matrix.config.target-os == 'ubuntu'
run: |
python -m pytest -s -n 32 --device cpu python/test/unit/language/test_core.py -m cpu
python -m pytest -s -n 32 --device cpu \
python/test/unit/cpu/test_math.py \
python/test/unit/cpu/test_opt.py \
python/test/unit/language/test_annotations.py \
python/test/unit/language/test_block_pointer.py \
python/test/unit/language/test_compile_errors.py \
python/test/unit/language/test_conversions.py \
python/test/unit/language/test_decorator.py \
python/test/unit/language/test_pipeliner.py \
python/test/unit/language/test_random.py \
python/test/unit/language/test_standard.py \
python/test/unit/runtime/test_autotuner.py \
python/test/unit/runtime/test_bindings.py \
python/test/unit/runtime/test_cache.py \
python/test/unit/runtime/test_driver.py \
python/test/unit/runtime/test_jit.py \
python/test/unit/runtime/test_launch.py \
python/test/unit/runtime/test_subproc.py \
python/test/unit/test_debug_dump.py

- name: Run lit tests
run: |
cd python
LIT_TEST_DIR="build/$(ls build | grep -i cmake)/test"
if [ ! -d "${LIT_TEST_DIR}" ]; then
echo "Could not find '${LIT_TEST_DIR}'" ; exit -1
fi
lit -v "${LIT_TEST_DIR}/TritonCPU"
Loading