diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..f087b42 --- /dev/null +++ b/.gitattributes @@ -0,0 +1 @@ +*.tar.gz filter=lfs diff=lfs merge=lfs -text diff --git a/README.md b/README.md index 0b10325..2b744a9 100644 --- a/README.md +++ b/README.md @@ -1,81 +1,149 @@ # SuperNPUBench -SuperNPUBench is a TileOP API, kernel, model, and benchmark workspace for -LinxISA/PTO-style tile programming experiments. The repository is organized -around header-only TileOP APIs, reusable kernels, model-level examples, and -make-driven test suites. +SuperNPUBench is a Linx/PTO TileOP benchmark workspace. Active benchmark +entrypoints live under `benchmarks/`, reusable library code stays under +`include/`, `kernels/`, and `models/`, non-benchmark correctness checks live +under `tests/`, and superseded material is preserved under `archive/outdated/`. ## Repository Map | Path | Purpose | | --- | --- | +| [`benchmarks`](benchmarks) | Primary active Linx-buildable benchmark tree. Start here for benchmark source and build commands. | +| [`benchmarks/INDEX.md`](benchmarks/INDEX.md) | Source catalog with benchmark paths, build commands, category, status, and required data objects. | +| [`benchmarks/common`](benchmarks/common) | Shared make harness, platform selection, compiler flags, simulator targets, and benchmark-local helper headers. | +| [`compiler`](compiler) | Compiler artifact staging. The checked-in legacy archive is a Git LFS pointer; active builds should pass a real toolchain through `COMPILER_DIR`. | | [`include/common`](include/common) | Shared TileOP API surface, data types, tensor helpers, layouts, and compile-time utilities. | -| [`include/cpu_sim`](include/cpu_sim) | CPU simulation backend used by tests built with `PLAT=cpu`. | +| [`include/benchmark_support`](include/benchmark_support) | Benchmark-only support headers, including NPU helper APIs used by active suites. | +| [`include/cpu_sim`](include/cpu_sim) | CPU simulation backend used by checks built with `PLAT=cpu`. | | [`include/aarch64`](include/aarch64) | Host/Arm-oriented backend headers and TileOP API variants. | -| [`include/accelerator`](include/accelerator) | Accelerator-facing headers split by versioned targets such as `v220` and `v310`. | -| [`include/jcore`](include/jcore) | Linx/JCore backend headers used by tests built with `PLAT=linx`. | -| [`kernels`](kernels) | Reusable kernel implementations, grouped by domain such as element-wise, memory, reduction, and matmul. | -| [`models`](models) | Model-level examples and compositions, currently including DeepSeekV3-oriented code. | -| [`test/common`](test/common) | Shared make harness, platform selection, compiler flags, and simulator targets. | -| [`test/tileop_api`](test/tileop_api) | Focused TileOP API compile and behavior tests. Start here for single-operation API work. | -| [`test/py_api`](test/py_api) | Python extension and golden-comparison flow for TileOP API checks. | -| [`test/accelerator`](test/accelerator) | Accelerator-oriented benchmark and validation suites. | -| [`test/kernel`](test/kernel) | Kernel benchmark and validation suites. | -| [`test/other`](test/other) | Additional model, microbenchmark, TileOP, vector, and script-driven suites. | -| [`test/script`](test/script) | Recursive compile/run helper for batch workflows. | +| [`include/accelerator`](include/accelerator) | Versioned device API headers consumed by Linx/NPU code. | +| [`include/jcore`](include/jcore) | Linx/JCore backend headers used by `PLAT=linx`. | +| [`kernels`](kernels) | Reusable kernel implementations shared by benchmark entrypoints. | +| [`models`](models) | Reusable model-level implementation code shared by model benchmarks. | +| [`samples`](samples) | Checked-in sample compiler disassembly for representative flash attention and GEMM outputs. | +| [`tests`](tests) | Non-benchmark correctness material that is not the primary Linx benchmark navigation surface. | +| [`archive/outdated`](archive/outdated) | Preserved duplicate, superseded, generated, or unusable historical material with replacement notes. | | `output/` | Generated build products. Treat this as local output, not source. | -## Quick Navigation +## Getting Started: Linx Compiler And QEMU -- To understand the public API, start with [`include/common`](include/common). -- To inspect a backend implementation of an operation, compare the matching - headers under [`include/cpu_sim`](include/cpu_sim), - [`include/jcore`](include/jcore), and [`include/aarch64`](include/aarch64). -- To add or update reusable compute code, use the matching domain under - [`kernels`](kernels). -- To compile one small API test, use [`test/tileop_api`](test/tileop_api). -- To validate a Python-facing API path, use [`test/py_api`](test/py_api) and - [`test/py_api/golden_cmp`](test/py_api/golden_cmp). -- To run larger suites, use the `compile.all` files in the relevant - [`test`](test) subdirectories. +These commands assume this workload lives at `$LINXISA_ROOT/workloads/SuperNPUBench`. Set `LINXISA_ROOT` to the root of your Linx superproject checkout. -## Header Installation +Build the Linx LLVM compiler from `compiler/llvm`: -The top-level `Makefile` installs the TileOP header tree into the active -Clang resource directory under `include/tileop-api`. +```bash +export LINXISA_ROOT=/path/to/linx-isa +cd "$LINXISA_ROOT" -```sh -make -n CLANG_PREFIX=/usr -make install CLANG_PREFIX=/path/to/clang/prefix -make uninstall CLANG_PREFIX=/path/to/clang/prefix +cmake -S compiler/llvm/llvm -B compiler/llvm/build-linxisa-clang -G Ninja \ + -DLLVM_ENABLE_PROJECTS="clang;lld" \ + -DLLVM_TARGETS_TO_BUILD=Linx +cmake --build compiler/llvm/build-linxisa-clang --target clang lld llvm-mc llvm-objdump llvm-objcopy + +export COMPILER_DIR="$LINXISA_ROOT/compiler/llvm/build-linxisa-clang/bin" +``` + +For incremental compiler rebuilds after the CMake tree exists: + +```bash +cd "$LINXISA_ROOT" +bash tools/bringup/run_llvm_incremental_build.sh clang lld llvm-mc llvm-objdump llvm-objcopy +``` + +Build the Linx QEMU target and run a QEMU smoke suite: + +```bash +cd "$LINXISA_ROOT" +QEMU="$(bash tools/bringup/run_qemu_build_local.sh)" + +cd "$LINXISA_ROOT/avs/qemu" +CLANG="$COMPILER_DIR/clang" LLD="$COMPILER_DIR/ld.lld" QEMU="$QEMU" ./run_tests.sh --suite arithmetic --timeout 10 +``` + +Compile one SuperNPUBench case with the rebuilt compiler: + +```bash +cd "$LINXISA_ROOT/workloads/SuperNPUBench" +make -C benchmarks/api/tileop TESTCASE=TAdd PLAT=linx COMPILER_DIR="$COMPILER_DIR" ``` -`CLANG_PREFIX` should point to a prefix containing `bin/clang`. The dry run is -useful because the install location is derived from -`clang -print-resource-dir`. +The benchmark `sim` target invokes `$(QEMU) -run-supertest -blk_optimize force_tb_chained `. Use it with a SuperTest-compatible QEMU binary: -## Building Tests +```bash +make -C benchmarks/api/tileop TESTCASE=TAdd PLAT=linx COMPILER_DIR="$COMPILER_DIR" QEMU=/path/to/supertest-compatible-qemu sim +``` + +For standard `qemu-system-linx64 -machine virt` validation, use the `avs/qemu` runner shown above. + +## Benchmark Navigation + +Active benchmark source is grouped by workload intent: + +| Path | Category | +| --- | --- | +| [`benchmarks/api/tileop`](benchmarks/api/tileop) | TileOP API operation benchmarks. | +| [`benchmarks/npu`](benchmarks/npu) | NPU cube, fusion, NDDMA, vec SIMD, and vec SIMT suites. | +| [`benchmarks/kernels`](benchmarks/kernels) | Control, element-wise, GEMM, fusion, memory, reduction, sort, and composite kernel suites. | +| [`benchmarks/models/deepseekv3`](benchmarks/models/deepseekv3) | DeepSeekV3 model-level benchmark cases. | +| [`benchmarks/microbench`](benchmarks/microbench) | Cube, memory, and vector microbenchmarks. | +| [`benchmarks/scripts`](benchmarks/scripts) | Batch and recursive helper scripts for benchmark workflows. | + +Use [`benchmarks/INDEX.md`](benchmarks/INDEX.md) when you need the exact source path, suite-level build command, or data-object requirement for a case. + +## Benchmark Names -Most test directories include [`test/common/Makefile.common`](test/common/Makefile.common). -The common harness is controlled mainly by `TESTCASE`, `PLAT`, -`COMPILER_DIR`, and `QEMU`. +The table below is generated from active benchmark source files. It currently lists 166 C++ benchmark entrypoints across 26 suites and 44 batch scripts. + +| Suite | Benchmark names | +| --- | --- | +| [`benchmarks/api/tileop`](benchmarks/api/tileop) | `Cus_Template_ASM`, `MatMacc`, `MatMul`, `MatMul_e4m3`, `Print`, `TAbs`, `TAdd`
`TAdd_mask`, `TAdds`, `TAnd`, `TAssemble`, `TCI`, `TCast`, `TCmp`
`TCopy`, `TLoad`, `TStore`, `TCvt`, `TDiv`, `TDivs`, `TExp`
`TExpandCol`, `TExpandRow`, `TExpandScalar`, `TExtract`, `TFillPad`, `TGather`, `TMax`
`TMaxs`, `TMin`, `TMins`, `TMul`, `TMuls`, `TOr`, `TPad`
`TRSqrt`, `TRecip`, `TRem`, `TReshape`, `TRowMax`, `TRowMaxExpand`, `TRowSum`
`TRowSumExpand`, `TScatter`, `TSelect`, `TSqrt`, `TSub`, `TSubs`, `TTrans`
`test_MatMacc`, `test_MatMmxac`, `test_MatMul`, `test_MatMulmx` | +| [`benchmarks/npu/cube`](benchmarks/npu/cube) | `LLAMA3_70B_attn_matmul_decode_bs_192`, `LLAMA3_70B_ffn_matmul_3_decode_bs_192`, `QuantBatchMatmulV3_292_hif4`, `QuantBatchMatmulV3_293_hif4`, `QuantBatchMatmulV3_294_hif4`, `QuantBatchMatmulV3_295_hif4`, `QuantBatchMatmulV3_296_hif4`
`QuantBatchMatmulV3_297_hif4`, `dsv3_q_up_proj_mxfp8`, `llama3_70b_w8_bs_1_case_4`, `llama_train_mm_2_A16W4`, `llama_train_mm_2_A16W8`, `llama_train_mm_2_mxfp8_mxfp4`, `llava1_6_6`
`mat_mul_o1_align_0001`, `matmul_1_bs16_fp8_GB_test`, `model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf`, `moe_w1w3_bs16_fp8_GB_DN_nbuf`, `mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022`, `mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16`, `xinghuo_13b_tp8_matmul_01_A16W8`
`xinghuo_13b_tp8_matmul_01_mxfp8_modified`, `xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4` | +| [`benchmarks/npu/fusion`](benchmarks/npu/fusion) | `fa1`, `fa10`, `fa11`, `fa2`, `fa3`, `fa4`, `fa5`
`fa6`, `fa7`, `fa8`, `fa9`, `fa_fp4`, `flashmla13` | +| [`benchmarks/npu/nddma`](benchmarks/npu/nddma) | `transpose_053_mgather`, `transpose_053_tload` | +| [`benchmarks/npu/vec_simd`](benchmarks/npu/vec_simd) | `Add_ND_bfloat16_float32_DeepSeek_V3_000028`, `LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic`, `LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic`, `LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV`, `gemm_18x128x256`, `layernorm_vcadd_vaddx3_12288_fp16`, `moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV`
`rmsnorm_reduce_1_16384_fp16`, `rmsnorm_reduce_2_8192_fp16`, `rmsnorm_reduce_4_4096_fp16`, `rmsnorm_reduce_4_5120_fp16`, `rope_32_40_1_64_bf16`, `softmax_8_34_fp16`, `softmax_LLM_2`
`softmax_vaddx3_vcadd_1_4096_bf16`, `softmax_vaddx3_vcadd_1_4096_fp16`, `swiglu_64_1024_fp16` | +| [`benchmarks/npu/vec_simt`](benchmarks/npu/vec_simt) | `npu_hashtable_insert_cmp_host`, `npu_hashtable_lookup_cmp_host`, `hashfind` | +| [`benchmarks/kernels/composite`](benchmarks/kernels/composite) | `flash_attention`, `flash_attention_mask`, `gemm`, `linear`, `matmul`, `normalization`, `onlinesoftmax`
`softmax` | +| [`benchmarks/kernels/control`](benchmarks/kernels/control) | `hashfind`, `hashtable_lookup_simd`, `hashtable_lookup_simt`, `hashtable_lookup_simt_v2`, `hkv` | +| [`benchmarks/kernels/element_wise/gelu`](benchmarks/kernels/element_wise/gelu) | `gelu` | +| [`benchmarks/kernels/fusion`](benchmarks/kernels/fusion) | `fa_hif4` | +| [`benchmarks/kernels/gemm/matmul`](benchmarks/kernels/gemm/matmul) | `A16W4`, `HiF4_HiF4` | +| [`benchmarks/kernels/memory/broadcast`](benchmarks/kernels/memory/broadcast) | `broadcast`, `broadcast_019`, `broadcast_039`, `broadcast_07`, `broadcast_Hunyuan`, `broadcast_mscatter`, `broadcast_nostore`
`broadcast_nomg`, `broadcast_tst` | +| [`benchmarks/kernels/memory/broadcast_vec`](benchmarks/kernels/memory/broadcast_vec) | `broadcast_vec_019`, `broadcast_vec_039`, `broadcast_vec_07` | +| [`benchmarks/kernels/memory/concat_gather`](benchmarks/kernels/memory/concat_gather) | `concat_gather` | +| [`benchmarks/kernels/memory/concat_scatter`](benchmarks/kernels/memory/concat_scatter) | `concat_scatter` | +| [`benchmarks/kernels/memory/gather`](benchmarks/kernels/memory/gather) | `gather` | +| [`benchmarks/kernels/memory/transpose`](benchmarks/kernels/memory/transpose) | `transpose` | +| [`benchmarks/kernels/reduction/reducemax_col`](benchmarks/kernels/reduction/reducemax_col) | `reducemax_col` | +| [`benchmarks/kernels/reduction/reducemax_row`](benchmarks/kernels/reduction/reducemax_row) | `reducemax_row` | +| [`benchmarks/kernels/reduction/reducesum_col`](benchmarks/kernels/reduction/reducesum_col) | `reducesum_col` | +| [`benchmarks/kernels/reduction/reducesum_row`](benchmarks/kernels/reduction/reducesum_row) | `reducesum_row` | +| [`benchmarks/kernels/sort/topk`](benchmarks/kernels/sort/topk) | `topk` | +| [`benchmarks/models/deepseekv3`](benchmarks/models/deepseekv3) | `concat`, `expand`, `gate`, `mask`, `mla`, `mlp`, `moe`
`permute`, `projection`, `rmsnorm`, `rope`, `split`, `topk`, `transformer` | +| [`benchmarks/microbench/cube`](benchmarks/microbench/cube) | `matop` | +| [`benchmarks/microbench/lmbench`](benchmarks/microbench/lmbench) | `mem` | +| [`benchmarks/microbench/vec`](benchmarks/microbench/vec) | `lat_bw` | + +## Building Benchmarks + +Most benchmark directories include [`benchmarks/common/Makefile.common`](benchmarks/common/Makefile.common). The common harness is controlled mainly by `TESTCASE`, `PLAT`, `COMPILER_DIR`, and `QEMU`. ```sh -cd test/tileop_api +cd benchmarks/api/tileop make clean make TESTCASE=TAdd PLAT=cpu COMPILER_DIR=/usr/bin make TESTCASE=TAdd PLAT=linx COMPILER_DIR=/path/to/linx/compiler/bin -make TESTCASE=TAdd PLAT=linx QEMU=/path/to/qemu-linx sim +make TESTCASE=TAdd PLAT=linx QEMU=/path/to/supertest-compatible-qemu sim ``` | Variable | Meaning | | --- | --- | -| `TESTCASE` | Source basename to build, for example `TAdd` in `test/tileop_api/src/TAdd.cpp`. | +| `TESTCASE` | Source basename or suite-local case name, for example `TAdd` in `benchmarks/api/tileop/src/TAdd.cpp`. | | `PLAT=cpu` | Builds against the CPU simulation backend and defines `__cpu_sim__`. | | `PLAT=linx` | Builds for the Linx target and defines `__linx`. | | `PLAT=arm_sme` | Builds Arm SME-oriented cases and defines `__ARM_FEATURE_SME`. | -| `COMPILER_DIR` | Directory containing `clang`, `clang++`, `llvm-objdump`, and related tools. | -| `QEMU` | Simulator binary used by `make sim` for Linx-targeted test execution. | +| `COMPILER_DIR` | Directory containing `clang`, `clang++`, `llvm-objdump`, `llvm-objcopy`, and related tools. | +| `QEMU` | Simulator binary used by `make sim` for Linx-targeted execution. | Common targets: @@ -88,63 +156,48 @@ make clean make clean_all ``` -Generated objects, executables, disassembly, and logs are written below -`output/`. +Generated objects, executables, disassembly, and logs are written below `output/`. ## Batch Suites -Many suites provide a local `compile.all` file. Run these from the suite -directory so relative paths and local make variables resolve as intended. +Many suites provide a local `compile*.all` file. Run these from the suite directory so relative paths and local make variables resolve as intended. Examples: ```sh -cd test/tileop_api && bash compile.all -cd test/py_api && bash compile.all -cd test/accelerator/vec_simt && bash compile.all -cd test/kernel/gemm/matmul && bash compile.all +cd benchmarks/api/tileop && bash compile.all +cd benchmarks/npu/vec_simt && bash compile.all +cd benchmarks/kernels/gemm/matmul && bash compile.all +cd benchmarks/models/deepseekv3 && bash compile.all ``` -For recursive compile/run automation, see -[`test/script/README.md`](test/script/README.md). +For recursive compile/run automation, see [`benchmarks/scripts/recursive/README.md`](benchmarks/scripts/recursive/README.md). -## Python API And Golden Comparison +## Header Installation -The Python API flow builds a shared object and compares selected cases against -Python golden logic. +The top-level `Makefile` installs the TileOP header tree into the active Clang resource directory under `include/tileop-api`. ```sh -cd test/py_api -make clean -make TESTCASE=tileop_py -python3 golden_cmp/golden_cmp.py -i tadd +make -n CLANG_PREFIX=/usr +make install CLANG_PREFIX=/path/to/clang/prefix +make uninstall CLANG_PREFIX=/path/to/clang/prefix ``` -For adding new golden cases, see -[`test/py_api/golden_cmp/README.md`](test/py_api/golden_cmp/README.md). +`CLANG_PREFIX` should point to a prefix containing `bin/clang`. The dry run is useful because the install location is derived from `clang -print-resource-dir`. ## Adding Work -Use the existing directory shape when adding code: +Use the current directory ownership when adding code: 1. Add API-facing definitions or declarations in [`include/common`](include/common). -2. Add backend behavior in the relevant backend directory, usually - [`include/cpu_sim`](include/cpu_sim), [`include/jcore`](include/jcore), or - [`include/aarch64`](include/aarch64). -3. Add reusable compute kernels under the matching [`kernels`](kernels) - domain. -4. Add focused tests under [`test/tileop_api`](test/tileop_api) or the - relevant suite under [`test/kernel`](test/kernel), - [`test/accelerator`](test/accelerator), or [`test/other`](test/other). -5. Add the case to the local `compile.all` file when it should be part of the - batch suite. - -New make-driven test directories should keep the local `Makefile` small and -include [`test/common/Makefile.common`](test/common/Makefile.common) for shared -platform flags, output paths, simulator targets, and cleanup behavior. +2. Add backend behavior in [`include/cpu_sim`](include/cpu_sim), [`include/jcore`](include/jcore), [`include/aarch64`](include/aarch64), or the relevant support include directory. +3. Add reusable compute kernels under the matching [`kernels`](kernels) domain. +4. Add Linx-buildable benchmark entrypoints under the matching [`benchmarks`](benchmarks) category. +5. Add non-benchmark correctness material under [`tests`](tests). +6. Move superseded or duplicate material to [`archive/outdated`](archive/outdated) with a replacement note instead of deleting it. + +New make-driven benchmark directories should keep the local `Makefile` small and include [`benchmarks/common/Makefile.common`](benchmarks/common/Makefile.common) for shared platform flags, output paths, simulator targets, and cleanup behavior. ## Generated Files -Do not commit generated files from `output/`, object files, executable test -artifacts, local logs, or disassembly files. Keep source changes in `include/`, -`kernels/`, `models/`, and `test/`. +Do not commit generated files from `output/`, object files, executable artifacts, local logs, or ad hoc disassembly files outside [`samples`](samples). Keep source changes in `include/`, `kernels/`, `models/`, `benchmarks/`, `tests/`, `samples/`, `compiler/`, and `archive/outdated/`. diff --git a/archive/outdated/README.md b/archive/outdated/README.md new file mode 100644 index 0000000..3aa7e3b --- /dev/null +++ b/archive/outdated/README.md @@ -0,0 +1,12 @@ +# Outdated Archive + +This directory preserves superseded or unusable material that should not be the default benchmark navigation surface. Nothing here was deleted; the table below records why each item moved and where active work should happen instead. + +| Archived path | Rationale | Replacement | +| --- | --- | --- | +| [`tests/other/tileop_api`](tests/other/tileop_api) | Legacy duplicate TileOP API surface with committed generated logs under `script/checknum_true`. | [`../../benchmarks/api/tileop`](../../benchmarks/api/tileop) | +| [`tests/other/py_api`](tests/other/py_api) | Older Python API duplicate. Active Python correctness material is kept outside the benchmark tree. | [`../../tests/py_api`](../../tests/py_api) | +| [`tests/accelerator/v220`](tests/accelerator/v220) | Superseded legacy NPU validation surface, not part of the active Linx benchmark catalog. | [`../../benchmarks/npu`](../../benchmarks/npu) | +| [`tests/accelerator/v310`](tests/accelerator/v310) | Superseded legacy NPU validation surface, not part of the active Linx benchmark catalog. | [`../../benchmarks/npu`](../../benchmarks/npu) | + +Archive files may retain historical path references because they document the old layout. Do not add new active benchmark cases here. diff --git a/test/accelerator/v220/src/common/data.hpp b/archive/outdated/tests/accelerator/v220/src/common/data.hpp similarity index 100% rename from test/accelerator/v220/src/common/data.hpp rename to archive/outdated/tests/accelerator/v220/src/common/data.hpp diff --git a/test/accelerator/v220/src/st/st1.cpp b/archive/outdated/tests/accelerator/v220/src/st/st1.cpp similarity index 100% rename from test/accelerator/v220/src/st/st1.cpp rename to archive/outdated/tests/accelerator/v220/src/st/st1.cpp diff --git a/test/accelerator/v220/src/ut/TAdd.cpp b/archive/outdated/tests/accelerator/v220/src/ut/TAdd.cpp similarity index 100% rename from test/accelerator/v220/src/ut/TAdd.cpp rename to archive/outdated/tests/accelerator/v220/src/ut/TAdd.cpp diff --git a/test/accelerator/v310/common/data.hpp b/archive/outdated/tests/accelerator/v310/common/data.hpp similarity index 100% rename from test/accelerator/v310/common/data.hpp rename to archive/outdated/tests/accelerator/v310/common/data.hpp diff --git a/test/accelerator/v310/st/st1.cpp b/archive/outdated/tests/accelerator/v310/st/st1.cpp similarity index 100% rename from test/accelerator/v310/st/st1.cpp rename to archive/outdated/tests/accelerator/v310/st/st1.cpp diff --git a/test/accelerator/v310/ut/TAdd.cpp b/archive/outdated/tests/accelerator/v310/ut/TAdd.cpp similarity index 100% rename from test/accelerator/v310/ut/TAdd.cpp rename to archive/outdated/tests/accelerator/v310/ut/TAdd.cpp diff --git a/test/other/py_api/Makefile b/archive/outdated/tests/other/py_api/Makefile similarity index 100% rename from test/other/py_api/Makefile rename to archive/outdated/tests/other/py_api/Makefile diff --git a/test/other/py_api/compile.all b/archive/outdated/tests/other/py_api/compile.all similarity index 100% rename from test/other/py_api/compile.all rename to archive/outdated/tests/other/py_api/compile.all diff --git a/test/other/py_api/golden_cmp/README.md b/archive/outdated/tests/other/py_api/golden_cmp/README.md similarity index 94% rename from test/other/py_api/golden_cmp/README.md rename to archive/outdated/tests/other/py_api/golden_cmp/README.md index b1aef28..433fe7f 100644 --- a/test/other/py_api/golden_cmp/README.md +++ b/archive/outdated/tests/other/py_api/golden_cmp/README.md @@ -7,12 +7,12 @@ · 文件路径:JanusCoreBench/test/golden_cmp/py_api/src/ · 操作说明: - + 1. 如果是添加一个新的运算方式(如 texp),则需要新建一个 HPP 文件。 2. 如果是同一运算方式的不同属性(如不同的矩阵尺寸或 tile 大小),则直接在对应的 HPP 文件中添加。 · 标准函数格式: - + · 文件头需要包含必要的头文件。 · 声明变量和函数名称时,需注意命名规范。 · 函数声明后,需将函数与模块绑定(m.def)。 @@ -47,9 +47,9 @@ void texp1_py(py::array_t dst_py, py::array_t src_py) { tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TEXP(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } @@ -72,14 +72,14 @@ void bind_texp(py::module_& m) { 步骤说明: 1. 添加文件头: - + · 在文件开头添加包含新 HPP 文件的头文件路径。 ``` #include "src/texp1.hpp" ``` 2. 添加绑定内容: - + · 在模块中绑定新函数。 ``` py::module_ _api = m.def_submodule("_api", "API module"); @@ -93,7 +93,7 @@ void bind_texp(py::module_& m) { 步骤说明: 1. 在 cases 中添加新测试用例: - + · 按照以下格式添加新函数的属性。 ``` { @@ -109,7 +109,7 @@ void bind_texp(py::module_& m) { · **output_shapes**:输出矩阵的形状。 2. 在 op_map 中添加操作映射: - + · 按照以下格式添加新操作的映射。 ``` "texp": [ @@ -129,10 +129,10 @@ void bind_texp(py::module_& m) { 在 /JanusCoreBench/test 路径下,执行以下命令: ``` -make clean -make TESTCASE=tileop_py PLAT=cpu PY_LIB=on -python3 golden_cmp/golden_cmp.py -i tadd1 +make clean +make TESTCASE=tileop_py PLAT=cpu PY_LIB=on +python3 golden_cmp/golden_cmp.py -i tadd1 ``` -其中,PLAT 和 PY_LIB 的值可以根据需要进行修改。具体的可选项可参考 common 文件夹下的 Makefile 。其中 -i 后面跟着的是函数的名称,具体的函数名可以参考 config.json 文件中的内容。 +其中,PLAT 和 PY_LIB 的值可以根据需要进行修改。具体的可选项可参考 common 文件夹下的 Makefile 。其中 -i 后面跟着的是函数的名称,具体的函数名可以参考 config.json 文件中的内容。 之后print出的对比结果中,在最后两行会显示loss(误差)以及是否pass or fail diff --git a/test/other/py_api/golden_cmp/config.json b/archive/outdated/tests/other/py_api/golden_cmp/config.json similarity index 100% rename from test/other/py_api/golden_cmp/config.json rename to archive/outdated/tests/other/py_api/golden_cmp/config.json diff --git a/test/other/py_api/golden_cmp/golden_cmp.py b/archive/outdated/tests/other/py_api/golden_cmp/golden_cmp.py similarity index 100% rename from test/other/py_api/golden_cmp/golden_cmp.py rename to archive/outdated/tests/other/py_api/golden_cmp/golden_cmp.py diff --git a/test/other/py_api/golden_cmp/ref_func_lib.py b/archive/outdated/tests/other/py_api/golden_cmp/ref_func_lib.py similarity index 100% rename from test/other/py_api/golden_cmp/ref_func_lib.py rename to archive/outdated/tests/other/py_api/golden_cmp/ref_func_lib.py diff --git a/test/other/py_api/golden_cmp/test.sh b/archive/outdated/tests/other/py_api/golden_cmp/test.sh similarity index 100% rename from test/other/py_api/golden_cmp/test.sh rename to archive/outdated/tests/other/py_api/golden_cmp/test.sh diff --git a/test/other/py_api/src/flash_attention_py.hpp b/archive/outdated/tests/other/py_api/src/flash_attention_py.hpp similarity index 100% rename from test/other/py_api/src/flash_attention_py.hpp rename to archive/outdated/tests/other/py_api/src/flash_attention_py.hpp diff --git a/test/other/py_api/src/matmul_py.hpp b/archive/outdated/tests/other/py_api/src/matmul_py.hpp similarity index 100% rename from test/other/py_api/src/matmul_py.hpp rename to archive/outdated/tests/other/py_api/src/matmul_py.hpp diff --git a/test/other/py_api/src/softmax_py.hpp b/archive/outdated/tests/other/py_api/src/softmax_py.hpp similarity index 100% rename from test/other/py_api/src/softmax_py.hpp rename to archive/outdated/tests/other/py_api/src/softmax_py.hpp diff --git a/test/other/py_api/src/tadd.hpp b/archive/outdated/tests/other/py_api/src/tadd.hpp similarity index 89% rename from test/other/py_api/src/tadd.hpp rename to archive/outdated/tests/other/py_api/src/tadd.hpp index 68fdc19..a15d36e 100644 --- a/test/other/py_api/src/tadd.hpp +++ b/archive/outdated/tests/other/py_api/src/tadd.hpp @@ -18,18 +18,18 @@ void tadd_py(float* dst, float* src0, float* src1){ int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); - gm_shape res(dst + offset); + gm_shape res(dst + offset); tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TADD(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } -#ifdef __cpu_sim__ +#ifdef __cpu_sim__ void bind_tadd(py::module_& m) { m.def("tadd", [](py::array_t dst_py, py::array_t src0_py, py::array_t src1_py){ float* dst = static_cast(dst_py.request().ptr); diff --git a/test/other/py_api/src/texp.hpp b/archive/outdated/tests/other/py_api/src/texp.hpp similarity index 95% rename from test/other/py_api/src/texp.hpp rename to archive/outdated/tests/other/py_api/src/texp.hpp index 9c1ab3b..5564bba 100644 --- a/test/other/py_api/src/texp.hpp +++ b/archive/outdated/tests/other/py_api/src/texp.hpp @@ -23,9 +23,9 @@ void texp_py(float* dst, float* src) { tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TEXP(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } diff --git a/test/other/py_api/src/tileop_py.cpp b/archive/outdated/tests/other/py_api/src/tileop_py.cpp similarity index 100% rename from test/other/py_api/src/tileop_py.cpp rename to archive/outdated/tests/other/py_api/src/tileop_py.cpp diff --git a/test/other/py_api/src/tmax.hpp b/archive/outdated/tests/other/py_api/src/tmax.hpp similarity index 93% rename from test/other/py_api/src/tmax.hpp rename to archive/outdated/tests/other/py_api/src/tmax.hpp index f144668..bdadeff 100644 --- a/test/other/py_api/src/tmax.hpp +++ b/archive/outdated/tests/other/py_api/src/tmax.hpp @@ -23,10 +23,10 @@ void tmax_py(float* dst, float* src0, float* src1){ tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TMAX(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } diff --git a/test/other/py_api/src/tsub.hpp b/archive/outdated/tests/other/py_api/src/tsub.hpp similarity index 89% rename from test/other/py_api/src/tsub.hpp rename to archive/outdated/tests/other/py_api/src/tsub.hpp index 7f05ea1..eabf0f6 100644 --- a/test/other/py_api/src/tsub.hpp +++ b/archive/outdated/tests/other/py_api/src/tsub.hpp @@ -18,18 +18,18 @@ void tsub_py(float* dst, float* src0, float* src1) { int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); - gm_shape res(dst + offset); + gm_shape res(dst + offset); tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TSUB(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } -#ifdef __cpu_sim__ +#ifdef __cpu_sim__ void bind_tsub(py::module_& m) { m.def("tsub", [](py::array_t dst_py, py::array_t src0_py, py::array_t src1_py){ float* dst = static_cast(dst_py.request().ptr); diff --git a/test/other/tileop_api/Makefile b/archive/outdated/tests/other/tileop_api/Makefile similarity index 100% rename from test/other/tileop_api/Makefile rename to archive/outdated/tests/other/tileop_api/Makefile diff --git a/test/other/tileop_api/compile.all b/archive/outdated/tests/other/tileop_api/compile.all similarity index 88% rename from test/other/tileop_api/compile.all rename to archive/outdated/tests/other/tileop_api/compile.all index 116d3be..4ff38ab 100755 --- a/test/other/tileop_api/compile.all +++ b/archive/outdated/tests/other/tileop_api/compile.all @@ -8,13 +8,12 @@ make TESTCASE=TAdd_mask make TESTCASE=TAdd make TESTCASE=TAdds make TESTCASE=TCopy -make TESTCASE=TCopyIn -make TESTCASE=TCopyOut +make TESTCASE=TLoad +make TESTCASE=TStore make TESTCASE=TCvt make TESTCASE=TDiv make TESTCASE=TDivs make TESTCASE=test_MatMacc -make TESTCASE=test_matmul make TESTCASE=test_MatMul make TESTCASE=TExp make TESTCASE=TExpandCol @@ -33,4 +32,4 @@ make TESTCASE=TRowSumExpand make TESTCASE=TSqrt make TESTCASE=TSub make TESTCASE=TSubs -make TESTCASE=TTrans \ No newline at end of file +make TESTCASE=TTrans diff --git a/test/tileop_api/data.hpp b/archive/outdated/tests/other/tileop_api/data.hpp similarity index 90% rename from test/tileop_api/data.hpp rename to archive/outdated/tests/other/tileop_api/data.hpp index c595f25..afca271 100644 --- a/test/tileop_api/data.hpp +++ b/archive/outdated/tests/other/tileop_api/data.hpp @@ -1,16 +1,34 @@ #ifndef DATA_H #define DATA_H +#ifdef __linx +#include +#include +extern "C" void exit(int); +extern "C" void free(void *); +extern "C" void *malloc(size_t); +extern "C" int printf(const char *, ...); +#else #include #include +#endif #include "common/type.hpp" +#ifdef __linx +static constexpr float s_fp32 = 0.1f; +static constexpr __half s_fp16 = __half(0.0f); +static constexpr int8_t s_i8 = 1; +static constexpr int16_t s_i16 = 1; +static constexpr int32_t s_i32 = 1; +static constexpr int64_t s_i64 = 1; +#else float s_fp32 = 0.1; __half s_fp16 = 0.1; int8_t s_i8 = 1; int16_t s_i16 = 1; int32_t s_i32 = 1; int64_t s_i64 = 1; +#endif template void init_src_uint(T *aar, uint16_t size) { for (uint16_t i = 0; i < size; i++) { @@ -36,7 +54,13 @@ void init_src_int8(int8_t *aar, uint16_t size) { template void init_src_fp(T *aar, uint16_t size) { for (uint16_t i = 0; i < size; i++) { +#ifdef __linx + const float x = (i + 1) / 100.0f; + const float x2 = x * x; + aar[i] = x * (1.0f - x2 / 6.0f + (x2 * x2) / 120.0f); +#else aar[i] = sin((i + 1) / 100.0f); +#endif } } @@ -81,22 +105,37 @@ template void init_rows_fp(T *aar, uint16_t row, uint16_t col) { } template void OutArray(const T *aar, size_t size) { +#ifdef __linx + (void)aar; + (void)size; +#else for (uint16_t i = 0; i < size; i++) { std::cout << aar[i] << " "; } std::cout << std::endl; +#endif } void OutArray(const int8_t *aar, size_t size) { +#ifdef __linx + (void)aar; + (void)size; +#else for (uint16_t i = 0; i < size; i++) { std::cout << static_cast(aar[i]) << " "; } std::cout << std::endl; +#endif } void OutArray(const __half *aar, size_t size) { +#ifdef __linx + (void)aar; + (void)size; +#else for (uint16_t i = 0; i < size; i++) { std::cout << static_cast<__fp16>(aar[i]) << " "; } std::cout << std::endl; +#endif } // check memory allocation @@ -181,4 +220,4 @@ template void check_mem_alloc(const T *p) { free(d2); \ free(d3); -#endif \ No newline at end of file +#endif diff --git a/test/other/tileop_api/script/README.md b/archive/outdated/tests/other/tileop_api/script/README.md similarity index 100% rename from test/other/tileop_api/script/README.md rename to archive/outdated/tests/other/tileop_api/script/README.md diff --git a/test/other/tileop_api/script/checknum_true/MatMacc.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/MatMacc.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/MatMacc.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/MatMacc.log diff --git a/test/other/tileop_api/script/checknum_true/MatMul.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/MatMul.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/MatMul.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/MatMul.log diff --git a/test/other/tileop_api/script/checknum_true/TAbs.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TAbs.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TAbs.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TAbs.log diff --git a/test/other/tileop_api/script/checknum_true/TAdd.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TAdd.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TAdd.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TAdd.log diff --git a/test/other/tileop_api/script/checknum_true/TAdds.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TAdds.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TAdds.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TAdds.log diff --git a/test/other/tileop_api/script/checknum_true/TAssemble.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TAssemble.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TAssemble.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TAssemble.log diff --git a/test/other/tileop_api/script/checknum_true/TCopy.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TCopy.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TCopy.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TCopy.log diff --git a/test/other/tileop_api/script/checknum_true/TCvt.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TCvt.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TCvt.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TCvt.log diff --git a/test/other/tileop_api/script/checknum_true/TDiv.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TDiv.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TDiv.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TDiv.log diff --git a/test/other/tileop_api/script/checknum_true/TDivs.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TDivs.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TDivs.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TDivs.log diff --git a/test/other/tileop_api/script/checknum_true/TExp.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TExp.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TExp.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TExp.log diff --git a/test/other/tileop_api/script/checknum_true/TExpandCol.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TExpandCol.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TExpandCol.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TExpandCol.log diff --git a/test/other/tileop_api/script/checknum_true/TExpandRow.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TExpandRow.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TExpandRow.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TExpandRow.log diff --git a/test/other/tileop_api/script/checknum_true/TExpandScalar.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TExpandScalar.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TExpandScalar.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TExpandScalar.log diff --git a/test/other/tileop_api/script/checknum_true/TExtract.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TExtract.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TExtract.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TExtract.log diff --git a/test/other/tileop_api/script/checknum_true/TGatherElementCol.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TGatherElementCol.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TGatherElementCol.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TGatherElementCol.log diff --git a/test/other/tileop_api/script/checknum_true/TGatherElementRow.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TGatherElementRow.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TGatherElementRow.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TGatherElementRow.log diff --git a/test/other/tileop_api/script/checknum_true/TGatherRow.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TGatherRow.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TGatherRow.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TGatherRow.log diff --git a/test/other/tileop_api/script/checknum_true/TCopyIn.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TLoad.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TCopyIn.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TLoad.log diff --git a/test/other/tileop_api/script/checknum_true/TMax.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TMax.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TMax.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TMax.log diff --git a/test/other/tileop_api/script/checknum_true/TMaxs.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TMaxs.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TMaxs.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TMaxs.log diff --git a/test/other/tileop_api/script/checknum_true/TMin.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TMin.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TMin.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TMin.log diff --git a/test/other/tileop_api/script/checknum_true/TMins.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TMins.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TMins.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TMins.log diff --git a/test/other/tileop_api/script/checknum_true/TMul.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TMul.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TMul.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TMul.log diff --git a/test/other/tileop_api/script/checknum_true/TMuls.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TMuls.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TMuls.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TMuls.log diff --git a/test/other/tileop_api/script/checknum_true/TRSqrt.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TRSqrt.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TRSqrt.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TRSqrt.log diff --git a/test/other/tileop_api/script/checknum_true/TRecip.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TRecip.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TRecip.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TRecip.log diff --git a/test/other/tileop_api/script/checknum_true/TReshape.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TReshape.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TReshape.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TReshape.log diff --git a/test/other/tileop_api/script/checknum_true/TRowMax.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TRowMax.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TRowMax.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TRowMax.log diff --git a/test/other/tileop_api/script/checknum_true/TRowMaxExpand.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TRowMaxExpand.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TRowMaxExpand.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TRowMaxExpand.log diff --git a/test/other/tileop_api/script/checknum_true/TRowSum.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TRowSum.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TRowSum.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TRowSum.log diff --git a/test/other/tileop_api/script/checknum_true/TRowSumExpand.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TRowSumExpand.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TRowSumExpand.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TRowSumExpand.log diff --git a/test/other/tileop_api/script/checknum_true/TScatterElementCol.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TScatterElementCol.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TScatterElementCol.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TScatterElementCol.log diff --git a/test/other/tileop_api/script/checknum_true/TScatterElementRow.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TScatterElementRow.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TScatterElementRow.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TScatterElementRow.log diff --git a/test/other/tileop_api/script/checknum_true/TSelect.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TSelect.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TSelect.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TSelect.log diff --git a/test/other/tileop_api/script/checknum_true/TSqrt.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TSqrt.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TSqrt.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TSqrt.log diff --git a/test/other/tileop_api/script/checknum_true/TCopyOut.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TStore.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TCopyOut.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TStore.log diff --git a/test/other/tileop_api/script/checknum_true/TSub.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TSub.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TSub.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TSub.log diff --git a/test/other/tileop_api/script/checknum_true/TSubs.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TSubs.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TSubs.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TSubs.log diff --git a/test/other/tileop_api/script/checknum_true/TTrans.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TTrans.log similarity index 100% rename from test/other/tileop_api/script/checknum_true/TTrans.log rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TTrans.log diff --git a/test/other/tileop_api/script/get_checknum.py b/archive/outdated/tests/other/tileop_api/script/get_checknum.py similarity index 100% rename from test/other/tileop_api/script/get_checknum.py rename to archive/outdated/tests/other/tileop_api/script/get_checknum.py diff --git a/test/other/tileop_api/script/test.py b/archive/outdated/tests/other/tileop_api/script/test.py similarity index 100% rename from test/other/tileop_api/script/test.py rename to archive/outdated/tests/other/tileop_api/script/test.py diff --git a/test/tileop_api/src/MatMacc.cpp b/archive/outdated/tests/other/tileop_api/src/MatMacc.cpp similarity index 62% rename from test/tileop_api/src/MatMacc.cpp rename to archive/outdated/tests/other/tileop_api/src/MatMacc.cpp index b0643fe..2c665a2 100644 --- a/test/tileop_api/src/MatMacc.cpp +++ b/archive/outdated/tests/other/tileop_api/src/MatMacc.cpp @@ -5,6 +5,57 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_RowMajor(T *dst, T *src0, T *src1) { using gm_shape_A = global_tensor>; @@ -23,11 +74,11 @@ void test_RowMajor(T *dst, T *src0, T *src1) { tile_shape_B d1; tile_shape_C d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TCOPYIN(d2, res); + TLOAD(d0, s0); + TLOAD(d1, s1); + TLOAD(d2, res); MATMACC(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } template @@ -48,21 +99,70 @@ void test_ColMajor(T *dst, T *src0, T *src1) { tile_shape_B d1; tile_shape_C d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TCOPYIN(d2, res); + TLOAD(d0, s0); + TLOAD(d1, s1); + TLOAD(d2, res); MATMACC(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } int main() { +#ifdef __linx + constexpr uint16_t M = 4; + constexpr uint16_t K = 4; + constexpr uint16_t N = 4; +#else const uint16_t M = 16; const uint16_t K = 8; const uint16_t N = 32; +#endif + + constexpr size_t size_A = M * K; + constexpr size_t size_B = K * N; + constexpr size_t size_C = M * N; + +#ifdef __linx + static int64_t dst_rm[size_C]; + static int64_t src0_rm[size_A]; + static int64_t src1_rm[size_B]; + static int64_t base_rm[size_C]; - size_t size_A = M * K; - size_t size_B = K * N; - size_t size_C = M * N; + for (size_t row = 0; row < M; ++row) { + for (size_t k = 0; k < K; ++k) { + const int64_t value = static_cast((row + 1) * (k + 2)); + src0_rm[row * K + k] = value; + } + } + for (size_t k = 0; k < K; ++k) { + for (size_t col = 0; col < N; ++col) { + const int64_t value = static_cast((k + 1) + (col + 1)); + src1_rm[k * N + col] = value; + } + } + for (size_t row = 0; row < M; ++row) { + for (size_t col = 0; col < N; ++col) { + const int64_t value = static_cast(10 + row * N + col); + dst_rm[row * N + col] = value; + base_rm[row * N + col] = value; + } + } + + test_RowMajor(dst_rm, src0_rm, src1_rm); + + for (size_t row = 0; row < M; ++row) { + for (size_t col = 0; col < N; ++col) { + int64_t expected = base_rm[row * N + col]; + for (size_t k = 0; k < K; ++k) { + expected += src0_rm[row * K + k] * src1_rm[k * N + col]; + } + if (dst_rm[row * N + col] != expected) { + return 1; + } + } + } + + return 0; +#else float *dst = (float *)malloc(size_C * sizeof(float)); check_mem_alloc(dst); @@ -77,7 +177,7 @@ int main() { __half *dst_f16 = (__half *)malloc(size_C * sizeof(__half)); check_mem_alloc(dst_f16); - init_dst_no_zero(dst_f16, size_C); + init_dst_no_zero(dst_f16, size_C); __half *src0_f16 = (__half *)malloc(size_A * sizeof(__half)); check_mem_alloc(src0_f16); @@ -85,44 +185,44 @@ int main() { __half *src1_f16 = (__half *)malloc(size_B * sizeof(__half)); check_mem_alloc(src1_f16); init_src_fp(src1_f16, size_B); - + int8_t *dst_i8 = (int8_t *)malloc(size_C * sizeof(int8_t)); check_mem_alloc(dst_i8); init_dst_no_zero(dst_i8, size_C); - + int8_t *src0_i8 = (int8_t *)malloc(size_A * sizeof(int8_t)); check_mem_alloc(src0_i8); init_src_int(src0_i8, size_A); int8_t *src1_i8 = (int8_t *)malloc(size_B * sizeof(int8_t)); check_mem_alloc(src1_i8); init_src_int(src1_i8, size_B); - + int16_t *dst_i16 = (int16_t *)malloc(size_C * sizeof(int16_t)); check_mem_alloc(dst_i16); init_dst_no_zero(dst_i16, size_C); - + int16_t *src0_i16 = (int16_t *)malloc(size_A * sizeof(int16_t)); check_mem_alloc(src0_i16); init_src_int(src0_i16, size_A); int16_t *src1_i16 = (int16_t *)malloc(size_B * sizeof(int16_t)); check_mem_alloc(src1_i16); init_src_int(src1_i16, size_B); - + int32_t *dst_i32 = (int32_t *)malloc(size_C * sizeof(int32_t)); check_mem_alloc(dst_i32); init_dst_no_zero(dst_i32, size_C); - + int32_t *src0_i32 = (int32_t *)malloc(size_A * sizeof(int32_t)); check_mem_alloc(src0_i32); init_src_int(src0_i32, size_A); int32_t *src1_i32 = (int32_t *)malloc(size_B * sizeof(int32_t)); check_mem_alloc(src1_i32); init_src_int(src1_i32, size_B); - + int64_t *dst_i64 = (int64_t *)malloc(size_C * sizeof(int64_t)); check_mem_alloc(dst_i64); init_dst_no_zero(dst_i64, size_C); - + int64_t *src0_i64 = (int64_t *)malloc(size_A * sizeof(int64_t)); check_mem_alloc(src0_i64); init_src_int(src0_i64, size_A); @@ -135,7 +235,7 @@ int main() { #endif //test_RowMajor(dst, src0, src1); - + //test_RowMajor(dst_f16, src0_f16, src1_f16); //test_RowMajor(dst_i8, src0_i8, src1_i8); @@ -157,30 +257,31 @@ int main() { OutArray(dst_i16, size_C); OutArray(dst_i32, size_C); OutArray(dst_i64, size_C); - + free(dst); free(src0); free(src1); - + free(dst_f16); free(src0_f16); free(src1_f16); - + free(dst_i8); free(src0_i8); free(src1_i8); - + free(dst_i16); free(src0_i16); free(src1_i16); - + free(dst_i32); free(src0_i32); free(src1_i32); - + free(dst_i64); free(src0_i64); free(src1_i64); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/MatMul.cpp b/archive/outdated/tests/other/tileop_api/src/MatMul.cpp similarity index 70% rename from test/tileop_api/src/MatMul.cpp rename to archive/outdated/tests/other/tileop_api/src/MatMul.cpp index 1942e97..99d8a33 100644 --- a/test/tileop_api/src/MatMul.cpp +++ b/archive/outdated/tests/other/tileop_api/src/MatMul.cpp @@ -5,6 +5,48 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + auto *d = static_cast(dst); + const auto *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + __asm__ volatile("" ::: "memory"); + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_RowMajor(T *dst, T *src0, T *src1) { using gm_shape_A = global_tensor>; @@ -23,10 +65,10 @@ void test_RowMajor(T *dst, T *src0, T *src1) { tile_shape_B d1; tile_shape_C d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); MATMUL(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } template @@ -47,13 +89,45 @@ void test_ColMajor(T *dst, T *src0, T *src1) { tile_shape_B d1; tile_shape_C d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); MATMUL(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } int main() { +#ifdef __linx + constexpr uint16_t M = 4; + constexpr uint16_t K = 4; + constexpr uint16_t N = 4; + constexpr size_t size_A = M * K; + constexpr size_t size_B = K * N; + constexpr size_t size_C = M * N; + + static int64_t dst_i64[size_C]; + static int64_t src0_i64[size_A]; + static int64_t src1_i64[size_B]; + + init_dst(dst_i64, size_C); + init_src_int(src0_i64, size_A); + init_src_int(src1_i64, size_B); + + test_RowMajor(dst_i64, src0_i64, src1_i64); + + for (size_t row = 0; row < M; ++row) { + for (size_t col = 0; col < N; ++col) { + int64_t expected = 0; + for (size_t k = 0; k < K; ++k) { + expected += src0_i64[row * K + k] * src1_i64[k * N + col]; + } + if (dst_i64[row * N + col] != expected) { + return 1; + } + } + } + + return 0; +#else const uint16_t M = 16; const uint16_t K = 32; const uint16_t N = 32; @@ -75,7 +149,7 @@ int main() { __half *dst_f16 = (__half *)malloc(size_C * sizeof(__half)); check_mem_alloc(dst_f16); - init_dst(dst_f16, size_C); + init_dst(dst_f16, size_C); __half *src0_f16 = (__half *)malloc(size_A * sizeof(__half)); check_mem_alloc(src0_f16); @@ -83,44 +157,44 @@ int main() { __half *src1_f16 = (__half *)malloc(size_B * sizeof(__half)); check_mem_alloc(src1_f16); init_src_fp(src1_f16, size_B); - + int8_t *dst_i8 = (int8_t *)malloc(size_C * sizeof(int8_t)); check_mem_alloc(dst_i8); init_dst(dst_i8, size_C); - + int8_t *src0_i8 = (int8_t *)malloc(size_A * sizeof(int8_t)); check_mem_alloc(src0_i8); init_src_int(src0_i8, size_A); int8_t *src1_i8 = (int8_t *)malloc(size_B * sizeof(int8_t)); check_mem_alloc(src1_i8); init_src_int(src1_i8, size_B); - + int16_t *dst_i16 = (int16_t *)malloc(size_C * sizeof(int16_t)); check_mem_alloc(dst_i16); init_dst(dst_i16, size_C); - + int16_t *src0_i16 = (int16_t *)malloc(size_A * sizeof(int16_t)); check_mem_alloc(src0_i16); init_src_int(src0_i16, size_A); int16_t *src1_i16 = (int16_t *)malloc(size_B * sizeof(int16_t)); check_mem_alloc(src1_i16); init_src_int(src1_i16, size_B); - + int32_t *dst_i32 = (int32_t *)malloc(size_C * sizeof(int32_t)); check_mem_alloc(dst_i32); init_dst(dst_i32, size_C); - + int32_t *src0_i32 = (int32_t *)malloc(size_A * sizeof(int32_t)); check_mem_alloc(src0_i32); init_src_int(src0_i32, size_A); int32_t *src1_i32 = (int32_t *)malloc(size_B * sizeof(int32_t)); check_mem_alloc(src1_i32); init_src_int(src1_i32, size_B); - + int64_t *dst_i64 = (int64_t *)malloc(size_C * sizeof(int64_t)); check_mem_alloc(dst_i64); init_dst(dst_i64, size_C); - + int64_t *src0_i64 = (int64_t *)malloc(size_A * sizeof(int64_t)); check_mem_alloc(src0_i64); init_src_int(src0_i64, size_A); @@ -133,7 +207,7 @@ int main() { #endif test_RowMajor(dst, src0, src1); - + test_RowMajor(dst_f16, src0_f16, src1_f16); test_RowMajor(dst_i8, src0_i8, src1_i8); @@ -155,30 +229,31 @@ int main() { OutArray(dst_i16, size_C); OutArray(dst_i32, size_C); OutArray(dst_i64, size_C); - + free(dst); free(src0); free(src1); - + free(dst_f16); free(src0_f16); free(src1_f16); - + free(dst_i8); free(src0_i8); free(src1_i8); - + free(dst_i16); free(src0_i16); free(src1_i16); - + free(dst_i32); free(src0_i32); free(src1_i32); - + free(dst_i64); free(src0_i64); free(src1_i64); return 0; -} \ No newline at end of file +#endif +} diff --git a/archive/outdated/tests/other/tileop_api/src/MatMul_e4m3.cpp b/archive/outdated/tests/other/tileop_api/src/MatMul_e4m3.cpp new file mode 100644 index 0000000..d9870b0 --- /dev/null +++ b/archive/outdated/tests/other/tileop_api/src/MatMul_e4m3.cpp @@ -0,0 +1,188 @@ +#include +#include "../data.hpp" +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + auto *d = static_cast(dst); + const auto *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + __asm__ volatile("" ::: "memory"); + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} + +template +void test(int64_t *dst, int64_t *src0, int64_t *src1) { + using gm_shape_A = global_tensor>; + using gm_shape_B = global_tensor>; + using gm_shape_C = global_tensor>; + + using tile_shape_A = Tile; + using tile_shape_B = Tile; + using tile_shape_C = Tile; + + gm_shape_A s0(src0); + gm_shape_B s1(src1); + gm_shape_C res(dst); + + tile_shape_A d0; + tile_shape_B d1; + tile_shape_C d2; + + TLOAD(d0, s0); + TLOAD(d1, s1); + MATMUL(d2, d0, d1); + TSTORE(res, d2); +} +#else +template +void __vec__ test_cvt(typename TA::TileDType __out__ a, + typename TB::TileDType __in__ b) { + using AType = typename TA::DType; + using BType = typename TB::DType; + BType *pb = blkv_get_tile_ptr(b); + AType *pa = blkv_get_tile_ptr(a); + int x = blkv_get_index_x(); + int y = blkv_get_index_y(); + int idx = index(y, x); + AType o = (AType)(pb[idx]); + pa[idx] = o; +} + +template +void test(float *dst, float *src0, float *src1) { + using gm_shape_A = global_tensor>; + using gm_shape_B = global_tensor>; + using gm_shape_C = global_tensor>; + + using tile_shape_A = TileLeft; + using tile_shape_B = TileRight; + using tile_shape_C = TileAcc; + using tile_shape_LA = TileLeft<__fp8_e4m3, M, K>; + using tile_shape_LB = TileRight<__fp8_e4m3, K, N>; + + gm_shape_A s0(src0); + gm_shape_B s1(src1); + gm_shape_C res(dst); + + tile_shape_A d0; + tile_shape_B d1; + tile_shape_C d2; + tile_shape_LA lda; + tile_shape_LB ldb; + + TLOAD(d0, s0); + TLOAD(d1, s1); + test_cvt<<>>(lda.data(), d0.data()); + test_cvt<<>>(ldb.data(), d1.data()); + MATMUL(d2, lda, ldb); + TSTORE(res, d2); +} +#endif + +int main() { +#ifdef __linx + constexpr uint16_t M = 4; + constexpr uint16_t K = 4; + constexpr uint16_t N = 4; + constexpr size_t size_A = M * K; + constexpr size_t size_B = K * N; + constexpr size_t size_C = M * N; + + static int64_t dst_i64[size_C]; + static int64_t src0_i64[size_A]; + static int64_t src1_i64[size_B]; + + init_dst(dst_i64, size_C); + init_src_int(src0_i64, size_A); + init_src_int(src1_i64, size_B); + + test(dst_i64, src0_i64, src1_i64); + + for (size_t row = 0; row < M; ++row) { + for (size_t col = 0; col < N; ++col) { + int64_t expected = 0; + for (size_t k = 0; k < K; ++k) { + expected += src0_i64[row * K + k] * src1_i64[k * N + col]; + } + if (dst_i64[row * N + col] != expected) { + return 1; + } + } + } + + return 0; +#else + const uint16_t M = 64; + const uint16_t K = 32; + const uint16_t N = 64; + + size_t size_A = M * K; + size_t size_B = K * N; + size_t size_C = M * N; + + float *dst = (float *)malloc(size_C * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, size_C); + + float *src0 = (float *)malloc(size_A * sizeof(float)); + check_mem_alloc(src0); + init_src_fp(src0, size_A); + float *src1 = (float *)malloc(size_B * sizeof(float)); + check_mem_alloc(src1); + init_src_fp(src1, size_B); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test(dst, src0, src1); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, size_C); + + free(dst); + free(src0); + free(src1); + + return 0; +#endif +} diff --git a/test/tileop_api/src/TAbs.cpp b/archive/outdated/tests/other/tileop_api/src/TAbs.cpp similarity index 62% rename from test/tileop_api/src/TAbs.cpp rename to archive/outdated/tests/other/tileop_api/src/TAbs.cpp index dc9f3f6..e580599 100644 --- a/test/tileop_api/src/TAbs.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TAbs.cpp @@ -5,12 +5,44 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_RowMajor(T *dst, T *src0) { using gm_shape = global_tensor>; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -20,21 +52,21 @@ void test_RowMajor(T *dst, T *src0) { int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TABS(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } - + template void test_ColMajor(T *dst, T *src0) { using gm_shape = global_tensor>; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -44,24 +76,42 @@ void test_ColMajor(T *dst, T *src0) { int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TABS(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 16; - const uint16_t tile_col = 16; +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 16; + constexpr uint16_t tile_col = 16; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src[gm_size]; + init_dst(dst, gm_size); + init_src_int(src, gm_size); - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; + test_RowMajor(dst, src); + return 0; +#else float *dst_col = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst_col); init_dst(dst_col, gm_size); @@ -73,7 +123,7 @@ int main() { __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(dst_f16); init_dst(dst_f16, gm_size); - + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src0_f16); init_src_fp(src0_f16, gm_size); @@ -96,9 +146,10 @@ int main() { free(dst_col); free(src0_col); - + free(dst_f16); free(src0_f16); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TAdd.cpp b/archive/outdated/tests/other/tileop_api/src/TAdd.cpp similarity index 78% rename from test/tileop_api/src/TAdd.cpp rename to archive/outdated/tests/other/tileop_api/src/TAdd.cpp index 250f76e..7fc20ed 100644 --- a/test/tileop_api/src/TAdd.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TAdd.cpp @@ -5,12 +5,44 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_RowMajor(T *dst, T *src0, T *src1) { using gm_shape = global_tensor>; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -21,22 +53,22 @@ void test_RowMajor(T *dst, T *src0, T *src1) { gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TADD(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } - + template void test_ColMajor(T *dst, T *src0, T *src1) { using gm_shape = global_tensor>; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -47,25 +79,45 @@ void test_ColMajor(T *dst, T *src0, T *src1) { gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TADD(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } int main() { +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else const uint16_t gm_row = 64; const uint16_t gm_col = 32; const uint16_t tile_row = 64; const uint16_t tile_col = 32; +#endif - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; +#ifdef __linx + static int64_t dst_i64[gm_size]; + static int64_t src0_i64[gm_size]; + static int64_t src1_i64[gm_size]; + init_dst(dst_i64, gm_size); + init_src_int(src0_i64, gm_size); + init_src_int(src1_i64, gm_size); + + test_RowMajor(dst_i64, src0_i64, src1_i64); + + return 0; +#else float *dst = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst); init_dst(dst, gm_size); @@ -88,54 +140,56 @@ int main() { check_mem_alloc(src1_col); init_src_fp(src1_col, gm_size); +#ifndef __linx __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(dst_f16); init_dst(dst_f16, gm_size); - + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src0_f16); init_src_fp(src0_f16, gm_size); __half *src1_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src1_f16); init_src_fp(src1_f16, gm_size); - +#endif + int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(dst_i8); init_dst(dst_i8, gm_size); - + int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src0_i8); init_src_int(src0_i8, gm_size); int8_t *src1_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src1_i8); init_src_int(src1_i8, gm_size); - + int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(dst_i16); init_dst(dst_i16, gm_size); - + int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(src0_i16); init_src_int(src0_i16, gm_size); int16_t *src1_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(src1_i16); init_src_int(src1_i16, gm_size); - + int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst_i32); init_dst(dst_i32, gm_size); - + int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src0_i32); init_src_int(src0_i32, gm_size); int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src1_i32); init_src_int(src1_i32, gm_size); - + int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(dst_i64); init_dst(dst_i64, gm_size); - + int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(src0_i64); init_src_int(src0_i64, gm_size); @@ -149,14 +203,16 @@ int main() { test_ColMajor(dst_col, src0_col, src1_col); +#ifndef __linx test_ColMajor(dst_f16, src0_f16, src1_f16); +#endif test_ColMajor(dst_i8, src0_i8, src1_i8); test_RowMajor(dst_i16, src0_i16, src1_i16); - + test_RowMajor(dst_i32, src0_i32, src1_i32); - + test_RowMajor(dst_i64, src0_i64, src1_i64); #ifdef LINX_PMC @@ -166,12 +222,14 @@ int main() { printf("Result:\n"); OutArray(dst, gm_size); OutArray(dst_col, gm_size); +#ifndef __linx OutArray(dst_f16, gm_size); +#endif OutArray(dst_i8, gm_size); OutArray(dst_i16, gm_size); OutArray(dst_i32, gm_size); OutArray(dst_i64, gm_size); - + free(dst); free(src0); free(src1); @@ -180,25 +238,28 @@ int main() { free(src0_col); free(src1_col); +#ifndef __linx free(dst_f16); free(src0_f16); free(src1_f16); - +#endif + free(dst_i8); free(src0_i8); free(src1_i8); - + free(dst_i16); free(src0_i16); free(src1_i16); - + free(dst_i32); free(src0_i32); free(src1_i32); - + free(dst_i64); free(src0_i64); free(src1_i64); return 0; -} \ No newline at end of file +#endif +} diff --git a/archive/outdated/tests/other/tileop_api/src/TAdd_mask.cpp b/archive/outdated/tests/other/tileop_api/src/TAdd_mask.cpp new file mode 100644 index 0000000..a1770b2 --- /dev/null +++ b/archive/outdated/tests/other/tileop_api/src/TAdd_mask.cpp @@ -0,0 +1,191 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +using namespace pto; + +template +void test(T *c_ptr, T *a_ptr, T *b_ptr) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + using glb_iterator = global_iterator; + + static constexpr int block_row = gm_row / tile_row; + static constexpr int block_col = gm_col / tile_col; + static constexpr int remainder_row = gm_row % tile_row; + static constexpr int remainder_col = gm_col % tile_col; + + using trailing_rows_shape = + Tile; + using trailing_cols_shape = + Tile; + using trailing_corner_shape = Tile; + + glb_iterator gAIter(a_ptr); + glb_iterator gBIter(b_ptr); + glb_iterator gCIter(c_ptr); + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + auto gA = gAIter(i, j); + auto gB = gBIter(i, j); + auto gC = gCIter(i, j); + + tile_shape tA, tB, tC; + TLOAD(tA, gA); + TLOAD(tB, gB); + TADD(tC, tA, tB); + TSTORE(gC, tC); + } + if constexpr (remainder_col) { + auto gA = gAIter(i, block_col); + auto gB = gBIter(i, block_col); + auto gC = gCIter(i, block_col); + + trailing_rows_shape tA, tB, tC; + TLOAD(tA, gA); + TLOAD(tB, gB); + TADD(tC, tA, tB); + TSTORE(gC, tC); + } + } + if constexpr (remainder_row) { + for (int j = 0; j < block_col; ++j) { + auto gA = gAIter(block_row, j); + auto gB = gBIter(block_row, j); + auto gC = gCIter(block_row, j); + + trailing_cols_shape tA, tB, tC; + TLOAD(tA, gA); + TLOAD(tB, gB); + TADD(tC, tA, tB); + TSTORE(gC, tC); + } + if constexpr (remainder_col) { + auto gA = gAIter(block_row, block_col); + auto gB = gBIter(block_row, block_col); + auto gC = gCIter(block_row, block_col); + + trailing_corner_shape tA, tB, tC; + TLOAD(tA, gA); + TLOAD(tB, gB); + TADD(tC, tA, tB); + TSTORE(gC, tC); + } + } +} + +int main() { +#ifdef __linx + constexpr uint16_t gm_row = 6; + constexpr uint16_t gm_col = 6; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + const uint16_t gm_row = 66; + const uint16_t gm_col = 66; + const uint16_t tile_row = 16; + const uint16_t tile_col = 16; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src0[gm_size]; + static int64_t src1[gm_size]; + init_dst(dst, gm_size); + init_src_int(src0, gm_size); + init_src_uint(src1, gm_size); + + test(dst, src0, src1); + return 0; +#else + float *dst = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, gm_size); + + float *src0 = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src0); + init_src_fp(src0, gm_size); + float *src1 = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src1); + init_src_fp(src1, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test(dst, src0, src1); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, gm_size); + + free(dst); + free(src0); + free(src1); + + return 0; +#endif +} diff --git a/test/tileop_api/src/TAdds.cpp b/archive/outdated/tests/other/tileop_api/src/TAdds.cpp similarity index 73% rename from test/tileop_api/src/TAdds.cpp rename to archive/outdated/tests/other/tileop_api/src/TAdds.cpp index 38170fd..3545585 100644 --- a/test/tileop_api/src/TAdds.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TAdds.cpp @@ -5,12 +5,44 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_RowMajor(T *dst, T *src0, T s) { using gm_shape = global_tensor>; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -20,21 +52,21 @@ void test_RowMajor(T *dst, T *src0, T s) { int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TADDS(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } - + template void test_ColMajor(T *dst, T *src0, T s) { using gm_shape = global_tensor>; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -44,24 +76,42 @@ void test_ColMajor(T *dst, T *src0, T s) { int offset = i * (tile_col * gm_row) + j * tile_row; gm_shape s0(src0 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TADDS(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; +#ifdef __linx + static int64_t dst_i64[gm_size]; + static int64_t src0_i64[gm_size]; + init_dst(dst_i64, gm_size); + init_src_int(src0_i64, gm_size); + + test_RowMajor(dst_i64, src0_i64, s_i64); + return 0; +#else float *dst_col = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst_col); init_dst(dst_col, gm_size); @@ -73,7 +123,7 @@ int main() { __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(dst_f16); init_dst(dst_f16, gm_size); - + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src0_f16); init_src_fp(src0_f16, gm_size); @@ -81,31 +131,31 @@ int main() { int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(dst_i8); init_dst(dst_i8, gm_size); - + int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src0_i8); init_src_int(src0_i8, gm_size); - + int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(dst_i16); init_dst(dst_i16, gm_size); - + int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(src0_i16); init_src_int(src0_i16, gm_size); - + int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst_i32); init_dst(dst_i32, gm_size); - + int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src0_i32); init_src_int(src0_i32, gm_size); - + int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(dst_i64); init_dst(dst_i64, gm_size); - + int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(src0_i64); init_src_int(src0_i64, gm_size); @@ -121,9 +171,9 @@ int main() { test_RowMajor(dst_i8, src0_i8, s_i8); test_RowMajor(dst_i16, src0_i16, s_i16); - + test_RowMajor(dst_i32, src0_i32, s_i32); - + test_RowMajor(dst_i64, src0_i64, s_i64); #ifdef LINX_PMC @@ -143,18 +193,19 @@ int main() { free(dst_f16); free(src0_f16); - + free(dst_i8); free(src0_i8); - + free(dst_i16); free(src0_i16); - + free(dst_i32); free(src0_i32); - + free(dst_i64); free(src0_i64); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/other/tileop_api/src/TAssemble.cpp b/archive/outdated/tests/other/tileop_api/src/TAssemble.cpp similarity index 96% rename from test/other/tileop_api/src/TAssemble.cpp rename to archive/outdated/tests/other/tileop_api/src/TAssemble.cpp index c2ec2e8..d7940f3 100644 --- a/test/other/tileop_api/src/TAssemble.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TAssemble.cpp @@ -29,11 +29,11 @@ void test(float *dst, float *src0, float *src1, float *src2) { tile_shape_src2 d2; tile_shape_dst d3; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TCOPYIN(d2, s2); + TLOAD(d0, s0); + TLOAD(d1, s1); + TLOAD(d2, s2); TASSEMBLE(d3, d0, d1, d2); - TCOPYOUT(res, d3); + TSTORE(res, d3); } int main() { diff --git a/test/other/tileop_api/src/TCast.cpp b/archive/outdated/tests/other/tileop_api/src/TCast.cpp similarity index 97% rename from test/other/tileop_api/src/TCast.cpp rename to archive/outdated/tests/other/tileop_api/src/TCast.cpp index 2a36007..6790915 100644 --- a/test/other/tileop_api/src/TCast.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TCast.cpp @@ -18,9 +18,9 @@ void test(T2 *dst, T1 *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TCAST(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } int main() { diff --git a/test/tileop_api/src/TCopy.cpp b/archive/outdated/tests/other/tileop_api/src/TCopy.cpp similarity index 83% rename from test/tileop_api/src/TCopy.cpp rename to archive/outdated/tests/other/tileop_api/src/TCopy.cpp index bd45f38..d0fb15c 100644 --- a/test/tileop_api/src/TCopy.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TCopy.cpp @@ -5,6 +5,38 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_Nz(T *dst, T *src0) { @@ -14,18 +46,18 @@ void test_Nz(T *dst, T *src0) { glb_iterator gS0Iter(src0); glb_iterator gDIter(dst); - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; for (int i = 0; i < block_row; ++i) { for (int j = 0; j < block_col; ++j) { auto s0 = gS0Iter(i, j); auto res = gDIter(i, j); - + tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TCOPY(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } @@ -40,7 +72,7 @@ void test_Nz_Dynamic(T *dst, T *src0) { uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row; uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col; - + for (int i = 0; i < block_row; ++i) { for (int j = 0; j < block_col; ++j) { uint16_t remainder_row = gm_row - i * tile_valid_row; @@ -55,9 +87,9 @@ void test_Nz_Dynamic(T *dst, T *src0) { tile_shape d0(active_row, active_col); tile_shape d1(active_row, active_col); - TCOPYIN(d0, s0); + TLOAD(d0, s0); TCOPY(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } @@ -67,7 +99,7 @@ template >; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -77,11 +109,11 @@ void test_RowMajor(T *dst, T *src0) { int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TCOPY(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } @@ -112,19 +144,19 @@ void test_RowMajor_Dynamic(T *dst, T *src0) { tile_shape d0(active_row, active_col); tile_shape d1(active_row, active_col); - TCOPYIN(d0, s0); + TLOAD(d0, s0); TCOPY(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } - + template void test_ColMajor(T *dst, T *src0) { using gm_shape = global_tensor>; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -134,24 +166,42 @@ void test_ColMajor(T *dst, T *src0) { int offset = i * (tile_col * gm_row) + j * tile_row; gm_shape s0(src0 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TCOPY(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src[gm_size]; + init_dst(dst, gm_size); + init_src_int(src, gm_size); + test_RowMajor(dst, src); + + return 0; +#else float *dst = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst); init_dst(dst, gm_size); @@ -171,7 +221,7 @@ int main() { __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(dst_f16); init_dst(dst_f16, gm_size); - + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src0_f16); init_src_fp(src0_f16, gm_size); @@ -179,31 +229,31 @@ int main() { int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(dst_i8); init_dst(dst_i8, gm_size); - + int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src0_i8); init_src_int(src0_i8, gm_size); - + int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(dst_i16); init_dst(dst_i16, gm_size); - + int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(src0_i16); init_src_int(src0_i16, gm_size); - + int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst_i32); init_dst(dst_i32, gm_size); - + int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src0_i32); init_src_int(src0_i32, gm_size); - + int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(dst_i64); init_dst(dst_i64, gm_size); - + int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(src0_i64); init_src_int(src0_i64, gm_size); @@ -211,7 +261,7 @@ int main() { int32_t *dst1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst1_i32); init_dst(dst1_i32, gm_size); - + int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src1_i32); init_src_int(src1_i32, gm_size); @@ -219,7 +269,7 @@ int main() { int32_t *dst_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst_nz_i32); init_dst(dst_nz_i32, gm_size); - + int32_t *src_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src_nz_i32); init_src_int(src_nz_i32, gm_size); @@ -239,7 +289,7 @@ int main() { test_RowMajor(dst_i16, src0_i16); test_RowMajor(dst_i32, src0_i32); - + test_RowMajor(dst_i64, src0_i64); test_RowMajor_Dynamic(dst1_i32, src1_i32); @@ -259,22 +309,22 @@ int main() { OutArray(dst_i64, gm_size); OutArray(dst1_i32, gm_size); OutArray(dst_nz_i32, gm_size); - + free(dst); free(src0); - + free(dst_f16); free(src0_f16); - + free(dst_i8); free(src0_i8); - + free(dst_i16); free(src0_i16); - + free(dst_i32); free(src0_i32); - + free(dst_i64); free(src0_i64); @@ -285,4 +335,5 @@ int main() { free(src_nz_i32); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TCvt.cpp b/archive/outdated/tests/other/tileop_api/src/TCvt.cpp similarity index 57% rename from test/tileop_api/src/TCvt.cpp rename to archive/outdated/tests/other/tileop_api/src/TCvt.cpp index cdc88f0..f8652e3 100644 --- a/test/tileop_api/src/TCvt.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TCvt.cpp @@ -5,6 +5,39 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void testRow2Nz(float *dst, float *src) { using gm_shape = global_tensor>; @@ -17,10 +50,10 @@ template void testRow2Nz(float *dst, float *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TCVT(d1, d0); TCVT(d0, d1); - TCOPYOUT(res, d0); + TSTORE(res, d0); } template void testNz2Col(float *dst, float *src) { @@ -35,10 +68,10 @@ template void testNz2Col(float *dst, float *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TCVT(d1, d0); TCVT(d0, d1); - TCOPYOUT(res, d0); + TSTORE(res, d0); } template void testNz2Zn(float *dst, float *src) { @@ -53,10 +86,10 @@ template void testNz2Zn(float *dst, float *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TCVT(d1, d0); TCVT(d0, d1); - TCOPYOUT(res, d0); + TSTORE(res, d0); } template void testZn2Nz(float *dst, float *src) { @@ -71,10 +104,10 @@ template void testZn2Nz(float *dst, float *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TCVT(d1, d0); TCVT(d0, d1); - TCOPYOUT(res, d0); + TSTORE(res, d0); } template void testNz2Nz(float *dst, float *src) { @@ -89,13 +122,69 @@ template void testNz2Nz(float *dst, float *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TCVT(d1, d0); TCVT(d0, d1); - TCOPYOUT(res, d0); + TSTORE(res, d0); } int main() { +#ifdef __linx + constexpr uint16_t row = 16; + constexpr uint16_t col = 16; + using row_tile = Tile; + using col_tile = Tile; + using nz_tile = TileLeft; + using zn_tile = TileRight; + + row_tile row_src; + row_tile row_round; + col_tile col_src; + col_tile col_round; + nz_tile nz_a; + nz_tile nz_b; + zn_tile zn; + + for (size_t i = 0; i < row; ++i) { + for (size_t j = 0; j < col; ++j) { + row_src.data()[index(i, j)] = + static_cast((i + 1) * 100 + j); + col_src.data()[index(i, j)] = + static_cast((i + 1) * 1000 + j); + } + } + + TCVT(nz_a, row_src); + TCVT(row_round, nz_a); + TCVT(zn, nz_a); + TCVT(nz_b, zn); + + for (size_t i = 0; i < row; ++i) { + for (size_t j = 0; j < col; ++j) { + if (row_round.data()[index(i, j)] != + row_src.data()[index(i, j)]) { + return 1; + } + if (nz_b.data()[index(i, j)] != + nz_a.data()[index(i, j)]) { + return 2; + } + } + } + + TCVT(nz_a, col_src); + TCVT(col_round, nz_a); + for (size_t i = 0; i < row; ++i) { + for (size_t j = 0; j < col; ++j) { + if (col_round.data()[index(i, j)] != + col_src.data()[index(i, j)]) { + return 3; + } + } + } + + return 0; +#else const uint16_t row = 16; const uint16_t col = 32; @@ -150,4 +239,5 @@ int main() { free(src2); return 0; -} \ No newline at end of file +#endif +} diff --git a/archive/outdated/tests/other/tileop_api/src/TDiv.cpp b/archive/outdated/tests/other/tileop_api/src/TDiv.cpp new file mode 100644 index 0000000..d8950a1 --- /dev/null +++ b/archive/outdated/tests/other/tileop_api/src/TDiv.cpp @@ -0,0 +1,251 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_rm(T *dst, T *src0, T *src1) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src0 + offset); + gm_shape s1(src1 + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1, d2; + TLOAD(d0, s0); + TLOAD(d1, s1); + TDIV(d2, d1, d0); + TSTORE(res, d2); + } + } +} +template +void test_cm(T *dst, T *src0, T *src1) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src0 + offset); + gm_shape s1(src1 + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1, d2; + TLOAD(d0, s0); + TLOAD(d1, s1); + TDIV(d2, d1, d0); + TSTORE(res, d2); + } + } +} + +int main() { +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + // 64*64-16*16 + const uint16_t gm_row = 64; + const uint16_t gm_col = 64; + const uint16_t tile_row = 32; + const uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst_rm[gm_size]; + static int64_t dst_cm[gm_size]; + static int64_t src0_rm[gm_size]; + static int64_t src1_rm[gm_size]; + static int64_t src0_cm[gm_size]; + static int64_t src1_cm[gm_size]; + init_dst(dst_rm, gm_size); + init_dst(dst_cm, gm_size); + init_src_uint(src0_rm, gm_size); + init_src_int(src1_rm, gm_size); + init_src_uint(src0_cm, gm_size); + init_src_int(src1_cm, gm_size); + + test_rm(dst_rm, src0_rm, src1_rm); + test_cm(dst_cm, src0_cm, src1_cm); + + return 0; +#else + // float32 + float *dst = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, gm_size); + + float *src0 = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src0); + init_src_fp(src0, gm_size); + float *src1 = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src1); + init_src_fp(src1, gm_size); + // float16 + __half *dst1 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst1); + init_dst(dst1, gm_size); + + __half *src2 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src2); + init_src_fp(src2, gm_size); + __half *src3 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src3); + init_src_fp(src3, gm_size); + // int8 + int8_t *dst2 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(dst2); + init_dst(dst2, gm_size); + + int8_t *src4 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src4); + init_src_int8(src4, gm_size); + int8_t *src5 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src5); + init_src_int8(src5, gm_size); + // int16 + int16_t *dst3 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(dst3); + init_dst(dst3, gm_size); + + int16_t *src6 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src6); + init_src_int(src6, gm_size); + int16_t *src7 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src7); + init_src_int(src7, gm_size); + // int32 + int32_t *dst4 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst4); + init_dst(dst4, gm_size); + + int32_t *src8 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src8); + init_src_int(src8, gm_size); + int32_t *src9 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src9); + init_src_int(src9, gm_size); + // int64 + int64_t *dst5 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(dst5); + init_dst(dst5, gm_size); + + int64_t *src10 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src10); + init_src_int(src10, gm_size); + int64_t *src11 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src11); + init_src_int(src11, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + test_rm(dst, src0, src1); + test_rm(dst1, src2, src3); + test_rm(dst2, src4, src5); + test_rm(dst3, src6, src7); + test_rm(dst4, src8, src9); + test_rm(dst5, src10, src11); + +#ifdef LINX_PMC + PMC_END(); +#endif + printf("Result:\n"); + OutArray(dst, gm_size); + OutArray(dst1, gm_size); + OutArray(dst2, gm_size); + OutArray(dst3, gm_size); + OutArray(dst4, gm_size); + OutArray(dst5, gm_size); + + free(dst); + free(src0); + free(src1); + free(dst1); + free(src2); + free(src3); + free(dst2); + free(src4); + free(src5); + free(dst3); + free(src6); + free(src7); + free(dst4); + free(src8); + free(src9); + free(dst5); + free(src10); + free(src11); + return 0; +#endif +} diff --git a/test/tileop_api/src/TDivs.cpp b/archive/outdated/tests/other/tileop_api/src/TDivs.cpp similarity index 65% rename from test/tileop_api/src/TDivs.cpp rename to archive/outdated/tests/other/tileop_api/src/TDivs.cpp index e3773ab..59b5023 100644 --- a/test/tileop_api/src/TDivs.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TDivs.cpp @@ -5,6 +5,57 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T *src, T s) { @@ -20,9 +71,9 @@ void test_rm(T *dst, T *src, T s) { gm_shape res(dst + offset); tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TDIVS(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } @@ -41,21 +92,44 @@ void test_cm(T *dst, T *src, T s) { gm_shape res(dst + offset); tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TDIVS(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } int main() { +#ifdef __linx + constexpr uint64_t gm_row = 4; + constexpr uint64_t gm_col = 4; + constexpr uint64_t tile_row = 4; + constexpr uint64_t tile_col = 4; +#else const uint16_t gm_row = 64; const uint16_t gm_col = 64; const uint16_t tile_row = 32; const uint16_t tile_col = 32; +#endif - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst_rm[gm_size]; + static int64_t dst_cm[gm_size]; + static int64_t src_rm[gm_size]; + static int64_t src_cm[gm_size]; + init_dst(dst_rm, gm_size); + init_dst(dst_cm, gm_size); + init_src_uint(src_rm, gm_size); + init_src_uint(src_cm, gm_size); + + test_rm(dst_rm, src_rm, 2); + test_cm(dst_cm, src_cm, 2); + return 0; +#else // float32 float *dst = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst); @@ -137,4 +211,5 @@ int main() { free(dst5); free(src5); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TExp.cpp b/archive/outdated/tests/other/tileop_api/src/TExp.cpp similarity index 50% rename from test/tileop_api/src/TExp.cpp rename to archive/outdated/tests/other/tileop_api/src/TExp.cpp index 233d727..2b7e767 100644 --- a/test/tileop_api/src/TExp.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TExp.cpp @@ -5,6 +5,57 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T *src) { @@ -20,9 +71,9 @@ void test_rm(T *dst, T *src) { gm_shape res(dst + offset); tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TEXP(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } @@ -41,14 +92,54 @@ void test_cm(T *dst, T *src) { gm_shape res(dst + offset); tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TEXP(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } int main() { +#ifdef __linx + constexpr size_t tile_row = 4; + constexpr size_t tile_col = 4; + using row_tile = Tile; + using col_tile = + Tile; + + row_tile src_rm, dst_rm; + col_tile src_cm, dst_cm; + + for (size_t i = 0; i < tile_row; ++i) { + for (size_t j = 0; j < tile_col; ++j) { + int64_t value = static_cast((i + j) % 6); + size_t row_index = index(i, j); + size_t col_index = index(i, j); + src_rm.data()[row_index] = value; + src_cm.data()[col_index] = value; + dst_rm.data()[row_index] = 0; + dst_cm.data()[col_index] = 0; + } + } + + TEXP(dst_rm, src_rm); + TEXP(dst_cm, src_cm); + + for (size_t i = 0; i < tile_row; ++i) { + for (size_t j = 0; j < tile_col; ++j) { + int64_t value = static_cast((i + j) % 6); + int64_t expected = linx_tile_iexp(value); + if (dst_rm.data()[index(i, j)] != expected) { + return 1; + } + if (dst_cm.data()[index(i, j)] != expected) { + return 2; + } + } + } + + return 0; +#else const uint16_t gm_row = 64; const uint16_t gm_col = 64; const uint16_t tile_row = 16; @@ -95,4 +186,5 @@ int main() { free(src2); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TExpandCol.cpp b/archive/outdated/tests/other/tileop_api/src/TExpandCol.cpp similarity index 73% rename from test/tileop_api/src/TExpandCol.cpp rename to archive/outdated/tests/other/tileop_api/src/TExpandCol.cpp index 2950077..4b6137a 100644 --- a/test/tileop_api/src/TExpandCol.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TExpandCol.cpp @@ -5,6 +5,39 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T *src) { using gm_shape_in = global_tensor>; using gm_shape_out = global_tensor>; @@ -17,9 +50,9 @@ template void test_rm(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TEXPANDCOL(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } template void test_cm(T *dst, T *src) { using gm_shape_in = global_tensor>; @@ -33,12 +66,30 @@ template void test_cm(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TEXPANDCOL(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } int main() { +#ifdef __linx + constexpr uint16_t row = 4; + constexpr uint16_t col = 8; + constexpr uint16_t size = row * col; + + static int64_t dst_rm[size]; + static int64_t dst_cm[size]; + static int64_t src_rm[size]; + static int64_t src_cm[size]; + init_dst(dst_rm, size); + init_dst(dst_cm, size); + init_src_int(src_rm, size); + init_src_int(src_cm, size); + + test_rm(dst_rm, src_rm); + test_cm(dst_cm, src_cm); + return 0; +#else const uint16_t row = 32; const uint16_t col = 32; @@ -131,4 +182,5 @@ int main() { free(src5); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TExpandRow.cpp b/archive/outdated/tests/other/tileop_api/src/TExpandRow.cpp similarity index 73% rename from test/tileop_api/src/TExpandRow.cpp rename to archive/outdated/tests/other/tileop_api/src/TExpandRow.cpp index d6d0005..f448a92 100644 --- a/test/tileop_api/src/TExpandRow.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TExpandRow.cpp @@ -5,6 +5,39 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T *src) { using gm_shape_in = global_tensor>; @@ -18,9 +51,9 @@ void test_rm(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TEXPANDROW(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } template void test_cm(T *dst, T *src) { @@ -35,12 +68,30 @@ void test_cm(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TEXPANDROW(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } int main() { +#ifdef __linx + constexpr uint16_t row = 4; + constexpr uint16_t col = 8; + constexpr uint16_t size = row * col; + + static int64_t dst_rm[size]; + static int64_t dst_cm[size]; + static int64_t src_rm[size]; + static int64_t src_cm[size]; + init_dst(dst_rm, size); + init_dst(dst_cm, size); + init_src_int(src_rm, size); + init_src_int(src_cm, size); + + test_rm(dst_rm, src_rm); + test_cm(dst_cm, src_cm); + return 0; +#else const uint16_t row = 32; const uint16_t col = 32; size_t size_in = col; @@ -135,4 +186,5 @@ int main() { free(dst5); free(src5); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TExpandScalar.cpp b/archive/outdated/tests/other/tileop_api/src/TExpandScalar.cpp similarity index 72% rename from test/tileop_api/src/TExpandScalar.cpp rename to archive/outdated/tests/other/tileop_api/src/TExpandScalar.cpp index 1bff82a..7b9e347 100644 --- a/test/tileop_api/src/TExpandScalar.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TExpandScalar.cpp @@ -5,6 +5,39 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T s) { @@ -14,7 +47,7 @@ void test_rm(T *dst, T s) { tile_shape d0; TEXPANDSCALAR(d0, s); - TCOPYOUT(res, d0); + TSTORE(res, d0); } template (dst_rm, s_i64); + test_cm(dst_cm, s_i64); + return 0; +#else const uint16_t gm_row = 16; const uint16_t gm_col = 32; const uint16_t tile_row = 16; @@ -116,4 +165,5 @@ int main() { free(dst6); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/other/tileop_api/src/TExtract.cpp b/archive/outdated/tests/other/tileop_api/src/TExtract.cpp similarity index 97% rename from test/other/tileop_api/src/TExtract.cpp rename to archive/outdated/tests/other/tileop_api/src/TExtract.cpp index dd38494..3e80365 100644 --- a/test/other/tileop_api/src/TExtract.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TExtract.cpp @@ -20,9 +20,9 @@ void test(float *dst, float *src, uint16_t offest) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TEXTRACT(d1, d0, offest); - TCOPYOUT(res, d1); + TSTORE(res, d1); } int main() { diff --git a/test/other/tileop_api/src/TGatherElementCol.cpp b/archive/outdated/tests/other/tileop_api/src/TGatherElementCol.cpp similarity index 96% rename from test/other/tileop_api/src/TGatherElementCol.cpp rename to archive/outdated/tests/other/tileop_api/src/TGatherElementCol.cpp index e72a015..31dd225 100644 --- a/test/other/tileop_api/src/TGatherElementCol.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TGatherElementCol.cpp @@ -23,10 +23,10 @@ void test(float *dst, float *src0, uint16_t *src1) { tile_shape_src1 d1; tile_shape_dst d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TGATHERELEMENTCOL(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } int main() { diff --git a/test/other/tileop_api/src/TGatherElementRow.cpp b/archive/outdated/tests/other/tileop_api/src/TGatherElementRow.cpp similarity index 96% rename from test/other/tileop_api/src/TGatherElementRow.cpp rename to archive/outdated/tests/other/tileop_api/src/TGatherElementRow.cpp index cbe5573..ed9adbb 100644 --- a/test/other/tileop_api/src/TGatherElementRow.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TGatherElementRow.cpp @@ -23,10 +23,10 @@ void test(float *dst, float *src0, uint16_t *src1) { tile_shape_src1 d1; tile_shape_dst d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TGATHERELEMENTROW(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } int main() { diff --git a/test/other/tileop_api/src/TGatherRow.cpp b/archive/outdated/tests/other/tileop_api/src/TGatherRow.cpp similarity index 97% rename from test/other/tileop_api/src/TGatherRow.cpp rename to archive/outdated/tests/other/tileop_api/src/TGatherRow.cpp index e5728be..a0204e5 100644 --- a/test/other/tileop_api/src/TGatherRow.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TGatherRow.cpp @@ -24,10 +24,10 @@ void test(float *dst, float *src0, uint16_t *src1) { tile_shape_src1 d1; tile_shape_dst d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TGATHERROW(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } int main() { diff --git a/test/tileop_api/src/TCopyIn.cpp b/archive/outdated/tests/other/tileop_api/src/TLoad.cpp similarity index 82% rename from test/tileop_api/src/TCopyIn.cpp rename to archive/outdated/tests/other/tileop_api/src/TLoad.cpp index 944f9f4..e13b0d6 100644 --- a/test/tileop_api/src/TCopyIn.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TLoad.cpp @@ -5,6 +5,38 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_RowMajor(T *dst, T *src0) { @@ -12,7 +44,7 @@ void test_RowMajor(T *dst, T *src0) { using stride = Stride<1, 1, gm_row * gm_col, gm_col, 1>; using gm_shape = GlobalTensor; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -22,10 +54,10 @@ void test_RowMajor(T *dst, T *src0) { int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape res(dst + offset); - + tile_shape d0; - TCOPYIN(d0, s0); - TCOPYOUT(res, d0); + TLOAD(d0, s0); + TSTORE(res, d0); } } } @@ -57,12 +89,12 @@ void test_RowMajor_Dynamic(T *dst, T *src0) { gm_shape res(dst + offset, gm_valid_row, gm_valid_col); tile_shape d0(active_row, active_col); - TCOPYIN(d0, s0); - TCOPYOUT(res, d0); + TLOAD(d0, s0); + TSTORE(res, d0); } } } - + template void test_ColMajor(T *dst, T *src0) { @@ -70,7 +102,7 @@ void test_ColMajor(T *dst, T *src0) { using stride = Stride<1, 1, gm_row * gm_col, 1, gm_row>; using gm_shape = GlobalTensor; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -80,10 +112,10 @@ void test_ColMajor(T *dst, T *src0) { int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape res(dst + offset); - + tile_shape d0; - TCOPYIN(d0, s0); - TCOPYOUT(res, d0); + TLOAD(d0, s0); + TSTORE(res, d0); } } } @@ -101,7 +133,7 @@ void test_Nz_Dynamic(T *dst, T *src0) { uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row; uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col; - + for (int i = 0; i < block_row; ++i) { for (int j = 0; j < block_col; ++j) { uint16_t remainder_row = gm_row - i * tile_valid_row; @@ -116,21 +148,39 @@ void test_Nz_Dynamic(T *dst, T *src0) { tile_shape d0(active_row, active_col); tile_shape d1(active_row, active_col); - TCOPYIN(d0, s0); - TCOPYOUT(res, d0); + TLOAD(d0, s0); + TSTORE(res, d0); } } } int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src[gm_size]; + init_dst(dst, gm_size); + init_src_int(src, gm_size); + + test_RowMajor(dst, src); + return 0; +#else float *dst = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst); init_dst(dst, gm_size); @@ -142,7 +192,7 @@ int main() { __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(dst_f16); init_dst(dst_f16, gm_size); - + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src0_f16); init_src_fp(src0_f16, gm_size); @@ -150,31 +200,31 @@ int main() { int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(dst_i8); init_dst(dst_i8, gm_size); - + int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src0_i8); init_src_int(src0_i8, gm_size); - + int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(dst_i16); init_dst(dst_i16, gm_size); - + int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(src0_i16); init_src_int(src0_i16, gm_size); - + int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst_i32); init_dst(dst_i32, gm_size); - + int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src0_i32); init_src_int(src0_i32, gm_size); - + int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(dst_i64); init_dst(dst_i64, gm_size); - + int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(src0_i64); init_src_int(src0_i64, gm_size); @@ -182,7 +232,7 @@ int main() { int32_t *dst1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst1_i32); init_dst(dst1_i32, gm_size); - + int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src1_i32); init_src_int(src1_i32, gm_size); @@ -190,7 +240,7 @@ int main() { int32_t *dst_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst_nz_i32); init_dst(dst_nz_i32, gm_size); - + int32_t *src_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src_nz_i32); init_src_int(src_nz_i32, gm_size); @@ -200,15 +250,15 @@ int main() { #endif test_RowMajor(dst, src0); - + test_RowMajor(dst_f16, src0_f16); - + test_RowMajor(dst_i8, src0_i8); test_RowMajor(dst_i16, src0_i16); - + test_RowMajor(dst_i32, src0_i32); - + test_RowMajor(dst_i64, src0_i64); test_RowMajor_Dynamic(dst1_i32, src1_i32); @@ -228,22 +278,22 @@ int main() { OutArray(dst_i64, gm_size); OutArray(dst1_i32, gm_size); OutArray(dst_nz_i32, gm_size); - + free(dst); free(src0); - + free(dst_f16); free(src0_f16); - + free(dst_i8); free(src0_i8); - + free(dst_i16); free(src0_i16); - + free(dst_i32); free(src0_i32); - + free(dst_i64); free(src0_i64); @@ -251,4 +301,5 @@ int main() { free(src1_i32); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TMax.cpp b/archive/outdated/tests/other/tileop_api/src/TMax.cpp similarity index 74% rename from test/tileop_api/src/TMax.cpp rename to archive/outdated/tests/other/tileop_api/src/TMax.cpp index 710c31f..25301d6 100644 --- a/test/tileop_api/src/TMax.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TMax.cpp @@ -5,6 +5,38 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T *src0, T *src1) { @@ -21,10 +53,10 @@ void test_rm(T *dst, T *src0, T *src1) { gm_shape res(dst + offset); tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TMAX(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } @@ -44,22 +76,43 @@ void test_cm(T *dst, T *src0, T *src1) { gm_shape res(dst + offset); tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TMAX(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src0[gm_size]; + static int64_t src1[gm_size]; + init_dst(dst, gm_size); + init_src_int(src0, gm_size); + init_src_int(src1, gm_size); + + test_rm(dst, src0, src1); + + return 0; +#else // float32 float *dst = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst); @@ -170,4 +223,5 @@ int main() { free(src11); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TMaxs.cpp b/archive/outdated/tests/other/tileop_api/src/TMaxs.cpp similarity index 72% rename from test/tileop_api/src/TMaxs.cpp rename to archive/outdated/tests/other/tileop_api/src/TMaxs.cpp index 3593c8e..8833ab9 100644 --- a/test/tileop_api/src/TMaxs.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TMaxs.cpp @@ -5,6 +5,38 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T *src, T s) { @@ -20,9 +52,9 @@ void test_rm(T *dst, T *src, T s) { gm_shape res(dst + offset); tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TMAXS(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } @@ -41,21 +73,40 @@ void test_cm(T *dst, T *src, T s) { gm_shape res(dst + offset); tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TMAXS(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src[gm_size]; + init_dst(dst, gm_size); + init_src_int(src, gm_size); + + test_rm(dst, src, s_i64); + + return 0; +#else // float32 float *dst = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst); @@ -142,4 +193,5 @@ int main() { free(src5); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/other/tileop_api/src/TMin.cpp b/archive/outdated/tests/other/tileop_api/src/TMin.cpp similarity index 95% rename from test/other/tileop_api/src/TMin.cpp rename to archive/outdated/tests/other/tileop_api/src/TMin.cpp index c711356..b6418bb 100644 --- a/test/other/tileop_api/src/TMin.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TMin.cpp @@ -21,10 +21,10 @@ void test(float *dst, float *src0, float *src1) { gm_shape res(dst + offset); tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TMIN(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } diff --git a/test/other/tileop_api/src/TMins.cpp b/archive/outdated/tests/other/tileop_api/src/TMins.cpp similarity index 96% rename from test/other/tileop_api/src/TMins.cpp rename to archive/outdated/tests/other/tileop_api/src/TMins.cpp index de3dc22..5673a48 100644 --- a/test/other/tileop_api/src/TMins.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TMins.cpp @@ -20,9 +20,9 @@ void test(float *dst, float *src, float s) { gm_shape res(dst + offset); tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TMINS(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } diff --git a/test/tileop_api/src/TMul.cpp b/archive/outdated/tests/other/tileop_api/src/TMul.cpp similarity index 73% rename from test/tileop_api/src/TMul.cpp rename to archive/outdated/tests/other/tileop_api/src/TMul.cpp index 7436241..9967351 100644 --- a/test/tileop_api/src/TMul.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TMul.cpp @@ -5,6 +5,38 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T *src0, T *src1) { @@ -18,13 +50,13 @@ void test_rm(T *dst, T *src0, T *src1) { int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); - gm_shape res(dst + offset); - + gm_shape res(dst + offset); + tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TMUL(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } @@ -41,30 +73,51 @@ void test_cm(T *dst, T *src0, T *src1) { int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); - gm_shape res(dst + offset); - + gm_shape res(dst + offset); + tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TMUL(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src0[gm_size]; + static int64_t src1[gm_size]; + init_dst(dst, gm_size); + init_src_int(src0, gm_size); + init_src_int(src1, gm_size); + + test_rm(dst, src0, src1); - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; + return 0; +#else // float32 float *dst = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst); init_dst(dst, gm_size); - + float *src0 = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(src0); init_src_fp(src0, gm_size); @@ -170,4 +223,5 @@ int main() { free(src11); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TMuls.cpp b/archive/outdated/tests/other/tileop_api/src/TMuls.cpp similarity index 72% rename from test/tileop_api/src/TMuls.cpp rename to archive/outdated/tests/other/tileop_api/src/TMuls.cpp index ae280f5..e451ee0 100644 --- a/test/tileop_api/src/TMuls.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TMuls.cpp @@ -5,6 +5,38 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T *src, T s) { @@ -20,9 +52,9 @@ void test_rm(T *dst, T *src, T s) { gm_shape res(dst + offset); tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TMULS(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } @@ -41,21 +73,40 @@ void test_cm(T *dst, T *src, T s) { gm_shape res(dst + offset); tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TMULS(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src[gm_size]; + init_dst(dst, gm_size); + init_src_int(src, gm_size); + + test_rm(dst, src, s_i64); + + return 0; +#else // float32 float *dst = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst); @@ -142,4 +193,5 @@ int main() { free(src5); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/other/tileop_api/src/TRSqrt.cpp b/archive/outdated/tests/other/tileop_api/src/TRSqrt.cpp similarity index 96% rename from test/other/tileop_api/src/TRSqrt.cpp rename to archive/outdated/tests/other/tileop_api/src/TRSqrt.cpp index 8b2a10c..7e1d236 100644 --- a/test/other/tileop_api/src/TRSqrt.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TRSqrt.cpp @@ -20,9 +20,9 @@ void test(float *dst, float *src) { gm_shape res(dst + offset); tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TRSQRT(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } diff --git a/test/tileop_api/src/TRecip.cpp b/archive/outdated/tests/other/tileop_api/src/TRecip.cpp similarity index 62% rename from test/tileop_api/src/TRecip.cpp rename to archive/outdated/tests/other/tileop_api/src/TRecip.cpp index 70c591b..61555ec 100644 --- a/test/tileop_api/src/TRecip.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TRecip.cpp @@ -5,6 +5,57 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T *src) { @@ -23,9 +74,9 @@ void test_rm(T *dst, T *src) { auto res = gDIter(i, j); tile_shape t0, t1; - TCOPYIN(t0, s0); + TLOAD(t0, s0); TRECIP(t1, t0); - TCOPYOUT(res, t1); + TSTORE(res, t1); } } } @@ -48,22 +99,65 @@ void test_cm(T *dst, T *src) { auto res = gDIter(j, i); tile_shape t0, t1; - TCOPYIN(t0, s0); + TLOAD(t0, s0); TRECIP(t1, t0); - TCOPYOUT(res, t1); + TSTORE(res, t1); } } } int main() { +#ifdef __linx + constexpr size_t gm_row = 4; + constexpr size_t gm_col = 4; + constexpr size_t tile_row = 4; + constexpr size_t tile_col = 4; +#else const size_t gm_row = 32; const size_t gm_col = 32; const size_t tile_row = 32; const size_t tile_col = 32; +#endif - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)gm_size; + (void)tile_size; + +#ifdef __linx + using row_tile = Tile; + using col_tile = + Tile; + row_tile src_rm, dst_rm; + col_tile src_cm, dst_cm; + + for (size_t i = 0; i < tile_row; ++i) { + for (size_t j = 0; j < tile_col; ++j) { + size_t row_index = index(i, j); + size_t col_index = index(i, j); + src_rm.data()[row_index] = 1; + src_cm.data()[col_index] = 1; + dst_rm.data()[row_index] = 0; + dst_cm.data()[col_index] = 0; + } + } + TRECIP(dst_rm, src_rm); + TRECIP(dst_cm, src_cm); + + for (size_t i = 0; i < tile_row; ++i) { + for (size_t j = 0; j < tile_col; ++j) { + if (dst_rm.data()[index(i, j)] != 1) { + return 1; + } + if (dst_cm.data()[index(i, j)] != 1) { + return 2; + } + } + } + + return 0; +#else // int8_t int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(dst_int8); @@ -155,4 +249,5 @@ int main() { free(src_f32); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TReshape.cpp b/archive/outdated/tests/other/tileop_api/src/TReshape.cpp similarity index 70% rename from test/tileop_api/src/TReshape.cpp rename to archive/outdated/tests/other/tileop_api/src/TReshape.cpp index 410a9cb..f83024e 100644 --- a/test/tileop_api/src/TReshape.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TReshape.cpp @@ -5,6 +5,38 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test(T *dst, T *src) { @@ -18,20 +50,38 @@ void test(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TRESHAPE(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } int main() { - const size_t gm_row = 64; - const size_t gm_col = 64; - const size_t tile_row = 64; - const size_t tile_col = 64; +#ifdef __linx + constexpr size_t gm_row = 4; + constexpr size_t gm_col = 8; + constexpr size_t tile_row = 4; + constexpr size_t tile_col = 8; +#else + constexpr size_t gm_row = 64; + constexpr size_t gm_col = 64; + constexpr size_t tile_row = 64; + constexpr size_t tile_col = 64; +#endif - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src[gm_size]; + init_dst(dst, gm_size); + init_src_uint(src, gm_size); + + test(dst, src); + + return 0; +#else // int8_t int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(dst_int8); @@ -108,7 +158,7 @@ int main() { OutArray(dst_int64, gm_size); OutArray(dst_f16, gm_size); OutArray(dst_f32, gm_size); - + free(dst_int8); free(src_int8); @@ -123,4 +173,5 @@ int main() { free(dst_f32); free(src_f32); return 0; -} \ No newline at end of file +#endif +} diff --git a/archive/outdated/tests/other/tileop_api/src/TRowMax.cpp b/archive/outdated/tests/other/tileop_api/src/TRowMax.cpp new file mode 100644 index 0000000..1c91357 --- /dev/null +++ b/archive/outdated/tests/other/tileop_api/src/TRowMax.cpp @@ -0,0 +1,203 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template void test_rm(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + + gm_shape_in s0(src); + gm_shape_out res(dst); + + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TROWMAX(d1, d0); + TSTORE(res, d1); +} + +template void test_cm(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + + gm_shape_in s0(src); + gm_shape_out res(dst); + + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TROWMAX(d1, d0); + TSTORE(res, d1); +} + +int main() { +#ifdef __linx + constexpr uint16_t row = 4; + constexpr uint16_t col = 8; + constexpr uint16_t size = row * col; + + static int64_t dst_rm[size]; + static int64_t dst_cm[size]; + static int64_t src_rm[size]; + static int64_t src_cm[size]; + init_dst(dst_rm, size); + init_dst(dst_cm, size); + init_src_int(src_rm, size); + init_src_int(src_cm, size); + + test_rm(dst_rm, src_rm); + test_cm(dst_cm, src_cm); + return 0; +#else + const size_t row = 32; + const size_t col = 32; + + size_t size_in = row * col; + size_t size_out = row * col; + + // int8_t + int8_t *dst_int8 = (int8_t *)malloc(size_out * sizeof(int8_t)); + check_mem_alloc(dst_int8); + init_dst(dst_int8, size_out); + + int8_t *src_int8 = (int8_t *)malloc(size_in * sizeof(int8_t)); + check_mem_alloc(src_int8); + init_src_uint(src_int8, size_in); + + // int16_t + int16_t *dst_int16 = (int16_t *)malloc(size_out * sizeof(int16_t)); + check_mem_alloc(dst_int16); + init_dst(dst_int16, size_out); + + int16_t *src_int16 = (int16_t *)malloc(size_in * sizeof(int16_t)); + check_mem_alloc(src_int16); + init_src_uint(src_int16, size_in); + + // int32_t + int32_t *dst_int32 = (int32_t *)malloc(size_out * sizeof(int32_t)); + check_mem_alloc(dst_int32); + init_dst(dst_int32, size_out); + + int32_t *src_int32 = (int32_t *)malloc(size_in * sizeof(int32_t)); + check_mem_alloc(src_int32); + init_src_uint(src_int32, size_in); + + // int64_t + int64_t *dst_int64 = (int64_t *)malloc(size_out * sizeof(int64_t)); + check_mem_alloc(dst_int64); + init_dst(dst_int64, size_out); + + int64_t *src_int64 = (int64_t *)malloc(size_in * sizeof(int64_t)); + check_mem_alloc(src_int64); + init_src_uint(src_int64, size_in); + + // __half + __half *dst_f16 = (__half *)malloc(size_out * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, size_out); + + __half *src_f16 = (__half *)malloc(size_in * sizeof(__half)); + check_mem_alloc(src_f16); + init_src_fp(src_f16, size_in); + + // __fp32 + __fp32 *dst_f32 = (__fp32 *)malloc(size_out * sizeof(__fp32)); + check_mem_alloc(dst_f32); + init_dst(dst_f32, size_out); + + __fp32 *src_f32 = (__fp32 *)malloc(size_in * sizeof(__fp32)); + check_mem_alloc(src_f32); + init_src_fp(src_f32, size_in); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst_int8, src_int8); + test_rm(dst_int16, src_int16); + test_rm(dst_int32, src_int32); + test_rm(dst_int64, src_int64); + test_cm(dst_f16, src_f16); + test_cm(dst_f32, src_f32); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst_int8, size_out); + OutArray(dst_int16, size_out); + OutArray(dst_int32, size_out); + OutArray(dst_int64, size_out); + OutArray(dst_f16, size_out); + OutArray(dst_f32, size_out); + + free(dst_int8); + free(src_int8); + free(dst_int16); + free(src_int16); + free(dst_int32); + free(src_int32); + free(dst_int64); + free(src_int64); + free(dst_f16); + free(src_f16); + free(dst_f32); + free(src_f32); + + return 0; +#endif +} diff --git a/test/tileop_api/src/TRowMaxExpand.cpp b/archive/outdated/tests/other/tileop_api/src/TRowMaxExpand.cpp similarity index 69% rename from test/tileop_api/src/TRowMaxExpand.cpp rename to archive/outdated/tests/other/tileop_api/src/TRowMaxExpand.cpp index d4814ca..669ff1e 100644 --- a/test/tileop_api/src/TRowMaxExpand.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TRowMaxExpand.cpp @@ -5,7 +5,49 @@ #include "../linxStartEnd.hpp" #endif -template void test_rm(T *dst, T *src) { +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template void test_rm(T *dst, T *src) { using gm_shape_in = global_tensor>; using gm_shape_out = global_tensor>; @@ -18,12 +60,12 @@ template void test_rm(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TROWMAXEXPAND(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } -template void test_cm(T *dst, T *src) { +template void test_cm(T *dst, T *src) { using gm_shape_in = global_tensor>; using gm_shape_out = global_tensor>; @@ -36,11 +78,12 @@ template void test_cm(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TROWMAXEXPAND(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } +#ifndef __linx template void test_Nz(T *dst, T *src) { using gm_shape_in = global_tensor>; using gm_shape_out = global_tensor>; @@ -54,11 +97,31 @@ template void test_Nz(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TROWMAXEXPAND(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } +#endif + int main() { +#ifdef __linx + constexpr uint16_t row = 4; + constexpr uint16_t col = 8; + constexpr uint16_t size = row * col; + + static int64_t dst_rm[size]; + static int64_t dst_cm[size]; + static int64_t src_rm[size]; + static int64_t src_cm[size]; + init_dst(dst_rm, size); + init_dst(dst_cm, size); + init_src_int(src_rm, size); + init_src_int(src_cm, size); + + test_rm(dst_rm, src_rm); + test_cm(dst_cm, src_cm); + return 0; +#else const size_t row = 32; const size_t col = 32; @@ -156,4 +219,5 @@ int main() { free(src_f32); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TRowSum.cpp b/archive/outdated/tests/other/tileop_api/src/TRowSum.cpp similarity index 71% rename from test/tileop_api/src/TRowSum.cpp rename to archive/outdated/tests/other/tileop_api/src/TRowSum.cpp index f881873..f1a6a59 100644 --- a/test/tileop_api/src/TRowSum.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TRowSum.cpp @@ -5,7 +5,40 @@ #include "../linxStartEnd.hpp" #endif -template void test_rm(T *dst, T *src) { +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template void test_rm(T *dst, T *src) { using gm_shape_in = global_tensor>; using gm_shape_out = global_tensor>; @@ -18,12 +51,12 @@ template void test_rm(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TROWSUM(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } -template void test_cm(T *dst, T *src) { +template void test_cm(T *dst, T *src) { using gm_shape_in = global_tensor>; using gm_shape_out = global_tensor>; @@ -36,12 +69,30 @@ template void test_cm(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TROWSUM(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } int main() { +#ifdef __linx + constexpr uint16_t row = 4; + constexpr uint16_t col = 8; + constexpr uint16_t size = row * col; + + static int64_t dst_rm[size]; + static int64_t dst_cm[size]; + static int64_t src_rm[size]; + static int64_t src_cm[size]; + init_dst(dst_rm, size); + init_dst(dst_cm, size); + init_src_int(src_rm, size); + init_src_int(src_cm, size); + + test_rm(dst_rm, src_rm); + test_cm(dst_cm, src_cm); + return 0; +#else const size_t row = 32; const size_t col = 32; @@ -139,4 +190,5 @@ int main() { free(src_f32); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TRowSumExpand.cpp b/archive/outdated/tests/other/tileop_api/src/TRowSumExpand.cpp similarity index 66% rename from test/tileop_api/src/TRowSumExpand.cpp rename to archive/outdated/tests/other/tileop_api/src/TRowSumExpand.cpp index 99a1f79..a5203eb 100644 --- a/test/tileop_api/src/TRowSumExpand.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TRowSumExpand.cpp @@ -5,7 +5,49 @@ #include "../linxStartEnd.hpp" #endif -template void test_rm(T *dst, T *src) { +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template void test_rm(T *dst, T *src) { using gm_shape_in = global_tensor>; using gm_shape_out = global_tensor>; @@ -18,12 +60,12 @@ template void test_rm(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TROWSUMEXPAND(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } -template void test_cm(T *dst, T *src) { +template void test_cm(T *dst, T *src) { using gm_shape_in = global_tensor>; using gm_shape_out = global_tensor>; @@ -36,12 +78,30 @@ template void test_cm(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TROWSUMEXPAND(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } int main() { +#ifdef __linx + constexpr uint16_t row = 4; + constexpr uint16_t col = 8; + constexpr uint16_t size = row * col; + + static int64_t dst_rm[size]; + static int64_t dst_cm[size]; + static int64_t src_rm[size]; + static int64_t src_cm[size]; + init_dst(dst_rm, size); + init_dst(dst_cm, size); + init_src_int(src_rm, size); + init_src_int(src_cm, size); + + test_rm(dst_rm, src_rm); + test_cm(dst_cm, src_cm); + return 0; +#else const size_t row = 32; const size_t col = 32; @@ -106,7 +166,7 @@ int main() { PMC_START(); #endif - test_rm(dst_int8, src_int8); + test_rm(dst_int8, src_int8); test_rm(dst_int16, src_int16); test_rm(dst_int32, src_int32); test_rm(dst_int64, src_int64); @@ -137,4 +197,5 @@ int main() { free(dst_f32); free(src_f32); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/other/tileop_api/src/TScatterElementCol.cpp b/archive/outdated/tests/other/tileop_api/src/TScatterElementCol.cpp similarity index 96% rename from test/other/tileop_api/src/TScatterElementCol.cpp rename to archive/outdated/tests/other/tileop_api/src/TScatterElementCol.cpp index a14ba89..71ae198 100644 --- a/test/other/tileop_api/src/TScatterElementCol.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TScatterElementCol.cpp @@ -21,10 +21,10 @@ void test(float *dst, uint16_t *srci, float s) { tile_shape_srci d0; tile_shape_dst d1; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TSCATTERELEMENTCOL(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } int main() { diff --git a/test/other/tileop_api/src/TScatterElementRow.cpp b/archive/outdated/tests/other/tileop_api/src/TScatterElementRow.cpp similarity index 96% rename from test/other/tileop_api/src/TScatterElementRow.cpp rename to archive/outdated/tests/other/tileop_api/src/TScatterElementRow.cpp index dc3cf50..c74b0cc 100644 --- a/test/other/tileop_api/src/TScatterElementRow.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TScatterElementRow.cpp @@ -21,10 +21,10 @@ void test(float *dst, uint16_t *srci, float s) { tile_shape_srci d0; tile_shape_dst d1; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TSCATTERELEMENTROW(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } int main() { diff --git a/test/other/tileop_api/src/TSelect.cpp b/archive/outdated/tests/other/tileop_api/src/TSelect.cpp similarity index 95% rename from test/other/tileop_api/src/TSelect.cpp rename to archive/outdated/tests/other/tileop_api/src/TSelect.cpp index 999cfc5..b8b68e4 100644 --- a/test/other/tileop_api/src/TSelect.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TSelect.cpp @@ -23,11 +23,11 @@ void test(float *dst, float *src0, float *src1, uint16_t *src2) { tile_shape_uint16 d2; tile_shape_fp32 d3; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TCOPYIN(d2, s2); + TLOAD(d0, s0); + TLOAD(d1, s1); + TLOAD(d2, s2); TSELECT(d3, d0, d1, d2); - TCOPYOUT(res, d3); + TSTORE(res, d3); } int main() { diff --git a/archive/outdated/tests/other/tileop_api/src/TSqrt.cpp b/archive/outdated/tests/other/tileop_api/src/TSqrt.cpp new file mode 100644 index 0000000..9d4923f --- /dev/null +++ b/archive/outdated/tests/other/tileop_api/src/TSqrt.cpp @@ -0,0 +1,204 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_rm(T *dst, T *src) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + using glb_iterator = global_iterator; + + glb_iterator gSIter(src); + glb_iterator gDIter(dst); + + size_t block_row = gm_row / tile_row; + size_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + auto s0 = gSIter(i, j); + auto res = gDIter(i, j); + + tile_shape t0, t1; + TLOAD(t0, s0); + TSQRT(t1, t0); + TSTORE(res, t1); + } + } +} + +template +void test_cm(T *dst, T *src) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + using glb_iterator = global_iterator; + + glb_iterator gSIter(src); + glb_iterator gDIter(dst); + + size_t block_row = gm_row / tile_row; + size_t block_col = gm_col / tile_col; + for (int i = 0; i < block_col; ++i) { + for (int j = 0; j < block_row; ++j) { + auto s0 = gSIter(j, i); + auto res = gDIter(j, i); + + tile_shape t0, t1; + TLOAD(t0, s0); + TSQRT(t1, t0); + TSTORE(res, t1); + } + } +} + +int main() { + +#ifdef __linx + constexpr size_t gm_row = 4; + constexpr size_t gm_col = 4; + constexpr size_t tile_row = 4; + constexpr size_t tile_col = 4; +#else + const size_t gm_row = 32; + const size_t gm_col = 32; + const size_t tile_row = 16; + const size_t tile_col = 16; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)gm_size; + (void)tile_size; + +#ifdef __linx + using row_tile = Tile; + using col_tile = + Tile; + row_tile src_rm, dst_rm; + col_tile src_cm, dst_cm; + + for (size_t i = 0; i < tile_row; ++i) { + for (size_t j = 0; j < tile_col; ++j) { + int64_t expected = static_cast(i * tile_col + j); + size_t row_index = index(i, j); + size_t col_index = index(i, j); + src_rm.data()[row_index] = expected * expected; + src_cm.data()[col_index] = expected * expected; + dst_rm.data()[row_index] = 0; + dst_cm.data()[col_index] = 0; + } + } + + TSQRT(dst_rm, src_rm); + TSQRT(dst_cm, src_cm); + + for (size_t i = 0; i < tile_row; ++i) { + for (size_t j = 0; j < tile_col; ++j) { + int64_t expected = static_cast(i * tile_col + j); + if (dst_rm.data()[index(i, j)] != expected) { + return 1; + } + if (dst_cm.data()[index(i, j)] != expected) { + return 2; + } + } + } + + return 0; +#else + // __half + __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, gm_size); + + __half *src_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src_f16); + init_rows_fp(src_f16, gm_row, gm_col); + + // __fp32 + __fp32 *dst_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); + check_mem_alloc(dst_f32); + init_dst(dst_f32, gm_size); + + __fp32 *src_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); + check_mem_alloc(src_f32); + init_rows_fp(src_f32, gm_row, gm_col); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst_f16, src_f16); + test_cm(dst_f16, src_f16); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst_f16, gm_size); + OutArray(dst_f32, gm_size); + + free(dst_f16); + free(src_f16); + free(dst_f32); + free(src_f32); + + return 0; +#endif +} diff --git a/test/tileop_api/src/TCopyOut.cpp b/archive/outdated/tests/other/tileop_api/src/TStore.cpp similarity index 82% rename from test/tileop_api/src/TCopyOut.cpp rename to archive/outdated/tests/other/tileop_api/src/TStore.cpp index a791431..02a2cf9 100644 --- a/test/tileop_api/src/TCopyOut.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TStore.cpp @@ -5,6 +5,38 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_RowMajor(T *dst, T *src0) { @@ -12,7 +44,7 @@ void test_RowMajor(T *dst, T *src0) { using stride = Stride<1, 1, gm_row * gm_col, gm_col, 1>; using gm_shape = GlobalTensor; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -22,10 +54,10 @@ void test_RowMajor(T *dst, T *src0) { int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape res(dst + offset); - + tile_shape d0; - TCOPYIN(d0, s0); - TCOPYOUT(res, d0); + TLOAD(d0, s0); + TSTORE(res, d0); } } } @@ -57,12 +89,12 @@ void test_RowMajor_Dynamic(T *dst, T *src0) { gm_shape res(dst + offset, gm_valid_row, gm_valid_col); tile_shape d0(active_row, active_col); - TCOPYIN(d0, s0); - TCOPYOUT(res, d0); + TLOAD(d0, s0); + TSTORE(res, d0); } } } - + template void test_ColMajor(T *dst, T *src0) { @@ -70,7 +102,7 @@ void test_ColMajor(T *dst, T *src0) { using stride = Stride<1, 1, gm_row * gm_col, 1, gm_row>; using gm_shape = GlobalTensor; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -80,10 +112,10 @@ void test_ColMajor(T *dst, T *src0) { int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape res(dst + offset); - + tile_shape d0; - TCOPYIN(d0, s0); - TCOPYOUT(res, d0); + TLOAD(d0, s0); + TSTORE(res, d0); } } } @@ -101,7 +133,7 @@ void test_Nz_Dynamic(T *dst, T *src0) { uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row; uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col; - + for (int i = 0; i < block_row; ++i) { for (int j = 0; j < block_col; ++j) { uint16_t remainder_row = gm_row - i * tile_valid_row; @@ -116,21 +148,39 @@ void test_Nz_Dynamic(T *dst, T *src0) { tile_shape d0(active_row, active_col); tile_shape d1(active_row, active_col); - TCOPYIN(d0, s0); - TCOPYOUT(res, d0); + TLOAD(d0, s0); + TSTORE(res, d0); } } } int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src[gm_size]; + init_dst(dst, gm_size); + init_src_int(src, gm_size); + + test_RowMajor(dst, src); + return 0; +#else float *dst = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst); init_dst(dst, gm_size); @@ -142,7 +192,7 @@ int main() { __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(dst_f16); init_dst(dst_f16, gm_size); - + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src0_f16); init_src_fp(src0_f16, gm_size); @@ -150,31 +200,31 @@ int main() { int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(dst_i8); init_dst(dst_i8, gm_size); - + int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src0_i8); init_src_int(src0_i8, gm_size); - + int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(dst_i16); init_dst(dst_i16, gm_size); - + int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(src0_i16); init_src_int(src0_i16, gm_size); - + int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst_i32); init_dst(dst_i32, gm_size); - + int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src0_i32); init_src_int(src0_i32, gm_size); - + int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(dst_i64); init_dst(dst_i64, gm_size); - + int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(src0_i64); init_src_int(src0_i64, gm_size); @@ -182,7 +232,7 @@ int main() { int32_t *dst1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst1_i32); init_dst(dst1_i32, gm_size); - + int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src1_i32); init_src_int(src1_i32, gm_size); @@ -190,7 +240,7 @@ int main() { int32_t *dst_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst_nz_i32); init_dst(dst_nz_i32, gm_size); - + int32_t *src_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src_nz_i32); init_src_int(src_nz_i32, gm_size); @@ -200,15 +250,15 @@ int main() { #endif test_RowMajor(dst, src0); - + test_RowMajor(dst_f16, src0_f16); - + test_RowMajor(dst_i8, src0_i8); test_RowMajor(dst_i16, src0_i16); - + test_ColMajor(dst_i32, src0_i32); - + test_ColMajor(dst_i64, src0_i64); test_RowMajor_Dynamic(dst1_i32, src1_i32); @@ -228,22 +278,22 @@ int main() { OutArray(dst_i64, gm_size); OutArray(dst1_i32, gm_size); OutArray(dst_nz_i32, gm_size); - + free(dst); free(src0); - + free(dst_f16); free(src0_f16); - + free(dst_i8); free(src0_i8); - + free(dst_i16); free(src0_i16); - + free(dst_i32); free(src0_i32); - + free(dst_i64); free(src0_i64); @@ -251,4 +301,5 @@ int main() { free(src1_i32); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TSub.cpp b/archive/outdated/tests/other/tileop_api/src/TSub.cpp similarity index 77% rename from test/tileop_api/src/TSub.cpp rename to archive/outdated/tests/other/tileop_api/src/TSub.cpp index 0552ecb..8f8480b 100644 --- a/test/tileop_api/src/TSub.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TSub.cpp @@ -5,6 +5,38 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + // C = A - B template @@ -26,10 +58,10 @@ void test_rm(T *dst, T *src0, T *src1) { auto res = gCIter(i, j); tile_shape t0, t1, t2; - TCOPYIN(t0, s0); - TCOPYIN(t1, s1); + TLOAD(t0, s0); + TLOAD(t1, s1); TSUB(t2, t1, t0); - TCOPYOUT(res, t2); + TSTORE(res, t2); } } } @@ -54,23 +86,44 @@ void test_cm(T *dst, T *src0, T *src1) { auto res = gCIter(j, i); tile_shape t0, t1, t2; - TCOPYIN(t0, s0); - TCOPYIN(t1, s1); + TLOAD(t0, s0); + TLOAD(t1, s1); TSUB(t2, t1, t0); - TCOPYOUT(res, t2); + TSTORE(res, t2); } } } int main() { - const size_t gm_row = 32; - const size_t gm_col = 32; - const size_t tile_row = 32; - const size_t tile_col = 32; +#ifdef __linx + constexpr size_t gm_row = 4; + constexpr size_t gm_col = 4; + constexpr size_t tile_row = 4; + constexpr size_t tile_col = 4; +#else + constexpr size_t gm_row = 32; + constexpr size_t gm_col = 32; + constexpr size_t tile_row = 32; + constexpr size_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; +#ifdef __linx + static int64_t dst_int64[gm_size]; + static int64_t src0_int64[gm_size]; + static int64_t src1_int64[gm_size]; + init_dst(dst_int64, gm_size); + init_src_int(src0_int64, gm_size); + init_src_int(src1_int64, gm_size); + test_rm(dst_int64, src0_int64, + src1_int64); + + return 0; +#else // int8_t int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(dst_int8); @@ -191,4 +244,5 @@ int main() { free(src1_f32); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TSubs.cpp b/archive/outdated/tests/other/tileop_api/src/TSubs.cpp similarity index 73% rename from test/tileop_api/src/TSubs.cpp rename to archive/outdated/tests/other/tileop_api/src/TSubs.cpp index 06d0288..fd3e234 100644 --- a/test/tileop_api/src/TSubs.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TSubs.cpp @@ -5,6 +5,38 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T *src, T s) { @@ -23,9 +55,9 @@ void test_rm(T *dst, T *src, T s) { auto res = gDIter(i, j); tile_shape t0, t1; - TCOPYIN(t0, s0); + TLOAD(t0, s0); TSUBS(t1, t0, s); - TCOPYOUT(res, t1); + TSTORE(res, t1); } } } @@ -48,22 +80,41 @@ void test_cm(T *dst, T *src, T s) { auto res = gDIter(j, i); tile_shape t0, t1; - TCOPYIN(t0, s0); + TLOAD(t0, s0); TSUBS(t1, t0, s); - TCOPYOUT(res, t1); + TSTORE(res, t1); } } } int main() { - const size_t gm_row = 32; - const size_t gm_col = 32; - const size_t tile_row = 32; - const size_t tile_col = 32; +#ifdef __linx + constexpr size_t gm_row = 4; + constexpr size_t gm_col = 4; + constexpr size_t tile_row = 4; + constexpr size_t tile_col = 4; +#else + constexpr size_t gm_row = 32; + constexpr size_t gm_col = 32; + constexpr size_t tile_row = 32; + constexpr size_t tile_col = 32; +#endif - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; +#ifdef __linx + static int64_t dst_int64[gm_size]; + static int64_t src_int64[gm_size]; + init_dst(dst_int64, gm_size); + init_src_int(src_int64, gm_size); + + test_rm(dst_int64, src_int64, + s_i64); + + return 0; +#else // int8_t int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(dst_int8); @@ -158,4 +209,5 @@ int main() { free(src_f32); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TTrans.cpp b/archive/outdated/tests/other/tileop_api/src/TTrans.cpp similarity index 74% rename from test/tileop_api/src/TTrans.cpp rename to archive/outdated/tests/other/tileop_api/src/TTrans.cpp index 324d88a..0d441a3 100644 --- a/test/tileop_api/src/TTrans.cpp +++ b/archive/outdated/tests/other/tileop_api/src/TTrans.cpp @@ -5,6 +5,38 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T *src) { using gm_shape_in = global_tensor>; using gm_shape_out = global_tensor>; @@ -17,9 +49,9 @@ template void test_rm(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TTRANS(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } template void test_cm(T *dst, T *src) { @@ -34,18 +66,33 @@ template void test_cm(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TTRANS(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } int main() { - const size_t row = 32; - const size_t col = 32; +#ifdef __linx + constexpr size_t row = 4; + constexpr size_t col = 4; +#else + constexpr size_t row = 32; + constexpr size_t col = 32; +#endif - size_t size_in = row * col; - size_t size_out = col * row; + constexpr size_t size_in = row * col; + constexpr size_t size_out = col * row; +#ifdef __linx + static int64_t dst[size_out]; + static int64_t src[size_in]; + init_dst(dst, size_out); + init_src_int(src, size_in); + + test_rm(dst, src); + + return 0; +#else // int8 int8_t *dst_int8 = (int8_t *)malloc(size_out * sizeof(int8_t)); check_mem_alloc(dst_int8); @@ -137,4 +184,5 @@ int main() { free(src_f32); return 0; -} \ No newline at end of file +#endif +} diff --git a/archive/outdated/tests/other/tileop_api/src/test_MatMacc.cpp b/archive/outdated/tests/other/tileop_api/src/test_MatMacc.cpp new file mode 100644 index 0000000..5316b61 --- /dev/null +++ b/archive/outdated/tests/other/tileop_api/src/test_MatMacc.cpp @@ -0,0 +1,194 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} + +template +void test_linx_row_major(T *dst, T *src0, T *src1) { + using gm_shape_A = global_tensor>; + using gm_shape_B = global_tensor>; + using gm_shape_C = global_tensor>; + + using tile_shape_A = Tile; + using tile_shape_B = Tile; + using tile_shape_C = Tile; + + gm_shape_A s0(src0); + gm_shape_B s1(src1); + gm_shape_C res(dst); + + tile_shape_A d0; + tile_shape_B d1; + tile_shape_C d2; + + TLOAD(d0, s0); + TLOAD(d1, s1); + MATMUL(d2, d0, d1); + MATMACC(d2, d0, d1); + TSTORE(res, d2); +} +#endif + +template +void test(float *dst, float *src0, float *src1) { + using gm_shape_A = global_tensor>; + using gm_shape_B = global_tensor>; + using gm_shape_C = global_tensor>; + + using tile_shape_A = TileLeft; + using tile_shape_B = TileRight; + using tile_shape_C = TileAcc; + using tile_shape_O = Tile; + + gm_shape_A s0(src0); + gm_shape_B s1(src1); + gm_shape_C res(dst); + + tile_shape_A d0; + tile_shape_B d1; + tile_shape_C d2; + tile_shape_O d3; + + TLOAD(d0, s0); + TLOAD(d1, s1); + MATMUL(d2, d0, d1); + MATMACC(d2, d0, d1); + TCVT(d3, d2); + TSTORE(res, d3); +} + +int main() { +#ifdef __linx + constexpr uint16_t M = 4; + constexpr uint16_t K = 4; + constexpr uint16_t N = 4; + constexpr size_t size_A = M * K; + constexpr size_t size_B = K * N; + constexpr size_t size_C = M * N; + + static int64_t dst_i64[size_C]; + static int64_t src0_i64[size_A]; + static int64_t src1_i64[size_B]; + + for (size_t row = 0; row < M; ++row) { + for (size_t k = 0; k < K; ++k) { + src0_i64[row * K + k] = static_cast((row + 1) * (k + 2)); + } + } + for (size_t k = 0; k < K; ++k) { + for (size_t col = 0; col < N; ++col) { + src1_i64[k * N + col] = static_cast((k + 1) + (col + 1)); + } + } + for (size_t i = 0; i < size_C; ++i) { + dst_i64[i] = 0; + } + + test_linx_row_major(dst_i64, src0_i64, src1_i64); + + for (size_t row = 0; row < M; ++row) { + for (size_t col = 0; col < N; ++col) { + int64_t expected = 0; + for (size_t k = 0; k < K; ++k) { + expected += src0_i64[row * K + k] * src1_i64[k * N + col]; + } + expected *= 2; + if (dst_i64[row * N + col] != expected) { + return 1; + } + } + } + + return 0; +#else + const uint16_t M = 16; + const uint16_t K = 8; + const uint16_t N = 32; + + size_t size_A = M * K; + size_t size_B = K * N; + size_t size_C = M * N; + + float *dst = (float *)malloc(size_C * sizeof(float)); + check_mem_alloc(dst); + init_src_fp(dst, size_C); + + float *src0 = (float *)malloc(size_A * sizeof(float)); + check_mem_alloc(src0); + init_src_fp(src0, size_A); + float *src1 = (float *)malloc(size_B * sizeof(float)); + check_mem_alloc(src1); + init_src_fp(src1, size_B); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test(dst, src0, src1); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, size_C); + + free(dst); + free(src0); + free(src1); + + return 0; +#endif +} diff --git a/archive/outdated/tests/other/tileop_api/src/test_MatMul.cpp b/archive/outdated/tests/other/tileop_api/src/test_MatMul.cpp new file mode 100644 index 0000000..bf01255 --- /dev/null +++ b/archive/outdated/tests/other/tileop_api/src/test_MatMul.cpp @@ -0,0 +1,191 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} + +template +void test_linx_row_major(T *dst, T *src0, T *src1) { + using gm_shape_A = global_tensor>; + using gm_shape_B = global_tensor>; + using gm_shape_C = global_tensor>; + + using tile_shape_A = Tile; + using tile_shape_B = Tile; + using tile_shape_C = Tile; + + gm_shape_A s0(src0); + gm_shape_B s1(src1); + gm_shape_C res(dst); + + tile_shape_A d0; + tile_shape_B d1; + tile_shape_C d2; + + TLOAD(d0, s0); + TLOAD(d1, s1); + MATMUL(d2, d0, d1); + TSTORE(res, d2); +} +#endif + +template +void test(float *dst, float *src0, float *src1) { + using gm_shape_A = global_tensor>; + using gm_shape_B = global_tensor>; + using gm_shape_C = global_tensor>; + + using tile_shape_A = TileLeft; + using tile_shape_B = TileRight; + using tile_shape_C = TileAcc; + using tile_shape_O = TileLeft; + + gm_shape_A s0(src0); + gm_shape_B s1(src1); + gm_shape_C res(dst); + + tile_shape_A d0; + tile_shape_B d1; + tile_shape_C d2; + tile_shape_O d3; + + TLOAD(d0, s0); + TLOAD(d1, s1); + MATMUL(d2, d0, d1); + TCVT(d3, d2); + TSTORE(res, d3); +} + +int main() { +#ifdef __linx + constexpr uint16_t M = 4; + constexpr uint16_t K = 4; + constexpr uint16_t N = 4; + constexpr size_t size_A = M * K; + constexpr size_t size_B = K * N; + constexpr size_t size_C = M * N; + + static int64_t dst_i64[size_C]; + static int64_t src0_i64[size_A]; + static int64_t src1_i64[size_B]; + + for (size_t row = 0; row < M; ++row) { + for (size_t k = 0; k < K; ++k) { + src0_i64[row * K + k] = static_cast((row + 1) * (k + 2)); + } + } + for (size_t k = 0; k < K; ++k) { + for (size_t col = 0; col < N; ++col) { + src1_i64[k * N + col] = static_cast((k + 1) + (col + 1)); + } + } + for (size_t i = 0; i < size_C; ++i) { + dst_i64[i] = 0; + } + + test_linx_row_major(dst_i64, src0_i64, src1_i64); + + for (size_t row = 0; row < M; ++row) { + for (size_t col = 0; col < N; ++col) { + int64_t expected = 0; + for (size_t k = 0; k < K; ++k) { + expected += src0_i64[row * K + k] * src1_i64[k * N + col]; + } + if (dst_i64[row * N + col] != expected) { + return 1; + } + } + } + + return 0; +#else + const uint16_t M = 16; + const uint16_t K = 8; + const uint16_t N = 32; + + size_t size_A = M * K; + size_t size_B = K * N; + size_t size_C = M * N; + + float *dst = (float *)malloc(size_C * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, size_C); + + float *src0 = (float *)malloc(size_A * sizeof(float)); + check_mem_alloc(src0); + init_src_fp(src0, size_A); + float *src1 = (float *)malloc(size_B * sizeof(float)); + check_mem_alloc(src1); + init_src_fp(src1, size_B); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test(dst, src0, src1); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, size_C); + + free(dst); + free(src0); + free(src1); + + return 0; +#endif +} diff --git a/benchmarks/INDEX.md b/benchmarks/INDEX.md new file mode 100644 index 0000000..39ff3f9 --- /dev/null +++ b/benchmarks/INDEX.md @@ -0,0 +1,232 @@ +# Benchmark Index + +Generated from the active `benchmarks/` tree. The suite table records batch build surfaces; the source table records benchmark entrypoints with source path, build command, category, active/archive status, and required data objects. + +## Suite Batch Commands + +| Category | Source path | Build command | Commands | Required data objects | Status | +| --- | --- | --- | --- | --- | --- | +| `api/tileop` | [`benchmarks/api/tileop/compile.all`](../benchmarks/api/tileop/compile.all) | `cd benchmarks/api/tileop && bash compile.all` | 39 | none | active | +| `kernels/composite` | [`benchmarks/kernels/composite/compile_flash_attention.all`](../benchmarks/kernels/composite/compile_flash_attention.all) | `cd benchmarks/kernels/composite && bash compile_flash_attention.all` | 109 | none | active | +| `kernels/composite` | [`benchmarks/kernels/composite/compile_gemm.all`](../benchmarks/kernels/composite/compile_gemm.all) | `cd benchmarks/kernels/composite && bash compile_gemm.all` | 12 | none | active | +| `kernels/composite` | [`benchmarks/kernels/composite/compile_linear.all`](../benchmarks/kernels/composite/compile_linear.all) | `cd benchmarks/kernels/composite && bash compile_linear.all` | 6 | none | active | +| `kernels/composite` | [`benchmarks/kernels/composite/compile_matmul.all`](../benchmarks/kernels/composite/compile_matmul.all) | `cd benchmarks/kernels/composite && bash compile_matmul.all` | 92 | none | active | +| `kernels/composite` | [`benchmarks/kernels/composite/compile_norm.all`](../benchmarks/kernels/composite/compile_norm.all) | `cd benchmarks/kernels/composite && bash compile_norm.all` | 18 | none | active | +| `kernels/composite` | [`benchmarks/kernels/composite/compile_softmax.all`](../benchmarks/kernels/composite/compile_softmax.all) | `cd benchmarks/kernels/composite && bash compile_softmax.all` | 12 | none | active | +| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul.all` | 96 | none | active | +| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul_dynamic.all` | 96 | none | active | +| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuse.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuse.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul_dynamic_reuse.all` | 96 | none | active | +| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseA.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseA.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul_dynamic_reuseA.all` | 96 | none | active | +| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseB.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseB.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul_dynamic_reuseB.all` | 96 | none | active | +| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul_reuseA.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul_reuseA.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul_reuseA.all` | 96 | none | active | +| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul_reuseAB.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul_reuseAB.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul_reuseAB.all` | 96 | none | active | +| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul_reuseB.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul_reuseB.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul_reuseB.all` | 96 | none | active | +| `kernels/control` | [`benchmarks/kernels/control/compile.all`](../benchmarks/kernels/control/compile.all) | `cd benchmarks/kernels/control && bash compile.all` | 10 | `hashtable_lookup_simd/data_obj`; `hkv/data_obj` for data-backed cases | active | +| `kernels/element_wise/gelu` | [`benchmarks/kernels/element_wise/gelu/compile.all`](../benchmarks/kernels/element_wise/gelu/compile.all) | `cd benchmarks/kernels/element_wise/gelu && bash compile.all` | 1 | none | active | +| `kernels/fusion` | [`benchmarks/kernels/fusion/compile.all`](../benchmarks/kernels/fusion/compile.all) | `cd benchmarks/kernels/fusion && bash compile.all` | 14 | none | active | +| `kernels/gemm/matmul` | [`benchmarks/kernels/gemm/matmul/compile.all`](../benchmarks/kernels/gemm/matmul/compile.all) | `cd benchmarks/kernels/gemm/matmul && bash compile.all` | 115 | none | active | +| `kernels/memory/broadcast` | [`benchmarks/kernels/memory/broadcast/compile.all`](../benchmarks/kernels/memory/broadcast/compile.all) | `cd benchmarks/kernels/memory/broadcast && bash compile.all` | 5 | none | active | +| `kernels/memory/broadcast_vec` | [`benchmarks/kernels/memory/broadcast_vec/compile.all`](../benchmarks/kernels/memory/broadcast_vec/compile.all) | `cd benchmarks/kernels/memory/broadcast_vec && bash compile.all` | 3 | none | active | +| `kernels/memory/concat_gather` | [`benchmarks/kernels/memory/concat_gather/compile.all`](../benchmarks/kernels/memory/concat_gather/compile.all) | `cd benchmarks/kernels/memory/concat_gather && bash compile.all` | 1 | none | active | +| `kernels/memory/concat_scatter` | [`benchmarks/kernels/memory/concat_scatter/compile.all`](../benchmarks/kernels/memory/concat_scatter/compile.all) | `cd benchmarks/kernels/memory/concat_scatter && bash compile.all` | 1 | none | active | +| `kernels/memory/gather` | [`benchmarks/kernels/memory/gather/compile.all`](../benchmarks/kernels/memory/gather/compile.all) | `cd benchmarks/kernels/memory/gather && bash compile.all` | 4 | none | active | +| `kernels/memory/transpose` | [`benchmarks/kernels/memory/transpose/compile.all`](../benchmarks/kernels/memory/transpose/compile.all) | `cd benchmarks/kernels/memory/transpose && bash compile.all` | 1 | none | active | +| `kernels/reduction/reducemax_col` | [`benchmarks/kernels/reduction/reducemax_col/compile.all`](../benchmarks/kernels/reduction/reducemax_col/compile.all) | `cd benchmarks/kernels/reduction/reducemax_col && bash compile.all` | 1 | none | active | +| `kernels/reduction/reducemax_row` | [`benchmarks/kernels/reduction/reducemax_row/compile.all`](../benchmarks/kernels/reduction/reducemax_row/compile.all) | `cd benchmarks/kernels/reduction/reducemax_row && bash compile.all` | 1 | none | active | +| `kernels/reduction/reducesum_col` | [`benchmarks/kernels/reduction/reducesum_col/compile.all`](../benchmarks/kernels/reduction/reducesum_col/compile.all) | `cd benchmarks/kernels/reduction/reducesum_col && bash compile.all` | 1 | none | active | +| `kernels/reduction/reducesum_row` | [`benchmarks/kernels/reduction/reducesum_row/compile.all`](../benchmarks/kernels/reduction/reducesum_row/compile.all) | `cd benchmarks/kernels/reduction/reducesum_row && bash compile.all` | 1 | none | active | +| `kernels/sort` | [`benchmarks/kernels/sort/compile.all`](../benchmarks/kernels/sort/compile.all) | `cd benchmarks/kernels/sort && bash compile.all` | 1 | `topk/data_obj` | active | +| `microbench/cube` | [`benchmarks/microbench/cube/compile.all`](../benchmarks/microbench/cube/compile.all) | `cd benchmarks/microbench/cube && bash compile.all` | 72 | none | active | +| `microbench/lmbench` | [`benchmarks/microbench/lmbench/compile_mem.all`](../benchmarks/microbench/lmbench/compile_mem.all) | `cd benchmarks/microbench/lmbench && bash compile_mem.all` | 78 | none | active | +| `microbench/vec` | [`benchmarks/microbench/vec/compile_lat_bw.all`](../benchmarks/microbench/vec/compile_lat_bw.all) | `cd benchmarks/microbench/vec && bash compile_lat_bw.all` | 120 | none | active | +| `models/deepseekv3` | [`benchmarks/models/deepseekv3/compile.all`](../benchmarks/models/deepseekv3/compile.all) | `cd benchmarks/models/deepseekv3 && bash compile.all` | 47 | none | active | +| `models/deepseekv3` | [`benchmarks/models/deepseekv3/compile_cpu.all`](../benchmarks/models/deepseekv3/compile_cpu.all) | `cd benchmarks/models/deepseekv3 && bash compile_cpu.all` | 47 | none | active | +| `npu/cube` | [`benchmarks/npu/cube/compile.all`](../benchmarks/npu/cube/compile.all) | `cd benchmarks/npu/cube && bash compile.all` | 10 | none | active | +| `npu/fusion` | [`benchmarks/npu/fusion/compile.all`](../benchmarks/npu/fusion/compile.all) | `cd benchmarks/npu/fusion && bash compile.all` | 71 | none | active | +| `npu/fusion` | [`benchmarks/npu/fusion/compile_fusion_2d_unroll.all`](../benchmarks/npu/fusion/compile_fusion_2d_unroll.all) | `cd benchmarks/npu/fusion && bash compile_fusion_2d_unroll.all` | 672 | none | active | +| `npu/fusion` | [`benchmarks/npu/fusion/compile_fusion_dcore.all`](../benchmarks/npu/fusion/compile_fusion_dcore.all) | `cd benchmarks/npu/fusion && bash compile_fusion_dcore.all` | 96 | none | active | +| `npu/fusion` | [`benchmarks/npu/fusion/compile_fusion_dynamic.all`](../benchmarks/npu/fusion/compile_fusion_dynamic.all) | `cd benchmarks/npu/fusion && bash compile_fusion_dynamic.all` | 15 | none | active | +| `npu/fusion` | [`benchmarks/npu/fusion/compile_fusion_fp4.all`](../benchmarks/npu/fusion/compile_fusion_fp4.all) | `cd benchmarks/npu/fusion && bash compile_fusion_fp4.all` | 62 | none | active | +| `npu/nddma` | [`benchmarks/npu/nddma/compile_transpose.all`](../benchmarks/npu/nddma/compile_transpose.all) | `cd benchmarks/npu/nddma && bash compile_transpose.all` | 1 | none | active | +| `npu/vec_simd` | [`benchmarks/npu/vec_simd/compile.all`](../benchmarks/npu/vec_simd/compile.all) | `cd benchmarks/npu/vec_simd && bash compile.all` | 16 | none | active | +| `npu/vec_simt` | [`benchmarks/npu/vec_simt/compile.all`](../benchmarks/npu/vec_simt/compile.all) | `cd benchmarks/npu/vec_simt && bash compile.all` | 3 | `hashfind/data_obj` when `TESTCASE=hashfind` | active | + +## Benchmark Source Entry Points + +| Category | Benchmark name | Source path | Build command | Required data objects | Status | +| --- | --- | --- | --- | --- | --- | +| `api/tileop` | `Cus_Template_ASM` | [`benchmarks/api/tileop/src/Cus_Template_ASM.cpp`](../benchmarks/api/tileop/src/Cus_Template_ASM.cpp) | `cd benchmarks/api/tileop && make TESTCASE=Cus_Template_ASM PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `MatMacc` | [`benchmarks/api/tileop/src/MatMacc.cpp`](../benchmarks/api/tileop/src/MatMacc.cpp) | `cd benchmarks/api/tileop && make TESTCASE=MatMacc PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `MatMul` | [`benchmarks/api/tileop/src/MatMul.cpp`](../benchmarks/api/tileop/src/MatMul.cpp) | `cd benchmarks/api/tileop && make TESTCASE=MatMul PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `MatMul_e4m3` | [`benchmarks/api/tileop/src/MatMul_e4m3.cpp`](../benchmarks/api/tileop/src/MatMul_e4m3.cpp) | `cd benchmarks/api/tileop && make TESTCASE=MatMul_e4m3 PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `Print` | [`benchmarks/api/tileop/src/Print.cpp`](../benchmarks/api/tileop/src/Print.cpp) | `cd benchmarks/api/tileop && make TESTCASE=Print PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TAbs` | [`benchmarks/api/tileop/src/TAbs.cpp`](../benchmarks/api/tileop/src/TAbs.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TAbs PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TAdd` | [`benchmarks/api/tileop/src/TAdd.cpp`](../benchmarks/api/tileop/src/TAdd.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TAdd PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TAdd_mask` | [`benchmarks/api/tileop/src/TAdd_mask.cpp`](../benchmarks/api/tileop/src/TAdd_mask.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TAdd_mask PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TAdds` | [`benchmarks/api/tileop/src/TAdds.cpp`](../benchmarks/api/tileop/src/TAdds.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TAdds PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TAnd` | [`benchmarks/api/tileop/src/TAnd.cpp`](../benchmarks/api/tileop/src/TAnd.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TAnd PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TAssemble` | [`benchmarks/api/tileop/src/TAssemble.cpp`](../benchmarks/api/tileop/src/TAssemble.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TAssemble PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TCI` | [`benchmarks/api/tileop/src/TCI.cpp`](../benchmarks/api/tileop/src/TCI.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TCI PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TCast` | [`benchmarks/api/tileop/src/TCast.cpp`](../benchmarks/api/tileop/src/TCast.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TCast PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TCmp` | [`benchmarks/api/tileop/src/TCmp.cpp`](../benchmarks/api/tileop/src/TCmp.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TCmp PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TCopy` | [`benchmarks/api/tileop/src/TCopy.cpp`](../benchmarks/api/tileop/src/TCopy.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TCopy PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TLoad` | [`benchmarks/api/tileop/src/TLoad.cpp`](../benchmarks/api/tileop/src/TLoad.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TLoad PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TStore` | [`benchmarks/api/tileop/src/TStore.cpp`](../benchmarks/api/tileop/src/TStore.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TStore PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TCvt` | [`benchmarks/api/tileop/src/TCvt.cpp`](../benchmarks/api/tileop/src/TCvt.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TCvt PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TDiv` | [`benchmarks/api/tileop/src/TDiv.cpp`](../benchmarks/api/tileop/src/TDiv.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TDiv PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TDivs` | [`benchmarks/api/tileop/src/TDivs.cpp`](../benchmarks/api/tileop/src/TDivs.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TDivs PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TExp` | [`benchmarks/api/tileop/src/TExp.cpp`](../benchmarks/api/tileop/src/TExp.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TExp PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TExpandCol` | [`benchmarks/api/tileop/src/TExpandCol.cpp`](../benchmarks/api/tileop/src/TExpandCol.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TExpandCol PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TExpandRow` | [`benchmarks/api/tileop/src/TExpandRow.cpp`](../benchmarks/api/tileop/src/TExpandRow.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TExpandRow PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TExpandScalar` | [`benchmarks/api/tileop/src/TExpandScalar.cpp`](../benchmarks/api/tileop/src/TExpandScalar.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TExpandScalar PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TExtract` | [`benchmarks/api/tileop/src/TExtract.cpp`](../benchmarks/api/tileop/src/TExtract.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TExtract PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TFillPad` | [`benchmarks/api/tileop/src/TFillPad.cpp`](../benchmarks/api/tileop/src/TFillPad.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TFillPad PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TGather` | [`benchmarks/api/tileop/src/TGather.cpp`](../benchmarks/api/tileop/src/TGather.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TGather PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TMax` | [`benchmarks/api/tileop/src/TMax.cpp`](../benchmarks/api/tileop/src/TMax.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TMax PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TMaxs` | [`benchmarks/api/tileop/src/TMaxs.cpp`](../benchmarks/api/tileop/src/TMaxs.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TMaxs PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TMin` | [`benchmarks/api/tileop/src/TMin.cpp`](../benchmarks/api/tileop/src/TMin.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TMin PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TMins` | [`benchmarks/api/tileop/src/TMins.cpp`](../benchmarks/api/tileop/src/TMins.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TMins PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TMul` | [`benchmarks/api/tileop/src/TMul.cpp`](../benchmarks/api/tileop/src/TMul.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TMul PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TMuls` | [`benchmarks/api/tileop/src/TMuls.cpp`](../benchmarks/api/tileop/src/TMuls.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TMuls PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TOr` | [`benchmarks/api/tileop/src/TOr.cpp`](../benchmarks/api/tileop/src/TOr.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TOr PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TPad` | [`benchmarks/api/tileop/src/TPad.cpp`](../benchmarks/api/tileop/src/TPad.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TPad PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TRSqrt` | [`benchmarks/api/tileop/src/TRSqrt.cpp`](../benchmarks/api/tileop/src/TRSqrt.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TRSqrt PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TRecip` | [`benchmarks/api/tileop/src/TRecip.cpp`](../benchmarks/api/tileop/src/TRecip.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TRecip PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TRem` | [`benchmarks/api/tileop/src/TRem.cpp`](../benchmarks/api/tileop/src/TRem.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TRem PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TReshape` | [`benchmarks/api/tileop/src/TReshape.cpp`](../benchmarks/api/tileop/src/TReshape.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TReshape PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TRowMax` | [`benchmarks/api/tileop/src/TRowMax.cpp`](../benchmarks/api/tileop/src/TRowMax.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TRowMax PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TRowMaxExpand` | [`benchmarks/api/tileop/src/TRowMaxExpand.cpp`](../benchmarks/api/tileop/src/TRowMaxExpand.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TRowMaxExpand PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TRowSum` | [`benchmarks/api/tileop/src/TRowSum.cpp`](../benchmarks/api/tileop/src/TRowSum.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TRowSum PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TRowSumExpand` | [`benchmarks/api/tileop/src/TRowSumExpand.cpp`](../benchmarks/api/tileop/src/TRowSumExpand.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TRowSumExpand PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TScatter` | [`benchmarks/api/tileop/src/TScatter.cpp`](../benchmarks/api/tileop/src/TScatter.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TScatter PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TSelect` | [`benchmarks/api/tileop/src/TSelect.cpp`](../benchmarks/api/tileop/src/TSelect.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TSelect PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TSqrt` | [`benchmarks/api/tileop/src/TSqrt.cpp`](../benchmarks/api/tileop/src/TSqrt.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TSqrt PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TSub` | [`benchmarks/api/tileop/src/TSub.cpp`](../benchmarks/api/tileop/src/TSub.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TSub PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TSubs` | [`benchmarks/api/tileop/src/TSubs.cpp`](../benchmarks/api/tileop/src/TSubs.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TSubs PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `TTrans` | [`benchmarks/api/tileop/src/TTrans.cpp`](../benchmarks/api/tileop/src/TTrans.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TTrans PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `test_MatMacc` | [`benchmarks/api/tileop/src/test_MatMacc.cpp`](../benchmarks/api/tileop/src/test_MatMacc.cpp) | `cd benchmarks/api/tileop && make TESTCASE=test_MatMacc PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `test_MatMmxac` | [`benchmarks/api/tileop/src/test_MatMmxac.cpp`](../benchmarks/api/tileop/src/test_MatMmxac.cpp) | `cd benchmarks/api/tileop && make TESTCASE=test_MatMmxac PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `test_MatMul` | [`benchmarks/api/tileop/src/test_MatMul.cpp`](../benchmarks/api/tileop/src/test_MatMul.cpp) | `cd benchmarks/api/tileop && make TESTCASE=test_MatMul PLAT=linx COMPILER_DIR=` | none | active | +| `api/tileop` | `test_MatMulmx` | [`benchmarks/api/tileop/src/test_MatMulmx.cpp`](../benchmarks/api/tileop/src/test_MatMulmx.cpp) | `cd benchmarks/api/tileop && make TESTCASE=test_MatMulmx PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/composite` | `flash_attention` | [`benchmarks/kernels/composite/src/flash_attention.cpp`](../benchmarks/kernels/composite/src/flash_attention.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=flash_attention PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/composite` | `flash_attention_mask` | [`benchmarks/kernels/composite/src/flash_attention_mask.cpp`](../benchmarks/kernels/composite/src/flash_attention_mask.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=flash_attention_mask PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/composite` | `gemm` | [`benchmarks/kernels/composite/src/gemm.cpp`](../benchmarks/kernels/composite/src/gemm.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=gemm PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/composite` | `linear` | [`benchmarks/kernels/composite/src/linear.cpp`](../benchmarks/kernels/composite/src/linear.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=linear PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/composite` | `matmul` | [`benchmarks/kernels/composite/src/matmul.cpp`](../benchmarks/kernels/composite/src/matmul.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=matmul PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/composite` | `normalization` | [`benchmarks/kernels/composite/src/normalization.cpp`](../benchmarks/kernels/composite/src/normalization.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=normalization PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/composite` | `onlinesoftmax` | [`benchmarks/kernels/composite/src/onlinesoftmax.cpp`](../benchmarks/kernels/composite/src/onlinesoftmax.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=onlinesoftmax PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/composite` | `softmax` | [`benchmarks/kernels/composite/src/softmax.cpp`](../benchmarks/kernels/composite/src/softmax.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=softmax PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/control` | `hashfind` | [`benchmarks/kernels/control/hashfind/hashfind.cpp`](../benchmarks/kernels/control/hashfind/hashfind.cpp) | `cd benchmarks/kernels/control && make TESTCASE=hashfind PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/control` | `hashtable_lookup_simd` | [`benchmarks/kernels/control/hashtable_lookup_simd/hashtable_lookup_simd.cpp`](../benchmarks/kernels/control/hashtable_lookup_simd/hashtable_lookup_simd.cpp) | `cd benchmarks/kernels/control && make TESTCASE=hashtable_lookup_simd PLAT=linx COMPILER_DIR=` | `hashtable_lookup_simd/data_obj`: `inserted_slot.o`, `lookup_keys.o`, `lookup_values.o` | active | +| `kernels/control` | `hashtable_lookup_simt` | [`benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp`](../benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp) | `cd benchmarks/kernels/control && make TESTCASE=hashtable_lookup_simt PLAT=linx COMPILER_DIR=` | `hashtable_lookup_simd/data_obj`: `inserted_slot.o`, `lookup_keys.o`, `lookup_values.o` | active | +| `kernels/control` | `hashtable_lookup_simt_v2` | [`benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp`](../benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp) | `cd benchmarks/kernels/control && make TESTCASE=hashtable_lookup_simt_v2 PLAT=linx COMPILER_DIR=` | `hashtable_lookup_simd/data_obj`: `inserted_slot.o`, `lookup_keys.o`, `lookup_values.o` | active | +| `kernels/control` | `hkv` | [`benchmarks/kernels/control/hkv/hkv.cpp`](../benchmarks/kernels/control/hkv/hkv.cpp) | `cd benchmarks/kernels/control && make TESTCASE=hkv PLAT=linx COMPILER_DIR=` | `hkv/data_obj`: `buckets.bin.o`, `buckets_size.bin.o`, `lookup_keys.bin.o`, `lookedup_values.bin.o`, `key_score_digest.bin.o` | active | +| `kernels/element_wise/gelu` | `gelu` | [`benchmarks/kernels/element_wise/gelu/src/gelu.cpp`](../benchmarks/kernels/element_wise/gelu/src/gelu.cpp) | `cd benchmarks/kernels/element_wise/gelu && make TESTCASE=gelu PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/fusion` | `fa_hif4` | [`benchmarks/kernels/fusion/src/fa_hif4.cpp`](../benchmarks/kernels/fusion/src/fa_hif4.cpp) | `cd benchmarks/kernels/fusion && make TESTCASE=fa_hif4 PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/gemm/matmul` | `A16W4` | [`benchmarks/kernels/gemm/matmul/src/A16W4.cpp`](../benchmarks/kernels/gemm/matmul/src/A16W4.cpp) | `cd benchmarks/kernels/gemm/matmul && make TESTCASE=matmul TYPE=A16W4 PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/gemm/matmul` | `HiF4_HiF4` | [`benchmarks/kernels/gemm/matmul/src/HiF4_HiF4.cpp`](../benchmarks/kernels/gemm/matmul/src/HiF4_HiF4.cpp) | `cd benchmarks/kernels/gemm/matmul && make TESTCASE=matmul TYPE=HIF4_HIF4 PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/broadcast` | `broadcast` | [`benchmarks/kernels/memory/broadcast/src/broadcast.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/broadcast` | `broadcast_019` | [`benchmarks/kernels/memory/broadcast/src/broadcast_019.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_019.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_019 PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/broadcast` | `broadcast_039` | [`benchmarks/kernels/memory/broadcast/src/broadcast_039.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_039.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_039 PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/broadcast` | `broadcast_07` | [`benchmarks/kernels/memory/broadcast/src/broadcast_07.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_07.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_07 PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/broadcast` | `broadcast_Hunyuan` | [`benchmarks/kernels/memory/broadcast/src/broadcast_Hunyuan.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_Hunyuan.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_Hunyuan PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/broadcast` | `broadcast_mscatter` | [`benchmarks/kernels/memory/broadcast/src/broadcast_mscatter.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_mscatter.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_mscatter PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/broadcast` | `broadcast_nostore` | [`benchmarks/kernels/memory/broadcast/src/broadcast_nostore.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_nostore.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_nostore PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/broadcast` | `broadcast_nomg` | [`benchmarks/kernels/memory/broadcast/src/broadcast_nomg.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_nomg.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_nomg PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/broadcast` | `broadcast_tst` | [`benchmarks/kernels/memory/broadcast/src/broadcast_tst.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_tst.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_tst PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/broadcast_vec` | `broadcast_vec_019` | [`benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_019.cpp`](../benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_019.cpp) | `cd benchmarks/kernels/memory/broadcast_vec && make TESTCASE=broadcast_vec_019 PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/broadcast_vec` | `broadcast_vec_039` | [`benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_039.cpp`](../benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_039.cpp) | `cd benchmarks/kernels/memory/broadcast_vec && make TESTCASE=broadcast_vec_039 PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/broadcast_vec` | `broadcast_vec_07` | [`benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_07.cpp`](../benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_07.cpp) | `cd benchmarks/kernels/memory/broadcast_vec && make TESTCASE=broadcast_vec_07 PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/concat_gather` | `concat_gather` | [`benchmarks/kernels/memory/concat_gather/src/concat_gather.cpp`](../benchmarks/kernels/memory/concat_gather/src/concat_gather.cpp) | `cd benchmarks/kernels/memory/concat_gather && make TESTCASE=concat_gather PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/concat_scatter` | `concat_scatter` | [`benchmarks/kernels/memory/concat_scatter/src/concat_scatter.cpp`](../benchmarks/kernels/memory/concat_scatter/src/concat_scatter.cpp) | `cd benchmarks/kernels/memory/concat_scatter && make TESTCASE=concat_scatter PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/gather` | `gather` | [`benchmarks/kernels/memory/gather/src/gather.cpp`](../benchmarks/kernels/memory/gather/src/gather.cpp) | `cd benchmarks/kernels/memory/gather && make TESTCASE=gather PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/memory/transpose` | `transpose` | [`benchmarks/kernels/memory/transpose/src/transpose.cpp`](../benchmarks/kernels/memory/transpose/src/transpose.cpp) | `cd benchmarks/kernels/memory/transpose && make TESTCASE=transpose PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/reduction/reducemax_col` | `reducemax_col` | [`benchmarks/kernels/reduction/reducemax_col/src/reducemax_col.cpp`](../benchmarks/kernels/reduction/reducemax_col/src/reducemax_col.cpp) | `cd benchmarks/kernels/reduction/reducemax_col && make TESTCASE=reducemax_col PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/reduction/reducemax_row` | `reducemax_row` | [`benchmarks/kernels/reduction/reducemax_row/src/reducemax_row.cpp`](../benchmarks/kernels/reduction/reducemax_row/src/reducemax_row.cpp) | `cd benchmarks/kernels/reduction/reducemax_row && make TESTCASE=reducemax_row PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/reduction/reducesum_col` | `reducesum_col` | [`benchmarks/kernels/reduction/reducesum_col/src/reducesum_col.cpp`](../benchmarks/kernels/reduction/reducesum_col/src/reducesum_col.cpp) | `cd benchmarks/kernels/reduction/reducesum_col && make TESTCASE=reducesum_col PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/reduction/reducesum_row` | `reducesum_row` | [`benchmarks/kernels/reduction/reducesum_row/src/reducesum_row.cpp`](../benchmarks/kernels/reduction/reducesum_row/src/reducesum_row.cpp) | `cd benchmarks/kernels/reduction/reducesum_row && make TESTCASE=reducesum_row PLAT=linx COMPILER_DIR=` | none | active | +| `kernels/sort` | `topk` | [`benchmarks/kernels/sort/topk/topk.cpp`](../benchmarks/kernels/sort/topk/topk.cpp) | `cd benchmarks/kernels/sort && make TESTCASE=topk PLAT=linx COMPILER_DIR=` | `topk/data_obj`: `input_131072.o`, `top_2048_out.o` | active | +| `microbench/cube` | `matop` | [`benchmarks/microbench/cube/src/matop.cpp`](../benchmarks/microbench/cube/src/matop.cpp) | `cd benchmarks/microbench/cube && make TESTCASE=matop PLAT=linx COMPILER_DIR=` | none | active | +| `microbench/lmbench` | `mem` | [`benchmarks/microbench/lmbench/src/mem.cpp`](../benchmarks/microbench/lmbench/src/mem.cpp) | `cd benchmarks/microbench/lmbench && make TESTCASE=mem PLAT=linx COMPILER_DIR=` | none | active | +| `microbench/vec` | `lat_bw` | [`benchmarks/microbench/vec/src/lat_bw.cpp`](../benchmarks/microbench/vec/src/lat_bw.cpp) | `cd benchmarks/microbench/vec && make TESTCASE=lat_bw PLAT=linx COMPILER_DIR=` | none | active | +| `models/deepseekv3` | `concat` | [`benchmarks/models/deepseekv3/src/concat.cpp`](../benchmarks/models/deepseekv3/src/concat.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=concat PLAT=linx COMPILER_DIR=` | none | active | +| `models/deepseekv3` | `expand` | [`benchmarks/models/deepseekv3/src/expand.cpp`](../benchmarks/models/deepseekv3/src/expand.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=expand PLAT=linx COMPILER_DIR=` | none | active | +| `models/deepseekv3` | `gate` | [`benchmarks/models/deepseekv3/src/gate.cpp`](../benchmarks/models/deepseekv3/src/gate.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=gate PLAT=linx COMPILER_DIR=` | none | active | +| `models/deepseekv3` | `mask` | [`benchmarks/models/deepseekv3/src/mask.cpp`](../benchmarks/models/deepseekv3/src/mask.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=mask PLAT=linx COMPILER_DIR=` | none | active | +| `models/deepseekv3` | `mla` | [`benchmarks/models/deepseekv3/src/mla.cpp`](../benchmarks/models/deepseekv3/src/mla.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=mla PLAT=linx COMPILER_DIR=` | none | active | +| `models/deepseekv3` | `mlp` | [`benchmarks/models/deepseekv3/src/mlp.cpp`](../benchmarks/models/deepseekv3/src/mlp.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=mlp PLAT=linx COMPILER_DIR=` | none | active | +| `models/deepseekv3` | `moe` | [`benchmarks/models/deepseekv3/src/moe.cpp`](../benchmarks/models/deepseekv3/src/moe.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=moe PLAT=linx COMPILER_DIR=` | none | active | +| `models/deepseekv3` | `permute` | [`benchmarks/models/deepseekv3/src/permute.cpp`](../benchmarks/models/deepseekv3/src/permute.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=permute PLAT=linx COMPILER_DIR=` | none | active | +| `models/deepseekv3` | `projection` | [`benchmarks/models/deepseekv3/src/projection.cpp`](../benchmarks/models/deepseekv3/src/projection.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=projection PLAT=linx COMPILER_DIR=` | none | active | +| `models/deepseekv3` | `rmsnorm` | [`benchmarks/models/deepseekv3/src/rmsnorm.cpp`](../benchmarks/models/deepseekv3/src/rmsnorm.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=rmsnorm PLAT=linx COMPILER_DIR=` | none | active | +| `models/deepseekv3` | `rope` | [`benchmarks/models/deepseekv3/src/rope.cpp`](../benchmarks/models/deepseekv3/src/rope.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=rope PLAT=linx COMPILER_DIR=` | none | active | +| `models/deepseekv3` | `split` | [`benchmarks/models/deepseekv3/src/split.cpp`](../benchmarks/models/deepseekv3/src/split.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=split PLAT=linx COMPILER_DIR=` | none | active | +| `models/deepseekv3` | `topk` | [`benchmarks/models/deepseekv3/src/topk.cpp`](../benchmarks/models/deepseekv3/src/topk.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=topk PLAT=linx COMPILER_DIR=` | none | active | +| `models/deepseekv3` | `transformer` | [`benchmarks/models/deepseekv3/src/transformer.cpp`](../benchmarks/models/deepseekv3/src/transformer.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=transformer PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `LLAMA3_70B_attn_matmul_decode_bs_192` | [`benchmarks/npu/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp`](../benchmarks/npu/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp) | `cd benchmarks/npu/cube && make TESTCASE=LLAMA3_70B_attn_matmul_decode_bs_192 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `LLAMA3_70B_ffn_matmul_3_decode_bs_192` | [`benchmarks/npu/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp`](../benchmarks/npu/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp) | `cd benchmarks/npu/cube && make TESTCASE=LLAMA3_70B_ffn_matmul_3_decode_bs_192 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `QuantBatchMatmulV3_292_hif4` | [`benchmarks/npu/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp`](../benchmarks/npu/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=QuantBatchMatmulV3_292_hif4 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `QuantBatchMatmulV3_293_hif4` | [`benchmarks/npu/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp`](../benchmarks/npu/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=QuantBatchMatmulV3_293_hif4 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `QuantBatchMatmulV3_294_hif4` | [`benchmarks/npu/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp`](../benchmarks/npu/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=QuantBatchMatmulV3_294_hif4 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `QuantBatchMatmulV3_295_hif4` | [`benchmarks/npu/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp`](../benchmarks/npu/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=QuantBatchMatmulV3_295_hif4 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `QuantBatchMatmulV3_296_hif4` | [`benchmarks/npu/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp`](../benchmarks/npu/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=QuantBatchMatmulV3_296_hif4 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `QuantBatchMatmulV3_297_hif4` | [`benchmarks/npu/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp`](../benchmarks/npu/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=QuantBatchMatmulV3_297_hif4 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `dsv3_q_up_proj_mxfp8` | [`benchmarks/npu/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp`](../benchmarks/npu/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp) | `cd benchmarks/npu/cube && make TESTCASE=dsv3_q_up_proj_mxfp8 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `llama3_70b_w8_bs_1_case_4` | [`benchmarks/npu/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp`](../benchmarks/npu/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=llama3_70b_w8_bs_1_case_4 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `llama_train_mm_2_A16W4` | [`benchmarks/npu/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp`](../benchmarks/npu/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=llama_train_mm_2_A16W4 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `llama_train_mm_2_A16W8` | [`benchmarks/npu/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp`](../benchmarks/npu/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp) | `cd benchmarks/npu/cube && make TESTCASE=llama_train_mm_2_A16W8 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `llama_train_mm_2_mxfp8_mxfp4` | [`benchmarks/npu/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp`](../benchmarks/npu/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=llama_train_mm_2_mxfp8_mxfp4 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `llava1_6_6` | [`benchmarks/npu/cube/llava1_6_6/llava1_6_6.cpp`](../benchmarks/npu/cube/llava1_6_6/llava1_6_6.cpp) | `cd benchmarks/npu/cube && make TESTCASE=llava1_6_6 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `mat_mul_o1_align_0001` | [`benchmarks/npu/cube/mat_mul_o1_align_0001/mat_mul_o1_align_0001.cpp`](../benchmarks/npu/cube/mat_mul_o1_align_0001/mat_mul_o1_align_0001.cpp) | `cd benchmarks/npu/cube && make TESTCASE=mat_mul_o1_align_0001 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `matmul_1_bs16_fp8_GB_test` | [`benchmarks/npu/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp`](../benchmarks/npu/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp) | `cd benchmarks/npu/cube && make TESTCASE=matmul_1_bs16_fp8_GB_test PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf` | [`benchmarks/npu/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp`](../benchmarks/npu/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp) | `cd benchmarks/npu/cube && make TESTCASE=model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `moe_w1w3_bs16_fp8_GB_DN_nbuf` | [`benchmarks/npu/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp`](../benchmarks/npu/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp) | `cd benchmarks/npu/cube && make TESTCASE=moe_w1w3_bs16_fp8_GB_DN_nbuf PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022` | [`benchmarks/npu/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp`](../benchmarks/npu/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp) | `cd benchmarks/npu/cube && make TESTCASE=mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16` | [`benchmarks/npu/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp`](../benchmarks/npu/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp) | `cd benchmarks/npu/cube && make TESTCASE=mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `xinghuo_13b_tp8_matmul_01_A16W8` | [`benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp`](../benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp) | `cd benchmarks/npu/cube && make TESTCASE=xinghuo_13b_tp8_matmul_01_A16W8 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `xinghuo_13b_tp8_matmul_01_mxfp8_modified` | [`benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp`](../benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp) | `cd benchmarks/npu/cube && make TESTCASE=xinghuo_13b_tp8_matmul_01_mxfp8_modified PLAT=linx COMPILER_DIR=` | none | active | +| `npu/cube` | `xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4` | [`benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp`](../benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/fusion` | `fa1` | [`benchmarks/npu/fusion/fa1/fa1.cpp`](../benchmarks/npu/fusion/fa1/fa1.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa1 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/fusion` | `fa10` | [`benchmarks/npu/fusion/fa10/fa10.cpp`](../benchmarks/npu/fusion/fa10/fa10.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa10 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/fusion` | `fa11` | [`benchmarks/npu/fusion/fa11/fa11.cpp`](../benchmarks/npu/fusion/fa11/fa11.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa11 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/fusion` | `fa2` | [`benchmarks/npu/fusion/fa2/fa2.cpp`](../benchmarks/npu/fusion/fa2/fa2.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa2 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/fusion` | `fa3` | [`benchmarks/npu/fusion/fa3/fa3.cpp`](../benchmarks/npu/fusion/fa3/fa3.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa3 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/fusion` | `fa4` | [`benchmarks/npu/fusion/fa4/fa4.cpp`](../benchmarks/npu/fusion/fa4/fa4.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa4 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/fusion` | `fa5` | [`benchmarks/npu/fusion/fa5/fa5.cpp`](../benchmarks/npu/fusion/fa5/fa5.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa5 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/fusion` | `fa6` | [`benchmarks/npu/fusion/fa6/fa6.cpp`](../benchmarks/npu/fusion/fa6/fa6.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa6 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/fusion` | `fa7` | [`benchmarks/npu/fusion/fa7/fa7.cpp`](../benchmarks/npu/fusion/fa7/fa7.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa7 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/fusion` | `fa8` | [`benchmarks/npu/fusion/fa8/fa8.cpp`](../benchmarks/npu/fusion/fa8/fa8.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa8 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/fusion` | `fa9` | [`benchmarks/npu/fusion/fa9/fa9.cpp`](../benchmarks/npu/fusion/fa9/fa9.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa9 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/fusion` | `fa_fp4` | [`benchmarks/npu/fusion/fa_fp4/fa_fp4.cpp`](../benchmarks/npu/fusion/fa_fp4/fa_fp4.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa_fp4 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/fusion` | `flashmla13` | [`benchmarks/npu/fusion/flashmla13/flashmla13.cpp`](../benchmarks/npu/fusion/flashmla13/flashmla13.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=flashmla13 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/nddma` | `transpose_053_mgather` | [`benchmarks/npu/nddma/transpose_053_mgather/transpose_053_mgather.cpp`](../benchmarks/npu/nddma/transpose_053_mgather/transpose_053_mgather.cpp) | `cd benchmarks/npu/nddma && make TESTCASE=transpose_053_mgather PLAT=linx COMPILER_DIR=` | none | active | +| `npu/nddma` | `transpose_053_tload` | [`benchmarks/npu/nddma/transpose_053_tload/transpose_053_tload.cpp`](../benchmarks/npu/nddma/transpose_053_tload/transpose_053_tload.cpp) | `cd benchmarks/npu/nddma && make TESTCASE=transpose_053_tload PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `Add_ND_bfloat16_float32_DeepSeek_V3_000028` | [`benchmarks/npu/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp`](../benchmarks/npu/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=Add_ND_bfloat16_float32_DeepSeek_V3_000028 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic` | [`benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp`](../benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic` | [`benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp`](../benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV` | [`benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp`](../benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `gemm_18x128x256` | [`benchmarks/npu/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp`](../benchmarks/npu/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=gemm_18x128x256 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `layernorm_vcadd_vaddx3_12288_fp16` | [`benchmarks/npu/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp`](../benchmarks/npu/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=layernorm_vcadd_vaddx3_12288_fp16 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV` | [`benchmarks/npu/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp`](../benchmarks/npu/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `rmsnorm_reduce_1_16384_fp16` | [`benchmarks/npu/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp`](../benchmarks/npu/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=rmsnorm_reduce_1_16384_fp16 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `rmsnorm_reduce_2_8192_fp16` | [`benchmarks/npu/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp`](../benchmarks/npu/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=rmsnorm_reduce_2_8192_fp16 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `rmsnorm_reduce_4_4096_fp16` | [`benchmarks/npu/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp`](../benchmarks/npu/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=rmsnorm_reduce_4_4096_fp16 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `rmsnorm_reduce_4_5120_fp16` | [`benchmarks/npu/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp`](../benchmarks/npu/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=rmsnorm_reduce_4_5120_fp16 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `rope_32_40_1_64_bf16` | [`benchmarks/npu/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp`](../benchmarks/npu/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=rope_32_40_1_64_bf16 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `softmax_8_34_fp16` | [`benchmarks/npu/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp`](../benchmarks/npu/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=softmax_8_34_fp16 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `softmax_LLM_2` | [`benchmarks/npu/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp`](../benchmarks/npu/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=softmax_LLM_2 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `softmax_vaddx3_vcadd_1_4096_bf16` | [`benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp`](../benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=softmax_vaddx3_vcadd_1_4096_bf16 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `softmax_vaddx3_vcadd_1_4096_fp16` | [`benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp`](../benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=softmax_vaddx3_vcadd_1_4096_fp16 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simd` | `swiglu_64_1024_fp16` | [`benchmarks/npu/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp`](../benchmarks/npu/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=swiglu_64_1024_fp16 PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simt` | `npu_hashtable_insert_cmp_host` | [`benchmarks/npu/vec_simt/npu_hashtable_insert_cmp_host/npu_hashtable_insert_cmp_host.cpp`](../benchmarks/npu/vec_simt/npu_hashtable_insert_cmp_host/npu_hashtable_insert_cmp_host.cpp) | `cd benchmarks/npu/vec_simt && make TESTCASE=npu_hashtable_insert_cmp_host PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simt` | `npu_hashtable_lookup_cmp_host` | [`benchmarks/npu/vec_simt/npu_hashtable_lookup_cmp_host/npu_hashtable_lookup_cmp_host.cpp`](../benchmarks/npu/vec_simt/npu_hashtable_lookup_cmp_host/npu_hashtable_lookup_cmp_host.cpp) | `cd benchmarks/npu/vec_simt && make TESTCASE=npu_hashtable_lookup_cmp_host PLAT=linx COMPILER_DIR=` | none | active | +| `npu/vec_simt` | `hashfind` | [`benchmarks/npu/vec_simt/hashfind/hashfind.cpp`](../benchmarks/npu/vec_simt/hashfind/hashfind.cpp) | `cd benchmarks/npu/vec_simt && make TESTCASE=hashfind PLAT=linx COMPILER_DIR=` | `hashfind/data_obj`: `simple_inserted_slot.o`, `simple_lookup_keys.o`, `simple_lookup_values.o` | active | + +## Archived And Outdated Surfaces + +| Category | Source path | Replacement | Required data objects | Status | +| --- | --- | --- | --- | --- | +| `legacy/api/tileop` | [`archive/outdated/tests/other/tileop_api`](../archive/outdated/tests/other/tileop_api) | [`benchmarks/api/tileop`](../benchmarks/api/tileop) | none | archive/outdated | +| `legacy/api/python` | [`archive/outdated/tests/other/py_api`](../archive/outdated/tests/other/py_api) | [`tests/py_api`](../tests/py_api) | none | archive/outdated | +| `legacy/npu/v220` | [`archive/outdated/tests/accelerator/v220`](../archive/outdated/tests/accelerator/v220) | [`benchmarks/npu`](../benchmarks/npu) | none | archive/outdated | +| `legacy/npu/v310` | [`archive/outdated/tests/accelerator/v310`](../archive/outdated/tests/accelerator/v310) | [`benchmarks/npu`](../benchmarks/npu) | none | archive/outdated | diff --git a/benchmarks/README.md b/benchmarks/README.md new file mode 100644 index 0000000..4326e76 --- /dev/null +++ b/benchmarks/README.md @@ -0,0 +1,44 @@ +# Benchmarks + +This is the primary navigation surface for active SuperNPUBench benchmark sources. These suites are intended to build through the shared make harness with `PLAT=linx COMPILER_DIR=` unless a local `compile*.all` file explicitly selects another platform for comparison. + +## Layout + +| Path | Purpose | +| --- | --- | +| [`common`](common) | Shared make harness and benchmark-local utility headers. | +| [`api/tileop`](api/tileop) | Focused TileOP API operation benchmarks. | +| [`npu/cube`](npu/cube) | Cube/matmul NPU benchmark cases. | +| [`npu/fusion`](npu/fusion) | Flash-attention and fusion NPU cases. | +| [`npu/nddma`](npu/nddma) | NDDMA transpose cases. | +| [`npu/vec_simd`](npu/vec_simd) | Vector SIMD NPU cases. | +| [`npu/vec_simt`](npu/vec_simt) | Vector SIMT NPU cases, including embedded data-object cases. | +| [`kernels/control`](kernels/control) | Control-flow/hash-table kernels. | +| [`kernels/element_wise`](kernels/element_wise) | Element-wise kernels. | +| [`kernels/gemm`](kernels/gemm) | GEMM/matmul kernels. | +| [`kernels/fusion`](kernels/fusion) | Kernel-level fusion cases. | +| [`kernels/memory`](kernels/memory) | Broadcast, gather, scatter, concat, and transpose memory kernels. | +| [`kernels/reduction`](kernels/reduction) | Row/column reduction kernels. | +| [`kernels/sort`](kernels/sort) | Sort/top-k kernels with embedded data-object support. | +| [`kernels/composite`](kernels/composite) | Composite kernels formerly grouped under `orther`. | +| [`models/deepseekv3`](models/deepseekv3) | DeepSeekV3 model-level benchmark kernels. | +| [`microbench`](microbench) | Cube, memory, and vector microbenchmark suites. | +| [`scripts`](scripts) | Batch and recursive helper scripts. | + +## Build Pattern + +Run local make commands from the suite directory: + +```sh +cd benchmarks/api/tileop +make TESTCASE=TAdd PLAT=linx COMPILER_DIR=/path/to/linx/compiler/bin +``` + +Run suite batches from the same directory as the script: + +```sh +cd benchmarks/kernels/gemm/matmul +bash compile.all +``` + +For the complete source and build catalog, use [`INDEX.md`](INDEX.md). diff --git a/test/tileop_api/Makefile b/benchmarks/api/tileop/Makefile similarity index 78% rename from test/tileop_api/Makefile rename to benchmarks/api/tileop/Makefile index 8d51541..f5085ab 100644 --- a/test/tileop_api/Makefile +++ b/benchmarks/api/tileop/Makefile @@ -3,4 +3,4 @@ TARGET = $(ELF_HEAD)_$(TESTCASE)_$(PLAT).elf SRC_FILE += $(TEST_ROOT)/$(CASE_SRC_DIR)/$(TESTCASE).cpp endif -include ../common/Makefile.common \ No newline at end of file +include ../../common/Makefile.common diff --git a/test/tileop_api/compile.all b/benchmarks/api/tileop/compile.all similarity index 94% rename from test/tileop_api/compile.all rename to benchmarks/api/tileop/compile.all index a3f4706..fccb53e 100755 --- a/test/tileop_api/compile.all +++ b/benchmarks/api/tileop/compile.all @@ -11,8 +11,8 @@ make clean;make TESTCASE=TAnd make clean;make TESTCASE=TCI make clean;make TESTCASE=TCmp make clean;make TESTCASE=TCopy -make clean;make TESTCASE=TCopyIn -make clean;make TESTCASE=TCopyOut +make clean;make TESTCASE=TLoad +make clean;make TESTCASE=TStore make clean;make TESTCASE=TCvt make clean;make TESTCASE=TDiv make clean;make TESTCASE=TDivs diff --git a/test/other/tileop_api/data.hpp b/benchmarks/api/tileop/data.hpp similarity index 78% rename from test/other/tileop_api/data.hpp rename to benchmarks/api/tileop/data.hpp index 394ff52..afca271 100644 --- a/test/other/tileop_api/data.hpp +++ b/benchmarks/api/tileop/data.hpp @@ -1,16 +1,34 @@ #ifndef DATA_H #define DATA_H +#ifdef __linx +#include +#include +extern "C" void exit(int); +extern "C" void free(void *); +extern "C" void *malloc(size_t); +extern "C" int printf(const char *, ...); +#else #include #include -#include - +#endif +#include "common/type.hpp" + +#ifdef __linx +static constexpr float s_fp32 = 0.1f; +static constexpr __half s_fp16 = __half(0.0f); +static constexpr int8_t s_i8 = 1; +static constexpr int16_t s_i16 = 1; +static constexpr int32_t s_i32 = 1; +static constexpr int64_t s_i64 = 1; +#else float s_fp32 = 0.1; __half s_fp16 = 0.1; int8_t s_i8 = 1; int16_t s_i16 = 1; int32_t s_i32 = 1; int64_t s_i64 = 1; +#endif template void init_src_uint(T *aar, uint16_t size) { for (uint16_t i = 0; i < size; i++) { @@ -23,10 +41,26 @@ template void init_src_int(T *aar, uint16_t size) { aar[i] = -(i + 1); } } +void init_src_int8(int8_t *aar, uint16_t size) { + for (uint16_t i = 0; i < size; i++) { + uint16_t val = i % 256; + if (val != 128) { + aar[i] = val - 128; + } else { + aar[i] = -128; + } + } +} template void init_src_fp(T *aar, uint16_t size) { for (uint16_t i = 0; i < size; i++) { +#ifdef __linx + const float x = (i + 1) / 100.0f; + const float x2 = x * x; + aar[i] = x * (1.0f - x2 / 6.0f + (x2 * x2) / 120.0f); +#else aar[i] = sin((i + 1) / 100.0f); +#endif } } @@ -36,6 +70,12 @@ template void init_dst(T *aar, uint16_t size) { } } +template void init_dst_no_zero(T *aar, uint16_t size) { + for (uint16_t i = 0; i < size; i++) { + aar[i] = 1.0; + } +} + template void init_index(T *aar, uint16_t row, uint16_t col) { for (uint16_t i = 0; i < row; ++i) { for (uint16_t j = 0; j < col; ++j) { @@ -56,11 +96,46 @@ template void init_01(T *aar, uint16_t row, uint16_t col) { } } +template void init_rows_fp(T *aar, uint16_t row, uint16_t col) { + for (uint16_t i = 0; i < row; ++i) { + for (uint16_t j = 0; j < col; ++j) { + aar[i * col + j] = (i * col + j) / 100.0f; + } + } +} + template void OutArray(const T *aar, size_t size) { +#ifdef __linx + (void)aar; + (void)size; +#else for (uint16_t i = 0; i < size; i++) { std::cout << aar[i] << " "; } std::cout << std::endl; +#endif +} +void OutArray(const int8_t *aar, size_t size) { +#ifdef __linx + (void)aar; + (void)size; +#else + for (uint16_t i = 0; i < size; i++) { + std::cout << static_cast(aar[i]) << " "; + } + std::cout << std::endl; +#endif +} +void OutArray(const __half *aar, size_t size) { +#ifdef __linx + (void)aar; + (void)size; +#else + for (uint16_t i = 0; i < size; i++) { + std::cout << static_cast<__fp16>(aar[i]) << " "; + } + std::cout << std::endl; +#endif } // check memory allocation @@ -145,4 +220,4 @@ template void check_mem_alloc(const T *p) { free(d2); \ free(d3); -#endif \ No newline at end of file +#endif diff --git a/test/common/linxStartEnd.hpp b/benchmarks/api/tileop/linxStartEnd.hpp similarity index 100% rename from test/common/linxStartEnd.hpp rename to benchmarks/api/tileop/linxStartEnd.hpp diff --git a/test/tileop_api/src/Cus_Template_ASM.cpp b/benchmarks/api/tileop/src/Cus_Template_ASM.cpp similarity index 95% rename from test/tileop_api/src/Cus_Template_ASM.cpp rename to benchmarks/api/tileop/src/Cus_Template_ASM.cpp index 418e89b..5491da1 100644 --- a/test/tileop_api/src/Cus_Template_ASM.cpp +++ b/benchmarks/api/tileop/src/Cus_Template_ASM.cpp @@ -8,7 +8,7 @@ #ifdef ENABLE_TENSOR_INSTR template -void TCOPYIN_ASM(tile_shape &dst, gm_shape &src) { +void TLOAD_ASM(tile_shape &dst, gm_shape &src) { asm volatile( "BSTART.PAR 33, %c1\n" @@ -37,9 +37,9 @@ void test_Nz(T *dst) { gm_shape g(dst); tile_shape t; #ifdef ENABLE_TENSOR_INSTR - TCOPYIN_ASM(t, g); + TLOAD_ASM(t, g); #else - TCOPYIN(t, g); + TLOAD(t, g); #endif print_tile(t); } diff --git a/test/other/tileop_api/src/MatMacc.cpp b/benchmarks/api/tileop/src/MatMacc.cpp similarity index 54% rename from test/other/tileop_api/src/MatMacc.cpp rename to benchmarks/api/tileop/src/MatMacc.cpp index afb4a37..2c665a2 100644 --- a/test/other/tileop_api/src/MatMacc.cpp +++ b/benchmarks/api/tileop/src/MatMacc.cpp @@ -5,6 +5,57 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_RowMajor(T *dst, T *src0, T *src1) { using gm_shape_A = global_tensor>; @@ -12,8 +63,8 @@ void test_RowMajor(T *dst, T *src0, T *src1) { using gm_shape_C = global_tensor>; using tile_shape_A = Tile; - using tile_shape_B = Tile; - using tile_shape_C = Tile; + using tile_shape_B = Tile;; + using tile_shape_C = Tile;; gm_shape_A s0(src0); gm_shape_B s1(src1); @@ -23,10 +74,11 @@ void test_RowMajor(T *dst, T *src0, T *src1) { tile_shape_B d1; tile_shape_C d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); + TLOAD(d2, res); MATMACC(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } template @@ -47,24 +99,74 @@ void test_ColMajor(T *dst, T *src0, T *src1) { tile_shape_B d1; tile_shape_C d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); + TLOAD(d2, res); MATMACC(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } int main() { - const uint16_t M = 64; - const uint16_t K = 32; - const uint16_t N = 64; +#ifdef __linx + constexpr uint16_t M = 4; + constexpr uint16_t K = 4; + constexpr uint16_t N = 4; +#else + const uint16_t M = 16; + const uint16_t K = 8; + const uint16_t N = 32; +#endif + + constexpr size_t size_A = M * K; + constexpr size_t size_B = K * N; + constexpr size_t size_C = M * N; + +#ifdef __linx + static int64_t dst_rm[size_C]; + static int64_t src0_rm[size_A]; + static int64_t src1_rm[size_B]; + static int64_t base_rm[size_C]; - size_t size_A = M * K; - size_t size_B = K * N; - size_t size_C = M * N; + for (size_t row = 0; row < M; ++row) { + for (size_t k = 0; k < K; ++k) { + const int64_t value = static_cast((row + 1) * (k + 2)); + src0_rm[row * K + k] = value; + } + } + for (size_t k = 0; k < K; ++k) { + for (size_t col = 0; col < N; ++col) { + const int64_t value = static_cast((k + 1) + (col + 1)); + src1_rm[k * N + col] = value; + } + } + for (size_t row = 0; row < M; ++row) { + for (size_t col = 0; col < N; ++col) { + const int64_t value = static_cast(10 + row * N + col); + dst_rm[row * N + col] = value; + base_rm[row * N + col] = value; + } + } + + test_RowMajor(dst_rm, src0_rm, src1_rm); + + for (size_t row = 0; row < M; ++row) { + for (size_t col = 0; col < N; ++col) { + int64_t expected = base_rm[row * N + col]; + for (size_t k = 0; k < K; ++k) { + expected += src0_rm[row * K + k] * src1_rm[k * N + col]; + } + if (dst_rm[row * N + col] != expected) { + return 1; + } + } + } + + return 0; +#else float *dst = (float *)malloc(size_C * sizeof(float)); check_mem_alloc(dst); - init_src_fp(dst, size_C); + init_dst_no_zero(dst, size_C); float *src0 = (float *)malloc(size_A * sizeof(float)); check_mem_alloc(src0); @@ -75,7 +177,7 @@ int main() { __half *dst_f16 = (__half *)malloc(size_C * sizeof(__half)); check_mem_alloc(dst_f16); - init_src_fp(dst_f16, size_C); + init_dst_no_zero(dst_f16, size_C); __half *src0_f16 = (__half *)malloc(size_A * sizeof(__half)); check_mem_alloc(src0_f16); @@ -83,44 +185,44 @@ int main() { __half *src1_f16 = (__half *)malloc(size_B * sizeof(__half)); check_mem_alloc(src1_f16); init_src_fp(src1_f16, size_B); - + int8_t *dst_i8 = (int8_t *)malloc(size_C * sizeof(int8_t)); check_mem_alloc(dst_i8); - init_src_fp(dst_i8, size_C); - + init_dst_no_zero(dst_i8, size_C); + int8_t *src0_i8 = (int8_t *)malloc(size_A * sizeof(int8_t)); check_mem_alloc(src0_i8); init_src_int(src0_i8, size_A); int8_t *src1_i8 = (int8_t *)malloc(size_B * sizeof(int8_t)); check_mem_alloc(src1_i8); init_src_int(src1_i8, size_B); - + int16_t *dst_i16 = (int16_t *)malloc(size_C * sizeof(int16_t)); check_mem_alloc(dst_i16); - init_src_fp(dst_i16, size_C); - + init_dst_no_zero(dst_i16, size_C); + int16_t *src0_i16 = (int16_t *)malloc(size_A * sizeof(int16_t)); check_mem_alloc(src0_i16); init_src_int(src0_i16, size_A); int16_t *src1_i16 = (int16_t *)malloc(size_B * sizeof(int16_t)); check_mem_alloc(src1_i16); init_src_int(src1_i16, size_B); - + int32_t *dst_i32 = (int32_t *)malloc(size_C * sizeof(int32_t)); check_mem_alloc(dst_i32); - init_src_fp(dst_i32, size_C); - + init_dst_no_zero(dst_i32, size_C); + int32_t *src0_i32 = (int32_t *)malloc(size_A * sizeof(int32_t)); check_mem_alloc(src0_i32); init_src_int(src0_i32, size_A); int32_t *src1_i32 = (int32_t *)malloc(size_B * sizeof(int32_t)); check_mem_alloc(src1_i32); init_src_int(src1_i32, size_B); - + int64_t *dst_i64 = (int64_t *)malloc(size_C * sizeof(int64_t)); check_mem_alloc(dst_i64); - init_src_fp(dst_i64, size_C); - + init_dst_no_zero(dst_i64, size_C); + int64_t *src0_i64 = (int64_t *)malloc(size_A * sizeof(int64_t)); check_mem_alloc(src0_i64); init_src_int(src0_i64, size_A); @@ -132,13 +234,13 @@ int main() { PMC_START(); #endif - test_RowMajor(dst, src0, src1); - - test_RowMajor(dst_f16, src0_f16, src1_f16); + //test_RowMajor(dst, src0, src1); + + //test_RowMajor(dst_f16, src0_f16, src1_f16); - test_RowMajor(dst_i8, src0_i8, src1_i8); + //test_RowMajor(dst_i8, src0_i8, src1_i8); - test_RowMajor(dst_i16, src0_i16, src1_i16); + //test_RowMajor(dst_i16, src0_i16, src1_i16); test_RowMajor(dst_i32, src0_i32, src1_i32); @@ -150,35 +252,36 @@ int main() { printf("Result:\n"); OutArray(dst, size_C); - //OutArray(dst_f16, size_C); + OutArray(dst_f16, size_C); OutArray(dst_i8, size_C); OutArray(dst_i16, size_C); OutArray(dst_i32, size_C); OutArray(dst_i64, size_C); - + free(dst); free(src0); free(src1); - + free(dst_f16); free(src0_f16); free(src1_f16); - + free(dst_i8); free(src0_i8); free(src1_i8); - + free(dst_i16); free(src0_i16); free(src1_i16); - + free(dst_i32); free(src0_i32); free(src1_i32); - + free(dst_i64); free(src0_i64); free(src1_i64); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/other/tileop_api/src/MatMul.cpp b/benchmarks/api/tileop/src/MatMul.cpp similarity index 69% rename from test/other/tileop_api/src/MatMul.cpp rename to benchmarks/api/tileop/src/MatMul.cpp index cfaec47..99d8a33 100644 --- a/test/other/tileop_api/src/MatMul.cpp +++ b/benchmarks/api/tileop/src/MatMul.cpp @@ -5,6 +5,48 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + auto *d = static_cast(dst); + const auto *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + __asm__ volatile("" ::: "memory"); + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_RowMajor(T *dst, T *src0, T *src1) { using gm_shape_A = global_tensor>; @@ -12,8 +54,8 @@ void test_RowMajor(T *dst, T *src0, T *src1) { using gm_shape_C = global_tensor>; using tile_shape_A = Tile; - using tile_shape_B = Tile; - using tile_shape_C = Tile; + using tile_shape_B = Tile;; + using tile_shape_C = Tile;; gm_shape_A s0(src0); gm_shape_B s1(src1); @@ -23,10 +65,10 @@ void test_RowMajor(T *dst, T *src0, T *src1) { tile_shape_B d1; tile_shape_C d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); MATMUL(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } template @@ -47,16 +89,48 @@ void test_ColMajor(T *dst, T *src0, T *src1) { tile_shape_B d1; tile_shape_C d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); MATMUL(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } int main() { - const uint16_t M = 64; +#ifdef __linx + constexpr uint16_t M = 4; + constexpr uint16_t K = 4; + constexpr uint16_t N = 4; + constexpr size_t size_A = M * K; + constexpr size_t size_B = K * N; + constexpr size_t size_C = M * N; + + static int64_t dst_i64[size_C]; + static int64_t src0_i64[size_A]; + static int64_t src1_i64[size_B]; + + init_dst(dst_i64, size_C); + init_src_int(src0_i64, size_A); + init_src_int(src1_i64, size_B); + + test_RowMajor(dst_i64, src0_i64, src1_i64); + + for (size_t row = 0; row < M; ++row) { + for (size_t col = 0; col < N; ++col) { + int64_t expected = 0; + for (size_t k = 0; k < K; ++k) { + expected += src0_i64[row * K + k] * src1_i64[k * N + col]; + } + if (dst_i64[row * N + col] != expected) { + return 1; + } + } + } + + return 0; +#else + const uint16_t M = 16; const uint16_t K = 32; - const uint16_t N = 64; + const uint16_t N = 32; size_t size_A = M * K; size_t size_B = K * N; @@ -75,7 +149,7 @@ int main() { __half *dst_f16 = (__half *)malloc(size_C * sizeof(__half)); check_mem_alloc(dst_f16); - init_dst(dst_f16, size_C); + init_dst(dst_f16, size_C); __half *src0_f16 = (__half *)malloc(size_A * sizeof(__half)); check_mem_alloc(src0_f16); @@ -83,44 +157,44 @@ int main() { __half *src1_f16 = (__half *)malloc(size_B * sizeof(__half)); check_mem_alloc(src1_f16); init_src_fp(src1_f16, size_B); - + int8_t *dst_i8 = (int8_t *)malloc(size_C * sizeof(int8_t)); check_mem_alloc(dst_i8); init_dst(dst_i8, size_C); - + int8_t *src0_i8 = (int8_t *)malloc(size_A * sizeof(int8_t)); check_mem_alloc(src0_i8); init_src_int(src0_i8, size_A); int8_t *src1_i8 = (int8_t *)malloc(size_B * sizeof(int8_t)); check_mem_alloc(src1_i8); init_src_int(src1_i8, size_B); - + int16_t *dst_i16 = (int16_t *)malloc(size_C * sizeof(int16_t)); check_mem_alloc(dst_i16); init_dst(dst_i16, size_C); - + int16_t *src0_i16 = (int16_t *)malloc(size_A * sizeof(int16_t)); check_mem_alloc(src0_i16); init_src_int(src0_i16, size_A); int16_t *src1_i16 = (int16_t *)malloc(size_B * sizeof(int16_t)); check_mem_alloc(src1_i16); init_src_int(src1_i16, size_B); - + int32_t *dst_i32 = (int32_t *)malloc(size_C * sizeof(int32_t)); check_mem_alloc(dst_i32); init_dst(dst_i32, size_C); - + int32_t *src0_i32 = (int32_t *)malloc(size_A * sizeof(int32_t)); check_mem_alloc(src0_i32); init_src_int(src0_i32, size_A); int32_t *src1_i32 = (int32_t *)malloc(size_B * sizeof(int32_t)); check_mem_alloc(src1_i32); init_src_int(src1_i32, size_B); - + int64_t *dst_i64 = (int64_t *)malloc(size_C * sizeof(int64_t)); check_mem_alloc(dst_i64); init_dst(dst_i64, size_C); - + int64_t *src0_i64 = (int64_t *)malloc(size_A * sizeof(int64_t)); check_mem_alloc(src0_i64); init_src_int(src0_i64, size_A); @@ -133,7 +207,7 @@ int main() { #endif test_RowMajor(dst, src0, src1); - + test_RowMajor(dst_f16, src0_f16, src1_f16); test_RowMajor(dst_i8, src0_i8, src1_i8); @@ -150,35 +224,36 @@ int main() { printf("Result:\n"); OutArray(dst, size_C); - //OutArray(dst_f16, size_C); + OutArray(dst_f16, size_C); OutArray(dst_i8, size_C); OutArray(dst_i16, size_C); OutArray(dst_i32, size_C); OutArray(dst_i64, size_C); - + free(dst); free(src0); free(src1); - + free(dst_f16); free(src0_f16); free(src1_f16); - + free(dst_i8); free(src0_i8); free(src1_i8); - + free(dst_i16); free(src0_i16); free(src1_i16); - + free(dst_i32); free(src0_i32); free(src1_i32); - + free(dst_i64); free(src0_i64); free(src1_i64); return 0; -} \ No newline at end of file +#endif +} diff --git a/benchmarks/api/tileop/src/MatMul_e4m3.cpp b/benchmarks/api/tileop/src/MatMul_e4m3.cpp new file mode 100644 index 0000000..f81d692 --- /dev/null +++ b/benchmarks/api/tileop/src/MatMul_e4m3.cpp @@ -0,0 +1,188 @@ +#include "../data.hpp" +#include +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + auto *d = static_cast(dst); + const auto *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + __asm__ volatile("" ::: "memory"); + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} + +template +void test(int64_t *dst, int64_t *src0, int64_t *src1) { + using gm_shape_A = global_tensor>; + using gm_shape_B = global_tensor>; + using gm_shape_C = global_tensor>; + + using tile_shape_A = Tile; + using tile_shape_B = Tile; + using tile_shape_C = Tile; + + gm_shape_A s0(src0); + gm_shape_B s1(src1); + gm_shape_C res(dst); + + tile_shape_A d0; + tile_shape_B d1; + tile_shape_C d2; + + TLOAD(d0, s0); + TLOAD(d1, s1); + MATMUL(d2, d0, d1); + TSTORE(res, d2); +} +#else +template +void __vec__ test_cvt(typename TA::TileDType __out__ a, + typename TB::TileDType __in__ b) { + using AType = typename TA::DType; + using BType = typename TB::DType; + __vbuf__ BType *pb = blkv_get_tile_ptr(b); + __vbuf__ AType *pa = blkv_get_tile_ptr(a); + int x = blkv_get_index_x(); + int y = blkv_get_index_y(); + int idx = index(y, x); + AType o = (AType)(pb[idx]); + pa[idx] = o; +} + +template +void test(float *dst, float *src0, float *src1) { + using gm_shape_A = global_tensor>; + using gm_shape_B = global_tensor>; + using gm_shape_C = global_tensor>; + + using tile_shape_A = TileLeft; + using tile_shape_B = TileRight; + using tile_shape_C = TileAcc; + using tile_shape_LA = TileLeft<__fp8_e4m3, M, K>; + using tile_shape_LB = TileRight<__fp8_e4m3, K, N>; + + gm_shape_A s0(src0); + gm_shape_B s1(src1); + gm_shape_C res(dst); + + tile_shape_A d0; + tile_shape_B d1; + tile_shape_C d2; + tile_shape_LA lda; + tile_shape_LB ldb; + + TLOAD(d0, s0); + TLOAD(d1, s1); + test_cvt<<>>(lda.data(), d0.data()); + test_cvt<<>>(ldb.data(), d1.data()); + MATMUL(d2, lda, ldb); + TSTORE(res, d2); +} +#endif + +int main() { +#ifdef __linx + constexpr uint16_t M = 4; + constexpr uint16_t K = 4; + constexpr uint16_t N = 4; + constexpr size_t size_A = M * K; + constexpr size_t size_B = K * N; + constexpr size_t size_C = M * N; + + static int64_t dst_i64[size_C]; + static int64_t src0_i64[size_A]; + static int64_t src1_i64[size_B]; + + init_dst(dst_i64, size_C); + init_src_int(src0_i64, size_A); + init_src_int(src1_i64, size_B); + + test(dst_i64, src0_i64, src1_i64); + + for (size_t row = 0; row < M; ++row) { + for (size_t col = 0; col < N; ++col) { + int64_t expected = 0; + for (size_t k = 0; k < K; ++k) { + expected += src0_i64[row * K + k] * src1_i64[k * N + col]; + } + if (dst_i64[row * N + col] != expected) { + return 1; + } + } + } + + return 0; +#else + const uint16_t M = 64; + const uint16_t K = 32; + const uint16_t N = 128; + + size_t size_A = M * K; + size_t size_B = K * N; + size_t size_C = M * N; + + float *dst = (float *)malloc(size_C * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, size_C); + + float *src0 = (float *)malloc(size_A * sizeof(float)); + check_mem_alloc(src0); + init_src_fp(src0, size_A); + float *src1 = (float *)malloc(size_B * sizeof(float)); + check_mem_alloc(src1); + init_src_fp(src1, size_B); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test(dst, src0, src1); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, size_C); + + free(dst); + free(src0); + free(src1); + + return 0; +#endif +} diff --git a/test/tileop_api/src/Print.cpp b/benchmarks/api/tileop/src/Print.cpp similarity index 96% rename from test/tileop_api/src/Print.cpp rename to benchmarks/api/tileop/src/Print.cpp index 7f6dba2..56c56fb 100644 --- a/test/tileop_api/src/Print.cpp +++ b/benchmarks/api/tileop/src/Print.cpp @@ -25,12 +25,12 @@ void test_ACC(float *dst, float *src0, float *src1) { tile_shape_C d2; tile_shape_O d3; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); MATMUL(d2, d0, d1); TCVT(d3, d2); print_tile(d3); - TCOPYOUT(res, d3); + TSTORE(res, d3); } template ; gm_shape g(dst); tile_shape t; - TCOPYIN(t, g); + TLOAD(t, g); print_tile(t); } @@ -51,7 +51,7 @@ void test_Zn(T *dst) { using tile_shape = TileRight; gm_shape g(dst); tile_shape t; - TCOPYIN(t, g); + TLOAD(t, g); print_tile(t); } @@ -62,10 +62,10 @@ void test_RowMajor(T *dst) { using tile_shape = Tile; gm_shape g(dst); tile_shape t; - TCOPYIN(t, g); + TLOAD(t, g); print_tile(t); } - + template void test_ColMajor(T *dst) { @@ -73,7 +73,7 @@ void test_ColMajor(T *dst) { using tile_shape = Tile; gm_shape g(dst); tile_shape t; - TCOPYIN(t, g); + TLOAD(t, g); print_tile(t); } diff --git a/benchmarks/api/tileop/src/TAbs.cpp b/benchmarks/api/tileop/src/TAbs.cpp new file mode 100644 index 0000000..e580599 --- /dev/null +++ b/benchmarks/api/tileop/src/TAbs.cpp @@ -0,0 +1,155 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_RowMajor(T *dst, T *src0) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + #pragma clang loop unroll(full) + for (int i = 0; i < block_row; ++i) { + #pragma clang loop unroll(full) + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src0 + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1; + TLOAD(d0, s0); + TABS(d1, d0); + TSTORE(res, d1); + } + } +} + +template +void test_ColMajor(T *dst, T *src0) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + #pragma clang loop unroll(full) + for (int i = 0; i < block_col; ++i) { + #pragma clang loop unroll(full) + for (int j = 0; j < block_row; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src0 + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1; + TLOAD(d0, s0); + TABS(d1, d0); + TSTORE(res, d1); + } + } +} + +int main() { +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 16; + constexpr uint16_t tile_col = 16; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src[gm_size]; + init_dst(dst, gm_size); + init_src_int(src, gm_size); + + test_RowMajor(dst, src); + + return 0; +#else + float *dst_col = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst_col); + init_dst(dst_col, gm_size); + + float *src0_col = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src0_col); + init_src_fp(src0_col, gm_size); + + __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, gm_size); + + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src0_f16); + init_src_fp(src0_f16, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_ColMajor(dst_col, src0_col); + + test_RowMajor(dst_f16, src0_f16); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst_col, gm_size); + OutArray(dst_f16, gm_size); + + free(dst_col); + free(src0_col); + + free(dst_f16); + free(src0_f16); + + return 0; +#endif +} diff --git a/test/other/tileop_api/src/TAdd.cpp b/benchmarks/api/tileop/src/TAdd.cpp similarity index 66% rename from test/other/tileop_api/src/TAdd.cpp rename to benchmarks/api/tileop/src/TAdd.cpp index 3da9aea..7fc20ed 100644 --- a/test/other/tileop_api/src/TAdd.cpp +++ b/benchmarks/api/tileop/src/TAdd.cpp @@ -5,12 +5,44 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_RowMajor(T *dst, T *src0, T *src1) { using gm_shape = global_tensor>; - using tile_shape = Tile; - + using tile_shape = Tile; + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -21,22 +53,22 @@ void test_RowMajor(T *dst, T *src0, T *src1) { gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TADD(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } - + template void test_ColMajor(T *dst, T *src0, T *src1) { using gm_shape = global_tensor>; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -47,25 +79,45 @@ void test_ColMajor(T *dst, T *src0, T *src1) { gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TADD(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } int main() { - const uint16_t gm_row = 128; - const uint16_t gm_col = 128; - const uint16_t tile_row = 32; +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + const uint16_t gm_row = 64; + const uint16_t gm_col = 32; + const uint16_t tile_row = 64; const uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst_i64[gm_size]; + static int64_t src0_i64[gm_size]; + static int64_t src1_i64[gm_size]; + init_dst(dst_i64, gm_size); + init_src_int(src0_i64, gm_size); + init_src_int(src1_i64, gm_size); - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; + test_RowMajor(dst_i64, src0_i64, src1_i64); + return 0; +#else float *dst = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst); init_dst(dst, gm_size); @@ -77,54 +129,67 @@ int main() { check_mem_alloc(src1); init_src_fp(src1, gm_size); + float *dst_col = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst_col); + init_dst(dst_col, gm_size); + + float *src0_col = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src0_col); + init_src_fp(src0_col, gm_size); + float *src1_col = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src1_col); + init_src_fp(src1_col, gm_size); + +#ifndef __linx __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(dst_f16); init_dst(dst_f16, gm_size); - + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src0_f16); init_src_fp(src0_f16, gm_size); __half *src1_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src1_f16); init_src_fp(src1_f16, gm_size); - +#endif + int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(dst_i8); init_dst(dst_i8, gm_size); - + int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src0_i8); init_src_int(src0_i8, gm_size); int8_t *src1_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src1_i8); init_src_int(src1_i8, gm_size); - + int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(dst_i16); init_dst(dst_i16, gm_size); - + int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(src0_i16); init_src_int(src0_i16, gm_size); int16_t *src1_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(src1_i16); init_src_int(src1_i16, gm_size); - + int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst_i32); init_dst(dst_i32, gm_size); - + int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src0_i32); init_src_int(src0_i32, gm_size); int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src1_i32); init_src_int(src1_i32, gm_size); - + int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(dst_i64); init_dst(dst_i64, gm_size); - + int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(src0_i64); init_src_int(src0_i64, gm_size); @@ -136,16 +201,18 @@ int main() { PMC_START(); #endif - test_RowMajor(dst, src0, src1); - - test_RowMajor(dst_f16, src0_f16, src1_f16); - - test_RowMajor(dst_i8, src0_i8, src1_i8); + test_ColMajor(dst_col, src0_col, src1_col); + +#ifndef __linx + test_ColMajor(dst_f16, src0_f16, src1_f16); +#endif + + test_ColMajor(dst_i8, src0_i8, src1_i8); test_RowMajor(dst_i16, src0_i16, src1_i16); - + test_RowMajor(dst_i32, src0_i32, src1_i32); - + test_RowMajor(dst_i64, src0_i64, src1_i64); #ifdef LINX_PMC @@ -154,35 +221,45 @@ int main() { printf("Result:\n"); OutArray(dst, gm_size); - //OutArray(dst_f16, gm_size); + OutArray(dst_col, gm_size); +#ifndef __linx + OutArray(dst_f16, gm_size); +#endif OutArray(dst_i8, gm_size); OutArray(dst_i16, gm_size); OutArray(dst_i32, gm_size); OutArray(dst_i64, gm_size); - + free(dst); free(src0); free(src1); - + + free(dst_col); + free(src0_col); + free(src1_col); + +#ifndef __linx free(dst_f16); free(src0_f16); free(src1_f16); - +#endif + free(dst_i8); free(src0_i8); free(src1_i8); - + free(dst_i16); free(src0_i16); free(src1_i16); - + free(dst_i32); free(src0_i32); free(src1_i32); - + free(dst_i64); free(src0_i64); free(src1_i64); return 0; -} \ No newline at end of file +#endif +} diff --git a/benchmarks/api/tileop/src/TAdd_mask.cpp b/benchmarks/api/tileop/src/TAdd_mask.cpp new file mode 100644 index 0000000..a1770b2 --- /dev/null +++ b/benchmarks/api/tileop/src/TAdd_mask.cpp @@ -0,0 +1,191 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +using namespace pto; + +template +void test(T *c_ptr, T *a_ptr, T *b_ptr) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + using glb_iterator = global_iterator; + + static constexpr int block_row = gm_row / tile_row; + static constexpr int block_col = gm_col / tile_col; + static constexpr int remainder_row = gm_row % tile_row; + static constexpr int remainder_col = gm_col % tile_col; + + using trailing_rows_shape = + Tile; + using trailing_cols_shape = + Tile; + using trailing_corner_shape = Tile; + + glb_iterator gAIter(a_ptr); + glb_iterator gBIter(b_ptr); + glb_iterator gCIter(c_ptr); + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + auto gA = gAIter(i, j); + auto gB = gBIter(i, j); + auto gC = gCIter(i, j); + + tile_shape tA, tB, tC; + TLOAD(tA, gA); + TLOAD(tB, gB); + TADD(tC, tA, tB); + TSTORE(gC, tC); + } + if constexpr (remainder_col) { + auto gA = gAIter(i, block_col); + auto gB = gBIter(i, block_col); + auto gC = gCIter(i, block_col); + + trailing_rows_shape tA, tB, tC; + TLOAD(tA, gA); + TLOAD(tB, gB); + TADD(tC, tA, tB); + TSTORE(gC, tC); + } + } + if constexpr (remainder_row) { + for (int j = 0; j < block_col; ++j) { + auto gA = gAIter(block_row, j); + auto gB = gBIter(block_row, j); + auto gC = gCIter(block_row, j); + + trailing_cols_shape tA, tB, tC; + TLOAD(tA, gA); + TLOAD(tB, gB); + TADD(tC, tA, tB); + TSTORE(gC, tC); + } + if constexpr (remainder_col) { + auto gA = gAIter(block_row, block_col); + auto gB = gBIter(block_row, block_col); + auto gC = gCIter(block_row, block_col); + + trailing_corner_shape tA, tB, tC; + TLOAD(tA, gA); + TLOAD(tB, gB); + TADD(tC, tA, tB); + TSTORE(gC, tC); + } + } +} + +int main() { +#ifdef __linx + constexpr uint16_t gm_row = 6; + constexpr uint16_t gm_col = 6; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + const uint16_t gm_row = 66; + const uint16_t gm_col = 66; + const uint16_t tile_row = 16; + const uint16_t tile_col = 16; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src0[gm_size]; + static int64_t src1[gm_size]; + init_dst(dst, gm_size); + init_src_int(src0, gm_size); + init_src_uint(src1, gm_size); + + test(dst, src0, src1); + return 0; +#else + float *dst = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, gm_size); + + float *src0 = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src0); + init_src_fp(src0, gm_size); + float *src1 = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src1); + init_src_fp(src1, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test(dst, src0, src1); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, gm_size); + + free(dst); + free(src0); + free(src1); + + return 0; +#endif +} diff --git a/test/other/tileop_api/src/TAdds.cpp b/benchmarks/api/tileop/src/TAdds.cpp similarity index 64% rename from test/other/tileop_api/src/TAdds.cpp rename to benchmarks/api/tileop/src/TAdds.cpp index 305fd6a..3545585 100644 --- a/test/other/tileop_api/src/TAdds.cpp +++ b/benchmarks/api/tileop/src/TAdds.cpp @@ -5,12 +5,44 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_RowMajor(T *dst, T *src0, T s) { using gm_shape = global_tensor>; - using tile_shape = Tile; - + using tile_shape = Tile; + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -20,21 +52,21 @@ void test_RowMajor(T *dst, T *src0, T s) { int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TADDS(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } - + template void test_ColMajor(T *dst, T *src0, T s) { using gm_shape = global_tensor>; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -44,36 +76,54 @@ void test_ColMajor(T *dst, T *src0, T s) { int offset = i * (tile_col * gm_row) + j * tile_row; gm_shape s0(src0 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TADDS(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; +#ifdef __linx + static int64_t dst_i64[gm_size]; + static int64_t src0_i64[gm_size]; + init_dst(dst_i64, gm_size); + init_src_int(src0_i64, gm_size); - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); + test_RowMajor(dst_i64, src0_i64, s_i64); + + return 0; +#else + float *dst_col = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst_col); + init_dst(dst_col, gm_size); - float *src0 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, gm_size); + float *src0_col = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src0_col); + init_src_fp(src0_col, gm_size); __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(dst_f16); init_dst(dst_f16, gm_size); - + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src0_f16); init_src_fp(src0_f16, gm_size); @@ -81,31 +131,31 @@ int main() { int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(dst_i8); init_dst(dst_i8, gm_size); - + int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src0_i8); init_src_int(src0_i8, gm_size); - + int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(dst_i16); init_dst(dst_i16, gm_size); - + int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(src0_i16); init_src_int(src0_i16, gm_size); - + int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst_i32); init_dst(dst_i32, gm_size); - + int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src0_i32); init_src_int(src0_i32, gm_size); - + int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(dst_i64); init_dst(dst_i64, gm_size); - + int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(src0_i64); init_src_int(src0_i64, gm_size); @@ -114,16 +164,16 @@ int main() { PMC_START(); #endif - test_RowMajor(dst, src0, s_fp32); - + test_ColMajor(dst_col, src0_col, s_fp32); + test_RowMajor(dst_f16, src0_f16, s_fp16); test_RowMajor(dst_i8, src0_i8, s_i8); test_RowMajor(dst_i16, src0_i16, s_i16); - + test_RowMajor(dst_i32, src0_i32, s_i32); - + test_RowMajor(dst_i64, src0_i64, s_i64); #ifdef LINX_PMC @@ -131,30 +181,31 @@ int main() { #endif printf("Result:\n"); - OutArray(dst, gm_size); - //OutArray(dst_f16, gm_size); + OutArray(dst_col, gm_size); + OutArray(dst_f16, gm_size); OutArray(dst_i8, gm_size); OutArray(dst_i16, gm_size); OutArray(dst_i32, gm_size); OutArray(dst_i64, gm_size); - - free(dst); - free(src0); - + + free(dst_col); + free(src0_col); + free(dst_f16); free(src0_f16); - + free(dst_i8); free(src0_i8); - + free(dst_i16); free(src0_i16); - + free(dst_i32); free(src0_i32); - + free(dst_i64); free(src0_i64); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TAnd.cpp b/benchmarks/api/tileop/src/TAnd.cpp similarity index 77% rename from test/tileop_api/src/TAnd.cpp rename to benchmarks/api/tileop/src/TAnd.cpp index 714b320..14e97d7 100644 --- a/test/tileop_api/src/TAnd.cpp +++ b/benchmarks/api/tileop/src/TAnd.cpp @@ -5,12 +5,44 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_RowMajor(T *dst, T *src0, T *src1) { using gm_shape = global_tensor>; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -21,22 +53,22 @@ void test_RowMajor(T *dst, T *src0, T *src1) { gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TAND(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } - + template void test_ColMajor(T *dst, T *src0, T *src1) { using gm_shape = global_tensor>; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -47,25 +79,45 @@ void test_ColMajor(T *dst, T *src0, T *src1) { gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TAND(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 32; - const uint16_t tile_row = 64; - const uint16_t tile_col = 32; +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 32; + constexpr uint16_t tile_row = 64; + constexpr uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; +#ifdef __linx + static int64_t dst_i64[gm_size]; + static int64_t src0_i64[gm_size]; + static int64_t src1_i64[gm_size]; + init_dst(dst_i64, gm_size); + init_src_int(src0_i64, gm_size); + init_src_int(src1_i64, gm_size); + + test_RowMajor(dst_i64, src0_i64, src1_i64); + return 0; +#else float *dst = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst); init_dst(dst, gm_size); @@ -91,51 +143,51 @@ int main() { __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(dst_f16); init_dst(dst_f16, gm_size); - + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src0_f16); init_src_fp(src0_f16, gm_size); __half *src1_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src1_f16); init_src_fp(src1_f16, gm_size); - + int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(dst_i8); init_dst(dst_i8, gm_size); - + int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src0_i8); init_src_int(src0_i8, gm_size); int8_t *src1_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src1_i8); init_src_int(src1_i8, gm_size); - + int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(dst_i16); init_dst(dst_i16, gm_size); - + int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(src0_i16); init_src_int(src0_i16, gm_size); int16_t *src1_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(src1_i16); init_src_int(src1_i16, gm_size); - + int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst_i32); init_dst(dst_i32, gm_size); - + int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src0_i32); init_src_int(src0_i32, gm_size); int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src1_i32); init_src_int(src1_i32, gm_size); - + int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(dst_i64); init_dst(dst_i64, gm_size); - + int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(src0_i64); init_src_int(src0_i64, gm_size); @@ -154,9 +206,9 @@ int main() { test_ColMajor(dst_i8, src0_i8, src1_i8); test_RowMajor(dst_i16, src0_i16, src1_i16); - + test_RowMajor(dst_i32, src0_i32, src1_i32); - + test_RowMajor(dst_i64, src0_i64, src1_i64); #ifdef LINX_PMC @@ -171,7 +223,7 @@ int main() { OutArray(dst_i16, gm_size); OutArray(dst_i32, gm_size); OutArray(dst_i64, gm_size); - + free(dst); free(src0); free(src1); @@ -183,22 +235,23 @@ int main() { free(dst_f16); free(src0_f16); free(src1_f16); - + free(dst_i8); free(src0_i8); free(src1_i8); - + free(dst_i16); free(src0_i16); free(src1_i16); - + free(dst_i32); free(src0_i32); free(src1_i32); - + free(dst_i64); free(src0_i64); free(src1_i64); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TAssemble.cpp b/benchmarks/api/tileop/src/TAssemble.cpp similarity index 95% rename from test/tileop_api/src/TAssemble.cpp rename to benchmarks/api/tileop/src/TAssemble.cpp index efd4e25..04a535c 100644 --- a/test/tileop_api/src/TAssemble.cpp +++ b/benchmarks/api/tileop/src/TAssemble.cpp @@ -28,11 +28,11 @@ void test_rm(float *dst, float *src0, float *src1, float *src2) { tile_shape_src2 d2; tile_shape_dst d3; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TCOPYIN(d2, s2); + TLOAD(d0, s0); + TLOAD(d1, s1); + TLOAD(d2, s2); TASSEMBLE(d3, d0, d1, d2); - TCOPYOUT(res, d3); + TSTORE(res, d3); } template (dst1, src0, src1, src2); - + test_cm(dst2, src0, src1, src2); - + test_rm_mask(dst3, src0, src1, src2); diff --git a/test/tileop_api/src/TCI.cpp b/benchmarks/api/tileop/src/TCI.cpp similarity index 74% rename from test/tileop_api/src/TCI.cpp rename to benchmarks/api/tileop/src/TCI.cpp index 776cad0..092e7c3 100644 --- a/test/tileop_api/src/TCI.cpp +++ b/benchmarks/api/tileop/src/TCI.cpp @@ -5,6 +5,39 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T s) { @@ -20,7 +53,7 @@ void test_rm(T *dst, T s) { tile_shape d1; TCI(d1, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } @@ -40,12 +73,28 @@ void test_cm(T *dst, T s) { tile_shape d1; TCI(d1, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } int main() { +#ifdef __linx + constexpr uint16_t gm_row = 8; + constexpr uint16_t gm_col = 8; + constexpr uint16_t tile_row = 8; + constexpr uint16_t tile_col = 8; + constexpr uint16_t gm_size = gm_row * gm_col; + + static int32_t dst_rm[gm_size]; + static int32_t dst_cm[gm_size]; + init_dst(dst_rm, gm_size); + init_dst(dst_cm, gm_size); + + test_rm(dst_rm, s_i32); + test_cm(dst_cm, s_i32); + return 0; +#else const uint16_t gm_row = 64; const uint16_t gm_col = 32; const uint16_t tile_row = 64; @@ -127,4 +176,5 @@ int main() { free(dst5); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TCast.cpp b/benchmarks/api/tileop/src/TCast.cpp similarity index 95% rename from test/tileop_api/src/TCast.cpp rename to benchmarks/api/tileop/src/TCast.cpp index b40d9ba..5d71936 100644 --- a/test/tileop_api/src/TCast.cpp +++ b/benchmarks/api/tileop/src/TCast.cpp @@ -17,9 +17,9 @@ void test_rm(T2 *dst, T1 *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TCAST(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } template @@ -34,9 +34,9 @@ void test_cm(T2 *dst, T1 *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TCAST(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } template @@ -51,9 +51,9 @@ void test_Nz(T2 *dst, T1 *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TCAST(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } int main() { diff --git a/test/tileop_api/src/TCmp.cpp b/benchmarks/api/tileop/src/TCmp.cpp similarity index 69% rename from test/tileop_api/src/TCmp.cpp rename to benchmarks/api/tileop/src/TCmp.cpp index 7d6c16e..495e118 100644 --- a/test/tileop_api/src/TCmp.cpp +++ b/benchmarks/api/tileop/src/TCmp.cpp @@ -5,6 +5,57 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_RowMajor_CmpMode(int32_t *dst, T *src0, T *src1) { @@ -12,7 +63,7 @@ void test_RowMajor_CmpMode(int32_t *dst, T *src0, T *src1) { using gm_shape_out = global_tensor>; using tile_shape_in = Tile; using tile_shape_out = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -23,13 +74,13 @@ void test_RowMajor_CmpMode(int32_t *dst, T *src0, T *src1) { gm_shape_in s0(src0 + offset); gm_shape_in s1(src1 + offset); gm_shape_out res(dst + offset); - + tile_shape_in d0, d1; tile_shape_out d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TCMP(d2, d1, d0, Mode); // 使用模板参数Mode - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } @@ -41,7 +92,7 @@ void test_ColMajor_CmpMode(int32_t *dst, T *src0, T *src1) { using gm_shape_out = global_tensor>; using tile_shape_in = Tile; using tile_shape_out = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -52,13 +103,13 @@ void test_ColMajor_CmpMode(int32_t *dst, T *src0, T *src1) { gm_shape_in s0(src0 + offset); gm_shape_in s1(src1 + offset); gm_shape_out res(dst + offset); - + tile_shape_in d0, d1; tile_shape_out d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TCMP(d2, d1, d0, Mode); // 使用模板参数Mode - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } @@ -68,7 +119,7 @@ template void test_SingleCmpMode_RowMajor() { size_t gm_size = gm_row * gm_col; - + int32_t *dst = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst); init_dst(dst, gm_size); @@ -89,7 +140,7 @@ void test_SingleCmpMode_RowMajor() { #endif OutArray(dst, gm_size); - + free(dst); free(src0); free(src1); @@ -99,7 +150,7 @@ template void test_SingleCmpMode_ColMajor() { size_t gm_size = gm_row * gm_col; - + int32_t *dst = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst); init_dst(dst, gm_size); @@ -120,7 +171,7 @@ void test_SingleCmpMode_ColMajor() { #endif OutArray(dst, gm_size); - + free(dst); free(src0); free(src1); @@ -147,6 +198,42 @@ void test_AllCmpModes_ColMajor() { } int main() { +#ifdef __linx + constexpr uint16_t gm_row = 8; + constexpr uint16_t gm_col = 8; + constexpr uint16_t tile_row = 8; + constexpr uint16_t tile_col = 8; + constexpr uint16_t gm_size = gm_row * gm_col; + + static int32_t dst_rm[gm_size]; + static int32_t dst_cm[gm_size]; + static int32_t dst_eq[gm_size]; + static int64_t src0_rm[gm_size]; + static int64_t src1_rm[gm_size]; + static int64_t src0_cm[gm_size]; + static int64_t src1_cm[gm_size]; + static int32_t src0_eq[gm_size]; + static int32_t src1_eq[gm_size]; + + init_dst(dst_rm, gm_size); + init_dst(dst_cm, gm_size); + init_dst(dst_eq, gm_size); + init_src_int(src0_rm, gm_size); + init_src_uint(src1_rm, gm_size); + init_src_uint(src0_cm, gm_size); + init_src_int(src1_cm, gm_size); + init_src_int(src0_eq, gm_size); + init_src_int(src1_eq, gm_size); + + test_RowMajor_CmpMode(dst_rm, src0_rm, src1_rm); + test_ColMajor_CmpMode(dst_cm, src0_cm, src1_cm); + test_RowMajor_CmpMode(dst_eq, src0_eq, src1_eq); + + return 0; +#else const uint16_t gm_row = 64; const uint16_t gm_col = 32; const uint16_t tile_row = 64; @@ -154,6 +241,7 @@ int main() { size_t gm_size = gm_row * gm_col; size_t tile_size = tile_row * tile_col; + (void)tile_size; printf("Result:\n"); // 测试float类型的所有比较模式 @@ -175,4 +263,5 @@ int main() { test_SingleCmpMode_RowMajor(); test_SingleCmpMode_ColMajor(); return 0; -} \ No newline at end of file +#endif +} diff --git a/benchmarks/api/tileop/src/TCopy.cpp b/benchmarks/api/tileop/src/TCopy.cpp new file mode 100644 index 0000000..d0fb15c --- /dev/null +++ b/benchmarks/api/tileop/src/TCopy.cpp @@ -0,0 +1,339 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_Nz(T *dst, T *src0) { + using gm_shape = global_tensor>; + using tile_shape = TileLeft; + using glb_iterator = global_iterator; + + glb_iterator gS0Iter(src0); + glb_iterator gDIter(dst); + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + auto s0 = gS0Iter(i, j); + auto res = gDIter(i, j); + + tile_shape d0, d1; + TLOAD(d0, s0); + TCOPY(d1, d0); + TSTORE(res, d1); + } + } +} + +template +void test_Nz_Dynamic(T *dst, T *src0) { + using gm_shape = global_tensor>; + using tile_shape = TileLeft; + + volatile size_t tile_valid_row = tile_row - 2; + volatile size_t tile_valid_col = tile_col - 2; + + uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row; + uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col; + + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + uint16_t remainder_row = gm_row - i * tile_valid_row; + uint16_t remainder_col = gm_col - j * tile_valid_col; + + uint16_t active_row = remainder_row < tile_valid_row ? remainder_row : tile_valid_row; + uint16_t active_col = remainder_col < tile_valid_col ? remainder_col : tile_valid_col; + + int offset = i * (tile_valid_row * gm_col) + j * tile_valid_col; + gm_shape s0(src0 + offset); + gm_shape res(dst + offset); + + tile_shape d0(active_row, active_col); + tile_shape d1(active_row, active_col); + TLOAD(d0, s0); + TCOPY(d1, d0); + TSTORE(res, d1); + } + } +} + +template +void test_RowMajor(T *dst, T *src0) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + #pragma clang loop unroll(full) + for (int i = 0; i < block_row; ++i) { + #pragma clang loop unroll(full) + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src0 + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1; + TLOAD(d0, s0); + TCOPY(d1, d0); + TSTORE(res, d1); + } + } +} + +template +void test_RowMajor_Dynamic(T *dst, T *src0) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + volatile size_t tile_valid_row = tile_row - 2; + volatile size_t tile_valid_col = tile_col - 2; + + uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row; + uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col; + + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + uint16_t remainder_row = gm_row - i * tile_valid_row; + uint16_t remainder_col = gm_col - j * tile_valid_col; + + uint16_t active_row = remainder_row < tile_valid_row ? remainder_row : tile_valid_row; + uint16_t active_col = remainder_col < tile_valid_col ? remainder_col : tile_valid_col; + + int offset = i * (tile_valid_row * gm_col) + j * tile_valid_col; + gm_shape s0(src0 + offset); + gm_shape res(dst + offset); + + tile_shape d0(active_row, active_col); + tile_shape d1(active_row, active_col); + TLOAD(d0, s0); + TCOPY(d1, d0); + TSTORE(res, d1); + } + } +} + +template +void test_ColMajor(T *dst, T *src0) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + #pragma clang loop unroll(full) + for (int i = 0; i < block_col; ++i) { + #pragma clang loop unroll(full) + for (int j = 0; j < block_row; ++j) { + int offset = i * (tile_col * gm_row) + j * tile_row; + gm_shape s0(src0 + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1; + TLOAD(d0, s0); + TCOPY(d1, d0); + TSTORE(res, d1); + } + } +} + +int main() { +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src[gm_size]; + init_dst(dst, gm_size); + init_src_int(src, gm_size); + + test_RowMajor(dst, src); + + return 0; +#else + float *dst = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, gm_size); + + float *src0 = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src0); + init_src_fp(src0, gm_size); + + float *dst_col = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst_col); + init_dst(dst_col, gm_size); + + float *src0_col = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src0_col); + init_src_fp(src0_col, gm_size); + + __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, gm_size); + + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src0_f16); + init_src_fp(src0_f16, gm_size); + + int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(dst_i8); + init_dst(dst_i8, gm_size); + + int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src0_i8); + init_src_int(src0_i8, gm_size); + + int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(dst_i16); + init_dst(dst_i16, gm_size); + + int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src0_i16); + init_src_int(src0_i16, gm_size); + + int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst_i32); + init_dst(dst_i32, gm_size); + + int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src0_i32); + init_src_int(src0_i32, gm_size); + + int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(dst_i64); + init_dst(dst_i64, gm_size); + + int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src0_i64); + init_src_int(src0_i64, gm_size); + + int32_t *dst1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst1_i32); + init_dst(dst1_i32, gm_size); + + int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src1_i32); + init_src_int(src1_i32, gm_size); + + int32_t *dst_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst_nz_i32); + init_dst(dst_nz_i32, gm_size); + + int32_t *src_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src_nz_i32); + init_src_int(src_nz_i32, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + //test for fp32 Nz + test_Nz(dst, src0); + + test_ColMajor(dst_col, src0_col); + + test_RowMajor(dst_f16, src0_f16); + + test_RowMajor(dst_i8, src0_i8); + + test_RowMajor(dst_i16, src0_i16); + + test_RowMajor(dst_i32, src0_i32); + + test_RowMajor(dst_i64, src0_i64); + + test_RowMajor_Dynamic(dst1_i32, src1_i32); + + test_Nz_Dynamic(dst_nz_i32, src_nz_i32); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, gm_size); + OutArray(dst_f16, gm_size); + OutArray(dst_i8, gm_size); + OutArray(dst_i16, gm_size); + OutArray(dst_i32, gm_size); + OutArray(dst_i64, gm_size); + OutArray(dst1_i32, gm_size); + OutArray(dst_nz_i32, gm_size); + + free(dst); + free(src0); + + free(dst_f16); + free(src0_f16); + + free(dst_i8); + free(src0_i8); + + free(dst_i16); + free(src0_i16); + + free(dst_i32); + free(src0_i32); + + free(dst_i64); + free(src0_i64); + + free(dst1_i32); + free(src1_i32); + + free(dst_nz_i32); + free(src_nz_i32); + + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/TCvt.cpp b/benchmarks/api/tileop/src/TCvt.cpp new file mode 100644 index 0000000..f8652e3 --- /dev/null +++ b/benchmarks/api/tileop/src/TCvt.cpp @@ -0,0 +1,243 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template void testRow2Nz(float *dst, float *src) { + using gm_shape = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = TileLeft; + + gm_shape s0(src); + gm_shape res(dst); + + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TCVT(d1, d0); + TCVT(d0, d1); + TSTORE(res, d0); +} + +template void testNz2Col(float *dst, float *src) { + using gm_shape = global_tensor>; + + using tile_shape_in = TileLeft; + using tile_shape_out = Tile; + + gm_shape s0(src); + gm_shape res(dst); + + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TCVT(d1, d0); + TCVT(d0, d1); + TSTORE(res, d0); +} + +template void testNz2Zn(float *dst, float *src) { + using gm_shape = global_tensor>; + + using tile_shape_in = TileLeft; + using tile_shape_out = TileRight; + + gm_shape s0(src); + gm_shape res(dst); + + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TCVT(d1, d0); + TCVT(d0, d1); + TSTORE(res, d0); +} + +template void testZn2Nz(float *dst, float *src) { + using gm_shape = global_tensor>; + + using tile_shape_in = TileRight; + using tile_shape_out = TileLeft; + + gm_shape s0(src); + gm_shape res(dst); + + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TCVT(d1, d0); + TCVT(d0, d1); + TSTORE(res, d0); +} + +template void testNz2Nz(float *dst, float *src) { + using gm_shape = global_tensor>; + + using tile_shape_in = TileLeft; + using tile_shape_out = TileLeft; + + gm_shape s0(src); + gm_shape res(dst); + + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TCVT(d1, d0); + TCVT(d0, d1); + TSTORE(res, d0); +} + +int main() { +#ifdef __linx + constexpr uint16_t row = 16; + constexpr uint16_t col = 16; + using row_tile = Tile; + using col_tile = Tile; + using nz_tile = TileLeft; + using zn_tile = TileRight; + + row_tile row_src; + row_tile row_round; + col_tile col_src; + col_tile col_round; + nz_tile nz_a; + nz_tile nz_b; + zn_tile zn; + + for (size_t i = 0; i < row; ++i) { + for (size_t j = 0; j < col; ++j) { + row_src.data()[index(i, j)] = + static_cast((i + 1) * 100 + j); + col_src.data()[index(i, j)] = + static_cast((i + 1) * 1000 + j); + } + } + + TCVT(nz_a, row_src); + TCVT(row_round, nz_a); + TCVT(zn, nz_a); + TCVT(nz_b, zn); + + for (size_t i = 0; i < row; ++i) { + for (size_t j = 0; j < col; ++j) { + if (row_round.data()[index(i, j)] != + row_src.data()[index(i, j)]) { + return 1; + } + if (nz_b.data()[index(i, j)] != + nz_a.data()[index(i, j)]) { + return 2; + } + } + } + + TCVT(nz_a, col_src); + TCVT(col_round, nz_a); + for (size_t i = 0; i < row; ++i) { + for (size_t j = 0; j < col; ++j) { + if (col_round.data()[index(i, j)] != + col_src.data()[index(i, j)]) { + return 3; + } + } + } + + return 0; +#else + const uint16_t row = 16; + const uint16_t col = 32; + + size_t size = row * col; + + float *dst = (float *)malloc(size * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, size); + + float *src = (float *)malloc(size * sizeof(float)); + check_mem_alloc(src); + init_src_fp(src, size); + + float *dst1 = (float *)malloc(size * sizeof(float)); + check_mem_alloc(dst1); + init_dst(dst1, size); + + float *src1 = (float *)malloc(size * sizeof(float)); + check_mem_alloc(src1); + init_rows_fp(src1, row, col); + + float *dst2 = (float *)malloc(size * sizeof(float)); + check_mem_alloc(dst2); + init_dst(dst2, size); + + float *src2 = (float *)malloc(size * sizeof(float)); + check_mem_alloc(src2); + init_rows_fp(src2, row, col); + +#ifdef LINX_PMC + PMC_START(); +#endif + + testRow2Nz(dst, src); + testNz2Col(dst1, src1); + testNz2Zn(dst2, src2); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, size); + OutArray(dst1, size); + OutArray(dst2, size); + + free(dst); + free(src); + free(dst1); + free(src1); + free(dst2); + free(src2); + + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/TDiv.cpp b/benchmarks/api/tileop/src/TDiv.cpp new file mode 100644 index 0000000..d8950a1 --- /dev/null +++ b/benchmarks/api/tileop/src/TDiv.cpp @@ -0,0 +1,251 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_rm(T *dst, T *src0, T *src1) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src0 + offset); + gm_shape s1(src1 + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1, d2; + TLOAD(d0, s0); + TLOAD(d1, s1); + TDIV(d2, d1, d0); + TSTORE(res, d2); + } + } +} +template +void test_cm(T *dst, T *src0, T *src1) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src0 + offset); + gm_shape s1(src1 + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1, d2; + TLOAD(d0, s0); + TLOAD(d1, s1); + TDIV(d2, d1, d0); + TSTORE(res, d2); + } + } +} + +int main() { +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + // 64*64-16*16 + const uint16_t gm_row = 64; + const uint16_t gm_col = 64; + const uint16_t tile_row = 32; + const uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst_rm[gm_size]; + static int64_t dst_cm[gm_size]; + static int64_t src0_rm[gm_size]; + static int64_t src1_rm[gm_size]; + static int64_t src0_cm[gm_size]; + static int64_t src1_cm[gm_size]; + init_dst(dst_rm, gm_size); + init_dst(dst_cm, gm_size); + init_src_uint(src0_rm, gm_size); + init_src_int(src1_rm, gm_size); + init_src_uint(src0_cm, gm_size); + init_src_int(src1_cm, gm_size); + + test_rm(dst_rm, src0_rm, src1_rm); + test_cm(dst_cm, src0_cm, src1_cm); + + return 0; +#else + // float32 + float *dst = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, gm_size); + + float *src0 = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src0); + init_src_fp(src0, gm_size); + float *src1 = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src1); + init_src_fp(src1, gm_size); + // float16 + __half *dst1 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst1); + init_dst(dst1, gm_size); + + __half *src2 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src2); + init_src_fp(src2, gm_size); + __half *src3 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src3); + init_src_fp(src3, gm_size); + // int8 + int8_t *dst2 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(dst2); + init_dst(dst2, gm_size); + + int8_t *src4 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src4); + init_src_int8(src4, gm_size); + int8_t *src5 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src5); + init_src_int8(src5, gm_size); + // int16 + int16_t *dst3 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(dst3); + init_dst(dst3, gm_size); + + int16_t *src6 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src6); + init_src_int(src6, gm_size); + int16_t *src7 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src7); + init_src_int(src7, gm_size); + // int32 + int32_t *dst4 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst4); + init_dst(dst4, gm_size); + + int32_t *src8 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src8); + init_src_int(src8, gm_size); + int32_t *src9 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src9); + init_src_int(src9, gm_size); + // int64 + int64_t *dst5 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(dst5); + init_dst(dst5, gm_size); + + int64_t *src10 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src10); + init_src_int(src10, gm_size); + int64_t *src11 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src11); + init_src_int(src11, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + test_rm(dst, src0, src1); + test_rm(dst1, src2, src3); + test_rm(dst2, src4, src5); + test_rm(dst3, src6, src7); + test_rm(dst4, src8, src9); + test_rm(dst5, src10, src11); + +#ifdef LINX_PMC + PMC_END(); +#endif + printf("Result:\n"); + OutArray(dst, gm_size); + OutArray(dst1, gm_size); + OutArray(dst2, gm_size); + OutArray(dst3, gm_size); + OutArray(dst4, gm_size); + OutArray(dst5, gm_size); + + free(dst); + free(src0); + free(src1); + free(dst1); + free(src2); + free(src3); + free(dst2); + free(src4); + free(src5); + free(dst3); + free(src6); + free(src7); + free(dst4); + free(src8); + free(src9); + free(dst5); + free(src10); + free(src11); + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/TDivs.cpp b/benchmarks/api/tileop/src/TDivs.cpp new file mode 100644 index 0000000..59b5023 --- /dev/null +++ b/benchmarks/api/tileop/src/TDivs.cpp @@ -0,0 +1,215 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_rm(T *dst, T *src, T s) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1; + TLOAD(d0, s0); + TDIVS(d1, d0, s); + TSTORE(res, d1); + } + } +} +template +void test_cm(T *dst, T *src, T s) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1; + TLOAD(d0, s0); + TDIVS(d1, d0, s); + TSTORE(res, d1); + } + } +} + +int main() { +#ifdef __linx + constexpr uint64_t gm_row = 4; + constexpr uint64_t gm_col = 4; + constexpr uint64_t tile_row = 4; + constexpr uint64_t tile_col = 4; +#else + const uint16_t gm_row = 64; + const uint16_t gm_col = 64; + const uint16_t tile_row = 32; + const uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst_rm[gm_size]; + static int64_t dst_cm[gm_size]; + static int64_t src_rm[gm_size]; + static int64_t src_cm[gm_size]; + init_dst(dst_rm, gm_size); + init_dst(dst_cm, gm_size); + init_src_uint(src_rm, gm_size); + init_src_uint(src_cm, gm_size); + + test_rm(dst_rm, src_rm, 2); + test_cm(dst_cm, src_cm, 2); + return 0; +#else + // float32 + float *dst = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, gm_size); + + float *src = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src); + init_src_fp(src, gm_size); + + // float16 + __half *dst1 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst1); + init_dst(dst1, gm_size); + __half *src1 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src1); + init_src_fp(src1, gm_size); + + // int16_t + int16_t *dst2 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(dst2); + init_dst(dst2, gm_size); + int16_t *src2 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src2); + init_src_int(src2, gm_size); + // int8 + int8_t *dst3 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(dst3); + init_dst(dst3, gm_size); + int8_t *src3 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src3); + init_src_int(src3, gm_size); + // int32_t + int32_t *dst4 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst4); + init_dst(dst4, gm_size); + int32_t *src4 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src4); + init_src_int(src4, gm_size); + // int64_t + int64_t *dst5 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(dst5); + init_dst(dst5, gm_size); + int64_t *src5 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src5); + init_src_int(src5, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + // qemu error: simt instructions do not support writing toscalar registers. + test_rm(dst, src, s_fp32); + test_rm(dst1, src1, s_fp16); + test_rm(dst3, src3, s_i8); + test_rm(dst2, src2, s_i16); + test_rm(dst4, src4, s_i32); + test_rm(dst5, src5, s_i64); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, gm_size); + OutArray(dst1, gm_size); + OutArray(dst2, gm_size); + OutArray(dst3, gm_size); + OutArray(dst4, gm_size); + OutArray(dst5, gm_size); + free(dst); + free(src); + free(dst1); + free(src1); + free(dst2); + free(src2); + free(dst3); + free(src3); + free(dst4); + free(src4); + free(dst5); + free(src5); + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/TExp.cpp b/benchmarks/api/tileop/src/TExp.cpp new file mode 100644 index 0000000..2b7e767 --- /dev/null +++ b/benchmarks/api/tileop/src/TExp.cpp @@ -0,0 +1,190 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_rm(T *dst, T *src) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1; + TLOAD(d0, s0); + TEXP(d1, d0); + TSTORE(res, d1); + } + } +} +template +void test_cm(T *dst, T *src) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1; + TLOAD(d0, s0); + TEXP(d1, d0); + TSTORE(res, d1); + } + } +} + +int main() { +#ifdef __linx + constexpr size_t tile_row = 4; + constexpr size_t tile_col = 4; + using row_tile = Tile; + using col_tile = + Tile; + + row_tile src_rm, dst_rm; + col_tile src_cm, dst_cm; + + for (size_t i = 0; i < tile_row; ++i) { + for (size_t j = 0; j < tile_col; ++j) { + int64_t value = static_cast((i + j) % 6); + size_t row_index = index(i, j); + size_t col_index = index(i, j); + src_rm.data()[row_index] = value; + src_cm.data()[col_index] = value; + dst_rm.data()[row_index] = 0; + dst_cm.data()[col_index] = 0; + } + } + + TEXP(dst_rm, src_rm); + TEXP(dst_cm, src_cm); + + for (size_t i = 0; i < tile_row; ++i) { + for (size_t j = 0; j < tile_col; ++j) { + int64_t value = static_cast((i + j) % 6); + int64_t expected = linx_tile_iexp(value); + if (dst_rm.data()[index(i, j)] != expected) { + return 1; + } + if (dst_cm.data()[index(i, j)] != expected) { + return 2; + } + } + } + + return 0; +#else + const uint16_t gm_row = 64; + const uint16_t gm_col = 64; + const uint16_t tile_row = 16; + const uint16_t tile_col = 16; + + size_t gm_size = gm_row * gm_col; + size_t tile_size = tile_row * tile_col; + // float32 + float *dst = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, gm_size); + float *src = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src); + init_src_fp(src, gm_size); + // float16 + __half *dst2 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst2); + init_dst(dst2, gm_size); + __half *src2 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src2); + init_src_fp(src2, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + + // TExp只支持float32和16 + test_rm(dst, src); + // half编译通过,运行出错 + // test_rm(dst2,src2); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, gm_size); + OutArray(dst2, gm_size); + + + free(dst); + free(src); + free(dst2); + free(src2); + + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/TExpandCol.cpp b/benchmarks/api/tileop/src/TExpandCol.cpp new file mode 100644 index 0000000..4b6137a --- /dev/null +++ b/benchmarks/api/tileop/src/TExpandCol.cpp @@ -0,0 +1,186 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template void test_rm(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + + gm_shape_in s0(src); + gm_shape_out res(dst); + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TEXPANDCOL(d1, d0); + TSTORE(res, d1); +} +template void test_cm(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + + gm_shape_in s0(src); + gm_shape_out res(dst); + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TEXPANDCOL(d1, d0); + TSTORE(res, d1); +} + +int main() { +#ifdef __linx + constexpr uint16_t row = 4; + constexpr uint16_t col = 8; + constexpr uint16_t size = row * col; + + static int64_t dst_rm[size]; + static int64_t dst_cm[size]; + static int64_t src_rm[size]; + static int64_t src_cm[size]; + init_dst(dst_rm, size); + init_dst(dst_cm, size); + init_src_int(src_rm, size); + init_src_int(src_cm, size); + + test_rm(dst_rm, src_rm); + test_cm(dst_cm, src_cm); + return 0; +#else + const uint16_t row = 32; + const uint16_t col = 32; + + size_t size_in = row * col; + size_t size_out = row * col; + size_t print_out = row; + // float32 + float *dst = (float *)malloc(size_out * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, size_out); + + float *src = (float *)malloc(size_in * sizeof(float)); + check_mem_alloc(src); + init_src_fp(src, size_in); + // float16 + __half *dst2 = (__half *)malloc(size_out * sizeof(__half)); + check_mem_alloc(dst2); + init_dst(dst2, size_out); + + __half *src2 = (__half *)malloc(size_in * sizeof(__half)); + check_mem_alloc(src2); + init_src_fp(src2, size_in); + // int16 + int16_t *dst1 = (int16_t *)malloc(size_out * sizeof(int16_t)); + check_mem_alloc(dst1); + init_dst(dst1, size_out); + + int16_t *src1 = (int16_t *)malloc(size_in * sizeof(int16_t)); + check_mem_alloc(src1); + init_src_int(src1, size_in); + // int8 + int8_t *dst3 = (int8_t *)malloc(size_out * sizeof(int8_t)); + check_mem_alloc(dst3); + init_dst(dst3, size_out); + + int8_t *src3 = (int8_t *)malloc(size_in * sizeof(int8_t)); + check_mem_alloc(src3); + init_src_int(src3, size_in); + // int32 + int32_t *dst4 = (int32_t *)malloc(size_out * sizeof(int32_t)); + check_mem_alloc(dst4); + init_dst(dst4, size_out); + + int32_t *src4 = (int32_t *)malloc(size_in * sizeof(int32_t)); + check_mem_alloc(src4); + init_src_int(src4, size_in); + // int64 + int64_t *dst5 = (int64_t *)malloc(size_out * sizeof(int64_t)); + check_mem_alloc(dst5); + init_dst(dst5, size_out); + + int64_t *src5 = (int64_t *)malloc(size_in * sizeof(int64_t)); + check_mem_alloc(src5); + init_src_int(src5, size_in); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst, src); + test_rm(dst2, src2); + test_rm(dst1, src1); + test_rm(dst3, src3); + test_rm(dst4, src4); + test_rm(dst5, src5); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, print_out); + OutArray(dst1, print_out); + OutArray(dst2, print_out); + OutArray(dst3, print_out); + OutArray(dst4, print_out); + OutArray(dst5, print_out); + + free(dst); + free(src); + free(dst1); + free(src1); + free(dst2); + free(src2); + free(dst3); + free(src3); + free(dst4); + free(src4); + free(dst5); + free(src5); + + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/TExpandRow.cpp b/benchmarks/api/tileop/src/TExpandRow.cpp new file mode 100644 index 0000000..f448a92 --- /dev/null +++ b/benchmarks/api/tileop/src/TExpandRow.cpp @@ -0,0 +1,190 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_rm(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + + gm_shape_in s0(src); + gm_shape_out res(dst); + + tile_shape_in d0; + tile_shape_out d1; + TLOAD(d0, s0); + TEXPANDROW(d1, d0); + TSTORE(res, d1); +} +template +void test_cm(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + + gm_shape_in s0(src); + gm_shape_out res(dst); + + tile_shape_in d0; + tile_shape_out d1; + TLOAD(d0, s0); + TEXPANDROW(d1, d0); + TSTORE(res, d1); +} + +int main() { +#ifdef __linx + constexpr uint16_t row = 4; + constexpr uint16_t col = 8; + constexpr uint16_t size = row * col; + + static int64_t dst_rm[size]; + static int64_t dst_cm[size]; + static int64_t src_rm[size]; + static int64_t src_cm[size]; + init_dst(dst_rm, size); + init_dst(dst_cm, size); + init_src_int(src_rm, size); + init_src_int(src_cm, size); + + test_rm(dst_rm, src_rm); + test_cm(dst_cm, src_cm); + return 0; +#else + const uint16_t row = 32; + const uint16_t col = 32; + size_t size_in = col; + size_t size_out = row * col; + + const uint16_t row1 = 64; + const uint16_t col1 = 64; + size_t size_in1 = col1; + size_t size_out1 = row1 * col1; + +//float32 + float *dst = (float *)malloc(size_out1 * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, size_out1); + + float *src = (float *)malloc(size_in1 * sizeof(float)); + check_mem_alloc(src); + init_src_fp(src, size_in1); + //float16 + __half *dst1 = (__half *)malloc(size_out * sizeof(__half)); + check_mem_alloc(dst1); + init_dst(dst1, size_out); + + __half *src1 = (__half *)malloc(size_in * sizeof(__half)); + check_mem_alloc(src1); + init_src_fp(src1, size_in); + //int8 + int8_t *dst2 = (int8_t *)malloc(size_out * sizeof(int8_t)); + check_mem_alloc(dst2); + init_dst(dst2, size_out); + + int8_t *src2 = (int8_t *)malloc(size_in * sizeof(int8_t)); + check_mem_alloc(src2); + init_src_int(src2, size_in); +//int16 + int16_t *dst3 = (int16_t *)malloc(size_out * sizeof(int16_t)); + check_mem_alloc(dst3); + init_dst(dst3, size_out); + + int16_t *src3 = (int16_t *)malloc(size_in * sizeof(int16_t)); + check_mem_alloc(src3); + init_src_int(src3, size_in); + //int32 + int32_t *dst4 = (int32_t *)malloc(size_out * sizeof(int32_t)); + check_mem_alloc(dst4); + init_dst(dst4, size_out); + + int32_t *src4 = (int32_t *)malloc(size_in * sizeof(int32_t)); + check_mem_alloc(src4); + init_src_int(src4, size_in); + //int64 + int64_t *dst5 = (int64_t *)malloc(size_out * sizeof(int64_t)); + check_mem_alloc(dst5); + init_dst(dst5, size_out); + + int64_t *src5 = (int64_t *)malloc(size_in * sizeof(int64_t)); + check_mem_alloc(src5); + init_src_int(src5, size_in); +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst, src); + test_rm(dst1, src1); + test_rm(dst2, src2); + test_rm(dst3, src3); + test_rm(dst4, src4); + test_rm(dst5, src5); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, size_out); + OutArray(dst1, size_out); + OutArray(dst2, size_out); + OutArray(dst3, size_out); + OutArray(dst4, size_out); + OutArray(dst5, size_out); + + free(dst); + free(src); + free(dst1); + free(src1); + free(dst2); + free(src2); + free(dst3); + free(src3); + free(dst4); + free(src4); + free(dst5); + free(src5); + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/TExpandScalar.cpp b/benchmarks/api/tileop/src/TExpandScalar.cpp new file mode 100644 index 0000000..7b9e347 --- /dev/null +++ b/benchmarks/api/tileop/src/TExpandScalar.cpp @@ -0,0 +1,169 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_rm(T *dst, T s) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + gm_shape res(dst); + + tile_shape d0; + TEXPANDSCALAR(d0, s); + TSTORE(res, d0); +} + +template +void test_rm_dynamic(T *dst, T s) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + volatile size_t tile_valid_row = tile_row; + volatile size_t tile_valid_col = tile_col; + + gm_shape res(dst); + tile_shape d0(tile_valid_row, tile_valid_col); + + TEXPANDSCALAR(d0, s); + TSTORE(res, d0); +} + +template +void test_cm(T *dst, T s) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + gm_shape res(dst); + + tile_shape d0; + TEXPANDSCALAR(d0, s); + TSTORE(res, d0); +} + +int main() { +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 8; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 8; + constexpr uint16_t gm_size = gm_row * gm_col; + + static int64_t dst_rm[gm_size]; + static int64_t dst_cm[gm_size]; + init_dst(dst_rm, gm_size); + init_dst(dst_cm, gm_size); + + test_rm(dst_rm, s_i64); + test_cm(dst_cm, s_i64); + return 0; +#else + const uint16_t gm_row = 16; + const uint16_t gm_col = 32; + const uint16_t tile_row = 16; + const uint16_t tile_col = 32; + + size_t gm_size = gm_row * gm_col; + size_t tile_size = tile_row * tile_col; + // float32 + float *dst = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, gm_size); + // float16 + __half *dst1 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst1); + init_dst(dst1, gm_size); + // int8 + int8_t *dst2 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(dst2); + init_dst(dst2, gm_size); + // int16 + int16_t *dst3 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(dst3); + init_dst(dst3, gm_size); + // int32 + int32_t *dst4 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst4); + init_dst(dst4, gm_size); + // int64 + int64_t *dst5 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(dst5); + init_dst(dst5, gm_size); + // int32 dynamic + int32_t *dst6 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst6); + init_dst(dst6, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst, s_fp32); + test_rm(dst1, s_fp16); + test_rm(dst2, s_i8); + test_rm(dst3, s_i16); + test_rm(dst4, s_i32); + test_rm(dst5, s_i64); + test_rm_dynamic(dst6, s_i32); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, gm_size); + OutArray(dst1, gm_size); + OutArray(dst2, gm_size); + OutArray(dst3, gm_size); + OutArray(dst4, gm_size); + OutArray(dst5, gm_size); + OutArray(dst6, gm_size); + + free(dst); + free(dst1); + free(dst2); + free(dst3); + free(dst4); + free(dst5); + free(dst6); + + return 0; +#endif +} diff --git a/test/tileop_api/src/TExtract.cpp b/benchmarks/api/tileop/src/TExtract.cpp similarity index 96% rename from test/tileop_api/src/TExtract.cpp rename to benchmarks/api/tileop/src/TExtract.cpp index 8a6e0bf..9fc7c35 100644 --- a/test/tileop_api/src/TExtract.cpp +++ b/benchmarks/api/tileop/src/TExtract.cpp @@ -20,9 +20,9 @@ void test_rm(float *dst, float *src, size_t offset_i, size_t offset_j) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TEXTRACT(d1, d0, offset_i, offset_j); - TCOPYOUT(res, d1); + TSTORE(res, d1); } template @@ -43,9 +43,9 @@ void test_rm_dynamic(float *dst, float *src, size_t offset_i, size_t offset_j) { tile_shape_in d0(src_valid_row, src_valid_col); tile_shape_out d1(dst_valid_row, dst_valid_col); - TCOPYIN(d0, s0); + TLOAD(d0, s0); TEXTRACT(d1, d0, offset_i, offset_j); - TCOPYOUT(res, d1); + TSTORE(res, d1); } template @@ -61,9 +61,9 @@ void test_cm(float *dst, float *src, size_t offset_i, size_t offset_j) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TEXTRACT(d1, d0, offset_i, offset_j); - TCOPYOUT(res, d1); + TSTORE(res, d1); } int main() { diff --git a/test/tileop_api/src/TFillPad.cpp b/benchmarks/api/tileop/src/TFillPad.cpp similarity index 96% rename from test/tileop_api/src/TFillPad.cpp rename to benchmarks/api/tileop/src/TFillPad.cpp index 0174d9d..8cb8ec4 100644 --- a/test/tileop_api/src/TFillPad.cpp +++ b/benchmarks/api/tileop/src/TFillPad.cpp @@ -20,9 +20,9 @@ void test_rm(int32_t *dst, int32_t *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TFILLPAD(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } template @@ -45,9 +45,9 @@ void test_rm_dynamic(int32_t *dst, int32_t *src) { tile_shape_in d0(src_valid_row, src_valid_col); tile_shape_out d1(dst_valid_row, dst_valid_col); - TCOPYIN(d0, s0); + TLOAD(d0, s0); TFILLPAD(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } template @@ -64,9 +64,9 @@ void test_cm(int32_t *dst, int32_t *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TFILLPAD(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } @@ -89,7 +89,7 @@ int main() { int32_t *dst3 = (int32_t *)malloc(size * sizeof(int32_t)); check_mem_alloc(dst3); init_dst(dst3, size); - + int32_t *src = (int32_t *)malloc(size * sizeof(int32_t)); check_mem_alloc(src); init_src_int(src, size); diff --git a/test/tileop_api/src/TGather.cpp b/benchmarks/api/tileop/src/TGather.cpp similarity index 96% rename from test/tileop_api/src/TGather.cpp rename to benchmarks/api/tileop/src/TGather.cpp index 088ad59..cd04d9b 100644 --- a/test/tileop_api/src/TGather.cpp +++ b/benchmarks/api/tileop/src/TGather.cpp @@ -23,10 +23,10 @@ void test_RowMajor(float *dst, float *src, uint16_t *indices) { tile_shape_indices d1; tile_shape_dst d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TGATHER(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } template @@ -47,10 +47,10 @@ void test_ColMajor(float *dst, float *src, uint16_t *indices) { tile_shape_indices d1; tile_shape_dst d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TGATHER(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } int main() { diff --git a/benchmarks/api/tileop/src/TLoad.cpp b/benchmarks/api/tileop/src/TLoad.cpp new file mode 100644 index 0000000..e13b0d6 --- /dev/null +++ b/benchmarks/api/tileop/src/TLoad.cpp @@ -0,0 +1,305 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_RowMajor(T *dst, T *src0) { + using shape = Shape<1, 1, 1, tile_row, tile_col>; + using stride = Stride<1, 1, gm_row * gm_col, gm_col, 1>; + using gm_shape = GlobalTensor; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + #pragma clang loop unroll(full) + for (int i = 0; i < block_row; ++i) { + #pragma clang loop unroll(full) + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src0 + offset); + gm_shape res(dst + offset); + + tile_shape d0; + TLOAD(d0, s0); + TSTORE(res, d0); + } + } +} + +template +void test_RowMajor_Dynamic(T *dst, T *src0) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + volatile size_t tile_valid_row = tile_row - 2; + volatile size_t tile_valid_col = tile_col - 2; + + volatile size_t gm_valid_row = gm_row; + volatile size_t gm_valid_col = gm_col; + + uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row; + uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col; + + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + uint16_t remainder_row = gm_row - i * tile_valid_row; + uint16_t remainder_col = gm_col - j * tile_valid_col; + + uint16_t active_row = remainder_row < tile_valid_row ? remainder_row : tile_valid_row; + uint16_t active_col = remainder_col < tile_valid_col ? remainder_col : tile_valid_col; + + int offset = i * (tile_valid_row * gm_valid_col) + j * tile_valid_col; + gm_shape s0(src0 + offset, gm_valid_row, gm_valid_col); + gm_shape res(dst + offset, gm_valid_row, gm_valid_col); + + tile_shape d0(active_row, active_col); + TLOAD(d0, s0); + TSTORE(res, d0); + } + } +} + +template +void test_ColMajor(T *dst, T *src0) { + using shape = Shape<1, 1, 1, tile_row, tile_col>; + using stride = Stride<1, 1, gm_row * gm_col, 1, gm_row>; + using gm_shape = GlobalTensor; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + #pragma clang loop unroll(full) + for (int i = 0; i < block_col; ++i) { + #pragma clang loop unroll(full) + for (int j = 0; j < block_row; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src0 + offset); + gm_shape res(dst + offset); + + tile_shape d0; + TLOAD(d0, s0); + TSTORE(res, d0); + } + } +} + +template +void test_Nz_Dynamic(T *dst, T *src0) { + using gm_shape = global_tensor>; + using tile_shape = TileLeft; + + volatile size_t tile_valid_row = tile_row - 2; + volatile size_t tile_valid_col = tile_col - 2; + + volatile size_t gm_valid_row = gm_row; + volatile size_t gm_valid_col = gm_col; + + uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row; + uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col; + + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + uint16_t remainder_row = gm_row - i * tile_valid_row; + uint16_t remainder_col = gm_col - j * tile_valid_col; + + uint16_t active_row = remainder_row < tile_valid_row ? remainder_row : tile_valid_row; + uint16_t active_col = remainder_col < tile_valid_col ? remainder_col : tile_valid_col; + + int offset = i * (tile_valid_row * gm_valid_col) + j * tile_valid_col; + gm_shape s0(src0 + offset, gm_valid_row, gm_valid_col); + gm_shape res(dst + offset, gm_valid_row, gm_valid_col); + + tile_shape d0(active_row, active_col); + tile_shape d1(active_row, active_col); + TLOAD(d0, s0); + TSTORE(res, d0); + } + } +} + +int main() { +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src[gm_size]; + init_dst(dst, gm_size); + init_src_int(src, gm_size); + + test_RowMajor(dst, src); + + return 0; +#else + float *dst = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, gm_size); + + float *src0 = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src0); + init_src_fp(src0, gm_size); + + __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, gm_size); + + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src0_f16); + init_src_fp(src0_f16, gm_size); + + int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(dst_i8); + init_dst(dst_i8, gm_size); + + int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src0_i8); + init_src_int(src0_i8, gm_size); + + int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(dst_i16); + init_dst(dst_i16, gm_size); + + int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src0_i16); + init_src_int(src0_i16, gm_size); + + int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst_i32); + init_dst(dst_i32, gm_size); + + int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src0_i32); + init_src_int(src0_i32, gm_size); + + int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(dst_i64); + init_dst(dst_i64, gm_size); + + int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src0_i64); + init_src_int(src0_i64, gm_size); + + int32_t *dst1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst1_i32); + init_dst(dst1_i32, gm_size); + + int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src1_i32); + init_src_int(src1_i32, gm_size); + + int32_t *dst_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst_nz_i32); + init_dst(dst_nz_i32, gm_size); + + int32_t *src_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src_nz_i32); + init_src_int(src_nz_i32, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_RowMajor(dst, src0); + + test_RowMajor(dst_f16, src0_f16); + + test_RowMajor(dst_i8, src0_i8); + + test_RowMajor(dst_i16, src0_i16); + + test_RowMajor(dst_i32, src0_i32); + + test_RowMajor(dst_i64, src0_i64); + + test_RowMajor_Dynamic(dst1_i32, src1_i32); + + test_Nz_Dynamic(dst_nz_i32, src_nz_i32); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, gm_size); + OutArray(dst_f16, gm_size); + OutArray(dst_i8, gm_size); + OutArray(dst_i16, gm_size); + OutArray(dst_i32, gm_size); + OutArray(dst_i64, gm_size); + OutArray(dst1_i32, gm_size); + OutArray(dst_nz_i32, gm_size); + + free(dst); + free(src0); + + free(dst_f16); + free(src0_f16); + + free(dst_i8); + free(src0_i8); + + free(dst_i16); + free(src0_i16); + + free(dst_i32); + free(src0_i32); + + free(dst_i64); + free(src0_i64); + + free(dst1_i32); + free(src1_i32); + + return 0; +#endif +} diff --git a/test/tileop_api/src/TDiv.cpp b/benchmarks/api/tileop/src/TMax.cpp similarity index 71% rename from test/tileop_api/src/TDiv.cpp rename to benchmarks/api/tileop/src/TMax.cpp index f9ecf10..25301d6 100644 --- a/test/tileop_api/src/TDiv.cpp +++ b/benchmarks/api/tileop/src/TMax.cpp @@ -5,6 +5,38 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T *src0, T *src1) { @@ -21,10 +53,10 @@ void test_rm(T *dst, T *src0, T *src1) { gm_shape res(dst + offset); tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TDIV(d2, d1, d0); - TCOPYOUT(res, d2); + TLOAD(d0, s0); + TLOAD(d1, s1); + TMAX(d2, d1, d0); + TSTORE(res, d2); } } } @@ -44,24 +76,43 @@ void test_cm(T *dst, T *src0, T *src1) { gm_shape res(dst + offset); tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TDIV(d2, d1, d0); - TCOPYOUT(res, d2); + TLOAD(d0, s0); + TLOAD(d1, s1); + TMAX(d2, d1, d0); + TSTORE(res, d2); } } } int main() { - // 64*64-16*16 - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src0[gm_size]; + static int64_t src1[gm_size]; + init_dst(dst, gm_size); + init_src_int(src0, gm_size); + init_src_int(src1, gm_size); + + test_rm(dst, src0, src1); + + return 0; +#else // float32 float *dst = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst); @@ -91,10 +142,10 @@ int main() { int8_t *src4 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src4); - init_src_int8(src4, gm_size); + init_src_int(src4, gm_size); int8_t *src5 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src5); - init_src_int8(src5, gm_size); + init_src_int(src5, gm_size); // int16 int16_t *dst3 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(dst3); @@ -132,7 +183,8 @@ int main() { #ifdef LINX_PMC PMC_START(); #endif - test_rm(dst, src0, src1); + + test_rm(dst, src0, src1); test_rm(dst1, src2, src3); test_rm(dst2, src4, src5); test_rm(dst3, src6, src7); @@ -142,6 +194,7 @@ int main() { #ifdef LINX_PMC PMC_END(); #endif + printf("Result:\n"); OutArray(dst, gm_size); OutArray(dst1, gm_size); @@ -168,5 +221,7 @@ int main() { free(dst5); free(src10); free(src11); + return 0; -} \ No newline at end of file +#endif +} diff --git a/benchmarks/api/tileop/src/TMaxs.cpp b/benchmarks/api/tileop/src/TMaxs.cpp new file mode 100644 index 0000000..8833ab9 --- /dev/null +++ b/benchmarks/api/tileop/src/TMaxs.cpp @@ -0,0 +1,197 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_rm(T *dst, T *src, T s) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1; + TLOAD(d0, s0); + TMAXS(d1, d0, s); + TSTORE(res, d1); + } + } +} +template +void test_cm(T *dst, T *src, T s) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1; + TLOAD(d0, s0); + TMAXS(d1, d0, s); + TSTORE(res, d1); + } + } +} + +int main() { +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src[gm_size]; + init_dst(dst, gm_size); + init_src_int(src, gm_size); + + test_rm(dst, src, s_i64); + + return 0; +#else + // float32 + float *dst = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, gm_size); + + float *src = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src); + init_src_fp(src, gm_size); + // float16 + __half *dst1 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst1); + init_dst(dst1, gm_size); + + __half *src1 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src1); + init_src_fp(src1, gm_size); + // int8 + int8_t *dst2 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(dst2); + init_dst(dst2, gm_size); + + int8_t *src2 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src2); + init_src_int(src2, gm_size); + // int16 + int16_t *dst3 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(dst3); + init_dst(dst3, gm_size); + + int16_t *src3 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src3); + init_src_int(src3, gm_size); + // int32 + int32_t *dst4 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst4); + init_dst(dst4, gm_size); + + int32_t *src4 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src4); + init_src_int(src4, gm_size); + // int64 + int64_t *dst5 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(dst5); + init_dst(dst5, gm_size); + + int64_t *src5 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src5); + init_src_int(src5, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst, src, s_fp32); + test_rm(dst1, src1, s_fp16); + test_rm(dst2, src2, s_i8); + test_rm(dst3, src3, s_i16); + test_rm(dst4, src4, s_i32); + test_rm(dst5, src5, s_i64); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, gm_size); + OutArray(dst1, gm_size); + OutArray(dst2, gm_size); + OutArray(dst3, gm_size); + OutArray(dst4, gm_size); + OutArray(dst5, gm_size); + + free(dst); + free(src); + free(dst1); + free(src1); + free(dst2); + free(src2); + free(dst3); + free(src3); + free(dst4); + free(src4); + free(dst5); + free(src5); + + return 0; +#endif +} diff --git a/test/tileop_api/src/TMin.cpp b/benchmarks/api/tileop/src/TMin.cpp similarity index 96% rename from test/tileop_api/src/TMin.cpp rename to benchmarks/api/tileop/src/TMin.cpp index 35d8f0d..6a8f6b3 100644 --- a/test/tileop_api/src/TMin.cpp +++ b/benchmarks/api/tileop/src/TMin.cpp @@ -21,10 +21,10 @@ void test_RowMajor(T *dst, T *src0, T *src1) { gm_shape res(dst + offset); tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TMIN(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } @@ -45,10 +45,10 @@ void test_ColMajor(T *dst, T *src0, T *src1) { gm_shape res(dst + offset); tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TMIN(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } @@ -98,7 +98,7 @@ int main() { __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(dst_f16); init_dst(dst_f16, gm_size); - + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src0_f16); init_src_fp(src0_f16, gm_size); diff --git a/test/tileop_api/src/TMins.cpp b/benchmarks/api/tileop/src/TMins.cpp similarity index 96% rename from test/tileop_api/src/TMins.cpp rename to benchmarks/api/tileop/src/TMins.cpp index 1cdbee2..7155098 100644 --- a/test/tileop_api/src/TMins.cpp +++ b/benchmarks/api/tileop/src/TMins.cpp @@ -20,9 +20,9 @@ void test_RowMajor(float *dst, float *src, float s) { gm_shape res(dst + offset); tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TMINS(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } @@ -42,9 +42,9 @@ void test_ColMajor(float *dst, float *src, float s) { gm_shape res(dst + offset); tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TMINS(d1, d0, s); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } diff --git a/benchmarks/api/tileop/src/TMul.cpp b/benchmarks/api/tileop/src/TMul.cpp new file mode 100644 index 0000000..9967351 --- /dev/null +++ b/benchmarks/api/tileop/src/TMul.cpp @@ -0,0 +1,227 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_rm(T *dst, T *src0, T *src1) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src0 + offset); + gm_shape s1(src1 + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1, d2; + TLOAD(d0, s0); + TLOAD(d1, s1); + TMUL(d2, d1, d0); + TSTORE(res, d2); + } + } +} +template +void test_cm(T *dst, T *src0, T *src1) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src0 + offset); + gm_shape s1(src1 + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1, d2; + TLOAD(d0, s0); + TLOAD(d1, s1); + TMUL(d2, d1, d0); + TSTORE(res, d2); + } + } +} + +int main() { +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src0[gm_size]; + static int64_t src1[gm_size]; + init_dst(dst, gm_size); + init_src_int(src0, gm_size); + init_src_int(src1, gm_size); + + test_rm(dst, src0, src1); + + return 0; +#else + // float32 + float *dst = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, gm_size); + + float *src0 = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src0); + init_src_fp(src0, gm_size); + float *src1 = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src1); + init_src_fp(src1, gm_size); + // float16 + __half *dst1 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst1); + init_dst(dst1, gm_size); + + __half *src2 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src2); + init_src_fp(src2, gm_size); + __half *src3 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src3); + init_src_fp(src3, gm_size); + // int8 + int8_t *dst2 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(dst2); + init_dst(dst2, gm_size); + + int8_t *src4 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src4); + init_src_int(src4, gm_size); + int8_t *src5 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src5); + init_src_int(src5, gm_size); + // int16 + int16_t *dst3 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(dst3); + init_dst(dst3, gm_size); + + int16_t *src6 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src6); + init_src_int(src6, gm_size); + int16_t *src7 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src7); + init_src_int(src7, gm_size); + // int32 + int32_t *dst4 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst4); + init_dst(dst4, gm_size); + + int32_t *src8 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src8); + init_src_int(src8, gm_size); + int32_t *src9 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src9); + init_src_int(src9, gm_size); + // int64 + int64_t *dst5 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(dst5); + init_dst(dst5, gm_size); + + int64_t *src10 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src10); + init_src_int(src10, gm_size); + int64_t *src11 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src11); + init_src_int(src11, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst, src0, src1); + test_rm(dst1, src2, src3); + test_rm(dst2, src4, src5); + test_rm(dst3, src6, src7); + test_rm(dst4, src8, src9); + test_rm(dst5, src10, src11); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, gm_size); + OutArray(dst1, gm_size); + OutArray(dst2, gm_size); + OutArray(dst3, gm_size); + OutArray(dst4, gm_size); + OutArray(dst5, gm_size); + + free(dst); + free(src0); + free(src1); + free(dst1); + free(src2); + free(src3); + free(dst2); + free(src4); + free(src5); + free(dst3); + free(src6); + free(src7); + free(dst4); + free(src8); + free(src9); + free(dst5); + free(src10); + free(src11); + + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/TMuls.cpp b/benchmarks/api/tileop/src/TMuls.cpp new file mode 100644 index 0000000..e451ee0 --- /dev/null +++ b/benchmarks/api/tileop/src/TMuls.cpp @@ -0,0 +1,197 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_rm(T *dst, T *src, T s) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1; + TLOAD(d0, s0); + TMULS(d1, d0, s); + TSTORE(res, d1); + } + } +} +template +void test_cm(T *dst, T *src, T s) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src + offset); + gm_shape res(dst + offset); + + tile_shape d0, d1; + TLOAD(d0, s0); + TMULS(d1, d0, s); + TSTORE(res, d1); + } + } +} + +int main() { +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src[gm_size]; + init_dst(dst, gm_size); + init_src_int(src, gm_size); + + test_rm(dst, src, s_i64); + + return 0; +#else + // float32 + float *dst = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, gm_size); + + float *src = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src); + init_src_fp(src, gm_size); + // float16 + __half *dst1 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst1); + init_dst(dst1, gm_size); + + __half *src1 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src1); + init_src_fp(src1, gm_size); + // int8 + int8_t *dst2 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(dst2); + init_dst(dst2, gm_size); + + int8_t *src2 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src2); + init_src_int(src2, gm_size); + // int16 + int16_t *dst3 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(dst3); + init_dst(dst3, gm_size); + + int16_t *src3 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src3); + init_src_int(src3, gm_size); + // int32 + int32_t *dst4 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst4); + init_dst(dst4, gm_size); + + int32_t *src4 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src4); + init_src_int(src4, gm_size); + // int64 + int64_t *dst5 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(dst5); + init_dst(dst5, gm_size); + + int64_t *src5 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src5); + init_src_int(src5, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst, src, s_fp32); + test_rm(dst1, src1, s_fp16); + test_rm(dst2, src2, s_i8); + test_rm(dst3, src3, s_i16); + test_rm(dst4, src4, s_i32); + test_rm(dst5, src5, s_i64); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, gm_size); + OutArray(dst1, gm_size); + OutArray(dst2, gm_size); + OutArray(dst3, gm_size); + OutArray(dst4, gm_size); + OutArray(dst5, gm_size); + + free(dst); + free(src); + free(dst1); + free(src1); + free(dst2); + free(src2); + free(dst3); + free(src3); + free(dst4); + free(src4); + free(dst5); + free(src5); + + return 0; +#endif +} diff --git a/test/tileop_api/src/TOr.cpp b/benchmarks/api/tileop/src/TOr.cpp similarity index 80% rename from test/tileop_api/src/TOr.cpp rename to benchmarks/api/tileop/src/TOr.cpp index 922bd5c..be0c249 100644 --- a/test/tileop_api/src/TOr.cpp +++ b/benchmarks/api/tileop/src/TOr.cpp @@ -5,12 +5,44 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_RowMajor(T *dst, T *src0, T *src1) { using gm_shape = global_tensor>; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -21,22 +53,22 @@ void test_RowMajor(T *dst, T *src0, T *src1) { gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TOR(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } - + template void test_ColMajor(T *dst, T *src0, T *src1) { using gm_shape = global_tensor>; using tile_shape = Tile; - + uint16_t block_row = gm_row / tile_row; uint16_t block_col = gm_col / tile_col; #pragma clang loop unroll(full) @@ -47,25 +79,45 @@ void test_ColMajor(T *dst, T *src0, T *src1) { gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); gm_shape res(dst + offset); - + tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TOR(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 32; - const uint16_t tile_row = 64; - const uint16_t tile_col = 32; +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 32; + constexpr uint16_t tile_row = 64; + constexpr uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; +#ifdef __linx + static int64_t dst_i64[gm_size]; + static int64_t src0_i64[gm_size]; + static int64_t src1_i64[gm_size]; + init_dst(dst_i64, gm_size); + init_src_int(src0_i64, gm_size); + init_src_int(src1_i64, gm_size); + + test_RowMajor(dst_i64, src0_i64, src1_i64); + return 0; +#else float *dst = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst); init_dst(dst, gm_size); @@ -91,63 +143,63 @@ int main() { __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(dst_f16); init_dst(dst_f16, gm_size); - + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src0_f16); init_src_fp(src0_f16, gm_size); __half *src1_f16 = (__half *)malloc(gm_size * sizeof(__half)); check_mem_alloc(src1_f16); init_src_fp(src1_f16, gm_size); - + int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(dst_i8); init_dst(dst_i8, gm_size); int8_t *dst_i8_col = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(dst_i8_col); init_dst(dst_i8_col, gm_size); - + int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src0_i8); init_src_int(src0_i8, gm_size); int8_t *src1_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); check_mem_alloc(src1_i8); init_src_int(src1_i8, gm_size); - + int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(dst_i16); init_dst(dst_i16, gm_size); int16_t *dst_i16_col = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(dst_i16_col); init_dst(dst_i16_col, gm_size); - + int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(src0_i16); init_src_int(src0_i16, gm_size); int16_t *src1_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); check_mem_alloc(src1_i16); init_src_int(src1_i16, gm_size); - + int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst_i32); init_dst(dst_i32, gm_size); int32_t *dst_i32_col = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(dst_i32); init_dst(dst_i32_col, gm_size); - + int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src0_i32); init_src_int(src0_i32, gm_size); int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); check_mem_alloc(src1_i32); init_src_int(src1_i32, gm_size); - + int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(dst_i64); init_dst(dst_i64, gm_size); int64_t *dst_i64_col = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(dst_i64_col); init_dst(dst_i64_col, gm_size); - + int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); check_mem_alloc(src0_i64); init_src_int(src0_i64, gm_size); @@ -168,10 +220,10 @@ int main() { test_RowMajor(dst_i16, src0_i16, src1_i16); test_ColMajor(dst_i16_col, src0_i16, src1_i16); - + test_RowMajor(dst_i32, src0_i32, src1_i32); test_ColMajor(dst_i32_col, src0_i32, src1_i32); - + test_RowMajor(dst_i64, src0_i64, src1_i64); test_ColMajor(dst_i64_col, src0_i64, src1_i64); @@ -191,7 +243,7 @@ int main() { OutArray(dst_i32_col, gm_size); OutArray(dst_i64, gm_size); OutArray(dst_i64_col, gm_size); - + free(dst); free(src0); free(src1); @@ -203,26 +255,27 @@ int main() { free(dst_f16); free(src0_f16); free(src1_f16); - + free(dst_i8); free(dst_i8_col); free(src0_i8); free(src1_i8); - + free(dst_i16); free(dst_i16_col); free(src0_i16); free(src1_i16); - + free(dst_i32); free(dst_i32_col); free(src0_i32); free(src1_i32); - + free(dst_i64); free(dst_i64_col); free(src0_i64); free(src1_i64); return 0; -} \ No newline at end of file +#endif +} diff --git a/test/tileop_api/src/TPad.cpp b/benchmarks/api/tileop/src/TPad.cpp similarity index 69% rename from test/tileop_api/src/TPad.cpp rename to benchmarks/api/tileop/src/TPad.cpp index 1ae7fae..87d3df5 100644 --- a/test/tileop_api/src/TPad.cpp +++ b/benchmarks/api/tileop/src/TPad.cpp @@ -5,6 +5,39 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_pad_rm(T *dst, T *src, T pad_value, size_t up_pad, size_t left_pad, size_t down_pad, size_t right_pad) { @@ -19,9 +52,9 @@ void test_pad_rm(T *dst, T *src, T pad_value, size_t up_pad, size_t left_pad, si tile_shape_src src_tensor; tile_shape_dst dst_tensor; - TCOPYIN(src_tensor, s0); + TLOAD(src_tensor, s0); TPAD(dst_tensor, src_tensor, pad_value, up_pad, left_pad, down_pad, right_pad); - TCOPYOUT(res, dst_tensor); + TSTORE(res, dst_tensor); } template ) { if constexpr (std::is_unsigned_v) { @@ -104,6 +137,29 @@ void test_single_type() { } int main() { +#ifdef __linx + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; + constexpr uint16_t valid_row = 2; + constexpr uint16_t valid_col = 2; + constexpr size_t up_pad = 1; + constexpr size_t left_pad = 1; + constexpr size_t down_pad = 1; + constexpr size_t right_pad = 1; + constexpr uint16_t dst_tile_row = valid_row + up_pad + down_pad; + constexpr uint16_t dst_tile_col = valid_col + left_pad + right_pad; + constexpr uint16_t size = tile_row * tile_col; + + static int64_t dst[size]; + static int64_t src[size]; + init_dst(dst, size); + init_src_int(src, size); + + test_pad_rm(dst, src, static_cast(0), + up_pad, left_pad, down_pad, right_pad); + return 0; +#else printf("Results:\n"); // 依次测试各种数据类型可通过, 一起运行测试会有精度错误 // test_single_type(); @@ -112,6 +168,7 @@ int main() { // test_single_type(); // test_single_type<__half>(); test_single_type(); - + return 0; +#endif } diff --git a/test/tileop_api/src/TRSqrt.cpp b/benchmarks/api/tileop/src/TRSqrt.cpp similarity index 96% rename from test/tileop_api/src/TRSqrt.cpp rename to benchmarks/api/tileop/src/TRSqrt.cpp index 8d33fc2..89b4731 100644 --- a/test/tileop_api/src/TRSqrt.cpp +++ b/benchmarks/api/tileop/src/TRSqrt.cpp @@ -20,9 +20,9 @@ void test(float *dst, float *src) { gm_shape res(dst + offset); tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TRSQRT(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } diff --git a/benchmarks/api/tileop/src/TRecip.cpp b/benchmarks/api/tileop/src/TRecip.cpp new file mode 100644 index 0000000..61555ec --- /dev/null +++ b/benchmarks/api/tileop/src/TRecip.cpp @@ -0,0 +1,253 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_rm(T *dst, T *src) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + using glb_iterator = global_iterator; + + glb_iterator gSIter(src); + glb_iterator gDIter(dst); + + size_t block_row = gm_row / tile_row; + size_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + auto s0 = gSIter(i, j); + auto res = gDIter(i, j); + + tile_shape t0, t1; + TLOAD(t0, s0); + TRECIP(t1, t0); + TSTORE(res, t1); + } + } +} + +template +void test_cm(T *dst, T *src) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + using glb_iterator = global_iterator; + + glb_iterator gSIter(src); + glb_iterator gDIter(dst); + + size_t block_row = gm_row / tile_row; + size_t block_col = gm_col / tile_col; + for (int i = 0; i < block_col; ++i) { + for (int j = 0; j < block_row; ++j) { + auto s0 = gSIter(j, i); + auto res = gDIter(j, i); + + tile_shape t0, t1; + TLOAD(t0, s0); + TRECIP(t1, t0); + TSTORE(res, t1); + } + } +} + +int main() { +#ifdef __linx + constexpr size_t gm_row = 4; + constexpr size_t gm_col = 4; + constexpr size_t tile_row = 4; + constexpr size_t tile_col = 4; +#else + const size_t gm_row = 32; + const size_t gm_col = 32; + const size_t tile_row = 32; + const size_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)gm_size; + (void)tile_size; + +#ifdef __linx + using row_tile = Tile; + using col_tile = + Tile; + row_tile src_rm, dst_rm; + col_tile src_cm, dst_cm; + + for (size_t i = 0; i < tile_row; ++i) { + for (size_t j = 0; j < tile_col; ++j) { + size_t row_index = index(i, j); + size_t col_index = index(i, j); + src_rm.data()[row_index] = 1; + src_cm.data()[col_index] = 1; + dst_rm.data()[row_index] = 0; + dst_cm.data()[col_index] = 0; + } + } + + TRECIP(dst_rm, src_rm); + TRECIP(dst_cm, src_cm); + + for (size_t i = 0; i < tile_row; ++i) { + for (size_t j = 0; j < tile_col; ++j) { + if (dst_rm.data()[index(i, j)] != 1) { + return 1; + } + if (dst_cm.data()[index(i, j)] != 1) { + return 2; + } + } + } + + return 0; +#else + // int8_t + int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(dst_int8); + init_dst(dst_int8, gm_size); + + int8_t *src_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src_int8); + init_src_uint(src_int8, gm_size); + + // int16_t + int16_t *dst_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(dst_int16); + init_dst(dst_int16, gm_size); + + int16_t *src_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src_int16); + init_src_uint(src_int16, gm_size); + + // int32_t + int32_t *dst_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst_int32); + init_dst(dst_int32, gm_size); + + int32_t *src_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src_int32); + init_src_uint(src_int32, gm_size); + + // int64_t + int64_t *dst_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(dst_int64); + init_dst(dst_int64, gm_size); + + int64_t *src_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src_int64); + init_src_uint(src_int64, gm_size); + + // __half + __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, gm_size); + + __half *src_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src_f16); + init_src_fp(src_f16, gm_size); + + // __fp32 + __fp32 *dst_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); + check_mem_alloc(dst_f32); + init_dst(dst_f32, gm_size); + + __fp32 *src_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); + check_mem_alloc(src_f32); + init_src_fp(src_f32, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst_int8, src_int8); + test_rm(dst_int16, src_int16); + test_rm(dst_int32, src_int32); + test_rm(dst_int64, src_int64); + test_cm(dst_f16, src_f16); + test_cm(dst_f32, src_f32); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst_int8, gm_size); + OutArray(dst_int16, gm_size); + OutArray(dst_int32, gm_size); + OutArray(dst_int64, gm_size); + OutArray(dst_f16, gm_size); + OutArray(dst_f32, gm_size); + + free(dst_int8); + free(src_int8); + free(dst_int16); + free(src_int16); + free(dst_int32); + free(src_int32); + free(dst_int64); + free(src_int64); + free(dst_f16); + free(src_f16); + free(dst_f32); + free(src_f32); + + return 0; +#endif +} diff --git a/test/tileop_api/src/TRem.cpp b/benchmarks/api/tileop/src/TRem.cpp similarity index 70% rename from test/tileop_api/src/TRem.cpp rename to benchmarks/api/tileop/src/TRem.cpp index 3648b1d..fac15b0 100644 --- a/test/tileop_api/src/TRem.cpp +++ b/benchmarks/api/tileop/src/TRem.cpp @@ -5,6 +5,57 @@ #include "../linxStartEnd.hpp" #endif +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + template void test_rm(T *dst, T *src0, T *src1) { @@ -21,10 +72,10 @@ void test_rm(T *dst, T *src0, T *src1) { gm_shape res(dst + offset); tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TREM(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } @@ -44,24 +95,51 @@ void test_cm(T *dst, T *src0, T *src1) { gm_shape res(dst + offset); tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TREM(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } int main() { +#ifdef __linx + constexpr uint16_t gm_row = 8; + constexpr uint16_t gm_col = 8; + constexpr uint16_t tile_row = 8; + constexpr uint16_t tile_col = 8; +#else // 64*64-16*16 const uint16_t gm_row = 64; const uint16_t gm_col = 64; const uint16_t tile_row = 32; const uint16_t tile_col = 32; +#endif - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; +#ifdef __linx + static int32_t dst_rm[gm_size]; + static int32_t dst_cm[gm_size]; + static int32_t src0_rm[gm_size]; + static int32_t src1_rm[gm_size]; + static int32_t src0_cm[gm_size]; + static int32_t src1_cm[gm_size]; + init_dst(dst_rm, gm_size); + init_dst(dst_cm, gm_size); + init_src_uint(src0_rm, gm_size); + init_src_int(src1_rm, gm_size); + init_src_uint(src0_cm, gm_size); + init_src_int(src1_cm, gm_size); + + test_rm(dst_rm, src0_rm, src1_rm); + test_cm(dst_cm, src0_cm, src1_cm); + + return 0; +#else // float32 float *dst = (float *)malloc(gm_size * sizeof(float)); check_mem_alloc(dst); @@ -181,4 +259,5 @@ int main() { free(src10); free(src11); return 0; -} \ No newline at end of file +#endif +} diff --git a/benchmarks/api/tileop/src/TReshape.cpp b/benchmarks/api/tileop/src/TReshape.cpp new file mode 100644 index 0000000..f83024e --- /dev/null +++ b/benchmarks/api/tileop/src/TReshape.cpp @@ -0,0 +1,177 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + gm_shape_in s0(src); + gm_shape_out res(dst); + + tile_shape_in d0; + tile_shape_out d1; + TLOAD(d0, s0); + TRESHAPE(d1, d0); + TSTORE(res, d1); +} + +int main() { +#ifdef __linx + constexpr size_t gm_row = 4; + constexpr size_t gm_col = 8; + constexpr size_t tile_row = 4; + constexpr size_t tile_col = 8; +#else + constexpr size_t gm_row = 64; + constexpr size_t gm_col = 64; + constexpr size_t tile_row = 64; + constexpr size_t tile_col = 64; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src[gm_size]; + init_dst(dst, gm_size); + init_src_uint(src, gm_size); + + test(dst, src); + + return 0; +#else + // int8_t + int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(dst_int8); + init_dst(dst_int8, gm_size); + + int8_t *src_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src_int8); + init_src_uint(src_int8, gm_size); + + // int16_t + int16_t *dst_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(dst_int16); + init_dst(dst_int16, gm_size); + + int16_t *src_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src_int16); + init_src_uint(src_int16, gm_size); + + // int32_t + int32_t *dst_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst_int32); + init_dst(dst_int32, gm_size); + + int32_t *src_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src_int32); + init_src_uint(src_int32, gm_size); + + // int64_t + int64_t *dst_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(dst_int64); + init_dst(dst_int64, gm_size); + + int64_t *src_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src_int64); + init_src_uint(src_int64, gm_size); + + // __half + __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, gm_size); + + __half *src_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src_f16); + init_src_fp(src_f16, gm_size); + + // __fp32 + __fp32 *dst_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); + check_mem_alloc(dst_f32); + init_dst(dst_f32, gm_size); + + __fp32 *src_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); + check_mem_alloc(src_f32); + init_src_fp(src_f32, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test(dst_int8, src_int8); + test(dst_int16, src_int16); + test(dst_int32, src_int32); + test(dst_int64, src_int64); + test(dst_f16, src_f16); + test(dst_f32, src_f32); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst_int8, gm_size); + OutArray(dst_int16, gm_size); + OutArray(dst_int32, gm_size); + OutArray(dst_int64, gm_size); + OutArray(dst_f16, gm_size); + OutArray(dst_f32, gm_size); + + + free(dst_int8); + free(src_int8); + free(dst_int16); + free(src_int16); + free(dst_int32); + free(src_int32); + free(dst_int64); + free(src_int64); + free(dst_f16); + free(src_f16); + free(dst_f32); + free(src_f32); + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/TRowMax.cpp b/benchmarks/api/tileop/src/TRowMax.cpp new file mode 100644 index 0000000..1c91357 --- /dev/null +++ b/benchmarks/api/tileop/src/TRowMax.cpp @@ -0,0 +1,203 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template void test_rm(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + + gm_shape_in s0(src); + gm_shape_out res(dst); + + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TROWMAX(d1, d0); + TSTORE(res, d1); +} + +template void test_cm(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + + gm_shape_in s0(src); + gm_shape_out res(dst); + + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TROWMAX(d1, d0); + TSTORE(res, d1); +} + +int main() { +#ifdef __linx + constexpr uint16_t row = 4; + constexpr uint16_t col = 8; + constexpr uint16_t size = row * col; + + static int64_t dst_rm[size]; + static int64_t dst_cm[size]; + static int64_t src_rm[size]; + static int64_t src_cm[size]; + init_dst(dst_rm, size); + init_dst(dst_cm, size); + init_src_int(src_rm, size); + init_src_int(src_cm, size); + + test_rm(dst_rm, src_rm); + test_cm(dst_cm, src_cm); + return 0; +#else + const size_t row = 32; + const size_t col = 32; + + size_t size_in = row * col; + size_t size_out = row * col; + + // int8_t + int8_t *dst_int8 = (int8_t *)malloc(size_out * sizeof(int8_t)); + check_mem_alloc(dst_int8); + init_dst(dst_int8, size_out); + + int8_t *src_int8 = (int8_t *)malloc(size_in * sizeof(int8_t)); + check_mem_alloc(src_int8); + init_src_uint(src_int8, size_in); + + // int16_t + int16_t *dst_int16 = (int16_t *)malloc(size_out * sizeof(int16_t)); + check_mem_alloc(dst_int16); + init_dst(dst_int16, size_out); + + int16_t *src_int16 = (int16_t *)malloc(size_in * sizeof(int16_t)); + check_mem_alloc(src_int16); + init_src_uint(src_int16, size_in); + + // int32_t + int32_t *dst_int32 = (int32_t *)malloc(size_out * sizeof(int32_t)); + check_mem_alloc(dst_int32); + init_dst(dst_int32, size_out); + + int32_t *src_int32 = (int32_t *)malloc(size_in * sizeof(int32_t)); + check_mem_alloc(src_int32); + init_src_uint(src_int32, size_in); + + // int64_t + int64_t *dst_int64 = (int64_t *)malloc(size_out * sizeof(int64_t)); + check_mem_alloc(dst_int64); + init_dst(dst_int64, size_out); + + int64_t *src_int64 = (int64_t *)malloc(size_in * sizeof(int64_t)); + check_mem_alloc(src_int64); + init_src_uint(src_int64, size_in); + + // __half + __half *dst_f16 = (__half *)malloc(size_out * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, size_out); + + __half *src_f16 = (__half *)malloc(size_in * sizeof(__half)); + check_mem_alloc(src_f16); + init_src_fp(src_f16, size_in); + + // __fp32 + __fp32 *dst_f32 = (__fp32 *)malloc(size_out * sizeof(__fp32)); + check_mem_alloc(dst_f32); + init_dst(dst_f32, size_out); + + __fp32 *src_f32 = (__fp32 *)malloc(size_in * sizeof(__fp32)); + check_mem_alloc(src_f32); + init_src_fp(src_f32, size_in); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst_int8, src_int8); + test_rm(dst_int16, src_int16); + test_rm(dst_int32, src_int32); + test_rm(dst_int64, src_int64); + test_cm(dst_f16, src_f16); + test_cm(dst_f32, src_f32); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst_int8, size_out); + OutArray(dst_int16, size_out); + OutArray(dst_int32, size_out); + OutArray(dst_int64, size_out); + OutArray(dst_f16, size_out); + OutArray(dst_f32, size_out); + + free(dst_int8); + free(src_int8); + free(dst_int16); + free(src_int16); + free(dst_int32); + free(src_int32); + free(dst_int64); + free(src_int64); + free(dst_f16); + free(src_f16); + free(dst_f32); + free(src_f32); + + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/TRowMaxExpand.cpp b/benchmarks/api/tileop/src/TRowMaxExpand.cpp new file mode 100644 index 0000000..669ff1e --- /dev/null +++ b/benchmarks/api/tileop/src/TRowMaxExpand.cpp @@ -0,0 +1,223 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template void test_rm(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + + gm_shape_in s0(src); + gm_shape_out res(dst); + + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TROWMAXEXPAND(d1, d0); + TSTORE(res, d1); +} + +template void test_cm(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + + gm_shape_in s0(src); + gm_shape_out res(dst); + + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TROWMAXEXPAND(d1, d0); + TSTORE(res, d1); +} + +#ifndef __linx +template void test_Nz(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = TileLeft; + using tile_shape_out = TileLeft; + + gm_shape_in s0(src); + gm_shape_out res(dst); + + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TROWMAXEXPAND(d1, d0); + TSTORE(res, d1); +} +#endif + +int main() { +#ifdef __linx + constexpr uint16_t row = 4; + constexpr uint16_t col = 8; + constexpr uint16_t size = row * col; + + static int64_t dst_rm[size]; + static int64_t dst_cm[size]; + static int64_t src_rm[size]; + static int64_t src_cm[size]; + init_dst(dst_rm, size); + init_dst(dst_cm, size); + init_src_int(src_rm, size); + init_src_int(src_cm, size); + + test_rm(dst_rm, src_rm); + test_cm(dst_cm, src_cm); + return 0; +#else + const size_t row = 32; + const size_t col = 32; + + size_t size_in = row * col; + size_t size_out = row * col; + + // int8_t + int8_t *dst_int8 = (int8_t *)malloc(size_out * sizeof(int8_t)); + check_mem_alloc(dst_int8); + init_dst(dst_int8, size_out); + + int8_t *src_int8 = (int8_t *)malloc(size_in * sizeof(int8_t)); + check_mem_alloc(src_int8); + init_src_uint(src_int8, size_in); + + // int16_t + int16_t *dst_int16 = (int16_t *)malloc(size_out * sizeof(int16_t)); + check_mem_alloc(dst_int16); + init_dst(dst_int16, size_out); + + int16_t *src_int16 = (int16_t *)malloc(size_in * sizeof(int16_t)); + check_mem_alloc(src_int16); + init_src_uint(src_int16, size_in); + + // int32_t + int32_t *dst_int32 = (int32_t *)malloc(size_out * sizeof(int32_t)); + check_mem_alloc(dst_int32); + init_dst(dst_int32, size_out); + + int32_t *src_int32 = (int32_t *)malloc(size_in * sizeof(int32_t)); + check_mem_alloc(src_int32); + init_src_uint(src_int32, size_in); + + // int64_t + int64_t *dst_int64 = (int64_t *)malloc(size_out * sizeof(int64_t)); + check_mem_alloc(dst_int64); + init_dst(dst_int64, size_out); + + int64_t *src_int64 = (int64_t *)malloc(size_in * sizeof(int64_t)); + check_mem_alloc(src_int64); + init_src_uint(src_int64, size_in); + + // __half + __half *dst_f16 = (__half *)malloc(size_out * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, size_out); + + __half *src_f16 = (__half *)malloc(size_in * sizeof(__half)); + check_mem_alloc(src_f16); + init_src_fp(src_f16, size_in); + + // __fp32 + __fp32 *dst_f32 = (__fp32 *)malloc(size_out * sizeof(__fp32)); + check_mem_alloc(dst_f32); + init_dst(dst_f32, size_out); + + __fp32 *src_f32 = (__fp32 *)malloc(size_in * sizeof(__fp32)); + check_mem_alloc(src_f32); + init_src_fp(src_f32, size_in); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst_int8, src_int8); + test_rm(dst_int16, src_int16); + test_rm(dst_int32, src_int32); + test_rm(dst_int64, src_int64); + test_rm(dst_f16, src_f16); + // test_rm(dst_f32, src_f32); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst_int8, size_out); + OutArray(dst_int16, size_out); + OutArray(dst_int32, size_out); + OutArray(dst_int64, size_out); + OutArray(dst_f16, size_out); + OutArray(dst_f32, size_out); + + free(dst_int8); + free(src_int8); + free(dst_int16); + free(src_int16); + free(dst_int32); + free(src_int32); + free(dst_int64); + free(src_int64); + free(dst_f16); + free(src_f16); + free(dst_f32); + free(src_f32); + + return 0; +#endif +} diff --git a/test/tileop_api/src/TRowMax.cpp b/benchmarks/api/tileop/src/TRowSum.cpp similarity index 70% rename from test/tileop_api/src/TRowMax.cpp rename to benchmarks/api/tileop/src/TRowSum.cpp index e0c52e8..f1a6a59 100644 --- a/test/tileop_api/src/TRowMax.cpp +++ b/benchmarks/api/tileop/src/TRowSum.cpp @@ -5,7 +5,40 @@ #include "../linxStartEnd.hpp" #endif -template void test_rm(T *dst, T *src) { +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template void test_rm(T *dst, T *src) { using gm_shape_in = global_tensor>; using gm_shape_out = global_tensor>; @@ -18,12 +51,12 @@ template void test_rm(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); - TROWMAX(d1, d0); - TCOPYOUT(res, d1); + TLOAD(d0, s0); + TROWSUM(d1, d0); + TSTORE(res, d1); } -template void test_cm(T *dst, T *src) { +template void test_cm(T *dst, T *src) { using gm_shape_in = global_tensor>; using gm_shape_out = global_tensor>; @@ -36,12 +69,30 @@ template void test_cm(T *dst, T *src) { tile_shape_in d0; tile_shape_out d1; - TCOPYIN(d0, s0); - TROWMAX(d1, d0); - TCOPYOUT(res, d1); + TLOAD(d0, s0); + TROWSUM(d1, d0); + TSTORE(res, d1); } int main() { +#ifdef __linx + constexpr uint16_t row = 4; + constexpr uint16_t col = 8; + constexpr uint16_t size = row * col; + + static int64_t dst_rm[size]; + static int64_t dst_cm[size]; + static int64_t src_rm[size]; + static int64_t src_cm[size]; + init_dst(dst_rm, size); + init_dst(dst_cm, size); + init_src_int(src_rm, size); + init_src_int(src_cm, size); + + test_rm(dst_rm, src_rm); + test_cm(dst_cm, src_cm); + return 0; +#else const size_t row = 32; const size_t col = 32; @@ -139,4 +190,5 @@ int main() { free(src_f32); return 0; -} \ No newline at end of file +#endif +} diff --git a/benchmarks/api/tileop/src/TRowSumExpand.cpp b/benchmarks/api/tileop/src/TRowSumExpand.cpp new file mode 100644 index 0000000..a5203eb --- /dev/null +++ b/benchmarks/api/tileop/src/TRowSumExpand.cpp @@ -0,0 +1,201 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template void test_rm(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + + gm_shape_in s0(src); + gm_shape_out res(dst); + + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TROWSUMEXPAND(d1, d0); + TSTORE(res, d1); +} + +template void test_cm(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + + gm_shape_in s0(src); + gm_shape_out res(dst); + + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TROWSUMEXPAND(d1, d0); + TSTORE(res, d1); +} + +int main() { +#ifdef __linx + constexpr uint16_t row = 4; + constexpr uint16_t col = 8; + constexpr uint16_t size = row * col; + + static int64_t dst_rm[size]; + static int64_t dst_cm[size]; + static int64_t src_rm[size]; + static int64_t src_cm[size]; + init_dst(dst_rm, size); + init_dst(dst_cm, size); + init_src_int(src_rm, size); + init_src_int(src_cm, size); + + test_rm(dst_rm, src_rm); + test_cm(dst_cm, src_cm); + return 0; +#else + const size_t row = 32; + const size_t col = 32; + + size_t size_in = row * col; + size_t size_out = row * col; + + // int8_t + int8_t *dst_int8 = (int8_t *)malloc(size_out * sizeof(int8_t)); + check_mem_alloc(dst_int8); + init_dst(dst_int8, size_out); + + int8_t *src_int8 = (int8_t *)malloc(size_in * sizeof(int8_t)); + check_mem_alloc(src_int8); + init_src_uint(src_int8, size_in); + + // int16_t + int16_t *dst_int16 = (int16_t *)malloc(size_out * sizeof(int16_t)); + check_mem_alloc(dst_int16); + init_dst(dst_int16, size_out); + + int16_t *src_int16 = (int16_t *)malloc(size_in * sizeof(int16_t)); + check_mem_alloc(src_int16); + init_src_uint(src_int16, size_in); + + // int32_t + int32_t *dst_int32 = (int32_t *)malloc(size_out * sizeof(int32_t)); + check_mem_alloc(dst_int32); + init_dst(dst_int32, size_out); + + int32_t *src_int32 = (int32_t *)malloc(size_in * sizeof(int32_t)); + check_mem_alloc(src_int32); + init_src_uint(src_int32, size_in); + + // int64_t + int64_t *dst_int64 = (int64_t *)malloc(size_out * sizeof(int64_t)); + check_mem_alloc(dst_int64); + init_dst(dst_int64, size_out); + + int64_t *src_int64 = (int64_t *)malloc(size_in * sizeof(int64_t)); + check_mem_alloc(src_int64); + init_src_uint(src_int64, size_in); + + // __half + __half *dst_f16 = (__half *)malloc(size_out * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, size_out); + + __half *src_f16 = (__half *)malloc(size_in * sizeof(__half)); + check_mem_alloc(src_f16); + init_src_fp(src_f16, size_in); + + // __fp32 + __fp32 *dst_f32 = (__fp32 *)malloc(size_out * sizeof(__fp32)); + check_mem_alloc(dst_f32); + init_dst(dst_f32, size_out); + + __fp32 *src_f32 = (__fp32 *)malloc(size_in * sizeof(__fp32)); + check_mem_alloc(src_f32); + init_src_fp(src_f32, size_in); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst_int8, src_int8); + test_rm(dst_int16, src_int16); + test_rm(dst_int32, src_int32); + test_rm(dst_int64, src_int64); + test_rm(dst_f16, src_f16); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst_int8, size_out); + OutArray(dst_int16, size_out); + OutArray(dst_int32, size_out); + OutArray(dst_int64, size_out); + OutArray(dst_f16, size_out); + OutArray(dst_f32, size_out); + + free(dst_int8); + free(src_int8); + free(dst_int16); + free(src_int16); + free(dst_int32); + free(src_int32); + free(dst_int64); + free(src_int64); + free(dst_f16); + free(src_f16); + free(dst_f32); + free(src_f32); + return 0; +#endif +} diff --git a/test/tileop_api/src/TScatter.cpp b/benchmarks/api/tileop/src/TScatter.cpp similarity index 95% rename from test/tileop_api/src/TScatter.cpp rename to benchmarks/api/tileop/src/TScatter.cpp index c8f34e8..bdf4473 100644 --- a/test/tileop_api/src/TScatter.cpp +++ b/benchmarks/api/tileop/src/TScatter.cpp @@ -23,11 +23,11 @@ void test_RowMajor(float *dst, float *src, uint16_t *indices) { tile_shape_indices d1; tile_shape_dst d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TCOPYIN(d2, res); + TLOAD(d0, s0); + TLOAD(d1, s1); + TLOAD(d2, res); TSCATTER(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } template @@ -48,11 +48,11 @@ void test_ColMajor(float *dst, float *src, uint16_t *indices) { tile_shape_indices d1; tile_shape_dst d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TCOPYIN(d2, res); + TLOAD(d0, s0); + TLOAD(d1, s1); + TLOAD(d2, res); TSCATTER(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } int main() { diff --git a/test/tileop_api/src/TSelect.cpp b/benchmarks/api/tileop/src/TSelect.cpp similarity index 93% rename from test/tileop_api/src/TSelect.cpp rename to benchmarks/api/tileop/src/TSelect.cpp index 7001a8c..7c69c02 100644 --- a/test/tileop_api/src/TSelect.cpp +++ b/benchmarks/api/tileop/src/TSelect.cpp @@ -23,11 +23,11 @@ void test_RowMajor(float *dst, float *src0, float *src1, uint16_t *cond) { tile_shape_uint16 d2; tile_shape_fp32 d3; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TCOPYIN(d2, s2); + TLOAD(d0, s0); + TLOAD(d1, s1); + TLOAD(d2, s2); TSELECT(d3, d2, d0, d1); - TCOPYOUT(res, d3); + TSTORE(res, d3); } template @@ -48,11 +48,11 @@ void test_ColMajor(float *dst, float *src0, float *src1, uint16_t *cond) { tile_shape_uint16 d2; tile_shape_fp32 d3; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TCOPYIN(d2, s2); + TLOAD(d0, s0); + TLOAD(d1, s1); + TLOAD(d2, s2); TSELECT(d3, d2, d0, d1); - TCOPYOUT(res, d3); + TSTORE(res, d3); } int main() { diff --git a/benchmarks/api/tileop/src/TSqrt.cpp b/benchmarks/api/tileop/src/TSqrt.cpp new file mode 100644 index 0000000..9d4923f --- /dev/null +++ b/benchmarks/api/tileop/src/TSqrt.cpp @@ -0,0 +1,204 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_rm(T *dst, T *src) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + using glb_iterator = global_iterator; + + glb_iterator gSIter(src); + glb_iterator gDIter(dst); + + size_t block_row = gm_row / tile_row; + size_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + auto s0 = gSIter(i, j); + auto res = gDIter(i, j); + + tile_shape t0, t1; + TLOAD(t0, s0); + TSQRT(t1, t0); + TSTORE(res, t1); + } + } +} + +template +void test_cm(T *dst, T *src) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + using glb_iterator = global_iterator; + + glb_iterator gSIter(src); + glb_iterator gDIter(dst); + + size_t block_row = gm_row / tile_row; + size_t block_col = gm_col / tile_col; + for (int i = 0; i < block_col; ++i) { + for (int j = 0; j < block_row; ++j) { + auto s0 = gSIter(j, i); + auto res = gDIter(j, i); + + tile_shape t0, t1; + TLOAD(t0, s0); + TSQRT(t1, t0); + TSTORE(res, t1); + } + } +} + +int main() { + +#ifdef __linx + constexpr size_t gm_row = 4; + constexpr size_t gm_col = 4; + constexpr size_t tile_row = 4; + constexpr size_t tile_col = 4; +#else + const size_t gm_row = 32; + const size_t gm_col = 32; + const size_t tile_row = 16; + const size_t tile_col = 16; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)gm_size; + (void)tile_size; + +#ifdef __linx + using row_tile = Tile; + using col_tile = + Tile; + row_tile src_rm, dst_rm; + col_tile src_cm, dst_cm; + + for (size_t i = 0; i < tile_row; ++i) { + for (size_t j = 0; j < tile_col; ++j) { + int64_t expected = static_cast(i * tile_col + j); + size_t row_index = index(i, j); + size_t col_index = index(i, j); + src_rm.data()[row_index] = expected * expected; + src_cm.data()[col_index] = expected * expected; + dst_rm.data()[row_index] = 0; + dst_cm.data()[col_index] = 0; + } + } + + TSQRT(dst_rm, src_rm); + TSQRT(dst_cm, src_cm); + + for (size_t i = 0; i < tile_row; ++i) { + for (size_t j = 0; j < tile_col; ++j) { + int64_t expected = static_cast(i * tile_col + j); + if (dst_rm.data()[index(i, j)] != expected) { + return 1; + } + if (dst_cm.data()[index(i, j)] != expected) { + return 2; + } + } + } + + return 0; +#else + // __half + __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, gm_size); + + __half *src_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src_f16); + init_rows_fp(src_f16, gm_row, gm_col); + + // __fp32 + __fp32 *dst_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); + check_mem_alloc(dst_f32); + init_dst(dst_f32, gm_size); + + __fp32 *src_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); + check_mem_alloc(src_f32); + init_rows_fp(src_f32, gm_row, gm_col); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst_f16, src_f16); + test_cm(dst_f16, src_f16); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst_f16, gm_size); + OutArray(dst_f32, gm_size); + + free(dst_f16); + free(src_f16); + free(dst_f32); + free(src_f32); + + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/TStore.cpp b/benchmarks/api/tileop/src/TStore.cpp new file mode 100644 index 0000000..02a2cf9 --- /dev/null +++ b/benchmarks/api/tileop/src/TStore.cpp @@ -0,0 +1,305 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_RowMajor(T *dst, T *src0) { + using shape = Shape<1, 1, 1, tile_row, tile_col>; + using stride = Stride<1, 1, gm_row * gm_col, gm_col, 1>; + using gm_shape = GlobalTensor; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + #pragma clang loop unroll(full) + for (int i = 0; i < block_row; ++i) { + #pragma clang loop unroll(full) + for (int j = 0; j < block_col; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src0 + offset); + gm_shape res(dst + offset); + + tile_shape d0; + TLOAD(d0, s0); + TSTORE(res, d0); + } + } +} + +template +void test_RowMajor_Dynamic(T *dst, T *src0) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + + volatile size_t tile_valid_row = tile_row - 2; + volatile size_t tile_valid_col = tile_col - 2; + + volatile size_t gm_valid_row = gm_row; + volatile size_t gm_valid_col = gm_col; + + uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row; + uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col; + + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + uint16_t remainder_row = gm_row - i * tile_valid_row; + uint16_t remainder_col = gm_col - j * tile_valid_col; + + uint16_t active_row = remainder_row < tile_valid_row ? remainder_row : tile_valid_row; + uint16_t active_col = remainder_col < tile_valid_col ? remainder_col : tile_valid_col; + + int offset = i * (tile_valid_row * gm_valid_col) + j * tile_valid_col; + gm_shape s0(src0 + offset, gm_valid_row, gm_valid_col); + gm_shape res(dst + offset, gm_valid_row, gm_valid_col); + + tile_shape d0(active_row, active_col); + TLOAD(d0, s0); + TSTORE(res, d0); + } + } +} + +template +void test_ColMajor(T *dst, T *src0) { + using shape = Shape<1, 1, 1, tile_row, tile_col>; + using stride = Stride<1, 1, gm_row * gm_col, 1, gm_row>; + using gm_shape = GlobalTensor; + using tile_shape = Tile; + + uint16_t block_row = gm_row / tile_row; + uint16_t block_col = gm_col / tile_col; + #pragma clang loop unroll(full) + for (int i = 0; i < block_col; ++i) { + #pragma clang loop unroll(full) + for (int j = 0; j < block_row; ++j) { + int offset = i * (tile_row * gm_col) + j * tile_col; + gm_shape s0(src0 + offset); + gm_shape res(dst + offset); + + tile_shape d0; + TLOAD(d0, s0); + TSTORE(res, d0); + } + } +} + +template +void test_Nz_Dynamic(T *dst, T *src0) { + using gm_shape = global_tensor>; + using tile_shape = TileLeft; + + volatile size_t tile_valid_row = tile_row - 2; + volatile size_t tile_valid_col = tile_col - 2; + + volatile size_t gm_valid_row = gm_row; + volatile size_t gm_valid_col = gm_col; + + uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row; + uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col; + + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + uint16_t remainder_row = gm_row - i * tile_valid_row; + uint16_t remainder_col = gm_col - j * tile_valid_col; + + uint16_t active_row = remainder_row < tile_valid_row ? remainder_row : tile_valid_row; + uint16_t active_col = remainder_col < tile_valid_col ? remainder_col : tile_valid_col; + + int offset = i * (tile_valid_row * gm_valid_col) + j * tile_valid_col; + gm_shape s0(src0 + offset, gm_valid_row, gm_valid_col); + gm_shape res(dst + offset, gm_valid_row, gm_valid_col); + + tile_shape d0(active_row, active_col); + tile_shape d1(active_row, active_col); + TLOAD(d0, s0); + TSTORE(res, d0); + } + } +} + +int main() { +#ifdef __linx + constexpr uint16_t gm_row = 4; + constexpr uint16_t gm_col = 4; + constexpr uint16_t tile_row = 4; + constexpr uint16_t tile_col = 4; +#else + constexpr uint16_t gm_row = 64; + constexpr uint16_t gm_col = 64; + constexpr uint16_t tile_row = 32; + constexpr uint16_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst[gm_size]; + static int64_t src[gm_size]; + init_dst(dst, gm_size); + init_src_int(src, gm_size); + + test_RowMajor(dst, src); + + return 0; +#else + float *dst = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, gm_size); + + float *src0 = (float *)malloc(gm_size * sizeof(float)); + check_mem_alloc(src0); + init_src_fp(src0, gm_size); + + __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, gm_size); + + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src0_f16); + init_src_fp(src0_f16, gm_size); + + int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(dst_i8); + init_dst(dst_i8, gm_size); + + int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src0_i8); + init_src_int(src0_i8, gm_size); + + int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(dst_i16); + init_dst(dst_i16, gm_size); + + int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src0_i16); + init_src_int(src0_i16, gm_size); + + int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst_i32); + init_dst(dst_i32, gm_size); + + int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src0_i32); + init_src_int(src0_i32, gm_size); + + int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(dst_i64); + init_dst(dst_i64, gm_size); + + int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src0_i64); + init_src_int(src0_i64, gm_size); + + int32_t *dst1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst1_i32); + init_dst(dst1_i32, gm_size); + + int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src1_i32); + init_src_int(src1_i32, gm_size); + + int32_t *dst_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst_nz_i32); + init_dst(dst_nz_i32, gm_size); + + int32_t *src_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src_nz_i32); + init_src_int(src_nz_i32, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_RowMajor(dst, src0); + + test_RowMajor(dst_f16, src0_f16); + + test_RowMajor(dst_i8, src0_i8); + + test_RowMajor(dst_i16, src0_i16); + + test_ColMajor(dst_i32, src0_i32); + + test_ColMajor(dst_i64, src0_i64); + + test_RowMajor_Dynamic(dst1_i32, src1_i32); + + test_Nz_Dynamic(dst_nz_i32, src_nz_i32); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, gm_size); + OutArray(dst_f16, gm_size); + OutArray(dst_i8, gm_size); + OutArray(dst_i16, gm_size); + OutArray(dst_i32, gm_size); + OutArray(dst_i64, gm_size); + OutArray(dst1_i32, gm_size); + OutArray(dst_nz_i32, gm_size); + + free(dst); + free(src0); + + free(dst_f16); + free(src0_f16); + + free(dst_i8); + free(src0_i8); + + free(dst_i16); + free(src0_i16); + + free(dst_i32); + free(src0_i32); + + free(dst_i64); + free(src0_i64); + + free(dst1_i32); + free(src1_i32); + + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/TSub.cpp b/benchmarks/api/tileop/src/TSub.cpp new file mode 100644 index 0000000..8f8480b --- /dev/null +++ b/benchmarks/api/tileop/src/TSub.cpp @@ -0,0 +1,248 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +// C = A - B +template +void test_rm(T *dst, T *src0, T *src1) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + using glb_iterator = global_iterator; + + glb_iterator gAIter(src0); + glb_iterator gBIter(src1); + glb_iterator gCIter(dst); + + size_t block_row = gm_row / tile_row; + size_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + auto s0 = gAIter(i, j); + auto s1 = gBIter(i, j); + auto res = gCIter(i, j); + + tile_shape t0, t1, t2; + TLOAD(t0, s0); + TLOAD(t1, s1); + TSUB(t2, t1, t0); + TSTORE(res, t2); + } + } +} + +template +void test_cm(T *dst, T *src0, T *src1) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + using glb_iterator = global_iterator; + + glb_iterator gAIter(src0); + glb_iterator gBIter(src1); + glb_iterator gCIter(dst); + + size_t block_row = gm_row / tile_row; + size_t block_col = gm_col / tile_col; + for (int i = 0; i < block_col; ++i) { + for (int j = 0; j < block_row; ++j) { + auto s0 = gAIter(j, i); + auto s1 = gBIter(j, i); + auto res = gCIter(j, i); + + tile_shape t0, t1, t2; + TLOAD(t0, s0); + TLOAD(t1, s1); + TSUB(t2, t1, t0); + TSTORE(res, t2); + } + } +} + +int main() { +#ifdef __linx + constexpr size_t gm_row = 4; + constexpr size_t gm_col = 4; + constexpr size_t tile_row = 4; + constexpr size_t tile_col = 4; +#else + constexpr size_t gm_row = 32; + constexpr size_t gm_col = 32; + constexpr size_t tile_row = 32; + constexpr size_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst_int64[gm_size]; + static int64_t src0_int64[gm_size]; + static int64_t src1_int64[gm_size]; + init_dst(dst_int64, gm_size); + init_src_int(src0_int64, gm_size); + init_src_int(src1_int64, gm_size); + + test_rm(dst_int64, src0_int64, + src1_int64); + + return 0; +#else + // int8_t + int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(dst_int8); + init_dst(dst_int8, gm_size); + + int8_t *src0_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src0_int8); + init_src_int(src0_int8, gm_size); + int8_t *src1_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src1_int8); + init_src_int(src1_int8, gm_size); + + // int16_t + int16_t *dst_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(dst_int16); + init_dst(dst_int16, gm_size); + + int16_t *src0_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src0_int16); + init_src_int(src0_int16, gm_size); + int16_t *src1_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src1_int16); + init_src_int(src1_int16, gm_size); + + // int32_t + int32_t *dst_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst_int32); + init_dst(dst_int32, gm_size); + + int32_t *src0_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src0_int32); + init_src_int(src0_int32, gm_size); + int32_t *src1_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src1_int32); + init_src_int(src1_int32, gm_size); + + // int64_t + int64_t *dst_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(dst_int64); + init_dst(dst_int64, gm_size); + + int64_t *src0_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src0_int64); + init_src_int(src0_int64, gm_size); + int64_t *src1_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src1_int64); + init_src_int(src1_int64, gm_size); + + // __half + __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, gm_size); + + __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src0_f16); + init_src_fp(src0_f16, gm_size); + __half *src1_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src1_f16); + init_src_fp(src1_f16, gm_size); + + // __fp32 + __fp32 *dst_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); + check_mem_alloc(dst_f32); + init_dst(dst_f32, gm_size); + + __fp32 *src0_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); + check_mem_alloc(src0_f32); + init_src_fp(src0_f32, gm_size); + __fp32 *src1_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); + check_mem_alloc(src1_f32); + init_src_fp(src1_f32, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst_int8, src0_int8, + src1_int8); + test_rm(dst_int16, src0_int16, + src1_int16); + test_rm(dst_int32, src0_int32, + src1_int32); + test_rm(dst_int64, src0_int64, + src1_int64); + test_cm(dst_f16, src0_f16, + src1_f16); + test_cm(dst_f32, src0_f32, + src1_f32); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst_int8, gm_size); + OutArray(dst_int16, gm_size); + OutArray(dst_int32, gm_size); + OutArray(dst_int64, gm_size); + OutArray(dst_f16, gm_size); + OutArray(dst_f32, gm_size); + + free(dst_int8); + free(src0_int8); + free(src1_int8); + free(dst_int16); + free(src0_int16); + free(src1_int16); + free(dst_int32); + free(src0_int32); + free(dst_int64); + free(src0_int64); + free(src1_int64); + free(dst_f16); + free(src0_f16); + free(src1_f16); + free(dst_f32); + free(src0_f32); + free(src1_f32); + + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/TSubs.cpp b/benchmarks/api/tileop/src/TSubs.cpp new file mode 100644 index 0000000..fd3e234 --- /dev/null +++ b/benchmarks/api/tileop/src/TSubs.cpp @@ -0,0 +1,213 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template +void test_rm(T *dst, T *src, T s) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + using glb_iterator = global_iterator; + + glb_iterator gSIter(src); + glb_iterator gDIter(dst); + + size_t block_row = gm_row / tile_row; + size_t block_col = gm_col / tile_col; + for (int i = 0; i < block_row; ++i) { + for (int j = 0; j < block_col; ++j) { + auto s0 = gSIter(i, j); + auto res = gDIter(i, j); + + tile_shape t0, t1; + TLOAD(t0, s0); + TSUBS(t1, t0, s); + TSTORE(res, t1); + } + } +} + +template +void test_cm(T *dst, T *src, T s) { + using gm_shape = global_tensor>; + using tile_shape = Tile; + using glb_iterator = global_iterator; + + glb_iterator gSIter(src); + glb_iterator gDIter(dst); + + size_t block_row = gm_row / tile_row; + size_t block_col = gm_col / tile_col; + for (int i = 0; i < block_col; ++i) { + for (int j = 0; j < block_row; ++j) { + auto s0 = gSIter(j, i); + auto res = gDIter(j, i); + + tile_shape t0, t1; + TLOAD(t0, s0); + TSUBS(t1, t0, s); + TSTORE(res, t1); + } + } +} + +int main() { +#ifdef __linx + constexpr size_t gm_row = 4; + constexpr size_t gm_col = 4; + constexpr size_t tile_row = 4; + constexpr size_t tile_col = 4; +#else + constexpr size_t gm_row = 32; + constexpr size_t gm_col = 32; + constexpr size_t tile_row = 32; + constexpr size_t tile_col = 32; +#endif + + constexpr size_t gm_size = gm_row * gm_col; + constexpr size_t tile_size = tile_row * tile_col; + (void)tile_size; + +#ifdef __linx + static int64_t dst_int64[gm_size]; + static int64_t src_int64[gm_size]; + init_dst(dst_int64, gm_size); + init_src_int(src_int64, gm_size); + + test_rm(dst_int64, src_int64, + s_i64); + + return 0; +#else + // int8_t + int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(dst_int8); + init_dst(dst_int8, gm_size); + + int8_t *src_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); + check_mem_alloc(src_int8); + init_src_int(src_int8, gm_size); + + // int16_t + int16_t *dst_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(dst_int16); + init_dst(dst_int16, gm_size); + + int16_t *src_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); + check_mem_alloc(src_int16); + init_src_int(src_int16, gm_size); + + // int32 + int32_t *dst_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(dst_int32); + init_dst(dst_int32, gm_size); + + int32_t *src_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); + check_mem_alloc(src_int32); + init_src_int(src_int32, gm_size); + + // int64_t + int64_t *dst_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(dst_int64); + init_dst(dst_int64, gm_size); + + int64_t *src_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); + check_mem_alloc(src_int64); + init_src_int(src_int64, gm_size); + + // __half + __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, gm_size); + + __half *src_f16 = (__half *)malloc(gm_size * sizeof(__half)); + check_mem_alloc(src_f16); + init_src_fp(src_f16, gm_size); + + // __fp32 + __fp32 *dst_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); + check_mem_alloc(dst_f32); + init_dst(dst_f32, gm_size); + + __fp32 *src_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); + check_mem_alloc(src_f32); + init_src_fp(src_f32, gm_size); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst_int8, src_int8, s_i8); + test_rm(dst_int16, src_int16, + s_i16); + test_rm(dst_int32, src_int32, + s_i32); + test_rm(dst_int64, src_int64, + s_i64); + test_cm(dst_f16, src_f16, s_fp16); + test_cm(dst_f32, src_f32, s_fp32); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst_int8, gm_size); + OutArray(dst_int16, gm_size); + OutArray(dst_int32, gm_size); + OutArray(dst_int64, gm_size); + OutArray(dst_f16, gm_size); + OutArray(dst_f32, gm_size); + + free(dst_int8); + free(src_int8); + free(dst_int16); + free(src_int16); + free(dst_int32); + free(src_int32); + free(dst_int64); + free(src_int64); + free(dst_f16); + free(src_f16); + free(dst_f32); + free(src_f32); + + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/TTrans.cpp b/benchmarks/api/tileop/src/TTrans.cpp new file mode 100644 index 0000000..0d441a3 --- /dev/null +++ b/benchmarks/api/tileop/src/TTrans.cpp @@ -0,0 +1,188 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif + +template void test_rm(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + + gm_shape_in s0(src); + gm_shape_out res(dst); + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TTRANS(d1, d0); + TSTORE(res, d1); +} + +template void test_cm(T *dst, T *src) { + using gm_shape_in = global_tensor>; + using gm_shape_out = global_tensor>; + + using tile_shape_in = Tile; + using tile_shape_out = Tile; + + gm_shape_in s0(src); + gm_shape_out res(dst); + tile_shape_in d0; + tile_shape_out d1; + + TLOAD(d0, s0); + TTRANS(d1, d0); + TSTORE(res, d1); +} + +int main() { +#ifdef __linx + constexpr size_t row = 4; + constexpr size_t col = 4; +#else + constexpr size_t row = 32; + constexpr size_t col = 32; +#endif + + constexpr size_t size_in = row * col; + constexpr size_t size_out = col * row; + +#ifdef __linx + static int64_t dst[size_out]; + static int64_t src[size_in]; + init_dst(dst, size_out); + init_src_int(src, size_in); + + test_rm(dst, src); + + return 0; +#else + // int8 + int8_t *dst_int8 = (int8_t *)malloc(size_out * sizeof(int8_t)); + check_mem_alloc(dst_int8); + init_dst(dst_int8, size_out); + + int8_t *src_int8 = (int8_t *)malloc(size_in * sizeof(int8_t)); + check_mem_alloc(src_int8); + init_src_int(src_int8, size_in); + + // int16 + int16_t *dst_int16 = (int16_t *)malloc(size_out * sizeof(int16_t)); + check_mem_alloc(dst_int16); + init_dst(dst_int16, size_out); + + int16_t *src_int16 = (int16_t *)malloc(size_in * sizeof(int16_t)); + check_mem_alloc(src_int16); + init_src_int(src_int16, size_in); + + // int32 + int32_t *dst_int32 = (int32_t *)malloc(size_out * sizeof(int32_t)); + check_mem_alloc(dst_int32); + init_dst(dst_int32, size_out); + + int32_t *src_int32 = (int32_t *)malloc(size_in * sizeof(int32_t)); + check_mem_alloc(src_int32); + init_src_int(src_int32, size_in); + + // int 64 + int64_t *dst_int64 = (int64_t *)malloc(size_out * sizeof(int64_t)); + check_mem_alloc(dst_int64); + init_dst(dst_int64, size_out); + + int64_t *src_int64 = (int64_t *)malloc(size_in * sizeof(int64_t)); + check_mem_alloc(src_int64); + init_src_int(src_int64, size_in); + + // __half + __half *dst_f16 = (__half *)malloc(size_out * sizeof(__half)); + check_mem_alloc(dst_f16); + init_dst(dst_f16, size_out); + + __half *src_f16 = (__half *)malloc(size_in * sizeof(__half)); + check_mem_alloc(src_f16); + init_src_fp(src_f16, size_in); + + // __fp32 + __fp32 *dst_f32 = (__fp32 *)malloc(size_out * sizeof(__fp32)); + check_mem_alloc(dst_f32); + init_dst(dst_f32, size_out); + + __fp32 *src_f32 = (__fp32 *)malloc(size_in * sizeof(__fp32)); + check_mem_alloc(src_f32); + init_src_fp(src_f32, size_in); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test_rm(dst_int8, src_int8); + test_rm(dst_int16, src_int16); + test_rm(dst_int32, src_int32); + test_rm(dst_int64, src_int64); + test_cm(dst_f16, src_f16); + test_cm(dst_f32, src_f32); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst_int8, size_out); + OutArray(dst_int16, size_out); + OutArray(dst_int32, size_out); + OutArray(dst_int64, size_out); + OutArray(dst_f16, size_out); + OutArray(dst_f32, size_out); + + free(dst_int8); + free(src_int8); + free(dst_int16); + free(src_int16); + free(dst_int32); + free(src_int32); + free(dst_int64); + free(src_int64); + free(dst_f16); + free(src_f16); + free(dst_f32); + free(src_f32); + + return 0; +#endif +} diff --git a/benchmarks/api/tileop/src/test_MatMacc.cpp b/benchmarks/api/tileop/src/test_MatMacc.cpp new file mode 100644 index 0000000..5316b61 --- /dev/null +++ b/benchmarks/api/tileop/src/test_MatMacc.cpp @@ -0,0 +1,194 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} + +template +void test_linx_row_major(T *dst, T *src0, T *src1) { + using gm_shape_A = global_tensor>; + using gm_shape_B = global_tensor>; + using gm_shape_C = global_tensor>; + + using tile_shape_A = Tile; + using tile_shape_B = Tile; + using tile_shape_C = Tile; + + gm_shape_A s0(src0); + gm_shape_B s1(src1); + gm_shape_C res(dst); + + tile_shape_A d0; + tile_shape_B d1; + tile_shape_C d2; + + TLOAD(d0, s0); + TLOAD(d1, s1); + MATMUL(d2, d0, d1); + MATMACC(d2, d0, d1); + TSTORE(res, d2); +} +#endif + +template +void test(float *dst, float *src0, float *src1) { + using gm_shape_A = global_tensor>; + using gm_shape_B = global_tensor>; + using gm_shape_C = global_tensor>; + + using tile_shape_A = TileLeft; + using tile_shape_B = TileRight; + using tile_shape_C = TileAcc; + using tile_shape_O = Tile; + + gm_shape_A s0(src0); + gm_shape_B s1(src1); + gm_shape_C res(dst); + + tile_shape_A d0; + tile_shape_B d1; + tile_shape_C d2; + tile_shape_O d3; + + TLOAD(d0, s0); + TLOAD(d1, s1); + MATMUL(d2, d0, d1); + MATMACC(d2, d0, d1); + TCVT(d3, d2); + TSTORE(res, d3); +} + +int main() { +#ifdef __linx + constexpr uint16_t M = 4; + constexpr uint16_t K = 4; + constexpr uint16_t N = 4; + constexpr size_t size_A = M * K; + constexpr size_t size_B = K * N; + constexpr size_t size_C = M * N; + + static int64_t dst_i64[size_C]; + static int64_t src0_i64[size_A]; + static int64_t src1_i64[size_B]; + + for (size_t row = 0; row < M; ++row) { + for (size_t k = 0; k < K; ++k) { + src0_i64[row * K + k] = static_cast((row + 1) * (k + 2)); + } + } + for (size_t k = 0; k < K; ++k) { + for (size_t col = 0; col < N; ++col) { + src1_i64[k * N + col] = static_cast((k + 1) + (col + 1)); + } + } + for (size_t i = 0; i < size_C; ++i) { + dst_i64[i] = 0; + } + + test_linx_row_major(dst_i64, src0_i64, src1_i64); + + for (size_t row = 0; row < M; ++row) { + for (size_t col = 0; col < N; ++col) { + int64_t expected = 0; + for (size_t k = 0; k < K; ++k) { + expected += src0_i64[row * K + k] * src1_i64[k * N + col]; + } + expected *= 2; + if (dst_i64[row * N + col] != expected) { + return 1; + } + } + } + + return 0; +#else + const uint16_t M = 16; + const uint16_t K = 8; + const uint16_t N = 32; + + size_t size_A = M * K; + size_t size_B = K * N; + size_t size_C = M * N; + + float *dst = (float *)malloc(size_C * sizeof(float)); + check_mem_alloc(dst); + init_src_fp(dst, size_C); + + float *src0 = (float *)malloc(size_A * sizeof(float)); + check_mem_alloc(src0); + init_src_fp(src0, size_A); + float *src1 = (float *)malloc(size_B * sizeof(float)); + check_mem_alloc(src1); + init_src_fp(src1, size_B); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test(dst, src0, src1); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, size_C); + + free(dst); + free(src0); + free(src1); + + return 0; +#endif +} diff --git a/test/tileop_api/src/test_MatMmxac.cpp b/benchmarks/api/tileop/src/test_MatMmxac.cpp similarity index 94% rename from test/tileop_api/src/test_MatMmxac.cpp rename to benchmarks/api/tileop/src/test_MatMmxac.cpp index a972acf..27d3f70 100644 --- a/test/tileop_api/src/test_MatMmxac.cpp +++ b/benchmarks/api/tileop/src/test_MatMmxac.cpp @@ -34,16 +34,16 @@ void test_mx(float *dst, float *src0, float *src0x, float *src1, float *src1x) { tile_shape_C d2; tile_shape_O d3; - TCOPYIN(d0, s0); - TCOPYIN(d0x, s0x); - TCOPYIN(d1, s1); - TCOPYIN(d1x, s1x); + TLOAD(d0, s0); + TLOAD(d0x, s0x); + TLOAD(d1, s1); + TLOAD(d1x, s1x); MATMULMX(d2, d0, d0x, d1, d1x); MATMACCMX(d2, d0, d0x, d1, d1x); TCVT(d3, d2); - TCOPYOUT(res, d3); + TSTORE(res, d3); } template @@ -70,15 +70,15 @@ void test_mxb(float *dst, float *src0, float *src1, float *src1x) { tile_shape_C d2; tile_shape_O d3; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TCOPYIN(d1x, s1x); + TLOAD(d0, s0); + TLOAD(d1, s1); + TLOAD(d1x, s1x); MATMULMXB(d2, d0, d1, d1x); MATMACCMXB(d2, d0, d1, d1x); TCVT(d3, d2); - TCOPYOUT(res, d3); + TSTORE(res, d3); } int main() { diff --git a/benchmarks/api/tileop/src/test_MatMul.cpp b/benchmarks/api/tileop/src/test_MatMul.cpp new file mode 100644 index 0000000..bf01255 --- /dev/null +++ b/benchmarks/api/tileop/src/test_MatMul.cpp @@ -0,0 +1,191 @@ +#include "../data.hpp" +#include + +#ifdef LINX_PMC +#include "../linxStartEnd.hpp" +#endif + +#ifdef __linx +int main(); + +extern "C" void *memcpy(void *dst, const void *src, size_t n) { + volatile uint8_t *d = static_cast(dst); + const volatile uint8_t *s = static_cast(src); + for (size_t i = 0; i < n; ++i) { + d[i] = s[i]; + } + return dst; +} + +extern "C" void *memset(void *dst, int value, size_t n) { + volatile uint8_t *d = static_cast(dst); + const uint8_t byte = static_cast(value); + for (size_t i = 0; i < n; ++i) { + d[i] = byte; + } + return dst; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void +_start(void) { + linx_supernpu_exit(static_cast(main())); +} + +template +void test_linx_row_major(T *dst, T *src0, T *src1) { + using gm_shape_A = global_tensor>; + using gm_shape_B = global_tensor>; + using gm_shape_C = global_tensor>; + + using tile_shape_A = Tile; + using tile_shape_B = Tile; + using tile_shape_C = Tile; + + gm_shape_A s0(src0); + gm_shape_B s1(src1); + gm_shape_C res(dst); + + tile_shape_A d0; + tile_shape_B d1; + tile_shape_C d2; + + TLOAD(d0, s0); + TLOAD(d1, s1); + MATMUL(d2, d0, d1); + TSTORE(res, d2); +} +#endif + +template +void test(float *dst, float *src0, float *src1) { + using gm_shape_A = global_tensor>; + using gm_shape_B = global_tensor>; + using gm_shape_C = global_tensor>; + + using tile_shape_A = TileLeft; + using tile_shape_B = TileRight; + using tile_shape_C = TileAcc; + using tile_shape_O = TileLeft; + + gm_shape_A s0(src0); + gm_shape_B s1(src1); + gm_shape_C res(dst); + + tile_shape_A d0; + tile_shape_B d1; + tile_shape_C d2; + tile_shape_O d3; + + TLOAD(d0, s0); + TLOAD(d1, s1); + MATMUL(d2, d0, d1); + TCVT(d3, d2); + TSTORE(res, d3); +} + +int main() { +#ifdef __linx + constexpr uint16_t M = 4; + constexpr uint16_t K = 4; + constexpr uint16_t N = 4; + constexpr size_t size_A = M * K; + constexpr size_t size_B = K * N; + constexpr size_t size_C = M * N; + + static int64_t dst_i64[size_C]; + static int64_t src0_i64[size_A]; + static int64_t src1_i64[size_B]; + + for (size_t row = 0; row < M; ++row) { + for (size_t k = 0; k < K; ++k) { + src0_i64[row * K + k] = static_cast((row + 1) * (k + 2)); + } + } + for (size_t k = 0; k < K; ++k) { + for (size_t col = 0; col < N; ++col) { + src1_i64[k * N + col] = static_cast((k + 1) + (col + 1)); + } + } + for (size_t i = 0; i < size_C; ++i) { + dst_i64[i] = 0; + } + + test_linx_row_major(dst_i64, src0_i64, src1_i64); + + for (size_t row = 0; row < M; ++row) { + for (size_t col = 0; col < N; ++col) { + int64_t expected = 0; + for (size_t k = 0; k < K; ++k) { + expected += src0_i64[row * K + k] * src1_i64[k * N + col]; + } + if (dst_i64[row * N + col] != expected) { + return 1; + } + } + } + + return 0; +#else + const uint16_t M = 16; + const uint16_t K = 8; + const uint16_t N = 32; + + size_t size_A = M * K; + size_t size_B = K * N; + size_t size_C = M * N; + + float *dst = (float *)malloc(size_C * sizeof(float)); + check_mem_alloc(dst); + init_dst(dst, size_C); + + float *src0 = (float *)malloc(size_A * sizeof(float)); + check_mem_alloc(src0); + init_src_fp(src0, size_A); + float *src1 = (float *)malloc(size_B * sizeof(float)); + check_mem_alloc(src1); + init_src_fp(src1, size_B); + +#ifdef LINX_PMC + PMC_START(); +#endif + + test(dst, src0, src1); + +#ifdef LINX_PMC + PMC_END(); +#endif + + printf("Result:\n"); + OutArray(dst, size_C); + + free(dst); + free(src0); + free(src1); + + return 0; +#endif +} diff --git a/test/tileop_api/src/test_MatMulmx.cpp b/benchmarks/api/tileop/src/test_MatMulmx.cpp similarity index 94% rename from test/tileop_api/src/test_MatMulmx.cpp rename to benchmarks/api/tileop/src/test_MatMulmx.cpp index f017582..60b444e 100644 --- a/test/tileop_api/src/test_MatMulmx.cpp +++ b/benchmarks/api/tileop/src/test_MatMulmx.cpp @@ -34,14 +34,14 @@ void test_mx(float *dst, float *src0, float *src0x, float *src1, float *src1x) { tile_shape_C d2; tile_shape_O d3; - TCOPYIN(d0, s0); - TCOPYIN(d0x, s0x); - TCOPYIN(d1, s1); - TCOPYIN(d1x, s1x); + TLOAD(d0, s0); + TLOAD(d0x, s0x); + TLOAD(d1, s1); + TLOAD(d1x, s1x); MATMULMX(d2, d0, d0x, d1, d1x); TCVT(d3, d2); - TCOPYOUT(res, d3); + TSTORE(res, d3); } // MATMULMXB: A, B + BX @@ -69,13 +69,13 @@ void test_mxb(float *dst, float *src0, float *src1, float *src1x) { tile_shape_C d2; tile_shape_O d3; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TCOPYIN(d1x, s1x); + TLOAD(d0, s0); + TLOAD(d1, s1); + TLOAD(d1x, s1x); MATMULMXB(d2, d0, d1, d1x); TCVT(d3, d2); - TCOPYOUT(res, d3); + TSTORE(res, d3); } int main() { diff --git a/test/common/Makefile.common b/benchmarks/common/Makefile.common similarity index 84% rename from test/common/Makefile.common rename to benchmarks/common/Makefile.common index afae221..e745d23 100644 --- a/test/common/Makefile.common +++ b/benchmarks/common/Makefile.common @@ -1,7 +1,9 @@ COMMON_MAKEFILE := $(abspath $(lastword $(MAKEFILE_LIST))) -TEST_ROOT := $(abspath $(dir $(COMMON_MAKEFILE))/..) -ROOT := $(abspath $(TEST_ROOT)/..) -CATEGORY := $(patsubst $(TEST_ROOT)/%,%,$(CURDIR)) +BENCHMARK_ROOT ?= $(abspath $(dir $(COMMON_MAKEFILE))/..) +ROOT ?= $(abspath $(BENCHMARK_ROOT)/..) +CATEGORY_ROOT ?= $(BENCHMARK_ROOT) +TEST_ROOT ?= $(CATEGORY_ROOT) +CATEGORY := $(patsubst $(CATEGORY_ROOT)/%,%,$(CURDIR)) CATEGORY_NAME := $(subst /,_,$(CATEGORY)) OBJ_ROOT := $(abspath $(ROOT)/output) CASE_SRC_DIR := $(CATEGORY)/src @@ -50,12 +52,13 @@ CXX = $(COMPILER_DIR)/clang++ LINK = $(COMPILER_DIR)/clang++ DUMP = $(COMPILER_DIR)/llvm-objdump COPY = $(COMPILER_DIR)/llvm-objcopy -CC_O = -c -mlxbc -fenable-matrix -O2 -mllvm -enable-all-vector-as-tilereg=true +CC_O = -c -target linx64-linx-none-elf -fenable-matrix -O2 +CC_LINK ?= -target linx64-linx-none-elf -nostdlib CC_VER ?= -std=c++20 -# COMM_SRC_FILE += $(ROOT)/test/common/_start.s +# COMM_SRC_FILE += $(BENCHMARK_ROOT)/common/_start.s # COMM_SRC_DIR = $(shell dirname $(COMM_SRC_FILE)) # COMM_OBJ += $(patsubst %.s, %.o, $(subst $(COMM_SRC_DIR), $(OBJ_DIR), $(COMM_SRC_FILE))) -# CC_LINK += -nostartfiles $(ROOT)/test/common/_start.s +# CC_LINK += -nostartfiles $(BENCHMARK_ROOT)/common/_start.s endif ifeq ($(PY_LIB), on) @@ -70,7 +73,7 @@ CC_O += -fPIC CC_LINK += -shared endif -INCLUDE += -I$(ROOT)/include -I$(ROOT)/test/common -I$(ROOT)/test/kernels/src +INCLUDE += -I$(ROOT)/include -I$(ROOT)/kernels -I$(BENCHMARK_ROOT)/common -I$(BENCHMARK_ROOT)/common/src QEMU = /remote/lms60/c00622284/qemu/LinxBlockModel/build/qemu-linx CC_O_ALL = $(CC_O) $(CC_VER) $(CC_OPTS) @@ -111,9 +114,9 @@ $(OBJ_DIR)%.o: $(COMM_SRC_DIR)%.s @mkdir -p $(shell dirname $@) $(AS) $(CC_O_ALL) $(INCLUDE) $(DEFINES) $< -o $@ -$(TARGET): $(OBJ) $(COMM_OBJ) +$(TARGET): $(OBJ) $(COMM_OBJ) $(EXTRA_OBJ_FILES) @mkdir -p $(shell dirname $@) - $(LINK) $(CC_LINK) $(COMM_OBJ) $(OBJ) -o $@ + $(LINK) $(CC_LINK) $(COMM_OBJ) $(OBJ) $(EXTRA_OBJ_FILES) -o $@ pre_work: @mkdir -p $(OBJ_DIR) diff --git a/test/common/_start.s b/benchmarks/common/_start.s similarity index 100% rename from test/common/_start.s rename to benchmarks/common/_start.s diff --git a/test/common/fileop.h b/benchmarks/common/fileop.h similarity index 100% rename from test/common/fileop.h rename to benchmarks/common/fileop.h diff --git a/test/tileop_api/linxStartEnd.hpp b/benchmarks/common/linxStartEnd.hpp similarity index 100% rename from test/tileop_api/linxStartEnd.hpp rename to benchmarks/common/linxStartEnd.hpp diff --git a/test/common/multi_tile.hpp b/benchmarks/common/multi_tile.hpp similarity index 96% rename from test/common/multi_tile.hpp rename to benchmarks/common/multi_tile.hpp index 6ca6a6c..e7ba914 100644 --- a/test/common/multi_tile.hpp +++ b/benchmarks/common/multi_tile.hpp @@ -129,19 +129,19 @@ void TCAST(tO &o, tA &a) { } template -void TCOPYIN(tile_shape &dst, itfn it) { +void TLOAD(tile_shape &dst, itfn it) { #pragma clang loop unroll(full) for (int i = 0; i < tile_shape::NumTiles; ++i) { auto gm = it(i); - TCOPYIN(dst.Tiles[i], gm); + TLOAD(dst.Tiles[i], gm); } } template -void TCOPYIN(tile_shape &dst, gm_shape &src) { +void TLOAD(tile_shape &dst, gm_shape &src) { #ifdef MULTI_REUSE typename tile_shape::TileType t; - TCOPYIN(t, src); + TLOAD(t, src); #pragma clang loop unroll(full) for (int i = 0; i < tile_shape::NumTiles; ++i) { dst.Tiles[i] = t; @@ -149,17 +149,17 @@ void TCOPYIN(tile_shape &dst, gm_shape &src) { #else #pragma clang loop unroll(full) for (int i = 0; i < tile_shape::NumTiles; ++i) { - TCOPYIN(dst.Tiles[i], src); + TLOAD(dst.Tiles[i], src); } #endif } template -void TCOPYOUT(itfn it, tile_shape &src) { +void TSTORE(itfn it, tile_shape &src) { #pragma clang loop unroll(full) for (int i = 0; i < tile_shape::NumTiles; ++i) { auto gm = it(i); - TCOPYOUT(gm, src.Tiles[i]); + TSTORE(gm, src.Tiles[i]); } } diff --git a/test/common/readBinary.h b/benchmarks/common/readBinary.h similarity index 100% rename from test/common/readBinary.h rename to benchmarks/common/readBinary.h diff --git a/test/common/src/assembler.h b/benchmarks/common/src/assembler.h similarity index 100% rename from test/common/src/assembler.h rename to benchmarks/common/src/assembler.h diff --git a/test/common/src/baremetal_linx.lds.S b/benchmarks/common/src/baremetal_linx.lds.S similarity index 100% rename from test/common/src/baremetal_linx.lds.S rename to benchmarks/common/src/baremetal_linx.lds.S diff --git a/test/common/src/benchmark.h b/benchmarks/common/src/benchmark.h similarity index 100% rename from test/common/src/benchmark.h rename to benchmarks/common/src/benchmark.h diff --git a/test/common/src/benchmark_boot_linx.s b/benchmarks/common/src/benchmark_boot_linx.s similarity index 100% rename from test/common/src/benchmark_boot_linx.s rename to benchmarks/common/src/benchmark_boot_linx.s diff --git a/test/common/src/chip_def.h b/benchmarks/common/src/chip_def.h similarity index 100% rename from test/common/src/chip_def.h rename to benchmarks/common/src/chip_def.h diff --git a/test/common/src/common.h b/benchmarks/common/src/common.h similarity index 100% rename from test/common/src/common.h rename to benchmarks/common/src/common.h diff --git a/test/common/src/ldv5.lds.S b/benchmarks/common/src/ldv5.lds.S similarity index 100% rename from test/common/src/ldv5.lds.S rename to benchmarks/common/src/ldv5.lds.S diff --git a/test/common/src/stackheap_linx.c b/benchmarks/common/src/stackheap_linx.c similarity index 100% rename from test/common/src/stackheap_linx.c rename to benchmarks/common/src/stackheap_linx.c diff --git a/test/common/src/sys-sections.h b/benchmarks/common/src/sys-sections.h similarity index 100% rename from test/common/src/sys-sections.h rename to benchmarks/common/src/sys-sections.h diff --git a/test/common/src/sys_linx.c b/benchmarks/common/src/sys_linx.c similarity index 100% rename from test/common/src/sys_linx.c rename to benchmarks/common/src/sys_linx.c diff --git a/test/common/template_asm.h b/benchmarks/common/template_asm.h similarity index 100% rename from test/common/template_asm.h rename to benchmarks/common/template_asm.h diff --git a/test/common/tensorwrite.hpp b/benchmarks/common/tensorwrite.hpp similarity index 100% rename from test/common/tensorwrite.hpp rename to benchmarks/common/tensorwrite.hpp diff --git a/test/common/writeBinary.h b/benchmarks/common/writeBinary.h similarity index 100% rename from test/common/writeBinary.h rename to benchmarks/common/writeBinary.h diff --git a/test/kernel/orther/Makefile b/benchmarks/kernels/composite/Makefile similarity index 98% rename from test/kernel/orther/Makefile rename to benchmarks/kernels/composite/Makefile index ec6acc1..3dc0454 100644 --- a/test/kernel/orther/Makefile +++ b/benchmarks/kernels/composite/Makefile @@ -66,6 +66,7 @@ endif CP = cp DEST_DIR = ~/elf_subset/subset_matmul_reuse/ +INCLUDE += -I$(ROOT)/kernels/other SRC_FILE += $(TEST_ROOT)/$(CASE_SRC_DIR)/$(TESTCASE).cpp include ../../common/Makefile.common diff --git a/test/kernel/orther/compile_flash_attention.all b/benchmarks/kernels/composite/compile_flash_attention.all similarity index 100% rename from test/kernel/orther/compile_flash_attention.all rename to benchmarks/kernels/composite/compile_flash_attention.all diff --git a/test/kernel/orther/compile_gemm.all b/benchmarks/kernels/composite/compile_gemm.all similarity index 100% rename from test/kernel/orther/compile_gemm.all rename to benchmarks/kernels/composite/compile_gemm.all diff --git a/test/kernel/orther/compile_linear.all b/benchmarks/kernels/composite/compile_linear.all similarity index 100% rename from test/kernel/orther/compile_linear.all rename to benchmarks/kernels/composite/compile_linear.all diff --git a/test/kernel/orther/compile_matmul.all b/benchmarks/kernels/composite/compile_matmul.all similarity index 100% rename from test/kernel/orther/compile_matmul.all rename to benchmarks/kernels/composite/compile_matmul.all diff --git a/test/kernel/orther/compile_norm.all b/benchmarks/kernels/composite/compile_norm.all similarity index 100% rename from test/kernel/orther/compile_norm.all rename to benchmarks/kernels/composite/compile_norm.all diff --git a/test/kernel/orther/compile_softmax.all b/benchmarks/kernels/composite/compile_softmax.all similarity index 100% rename from test/kernel/orther/compile_softmax.all rename to benchmarks/kernels/composite/compile_softmax.all diff --git a/benchmarks/kernels/composite/npu_compile.sh b/benchmarks/kernels/composite/npu_compile.sh new file mode 100755 index 0000000..08750da --- /dev/null +++ b/benchmarks/kernels/composite/npu_compile.sh @@ -0,0 +1,12 @@ +#! /bin/bash + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" + +"$SCRIPT_DIR/npu_compile/compile_matmul.all" +"$SCRIPT_DIR/npu_compile/compile_matmul_reuseA.all" +"$SCRIPT_DIR/npu_compile/compile_matmul_reuseB.all" +"$SCRIPT_DIR/npu_compile/compile_matmul_reuseAB.all" + +"$SCRIPT_DIR/npu_compile/compile_matmul_dynamic.all" +"$SCRIPT_DIR/npu_compile/compile_matmul_dynamic_reuseA.all" +"$SCRIPT_DIR/npu_compile/compile_matmul_dynamic_reuseB.all" diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul.all b/benchmarks/kernels/composite/npu_compile/compile_matmul.all new file mode 100755 index 0000000..0539888 --- /dev/null +++ b/benchmarks/kernels/composite/npu_compile/compile_matmul.all @@ -0,0 +1,113 @@ +#! /bin/bash + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=777 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=777 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=777 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=777 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=777 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic.all b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic.all new file mode 100755 index 0000000..1ae6f6e --- /dev/null +++ b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic.all @@ -0,0 +1,113 @@ +#! /bin/bash + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=777 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=777 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=777 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=777 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=777 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuse.all b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuse.all new file mode 100644 index 0000000..39b28e0 --- /dev/null +++ b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuse.all @@ -0,0 +1,113 @@ +#! /bin/bash + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=777 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=777 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=777 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=777 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=777 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseA.all b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseA.all new file mode 100755 index 0000000..bb0bdbf --- /dev/null +++ b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseA.all @@ -0,0 +1,113 @@ +#! /bin/bash + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=777 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=777 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=777 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=777 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=777 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseB.all b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseB.all new file mode 100755 index 0000000..5355f6f --- /dev/null +++ b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseB.all @@ -0,0 +1,113 @@ +#! /bin/bash + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=777 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=777 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=777 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=777 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=777 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseA.all b/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseA.all new file mode 100755 index 0000000..1bfd87f --- /dev/null +++ b/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseA.all @@ -0,0 +1,113 @@ +#! /bin/bash + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=777 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=777 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=777 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=777 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=777 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseAB.all b/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseAB.all new file mode 100755 index 0000000..05ee361 --- /dev/null +++ b/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseAB.all @@ -0,0 +1,113 @@ +#! /bin/bash + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=777 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=777 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=777 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=777 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=777 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseB.all b/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseB.all new file mode 100755 index 0000000..f18b5d6 --- /dev/null +++ b/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseB.all @@ -0,0 +1,113 @@ +#! /bin/bash + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=2048 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=2048 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=2048 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=2048 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=2048 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=2048 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 + + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=256 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=777 tM=64 tK=256 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=777 tM=64 tK=256 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=256 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=777 tM=256 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=777 tM=256 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=256 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=777 tM=64 tK=64 tN=256 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=777 tM=64 tK=64 tN=256 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=256 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=777 tM=64 tK=256 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=777 tM=64 tK=256 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=256 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=777 tM=256 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=777 tM=256 tK=64 tN=64 + +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=256 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=777 tM=64 tK=64 tN=64 +make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/test/kernel/orther/src/FA.py b/benchmarks/kernels/composite/src/FA.py similarity index 100% rename from test/kernel/orther/src/FA.py rename to benchmarks/kernels/composite/src/FA.py diff --git a/test/kernel/orther/src/flash_attention.cpp b/benchmarks/kernels/composite/src/flash_attention.cpp similarity index 100% rename from test/kernel/orther/src/flash_attention.cpp rename to benchmarks/kernels/composite/src/flash_attention.cpp diff --git a/test/kernel/orther/src/flash_attention_mask.cpp b/benchmarks/kernels/composite/src/flash_attention_mask.cpp similarity index 100% rename from test/kernel/orther/src/flash_attention_mask.cpp rename to benchmarks/kernels/composite/src/flash_attention_mask.cpp diff --git a/test/kernel/orther/src/gemm.cpp b/benchmarks/kernels/composite/src/gemm.cpp similarity index 100% rename from test/kernel/orther/src/gemm.cpp rename to benchmarks/kernels/composite/src/gemm.cpp diff --git a/test/kernel/orther/src/linear.cpp b/benchmarks/kernels/composite/src/linear.cpp similarity index 100% rename from test/kernel/orther/src/linear.cpp rename to benchmarks/kernels/composite/src/linear.cpp diff --git a/test/kernel/orther/src/matmul.cpp b/benchmarks/kernels/composite/src/matmul.cpp similarity index 100% rename from test/kernel/orther/src/matmul.cpp rename to benchmarks/kernels/composite/src/matmul.cpp diff --git a/test/kernel/orther/src/normalization.cpp b/benchmarks/kernels/composite/src/normalization.cpp similarity index 100% rename from test/kernel/orther/src/normalization.cpp rename to benchmarks/kernels/composite/src/normalization.cpp diff --git a/test/kernel/orther/src/onlinesoftmax.cpp b/benchmarks/kernels/composite/src/onlinesoftmax.cpp similarity index 83% rename from test/kernel/orther/src/onlinesoftmax.cpp rename to benchmarks/kernels/composite/src/onlinesoftmax.cpp index edee11a..20af0fb 100644 --- a/test/kernel/orther/src/onlinesoftmax.cpp +++ b/benchmarks/kernels/composite/src/onlinesoftmax.cpp @@ -3,7 +3,7 @@ #include "softmax.hpp" #include "benchmark.h" -#ifndef globM +#ifndef globM #define globM 120 #endif @@ -36,14 +36,14 @@ void onlinesoftmax_test(dtype* dst, dtype* src){ for(int i=0;i diff --git a/test/accelerator/vec_simt/hashfind/compute_offsets.py b/benchmarks/kernels/control/hashtable_lookup_simd/compute_offsets.py similarity index 100% rename from test/accelerator/vec_simt/hashfind/compute_offsets.py rename to benchmarks/kernels/control/hashtable_lookup_simd/compute_offsets.py diff --git a/benchmarks/kernels/control/hashtable_lookup_simd/data_obj/.gitignore b/benchmarks/kernels/control/hashtable_lookup_simd/data_obj/.gitignore new file mode 100644 index 0000000..dbf14ab --- /dev/null +++ b/benchmarks/kernels/control/hashtable_lookup_simd/data_obj/.gitignore @@ -0,0 +1,3 @@ +*.s +*.o +*.data diff --git a/test/kernel/control/hashtable_lookup_simd/data_obj/build_data_obj.sh b/benchmarks/kernels/control/hashtable_lookup_simd/data_obj/build_data_obj.sh similarity index 51% rename from test/kernel/control/hashtable_lookup_simd/data_obj/build_data_obj.sh rename to benchmarks/kernels/control/hashtable_lookup_simd/data_obj/build_data_obj.sh index 2668212..132c16b 100755 --- a/test/kernel/control/hashtable_lookup_simd/data_obj/build_data_obj.sh +++ b/benchmarks/kernels/control/hashtable_lookup_simd/data_obj/build_data_obj.sh @@ -1,10 +1,21 @@ #!/bin/bash -COMPILER_DIR="${COMPILER_DIR:-/remote/lms60/c00622284/janus/linxisa_compiler_v0.55/linx_blockisa_llvm_musl/bin}" -DATA_OBJ_DIR="$1" -OUTPUT_DIR="$2" +set -euo pipefail + +COMPILER_DIR="${COMPILER_DIR:-/usr/bin}" +LINX_TARGET="${LINX_TARGET:-linx64-linx-none-elf}" +DATA_OBJ_DIR="${1:?data object directory required}" +OUTPUT_DIR="${2:?output directory required}" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +CASE_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)" mkdir -p "$OUTPUT_DIR" +if [[ ! -f "${DATA_OBJ_DIR}/inserted_slot.data" || + ! -f "${DATA_OBJ_DIR}/lookup_keys.data" || + ! -f "${DATA_OBJ_DIR}/lookup_values.data" ]]; then + (cd "$CASE_DIR" && python3 gen_data_simple.py) +fi + build_one() { local name="$1" local data_file="${DATA_OBJ_DIR}/${name}.data" @@ -26,11 +37,11 @@ _binary_${name}_data_end: .equ _binary_${name}_data_size, .-_binary_${name}_data_start EOF - $COMPILER_DIR/clang++ -target linx64v5 -c "$asm_file" -o "$obj_file" + "${COMPILER_DIR}/clang++" -target "$LINX_TARGET" -c "$asm_file" -o "$obj_file" } build_one "inserted_slot" build_one "lookup_keys" build_one "lookup_values" -echo "Done building data object files" \ No newline at end of file +echo "Done building data object files" diff --git a/test/kernel/control/hashtable_lookup_simd/data_obj/probe_statistics.md b/benchmarks/kernels/control/hashtable_lookup_simd/data_obj/probe_statistics.md similarity index 100% rename from test/kernel/control/hashtable_lookup_simd/data_obj/probe_statistics.md rename to benchmarks/kernels/control/hashtable_lookup_simd/data_obj/probe_statistics.md diff --git a/test/kernel/control/hashtable_lookup_simd/gen_data_simple.py b/benchmarks/kernels/control/hashtable_lookup_simd/gen_data_simple.py similarity index 90% rename from test/kernel/control/hashtable_lookup_simd/gen_data_simple.py rename to benchmarks/kernels/control/hashtable_lookup_simd/gen_data_simple.py index e1405b0..540b887 100644 --- a/test/kernel/control/hashtable_lookup_simd/gen_data_simple.py +++ b/benchmarks/kernels/control/hashtable_lookup_simd/gen_data_simple.py @@ -153,29 +153,29 @@ def u64_to_i64(u): return u - (1 << 64) return u - # Write simple_inserted_slot.data (hashtable) + # Write inserted_slot.data (hashtable) output_dir = "data_obj" - with open(f"{output_dir}/simple_inserted_slot.data", "wb") as f: + with open(f"{output_dir}/inserted_slot.data", "wb") as f: for key, value, padding in table: # Pack as: key(int64), value(int32), padding(int32) f.write(struct.pack(" #include "benchmark.h" +#ifndef __linx #include "fileop.h" +#endif #include "template_asm.h" +#ifndef __linx #include +#endif // ============================================================================ // Tile operation implementations @@ -453,7 +457,7 @@ void linearProbing(typename HashFindTypes::TileI32& outTil // Compare, advance, and check if all lanes found probe_step(queryKeyTile, tableKeyTile, tableValueTile, probeIdxTile, outTile, countTile, kCap, kNotFound); - TCOPYOUT(countGT, countTile); + TSTORE(countGT, countTile); bool all_done = true; for (int g = 0; g < kNumGroups; g++) { @@ -492,7 +496,7 @@ void runHashFind(int32_t __out__ *out, // copy in KeyGT key_gt(queries); - TCOPYIN(queryKeyTile, key_gt); + TLOAD(queryKeyTile, key_gt); // compute hash (writes int64_t byte offsets into probeIdxTile) compute_hash_vec(queryKeyTile, probeIdxTile, kCap); @@ -504,7 +508,7 @@ void runHashFind(int32_t __out__ *out, // copy out OutGT outGlobal(out); - TCOPYOUT(outGlobal, outTile); + TSTORE(outGlobal, outTile); } template @@ -551,6 +555,7 @@ int main() { } } +#ifndef __linx printf("=== hashtable_lookup_simd ===\n"); printf("Match: %d/%d (%.4f%%)\n", match, kNum, 100.0 * double(match) / double(kNum)); @@ -567,9 +572,6 @@ int main() { } } fflush(stdout); -#endif - -#ifndef FOR_GFSIM int ret = (match == kNum) ? 0 : 1; if (!ret) { printf("PASS\n"); @@ -577,8 +579,9 @@ int main() { printf("FAIL\n"); } fflush(stdout); +#endif return ret; #else return 0; #endif -} \ No newline at end of file +} diff --git a/test/kernel/control/hashtable_lookup_simd/run_hashtable_lookup_simd.md b/benchmarks/kernels/control/hashtable_lookup_simd/run_hashtable_lookup_simd.md similarity index 64% rename from test/kernel/control/hashtable_lookup_simd/run_hashtable_lookup_simd.md rename to benchmarks/kernels/control/hashtable_lookup_simd/run_hashtable_lookup_simd.md index 72fdd73..9d4d32c 100644 --- a/test/kernel/control/hashtable_lookup_simd/run_hashtable_lookup_simd.md +++ b/benchmarks/kernels/control/hashtable_lookup_simd/run_hashtable_lookup_simd.md @@ -9,7 +9,7 @@ ``` {WORKSPACE}/ LinxBlockModel/ ← QEMU(LinxBlockModel) - JanusCoreBench/ ← Benchmark + SuperNPUBench/ ← Benchmark ``` 在以下命令中,将 `{WORKSPACE}` 替换为实际路径,例如: @@ -50,7 +50,7 @@ ninja -j 32 `compile.all` 本质是对 Makefile 的多次调用: ```bash -cd {WORKSPACE}/JanusCoreBench/test/kernel/control +cd {WORKSPACE}/SuperNPUBench/benchmarks/kernels/control # hashtable_lookup_simt make TESTCASE=hashtable_lookup_simt SUFFIX=_kNum409600 EXTRA_DEFINES="-DkNum=409600" @@ -73,9 +73,9 @@ make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum256 EXTRA_DEFINES="-DkNum=256" ```bash # 自动创建,Makefile 会执行: -mkdir -p output/kernel/control/src/ -mkdir -p output/kernel/control/elf/ -mkdir -p output/kernel/control/hashtable_lookup_simd/data_obj/ +mkdir -p output/kernels/control/src/ +mkdir -p output/kernels/control/elf/ +mkdir -p output/kernels/control/hashtable_lookup_simd/data_obj/ ``` --- @@ -83,8 +83,8 @@ mkdir -p output/kernel/control/hashtable_lookup_simd/data_obj/ #### Step 2 — 把数据 .data 文件转为 .o(hashtable_lookup_simd 专用) ```bash -cd {WORKSPACE}/JanusCoreBench/test/kernel/control/hashtable_lookup_simd -COMPILER_DIR= bash data_obj/build_data_obj.sh data_obj ../../../output/kernel/control/hashtable_lookup_simd/data_obj +cd {WORKSPACE}/SuperNPUBench/benchmarks/kernels/control/hashtable_lookup_simd +COMPILER_DIR= bash data_obj/build_data_obj.sh data_obj ../../../output/kernels/control/hashtable_lookup_simd/data_obj ``` `build_data_obj.sh` 遍历 `data_obj/` 下所有 `.data` 文件,调用: @@ -96,9 +96,9 @@ $COMPILER_DIR/clang++ -target linx64v5 -c *.s -o output/.../*.o 生成的 `.o` 文件: ``` -output/kernel/control/hashtable_lookup_simd/data_obj/inserted_slot.o # hash 表 -output/kernel/control/hashtable_lookup_simd/data_obj/lookup_keys.o # 查询 key -output/kernel/control/hashtable_lookup_simd/data_obj/lookup_values.o # 期望 value +output/kernels/control/hashtable_lookup_simd/data_obj/inserted_slot.o # hash 表 +output/kernels/control/hashtable_lookup_simd/data_obj/lookup_keys.o # 查询 key +output/kernels/control/hashtable_lookup_simd/data_obj/lookup_values.o # 期望 value ``` --- @@ -107,15 +107,15 @@ output/kernel/control/hashtable_lookup_simd/data_obj/lookup_values.o # 期望 ```bash # 实际执行(make 自动推导): -cd {WORKSPACE}/JanusCoreBench/test/kernel/control +cd {WORKSPACE}/SuperNPUBench/benchmarks/kernels/control $COMPILER_DIR/clang++ \ -c -mlxbc -fenable-matrix -O2 \ -std=c++20 \ - -I{WORKSPACE}/JanusCoreBench/include \ - -I{WORKSPACE}/JanusCoreBench/test/common \ + -I{WORKSPACE}/SuperNPUBench/include \ + -I{WORKSPACE}/SuperNPUBench/benchmarks/common \ -D__linx -DENABLE_TENSOR_INSTR \ hashtable_lookup_simd/hashtable_lookup_simd.cpp \ - -o output/kernel/control/src/hashtable_lookup_simd.o + -o output/kernels/control/src/hashtable_lookup_simd.o ``` --- @@ -125,11 +125,11 @@ $COMPILER_DIR/clang++ \ ```bash $COMPILER_DIR/clang++ \ -nostartfiles \ - output/kernel/control/src/hashtable_lookup_simd.o \ - output/kernel/control/hashtable_lookup_simd/data_obj/inserted_slot.o \ - output/kernel/control/hashtable_lookup_simd/data_obj/lookup_keys.o \ - output/kernel/control/hashtable_lookup_simd/data_obj/lookup_values.o \ - -o output/kernel/control/elf/kernel_control_hashtable_lookup_simd_kNum409600.elf + output/kernels/control/src/hashtable_lookup_simd.o \ + output/kernels/control/hashtable_lookup_simd/data_obj/inserted_slot.o \ + output/kernels/control/hashtable_lookup_simd/data_obj/lookup_keys.o \ + output/kernels/control/hashtable_lookup_simd/data_obj/lookup_values.o \ + -o output/kernels/control/elf/kernels_control_hashtable_lookup_simd_kNum409600.elf ``` --- @@ -138,8 +138,8 @@ $COMPILER_DIR/clang++ \ | ELF 名称 | 编译命令 | |----------|----------| -| `kernel_control_hashtable_lookup_simd_kNum409600.elf` | `make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum409600 EXTRA_DEFINES="-DkNum=409600"` | -| `kernel_control_hashtable_lookup_simd_kNum256.elf` | `make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum256 EXTRA_DEFINES="-DkNum=256"` | +| `kernels_control_hashtable_lookup_simd_kNum409600.elf` | `make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum409600 EXTRA_DEFINES="-DkNum=409600"` | +| `kernels_control_hashtable_lookup_simd_kNum256.elf` | `make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum256 EXTRA_DEFINES="-DkNum=256"` | --- @@ -148,8 +148,8 @@ $COMPILER_DIR/clang++ \ ```bash cd {WORKSPACE}/LinxBlockModel ./build/qemu-linx \ - ../JanusCoreBench/output/kernel/control/elf/kernel_control_hashtable_lookup_simd_kNum256.elf \ - 2>&1 | tee ../JanusCoreBench/test/kernel/control/hashtable_lookup_simd/run.log + ../SuperNPUBench/output/kernels/control/elf/kernels_control_hashtable_lookup_simd_kNum256.elf \ + 2>&1 | tee ../SuperNPUBench/benchmarks/kernels/control/hashtable_lookup_simd/run.log ``` **说明:** @@ -165,7 +165,7 @@ cd {WORKSPACE}/LinxBlockModel ```bash export COMPILER_DIR=/remote/lms01/j00827727/jcore/compilers/linx_blockisa_llvm_musl0.56.18/bin -cd {WORKSPACE}/JanusCoreBench/test/kernel/control +cd {WORKSPACE}/SuperNPUBench/benchmarks/kernels/control ./compile.all ``` @@ -173,7 +173,7 @@ cd {WORKSPACE}/JanusCoreBench/test/kernel/control ## 文件说明 -路径前缀约定:`{WORKSPACE}/JanusCoreBench/test/kernel/control/hashtable_lookup_simd/` +路径前缀约定:`{WORKSPACE}/SuperNPUBench/benchmarks/kernels/control/hashtable_lookup_simd/` | 文件 | 说明 | |------|------| diff --git a/test/kernel/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp b/benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp similarity index 59% rename from test/kernel/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp rename to benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp index 72a7bf1..5ff6dc7 100644 --- a/test/kernel/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp +++ b/benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp @@ -1,5 +1,170 @@ +#if defined(__linx) && defined(FOR_GFSIM) && (defined(LINX_HT_DIRECT) || defined(LINX_HASHTABLE_DIRECT_SMOKE)) +typedef unsigned char uint8_t; +typedef unsigned int uint32_t; +typedef int int32_t; +typedef long long int64_t; + +#ifndef kNum +#define kNum 1024 +#endif + +#ifndef MAX_PROBE +#define MAX_PROBE 512 +#endif + +#ifndef LINX_HT_CAPACITY +#ifdef LINX_HASH_CAPACITY +#define LINX_HT_CAPACITY LINX_HASH_CAPACITY +#else +#define LINX_HT_CAPACITY 2048u +#endif +#endif + +#ifndef LINX_HT_SCAN +#ifdef LINX_HASH_LINEAR_SCAN +#define LINX_HT_SCAN LINX_HASH_LINEAR_SCAN +#else +#define LINX_HT_SCAN 0 +#endif +#endif + +struct TableEntry { + int64_t key; + int32_t value; + int32_t padding; +}; + +extern "C" { + extern const uint8_t _binary_inserted_slot_data_start[]; + extern const uint8_t _binary_lookup_keys_data_start[]; + extern const uint8_t _binary_lookup_values_data_start[]; +} + +static uint32_t rotl32(uint32_t value, uint32_t shift) { + return (value << shift) | (value >> (32u - shift)); +} + +static uint32_t murmurhash3_i64(int64_t key) { + const uint32_t c1_local = 0xcc9e2d51u; + const uint32_t c2_local = 0x1b873593u; + const uint32_t c3_local = 0xe6546b64u; + unsigned long long bits = (unsigned long long)key; + uint32_t h = 0u; + uint32_t block = (uint32_t)bits; + + block *= c1_local; + block = rotl32(block, 15u); + block *= c2_local; + h ^= block; + h = rotl32(h, 13u); + h = h * 5u + c3_local; + + block = (uint32_t)(bits >> 32); + block *= c1_local; + block = rotl32(block, 15u); + block *= c2_local; + h ^= block; + h = rotl32(h, 13u); + h = h * 5u + c3_local; + + h ^= 8u; + h ^= h >> 16u; + h *= 0x85ebca6bu; + h ^= h >> 13u; + h *= 0xc2b2ae35u; + h ^= h >> 16u; + return h; +} + +static uint32_t first_slot(uint32_t hash) { +#if (LINX_HT_CAPACITY & (LINX_HT_CAPACITY - 1u)) == 0 + return hash & (LINX_HT_CAPACITY - 1u); +#else + return hash % LINX_HT_CAPACITY; +#endif +} + +int main() { + const TableEntry* table = + (const TableEntry*)_binary_inserted_slot_data_start; + const int64_t* keys = + (const int64_t*)_binary_lookup_keys_data_start; + const int32_t* expected = + (const int32_t*)_binary_lookup_values_data_start; + + int32_t mismatches = 0; + for (int32_t i = 0; i < kNum; ++i) { +#if LINX_HT_SCAN + int32_t found = -1; + for (uint32_t slot = 0; slot < LINX_HT_CAPACITY; ++slot) { + const TableEntry* entry = table + slot; + if (entry->key == keys[i]) { + found = entry->value; + break; + } + } +#else + uint32_t slot = first_slot(murmurhash3_i64(keys[i])); + int32_t found = -1; + for (int32_t probe = 0; probe < MAX_PROBE; ++probe) { + const TableEntry* entry = table + slot; + if (entry->key == keys[i]) { + found = entry->value; + break; + } + if ((unsigned long long)entry->key == 0x8000000000000000ull) { + break; + } + ++slot; + if (slot == LINX_HT_CAPACITY) { + slot = 0; + } + } +#endif + if (found != expected[i]) { + ++mismatches; + } + } + return mismatches == 0 ? 0 : 1; +} + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + __asm__ volatile("" ::: "memory"); + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit((uint32_t)main()); +} + +#else + #include #include "benchmark.h" +#include "template_asm.h" +#ifndef __linx +#include +#endif // ============================================================================ // ELF Data layout — embedded binary data produced by build_data_obj.sh @@ -153,6 +318,7 @@ int main() { } } +#ifndef __linx if (mismatch_count > 0) { printf("\n=== Mismatching keys (%d total) ===\n", mismatch_count); printf("%7s %22s %10s %10s\n", "Idx", "Key", "Got", "Expected"); @@ -167,8 +333,11 @@ int main() { printf("\n=== hashtable_lookup_simt ===\n"); printf("Match: %d/%d (%d %%)\n", match, kNum, int(100 * double(match) / double(kNum))); fflush(stdout); +#endif return (match == kNum) ? 0 : 1; #else return 0; #endif } + +#endif diff --git a/test/kernel/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp b/benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp similarity index 97% rename from test/kernel/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp rename to benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp index d10bbba..cd29221 100644 --- a/test/kernel/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp +++ b/benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp @@ -1,5 +1,9 @@ #include #include "benchmark.h" +#include "template_asm.h" +#ifndef __linx +#include +#endif // ============================================================================ // ELF Data layout — embedded binary data produced by build_data_obj.sh @@ -141,7 +145,7 @@ int main() { kEntrySize, kMaxProbe); BENCHEND; -#ifndef FOR_GFSIM +#if !defined(FOR_GFSIM) && !defined(__linx) // Print SIMT kernel computed hash values for first 64 keys printf("\n=== SIMT kernel hash values (first 64 keys) ===\n"); printf("%4s %22s %10s %7s %10s %10s\n", "Idx", "Key", "Hash(hex)", "Slot", "SIMT_out", "Expected"); @@ -164,7 +168,7 @@ int main() { } } -#ifndef FOR_GFSIM +#if !defined(FOR_GFSIM) && !defined(__linx) if (mismatch_count > 0) { printf("\n=== Mismatching keys (%d total) ===\n", mismatch_count); printf("%7s %22s %10s %7s %10s %10s\n", "Idx", "Key", "Hash(hex)", "Slot", "Got", "Expected"); diff --git a/benchmarks/kernels/control/hkv/data_obj/.gitignore b/benchmarks/kernels/control/hkv/data_obj/.gitignore new file mode 100644 index 0000000..75f1c9f --- /dev/null +++ b/benchmarks/kernels/control/hkv/data_obj/.gitignore @@ -0,0 +1,3 @@ +*.s +*.o +*.bin diff --git a/test/kernel/control/hkv/data_obj/build_data_obj.sh b/benchmarks/kernels/control/hkv/data_obj/build_data_obj.sh similarity index 56% rename from test/kernel/control/hkv/data_obj/build_data_obj.sh rename to benchmarks/kernels/control/hkv/data_obj/build_data_obj.sh index ec4819c..7771bdb 100755 --- a/test/kernel/control/hkv/data_obj/build_data_obj.sh +++ b/benchmarks/kernels/control/hkv/data_obj/build_data_obj.sh @@ -1,10 +1,23 @@ #!/bin/bash -COMPILER_DIR="${COMPILER_DIR:-/remote/lms01/j00827727/jcore/compilers/linx_blockisa_llvm_musl0.56.16/bin}" -DATA_OBJ_DIR="$1" -OUTPUT_DIR="$2" +set -euo pipefail + +COMPILER_DIR="${COMPILER_DIR:-/usr/bin}" +LINX_TARGET="${LINX_TARGET:-linx64-linx-none-elf}" +DATA_OBJ_DIR="${1:?data object directory required}" +OUTPUT_DIR="${2:?output directory required}" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +CASE_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)" mkdir -p "$OUTPUT_DIR" +if [[ ! -f "${DATA_OBJ_DIR}/buckets.bin" || + ! -f "${DATA_OBJ_DIR}/buckets_size.bin" || + ! -f "${DATA_OBJ_DIR}/lookup_keys.bin" || + ! -f "${DATA_OBJ_DIR}/lookedup_values.bin" || + ! -f "${DATA_OBJ_DIR}/key_score_digest.bin" ]]; then + (cd "$CASE_DIR" && python3 gen_data.py) +fi + build_one() { local name="$1" local data_file="${DATA_OBJ_DIR}/${name}" @@ -28,7 +41,7 @@ _binary_${sym_name}_end: .equ _binary_${sym_name}_size, .-_binary_${sym_name}_start EOF - $COMPILER_DIR/clang++ -target linx64v5 -c "$asm_file" -o "$obj_file" + "${COMPILER_DIR}/clang++" -target "$LINX_TARGET" -c "$asm_file" -o "$obj_file" } build_one "buckets.bin" diff --git a/test/kernel/control/hkv/gen_data.py b/benchmarks/kernels/control/hkv/gen_data.py similarity index 100% rename from test/kernel/control/hkv/gen_data.py rename to benchmarks/kernels/control/hkv/gen_data.py diff --git a/test/kernel/control/hkv/hkv.cpp b/benchmarks/kernels/control/hkv/hkv.cpp similarity index 100% rename from test/kernel/control/hkv/hkv.cpp rename to benchmarks/kernels/control/hkv/hkv.cpp diff --git a/test/kernel/element_wise/gelu/Makefile b/benchmarks/kernels/element_wise/gelu/Makefile similarity index 100% rename from test/kernel/element_wise/gelu/Makefile rename to benchmarks/kernels/element_wise/gelu/Makefile diff --git a/test/kernel/element_wise/gelu/compile.all b/benchmarks/kernels/element_wise/gelu/compile.all similarity index 100% rename from test/kernel/element_wise/gelu/compile.all rename to benchmarks/kernels/element_wise/gelu/compile.all diff --git a/test/kernel/element_wise/gelu/src/gelu.cpp b/benchmarks/kernels/element_wise/gelu/src/gelu.cpp similarity index 97% rename from test/kernel/element_wise/gelu/src/gelu.cpp rename to benchmarks/kernels/element_wise/gelu/src/gelu.cpp index 1dd8af9..2e70935 100644 --- a/test/kernel/element_wise/gelu/src/gelu.cpp +++ b/benchmarks/kernels/element_wise/gelu/src/gelu.cpp @@ -1,9 +1,14 @@ #include +#ifdef __linx +#include +#include +#else #include #include #include "fileop.h" +#endif #include "element_wise/gelu.hpp" @@ -76,4 +81,4 @@ int main() { // #ifdef RES_CHECK // writeBinaryFile(OUTPUT_PATH, (uint8_t*)output, gMs * sizeof(dtype)); // #endif -} \ No newline at end of file +} diff --git a/test/kernel/element_wise/gelu/src/gelu_data_compare.py b/benchmarks/kernels/element_wise/gelu/src/gelu_data_compare.py similarity index 100% rename from test/kernel/element_wise/gelu/src/gelu_data_compare.py rename to benchmarks/kernels/element_wise/gelu/src/gelu_data_compare.py diff --git a/test/kernel/element_wise/gelu/src/gen_gelu_data.py b/benchmarks/kernels/element_wise/gelu/src/gen_gelu_data.py similarity index 100% rename from test/kernel/element_wise/gelu/src/gen_gelu_data.py rename to benchmarks/kernels/element_wise/gelu/src/gen_gelu_data.py diff --git a/test/kernel/element_wise/gelu/src/tmp.list b/benchmarks/kernels/element_wise/gelu/src/tmp.list similarity index 100% rename from test/kernel/element_wise/gelu/src/tmp.list rename to benchmarks/kernels/element_wise/gelu/src/tmp.list diff --git a/test/kernel/fusion_op/Makefile b/benchmarks/kernels/fusion/Makefile similarity index 100% rename from test/kernel/fusion_op/Makefile rename to benchmarks/kernels/fusion/Makefile diff --git a/test/kernel/fusion_op/compile.all b/benchmarks/kernels/fusion/compile.all similarity index 100% rename from test/kernel/fusion_op/compile.all rename to benchmarks/kernels/fusion/compile.all diff --git a/test/kernel/fusion_op/src/fa_hif4.cpp b/benchmarks/kernels/fusion/src/fa_hif4.cpp similarity index 98% rename from test/kernel/fusion_op/src/fa_hif4.cpp rename to benchmarks/kernels/fusion/src/fa_hif4.cpp index 00899e0..8307e51 100644 --- a/test/kernel/fusion_op/src/fa_hif4.cpp +++ b/benchmarks/kernels/fusion/src/fa_hif4.cpp @@ -1,5 +1,5 @@ #include -// #include "../../include/accelerator_fusion.h" +// #include #include "benchmark.h" #include "fileop.h" #include "fa_mx/fa_hif4.hpp" diff --git a/test/kernel/gemm/matmul/Makefile b/benchmarks/kernels/gemm/matmul/Makefile similarity index 89% rename from test/kernel/gemm/matmul/Makefile rename to benchmarks/kernels/gemm/matmul/Makefile index aba422d..25dafb3 100644 --- a/test/kernel/gemm/matmul/Makefile +++ b/benchmarks/kernels/gemm/matmul/Makefile @@ -4,11 +4,11 @@ ifeq ($(TYPE), HIF4_HIF4) OPFILE = $(notdir $(OPPATH)) OPNAME = $(patsubst %.cpp,%, $(OPFILE)) DEFINES += -DglobM=$(M) -DglobN=$(N) -DglobK=$(K) -DtilM=$(tM) -DtilN=$(tN) -DtilK=$(tK) - Batch = 1 - ifneq ($(B), ) - Batch = $(B) - DEFINES += -DBatch=$(B) - endif + Batch = 1 + ifneq ($(B), ) + Batch = $(B) + endif + DEFINES += -DBatch=$(Batch) ifneq ($(VER), ) DEFINES += -D$(VER) else @@ -29,11 +29,11 @@ ifeq ($(TYPE), A16W4) OPFILE = $(notdir $(OPPATH)) OPNAME = $(patsubst %.cpp,%, $(OPFILE)) DEFINES += -DglobM=$(M) -DglobN=$(N) -DglobK=$(K) -DtilM=$(tM) -DtilN=$(tN) -DtilK=$(tK) - Batch = 1 - ifneq ($(B), ) - Batch = $(B) - endif - DEFINES += -DBatch=$(B) + Batch = 1 + ifneq ($(B), ) + Batch = $(B) + endif + DEFINES += -DBatch=$(Batch) # TARGET = $(OPNAME)_B$(Batch)_M$(M)_N$(N)_K$(K)_tM$(tM)_tN$(tN)_tK$(tK).elf TARGET = $(ELF_HEAD)/$(TESTCASE)_$(TYPE)_$(MODE)_B$(Batch)_M$(M)_N$(N)_K$(K)_tM$(tM)_tN$(tN)_tK$(tK).elf endif @@ -56,4 +56,4 @@ DEST_DIR = ~/elf_subset/subset_matmul_reuse/ $(TARGET): clean $(LINK_SCRIPT) $(COMM_OBJ) $(OBJ) $(EXTRA_OBJ_FILES) @mkdir -p $(shell dirname $@) $(LINK) $(CC_LINK) $(OBJ) $(COMM_OBJ) $(EXTRA_OBJ_FILES) -o $@ - $(CP) $(TARGET) $(DEST_DIR) \ No newline at end of file + $(CP) $(TARGET) $(DEST_DIR) diff --git a/test/kernel/gemm/matmul/compile.all b/benchmarks/kernels/gemm/matmul/compile.all similarity index 100% rename from test/kernel/gemm/matmul/compile.all rename to benchmarks/kernels/gemm/matmul/compile.all diff --git a/test/kernel/gemm/matmul/src/A16W4.cpp b/benchmarks/kernels/gemm/matmul/src/A16W4.cpp similarity index 63% rename from test/kernel/gemm/matmul/src/A16W4.cpp rename to benchmarks/kernels/gemm/matmul/src/A16W4.cpp index 7defbf1..f52b5d8 100644 --- a/test/kernel/gemm/matmul/src/A16W4.cpp +++ b/benchmarks/kernels/gemm/matmul/src/A16W4.cpp @@ -1,8 +1,49 @@ #include +#ifdef __linx +#include +#include +#else #include #include "fileop.h" #include "common.h" #include "benchmark.h" +#endif + +#ifdef __linx +#define BENCHSTART __asm__ __volatile__("B.HINT TRACE.begin\n" : : :); +#define BENCHEND __asm__ __volatile__("B.HINT TRACE.end\n" : : :); + +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + __asm__ volatile("" ::: "memory"); + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif #ifndef globM #define globM 120 @@ -68,4 +109,4 @@ int main() { #endif return 0; -} \ No newline at end of file +} diff --git a/test/kernel/gemm/matmul/src/HiF4_HiF4.cpp b/benchmarks/kernels/gemm/matmul/src/HiF4_HiF4.cpp similarity index 73% rename from test/kernel/gemm/matmul/src/HiF4_HiF4.cpp rename to benchmarks/kernels/gemm/matmul/src/HiF4_HiF4.cpp index 9d7a351..6e23751 100644 --- a/test/kernel/gemm/matmul/src/HiF4_HiF4.cpp +++ b/benchmarks/kernels/gemm/matmul/src/HiF4_HiF4.cpp @@ -1,8 +1,49 @@ #include +#ifdef __linx +#include +#include +#else #include #include "fileop.h" #include "common.h" #include "benchmark.h" +#endif + +#ifdef __linx +#define BENCHSTART __asm__ __volatile__("B.HINT TRACE.begin\n" : : :); +#define BENCHEND __asm__ __volatile__("B.HINT TRACE.end\n" : : :); + +int main(); + +static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) { + if (code == 0) { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 5, ->t\n" + "addi t#1, 1365, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } else { + __asm__ volatile( + "BSTART.STD\n" + "lui 65545, ->u\n" + "lui 19, ->t\n" + "addi t#1, 819, ->t\n" + "c.swi t#1, [u#1, 0]\n" + "BSTOP\n" + ::: "memory"); + } + while (1) { + __asm__ volatile("" ::: "memory"); + } +} + +extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) { + linx_supernpu_exit(static_cast(main())); +} +#endif #ifndef globM #define globM 120 @@ -82,4 +123,4 @@ int main() { #endif return 0; -} \ No newline at end of file +} diff --git a/test/kernel/memory/broadcast/Makefile b/benchmarks/kernels/memory/broadcast/Makefile similarity index 98% rename from test/kernel/memory/broadcast/Makefile rename to benchmarks/kernels/memory/broadcast/Makefile index cffedf1..c257021 100644 --- a/test/kernel/memory/broadcast/Makefile +++ b/benchmarks/kernels/memory/broadcast/Makefile @@ -3,7 +3,7 @@ DEFINES += -DDType=$(DType) -DtMs=$(tMs) -DMAX_DIMs=$(MAX_DIMs) -DIN_SHAPEs=$(IN TARGET = $(ELF_HEAD)_$(TESTCASE)_$(MODE)_DType$(DType)_tM$(tMs)_IN_SHAPE$(IN_SHAPE_NAME)_OUT_SHAPE$(OUT_SHAPE_NAME).elf endif -ifeq ($(TESTCASE), broadcast_nocopyout) +ifeq ($(TESTCASE), broadcast_nostore) DEFINES += -DDType=$(DType) -DtMs=$(tMs) -DMAX_DIMs=$(MAX_DIMs) -DIN_SHAPEs=$(IN_SHAPEs) -DOUT_SHAPEs=$(OUT_SHAPEs) -DIN_DIMs=$(IN_DIMs) -DOUT_DIMs=$(OUT_DIMs) -DgIMs=$(gIMs) -DgOMs=$(gOMs) TARGET = $(ELF_HEAD)_$(TESTCASE)_$(MODE)_DType$(DType)_tM$(tMs)_IN_SHAPE$(IN_SHAPE_NAME)_OUT_SHAPE$(OUT_SHAPE_NAME).elf endif diff --git a/test/kernel/memory/broadcast/compile.all b/benchmarks/kernels/memory/broadcast/compile.all similarity index 89% rename from test/kernel/memory/broadcast/compile.all rename to benchmarks/kernels/memory/broadcast/compile.all index 9206219..cebb560 100755 --- a/test/kernel/memory/broadcast/compile.all +++ b/benchmarks/kernels/memory/broadcast/compile.all @@ -3,98 +3,98 @@ make TESTCASE=broadcast_07 DType=__half tMs=2048 MAX_DIMs=2 \ IN_SHAPEs=1334,1 OUT_SHAPEs=1334,129 IN_DIMs=2 OUT_DIMs=2 \ gIMs=1334*1 gOMs=1334*129 \ TESTNAME=broadcast_to_07 \ - IN_SHAPE_NAME=1334_1 OUT_SHAPE_NAME=1334_129 res_check=off diss + IN_SHAPE_NAME=1334_1 OUT_SHAPE_NAME=1334_129 res_check=off diss # # 2\broadcast_to_ABA_019_new fp16 make TESTCASE=broadcast_019 DType=__half tMs=2048 MAX_DIMs=3 \ IN_SHAPEs=1280,1,49 OUT_SHAPEs=1280,8,49 IN_DIMs=3 OUT_DIMs=3 \ gIMs=1280*1*49 gOMs=1280*8*49 \ TESTNAME=broadcast_to_ABA_019_new \ - IN_SHAPE_NAME=1280_1_49 OUT_SHAPE_NAME=1280_8_49 res_check=off diss + IN_SHAPE_NAME=1280_1_49 OUT_SHAPE_NAME=1280_8_49 res_check=off diss # 3\BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004 fp16 make TESTCASE=broadcast_Hunyuan DType=__half tMs=2048 MAX_DIMs=5 \ IN_SHAPEs=1,1,1,65,128 OUT_SHAPEs=1,1,7,65,128 IN_DIMs=5 OUT_DIMs=5 \ gIMs=1*1*1*65*128 gOMs=1*1*7*65*128 \ TESTNAME=BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004 \ - IN_SHAPE_NAME=1_1_1_65_128 OUT_SHAPE_NAME=1_1_7_65_128 res_check=off diss + IN_SHAPE_NAME=1_1_1_65_128 OUT_SHAPE_NAME=1_1_7_65_128 res_check=off diss # 4\Bbroadcast_to_BABA_039 fp16 int32 make TESTCASE=broadcast_039 DType=__half tMs=2048 MAX_DIMs=4 \ IN_SHAPEs=1,128,1,16 OUT_SHAPEs=64,128,8,16 IN_DIMs=4 OUT_DIMs=4 \ gIMs=1*128*1*16 gOMs=64*128*8*16 \ TESTNAME=Bbroadcast_to_BABA_039 \ - IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16 res_check=off diss + IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16 res_check=off diss make TESTCASE=broadcast_039 DType=__half tMs=2048 MAX_DIMs=4 \ IN_SHAPEs=1,8192,1,16 OUT_SHAPEs=1,8192,8,16 IN_DIMs=4 OUT_DIMs=4 \ gIMs=1*8192*1*16 gOMs=1*8192*8*16 \ TESTNAME=Bbroadcast_to_BABA_039 \ - IN_SHAPE_NAME=1_8192_1_16 OUT_SHAPE_NAME=1_8192_8_16 res_check=off diss + IN_SHAPE_NAME=1_8192_1_16 OUT_SHAPE_NAME=1_8192_8_16 res_check=off diss # make TESTCASE=broadcast_039 DType=int32_t tMs=2048 MAX_DIMs=4 \ # IN_SHAPEs=1,128,1,16 OUT_SHAPEs=64,128,8,16 IN_DIMs=4 OUT_DIMs=4 \ # gIMs=1*128*1*16 gOMs=64*128*8*16 \ # TESTNAME=Bbroadcast_to_BABA_039 \ -# IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16 res_check=off diss +# IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16 res_check=off diss -# make TESTCASE=broadcast_nocopyout DType=int32_t tMs=512 MAX_DIMs=2 \ +# make TESTCASE=broadcast_nostore DType=int32_t tMs=512 MAX_DIMs=2 \ # IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \ # gIMs=1042*1 gOMs=1042*129 \ # TESTNAME=broadcast_to_07 \ -# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss +# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss # # NoCopyout 性能分析 # # # 1\broadcast_to_07 fp16 -# make TESTCASE=broadcast_nocopyout DType=__half tMs=128 MAX_DIMs=2 \ +# make TESTCASE=broadcast_nostore DType=__half tMs=128 MAX_DIMs=2 \ # IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \ # gIMs=1042*1 gOMs=1042*129 \ # TESTNAME=broadcast_to_07 \ -# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss +# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss -# make TESTCASE=broadcast_nocopyout DType=__half tMs=512 MAX_DIMs=2 \ +# make TESTCASE=broadcast_nostore DType=__half tMs=512 MAX_DIMs=2 \ # IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \ # gIMs=1042*1 gOMs=1042*129 \ # TESTNAME=broadcast_to_07 \ -# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss +# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss -# make TESTCASE=broadcast_nocopyout DType=__half tMs=2048 MAX_DIMs=2 \ +# make TESTCASE=broadcast_nostore DType=__half tMs=2048 MAX_DIMs=2 \ # IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \ # gIMs=1042*1 gOMs=1042*129 \ # TESTNAME=broadcast_to_07 \ -# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss +# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss # # # 2\broadcast_to_ABA_019_new fp16 -# make TESTCASE=broadcast_nocopyout DType=__half tMs=2048 MAX_DIMs=3 \ +# make TESTCASE=broadcast_nostore DType=__half tMs=2048 MAX_DIMs=3 \ # IN_SHAPEs=1280,1,49 OUT_SHAPEs=1280,8,49 IN_DIMs=3 OUT_DIMs=3 \ # gIMs=1280*1*49 gOMs=1280*8*49 \ # TESTNAME=broadcast_to_ABA_019_new \ -# IN_SHAPE_NAME=1280_1_49 OUT_SHAPE_NAME=1280_8_49 res_check=off diss +# IN_SHAPE_NAME=1280_1_49 OUT_SHAPE_NAME=1280_8_49 res_check=off diss # # # 3\BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004 fp16 -# make TESTCASE=broadcast_nocopyout DType=__half tMs=2048 MAX_DIMs=5 \ +# make TESTCASE=broadcast_nostore DType=__half tMs=2048 MAX_DIMs=5 \ # IN_SHAPEs=1,1,1,80,128 OUT_SHAPEs=1,1,7,80,128 IN_DIMs=5 OUT_DIMs=5 \ # gIMs=1*1*1*80*128 gOMs=1*1*7*80*128 \ # TESTNAME=BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004 \ -# IN_SHAPE_NAME=1_1_1_80_128 OUT_SHAPE_NAME=1_1_7_80_128 res_check=off diss +# IN_SHAPE_NAME=1_1_1_80_128 OUT_SHAPE_NAME=1_1_7_80_128 res_check=off diss # # 4\Bbroadcast_to_BABA_039 fp16 int32 -# make TESTCASE=broadcast_nocopyout DType=__half tMs=512 MAX_DIMs=4 \ +# make TESTCASE=broadcast_nostore DType=__half tMs=512 MAX_DIMs=4 \ # IN_SHAPEs=1,128,1,16 OUT_SHAPEs=64,128,8,16 IN_DIMs=4 OUT_DIMs=4 \ # gIMs=1*128*1*16 gOMs=64*128*8*16 \ # TESTNAME=Bbroadcast_to_BABA_039 \ -# IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16 res_check=off diss +# IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16 res_check=off diss -# make TESTCASE=broadcast_nocopyout DType=int32_t tMs=512 MAX_DIMs=4 \ +# make TESTCASE=broadcast_nostore DType=int32_t tMs=512 MAX_DIMs=4 \ # IN_SHAPEs=1,128,1,16 OUT_SHAPEs=64,128,8,16 IN_DIMs=4 OUT_DIMs=4 \ # gIMs=1*128*1*16 gOMs=64*128*8*16 \ # TESTNAME=Bbroadcast_to_BABA_039 \ -# IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16 res_check=off diss +# IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16 res_check=off diss @@ -104,70 +104,70 @@ make TESTCASE=broadcast_039 DType=__half tMs=2048 MAX_DIMs=4 \ # IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \ # gIMs=1042*1 gOMs=1042*129 \ # TESTNAME=broadcast_to_07 \ -# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss - +# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss + # make TESTCASE=broadcast DType=__half tMs=256 MAX_DIMs=2 \ # IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \ # gIMs=1042*1 gOMs=1042*129 \ # TESTNAME=broadcast_to_07 \ -# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss - +# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss + # make TESTCASE=broadcast DType=__half tMs=512 MAX_DIMs=2 \ # IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \ # gIMs=1042*1 gOMs=1042*129 \ # TESTNAME=broadcast_to_07 \ -# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss - +# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss + # make TESTCASE=broadcast DType=__half tMs=768 MAX_DIMs=2 \ # IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \ # gIMs=1042*1 gOMs=1042*129 \ # TESTNAME=broadcast_to_07 \ -# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss +# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss # make TESTCASE=broadcast DType=__half tMs=1024 MAX_DIMs=2 \ # IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \ # gIMs=1042*1 gOMs=1042*129 \ # TESTNAME=broadcast_to_07 \ -# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss +# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss # make TESTCASE=broadcast DType=__half tMs=1536 MAX_DIMs=2 \ # IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \ # gIMs=1042*1 gOMs=1042*129 \ # TESTNAME=broadcast_to_07 \ -# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss +# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss # make TESTCASE=broadcast DType=__half tMs=2048 MAX_DIMs=2 \ # IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \ # gIMs=1042*1 gOMs=1042*129 \ # TESTNAME=broadcast_to_07 \ -# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss +# IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss # # # 2\broadcast_to_ABA_019_new fp16 # make TESTCASE=broadcast DType=__half tMs=2048 MAX_DIMs=3 \ # IN_SHAPEs=1280,1,49 OUT_SHAPEs=1280,8,49 IN_DIMs=3 OUT_DIMs=3 \ # gIMs=1280*1*49 gOMs=1280*8*49 \ # TESTNAME=broadcast_to_ABA_019_new \ -# IN_SHAPE_NAME=1280_1_49 OUT_SHAPE_NAME=1280_8_49 res_check=off diss +# IN_SHAPE_NAME=1280_1_49 OUT_SHAPE_NAME=1280_8_49 res_check=off diss # # # 3\BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004 fp16 # make TESTCASE=broadcast DType=__half tMs=2048 MAX_DIMs=5 \ # IN_SHAPEs=1,1,1,65,128 OUT_SHAPEs=1,1,7,65,128 IN_DIMs=5 OUT_DIMs=5 \ # gIMs=1*1*1*65*128 gOMs=1*1*7*65*128 \ # TESTNAME=BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004 \ -# IN_SHAPE_NAME=1_1_1_65_128 OUT_SHAPE_NAME=1_1_7_65_128 res_check=off diss +# IN_SHAPE_NAME=1_1_1_65_128 OUT_SHAPE_NAME=1_1_7_65_128 res_check=off diss # # 4\Bbroadcast_to_BABA_039 fp16 int32 # make TESTCASE=broadcast DType=__half tMs=512 MAX_DIMs=4 \ # IN_SHAPEs=1,128,1,16 OUT_SHAPEs=64,128,8,16 IN_DIMs=4 OUT_DIMs=4 \ # gIMs=1*128*1*16 gOMs=64*128*8*16 \ # TESTNAME=Bbroadcast_to_BABA_039 \ -# IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16 res_check=off diss +# IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16 res_check=off diss # make TESTCASE=broadcast DType=int32_t tMs=512 MAX_DIMs=4 \ # IN_SHAPEs=1,128,1,16 OUT_SHAPEs=64,128,8,16 IN_DIMs=4 OUT_DIMs=4 \ # gIMs=1*128*1*16 gOMs=64*128*8*16 \ # TESTNAME=Bbroadcast_to_BABA_039 \ -# IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16 res_check=off diss +# IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16 res_check=off diss # gfrun验证通过 @@ -175,7 +175,7 @@ make TESTCASE=broadcast_039 DType=__half tMs=2048 MAX_DIMs=4 \ # IN_SHAPEs=1,1,1,80,128 OUT_SHAPEs=1,1,7,80,128 IN_DIMs=5 OUT_DIMs=5 \ # gIMs=1*1*1*80*128 gOMs=1*1*7*80*128 \ # TESTNAME=BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004 \ -# IN_SHAPE_NAME=1_1_1_80_128 OUT_SHAPE_NAME=1_1_7_80_128 res_check=on diss +# IN_SHAPE_NAME=1_1_1_80_128 OUT_SHAPE_NAME=1_1_7_80_128 res_check=on diss # make TESTCASE=broadcast_mscatter DType=int32_t tMs=512 MAX_DIMs=3 \ # IN_SHAPEs=64,1,49 OUT_SHAPEs=64,8,49 IN_DIMs=3 OUT_DIMs=3 \ @@ -185,23 +185,23 @@ make TESTCASE=broadcast_039 DType=__half tMs=2048 MAX_DIMs=4 \ # make TESTCASE=broadcast DType=int32_t tMs=512 MAX_DIMs=4 \ # IN_SHAPEs=1,128,1,16 OUT_SHAPEs=64,128,8,16 IN_DIMs=4 OUT_DIMs=4 \ # gIMs=1*128*1*16 gOMs=64*128*8*16 \ -# IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16-Tst res_check=off diss +# IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16-Tst res_check=off diss # make TESTCASE=broadcast DType=int32_t tMs=2048 MAX_DIMs=3 \ # IN_SHAPEs=8192,1,49 OUT_SHAPEs=8192,8,49 IN_DIMs=3 OUT_DIMs=3 \ # gIMs=8192*1*49 gOMs=8192*8*49 \ -# IN_SHAPE_NAME=8192_1_49 OUT_SHAPE_NAME=8192_8_49 res_check=off diss +# IN_SHAPE_NAME=8192_1_49 OUT_SHAPE_NAME=8192_8_49 res_check=off diss # make TESTCASE=broadcast DType=int32_t tMs=512 MAX_DIMs=3 \ # IN_SHAPEs=64,1,49 OUT_SHAPEs=64,8,49 IN_DIMs=3 OUT_DIMs=3 \ # gIMs=64*1*49 gOMs=64*8*49 \ -# IN_SHAPE_NAME=64_1_49 OUT_SHAPE_NAME=64_8_49 res_check=off diss +# IN_SHAPE_NAME=64_1_49 OUT_SHAPE_NAME=64_8_49 res_check=off diss # # BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004 # make TESTCASE=broadcast DType=__half tMs=512 MAX_DIMs=5 \ # IN_SHAPEs=1,4,1,1034,128 OUT_SHAPEs=1,4,7,1034,128 IN_DIMs=5 OUT_DIMs=5 \ # gIMs=1*4*1*1034*128 gOMs=1*4*7*1034*128 \ -# IN_SHAPE_NAME=1_4_1_1034_128 OUT_SHAPE_NAME=1_4_7_1034_128 res_check=off diss +# IN_SHAPE_NAME=1_4_1_1034_128 OUT_SHAPE_NAME=1_4_7_1034_128 res_check=off diss @@ -210,31 +210,31 @@ make TESTCASE=broadcast_039 DType=__half tMs=2048 MAX_DIMs=4 \ # make TESTCASE=broadcast DType=int32_t tMs=512 MAX_DIMs=2 \ # IN_SHAPEs=66661,1 OUT_SHAPEs=66661,129 IN_DIMs=2 OUT_DIMs=2 \ # gIMs=66661*1 gOMs=66661*129 \ -# IN_SHAPE_NAME=66661_1 OUT_SHAPE_NAME=66661_129 res_check=off diss +# IN_SHAPE_NAME=66661_1 OUT_SHAPE_NAME=66661_129 res_check=off diss # make TESTCASE=broadcast DType=__half tMs=512 MAX_DIMs=2 \ # IN_SHAPEs=66661,1 OUT_SHAPEs=66661,129 IN_DIMs=2 OUT_DIMs=2 \ # gIMs=66661*1 gOMs=66661*129 \ -# IN_SHAPE_NAME=66661_1 OUT_SHAPE_NAME=66661_129 res_check=off diss +# IN_SHAPE_NAME=66661_1 OUT_SHAPE_NAME=66661_129 res_check=off diss # make TESTCASE=broadcast DType=int32_t tMs=512 MAX_DIMs=3 \ # IN_SHAPEs=81920,1,49 OUT_SHAPEs=81920,8,49 IN_DIMs=3 OUT_DIMs=3 \ # gIMs=81920*1*49 gOMs=81920*8*49 \ -# IN_SHAPE_NAME=81920_1_49 OUT_SHAPE_NAME=81920_8_49 res_check=off diss +# IN_SHAPE_NAME=81920_1_49 OUT_SHAPE_NAME=81920_8_49 res_check=off diss # make TESTCASE=broadcast DType=int32_t tMs=512 MAX_DIMs=2 \ # IN_SHAPEs=66661,1 OUT_SHAPEs=66661,129 IN_DIMs=2 OUT_DIMs=2 \ # gIMs=66661*1 gOMs=66661*129 \ -# IN_SHAPE_NAME=66661_1 OUT_SHAPE_NAME=66661_129 res_check=off diss - +# IN_SHAPE_NAME=66661_1 OUT_SHAPE_NAME=66661_129 res_check=off diss + # make TESTCASE=broadcast DType=int32_t tMs=512 MAX_DIMs=4 \ # IN_SHAPEs=1,81920,1,49 OUT_SHAPEs=64,81920,8,49 IN_DIMs=4 OUT_DIMs=4 \ # gIMs=1*81920*1*49 gOMs=64*81920*8*49 \ -# IN_SHAPE_NAME=1_81920_1_49 OUT_SHAPE_NAME=64_81920_8_49 res_check=off diss +# IN_SHAPE_NAME=1_81920_1_49 OUT_SHAPE_NAME=64_81920_8_49 res_check=off diss ### for test ### # make TESTCASE=broadcast_tst DType=int32_t tMs=2048 MAX_DIMs=3 \ # IN_SHAPEs=8192,1,49 OUT_SHAPEs=8192,8,49 IN_DIMs=3 OUT_DIMs=3 \ # gIMs=8192*1*49 gOMs=8192*8*49 \ -# IN_SHAPE_NAME=8192_1_49 OUT_SHAPE_NAME=8192_8_49 res_check=off diss \ No newline at end of file +# IN_SHAPE_NAME=8192_1_49 OUT_SHAPE_NAME=8192_8_49 res_check=off diss \ No newline at end of file diff --git a/test/kernel/memory/broadcast/src/broadcast.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast.cpp similarity index 100% rename from test/kernel/memory/broadcast/src/broadcast.cpp rename to benchmarks/kernels/memory/broadcast/src/broadcast.cpp diff --git a/test/kernel/memory/broadcast/src/broadcast_019.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_019.cpp similarity index 100% rename from test/kernel/memory/broadcast/src/broadcast_019.cpp rename to benchmarks/kernels/memory/broadcast/src/broadcast_019.cpp diff --git a/test/kernel/memory/broadcast/src/broadcast_039.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_039.cpp similarity index 100% rename from test/kernel/memory/broadcast/src/broadcast_039.cpp rename to benchmarks/kernels/memory/broadcast/src/broadcast_039.cpp diff --git a/test/kernel/memory/broadcast/src/broadcast_07.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_07.cpp similarity index 100% rename from test/kernel/memory/broadcast/src/broadcast_07.cpp rename to benchmarks/kernels/memory/broadcast/src/broadcast_07.cpp diff --git a/test/kernel/memory/broadcast/src/broadcast_Hunyuan.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_Hunyuan.cpp similarity index 100% rename from test/kernel/memory/broadcast/src/broadcast_Hunyuan.cpp rename to benchmarks/kernels/memory/broadcast/src/broadcast_Hunyuan.cpp diff --git a/test/kernel/memory/broadcast/src/broadcast_data_compare.py b/benchmarks/kernels/memory/broadcast/src/broadcast_data_compare.py similarity index 100% rename from test/kernel/memory/broadcast/src/broadcast_data_compare.py rename to benchmarks/kernels/memory/broadcast/src/broadcast_data_compare.py diff --git a/test/kernel/memory/broadcast/src/broadcast_mscatter.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_mscatter.cpp similarity index 100% rename from test/kernel/memory/broadcast/src/broadcast_mscatter.cpp rename to benchmarks/kernels/memory/broadcast/src/broadcast_mscatter.cpp diff --git a/test/kernel/memory/broadcast/src/broadcast_nomg.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_nomg.cpp similarity index 100% rename from test/kernel/memory/broadcast/src/broadcast_nomg.cpp rename to benchmarks/kernels/memory/broadcast/src/broadcast_nomg.cpp diff --git a/test/kernel/memory/broadcast/src/broadcast_nocopyout.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_nostore.cpp similarity index 93% rename from test/kernel/memory/broadcast/src/broadcast_nocopyout.cpp rename to benchmarks/kernels/memory/broadcast/src/broadcast_nostore.cpp index 9e64e9d..c7070e0 100644 --- a/test/kernel/memory/broadcast/src/broadcast_nocopyout.cpp +++ b/benchmarks/kernels/memory/broadcast/src/broadcast_nostore.cpp @@ -4,7 +4,7 @@ #include #include "fileop.h" -#include "memory/broadcast_nocopyout.hpp" +#include "memory/broadcast_nostore.hpp" #ifndef DType @@ -92,8 +92,8 @@ int main() { printf("input[2]=%d\n",input[2]); printf("input[3]=%d\n",input[3]); #endif - - broadcast_nocopyout(input, output, in_shape, out_shape); + + broadcast_nostore(input, output, in_shape, out_shape); #ifdef RES_CHECK writeBinaryFile(OUTPUT_PATH, (uint8_t*)output, gOMs * sizeof(dtype)); diff --git a/test/kernel/memory/broadcast/src/broadcast_tst.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_tst.cpp similarity index 100% rename from test/kernel/memory/broadcast/src/broadcast_tst.cpp rename to benchmarks/kernels/memory/broadcast/src/broadcast_tst.cpp diff --git a/test/kernel/memory/broadcast/src/gen_broadcast_data.py b/benchmarks/kernels/memory/broadcast/src/gen_broadcast_data.py similarity index 100% rename from test/kernel/memory/broadcast/src/gen_broadcast_data.py rename to benchmarks/kernels/memory/broadcast/src/gen_broadcast_data.py diff --git a/test/kernel/memory/broadcast/src/gfrun_broadcast.py b/benchmarks/kernels/memory/broadcast/src/gfrun_broadcast.py similarity index 100% rename from test/kernel/memory/broadcast/src/gfrun_broadcast.py rename to benchmarks/kernels/memory/broadcast/src/gfrun_broadcast.py diff --git a/test/kernel/memory/broadcast/src/tmp.list b/benchmarks/kernels/memory/broadcast/src/tmp.list similarity index 100% rename from test/kernel/memory/broadcast/src/tmp.list rename to benchmarks/kernels/memory/broadcast/src/tmp.list diff --git a/test/kernel/memory/broadcast_vec/Makefile b/benchmarks/kernels/memory/broadcast_vec/Makefile similarity index 100% rename from test/kernel/memory/broadcast_vec/Makefile rename to benchmarks/kernels/memory/broadcast_vec/Makefile diff --git a/test/kernel/memory/broadcast_vec/compile.all b/benchmarks/kernels/memory/broadcast_vec/compile.all similarity index 100% rename from test/kernel/memory/broadcast_vec/compile.all rename to benchmarks/kernels/memory/broadcast_vec/compile.all diff --git a/test/kernel/memory/broadcast_vec/src/broadcast_vec_019.cpp b/benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_019.cpp similarity index 100% rename from test/kernel/memory/broadcast_vec/src/broadcast_vec_019.cpp rename to benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_019.cpp diff --git a/test/kernel/memory/broadcast_vec/src/broadcast_vec_039.cpp b/benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_039.cpp similarity index 100% rename from test/kernel/memory/broadcast_vec/src/broadcast_vec_039.cpp rename to benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_039.cpp diff --git a/test/kernel/memory/broadcast_vec/src/broadcast_vec_07.cpp b/benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_07.cpp similarity index 100% rename from test/kernel/memory/broadcast_vec/src/broadcast_vec_07.cpp rename to benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_07.cpp diff --git a/test/kernel/memory/concat_gather/Makefile b/benchmarks/kernels/memory/concat_gather/Makefile similarity index 100% rename from test/kernel/memory/concat_gather/Makefile rename to benchmarks/kernels/memory/concat_gather/Makefile diff --git a/test/kernel/memory/concat_gather/compile.all b/benchmarks/kernels/memory/concat_gather/compile.all similarity index 100% rename from test/kernel/memory/concat_gather/compile.all rename to benchmarks/kernels/memory/concat_gather/compile.all diff --git a/test/kernel/memory/concat_gather/src/concat_gather.cpp b/benchmarks/kernels/memory/concat_gather/src/concat_gather.cpp similarity index 100% rename from test/kernel/memory/concat_gather/src/concat_gather.cpp rename to benchmarks/kernels/memory/concat_gather/src/concat_gather.cpp diff --git a/test/kernel/memory/concat_scatter/Makefile b/benchmarks/kernels/memory/concat_scatter/Makefile similarity index 100% rename from test/kernel/memory/concat_scatter/Makefile rename to benchmarks/kernels/memory/concat_scatter/Makefile diff --git a/test/kernel/memory/concat_scatter/compile.all b/benchmarks/kernels/memory/concat_scatter/compile.all similarity index 100% rename from test/kernel/memory/concat_scatter/compile.all rename to benchmarks/kernels/memory/concat_scatter/compile.all diff --git a/test/kernel/memory/concat_scatter/src/concat_scatter.cpp b/benchmarks/kernels/memory/concat_scatter/src/concat_scatter.cpp similarity index 100% rename from test/kernel/memory/concat_scatter/src/concat_scatter.cpp rename to benchmarks/kernels/memory/concat_scatter/src/concat_scatter.cpp diff --git a/test/kernel/memory/gather/Makefile b/benchmarks/kernels/memory/gather/Makefile similarity index 100% rename from test/kernel/memory/gather/Makefile rename to benchmarks/kernels/memory/gather/Makefile diff --git a/test/kernel/memory/gather/compile.all b/benchmarks/kernels/memory/gather/compile.all similarity index 100% rename from test/kernel/memory/gather/compile.all rename to benchmarks/kernels/memory/gather/compile.all diff --git a/test/kernel/memory/gather/src/gather.cpp b/benchmarks/kernels/memory/gather/src/gather.cpp similarity index 100% rename from test/kernel/memory/gather/src/gather.cpp rename to benchmarks/kernels/memory/gather/src/gather.cpp diff --git a/test/kernel/memory/gather/src/gen_gather_data.py b/benchmarks/kernels/memory/gather/src/gen_gather_data.py similarity index 100% rename from test/kernel/memory/gather/src/gen_gather_data.py rename to benchmarks/kernels/memory/gather/src/gen_gather_data.py diff --git a/test/kernel/memory/gather/src/tmp.list b/benchmarks/kernels/memory/gather/src/tmp.list similarity index 100% rename from test/kernel/memory/gather/src/tmp.list rename to benchmarks/kernels/memory/gather/src/tmp.list diff --git a/test/kernel/memory/transpose/Makefile b/benchmarks/kernels/memory/transpose/Makefile similarity index 100% rename from test/kernel/memory/transpose/Makefile rename to benchmarks/kernels/memory/transpose/Makefile diff --git a/test/kernel/memory/transpose/compile.all b/benchmarks/kernels/memory/transpose/compile.all similarity index 100% rename from test/kernel/memory/transpose/compile.all rename to benchmarks/kernels/memory/transpose/compile.all diff --git a/test/kernel/memory/transpose/src/transpose.cpp b/benchmarks/kernels/memory/transpose/src/transpose.cpp similarity index 100% rename from test/kernel/memory/transpose/src/transpose.cpp rename to benchmarks/kernels/memory/transpose/src/transpose.cpp diff --git a/test/kernel/reduction/reducemax_col/Makefile b/benchmarks/kernels/reduction/reducemax_col/Makefile similarity index 100% rename from test/kernel/reduction/reducemax_col/Makefile rename to benchmarks/kernels/reduction/reducemax_col/Makefile diff --git a/test/kernel/reduction/reducemax_col/compile.all b/benchmarks/kernels/reduction/reducemax_col/compile.all similarity index 100% rename from test/kernel/reduction/reducemax_col/compile.all rename to benchmarks/kernels/reduction/reducemax_col/compile.all diff --git a/test/kernel/reduction/reducemax_col/src/reducemax_col.cpp b/benchmarks/kernels/reduction/reducemax_col/src/reducemax_col.cpp similarity index 100% rename from test/kernel/reduction/reducemax_col/src/reducemax_col.cpp rename to benchmarks/kernels/reduction/reducemax_col/src/reducemax_col.cpp diff --git a/test/kernel/reduction/reducemax_row/Makefile b/benchmarks/kernels/reduction/reducemax_row/Makefile similarity index 100% rename from test/kernel/reduction/reducemax_row/Makefile rename to benchmarks/kernels/reduction/reducemax_row/Makefile diff --git a/test/kernel/reduction/reducemax_row/compile.all b/benchmarks/kernels/reduction/reducemax_row/compile.all similarity index 100% rename from test/kernel/reduction/reducemax_row/compile.all rename to benchmarks/kernels/reduction/reducemax_row/compile.all diff --git a/test/kernel/reduction/reducemax_row/src/reducemax_row.cpp b/benchmarks/kernels/reduction/reducemax_row/src/reducemax_row.cpp similarity index 100% rename from test/kernel/reduction/reducemax_row/src/reducemax_row.cpp rename to benchmarks/kernels/reduction/reducemax_row/src/reducemax_row.cpp diff --git a/test/kernel/reduction/reducesum_col/Makefile b/benchmarks/kernels/reduction/reducesum_col/Makefile similarity index 100% rename from test/kernel/reduction/reducesum_col/Makefile rename to benchmarks/kernels/reduction/reducesum_col/Makefile diff --git a/test/kernel/reduction/reducesum_col/compile.all b/benchmarks/kernels/reduction/reducesum_col/compile.all similarity index 100% rename from test/kernel/reduction/reducesum_col/compile.all rename to benchmarks/kernels/reduction/reducesum_col/compile.all diff --git a/test/kernel/reduction/reducesum_col/src/reducesum_col.cpp b/benchmarks/kernels/reduction/reducesum_col/src/reducesum_col.cpp similarity index 100% rename from test/kernel/reduction/reducesum_col/src/reducesum_col.cpp rename to benchmarks/kernels/reduction/reducesum_col/src/reducesum_col.cpp diff --git a/test/kernel/reduction/reducesum_row/Makefile b/benchmarks/kernels/reduction/reducesum_row/Makefile similarity index 100% rename from test/kernel/reduction/reducesum_row/Makefile rename to benchmarks/kernels/reduction/reducesum_row/Makefile diff --git a/test/kernel/reduction/reducesum_row/compile.all b/benchmarks/kernels/reduction/reducesum_row/compile.all similarity index 100% rename from test/kernel/reduction/reducesum_row/compile.all rename to benchmarks/kernels/reduction/reducesum_row/compile.all diff --git a/test/kernel/reduction/reducesum_row/src/reducesum_row.cpp b/benchmarks/kernels/reduction/reducesum_row/src/reducesum_row.cpp similarity index 100% rename from test/kernel/reduction/reducesum_row/src/reducesum_row.cpp rename to benchmarks/kernels/reduction/reducesum_row/src/reducesum_row.cpp diff --git a/test/kernel/sort/Makefile b/benchmarks/kernels/sort/Makefile similarity index 70% rename from test/kernel/sort/Makefile rename to benchmarks/kernels/sort/Makefile index a07b811..2fee7f6 100644 --- a/test/kernel/sort/Makefile +++ b/benchmarks/kernels/sort/Makefile @@ -1,3 +1,5 @@ +.DEFAULT_GOAL := all + TARGET = $(ELF_HEAD)_$(TESTCASE).elf # Override target name for topk @@ -7,10 +9,10 @@ endif SRC_FILE += $(TEST_ROOT)/$(CATEGORY)/$(TESTCASE)/$(TESTCASE).cpp # Special handling for topk - embed data as object files -EXTRA_OBJ_FILES := -EXTRA_OBJ_DEPS := +EXTRA_OBJ_FILES = +EXTRA_OBJ_DEPS = DATA_OBJ_DIR := topk/data_obj -OUTPUT_DATA_OBJ_DIR := ../../../output/kernel/sort/topk/data_obj +OUTPUT_DATA_OBJ_DIR = $(OBJ_ROOT)/$(CATEGORY)/topk/data_obj ifeq ($(TESTCASE), topk) EXTRA_OBJ_FILES += $(OUTPUT_DATA_OBJ_DIR)/input_131072.o @@ -21,9 +23,6 @@ pre_work: build_data_objs build_data_objs: @COMPILER_DIR="$(COMPILER_DIR)" $(DATA_OBJ_DIR)/build_data_obj.sh $(DATA_OBJ_DIR) $(OUTPUT_DATA_OBJ_DIR) -$(OUTPUT_DATA_OBJ_DIR)/%.o: $(DATA_OBJ_DIR)/%.s pre_work - @mkdir -p $(shell dirname $@) - $(AS) $(CC_O_ALL) $(INCLUDE) $(DEFINES) $< -o $@ endif ifeq ($(opt), on) @@ -31,4 +30,9 @@ DEFINES += -DOPT TARGET = $(ELF_HEAD)_$(TESTCASE)_OPT.elf endif -include ../../common/Makefile.common \ No newline at end of file +include ../../common/Makefile.common + +ifneq ($(EXTRA_OBJ_FILES),) +$(EXTRA_OBJ_FILES): pre_work + @true +endif diff --git a/test/kernel/sort/compile.all b/benchmarks/kernels/sort/compile.all similarity index 100% rename from test/kernel/sort/compile.all rename to benchmarks/kernels/sort/compile.all diff --git a/benchmarks/kernels/sort/topk/.gitignore b/benchmarks/kernels/sort/topk/.gitignore new file mode 100644 index 0000000..e60f4e8 --- /dev/null +++ b/benchmarks/kernels/sort/topk/.gitignore @@ -0,0 +1,3 @@ +*.o +*.s +*.data diff --git a/test/kernel/sort/topk/data_obj/build_data_obj.sh b/benchmarks/kernels/sort/topk/data_obj/build_data_obj.sh similarity index 50% rename from test/kernel/sort/topk/data_obj/build_data_obj.sh rename to benchmarks/kernels/sort/topk/data_obj/build_data_obj.sh index 8128f1e..7f3172b 100755 --- a/test/kernel/sort/topk/data_obj/build_data_obj.sh +++ b/benchmarks/kernels/sort/topk/data_obj/build_data_obj.sh @@ -1,10 +1,20 @@ #!/bin/bash -COMPILER_DIR="${COMPILER_DIR:-/remote/lms01/j00827727/jcore/compilers/linx_blockisa_llvm_musl0.56.16/bin}" -DATA_OBJ_DIR="$1" -OUTPUT_DIR="$2" +set -euo pipefail + +COMPILER_DIR="${COMPILER_DIR:-/usr/bin}" +LINX_TARGET="${LINX_TARGET:-linx64-linx-none-elf}" +DATA_OBJ_DIR="${1:?data object directory required}" +OUTPUT_DIR="${2:?output directory required}" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +CASE_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)" mkdir -p "$OUTPUT_DIR" +if [[ ! -f "${DATA_OBJ_DIR}/input_131072.data" || + ! -f "${DATA_OBJ_DIR}/top_2048_out.data" ]]; then + (cd "$CASE_DIR" && python3 gen_topk_data.py) +fi + build_one() { local name="$1" local data_file="${DATA_OBJ_DIR}/${name}.data" @@ -25,10 +35,10 @@ _binary_${name}_data_end: .equ _binary_${name}_data_size, .-_binary_${name}_data_start EOF - $COMPILER_DIR/clang++ -target linx64v5 -c "$asm_file" -o "$obj_file" + "${COMPILER_DIR}/clang++" -target "$LINX_TARGET" -c "$asm_file" -o "$obj_file" } build_one "input_131072" build_one "top_2048_out" -echo "Done building data object files" \ No newline at end of file +echo "Done building data object files" diff --git a/test/kernel/sort/topk/gen_topk_data.py b/benchmarks/kernels/sort/topk/gen_topk_data.py similarity index 100% rename from test/kernel/sort/topk/gen_topk_data.py rename to benchmarks/kernels/sort/topk/gen_topk_data.py diff --git a/test/kernel/sort/topk/topk.cpp b/benchmarks/kernels/sort/topk/topk.cpp similarity index 94% rename from test/kernel/sort/topk/topk.cpp rename to benchmarks/kernels/sort/topk/topk.cpp index edee4bc..1f717ea 100644 --- a/test/kernel/sort/topk/topk.cpp +++ b/benchmarks/kernels/sort/topk/topk.cpp @@ -1,9 +1,13 @@ #include #include "benchmark.h" +#ifndef __linx #include "fileop.h" +#endif #include "template_asm.h" +#ifndef __linx #include #include +#endif // #define FOR_GFSIM // ============================================================================ @@ -143,7 +147,7 @@ static int find_kth_bin(const uint32_t hist[256], int k, int& need_from_kth) { // ============================================================================ int main() { -#ifndef FOR_GFSIM +#if !defined(FOR_GFSIM) && !defined(__linx) printf("=== TopK Test (SIMT per-bucket) ===\n"); printf("Input: %d TopK: %d Tiles: %d TileSize: %d\n", kInputCount, kTopK, kNumTiles, kTileSize); @@ -161,14 +165,14 @@ int main() { using HistGT = GlobalTensor, Stride<1,1,1,16,1>>; uint32_t histResult[256]; HistGT histGlobal(histResult); - TCOPYOUT(histGlobal, high8HistTile); + TSTORE(histGlobal, high8HistTile); uint32_t global_high8_hist[256] = {0}; for (int b = 0; b < 256; b++) { global_high8_hist[b] = histResult[b]; } -#ifndef FOR_GFSIM +#if !defined(FOR_GFSIM) && !defined(__linx) printf("\nPhase 1: high8 histograms built (1 SIMT launch, 256 lanes).\n"); fflush(stdout); #endif @@ -179,7 +183,7 @@ int main() { int need_from_kth_bin = 0; int kth_bin = find_kth_bin(global_high8_hist, kTopK, need_from_kth_bin); -#ifndef FOR_GFSIM +#if !defined(FOR_GFSIM) && !defined(__linx) printf("\nPhase 2: kth_bin=%d need_from_kth_bin=%d\n", kth_bin, need_from_kth_bin); uint64_t total_above = 0; @@ -200,7 +204,7 @@ int main() { uint32_t low8HistResult[256]; HistGT low8HistGlobal(low8HistResult); - TCOPYOUT(low8HistGlobal, low8HistTile); + TSTORE(low8HistGlobal, low8HistTile); uint32_t global_low8_hist_kth[256] = {0}; for (int b = 0; b < 256; b++) { @@ -220,7 +224,7 @@ int main() { } } -#ifndef FOR_GFSIM +#if !defined(FOR_GFSIM) && !defined(__linx) printf("\nPhase 4: low8_boundary=%d\n", low8_boundary); printf(" Global low8 hist (kth bin) total: %lu\n", cumsum_low); fflush(stdout); @@ -229,7 +233,9 @@ int main() { // ------------------------------------------------------------------------- // Phase 5: Scalar masked scatter (directly on g_input / g_output) // ------------------------------------------------------------------------- - memset(g_output, 0, sizeof(g_output)); + for (int i = 0; i < kInputCount; i++) { + g_output[i] = 0; + } for (int i = 0; i < kInputCount; i++) { uint16_t val = g_input[i]; uint8_t high8 = static_cast(val >> 8); @@ -252,7 +258,7 @@ int main() { } } -#ifndef FOR_GFSIM +#if !defined(FOR_GFSIM) && !defined(__linx) printf("\nPhase 5: Collected %d output elements (expected %d)\n", out_count, kTopK); fflush(stdout); @@ -264,7 +270,9 @@ int main() { int cmp_count = (out_count < kTopK) ? out_count : kTopK; uint16_t result_sorted[2048]; - memcpy(result_sorted, result, sizeof(result_sorted)); + for (int i = 0; i < cmp_count; i++) { + result_sorted[i] = result[i]; + } for (int i = 0; i < cmp_count; i++) { for (int j = i + 1; j < cmp_count; j++) { if (result_sorted[i] < result_sorted[j]) { @@ -280,7 +288,7 @@ int main() { if (result_sorted[i] == g_expected[i]) match++; } -#ifndef FOR_GFSIM +#if !defined(FOR_GFSIM) && !defined(__linx) printf("\n=== Verification (vs. embedded standard answer) ===\n"); printf("Match: %d/%d (%.1f%%)\n", match, cmp_count, 100.0 * match / cmp_count); printf("Output[0..9]: "); @@ -291,7 +299,7 @@ int main() { #endif int ret = (match == cmp_count) ? 0 : 1; -#ifndef FOR_GFSIM +#if !defined(FOR_GFSIM) && !defined(__linx) printf("%s\n", ret ? "FAIL" : "PASS"); fflush(stdout); #endif diff --git a/test/other/cube/Makefile b/benchmarks/microbench/cube/Makefile similarity index 100% rename from test/other/cube/Makefile rename to benchmarks/microbench/cube/Makefile diff --git a/test/other/cube/compile.all b/benchmarks/microbench/cube/compile.all similarity index 100% rename from test/other/cube/compile.all rename to benchmarks/microbench/cube/compile.all diff --git a/test/other/cube/src/matop.cpp b/benchmarks/microbench/cube/src/matop.cpp similarity index 87% rename from test/other/cube/src/matop.cpp rename to benchmarks/microbench/cube/src/matop.cpp index b7aef50..49ae9f6 100644 --- a/test/other/cube/src/matop.cpp +++ b/benchmarks/microbench/cube/src/matop.cpp @@ -108,8 +108,8 @@ void matmul(const int loop, src_dtype *a, src_dtype *b) { tile_shapeB tB; tile_shapeACC tACC; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); #pragma clang loop unroll(full) for(int i=0;i>; using tile_shape = Tile; Tile tsrc(0); - + BENCHSTART; #pragma clang loop unroll(full) for(int i=0;i tmp; - uint16_t real_col = gm_shape::Cols / (STRD/sizeof(dtype)); + uint16_t real_col = gm_shape::Cols / (STRD/sizeof(dtype)); gm_rd<<>>(gsrc.data(), static_cast(STRD/sizeof(dtype)), tmp.data()); } else if(!strcmp(MODE, "gm_frd")){ @@ -374,7 +374,7 @@ int main() { dtype dst[gm_size]; gm_shape gdst(dst); Tile tmp; - uint16_t real_col = gm_shape::Cols / (STRD/sizeof(dtype)); + uint16_t real_col = gm_shape::Cols / (STRD/sizeof(dtype)); gm_wr<<>>(gdst.data(), static_cast(STRD/sizeof(dtype)), tmp.data()); } else if(!strcmp(MODE, "gm_fwr")){ @@ -389,7 +389,7 @@ int main() { gm_shape gsrc(src); gm_shape gdst(dst); Tile tmp; - uint16_t real_col = gm_shape::Cols / (STRD/sizeof(dtype)); + uint16_t real_col = gm_shape::Cols / (STRD/sizeof(dtype)); gm_copy<<>>(gdst.data(), gsrc.data(), static_cast(STRD/sizeof(dtype)), tmp.data()); } else if(!strcmp(MODE, "gm_fcopy")){ @@ -402,7 +402,7 @@ int main() { } else if(!strcmp(MODE, "tile_rd")){ //Tile tsrc(0); - uint16_t real_col = tile_shape::Cols / (STRD/sizeof(dtype)); + uint16_t real_col = tile_shape::Cols / (STRD/sizeof(dtype)); tile_rd<<>>(tsrc.data(), static_cast(STRD/sizeof(dtype))); } else if(!strcmp(MODE, "tile_frd")){ @@ -411,7 +411,7 @@ int main() { } else if(!strcmp(MODE, "tile_wr")){ Tile tdst; - uint16_t real_col = tile_shape::Cols / (STRD/sizeof(dtype)); + uint16_t real_col = tile_shape::Cols / (STRD/sizeof(dtype)); tile_wr<<>>(tdst.data(), static_cast(STRD/sizeof(dtype))); } else if(!strcmp(MODE, "tile_fwr")){ @@ -421,7 +421,7 @@ int main() { else if(!strcmp(MODE, "tile_copy")){ //Tile tsrc(0); Tile tdst; - uint16_t real_col =tile_shape::Cols / (STRD/sizeof(dtype)); + uint16_t real_col =tile_shape::Cols / (STRD/sizeof(dtype)); tile_copy<<>>(tdst.data(), tsrc.data(), static_cast(STRD/sizeof(dtype))); } else if(!strcmp(MODE, "tile_fcopy")){ diff --git a/test/other/vec/Makefile b/benchmarks/microbench/vec/Makefile similarity index 100% rename from test/other/vec/Makefile rename to benchmarks/microbench/vec/Makefile diff --git a/test/other/vec/compile_lat_bw.all b/benchmarks/microbench/vec/compile_lat_bw.all similarity index 100% rename from test/other/vec/compile_lat_bw.all rename to benchmarks/microbench/vec/compile_lat_bw.all diff --git a/test/other/vec/src/lat_bw.cpp b/benchmarks/microbench/vec/src/lat_bw.cpp similarity index 100% rename from test/other/vec/src/lat_bw.cpp rename to benchmarks/microbench/vec/src/lat_bw.cpp diff --git a/test/other/vec/src/lat_bw_func.h b/benchmarks/microbench/vec/src/lat_bw_func.h similarity index 100% rename from test/other/vec/src/lat_bw_func.h rename to benchmarks/microbench/vec/src/lat_bw_func.h diff --git a/test/other/vec/src/lat_bw_vec.h b/benchmarks/microbench/vec/src/lat_bw_vec.h similarity index 100% rename from test/other/vec/src/lat_bw_vec.h rename to benchmarks/microbench/vec/src/lat_bw_vec.h diff --git a/test/other/deepseek/Makefile b/benchmarks/models/deepseekv3/Makefile similarity index 100% rename from test/other/deepseek/Makefile rename to benchmarks/models/deepseekv3/Makefile diff --git a/test/other/deepseek/compile.all b/benchmarks/models/deepseekv3/compile.all similarity index 100% rename from test/other/deepseek/compile.all rename to benchmarks/models/deepseekv3/compile.all diff --git a/test/other/deepseek/compile_cpu.all b/benchmarks/models/deepseekv3/compile_cpu.all similarity index 100% rename from test/other/deepseek/compile_cpu.all rename to benchmarks/models/deepseekv3/compile_cpu.all diff --git a/test/other/deepseek/src/concat.cpp b/benchmarks/models/deepseekv3/src/concat.cpp similarity index 100% rename from test/other/deepseek/src/concat.cpp rename to benchmarks/models/deepseekv3/src/concat.cpp diff --git a/test/other/deepseek/src/expand.cpp b/benchmarks/models/deepseekv3/src/expand.cpp similarity index 100% rename from test/other/deepseek/src/expand.cpp rename to benchmarks/models/deepseekv3/src/expand.cpp diff --git a/test/other/deepseek/src/gate.cpp b/benchmarks/models/deepseekv3/src/gate.cpp similarity index 100% rename from test/other/deepseek/src/gate.cpp rename to benchmarks/models/deepseekv3/src/gate.cpp diff --git a/test/other/deepseek/src/mask.cpp b/benchmarks/models/deepseekv3/src/mask.cpp similarity index 100% rename from test/other/deepseek/src/mask.cpp rename to benchmarks/models/deepseekv3/src/mask.cpp diff --git a/test/other/deepseek/src/mla.cpp b/benchmarks/models/deepseekv3/src/mla.cpp similarity index 100% rename from test/other/deepseek/src/mla.cpp rename to benchmarks/models/deepseekv3/src/mla.cpp diff --git a/test/other/deepseek/src/mlp.cpp b/benchmarks/models/deepseekv3/src/mlp.cpp similarity index 100% rename from test/other/deepseek/src/mlp.cpp rename to benchmarks/models/deepseekv3/src/mlp.cpp diff --git a/test/other/deepseek/src/moe.cpp b/benchmarks/models/deepseekv3/src/moe.cpp similarity index 100% rename from test/other/deepseek/src/moe.cpp rename to benchmarks/models/deepseekv3/src/moe.cpp diff --git a/test/other/deepseek/src/permute.cpp b/benchmarks/models/deepseekv3/src/permute.cpp similarity index 100% rename from test/other/deepseek/src/permute.cpp rename to benchmarks/models/deepseekv3/src/permute.cpp diff --git a/test/other/deepseek/src/projection.cpp b/benchmarks/models/deepseekv3/src/projection.cpp similarity index 100% rename from test/other/deepseek/src/projection.cpp rename to benchmarks/models/deepseekv3/src/projection.cpp diff --git a/test/other/deepseek/src/rmsnorm.cpp b/benchmarks/models/deepseekv3/src/rmsnorm.cpp similarity index 100% rename from test/other/deepseek/src/rmsnorm.cpp rename to benchmarks/models/deepseekv3/src/rmsnorm.cpp diff --git a/test/other/deepseek/src/rope.cpp b/benchmarks/models/deepseekv3/src/rope.cpp similarity index 100% rename from test/other/deepseek/src/rope.cpp rename to benchmarks/models/deepseekv3/src/rope.cpp diff --git a/test/other/deepseek/src/split.cpp b/benchmarks/models/deepseekv3/src/split.cpp similarity index 100% rename from test/other/deepseek/src/split.cpp rename to benchmarks/models/deepseekv3/src/split.cpp diff --git a/test/other/deepseek/src/topk.cpp b/benchmarks/models/deepseekv3/src/topk.cpp similarity index 100% rename from test/other/deepseek/src/topk.cpp rename to benchmarks/models/deepseekv3/src/topk.cpp diff --git a/test/other/deepseek/src/transformer.cpp b/benchmarks/models/deepseekv3/src/transformer.cpp similarity index 100% rename from test/other/deepseek/src/transformer.cpp rename to benchmarks/models/deepseekv3/src/transformer.cpp diff --git a/test/accelerator/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp b/benchmarks/npu/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp similarity index 84% rename from test/accelerator/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp rename to benchmarks/npu/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp index fe5bb69..517344e 100644 --- a/test/accelerator/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp +++ b/benchmarks/npu/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A8W8.h" -#include "../../include/accelerator_cube.h" +#include int main(){ __fp8_e4m3 a[MPC*KPC]; diff --git a/test/accelerator/cube/LLAMA3_70B_attn_matmul_decode_bs_192/params_mx_A8W8.h b/benchmarks/npu/cube/LLAMA3_70B_attn_matmul_decode_bs_192/params_mx_A8W8.h similarity index 100% rename from test/accelerator/cube/LLAMA3_70B_attn_matmul_decode_bs_192/params_mx_A8W8.h rename to benchmarks/npu/cube/LLAMA3_70B_attn_matmul_decode_bs_192/params_mx_A8W8.h diff --git a/test/accelerator/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp b/benchmarks/npu/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp similarity index 84% rename from test/accelerator/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp rename to benchmarks/npu/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp index fe5bb69..517344e 100644 --- a/test/accelerator/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp +++ b/benchmarks/npu/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A8W8.h" -#include "../../include/accelerator_cube.h" +#include int main(){ __fp8_e4m3 a[MPC*KPC]; diff --git a/test/accelerator/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/params_mx_A8W8.h b/benchmarks/npu/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/params_mx_A8W8.h similarity index 100% rename from test/accelerator/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/params_mx_A8W8.h rename to benchmarks/npu/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/params_mx_A8W8.h diff --git a/test/accelerator/cube/Layer_6588_modified_fp8_GB_nbuf/params_mx_A8W8.h b/benchmarks/npu/cube/Layer_6588_modified_fp8_GB_nbuf/params_mx_A8W8.h similarity index 100% rename from test/accelerator/cube/Layer_6588_modified_fp8_GB_nbuf/params_mx_A8W8.h rename to benchmarks/npu/cube/Layer_6588_modified_fp8_GB_nbuf/params_mx_A8W8.h diff --git a/test/accelerator/cube/Makefile b/benchmarks/npu/cube/Makefile similarity index 100% rename from test/accelerator/cube/Makefile rename to benchmarks/npu/cube/Makefile diff --git a/test/accelerator/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp b/benchmarks/npu/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp similarity index 81% rename from test/accelerator/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp rename to benchmarks/npu/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp index 68e3aeb..fa67648 100644 --- a/test/accelerator/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp +++ b/benchmarks/npu/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A4W4.h" -#include "../../include/accelerator_cube.h" +#include int main(){ float a[M*K]; diff --git a/test/accelerator/cube/QuantBatchMatmulV3_292_hif4/params_mx_A4W4.h b/benchmarks/npu/cube/QuantBatchMatmulV3_292_hif4/params_mx_A4W4.h similarity index 100% rename from test/accelerator/cube/QuantBatchMatmulV3_292_hif4/params_mx_A4W4.h rename to benchmarks/npu/cube/QuantBatchMatmulV3_292_hif4/params_mx_A4W4.h diff --git a/test/accelerator/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp b/benchmarks/npu/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp similarity index 81% rename from test/accelerator/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp rename to benchmarks/npu/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp index 68e3aeb..fa67648 100644 --- a/test/accelerator/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp +++ b/benchmarks/npu/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A4W4.h" -#include "../../include/accelerator_cube.h" +#include int main(){ float a[M*K]; diff --git a/test/accelerator/cube/QuantBatchMatmulV3_293_hif4/params_mx_A4W4.h b/benchmarks/npu/cube/QuantBatchMatmulV3_293_hif4/params_mx_A4W4.h similarity index 100% rename from test/accelerator/cube/QuantBatchMatmulV3_293_hif4/params_mx_A4W4.h rename to benchmarks/npu/cube/QuantBatchMatmulV3_293_hif4/params_mx_A4W4.h diff --git a/test/accelerator/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp b/benchmarks/npu/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp similarity index 81% rename from test/accelerator/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp rename to benchmarks/npu/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp index 68e3aeb..fa67648 100644 --- a/test/accelerator/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp +++ b/benchmarks/npu/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A4W4.h" -#include "../../include/accelerator_cube.h" +#include int main(){ float a[M*K]; diff --git a/test/accelerator/cube/QuantBatchMatmulV3_294_hif4/params_mx_A4W4.h b/benchmarks/npu/cube/QuantBatchMatmulV3_294_hif4/params_mx_A4W4.h similarity index 100% rename from test/accelerator/cube/QuantBatchMatmulV3_294_hif4/params_mx_A4W4.h rename to benchmarks/npu/cube/QuantBatchMatmulV3_294_hif4/params_mx_A4W4.h diff --git a/test/accelerator/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp b/benchmarks/npu/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp similarity index 81% rename from test/accelerator/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp rename to benchmarks/npu/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp index 68e3aeb..fa67648 100644 --- a/test/accelerator/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp +++ b/benchmarks/npu/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A4W4.h" -#include "../../include/accelerator_cube.h" +#include int main(){ float a[M*K]; diff --git a/test/accelerator/cube/QuantBatchMatmulV3_295_hif4/params_mx_A4W4.h b/benchmarks/npu/cube/QuantBatchMatmulV3_295_hif4/params_mx_A4W4.h similarity index 100% rename from test/accelerator/cube/QuantBatchMatmulV3_295_hif4/params_mx_A4W4.h rename to benchmarks/npu/cube/QuantBatchMatmulV3_295_hif4/params_mx_A4W4.h diff --git a/test/accelerator/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp b/benchmarks/npu/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp similarity index 81% rename from test/accelerator/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp rename to benchmarks/npu/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp index 68e3aeb..fa67648 100644 --- a/test/accelerator/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp +++ b/benchmarks/npu/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A4W4.h" -#include "../../include/accelerator_cube.h" +#include int main(){ float a[M*K]; diff --git a/test/accelerator/cube/QuantBatchMatmulV3_296_hif4/params_mx_A4W4.h b/benchmarks/npu/cube/QuantBatchMatmulV3_296_hif4/params_mx_A4W4.h similarity index 100% rename from test/accelerator/cube/QuantBatchMatmulV3_296_hif4/params_mx_A4W4.h rename to benchmarks/npu/cube/QuantBatchMatmulV3_296_hif4/params_mx_A4W4.h diff --git a/test/accelerator/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp b/benchmarks/npu/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp similarity index 81% rename from test/accelerator/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp rename to benchmarks/npu/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp index 68e3aeb..fa67648 100644 --- a/test/accelerator/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp +++ b/benchmarks/npu/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A4W4.h" -#include "../../include/accelerator_cube.h" +#include int main(){ float a[M*K]; diff --git a/test/accelerator/cube/QuantBatchMatmulV3_297_hif4/params_mx_A4W4.h b/benchmarks/npu/cube/QuantBatchMatmulV3_297_hif4/params_mx_A4W4.h similarity index 100% rename from test/accelerator/cube/QuantBatchMatmulV3_297_hif4/params_mx_A4W4.h rename to benchmarks/npu/cube/QuantBatchMatmulV3_297_hif4/params_mx_A4W4.h diff --git a/test/accelerator/cube/compile.all b/benchmarks/npu/cube/compile.all similarity index 100% rename from test/accelerator/cube/compile.all rename to benchmarks/npu/cube/compile.all diff --git a/test/accelerator/cube/dsv3_q_up_proj_fp8_GB_DN_3buf/params_mx_A8W8.h b/benchmarks/npu/cube/dsv3_q_up_proj_fp8_GB_DN_3buf/params_mx_A8W8.h similarity index 100% rename from test/accelerator/cube/dsv3_q_up_proj_fp8_GB_DN_3buf/params_mx_A8W8.h rename to benchmarks/npu/cube/dsv3_q_up_proj_fp8_GB_DN_3buf/params_mx_A8W8.h diff --git a/test/accelerator/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp b/benchmarks/npu/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp similarity index 84% rename from test/accelerator/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp rename to benchmarks/npu/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp index 4f15ff2..a6a4d4d 100644 --- a/test/accelerator/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp +++ b/benchmarks/npu/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A8W8.h" -#include "../../include/accelerator_cube.h" +#include int main(){ float a[M*K]; diff --git a/test/accelerator/cube/dsv3_q_up_proj_mxfp8/params_mx_A8W8.h b/benchmarks/npu/cube/dsv3_q_up_proj_mxfp8/params_mx_A8W8.h similarity index 100% rename from test/accelerator/cube/dsv3_q_up_proj_mxfp8/params_mx_A8W8.h rename to benchmarks/npu/cube/dsv3_q_up_proj_mxfp8/params_mx_A8W8.h diff --git a/test/accelerator/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp b/benchmarks/npu/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp similarity index 83% rename from test/accelerator/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp rename to benchmarks/npu/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp index 07dc46c..d3f66ec 100644 --- a/test/accelerator/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp +++ b/benchmarks/npu/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp @@ -1,6 +1,6 @@ #include #include "params_A16W8.h" -#include "../../include/accelerator_cube.h" +#include int main(){ __half a[MPC*KPC]; diff --git a/test/accelerator/cube/llama3_70b_w8_bs_1_case_4/params_A16W8.h b/benchmarks/npu/cube/llama3_70b_w8_bs_1_case_4/params_A16W8.h similarity index 100% rename from test/accelerator/cube/llama3_70b_w8_bs_1_case_4/params_A16W8.h rename to benchmarks/npu/cube/llama3_70b_w8_bs_1_case_4/params_A16W8.h diff --git a/test/accelerator/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp b/benchmarks/npu/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp similarity index 81% rename from test/accelerator/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp rename to benchmarks/npu/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp index af1359a..839ca1a 100644 --- a/test/accelerator/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp +++ b/benchmarks/npu/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A8W8.h" -#include "../../include/accelerator_cube.h" +#include int main(){ float a[M*K]; diff --git a/test/accelerator/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp b/benchmarks/npu/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp similarity index 83% rename from test/accelerator/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp rename to benchmarks/npu/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp index 07dc46c..d3f66ec 100644 --- a/test/accelerator/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp +++ b/benchmarks/npu/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp @@ -1,6 +1,6 @@ #include #include "params_A16W8.h" -#include "../../include/accelerator_cube.h" +#include int main(){ __half a[MPC*KPC]; diff --git a/test/accelerator/cube/llama_train_mm_2_A16W8/params_A16W8.h b/benchmarks/npu/cube/llama_train_mm_2_A16W8/params_A16W8.h similarity index 100% rename from test/accelerator/cube/llama_train_mm_2_A16W8/params_A16W8.h rename to benchmarks/npu/cube/llama_train_mm_2_A16W8/params_A16W8.h diff --git a/test/accelerator/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp b/benchmarks/npu/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp similarity index 81% rename from test/accelerator/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp rename to benchmarks/npu/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp index 48dd040..21467ee 100644 --- a/test/accelerator/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp +++ b/benchmarks/npu/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A8W4.h" -#include "../../include/accelerator_cube.h" +#include int main(){ float a[M*K]; diff --git a/test/accelerator/cube/llama_train_mm_2_mxfp8_mxfp4/params_mx_A8W4.h b/benchmarks/npu/cube/llama_train_mm_2_mxfp8_mxfp4/params_mx_A8W4.h similarity index 100% rename from test/accelerator/cube/llama_train_mm_2_mxfp8_mxfp4/params_mx_A8W4.h rename to benchmarks/npu/cube/llama_train_mm_2_mxfp8_mxfp4/params_mx_A8W4.h diff --git a/test/accelerator/cube/llava1_6_6/llava1_6_6.cpp b/benchmarks/npu/cube/llava1_6_6/llava1_6_6.cpp similarity index 83% rename from test/accelerator/cube/llava1_6_6/llava1_6_6.cpp rename to benchmarks/npu/cube/llava1_6_6/llava1_6_6.cpp index 07dc46c..d3f66ec 100644 --- a/test/accelerator/cube/llava1_6_6/llava1_6_6.cpp +++ b/benchmarks/npu/cube/llava1_6_6/llava1_6_6.cpp @@ -1,6 +1,6 @@ #include #include "params_A16W8.h" -#include "../../include/accelerator_cube.h" +#include int main(){ __half a[MPC*KPC]; diff --git a/test/accelerator/cube/llava1_6_6/params_A16W8.h b/benchmarks/npu/cube/llava1_6_6/params_A16W8.h similarity index 100% rename from test/accelerator/cube/llava1_6_6/params_A16W8.h rename to benchmarks/npu/cube/llava1_6_6/params_A16W8.h diff --git a/test/accelerator/cube/mat_mul_o1_align_0001/mat_mul_o1_align_0001.cpp b/benchmarks/npu/cube/mat_mul_o1_align_0001/mat_mul_o1_align_0001.cpp similarity index 100% rename from test/accelerator/cube/mat_mul_o1_align_0001/mat_mul_o1_align_0001.cpp rename to benchmarks/npu/cube/mat_mul_o1_align_0001/mat_mul_o1_align_0001.cpp diff --git a/test/accelerator/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp b/benchmarks/npu/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp similarity index 84% rename from test/accelerator/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp rename to benchmarks/npu/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp index fe5bb69..517344e 100644 --- a/test/accelerator/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp +++ b/benchmarks/npu/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A8W8.h" -#include "../../include/accelerator_cube.h" +#include int main(){ __fp8_e4m3 a[MPC*KPC]; diff --git a/test/accelerator/cube/matmul_1_bs16_fp8_GB_test/params_mx_A8W8.h b/benchmarks/npu/cube/matmul_1_bs16_fp8_GB_test/params_mx_A8W8.h similarity index 100% rename from test/accelerator/cube/matmul_1_bs16_fp8_GB_test/params_mx_A8W8.h rename to benchmarks/npu/cube/matmul_1_bs16_fp8_GB_test/params_mx_A8W8.h diff --git a/test/accelerator/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp b/benchmarks/npu/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp similarity index 84% rename from test/accelerator/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp rename to benchmarks/npu/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp index fe5bb69..517344e 100644 --- a/test/accelerator/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp +++ b/benchmarks/npu/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A8W8.h" -#include "../../include/accelerator_cube.h" +#include int main(){ __fp8_e4m3 a[MPC*KPC]; diff --git a/test/accelerator/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/params_mx_A8W8.h b/benchmarks/npu/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/params_mx_A8W8.h similarity index 100% rename from test/accelerator/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/params_mx_A8W8.h rename to benchmarks/npu/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/params_mx_A8W8.h diff --git a/test/accelerator/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp b/benchmarks/npu/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp similarity index 84% rename from test/accelerator/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp rename to benchmarks/npu/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp index fe5bb69..517344e 100644 --- a/test/accelerator/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp +++ b/benchmarks/npu/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A8W8.h" -#include "../../include/accelerator_cube.h" +#include int main(){ __fp8_e4m3 a[MPC*KPC]; diff --git a/test/accelerator/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/params_mx_A8W8.h b/benchmarks/npu/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/params_mx_A8W8.h similarity index 100% rename from test/accelerator/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/params_mx_A8W8.h rename to benchmarks/npu/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/params_mx_A8W8.h diff --git a/test/accelerator/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp b/benchmarks/npu/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp similarity index 81% rename from test/accelerator/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp rename to benchmarks/npu/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp index 48dd040..21467ee 100644 --- a/test/accelerator/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp +++ b/benchmarks/npu/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A8W4.h" -#include "../../include/accelerator_cube.h" +#include int main(){ float a[M*K]; diff --git a/test/accelerator/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/params_mx_A8W4.h b/benchmarks/npu/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/params_mx_A8W4.h similarity index 100% rename from test/accelerator/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/params_mx_A8W4.h rename to benchmarks/npu/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/params_mx_A8W4.h diff --git a/test/accelerator/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp b/benchmarks/npu/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp similarity index 81% rename from test/accelerator/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp rename to benchmarks/npu/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp index 48dd040..21467ee 100644 --- a/test/accelerator/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp +++ b/benchmarks/npu/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A8W4.h" -#include "../../include/accelerator_cube.h" +#include int main(){ float a[M*K]; diff --git a/test/accelerator/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/params_mx_A8W4.h b/benchmarks/npu/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/params_mx_A8W4.h similarity index 100% rename from test/accelerator/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/params_mx_A8W4.h rename to benchmarks/npu/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/params_mx_A8W4.h diff --git a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_A16W8/params_A16W8.h b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_A16W8/params_A16W8.h similarity index 100% rename from test/accelerator/cube/xinghuo_13b_tp8_matmul_01_A16W8/params_A16W8.h rename to benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_A16W8/params_A16W8.h diff --git a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp similarity index 83% rename from test/accelerator/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp rename to benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp index 07dc46c..d3f66ec 100644 --- a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp +++ b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp @@ -1,6 +1,6 @@ #include #include "params_A16W8.h" -#include "../../include/accelerator_cube.h" +#include int main(){ __half a[MPC*KPC]; diff --git a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/params_mx_A8W8.h b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/params_mx_A8W8.h similarity index 100% rename from test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/params_mx_A8W8.h rename to benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/params_mx_A8W8.h diff --git a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp similarity index 84% rename from test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp rename to benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp index fe5bb69..517344e 100644 --- a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp +++ b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A8W8.h" -#include "../../include/accelerator_cube.h" +#include int main(){ __fp8_e4m3 a[MPC*KPC]; diff --git a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/params_mx_A8W4.h b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/params_mx_A8W4.h similarity index 100% rename from test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/params_mx_A8W4.h rename to benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/params_mx_A8W4.h diff --git a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp similarity index 84% rename from test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp rename to benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp index fe5bb69..517344e 100644 --- a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp +++ b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp @@ -1,6 +1,6 @@ #include #include "params_mx_A8W8.h" -#include "../../include/accelerator_cube.h" +#include int main(){ __fp8_e4m3 a[MPC*KPC]; diff --git a/test/accelerator/fusion/Makefile b/benchmarks/npu/fusion/Makefile similarity index 100% rename from test/accelerator/fusion/Makefile rename to benchmarks/npu/fusion/Makefile diff --git a/test/accelerator/fusion/compile.all b/benchmarks/npu/fusion/compile.all similarity index 100% rename from test/accelerator/fusion/compile.all rename to benchmarks/npu/fusion/compile.all diff --git a/test/accelerator/fusion/compile_fusion_2d_unroll.all b/benchmarks/npu/fusion/compile_fusion_2d_unroll.all similarity index 100% rename from test/accelerator/fusion/compile_fusion_2d_unroll.all rename to benchmarks/npu/fusion/compile_fusion_2d_unroll.all diff --git a/test/accelerator/fusion/compile_fusion_dcore.all b/benchmarks/npu/fusion/compile_fusion_dcore.all similarity index 100% rename from test/accelerator/fusion/compile_fusion_dcore.all rename to benchmarks/npu/fusion/compile_fusion_dcore.all diff --git a/test/accelerator/fusion/compile_fusion_dynamic.all b/benchmarks/npu/fusion/compile_fusion_dynamic.all similarity index 100% rename from test/accelerator/fusion/compile_fusion_dynamic.all rename to benchmarks/npu/fusion/compile_fusion_dynamic.all diff --git a/test/accelerator/fusion/compile_fusion_fp4.all b/benchmarks/npu/fusion/compile_fusion_fp4.all similarity index 100% rename from test/accelerator/fusion/compile_fusion_fp4.all rename to benchmarks/npu/fusion/compile_fusion_fp4.all diff --git a/test/accelerator/fusion/dynamic.list b/benchmarks/npu/fusion/dynamic.list similarity index 100% rename from test/accelerator/fusion/dynamic.list rename to benchmarks/npu/fusion/dynamic.list diff --git a/test/accelerator/fusion/fa1/fa1.cpp b/benchmarks/npu/fusion/fa1/fa1.cpp similarity index 98% rename from test/accelerator/fusion/fa1/fa1.cpp rename to benchmarks/npu/fusion/fa1/fa1.cpp index c155569..911b983 100644 --- a/test/accelerator/fusion/fa1/fa1.cpp +++ b/benchmarks/npu/fusion/fa1/fa1.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_fusion.h" +#include #include "benchmark.h" #include "fileop.h" diff --git a/test/accelerator/fusion/fa10/fa10.cpp b/benchmarks/npu/fusion/fa10/fa10.cpp similarity index 98% rename from test/accelerator/fusion/fa10/fa10.cpp rename to benchmarks/npu/fusion/fa10/fa10.cpp index 84f75a5..b219a36 100644 --- a/test/accelerator/fusion/fa10/fa10.cpp +++ b/benchmarks/npu/fusion/fa10/fa10.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_fusion.h" +#include #include "benchmark.h" #define B 1 diff --git a/test/accelerator/fusion/fa11/fa11.cpp b/benchmarks/npu/fusion/fa11/fa11.cpp similarity index 95% rename from test/accelerator/fusion/fa11/fa11.cpp rename to benchmarks/npu/fusion/fa11/fa11.cpp index 42b3d88..d4967c4 100644 --- a/test/accelerator/fusion/fa11/fa11.cpp +++ b/benchmarks/npu/fusion/fa11/fa11.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_fusion.h" +#include #include "benchmark.h" //need with bytemask diff --git a/test/accelerator/fusion/fa2/fa2.cpp b/benchmarks/npu/fusion/fa2/fa2.cpp similarity index 98% rename from test/accelerator/fusion/fa2/fa2.cpp rename to benchmarks/npu/fusion/fa2/fa2.cpp index 05604f6..803696a 100644 --- a/test/accelerator/fusion/fa2/fa2.cpp +++ b/benchmarks/npu/fusion/fa2/fa2.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_fusion.h" +#include #include "benchmark.h" #define B 1 diff --git a/test/accelerator/fusion/fa3/fa3.cpp b/benchmarks/npu/fusion/fa3/fa3.cpp similarity index 98% rename from test/accelerator/fusion/fa3/fa3.cpp rename to benchmarks/npu/fusion/fa3/fa3.cpp index 2d45c86..d41cc9c 100644 --- a/test/accelerator/fusion/fa3/fa3.cpp +++ b/benchmarks/npu/fusion/fa3/fa3.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_fusion.h" +#include #include "benchmark.h" #define B 1 diff --git a/test/accelerator/fusion/fa4/fa4.cpp b/benchmarks/npu/fusion/fa4/fa4.cpp similarity index 98% rename from test/accelerator/fusion/fa4/fa4.cpp rename to benchmarks/npu/fusion/fa4/fa4.cpp index d2c10e0..80431d0 100644 --- a/test/accelerator/fusion/fa4/fa4.cpp +++ b/benchmarks/npu/fusion/fa4/fa4.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_fusion.h" +#include #include "benchmark.h" #define B 1 diff --git a/test/accelerator/fusion/fa5/fa5.cpp b/benchmarks/npu/fusion/fa5/fa5.cpp similarity index 98% rename from test/accelerator/fusion/fa5/fa5.cpp rename to benchmarks/npu/fusion/fa5/fa5.cpp index 8511996..e4be6ca 100644 --- a/test/accelerator/fusion/fa5/fa5.cpp +++ b/benchmarks/npu/fusion/fa5/fa5.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_fusion.h" +#include #include "benchmark.h" #define B 1 diff --git a/test/accelerator/fusion/fa6/fa6.cpp b/benchmarks/npu/fusion/fa6/fa6.cpp similarity index 98% rename from test/accelerator/fusion/fa6/fa6.cpp rename to benchmarks/npu/fusion/fa6/fa6.cpp index 1be9a70..04670c0 100644 --- a/test/accelerator/fusion/fa6/fa6.cpp +++ b/benchmarks/npu/fusion/fa6/fa6.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_fusion.h" +#include #include "benchmark.h" #define B 1 diff --git a/test/accelerator/fusion/fa7/fa7.cpp b/benchmarks/npu/fusion/fa7/fa7.cpp similarity index 98% rename from test/accelerator/fusion/fa7/fa7.cpp rename to benchmarks/npu/fusion/fa7/fa7.cpp index d35d506..0d9e0ce 100644 --- a/test/accelerator/fusion/fa7/fa7.cpp +++ b/benchmarks/npu/fusion/fa7/fa7.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_fusion.h" +#include #include "benchmark.h" #define B 1 diff --git a/test/accelerator/fusion/fa8/fa8.cpp b/benchmarks/npu/fusion/fa8/fa8.cpp similarity index 98% rename from test/accelerator/fusion/fa8/fa8.cpp rename to benchmarks/npu/fusion/fa8/fa8.cpp index 799638c..b1d264e 100644 --- a/test/accelerator/fusion/fa8/fa8.cpp +++ b/benchmarks/npu/fusion/fa8/fa8.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_fusion.h" +#include #include "benchmark.h" #define B 1 diff --git a/test/accelerator/fusion/fa9/fa9.cpp b/benchmarks/npu/fusion/fa9/fa9.cpp similarity index 98% rename from test/accelerator/fusion/fa9/fa9.cpp rename to benchmarks/npu/fusion/fa9/fa9.cpp index 1cced45..6ea94aa 100644 --- a/test/accelerator/fusion/fa9/fa9.cpp +++ b/benchmarks/npu/fusion/fa9/fa9.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_fusion.h" +#include #include "benchmark.h" //need with philox diff --git a/test/accelerator/fusion/fa_fp4/fa_fp4.cpp b/benchmarks/npu/fusion/fa_fp4/fa_fp4.cpp similarity index 98% rename from test/accelerator/fusion/fa_fp4/fa_fp4.cpp rename to benchmarks/npu/fusion/fa_fp4/fa_fp4.cpp index dce68c7..8f0c80d 100644 --- a/test/accelerator/fusion/fa_fp4/fa_fp4.cpp +++ b/benchmarks/npu/fusion/fa_fp4/fa_fp4.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_fusion.h" +#include #include "benchmark.h" #include "fileop.h" diff --git a/test/accelerator/fusion/flashmla13/flashmla13.cpp b/benchmarks/npu/fusion/flashmla13/flashmla13.cpp similarity index 100% rename from test/accelerator/fusion/flashmla13/flashmla13.cpp rename to benchmarks/npu/fusion/flashmla13/flashmla13.cpp diff --git a/test/accelerator/fusion/opt.list b/benchmarks/npu/fusion/opt.list similarity index 100% rename from test/accelerator/fusion/opt.list rename to benchmarks/npu/fusion/opt.list diff --git a/test/accelerator/fusion/program.list b/benchmarks/npu/fusion/program.list similarity index 100% rename from test/accelerator/fusion/program.list rename to benchmarks/npu/fusion/program.list diff --git a/test/accelerator/fusion/simall.py b/benchmarks/npu/fusion/simall.py similarity index 100% rename from test/accelerator/fusion/simall.py rename to benchmarks/npu/fusion/simall.py diff --git a/test/accelerator/nddma/Makefile b/benchmarks/npu/nddma/Makefile similarity index 100% rename from test/accelerator/nddma/Makefile rename to benchmarks/npu/nddma/Makefile diff --git a/test/accelerator/nddma/compile_transpose.all b/benchmarks/npu/nddma/compile_transpose.all similarity index 100% rename from test/accelerator/nddma/compile_transpose.all rename to benchmarks/npu/nddma/compile_transpose.all diff --git a/test/accelerator/nddma/transpose_053_mgather/transpose_053_mgather.cpp b/benchmarks/npu/nddma/transpose_053_mgather/transpose_053_mgather.cpp similarity index 96% rename from test/accelerator/nddma/transpose_053_mgather/transpose_053_mgather.cpp rename to benchmarks/npu/nddma/transpose_053_mgather/transpose_053_mgather.cpp index e9152fe..b481f34 100644 --- a/test/accelerator/nddma/transpose_053_mgather/transpose_053_mgather.cpp +++ b/benchmarks/npu/nddma/transpose_053_mgather/transpose_053_mgather.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_transpose.h" +#include #include "benchmark.h" #include "fileop.h" @@ -71,7 +71,7 @@ int main() TableGT dstGlobal(gDst); MGATHER(elemTile, srcGlobal, loadIdxTile16); - // TCOPYIN(elemTile, srcGlobal); + // TLOAD(elemTile, srcGlobal); MSCATTER(dstGlobal, elemTile, storeIdxTile16); } diff --git a/test/accelerator/nddma/transpose_053_tload/transpose_053_tload.cpp b/benchmarks/npu/nddma/transpose_053_tload/transpose_053_tload.cpp similarity index 95% rename from test/accelerator/nddma/transpose_053_tload/transpose_053_tload.cpp rename to benchmarks/npu/nddma/transpose_053_tload/transpose_053_tload.cpp index 1d5767b..abb9929 100644 --- a/test/accelerator/nddma/transpose_053_tload/transpose_053_tload.cpp +++ b/benchmarks/npu/nddma/transpose_053_tload/transpose_053_tload.cpp @@ -54,8 +54,8 @@ int main() gm_shape_out gDst(out + dst_slice_offset); tile_shape tmp; - TCOPYIN(tmp, gSrc); - TCOPYOUT(gDst, tmp); + TLOAD(tmp, gSrc); + TSTORE(gDst, tmp); } } } diff --git a/test/accelerator/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp b/benchmarks/npu/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp similarity index 82% rename from test/accelerator/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp rename to benchmarks/npu/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp index cc178b9..e3d929f 100644 --- a/test/accelerator/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp +++ b/benchmarks/npu/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #define kM 1024 #define kN 1024 diff --git a/test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp b/benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp similarity index 95% rename from test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp rename to benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp index f37b51d..4b4e445 100644 --- a/test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp +++ b/benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp b/benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp similarity index 95% rename from test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp rename to benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp index 2878ed5..4d8e7dc 100644 --- a/test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp +++ b/benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp b/benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp similarity index 95% rename from test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp rename to benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp index 43043b9..0e26fc9 100644 --- a/test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp +++ b/benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simd/Makefile b/benchmarks/npu/vec_simd/Makefile similarity index 100% rename from test/accelerator/vec_simd/Makefile rename to benchmarks/npu/vec_simd/Makefile diff --git a/test/accelerator/vec_simd/compile.all b/benchmarks/npu/vec_simd/compile.all similarity index 100% rename from test/accelerator/vec_simd/compile.all rename to benchmarks/npu/vec_simd/compile.all diff --git a/test/accelerator/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp b/benchmarks/npu/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp similarity index 95% rename from test/accelerator/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp rename to benchmarks/npu/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp index e8a2da8..89dee66 100644 --- a/test/accelerator/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp +++ b/benchmarks/npu/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp b/benchmarks/npu/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp similarity index 95% rename from test/accelerator/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp rename to benchmarks/npu/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp index ab6a0ff..d72a518 100644 --- a/test/accelerator/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp +++ b/benchmarks/npu/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp b/benchmarks/npu/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp similarity index 95% rename from test/accelerator/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp rename to benchmarks/npu/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp index 271d9fc..46e8675 100644 --- a/test/accelerator/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp +++ b/benchmarks/npu/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp b/benchmarks/npu/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp similarity index 93% rename from test/accelerator/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp rename to benchmarks/npu/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp index 3dfb6c2..74b4735 100644 --- a/test/accelerator/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp +++ b/benchmarks/npu/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp b/benchmarks/npu/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp similarity index 93% rename from test/accelerator/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp rename to benchmarks/npu/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp index 93ef753..1c50c92 100644 --- a/test/accelerator/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp +++ b/benchmarks/npu/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp b/benchmarks/npu/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp similarity index 93% rename from test/accelerator/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp rename to benchmarks/npu/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp index 3194b39..f170d87 100644 --- a/test/accelerator/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp +++ b/benchmarks/npu/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp b/benchmarks/npu/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp similarity index 93% rename from test/accelerator/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp rename to benchmarks/npu/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp index 5d782f3..ed31dac 100644 --- a/test/accelerator/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp +++ b/benchmarks/npu/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp b/benchmarks/npu/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp similarity index 95% rename from test/accelerator/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp rename to benchmarks/npu/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp index 1240850..6ea2179 100644 --- a/test/accelerator/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp +++ b/benchmarks/npu/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp b/benchmarks/npu/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp similarity index 94% rename from test/accelerator/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp rename to benchmarks/npu/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp index 5a5024f..97100bf 100644 --- a/test/accelerator/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp +++ b/benchmarks/npu/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" @@ -18,7 +18,7 @@ void softmax_local(__half* dst, __half* src){ gm_shape gsrc(src); tile_shape tsrc; - TCOPYIN(tsrc, gsrc); + TLOAD(tsrc, gsrc); tMax tLocalMax; TROWMAX(tLocalMax, tsrc); @@ -42,7 +42,7 @@ void softmax_local(__half* dst, __half* src){ TDIV(tres, tExp, tSumExpand); gm_shape gdst(dst); - TCOPYOUT(gdst, tres); + TSTORE(gdst, tres); } int main(){ diff --git a/test/accelerator/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp b/benchmarks/npu/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp similarity index 94% rename from test/accelerator/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp rename to benchmarks/npu/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp index 8a5be54..0a8ff03 100644 --- a/test/accelerator/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp +++ b/benchmarks/npu/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp b/benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp similarity index 94% rename from test/accelerator/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp rename to benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp index a260789..9075da5 100644 --- a/test/accelerator/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp +++ b/benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp b/benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp similarity index 94% rename from test/accelerator/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp rename to benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp index 9bac6bb..bf861a2 100644 --- a/test/accelerator/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp +++ b/benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp b/benchmarks/npu/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp similarity index 96% rename from test/accelerator/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp rename to benchmarks/npu/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp index f952193..fb080ac 100644 --- a/test/accelerator/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp +++ b/benchmarks/npu/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simd.h" +#include #include "benchmark.h" #include "common.h" diff --git a/test/accelerator/vec_simt/Makefile b/benchmarks/npu/vec_simt/Makefile similarity index 64% rename from test/accelerator/vec_simt/Makefile rename to benchmarks/npu/vec_simt/Makefile index 80e8b22..fca51d5 100644 --- a/test/accelerator/vec_simt/Makefile +++ b/benchmarks/npu/vec_simt/Makefile @@ -1,13 +1,15 @@ +.DEFAULT_GOAL := all + TARGET = $(ELF_HEAD)_$(TESTCASE).elf SRC_FILE += $(TEST_ROOT)/$(CATEGORY)/$(TESTCASE)/$(TESTCASE).cpp # Special handling for hashfind - embed data as object files -EXTRA_OBJ_FILES := -EXTRA_OBJ_DEPS := +EXTRA_OBJ_FILES = +EXTRA_OBJ_DEPS = # Data object files location (relative paths) DATA_OBJ_DIR := hashfind/data_obj -OUTPUT_DATA_OBJ_DIR := ../../output/accelerator/vec_simt/hashfind/data_obj +OUTPUT_DATA_OBJ_DIR = $(OBJ_ROOT)/$(CATEGORY)/hashfind/data_obj # hashfind uses embedded data (simple_ dataset) ifeq ($(TESTCASE), hashfind) @@ -20,11 +22,6 @@ pre_work: build_data_objs build_data_objs: @COMPILER_DIR="$(COMPILER_DIR)" $(DATA_OBJ_DIR)/build_data_obj.sh $(DATA_OBJ_DIR) $(OUTPUT_DATA_OBJ_DIR) -# Pattern rule so make doesn't use a generic implicit rule for .s → .o in the data_obj dir -$(OUTPUT_DATA_OBJ_DIR)/%.o: $(DATA_OBJ_DIR)/%.s pre_work - @mkdir -p $(shell dirname $@) - $(AS) $(CC_O_ALL) $(INCLUDE) $(DEFINES) $< -o $@ - endif # hashfind_simple also uses embedded data (simple_ prefixed) @@ -38,11 +35,11 @@ pre_work: build_data_objs build_data_objs: @COMPILER_DIR="$(COMPILER_DIR)" $(DATA_OBJ_DIR)/build_data_obj.sh $(DATA_OBJ_DIR) $(OUTPUT_DATA_OBJ_DIR) -# Pattern rule so make doesn't use a generic implicit rule for .s → .o in the data_obj dir -$(OUTPUT_DATA_OBJ_DIR)/%.o: $(DATA_OBJ_DIR)/%.s pre_work - @mkdir -p $(shell dirname $@) - $(AS) $(CC_O_ALL) $(INCLUDE) $(DEFINES) $< -o $@ - endif -include ../../common/Makefile.common \ No newline at end of file +include ../../common/Makefile.common + +ifneq ($(EXTRA_OBJ_FILES),) +$(EXTRA_OBJ_FILES): pre_work + @true +endif diff --git a/benchmarks/npu/vec_simt/compile.all b/benchmarks/npu/vec_simt/compile.all new file mode 100755 index 0000000..1f7b2fc --- /dev/null +++ b/benchmarks/npu/vec_simt/compile.all @@ -0,0 +1,5 @@ +#! /bin/bash + +make TESTCASE=npu_hashtable_insert_cmp_host +make TESTCASE=npu_hashtable_lookup_cmp_host +make TESTCASE=hashfind diff --git a/test/kernel/control/hashtable_lookup_simd/compute_offsets.py b/benchmarks/npu/vec_simt/hashfind/compute_offsets.py similarity index 100% rename from test/kernel/control/hashtable_lookup_simd/compute_offsets.py rename to benchmarks/npu/vec_simt/hashfind/compute_offsets.py diff --git a/benchmarks/npu/vec_simt/hashfind/data_obj/.gitignore b/benchmarks/npu/vec_simt/hashfind/data_obj/.gitignore new file mode 100644 index 0000000..dbf14ab --- /dev/null +++ b/benchmarks/npu/vec_simt/hashfind/data_obj/.gitignore @@ -0,0 +1,3 @@ +*.s +*.o +*.data diff --git a/test/accelerator/vec_simt/hashfind/data_obj/build_data_obj.sh b/benchmarks/npu/vec_simt/hashfind/data_obj/build_data_obj.sh similarity index 53% rename from test/accelerator/vec_simt/hashfind/data_obj/build_data_obj.sh rename to benchmarks/npu/vec_simt/hashfind/data_obj/build_data_obj.sh index 095eed2..e4cbbff 100755 --- a/test/accelerator/vec_simt/hashfind/data_obj/build_data_obj.sh +++ b/benchmarks/npu/vec_simt/hashfind/data_obj/build_data_obj.sh @@ -1,10 +1,21 @@ #!/bin/bash -COMPILER_DIR="${COMPILER_DIR:-/remote/lms60/c00622284/janus/linxisa_compiler_v0.55/linx_blockisa_llvm_musl/bin}" -DATA_OBJ_DIR="$1" -OUTPUT_DIR="$2" +set -euo pipefail + +COMPILER_DIR="${COMPILER_DIR:-/usr/bin}" +LINX_TARGET="${LINX_TARGET:-linx64-linx-none-elf}" +DATA_OBJ_DIR="${1:?data object directory required}" +OUTPUT_DIR="${2:?output directory required}" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +CASE_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)" mkdir -p "$OUTPUT_DIR" +if [[ ! -f "${DATA_OBJ_DIR}/simple_inserted_slot.data" || + ! -f "${DATA_OBJ_DIR}/simple_lookup_keys.data" || + ! -f "${DATA_OBJ_DIR}/simple_lookup_values.data" ]]; then + (cd "$CASE_DIR" && python3 gen_data_simple.py) +fi + build_one() { local name="$1" local data_file="${DATA_OBJ_DIR}/${name}.data" @@ -26,16 +37,12 @@ _binary_${name}_data_end: .equ _binary_${name}_data_size, .-_binary_${name}_data_start EOF - $COMPILER_DIR/clang++ -target linx64v5 -c "$asm_file" -o "$obj_file" + "${COMPILER_DIR}/clang++" -target "$LINX_TARGET" -c "$asm_file" -o "$obj_file" } -build_one "inserted_slot" -build_one "lookup_keys" -build_one "lookup_values" - # Simple dataset (8192 entries, 80% load, 1024 queries) build_one "simple_inserted_slot" build_one "simple_lookup_keys" build_one "simple_lookup_values" -echo "Done building data object files" \ No newline at end of file +echo "Done building data object files" diff --git a/test/accelerator/vec_simt/hashfind/gen_data_simple.py b/benchmarks/npu/vec_simt/hashfind/gen_data_simple.py similarity index 100% rename from test/accelerator/vec_simt/hashfind/gen_data_simple.py rename to benchmarks/npu/vec_simt/hashfind/gen_data_simple.py diff --git a/test/accelerator/vec_simt/hashfind/hashfind.cpp b/benchmarks/npu/vec_simt/hashfind/hashfind.cpp similarity index 99% rename from test/accelerator/vec_simt/hashfind/hashfind.cpp rename to benchmarks/npu/vec_simt/hashfind/hashfind.cpp index 88a46cd..46b7522 100644 --- a/test/accelerator/vec_simt/hashfind/hashfind.cpp +++ b/benchmarks/npu/vec_simt/hashfind/hashfind.cpp @@ -282,14 +282,14 @@ void loadKeys(typename HashFindTypes::TileU32& lowTile, TileU16 offsetLowTile, offsetHighTile; OffsetGT offsetLowGlobal(g_offset_low); OffsetGT offsetHighGlobal(g_offset_high); - TCOPYIN(offsetLowTile, offsetLowGlobal); - TCOPYIN(offsetHighTile, offsetHighGlobal); + TLOAD(offsetLowTile, offsetLowGlobal); + TLOAD(offsetHighTile, offsetHighGlobal); KeyGT keysGlobal(queries); MGATHER(lowTile, keysGlobal, offsetLowTile); MGATHER(highTile, keysGlobal, offsetHighTile); - TCOPYIN(queryKeyTile, keysGlobal); + TLOAD(queryKeyTile, keysGlobal); } // ============================================================================ @@ -487,12 +487,12 @@ void runHashFind(int32_t __out__ *out, // Load the 256 distinct update values, then MSCATTER to the table using UpdGT = GlobalTensor, Stride<1,1,1,kTileCols,1>>; UpdGT updGlobal(update_values); - TCOPYIN(updateTile, updGlobal); + TLOAD(updateTile, updGlobal); updateValues(updateTile, foundIdxTile, table); } TileGT outGlobal(out); - TCOPYOUT(outGlobal, outTile); + TSTORE(outGlobal, outTile); } template diff --git a/test/accelerator/vec_simt/accel_hashtable_insert_cmp_host/accel_hashtable_insert_cmp_host.cpp b/benchmarks/npu/vec_simt/npu_hashtable_insert_cmp_host/npu_hashtable_insert_cmp_host.cpp similarity index 94% rename from test/accelerator/vec_simt/accel_hashtable_insert_cmp_host/accel_hashtable_insert_cmp_host.cpp rename to benchmarks/npu/vec_simt/npu_hashtable_insert_cmp_host/npu_hashtable_insert_cmp_host.cpp index c487c92..c1e6ac7 100644 --- a/test/accelerator/vec_simt/accel_hashtable_insert_cmp_host/accel_hashtable_insert_cmp_host.cpp +++ b/benchmarks/npu/vec_simt/npu_hashtable_insert_cmp_host/npu_hashtable_insert_cmp_host.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simt.h" +#include #include "benchmark.h" #define INSERT_NUM 4096 diff --git a/test/accelerator/vec_simt/accel_hashtable_lookup_cmp_host/accel_hashtable_lookup_cmp_host.cpp b/benchmarks/npu/vec_simt/npu_hashtable_lookup_cmp_host/npu_hashtable_lookup_cmp_host.cpp similarity index 93% rename from test/accelerator/vec_simt/accel_hashtable_lookup_cmp_host/accel_hashtable_lookup_cmp_host.cpp rename to benchmarks/npu/vec_simt/npu_hashtable_lookup_cmp_host/npu_hashtable_lookup_cmp_host.cpp index 41296ce..db93a2b 100644 --- a/test/accelerator/vec_simt/accel_hashtable_lookup_cmp_host/accel_hashtable_lookup_cmp_host.cpp +++ b/benchmarks/npu/vec_simt/npu_hashtable_lookup_cmp_host/npu_hashtable_lookup_cmp_host.cpp @@ -1,5 +1,5 @@ #include -#include "../../include/accelerator_vec_simt.h" +#include #include "benchmark.h" #define LOOKUP_NUM 4096 diff --git a/test/run_ci.py b/benchmarks/run_ci.py similarity index 94% rename from test/run_ci.py rename to benchmarks/run_ci.py index 090c65d..1057a8d 100755 --- a/test/run_ci.py +++ b/benchmarks/run_ci.py @@ -8,7 +8,7 @@ import subprocess def compile(): - os.chdir(os.path.dirname(__file__)+"/tileop_api") + os.chdir(os.path.join(os.path.dirname(__file__), "api", "tileop")) print(os.getcwd()) subprocess.run("./compile.all", shell=True) return @@ -18,7 +18,6 @@ def run(): return def verify(): - return if __name__ == '__main__': @@ -39,5 +38,3 @@ def verify(): compile() run() verify() - - \ No newline at end of file diff --git a/benchmarks/scripts/legacy_batch/bench_all.sh b/benchmarks/scripts/legacy_batch/bench_all.sh new file mode 100755 index 0000000..8256298 --- /dev/null +++ b/benchmarks/scripts/legacy_batch/bench_all.sh @@ -0,0 +1,26 @@ +#!/bin/bash +set -e +set -x +set -o pipefail + +cd $(dirname $0)/../.. + +export CC_OPT=default + +python3 benchmarks/scripts/legacy_batch/run_compile.py +ERRS=$(grep fail: benchmarks/cm_log/compile_summary.log | awk '{print $2}') +PASS=$(($ERRS <= 4)) +if [[ x"$PASS" != x"1" ]]; then + cat benchmarks/cm_log/compile_summary.log + exit 1 +fi + +# ELF_LIST="output/api/tileop/elf/*.elf output/microbench/lmbench/elf/*.elf output/kernels/composite/elf/*.elf output/models/deepseekv3/elf/*.elf" +# +# realpath $ELF_LIST > tmp.list +# +# if [[ -f $QEMU ]]; then +# ARGS="$ARGS -m $QEMU" +# fi +# python3 benchmarks/scripts/legacy_batch/run_qemu.py -i tmp.list -o benchmarks/cm_log/qemu_run.log $ARGS +# rm -f tmp.list diff --git a/test/other/scripts/run_ci.py b/benchmarks/scripts/legacy_batch/run_ci.py similarity index 89% rename from test/other/scripts/run_ci.py rename to benchmarks/scripts/legacy_batch/run_ci.py index 5fd0388..35bf327 100755 --- a/test/other/scripts/run_ci.py +++ b/benchmarks/scripts/legacy_batch/run_ci.py @@ -8,10 +8,11 @@ import subprocess from pathlib import Path -OTHER_ROOT = Path(__file__).resolve().parent.parent +REPO_ROOT = Path(__file__).resolve().parents[3] +BENCHMARK_ROOT = REPO_ROOT / "benchmarks" def compile(): - compile_dir = OTHER_ROOT / "tileop_api" + compile_dir = BENCHMARK_ROOT / "api" / "tileop" print(compile_dir) subprocess.run("./compile.all", cwd=compile_dir, shell=True) return diff --git a/test/other/scripts/run_compile.py b/benchmarks/scripts/legacy_batch/run_compile.py similarity index 85% rename from test/other/scripts/run_compile.py rename to benchmarks/scripts/legacy_batch/run_compile.py index a9bbd6a..1140fc8 100755 --- a/test/other/scripts/run_compile.py +++ b/benchmarks/scripts/legacy_batch/run_compile.py @@ -14,33 +14,33 @@ MAX_WORKERS = 20 #parallel thread num depend on your machine REPO_ROOT = Path(__file__).resolve().parents[3] -TEST_ROOT = REPO_ROOT / "test" -CM_LOG_DIR = TEST_ROOT / "cm_log" +BENCHMARK_ROOT = REPO_ROOT / "benchmarks" +CM_LOG_DIR = BENCHMARK_ROOT / "cm_log" compile_result = {"pass":[], "fail":[], "timeout":[]} compile_list = [ - "other/tileop_test/compile.all", - "kernel/orther/compile_softmax.all", - "kernel/orther/compile_gemm.all", - "kernel/orther/compile_linear.all", - "kernel/orther/compile_matmul.all", - "kernel/orther/compile_flash_attention.all", - "other/lmbench/compile_mem.all", - "other/vec/compile_lat_bw.all", - "other/cube/compile.all", - "other/deepseek/compile.all", - "accelerator/vec_simd/compile.all", - "accelerator/fusion/compile.all", + "api/tileop/compile.all", + "kernels/composite/compile_softmax.all", + "kernels/composite/compile_gemm.all", + "kernels/composite/compile_linear.all", + "kernels/composite/compile_matmul.all", + "kernels/composite/compile_flash_attention.all", + "microbench/lmbench/compile_mem.all", + "microbench/vec/compile_lat_bw.all", + "microbench/cube/compile.all", + "models/deepseekv3/compile.all", + "npu/vec_simd/compile.all", + "npu/fusion/compile.all", ] def cmd_config_parse(list): pass def compile_elf(compile_file): - cmd_path = TEST_ROOT / compile_file + cmd_path = BENCHMARK_ROOT / compile_file cmd_dir = cmd_path.parent - cmd_rel_dir = cmd_dir.relative_to(TEST_ROOT) + cmd_rel_dir = cmd_dir.relative_to(BENCHMARK_ROOT) cmd = f"{cmd_path.stem}_{str(cmd_rel_dir).replace(os.sep, '_')}" print(f"processing {cmd}...") @@ -77,7 +77,7 @@ def compile_elf(compile_file): args = parser.parse_args() os.environ["PLAT"] = args.plat - print(f"test_root is {TEST_ROOT}") + print(f"benchmark_root is {BENCHMARK_ROOT}") if CM_LOG_DIR.exists(): shutil.rmtree(CM_LOG_DIR) CM_LOG_DIR.mkdir(parents=True) diff --git a/test/other/scripts/run_qemu.py b/benchmarks/scripts/legacy_batch/run_qemu.py similarity index 100% rename from test/other/scripts/run_qemu.py rename to benchmarks/scripts/legacy_batch/run_qemu.py diff --git a/test/other/scripts/run_result_check.py b/benchmarks/scripts/legacy_batch/run_result_check.py similarity index 100% rename from test/other/scripts/run_result_check.py rename to benchmarks/scripts/legacy_batch/run_result_check.py diff --git a/test/script/README.md b/benchmarks/scripts/recursive/README.md similarity index 57% rename from test/script/README.md rename to benchmarks/scripts/recursive/README.md index 5b1394a..399e232 100644 --- a/test/script/README.md +++ b/benchmarks/scripts/recursive/README.md @@ -6,7 +6,7 @@ - options: -h show this help message and exit -lib TileOP库的根目录, case: /xx/PTOTileLib/ - -src 需要编译的目录(递归的), case: /xx/PTOTileLib/test/tileop_api/src/ + -src 需要编译的目录(递归的), case: /xx/PTOTileLib/benchmarks/api/tileop/src/ 默认等于lib -m test model: cmp or run, default cmp -lc linx clang++ path, case: /xx/linx_blockisa_llvm/bin/clang++ @@ -29,16 +29,16 @@ run: 编译 + 运行 ## 使用实例 - 编译 cpu_sim版本 -python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/test/tileop_api/src -hc /xx/llvm-15.0.4/bin/clang++ +python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/benchmarks/api/tileop/src -hc /xx/llvm-15.0.4/bin/clang++ - 编译 jcore版本 -python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/test/tileop_api/src -lc /xx/linx_blockisa_llvm/bin/clang++ +python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/benchmarks/api/tileop/src -lc /xx/linx_blockisa_llvm/bin/clang++ - 编译+运行 cpu_sim版本 -python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/test/tileop_api/src -hc /xx/llvm-15.0.4/bin/clang++ -m run +python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/benchmarks/api/tileop/src -hc /xx/llvm-15.0.4/bin/clang++ -m run - 编译+运行+功能验证 jcore版本 -python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/test/tileop_api/src -lc /xx/linx_blockisa_llvm/bin/clang++ -hc /xx/llvm-15.0.4/bin/clang++ -qemu /xx/qemu-linx -m run +python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/benchmarks/api/tileop/src -lc /xx/linx_blockisa_llvm/bin/clang++ -hc /xx/llvm-15.0.4/bin/clang++ -qemu /xx/qemu-linx -m run - 编译+运行+功能验证 单用例 -python3 {$REPO}/test/tileop_api/test_tileop.py -lib {$REPO} -src {$REPO}/test/tileop_api/src -lc {$L_CHAIN}/bin/clang++ -hc {$H_CHAIN}/bin/clang++ -qemu {$QEMU}/qemu-linx -m run -case=Txxx \ No newline at end of file +python3 {$REPO}/benchmarks/scripts/recursive/test.py -lib {$REPO} -src {$REPO}/benchmarks/api/tileop/src -lc {$L_CHAIN}/bin/clang++ -hc {$H_CHAIN}/bin/clang++ -qemu {$QEMU}/qemu-linx -m run -case=Txxx \ No newline at end of file diff --git a/test/script/test.py b/benchmarks/scripts/recursive/test.py similarity index 99% rename from test/script/test.py rename to benchmarks/scripts/recursive/test.py index 47f682d..3158791 100644 --- a/test/script/test.py +++ b/benchmarks/scripts/recursive/test.py @@ -870,7 +870,7 @@ def main( "-src", type=str, default="None", - help="input test dir, case: /xx/Linx-TileOP-API/test/tileop_api/src", + help="input test dir, case: /xx/Linx-TileOP-API/benchmarks/api/tileop/src", ) argParser.add_argument( "-m", type=str, default="cmp", help="test model: cmp or run, default cmp" diff --git a/compiler/linx_blockisa_llvm_musl.tar.gz b/compiler/linx_blockisa_llvm_musl.tar.gz new file mode 100644 index 0000000..ad015cb --- /dev/null +++ b/compiler/linx_blockisa_llvm_musl.tar.gz @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0d7b21b7b0888299562a2b06ff78f2d2e973c0cfdb9bd61aec140686624d2029 +size 821947387 diff --git a/include/aarch64/TCopyIn.hpp b/include/aarch64/TLoad.hpp similarity index 100% rename from include/aarch64/TCopyIn.hpp rename to include/aarch64/TLoad.hpp diff --git a/include/aarch64/TCopyOut.hpp b/include/aarch64/TStore.hpp similarity index 100% rename from include/aarch64/TCopyOut.hpp rename to include/aarch64/TStore.hpp diff --git a/test/accelerator/include/accelerator_cube.h b/include/benchmark_support/npu/npu_cube.h similarity index 89% rename from test/accelerator/include/accelerator_cube.h rename to include/benchmark_support/npu/npu_cube.h index 9b08f02..424a758 100644 --- a/test/accelerator/include/accelerator_cube.h +++ b/include/benchmark_support/npu/npu_cube.h @@ -1,6 +1,6 @@ #include -template void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float *dequant) { using gm_shapeA = global_tensor<__half, RowMajor>; @@ -70,8 +70,8 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float tile_shapeA tA; tile_shapeB tB_ori; tile_shapeB_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -83,8 +83,8 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float tile_shapeA_trows tA; tile_shapeB_tcols tB_ori; tile_shapeB_tcols_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -92,7 +92,7 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float //quant pre VQF322BF16_PRE tile_shapeACC_cvt tACC_cvt; TCVT(tACC_cvt, tACC); - TCOPYOUT(gC, tACC_cvt); + TSTORE(gC, tACC_cvt); } if constexpr (rmd_N) { auto gC = gCIter(i, Nb); @@ -109,8 +109,8 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float tile_shapeA tA; tile_shapeB_trows tB_ori; tile_shapeB_trows_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -121,8 +121,8 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float tile_shapeA_trows tA; tile_shapeB_tcorner tB_ori; tile_shapeB_tcorner_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -130,7 +130,7 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float //quant pre tile_shapeC_trows_cvt tACC_cvt; TCVT(tACC_cvt, tACC); - TCOPYOUT(gC, tACC_cvt); + TSTORE(gC, tACC_cvt); } } if constexpr (rmd_M) { @@ -149,8 +149,8 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float tile_shapeA_tcols tA; tile_shapeB tB_ori; tile_shapeB_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -161,15 +161,15 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float tile_shapeA_tcorner tA; tile_shapeB_tcols tB_ori; tile_shapeB_tcols_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } tile_shapeC_tcols_cvt tACC_cvt; TCVT(tACC_cvt, tACC); - TCOPYOUT(gC, tACC_cvt); + TSTORE(gC, tACC_cvt); } if constexpr (rmd_N) { auto gC = gCIter(Mb, Nb); @@ -177,7 +177,7 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float tile_shapeC_tcorner tACC; tile_shapeA_tcols tA(0); tile_shapeB_trows_cvt tB(0); - MATMUL(tACC, tA, tB); + MATMUL(tACC, tA, tB); #pragma clang loop unroll(full) for (int k = 0; k < Kb; ++k) { auto gA = gAIter(Mb, k); @@ -186,8 +186,8 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float tile_shapeA_tcols tA; tile_shapeB_trows tB_ori; tile_shapeB_trows_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -198,20 +198,20 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float tile_shapeA_tcorner tA; tile_shapeB_tcorner tB_ori; tile_shapeB_tcorner_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } tile_shapeC_tcorner_cvt tACC_cvt; TCVT(tACC_cvt, tACC); - TCOPYOUT(gC, tACC_cvt); + TSTORE(gC, tACC_cvt); } } } -template void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, float *dequant) { using gm_shapeA = global_tensor<__fp8_e4m3, RowMajor>; @@ -277,8 +277,8 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA tA; tile_shapeB tB_ori; tile_shapeB_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -290,8 +290,8 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA_trows tA; tile_shapeB_tcols tB_ori; tile_shapeB_tcols_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -299,7 +299,7 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, //quant pre VQF322BF16_PRE tile_shapeACC_cvt tACC_cvt; TCVT(tACC_cvt, tACC); - TCOPYOUT(gC, tACC_cvt); + TSTORE(gC, tACC_cvt); } if constexpr (rmd_N) { auto gC = gCIter(i, Nb); @@ -316,8 +316,8 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA tA; tile_shapeB_trows tB_ori; tile_shapeB_trows_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -328,8 +328,8 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA_trows tA; tile_shapeB_tcorner tB_ori; tile_shapeB_tcorner_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -337,7 +337,7 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, //quant pre tile_shapeC_trows_cvt tACC_cvt; TCVT(tACC_cvt, tACC); - TCOPYOUT(gC, tACC_cvt); + TSTORE(gC, tACC_cvt); } } if constexpr (rmd_M) { @@ -356,8 +356,8 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA_tcols tA; tile_shapeB tB_ori; tile_shapeB_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -368,15 +368,15 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA_tcorner tA; tile_shapeB_tcols tB_ori; tile_shapeB_tcols_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } tile_shapeC_tcols_cvt tACC_cvt; TCVT(tACC_cvt, tACC); - TCOPYOUT(gC, tACC_cvt); + TSTORE(gC, tACC_cvt); } if constexpr (rmd_N) { auto gC = gCIter(Mb, Nb); @@ -384,7 +384,7 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeC_tcorner tACC; TileLeft<__half, kTM, kTK, rmd_M, kTK> tA(0); TileRight<__half, kTK, kTN, kTK, rmd_N> tB(0); - MATMUL(tACC, tA, tB); + MATMUL(tACC, tA, tB); #pragma clang loop unroll(full) for (int k = 0; k < Kb; ++k) { auto gA = gAIter(Mb, k); @@ -393,8 +393,8 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA_tcols tA; tile_shapeB_trows tB_ori; tile_shapeB_trows_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -405,20 +405,20 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA_tcorner tA; tile_shapeB_tcorner tB_ori; tile_shapeB_tcorner_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCVT(tB, tB_ori); MATMACC(tACC, tA, tB); } tile_shapeC_tcorner_cvt tACC_cvt; TCVT(tACC_cvt, tACC); - TCOPYOUT(gC, tACC_cvt); + TSTORE(gC, tACC_cvt); } } } -template void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, __fp8_e4m3 *amx, __fp8_e4m3 *bmx, float *dequant) { using gm_shapeA = global_tensor<__fp8_e4m3, RowMajor>; @@ -499,8 +499,8 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA tA; tile_shapeB tB_ori; tile_shapeB_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCAST(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -512,16 +512,16 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA_trows tA; tile_shapeB_tcols tB_ori; tile_shapeB_tcols_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB_ori, gB); + TLOAD(tA, gA); + TLOAD(tB_ori, gB); TCAST(tB, tB_ori); MATMACC(tACC, tA, tB); } - //quant pre(acc * dequant_scale) + //quant pre(acc * dequant_scale) auto gDQ = gDQIter(0,j); tile_shapeDQ tDQ; - TCOPYIN(tDQ, gDQ); + TLOAD(tDQ, gDQ); tile_shapeACC tDQ_expand; TEXPANDROW(tDQ_expand,tDQ); TMULS(tACC, tACC, tDQ_expand); @@ -531,7 +531,7 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, // TMULS(tACC_scale, tACC_scale, static_cast(2)); tile_shapeACC_cvt tACC_cvt; TCAST(tACC_cvt, tACC); - TCOPYOUT(gC, tACC_cvt); + TSTORE(gC, tACC_cvt); } if constexpr (rmd_N) { auto gC = gCIter(i, Nb); @@ -548,8 +548,8 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA tA; tile_shapeB_trows tB_ori; tile_shapeB_trows_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); TCAST(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -560,22 +560,22 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA_trows tA; tile_shapeB_tcorner tB_ori; tile_shapeB_tcorner_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); TCAST(tB, tB_ori); MATMACC(tACC, tA, tB); } - //quant pre(acc * dequant_scale) + //quant pre(acc * dequant_scale) auto gDQ = gDQIter(0,Nb); tile_shapeDQ tDQ; - TCOPYIN(tDQ, gDQ); + TLOAD(tDQ, gDQ); tile_shapeC_trows tDQ_expand; TEXPANDROW(tDQ_expand,tDQ); TMULS(tACC, tACC, tDQ_expand); tile_shapeC_trows_cvt tACC_cvt; TCAST(tACC_cvt, tACC); - TCOPYOUT(gC, tACC_cvt); + TSTORE(gC, tACC_cvt); } } if constexpr (rmd_M) { @@ -594,8 +594,8 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA_tcols tA; tile_shapeB tB_ori; tile_shapeB_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); TCAST(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -606,8 +606,8 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA_tcorner tA; tile_shapeB_tcols tB_ori; tile_shapeB_tcols_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); TCAST(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -615,13 +615,13 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, //need quantization for C auto gDQ = gDQIter(0,j); tile_shapeDQ tDQ; - TCOPYIN(tDQ, gDQ); + TLOAD(tDQ, gDQ); tile_shapeC_tcols tDQ_expand; TEXPANDROW(tDQ_expand,tDQ); TMULS(tACC, tACC, tDQ_expand); tile_shapeC_tcols_cvt tACC_cvt; TCAST(tACC_cvt, tACC); - TCOPYOUT(gC, tACC_cvt); + TSTORE(gC, tACC_cvt); } if constexpr (rmd_N) { auto gC = gCIter(Mb, Nb); @@ -629,7 +629,7 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeC_tcorner tACC; tile_shapeA_tcols tA(0); tile_shapeB_trows_cvt tB(0); - MATMUL(tACC, tA, tB); + MATMUL(tACC, tA, tB); #pragma clang loop unroll(full) for (int k = 0; k < Kb; ++k) { auto gA = gAIter(Mb, k); @@ -638,8 +638,8 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA_tcols tA; tile_shapeB_trows tB_ori; tile_shapeB_trows_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); TCAST(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -650,8 +650,8 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, tile_shapeA_tcorner tA; tile_shapeB_tcorner tB_ori; tile_shapeB_tcorner_cvt tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); TCAST(tB, tB_ori); MATMACC(tACC, tA, tB); } @@ -660,13 +660,13 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, //need quantization for C auto gDQ = gDQIter(0,Nb); tile_shapeDQ tDQ; - TCOPYIN(tDQ, gDQ); + TLOAD(tDQ, gDQ); tile_shapeC_tcorner tDQ_expand; TEXPANDROW(tDQ_expand,tDQ); TMULS(tACC, tACC, tDQ_expand); tile_shapeC_tcorner_cvt tACC_cvt; TCAST(tACC_cvt, tACC); - TCOPYOUT(gC, tACC_cvt); + TSTORE(gC, tACC_cvt); } } } @@ -724,8 +724,8 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) { tile_shapeA tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } @@ -735,12 +735,12 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) { tile_shapeA_trows tA; tile_shapeB_tcols tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } - TCOPYOUT(gC, tACC); + TSTORE(gC, tACC); } if constexpr (rmd_N) { auto gC = gCIter(i, Nb); @@ -756,8 +756,8 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) { tile_shapeA tA; tile_shapeB_trows tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } if constexpr (rmd_K) { @@ -766,11 +766,11 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) { tile_shapeA_trows tA; tile_shapeB_tcorner tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } - TCOPYOUT(gC, tACC); + TSTORE(gC, tACC); } } if constexpr (rmd_M) { @@ -788,8 +788,8 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) { tile_shapeA_tcols tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } if constexpr (rmd_K) { @@ -798,11 +798,11 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) { tile_shapeA_tcorner tA; tile_shapeB_tcols tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } - TCOPYOUT(gC, tACC); + TSTORE(gC, tACC); } if constexpr (rmd_N) { auto gC = gCIter(Mb, Nb); @@ -810,7 +810,7 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) { tile_shapeC_tcorner tACC; tile_shapeA_tcols tA(0); tile_shapeB_trows tB(0); - MATMUL(tACC, tA, tB); + MATMUL(tACC, tA, tB); #pragma clang loop unroll(full) for (int k = 0; k < Kb; ++k) { auto gA = gAIter(Mb, k); @@ -818,8 +818,8 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) { tile_shapeA_tcols tA; tile_shapeB_trows tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } if constexpr (rmd_K) { @@ -828,11 +828,11 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) { tile_shapeA_tcorner tA; tile_shapeB_tcorner tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } - TCOPYOUT(gC, tACC); + TSTORE(gC, tACC); } } } diff --git a/test/accelerator/include/accelerator_fa_2d_unroll.h b/include/benchmark_support/npu/npu_fa_2d_unroll.h similarity index 98% rename from test/accelerator/include/accelerator_fa_2d_unroll.h rename to include/benchmark_support/npu/npu_fa_2d_unroll.h index d076b17..253448a 100644 --- a/test/accelerator/include/accelerator_fa_2d_unroll.h +++ b/include/benchmark_support/npu/npu_fa_2d_unroll.h @@ -49,9 +49,9 @@ void __vec__ new_max_1src( #ifndef RES_CHECK upd_max = upd_max * src_scale; #endif - new_max_ptr[max_idx] = upd_max; + new_max_ptr[max_idx] = upd_max; - scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); + scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); } template @@ -100,7 +100,7 @@ void __vec__ new_sum_1src( typename tileSrc::DType exp_src_3 = src_ptr[src_idx_3]; typename tileSrc::DType exp_src_01 = exp_src_0 + exp_src_1; typename tileSrc::DType exp_src_23 = exp_src_2 + exp_src_3; - typename tileSrc::DType exp_src_0123 = exp_src_01 + exp_src_23; + typename tileSrc::DType exp_src_0123 = exp_src_01 + exp_src_23; upd_sum += exp_src_0123; } blkv_get_tile_ptr(new_sum)[sum_idx] = upd_sum; @@ -332,9 +332,9 @@ void __vec__ new_max_4src( upd_max = blkv_max(upd_max, s3_max_0123); } upd_max = upd_max * src_scale; - new_max_ptr[max_idx] = upd_max; + new_max_ptr[max_idx] = upd_max; - scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); + scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); } template @@ -407,7 +407,7 @@ void __vec__ src_exp_2src_with_local_sum( BLKC_ASSIGN_CAST(src_exp1, idx_2, src1_exp2); BLKC_ASSIGN_CAST(src_exp1, idx_3, src1_exp3); typename tileSum::DType src1_exp_sum = src1_exp0 + src1_exp1 + src1_exp2 + src1_exp3; - + upd_sum += src0_exp_sum + src1_exp_sum; } size_t idx_sum = i * tileSum::RowStride; @@ -450,7 +450,7 @@ void __vec__ new_sum_4src( typename tileSrc::DType s0_exp_src_3 = src0_ptr[src_idx_3]; typename tileSrc::DType s0_exp_src_01 = s0_exp_src_0 + s0_exp_src_1; typename tileSrc::DType s0_exp_src_23 = s0_exp_src_2 + s0_exp_src_3; - typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23; + typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23; typename tileSrc::DType s1_exp_src_0 = src1_ptr[src_idx_0]; typename tileSrc::DType s1_exp_src_1 = src1_ptr[src_idx_1]; @@ -458,7 +458,7 @@ void __vec__ new_sum_4src( typename tileSrc::DType s1_exp_src_3 = src1_ptr[src_idx_3]; typename tileSrc::DType s1_exp_src_01 = s1_exp_src_0 + s1_exp_src_1; typename tileSrc::DType s1_exp_src_23 = s1_exp_src_2 + s1_exp_src_3; - typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23; + typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23; typename tileSrc::DType s2_exp_src_0 = src2_ptr[src_idx_0]; typename tileSrc::DType s2_exp_src_1 = src2_ptr[src_idx_1]; @@ -466,7 +466,7 @@ void __vec__ new_sum_4src( typename tileSrc::DType s2_exp_src_3 = src2_ptr[src_idx_3]; typename tileSrc::DType s2_exp_src_01 = s2_exp_src_0 + s2_exp_src_1; typename tileSrc::DType s2_exp_src_23 = s2_exp_src_2 + s2_exp_src_3; - typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23; + typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23; typename tileSrc::DType s3_exp_src_0 = src3_ptr[src_idx_0]; typename tileSrc::DType s3_exp_src_1 = src3_ptr[src_idx_1]; @@ -532,7 +532,7 @@ void __vec__ local_max_4src( upd_max = blkv_max(upd_max, s0123_max); } upd_max = upd_max * src_scale; - local_max_ptr[max_idx] = upd_max; + local_max_ptr[max_idx] = upd_max; } template @@ -568,7 +568,7 @@ void __vec__ local_sum_4src( typename tileSrc::DType s0_exp_src_3 = src0_ptr[src_idx_3]; typename tileSrc::DType s0_exp_src_01 = s0_exp_src_0 + s0_exp_src_1; typename tileSrc::DType s0_exp_src_23 = s0_exp_src_2 + s0_exp_src_3; - typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23; + typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23; typename tileSrc::DType s1_exp_src_0 = src1_ptr[src_idx_0]; typename tileSrc::DType s1_exp_src_1 = src1_ptr[src_idx_1]; @@ -576,7 +576,7 @@ void __vec__ local_sum_4src( typename tileSrc::DType s1_exp_src_3 = src1_ptr[src_idx_3]; typename tileSrc::DType s1_exp_src_01 = s1_exp_src_0 + s1_exp_src_1; typename tileSrc::DType s1_exp_src_23 = s1_exp_src_2 + s1_exp_src_3; - typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23; + typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23; typename tileSrc::DType s2_exp_src_0 = src2_ptr[src_idx_0]; typename tileSrc::DType s2_exp_src_1 = src2_ptr[src_idx_1]; @@ -584,7 +584,7 @@ void __vec__ local_sum_4src( typename tileSrc::DType s2_exp_src_3 = src2_ptr[src_idx_3]; typename tileSrc::DType s2_exp_src_01 = s2_exp_src_0 + s2_exp_src_1; typename tileSrc::DType s2_exp_src_23 = s2_exp_src_2 + s2_exp_src_3; - typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23; + typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23; typename tileSrc::DType s3_exp_src_0 = src3_ptr[src_idx_0]; typename tileSrc::DType s3_exp_src_1 = src3_ptr[src_idx_1]; @@ -622,7 +622,7 @@ void __vec__ new_max_of_2_loc_max( typename tileMax::DType local_max_01 = blkv_max(local_max_0_ptr[max_idx], local_max_1_ptr[max_idx]); upd_max = blkv_max(upd_max, local_max_01); new_max_ptr[max_idx] = upd_max; - scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); + scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); } template void __vec__ new_sum_of_2_loc_sum( @@ -677,7 +677,7 @@ void __vec__ new_max_of_4_loc_max( typename tileMax::DType local_max_0123 = blkv_max(local_max_01, local_max_23); upd_max = blkv_max(upd_max, local_max_0123); new_max_ptr[max_idx] = upd_max; - scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); + scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); } template void __vec__ new_sum_of_4_loc_sum( @@ -702,7 +702,7 @@ void __vec__ new_sum_of_4_loc_sum( size_t sum_idx = i*tileSum::RowStride; - new_sum_ptr[sum_idx] = old_sum_ptr[sum_idx] * scale_ptr[sum_idx] + + new_sum_ptr[sum_idx] = old_sum_ptr[sum_idx] * scale_ptr[sum_idx] + local_sum_0_ptr[sum_idx] + local_sum_1_ptr[sum_idx] + local_sum_2_ptr[sum_idx] + local_sum_3_ptr[sum_idx]; } @@ -731,7 +731,7 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype using tileW_out = TileAcc; // [kTm×kTk] using tileW = Tile; using tileW_cast = Tile::DType, kTm, kTk, BLayout::ColMajor>; - using tileW_left = TileLeft; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] @@ -768,7 +768,7 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype tileQ tQ[Xdim]; - #ifdef MULTI_LDST // don't use, no need for multi tload/tstore + #ifdef MULTI_LDST // don't use, no need for multi tload/tstore #pragma clang loop unroll(full) for(int x=0;x<<>>( - tScale[x].data(), - tNewMax[x].data(), + tScale[x].data(), + tNewMax[x].data(), tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), tMax[x].data(), scale); @@ -888,7 +888,7 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype // tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), // tNewMax[x].data(), // scale); - + src_exp_2src_with_local_sum<<>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(), tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale); src_exp_2src_with_local_sum<<>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(), @@ -906,7 +906,7 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); @@ -926,7 +926,7 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); } @@ -966,7 +966,7 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype #pragma clang loop unroll(full) for(int y=0;y; // [kTm×kTk] using tileW = Tile; using tileW_cast = Tile::DType, kTm, kTk, BLayout::ColMajor>; - using tileW_left = TileLeft; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] @@ -50,7 +50,7 @@ void flash_attention_2d_unroll_pto(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, d #pragma clang loop unroll(full) for(int x=0;x Nz @@ -233,7 +233,7 @@ void flash_attention_2d_unroll_pto(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, d TCOLEXPANDMUL_TEPL(tO[x], tO[x], tInvSum[x]); TCAST_TEPL(tO_cast[x], tO[x]); auto dstO = gIterO(i+x, 0); - TCOPYOUT(dstO, tO_cast[x]); + TSTORE(dstO, tO_cast[x]); } } } diff --git a/test/accelerator/include/accelerator_fa_dcore.h b/include/benchmark_support/npu/npu_fa_dcore.h similarity index 98% rename from test/accelerator/include/accelerator_fa_dcore.h rename to include/benchmark_support/npu/npu_fa_dcore.h index 80322da..1ed8436 100644 --- a/test/accelerator/include/accelerator_fa_dcore.h +++ b/include/benchmark_support/npu/npu_fa_dcore.h @@ -124,7 +124,7 @@ void flash_attention_dcore(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_ using tileQ = MultiTile>; // [kTm×qD] using tileK = MultiTile>; // [vD×kTk] // using tileW_out = TileAcc; // [kTm×kTk] - using tileW = MultiTile>; + using tileW = MultiTile>; using tileW_cast = MultiTile>; using tileW_left = MultiTile>; @@ -151,7 +151,7 @@ void flash_attention_dcore(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_ const float scale = 1.0f / sqrt((float)qD); const int Qb = (Sq + kTm - 1) / kTm; const int Kb = (Skv + kTk - 1) / kTk; - + // 对每个 Q-block (i) for (int i = 0; i < Qb; i += MULTI) { // 加载当前Q块 (仅一次) @@ -161,7 +161,7 @@ void flash_attention_dcore(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_ auto gQ = gIterQ(i,0); TLOAD2_ND2NZ(tQ.Tiles[1], tQ.Tiles[0], gQ); #else - TCOPYIN(tQ, [&](int t) { return gIterQ(i + t, 0); }); + TLOAD(tQ, [&](int t) { return gIterQ(i + t, 0); }); #endif // 初始化状态: 最大值/指数和/输出累加 @@ -178,7 +178,7 @@ void flash_attention_dcore(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_ // 加载K_j和V_j auto gK = gIterK(0, j); tileK tK; - TCOPYIN(tK, gK); + TLOAD(tK, gK); // 计算注意力分数块 tileW tW; @@ -207,7 +207,7 @@ void flash_attention_dcore(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_ // 计算当前块的加权输出: O_j = W * V auto gV = gIterV(j, 0); tileV tV; - TCOPYIN(tV, gV); + TLOAD(tV, gV); MATMUL(tPV, tW_left, tV); if(j==0){ @@ -233,7 +233,7 @@ void flash_attention_dcore(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_ auto gO = gIterO(i, 0); TSTORE2_DN2DN(gO, tO_cast.Tiles[1], tO_cast.Tiles[0]); #else - TCOPYOUT([&](int t) { return gIterO(i + t, 0); }, tO_cast); + TSTORE([&](int t) { return gIterO(i + t, 0); }, tO_cast); #endif } } diff --git a/test/accelerator/include/accelerator_fa_dynamic.h b/include/benchmark_support/npu/npu_fa_dynamic.h similarity index 98% rename from test/accelerator/include/accelerator_fa_dynamic.h rename to include/benchmark_support/npu/npu_fa_dynamic.h index 0882089..c0be2a3 100644 --- a/test/accelerator/include/accelerator_fa_dynamic.h +++ b/include/benchmark_support/npu/npu_fa_dynamic.h @@ -231,9 +231,9 @@ void __vec__ new_max_4src_dynamic( upd_max = blkv_max(upd_max, s3_max_0123); } upd_max = upd_max * src_scale; - new_max_ptr[max_idx] = upd_max; + new_max_ptr[max_idx] = upd_max; - scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); + scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); } template @@ -274,7 +274,7 @@ void __vec__ new_sum_4src_dynamic( typename tileSrc::DType s0_exp_src_3 = src0_ptr[src_idx_3]; typename tileSrc::DType s0_exp_src_01 = s0_exp_src_0 + s0_exp_src_1; typename tileSrc::DType s0_exp_src_23 = s0_exp_src_2 + s0_exp_src_3; - typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23; + typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23; typename tileSrc::DType s1_exp_src_0 = src1_ptr[src_idx_0]; typename tileSrc::DType s1_exp_src_1 = src1_ptr[src_idx_1]; @@ -282,7 +282,7 @@ void __vec__ new_sum_4src_dynamic( typename tileSrc::DType s1_exp_src_3 = src1_ptr[src_idx_3]; typename tileSrc::DType s1_exp_src_01 = s1_exp_src_0 + s1_exp_src_1; typename tileSrc::DType s1_exp_src_23 = s1_exp_src_2 + s1_exp_src_3; - typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23; + typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23; typename tileSrc::DType s2_exp_src_0 = src2_ptr[src_idx_0]; typename tileSrc::DType s2_exp_src_1 = src2_ptr[src_idx_1]; @@ -290,7 +290,7 @@ void __vec__ new_sum_4src_dynamic( typename tileSrc::DType s2_exp_src_3 = src2_ptr[src_idx_3]; typename tileSrc::DType s2_exp_src_01 = s2_exp_src_0 + s2_exp_src_1; typename tileSrc::DType s2_exp_src_23 = s2_exp_src_2 + s2_exp_src_3; - typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23; + typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23; typename tileSrc::DType s3_exp_src_0 = src3_ptr[src_idx_0]; typename tileSrc::DType s3_exp_src_1 = src3_ptr[src_idx_1]; @@ -358,7 +358,7 @@ void __vec__ local_max_4src_dynamic( upd_max = blkv_max(upd_max, s0123_max); } upd_max = upd_max * src_scale; - local_max_ptr[max_idx] = upd_max; + local_max_ptr[max_idx] = upd_max; } template @@ -396,7 +396,7 @@ void __vec__ local_sum_4src_dynamic( typename tileSrc::DType s0_exp_src_3 = src0_ptr[src_idx_3]; typename tileSrc::DType s0_exp_src_01 = s0_exp_src_0 + s0_exp_src_1; typename tileSrc::DType s0_exp_src_23 = s0_exp_src_2 + s0_exp_src_3; - typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23; + typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23; typename tileSrc::DType s1_exp_src_0 = src1_ptr[src_idx_0]; typename tileSrc::DType s1_exp_src_1 = src1_ptr[src_idx_1]; @@ -404,7 +404,7 @@ void __vec__ local_sum_4src_dynamic( typename tileSrc::DType s1_exp_src_3 = src1_ptr[src_idx_3]; typename tileSrc::DType s1_exp_src_01 = s1_exp_src_0 + s1_exp_src_1; typename tileSrc::DType s1_exp_src_23 = s1_exp_src_2 + s1_exp_src_3; - typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23; + typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23; typename tileSrc::DType s2_exp_src_0 = src2_ptr[src_idx_0]; typename tileSrc::DType s2_exp_src_1 = src2_ptr[src_idx_1]; @@ -412,7 +412,7 @@ void __vec__ local_sum_4src_dynamic( typename tileSrc::DType s2_exp_src_3 = src2_ptr[src_idx_3]; typename tileSrc::DType s2_exp_src_01 = s2_exp_src_0 + s2_exp_src_1; typename tileSrc::DType s2_exp_src_23 = s2_exp_src_2 + s2_exp_src_3; - typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23; + typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23; typename tileSrc::DType s3_exp_src_0 = src3_ptr[src_idx_0]; typename tileSrc::DType s3_exp_src_1 = src3_ptr[src_idx_1]; @@ -463,7 +463,7 @@ __attribute__((noinline)) void flash_attention_dynamic(dtype* out_ptr, dtype* q_ using tileW_out = TileAcc; // [kTm×kTk] using tileW = Tile; using tileW_cast = Tile; - using tileW_left = TileLeft; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] @@ -491,7 +491,7 @@ __attribute__((noinline)) void flash_attention_dynamic(dtype* out_ptr, dtype* q_ int dyn_m = (i+1) * kTm > Sq ? rQ:kTm; tileQ tQ[Xdim]; for (auto& x : tQ) { x = tileQ(dyn_m);} - #ifdef MULTI_LDST // don't use, no need for multi tload/tstore + #ifdef MULTI_LDST // don't use, no need for multi tload/tstore #pragma clang loop unroll(full) for(int x=0;x<<>>( - tScale[x].data(), - tNewMax[x].data(), + tScale[x].data(), + tNewMax[x].data(), tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), tMax[x].data(), scale, tW[0][0].GetValidCol()); @@ -636,7 +636,7 @@ __attribute__((noinline)) void flash_attention_dynamic(dtype* out_ptr, dtype* q_ for(int y=0;y Nz @@ -690,7 +690,7 @@ __attribute__((noinline)) void flash_attention_dynamic(dtype* out_ptr, dtype* q_ for (int x = 0; x < Xdim; ++x) { size_t offset_O = (i+x) * tileO_cast::Rows * vD; gmO dstO(out_ptr+offset_O, Sq); - TCOPYOUT(dstO, tO_cast[x]); + TSTORE(dstO, tO_cast[x]); } i+=Xdim; @@ -895,7 +895,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt using tileW_out = TileAcc; // [kTm×kTk] using tileW = Tile; using tileW_cast = Tile; - using tileW_left = TileLeft; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] @@ -922,7 +922,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt for(int x=0;x Skv ? rK:kTk; @@ -951,7 +951,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt for(int y=0;y<<>>( - tScale[x].data(), - tNewMax[x].data(), + tScale[x].data(), + tNewMax[x].data(), tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), tMax[x].data(), scale, tW[0][0].GetValidCol()); - + src_exp_2src_with_local_sum_dynamic<<>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(), tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale, tW[0][0].GetValidCol()); src_exp_2src_with_local_sum_dynamic<<>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(), @@ -1019,7 +1019,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt for(int y=0;y Nz @@ -1066,7 +1066,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt for (int x = 0; x < dyn_Xdim; ++x) { size_t offset_O = (i+x) * tileO_cast::Rows * vD; gmO dstO(out_ptr+offset_O, Sq); - TCOPYOUT(dstO, tO_cast[x]); + TSTORE(dstO, tO_cast[x]); } i+=dyn_Xdim; @@ -1083,7 +1083,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt for(int x=0;x Nz */ \ /* 计算当前块的加权输出: O_j = W * V */ \ @@ -1270,7 +1270,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt for (int x = 0; x < Xdim; ++x) { \ size_t offset_O = (i+x) * tileO_cast::Rows * vD; \ gmO dstO(out_ptr+offset_O, Sq); \ - TCOPYOUT(dstO, tO_cast[x]); \ + TSTORE(dstO, tO_cast[x]); \ } template @@ -1286,7 +1286,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt using tileW_out = TileAcc; // [kTm×kTk] using tileW = Tile; using tileW_cast = Tile; - using tileW_left = TileLeft; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] diff --git a/test/accelerator/include/accelerator_fa_fp4.h b/include/benchmark_support/npu/npu_fa_fp4.h similarity index 100% rename from test/accelerator/include/accelerator_fa_fp4.h rename to include/benchmark_support/npu/npu_fa_fp4.h diff --git a/test/accelerator/include/accelerator_fa_manual.h b/include/benchmark_support/npu/npu_fa_manual.h similarity index 97% rename from test/accelerator/include/accelerator_fa_manual.h rename to include/benchmark_support/npu/npu_fa_manual.h index 691ebdd..8401529 100644 --- a/test/accelerator/include/accelerator_fa_manual.h +++ b/include/benchmark_support/npu/npu_fa_manual.h @@ -37,9 +37,9 @@ void __vec__ new_max_manual( #ifndef RES_CHECK upd_max = upd_max * src_scale; #endif - new_max_ptr[max_idx] = upd_max; + new_max_ptr[max_idx] = upd_max; - scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); + scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); } template @@ -88,7 +88,7 @@ void __vec__ new_sum_manual( typename tileSrc::DType exp_src_3 = src_ptr[src_idx_3]; typename tileSrc::DType exp_src_01 = exp_src_0 + exp_src_1; typename tileSrc::DType exp_src_23 = exp_src_2 + exp_src_3; - typename tileSrc::DType exp_src_0123 = exp_src_01 + exp_src_23; + typename tileSrc::DType exp_src_0123 = exp_src_01 + exp_src_23; upd_sum += exp_src_0123; } blkv_get_tile_ptr(new_sum)[sum_idx] = upd_sum; @@ -108,7 +108,7 @@ void flash_attention_manual(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v using tileW_out = TileAcc; // [kTm×kTk] using tileW = Tile; using tileW_cast = Tile; - using tileW_left = TileLeft; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] @@ -148,7 +148,7 @@ void flash_attention_manual(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v #pragma clang loop unroll(full) for(int x=0;x; // [kTm×qD] using tileK = TileRight; // [vD×kTk] using tileW_out = TileAcc; // [kTm×kTk] - using tileW = Tile; - using tileW_left = TileLeft; + using tileW = Tile; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] @@ -40,7 +40,7 @@ void flash_attention_opt1(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p // 加载当前Q块 (仅一次) tileQ tQ; auto gQ = gIterQ(i, 0); - TCOPYIN(tQ, gQ); + TLOAD(tQ, gQ); // 初始化状态: 最大值/指数和/输出累加 tileMax tMax; @@ -55,8 +55,8 @@ void flash_attention_opt1(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p // 加载K_j和V_j auto gK = gIterK(0, j); auto gV = gIterV(j, 0); - tileK tK; TCOPYIN(tK, gK); - tileV tV; TCOPYIN(tV, gV); + tileK tK; TLOAD(tK, gK); + tileV tV; TLOAD(tV, gV); // 计算注意力分数块 tileW_out tW_out; @@ -94,6 +94,6 @@ void flash_attention_opt1(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p TCAST(tO_cast, tO); // 写回全局内存 auto dstO = gIterO(i, 0); - TCOPYOUT(dstO, tO_cast); + TSTORE(dstO, tO_cast); } } \ No newline at end of file diff --git a/test/accelerator/include/accelerator_fa_opt2.h b/include/benchmark_support/npu/npu_fa_opt2.h similarity index 94% rename from test/accelerator/include/accelerator_fa_opt2.h rename to include/benchmark_support/npu/npu_fa_opt2.h index a2f856d..2f0d157 100644 --- a/test/accelerator/include/accelerator_fa_opt2.h +++ b/include/benchmark_support/npu/npu_fa_opt2.h @@ -9,8 +9,8 @@ void flash_attention_opt2(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p using tileQ = TileLeft; // [kTm×qD] using tileK = TileRight; // [vD×kTk] using tileW_out = TileAcc; // [kTm×kTk] - using tileW = Tile; - using tileW_left = TileLeft; + using tileW = Tile; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] @@ -34,13 +34,13 @@ void flash_attention_opt2(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p const float scale = 1.0f / sqrt((float)qD); const int Qb = (Sq + kTm - 1) / kTm; const int Kb = (Skv + kTk - 1) / kTk; - + // 对每个 Q-block (i) for (int i = 0; i < Qb; ++i) { // 加载当前Q块 (仅一次) tileQ tQ; auto gQ = gIterQ(i,0); - TCOPYIN(tQ, gQ); + TLOAD(tQ, gQ); // 初始化状态: 最大值/指数和/输出累加 tileMax tMax; @@ -55,8 +55,8 @@ void flash_attention_opt2(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p // 加载K_j和V_j auto gK = gIterK(0, j); auto gV = gIterV(j, 0); - tileK tK; TCOPYIN(tK, gK); - tileV tV; TCOPYIN(tV, gV); + tileK tK; TLOAD(tK, gK); + tileV tV; TLOAD(tV, gV); // 计算注意力分数块 tileW_out tW_out; @@ -94,6 +94,6 @@ void flash_attention_opt2(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p TCAST(tO_cast, tO); // 写回全局内存 auto dstO = gIterO(i, 0); - TCOPYOUT(dstO, tO_cast); + TSTORE(dstO, tO_cast); } } \ No newline at end of file diff --git a/test/accelerator/include/accelerator_fa_opt3.h b/include/benchmark_support/npu/npu_fa_opt3.h similarity index 97% rename from test/accelerator/include/accelerator_fa_opt3.h rename to include/benchmark_support/npu/npu_fa_opt3.h index 70741d6..9ed0634 100644 --- a/test/accelerator/include/accelerator_fa_opt3.h +++ b/include/benchmark_support/npu/npu_fa_opt3.h @@ -124,8 +124,8 @@ void flash_attention_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p using tileQ = TileLeft; // [kTm×qD] using tileK = TileRight; // [vD×kTk] using tileW_out = TileAcc; // [kTm×kTk] - using tileW = Tile; - using tileW_left = TileLeft; + using tileW = Tile; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] @@ -150,13 +150,13 @@ void flash_attention_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p const float scale = 1.0f / sqrt((float)qD); const int Qb = (Sq + kTm - 1) / kTm; const int Kb = (Skv + kTk - 1) / kTk; - + // 对每个 Q-block (i) for (int i = 0; i < Qb; ++i) { // 加载当前Q块 (仅一次) tileQ tQ; auto gQ = gIterQ(i,0); - TCOPYIN(tQ, gQ); + TLOAD(tQ, gQ); // 初始化状态: 最大值/指数和/输出累加 tileMax tMax; @@ -171,8 +171,8 @@ void flash_attention_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p // 加载K_j和V_j auto gK = gIterK(0, j); auto gV = gIterV(j, 0); - tileK tK; TCOPYIN(tK, gK); - tileV tV; TCOPYIN(tV, gV); + tileK tK; TLOAD(tK, gK); + tileV tV; TLOAD(tV, gV); // 计算注意力分数块 tileW_out tW_out; @@ -180,7 +180,7 @@ void flash_attention_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p // Nz -> ColMajor tileW tW; - #ifdef TEMPLATE + #ifdef TEMPLATE ACCSCALE_NZ2DN(tW, tW_out, scale); tileMax tNewMax; @@ -225,7 +225,7 @@ void flash_attention_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p normalize<<>>(tO_cast.data(), tO.data(), tRescaleO.data(), tSum.data()); // 写回全局内存 auto dstO = gIterO(i, 0); - TCOPYOUT(dstO, tO_cast); + TSTORE(dstO, tO_cast); } } @@ -240,7 +240,7 @@ void flash_attention_multitile_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, using tileQ = MultiTile>; // [kTm×qD] using tileK = MultiTile>; // [vD×kTk] // using tileW_out = TileAcc; // [kTm×kTk] - using tileW = MultiTile>; + using tileW = MultiTile>; using tileW_left = MultiTile>; // using tileO_out = TileAcc; @@ -266,7 +266,7 @@ void flash_attention_multitile_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, const float scale = 1.0f / sqrt((float)qD); const int Qb = (Sq + kTm - 1) / kTm; const int Kb = (Skv + kTk - 1) / kTk; - + // 对每个 Q-block (i) for (int i = 0; i < Qb; i += MULTI) { // 加载当前Q块 (仅一次) @@ -276,7 +276,7 @@ void flash_attention_multitile_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, auto gQ = gIterQ(i,0); TLOAD2_ND2NZ(tQ.Tiles[1], tQ.Tiles[0], gQ); #else - TCOPYIN(tQ, [&](int t) { return gIterQ(i + t, 0); }); + TLOAD(tQ, [&](int t) { return gIterQ(i + t, 0); }); #endif // 初始化状态: 最大值/指数和/输出累加 @@ -294,7 +294,7 @@ void flash_attention_multitile_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // 加载K_j和V_j auto gK = gIterK(0, j); tileK tK; - TCOPYIN(tK, gK); + TLOAD(tK, gK); // 计算注意力分数块 tileW tW; @@ -330,7 +330,7 @@ void flash_attention_multitile_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // 计算当前块的加权输出: O_j = W * V auto gV = gIterV(j, 0); tileV tV; - TCOPYIN(tV, gV); + TLOAD(tV, gV); MATMUL(tO, tW_left, tV); // 更新最大值状态 tMax = tNewMax; @@ -356,7 +356,7 @@ void flash_attention_multitile_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, auto gO = gIterO(i, 0); TSTORE2_DN2DN(gO, tO_cast.Tiles[1], tO_cast.Tiles[0]); #else - TCOPYOUT([&](int t) { return gIterO(i + t, 0); }, tO_cast); + TSTORE([&](int t) { return gIterO(i + t, 0); }, tO_cast); #endif } } \ No newline at end of file diff --git a/test/accelerator/include/accelerator_fa_opt4.h b/include/benchmark_support/npu/npu_fa_opt4.h similarity index 97% rename from test/accelerator/include/accelerator_fa_opt4.h rename to include/benchmark_support/npu/npu_fa_opt4.h index 5081a94..b108619 100644 --- a/test/accelerator/include/accelerator_fa_opt4.h +++ b/include/benchmark_support/npu/npu_fa_opt4.h @@ -25,7 +25,7 @@ void __vec__ flashsoftmax_opt4_with_scale( typename tileSum::DType old_sum_val = old_sum_ptr[sum_idx]; typename tileSum::DType upd_sum = old_sum_val; typename tileMax::DType upd_max = old_max_val; - + #pragma clang loop unroll(full) for(size_t j=0;j; // [vD×kTk] using tileW_out = TileAcc; // [kTm×kTk] using tileW = Tile; - using tileW_left = TileLeft; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] @@ -195,19 +195,19 @@ void flash_attention_opt4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p const float scale = 1.0f / sqrt((float)qD); const int Qb = (Sq + kTm - 1) / kTm; const int Kb = (Skv + kTk - 1) / kTk; - + // 对每个 Q-block (i) for (int i = 0; i < Qb; ++i) { // 加载当前Q块 (仅一次) tileQ tQ; auto gQ = gIterQ(i,0); - TCOPYIN(tQ, gQ); + TLOAD(tQ, gQ); // 初始化状态: 最大值/指数和/输出累加 tileMax tMax; TEXPANDSCALAR(tMax, -1e30f); // 初始化为极小值 tileSum tSum(0); // 指数和归零 - tileO_out tPV_out; + tileO_out tPV_out; tileO tPV; tileO tO(0); // 输出累加归零 @@ -217,8 +217,8 @@ void flash_attention_opt4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p // 加载K_j和V_j auto gK = gIterK(0, j); auto gV = gIterV(j, 0); - tileK tK; TCOPYIN(tK, gK); - tileV tV; TCOPYIN(tV, gV); + tileK tK; TLOAD(tK, gK); + tileV tV; TLOAD(tV, gV); // 计算注意力分数块 tileW_out tW_out; @@ -226,7 +226,7 @@ void flash_attention_opt4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p // Nz -> ColMajor tileW tW; - #ifdef TEMPLATE + #ifdef TEMPLATE ACCSCALE_NZ2DN(tW, tW_out, scale); tileMax tNewMax; @@ -270,6 +270,6 @@ void flash_attention_opt4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p // TCAST(tO_cast, tO); normalize_opt4<<>>(tO_cast.data(), tO.data(), tSum.data()); auto dstO = gIterO(i, 0); - TCOPYOUT(dstO, tO_cast); + TSTORE(dstO, tO_cast); } } \ No newline at end of file diff --git a/test/accelerator/include/accelerator_fa_template_2d_unroll.h b/include/benchmark_support/npu/npu_fa_template_2d_unroll.h similarity index 98% rename from test/accelerator/include/accelerator_fa_template_2d_unroll.h rename to include/benchmark_support/npu/npu_fa_template_2d_unroll.h index fe4a781..2b93d6d 100644 --- a/test/accelerator/include/accelerator_fa_template_2d_unroll.h +++ b/include/benchmark_support/npu/npu_fa_template_2d_unroll.h @@ -240,9 +240,9 @@ void __vec__ new_max_4src_template( typename tileMax::DType local_max_0123 = blkv_max(local_max_01, local_max_23); upd_max = blkv_max(upd_max, local_max_0123); - new_max_ptr[max_idx] = upd_max; + new_max_ptr[max_idx] = upd_max; - scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); + scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); } template @@ -313,7 +313,7 @@ void __vec__ src_exp_2src_with_local_sum_template( BLKC_ASSIGN_CAST(src_exp1, idx_2, src1_exp2); BLKC_ASSIGN_CAST(src_exp1, idx_3, src1_exp3); typename tileSum::DType src1_exp_sum = src1_exp0 + src1_exp1 + src1_exp2 + src1_exp3; - + upd_sum += src0_exp_sum + src1_exp_sum; } size_t idx_sum = i * tileSum::RowStride; @@ -356,7 +356,7 @@ void __vec__ new_sum_4src_template( typename tileSrc::DType s0_exp_src_3 = src0_ptr[src_idx_3]; typename tileSrc::DType s0_exp_src_01 = s0_exp_src_0 + s0_exp_src_1; typename tileSrc::DType s0_exp_src_23 = s0_exp_src_2 + s0_exp_src_3; - typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23; + typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23; typename tileSrc::DType s1_exp_src_0 = src1_ptr[src_idx_0]; typename tileSrc::DType s1_exp_src_1 = src1_ptr[src_idx_1]; @@ -364,7 +364,7 @@ void __vec__ new_sum_4src_template( typename tileSrc::DType s1_exp_src_3 = src1_ptr[src_idx_3]; typename tileSrc::DType s1_exp_src_01 = s1_exp_src_0 + s1_exp_src_1; typename tileSrc::DType s1_exp_src_23 = s1_exp_src_2 + s1_exp_src_3; - typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23; + typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23; typename tileSrc::DType s2_exp_src_0 = src2_ptr[src_idx_0]; typename tileSrc::DType s2_exp_src_1 = src2_ptr[src_idx_1]; @@ -372,7 +372,7 @@ void __vec__ new_sum_4src_template( typename tileSrc::DType s2_exp_src_3 = src2_ptr[src_idx_3]; typename tileSrc::DType s2_exp_src_01 = s2_exp_src_0 + s2_exp_src_1; typename tileSrc::DType s2_exp_src_23 = s2_exp_src_2 + s2_exp_src_3; - typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23; + typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23; typename tileSrc::DType s3_exp_src_0 = src3_ptr[src_idx_0]; typename tileSrc::DType s3_exp_src_1 = src3_ptr[src_idx_1]; @@ -415,7 +415,7 @@ void __vec__ local_max_4src_template( typename tileMax::DType local_max_0123 = blkv_max(local_max_01, local_max_23); upd_max = blkv_max(upd_max, local_max_0123); - local_max_ptr[max_idx] = upd_max; + local_max_ptr[max_idx] = upd_max; } template @@ -451,7 +451,7 @@ void __vec__ local_sum_4src_template( typename tileSrc::DType s0_exp_src_3 = src0_ptr[src_idx_3]; typename tileSrc::DType s0_exp_src_01 = s0_exp_src_0 + s0_exp_src_1; typename tileSrc::DType s0_exp_src_23 = s0_exp_src_2 + s0_exp_src_3; - typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23; + typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23; typename tileSrc::DType s1_exp_src_0 = src1_ptr[src_idx_0]; typename tileSrc::DType s1_exp_src_1 = src1_ptr[src_idx_1]; @@ -459,7 +459,7 @@ void __vec__ local_sum_4src_template( typename tileSrc::DType s1_exp_src_3 = src1_ptr[src_idx_3]; typename tileSrc::DType s1_exp_src_01 = s1_exp_src_0 + s1_exp_src_1; typename tileSrc::DType s1_exp_src_23 = s1_exp_src_2 + s1_exp_src_3; - typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23; + typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23; typename tileSrc::DType s2_exp_src_0 = src2_ptr[src_idx_0]; typename tileSrc::DType s2_exp_src_1 = src2_ptr[src_idx_1]; @@ -467,7 +467,7 @@ void __vec__ local_sum_4src_template( typename tileSrc::DType s2_exp_src_3 = src2_ptr[src_idx_3]; typename tileSrc::DType s2_exp_src_01 = s2_exp_src_0 + s2_exp_src_1; typename tileSrc::DType s2_exp_src_23 = s2_exp_src_2 + s2_exp_src_3; - typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23; + typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23; typename tileSrc::DType s3_exp_src_0 = src3_ptr[src_idx_0]; typename tileSrc::DType s3_exp_src_1 = src3_ptr[src_idx_1]; @@ -505,7 +505,7 @@ void __vec__ new_max_of_2_loc_max_template( typename tileMax::DType local_max_01 = blkv_max(local_max_0_ptr[max_idx], local_max_1_ptr[max_idx]); upd_max = blkv_max(upd_max, local_max_01); new_max_ptr[max_idx] = upd_max; - scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); + scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); } template void __vec__ new_sum_of_2_loc_sum_template( @@ -560,7 +560,7 @@ void __vec__ new_max_of_4_loc_max_template( typename tileMax::DType local_max_0123 = blkv_max(local_max_01, local_max_23); upd_max = blkv_max(upd_max, local_max_0123); new_max_ptr[max_idx] = upd_max; - scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); + scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); } template void __vec__ new_sum_of_4_loc_sum_template( @@ -585,7 +585,7 @@ void __vec__ new_sum_of_4_loc_sum_template( size_t sum_idx = i*tileSum::RowStride; - new_sum_ptr[sum_idx] = old_sum_ptr[sum_idx] * scale_ptr[sum_idx] + + new_sum_ptr[sum_idx] = old_sum_ptr[sum_idx] * scale_ptr[sum_idx] + local_sum_0_ptr[sum_idx] + local_sum_1_ptr[sum_idx] + local_sum_2_ptr[sum_idx] + local_sum_3_ptr[sum_idx]; } @@ -639,7 +639,7 @@ void flash_attention_template_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_p tileQ tQ[Xdim]; - #ifdef MULTI_LDST // don't use, no need for multi tload/tstore + #ifdef MULTI_LDST // don't use, no need for multi tload/tstore #pragma clang loop unroll(full) for(int x=0;x<<>>( - tScale[x].data(), - tNewMax[x].data(), + tScale[x].data(), + tNewMax[x].data(), tLocalMax[x][0].data(), tLocalMax[x][1].data(), tLocalMax[x][2].data(), tLocalMax[x][3].data(), tMax[x].data()); // src_exp_4src_template<<>>( @@ -760,7 +760,7 @@ void flash_attention_template_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_p src_exp_2src_with_local_sum_template<<>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(), tW[x][0].data(), tW[x][1].data(), tNewMax[x].data()); src_exp_2src_with_local_sum_template<<>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(), - tW[x][2].data(), tW[x][3].data(), tNewMax[x].data()); + tW[x][2].data(), tW[x][3].data(), tNewMax[x].data()); // new_sum_4src_template<<>>( // tNewSum[x].data(), // tExpW[x][0].data(), tExpW[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(), @@ -774,7 +774,7 @@ void flash_attention_template_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_p tileSum tLocalSum[Xdim][2]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax4[x][k].data(), tLocalMax[x][4*k].data(), tLocalMax[x][4*k+1].data(), tLocalMax[x][4*k+2].data(), tLocalMax[x][4*k+3].data()); @@ -800,7 +800,7 @@ void flash_attention_template_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_p tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax4[x][k].data(), tLocalMax[x][4*k].data(), tLocalMax[x][4*k+1].data(), tLocalMax[x][4*k+2].data(), tLocalMax[x][4*k+3].data()); } @@ -837,7 +837,7 @@ void flash_attention_template_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_p #pragma clang loop unroll(full) for(int y=0;y; // [kTm×kTk] using tileW = Tile; using tileW_cast = Tile; - using tileW_left = TileLeft; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] @@ -100,7 +100,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt #pragma clang loop unroll(full) for(int x=0;x<<>>( - tScale[x].data(), - tNewMax[x].data(), + tScale[x].data(), + tNewMax[x].data(), tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), tMax[x].data(), scale); @@ -216,7 +216,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt #pragma clang loop unroll(full) for(int y=0;y Nz @@ -261,7 +261,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt tileK_tcols tK; auto gK = gIterK(0, Kb); - TCOPYIN(tK, gK); + TLOAD(tK, gK); tileW_tcols tW[Xdim]; #pragma clang loop unroll(full) @@ -291,7 +291,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt tileV_trows tV; auto gV = gIterV(Kb, 0); - TCOPYIN(tV, gV); + TLOAD(tV, gV); // 计算当前块的加权输出: O_j = W * V tileW_left_tcols tW_left[Xdim]; @@ -325,7 +325,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt #pragma clang loop unroll(full) for (int x = 0; x < Xdim; ++x) { auto dstO = gIterO(i+x, 0); - TCOPYOUT(dstO, tO_cast[x]); + TSTORE(dstO, tO_cast[x]); } } @@ -334,7 +334,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt tileQ_trows tQ; auto gQ = gIterQ(Qb,0); - TCOPYIN(tQ, gQ); + TLOAD(tQ, gQ); tileMax_trows tMax; TEXPANDSCALAR(tMax, -1e30f); @@ -354,7 +354,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt #pragma clang loop unroll(full) for(int y=0;y<<>>( - tScale.data(), - tNewMax.data(), + tScale.data(), + tNewMax.data(), tW[0].data(), tW[1].data(), tW[2].data(), tW[3].data(), tMax.data(), scale); @@ -432,7 +432,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt #pragma clang loop unroll(full) for(int y=0;y<<>>(tO_cast.data(), tO.data(), tSum.data()); auto dstO = gIterO(Qb, 0); - TCOPYOUT(dstO, tO_cast); + TSTORE(dstO, tO_cast); } } \ No newline at end of file diff --git a/test/accelerator/include/accelerator_fusion.h b/include/benchmark_support/npu/npu_fusion.h similarity index 97% rename from test/accelerator/include/accelerator_fusion.h rename to include/benchmark_support/npu/npu_fusion.h index 00e816c..12d4a33 100644 --- a/test/accelerator/include/accelerator_fusion.h +++ b/include/benchmark_support/npu/npu_fusion.h @@ -70,9 +70,9 @@ void __vec__ flashsoftmax_new_max( #ifndef RES_CHECK upd_max = upd_max * src_scale; #endif - new_max_ptr[max_idx] = upd_max; + new_max_ptr[max_idx] = upd_max; - scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); + scale_ptr[max_idx] = blkv_fexp(old_max_val - upd_max); } template @@ -147,7 +147,7 @@ void __vec__ flashsoftmax_new_sum( typename tileSrc::DType exp_src_3 = src_ptr[src_idx_3]; typename tileSrc::DType exp_src_01 = exp_src_0 + exp_src_1; typename tileSrc::DType exp_src_23 = exp_src_2 + exp_src_3; - typename tileSrc::DType exp_src_0123 = exp_src_01 + exp_src_23; + typename tileSrc::DType exp_src_0123 = exp_src_01 + exp_src_23; upd_sum += exp_src_0123; } blkv_get_tile_ptr(new_sum)[sum_idx] = upd_sum; @@ -441,19 +441,19 @@ void __vec__ normalize( blkv_get_tile_ptr(out_cast)[idx] = static_cast( (blkv_get_tile_ptr(rescale_out)[idx] + blkv_get_tile_ptr(out)[idx]) / blkv_get_tile_ptr(sum)[idx_sum] ); } -#include "accelerator_fa_fp4.h" -#include "accelerator_fa_opt1.h" -#include "accelerator_fa_opt2.h" -#include "accelerator_fa_opt3.h" -#include "accelerator_fa_opt4.h" -#include "accelerator_fa_dcore.h" -#include "accelerator_fa_2d_unroll.h" +#include "npu_fa_fp4.h" +#include "npu_fa_opt1.h" +#include "npu_fa_opt2.h" +#include "npu_fa_opt3.h" +#include "npu_fa_opt4.h" +#include "npu_fa_dcore.h" +#include "npu_fa_2d_unroll.h" #ifdef _2D_UNROLL_PTO -#include "accelerator_fa_2d_unroll_pto.h" +#include "npu_fa_2d_unroll_pto.h" #endif -#include "accelerator_fa_template_2d_unroll.h" -#include "accelerator_fa_dynamic.h" -#include "accelerator_fa_manual.h" +#include "npu_fa_template_2d_unroll.h" +#include "npu_fa_dynamic.h" +#include "npu_fa_manual.h" template void flashsoftmax(float *input, float *max, float *sum, float *input_scale, uint16_t *bitmask_gm, __half *output) { @@ -467,11 +467,11 @@ void flashsoftmax(float *input, float *max, float *sum, float *input_scale, uint using tileMax = Tile; using tileSum = Tile; using tileScale = Tile; - + using tileO = Tile; using tileO_cast = Tile; - const int Bm = S/tM; + const int Bm = S/tM; const int Bk = S/tK; for(int i=0;i(in_ptr[j]); - asm volatile("l.rdfadd %1.fh, ->%0.h" + asm volatile("l.rdfadd %1.fh, ->%0.h" :"=r"(sum) :"vr"(data) ); - + #pragma clang loop unroll(full) for(int i=1;i(in_ptr[idx]) * static_cast<__half>(in_ptr[idx]); typename tile_shape::DType local_sum; - asm volatile("l.rdfadd %1.fh, ->%0.h" + asm volatile("l.rdfadd %1.fh, ->%0.h" :"=r"(local_sum) :"vr"(data) ); @@ -61,7 +61,7 @@ void __vec__ rmsnorm_kernel( } sum = blkv_fsqrt( (sum / static_cast<__half>(tile_shape::ValidCol)) ); - + #pragma clang loop unroll(full) for(int i=0;i<<>>(tdst.data(), tsrc.data(), LaneNum, iter); gm_shape gdst(dst); - TCOPYOUT(gdst, tdst); + TSTORE(gdst, tdst); } // x / (Ex^2) ^ .5 @@ -92,9 +92,9 @@ template>; using tile_shape = Tile; - + using tSum = Tile; - + using gIter = global_iterator; gIter giter_src(src); @@ -111,19 +111,19 @@ void rmsnorm(dtype *dst, dtype *src){ { auto gsrc = giter_src(i, j); tile_shape tsrc; - - TCOPYIN(tsrc, gsrc); + + TLOAD(tsrc, gsrc); tSum tLocalSum; TMUL(tsrc, tsrc, tsrc); TROWSUM(tLocalSum, tsrc); TADD(tAccSquareSum, tAccSquareSum, tLocalSum); } - + tSum gSqureMean; TDIVS(gSqureMean, tAccSquareSum, kN); TSQRT(gSqureMean, gSqureMean); - + tile_shape gSqureMean_i; TEXPANDCOL(gSqureMean_i, gSqureMean); @@ -131,12 +131,12 @@ void rmsnorm(dtype *dst, dtype *src){ { auto gsrc = giter_src(i,j); tile_shape tsrc; - TCOPYIN(tsrc, gsrc); - + TLOAD(tsrc, gsrc); + TDIV(tsrc, tsrc, gSqureMean_i); - + auto gdst = giter_dst(i,j); - TCOPYOUT(gdst, tsrc); + TSTORE(gdst, tsrc); } } } @@ -161,15 +161,15 @@ void __vec__ layernorm_kernel( __half data = static_cast<__half>(in_ptr[j]); __half data_square = data * data; - asm volatile("l.rdfadd %1.fh, ->%0.h" + asm volatile("l.rdfadd %1.fh, ->%0.h" :"=r"(sum) :"vr"(data) ); - asm volatile("l.rdfadd %1.fh, ->%0.h" + asm volatile("l.rdfadd %1.fh, ->%0.h" :"=r"(square_sum) :"vr"(data_square) ); - + #pragma clang loop unroll(full) for(int i=1;i%0.h" + asm volatile("l.rdfadd %1.fh, ->%0.h" :"=r"(local_sum) :"vr"(data) ); - asm volatile("l.rdfadd %1.fh, ->%0.h" + asm volatile("l.rdfadd %1.fh, ->%0.h" :"=r"(local_square_sum) :"vr"(data) ); @@ -191,7 +191,7 @@ void __vec__ layernorm_kernel( sum = (sum / static_cast<__half>(tile_shape::ValidCol)); square_sum = (square_sum / static_cast<__half>(tile_shape::ValidCol)); - + #pragma clang loop unroll(full) for(int i=0;i<<>>(tdst.data(), tsrc.data(), LaneNum, iter); gm_shape gdst(dst); - TCOPYOUT(gdst, tdst); + TSTORE(gdst, tdst); } @@ -225,9 +225,9 @@ void layernorm(dtype *dst, dtype *src, float *gamma, float *beta) using gm_shape = global_tensor>; using tile_shape = Tile; - + using tSum = Tile; - + using gIter = global_iterator; gIter giter_src(src); @@ -240,23 +240,23 @@ void layernorm(dtype *dst, dtype *src, float *gamma, float *beta) { tSum tAccSum(0); // tiling sum tSum tAccSquareSum(0); // tiling square sum - + for(int j=0;j(*gamma)); TADDS(tsrc, tsrc, static_cast(*beta)); - + auto gdst = giter_dst(i,j); - TCOPYOUT(gdst, tsrc); + TSTORE(gdst, tsrc); } } } @@ -297,9 +297,9 @@ void layernorm_bf16(__bf16 *dst, __bf16 *src, float *gamma, float *beta) using tile_shape = Tile; using tile_shape_cast = Tile; - + using tSum = Tile; - + using gIter = global_iterator; gIter giter_src(src); @@ -312,24 +312,24 @@ void layernorm_bf16(__bf16 *dst, __bf16 *src, float *gamma, float *beta) { tSum tAccSum(0); // tiling sum tSum tAccSquareSum(0); // tiling square sum - + for(int j=0;j(*gamma)); TADDS(tsrc, tsrc, static_cast<__half>(*beta)); - + auto gdst = giter_dst(i,j); tile_shape_cast tdst; TCAST(tdst, tsrc); - TCOPYOUT(gdst, tdst); + TSTORE(gdst, tdst); } } } @@ -401,14 +401,14 @@ void layernorm_bf16(__bf16 *dst, __bf16 *src, float *gamma, float *beta) // gm_pic gpic(pic+ n*C*H*W + c*H*W + h*pool.stride*W + w*pool.stride); //pic[n, c, h*pool.stride, w*pool.stride] // tile_filt tpic; -// TCOPYIN(tpic, gpic); +// TLOAD(tpic, gpic); // TROWMAXEXPAND(tpic, tpic); // TCOLMAXEXPAND(tpic, tpic); // TCOPY(tmp, tpic); // int offset = n*C*H_out*W_out + c*H_out*W_out + h*W_out + w; // gm_out gO(out+offset); -// TCOPYOUT(gO, tpic); +// TSTORE(gO, tpic); // } // } // } @@ -430,35 +430,35 @@ void __vec__ softmax_kernel( __half max; __half sum; - + __half data = static_cast<__half>(in_ptr[j]); - asm volatile("l.rdfmax %1.fh, ->%0.h" + asm volatile("l.rdfmax %1.fh, ->%0.h" :"=r"(max) :"vr"(data) ); - asm volatile("l.rdfadd %1.fh, ->%0.h" + asm volatile("l.rdfadd %1.fh, ->%0.h" :"=r"(sum) :"vr"(data) ); - + #pragma clang loop unroll(full) for(int i=1;i(in_ptr[idx]); typename tile_shape::DType local_max; - asm volatile("l.rdfmax %1.fh, ->%0.h" + asm volatile("l.rdfmax %1.fh, ->%0.h" :"=r"(local_max) :"vr"(data) ); max = blkv_max(max, local_max); typename tile_shape::DType local_sum; - asm volatile("l.rdfadd %1.fh, ->%0.h" + asm volatile("l.rdfadd %1.fh, ->%0.h" :"=r"(local_sum) :"vr"(data) ); - sum += local_sum; + sum += local_sum; } #pragma clang loop unroll(full) @@ -476,14 +476,14 @@ void softmax_oneline(dtype *dst, dtype *src){ gm_shape gsrc(src); tile_shape tsrc; - TCOPYIN(tsrc, gsrc); + TLOAD(tsrc, gsrc); const int iter = tile_shape::ValidCol/ LaneNum; tile_shape tdst; softmax_kernel<<>>(tdst.data(), tsrc.data(), LaneNum, iter); gm_shape gdst(dst); - TCOPYOUT(gdst, tdst); + TSTORE(gdst, tdst); } template @@ -506,7 +506,7 @@ void softmax_bf16(__bf16* dst, __bf16* src){ gm_shape gsrc(src+offset); tile_shape_ori tsrc_ori; tile_shape tsrc; - TCOPYIN(tsrc_ori, gsrc); + TLOAD(tsrc_ori, gsrc); TCAST(tsrc, tsrc_ori); tMax tLocalMax; @@ -541,7 +541,7 @@ void softmax_bf16(__bf16* dst, __bf16* src){ gm_shape gsrc(src+offset); tile_shape_ori tsrc_ori; tile_shape tsrc; - TCOPYIN(tsrc_ori, gsrc); + TLOAD(tsrc_ori, gsrc); TCAST(tsrc, tsrc_ori); tile_shape gMax; @@ -557,7 +557,7 @@ void softmax_bf16(__bf16* dst, __bf16* src){ tile_shape_ori tsrc_out; TCAST(tsrc_out, tsrc); - TCOPYOUT(gdst, tsrc_out); + TSTORE(gdst, tsrc_out); } } } @@ -581,7 +581,7 @@ void softmax(dtype* dst, dtype* src){ uint32_t offset = i*kTM*kN+j*kTN; gm_shape gsrc(src+offset); tile_shape tsrc; - TCOPYIN(tsrc, gsrc); + TLOAD(tsrc, gsrc); tMax tLocalMax; TROWMAX(tLocalMax, tsrc); @@ -615,7 +615,7 @@ void softmax(dtype* dst, dtype* src){ uint32_t offset = i*kTM*kN+j*kTN; gm_shape gsrc(src+offset); tile_shape tsrc; - TCOPYIN(tsrc, gsrc); + TLOAD(tsrc, gsrc); tile_shape gMax; tile_shape gSum; @@ -627,14 +627,14 @@ void softmax(dtype* dst, dtype* src){ TDIV(tsrc, tsrc, gSum); gm_shape gdst(dst+offset); - TCOPYOUT(gdst, tsrc); + TSTORE(gdst, tsrc); } } } template void gemm(float *c, float *a, float *b, float alpha, float beta) -{ +{ using gm_shapeA = global_tensor>; using gm_shapeB = global_tensor>; using gm_shapeC = global_tensor>; @@ -681,8 +681,8 @@ void gemm(float *c, float *a, float *b, float alpha, float beta) tile_shapeA tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } @@ -692,20 +692,20 @@ void gemm(float *c, float *a, float *b, float alpha, float beta) tile_shapeA_trows tA; tile_shapeB_tcols tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } tile_shapeACC oldC; - TCOPYIN(oldC, gC); + TLOAD(oldC, gC); TMULS(tACC, tACC, alpha); TMULS(oldC, oldC, beta); TADD(tACC, tACC, oldC); if constexpr(Relu){ TMAXS(tACC, tACC, 0); } - TCOPYOUT(gC, tACC); + TSTORE(gC, tACC); } if constexpr (rmd_N) { auto gC = gCIter(i, Nb); @@ -718,8 +718,8 @@ void gemm(float *c, float *a, float *b, float alpha, float beta) tile_shapeA tA; tile_shapeB_trows tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } if constexpr (rmd_K) { @@ -728,20 +728,20 @@ void gemm(float *c, float *a, float *b, float alpha, float beta) tile_shapeA_trows tA; tile_shapeB_tcorner tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } tile_shapeC_trows oldC; - TCOPYIN(oldC, gC); + TLOAD(oldC, gC); TMULS(tACC, tACC, alpha); TMULS(oldC, oldC, beta); TADD(tACC, tACC, oldC); if constexpr(Relu){ TMAXS(tACC, tACC, 0); } - TCOPYOUT(gC, tACC); + TSTORE(gC, tACC); } } if constexpr (rmd_M) { @@ -756,8 +756,8 @@ void gemm(float *c, float *a, float *b, float alpha, float beta) tile_shapeA_tcols tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } if constexpr (rmd_K) { @@ -766,20 +766,20 @@ void gemm(float *c, float *a, float *b, float alpha, float beta) tile_shapeA_tcorner tA; tile_shapeB_tcols tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } tile_shapeC_tcols oldC; - TCOPYIN(oldC, gC); + TLOAD(oldC, gC); TMULS(tACC, tACC, alpha); TMULS(oldC, oldC, beta); TADD(tACC, tACC, oldC); if constexpr(Relu){ TMAXS(tACC, tACC, 0); } - TCOPYOUT(gC, tACC); + TSTORE(gC, tACC); } if constexpr (rmd_N) { auto gC = gCIter(Mb, Nb); @@ -792,8 +792,8 @@ void gemm(float *c, float *a, float *b, float alpha, float beta) tile_shapeA_tcols tA; tile_shapeB_trows tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } if constexpr (rmd_K) { @@ -802,22 +802,22 @@ void gemm(float *c, float *a, float *b, float alpha, float beta) tile_shapeA_tcorner tA; tile_shapeB_tcorner tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } tile_shapeC_tcorner oldC; - TCOPYIN(oldC, gC); + TLOAD(oldC, gC); TMULS(tACC, tACC, alpha); TMULS(oldC, oldC, beta); TADD(tACC, tACC, oldC); if constexpr(Relu){ TMAXS(tACC, tACC, 0); } - TCOPYOUT(gC, tACC); + TSTORE(gC, tACC); } - } + } } template @@ -826,7 +826,7 @@ void gelu(dtype *out, dtype* in){ } // w1(in) * silu(w2(in)) -//silu : x / (1 + e^-x) +//silu : x / (1 + e^-x) template void swiglu(dtype *out, dtype *in, dtype *w1, dtype *w2){ using gmIn = global_tensor>; @@ -849,7 +849,7 @@ void swiglu(dtype *out, dtype *in, dtype *w1, dtype *w2){ const int Sb = S / tS; const int Inb = InDim / tInDim; const int Outb = OutDim / tOutDim; - + for(int i=0;i 1024*2 tile_shape_half tin_real; - tile_shape_half tin_imag; + tile_shape_half tin_imag; TEXTRACT(tin_real, resh_tin, 0, 0); // real 1024*1 TEXTRACT(tin_imag, resh_tin, 0, 1); // image 1024*1 @@ -918,7 +918,7 @@ void rope(__bf16 *out, __bf16 *x, __bf16 *freqs_cis){ gm_shape freqs(freqs_cis+offset); tile_shape tfreqs_ori; tile_shape_rope tfreqs_resh; - TCOPYIN(tfreqs_ori, freqs); + TLOAD(tfreqs_ori, freqs); tile_shape_cast tfreqs; TCAST(tfreqs, tfreqs_ori); TRESHAPE(tfreqs_resh, tfreqs); @@ -958,7 +958,7 @@ void rope(__bf16 *out, __bf16 *x, __bf16 *freqs_cis){ tile_shape_cast tout_resh_cast; TCAST(tout_resh_cast, tout_resh); - TCOPYOUT(output, tout_resh_cast); + TSTORE(output, tout_resh_cast); } } } @@ -985,16 +985,16 @@ void __vec__ BitonicSortStepDescend_RowMajor_Imp( "v.lw [ta, vn#1.reuse.uh<<2], ->vt.w\n" // src[index_part+col/2] = partner_idx "v.lw [ta, vm#2.reuse.uh<<2], ->vt.w\n" // src[index] = cur_value "v.lw [ta, vm#1.reuse.uh<<2], ->vt.w\n" // src[index_part] = partner_value - "v.sw vt#2.reuse.sw, [to, vm#2.reuse.uh<<2]\n" // dst[tid] = src[tid] // copy first + "v.sw vt#2.reuse.sw, [to, vm#2.reuse.uh<<2]\n" // dst[tid] = src[tid] // copy first "v.sw vt#1.reuse.sw, [to, vm#1.reuse.uh<<2]\n" // dst[partner] = src[partner] // copy first - "v.sw vt#4.reuse.sw, [to, vn#2.reuse.uh<<2]\n" // dst[tid+col/2] = src[tid+col/2] // copy first + "v.sw vt#4.reuse.sw, [to, vn#2.reuse.uh<<2]\n" // dst[tid+col/2] = src[tid+col/2] // copy first "v.sw vt#3.reuse.sw, [to, vn#1.reuse.uh<<2]\n" // dst[partner+col/2] = src[partner+col/2] // copy first "v.cmp.lt lc0.uh, vu#1.reuse.uh, ->vn.b\n" // tid < partner "v.and vu#1.reuse.uh, ri0.uh, ->vn.h\n" // partner & stage "v.cmp.eqi vn#1.reuse.uh, 0, ->vn.b\n" // partner & stage == 0 "v.cmp.lt vt#2.reuse.sw, vt#1.reuse.sw, ->vn.b\n" // cur_value < partner_value "v.and vn#4.reuse.ub, vn#2.reuse.ub, ->vu.b\n" // (tid < partner) & (partner & stage) == 0 - "v.and vu#1.reuse.ub, vn#1.reuse.ub ->vu.b\n" // (tid < partner) & ((partner & stage) == 0) & (cur_value < partner_value) + "v.and vu#1.reuse.ub, vn#1.reuse.ub ->vu.b\n" // (tid < partner) & ((partner & stage) == 0) & (cur_value < partner_value) "v.cmp.eqi vu#1.ub, 1, ->vm.b\n" // sort_descend "" "v.cmp.eqi vn#3.uh, 1, ->vn.b\n" // partner & stage == 1 @@ -1015,7 +1015,7 @@ void __vec__ BitonicSortStepDescend_RowMajor_Imp( "v.sw vt#3.sw, [to, vn#2.uh<<2]\n" // dst[tid+col/2] = src[partner] "v.sw vt#4.sw, [to, vn#1.uh<<2]\n" // dst[partner+col/2] = src[tid] "l.addi t#1.ud, 0, ->p\n" // resave p from 1st branch - "" // merge 2nd branch two result + "" // merge 2nd branch two result "c.bstop\n" : :"i"(tile_shape::ValidCol) @@ -1043,7 +1043,7 @@ void TSORTROW(tile_shape &weight, tile_shape &indices, tile_shape &src) { using tile_shape_sort = Tile; tile_shape_sort dst_sort; - tile_shape_sort src_sort; + tile_shape_sort src_sort; TRANGE_RowMajor<<>>(indices.data()); tile_shape_sort padding(-1); @@ -1072,14 +1072,14 @@ void topk(dtype *weight, dtype* indices, dtype *x){ using gmOut = global_tensor>; using tileIn = Tile; using tileOut = Tile; // topk < 32 - + const int block = tokens/tS; for(int i=0;i(tWeight, tIndice, tIn); tileOut tWeightOut; TEXTRACT(tWeightOut, tWeight, 0, 0); @@ -1088,9 +1088,9 @@ void topk(dtype *weight, dtype* indices, dtype *x){ TEXTRACT(tIndiceOut, tIndice, 0, 0); gmOut gWeight(weight+i*tS*tK); - TCOPYOUT(gWeight, tWeightOut); + TSTORE(gWeight, tWeightOut); gmOut gIndice(indices+i*tS*tK); - TCOPYOUT(gIndice, tIndiceOut); + TSTORE(gIndice, tIndiceOut); } } \ No newline at end of file diff --git a/test/accelerator/include/accelerator_vec_simt.h b/include/benchmark_support/npu/npu_vec_simt.h similarity index 100% rename from test/accelerator/include/accelerator_vec_simt.h rename to include/benchmark_support/npu/npu_vec_simt.h diff --git a/include/common/debug_utils.hpp b/include/common/debug_utils.hpp index 5dd882a..7569d49 100644 --- a/include/common/debug_utils.hpp +++ b/include/common/debug_utils.hpp @@ -1,14 +1,18 @@ #ifndef DEBUG_UIILS_HPP #define DEBUG_UIILS_HPP -#ifdef __linx -#include "jcore/utils.hpp" +#if defined(__linx) +namespace pto { +template +void print_tile(tile_shape &) {} +} // namespace pto #elif defined(__ARM_FEATURE_SME) #include "aarch64/utils.hpp" #elif defined(__cpu_sim__) #include "cpu_sim/utils.hpp" #endif +#ifndef __linx namespace pto { template void print_tile(tile_shape &tile) { @@ -16,5 +20,6 @@ void print_tile(tile_shape &tile) { } } // namespace pto +#endif -#endif \ No newline at end of file +#endif diff --git a/include/common/layout.hpp b/include/common/layout.hpp index d746528..3317ae2 100644 --- a/include/common/layout.hpp +++ b/include/common/layout.hpp @@ -2,8 +2,11 @@ #define LAYOUT_HPP #include +#include +#ifndef __linx #include +#endif #include #include "common/math_utils.hpp" @@ -65,6 +68,7 @@ const char *layout_type_to_str(LayoutEnum type) { return "UnsupportedLayout"; } +#ifndef __linx class MatrixLayoutPrettyPrinter { template static void print(std::ostream &out, const Layout &layout) { @@ -73,6 +77,7 @@ class MatrixLayoutPrettyPrinter { << Layout::ColStride << ">, Numel = " << Layout::Numel; } }; +#endif template static std::ostream & operator<<(std::ostream &out, @@ -206,6 +214,7 @@ operator<<(std::ostream &out, << " }"; return out; } +#endif template concept BlockRowMajorLayout = @@ -234,4 +243,4 @@ template using BlockMixed = BlockMatrixLayout; } // namespace pto -#endif \ No newline at end of file +#endif diff --git a/include/common/pto_tile.hpp b/include/common/pto_tile.hpp index 1177823..06f3409 100644 --- a/include/common/pto_tile.hpp +++ b/include/common/pto_tile.hpp @@ -416,7 +416,7 @@ struct Tile { static_assert(SFractalSize_ == 512 || SFractalSize_ == 1024, "SFractalSize_ illegal"); -#ifdef __linx +#if defined(__linx) && defined(SUPERNPUBENCH_LINX_TILE_SIZE) using TileDType = DType tile_size(Rows *Cols / (sizeof(DType) * 8 / type_traits::bits)); #else using TileDType = DType[Rows * Cols]; @@ -656,6 +656,7 @@ const char* get_layout_str() { template void print_tile_info() { +#ifndef __linx std::cout << "Tile Rows Number: " << tile_shape::Rows << std::endl; std::cout << "Tile Columns Number: " << tile_shape::Cols << std::endl; std::cout << "Tile Active Rows Number: " << tile_shape::ValidRow << std::endl; @@ -667,6 +668,7 @@ void print_tile_info() { std::cout << "Tile Size: " << tile_shape::Numel << std::endl; std::cout << "Tile Layout: " << get_layout_str() << std::endl; std::cout << "Tile Data Dump: " << std::endl; +#endif } } // namespace pto diff --git a/include/common/tileop_api.hpp b/include/common/tileop_api.hpp index 9774dd9..aacf95f 100644 --- a/include/common/tileop_api.hpp +++ b/include/common/tileop_api.hpp @@ -36,7 +36,7 @@ void MATMACCMX(tile_shape_C &dst, tile_shape_A &src0, tile_shape_AX &src0x, MATMACCMX_Impl(dst, src0, src0x, src1, src1x); } -template void MATMACCMXB(tile_shape_C &dst, tile_shape_A &src0, tile_shape_B &src1, tile_shape_BX &src1x) { @@ -81,12 +81,12 @@ void TCOPY(tile_shape &dst, tile_shape &src) { TCOPY_Impl(dst, src); } template -void TCOPYIN(tile_shape &dst, gm_shape &src) { - TCOPYIN_Impl(dst, src); +void TLOAD(tile_shape &dst, gm_shape &src) { + TLOAD_Impl(dst, src); } template -void TCOPYOUT(gm_shape &dst, tile_shape &src) { - TCOPYOUT_Impl(dst, src); +void TSTORE(gm_shape &dst, tile_shape &src) { + TSTORE_Impl(dst, src); } template void TCVT(tile_shape_out &dst, tile_shape_in &src) { diff --git a/include/common/tileop_api_impl.hpp b/include/common/tileop_api_impl.hpp index ff76138..fe8ad19 100644 --- a/include/common/tileop_api_impl.hpp +++ b/include/common/tileop_api_impl.hpp @@ -8,44 +8,34 @@ #include "jcore/TAdd.hpp" #include "jcore/TAdds.hpp" #include "jcore/TAnd.hpp" -#include "jcore/TAssemble.hpp" -#include "jcore/TCast.hpp" -#include "jcore/TCI.hpp" #include "jcore/TCmp.hpp" +#include "jcore/TCI.hpp" #include "jcore/TCopy.hpp" -#include "jcore/TCopyIn.hpp" -#include "jcore/TCopyOut.hpp" +#include "jcore/TLoad.hpp" +#include "jcore/TStore.hpp" #include "jcore/TCvt.hpp" #include "jcore/TDiv.hpp" #include "jcore/TDivs.hpp" #include "jcore/TExp.hpp" +#include "jcore/TOr.hpp" +#include "jcore/TRem.hpp" +#include "jcore/TRecip.hpp" +#include "jcore/TSqrt.hpp" +#include "jcore/TSub.hpp" +#include "jcore/TSubs.hpp" +#include "jcore/TMul.hpp" +#include "jcore/TMuls.hpp" +#include "jcore/TMax.hpp" +#include "jcore/TMaxs.hpp" #include "jcore/TExpandCol.hpp" #include "jcore/TExpandRow.hpp" #include "jcore/TExpandScalar.hpp" -#include "jcore/TExtract.hpp" -#include "jcore/TFillPad.hpp" -#include "jcore/TGather.hpp" -#include "jcore/TMax.hpp" -#include "jcore/TMaxs.hpp" -#include "jcore/TMin.hpp" -#include "jcore/TMins.hpp" -#include "jcore/TMul.hpp" -#include "jcore/TMuls.hpp" -#include "jcore/TOr.hpp" -#include "jcore/TPad.hpp" -#include "jcore/TRSqrt.hpp" -#include "jcore/TRecip.hpp" -#include "jcore/TRem.hpp" #include "jcore/TReshape.hpp" #include "jcore/TRowMax.hpp" #include "jcore/TRowMaxExpand.hpp" #include "jcore/TRowSum.hpp" #include "jcore/TRowSumExpand.hpp" -#include "jcore/TScatter.hpp" -#include "jcore/TSelect.hpp" -#include "jcore/TSqrt.hpp" -#include "jcore/TSub.hpp" -#include "jcore/TSubs.hpp" +#include "jcore/TPad.hpp" #include "jcore/TTrans.hpp" #elif defined(__ARM_FEATURE_SME) @@ -60,8 +50,8 @@ #include "aarch64/TCI.hpp" #include "aarch64/TCmp.hpp" #include "aarch64/TCopy.hpp" -#include "aarch64/TCopyIn.hpp" -#include "aarch64/TCopyOut.hpp" +#include "aarch64/TLoad.hpp" +#include "aarch64/TStore.hpp" #include "aarch64/TDiv.hpp" #include "aarch64/TDivs.hpp" #include "aarch64/TExp.hpp" @@ -103,8 +93,8 @@ #include "cpu_sim/TCI.hpp" #include "cpu_sim/TCmp.hpp" #include "cpu_sim/TCopy.hpp" -#include "cpu_sim/TCopyIn.hpp" -#include "cpu_sim/TCopyOut.hpp" +#include "cpu_sim/TLoad.hpp" +#include "cpu_sim/TStore.hpp" #include "cpu_sim/TCvt.hpp" #include "cpu_sim/TDiv.hpp" #include "cpu_sim/TDivs.hpp" @@ -142,4 +132,4 @@ #error "__linx, __ARM_FEATURE_SME, or __cpu_sim__ must be defined" #endif -#endif \ No newline at end of file +#endif diff --git a/include/cpu_sim/TCopyIn.hpp b/include/cpu_sim/TLoad.hpp similarity index 85% rename from include/cpu_sim/TCopyIn.hpp rename to include/cpu_sim/TLoad.hpp index 56476ea..17e32de 100644 --- a/include/cpu_sim/TCopyIn.hpp +++ b/include/cpu_sim/TLoad.hpp @@ -1,12 +1,12 @@ -#ifndef TCOPYIN_HPP -#define TCOPYIN_HPP +#ifndef CPU_SIM_TLOAD_HPP +#define CPU_SIM_TLOAD_HPP #include "common/pto_tile.hpp" using namespace pto; template -void CopyInRow2NzImpl1D(typename tile_shape::TileDType dst, +void LoadRow2NzImpl1D(typename tile_shape::TileDType dst, const typename gm_shape::DType *src) { static constexpr int inner_rows = tile_shape::InnerRows; static constexpr int inner_cols = tile_shape::InnerCols; @@ -29,7 +29,7 @@ void CopyInRow2NzImpl1D(typename tile_shape::TileDType dst, } template -void CopyInRow2ZnImpl1D(typename tile_shape::TileDType dst, +void LoadRow2ZnImpl1D(typename tile_shape::TileDType dst, const typename gm_shape::DType *src) { static constexpr int inner_rows = tile_shape::InnerRows; static constexpr int inner_cols = tile_shape::InnerCols; @@ -52,7 +52,7 @@ void CopyInRow2ZnImpl1D(typename tile_shape::TileDType dst, } template -void CopyInCol2ZnImpl1D(typename tile_shape::TileDType dst, +void LoadCol2ZnImpl1D(typename tile_shape::TileDType dst, const typename gm_shape::DType *src) { static constexpr int inner_rows = tile_shape::InnerRows; static constexpr int inner_cols = tile_shape::InnerCols; @@ -75,7 +75,7 @@ void CopyInCol2ZnImpl1D(typename tile_shape::TileDType dst, } template -void TCopyIn_RowMajor_Impl(typename tile_shape::TileDType dst, +void TLoad_RowMajor_Impl(typename tile_shape::TileDType dst, const typename gm_shape::DType *src) { for (size_t i = 0; i < tile_shape::ValidRow; ++i) for (size_t j = 0; j < tile_shape::ValidCol; ++j) { @@ -86,7 +86,7 @@ void TCopyIn_RowMajor_Impl(typename tile_shape::TileDType dst, } template -void TCopyIn_ColMajor_Impl(typename tile_shape::TileDType dst, +void TLoad_ColMajor_Impl(typename tile_shape::TileDType dst, const typename gm_shape::DType *src) { for (size_t i = 0; i < tile_shape::ValidCol; ++i) for (size_t j = 0; j < tile_shape::ValidRow; ++j) { @@ -97,7 +97,7 @@ void TCopyIn_ColMajor_Impl(typename tile_shape::TileDType dst, } template -void CopyInRow2NzImpl1D_Dynamic(tile_shape &dst, +void LoadRow2NzImpl1D_Dynamic(tile_shape &dst, gm_shape &src) { static constexpr int inner_rows = tile_shape::InnerRows; static constexpr int inner_cols = tile_shape::InnerCols; @@ -120,7 +120,7 @@ void CopyInRow2NzImpl1D_Dynamic(tile_shape &dst, } template -void CopyInRow2ZnImpl1D_Dynamic(tile_shape &dst, +void LoadRow2ZnImpl1D_Dynamic(tile_shape &dst, gm_shape &src) { static constexpr int inner_rows = tile_shape::InnerRows; static constexpr int inner_cols = tile_shape::InnerCols; @@ -143,7 +143,7 @@ void CopyInRow2ZnImpl1D_Dynamic(tile_shape &dst, } template -void CopyInCol2ZnImpl1D_Dynamic(tile_shape &dst, +void LoadCol2ZnImpl1D_Dynamic(tile_shape &dst, gm_shape &src) { static constexpr int inner_rows = tile_shape::InnerRows; static constexpr int inner_cols = tile_shape::InnerCols; @@ -166,7 +166,7 @@ void CopyInCol2ZnImpl1D_Dynamic(tile_shape &dst, } template -void TCopyIn_RowMajor_Impl_Dynamic(tile_shape &dst, +void TLoad_RowMajor_Impl_Dynamic(tile_shape &dst, gm_shape &src) { for (size_t i = 0; i < dst.GetValidRow(); ++i) { for (size_t j = 0; j < dst.GetValidCol(); ++j) { @@ -178,7 +178,7 @@ void TCopyIn_RowMajor_Impl_Dynamic(tile_shape &dst, } template -void TCopyIn_ColMajor_Impl_Dynamic(tile_shape &dst, +void TLoad_ColMajor_Impl_Dynamic(tile_shape &dst, gm_shape &src) { for (size_t i = 0; i < dst.GetValidCol(); ++i) { for (size_t j = 0; j < dst.GetValidRow(); ++j) { @@ -190,26 +190,26 @@ void TCopyIn_ColMajor_Impl_Dynamic(tile_shape &dst, } template -void TCOPYIN_Impl(tile_shape &dst, gm_shape &src) { +void TLOAD_Impl(tile_shape &dst, gm_shape &src) { static_assert(tile_shape::Loc != Location::Acc, "Unsupport ACC to be input or output here"); if (tile_shape::ValidRow == DYNAMIC || tile_shape::ValidCol == DYNAMIC) { // dynamic if constexpr (is_Nz_layout::value) { if constexpr (gm_shape::isRowMajor) { - CopyInRow2NzImpl1D_Dynamic(dst, src); + LoadRow2NzImpl1D_Dynamic(dst, src); } else { static_assert(gm_shape::isRowMajor, "Storage layout type not supported, gm should rowmajor"); } } else if constexpr (is_Zn_layout::value) { if constexpr (!gm_shape::isRowMajor) { - CopyInCol2ZnImpl1D_Dynamic(dst, src); + LoadCol2ZnImpl1D_Dynamic(dst, src); } else { - CopyInRow2ZnImpl1D_Dynamic(dst, src); + LoadRow2ZnImpl1D_Dynamic(dst, src); } } else if constexpr (tile_shape::isBoxedLayout == false) { if constexpr (gm_shape::isRowMajor) { - TCopyIn_RowMajor_Impl_Dynamic(dst, src); + TLoad_RowMajor_Impl_Dynamic(dst, src); } else { - TCopyIn_ColMajor_Impl_Dynamic(dst, src); + TLoad_ColMajor_Impl_Dynamic(dst, src); } } else { static_assert(tile_shape::isBoxedLayout == false, "Data type not supported"); @@ -217,21 +217,21 @@ void TCOPYIN_Impl(tile_shape &dst, gm_shape &src) { } else { // static if constexpr (is_Nz_layout::value) { if constexpr (gm_shape::isRowMajor) { - CopyInRow2NzImpl1D(dst.data(), src.data()); + LoadRow2NzImpl1D(dst.data(), src.data()); } else { static_assert(gm_shape::isRowMajor, "Storage layout type not supported, gm should rowmajor"); } } else if constexpr (is_Zn_layout::value) { if constexpr (!gm_shape::isRowMajor) { - CopyInCol2ZnImpl1D(dst.data(), src.data()); + LoadCol2ZnImpl1D(dst.data(), src.data()); } else { - CopyInRow2ZnImpl1D(dst.data(), src.data()); + LoadRow2ZnImpl1D(dst.data(), src.data()); } } else if constexpr (tile_shape::isBoxedLayout == false) { if constexpr (gm_shape::isRowMajor) { - TCopyIn_RowMajor_Impl(dst.data(), src.data()); + TLoad_RowMajor_Impl(dst.data(), src.data()); } else { - TCopyIn_ColMajor_Impl(dst.data(), src.data()); + TLoad_ColMajor_Impl(dst.data(), src.data()); } } else { static_assert(tile_shape::isBoxedLayout == false, "Data type not supported"); diff --git a/include/cpu_sim/TCopyOut.hpp b/include/cpu_sim/TStore.hpp similarity index 84% rename from include/cpu_sim/TCopyOut.hpp rename to include/cpu_sim/TStore.hpp index 6a59712..d3412b4 100644 --- a/include/cpu_sim/TCopyOut.hpp +++ b/include/cpu_sim/TStore.hpp @@ -1,12 +1,12 @@ -#ifndef TCOPYOUT_HPP -#define TCOPYOUT_HPP +#ifndef CPU_SIM_TSTORE_HPP +#define CPU_SIM_TSTORE_HPP #include "common/pto_tile.hpp" using namespace pto; template -void CopyOut2NzImpl1D(typename gm_shape::DType *dst, +void Store2NzImpl1D(typename gm_shape::DType *dst, const typename tile_shape::TileDType src) { static constexpr int inner_rows = tile_shape::InnerRows; static constexpr int inner_cols = tile_shape::InnerCols; @@ -29,7 +29,7 @@ void CopyOut2NzImpl1D(typename gm_shape::DType *dst, } template -void TCopyOut_ColMajor_Impl(typename gm_shape::DType *dst, +void TStore_ColMajor_Impl(typename gm_shape::DType *dst, typename tile_shape::TileDType src) { for (size_t i = 0; i < tile_shape::ValidCol; ++i) for (size_t j = 0; j < tile_shape::ValidRow; ++j) { @@ -40,7 +40,7 @@ void TCopyOut_ColMajor_Impl(typename gm_shape::DType *dst, } template -void TCopyOut_RowMajor_Impl(typename gm_shape::DType *dst, +void TStore_RowMajor_Impl(typename gm_shape::DType *dst, typename tile_shape::TileDType src) { for (size_t i = 0; i < tile_shape::ValidRow; ++i) for (size_t j = 0; j < tile_shape::ValidCol; ++j) { @@ -51,7 +51,7 @@ void TCopyOut_RowMajor_Impl(typename gm_shape::DType *dst, } template -void CopyOut2NzImpl1D_Dynamic(gm_shape &dst, +void Store2NzImpl1D_Dynamic(gm_shape &dst, const tile_shape &src) { static constexpr int inner_rows = tile_shape::InnerRows; static constexpr int inner_cols = tile_shape::InnerCols; @@ -74,7 +74,7 @@ void CopyOut2NzImpl1D_Dynamic(gm_shape &dst, } template -void TCopyOut_ColMajor_Impl_Dynamic(gm_shape& dst, +void TStore_ColMajor_Impl_Dynamic(gm_shape& dst, tile_shape &src) { for (size_t i = 0; i < src.GetValidCol(); ++i) { for (size_t j = 0; j < src.GetValidRow(); ++j) { @@ -86,7 +86,7 @@ void TCopyOut_ColMajor_Impl_Dynamic(gm_shape& dst, } template -void TCopyOut_RowMajor_Impl_Dynamic(gm_shape &dst, +void TStore_RowMajor_Impl_Dynamic(gm_shape &dst, tile_shape &src) { for (size_t i = 0; i < src.GetValidRow(); ++i) { for (size_t j = 0; j < src.GetValidCol(); ++j) { @@ -98,20 +98,20 @@ void TCopyOut_RowMajor_Impl_Dynamic(gm_shape &dst, } template -void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) { +void TSTORE_Impl(gm_shape &dst, tile_shape &src) { static_assert(tile_shape::Loc != Location::Acc, "Unsupport ACC to be input or output here"); if (tile_shape::ValidRow == DYNAMIC || tile_shape::ValidCol == DYNAMIC) { // dynamic if constexpr (is_Nz_layout::value) { if constexpr (gm_shape::isRowMajor) { - CopyOut2NzImpl1D_Dynamic(dst, src); + Store2NzImpl1D_Dynamic(dst, src); } else { static_assert(gm_shape::isRowMajor, "Storage layout type not supported, gm should rowmajor"); } } else if constexpr (tile_shape::isBoxedLayout == false) { if constexpr (gm_shape::isRowMajor) { - TCopyOut_RowMajor_Impl_Dynamic(dst, src); + TStore_RowMajor_Impl_Dynamic(dst, src); } else { - TCopyOut_ColMajor_Impl_Dynamic(dst, src); + TStore_ColMajor_Impl_Dynamic(dst, src); } } else { static_assert(tile_shape::isBoxedLayout == false, "Data type not supported"); @@ -119,15 +119,15 @@ void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) { } else { // static if constexpr (is_Nz_layout::value) { if constexpr (gm_shape::isRowMajor) { - CopyOut2NzImpl1D(dst.data(), src.data()); + Store2NzImpl1D(dst.data(), src.data()); } else { static_assert(gm_shape::isRowMajor, "Storage layout type not supported, gm should rowmajor"); } } else if constexpr (tile_shape::isBoxedLayout == false) { if constexpr (gm_shape::isRowMajor) { - TCopyOut_RowMajor_Impl(dst.data(), src.data()); + TStore_RowMajor_Impl(dst.data(), src.data()); } else { - TCopyOut_ColMajor_Impl(dst.data(), src.data()); + TStore_ColMajor_Impl(dst.data(), src.data()); } } else { static_assert(tile_shape::isBoxedLayout == false, "Data type not supported"); diff --git a/include/jcore/MatMacc.hpp b/include/jcore/MatMacc.hpp index 39f4acc..a274bbc 100644 --- a/include/jcore/MatMacc.hpp +++ b/include/jcore/MatMacc.hpp @@ -5,6 +5,68 @@ using namespace pto; +#ifdef __linx +template +struct linx_matmacc_unsupported { + static constexpr bool value = false; +}; + +// Matrix Multiply and Accumulate: C[MxN] += A[MxK] x B[KxN] +template +void MATMACC_Impl(tile_shape_C &dst, tile_shape_A &src0, tile_shape_B &src1) { + static_assert(tile_shape_A::ValidCol == tile_shape_B::ValidRow, + "Linx scalar MATMACC requires A columns to match B rows"); + static_assert(!tile_shape_A::isBoxedLayout && !tile_shape_B::isBoxedLayout && + !tile_shape_C::isBoxedLayout, + "Linx scalar MATMACC supports only unboxed layouts"); + static_assert(tile_shape_A::Loc != Location::Acc && + tile_shape_B::Loc != Location::Acc && + tile_shape_C::Loc != Location::Acc, + "Linx scalar MATMACC does not support ACC tile operands"); + static_assert(std::is_integral::value && + std::is_integral::value && + std::is_integral::value, + "Linx scalar MATMACC direct smoke supports integral tiles only"); + + constexpr size_t rows = tile_shape_C::ValidRow; + constexpr size_t cols = tile_shape_C::ValidCol; + constexpr size_t inner = tile_shape_A::ValidCol; + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + typename tile_shape_C::DType acc = dst.data()[index(row, col)]; + for (size_t k = 0; k < inner; ++k) { + acc += src0.data()[index(row, k)] * + src1.data()[index(k, col)]; + } + dst.data()[index(row, col)] = acc; + } + } +} + +template +void MATMACCMX_Impl(tile_shape_C &, tile_shape_A &, tile_shape_AX &, + tile_shape_B &, tile_shape_BX &) { + static_assert(linx_matmacc_unsupported::value, + "Linx direct MATMACCMX smoke is not implemented"); +} + +template +void MATMACCMXB_Impl(tile_shape_C &, tile_shape_A &, tile_shape_B &, + tile_shape_BX &) { + static_assert(linx_matmacc_unsupported::value, + "Linx direct MATMACCMXB smoke is not implemented"); +} + +#else + template void __vec__ MatMacc_Vec_Impl( typename tile_shape_C::TileDType __out__ dst, @@ -235,4 +297,5 @@ void MATMACCMXB_Impl(tile_shape_C &dst, } } -#endif \ No newline at end of file +#endif +#endif diff --git a/include/jcore/MatMul.hpp b/include/jcore/MatMul.hpp index 166abf8..80bf41a 100644 --- a/include/jcore/MatMul.hpp +++ b/include/jcore/MatMul.hpp @@ -5,6 +5,45 @@ using namespace pto; +#ifdef __linx + +// Direct-boot Linx smoke uses the scalar fallback until the vector launch +// syntax is supported in this toolchain lane. +template +void MATMUL_Impl(tile_shape_C &dst, tile_shape_A &src0, tile_shape_B &src1) { + static_assert(!tile_shape_A::isBoxedLayout && !tile_shape_B::isBoxedLayout && + !tile_shape_C::isBoxedLayout, + "Linx scalar MATMUL supports only unboxed layouts"); + static_assert(tile_shape_A::Loc != Location::Acc && + tile_shape_B::Loc != Location::Acc && + tile_shape_C::Loc != Location::Acc, + "Linx scalar MATMUL does not support ACC tile operands"); + + const size_t rows = dst.GetValidRow(); + const size_t cols = dst.GetValidCol(); + const size_t inner = + src0.GetValidCol() > src1.GetValidRow() ? src0.GetValidCol() + : src1.GetValidRow(); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + typename tile_shape_C::DType acc = 0; + for (size_t k = 0; k < inner; ++k) { + if constexpr (!std::is_same::value && + !std::is_same::value && + !std::is_same::value) { + acc += src0.data()[index(row, k)] * + src1.data()[index(k, col)]; + } + } + dst.data()[index(row, col)] = acc; + } + } +} + +#else + template void __vec__ MatMul_Vec_Impl( typename tile_shape_C::TileDType __out__ dst, @@ -245,3 +284,5 @@ void MATMULMXB_Impl(tile_shape_C &dst, } #endif + +#endif diff --git a/include/jcore/TAbs.hpp b/include/jcore/TAbs.hpp index 73e55be..273cce6 100644 --- a/include/jcore/TAbs.hpp +++ b/include/jcore/TAbs.hpp @@ -5,6 +5,27 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TABS_Impl(tile_shape &dst, tile_shape &src) { + size_t rows = src.GetValidRow(); + size_t cols = src.GetValidCol(); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(!tile_shape::isBoxedLayout, "TABS not support Boxed Layout!"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + auto src_value = src.data()[index]; + auto zero = typename tile_shape::DType{}; + dst.data()[index] = src_value < zero ? -src_value : src_value; + } + } +} +#else template void __vec__ TAbs_Vec_RowMajor( typename tile_shape::TileDType __out__ dst, @@ -85,5 +106,6 @@ template void TABS_Impl(tile_shape &dst, tile_shape } } } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TAdd.hpp b/include/jcore/TAdd.hpp index 87faa41..7a59dfa 100644 --- a/include/jcore/TAdd.hpp +++ b/include/jcore/TAdd.hpp @@ -5,6 +5,26 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TADD_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { + size_t rows = src0.GetValidRow(); + size_t cols = src0.GetValidCol(); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(tile_shape::isBoxedLayout == false, + "TADD not support Boxed Layout!"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + dst.data()[index] = src0.data()[index] + src1.data()[index]; + } + } +} +#else template void __vec__ TAdd_Vec_RowMajor( typename tile_shape::TileDType __out__ dst, @@ -72,5 +92,6 @@ void TADD_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { } } +#endif #endif diff --git a/include/jcore/TAdds.hpp b/include/jcore/TAdds.hpp index 9143023..adc353f 100644 --- a/include/jcore/TAdds.hpp +++ b/include/jcore/TAdds.hpp @@ -5,6 +5,25 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TADDS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) { + size_t rows = src.GetValidRow(); + size_t cols = src.GetValidCol(); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(!tile_shape::isBoxedLayout, "TADDS not support Boxed Layout!"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + dst.data()[index] = src.data()[index] + s; + } + } +} +#else template void __vec__ TAdds_Vec_RowMajor( typename tile_shape::TileDType __out__ dst, @@ -80,5 +99,6 @@ void TADDS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) } } } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TAnd.hpp b/include/jcore/TAnd.hpp index b7dda75..bc43bf4 100644 --- a/include/jcore/TAnd.hpp +++ b/include/jcore/TAnd.hpp @@ -5,6 +5,25 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TAND_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { + size_t rows = src0.GetValidRow(); + size_t cols = src0.GetValidCol(); + static_assert(tile_shape::Loc == Location::Vec, + "Only VEC tile type are supported"); + static_assert(!tile_shape::isBoxedLayout, "TAND not support Boxed Layout!"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + dst.data()[index] = src0.data()[index] & src1.data()[index]; + } + } +} +#else template void __vec__ TAnd_Vec_RowMajor( typename tile_shape::TileDType __out__ dst, @@ -70,5 +89,6 @@ void TAND_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { "Only int data type are supported"); } } +#endif #endif diff --git a/include/jcore/TCI.hpp b/include/jcore/TCI.hpp index 3849eab..148aa00 100644 --- a/include/jcore/TCI.hpp +++ b/include/jcore/TCI.hpp @@ -5,7 +5,41 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TCI_Impl(tile_shape &dst, T s) { + static constexpr size_t row = tile_shape::ValidRow; + static constexpr size_t col = tile_shape::ValidCol; + + static_assert(std::is_same::value, + "Dst and scalar must be same data type!"); + static_assert((descending == 0) || (descending == 1), + "descending must be 0 or 1!"); + static_assert(row != DYNAMIC && col != DYNAMIC, + "TODO: Support tile dynamic shape!"); + static_assert(tile_shape::Loc == Location::Vec, + "Only VEC tile type are supported"); + static_assert(!tile_shape::isBoxedLayout, "TCI not support Boxed Layout!"); + static_assert(std::is_same::value || + std::is_same::value || + std::is_same::value || + std::is_same::value, + "Data type not supported"); + for (size_t row_idx = 0; row_idx < row; ++row_idx) { + for (size_t col_idx = 0; col_idx < col; ++col_idx) { + size_t tile_index = index(row_idx, col_idx); + if constexpr (descending) { + dst.data()[tile_index] = + s - static_cast(tile_index); + } else { + dst.data()[tile_index] = + s + static_cast(tile_index); + } + } + } +} +#else template void __vec__ TCIImpl_RowMajor(typename tile_shape::TileDType __out__ dst, const typename tile_shape::DType __in__ s) { @@ -70,4 +104,5 @@ if constexpr (std::is_same::value || } } -#endif \ No newline at end of file +#endif +#endif diff --git a/include/jcore/TCmp.hpp b/include/jcore/TCmp.hpp index 608e76f..6a37c0c 100644 --- a/include/jcore/TCmp.hpp +++ b/include/jcore/TCmp.hpp @@ -3,9 +3,64 @@ #include "common/pto_tile.hpp" #include "jcore/constants.hpp" +#ifdef __linx +#include +#else #include +#endif using namespace pto; +#ifdef __linx +template static inline int32_t linx_tcmp_value(T a, T b, CmpMode mode) { + switch (mode) { + case CmpMode::EQ: + return a == b; + case CmpMode::NE: + return a != b; + case CmpMode::GT: + return a > b; + case CmpMode::LT: + return a < b; + case CmpMode::GE: + return a >= b; + case CmpMode::LE: + return a <= b; + } + return 0; +} + +template +void TCMP_Impl(tile_shape_out &dst, tile_shape_in &src0, tile_shape_in &src1, + CmpMode cmpMode) { + static_assert(tile_shape_in::Rows == tile_shape_out::Rows && + tile_shape_in::Cols == tile_shape_out::Cols, + "Error! Input shape != Output shape"); + static_assert(tile_shape_in::InnerRows == tile_shape_out::InnerRows && + tile_shape_in::InnerCols == tile_shape_out::InnerCols, + "Error! Inner shape is not equal!"); + static_assert(tile_shape_out::Loc == Location::Vec && + tile_shape_in::Loc == Location::Vec, + "Only VEC tile type are supported"); + static_assert(tile_shape_out::isBoxedLayout == false && + tile_shape_in::isBoxedLayout == false, + "TCMP not support Boxed Layout!"); + + static constexpr size_t row = tile_shape_in::ValidRow; + static constexpr size_t col = tile_shape_in::ValidCol; + static_assert(row != DYNAMIC && col != DYNAMIC, + "TODO: Support tile dynamic shape!"); + + for (size_t i = 0; i < row; ++i) { + for (size_t j = 0; j < col; ++j) { + size_t in_index = index(i, j); + size_t out_index = index(i, j); + dst.data()[out_index] = + static_cast(linx_tcmp_value( + src0.data()[in_index], src1.data()[in_index], cmpMode)); + } + } +} +#else template void __vec__ TCmp_Vec_RowMajor(typename tile_shape_out::TileDType __out__ dst, const typename tile_shape_in::TileDType __in__ src0, @@ -160,3 +215,4 @@ void TCMP_Impl(tile_shape_out &dst, tile_shape_in &src0, tile_shape_in &src1, Cm } } #endif +#endif diff --git a/include/jcore/TCopy.hpp b/include/jcore/TCopy.hpp index 956c432..c0a2778 100644 --- a/include/jcore/TCopy.hpp +++ b/include/jcore/TCopy.hpp @@ -5,6 +5,25 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TCOPY_Impl(tile_shape &dst, tile_shape &src) { + size_t rows = src.GetValidRow(); + size_t cols = src.GetValidCol(); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(!tile_shape::isBoxedLayout, "TCOPY not support Boxed Layout!"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + dst.data()[index] = src.data()[index]; + } + } +} +#else template void __vec__ TCopy_Vec_RowMajor( typename tile_shape::TileDType __out__ dst, @@ -68,5 +87,6 @@ void TCOPY_Impl(tile_shape &dst, tile_shape &src) { } } } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TCvt.hpp b/include/jcore/TCvt.hpp index e1ab2f5..19601cc 100644 --- a/include/jcore/TCvt.hpp +++ b/include/jcore/TCvt.hpp @@ -2,10 +2,38 @@ #define TCVT_HPP #include "common/pto_tile.hpp" +#ifndef __linx #include "template_asm.hpp" +#endif using namespace pto; +#ifdef __linx +template +void TCVT_Impl(tile_shape_out &dst, tile_shape_in &src) { + static_assert(tile_shape_in::ValidRow != DYNAMIC && + tile_shape_in::ValidCol != DYNAMIC && + tile_shape_out::ValidRow != DYNAMIC && + tile_shape_out::ValidCol != DYNAMIC, + "TODO: Support tile dynamic shape!"); + static_assert(tile_shape_in::Loc != Location::Acc, + "Linx direct TCVT smoke does not support ACC input"); + static_assert(tile_shape_out::Loc != Location::Acc, + "ACC can not be output tile!"); + static_assert(tile_shape_in::ValidRow == tile_shape_out::ValidRow && + tile_shape_in::ValidCol == tile_shape_out::ValidCol, + "TCVT direct path requires matching logical shapes"); + + for (size_t row = 0; row < tile_shape_in::ValidRow; ++row) { + for (size_t col = 0; col < tile_shape_in::ValidCol; ++col) { + size_t src_index = index(row, col); + size_t dst_index = index(row, col); + dst.data()[dst_index] = + static_cast(src.data()[src_index]); + } + } +} +#else template struct blkc_has_data_member : std::false_type {}; @@ -707,4 +735,5 @@ void TCVT_Impl(tile_shape_out &dst, tile_shape_in &src) { } } -#endif \ No newline at end of file +#endif +#endif diff --git a/include/jcore/TDiv.hpp b/include/jcore/TDiv.hpp index b13ff80..50666dd 100644 --- a/include/jcore/TDiv.hpp +++ b/include/jcore/TDiv.hpp @@ -5,6 +5,26 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TDIV_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { + static constexpr size_t row = tile_shape::ValidRow; + static constexpr size_t col = tile_shape::ValidCol; + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(tile_shape::isBoxedLayout == false, + "TDIV not support Boxed Layout!"); + static_assert(row != DYNAMIC && col != DYNAMIC, + "TODO: Support tile dynamic shape!"); + + for (size_t i = 0; i < row; ++i) { + for (size_t j = 0; j < col; ++j) { + size_t tile_index = index(i, j); + dst.data()[tile_index] = src0.data()[tile_index] / src1.data()[tile_index]; + } + } +} +#else template void __vec__ TDivImpl_RowMajor(typename tile_shape::TileDType __out__ dst, @@ -68,5 +88,6 @@ void TDIV_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { "Storage type not supported"); } } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TDivs.hpp b/include/jcore/TDivs.hpp index 7f5ec9f..8a6f435 100644 --- a/include/jcore/TDivs.hpp +++ b/include/jcore/TDivs.hpp @@ -5,6 +5,26 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TDIVS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) { + static constexpr size_t row = tile_shape::ValidRow; + static constexpr size_t col = tile_shape::ValidCol; + static_assert(row != DYNAMIC && col != DYNAMIC, + "TODO: Support tile dynamic shape!"); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(tile_shape::isBoxedLayout == false, + "TDIVS not support Boxed Layout!"); + + for (size_t i = 0; i < row; ++i) { + for (size_t j = 0; j < col; ++j) { + size_t tile_index = index(i, j); + dst.data()[tile_index] = src.data()[tile_index] / s; + } + } +} +#else template void __vec__ TDivsImpl_RowMajor(typename tile_shape::TileDType __out__ dst, const typename tile_shape::TileDType __in__ src, @@ -62,5 +82,6 @@ void TDIVS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) "Storage layout type not supported"); } } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TExp.hpp b/include/jcore/TExp.hpp index 0e43020..b180984 100644 --- a/include/jcore/TExp.hpp +++ b/include/jcore/TExp.hpp @@ -5,6 +5,43 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +T linx_tile_iexp(T value) { + T result = static_cast(1); + result += + (value >= static_cast(1)) ? static_cast(2) : static_cast(0); + result += + (value >= static_cast(2)) ? static_cast(4) : static_cast(0); + result += + (value >= static_cast(3)) ? static_cast(13) : static_cast(0); + result += + (value >= static_cast(4)) ? static_cast(35) : static_cast(0); + result += + (value >= static_cast(5)) ? static_cast(93) : static_cast(0); + return result; +} + +template +void TEXP_Impl(tile_shape &dst, tile_shape &src) { + size_t rows = src.GetValidRow(); + size_t cols = src.GetValidCol(); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(!tile_shape::isBoxedLayout, "TEXP not support Boxed Layout!"); + static_assert(std::is_integral::value, + "Linx direct TEXP supports integral smoke types only"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t tile_index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + dst.data()[tile_index] = linx_tile_iexp(src.data()[tile_index]); + } + } +} +#else template void __vec__ TExpImpl_RowMajor(typename tile_shape::TileDType __out__ dst, @@ -65,4 +102,6 @@ template void TEXP_Impl(tile_shape &dst, tile_shape } } -#endif \ No newline at end of file +#endif + +#endif diff --git a/include/jcore/TExpandCol.hpp b/include/jcore/TExpandCol.hpp index 25c6988..51c1d5d 100644 --- a/include/jcore/TExpandCol.hpp +++ b/include/jcore/TExpandCol.hpp @@ -5,6 +5,29 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TEXPANDCOL_Impl(tile_shape_out &dst, tile_shape_in &src) { + static_assert((tile_shape_out::Rows == tile_shape_in::Rows) && + (tile_shape_out::ValidRow == tile_shape_in::ValidRow), + "Error! Cude A:Rows != Cude B:Rows"); + static_assert(!tile_shape_out::isBoxedLayout && !tile_shape_in::isBoxedLayout, + "Not support Fractal layout"); + static_assert(tile_shape_out::Loc != Location::Acc && + tile_shape_in::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + + size_t row = dst.GetValidRow(); + size_t col = dst.GetValidCol(); + for (size_t row_idx = 0; row_idx < row; ++row_idx) { + for (size_t col_idx = 0; col_idx < col; ++col_idx) { + size_t dst_index = index(row_idx, col_idx); + size_t src_index = index(row_idx, 0); + dst.data()[dst_index] = src.data()[src_index]; + } + } +} +#else template void __vec__ TExpandCol_RowImpl(typename tile_shape_out::TileDType __out__ dst, @@ -74,4 +97,5 @@ void TEXPANDCOL_Impl(tile_shape_out &dst, tile_shape_in &src) { } } -#endif \ No newline at end of file +#endif +#endif diff --git a/include/jcore/TExpandRow.hpp b/include/jcore/TExpandRow.hpp index d37d634..a945426 100644 --- a/include/jcore/TExpandRow.hpp +++ b/include/jcore/TExpandRow.hpp @@ -5,6 +5,34 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TEXPANDROW_Impl(tile_shape_out &dst, tile_shape_in &src) { + static_assert((tile_shape_out::Cols == tile_shape_in::Cols) && + (tile_shape_out::ValidCol == tile_shape_in::ValidCol), + "Error! Cude A:Columns != Cude B:Columns"); + static_assert(!tile_shape_out::isBoxedLayout && !tile_shape_in::isBoxedLayout, + "Not support Fractal layout"); + static_assert(tile_shape_in::ValidRow != DYNAMIC && + tile_shape_in::ValidCol != DYNAMIC && + tile_shape_out::ValidRow != DYNAMIC && + tile_shape_out::ValidCol != DYNAMIC, + "TODO: Support tile dynamic shape!"); + static_assert(tile_shape_out::Loc != Location::Acc && + tile_shape_in::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + + size_t row = dst.GetValidRow(); + size_t col = dst.GetValidCol(); + for (size_t row_idx = 0; row_idx < row; ++row_idx) { + for (size_t col_idx = 0; col_idx < col; ++col_idx) { + size_t dst_index = index(row_idx, col_idx); + size_t src_index = index(0, col_idx); + dst.data()[dst_index] = src.data()[src_index]; + } + } +} +#else template void __vec__ TExpandRow_RowImpl(typename tile_shape_out::TileDType __out__ dst, @@ -78,4 +106,5 @@ void TEXPANDROW_Impl(tile_shape_out &dst, tile_shape_in &src) { } } -#endif \ No newline at end of file +#endif +#endif diff --git a/include/jcore/TExpandScalar.hpp b/include/jcore/TExpandScalar.hpp index dd316c8..97cfb4a 100644 --- a/include/jcore/TExpandScalar.hpp +++ b/include/jcore/TExpandScalar.hpp @@ -5,6 +5,25 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TEXPANDSCALAR_Impl(tile_shape &dst, typename tile_shape::DType s) { + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(!tile_shape::isBoxedLayout, + "TEXPANDSCALAR Linx smoke supports only unboxed tiles"); + + size_t row = dst.GetValidRow(); + size_t col = dst.GetValidCol(); + + for (size_t row_idx = 0; row_idx < row; ++row_idx) { + for (size_t col_idx = 0; col_idx < col; ++col_idx) { + size_t tile_index = index(row_idx, col_idx); + dst.data()[tile_index] = s; + } + } +} +#else template void __vec__ ExpandScalarImpl_RowMajor(typename tile_shape::TileDType __out__ dst, @@ -140,4 +159,4 @@ void TEXPANDSCALAR_Impl(tile_shape &dst, typename tile_shape::DType s) { } #endif - +#endif diff --git a/include/jcore/TCopyIn.hpp b/include/jcore/TLoad.hpp similarity index 84% rename from include/jcore/TCopyIn.hpp rename to include/jcore/TLoad.hpp index 5c8a3fe..b96ec3f 100644 --- a/include/jcore/TCopyIn.hpp +++ b/include/jcore/TLoad.hpp @@ -1,14 +1,42 @@ -#ifndef TCOPYIN_HPP -#define TCOPYIN_HPP +#ifndef JCORE_TLOAD_HPP +#define JCORE_TLOAD_HPP #include "common/pto_tile.hpp" +#ifdef ENABLE_TENSOR_INSTR #include "template_asm.hpp" +#endif using namespace pto; +#ifdef __linx +template +void TLOAD_Impl(tile_shape &dst, gm_shape &src) { + size_t rows = dst.GetValidRow(); + size_t cols = dst.GetValidCol(); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(gm_shape::staticStride[0] == 1 && + gm_shape::staticStride[1] == 1, + "TODO: Support global tensor more than 3 dimensions"); + static_assert(tile_shape::isBoxedLayout == false, + "Linx smoke TLOAD supports only unboxed tiles"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t gm_index = gm_shape::isRowMajor + ? row * gm_shape::RowStride + col + : col * gm_shape::ColStride + row; + size_t tile_index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + dst.data()[tile_index] = src.data()[gm_index]; + } + } +} +#else // gm row major -> tile Nz template -void __mtc__ CopyInRow2NzImpl1D(typename tile_shape::TileDType __out__ dst, +void __mtc__ LoadRow2NzImpl1D(typename tile_shape::TileDType __out__ dst, const typename gm_shape::DType __in__ *src) { static constexpr int inner_rows = tile_shape::InnerRows; static constexpr int inner_cols = tile_shape::InnerCols; @@ -34,7 +62,7 @@ void __mtc__ CopyInRow2NzImpl1D(typename tile_shape::TileDType __out__ dst, // gm col major -> tile Zn template -void __mtc__ CopyInCol2ZnImpl1D(typename tile_shape::TileDType __out__ dst, +void __mtc__ LoadCol2ZnImpl1D(typename tile_shape::TileDType __out__ dst, const typename gm_shape::DType __in__ *src) { static constexpr int inner_rows = tile_shape::InnerRows; static constexpr int inner_cols = tile_shape::InnerCols; @@ -60,7 +88,7 @@ void __mtc__ CopyInCol2ZnImpl1D(typename tile_shape::TileDType __out__ dst, // gm row major -> tile Zn template -void __mtc__ CopyInRow2ZnImpl1D(typename tile_shape::TileDType __out__ dst, +void __mtc__ LoadRow2ZnImpl1D(typename tile_shape::TileDType __out__ dst, const typename gm_shape::DType __in__ *src) { static constexpr int inner_rows = tile_shape::InnerRows; static constexpr int inner_cols = tile_shape::InnerCols; @@ -86,9 +114,9 @@ void __mtc__ CopyInRow2ZnImpl1D(typename tile_shape::TileDType __out__ dst, //no fractal template -void __mtc__ TCopyIn_Vec_ColMajor(typename tile_shape::TileDType __out__ dst, +void __mtc__ TLoad_Vec_ColMajor(typename tile_shape::TileDType __out__ dst, const typename gm_shape::DType __in__ *src) { - + size_t i = blkv_get_index_x(); size_t j = blkv_get_index_y(); @@ -96,11 +124,11 @@ void __mtc__ TCopyIn_Vec_ColMajor(typename tile_shape::TileDType __out__ dst, size_t index_tile = j * tile_shape::ColStride + i; blkv_get_tile_ptr(dst)[index_tile] = src[index_gm]; } - + template -void __mtc__ TCopyIn_Vec_RowMajor(typename tile_shape::TileDType __out__ dst, +void __mtc__ TLoad_Vec_RowMajor(typename tile_shape::TileDType __out__ dst, typename gm_shape::DType __in__ *src) { - + size_t i = blkv_get_index_x(); size_t j = blkv_get_index_y(); @@ -111,7 +139,7 @@ void __mtc__ TCopyIn_Vec_RowMajor(typename tile_shape::TileDType __out__ dst, // gm row major -> tile Nz template -void __mtc__ CopyInRow2NzImpl2D_Dynamic(typename tile_shape::TileDType __out__ dst, +void __mtc__ LoadRow2NzImpl2D_Dynamic(typename tile_shape::TileDType __out__ dst, const typename gm_shape::DType __in__ *src, const size_t __in__ gm_row_stride) { static constexpr int inner_rows = tile_shape::InnerRows; @@ -134,7 +162,7 @@ void __mtc__ CopyInRow2NzImpl2D_Dynamic(typename tile_shape::TileDType __out__ d // gm col major -> tile Zn template -void __mtc__ CopyInCol2ZnImpl2D_Dynamic(typename tile_shape::TileDType __out__ dst, +void __mtc__ LoadCol2ZnImpl2D_Dynamic(typename tile_shape::TileDType __out__ dst, const typename gm_shape::DType __in__ *src, const size_t __in__ gm_col_stride) { static constexpr int inner_rows = tile_shape::InnerRows; @@ -157,7 +185,7 @@ void __mtc__ CopyInCol2ZnImpl2D_Dynamic(typename tile_shape::TileDType __out__ d // gm row major -> tile Zn template -void __mtc__ CopyInRow2ZnImpl2D_Dynamic(typename tile_shape::TileDType __out__ dst, +void __mtc__ LoadRow2ZnImpl2D_Dynamic(typename tile_shape::TileDType __out__ dst, const typename gm_shape::DType __in__ *src, const size_t __in__ gm_row_stride) { static constexpr int inner_rows = tile_shape::InnerRows; @@ -180,10 +208,10 @@ void __mtc__ CopyInRow2ZnImpl2D_Dynamic(typename tile_shape::TileDType __out__ d //no fractal template -void __mtc__ TCopyIn_Vec_ColMajor_Dynamic(typename tile_shape::TileDType __out__ dst, +void __mtc__ TLoad_Vec_ColMajor_Dynamic(typename tile_shape::TileDType __out__ dst, const typename gm_shape::DType __in__ *src, const size_t __in__ gm_col_stride) { - + size_t i = blkv_get_index_x(); size_t j = blkv_get_index_y(); @@ -193,10 +221,10 @@ void __mtc__ TCopyIn_Vec_ColMajor_Dynamic(typename tile_shape::TileDType __out__ } template -void __mtc__ TCopyIn_Vec_RowMajor_Dynamic(typename tile_shape::TileDType __out__ dst, +void __mtc__ TLoad_Vec_RowMajor_Dynamic(typename tile_shape::TileDType __out__ dst, typename gm_shape::DType __in__ *src, const size_t __in__ gm_row_stride) { - + size_t i = blkv_get_index_x(); size_t j = blkv_get_index_y(); @@ -206,7 +234,7 @@ void __mtc__ TCopyIn_Vec_RowMajor_Dynamic(typename tile_shape::TileDType __out__ } template -void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) { +void _TLOAD_Impl(tile_shape &dst, gm_shape &src) { size_t tile_rows = dst.GetValidRow(); size_t tile_cols = dst.GetValidCol(); static_assert(tile_shape::Loc != Location::Acc, "Unsupport ACC to be input or output here"); @@ -290,7 +318,7 @@ void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) { tile_shape::ValidRow == DYNAMIC || tile_shape::ValidCol == DYNAMIC) { // dynamic if constexpr (is_Nz_layout::value) { // Nz if constexpr (gm_shape::isRowMajor) { - CopyInRow2NzImpl2D_Dynamic + LoadRow2NzImpl2D_Dynamic <<>>(dst.data(), src.data(), src.GetStride(3)); } else { static_assert(gm_shape::isRowMajor, @@ -298,16 +326,16 @@ void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) { } } else if constexpr (is_Zn_layout::value) { //Zn if constexpr (!gm_shape::isRowMajor) { - CopyInCol2ZnImpl2D_Dynamic + LoadCol2ZnImpl2D_Dynamic <<>>(dst.data(), src.data(), src.GetStride(4)); } else { - CopyInRow2ZnImpl2D_Dynamic + LoadRow2ZnImpl2D_Dynamic <<>>(dst.data(), src.data(), src.GetStride(3)); } } else if constexpr (tile_shape::isBoxedLayout == false) { if constexpr (tile_shape::isRowMajor) { if constexpr (gm_shape::isRowMajor) { - TCopyIn_Vec_RowMajor_Dynamic + TLoad_Vec_RowMajor_Dynamic <<>>(dst.data(), src.data(), src.GetStride(3)); } else { static_assert(gm_shape::isRowMajor, @@ -315,7 +343,7 @@ void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) { } } else if constexpr (!tile_shape::isRowMajor) { if constexpr (!gm_shape::isRowMajor) { - TCopyIn_Vec_ColMajor_Dynamic + TLoad_Vec_ColMajor_Dynamic <<>>(dst.data(), src.data(), src.GetStride(4)); } else { static_assert(!gm_shape::isRowMajor, @@ -329,7 +357,7 @@ void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) { } else { // static if constexpr (is_Nz_layout::value) { // Nz if constexpr (gm_shape::isRowMajor) { - CopyInRow2NzImpl1D + LoadRow2NzImpl1D <<>>(dst.data(), src.data()); } else { static_assert(gm_shape::isRowMajor, @@ -337,16 +365,16 @@ void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) { } } else if constexpr (is_Zn_layout::value) { //Zn if constexpr (!gm_shape::isRowMajor) { - CopyInCol2ZnImpl1D + LoadCol2ZnImpl1D <<>>(dst.data(), src.data()); } else { - CopyInRow2ZnImpl1D + LoadRow2ZnImpl1D <<>>(dst.data(), src.data()); } } else if constexpr (tile_shape::isBoxedLayout == false) { if constexpr (tile_shape::isRowMajor) { if constexpr (gm_shape::isRowMajor) { - TCopyIn_Vec_RowMajor + TLoad_Vec_RowMajor <<>>(dst.data(), src.data()); } else { static_assert(gm_shape::isRowMajor, @@ -354,7 +382,7 @@ void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) { } } else if constexpr (!tile_shape::isRowMajor) { if constexpr (!gm_shape::isRowMajor) { - TCopyIn_Vec_ColMajor + TLoad_Vec_ColMajor <<>>(dst.data(), src.data()); } else { static_assert(!gm_shape::isRowMajor, @@ -367,19 +395,19 @@ void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) { } } - + #endif } template -void TCOPYIN_2LVL(tile_shape &dst, gm_shape &src){ +void TLOAD_2LVL(tile_shape &dst, gm_shape &src){ using tile_tmp = Tile; tile_tmp tmp; - _TCOPYIN_Impl(tmp, src); + _TLOAD_Impl(tmp, src); if constexpr(gm_shape::isRowMajor && is_Nz_layout::value){ - TCVT_ND2NZ(dst, tmp); + TCVT_ND2NZ(dst, tmp); }else if constexpr(gm_shape::isRowMajor && is_Zn_layout::value){ - TCVT_ND2ZN(dst, tmp); + TCVT_ND2ZN(dst, tmp); }else if constexpr(!gm_shape::isRowMajor && is_Nz_layout::value){ TCVT_DN2NZ(dst, tmp); }else if constexpr(!gm_shape::isRowMajor && is_Zn_layout::value){ @@ -388,12 +416,13 @@ void TCOPYIN_2LVL(tile_shape &dst, gm_shape &src){ } template -void TCOPYIN_Impl(tile_shape &dst, gm_shape &src) { +void TLOAD_Impl(tile_shape &dst, gm_shape &src) { #ifdef RUMINATE - TCOPYIN_2LVL(dst, src); + TLOAD_2LVL(dst, src); #else - _TCOPYIN_Impl(dst, src); + _TLOAD_Impl(dst, src); #endif } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TMax.hpp b/include/jcore/TMax.hpp index 2739f0f..3d9415d 100644 --- a/include/jcore/TMax.hpp +++ b/include/jcore/TMax.hpp @@ -5,6 +5,27 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TMAX_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { + size_t rows = src0.GetValidRow(); + size_t cols = src0.GetValidCol(); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(!tile_shape::isBoxedLayout, "TMAX not support Boxed Layout!"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + auto src0_value = src0.data()[index]; + auto src1_value = src1.data()[index]; + dst.data()[index] = src0_value > src1_value ? src0_value : src1_value; + } + } +} +#else template void __vec__ @@ -68,5 +89,6 @@ void TMAX_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { "Storage layout type not supported"); } } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TMaxs.hpp b/include/jcore/TMaxs.hpp index 9cec5aa..6d7e3dc 100644 --- a/include/jcore/TMaxs.hpp +++ b/include/jcore/TMaxs.hpp @@ -5,6 +5,26 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TMAXS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) { + size_t rows = src.GetValidRow(); + size_t cols = src.GetValidCol(); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(!tile_shape::isBoxedLayout, "TMAXS not support Boxed Layout!"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + auto src_value = src.data()[index]; + dst.data()[index] = src_value > s ? src_value : s; + } + } +} +#else template void __vec__ TMaxsImpl_RowMajor(typename tile_shape::TileDType __out__ dst, const typename tile_shape::TileDType __in__ src, @@ -64,5 +84,6 @@ void TMAXS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) "Storage layout type not supported"); } } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TMul.hpp b/include/jcore/TMul.hpp index c059c62..dd989e1 100644 --- a/include/jcore/TMul.hpp +++ b/include/jcore/TMul.hpp @@ -5,6 +5,25 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TMUL_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { + size_t rows = src0.GetValidRow(); + size_t cols = src0.GetValidCol(); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(!tile_shape::isBoxedLayout, "TMUL not support Boxed Layout!"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + dst.data()[index] = src0.data()[index] * src1.data()[index]; + } + } +} +#else template void __vec__ TmulImpl_RowMajor(typename tile_shape::TileDType __out__ dst, @@ -67,5 +86,6 @@ void TMUL_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { "Storage layout type not supported"); } } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TMuls.hpp b/include/jcore/TMuls.hpp index 3d01439..af125e6 100644 --- a/include/jcore/TMuls.hpp +++ b/include/jcore/TMuls.hpp @@ -5,6 +5,25 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TMULS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) { + size_t rows = src.GetValidRow(); + size_t cols = src.GetValidCol(); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(!tile_shape::isBoxedLayout, "TMULS not support Boxed Layout!"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + dst.data()[index] = src.data()[index] * s; + } + } +} +#else template void __vec__ TMulsImpl_RowMajor(typename tile_shape::TileDType __out__ dst, const typename tile_shape::TileDType __in__ src, @@ -62,5 +81,6 @@ void TMULS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) "Storage layout type not supported"); } } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TOr.hpp b/include/jcore/TOr.hpp index a7eee26..f7e5854 100644 --- a/include/jcore/TOr.hpp +++ b/include/jcore/TOr.hpp @@ -5,6 +5,25 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TOR_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { + size_t rows = src0.GetValidRow(); + size_t cols = src0.GetValidCol(); + static_assert(tile_shape::Loc == Location::Vec, + "Only VEC tile type are supported"); + static_assert(!tile_shape::isBoxedLayout, "TOR not support Boxed Layout!"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + dst.data()[index] = src0.data()[index] | src1.data()[index]; + } + } +} +#else template void __vec__ TOr_Vec_RowMajor( typename tile_shape::TileDType __out__ dst, @@ -70,5 +89,6 @@ void TOR_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { "Only int data type are supported"); } } +#endif #endif diff --git a/include/jcore/TPad.hpp b/include/jcore/TPad.hpp index 9ffd31c..3a6af6a 100644 --- a/include/jcore/TPad.hpp +++ b/include/jcore/TPad.hpp @@ -3,9 +3,61 @@ #include "common/pto_tile.hpp" #include "jcore/constants.hpp" +#ifndef __linx #include +#endif using namespace pto; +#ifdef __linx +template +void TPAD_Impl(tile_shape_out &dst, const tile_shape_in &src, T pad_value, + size_t up_pad, size_t left_pad, size_t down_pad, + size_t right_pad) { + static_assert(!tile_shape_out::isBoxedLayout && + !tile_shape_in::isBoxedLayout, + "Not support Boxed Layout!"); + static_assert(tile_shape_out::Loc == Location::Vec && + tile_shape_in::Loc == Location::Vec, + "Only VEC tile type are supported"); + static_assert(tile_shape_out::ValidRow != DYNAMIC && + tile_shape_out::ValidCol != DYNAMIC && + tile_shape_in::ValidRow != DYNAMIC && + tile_shape_in::ValidCol != DYNAMIC, + "TODO: Support tile dynamic shape!"); + static_assert(tile_shape_out::ValidRow >= tile_shape_in::ValidRow && + tile_shape_out::ValidCol >= tile_shape_in::ValidCol, + "Dst must cover src shape!"); + + size_t src_valid_row = src.GetValidRow(); + size_t src_valid_col = src.GetValidCol(); + size_t dst_valid_row = dst.GetValidRow(); + size_t dst_valid_col = dst.GetValidCol(); + size_t after_pad_row = up_pad + src_valid_row + down_pad; + size_t after_pad_col = left_pad + src_valid_col + right_pad; + + if (after_pad_row > dst_valid_row || after_pad_col > dst_valid_col) { + return; + } + + for (size_t row = 0; row < dst_valid_row; ++row) { + for (size_t col = 0; col < dst_valid_col; ++col) { + size_t dst_index = index(row, col); + bool in_src_range = row >= up_pad && row < up_pad + src_valid_row && + col >= left_pad && col < left_pad + src_valid_col; + if (in_src_range) { + size_t src_row = row - up_pad; + size_t src_col = col - left_pad; + size_t src_index = index(src_row, src_col); + dst.data()[dst_index] = src.data()[src_index]; + } else { + dst.data()[dst_index] = + static_cast(pad_value); + } + } + } +} +#else template void __vec__ TPad_Vec_RowMajor(typename tile_shape_out::TileDType __out__ dst, const typename tile_shape_in::TileDType __in__ src, const T __in__ pad_value, @@ -101,4 +153,5 @@ void TPAD_Impl(tile_shape_out &dst, const tile_shape_in &src, "Storage layout type not supported"); } } -#endif \ No newline at end of file +#endif +#endif diff --git a/include/jcore/TRecip.hpp b/include/jcore/TRecip.hpp index 9680437..43fb3fd 100644 --- a/include/jcore/TRecip.hpp +++ b/include/jcore/TRecip.hpp @@ -5,6 +5,27 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TRECIP_Impl(tile_shape &dst, tile_shape &src) { + static constexpr size_t row = tile_shape::ValidRow; + static constexpr size_t col = tile_shape::ValidCol; + static_assert(row != DYNAMIC && col != DYNAMIC, + "TODO: Support tile dynamic shape!"); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(tile_shape::isBoxedLayout == false, + "TRECIP not support Boxed Layout!"); + + for (size_t i = 0; i < row; ++i) { + for (size_t j = 0; j < col; ++j) { + size_t tile_index = index(i, j); + dst.data()[tile_index] = + static_cast(1) / src.data()[tile_index]; + } + } +} +#else template void __vec__ TRecip_RowMajor(typename tile_shape::TileDType __out__ dst, const typename tile_shape::TileDType __in__ src) { @@ -62,4 +83,5 @@ void TRECIP_Impl(tile_shape &dst, tile_shape &src) { } } -#endif \ No newline at end of file +#endif +#endif diff --git a/include/jcore/TRem.hpp b/include/jcore/TRem.hpp index be2ef5e..38789bc 100644 --- a/include/jcore/TRem.hpp +++ b/include/jcore/TRem.hpp @@ -5,6 +5,29 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TREM_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { + static constexpr size_t row = tile_shape::ValidRow; + static constexpr size_t col = tile_shape::ValidCol; + static_assert(tile_shape::Loc == Location::Vec, + "Only VEC tile type are supported"); + static_assert(row != DYNAMIC && col != DYNAMIC, + "TODO: Support tile dynamic shape!"); + static_assert(std::is_same::value || + std::is_same::value, + "Data type not supported"); + static_assert(tile_shape::isBoxedLayout == false, + "TREM not support Boxed Layout!"); + + for (size_t i = 0; i < row; ++i) { + for (size_t j = 0; j < col; ++j) { + size_t tile_index = index(i, j); + dst.data()[tile_index] = src0.data()[tile_index] % src1.data()[tile_index]; + } + } +} +#else template void __vec__ TRemImpl_RowMajor(typename tile_shape::TileDType __out__ dst, @@ -71,5 +94,6 @@ void TREM_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { "Storage type not supported"); } } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TReshape.hpp b/include/jcore/TReshape.hpp index 64aff12..782a1ae 100644 --- a/include/jcore/TReshape.hpp +++ b/include/jcore/TReshape.hpp @@ -5,6 +5,28 @@ using namespace pto; +#ifdef __linx +template +void TRESHAPE_Impl(tile_shape_out &tile_out, tile_shape_in &tile_in) { + static_assert(tile_shape_in::ValidRow != DYNAMIC && + tile_shape_in::ValidCol != DYNAMIC && + tile_shape_out::ValidRow != DYNAMIC && + tile_shape_out::ValidCol != DYNAMIC, + "TODO: Support tile dynamic shape!"); + static_assert(tile_shape_out::Loc != Location::Acc && + tile_shape_in::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(!tile_shape_out::isBoxedLayout && + !tile_shape_in::isBoxedLayout, + "TRESHAPE not support Boxed Layout!"); + static_assert(tile_shape_out::Numel == tile_shape_in::Numel, + "TRESHAPE requires equal tile element counts"); + + for (size_t index = 0; index < tile_shape_in::Numel; ++index) { + tile_out.data()[index] = tile_in.data()[index]; + } +} +#else template void TRESHAPE_Impl(tile_shape_out &tile_out, tile_shape_in &tile_in) { static_assert(tile_shape_in::ValidRow != DYNAMIC && tile_shape_in::ValidCol != DYNAMIC && @@ -14,5 +36,6 @@ void TRESHAPE_Impl(tile_shape_out &tile_out, tile_shape_in &tile_in) { "Unsupport ACC to be input or output here"); tile_out.data() = tile_in.data(); } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TRowMax.hpp b/include/jcore/TRowMax.hpp index 917af27..dc46afd 100644 --- a/include/jcore/TRowMax.hpp +++ b/include/jcore/TRowMax.hpp @@ -5,6 +5,36 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TROWMAX_Impl(tile_shape_out &dst, tile_shape_in &src) { + static_assert(tile_shape_in::Rows == tile_shape_out::Rows, + "Error! Input row != Output row."); + static_assert(tile_shape_out::ValidCol == 1, "valid column must be 1."); + static_assert(!tile_shape_out::isBoxedLayout && !tile_shape_in::isBoxedLayout, + "Not support Fractal layout"); + static_assert(tile_shape_in::ValidRow != DYNAMIC && + tile_shape_in::ValidCol != DYNAMIC && + tile_shape_out::ValidRow != DYNAMIC && + tile_shape_out::ValidCol != DYNAMIC, + "TODO: Support tile dynamic shape!"); + static_assert(tile_shape_out::Loc != Location::Acc && + tile_shape_in::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + + size_t rows = src.GetValidRow(); + size_t cols = src.GetValidCol(); + for (size_t row = 0; row < rows; ++row) { + typename tile_shape_in::DType max_val = + src.data()[index(row, 0)]; + for (size_t col = 1; col < cols; ++col) { + auto now_val = src.data()[index(row, col)]; + max_val = max_val > now_val ? max_val : now_val; + } + dst.data()[index(row, 0)] = max_val; + } +} +#else template void __vec__ TRowMax_NoFractal_Impl(typename tile_shape_out::TileDType __out__ dst, @@ -105,4 +135,5 @@ void TROWMAX_Impl(tile_shape_out &dst, tile_shape_in &src) { } } -#endif \ No newline at end of file +#endif +#endif diff --git a/include/jcore/TRowMaxExpand.hpp b/include/jcore/TRowMaxExpand.hpp index 7c1dbd0..b552b5a 100644 --- a/include/jcore/TRowMaxExpand.hpp +++ b/include/jcore/TRowMaxExpand.hpp @@ -5,6 +5,31 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +// ROWMAX + EXPAND +template +void TROWMAXEXPAND_Impl(tile_shape &dst, tile_shape &src) { + static_assert(tile_shape::ValidRow != DYNAMIC && + tile_shape::ValidCol != DYNAMIC, + "TODO: Support tile dynamic shape!"); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(!tile_shape::isBoxedLayout, "Not support Fractal layout"); + + size_t rows = src.GetValidRow(); + size_t cols = src.GetValidCol(); + for (size_t row = 0; row < rows; ++row) { + typename tile_shape::DType max_val = src.data()[index(row, 0)]; + for (size_t col = 1; col < cols; ++col) { + auto now_val = src.data()[index(row, col)]; + max_val = max_val > now_val ? max_val : now_val; + } + for (size_t col = 0; col < cols; ++col) { + dst.data()[index(row, col)] = max_val; + } + } +} +#else template void __vec__ TRowMaxExpand_NoFractal_Impl(typename tile_shape::TileDType __out__ dst, @@ -77,3 +102,4 @@ void TROWMAXEXPAND_Impl(tile_shape &dst, tile_shape &src) { } } #endif +#endif diff --git a/include/jcore/TRowSum.hpp b/include/jcore/TRowSum.hpp index c6ebb8c..c49ee8c 100644 --- a/include/jcore/TRowSum.hpp +++ b/include/jcore/TRowSum.hpp @@ -5,6 +5,34 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TROWSUM_Impl(tile_shape_out &dst, tile_shape_in &src) { + static_assert(tile_shape_in::Rows == tile_shape_out::Rows, + "Error! Input row != Output row."); + static_assert(tile_shape_out::ValidCol == 1, "valid column must be 1."); + static_assert(!tile_shape_out::isBoxedLayout && !tile_shape_in::isBoxedLayout, + "Not support Fractal layout"); + static_assert(tile_shape_in::ValidRow != DYNAMIC && + tile_shape_in::ValidCol != DYNAMIC && + tile_shape_out::ValidRow != DYNAMIC && + tile_shape_out::ValidCol != DYNAMIC, + "TODO: Support tile dynamic shape!"); + static_assert(tile_shape_out::Loc != Location::Acc && + tile_shape_in::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + + size_t rows = src.GetValidRow(); + size_t cols = src.GetValidCol(); + for (size_t row = 0; row < rows; ++row) { + typename tile_shape_in::DType sum = src.data()[index(row, 0)]; + for (size_t col = 1; col < cols; ++col) { + sum += src.data()[index(row, col)]; + } + dst.data()[index(row, 0)] = sum; + } +} +#else template void __vec__ TRowSum_NoFractal_Impl(typename tile_shape_out::TileDType __out__ dst, @@ -107,4 +135,5 @@ void TROWSUM_Impl(tile_shape_out &dst, tile_shape_in &src) { } } -#endif \ No newline at end of file +#endif +#endif diff --git a/include/jcore/TRowSumExpand.hpp b/include/jcore/TRowSumExpand.hpp index 1be1bef..710db56 100644 --- a/include/jcore/TRowSumExpand.hpp +++ b/include/jcore/TRowSumExpand.hpp @@ -5,6 +5,30 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +// ROWSUM + EXPAND +template +void TROWSUMEXPAND_Impl(tile_shape &dst, tile_shape &src) { + static_assert(tile_shape::ValidRow != DYNAMIC && + tile_shape::ValidCol != DYNAMIC, + "TODO: Support tile dynamic shape!"); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(!tile_shape::isBoxedLayout, "Not support Fractal layout"); + + size_t rows = src.GetValidRow(); + size_t cols = src.GetValidCol(); + for (size_t row = 0; row < rows; ++row) { + typename tile_shape::DType sum = src.data()[index(row, 0)]; + for (size_t col = 1; col < cols; ++col) { + sum += src.data()[index(row, col)]; + } + for (size_t col = 0; col < cols; ++col) { + dst.data()[index(row, col)] = sum; + } + } +} +#else template void __vec__ TRowSumExpand_NoFractal_Impl(typename tile_shape::TileDType __out__ dst, @@ -77,3 +101,4 @@ void TROWSUMEXPAND_Impl(tile_shape &dst, tile_shape &src) { } #endif +#endif diff --git a/include/jcore/TSqrt.hpp b/include/jcore/TSqrt.hpp index dd8db9d..c819d7c 100644 --- a/include/jcore/TSqrt.hpp +++ b/include/jcore/TSqrt.hpp @@ -5,6 +5,49 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +T linx_tile_isqrt(T value) { + T root = 0; + root += (value >= static_cast(1)) ? static_cast(1) : static_cast(0); + root += (value >= static_cast(4)) ? static_cast(1) : static_cast(0); + root += (value >= static_cast(9)) ? static_cast(1) : static_cast(0); + root += (value >= static_cast(16)) ? static_cast(1) : static_cast(0); + root += (value >= static_cast(25)) ? static_cast(1) : static_cast(0); + root += (value >= static_cast(36)) ? static_cast(1) : static_cast(0); + root += (value >= static_cast(49)) ? static_cast(1) : static_cast(0); + root += (value >= static_cast(64)) ? static_cast(1) : static_cast(0); + root += (value >= static_cast(81)) ? static_cast(1) : static_cast(0); + root += (value >= static_cast(100)) ? static_cast(1) : static_cast(0); + root += (value >= static_cast(121)) ? static_cast(1) : static_cast(0); + root += (value >= static_cast(144)) ? static_cast(1) : static_cast(0); + root += (value >= static_cast(169)) ? static_cast(1) : static_cast(0); + root += (value >= static_cast(196)) ? static_cast(1) : static_cast(0); + root += (value >= static_cast(225)) ? static_cast(1) : static_cast(0); + return root; +} + +template +void TSQRT_Impl(tile_shape &dst, tile_shape &src) { + static constexpr size_t row = tile_shape::ValidRow; + static constexpr size_t col = tile_shape::ValidCol; + static_assert(row != DYNAMIC && col != DYNAMIC, + "TODO: Support tile dynamic shape!"); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(tile_shape::isBoxedLayout == false, + "TSQRT not support Boxed Layout!"); + static_assert(std::is_integral::value, + "Linx direct TSQRT supports integral smoke types only"); + + for (size_t i = 0; i < row; ++i) { + for (size_t j = 0; j < col; ++j) { + size_t tile_index = index(i, j); + dst.data()[tile_index] = linx_tile_isqrt(src.data()[tile_index]); + } + } +} +#else template void __vec__ TSqrt_RowMajor(typename tile_shape::TileDType __out__ dst, const typename tile_shape::TileDType __in__ src) { @@ -74,4 +117,5 @@ void TSQRT_Impl(tile_shape &dst, tile_shape &src) { } } -#endif \ No newline at end of file +#endif +#endif diff --git a/include/jcore/TCopyOut.hpp b/include/jcore/TStore.hpp similarity index 82% rename from include/jcore/TCopyOut.hpp rename to include/jcore/TStore.hpp index cd8725c..f6bfbd5 100644 --- a/include/jcore/TCopyOut.hpp +++ b/include/jcore/TStore.hpp @@ -1,13 +1,36 @@ -#ifndef TCOPYOUT_HPP -#define TCOPYOUT_HPP +#ifndef JCORE_TSTORE_HPP +#define JCORE_TSTORE_HPP #include "common/pto_tile.hpp" using namespace pto; +#ifdef __linx +template +void TSTORE_Impl(gm_shape &dst, tile_shape &src) { + size_t rows = src.GetValidRow(); + size_t cols = src.GetValidCol(); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(tile_shape::isBoxedLayout == false, + "Linx smoke TSTORE supports only unboxed tiles"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t gm_index = gm_shape::isRowMajor + ? row * gm_shape::RowStride + col + : col * gm_shape::ColStride + row; + size_t tile_index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + dst.data()[gm_index] = src.data()[tile_index]; + } + } +} +#else // cube left -> gm row major template -void __mtc__ CopyOut2NzImpl1D(typename gm_shape::DType __out__ *dst, +void __mtc__ Store2NzImpl1D(typename gm_shape::DType __out__ *dst, const typename tile_shape::TileDType __in__ src) { static constexpr int inner_rows = tile_shape::InnerRows; static constexpr int inner_cols = tile_shape::InnerCols; @@ -32,7 +55,7 @@ void __mtc__ CopyOut2NzImpl1D(typename gm_shape::DType __out__ *dst, } template -void __mtc__ CopyOut2ZnImpl1D(typename gm_shape::DType __out__ *dst, +void __mtc__ Store2ZnImpl1D(typename gm_shape::DType __out__ *dst, const typename tile_shape::TileDType __in__ src) { static constexpr int inner_rows = tile_shape::InnerRows; static constexpr int inner_cols = tile_shape::InnerCols; @@ -59,7 +82,7 @@ void __mtc__ CopyOut2ZnImpl1D(typename gm_shape::DType __out__ *dst, //no fractal template void __mtc__ -TCopyOut_Vec_ColMajor(typename gm_shape::DType __out__ *dst, +TStore_Vec_ColMajor(typename gm_shape::DType __out__ *dst, const typename tile_shape::TileDType __in__ src) { size_t i = blkv_get_index_x(); size_t j = blkv_get_index_y(); @@ -68,10 +91,10 @@ TCopyOut_Vec_ColMajor(typename gm_shape::DType __out__ *dst, size_t index_tile = j * tile_shape::ColStride + i; dst[index_gm] = blkv_get_tile_ptr(src)[index_tile]; } - + template void __mtc__ -TCopyOut_Vec_RowMajor(typename gm_shape::DType __out__ *dst, +TStore_Vec_RowMajor(typename gm_shape::DType __out__ *dst, typename tile_shape::TileDType __in__ src) { size_t i = blkv_get_index_x(); size_t j = blkv_get_index_y(); @@ -83,7 +106,7 @@ TCopyOut_Vec_RowMajor(typename gm_shape::DType __out__ *dst, // cube left -> gm row major template -void __mtc__ CopyOut2NzImpl2D_Dynamic(typename gm_shape::DType __out__ *dst, +void __mtc__ Store2NzImpl2D_Dynamic(typename gm_shape::DType __out__ *dst, const typename tile_shape::TileDType __in__ src, const size_t __in__ gm_row_stride) { static constexpr int inner_rows = tile_shape::InnerRows; @@ -105,7 +128,7 @@ void __mtc__ CopyOut2NzImpl2D_Dynamic(typename gm_shape::DType __out__ *dst, } template -void __mtc__ CopyOut2ZnImpl2D_Dynamic(typename gm_shape::DType __out__ *dst, +void __mtc__ Store2ZnImpl2D_Dynamic(typename gm_shape::DType __out__ *dst, const typename tile_shape::TileDType __in__ src, const size_t __in__ gm_row_stride) { static constexpr int inner_rows = tile_shape::InnerRows; @@ -128,7 +151,7 @@ void __mtc__ CopyOut2ZnImpl2D_Dynamic(typename gm_shape::DType __out__ *dst, //no fractal template -void __mtc__ TCopyOut_Vec_ColMajor_Dynamic(typename gm_shape::DType __out__ *dst, +void __mtc__ TStore_Vec_ColMajor_Dynamic(typename gm_shape::DType __out__ *dst, const typename tile_shape::TileDType __in__ src, const size_t __in__ gm_col_stride) { size_t i = blkv_get_index_x(); @@ -138,9 +161,9 @@ void __mtc__ TCopyOut_Vec_ColMajor_Dynamic(typename gm_shape::DType __out__ *dst size_t index_tile = j * tile_shape::ColStride + i; dst[index_gm] = blkv_get_tile_ptr(src)[index_tile]; } - + template -void __mtc__ TCopyOut_Vec_RowMajor_Dynamic(typename gm_shape::DType __out__ *dst, +void __mtc__ TStore_Vec_RowMajor_Dynamic(typename gm_shape::DType __out__ *dst, typename tile_shape::TileDType __in__ src, const size_t __in__ gm_row_stride) { size_t i = blkv_get_index_x(); @@ -152,7 +175,7 @@ void __mtc__ TCopyOut_Vec_RowMajor_Dynamic(typename gm_shape::DType __out__ *dst } template -void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) { +void TSTORE_Impl(gm_shape &dst, tile_shape &src) { size_t tile_rows = src.GetValidRow(); size_t tile_cols = src.GetValidCol(); static_assert(tile_shape::Loc != Location::Acc, "Unsupport ACC to be input or output here"); @@ -203,7 +226,7 @@ void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) { } } } else { - static_assert(tile_shape::isBoxedLayout == false, + static_assert(tile_shape::isBoxedLayout == false, "Storage layout type not supported"); } @@ -211,15 +234,15 @@ void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) { if constexpr (gm_shape::RowStride == DYNAMIC || gm_shape::ColStride == DYNAMIC || tile_shape::ValidRow == DYNAMIC || tile_shape::ValidCol == DYNAMIC) { // dynamic if constexpr (is_Nz_layout::value) { // Nz - CopyOut2NzImpl2D_Dynamic + Store2NzImpl2D_Dynamic <<>>(dst.data(), src.data(), dst.GetStride(3)); } else if constexpr (is_Zn_layout::value) { // Zn - CopyOut2ZnImpl2D_Dynamic + Store2ZnImpl2D_Dynamic <<>>(dst.data(), src.data(), dst.GetStride(3)); } else if constexpr (tile_shape::isBoxedLayout == false) { if constexpr (tile_shape::isRowMajor) { if constexpr (gm_shape::isRowMajor) { - TCopyOut_Vec_RowMajor_Dynamic + TStore_Vec_RowMajor_Dynamic <<>>(dst.data(), src.data(), dst.GetStride(3)); } else { static_assert(gm_shape::isRowMajor, @@ -227,7 +250,7 @@ void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) { } } else if constexpr (!tile_shape::isRowMajor) { if constexpr (!gm_shape::isRowMajor) { - TCopyOut_Vec_ColMajor_Dynamic + TStore_Vec_ColMajor_Dynamic <<>>(dst.data(), src.data(), dst.GetStride(4)); } else { static_assert(!gm_shape::isRowMajor, @@ -235,20 +258,20 @@ void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) { } } } else { - static_assert(tile_shape::isBoxedLayout == false, + static_assert(tile_shape::isBoxedLayout == false, "Storage layout type not supported"); } } else { // static if constexpr (is_Nz_layout::value) { // Nz - CopyOut2NzImpl1D + Store2NzImpl1D <<>>(dst.data(), src.data()); } else if constexpr (is_Zn_layout::value) { // Zn - CopyOut2ZnImpl1D + Store2ZnImpl1D <<>>(dst.data(), src.data()); } else if constexpr (tile_shape::isBoxedLayout == false) { if constexpr (tile_shape::isRowMajor) { if constexpr (gm_shape::isRowMajor) { - TCopyOut_Vec_RowMajor + TStore_Vec_RowMajor <<>>(dst.data(), src.data()); } else { static_assert(gm_shape::isRowMajor, @@ -256,7 +279,7 @@ void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) { } } else if constexpr (!tile_shape::isRowMajor) { if constexpr (!gm_shape::isRowMajor) { - TCopyOut_Vec_ColMajor + TStore_Vec_ColMajor <<>>(dst.data(), src.data()); } else { static_assert(!gm_shape::isRowMajor, @@ -264,13 +287,14 @@ void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) { } } } else { - static_assert(tile_shape::isBoxedLayout == false, + static_assert(tile_shape::isBoxedLayout == false, "Storage layout type not supported"); } } - + #endif } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TSub.hpp b/include/jcore/TSub.hpp index 5c9ee29..8689e63 100644 --- a/include/jcore/TSub.hpp +++ b/include/jcore/TSub.hpp @@ -5,6 +5,26 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TSUB_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { + size_t rows = src0.GetValidRow(); + size_t cols = src0.GetValidCol(); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(tile_shape::isBoxedLayout == false, + "TSUB not support Boxed Layout!"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + dst.data()[index] = src0.data()[index] - src1.data()[index]; + } + } +} +#else template void __vec__ Tsub_RowMajor(typename tile_shape::TileDType __out__ dst, const typename tile_shape::TileDType __in__ src0, @@ -71,5 +91,6 @@ void TSUB_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) { "Storage layout type not supported"); } } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TSubs.hpp b/include/jcore/TSubs.hpp index 90f462b..daa151f 100644 --- a/include/jcore/TSubs.hpp +++ b/include/jcore/TSubs.hpp @@ -5,6 +5,25 @@ #include "jcore/constants.hpp" using namespace pto; +#ifdef __linx +template +void TSUBS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) { + size_t rows = src.GetValidRow(); + size_t cols = src.GetValidCol(); + static_assert(tile_shape::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(!tile_shape::isBoxedLayout, "TSUBS not support Boxed Layout!"); + + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t index = tile_shape::isRowMajor + ? row * tile_shape::RowStride + col + : col * tile_shape::ColStride + row; + dst.data()[index] = src.data()[index] - s; + } + } +} +#else template void __vec__ TSubs_RowMajor(typename tile_shape::TileDType __out__ dst, const typename tile_shape::TileDType __in__ src, @@ -67,5 +86,6 @@ void TSUBS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) "Storage layout type not supported"); } } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/TTrans.hpp b/include/jcore/TTrans.hpp index 3933340..9df15ef 100644 --- a/include/jcore/TTrans.hpp +++ b/include/jcore/TTrans.hpp @@ -6,6 +6,40 @@ using namespace pto; +#ifdef __linx +template +void TTRANS_Impl(tile_shape_out &dst, tile_shape_in &src) { + static_assert( + tile_shape_in::Rows == tile_shape_out::Cols && + tile_shape_in::Cols == tile_shape_out::Rows, + "Error! Input rows != Output Columns or Input Columns != Output rows"); + static_assert(tile_shape_in::ValidRow != DYNAMIC && + tile_shape_in::ValidCol != DYNAMIC && + tile_shape_out::ValidRow != DYNAMIC && + tile_shape_out::ValidCol != DYNAMIC, + "TODO: Support tile dynamic shape!"); + static_assert(tile_shape_out::Loc != Location::Acc && + tile_shape_in::Loc != Location::Acc, + "Unsupport ACC to be input or output here"); + static_assert(tile_shape_out::isBoxedLayout == false && + tile_shape_in::isBoxedLayout == false, + "Storage layout type not supported"); + + size_t rows = src.GetValidRow(); + size_t cols = src.GetValidCol(); + for (size_t row = 0; row < rows; ++row) { + for (size_t col = 0; col < cols; ++col) { + size_t src_index = tile_shape_in::isRowMajor + ? row * tile_shape_in::RowStride + col + : col * tile_shape_in::ColStride + row; + size_t dst_index = tile_shape_out::isRowMajor + ? col * tile_shape_out::RowStride + row + : row * tile_shape_out::ColStride + col; + dst.data()[dst_index] = src.data()[src_index]; + } + } +} +#else template void __vec__ TTrans_RowMajor(typename tile_shape_out::TileDType __out__ dst, @@ -96,5 +130,6 @@ void TTRANS_Impl(tile_shape_out &dst, tile_shape_in &src) { "Storage layout type not supported"); } } +#endif -#endif \ No newline at end of file +#endif diff --git a/include/jcore/type.hpp b/include/jcore/type.hpp index 2165a2c..bde7cdc 100644 --- a/include/jcore/type.hpp +++ b/include/jcore/type.hpp @@ -1,8 +1,76 @@ #ifndef _INCLUDE_JCORE_TYPE_H_ #define _INCLUDE_JCORE_TYPE_H_ -#include #include +#include +#include + +#ifdef __linx +#ifndef tile_size +#define tile_size(N) __attribute__((tile_size(N))) +#endif +#ifndef __in__ +#define __in__ +#endif +#ifndef __out__ +#define __out__ +#endif +#ifndef __vec__ +#define __vec__ +#endif +#ifndef __mtc__ +#define __mtc__ +#endif + +struct __fp32 { + float value; + constexpr __fp32(float v = 0.0f) : value(v) {} + constexpr operator float() const { return value; } +}; +struct __tf32 { + uint32_t bits; +}; +struct __hf32 { + uint32_t bits; +}; +struct __half { + uint16_t bits; + constexpr __half(float = 0.0f) : bits(0) {} +}; +struct __hif8 { + uint8_t bits; +}; +struct __fp8_e4m3 { + uint8_t bits; +}; +struct __fp8_e5m2 { + uint8_t bits; +}; +struct __fp6_e3m2 { + uint8_t bits; +}; +struct __fp6_e2m3 { + uint8_t bits; +}; +struct __fp4_e2m1x2 { + uint8_t bits; +}; +struct __fp4_e1m2x2 { + uint8_t bits; +}; +struct __fp8_e8m0 { + uint8_t bits; +}; +struct __fp4_hif4x2 { + uint8_t bits; +}; +struct __int4x2 { + uint8_t bits; +}; +struct __uint4x2 { + uint8_t bits; +}; +#endif enum __type_code { __type_fp64 = 0, @@ -50,7 +118,9 @@ template<> struct type_traits<__tf32> : public type_traits_base<__type_t template<> struct type_traits<__hf32> : public type_traits_base<__type_hf32, 32> {}; template<> struct type_traits<__half> : public type_traits_base<__type_fp16, 16> {}; +#ifndef __linx template<> struct type_traits<__bf16> : public type_traits_base<__type_bf16, 16> {}; +#endif template<> struct type_traits<__hif8> : public type_traits_base<__type_hif8, 8> {}; template<> struct type_traits<__fp8_e4m3> : public type_traits_base<__type_fp8_e4m3, 8> {}; diff --git a/include/jcore/utils.hpp b/include/jcore/utils.hpp index 3268155..ee7dee0 100644 --- a/include/jcore/utils.hpp +++ b/include/jcore/utils.hpp @@ -12,7 +12,7 @@ void print_tile_Impl(tile_shape &tile) { typename tile_shape::DType d[tile_size] = {0}; using dtype = typename tile_shape::DType; using shape = Shape<1, 1, 1, 1, 1>; - using stride = + using stride = std::conditional_t, Stride<1, 1, tile_shape::Rows * tile_shape::Cols, 1, tile_shape::Rows>>; @@ -21,7 +21,7 @@ void print_tile_Impl(tile_shape &tile) { GlobalTensor, GlobalTensor>; gm_shape dst(d); - TCOPYOUT(dst, tile); + TSTORE(dst, tile); print_tile_info(); std::cout << std::fixed << std::scientific << std::setprecision(4); diff --git a/kernels/element_wise/gelu.hpp b/kernels/element_wise/gelu.hpp index 69cc4ee..191eb59 100644 --- a/kernels/element_wise/gelu.hpp +++ b/kernels/element_wise/gelu.hpp @@ -1,9 +1,14 @@ #include -#include "../test/accelerator/include/accelerator_fusion.h" +#include #include "template_asm.h" +#ifdef __linx +#include +#include +#else #include #include +#endif // 海思解决方案 新版多项式拟合 // ============================================== @@ -21,7 +26,7 @@ void __vec__ gelu_simd( // 数据格式转换 V.FCVT float x = static_cast(indata); - + constexpr uint32_t TOTAL_COUNT = 24*8*1024; constexpr float SCALAR_A5 = -3.5123395303315874e-09f; constexpr float SCALAR_A4 = 2.6452661927578447e-07f; @@ -31,7 +36,7 @@ void __vec__ gelu_simd( constexpr float SCALAR_A0 = -7.2666168212890625e-02f; constexpr float SCALAR_AM1 = -1.5957698822021484e+00f; constexpr float FP32_MAX = 5.75f; - + float t = blkv_max(x, -FP32_MAX); t = blkv_min(t, FP32_MAX); float t2 = t * t; @@ -45,7 +50,7 @@ void __vec__ gelu_simd( float exp_val = blkv_fexp(t * p); float y = x / (1.0f + exp_val); - + BLKC_ASSIGN_CAST(out, index, y); // blkv_get_tile_ptr(out)[index] = static_cast(result); } @@ -65,7 +70,7 @@ void gelu_impl( GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -83,7 +88,7 @@ void gelu( bool approximate = false // false:none, true:tanh ) { const int Mb = gM / tM; - + const int rmd_M = gM % tM; using gm_shape = global_tensor>; @@ -112,18 +117,18 @@ void gelu( // printf("iter i %d\n",i); auto gI = gIIter(0, i); auto gO = gOIter(0, i); - TCOPYIN(inTile, gI); + TLOAD(inTile, gI); // DUMP_TILE("inTile", inTile, g_dump_intTile, 1, tM); gelu_impl(inTile, outTile); // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM); - TCOPYOUT(gO, outTile); + TSTORE(gO, outTile); } if constexpr (rmd_M) { auto gI = gIIter(0, Mb); auto gO = gOIter(0, Mb); - TCOPYIN(inTile_rmd, gI); + TLOAD(inTile_rmd, gI); gelu_impl(inTile_rmd, outTile_rmd); // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM); - TCOPYOUT(gO, outTile_rmd); + TSTORE(gO, outTile_rmd); } -} \ No newline at end of file +} diff --git a/kernels/element_wise/gelu_origin.hpp b/kernels/element_wise/gelu_origin.hpp index 0db59e6..62f4fb4 100644 --- a/kernels/element_wise/gelu_origin.hpp +++ b/kernels/element_wise/gelu_origin.hpp @@ -1,5 +1,5 @@ #include -#include "../test/accelerator/include/accelerator_fusion.h" +#include #include "template_asm.h" #include @@ -91,7 +91,7 @@ void __vec__ gelu_simd( float result; // 数据格式转换 V.FCVT float x = static_cast(indata); - + // GELU(x)=x∗Φ(x), Φ(x)=负无穷~x积分 φ(exp(-0.5f*x*x) / sqrt(2π)) // 等价于GELU(x)=0.5⋅x⋅(1+erf(x/sqrt(2)) if (!approximate) { @@ -126,7 +126,7 @@ void gelu_impl( GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -144,7 +144,7 @@ void gelu( bool approximate = false // false:none, true:tanh ) { const int Mb = gM / tM; - + const int rmd_M = gM % tM; using gm_shape = global_tensor>; @@ -173,19 +173,19 @@ void gelu( // printf("iter i %d\n",i); auto gI = gIIter(0, i); auto gO = gOIter(0, i); - TCOPYIN(inTile, gI); + TLOAD(inTile, gI); // DUMP_TILE("inTile", inTile, g_dump_intTile, 1, tM); gelu_impl(inTile, outTile, approximate); // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM); - TCOPYOUT(gO, outTile); + TSTORE(gO, outTile); } if constexpr (rmd_M) { auto gI = gIIter(0, Mb); auto gO = gOIter(0, Mb); - TCOPYIN(inTile_rmd, gI); + TLOAD(inTile_rmd, gI); gelu_impl(inTile_rmd, outTile_rmd, approximate); // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM); - TCOPYOUT(gO, outTile_rmd); + TSTORE(gO, outTile_rmd); } } \ No newline at end of file diff --git a/kernels/fa_mx/fa_hif4.hpp b/kernels/fa_mx/fa_hif4.hpp index 280ef24..a655750 100644 --- a/kernels/fa_mx/fa_hif4.hpp +++ b/kernels/fa_mx/fa_hif4.hpp @@ -272,7 +272,7 @@ void __vec__ pkg_rowmax( __bf16x2 upd_max; __bf16 old_max_bf160, old_max_bf161; linx_cvt(old_max_bf160, old_max_ptr[i*2*tileMax::RowStride]); //float->bf16 - linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]); + linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]); linx_cvt_package(upd_max, old_max_bf160, old_max_bf161); // calc tile rowmax @@ -323,7 +323,7 @@ void __vec__ pkg_rowmax( // recalculate scale of softmax __bf16x2 scale, old_max_bf16x2; linx_cvt_package(old_max_bf16x2, old_max_bf160, old_max_bf161); - blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max); + blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max); blkv_bf16x2_fexp(scale, old_max_bf16x2); // opt1 union { __bf16x2 vec; uint32_t u32; } scale_u; @@ -365,7 +365,7 @@ void __vec__ pkg_rowmax_4src( __bf16x2 upd_max; __bf16 old_max_bf160, old_max_bf161; linx_cvt(old_max_bf160, old_max_ptr[i*2*tileMax::RowStride]); //float->bf16 - linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]); + linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]); linx_cvt_package(upd_max, old_max_bf160, old_max_bf161); // calc tile rowmax @@ -379,7 +379,7 @@ void __vec__ pkg_rowmax_4src( uint32_t src_idx_21 = (2*i + 1) * tileSrc::RowStride + (j + 2) * tileSrc::ColStride; uint32_t src_idx_30 = (2*i) * tileSrc::RowStride + (j + 3) * tileSrc::ColStride; uint32_t src_idx_31 = (2*i + 1) * tileSrc::RowStride + (j + 3) * tileSrc::ColStride; - + __bf16x2 s0_0, s0_1, s0_2, s0_3; linx_cvt_package(s0_0, src0_ptr[src_idx_00], src0_ptr[src_idx_01]); linx_cvt_package(s0_1, src0_ptr[src_idx_10], src0_ptr[src_idx_11]); @@ -435,7 +435,7 @@ void __vec__ pkg_rowmax_4src( // recalculate scale of softmax __bf16x2 scale, old_max_bf16x2; linx_cvt_package(old_max_bf16x2, old_max_bf160, old_max_bf161); - blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max); + blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max); blkv_bf16x2_fexp(scale, old_max_bf16x2); union { __bf16x2 vec; uint32_t u32; } scale_u; @@ -480,7 +480,7 @@ void __vec__ pkg_rowmax_4srcx2( __bf16x2 upd_max; __bf16 old_max_bf160, old_max_bf161; linx_cvt(old_max_bf160, old_max_ptr[i*2*tileMax::RowStride]); //float->bf16 - linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]); + linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]); linx_cvt_package(upd_max, old_max_bf160, old_max_bf161); // calc tile rowmax @@ -491,7 +491,7 @@ void __vec__ pkg_rowmax_4srcx2( uint32_t src_idx_10 = (2*i) * tileSrc::RowStride + (j + 1) * tileSrc::ColStride; uint32_t src_idx_20 = (2*i) * tileSrc::RowStride + (j + 2) * tileSrc::ColStride; uint32_t src_idx_30 = (2*i) * tileSrc::RowStride + (j + 3) * tileSrc::ColStride; - + __bf16x2 s0_0, s0_1; blkv_bf16x2_fmax(s0_0, src0_x2_ptr[src_idx_00], src0_x2_ptr[src_idx_10]); blkv_bf16x2_fmax(s0_1, src0_x2_ptr[src_idx_20], src0_x2_ptr[src_idx_30]); @@ -531,7 +531,7 @@ void __vec__ pkg_rowmax_4srcx2( // recalculate scale of softmax __bf16x2 scale, old_max_bf16x2; linx_cvt_package(old_max_bf16x2, old_max_bf160, old_max_bf161); - blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max); + blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max); blkv_bf16x2_fexp(scale, old_max_bf16x2); union { __bf16x2 vec; uint32_t u32; } scale_u; @@ -564,12 +564,12 @@ void __vec__ rowsum_2src_with_local_sum( linx_cvt_package(src_scale, 1.0f / sqrt((float)qD), 1.0f / sqrt((float)qD)); __bf16x2 upd_sum, new_max_val; __bf16 new_max_bf16_0, new_max_bf16_1; - + // Initialize local sum to 0 linx_cvt_package(upd_sum, 0.0f, 0.0f); linx_cvt(new_max_bf16_0, blkv_get_tile_ptr(new_max)[i*2*tileMax::RowStride]); //float->bf16 - linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]); + linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]); linx_cvt_package(new_max_val, new_max_bf16_0, new_max_bf16_1); #pragma clang loop unroll(full) @@ -582,7 +582,7 @@ void __vec__ rowsum_2src_with_local_sum( uint32_t src_idx_21 = (2*i + 1) * tileSrc::RowStride + (j + 2) * tileSrc::ColStride; uint32_t src_idx_30 = (2*i) * tileSrc::RowStride + (j + 3) * tileSrc::ColStride; uint32_t src_idx_31 = (2*i + 1) * tileSrc::RowStride + (j + 3) * tileSrc::ColStride; - + // Process src0 __bf16x2 s0_0, s0_1, s0_2, s0_3; __bf16x2 sum01_0, sum23_0, sum0123_0; @@ -590,7 +590,7 @@ void __vec__ rowsum_2src_with_local_sum( linx_cvt_package(s0_1, src0_ptr[src_idx_10], src0_ptr[src_idx_11]); linx_cvt_package(s0_2, src0_ptr[src_idx_20], src0_ptr[src_idx_21]); linx_cvt_package(s0_3, src0_ptr[src_idx_30], src0_ptr[src_idx_31]); - + blkv_bf16x2_fmsub(s0_0, s0_0, src_scale, new_max_val); blkv_bf16x2_fmsub(s0_1, s0_1, src_scale, new_max_val); blkv_bf16x2_fmsub(s0_2, s0_2, src_scale, new_max_val); @@ -603,7 +603,7 @@ void __vec__ rowsum_2src_with_local_sum( blkv_bf16x2_fadd(sum23_0, s0_2, s0_3); blkv_bf16x2_fadd(sum0123_0, sum01_0, sum23_0); blkv_bf16x2_fadd(upd_sum, upd_sum, sum0123_0); - + BLKC_ASSIGN_CAST(src_exp0, src_idx_00, s0_0); BLKC_ASSIGN_CAST(src_exp0, src_idx_10, s0_1); BLKC_ASSIGN_CAST(src_exp0, src_idx_20, s0_2); @@ -616,7 +616,7 @@ void __vec__ rowsum_2src_with_local_sum( linx_cvt_package(s1_1, src1_ptr[src_idx_10], src1_ptr[src_idx_11]); linx_cvt_package(s1_2, src1_ptr[src_idx_20], src1_ptr[src_idx_21]); linx_cvt_package(s1_3, src1_ptr[src_idx_30], src1_ptr[src_idx_31]); - + blkv_bf16x2_fmsub(s1_0, s1_0, src_scale, new_max_val); blkv_bf16x2_fmsub(s1_1, s1_1, src_scale, new_max_val); blkv_bf16x2_fmsub(s1_2, s1_2, src_scale, new_max_val); @@ -629,7 +629,7 @@ void __vec__ rowsum_2src_with_local_sum( blkv_bf16x2_fadd(sum23_1, s1_2, s1_3); blkv_bf16x2_fadd(sum0123_1, sum01_1, sum23_1); blkv_bf16x2_fadd(upd_sum, upd_sum, sum0123_1); - + BLKC_ASSIGN_CAST(src_exp1, src_idx_00, s1_0); BLKC_ASSIGN_CAST(src_exp1, src_idx_10, s1_1); BLKC_ASSIGN_CAST(src_exp1, src_idx_20, s1_2); @@ -640,7 +640,7 @@ void __vec__ rowsum_2src_with_local_sum( sum_u.vec = upd_sum; __bf16 sum0 = (sum_u.u32 >> 16) & 0xffff; __bf16 sum1 = (sum_u.u32 & 0xffff); - + local_sum_ptr[(i*2)*tileSum::RowStride] = sum0; local_sum_ptr[(i*2+1)*tileSum::RowStride] = sum1; } @@ -670,12 +670,12 @@ void __vec__ rowsum_2src_with_local_sumx2( linx_cvt_package(src_scale, 1.0f / sqrt((float)qD), 1.0f / sqrt((float)qD)); __bf16x2 upd_sum, new_max_val; __bf16 new_max_bf16_0, new_max_bf16_1; - + // Initialize local sum to 0 linx_cvt_package(upd_sum, 0.0f, 0.0f); linx_cvt(new_max_bf16_0, blkv_get_tile_ptr(new_max)[i*2*tileMax::RowStride]); //float->bf16 - linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]); + linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]); linx_cvt_package(new_max_val, new_max_bf16_0, new_max_bf16_1); // row sum @@ -686,17 +686,17 @@ void __vec__ rowsum_2src_with_local_sumx2( uint32_t src_idx_10 = (2*i) * tileSrc::RowStride + (j + 1) * tileSrc::ColStride; uint32_t src_idx_20 = (2*i) * tileSrc::RowStride + (j + 2) * tileSrc::ColStride; uint32_t src_idx_30 = (2*i) * tileSrc::RowStride + (j + 3) * tileSrc::ColStride; - + // Process src0 __bf16x2 s0_0, s0_1, s0_2, s0_3; __bf16x2 sum01_0, sum23_0, sum0123_0; - + // 直接将内存读取作为 fmsub 的输入操作数 blkv_bf16x2_fmsub(s0_0, src0_x2_ptr[src_idx_00], src_scale, new_max_val); blkv_bf16x2_fmsub(s0_1, src0_x2_ptr[src_idx_10], src_scale, new_max_val); blkv_bf16x2_fmsub(s0_2, src0_x2_ptr[src_idx_20], src_scale, new_max_val); blkv_bf16x2_fmsub(s0_3, src0_x2_ptr[src_idx_30], src_scale, new_max_val); - + blkv_bf16x2_fexp(s0_0, s0_0); blkv_bf16x2_fexp(s0_1, s0_1); blkv_bf16x2_fexp(s0_2, s0_2); @@ -713,12 +713,12 @@ void __vec__ rowsum_2src_with_local_sumx2( // Process src1 __bf16x2 s1_0, s1_1, s1_2, s1_3; __bf16x2 sum01_1, sum23_1, sum0123_1; - + blkv_bf16x2_fmsub(s1_0, src1_x2_ptr[src_idx_00], src_scale, new_max_val); blkv_bf16x2_fmsub(s1_1, src1_x2_ptr[src_idx_10], src_scale, new_max_val); blkv_bf16x2_fmsub(s1_2, src1_x2_ptr[src_idx_20], src_scale, new_max_val); blkv_bf16x2_fmsub(s1_3, src1_x2_ptr[src_idx_30], src_scale, new_max_val); - + blkv_bf16x2_fexp(s1_0, s1_0); blkv_bf16x2_fexp(s1_1, s1_1); blkv_bf16x2_fexp(s1_2, s1_2); @@ -727,7 +727,7 @@ void __vec__ rowsum_2src_with_local_sumx2( blkv_bf16x2_fadd(sum23_1, s1_2, s1_3); blkv_bf16x2_fadd(sum0123_1, sum01_1, sum23_1); blkv_bf16x2_fadd(upd_sum, upd_sum, sum0123_1); - + blkc_assign_elem(src_exp1_x2_ptr, src_idx_00, s1_0); blkc_assign_elem(src_exp1_x2_ptr, src_idx_10, s1_1); blkc_assign_elem(src_exp1_x2_ptr, src_idx_20, s1_2); @@ -738,7 +738,7 @@ void __vec__ rowsum_2src_with_local_sumx2( sum_u.vec = upd_sum; __bf16 sum0 = (sum_u.u32 >> 16) & 0xffff; __bf16 sum1 = (sum_u.u32 & 0xffff); - + local_sum_ptr[(i*2)*tileSum::RowStride] = sum0; local_sum_ptr[(i*2+1)*tileSum::RowStride] = sum1; } @@ -768,12 +768,12 @@ void __vec__ rowsum_2src_with_local_expx2( linx_cvt_package(src_scale, 1.0f / sqrt((float)qD), 1.0f / sqrt((float)qD)); __bf16x2 upd_sum, new_max_val; __bf16 new_max_bf16_0, new_max_bf16_1; - + // Initialize local sum to 0 linx_cvt_package(upd_sum, 0.0f, 0.0f); linx_cvt(new_max_bf16_0, blkv_get_tile_ptr(new_max)[i*2*tileMax::RowStride]); //float->bf16 - linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]); + linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]); linx_cvt_package(new_max_val, new_max_bf16_0, new_max_bf16_1); // row sum @@ -784,17 +784,17 @@ void __vec__ rowsum_2src_with_local_expx2( uint32_t src_idx_10 = (2*i) * tileSrc::RowStride + (j + 1) * tileSrc::ColStride; uint32_t src_idx_20 = (2*i) * tileSrc::RowStride + (j + 2) * tileSrc::ColStride; uint32_t src_idx_30 = (2*i) * tileSrc::RowStride + (j + 3) * tileSrc::ColStride; - + // Process src0 __bf16x2 s0_0, s0_1, s0_2, s0_3; __bf16x2 sum01_0, sum23_0, sum0123_0; - + // 直接将内存读取作为 fmsub 的输入操作数 blkv_bf16x2_fmsub(s0_0, src0_x2_ptr[src_idx_00], src_scale, new_max_val); blkv_bf16x2_fmsub(s0_1, src0_x2_ptr[src_idx_10], src_scale, new_max_val); blkv_bf16x2_fmsub(s0_2, src0_x2_ptr[src_idx_20], src_scale, new_max_val); blkv_bf16x2_fmsub(s0_3, src0_x2_ptr[src_idx_30], src_scale, new_max_val); - + blkv_bf16x2_fexp(s0_0, s0_0); blkv_bf16x2_fexp(s0_1, s0_1); blkv_bf16x2_fexp(s0_2, s0_2); @@ -811,12 +811,12 @@ void __vec__ rowsum_2src_with_local_expx2( // Process src1 __bf16x2 s1_0, s1_1, s1_2, s1_3; __bf16x2 sum01_1, sum23_1, sum0123_1; - + blkv_bf16x2_fmsub(s1_0, src1_x2_ptr[src_idx_00], src_scale, new_max_val); blkv_bf16x2_fmsub(s1_1, src1_x2_ptr[src_idx_10], src_scale, new_max_val); blkv_bf16x2_fmsub(s1_2, src1_x2_ptr[src_idx_20], src_scale, new_max_val); blkv_bf16x2_fmsub(s1_3, src1_x2_ptr[src_idx_30], src_scale, new_max_val); - + blkv_bf16x2_fexp(s1_0, s1_0); blkv_bf16x2_fexp(s1_1, s1_1); blkv_bf16x2_fexp(s1_2, s1_2); @@ -825,7 +825,7 @@ void __vec__ rowsum_2src_with_local_expx2( // blkv_bf16x2_fadd(sum23_1, s1_2, s1_3); // blkv_bf16x2_fadd(sum0123_1, sum01_1, sum23_1); // blkv_bf16x2_fadd(upd_sum, upd_sum, sum0123_1); - + blkc_assign_elem(src_exp1_x2_ptr, src_idx_00, s1_0); blkc_assign_elem(src_exp1_x2_ptr, src_idx_10, s1_1); blkc_assign_elem(src_exp1_x2_ptr, src_idx_20, s1_2); @@ -836,7 +836,7 @@ void __vec__ rowsum_2src_with_local_expx2( // sum_u.vec = upd_sum; // __bf16 sum0 = (sum_u.u32 >> 16) & 0xffff; // __bf16 sum1 = (sum_u.u32 & 0xffff); - + // local_sum_ptr[(i*2)*tileSum::RowStride] = sum0; // local_sum_ptr[(i*2+1)*tileSum::RowStride] = sum1; } @@ -890,14 +890,14 @@ void __vec__ rowsum( __bf16 new_max_bf16_0, new_max_bf16_1; // old_sum * rescale + new_sum linx_cvt(old_sum_bf16_0, old_sum_ptr[i*2*tileSum::RowStride]); //float->bf16 - linx_cvt(old_sum_bf16_1, old_sum_ptr[(i*2+1)*tileSum::RowStride]); + linx_cvt(old_sum_bf16_1, old_sum_ptr[(i*2+1)*tileSum::RowStride]); linx_cvt_package(upd_sum, old_sum_bf16_0, old_sum_bf16_1); linx_cvt(scale_bf16_0, scale_ptr[i*2*tileSum::RowStride]); //float->bf16 - linx_cvt(scale_bf16_1, scale_ptr[(i*2+1)*tileSum::RowStride]); + linx_cvt(scale_bf16_1, scale_ptr[(i*2+1)*tileSum::RowStride]); linx_cvt_package(scale, scale_bf16_0, scale_bf16_1); - blkv_bf16x2_fmul(upd_sum, upd_sum, scale); + blkv_bf16x2_fmul(upd_sum, upd_sum, scale); linx_cvt(new_max_bf16_0, blkv_get_tile_ptr(new_max)[i*2*tileMax::RowStride]); //float->bf16 - linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]); + linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]); linx_cvt_package(new_max_val, new_max_bf16_0, new_max_bf16_1); // calculate row sum of softmax, l_i @@ -952,7 +952,7 @@ void __vec__ rowsum( sum_u.vec = upd_sum; __bf16 sum0 = (sum_u.u32 >> 16) & 0xffff; __bf16 sum1 = (sum_u.u32 & 0xffff); - + new_sum_ptr[(i*2)*tileSum::RowStride] = sum0; new_sum_ptr[(i*2+1)*tileSum::RowStride] = sum1; } @@ -987,7 +987,7 @@ void __vec__ flashsoftmax_dn_mout_cast_kernel_bf16x2( __bf16x2 upd_max; __bf16 old_max_bf160, old_max_bf161; linx_cvt(old_max_bf160, old_max_ptr[i*2*tileMax::RowStride]); //float->bf16 - linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]); + linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]); linx_cvt_package(upd_max, old_max_bf160, old_max_bf161); // calc tile rowmax @@ -1044,7 +1044,7 @@ void __vec__ flashsoftmax_dn_mout_cast_kernel_bf16x2( // recalculate scale of softmax __bf16x2 scale, old_max_bf16x2; linx_cvt_package(old_max_bf16x2, old_max_bf160, old_max_bf161); - blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max); + blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max); blkv_bf16x2_fexp(scale, old_max_bf16x2); uint32_t scale_idx00 = i*2*tileScale::RowStride; uint32_t scale_idx01 = (i*2+1)*tileScale::RowStride; @@ -1064,7 +1064,7 @@ void __vec__ flashsoftmax_dn_mout_cast_kernel_bf16x2( __bf16x2 upd_sum; __bf16 old_sum_bf16_0, old_sum_bf16_1; linx_cvt(old_sum_bf16_0, old_sum_ptr[i*2*tileSum::RowStride]); //float->bf16 - linx_cvt(old_sum_bf16_1, old_sum_ptr[(i*2+1)*tileSum::RowStride]); + linx_cvt(old_sum_bf16_1, old_sum_ptr[(i*2+1)*tileSum::RowStride]); linx_cvt_package(upd_sum, old_sum_bf16_0, old_sum_bf16_1); blkv_bf16x2_fmul(upd_sum, upd_sum, scale); // *** TODO @@ -1208,9 +1208,9 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, using gmK = global_tensor>; // K: [qD×S] using gmV = global_tensor>; // V: [S×vD] using gmO = global_tensor>; // O: [SxvD] - using gm_QMX = global_tensor>; - using gm_KMX = global_tensor>; - using gm_VMX = global_tensor>; + using gm_QMX = global_tensor>; + using gm_KMX = global_tensor>; + using gm_VMX = global_tensor>; // tile 寄存器形状 using tileQ = TileLeft; // [kTm×qD] using tileK = TileRight; // [vD×kTk] @@ -1278,7 +1278,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, tileQ tQ[Xdim]; tile_QMX tQMX[Xdim]; // load tile Q, TODO: add ND2ZZ transform for QMX - #ifdef MULTI_LDST // don't use, no need for multi tload/tstore + #ifdef MULTI_LDST // don't use, no need for multi tload/tstore #pragma clang loop unroll(full) for(int x=0;x(gQMX, tQMX[x], nd2zz_offset, i+x, 0); + TLOAD(tQ[x], gQ); + gen_ND2ZZ_offset_Impl(gQMX, tQMX[x], nd2zz_offset, i+x, 0); MGATHER(tQMX[x], gQMX, nd2zz_offset); - // TCOPYIN(tQMX[x], gQMX); + // TLOAD(tQMX[x], gQMX); } #endif @@ -1330,10 +1330,10 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, for(int y=0;y(gKMX, tKMX[y], nd2nn_offset, 0, j+y); - MGATHER(tKMX[y], gKMX, nd2nn_offset); - // TCOPYIN(tKMX[y], gKM); + MGATHER(tKMX[y], gKMX, nd2nn_offset); + // TLOAD(tKMX[y], gKM); } #endif @@ -1372,7 +1372,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, tNewMax[x].data(), tNewSum[x].data(), tExpW[x][0].data(), - tW[x][0].data(), // + tW[x][0].data(), // tMax[x].data(), tSum[x].data()); tohif4<<>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); // 64 = 1 group, tExpW(Zn, bf16), tP_scale(ZZ, E6M2 with zero E1_8 && E1_16) @@ -1383,7 +1383,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, tNewMax[x].data(), tNewSum[x].data(), tExpW[x][0].data(), - tW[x][0].data(), // + tW[x][0].data(), // tMax[x].data(), tSum[x].data()); tohif4_bf16x2<<>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); @@ -1422,8 +1422,8 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // #pragma clang loop unroll(full) // for(int x=0;x<<>>( - // tScale[x].data(), - // tNewMax[x].data(), + // tScale[x].data(), + // tNewMax[x].data(), // tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), // tMax[x].data(), // scale); @@ -1432,7 +1432,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // // tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), // // tNewMax[x].data(), // // scale); - + // src_exp_2src_with_local_sum<<>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(), // tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale); // src_exp_2src_with_local_sum<<>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(), @@ -1450,7 +1450,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); @@ -1470,7 +1470,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); } @@ -1512,10 +1512,10 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, for(int y=0;y(gVMX, tVMX[y], nd2nn_offset, j+y, 0); - MGATHER(tVMX[y], gVMX, nd2nn_offset); - // TCOPYIN(tVMX[y], gVMX); + MGATHER(tVMX[y], gVMX, nd2nn_offset); + // TLOAD(tVMX[y], gVMX); } #endif @@ -1586,7 +1586,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, #pragma clang loop unroll(full) for (int x = 0; x < Xdim; ++x) { auto dstO = gIterO(i+x, 0); - TCOPYOUT(dstO, tO_cast[x]);//TMOV + TSTORE(dstO, tO_cast[x]);//TMOV } #endif @@ -1601,9 +1601,9 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // using gmK = global_tensor>; // K: [qD×S] // using gmV = global_tensor>; // V: [S×vD] // using gmO = global_tensor>; // O: [SxvD] -// using gm_QMX = global_tensor>; -// using gm_KMX = global_tensor>; -// using gm_VMX = global_tensor>; +// using gm_QMX = global_tensor>; +// using gm_KMX = global_tensor>; +// using gm_VMX = global_tensor>; // // tile 寄存器形状 // using tileQ = TileLeft; // [kTm×qD] // using tileK = TileRight; // [vD×kTk] @@ -1671,7 +1671,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // tileQ tQ[Xdim]; // tile_QMX tQMX[Xdim]; // // load tile Q, TODO: add ND2ZZ transform for QMX -// #ifdef MULTI_LDST // don't use, no need for multi tload/tstore +// #ifdef MULTI_LDST // don't use, no need for multi tload/tstore // #pragma clang loop unroll(full) // for(int x=0;x(gQMX, tQMX[x], nd2zz_offset, i+x, 0); +// TLOAD(tQ[x], gQ); +// gen_ND2ZZ_offset_Impl(gQMX, tQMX[x], nd2zz_offset, i+x, 0); // MGATHER(tQMX[x], gQMX, nd2zz_offset); -// // TCOPYIN(tQMX[x], gQMX); +// // TLOAD(tQMX[x], gQMX); // } // #endif @@ -1723,10 +1723,10 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // for(int y=0;y(gKMX, tKMX[y], nd2nn_offset, 0, j+y); -// MGATHER(tKMX[y], gKMX, nd2nn_offset); -// // TCOPYIN(tKMX[y], gKM); +// MGATHER(tKMX[y], gKMX, nd2nn_offset); +// // TLOAD(tKMX[y], gKM); // } // #endif @@ -1765,7 +1765,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // tNewMax[x].data(), // tNewSum[x].data(), // tExpW[x][0].data(), -// tW[x][0].data(), // +// tW[x][0].data(), // // tMax[x].data(), // tSum[x].data()); // tohif4<<>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); // 64 = 1 group, tExpW(Zn, bf16), tP_scale(ZZ, E6M2 with zero E1_8 && E1_16) @@ -1776,7 +1776,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // // tNewMax[x].data(), // // tNewSum[x].data(), // // tExpW[x][0].data(), -// // tW[x][0].data(), // +// // tW[x][0].data(), // // // tMax[x].data(), // // tSum[x].data()); @@ -1821,8 +1821,8 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // #pragma clang loop unroll(full) // for(int x=0;x<<>>( -// tScale[x].data(), -// tNewMax[x].data(), +// tScale[x].data(), +// tNewMax[x].data(), // tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), // tMax[x].data(), // scale); @@ -1831,7 +1831,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // // tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), // // tNewMax[x].data(), // // scale); - + // src_exp_2src_with_local_sum<<>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(), // tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale); // src_exp_2src_with_local_sum<<>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(), @@ -1849,7 +1849,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // tileSum tLocalSum[Xdim][4]; // #pragma clang loop unroll(full) -// for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); @@ -1869,7 +1869,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // tileSum tLocalSum[Xdim][4]; // #pragma clang loop unroll(full) -// for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); // } @@ -1911,10 +1911,10 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // for(int y=0;y(gVMX, tVMX[y], nd2nn_offset, j+y, 0); -// MGATHER(tVMX[y], gVMX, nd2nn_offset); -// // TCOPYIN(tVMX[y], gVMX); +// MGATHER(tVMX[y], gVMX, nd2nn_offset); +// // TLOAD(tVMX[y], gVMX); // } // #endif @@ -1985,7 +1985,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, // #pragma clang loop unroll(full) // for (int x = 0; x < Xdim; ++x) { // auto dstO = gIterO(i+x, 0); -// TCOPYOUT(dstO, tO_cast[x]);//TMOV +// TSTORE(dstO, tO_cast[x]);//TMOV // } // #endif @@ -2001,9 +2001,9 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype using gmK = global_tensor>; // K: [qD×S] using gmV = global_tensor>; // V: [S×vD] using gmO = global_tensor>; // O: [SxvD] - using gm_QMX = global_tensor>; - using gm_KMX = global_tensor>; - using gm_VMX = global_tensor>; + using gm_QMX = global_tensor>; + using gm_KMX = global_tensor>; + using gm_VMX = global_tensor>; // tile 寄存器形状 using tileQ = TileLeft; // [kTm×qD] using tileK = TileRight; // [vD×kTk] @@ -2071,7 +2071,7 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype tileQ tQ[Xdim]; tile_QMX tQMX[Xdim]; // load tile Q, TODO: add ND2ZZ transform for QMX - #ifdef MULTI_LDST // don't use, no need for multi tload/tstore + #ifdef MULTI_LDST // don't use, no need for multi tload/tstore #pragma clang loop unroll(full) for(int x=0;x(gQMX, tQMX[x], nd2zz_offset, i+x, 0); + TLOAD(tQ[x], gQ); + // gen_ND2ZZ_offset_Impl(gQMX, tQMX[x], nd2zz_offset, i+x, 0); // MGATHER(tQMX[x], gQMX, nd2zz_offset); - TCOPYIN(tQMX[x], gQMX); + TLOAD(tQMX[x], gQMX); } #endif @@ -2123,10 +2123,10 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype for(int y=0;y(gKMX, tKMX[y], nd2nn_offset, 0, j+y); - // MGATHER(tKMX[y], gKMX, nd2nn_offset); - TCOPYIN(tKMX[y], gKM); + // MGATHER(tKMX[y], gKMX, nd2nn_offset); + TLOAD(tKMX[y], gKM); } #endif @@ -2164,7 +2164,7 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype tNewMax[x].data(), tNewSum[x].data(), tExpW[x][0].data(), - tW[x][0].data(), // + tW[x][0].data(), // tMax[x].data(), tSum[x].data()); tohif4<<>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); // 64 = 1 group, tExpW(Zn, bf16), tP_scale(ZZ, E6M2 with zero E1_8 && E1_16) @@ -2176,7 +2176,7 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype tNewMax[x].data(), tNewSum[x].data(), tExpW[x][0].data(), - tW[x][0].data(), // + tW[x][0].data(), // tMax[x].data(), tSum[x].data()); tohif4_bf16x2<<>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); @@ -2215,8 +2215,8 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype // #pragma clang loop unroll(full) // for(int x=0;x<<>>( - // tScale[x].data(), - // tNewMax[x].data(), + // tScale[x].data(), + // tNewMax[x].data(), // tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), // tMax[x].data(), // scale); @@ -2225,7 +2225,7 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype // // tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), // // tNewMax[x].data(), // // scale); - + // src_exp_2src_with_local_sum<<>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(), // tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale); // src_exp_2src_with_local_sum<<>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(), @@ -2243,7 +2243,7 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); @@ -2263,7 +2263,7 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); } @@ -2305,10 +2305,10 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype for(int y=0;y(gVMX, tVMX[y], nd2nn_offset, j+y, 0); - // MGATHER(tVMX[y], gVMX, nd2nn_offset); + // MGATHER(tVMX[y], gVMX, nd2nn_offset); } #endif @@ -2380,7 +2380,7 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype #pragma clang loop unroll(full) for (int x = 0; x < Xdim; ++x) { auto dstO = gIterO(i+x, 0); - TCOPYOUT(dstO, tO_cast[x]);//TMOV + TSTORE(dstO, tO_cast[x]);//TMOV } #endif @@ -2395,9 +2395,9 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty using gmK = global_tensor>; // K: [qD×S] using gmV = global_tensor>; // V: [S×vD] using gmO = global_tensor>; // O: [SxvD] - using gm_QMX = global_tensor>; - using gm_KMX = global_tensor>; - using gm_VMX = global_tensor>; + using gm_QMX = global_tensor>; + using gm_KMX = global_tensor>; + using gm_VMX = global_tensor>; // tile 寄存器形状 using tileQ = TileLeft; // [kTm×qD] using tileK = TileRight; // [vD×kTk] @@ -2466,7 +2466,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty tileQ tQ[Xdim]; tile_QMX tQMX[Xdim]; // load tile Q, TODO: add ND2ZZ transform for QMX - #ifdef MULTI_LDST // don't use, no need for multi tload/tstore + #ifdef MULTI_LDST // don't use, no need for multi tload/tstore #pragma clang loop unroll(full) for(int x=0;x(gQMX, tQMX[x], nd2zz_offset, i+x, 0); + TLOAD(tQ[x], gQ); + // gen_ND2ZZ_offset_Impl(gQMX, tQMX[x], nd2zz_offset, i+x, 0); // MGATHER(tQMX[x], gQMX, nd2zz_offset); - TCOPYIN(tQMX[x], gQMX); + TLOAD(tQMX[x], gQMX); } #endif @@ -2518,10 +2518,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty for(int y=0;y(gKMX, tKMX[y], nd2nn_offset, 0, j+y); - // MGATHER(tKMX[y], gKMX, nd2nn_offset); - TCOPYIN(tKMX[y], gKM); + // MGATHER(tKMX[y], gKMX, nd2nn_offset); + TLOAD(tKMX[y], gKM); } #endif @@ -2560,7 +2560,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty tNewMax[x].data(), tNewSum[x].data(), tExpW[x][0].data(), - tW[x][0].data(), // + tW[x][0].data(), // tMax[x].data(), tSum[x].data()); tohif4<<>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); // 64 = 1 group, tExpW(Zn, bf16), tP_scale(ZZ, E6M2 with zero E1_8 && E1_16) @@ -2572,7 +2572,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty // tNewMax[x].data(), // tNewSum[x].data(), // tExpW[x][0].data(), - // tW[x][0].data(), // + // tW[x][0].data(), // // tMax[x].data(), // tSum[x].data()); // bf16tobf16x2(); @@ -2606,13 +2606,13 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty // #pragma clang loop unroll(full) // for(int x=0;x<<>>( - // tScale[x].data(), - // tNewMax[x].data(), + // tScale[x].data(), + // tNewMax[x].data(), // tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), // tMax[x].data(), // scale); - + // src_exp_2src_with_local_sum<<>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(), // tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale); // src_exp_2src_with_local_sum<<>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(), @@ -2624,8 +2624,8 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty #pragma clang loop unroll(full) for(int x=0;x<<>>( - tScale[x].data(), - tNewMax[x].data(), + tScale[x].data(), + tNewMax[x].data(), tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), tMax[x].data()); @@ -2635,7 +2635,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty rowsum_2src_with_local_sum<<>>( tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(), tW[x][2].data(), tW[x][3].data(), tNewMax[x].data()); - + new_sum_of_2_loc_sum_bf16x2<<>>( tNewSum[x].data(), tLocalSum[x][0].data(), tLocalSum[x][1].data(), tSum[x].data(), tScale[x].data()); @@ -2649,7 +2649,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); @@ -2669,7 +2669,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); } @@ -2711,10 +2711,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty for(int y=0;y(gVMX, tVMX[y], nd2nn_offset, j+y, 0); - // MGATHER(tVMX[y], gVMX, nd2nn_offset); + // MGATHER(tVMX[y], gVMX, nd2nn_offset); } #endif @@ -2786,7 +2786,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty #pragma clang loop unroll(full) for (int x = 0; x < Xdim; ++x) { auto dstO = gIterO(i+x, 0); - TCOPYOUT(dstO, tO_cast[x]);//TMOV + TSTORE(dstO, tO_cast[x]);//TMOV } #endif @@ -2801,9 +2801,9 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p using gmK = global_tensor>; // K: [qD×S] using gmV = global_tensor>; // V: [S×vD] using gmO = global_tensor>; // O: [SxvD] - using gm_QMX = global_tensor>; - using gm_KMX = global_tensor>; - using gm_VMX = global_tensor>; + using gm_QMX = global_tensor>; + using gm_KMX = global_tensor>; + using gm_VMX = global_tensor>; // tile 寄存器形状 using tileQ = TileLeft; // [kTm×qD] using tileK = TileRight; // [vD×kTk] @@ -2872,7 +2872,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p tileQ tQ[Xdim]; tile_QMX tQMX[Xdim]; // load tile Q, TODO: add ND2ZZ transform for QMX - #ifdef MULTI_LDST // don't use, no need for multi tload/tstore + #ifdef MULTI_LDST // don't use, no need for multi tload/tstore #pragma clang loop unroll(full) for(int x=0;x(gQMX, tQMX[x], nd2zz_offset, i+x, 0); + TLOAD(tQ[x], gQ); + // gen_ND2ZZ_offset_Impl(gQMX, tQMX[x], nd2zz_offset, i+x, 0); // MGATHER(tQMX[x], gQMX, nd2zz_offset); - TCOPYIN(tQMX[x], gQMX); + TLOAD(tQMX[x], gQMX); } #endif @@ -2924,10 +2924,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p for(int y=0;y(gKMX, tKMX[y], nd2nn_offset, 0, j+y); - // MGATHER(tKMX[y], gKMX, nd2nn_offset); - TCOPYIN(tKMX[y], gKM); + // MGATHER(tKMX[y], gKMX, nd2nn_offset); + TLOAD(tKMX[y], gKM); } #endif @@ -2966,7 +2966,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p tNewMax[x].data(), tNewSum[x].data(), tExpW[x][0].data(), - tW[x][0].data(), // + tW[x][0].data(), // tMax[x].data(), tSum[x].data()); tohif4<<>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); // 64 = 1 group, tExpW(Zn, bf16), tP_scale(ZZ, E6M2 with zero E1_8 && E1_16) @@ -2978,7 +2978,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p // tNewMax[x].data(), // tNewSum[x].data(), // tExpW[x][0].data(), - // tW[x][0].data(), // + // tW[x][0].data(), // // tMax[x].data(), // tSum[x].data()); // bf16tobf16x2(); @@ -3012,13 +3012,13 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p // #pragma clang loop unroll(full) // for(int x=0;x<<>>( - // tScale[x].data(), - // tNewMax[x].data(), + // tScale[x].data(), + // tNewMax[x].data(), // tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), // tMax[x].data(), // scale); - + // src_exp_2src_with_local_sum<<>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(), // tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale); // src_exp_2src_with_local_sum<<>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(), @@ -3030,8 +3030,8 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p #pragma clang loop unroll(full) for(int x=0;x<<>>( - tScale[x].data(), - tNewMax[x].data(), + tScale[x].data(), + tNewMax[x].data(), tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), tMax[x].data()); @@ -3041,7 +3041,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p rowsum_2src_with_local_sumx2<<>>( tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(), tW[x][2].data(), tW[x][3].data(), tNewMax[x].data()); - + new_sum_of_2_loc_sum_bf16x2<<>>( tNewSum[x].data(), tLocalSum[x][0].data(), tLocalSum[x][1].data(), tSum[x].data(), tScale[x].data()); @@ -3055,7 +3055,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); @@ -3075,7 +3075,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); } @@ -3117,10 +3117,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p for(int y=0;y(gVMX, tVMX[y], nd2nn_offset, j+y, 0); - // MGATHER(tVMX[y], gVMX, nd2nn_offset); + // MGATHER(tVMX[y], gVMX, nd2nn_offset); } #endif @@ -3192,7 +3192,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p #pragma clang loop unroll(full) for (int x = 0; x < Xdim; ++x) { auto dstO = gIterO(i+x, 0); - TCOPYOUT(dstO, tO_cast[x]);//TMOV + TSTORE(dstO, tO_cast[x]);//TMOV } #endif @@ -3225,9 +3225,9 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype using gmK = global_tensor>; // K: [qD×S] using gmV = global_tensor>; // V: [S×vD] using gmO = global_tensor>; // O: [SxvD] - using gm_QMX = global_tensor>; - using gm_KMX = global_tensor>; - using gm_VMX = global_tensor>; + using gm_QMX = global_tensor>; + using gm_KMX = global_tensor>; + using gm_VMX = global_tensor>; // tile 寄存器形状 using tileQ = TileLeft; // [kTm×qD] using tileK = TileRight; // [vD×kTk] @@ -3303,7 +3303,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype tileQ tQ[Xdim]; tile_QMX tQMX[Xdim]; // load tile Q, TODO: add ND2ZZ transform for QMX - #ifdef MULTI_LDST // don't use, no need for multi tload/tstore + #ifdef MULTI_LDST // don't use, no need for multi tload/tstore #pragma clang loop unroll(full) for(int x=0;x(gQMX, tQMX[x], nd2zz_offset, i+x, 0); + TLOAD(tQ[x], gQ); + // gen_ND2ZZ_offset_Impl(gQMX, tQMX[x], nd2zz_offset, i+x, 0); // MGATHER(tQMX[x], gQMX, nd2zz_offset); - TCOPYIN(tQMX[x], gQMX); + TLOAD(tQMX[x], gQMX); } #endif @@ -3355,10 +3355,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype for(int y=0;y(gKMX, tKMX[y], nd2nn_offset, 0, j+y); - // MGATHER(tKMX[y], gKMX, nd2nn_offset); - TCOPYIN(tKMX[y], gKM); + // MGATHER(tKMX[y], gKMX, nd2nn_offset); + TLOAD(tKMX[y], gKM); } #endif @@ -3400,7 +3400,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype tNewMax[x].data(), tNewSum[x].data(), tExpW[x][0].data(), - tW[x][0].data(), // + tW[x][0].data(), // tMax[x].data(), tSum[x].data()); tohif4<<>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); // 64 = 1 group, tExpW(Zn, bf16), tP_scale(ZZ, E6M2 with zero E1_8 && E1_16) @@ -3412,7 +3412,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype // tNewMax[x].data(), // tNewSum[x].data(), // tExpW[x][0].data(), - // tW[x][0].data(), // + // tW[x][0].data(), // // tMax[x].data(), // tSum[x].data()); // bf16tobf16x2(); @@ -3446,13 +3446,13 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype // #pragma clang loop unroll(full) // for(int x=0;x<<>>( - // tScale[x].data(), - // tNewMax[x].data(), + // tScale[x].data(), + // tNewMax[x].data(), // tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), // tMax[x].data(), // scale); - + // src_exp_2src_with_local_sum<<>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(), // tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale); // src_exp_2src_with_local_sum<<>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(), @@ -3465,12 +3465,12 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype #pragma clang loop unroll(full) for(int x=0;x<<>>( - tScale[x].data(), - tNewMax[x].data(), + tScale[x].data(), + tNewMax[x].data(), tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), tMax[x].data()); - + rowsum_2src_with_local_expx2<<>>( tExpW[x][0].data(), tExpW[x][1].data(), tW[x][0].data(), tW[x][1].data(), tNewMax[x].data()); @@ -3503,7 +3503,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); @@ -3523,7 +3523,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); } @@ -3565,10 +3565,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype for(int y=0;y(gVMX, tVMX[y], nd2nn_offset, j+y, 0); - // MGATHER(tVMX[y], gVMX, nd2nn_offset); + // MGATHER(tVMX[y], gVMX, nd2nn_offset); } #endif @@ -3640,7 +3640,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype #pragma clang loop unroll(full) for (int x = 0; x < Xdim; ++x) { auto dstO = gIterO(i+x, 0); - TCOPYOUT(dstO, tO_cast[x]);//TMOV + TSTORE(dstO, tO_cast[x]);//TMOV } #endif } @@ -3655,9 +3655,9 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp using gmK = global_tensor>; // K: [qD×S] using gmV = global_tensor>; // V: [S×vD] using gmO = global_tensor>; // O: [SxvD] - using gm_QMX = global_tensor>; - using gm_KMX = global_tensor>; - using gm_VMX = global_tensor>; + using gm_QMX = global_tensor>; + using gm_KMX = global_tensor>; + using gm_VMX = global_tensor>; // tile 寄存器形状 using tileQ = TileLeft; // [kTm×qD] using tileK = TileRight; // [vD×kTk] @@ -3740,7 +3740,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp tileQ tQ[Xdim]; tile_QMX tQMX[Xdim]; // load tile Q, TODO: add ND2ZZ transform for QMX - #ifdef MULTI_LDST // don't use, no need for multi tload/tstore + #ifdef MULTI_LDST // don't use, no need for multi tload/tstore #pragma clang loop unroll(full) for(int x=0;x(gQMX, tQMX[x], nd2zz_offset, i+x, 0); + TLOAD(tQ[x], gQ); + // gen_ND2ZZ_offset_Impl(gQMX, tQMX[x], nd2zz_offset, i+x, 0); // MGATHER(tQMX[x], gQMX, nd2zz_offset); - TCOPYIN(tQMX[x], gQMX); + TLOAD(tQMX[x], gQMX); } #endif @@ -3792,10 +3792,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp for(int y=0;y(gKMX, tKMX[y], nd2nn_offset, 0, j+y); - // MGATHER(tKMX[y], gKMX, nd2nn_offset); - TCOPYIN(tKMX[y], gKM); + // MGATHER(tKMX[y], gKMX, nd2nn_offset); + TLOAD(tKMX[y], gKM); } #endif @@ -3837,7 +3837,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp tNewMax[x].data(), tNewSum[x].data(), tExpW[x][0].data(), - tW[x][0].data(), // + tW[x][0].data(), // tMax[x].data(), tSum[x].data()); tohif4<<>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); // 64 = 1 group, tExpW(Zn, bf16), tP_scale(ZZ, E6M2 with zero E1_8 && E1_16) @@ -3849,7 +3849,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp // tNewMax[x].data(), // tNewSum[x].data(), // tExpW[x][0].data(), - // tW[x][0].data(), // + // tW[x][0].data(), // // tMax[x].data(), // tSum[x].data()); // bf16tobf16x2(); @@ -3883,13 +3883,13 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp // #pragma clang loop unroll(full) // for(int x=0;x<<>>( - // tScale[x].data(), - // tNewMax[x].data(), + // tScale[x].data(), + // tNewMax[x].data(), // tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), // tMax[x].data(), // scale); - + // src_exp_2src_with_local_sum<<>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(), // tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale); // src_exp_2src_with_local_sum<<>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(), @@ -3902,12 +3902,12 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp #pragma clang loop unroll(full) for(int x=0;x<<>>( - tScale[x].data(), - tNewMax[x].data(), + tScale[x].data(), + tNewMax[x].data(), tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(), tMax[x].data()); - + rowsum_2src_with_local_expx2<<>>( tExpW[x][0].data(), tExpW[x][1].data(), tW[x][0].data(), tW[x][1].data(), tNewMax[x].data()); @@ -3940,7 +3940,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); @@ -3960,7 +3960,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp tileSum tLocalSum[Xdim][4]; #pragma clang loop unroll(full) - for(int x=0;x<<>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale); } @@ -4002,10 +4002,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp for(int y=0;y(gVMX, tVMX[y], nd2nn_offset, j+y, 0); - // MGATHER(tVMX[y], gVMX, nd2nn_offset); + // MGATHER(tVMX[y], gVMX, nd2nn_offset); } #endif @@ -4085,7 +4085,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp #pragma clang loop unroll(full) for (int x = 0; x < Xdim; ++x) { auto dstO = gIterO(i+x, 0); - TCOPYOUT(dstO, tO_cast[x]);//TMOV + TSTORE(dstO, tO_cast[x]);//TMOV } #endif } diff --git a/kernels/matmul_mx/matmul_mx.hpp b/kernels/matmul_mx/matmul_mx.hpp index 52e6ee6..ddf435d 100644 --- a/kernels/matmul_mx/matmul_mx.hpp +++ b/kernels/matmul_mx/matmul_mx.hpp @@ -4,7 +4,9 @@ #include #include "template_asm.h" #include +#ifndef __linx #include +#endif #include "utils/layout_transform.hpp" @@ -13,7 +15,7 @@ // GlobalTensor, \ // Stride<1,1,1,Cols,1>> _g(DumpBuf); \ -// TCOPYOUT(_g, TileVar); \ +// TSTORE(_g, TileVar); \ // printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ // for (int ri = 0; ri < Rows; ri++) { \ // printf(" row%2d: ", ri); \ @@ -32,20 +34,20 @@ using namespace pto; // TODO, move to utils.cpp template -void TCOPYOUT_ACC(GmOut &Gout, TileAcc &tAcc){ +void TSTORE_ACC(GmOut &Gout, TileAcc &tAcc){ using TileAccOut = Tile; TileAccOut tAccOut; TCVT(tAccOut, tAcc); - TCOPYOUT(Gout, tAccOut); + TSTORE(Gout, tAccOut); } // typeb_wfactor 表明typeA和typeB的位宽比例,比如fp8是fp4x2的两倍, // smatrix_wfactor : scaling matrix 与计算matrix位宽比 -template void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8_t *src1_mx) { // only support regular shape now for this operator! - // static_assert(gM % tM == 0); + // static_assert(gM % tM == 0); // static_assert(gN % tN == 0); // static_assert(gK % tK == 0); static const uint32_t valid_row = (tM > gM) ? gM : tM; @@ -54,7 +56,7 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8 using gm_shapeB = global_tensor>; using gm_shapeC = global_tensor>; - using tile_shapeA = TileLeft; + using tile_shapeA = TileLeft; using tile_shapeB = TileRight; using tile_shapeACC = TileAcc; using itA = global_iterator; @@ -65,7 +67,7 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8 itB gBIter(src1); itC gCIter(dst); - using gm_shapeAMX = global_tensor>; + using gm_shapeAMX = global_tensor>; gm_shapeAMX gAMX(src0_mx); using tile_shapeAMX = Tile; // 实际tile尺寸, 需初始化为0 using itAMX = global_iterator; @@ -119,8 +121,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8 auto gB = gBIter(k,j); tile_shapeA tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); // if (src0_mx != nullptr && src1_mx != nullptr) { tile_shapeAMX tAMX; @@ -141,8 +143,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8 auto gB = gBIter(Kb,j); tile_shapeA_trows tA; tile_shapeB_tcols tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); tile_shapeAMX_trows tAMX; gen_ND2ZZ_offset_Impl(gAMX, tAMX, nd2zz_offset, i, Kb); @@ -158,7 +160,7 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8 MATMULMX(tACC, tA, tAMX, tB, tBMX); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { auto gC = gCIter(i, Nb); @@ -168,8 +170,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8 auto gB = gBIter(k, Nb); tile_shapeA tA; tile_shapeB_trows tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); tile_shapeAMX tAMX; gen_ND2ZZ_offset_Impl(gAMX, tAMX, nd2zz_offset, i, k); @@ -191,8 +193,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8 tile_shapeA_trows tA; tile_shapeB_tcorner tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); tile_shapeAMX_trows tAMX; gen_ND2ZZ_offset_Impl(gAMX, tAMX, nd2zz_offset, i, Kb); @@ -206,7 +208,7 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8 MATMULMX(tACC, tA, tAMX, tB, tBMX); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } if constexpr (rmd_M) { @@ -221,8 +223,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8 tile_shapeA_tcols tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); tile_shapeAMX_tcols tAMX; gen_ND2ZZ_offset_Impl(gAMX, tAMX, nd2zz_offset, Mb, k); @@ -244,8 +246,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8 tile_shapeA_tcorner tA; tile_shapeB_tcols tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); tile_shapeAMX_tcorner tAMX; gen_ND2ZZ_offset_Impl(gAMX, tAMX, nd2zz_offset, Mb, Kb); @@ -260,7 +262,7 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8 MATMULMX(tACC, tA, tAMX, tB, tBMX); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { auto gC = gCIter(Mb, Nb); @@ -273,8 +275,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8 tile_shapeA_tcols tA; tile_shapeB_trows tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); tile_shapeAMX_tcols tAMX; gen_ND2ZZ_offset_Impl(gAMX, tAMX, nd2zz_offset, Mb, k); @@ -295,8 +297,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8 tile_shapeA_tcorner tA; tile_shapeB_tcorner tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); tile_shapeAMX_tcorner tAMX; gen_ND2ZZ_offset_Impl(gAMX, tAMX, nd2zz_offset, Mb, Kb); @@ -310,12 +312,12 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8 MATMULMX(tACC, tA, tAMX, tB, tBMX); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } -template void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8_t *src1_mx) { static_assert(typeb_wfactor == 1 ); @@ -325,7 +327,7 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx using gm_shapeB = global_tensor>; using gm_shapeC = global_tensor>; - using tile_shapeA = TileLeft; + using tile_shapeA = TileLeft; using tile_shapeB = TileRight; using tile_shapeACC = TileAcc; using itA = global_iterator; @@ -336,7 +338,7 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx itB gBIter(src1); itC gCIter(dst); - using gm_shapeAMX = global_tensor>; + using gm_shapeAMX = global_tensor>; using tile_shapeAMX = Tile; // 实际tile尺寸, 需初始化为0 using itAMX = global_iterator; itAMX gAMXIter(src0_mx); @@ -390,10 +392,10 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx tile_shapeB tB; tile_shapeAMX tAMX; tile_shapeBMX tBMX; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TCOPYIN(tAMX, gAMX); - TCOPYIN(tBMX, gBMX); + TLOAD(tA, gA); + TLOAD(tB, gB); + TLOAD(tAMX, gAMX); + TLOAD(tBMX, gBMX); if(k==0){ MATMULMX(tACC, tA, tAMX, tB, tBMX); @@ -406,14 +408,14 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx auto gB = gBIter(Kb,j); tile_shapeA_trows tA; tile_shapeB_tcols tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); auto gAMX = gAMXIter(i, Kb); auto gBMX = gBMXIter(Kb, j); tile_shapeAMX_trows tAMX; tile_shapeBMX_tcols tBMX; - TCOPYIN(tAMX, gAMX); - TCOPYIN(tBMX, gBMX); + TLOAD(tAMX, gAMX); + TLOAD(tBMX, gBMX); if constexpr(Kb>0){ MATMACCMX(tACC, tA, tAMX, tB, tBMX); @@ -421,7 +423,7 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx MATMULMX(tACC, tA, tAMX, tB, tBMX); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { auto gC = gCIter(i, Nb); @@ -431,15 +433,15 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx auto gB = gBIter(k, Nb); tile_shapeA tA; tile_shapeB_trows tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); auto gAMX = gAMXIter(i, k); auto gBMX = gBMXIter(k, Nb); tile_shapeAMX tAMX; tile_shapeBMX_trows tBMX; - TCOPYIN(tAMX, gAMX); - TCOPYIN(tBMX, gBMX); + TLOAD(tAMX, gAMX); + TLOAD(tBMX, gBMX); if(k==0){ MATMULMX(tACC, tA, tAMX, tB, tBMX); @@ -454,15 +456,15 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx tile_shapeA_trows tA; tile_shapeB_tcorner tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); auto gAMX = gAMXIter(i, Kb); auto gBMX = gBMXIter(Kb, Nb); tile_shapeAMX_trows tAMX; tile_shapeBMX_tcorner tBMX; - TCOPYIN(tAMX, gAMX); - TCOPYIN(tBMX, gBMX); + TLOAD(tAMX, gAMX); + TLOAD(tBMX, gBMX); if constexpr(Kb>0){ MATMACCMX(tACC, tA, tAMX, tB, tBMX); @@ -470,7 +472,7 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx MATMULMX(tACC, tA, tAMX, tB, tBMX); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } if constexpr (rmd_M) { @@ -485,15 +487,15 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx tile_shapeA_tcols tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); auto gAMX = gAMXIter(Mb, k); auto gBMX = gBMXIter(k, j); tile_shapeAMX_tcols tAMX; tile_shapeBMX tBMX; - TCOPYIN(tAMX, gAMX); - TCOPYIN(tBMX, gBMX); + TLOAD(tAMX, gAMX); + TLOAD(tBMX, gBMX); if(k==0){ MATMULMX(tACC, tA, tAMX, tB, tBMX); @@ -508,15 +510,15 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx tile_shapeA_tcorner tA; tile_shapeB_tcols tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); auto gAMX = gAMXIter(Mb, Kb); auto gBMX = gBMXIter(Kb, j); tile_shapeAMX_tcorner tAMX; tile_shapeBMX_tcols tBMX; - TCOPYIN(tAMX, gAMX); - TCOPYIN(tBMX, gBMX); + TLOAD(tAMX, gAMX); + TLOAD(tBMX, gBMX); if constexpr(Kb>0){ MATMACCMX(tACC, tA, tAMX, tB, tBMX); @@ -524,7 +526,7 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx MATMULMX(tACC, tA, tAMX, tB, tBMX); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { auto gC = gCIter(Mb, Nb); @@ -537,15 +539,15 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx tile_shapeA_tcols tA; tile_shapeB_trows tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); auto gAMX = gAMXIter(Mb, k); auto gBMX = gBMXIter(k, Nb); tile_shapeAMX_tcols tAMX; tile_shapeBMX_trows tBMX; - TCOPYIN(tAMX, gAMX); - TCOPYIN(tBMX, gBMX); + TLOAD(tAMX, gAMX); + TLOAD(tBMX, gBMX); if(k==0){ MATMULMX(tACC, tA, tAMX, tB, tBMX); @@ -559,22 +561,22 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx tile_shapeA_tcorner tA; tile_shapeB_tcorner tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); auto gAMX = gAMXIter(Mb, Kb); auto gBMX = gBMXIter(Kb, Nb); tile_shapeAMX_tcorner tAMX; tile_shapeBMX_tcorner tBMX; - TCOPYIN(tAMX, gAMX); - TCOPYIN(tBMX, gBMX); + TLOAD(tAMX, gAMX); + TLOAD(tBMX, gBMX); if constexpr(Kb>0){ MATMACCMX(tACC, tA, tAMX, tB, tBMX); } else { MATMULMX(tACC, tA, tAMX, tB, tBMX); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } @@ -609,22 +611,22 @@ constexpr ResA find_reuseA(int Mb, int Kb, int MAX_TILE_NUM) { return {best_m, best_k, best_val}; } -template void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8_t *src1_mx) { static_assert(typeb_wfactor == 1 ); static const uint32_t valid_row = (tM > gM) ? gM : tM; static const uint32_t valid_col = (tN > gN) ? gN : tN; static const uint32_t MAX_TILE_NUM = 24; // TODO, check this value - + using gm_shapeA = global_tensor>; using gm_shapeB = global_tensor>; using gm_shapeC = global_tensor>; - using tile_shapeA = TileLeft; + using tile_shapeA = TileLeft; using tile_shapeB = TileRight; using tile_shapeACC = TileAcc; - + using itA = global_iterator; using itB = global_iterator; using itC = global_iterator; @@ -633,8 +635,8 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * itB gBIter(src1); itC gCIter(dst); - using gm_shapeAMX = global_tensor>; - using tile_shapeAMX = Tile; + using gm_shapeAMX = global_tensor>; + using tile_shapeAMX = Tile; using itAMX = global_iterator; itAMX gAMXIter(src0_mx); @@ -691,8 +693,8 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * // for(int k=0; k0){ MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX); } else { @@ -763,7 +765,7 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * } auto gC = gCIter(i*R.m+ii, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // [m, rmd_N, k] @@ -776,8 +778,8 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * auto gBMX = gBMXIter(k, Nb); tile_shapeB_trows tB; tile_shapeBMX_trows tBMX; - TCOPYIN(tB, gB); - TCOPYIN(tBMX, gBMX); + TLOAD(tB, gB); + TLOAD(tBMX, gBMX); if(k==0){ MATMULMX(tACC, tA[ii][k], tAMX[ii][k], tB, tBMX); }else{ @@ -795,10 +797,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * auto gAMX = gAMXIter(i*R.m+ii, k); auto gB = gBIter(k, Nb); auto gBMX = gBMXIter(k, Nb); - TCOPYIN(tA_tmp, gA); - TCOPYIN(tAMX_tmp, gAMX); - TCOPYIN(tB, gB); - TCOPYIN(tBMX, gBMX); + TLOAD(tA_tmp, gA); + TLOAD(tAMX_tmp, gAMX); + TLOAD(tB, gB); + TLOAD(tBMX, gBMX); MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX); } } @@ -814,11 +816,11 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * tile_shapeAMX_trows tAMX_tmp; tile_shapeB_tcorner tB; tile_shapeBMX_tcorner tBMX; - - TCOPYIN(tA_tmp, gA); - TCOPYIN(tAMX_tmp, gAMX); - TCOPYIN(tB, gB); - TCOPYIN(tBMX, gBMX); + + TLOAD(tA_tmp, gA); + TLOAD(tAMX_tmp, gAMX); + TLOAD(tB, gB); + TLOAD(tBMX, gBMX); if constexpr(Kb>0){ MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX); } else { @@ -827,7 +829,7 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * } auto gC = gCIter(i*R.m+ii, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } @@ -835,7 +837,7 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * if constexpr(rM>0){ tile_shapeA tA[rM][R.k]; tile_shapeAMX tAMX[rM][R.k]; - + #pragma clang loop unroll(full) for(int i=0; i0){ MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX); } else { @@ -914,7 +916,7 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * } } auto gC = gCIter(i+dM*R.m, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // [rM, rmd_N, k] @@ -927,8 +929,8 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * auto gBMX = gBMXIter(k, Nb); tile_shapeB_trows tB; tile_shapeBMX_trows tBMX; - TCOPYIN(tB, gB); - TCOPYIN(tBMX, gBMX); + TLOAD(tB, gB); + TLOAD(tBMX, gBMX); if(k==0){ MATMULMX(tACC, tA[i][k], tAMX[i][k], tB, tBMX); }else{ @@ -946,10 +948,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * auto gAMX = gAMXIter(i+dM*R.m, k); auto gB = gBIter(k, Nb); auto gBMX = gBMXIter(k, Nb); - TCOPYIN(tA_tmp, gA); - TCOPYIN(tAMX_tmp, gAMX); - TCOPYIN(tB, gB); - TCOPYIN(tBMX, gBMX); + TLOAD(tA_tmp, gA); + TLOAD(tAMX_tmp, gAMX); + TLOAD(tB, gB); + TLOAD(tBMX, gBMX); MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX); } } @@ -966,10 +968,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * tile_shapeB_tcorner tB; tile_shapeBMX_tcorner tBMX; - TCOPYIN(tA_tmp, gA); - TCOPYIN(tAMX_tmp, gAMX); - TCOPYIN(tB, gB); - TCOPYIN(tBMX, gBMX); + TLOAD(tA_tmp, gA); + TLOAD(tAMX_tmp, gAMX); + TLOAD(tB, gB); + TLOAD(tBMX, gBMX); if constexpr(Kb>0){ MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX); } else { @@ -977,7 +979,7 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * } } auto gC = gCIter(i+dM*R.m, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } @@ -986,13 +988,13 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * if constexpr (rmd_M) { tile_shapeA_tcols tA[R.k]; tile_shapeAMX_tcols tAMX[R.k]; - + #pragma clang loop unroll(full) for(int k=0; k0){ MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX); } else { @@ -1055,7 +1057,7 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * } } auto gC = gCIter(Mb, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // [rmd_M, rmd_N, k] @@ -1068,8 +1070,8 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * auto gBMX = gBMXIter(k, Nb); tile_shapeB_trows tB; tile_shapeBMX_trows tBMX; - TCOPYIN(tB, gB); - TCOPYIN(tBMX, gBMX); + TLOAD(tB, gB); + TLOAD(tBMX, gBMX); if(k==0){ MATMULMX(tACC, tA[k], tAMX[k], tB, tBMX); }else{ @@ -1087,10 +1089,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * auto gAMX = gAMXIter(Mb, k); auto gB = gBIter(k, Nb); auto gBMX = gBMXIter(k, Nb); - TCOPYIN(tA_tmp, gA); - TCOPYIN(tAMX_tmp, gAMX); - TCOPYIN(tB, gB); - TCOPYIN(tBMX, gBMX); + TLOAD(tA_tmp, gA); + TLOAD(tAMX_tmp, gAMX); + TLOAD(tB, gB); + TLOAD(tBMX, gBMX); MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX); } } @@ -1107,10 +1109,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * tile_shapeB_tcorner tB; tile_shapeBMX_tcorner tBMX; - TCOPYIN(tA_tmp, gA); - TCOPYIN(tAMX_tmp, gAMX); - TCOPYIN(tB, gB); - TCOPYIN(tBMX, gBMX); + TLOAD(tA_tmp, gA); + TLOAD(tAMX_tmp, gAMX); + TLOAD(tB, gB); + TLOAD(tBMX, gBMX); if constexpr(Kb>0){ MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX); } else { @@ -1118,25 +1120,25 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t * } } auto gC = gCIter(Mb, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } // typeb_wfactor 表明typeA和typeB的位宽比例,比如fp8是fp4x2的两倍, // smatrix_wfactor : scaling matrix 与计算matrix位宽比 -template void matmul_mxfp_notcvt_old(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8_t *src1_mx) { // only support regular shape now for this operator! - static_assert(gM % tM == 0); + static_assert(gM % tM == 0); static_assert(gN % tN == 0); static_assert(gK % tK == 0); using gm_shapeA = global_tensor>; using gm_shapeB = global_tensor>; using gm_shapeC = global_tensor>; - using tile_shapeA = TileLeft; + using tile_shapeA = TileLeft; using tile_shapeB = TileRight; using tile_shapeACC = TileAcc; using itA = global_iterator; @@ -1147,7 +1149,7 @@ void matmul_mxfp_notcvt_old(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src itB gBIter(src1); itC gCIter(dst); - using gm_shapeAMX = global_tensor>; + using gm_shapeAMX = global_tensor>; // gm_shapeAMX gAMX(src0_mx); // using tile_shapeAMX = Tile; // 实际tile尺寸, 需初始化为0 using tile_shapeAMX = Tile; // 实际tile尺寸, 需初始化为0 @@ -1184,10 +1186,10 @@ void matmul_mxfp_notcvt_old(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src tile_shapeB tB; tile_shapeAMX tAMX; tile_shapeBMX tBMX; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TCOPYIN(tAMX, gAMX); - TCOPYIN(tBMX, gBMX); + TLOAD(tA, gA); + TLOAD(tB, gB); + TLOAD(tAMX, gAMX); + TLOAD(tBMX, gBMX); if(k==0){ MATMULMX(tACC, tA, tAMX, tB, tBMX); @@ -1197,25 +1199,25 @@ void matmul_mxfp_notcvt_old(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src // MATMACC(tACC, tA, tB); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } // typeb_wfactor 表明typeA和typeB的位宽比例,比如fp8是fp4x2的两倍, // smatrix_wfactor : scaling matrix 与计算matrix位宽比 -template void matmul_fp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8_t *src1_mx) { // only support regular shape now for this operator! - static_assert(gM % tM == 0); + static_assert(gM % tM == 0); static_assert(gN % tN == 0); static_assert(gK % tK == 0); using gm_shapeA = global_tensor>; using gm_shapeB = global_tensor>; using gm_shapeC = global_tensor>; - using tile_shapeA = TileLeft; + using tile_shapeA = TileLeft; using tile_shapeB = TileRight; using tile_shapeACC = TileAcc; using itA = global_iterator; @@ -1226,7 +1228,7 @@ void matmul_fp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, itB gBIter(src1); itC gCIter(dst); - using gm_shapeAMX = global_tensor>; + using gm_shapeAMX = global_tensor>; // gm_shapeAMX gAMX(src0_mx); // using tile_shapeAMX = Tile; // 实际tile尺寸, 需初始化为0 using tile_shapeAMX = Tile; // 实际tile尺寸, 需初始化为0 @@ -1263,10 +1265,10 @@ void matmul_fp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, tile_shapeB tB; tile_shapeAMX tAMX; tile_shapeBMX tBMX; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TCOPYIN(tAMX, gAMX); - TCOPYIN(tBMX, gBMX); + TLOAD(tA, gA); + TLOAD(tB, gB); + TLOAD(tAMX, gAMX); + TLOAD(tBMX, gBMX); if(k==0){ MATMUL(tACC, tA, tB); @@ -1275,7 +1277,7 @@ void matmul_fp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, MATMACC(tACC, tA, tB); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } @@ -1323,7 +1325,7 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) { using gm_shapeA = global_tensor>; using gm_shapeB = global_tensor>; // 伪量化固定float, group 大小128, 128个fp4共享一个scaling factor, 128的partial sum* scale - using gm_shape_scale = global_tensor>; + using gm_shape_scale = global_tensor>; using gm_shapeACC = global_tensor>; using tile_shapeA = TileLeft; using tile_shapeB = TileRight; @@ -1331,7 +1333,7 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) { using tile_shape_dequant = Tile; using tile_shapeACC = TileAcc; // copy of acc, input as vector - using tile_ACCin = Tile; + using tile_ACCin = Tile; using itA = global_iterator; using itB = global_iterator; @@ -1367,9 +1369,9 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) { using tile_shape_scale_tcols = Tile; using tile_shape_scale_tcorner = Tile; - using tile_ACCin_trows = Tile; - using tile_ACCin_tcols = Tile; - using tile_ACCin_tconer = Tile; + using tile_ACCin_trows = Tile; + using tile_ACCin_tcols = Tile; + using tile_ACCin_tconer = Tile; using tile_shape_dequant_trows = Tile; using tile_shape_dequant_tcols = Tile; @@ -1392,9 +1394,9 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) { tile_shapeB tB; tile_shape_scale ts; tile_shape_dequant tC_dequant; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TCOPYIN(ts, gS); // [1, tN] + TLOAD(tA, gA); + TLOAD(tB, gB); + TLOAD(ts, gS); // [1, tN] MATMUL(tACC, tA, tB); TCVT(tACCin, tACC);//[tM, tN] 256->1 , 256 -> 2 scaling factor @@ -1413,22 +1415,22 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) { tile_shape_scale_tcols ts; tile_shape_dequant tC_dequant; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TCOPYIN(ts, gS); + TLOAD(tA, gA); + TLOAD(tB, gB); + TLOAD(ts, gS); MATMUL(tACC, tA, tB); TCVT(tACCin, tACC); dequant_acc<<>>(tACCin.data(), ts.data(), tAdder[k%2].data(), tC_dequant.data()); tAdder[(k+1)%2] = tC_dequant; } - TCOPYOUT(gACC, tAdder[(k+1)%2]); + TSTORE(gACC, tAdder[(k+1)%2]); } // if constexpr (rmd_N) // TODO } if constexpr (rmd_M) { for (int j = 0; j < Nb; ++j) { - auto gACC = gACCIter(Mb, j); + auto gACC = gACCIter(Mb, j); tile_shapeC_tcols tACC; tile_ACCin_tcols tACCin; tile_shape_dequant_tcols tAdder[2]; @@ -1444,9 +1446,9 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) { tile_shapeB tB; tile_shape_scale ts; tile_shape_dequant_tcols tC_dequant; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TCOPYIN(ts, gS); + TLOAD(tA, gA); + TLOAD(tB, gB); + TLOAD(ts, gS); MATMUL(tACC, tA, tB); TCVT(tACCin, tACC); dequant_acc<<>>(tACCin.data(), ts.data(), tAdder[k%2].data(), tC_dequant.data()); @@ -1462,15 +1464,15 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) { tile_shapeB_tcols tB; tile_shape_scale_tcols ts; tile_shape_dequant tC_dequant; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TCOPYIN(ts, gS); + TLOAD(tA, gA); + TLOAD(tB, gB); + TLOAD(ts, gS); MATMUL(tACC, tA, tB); TCVT(tACCin, tACC); dequant_acc<<>>(tACCin.data(), ts.data(), tAdder[k%2].data(), tC_dequant.data()); tAdder[(k+1)%2] = tC_dequant; } - TCOPYOUT(gACC, tAdder[(k+1)%2]); + TSTORE(gACC, tAdder[(k+1)%2]); } // todo // if constexpr (rmd_N) { @@ -1483,9 +1485,9 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) { // tile_shapeA_tcols tA; // tile_shapeB_trows tB; - // TCOPYIN(tA, gA); - // TCOPYIN(tB, gB); - // MATMUL(tACC, tA, tB); + // TLOAD(tA, gA); + // TLOAD(tB, gB); + // MATMUL(tACC, tA, tB); // } // #pragma clang loop unroll(full) // for (int k = 1; k < Kb; ++k) { @@ -1494,8 +1496,8 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) { // tile_shapeA_tcols tA; // tile_shapeB_trows tB; - // TCOPYIN(tA, gA); - // TCOPYIN(tB, gB); + // TLOAD(tA, gA); + // TLOAD(tB, gB); // MATMACC(tACC, tA, tB); // } // if constexpr (rmd_K) { @@ -1504,18 +1506,18 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) { // tile_shapeA_tcorner tA; // tile_shapeB_tcorner tB; - // TCOPYIN(tA, gA); - // TCOPYIN(tB, gB); + // TLOAD(tA, gA); + // TLOAD(tB, gB); // if constexpr(Kb>0){ // MATMACC(tACC, tA, tB); // } else { // MATMUL(tACC, tA, tB); // } // } - // TCOPYOUT_ACC(gC, tACC); + // TSTORE_ACC(gC, tACC); // } } } -#endif \ No newline at end of file +#endif diff --git a/kernels/memory/broadcast.hpp b/kernels/memory/broadcast.hpp index 5276b2b..0b6756a 100644 --- a/kernels/memory/broadcast.hpp +++ b/kernels/memory/broadcast.hpp @@ -10,7 +10,7 @@ GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -162,7 +162,7 @@ void gen_offset_impl( const size_t total_elements) { static_assert(tile_shapeOffset::ValidRow != -1 && tile_shapeOffset::ValidCol != -1, "Only static shape supported"); - + #if MAX_DIMs >= 1 size_t in_shape0 = in_shape[0]; size_t out_shape0 = out_shape[0]; @@ -236,11 +236,11 @@ void broadcast( const size_t *out_shape ) { const size_t Mb = gOM / tM; - const size_t rmd_M = gOM % tM; + const size_t rmd_M = gOM % tM; using gm_shapeIn = global_tensor>; using gm_shapeOut = global_tensor>; - using tile_shapeData = Tile; + using tile_shapeData = Tile; using tile_shapeOffset = Tile; using tile_shapeData_rmd = Tile; using tile_shapeOffset_rmd = Tile; @@ -276,18 +276,18 @@ void broadcast( // printf("total_elements = %d\n", total_elements); // printf("in_shape[0] = %d\n", in_shape[0]); // printf("inGm = %ld\n", inGm); - - // TCOPYIN(inTile, gI); + + // TLOAD(inTile, gI); // DUMP_TILE("inTile", inTile, g_dump_inTile, 1, tM); gen_offset_impl(offsetTile, in_shape, out_shape, base, total_elements); base += total_elements; - + // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM); MGATHER(outTile, inGm, offsetTile); // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM); - TCOPYOUT(gO, outTile); + TSTORE(gO, outTile); } if constexpr (rmd_M) { // printf("rmd_M = %d\n", rmd_M); @@ -296,7 +296,7 @@ void broadcast( gen_offset_impl(offsetTile_rmd, in_shape, out_shape, base, total_elements); base += total_elements; MGATHER(outTile_rmd, inGm, offsetTile_rmd); - TCOPYOUT(gO, outTile_rmd); + TSTORE(gO, outTile_rmd); } } diff --git a/kernels/memory/broadcast_019.hpp b/kernels/memory/broadcast_019.hpp index b6916f2..c1e9b28 100644 --- a/kernels/memory/broadcast_019.hpp +++ b/kernels/memory/broadcast_019.hpp @@ -10,7 +10,7 @@ GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -268,11 +268,11 @@ void broadcast( const size_t *out_shape ) { const size_t Mb = gOM / tM; - const size_t rmd_M = gOM % tM; + const size_t rmd_M = gOM % tM; using gm_shapeIn = global_tensor>; using gm_shapeOut = global_tensor>; - using tile_shapeData = Tile; + using tile_shapeData = Tile; using tile_shapeOffset = Tile; using tile_shapeData_rmd = Tile; using tile_shapeOffset_rmd = Tile; @@ -308,18 +308,18 @@ void broadcast( // printf("total_elements = %d\n", total_elements); // printf("in_shape[0] = %d\n", in_shape[0]); // printf("inGm = %ld\n", inGm); - - // TCOPYIN(inTile, gI); + + // TLOAD(inTile, gI); // DUMP_TILE("inTile", inTile, g_dump_inTile, 1, tM); gen_offset_impl(offsetTile, in_shape, out_shape, base, total_elements); base += total_elements; - + // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM); MGATHER(outTile, inGm, offsetTile); // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM); - TCOPYOUT(gO, outTile); + TSTORE(gO, outTile); } if constexpr (rmd_M) { // printf("rmd_M = %d\n", rmd_M); @@ -328,7 +328,7 @@ void broadcast( gen_offset_impl(offsetTile_rmd, in_shape, out_shape, base, total_elements); base += total_elements; // MGATHER(outTile_rmd, inGm, offsetTile_rmd); - TCOPYOUT(gO, outTile_rmd); + TSTORE(gO, outTile_rmd); } } diff --git a/kernels/memory/broadcast_039.hpp b/kernels/memory/broadcast_039.hpp index b3fc5e5..1faf050 100644 --- a/kernels/memory/broadcast_039.hpp +++ b/kernels/memory/broadcast_039.hpp @@ -10,7 +10,7 @@ GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -317,11 +317,11 @@ void broadcast( const size_t *out_shape ) { const size_t Mb = gOM / tM; - const size_t rmd_M = gOM % tM; + const size_t rmd_M = gOM % tM; using gm_shapeIn = global_tensor>; using gm_shapeOut = global_tensor>; - using tile_shapeData = Tile; + using tile_shapeData = Tile; using tile_shapeOffset = Tile; using tile_shapeData_rmd = Tile; using tile_shapeOffset_rmd = Tile; @@ -357,18 +357,18 @@ void broadcast( // printf("total_elements = %d\n", total_elements); // printf("in_shape[0] = %d\n", in_shape[0]); // printf("inGm = %ld\n", inGm); - - // TCOPYIN(inTile, gI); + + // TLOAD(inTile, gI); // DUMP_TILE("inTile", inTile, g_dump_inTile, 1, tM); gen_offset_impl(offsetTile, in_shape, out_shape, base, total_elements); base += total_elements; - + // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM); MGATHER(outTile, inGm, offsetTile); // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM); - TCOPYOUT(gO, outTile); + TSTORE(gO, outTile); } if constexpr (rmd_M) { // printf("rmd_M = %d\n", rmd_M); @@ -377,7 +377,7 @@ void broadcast( gen_offset_impl(offsetTile_rmd, in_shape, out_shape, base, total_elements); base += total_elements; // MGATHER(outTile_rmd, inGm, offsetTile_rmd); - TCOPYOUT(gO, outTile_rmd); + TSTORE(gO, outTile_rmd); } } diff --git a/kernels/memory/broadcast_07.hpp b/kernels/memory/broadcast_07.hpp index 4b3b7b1..d38101c 100644 --- a/kernels/memory/broadcast_07.hpp +++ b/kernels/memory/broadcast_07.hpp @@ -11,7 +11,7 @@ GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -268,11 +268,11 @@ void broadcast( const size_t *out_shape ) { const size_t Mb = gOM / tM; - const size_t rmd_M = gOM % tM; + const size_t rmd_M = gOM % tM; using gm_shapeIn = global_tensor>; using gm_shapeOut = global_tensor>; - using tile_shapeData = Tile; + using tile_shapeData = Tile; using tile_shapeOffset = Tile; using tile_shapeData_rmd = Tile; using tile_shapeOffset_rmd = Tile; @@ -308,18 +308,18 @@ void broadcast( // printf("total_elements = %d\n", total_elements); // printf("in_shape[0] = %d\n", in_shape[0]); // printf("inGm = %ld\n", inGm); - - // TCOPYIN(inTile, gI); + + // TLOAD(inTile, gI); // DUMP_TILE("inTile", inTile, g_dump_inTile, 1, tM); gen_offset_impl(offsetTile, in_shape, out_shape, base, total_elements); base += total_elements; - + // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM); MGATHER(outTile, inGm, offsetTile); // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM); - TCOPYOUT(gO, outTile); + TSTORE(gO, outTile); } if constexpr (rmd_M) { // printf("rmd_M = %d\n", rmd_M); @@ -328,7 +328,7 @@ void broadcast( gen_offset_impl(offsetTile_rmd, in_shape, out_shape, base, total_elements); base += total_elements; MGATHER(outTile_rmd, inGm, offsetTile_rmd); - TCOPYOUT(gO, outTile_rmd); + TSTORE(gO, outTile_rmd); } } diff --git a/kernels/memory/broadcast_07_simple.hpp b/kernels/memory/broadcast_07_simple.hpp index 75e03d0..1e39c97 100644 --- a/kernels/memory/broadcast_07_simple.hpp +++ b/kernels/memory/broadcast_07_simple.hpp @@ -40,11 +40,11 @@ void broadcast( const size_t *out_shape ) { const size_t Mb = gOM / tM; - const size_t rmd_M = gOM % tM; + const size_t rmd_M = gOM % tM; using gm_shapeIn = global_tensor>; using gm_shapeOut = global_tensor>; - using tile_shapeData = Tile; + using tile_shapeData = Tile; using tile_shapeOffset = Tile; using tile_shapeData_rmd = Tile; using tile_shapeOffset_rmd = Tile; @@ -66,7 +66,7 @@ void broadcast( gen_offset_impl(offsetTile, in_shape, out_shape, base, total_elements); base += total_elements; MGATHER(outTile, inGm, offsetTile); - TCOPYOUT(gO, outTile); + TSTORE(gO, outTile); } if constexpr (rmd_M) { auto gO = gOIter(0, Mb); @@ -74,6 +74,6 @@ void broadcast( gen_offset_impl(offsetTile_rmd, in_shape, out_shape, base, total_elements); base += total_elements; MGATHER(outTile_rmd, inGm, offsetTile_rmd); - TCOPYOUT(gO, outTile_rmd); + TSTORE(gO, outTile_rmd); } } \ No newline at end of file diff --git a/kernels/memory/broadcast_Hunyuan.hpp b/kernels/memory/broadcast_Hunyuan.hpp index 80b5b1c..191ab31 100644 --- a/kernels/memory/broadcast_Hunyuan.hpp +++ b/kernels/memory/broadcast_Hunyuan.hpp @@ -10,7 +10,7 @@ GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -273,11 +273,11 @@ void broadcast( const size_t *out_shape ) { const size_t Mb = gOM / tM; - const size_t rmd_M = gOM % tM; + const size_t rmd_M = gOM % tM; using gm_shapeIn = global_tensor>; using gm_shapeOut = global_tensor>; - using tile_shapeData = Tile; + using tile_shapeData = Tile; using tile_shapeOffset = Tile; using tile_shapeData_rmd = Tile; using tile_shapeOffset_rmd = Tile; @@ -313,18 +313,18 @@ void broadcast( // printf("total_elements = %d\n", total_elements); // printf("in_shape[0] = %d\n", in_shape[0]); // printf("inGm = %ld\n", inGm); - - // TCOPYIN(inTile, gI); + + // TLOAD(inTile, gI); // DUMP_TILE("inTile", inTile, g_dump_inTile, 1, tM); gen_offset_impl(offsetTile, in_shape, out_shape, base, total_elements); base += total_elements; - + // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM); MGATHER(outTile, inGm, offsetTile); // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM); - TCOPYOUT(gO, outTile); + TSTORE(gO, outTile); } if constexpr (rmd_M) { // printf("rmd_M = %d\n", rmd_M); @@ -333,7 +333,7 @@ void broadcast( gen_offset_impl(offsetTile_rmd, in_shape, out_shape, base, total_elements); base += total_elements; MGATHER(outTile_rmd, inGm, offsetTile_rmd); - TCOPYOUT(gO, outTile_rmd); + TSTORE(gO, outTile_rmd); } } diff --git a/kernels/memory/broadcast_mscatter.hpp b/kernels/memory/broadcast_mscatter.hpp index f38161b..958da18 100644 --- a/kernels/memory/broadcast_mscatter.hpp +++ b/kernels/memory/broadcast_mscatter.hpp @@ -167,7 +167,7 @@ void broadcast_mscatter( // ====================== for (int i = 0; i < input_tiles; ++i) { auto gIn = gInIter(0, i); - TCOPYIN(inDataTile, gIn); + TLOAD(inDataTile, gIn); // 重置广播步进 memset(bcast_step, 0, sizeof(bcast_step)); @@ -185,7 +185,7 @@ void broadcast_mscatter( // 散射写入 MSCATTER(outGm, inDataTile, offsetTile); auto gOffset = gOffsetIter(0, offset_idx); - TCOPYOUT(gOffset, offsetTile); + TSTORE(gOffset, offsetTile); offset_idx ++; // 下一组广播坐标 @@ -200,7 +200,7 @@ void broadcast_mscatter( if constexpr (rmd_input > 0) { auto gIn = gInIter(0, input_tiles); total_elements = rmd_input; - TCOPYIN(inDataTile_rmd, gIn); + TLOAD(inDataTile_rmd, gIn); memset(bcast_step, 0, sizeof(bcast_step)); done = false; @@ -212,7 +212,7 @@ void broadcast_mscatter( ); MSCATTER(outGm, inDataTile_rmd, offsetTile_rmd); auto gOffset = gOffsetIter(0, offset_idx); - TCOPYOUT(gOffset, offsetTile_rmd); + TSTORE(gOffset, offsetTile_rmd); offset_idx ++; done = !next_broadcast_step(); diff --git a/kernels/memory/broadcast_nomg.hpp b/kernels/memory/broadcast_nomg.hpp index 16c5332..c619c0f 100644 --- a/kernels/memory/broadcast_nomg.hpp +++ b/kernels/memory/broadcast_nomg.hpp @@ -10,7 +10,7 @@ GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -162,7 +162,7 @@ void gen_offset_impl( const size_t total_elements) { static_assert(tile_shapeOffset::ValidRow != -1 && tile_shapeOffset::ValidCol != -1, "Only static shape supported"); - + #if MAX_DIMs >= 1 size_t in_shape0 = in_shape[0]; size_t out_shape0 = out_shape[0]; @@ -236,11 +236,11 @@ void broadcast( const size_t *out_shape ) { const size_t Mb = gOM / tM; - const size_t rmd_M = gOM % tM; + const size_t rmd_M = gOM % tM; using gm_shapeIn = global_tensor>; using gm_shapeOut = global_tensor>; - using tile_shapeData = Tile; + using tile_shapeData = Tile; using tile_shapeOffset = Tile; using tile_shapeData_rmd = Tile; using tile_shapeOffset_rmd = Tile; @@ -276,18 +276,18 @@ void broadcast( // printf("total_elements = %d\n", total_elements); // printf("in_shape[0] = %d\n", in_shape[0]); // printf("inGm = %ld\n", inGm); - - // TCOPYIN(inTile, gI); + + // TLOAD(inTile, gI); // DUMP_TILE("inTile", inTile, g_dump_inTile, 1, tM); gen_offset_impl(offsetTile, in_shape, out_shape, base, total_elements); base += total_elements; - + // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM); // MGATHER(outTile, inGm, offsetTile); // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM); - TCOPYOUT(gO, offsetTile); + TSTORE(gO, offsetTile); } if constexpr (rmd_M) { // printf("rmd_M = %d\n", rmd_M); @@ -296,7 +296,7 @@ void broadcast( gen_offset_impl(offsetTile_rmd, in_shape, out_shape, base, total_elements); base += total_elements; // MGATHER(outTile_rmd, inGm, offsetTile_rmd); - TCOPYOUT(gO, offsetTile_rmd); + TSTORE(gO, offsetTile_rmd); } } diff --git a/kernels/memory/broadcast_nocopyout.hpp b/kernels/memory/broadcast_nostore.hpp similarity index 97% rename from kernels/memory/broadcast_nocopyout.hpp rename to kernels/memory/broadcast_nostore.hpp index 5340d09..4d429b0 100644 --- a/kernels/memory/broadcast_nocopyout.hpp +++ b/kernels/memory/broadcast_nostore.hpp @@ -10,7 +10,7 @@ GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -162,7 +162,7 @@ void gen_offset_impl( const size_t total_elements) { static_assert(tile_shapeOffset::ValidRow != -1 && tile_shapeOffset::ValidCol != -1, "Only static shape supported"); - + #if MAX_DIMs >= 1 size_t in_shape0 = in_shape[0]; size_t out_shape0 = out_shape[0]; @@ -229,18 +229,18 @@ void gen_offset_impl( template -void broadcast_nocopyout( +void broadcast_nostore( dtype *in_ptr, dtype *out_ptr, const size_t *in_shape, const size_t *out_shape ) { const size_t Mb = gOM / tM; - const size_t rmd_M = gOM % tM; + const size_t rmd_M = gOM % tM; using gm_shapeIn = global_tensor>; using gm_shapeOut = global_tensor>; - using tile_shapeData = Tile; + using tile_shapeData = Tile; using tile_shapeOffset = Tile; using tile_shapeData_rmd = Tile; using tile_shapeOffset_rmd = Tile; @@ -273,10 +273,10 @@ void broadcast_nocopyout( auto gO = gOIter(0, i); gen_offset_impl(offsetTile, in_shape, out_shape, base, total_elements); base += total_elements; - + MGATHER(outTile, inGm, offsetTile); - // TCOPYOUT(gO, outTile); + // TSTORE(gO, outTile); } if constexpr (rmd_M) { auto gO = gOIter(0, Mb); @@ -284,7 +284,7 @@ void broadcast_nocopyout( gen_offset_impl(offsetTile_rmd, in_shape, out_shape, base, total_elements); base += total_elements; MGATHER(outTile_rmd, inGm, offsetTile_rmd); - // TCOPYOUT(gO, outTile_rmd); + // TSTORE(gO, outTile_rmd); } } diff --git a/kernels/memory/broadcast_simple.hpp b/kernels/memory/broadcast_simple.hpp index c8ce29f..43e135e 100644 --- a/kernels/memory/broadcast_simple.hpp +++ b/kernels/memory/broadcast_simple.hpp @@ -85,7 +85,7 @@ void gen_offset_impl( const size_t total_elements) { static_assert(tile_shapeOffset::ValidRow != -1 && tile_shapeOffset::ValidCol != -1, "Only static shape supported"); - + #if MAX_DIMs >= 1 size_t in_shape0 = in_shape[0]; size_t out_shape0 = out_shape[0]; @@ -159,11 +159,11 @@ void broadcast( const size_t *out_shape ) { const size_t Mb = gOM / tM; - const size_t rmd_M = gOM % tM; + const size_t rmd_M = gOM % tM; using gm_shapeIn = global_tensor>; using gm_shapeOut = global_tensor>; - using tile_shapeData = Tile; + using tile_shapeData = Tile; using tile_shapeOffset = Tile; using tile_shapeData_rmd = Tile; using tile_shapeOffset_rmd = Tile; @@ -199,18 +199,18 @@ void broadcast( // printf("total_elements = %d\n", total_elements); // printf("in_shape[0] = %d\n", in_shape[0]); // printf("inGm = %ld\n", inGm); - - // TCOPYIN(inTile, gI); + + // TLOAD(inTile, gI); // DUMP_TILE("inTile", inTile, g_dump_inTile, 1, tM); gen_offset_impl(offsetTile, in_shape, out_shape, base, total_elements); base += total_elements; - + // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM); MGATHER(outTile, inGm, offsetTile); // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM); - TCOPYOUT(gO, outTile); + TSTORE(gO, outTile); } if constexpr (rmd_M) { // printf("rmd_M = %d\n", rmd_M); @@ -219,7 +219,7 @@ void broadcast( gen_offset_impl(offsetTile_rmd, in_shape, out_shape, base, total_elements); base += total_elements; MGATHER(outTile_rmd, inGm, offsetTile_rmd); - TCOPYOUT(gO, outTile_rmd); + TSTORE(gO, outTile_rmd); } } diff --git a/kernels/memory/broadcast_vec_019.hpp b/kernels/memory/broadcast_vec_019.hpp index e58991b..d03fdbb 100644 --- a/kernels/memory/broadcast_vec_019.hpp +++ b/kernels/memory/broadcast_vec_019.hpp @@ -5,7 +5,7 @@ using namespace pto; // ===================================================================== -// Broadcast (B,1,K) -> (B,N,K) via TCOPYIN + __vec__ broadcast + TCOPYOUT +// Broadcast (B,1,K) -> (B,N,K) via TLOAD + __vec__ broadcast + TSTORE // // Optimized for: (1280,1,49) -> (1280,8,49), dtype=half // @@ -17,7 +17,7 @@ using namespace pto; // Processing strategy: // Divide B batches into tiles of kTileBatch batches each. // Per tile: -// 1. TCOPYIN (kTileBatch, K) from GlobalMem -> TileReg +// 1. TLOAD (kTileBatch, K) from GlobalMem -> TileReg // Reads kTileBatch * K contiguous elements. // 2. __vec__ broadcast within TileReg: // Launch <<>> threads: @@ -27,7 +27,7 @@ using namespace pto; // batch_idx = y & (kTileBatch - 1) (0..kTileBatch-1, bitwise) // Read src[batch_idx * RowStride + x] // Write dst[batch_idx * RowStride + copy * K + x] -// 3. TCOPYOUT (kTileBatch, N*K) from TileReg -> GlobalMem +// 3. TSTORE (kTileBatch, N*K) from TileReg -> GlobalMem // // TileReg layout: // Physical tile cols = 512 (padded for 512B alignment). @@ -111,23 +111,23 @@ void broadcast(dtype *in_ptr, dtype *out_ptr, for (size_t i = 0; i < Nb; i++) { gm_in gsrc(in_ptr + i * kTileBatch * kInner); - TCOPYIN(inTile, gsrc); + TLOAD(inTile, gsrc); vec_broadcast_3d <<>>(outTile.data(), inTile.data()); gm_out gdst(out_ptr + i * kTileBatch * kBCast * kInner); - TCOPYOUT(gdst, outTile); + TSTORE(gdst, outTile); } if constexpr (rmd > 0) { gm_in gsrc(in_ptr + Nb * kTileBatch * kInner); - TCOPYIN(inTile_rmd, gsrc); + TLOAD(inTile_rmd, gsrc); vec_broadcast_3d <<>>(outTile_rmd.data(), inTile_rmd.data()); gm_out gdst(out_ptr + Nb * kTileBatch * kBCast * kInner); - TCOPYOUT(gdst, outTile_rmd); + TSTORE(gdst, outTile_rmd); } } \ No newline at end of file diff --git a/kernels/memory/broadcast_vec_039.hpp b/kernels/memory/broadcast_vec_039.hpp index 72cccbd..58cd521 100644 --- a/kernels/memory/broadcast_vec_039.hpp +++ b/kernels/memory/broadcast_vec_039.hpp @@ -5,7 +5,7 @@ using namespace pto; // ===================================================================== -// Broadcast (B,1,K) -> (B,N,K) via TCOPYIN + __vec__ broadcast + TCOPYOUT +// Broadcast (B,1,K) -> (B,N,K) via TLOAD + __vec__ broadcast + TSTORE // // Optimized for: (8192,1,16) -> (8192,8,16), dtype=half // @@ -17,7 +17,7 @@ using namespace pto; // Processing strategy: // Divide B batches into tiles of kTileBatch batches each. // Per tile: -// 1. TCOPYIN (kTileBatch, K) from GlobalMem -> TileReg +// 1. TLOAD (kTileBatch, K) from GlobalMem -> TileReg // Reads kTileBatch * K contiguous elements. // 2. __vec__ broadcast within TileReg: // For each batch, replicate its K elements N times (row-wise). @@ -28,7 +28,7 @@ using namespace pto; // col = x % K (inner column 0..K-1) // Read src[y * RowStride + col] // Write dst[y * RowStride + x] -// 3. TCOPYOUT (kTileBatch, N*K) from TileReg -> GlobalMem +// 3. TSTORE (kTileBatch, N*K) from TileReg -> GlobalMem // // TileReg layout: // Physical tile cols = 256 (padded for 512B alignment). @@ -113,23 +113,23 @@ void broadcast(dtype *in_ptr, dtype *out_ptr, for (size_t i = 0; i < Nb; i++) { gm_in gsrc(in_ptr + i * kTileBatch * kInner); - TCOPYIN(inTile, gsrc); + TLOAD(inTile, gsrc); vec_broadcast_3d <<>>(outTile.data(), inTile.data()); gm_out gdst(out_ptr + i * kTileBatch * kBCast * kInner); - TCOPYOUT(gdst, outTile); + TSTORE(gdst, outTile); } if constexpr (rmd > 0) { gm_in gsrc(in_ptr + Nb * kTileBatch * kInner); - TCOPYIN(inTile_rmd, gsrc); + TLOAD(inTile_rmd, gsrc); vec_broadcast_3d <<>>(outTile_rmd.data(), inTile_rmd.data()); gm_out gdst(out_ptr + Nb * kTileBatch * kBCast * kInner); - TCOPYOUT(gdst, outTile_rmd); + TSTORE(gdst, outTile_rmd); } } \ No newline at end of file diff --git a/kernels/memory/broadcast_vec_07.hpp b/kernels/memory/broadcast_vec_07.hpp index 5934187..2129afb 100644 --- a/kernels/memory/broadcast_vec_07.hpp +++ b/kernels/memory/broadcast_vec_07.hpp @@ -5,20 +5,20 @@ using namespace pto; // ===================================================================== -// Broadcast (N,1) -> (N,C) via TCOPYIN + __vec__ broadcast + TCOPYOUT +// Broadcast (N,1) -> (N,C) via TLOAD + __vec__ broadcast + TSTORE // // Optimized for: (1443,1) -> (1443,129), dtype=half // // Processing strategy: // Divide N rows into tiles of kTileRows rows each. // Per tile: -// 1. TCOPYIN (kTileRows, 1) from GlobalMem -> TileReg +// 1. TLOAD (kTileRows, 1) from GlobalMem -> TileReg // 2. __vec__ broadcast (kTileRows, 1) -> (kTileRows, C) within TileReg // Launch <<>> threads: // x = column index (0..kC-1), y = row index (0..kTileRows-1) // Each thread reads src[j*src_RowStride] (col 0 of row j) // and writes to dst[i + j*dst_RowStride] (col i of row j) -// 3. TCOPYOUT (kTileRows, C) from TileReg -> GlobalMem +// 3. TSTORE (kTileRows, C) from TileReg -> GlobalMem // // TileReg layout: // Physical tile cols padded to 256 for 512B alignment. @@ -77,13 +77,13 @@ void broadcast(dtype *in_ptr, dtype *out_ptr, for (size_t i = 0; i < Nb; i++) { gm_in gsrc(in_ptr + i * kTileRows); - TCOPYIN(inTile, gsrc); + TLOAD(inTile, gsrc); vec_broadcast_rowmajor <<>>(outTile.data(), inTile.data()); gm_out gdst(out_ptr + i * kTileRows * kC); - TCOPYOUT(gdst, outTile); + TSTORE(gdst, outTile); } using tile_in_r = Tile 0) { gm_in gsrc(in_ptr + Nb * kTileRows); - TCOPYIN(inTile_rmd, gsrc); + TLOAD(inTile_rmd, gsrc); vec_broadcast_rowmajor <<>>(outTile_rmd.data(), inTile_rmd.data()); gm_out gdst(out_ptr + Nb * kTileRows * kC); - TCOPYOUT(gdst, outTile_rmd); + TSTORE(gdst, outTile_rmd); } } \ No newline at end of file diff --git a/kernels/memory/broadcast_vec_07_handwrite.hpp b/kernels/memory/broadcast_vec_07_handwrite.hpp index 88cccf2..b6d3a6c 100644 --- a/kernels/memory/broadcast_vec_07_handwrite.hpp +++ b/kernels/memory/broadcast_vec_07_handwrite.hpp @@ -12,7 +12,7 @@ GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -58,7 +58,7 @@ void broadcast( const size_t rmd_N = rmd_M; const size_t vld_N = N * 129; // 实际一次写回数量 const size_t rmd_vld_N = rmd_M * 129; // 尾块,实际一次写回数量 - + Assert(tO0 > 129); Assert(tO0 % 128 == 0); @@ -86,26 +86,26 @@ void broadcast( auto gI = gIIter(0, i); t_in_offset = 0; t_out_offset = 0; - TCOPYIN(inTile, gI); + TLOAD(inTile, gI); for (int j = 0; j < N; ++i) { vec_broadcast<<<129, 1, 1>>>(inTile, outTile, t_in_offset, t_out_offset); t_in_offset += 1; t_out_offset += 129; } - TCOPYOUT(outTile, out_ptr); + TSTORE(outTile, out_ptr); out_ptr += sizeof(dtype) * vld_N; } if constexpr (rmd_M) { auto gI = gIIter(0, Mb); t_in_offset = 0; t_out_offset = 0; - TCOPYIN(inTile, gI); + TLOAD(inTile, gI); for (int j = 0; j < rmd_N; ++i) { vec_broadcast<<<129, 1, 1>>>(inTile_rmd, outTile_rmd, t_in_offset, t_out_offset); t_in_offset += 1; t_out_offset += 129; } - TCOPYOUT(outTile, out_ptr); + TSTORE(outTile, out_ptr); out_ptr += sizeof(dtype) * rmd_vld_N; } } diff --git a/kernels/memory/concat_gather.hpp b/kernels/memory/concat_gather.hpp index 85a69fd..404f7ae 100644 --- a/kernels/memory/concat_gather.hpp +++ b/kernels/memory/concat_gather.hpp @@ -16,7 +16,7 @@ using namespace pto; GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -36,7 +36,7 @@ void __vec__ gen_offset_concat( typename tile_shape::TileDType __out__ out, typename tile_Inshape::TileDType __in__ in_shape, typename tile_Outshape::TileDType __in__ out_shape, -// const size_t in_dim, +// const size_t in_dim, const size_t base, const size_t total_elements ) { @@ -64,14 +64,14 @@ void __vec__ gen_offset_concat( // 输出一维索引 → 输出坐标 size_t out_coord[MAX_DIM] = {0}; // size_t tmp = idx; // - + #pragma clang loop unroll(full) for (int d = DATA_DIM - 1; d >= 0; d--) { out_coord[d] = tmp % out_shape_ptr[d]; tmp /= out_shape_ptr[d]; } - size_t n = out_coord[CONCAT_DIM] / in_shape_ptr[CONCAT_DIM]; + size_t n = out_coord[CONCAT_DIM] / in_shape_ptr[CONCAT_DIM]; size_t offset = out_coord[CONCAT_DIM] % in_shape_ptr[CONCAT_DIM]; out_coord[CONCAT_DIM] = offset; @@ -88,10 +88,10 @@ void __vec__ gen_offset_concat( } } */ -// uint16_t in_offset = 0; - uint32_t in_offset = 0; +// uint16_t in_offset = 0; + uint32_t in_offset = 0; - #pragma clang loop unroll(full) + #pragma clang loop unroll(full) for (int i = 0; i < DATA_DIM; i++) { in_offset += out_coord[i] * stride[i] * sizeof(dtype); } @@ -109,7 +109,7 @@ void gen_offset_Impl( // const size_t in_dim, // const size_t out_dim, // const size_t transpose_dim1, -// const size_t transpose_dim0, +// const size_t transpose_dim0, const size_t base, const size_t total_elements ) @@ -130,31 +130,31 @@ void concat_gather( // const size_t in_dim, // const size_t out_dim, // const size_t transpose_dim1, -// const size_t transpose_dim0 -) +// const size_t transpose_dim0 +) { const int Mb = gOM / tM; - + const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 + using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 using gm_shapeOut = global_tensor>; - using gm_InDataShape = global_tensor>; //将gm中的Tensor先声明为一维数据 + using gm_InDataShape = global_tensor>; //将gm中的Tensor先声明为一维数据 using gm_OutDataShape = global_tensor>; using tile_shapeData = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeOffset = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec using tile_Inshape = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec - using tile_Outshape = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec + using tile_Outshape = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec // using tile_shapeOffset = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec gm_shapeIn inGm(in_ptr); - gm_InDataShape InShapeGm(in_shape); - gm_OutDataShape OutShapeGm(out_shape); + gm_InDataShape InShapeGm(in_shape); + gm_OutDataShape OutShapeGm(out_shape); tile_shapeData dataTile; tile_shapeOffset offsetTile; @@ -175,25 +175,25 @@ void concat_gather( for (int i = 0; i < Mb; ++i) { auto gO = gOIter(0, i); - TCOPYIN(InshapeTile, InShapeGm); - TCOPYIN(OutshapeTile, OutShapeGm); + TLOAD(InshapeTile, InShapeGm); + TLOAD(OutshapeTile, OutShapeGm); gen_offset_Impl(offsetTile, InshapeTile, OutshapeTile, base, total_elements); // printf("end genoffset\n"); base += total_elements; // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM); MGATHER(dataTile, inGm, offsetTile); -// printf("end mgather\n"); - TCOPYOUT(gO, dataTile); +// printf("end mgather\n"); + TSTORE(gO, dataTile); } - if constexpr (rmd_M) { + if constexpr (rmd_M) { auto gO = gOIter(0, Mb); - TCOPYIN(InshapeTile, InShapeGm); - TCOPYIN(OutshapeTile, OutShapeGm); + TLOAD(InshapeTile, InShapeGm); + TLOAD(OutshapeTile, OutShapeGm); total_elements = rmd_M;//尾片的大小。 gen_offset_Impl(offsetTile, InshapeTile, OutshapeTile, base, total_elements); base += total_elements; MGATHER(dataTile, inGm, offsetTile); - TCOPYOUT(gO, dataTile); + TSTORE(gO, dataTile); } } diff --git a/kernels/memory/concat_scatter.hpp b/kernels/memory/concat_scatter.hpp index e0f14b8..13b6587 100644 --- a/kernels/memory/concat_scatter.hpp +++ b/kernels/memory/concat_scatter.hpp @@ -16,7 +16,7 @@ using namespace pto; GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -36,15 +36,15 @@ void __vec__ gen_offset_concat( typename tile_shape::TileDType __out__ out, typename tile_Inshape::TileDType __in__ in_shape, typename tile_Outshape::TileDType __in__ out_shape, -// const size_t in_dim, +// const size_t in_dim, const size_t base, const size_t total_elements ) { size_t index = blkv_get_index_x(); size_t idx = blkv_get_index_x(); - __vbuf__ typename tile_Inshape::DType *in_shape_ptr = blkv_get_tile_ptr(in_shape); - __vbuf__ typename tile_Outshape::DType *out_shape_ptr = blkv_get_tile_ptr(out_shape); + __vbuf__ typename tile_Inshape::DType *in_shape_ptr = blkv_get_tile_ptr(in_shape); + __vbuf__ typename tile_Outshape::DType *out_shape_ptr = blkv_get_tile_ptr(out_shape); if (index >= total_elements) return; idx = idx + base; // todo idx是个向量,base是个标量,获得所有的基地址或者说基offset @@ -64,7 +64,7 @@ void __vec__ gen_offset_concat( // 输出一维索引 → 输出坐标 size_t in_coord[MAX_DIM] = {0}; // size_t tmp = idx; // - + #pragma clang loop unroll(full) for (int d = DATA_DIM - 1; d >= 0; d--) { in_coord[d] = tmp % in_shape_ptr[d]; @@ -73,7 +73,7 @@ void __vec__ gen_offset_concat( size_t n = tmp; in_coord[CONCAT_DIM] = n * in_shape_ptr[CONCAT_DIM] + in_coord[CONCAT_DIM]; -// size_t n = out_coord[CONCAT_DIM] / in_shape_ptr[CONCAT_DIM]; +// size_t n = out_coord[CONCAT_DIM] / in_shape_ptr[CONCAT_DIM]; // size_t offset = out_coord[CONCAT_DIM] % in_shape_ptr[CONCAT_DIM]; // out_coord[CONCAT_DIM] = offset; @@ -90,11 +90,11 @@ void __vec__ gen_offset_concat( } } */ -// uint16_t in_offset = 0; -// uint32_t out_offset = 0; - uint16_t out_offset = 0; +// uint16_t in_offset = 0; +// uint32_t out_offset = 0; + uint16_t out_offset = 0; - #pragma clang loop unroll(full) + #pragma clang loop unroll(full) for (int i = 0; i < DATA_DIM; i++) { out_offset += in_coord[i] * stride[i] * sizeof(dtype); } @@ -111,7 +111,7 @@ void gen_offset_Impl( // const size_t in_dim, // const size_t out_dim, // const size_t transpose_dim1, -// const size_t transpose_dim0, +// const size_t transpose_dim0, const size_t base, const size_t total_elements ) @@ -132,33 +132,33 @@ void concat_scatter( // const size_t in_dim, // const size_t out_dim, // const size_t transpose_dim1, -// const size_t transpose_dim0 -) +// const size_t transpose_dim0 +) { const int Mb = gOM / tM; - + const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 + using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 using gm_shapeOut = global_tensor>; - using gm_InDataShape = global_tensor>; //将gm中的Tensor先声明为一维数据 + using gm_InDataShape = global_tensor>; //将gm中的Tensor先声明为一维数据 using gm_OutDataShape = global_tensor>; using tile_shapeData = Tile; // todo 尾块怎么处理?是否要作为参数写在这 - using tile_shapeOffset = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec + using tile_shapeOffset = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec // using tile_shapeOffset = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec using tile_Inshape = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec - using tile_Outshape = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec + using tile_Outshape = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec // using tile_shapeOffset = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec // gm_shapeIn inGm(in_ptr); gm_shapeOut outGm(out_ptr); - gm_InDataShape InShapeGm(in_shape); - gm_OutDataShape OutShapeGm(out_shape); + gm_InDataShape InShapeGm(in_shape); + gm_OutDataShape OutShapeGm(out_shape); tile_shapeData dataTile; tile_shapeOffset offsetTile; @@ -181,24 +181,24 @@ void concat_scatter( for (int i = 0; i < Mb; ++i) { - auto gI = gIIter(0, i); - TCOPYIN(InshapeTile, InShapeGm); - TCOPYIN(OutshapeTile, OutShapeGm); + auto gI = gIIter(0, i); + TLOAD(InshapeTile, InShapeGm); + TLOAD(OutshapeTile, OutShapeGm); gen_offset_Impl(offsetTile, InshapeTile, OutshapeTile, base, total_elements); // printf("end genoffset\n"); base += total_elements; // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM); - TCOPYIN(dataTile, gI); + TLOAD(dataTile, gI); MSCATTER(outGm, dataTile, offsetTile); } - if constexpr (rmd_M) { - auto gI = gIIter(0, Mb); - TCOPYIN(InshapeTile, InShapeGm); - TCOPYIN(OutshapeTile, OutShapeGm); + if constexpr (rmd_M) { + auto gI = gIIter(0, Mb); + TLOAD(InshapeTile, InShapeGm); + TLOAD(OutshapeTile, OutShapeGm); total_elements = rmd_M;//尾片的大小。 gen_offset_Impl(offsetTile, InshapeTile, OutshapeTile, base, total_elements); base += total_elements; - TCOPYIN(dataTile, gI); + TLOAD(dataTile, gI); MSCATTER(outGm, dataTile, offsetTile); } } diff --git a/kernels/memory/gather.hpp b/kernels/memory/gather.hpp index 0222852..03a1069 100644 --- a/kernels/memory/gather.hpp +++ b/kernels/memory/gather.hpp @@ -10,7 +10,7 @@ GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -27,7 +27,7 @@ GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -40,7 +40,7 @@ template void __vec__ gen_offset( typename tile_shapeInOffset::TileDType __in__ in, // inOffset - typename tile_shapeOffset::TileDType __out__ out, // + typename tile_shapeOffset::TileDType __out__ out, // const size_t n_base ) { size_t data_width = sizeof(dtype); @@ -64,7 +64,7 @@ void gen_offset_impl( ) { static_assert(tile_shapeOffset::ValidRow != -1 && tile_shapeOffset::ValidCol != -1, "Only static shape supported"); - + gen_offset<<>>( in_offset.data(), @@ -86,25 +86,25 @@ void gather( using gm_shapeInOffset = global_tensor>; using gm_shapeIn = global_tensor>; using gm_shapeOut = global_tensor>; - - using tile_shapeInData = Tile; + + using tile_shapeInData = Tile; using itIn = global_iterator; tile_shapeInData inTile; itIn gInIter(in_data_ptr); - using tile_shapeInOffset = Tile; - using tile_shapeData = Tile; + using tile_shapeInOffset = Tile; + using tile_shapeData = Tile; using tile_shapeOffset = Tile; - using tile_shapeInOffset_rmd_n = Tile; + using tile_shapeInOffset_rmd_n = Tile; using tile_shapeData_rmd_n = Tile; using tile_shapeOffset_rmd_n = Tile; - using tile_shapeInOffset_rmd_mn = Tile; + using tile_shapeInOffset_rmd_mn = Tile; using tile_shapeData_rmd_mn = Tile; using tile_shapeOffset_rmd_mn = Tile; - using tile_shapeInOffset_rmd_m = Tile; + using tile_shapeInOffset_rmd_m = Tile; using tile_shapeData_rmd_m = Tile; using tile_shapeOffset_rmd_m = Tile; @@ -117,11 +117,11 @@ void gather( tile_shapeInOffset_rmd_n inOffsetTile_rmd_n; tile_shapeData_rmd_n outTile_rmd_n; tile_shapeOffset_rmd_n offsetTile_rmd_n; - + tile_shapeInOffset_rmd_mn inOffsetTile_rmd_mn; tile_shapeData_rmd_mn outTile_rmd_mn; tile_shapeOffset_rmd_mn offsetTile_rmd_mn; - + tile_shapeInOffset_rmd_m inOffsetTile_rmd_m; tile_shapeData_rmd_m outTile_rmd_m; tile_shapeOffset_rmd_m offsetTile_rmd_m; @@ -140,24 +140,24 @@ void gather( // /////////////////////////////////////// size_t n_base = 0; - + // #pragma clang loop unroll(full) for (int j = 0; j < Mb; ++j) { printf("j = %d\n", j); for (int i = 0; i < Nb; ++i) { auto gInOffset = gInOffsetIter(0, j); auto gO = gOIter(j, i); - TCOPYIN(inOffsetTile, gInOffset); + TLOAD(inOffsetTile, gInOffset); // test // auto gIn = gInIter(j, i); - // TCOPYIN(inTile, gIn); + // TLOAD(inTile, gIn); n_base = i * tN; // printf("j = %d\n", j); // printf("i = %d\n", i); // printf("base = %d\n", base); // printf("in_shape[0] = %d\n", in_shape[0]); gen_offset_impl(inOffsetTile, offsetTile, n_base); - + MGATHER(outTile, inGm, offsetTile); // printf("inGm = %d\n", inGm); @@ -165,19 +165,19 @@ void gather( // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tN); // DUMP_TILE_FLOAT("inTile", inTile, g_dump_outdata, 1, tN); // DUMP_TILE_FLOAT("outTile", outTile, g_dump_outdata, 1, tN); - TCOPYOUT(gO, outTile); + TSTORE(gO, outTile); } if constexpr (rmd_N) { auto gInOffset = gInOffsetIter(0, j); auto gO = gOIter(j, Nb); n_base = Nb * tN; - TCOPYIN(inOffsetTile_rmd_n, gInOffset); + TLOAD(inOffsetTile_rmd_n, gInOffset); gen_offset_impl(inOffsetTile_rmd_n, offsetTile_rmd_n, n_base); MGATHER(outTile_rmd_n, inGm, offsetTile_rmd_n); // DUMP_TILE("inOffsetTile_rmd_n", inOffsetTile_rmd_n, g_dump_inoffset, 1, rmd_N); // DUMP_TILE_FLOAT("outTile_rmd_n", outTile_rmd_n, g_dump_outdata, 20, rmd_N); // DUMP_TILE("offsetTile_rmd_n", offsetTile_rmd_n, g_dump, 20, rmd_N); - TCOPYOUT(gO, outTile_rmd_n); + TSTORE(gO, outTile_rmd_n); } } if constexpr (rmd_M) { @@ -185,19 +185,19 @@ void gather( auto gInOffset = gInOffsetIter(0, Mb); auto gO = gOIter(Mb, i); n_base = i * tN; - TCOPYIN(inOffsetTile_rmd_m, gInOffset); + TLOAD(inOffsetTile_rmd_m, gInOffset); gen_offset_impl(inOffsetTile_rmd_m, offsetTile_rmd_m, n_base); MGATHER(outTile_rmd_m, inGm, offsetTile_rmd_m); - TCOPYOUT(gO, outTile_rmd_m); + TSTORE(gO, outTile_rmd_m); } if constexpr (rmd_N) { auto gInOffset = gInOffsetIter(0, Mb); auto gO = gOIter(Mb, Nb); n_base = Nb * tN; - TCOPYIN(inOffsetTile_rmd_mn, gInOffset); + TLOAD(inOffsetTile_rmd_mn, gInOffset); gen_offset_impl(inOffsetTile_rmd_mn, offsetTile_rmd_mn, n_base); MGATHER(outTile_rmd_mn, inGm, offsetTile_rmd_mn); - TCOPYOUT(gO, outTile_rmd_mn); + TSTORE(gO, outTile_rmd_mn); } } diff --git a/kernels/memory/transpose.hpp b/kernels/memory/transpose.hpp index 37ca721..51d0a0c 100644 --- a/kernels/memory/transpose.hpp +++ b/kernels/memory/transpose.hpp @@ -16,7 +16,7 @@ using namespace pto; GlobalTensor, \ Stride<1,1,1,Cols,1>> _g(DumpBuf); \ - TCOPYOUT(_g, TileVar); \ + TSTORE(_g, TileVar); \ printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \ for (int ri = 0; ri < Rows; ri++) { \ printf(" row%2d: ", ri); \ @@ -38,7 +38,7 @@ void __vec__ gen_offset_trans( // const size_t in_dim, // const size_t out_dim, // const size_t transpose_dim1, -// const size_t transpose_dim0, +// const size_t transpose_dim0, const size_t base, const size_t total_elements ) { @@ -62,7 +62,7 @@ void __vec__ gen_offset_trans( // 输出一维索引 → 输出坐标 size_t out_coord[MAX_DIM] = {0}; // size_t tmp = idx; // - + #pragma clang loop unroll(full) for (int d = OUT_DIM - 1; d >= 0; d--) { out_coord[d] = tmp % out_shape[d]; @@ -81,10 +81,10 @@ void __vec__ gen_offset_trans( } } */ -// uint16_t in_offset = 0; - uint32_t in_offset = 0; +// uint16_t in_offset = 0; + uint32_t in_offset = 0; - #pragma clang loop unroll(full) + #pragma clang loop unroll(full) for (int i = 0; i < IN_DIM; i++) { in_offset += out_coord[i] * stride[i] * sizeof(dtype); } @@ -101,7 +101,7 @@ void gen_offset_Impl( // const size_t in_dim, // const size_t out_dim, // const size_t transpose_dim1, -// const size_t transpose_dim0, +// const size_t transpose_dim0, const size_t base, const size_t total_elements ) @@ -122,16 +122,16 @@ void transpose( // const size_t in_dim, // const size_t out_dim, // const size_t transpose_dim1, -// const size_t transpose_dim0 -) +// const size_t transpose_dim0 +) { const int Mb = gOM / tM; - + const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 + using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeOffset = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec @@ -158,9 +158,9 @@ void transpose( base += total_elements; // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM); MGATHER(dataTile, inGm, offsetTile); -// printf("end mgather\n"); - TCOPYOUT(gO, dataTile); -// TCOPYOUT(gO, dataTile); +// printf("end mgather\n"); + TSTORE(gO, dataTile); +// TSTORE(gO, dataTile); } if constexpr (rmd_M) { auto gO = gOIter(0, Mb); @@ -168,7 +168,7 @@ void transpose( gen_offset_Impl(offsetTile, in_shape, out_shape, base, total_elements); base += total_elements; MGATHER(dataTile, inGm, offsetTile); - TCOPYOUT(gO, dataTile); + TSTORE(gO, dataTile); } } diff --git a/kernels/memory/transpose_vector_007.hpp b/kernels/memory/transpose_vector_007.hpp index c9fbe07..4093211 100644 --- a/kernels/memory/transpose_vector_007.hpp +++ b/kernels/memory/transpose_vector_007.hpp @@ -4,11 +4,11 @@ using namespace pto; //AI/IA = A, placeholder -//like Ttrans +//like Ttrans template void __vec__ transpose_007_impl( typename tileOutData::TileDType __out__ out, - const typename tileInData::TileDType __in__ in + const typename tileInData::TileDType __in__ in ) { size_t i = blkv_get_index_x(); // 4096 @@ -27,13 +27,13 @@ void __vec__ transpose_007_impl( // in1[DimIn1, 1] in2 [DimIn2, 1] bias[DimOut, 1] weight [DimOut, DimIn1, DimIn2] template void transpose_007( - dtype *out_ptr, + dtype *out_ptr, dtype *in_ptr ) { - const int Mb = 4096 / 4096; + const int Mb = 4096 / 4096; - using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 + using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 using gm_shapeOut = global_tensor>; using tile_shapeInData = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeOutData = Tile; // todo 尾块怎么处理?是否要作为参数写在这 @@ -42,18 +42,18 @@ void transpose_007( using itOut = global_iterator; tile_shapeInData InDataTile; - tile_shapeOutData OutDataTile; + tile_shapeOutData OutDataTile; itIn gIIter(in_ptr); - itOut gOIter(out_ptr); + itOut gOIter(out_ptr); for (int i = 0; i < Mb; ++i) { auto gI = gIIter(0, i); auto gO = gOIter(0, i); - TCOPYIN(InDataTile, gI); + TLOAD(InDataTile, gI); transpose_007_impl<<>>(OutDataTile.data(), InDataTile.data()); - TCOPYOUT(gO, OutDataTile); - } + TSTORE(gO, OutDataTile); + } } diff --git a/kernels/memory/transpose_vector_050.hpp b/kernels/memory/transpose_vector_050.hpp index d280dd3..d9c2053 100644 --- a/kernels/memory/transpose_vector_050.hpp +++ b/kernels/memory/transpose_vector_050.hpp @@ -7,7 +7,7 @@ using namespace pto; template void __vec__ transpose_050_impl( typename tileData::TileDType __out__ out, - const typename tileData::TileDType __in__ in + const typename tileData::TileDType __in__ in ) { size_t i = blkv_get_index_x(); // y @@ -26,33 +26,33 @@ void __vec__ transpose_050_impl( template void transpose_050( - dtype *out_ptr, + dtype *out_ptr, dtype *in_ptr ) -{ +{ - using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 + using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using itIn = global_iterator; - using itOut = global_iterator; + using itOut = global_iterator; tile_shapeData InDataTile; tile_shapeData OutDataTile; itIn gIIter(in_ptr); - itOut gOIter(out_ptr); + itOut gOIter(out_ptr); auto gI = gIIter(0, 0); auto gO = gOIter(0, 0); - TCOPYIN(InDataTile, gI); + TLOAD(InDataTile, gI); transpose_050_impl<<>>(OutDataTile.data(), InDataTile.data()); - TCOPYOUT(gO, OutDataTile); - - + TSTORE(gO, OutDataTile); + + } diff --git a/kernels/other/attention.hpp b/kernels/other/attention.hpp index b40b402..ae4af5c 100644 --- a/kernels/other/attention.hpp +++ b/kernels/other/attention.hpp @@ -54,8 +54,8 @@ void flash_attention1(float *out_ptr, for (int j = 0; j < Kb; ++j) { // load Q_i, K_j - tileQ tQ; TCOPYIN(tQ, gQ(i, 0)); - tileK tK; TCOPYIN(tK, gK(0, j)); + tileQ tQ; TLOAD(tQ, gQ(i, 0)); + tileK tK; TLOAD(tK, gK(0, j)); // 计算分数块 tileW tW; MATMUL(tW, tQ, tK); @@ -72,8 +72,8 @@ void flash_attention1(float *out_ptr, // 2. 扫描所有 K‑blocks,计算行 exp 和的累加 tileSum tSum(0); for (int j = 0; j < Kb; ++j) { - tileQ tQ; TCOPYIN(tQ, gQ(i, 0)); - tileK tK; TCOPYIN(tK, gK(0, j)); + tileQ tQ; TLOAD(tQ, gQ(i, 0)); + tileK tK; TLOAD(tK, gK(0, j)); tileW tW; MATMUL(tW, tQ, tK); TMULS(tW, tW, scale); @@ -95,9 +95,9 @@ void flash_attention1(float *out_ptr, // 3. 重算加权,乘 V 并累加到输出 tO = tileO(0); for (int j = 0; j < Kb; ++j) { - tileQ tQ; TCOPYIN(tQ, gQ(i, 0)); - tileK tK; TCOPYIN(tK, gK(0, j)); - tileV tV; TCOPYIN(tV, gV(j, 0)); + tileQ tQ; TLOAD(tQ, gQ(i, 0)); + tileK tK; TLOAD(tK, gK(0, j)); + tileV tV; TLOAD(tV, gV(j, 0)); tileW tW; MATMUL(tW, tQ, tK); @@ -115,6 +115,6 @@ void flash_attention1(float *out_ptr, } // 写回 global - TCOPYOUT(dstO, tO); + TSTORE(dstO, tO); } } diff --git a/kernels/other/conv.hpp b/kernels/other/conv.hpp index 8cb22e3..f2f9301 100644 --- a/kernels/other/conv.hpp +++ b/kernels/other/conv.hpp @@ -4,13 +4,13 @@ using namespace pto; //in: Input data of shape (N, C, H, W) -> -> N, 1, C, H, W -> N, F, C, H, W //filter: Filter weights of shape (F, C, HH, WW) -> 1 ,F, C, HH, WW - > N ,F, C, HH, WW -//out: Output data, of shape (N, F, H', W') +//out: Output data, of shape (N, F, H', W') // for j in range(0, H_prime): // for i in range(0, W_prime): // tmp_w = w // tmp_w = tmp_w[np.newaxis,:] // tmp_w = np.repeat(tmp_w, N, axis=0) -// tmp_x = x_pad[:, :, j * stride:j * stride + HH, i * stride:i * stride + WW] +// tmp_x = x_pad[:, :, j * stride:j * stride + HH, i * stride:i * stride + WW] // tmp_x = tmp_x[:,np.newaxis] // tmp_x = np.repeat(tmp_x, F, axis=1) // out[:,:,j,i] = np.sum(np.sum(np.sum(tmp_x*tmp_w, axis=-1), axis=-1), axis=-1) \ @@ -26,11 +26,11 @@ using namespace pto; // for i in range(0, W_prime): // tmp_w = w[f, :, :, :] // tmp_w = tmp_w[np.newaxis,:] -// tmp_w = np.repeat(tmp_w, N, axis=0) +// tmp_w = np.repeat(tmp_w, N, axis=0) // out[:, f, j, i] = np.sum(np.sum(np.sum(x_pad[:, :, j * stride:j * stride + HH, i * stride:i * stride + WW] * tmp_w, axis=-3), axis=-2), axis=-1) //pic [N, C, H, W], filter [F, C, HH, WW] -> out [N, F, H', W'] -template void conv_forward(dtype *out, dtype *pic, dtype *filter){ const int stride = 1; @@ -59,8 +59,8 @@ void conv_forward(dtype *out, dtype *pic, dtype *filter){ tile_filt tfilt; tile_filt tpic; - TCOPYIN(tfilt, gfilt); - TCOPYIN(tpic, gpic); + TLOAD(tfilt, gfilt); + TLOAD(tpic, gpic); TMUL(tpic, tpic, tfilt); TROWSUMEXPAND(tpic, tpic, tpic); TCOLSUMEXPAND(tpic, tpic, tpic); // sum all element @@ -68,7 +68,7 @@ void conv_forward(dtype *out, dtype *pic, dtype *filter){ } int offset = n*F*H_prime*W_prime + f*H_prime*W_prime + h*W_prime + w; gm_out gO(out+offset); - TCOPYOUT(gO, tmp); + TSTORE(gO, tmp); } } } diff --git a/kernels/other/flash_attention.hpp b/kernels/other/flash_attention.hpp index b4a5429..37e1bc0 100644 --- a/kernels/other/flash_attention.hpp +++ b/kernels/other/flash_attention.hpp @@ -16,7 +16,7 @@ void flash_attention(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr) { using tileK = TileRight; // [vD×kTk] using tileW_out = TileAcc; // [kTm×kTk] using tileW = Tile; - using tileW_left = TileLeft; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] @@ -46,7 +46,7 @@ void flash_attention(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr) { // 加载当前Q块 (仅一次) tileQ tQ; auto gQ = gIterQ(i,0); - TCOPYIN(tQ, gQ); + TLOAD(tQ, gQ); // 初始化状态: 最大值/指数和/输出累加 tileMax tMax; @@ -61,8 +61,8 @@ void flash_attention(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr) { // 加载K_j和V_j auto gK = gIterK(0,j); auto gV = gIterV(j,0); - tileK tK; TCOPYIN(tK, gK); - tileV tV; TCOPYIN(tV, gV); + tileK tK; TLOAD(tK, gK); + tileV tV; TLOAD(tV, gV); // 计算注意力分数块 tileW_out tW_out; @@ -132,7 +132,7 @@ void flash_attention(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr) { TCAST(tO_cast, tO); // 写回全局内存 auto dstO = gIterO(i, 0); - TCOPYOUT(dstO, tO_cast); + TSTORE(dstO, tO_cast); } } @@ -160,7 +160,7 @@ void __vec__ flashsoftmax_new_max( upd_max = blkv_max(upd_max, src_ptr[src_idx] * src_scale); } - new_max_ptr[max_idx] = upd_max; + new_max_ptr[max_idx] = upd_max; } template @@ -282,8 +282,8 @@ void flash_attention_opt(float* out_ptr, float* q_ptr, float* k_ptr, float* v_pt using tileQ = TileLeft; // [kTm×qD] using tileK = TileRight; // [vD×kTk] using tileW_out = TileAcc; // [kTm×kTk] - using tileW = Tile; - using tileW_left = TileLeft; + using tileW = Tile; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] @@ -313,7 +313,7 @@ void flash_attention_opt(float* out_ptr, float* q_ptr, float* k_ptr, float* v_pt // 加载当前Q块 (仅一次) tileQ tQ; auto gQ = gIterQ(i, 0); - TCOPYIN(tQ, gQ); + TLOAD(tQ, gQ); // 初始化状态: 最大值/指数和/输出累加 tileMax tMax; @@ -328,8 +328,8 @@ void flash_attention_opt(float* out_ptr, float* q_ptr, float* k_ptr, float* v_pt // 加载K_j和V_j auto gK = gIterK(0, j); auto gV = gIterV(j, 0); - tileK tK; TCOPYIN(tK, gK); - tileV tV; TCOPYIN(tV, gV); + tileK tK; TLOAD(tK, gK); + tileV tV; TLOAD(tV, gV); // 计算注意力分数块 tileW_out tW_out; @@ -367,7 +367,7 @@ void flash_attention_opt(float* out_ptr, float* q_ptr, float* k_ptr, float* v_pt TCAST(tO_cast, tO); // 写回全局内存 auto dstO = gIterO(i, 0); - TCOPYOUT(dstO, tO_cast); + TSTORE(dstO, tO_cast); } } @@ -382,8 +382,8 @@ void flash_attention_opt2(float* out_ptr, float* q_ptr, float* k_ptr, float* v_p using tileQ = TileLeft; // [kTm×qD] using tileK = TileRight; // [vD×kTk] using tileW_out = TileAcc; // [kTm×kTk] - using tileW = Tile; - using tileW_left = TileLeft; + using tileW = Tile; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] @@ -413,7 +413,7 @@ void flash_attention_opt2(float* out_ptr, float* q_ptr, float* k_ptr, float* v_p // 加载当前Q块 (仅一次) tileQ tQ; auto gQ = gIterQ(i, 0); - TCOPYIN(tQ, gQ); + TLOAD(tQ, gQ); // 初始化状态: 最大值/指数和/输出累加 tileMax tMax; @@ -428,8 +428,8 @@ void flash_attention_opt2(float* out_ptr, float* q_ptr, float* k_ptr, float* v_p // 加载K_j和V_j auto gK = gIterK(0, j); auto gV = gIterV(j, 0); - tileK tK; TCOPYIN(tK, gK); - tileV tV; TCOPYIN(tV, gV); + tileK tK; TLOAD(tK, gK); + tileV tV; TLOAD(tV, gV); // 计算注意力分数块 tileW_out tW_out; @@ -467,7 +467,7 @@ void flash_attention_opt2(float* out_ptr, float* q_ptr, float* k_ptr, float* v_p TCAST(tO_cast, tO); // 写回全局内存 auto dstO = gIterO(i, 0); - TCOPYOUT(dstO, tO_cast); + TSTORE(dstO, tO_cast); } } @@ -513,7 +513,7 @@ void flash_attention_frac(float *out_ptr, for (int i = 0; i < Qb; ++i) { // 加载当前Q块 (仅一次) tileQ tQ; - TCOPYIN(tQ, gQ(i, 0)); + TLOAD(tQ, gQ(i, 0)); // 初始化状态: 最大值/指数和/输出累加 tileMax tMax; @@ -525,8 +525,8 @@ void flash_attention_frac(float *out_ptr, #pragma clang loop unroll(full) for (int j = 0; j < Kb; ++j) { // 加载K_j和V_j - tileK tK; TCOPYIN(tK, gK(0, j)); - tileV tV; TCOPYIN(tV, gV(j, 0)); + tileK tK; TLOAD(tK, gK(0, j)); + tileV tV; TLOAD(tV, gV(j, 0)); // 计算注意力分数块 tileS tS; @@ -588,7 +588,7 @@ void flash_attention_frac(float *out_ptr, // 写回全局内存 auto dstO = gO(i, 0); - TCOPYOUT(dstO, tO); + TSTORE(dstO, tO); } } @@ -622,7 +622,7 @@ void flash_attention_rm(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr itK gIterK(k_ptr); itV gIterV(v_ptr); itO gIterO(out_ptr); - + const float scale = 1.0f / sqrt((float)qD); const int Qb = (S + kTm - 1) / kTm; const int Kb = (S + kTk - 1) / kTk; @@ -633,7 +633,7 @@ void flash_attention_rm(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr // 加载当前Q块 (仅一次) tileQ tQ; auto gQ = gIterQ(i, 0); - TCOPYIN(tQ, gQ); + TLOAD(tQ, gQ); // 初始化状态: 最大值/指数和/输出累加 tileMax tMax; @@ -646,8 +646,8 @@ void flash_attention_rm(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr // 加载K_j和V_j auto gK = gIterK(0, j); auto gV = gIterV(j, 0); - tileK tK; TCOPYIN(tK, gK); - tileV tV; TCOPYIN(tV, gV); + tileK tK; TLOAD(tK, gK); + tileV tV; TLOAD(tV, gV); // 计算注意力分数块 tileW tW; @@ -706,7 +706,7 @@ void flash_attention_rm(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr // 写回全局内存 auto dstO = gIterO(i, 0); - TCOPYOUT(dstO, tO); + TSTORE(dstO, tO); } } @@ -730,10 +730,10 @@ void flash_attention_opt2_unroll2_aligned(float* out_ptr, float* q_ptr, float* k using tileQ = TileLeft; using tileK = TileRight; using tileV = TileRight; - + using tileW_out = TileAcc; - using tileW = Tile; - using tileW_left = TileLeft; + using tileW = Tile; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; @@ -758,7 +758,7 @@ void flash_attention_opt2_unroll2_aligned(float* out_ptr, float* q_ptr, float* k for (int i = 0; i < Qb; ++i) { tileQ tQ; auto gQ = gIterQ(i, 0); - TCOPYIN(tQ, gQ); + TLOAD(tQ, gQ); // 初始化状态 tileMax tMax; TEXPANDSCALAR(tMax, -1e30f); @@ -768,17 +768,17 @@ void flash_attention_opt2_unroll2_aligned(float* out_ptr, float* q_ptr, float* k // --- 定义双缓冲寄存器 --- tileK tK_0, tK_1; tileV tV_0, tV_1; - + tileW_out tW_out_0, tW_out_1; tileW tW_0, tW_1; - + tileW tExpW_0, tExpW_1; tileW_left tW_left_0, tW_left_1; - + tileO_out tO_out_0, tO_out_1; - tileO tO_tmp_0, tO_tmp_1; + tileO tO_tmp_0, tO_tmp_1; tileO tRescaleO_0, tRescaleO_1; - + // 状态更新的临时变量 tileMax tNewMax_0, tNewMax_1; tileSum tNewSum_0, tNewSum_1; @@ -793,15 +793,15 @@ void flash_attention_opt2_unroll2_aligned(float* out_ptr, float* q_ptr, float* k auto gK_1 = gIterK(0, j + 1); auto gV_1 = gIterV(j + 1, 0); - TCOPYIN(tK_0, gK_0); - TCOPYIN(tV_0, gV_0); - TCOPYIN(tK_1, gK_1); - TCOPYIN(tV_1, gV_1); + TLOAD(tK_0, gK_0); + TLOAD(tV_0, gV_0); + TLOAD(tK_1, gK_1); + TLOAD(tV_1, gV_1); // 2. MatMul QK Grouping // Doubled buffer MATMUL(tW_out_0, tQ, tK_0); - MATMUL(tW_out_1, tQ, tK_1); + MATMUL(tW_out_1, tQ, tK_1); // 3. Convert TCVT(tW_0, tW_out_0); @@ -813,11 +813,11 @@ void flash_attention_opt2_unroll2_aligned(float* out_ptr, float* q_ptr, float* k ); TMUL(tO, tO, tRescaleO_0); // Rescale O_old - + TCVT(tW_left_0, tExpW_0); MATMUL(tO_out_0, tW_left_0, tV_0); // Compute P0 * V0 TCVT(tO_tmp_0, tO_out_0); - + TADD(tO, tO, tO_tmp_0); // Accumulate O_new // Update State 0 -> 1 @@ -855,14 +855,14 @@ void flash_attention_opt2_unroll2_aligned(float* out_ptr, float* q_ptr, float* k Tile tO_cast; TCAST(tO_cast, tO); auto dstO = gIterO(i, 0); - TCOPYOUT(dstO, tO_cast); + TSTORE(dstO, tO_cast); } } template __attribute__((noinline)) void flash_attention_dynamic(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_ptr, int Sq, int Skv) { - + using gmQ = global_tensor>; // Q: [S×qD] using gmK = global_tensor>; // K: [qD×S] using gmV = global_tensor>; // V: [S×vD] @@ -872,7 +872,7 @@ void flash_attention_dynamic(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* using tileK = TileRight; // [vD×kTk] using tileW_out = TileAcc; // [kTm×kTk] using tileW = Tile; - using tileW_left = TileLeft; + using tileW_left = TileLeft; using tileO_out = TileAcc; using tileO = Tile; // [kTm×vD] @@ -897,7 +897,7 @@ void flash_attention_dynamic(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* gmQ gQ(q_ptr+offset_Q, Sq); tileQ tQ(dyn_m); - TCOPYIN(tQ, gQ); + TLOAD(tQ, gQ); tileMax tMax(-1e30f, dyn_m); tileSum tSum(0, dyn_m); @@ -906,7 +906,7 @@ void flash_attention_dynamic(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* #pragma clang loop unroll(full) for (int j = 0; j < Kb; ++j) { - + int dyn_k = (j+1) * kTk > Skv ? rK:kTk; size_t offset_K = j * tileK::Cols * qD; @@ -914,8 +914,8 @@ void flash_attention_dynamic(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* gmK gK(k_ptr+offset_K, Skv); gmV gV(v_ptr+offset_V, Skv); - tileK tK(dyn_k); TCOPYIN(tK, gK); - tileV tV(dyn_k); TCOPYIN(tV, gV); + tileK tK(dyn_k); TLOAD(tK, gK); + tileV tV(dyn_k); TLOAD(tV, gV); tileW_out tW_out(dyn_m, dyn_k); MATMUL(tW_out, tQ, tK); @@ -985,7 +985,7 @@ void flash_attention_dynamic(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* size_t offset_O = i * tileO_cast::Rows * vD; gmO dstO(out_ptr+offset_O, Sq); - TCOPYOUT(dstO, tO_cast); + TSTORE(dstO, tO_cast); } } diff --git a/kernels/other/flash_attention_mask.hpp b/kernels/other/flash_attention_mask.hpp index d626385..a6b9083 100644 --- a/kernels/other/flash_attention_mask.hpp +++ b/kernels/other/flash_attention_mask.hpp @@ -89,7 +89,7 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr, for (int i = 0; i < Qb; ++i) { // 加载当前Q块 (仅一次) tileQ tQ; - TCOPYIN(tQ, gQ(i, 0)); + TLOAD(tQ, gQ(i, 0)); // 初始化状态: 最大值/指数和/输出累加 tileMax tMax; @@ -103,9 +103,9 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr, for (int j = 0; j < Kb; ++j) { // 加载K_j和V_j tileK tK; - TCOPYIN(tK, gK(0, j)); + TLOAD(tK, gK(0, j)); tileV tV; - TCOPYIN(tV, gV(j, 0)); + TLOAD(tV, gV(j, 0)); // 计算注意力分数块 tileW_out tW_out; @@ -169,9 +169,9 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr, if constexpr (rK) { // 加载K_Kb 和V_Kb tileK_tcols tK_tcols; - TCOPYIN(tK_tcols, gK(0, Kb)); + TLOAD(tK_tcols, gK(0, Kb)); tileV_trows tV_trows; - TCOPYIN(tV_trows, gV(Kb, 0)); + TLOAD(tV_trows, gV(Kb, 0)); // 计算注意力分数块 tileW_out_tcols tW_out_tcols; @@ -237,10 +237,10 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr, TEXPANDCOL(tInvSumExpanded, tInvSum); TMUL(tO, tO, tInvSumExpanded); - // 写回全局内存-------将第一步和第二部合并,即完成输出tile块的计算,并copyout到 + // 写回全局内存-------将第一步和第二部合并,即完成输出tile块的计算,并store到 // gO(i,0) auto dstO = gO(i, 0); - TCOPYOUT(dstO, tO); + TSTORE(dstO, tO); } // 最后的Q-block块(Qb) @@ -248,7 +248,7 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr, // 加载当前Q块 (仅一次) tileQ_trows tQ_trows; - TCOPYIN(tQ_trows, gQ(Qb, 0)); + TLOAD(tQ_trows, gQ(Qb, 0)); // 初始化状态: 最大值/指数和/输出累加 tileMax_trows tMax_trows; @@ -263,9 +263,9 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr, for (int j = 0; j < Kb; ++j) { // 加载K_j和V_j tileK tK; - TCOPYIN(tK, gK(0, j)); + TLOAD(tK, gK(0, j)); tileV tV; - TCOPYIN(tV, gV(j, 0)); + TLOAD(tV, gV(j, 0)); // 计算注意力分数块 tileW_out_trows tW_out_trows; @@ -329,9 +329,9 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr, if constexpr (rK) { // 加载K_Kb 和V_Kb tileK_tcols tK_tcols; - TCOPYIN(tK_tcols, gK(0, Kb)); + TLOAD(tK_tcols, gK(0, Kb)); tileV_trows tV_trows; - TCOPYIN(tV_trows, gV(Kb, 0)); + TLOAD(tV_trows, gV(Kb, 0)); // 计算注意力分数块 tileW_out_tcorner tW_out_tcorner; @@ -397,10 +397,10 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr, TEXPANDCOL(tInvSumExpanded_trows, tInvSum_trows); TMUL(tO_trows, tO_trows, tInvSumExpanded_trows); - // 写回全局内存-----将第三步和第四步合并,即完成输出tile块的计算,并copyout到 + // 写回全局内存-----将第三步和第四步合并,即完成输出tile块的计算,并store到 // gO(Qb,0) auto dstO = gO(Qb, 0); - TCOPYOUT(dstO, tO_trows); + TSTORE(dstO, tO_trows); } } diff --git a/kernels/other/gemm.hpp b/kernels/other/gemm.hpp index 3520552..f5ddd63 100644 --- a/kernels/other/gemm.hpp +++ b/kernels/other/gemm.hpp @@ -5,28 +5,28 @@ using namespace pto; template void gemm(DataType *dst, DataType *src0, DataType *src1, DataType *src2) -{ +{ using gm_shapeA = global_tensor>; using gm_shapeB = global_tensor>; using gm_shapeC = global_tensor>; using gm_shapeBias = global_tensor>; - + using tile_shapeA = TileLeft; using tile_shapeB = TileRight; using tile_shapeACC = TileAcc; using tile_shapeACC_RM = Tile; using tile_shapeBias = Tile; - + using gm_iteratorA = global_iterator; using gm_iteratorB = global_iterator; using gm_iteratorC = global_iterator; using gm_iteratorBias = global_iterator; - + gm_iteratorA gAIter(src0); gm_iteratorB gBIter(src1); gm_iteratorC gCIter(dst); gm_iteratorBias gBiasIter(src2); - + const int Mb = gM / tM; const int Nb = gN / tN; const int Kb = gK / tK; @@ -40,7 +40,7 @@ void gemm(DataType *dst, DataType *src0, DataType *src1, DataType *src2) tile_shapeB tB(0); tile_shapeACC tACC; MATMUL(tACC, tA, tB); - + #pragma clang loop unroll(full) for(int k = 0; k < Kb; k++) { @@ -48,28 +48,28 @@ void gemm(DataType *dst, DataType *src0, DataType *src1, DataType *src2) auto gB = gBIter(k, j); tile_shapeA tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } tile_shapeACC_RM tACC_RM; TCVT(tACC_RM, tACC); - + if constexpr (Bias) { tile_shapeBias tBias; tile_shapeACC_RM tExpandBias; auto gBias = gBiasIter(0,j); - TCOPYIN(tBias, gBias); + TLOAD(tBias, gBias); TEXPANDROW(tExpandBias, tBias); TADD(tACC_RM, tACC_RM, tExpandBias); } - + if constexpr (Relu) { TMAXS(tACC_RM, tACC_RM, 0); } - - TCOPYOUT(gC, tACC_RM); + + TSTORE(gC, tACC_RM); } } } \ No newline at end of file diff --git a/kernels/other/linear.hpp b/kernels/other/linear.hpp index 36fa7bc..d53b9f0 100644 --- a/kernels/other/linear.hpp +++ b/kernels/other/linear.hpp @@ -19,9 +19,9 @@ void Identity(dtype *dst, dtype *src){ for(int i=0;i -void Linear(dtype *out, dtype *in, dtype *weight, dtype *bias){ +void Linear(dtype *out, dtype *in, dtype *weight, dtype *bias){ gemm(out, in, weight, bias); } // y = x1^T.A.x2 + b, where x1=1xd1, x2=1xd2, A=doxd1xd2 // in1[DimIn1, 1] in2 [DimIn2, 1] bias[DimOut, 1] weight [DimOut, DimIn1, DimIn2] template -void BiLinear(dtype *out, dtype *weight, +void BiLinear(dtype *out, dtype *weight, dtype *in1, dtype *in2, dtype *bias){ using gm_shapeA = global_tensor>; using gm_shapeB = global_tensor>; @@ -57,25 +57,25 @@ void BiLinear(dtype *out, dtype *weight, tile_shapeW tW; tile_shapeBT tmp; tile_shapeO tO; - TCOPYIN(tin1, gin1); - TCOPYIN(tin2, gin2); + TLOAD(tin1, gin1); + TLOAD(tin2, gin2); for (int i=0;i tout; Tile tbias; gm_shapeO gO(out); gm_shapeO gbias(bias); - TCOPYIN(tout, gO); - TCOPYIN(tbias, gbias); + TLOAD(tout, gO); + TLOAD(tbias, gbias); TADD(tout, tout, tbias); - TCOPYOUT(gO, tout); + TSTORE(gO, tout); } } diff --git a/kernels/other/matadd.hpp b/kernels/other/matadd.hpp index 7bb031d..6860bee 100644 --- a/kernels/other/matadd.hpp +++ b/kernels/other/matadd.hpp @@ -27,10 +27,10 @@ void matadd(float *c_ptr, float *a_ptr, float *b_ptr) { auto gC = gCIter(i, j); tile_shape tA, tB, tC; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); TADD(tC, tA, tB); - TCOPYOUT(gC, tC); + TSTORE(gC, tC); } } } diff --git a/kernels/other/matmul.hpp b/kernels/other/matmul.hpp index 3d3cbe1..6af6183 100644 --- a/kernels/other/matmul.hpp +++ b/kernels/other/matmul.hpp @@ -10,19 +10,19 @@ using namespace pto; template -void TCOPYOUT_ACC(GmOut &Gout, TileAcc &tAcc){ +void TSTORE_ACC(GmOut &Gout, TileAcc &tAcc){ using TileAccOut = Tile; TileAccOut tAccOut; TCVT(tAccOut, tAcc); - TCOPYOUT(Gout, tAccOut); + TSTORE(Gout, tAccOut); } template -void TCOPYOUT_ACC_DYNAMIC(GmOut &Gout, TileAcc &tAcc, size_t valid_row, size_t valid_col){ +void TSTORE_ACC_DYNAMIC(GmOut &Gout, TileAcc &tAcc, size_t valid_row, size_t valid_col){ using TileAccOut = Tile; TileAccOut tAccOut(valid_row, valid_col); TCVT(tAccOut, tAcc); - TCOPYOUT(Gout, tAccOut); + TSTORE(Gout, tAccOut); } // A * B -> C with any shape @@ -74,16 +74,16 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeACC tACC; // tile_shapecast tcast; - + if constexpr(Kb>0){ auto gA = gAIter(i, 0); auto gB = gBIter(0, j); tile_shapeA tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - MATMUL(tACC, tA, tB); + TLOAD(tA, gA); + TLOAD(tB, gB); + MATMUL(tACC, tA, tB); } #pragma clang loop unroll(full) for (int k = 1; k < Kb; ++k) { @@ -92,8 +92,8 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } @@ -103,8 +103,8 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_trows tA; tile_shapeB_tcols tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); if constexpr(Kb>0){ MATMACC(tACC, tA, tB); } else { @@ -112,8 +112,8 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) { } } // TCVT(tCast, tACC); - // TCOPYOUT(gC, tCast); - TCOPYOUT_ACC(gC, tACC); + // TSTORE(gC, tCast); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { auto gC = gCIter(i, Nb); @@ -125,9 +125,9 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA tA; tile_shapeB_trows tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - MATMUL(tACC, tA, tB); + TLOAD(tA, gA); + TLOAD(tB, gB); + MATMUL(tACC, tA, tB); } #pragma clang loop unroll(full) for (int k = 1; k < Kb; ++k) { @@ -136,8 +136,8 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA tA; tile_shapeB_trows tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } if constexpr (rmd_K) { @@ -146,15 +146,15 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_trows tA; tile_shapeB_tcorner tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); if constexpr(Kb>0){ MATMACC(tACC, tA, tB); } else { MATMUL(tACC, tA, tB); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } if constexpr (rmd_M) { @@ -168,9 +168,9 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_tcols tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - MATMUL(tACC, tA, tB); + TLOAD(tA, gA); + TLOAD(tB, gB); + MATMUL(tACC, tA, tB); } #pragma clang loop unroll(full) for (int k = 1; k < Kb; ++k) { @@ -179,8 +179,8 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_tcols tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } if constexpr (rmd_K) { @@ -189,15 +189,15 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_tcorner tA; tile_shapeB_tcols tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); if constexpr(Kb>0){ MATMACC(tACC, tA, tB); } else { MATMUL(tACC, tA, tB); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { auto gC = gCIter(Mb, Nb); @@ -209,9 +209,9 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_tcols tA; tile_shapeB_trows tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - MATMUL(tACC, tA, tB); + TLOAD(tA, gA); + TLOAD(tB, gB); + MATMUL(tACC, tA, tB); } #pragma clang loop unroll(full) for (int k = 1; k < Kb; ++k) { @@ -220,8 +220,8 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_tcols tA; tile_shapeB_trows tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } if constexpr (rmd_K) { @@ -230,15 +230,15 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_tcorner tA; tile_shapeB_tcorner tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); if constexpr(Kb>0){ MATMACC(tACC, tA, tB); } else { MATMUL(tACC, tA, tB); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } @@ -275,8 +275,8 @@ void matmul_frac(float* dst, dtype* src0, dtype* src1){ tile_shapeA tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMUL(tACC, tA, tB); } #pragma clang loop unroll(full) @@ -285,11 +285,11 @@ void matmul_frac(float* dst, dtype* src0, dtype* src1){ auto gB = gBIter(k,j); tile_shapeA tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } @@ -406,7 +406,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){ // #pragma clang loop unroll(full) // for(int k=0;k0){ MATMACC(tACC, tA, tB); } else { @@ -460,7 +460,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){ } auto gC = gIterC(i*R.m+ii,j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // [m, rmd_N, k] @@ -471,7 +471,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){ for(int k=0;k Kb); if constexpr(R.k < Kb){ - + for(int k=R.k;k0){ MATMACC(tACC, tA, tB); } else { @@ -510,7 +510,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){ } auto gC = gIterC(i*R.m+ii,Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } @@ -518,7 +518,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){ if constexpr(rM>0){ tile_shapeA tA[rM][R.k]; - + #pragma clang loop unroll(full) for(int i=0;i0){ MATMACC(tACC, tA, tB); } else { @@ -574,7 +574,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){ } } auto gC = gIterC(i+dM*R.m,j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // [rM, rmd_N, k] @@ -585,7 +585,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){ for(int k=0;k0){ MATMACC(tACC, tA, tB); } else { @@ -622,7 +622,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){ } } auto gC = gIterC(i+dM*R.m,Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } @@ -630,11 +630,11 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){ // [rmd_M, n, k] if constexpr (rmd_M) { tile_shapeA_tcols tA[R.k]; - + #pragma clang loop unroll(full) for(int k=0;k0){ MATMACC(tACC, tA, tB); } else { @@ -682,7 +682,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){ } } auto gC = gIterC(Mb,j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // [rmd_M, rmd_N, k] @@ -693,7 +693,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){ for(int k=0;k0){ MATMACC(tACC, tA, tB); } else { @@ -730,7 +730,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){ } } auto gC = gIterC(Mb,Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } }// Batch @@ -802,7 +802,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < R.k; k++) { auto gA = gIterA(row, k); - TCOPYIN(tA_phase0[k], gA); + TLOAD(tA_phase0[k], gA); } // --- N 主列 --- @@ -813,12 +813,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < R.k; k++) { tile_shapeB tB; auto gB = gIterB(k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_phase0[k], tB); else MATMACC(tACC, tA_phase0[k], tB); } auto gC = gIterC(row, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // --- N 余列 (rmd_N) --- @@ -828,12 +828,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < R.k; k++) { tile_shapeB_trows tB; auto gB = gIterB(k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_phase0[k], tB); else MATMACC(tACC, tA_phase0[k], tB); } auto gC = gIterC(row, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // Phase B-1: 剩余 K 轴 Full chunks (每块 MAX_TILE_NUM 个 k tile) @@ -847,7 +847,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < MAX_TILE_NUM; k++) { auto gA = gIterA(row, k_base + k); - TCOPYIN(tA_chunk[k], gA); + TLOAD(tA_chunk[k], gA); } // --- N 主列 --- @@ -858,12 +858,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < MAX_TILE_NUM; k++) { tile_shapeB tB; auto gB = gIterB(k_base + k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_chunk[k], tB); else MATMACC(tACC, tA_chunk[k], tB); } auto gC = gIterC(row, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // --- N 余列 --- @@ -873,12 +873,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < MAX_TILE_NUM; k++) { tile_shapeB_trows tB; auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_chunk[k], tB); else MATMACC(tACC, tA_chunk[k], tB); } auto gC = gIterC(row, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } @@ -891,7 +891,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < K2_rem; k++) { auto gA = gIterA(row, k_base + k); - TCOPYIN(tA_tail[k], gA); + TLOAD(tA_tail[k], gA); } // --- N 主列 --- @@ -902,12 +902,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < K2_rem; k++) { tile_shapeB tB; auto gB = gIterB(k_base + k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_tail[k], tB); else MATMACC(tACC, tA_tail[k], tB); } auto gC = gIterC(row, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // --- N 余列 --- @@ -917,12 +917,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < K2_rem; k++) { tile_shapeB_trows tB; auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_tail[k], tB); else MATMACC(tACC, tA_tail[k], tB); } auto gC = gIterC(row, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } @@ -930,7 +930,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ if constexpr (rmd_K) { tile_shapeA_trows tA_rmdK; auto gA = gIterA(row, Kb); - TCOPYIN(tA_rmdK, gA); + TLOAD(tA_rmdK, gA); // --- N 主列 --- #pragma clang loop unroll(full) @@ -938,11 +938,11 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ tile_shapeACC tACC; tile_shapeB_tcols tB; auto gB = gIterB(Kb, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if constexpr (Kb > 0) MATMACC(tACC, tA_rmdK, tB); else MATMUL (tACC, tA_rmdK, tB); auto gC = gIterC(row, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // --- N 余列 --- @@ -950,11 +950,11 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ tile_shapeC_trows tACC; tile_shapeB_tcorner tB; auto gB = gIterB(Kb, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if constexpr (Kb > 0) MATMACC(tACC, tA_rmdK, tB); else MATMUL (tACC, tA_rmdK, tB); auto gC = gIterC(row, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } @@ -974,7 +974,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < R.k; k++) { auto gA = gIterA(row, k); - TCOPYIN(tA_phase0[k], gA); + TLOAD(tA_phase0[k], gA); } #pragma clang loop unroll(full) @@ -984,12 +984,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < R.k; k++) { tile_shapeB tB; auto gB = gIterB(k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_phase0[k], tB); else MATMACC(tACC, tA_phase0[k], tB); } auto gC = gIterC(row, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { @@ -998,12 +998,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < R.k; k++) { tile_shapeB_trows tB; auto gB = gIterB(k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_phase0[k], tB); else MATMACC(tACC, tA_phase0[k], tB); } auto gC = gIterC(row, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // Phase B-1: Full chunks @@ -1016,7 +1016,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < MAX_TILE_NUM; k++) { auto gA = gIterA(row, k_base + k); - TCOPYIN(tA_chunk[k], gA); + TLOAD(tA_chunk[k], gA); } #pragma clang loop unroll(full) @@ -1026,12 +1026,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < MAX_TILE_NUM; k++) { tile_shapeB tB; auto gB = gIterB(k_base + k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_chunk[k], tB); else MATMACC(tACC, tA_chunk[k], tB); } auto gC = gIterC(row, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { @@ -1040,12 +1040,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < MAX_TILE_NUM; k++) { tile_shapeB_trows tB; auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_chunk[k], tB); else MATMACC(tACC, tA_chunk[k], tB); } auto gC = gIterC(row, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } @@ -1058,7 +1058,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < K2_rem; k++) { auto gA = gIterA(row, k_base + k); - TCOPYIN(tA_tail[k], gA); + TLOAD(tA_tail[k], gA); } #pragma clang loop unroll(full) @@ -1068,12 +1068,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < K2_rem; k++) { tile_shapeB tB; auto gB = gIterB(k_base + k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_tail[k], tB); else MATMACC(tACC, tA_tail[k], tB); } auto gC = gIterC(row, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { @@ -1082,12 +1082,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < K2_rem; k++) { tile_shapeB_trows tB; auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_tail[k], tB); else MATMACC(tACC, tA_tail[k], tB); } auto gC = gIterC(row, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } @@ -1095,29 +1095,29 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ if constexpr (rmd_K) { tile_shapeA_trows tA_rmdK; auto gA = gIterA(row, Kb); - TCOPYIN(tA_rmdK, gA); + TLOAD(tA_rmdK, gA); #pragma clang loop unroll(full) for (int j = 0; j < Nb; j++) { tile_shapeACC tACC; tile_shapeB_tcols tB; auto gB = gIterB(Kb, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if constexpr (Kb > 0) MATMACC(tACC, tA_rmdK, tB); else MATMUL (tACC, tA_rmdK, tB); auto gC = gIterC(row, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { tile_shapeC_trows tACC; tile_shapeB_tcorner tB; auto gB = gIterB(Kb, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if constexpr (Kb > 0) MATMACC(tACC, tA_rmdK, tB); else MATMUL (tACC, tA_rmdK, tB); auto gC = gIterC(row, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } @@ -1134,7 +1134,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < R.k; k++) { auto gA = gIterA(Mb, k); - TCOPYIN(tA_phase0[k], gA); + TLOAD(tA_phase0[k], gA); } #pragma clang loop unroll(full) @@ -1144,12 +1144,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < R.k; k++) { tile_shapeB tB; auto gB = gIterB(k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_phase0[k], tB); else MATMACC(tACC, tA_phase0[k], tB); } auto gC = gIterC(Mb, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { @@ -1158,12 +1158,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < R.k; k++) { tile_shapeB_trows tB; auto gB = gIterB(k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_phase0[k], tB); else MATMACC(tACC, tA_phase0[k], tB); } auto gC = gIterC(Mb, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // Phase B-1: Full chunks (rmd_M 行,A 类型为 tcols) @@ -1176,7 +1176,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < MAX_TILE_NUM; k++) { auto gA = gIterA(Mb, k_base + k); - TCOPYIN(tA_chunk[k], gA); + TLOAD(tA_chunk[k], gA); } #pragma clang loop unroll(full) @@ -1186,12 +1186,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < MAX_TILE_NUM; k++) { tile_shapeB tB; auto gB = gIterB(k_base + k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_chunk[k], tB); else MATMACC(tACC, tA_chunk[k], tB); } auto gC = gIterC(Mb, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { @@ -1200,12 +1200,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < MAX_TILE_NUM; k++) { tile_shapeB_trows tB; auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_chunk[k], tB); else MATMACC(tACC, tA_chunk[k], tB); } auto gC = gIterC(Mb, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } @@ -1218,7 +1218,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < K2_rem; k++) { auto gA = gIterA(Mb, k_base + k); - TCOPYIN(tA_tail[k], gA); + TLOAD(tA_tail[k], gA); } #pragma clang loop unroll(full) @@ -1228,12 +1228,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < K2_rem; k++) { tile_shapeB tB; auto gB = gIterB(k_base + k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_tail[k], tB); else MATMACC(tACC, tA_tail[k], tB); } auto gC = gIterC(Mb, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { @@ -1242,12 +1242,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < K2_rem; k++) { tile_shapeB_trows tB; auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA_tail[k], tB); else MATMACC(tACC, tA_tail[k], tB); } auto gC = gIterC(Mb, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } @@ -1255,29 +1255,29 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){ if constexpr (rmd_K) { tile_shapeA_tcorner tA_rmdK; auto gA = gIterA(Mb, Kb); - TCOPYIN(tA_rmdK, gA); + TLOAD(tA_rmdK, gA); #pragma clang loop unroll(full) for (int j = 0; j < Nb; j++) { tile_shapeC_tcols tACC; tile_shapeB_tcols tB; auto gB = gIterB(Kb, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if constexpr (Kb > 0) MATMACC(tACC, tA_rmdK, tB); else MATMUL (tACC, tA_rmdK, tB); auto gC = gIterC(Mb, j); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { tile_shapeC_tcorner tACC; tile_shapeB_tcorner tB; auto gB = gIterB(Kb, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if constexpr (Kb > 0) MATMACC(tACC, tA_rmdK, tB); else MATMUL (tACC, tA_rmdK, tB); auto gC = gIterC(Mb, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } // rmd_M @@ -1380,7 +1380,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < LEN; k++) { auto gA = gIterA(row, k_base + k); - TCOPYIN(tA[k], gA); + TLOAD(tA[k], gA); } #pragma clang loop unroll(full) @@ -1390,7 +1390,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeB tB; auto gB = gIterB(k_base + k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA[k], tB); else MATMACC(tACC, tA[k], tB); } @@ -1403,7 +1403,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeB_trows tB; auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA[k], tB); else MATMACC(tACC, tA[k], tB); } @@ -1417,7 +1417,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < LEN; k++) { auto gA = gIterA(Mb, k_base + k); - TCOPYIN(tA[k], gA); + TLOAD(tA[k], gA); } #pragma clang loop unroll(full) @@ -1427,7 +1427,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeB tB; auto gB = gIterB(k_base + k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA[k], tB); else MATMACC(tACC, tA[k], tB); } @@ -1440,7 +1440,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeB_trows tB; auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA[k], tB); else MATMACC(tACC, tA[k], tB); } @@ -1463,7 +1463,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < LEN; k++) { auto gA = gIterA(row, k_base + k); - TCOPYIN(tA[k], gA); + TLOAD(tA[k], gA); } #pragma clang loop unroll(full) @@ -1473,7 +1473,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeB tB; auto gB = gIterB(k_base + k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA[k], tB); else MATMACC(tACC, tA[k], tB); } @@ -1488,7 +1488,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeB_trows tB; auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA[k], tB); else MATMACC(tACC, tA[k], tB); } @@ -1503,7 +1503,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < LEN; k++) { auto gA = gIterA(Mb, k_base + k); - TCOPYIN(tA[k], gA); + TLOAD(tA[k], gA); } #pragma clang loop unroll(full) @@ -1513,7 +1513,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeB tB; auto gB = gIterB(k_base + k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA[k], tB); else MATMACC(tACC, tA[k], tB); } @@ -1528,7 +1528,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeB_trows tB; auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA[k], tB); else MATMACC(tACC, tA[k], tB); } @@ -1556,7 +1556,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < LEN; k++) { auto gA = gIterA(row, k_base + k); - TCOPYIN(tA[k], gA); + TLOAD(tA[k], gA); } #pragma clang loop unroll(full) @@ -1566,7 +1566,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeB tB; auto gB = gIterB(k_base + k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA[k], tB); else MATMACC(tACC, tA[k], tB); } @@ -1585,7 +1585,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeB_trows tB; auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA[k], tB); else MATMACC(tACC, tA[k], tB); } @@ -1604,7 +1604,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < LEN; k++) { auto gA = gIterA(Mb, k_base + k); - TCOPYIN(tA[k], gA); + TLOAD(tA[k], gA); } #pragma clang loop unroll(full) @@ -1614,7 +1614,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeB tB; auto gB = gIterB(k_base + k, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA[k], tB); else MATMACC(tACC, tA[k], tB); } @@ -1633,7 +1633,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeB_trows tB; auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if (k == 0) MATMUL (tACC, tA[k], tB); else MATMACC(tACC, tA[k], tB); } @@ -1659,14 +1659,14 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ for (int row = 0; row < Mb; row++) { tile_shapeA_trows tA_rmdK; auto gA = gIterA(row, Kb); - TCOPYIN(tA_rmdK, gA); + TLOAD(tA_rmdK, gA); #pragma clang loop unroll(full) for (int j = 0; j < Nb; j++) { tile_shapeACC tACC; tile_shapeB_tcols tB; auto gB = gIterB(Kb, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); MATMUL(tACC, tA_rmdK, tB); if constexpr (is_first) { @@ -1682,7 +1682,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ tile_shapeACC_trows tACC; tile_shapeB_tcorner tB; auto gB = gIterB(Kb, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); MATMUL(tACC, tA_rmdK, tB); if constexpr (is_first) { @@ -1698,14 +1698,14 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ if constexpr (rmd_M) { tile_shapeA_tcorner tA_rmdK; auto gA = gIterA(Mb, Kb); - TCOPYIN(tA_rmdK, gA); + TLOAD(tA_rmdK, gA); #pragma clang loop unroll(full) for (int j = 0; j < Nb; j++) { tile_shapeACC_tcols tACC; tile_shapeB_tcols tB; auto gB = gIterB(Kb, j); - TCOPYIN(tB, gB); + TLOAD(tB, gB); MATMUL(tACC, tA_rmdK, tB); if constexpr (is_first) { @@ -1721,7 +1721,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ tile_shapeACC_tcorner tACC; tile_shapeB_tcorner tB; auto gB = gIterB(Kb, Nb); - TCOPYIN(tB, gB); + TLOAD(tB, gB); MATMUL(tACC, tA_rmdK, tB); if constexpr (is_first) { @@ -1745,15 +1745,15 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ tile_C_bf16 tC_b; // TMOV_NZ2DN(tC_b, tC_main[m][n]); auto gC = gIterC(m, n); - // TCOPYOUT(gC, tC_b); - TCOPYOUT(gC, tC_main[m][n]); + // TSTORE(gC, tC_b); + TSTORE(gC, tC_main[m][n]); } if constexpr (rmd_N) { tile_C_bf16_trows tC_b; // TMOV_NZ2DN(tC_b, tC_rcol[m]); auto gC = gIterC(m, Nb); - TCOPYOUT(gC, tC_rcol[m]); - // TCOPYOUT(gC, tC_b); + TSTORE(gC, tC_rcol[m]); + // TSTORE(gC, tC_b); } } if constexpr (rmd_M) { @@ -1762,15 +1762,15 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){ tile_C_bf16_tcols tC_b; // TMOV_NZ2DN(tC_b, tC_rrow[n]); auto gC = gIterC(Mb, n); - TCOPYOUT(gC, tC_rrow[n]); - // TCOPYOUT(gC, tC_b); + TSTORE(gC, tC_rrow[n]); + // TSTORE(gC, tC_b); } if constexpr (rmd_N) { tile_C_bf16_tcorner tC_b; // TMOV_NZ2DN(tC_b, tC_corner); auto gC = gIterC(Mb, Nb); - TCOPYOUT(gC, tC_corner); - // TCOPYOUT(gC, tC_b); + TSTORE(gC, tC_corner); + // TSTORE(gC, tC_b); } } @@ -1873,7 +1873,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < LEN; k++) { auto gB = gIterB(k_base + k, col); - TCOPYIN(tB[k], gB); + TLOAD(tB[k], gB); } #pragma clang loop unroll(full) @@ -1883,7 +1883,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeA tA; auto gA = gIterA(row, k_base + k); - TCOPYIN(tA, gA); + TLOAD(tA, gA); if (k == 0) MATMUL (tACC, tA, tB[k]); else MATMACC(tACC, tA, tB[k]); } @@ -1896,7 +1896,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeA_tcols tA; auto gA = gIterA(Mb, k_base + k); - TCOPYIN(tA, gA); + TLOAD(tA, gA); if (k == 0) MATMUL (tACC, tA, tB[k]); else MATMACC(tACC, tA, tB[k]); } @@ -1910,7 +1910,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < LEN; k++) { auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB[k], gB); + TLOAD(tB[k], gB); } #pragma clang loop unroll(full) @@ -1920,7 +1920,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeA tA; auto gA = gIterA(row, k_base + k); - TCOPYIN(tA, gA); + TLOAD(tA, gA); if (k == 0) MATMUL (tACC, tA, tB[k]); else MATMACC(tACC, tA, tB[k]); } @@ -1933,7 +1933,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeA_tcols tA; auto gA = gIterA(Mb, k_base + k); - TCOPYIN(tA, gA); + TLOAD(tA, gA); if (k == 0) MATMUL (tACC, tA, tB[k]); else MATMACC(tACC, tA, tB[k]); } @@ -1956,7 +1956,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < LEN; k++) { auto gB = gIterB(k_base + k, col); - TCOPYIN(tB[k], gB); + TLOAD(tB[k], gB); } #pragma clang loop unroll(full) @@ -1966,7 +1966,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeA tA; auto gA = gIterA(row, k_base + k); - TCOPYIN(tA, gA); + TLOAD(tA, gA); if (k == 0) MATMUL (tACC, tA, tB[k]); else MATMACC(tACC, tA, tB[k]); } @@ -1981,7 +1981,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeA_tcols tA; auto gA = gIterA(Mb, k_base + k); - TCOPYIN(tA, gA); + TLOAD(tA, gA); if (k == 0) MATMUL (tACC, tA, tB[k]); else MATMACC(tACC, tA, tB[k]); } @@ -1996,7 +1996,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < LEN; k++) { auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB[k], gB); + TLOAD(tB[k], gB); } #pragma clang loop unroll(full) @@ -2006,7 +2006,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeA tA; auto gA = gIterA(row, k_base + k); - TCOPYIN(tA, gA); + TLOAD(tA, gA); if (k == 0) MATMUL (tACC, tA, tB[k]); else MATMACC(tACC, tA, tB[k]); } @@ -2021,7 +2021,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeA_tcols tA; auto gA = gIterA(Mb, k_base + k); - TCOPYIN(tA, gA); + TLOAD(tA, gA); if (k == 0) MATMUL (tACC, tA, tB[k]); else MATMACC(tACC, tA, tB[k]); } @@ -2049,7 +2049,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < LEN; k++) { auto gB = gIterB(k_base + k, col); - TCOPYIN(tB[k], gB); + TLOAD(tB[k], gB); } #pragma clang loop unroll(full) @@ -2059,7 +2059,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeA tA; auto gA = gIterA(row, k_base + k); - TCOPYIN(tA, gA); + TLOAD(tA, gA); if (k == 0) MATMUL (tACC, tA, tB[k]); else MATMACC(tACC, tA, tB[k]); } @@ -2078,7 +2078,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeA_tcols tA; auto gA = gIterA(Mb, k_base + k); - TCOPYIN(tA, gA); + TLOAD(tA, gA); if (k == 0) MATMUL (tACC, tA, tB[k]); else MATMACC(tACC, tA, tB[k]); } @@ -2097,7 +2097,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for (int k = 0; k < LEN; k++) { auto gB = gIterB(k_base + k, Nb); - TCOPYIN(tB[k], gB); + TLOAD(tB[k], gB); } #pragma clang loop unroll(full) @@ -2107,7 +2107,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeA tA; auto gA = gIterA(row, k_base + k); - TCOPYIN(tA, gA); + TLOAD(tA, gA); if (k == 0) MATMUL (tACC, tA, tB[k]); else MATMACC(tACC, tA, tB[k]); } @@ -2126,7 +2126,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ for (int k = 0; k < LEN; k++) { tile_shapeA_tcols tA; auto gA = gIterA(Mb, k_base + k); - TCOPYIN(tA, gA); + TLOAD(tA, gA); if (k == 0) MATMUL (tACC, tA, tB[k]); else MATMACC(tACC, tA, tB[k]); } @@ -2152,14 +2152,14 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ for (int col = 0; col < Nb; col++) { tile_shapeB_tcols tB_rmdK; auto gB = gIterB(Kb, col); - TCOPYIN(tB_rmdK, gB); + TLOAD(tB_rmdK, gB); #pragma clang loop unroll(full) for (int row = 0; row < Mb; row++) { tile_shapeACC tACC; tile_shapeA_trows tA; auto gA = gIterA(row, Kb); - TCOPYIN(tA, gA); + TLOAD(tA, gA); MATMUL(tACC, tA, tB_rmdK); if constexpr (is_first) { @@ -2175,7 +2175,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ tile_shapeACC_tcols tACC; tile_shapeA_tcorner tA; auto gA = gIterA(Mb, Kb); - TCOPYIN(tA, gA); + TLOAD(tA, gA); MATMUL(tACC, tA, tB_rmdK); if constexpr (is_first) { @@ -2191,14 +2191,14 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ if constexpr (rmd_N) { tile_shapeB_tcorner tB_rmdK; auto gB = gIterB(Kb, Nb); - TCOPYIN(tB_rmdK, gB); + TLOAD(tB_rmdK, gB); #pragma clang loop unroll(full) for (int row = 0; row < Mb; row++) { tile_shapeACC_trows tACC; tile_shapeA_trows tA; auto gA = gIterA(row, Kb); - TCOPYIN(tA, gA); + TLOAD(tA, gA); MATMUL(tACC, tA, tB_rmdK); if constexpr (is_first) { @@ -2214,7 +2214,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ tile_shapeACC_tcorner tACC; tile_shapeA_tcorner tA; auto gA = gIterA(Mb, Kb); - TCOPYIN(tA, gA); + TLOAD(tA, gA); MATMUL(tACC, tA, tB_rmdK); if constexpr (is_first) { @@ -2238,15 +2238,15 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ tile_C_bf16 tC_b; // TMOV_NZ2DN(tC_b, tC_main[m][n]); auto gC = gIterC(m, n); - // TCOPYOUT(gC, tC_b); - TCOPYOUT(gC, tC_main[m][n]); + // TSTORE(gC, tC_b); + TSTORE(gC, tC_main[m][n]); } if constexpr (rmd_N) { tile_C_bf16_trows tC_b; // TMOV_NZ2DN(tC_b, tC_rcol[m]); auto gC = gIterC(m, Nb); - TCOPYOUT(gC, tC_rcol[m]); - // TCOPYOUT(gC, tC_b); + TSTORE(gC, tC_rcol[m]); + // TSTORE(gC, tC_b); } } if constexpr (rmd_M) { @@ -2255,15 +2255,15 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){ tile_C_bf16_tcols tC_b; // TMOV_NZ2DN(tC_b, tC_rrow[n]); auto gC = gIterC(Mb, n); - TCOPYOUT(gC, tC_rrow[n]); - // TCOPYOUT(gC, tC_b); + TSTORE(gC, tC_rrow[n]); + // TSTORE(gC, tC_b); } if constexpr (rmd_N) { tile_C_bf16_tcorner tC_b; // TMOV_NZ2DN(tC_b, tC_corner); auto gC = gIterC(Mb, Nb); - TCOPYOUT(gC, tC_corner); - // TCOPYOUT(gC, tC_b); + TSTORE(gC, tC_corner); + // TSTORE(gC, tC_b); } } @@ -2331,7 +2331,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){ // #pragma clang loop unroll(full) // for(int k=0;k0){ MATMACC(tACC, tA, tB); } else { @@ -2387,7 +2387,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){ } auto gC = gIterC(j, i*R.n+ii); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // [n, rmd_M, k] @@ -2398,7 +2398,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){ for(int k=0;k0){ MATMACC(tACC, tA, tB); } else { @@ -2438,7 +2438,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){ } auto gC = gIterC(Mb, i*R.n+ii); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } @@ -2447,7 +2447,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){ // [rN, m, k] if constexpr(rN>0){ tile_shapeB tB[R.k][rN]; - + #pragma clang loop unroll(full) for(int i=0;i0){ MATMACC(tACC, tA, tB); } else { @@ -2508,7 +2508,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){ } } auto gC = gIterC(j, i+dN*R.n); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // [rN, rmd_M, k] @@ -2519,7 +2519,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){ for(int k=0;k0){ MATMACC(tACC, tA, tB); } else { @@ -2560,7 +2560,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){ } } auto gC = gIterC(Mb, i+dN*R.n); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } @@ -2568,11 +2568,11 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){ // [rmd_N, m, k] if constexpr (rmd_N) { tile_shapeB_trows tB[R.k]; - + #pragma clang loop unroll(full) for(int k=0;k0){ MATMACC(tACC, tA, tB); } else { @@ -2624,7 +2624,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){ } } auto gC = gIterC(j, Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } // [rmd_N, rmd_M, k] @@ -2635,7 +2635,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){ for(int k=0;k0){ MATMACC(tACC, tA, tB); } else { @@ -2676,7 +2676,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){ } } auto gC = gIterC(Mb,Nb); - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } @@ -2760,10 +2760,10 @@ void matmul_mask_reuseAB(float *dst, dtype *src0, dtype *src1){ #pragma clang loop unroll(full) for(int k=0;k tK ? tK : gK - k; tile_shapeA tA(dyn_m, dyn_k); tile_shapeB tB(dyn_k, dyn_n); - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); if(k==0){ MATMUL(tACC, tA, tB); }else{ MATMACC(tACC, tA, tB); } } - TCOPYOUT_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); + TSTORE_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); } } } @@ -3150,15 +3150,15 @@ __attribute__((noinline)) void matmul_dynamic(float* dst, dtype* src0, dtype* sr int dyn_k = (k+1) * tK > gK ? rem_k:tK; tile_shapeA tA(dyn_m, dyn_k); tile_shapeB tB(dyn_k, dyn_n); - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); if(k==0){ MATMUL(tACC, tA, tB); }else{ MATMACC(tACC, tA, tB); } } - TCOPYOUT_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); + TSTORE_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); } } } @@ -3201,7 +3201,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt int rem_k = gK % tK; ResA R = find_reuseA_dynamic(Mb, Kb, MAX_TILE_NUM); - + int dM = R.m == 0? 0 : Mb / R.m; int rM = R.m == 0? 0 : Mb % R.m; @@ -3223,7 +3223,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt tile_shapeA tA[m_step][R.k]; for (int mm=0;mm gM ){ tA[mm][kk]= tile_shapeA(rem_m, tK); }else{ @@ -3236,7 +3236,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt for(int k=0;k gM? rem_m:tM; @@ -3249,7 +3249,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt size_t offset_B = k * gN * tile_shapeB::Rows + j * tile_shapeB::Cols; gm_shapeB gB(src1 + offset_B, gK, gN); tile_shapeB tB(tK, dyn_n); - TCOPYIN(tB, gB); + TLOAD(tB, gB); if(k==0){ MATMUL(tACC, tA[ii][k], tB); }else{ @@ -3268,8 +3268,8 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt tile_shapeA tA(dyn_m, dyn_k); tile_shapeB tB(dyn_k, dyn_n); - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); if(k==0){ MATMUL(tACC, tA, tB); }else{ @@ -3280,7 +3280,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt size_t offset_C = (i+ii) * gN * tile_shapeACC::Rows + j * tile_shapeACC::Cols; gm_shapeC gC(dst + offset_C, gM, gN); - TCOPYOUT_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); + TSTORE_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); } } @@ -3313,7 +3313,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseB(float* dst, dtype* src0, dt R.n = Ra.m; R.k = Ra.k; R.val = Ra.val; - + int dN = R.n == 0? 0 : Nb / R.n; int rN = R.n == 0? 0 : Nb % R.n; @@ -3333,7 +3333,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseB(float* dst, dtype* src0, dt tile_shapeB tB[R.k][n_step]; for (int nn=0;nn gN ){ tB[kk][nn]= tile_shapeB(tK, rem_n); }else{ @@ -3346,7 +3346,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseB(float* dst, dtype* src0, dt for(int k=0;k gN? rem_n:tN; @@ -3359,7 +3359,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseB(float* dst, dtype* src0, dt size_t offset_A = j * gK * tile_shapeA::Rows + k * tile_shapeA::Cols; gm_shapeA gA(src0 + offset_A, gM, gK); tile_shapeA tA(dyn_m, tK); - TCOPYIN(tA, gA); + TLOAD(tA, gA); if(k==0){ MATMUL(tACC, tA, tB[k][ii]); }else{ @@ -3378,8 +3378,8 @@ __attribute__((noinline)) void matmul_dynamic_reuseB(float* dst, dtype* src0, dt tile_shapeA tA(dyn_m, dyn_k); tile_shapeB tB(dyn_k, dyn_n); - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); if(k==0){ MATMUL(tACC, tA, tB); }else{ @@ -3390,7 +3390,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseB(float* dst, dtype* src0, dt size_t offset_C = j * gN * tile_shapeACC::Rows + (i+ii) * tile_shapeACC::Cols; gm_shapeC gC(dst + offset_C, gM, gN); - TCOPYOUT_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); + TSTORE_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); } } @@ -3446,8 +3446,8 @@ void matmul_mx(float *dst, dtype *src0, dtype *src1, uint8_t *src0_mx, uint8_t * tile_shapeB tB; tile_shapeAMX tAMX; tile_shapeBMX tBMX; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); blk_tload(tAMX.GetValidCol(), tAMX.GetValidRow(), tile_shapeAMX::Cols, type_traits::TypeCode, @@ -3471,7 +3471,7 @@ void matmul_mx(float *dst, dtype *src0, dtype *src1, uint8_t *src0_mx, uint8_t * MATMACCMX(tACC, tA, tAMX, tB, tBMX); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } @@ -3520,16 +3520,16 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) { auto gC = gCIter(i, j); tile_shapeACC tACC; - + if constexpr(Kb>0){ auto gA = gAIter(i, 0); auto gB = gBIter(0, j); tile_shapeA tA; tile_shapeB tB; - TCOPYIN_2LVL(tA, gA); - TCOPYIN_2LVL(tB, gB); - MATMUL(tACC, tA, tB); + TLOAD_2LVL(tA, gA); + TLOAD_2LVL(tB, gB); + MATMUL(tACC, tA, tB); } #pragma clang loop unroll(full) for (int k = 1; k < Kb; ++k) { @@ -3538,8 +3538,8 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA tA; tile_shapeB tB; - TCOPYIN_2LVL(tA, gA); - TCOPYIN_2LVL(tB, gB); + TLOAD_2LVL(tA, gA); + TLOAD_2LVL(tB, gB); MATMACC(tACC, tA, tB); } @@ -3549,15 +3549,15 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_trows tA; tile_shapeB_tcols tB; - TCOPYIN_2LVL(tA, gA); - TCOPYIN_2LVL(tB, gB); + TLOAD_2LVL(tA, gA); + TLOAD_2LVL(tB, gB); if constexpr(Kb>0){ MATMACC(tACC, tA, tB); } else { MATMUL(tACC, tA, tB); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { auto gC = gCIter(i, Nb); @@ -3569,9 +3569,9 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA tA; tile_shapeB_trows tB; - TCOPYIN_2LVL(tA, gA); - TCOPYIN_2LVL(tB, gB); - MATMUL(tACC, tA, tB); + TLOAD_2LVL(tA, gA); + TLOAD_2LVL(tB, gB); + MATMUL(tACC, tA, tB); } #pragma clang loop unroll(full) for (int k = 1; k < Kb; ++k) { @@ -3580,8 +3580,8 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA tA; tile_shapeB_trows tB; - TCOPYIN_2LVL(tA, gA); - TCOPYIN_2LVL(tB, gB); + TLOAD_2LVL(tA, gA); + TLOAD_2LVL(tB, gB); MATMACC(tACC, tA, tB); } if constexpr (rmd_K) { @@ -3590,15 +3590,15 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_trows tA; tile_shapeB_tcorner tB; - TCOPYIN_2LVL(tA, gA); - TCOPYIN_2LVL(tB, gB); + TLOAD_2LVL(tA, gA); + TLOAD_2LVL(tB, gB); if constexpr(Kb>0){ MATMACC(tACC, tA, tB); } else { MATMUL(tACC, tA, tB); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } if constexpr (rmd_M) { @@ -3612,9 +3612,9 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_tcols tA; tile_shapeB tB; - TCOPYIN_2LVL(tA, gA); - TCOPYIN_2LVL(tB, gB); - MATMUL(tACC, tA, tB); + TLOAD_2LVL(tA, gA); + TLOAD_2LVL(tB, gB); + MATMUL(tACC, tA, tB); } #pragma clang loop unroll(full) for (int k = 1; k < Kb; ++k) { @@ -3623,8 +3623,8 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_tcols tA; tile_shapeB tB; - TCOPYIN_2LVL(tA, gA); - TCOPYIN_2LVL(tB, gB); + TLOAD_2LVL(tA, gA); + TLOAD_2LVL(tB, gB); MATMACC(tACC, tA, tB); } if constexpr (rmd_K) { @@ -3633,15 +3633,15 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_tcorner tA; tile_shapeB_tcols tB; - TCOPYIN_2LVL(tA, gA); - TCOPYIN_2LVL(tB, gB); + TLOAD_2LVL(tA, gA); + TLOAD_2LVL(tB, gB); if constexpr(Kb>0){ MATMACC(tACC, tA, tB); } else { MATMUL(tACC, tA, tB); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } if constexpr (rmd_N) { auto gC = gCIter(Mb, Nb); @@ -3653,9 +3653,9 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_tcols tA; tile_shapeB_trows tB; - TCOPYIN_2LVL(tA, gA); - TCOPYIN_2LVL(tB, gB); - MATMUL(tACC, tA, tB); + TLOAD_2LVL(tA, gA); + TLOAD_2LVL(tB, gB); + MATMUL(tACC, tA, tB); } #pragma clang loop unroll(full) for (int k = 1; k < Kb; ++k) { @@ -3664,8 +3664,8 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_tcols tA; tile_shapeB_trows tB; - TCOPYIN_2LVL(tA, gA); - TCOPYIN_2LVL(tB, gB); + TLOAD_2LVL(tA, gA); + TLOAD_2LVL(tB, gB); MATMACC(tACC, tA, tB); } if constexpr (rmd_K) { @@ -3674,15 +3674,15 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) { tile_shapeA_tcorner tA; tile_shapeB_tcorner tB; - TCOPYIN_2LVL(tA, gA); - TCOPYIN_2LVL(tB, gB); + TLOAD_2LVL(tA, gA); + TLOAD_2LVL(tB, gB); if constexpr(Kb>0){ MATMACC(tACC, tA, tB); } else { MATMUL(tACC, tA, tB); } } - TCOPYOUT_ACC(gC, tACC); + TSTORE_ACC(gC, tACC); } } } @@ -3718,11 +3718,11 @@ void matmul_vec(float* dst, float* src0, float* src1){ auto gB = gBIter(k,j); tile_shapeA tA; tile_shapeB tB; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); + TLOAD(tA, gA); + TLOAD(tB, gB); MATMACC(tACC, tA, tB); } - TCOPYOUT(gC, tACC); + TSTORE(gC, tACC); } } } @@ -3745,10 +3745,10 @@ void matmul_tile_vec(float* dst, float* src0, float* src1) { tile_shape_B d1; tile_shape_C d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); MATMUL(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } template @@ -3769,10 +3769,10 @@ void matmul_tile_frac(float* dst, float* src0, float* src1) { tile_shape_B d1; tile_shape_C d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); MATMUL(d2, d0, d1); - TCOPYOUT_ACC(res, d2); + TSTORE_ACC(res, d2); } diff --git a/kernels/other/matmul_dynamic_reuse.hpp b/kernels/other/matmul_dynamic_reuse.hpp index 1600964..7a1693f 100644 --- a/kernels/other/matmul_dynamic_reuse.hpp +++ b/kernels/other/matmul_dynamic_reuse.hpp @@ -19,7 +19,7 @@ for(int k=0;k gM? rem_m:tM; \ @@ -32,7 +32,7 @@ size_t offset_B = k * gN * tile_shapeB::Rows + j * tile_shapeB::Cols; \ gm_shapeB gB(src1 + offset_B, gK, gN); \ tile_shapeB tB(tK, dyn_n); \ - TCOPYIN(tB, gB); \ + TLOAD(tB, gB); \ if(k==0){ \ MATMUL(tACC, tA[ii][k], tB); \ }else{ \ @@ -51,8 +51,8 @@ tile_shapeA tA(dyn_m, dyn_k); \ tile_shapeB tB(dyn_k, dyn_n); \ \ - TCOPYIN(tA, gA); \ - TCOPYIN(tB, gB); \ + TLOAD(tA, gA); \ + TLOAD(tB, gB); \ if(k==0){ \ MATMUL(tACC, tA, tB); \ }else{ \ @@ -63,7 +63,7 @@ \ size_t offset_C = (i+ii) * gN * tile_shapeACC::Rows + j * tile_shapeACC::Cols; \ gm_shapeC gC(dst + offset_C, gM, gN); \ - TCOPYOUT_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); \ + TSTORE_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); \ } \ } @@ -114,7 +114,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt for(int k=0;k gN? rem_n:tN; \ @@ -127,7 +127,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt size_t offset_A = j * gK * tile_shapeA::Rows + k * tile_shapeA::Cols; \ gm_shapeA gA(src0 + offset_A, gM, gK); \ tile_shapeA tA(dyn_m, tK); \ - TCOPYIN(tA, gA); \ + TLOAD(tA, gA); \ if(k==0){ \ MATMUL(tACC, tA, tB[k][ii]); \ }else{ \ @@ -146,8 +146,8 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt tile_shapeA tA(dyn_m, dyn_k); \ tile_shapeB tB(dyn_k, dyn_n); \ \ - TCOPYIN(tA, gA); \ - TCOPYIN(tB, gB); \ + TLOAD(tA, gA); \ + TLOAD(tB, gB); \ if(k==0){ \ MATMUL(tACC, tA, tB); \ }else{ \ @@ -158,7 +158,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt \ size_t offset_C = j * gN * tile_shapeACC::Rows + (i+ii) * tile_shapeACC::Cols; \ gm_shapeC gC(dst + offset_C, gM, gN); \ - TCOPYOUT_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); \ + TSTORE_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); \ } \ } @@ -178,7 +178,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseB(float* dst, dtype* src0, dt int rem_m = gM % tM; int rem_n = gN % tN; int rem_k = gK % tK; - + for (int b=0;b>; using tile_shape = Tile; - + using tSum = Tile; - + using gIter = global_iterator; gIter giter_src(src); @@ -26,19 +26,19 @@ void rmsnorm(dtype *dst, dtype *src){ { auto gsrc = giter_src(i, j); tile_shape tsrc; - - TCOPYIN(tsrc, gsrc); + + TLOAD(tsrc, gsrc); tSum tLocalSum; TMUL(tsrc, tsrc, tsrc); TROWSUM(tLocalSum, tsrc); TADD(tAccSquareSum, tAccSquareSum, tLocalSum); } - + tSum gSqureMean; TDIVS(gSqureMean, tAccSquareSum, kN); TSQRT(gSqureMean, gSqureMean); - + tile_shape gSqureMean_i; TEXPANDCOL(gSqureMean_i, gSqureMean); @@ -46,12 +46,12 @@ void rmsnorm(dtype *dst, dtype *src){ { auto gsrc = giter_src(i,j); tile_shape tsrc; - TCOPYIN(tsrc, gsrc); - + TLOAD(tsrc, gsrc); + TDIV(tsrc, tsrc, gSqureMean_i); - + auto gdst = giter_dst(i,j); - TCOPYOUT(gdst, tsrc); + TSTORE(gdst, tsrc); } } } @@ -62,9 +62,9 @@ void layernorm(dtype *dst, dtype *src) { using gm_shape = global_tensor>; using tile_shape = Tile; - + using tSum = Tile; - + using gIter = global_iterator; gIter giter_src(src); @@ -77,23 +77,23 @@ void layernorm(dtype *dst, dtype *src) { tSum tAccSum(0); // tiling sum tSum tAccSquareSum(0); // tiling square sum - + for(int j=0;j(sum_out); - } - new_sum_ptr[i] = upd_sum; + } + new_sum_ptr[i] = upd_sum; } @@ -66,35 +66,35 @@ void __vec__ cumsum_col_kernel( template void cumsum_col_rand( dtype *in_ptr, -// dtype *inzero_ptr, +// dtype *inzero_ptr, dtype *out_ptr -) +) { const int Mb = gIM / tM; - const int Nb = gIN / tN; + const int Nb = gIN / tN; - const int rmd_M = gIM % tM; - const int rmd_N = gIN % tN; + const int rmd_M = gIM % tM; + const int rmd_N = gIN % tN; // const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 -// using gm_shapeSum = global_tensor>; + using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 +// using gm_shapeSum = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // - using tile_shapeData_col = Tile; // - using tile_shapeSum = Tile; // + using tile_shapeData_col = Tile; // + using tile_shapeSum = Tile; // - using tile_shapeData_row = Tile; // - using tile_shapeData_cor = Tile; // + using tile_shapeData_row = Tile; // + using tile_shapeData_cor = Tile; // using tile_shapeSum_row = Tile; // //need tM = 1; - gm_shapeIn inGm(in_ptr); -// gm_shapeOut ZeroGm(inzero_ptr); + gm_shapeIn inGm(in_ptr); +// gm_shapeOut ZeroGm(inzero_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeSum olcSumGm(old_sum_ptr); +// gm_shapeSum olcSumGm(old_sum_ptr); tile_shapeData dataTile; tile_shapeData OutTile; @@ -105,19 +105,19 @@ void cumsum_col_rand( tile_shapeData_row dataTile_row; tile_shapeData_row OutTile_row; - tile_shapeData_cor dataTile_cor; - tile_shapeData_cor OutTile_cor; - + tile_shapeData_cor dataTile_cor; + tile_shapeData_cor OutTile_cor; + tile_shapeSum_row SumTile_row; - tile_shapeSum_row oldSumTile_row; + tile_shapeSum_row oldSumTile_row; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; -// using itZero = global_iterator; + using itIn = global_iterator; +// using itZero = global_iterator; using itOut = global_iterator; -// using itSum = global_iterator; +// using itSum = global_iterator; itIn gIIter(in_ptr); itOut gOIter(out_ptr); @@ -128,44 +128,44 @@ void cumsum_col_rand( TEXPANDSCALAR(oldSumTile, 0);//初始化为0 for (int i = 0; i < Mb; ++i) { auto gI = gIIter(i, j); - auto gO = gOIter(i, j); - TCOPYIN(dataTile, gI); -// printf("in0 : %d, %d\n",in_ptr[i*tM], i*tM); + auto gO = gOIter(i, j); + TLOAD(dataTile, gI); +// printf("in0 : %d, %d\n",in_ptr[i*tM], i*tM); cumsum_col_kernel<<>>(SumTile.data(), OutTile.data(), dataTile.data(), oldSumTile.data()); oldSumTile = SumTile; - TCOPYOUT(gO, OutTile); -// printf("out0 : %d,%d\n", out_ptr[i*tM],i*tM); + TSTORE(gO, OutTile); +// printf("out0 : %d,%d\n", out_ptr[i*tM],i*tM); } - if constexpr (rmd_M > 0){ + if constexpr (rmd_M > 0){ auto gI = gIIter(Mb, j); auto gO = gOIter(Mb, j); - TCOPYIN(dataTile_col, gI); + TLOAD(dataTile_col, gI); cumsum_col_kernel<<>>(SumTile.data(), OutTile_col.data(), dataTile_col.data(), oldSumTile.data()); oldSumTile = SumTile; - TCOPYOUT(gO, OutTile_col); + TSTORE(gO, OutTile_col); } -// TCOPYOUT(gO, SumTile); +// TSTORE(gO, SumTile); } if constexpr (rmd_N > 0){ -// auto gZero = gZeroIter(0, Nb); +// auto gZero = gZeroIter(0, Nb); // auto gO = gOIter(0, Nb); - TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0 -// TCOPYIN(oldSumTile_row, gZero);//初始化为0 - for (int i = 0; i < Mb; ++i) { + TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0 +// TLOAD(oldSumTile_row, gZero);//初始化为0 + for (int i = 0; i < Mb; ++i) { auto gI = gIIter(i, Nb); auto gO = gOIter(i, Nb); - TCOPYIN(dataTile_row, gI); + TLOAD(dataTile_row, gI); cumsum_col_kernel<<>>(SumTile_row.data(), OutTile_row.data(), dataTile_row.data(), oldSumTile_row.data()); oldSumTile_row = SumTile_row; - TCOPYOUT(gO, OutTile_row); + TSTORE(gO, OutTile_row); } - if constexpr (rmd_M > 0){ + if constexpr (rmd_M > 0){ auto gI = gIIter(Mb, Nb); auto gO = gOIter(Mb, Nb); - TCOPYIN(dataTile_cor, gI); + TLOAD(dataTile_cor, gI); cumsum_col_kernel<<>>(SumTile_row.data(), OutTile_cor.data(), dataTile_cor.data(), oldSumTile_row.data()); oldSumTile_row = SumTile_row; - TCOPYOUT(gO, OutTile_cor); + TSTORE(gO, OutTile_cor); } } /* diff --git a/kernels/reduction/cumsum_rowvec.hpp b/kernels/reduction/cumsum_rowvec.hpp index ec998c1..eeedc93 100644 --- a/kernels/reduction/cumsum_rowvec.hpp +++ b/kernels/reduction/cumsum_rowvec.hpp @@ -16,42 +16,42 @@ using namespace pto; template void __vec__ cumsum_row_kernel( typename tileSum::TileDType __out__ new_sum, - const typename tileData::TileDType __out__ out, + const typename tileData::TileDType __out__ out, const typename tileData::TileDType __in__ src, - const typename tileSum::TileDType __in__ old_sum + const typename tileSum::TileDType __in__ old_sum ) { -// size_t i = blkv_get_index_x(); - size_t j = blkv_get_index_y(); - size_t sum_idx = j * tileSum::RowStride; +// size_t i = blkv_get_index_x(); + size_t j = blkv_get_index_y(); + size_t sum_idx = j * tileSum::RowStride; __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum); - __vbuf__ typename tileData::DType *out_ptr = blkv_get_tile_ptr(out); + __vbuf__ typename tileData::DType *out_ptr = blkv_get_tile_ptr(out); __vbuf__ typename tileData::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); + __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); typename tileSum::DType upd_sum = old_sum_ptr[sum_idx]; -/* +/* for(size_t j=0;j(sum_out); - } - new_sum_ptr[sum_idx] = upd_sum; + upd_sum = sum_out; + out_ptr[idx] = static_cast(sum_out); + } + new_sum_ptr[sum_idx] = upd_sum; } @@ -60,105 +60,105 @@ template>; //将gm中的Tensor先声明为一维数据 -// using gm_shapeSum = global_tensor>; + using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 +// using gm_shapeSum = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeData_row = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeSum = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec - using tile_shapeData_col = Tile; // todo 尾块怎么处理?是否要作为参数写在这 + using tile_shapeData_col = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeData_cor = Tile; // todo 尾块怎么处理?是否要作为参数写在这 - using tile_shapeSum_col = Tile; + using tile_shapeSum_col = Tile; - gm_shapeIn inGm(in_ptr); + gm_shapeIn inGm(in_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeSum olcSumGm(old_sum_ptr); +// gm_shapeSum olcSumGm(old_sum_ptr); - tile_shapeData dataTile; + tile_shapeData dataTile; tile_shapeData_row dataTile_row; tile_shapeData_col dataTile_col; - tile_shapeData_cor dataTile_cor; + tile_shapeData_cor dataTile_cor; - tile_shapeData OutTile; + tile_shapeData OutTile; tile_shapeData_row OutTile_row; tile_shapeData_col OutTile_col; - tile_shapeData_cor OutTile_cor; - + tile_shapeData_cor OutTile_cor; + tile_shapeSum SumTile; tile_shapeSum oldSumTile; tile_shapeSum_col SumTile_col; - tile_shapeSum_col oldSumTile_col; + tile_shapeSum_col oldSumTile_col; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; + using itIn = global_iterator; using itOut = global_iterator; itIn gIIter(in_ptr); itOut gOIter(out_ptr); // printf("tile_shapeSum::ValidCol = %d\n", tile_shapeSum::ValidCol); -// printf("tile_shapeSum::ValidRow = %d\n", tile_shapeSum::ValidRow); +// printf("tile_shapeSum::ValidRow = %d\n", tile_shapeSum::ValidRow); for (int j = 0; j < Mb; ++j) { // auto gO = gOIter(j, 0); TEXPANDSCALAR(oldSumTile, 0);//初始化为0 - //初始化old_sum的tile + //初始化old_sum的tile for (int i = 0; i < Nb; ++i) { - auto gI = gIIter(j, i); - auto gO = gOIter(j, i); - TCOPYIN(dataTile, gI); + auto gI = gIIter(j, i); + auto gO = gOIter(j, i); + TLOAD(dataTile, gI); cumsum_row_kernel<<<1, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), OutTile.data(), dataTile.data(), oldSumTile.data()); // reducesum_row_kernel<<>>(SumTile.data(), dataTile.data(), oldSumTile.data()); oldSumTile = SumTile; - TCOPYOUT(gO, OutTile); + TSTORE(gO, OutTile); } // printf("end for%d\n",j); //for row corner if constexpr (rmd_N > 0){ auto gI = gIIter(j, Nb); auto gO = gOIter(j, Nb); - TCOPYIN(dataTile_row, gI); - cumsum_row_kernel<<<1, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), OutTile_row.data(), dataTile_row.data(), oldSumTile.data()); + TLOAD(dataTile_row, gI); + cumsum_row_kernel<<<1, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), OutTile_row.data(), dataTile_row.data(), oldSumTile.data()); // reducesum_row_kernel<<>>(SumTile.data(), dataTile_row.data(), oldSumTile.data()); oldSumTile = SumTile; - TCOPYOUT(gO, OutTile_row); + TSTORE(gO, OutTile_row); } } //for col cor if constexpr (rmd_M > 0){ TEXPANDSCALAR(oldSumTile_col, 0);//初始化为0 - //初始化old_sum的tile + //初始化old_sum的tile for (int i = 0; i < Nb; ++i) { - auto gI = gIIter(Mb, i); - auto gO = gOIter(Mb, i); - TCOPYIN(dataTile_col, gI); + auto gI = gIIter(Mb, i); + auto gO = gOIter(Mb, i); + TLOAD(dataTile_col, gI); cumsum_row_kernel<<<1, tile_shapeSum_col::ValidRow, 1>>>(SumTile_col.data(), OutTile_col.data(), dataTile_col.data(), oldSumTile_col.data()); oldSumTile_col = SumTile_col; - TCOPYOUT(gO, OutTile_col); + TSTORE(gO, OutTile_col); } if constexpr (rmd_N > 0){ auto gI = gIIter(Mb, Nb); auto gO = gOIter(Mb, Nb); - TCOPYIN(dataTile_cor, gI); + TLOAD(dataTile_cor, gI); cumsum_row_kernel<<<1, tile_shapeSum_col::ValidRow, 1>>>(SumTile_col.data(), OutTile_cor.data(), dataTile_cor.data(), oldSumTile_col.data()); oldSumTile_col = SumTile_col; - TCOPYOUT(gO, OutTile_cor); - } + TSTORE(gO, OutTile_cor); + } } /* for(int i = 0; i < gIM; i++){ diff --git a/kernels/reduction/reducemax_colvec.hpp b/kernels/reduction/reducemax_colvec.hpp index 502db96..4a87fd8 100644 --- a/kernels/reduction/reducemax_colvec.hpp +++ b/kernels/reduction/reducemax_colvec.hpp @@ -19,30 +19,30 @@ template void __vec__ reducemax_col_kernel( typename tileMax::TileDType __out__ new_max, const typename tileSrc::TileDType __in__ src, - const typename tileMax::TileDType __in__ old_max + const typename tileMax::TileDType __in__ old_max ) { - size_t i = blkv_get_index_x(); + size_t i = blkv_get_index_x(); __vbuf__ typename tileMax::DType *new_max_ptr = blkv_get_tile_ptr(new_max); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max); + __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max); typename tileMax::DType upd_max = old_max_ptr[i]; -/* +/* for(size_t j=0;j void reducemax_col_rand( dtype *in_ptr, -// dtype *inzero_ptr, +// dtype *inzero_ptr, dtype *out_ptr -) +) { const int Mb = gIM / tM; - const int Nb = gIN / tN; + const int Nb = gIN / tN; const int rmd_M = gIM % tM; const int rmd_N = gIN % tN; // const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; // -// using gm_shapeSum = global_tensor>; + using gm_shapeIn = global_tensor>; // +// using gm_shapeSum = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // - using tile_shapeData_col = Tile; // - using tile_shapeMax = Tile; // + using tile_shapeData_col = Tile; // + using tile_shapeMax = Tile; // - using tile_shapeData_row = Tile; // - using tile_shapeData_cor = Tile; // - using tile_shapeMax_row = Tile; // + using tile_shapeData_row = Tile; // + using tile_shapeData_cor = Tile; // + using tile_shapeMax_row = Tile; // //need tM = 1; - gm_shapeIn inGm(in_ptr); -// gm_shapeOut ZeroGm(inzero_ptr); + gm_shapeIn inGm(in_ptr); +// gm_shapeOut ZeroGm(inzero_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeSum olcSumGm(old_sum_ptr); +// gm_shapeSum olcSumGm(old_sum_ptr); tile_shapeData dataTile; - tile_shapeData_col dataTile_col; + tile_shapeData_col dataTile_col; tile_shapeMax MaxTile; tile_shapeMax oldMaxTile; tile_shapeData_row dataTile_row; - tile_shapeData_cor dataTile_cor; + tile_shapeData_cor dataTile_cor; tile_shapeMax_row MaxTile_row; - tile_shapeMax_row oldMaxTile_row; + tile_shapeMax_row oldMaxTile_row; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; + using itIn = global_iterator; using itIn_row = global_iterator; using itOut = global_iterator; itIn gIIter(in_ptr); itIn_row gIIter_rmd_row(in_ptr); -// itZero gZeroIter(inzero_ptr); +// itZero gZeroIter(inzero_ptr); itOut gOIter(out_ptr); // dtype zero = 0; @@ -130,41 +130,41 @@ void reducemax_col_rand( // auto gZero = gZeroIter(0, j); auto gO = gOIter(0, j); TEXPANDSCALAR(oldMaxTile, 0);//初始化为0 -// TCOPYIN(oldSumTile, gZero);//初始化为0 - //初始化old_sum的tile - //need +// TLOAD(oldSumTile, gZero);//初始化为0 + //初始化old_sum的tile + //need for (int i = 0; i < Mb; ++i) { auto gI = gIIter(i, j); - TCOPYIN(dataTile, gI); + TLOAD(dataTile, gI); reducemax_col_kernel<<>>(MaxTile.data(), dataTile.data(), oldMaxTile.data()); oldMaxTile = MaxTile; } - if constexpr (rmd_M > 0){ + if constexpr (rmd_M > 0){ auto gI = gIIter(Mb, j); - TCOPYIN(dataTile_col, gI); + TLOAD(dataTile_col, gI); reducemax_col_kernel<<>>(MaxTile.data(), dataTile_col.data(), oldMaxTile.data()); oldMaxTile = MaxTile; } - TCOPYOUT(gO, MaxTile); + TSTORE(gO, MaxTile); } if constexpr (rmd_N > 0){ -// auto gZero = gZeroIter(0, Nb); +// auto gZero = gZeroIter(0, Nb); auto gO = gOIter(0, Nb); - TEXPANDSCALAR(oldMaxTile_row, 0);//初始化为0 -// TCOPYIN(oldSumTile_row, gZero);//初始化为0 - for (int i = 0; i < Mb; ++i) { + TEXPANDSCALAR(oldMaxTile_row, 0);//初始化为0 +// TLOAD(oldSumTile_row, gZero);//初始化为0 + for (int i = 0; i < Mb; ++i) { auto gI = gIIter(i, Nb); - TCOPYIN(dataTile_row, gI); + TLOAD(dataTile_row, gI); reducemax_col_kernel<<>>(MaxTile_row.data(), dataTile_row.data(), oldMaxTile_row.data()); oldMaxTile_row = MaxTile_row; } - if constexpr (rmd_M > 0){ + if constexpr (rmd_M > 0){ auto gI = gIIter(Mb, Nb); - TCOPYIN(dataTile_cor, gI); + TLOAD(dataTile_cor, gI); reducemax_col_kernel<<>>(MaxTile_row.data(), dataTile_cor.data(), oldMaxTile_row.data()); oldMaxTile_row = MaxTile_row; } - TCOPYOUT(gO, MaxTile_row); + TSTORE(gO, MaxTile_row); } } diff --git a/kernels/reduction/reducemax_colvec_single.hpp b/kernels/reduction/reducemax_colvec_single.hpp index 02806e6..7eea81e 100644 --- a/kernels/reduction/reducemax_colvec_single.hpp +++ b/kernels/reduction/reducemax_colvec_single.hpp @@ -19,30 +19,30 @@ template void __vec__ reducemax_col_kernel( typename tileMax::TileDType __out__ new_max, const typename tileSrc::TileDType __in__ src, - const typename tileMax::TileDType __in__ old_max + const typename tileMax::TileDType __in__ old_max ) { - size_t i = blkv_get_index_x(); + size_t i = blkv_get_index_x(); __vbuf__ typename tileMax::DType *new_max_ptr = blkv_get_tile_ptr(new_max); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max); + __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max); typename tileMax::DType upd_max = old_max_ptr[i]; -/* +/* for(size_t j=0;j void reducemax_col_rand( dtype *in_ptr, -// dtype *inzero_ptr, +// dtype *inzero_ptr, dtype *out_ptr -) +) { const int Mb = gIM / tM; - const int Nb = gIN / tN; + const int Nb = gIN / tN; const int rmd_M = gIM % tM; const int rmd_N = gIN % tN; // const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; // -// using gm_shapeSum = global_tensor>; + using gm_shapeIn = global_tensor>; // +// using gm_shapeSum = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // - using tile_shapeData_col = Tile; // - using tile_shapeMax = Tile; // + using tile_shapeData_col = Tile; // + using tile_shapeMax = Tile; // - using tile_shapeData_row = Tile; // - using tile_shapeData_cor = Tile; // - using tile_shapeMax_row = Tile; // + using tile_shapeData_row = Tile; // + using tile_shapeData_cor = Tile; // + using tile_shapeMax_row = Tile; // //need tM = 1; - gm_shapeIn inGm(in_ptr); -// gm_shapeOut ZeroGm(inzero_ptr); + gm_shapeIn inGm(in_ptr); +// gm_shapeOut ZeroGm(inzero_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeSum olcSumGm(old_sum_ptr); +// gm_shapeSum olcSumGm(old_sum_ptr); tile_shapeData dataTile; - tile_shapeData_col dataTile_col; + tile_shapeData_col dataTile_col; tile_shapeMax MaxTile; tile_shapeMax oldMaxTile; tile_shapeData_row dataTile_row; - tile_shapeData_cor dataTile_cor; + tile_shapeData_cor dataTile_cor; tile_shapeMax_row MaxTile_row; - tile_shapeMax_row oldMaxTile_row; + tile_shapeMax_row oldMaxTile_row; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; + using itIn = global_iterator; using itIn_row = global_iterator; using itOut = global_iterator; itIn gIIter(in_ptr); itIn_row gIIter_rmd_row(in_ptr); -// itZero gZeroIter(inzero_ptr); +// itZero gZeroIter(inzero_ptr); itOut gOIter(out_ptr); // dtype zero = 0; @@ -138,42 +138,42 @@ void reducemax_col_rand( // auto gZero = gZeroIter(0, j); auto gO = gOIter(0, 0); TEXPANDSCALAR(oldMaxTile, 0);//初始化为0 -// TCOPYIN(oldSumTile, gZero);//初始化为0 - //初始化old_sum的tile - //need +// TLOAD(oldSumTile, gZero);//初始化为0 + //初始化old_sum的tile + //need for (int i = 0; i < Mb; ++i) { auto gI = gIIter(i, 0); - TCOPYIN(dataTile, gI); + TLOAD(dataTile, gI); reducemax_col_kernel<<>>(MaxTile.data(), dataTile.data(), oldMaxTile.data()); oldMaxTile = MaxTile; } - if constexpr (rmd_M > 0){ + if constexpr (rmd_M > 0){ auto gI = gIIter(Mb, 0); - TCOPYIN(dataTile_col, gI); + TLOAD(dataTile_col, gI); reducemax_col_kernel<<>>(MaxTile.data(), dataTile_col.data(), oldMaxTile.data()); oldMaxTile = MaxTile; } - TCOPYOUT(gO, MaxTile); + TSTORE(gO, MaxTile); // } /* if constexpr (rmd_N > 0){ -// auto gZero = gZeroIter(0, Nb); +// auto gZero = gZeroIter(0, Nb); auto gO = gOIter(0, Nb); - TEXPANDSCALAR(oldMaxTile_row, 0);//初始化为0 -// TCOPYIN(oldSumTile_row, gZero);//初始化为0 - for (int i = 0; i < Mb; ++i) { + TEXPANDSCALAR(oldMaxTile_row, 0);//初始化为0 +// TLOAD(oldSumTile_row, gZero);//初始化为0 + for (int i = 0; i < Mb; ++i) { auto gI = gIIter(i, Nb); - TCOPYIN(dataTile_row, gI); + TLOAD(dataTile_row, gI); reducemax_col_kernel<<>>(MaxTile_row.data(), dataTile_row.data(), oldMaxTile_row.data()); oldMaxTile_row = MaxTile_row; } - if constexpr (rmd_M > 0){ + if constexpr (rmd_M > 0){ auto gI = gIIter(Mb, Nb); - TCOPYIN(dataTile_cor, gI); + TLOAD(dataTile_cor, gI); reducemax_col_kernel<<>>(MaxTile_row.data(), dataTile_cor.data(), oldMaxTile_row.data()); oldMaxTile_row = MaxTile_row; } - TCOPYOUT(gO, MaxTile_row); + TSTORE(gO, MaxTile_row); } */ } diff --git a/kernels/reduction/reducemax_colvec_single_8192.hpp b/kernels/reduction/reducemax_colvec_single_8192.hpp index 0688726..eb25d6c 100644 --- a/kernels/reduction/reducemax_colvec_single_8192.hpp +++ b/kernels/reduction/reducemax_colvec_single_8192.hpp @@ -20,39 +20,39 @@ void __vec__ reducemax_col_kernel( typename tileTmpMax::TileDType __out__ new_max, const typename tileSrc::TileDType __in__ src, const typename tileTmpMax::TileDType __in__ old_max, - const size_t tile_idx + const size_t tile_idx ) { - size_t i = blkv_get_index_x(); + size_t i = blkv_get_index_x(); __vbuf__ typename tileTmpMax::DType *new_max_ptr = blkv_get_tile_ptr(new_max); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileTmpMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max); + __vbuf__ typename tileTmpMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max); - #pragma clang loop unroll(full) + #pragma clang loop unroll(full) for(size_t j=0;j void reducemax_col_rand( - dtype *in_ptr, + dtype *in_ptr, dtype *out_ptr -) +) { const int Mb = gIM / tM; - const int Nb = gIN / tN; + const int Nb = gIN / tN; const int rmd_M = gIM % tM; const int rmd_N = gIN % tN; // const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; // + using gm_shapeIn = global_tensor>; // using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // - using tile_shapeData_col = Tile; // - using tile_shapeMax = Tile; // - using tile_shapeTmpMax = Tile; // + using tile_shapeData_col = Tile; // + using tile_shapeMax = Tile; // + using tile_shapeTmpMax = Tile; // -// using tile_shapeData_row = Tile; // -// using tile_shapeData_cor = Tile; // -// using tile_shapeMax_row = Tile; // +// using tile_shapeData_row = Tile; // +// using tile_shapeData_cor = Tile; // +// using tile_shapeMax_row = Tile; // //need tM = 1; - gm_shapeIn inGm(in_ptr); - gm_shapeOut outGm(out_ptr); + gm_shapeIn inGm(in_ptr); + gm_shapeOut outGm(out_ptr); tile_shapeData dataTile; - tile_shapeData_col dataTile_col; + tile_shapeData_col dataTile_col; tile_shapeMax MaxTile; tile_shapeTmpMax oldtmpMaxTile; tile_shapeTmpMax tmpMaxTile; // tile_shapeTmpMax_l2 tmpMaxTile_l2; // tile_shapeData_row dataTile_row; -// tile_shapeData_cor dataTile_cor; +// tile_shapeData_cor dataTile_cor; // tile_shapeMax_row MaxTile_row; -// tile_shapeMax_row oldMaxTile_row; +// tile_shapeMax_row oldMaxTile_row; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 @@ -192,7 +192,7 @@ void reducemax_col_rand( using itIn = global_iterator; using itOut = global_iterator; - itIn gIIter(in_ptr); + itIn gIIter(in_ptr); itOut gOIter(out_ptr); // dtype zero = 0; @@ -202,19 +202,19 @@ void reducemax_col_rand( auto gO = gOIter(0, 0); TEXPANDSCALAR(oldtmpMaxTile, 0);//初始化为0 // TEXPANDSCALAR(tmpMaxTile, 0);//初始化为0 -// TEXPANDSCALAR(tmpMaxTile_l2, 0);//初始化为0 +// TEXPANDSCALAR(tmpMaxTile_l2, 0);//初始化为0 for (size_t i = 0; i < Mb; ++i){ auto gI = gIIter(i, 0); - TCOPYIN(dataTile, gI); - reducemax_col_kernel<<>>(tmpMaxTile.data(), + TLOAD(dataTile, gI); + reducemax_col_kernel<<>>(tmpMaxTile.data(), dataTile.data(), - oldtmpMaxTile.data(), + oldtmpMaxTile.data(), i); oldtmpMaxTile = tmpMaxTile; } - reducemax_col_final_kernel<<>>(MaxTile.data(), + reducemax_col_final_kernel<<>>(MaxTile.data(), tmpMaxTile.data()); - TCOPYOUT(gO, MaxTile); + TSTORE(gO, MaxTile); } diff --git a/kernels/reduction/reducemax_colvec_unalign_120_8.hpp b/kernels/reduction/reducemax_colvec_unalign_120_8.hpp index adab096..4aa0ab1 100644 --- a/kernels/reduction/reducemax_colvec_unalign_120_8.hpp +++ b/kernels/reduction/reducemax_colvec_unalign_120_8.hpp @@ -18,18 +18,18 @@ using namespace pto; template void __vec__ reducemax_col_tmp( typename tileTmp::TileDType __out__ tmp_max, - const typename tileSrc::TileDType __in__ src + const typename tileSrc::TileDType __in__ src ) { - size_t i = blkv_get_index_x(); + size_t i = blkv_get_index_x(); __vbuf__ typename tileTmp::DType *tmp_max_ptr = blkv_get_tile_ptr(tmp_max); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); -// __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max); +// __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max); typename tileTmp::DType upd_tmp_max = 0; - - #pragma clang loop unroll(full) + + #pragma clang loop unroll(full) for(size_t j=0;j void __vec__ reducemax_col_final( typename tileMax::TileDType __out__ new_max, - const typename tileTmp::TileDType __in__ src, - const typename tileMax::TileDType __in__ old_max + const typename tileTmp::TileDType __in__ src, + const typename tileMax::TileDType __in__ old_max ) { - size_t i = blkv_get_index_x(); + size_t i = blkv_get_index_x(); __vbuf__ typename tileMax::DType *new_max_ptr = blkv_get_tile_ptr(new_max); __vbuf__ typename tileTmp::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max); + __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max); typename tileMax::DType upd_max = old_max_ptr[i]; - + size_t src_idx_0 = i * tileMax::ColStride + 0 * tileMax::ValidCol; size_t src_idx_1 = i * tileMax::ColStride + 1 * tileMax::ValidCol; size_t src_idx_2 = i * tileMax::ColStride + 2 * tileMax::ValidCol; - size_t src_idx_3 = i * tileMax::ColStride + 3 * tileMax::ValidCol; + size_t src_idx_3 = i * tileMax::ColStride + 3 * tileMax::ValidCol; size_t src_idx_4 = i * tileMax::ColStride + 4 * tileMax::ValidCol; size_t src_idx_5 = i * tileMax::ColStride + 5 * tileMax::ValidCol; size_t src_idx_6 = i * tileMax::ColStride + 6 * tileMax::ValidCol; - size_t src_idx_7 = i * tileMax::ColStride + 7 * tileMax::ValidCol; - typename tileMax::DType max_01 = blkv_max(src_ptr[src_idx_0], src_ptr[src_idx_1]); - typename tileMax::DType max_23 = blkv_max(src_ptr[src_idx_2], src_ptr[src_idx_3]); - typename tileMax::DType max_45 = blkv_max(src_ptr[src_idx_4], src_ptr[src_idx_5]); - typename tileMax::DType max_67 = blkv_max(src_ptr[src_idx_6], src_ptr[src_idx_7]); - typename tileMax::DType max_0123 = blkv_max(max_01, max_23); - typename tileMax::DType max_4567 = blkv_max(max_45, max_67); - typename tileMax::DType max_all = blkv_max(max_0123, max_4567); - -// upd_max = upd_max + max_tmp; - - - new_max_ptr[i] = blkv_max(max_all, upd_max); + size_t src_idx_7 = i * tileMax::ColStride + 7 * tileMax::ValidCol; + typename tileMax::DType max_01 = blkv_max(src_ptr[src_idx_0], src_ptr[src_idx_1]); + typename tileMax::DType max_23 = blkv_max(src_ptr[src_idx_2], src_ptr[src_idx_3]); + typename tileMax::DType max_45 = blkv_max(src_ptr[src_idx_4], src_ptr[src_idx_5]); + typename tileMax::DType max_67 = blkv_max(src_ptr[src_idx_6], src_ptr[src_idx_7]); + typename tileMax::DType max_0123 = blkv_max(max_01, max_23); + typename tileMax::DType max_4567 = blkv_max(max_45, max_67); + typename tileMax::DType max_all = blkv_max(max_0123, max_4567); + +// upd_max = upd_max + max_tmp; + + + new_max_ptr[i] = blkv_max(max_all, upd_max); } @@ -98,70 +98,70 @@ void __vec__ reducemax_col_final( template void reducemax_col_rand( dtype *in_ptr, -// dtype *inzero_ptr, +// dtype *inzero_ptr, dtype *out_ptr -) +) { -// const int Mb = (gIM/8) / tM; +// const int Mb = (gIM/8) / tM; const int rmd_M = gIM % tM; const int rmd_N = gIN % tN; // const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; // -// using gm_shapeMax = global_tensor>; + using gm_shapeIn = global_tensor>; // +// using gm_shapeMax = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // // using tile_shapeData_col = Tile; // - using tile_shapeTmp = Tile; // - using tile_shapeMax = Tile; // + using tile_shapeTmp = Tile; // + using tile_shapeMax = Tile; // -// using tile_shapeData_row = Tile; // -// using tile_shapeData_cor = Tile; // -// using tile_shapeMax_row = Tile; // +// using tile_shapeData_row = Tile; // +// using tile_shapeData_cor = Tile; // +// using tile_shapeMax_row = Tile; // //need tM = 1; - gm_shapeIn inGm(in_ptr); -// gm_shapeOut ZeroGm(inzero_ptr); + gm_shapeIn inGm(in_ptr); +// gm_shapeOut ZeroGm(inzero_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeMax olcMaxGm(old_max_ptr); +// gm_shapeMax olcMaxGm(old_max_ptr); tile_shapeData dataTile; // tile_shapeData_col dataTile_col; - tile_shapeTmp TmpTile; + tile_shapeTmp TmpTile; tile_shapeMax MaxTile; tile_shapeMax oldMaxTile; // tile_shapeData_row dataTile_row; -// tile_shapeData_cor dataTile_cor; +// tile_shapeData_cor dataTile_cor; // tile_shapeMax_row MaxTile_row; -// tile_shapeMax_row oldMaxTile_row; +// tile_shapeMax_row oldMaxTile_row; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; + using itIn = global_iterator; using itIn_row = global_iterator; using itOut = global_iterator; itIn gIIter(in_ptr); itIn_row gIIter_rmd_row(in_ptr); -// itZero gZeroIter(inzero_ptr); +// itZero gZeroIter(inzero_ptr); itOut gOIter(out_ptr); auto gO = gOIter(0, 0); TEXPANDSCALAR(oldMaxTile, 0);//初始化为0 auto gI = gIIter(0, 0); - TCOPYIN(dataTile, gI);//补0的TLOAD + TLOAD(dataTile, gI);//补0的TLOAD reducemax_col_tmp<<>>(TmpTile.data(), dataTile.data()); reducemax_col_final<<>>(MaxTile.data(), TmpTile.data(), oldMaxTile.data()); oldMaxTile = MaxTile; - TCOPYOUT(gO, MaxTile); + TSTORE(gO, MaxTile); } #endif diff --git a/kernels/reduction/reducemax_rowvec.hpp b/kernels/reduction/reducemax_rowvec.hpp index 6b95d70..57830c3 100644 --- a/kernels/reduction/reducemax_rowvec.hpp +++ b/kernels/reduction/reducemax_rowvec.hpp @@ -18,17 +18,17 @@ template void __vec__ reducemax_row_kernel( typename tileMax::TileDType __out__ new_max, const typename tileSrc::TileDType __in__ src, - const typename tileMax::TileDType __in__ old_max + const typename tileMax::TileDType __in__ old_max ) { -// size_t i = blkv_get_index_x(); - size_t j = blkv_get_index_x(); +// size_t i = blkv_get_index_x(); + size_t j = blkv_get_index_x(); // size_t j = blkv_get_index_y(); - size_t idx = j * tileMax::RowStride; + size_t idx = j * tileMax::RowStride; __vbuf__ typename tileMax::DType *new_max_ptr = blkv_get_tile_ptr(new_max); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max); + __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max); typename tileMax::DType upd_max = old_max_ptr[idx]; @@ -39,34 +39,34 @@ void __vec__ reducemax_row_kernel( size_t src_idx0 = i * tileSrc::ColStride + j * tileSrc::RowStride; size_t src_idx1 = (i+1) * tileSrc::ColStride + j * tileSrc::RowStride; size_t src_idx2 = (i+2) * tileSrc::ColStride + j * tileSrc::RowStride; - size_t src_idx3 = (i+3) * tileSrc::ColStride + j * tileSrc::RowStride; + size_t src_idx3 = (i+3) * tileSrc::ColStride + j * tileSrc::RowStride; size_t src_idx4 = (i+4) * tileSrc::ColStride + j * tileSrc::RowStride; size_t src_idx5 = (i+5) * tileSrc::ColStride + j * tileSrc::RowStride; size_t src_idx6 = (i+6) * tileSrc::ColStride + j * tileSrc::RowStride; - size_t src_idx7 = (i+7) * tileSrc::ColStride + j * tileSrc::RowStride; + size_t src_idx7 = (i+7) * tileSrc::ColStride + j * tileSrc::RowStride; typename tileMax::DType max_01 = blkv_max(src_ptr[src_idx0], src_ptr[src_idx1]); - typename tileMax::DType max_23 = blkv_max(src_ptr[src_idx2], src_ptr[src_idx3]); - typename tileMax::DType max_45 = blkv_max(src_ptr[src_idx4], src_ptr[src_idx5]); - typename tileMax::DType max_67 = blkv_max(src_ptr[src_idx6], src_ptr[src_idx7]); + typename tileMax::DType max_23 = blkv_max(src_ptr[src_idx2], src_ptr[src_idx3]); + typename tileMax::DType max_45 = blkv_max(src_ptr[src_idx4], src_ptr[src_idx5]); + typename tileMax::DType max_67 = blkv_max(src_ptr[src_idx6], src_ptr[src_idx7]); typename tileMax::DType max_0123 = blkv_max(max_01, max_23); - typename tileMax::DType max_4567 = blkv_max(max_45, max_67); + typename tileMax::DType max_4567 = blkv_max(max_45, max_67); typename tileMax::DType max_tmp = blkv_max(max_0123, max_4567); - upd_max = blkv_max(upd_max, max_tmp); - } + upd_max = blkv_max(upd_max, max_tmp); + } /* #pragma clang loop unroll(full) for(size_t i=0;i>; //将gm中的Tensor先声明为一维数据 -// using gm_shapeSum = global_tensor>; + using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 +// using gm_shapeSum = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeData_row = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeMax = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec - using tile_shapeData_col = Tile; // todo 尾块怎么处理?是否要作为参数写在这 + using tile_shapeData_col = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeData_cor = Tile; // todo 尾块怎么处理?是否要作为参数写在这 - using tile_shapeMax_col = Tile; + using tile_shapeMax_col = Tile; - gm_shapeIn inGm(in_ptr); + gm_shapeIn inGm(in_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeSum olcSumGm(old_sum_ptr); +// gm_shapeSum olcSumGm(old_sum_ptr); - tile_shapeData dataTile; + tile_shapeData dataTile; tile_shapeData_row dataTile_row; tile_shapeData_col dataTile_col; - tile_shapeData_cor dataTile_cor; - + tile_shapeData_cor dataTile_cor; + tile_shapeMax MaxTile; tile_shapeMax oldMaxTile; tile_shapeMax_col MaxTile_col; - tile_shapeMax_col oldMaxTile_col; + tile_shapeMax_col oldMaxTile_col; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; + using itIn = global_iterator; using itOut = global_iterator; itIn gIIter(in_ptr); itOut gOIter(out_ptr); // printf("tile_shapeSum::ValidCol = %d\n", tile_shapeSum::ValidCol); -// printf("tile_shapeSum::ValidRow = %d\n", tile_shapeSum::ValidRow); +// printf("tile_shapeSum::ValidRow = %d\n", tile_shapeSum::ValidRow); // printf("before for\n"); for (int j = 0; j < Mb; ++j) { auto gO = gOIter(j, 0); TEXPANDSCALAR(oldMaxTile, 0);//初始化为0 - //初始化old_sum的tile + //初始化old_sum的tile for (int i = 0; i < Nb; ++i) { - auto gI = gIIter(j, i); -// printf("before copy in , %d\n", i); - TCOPYIN(dataTile, gI); + auto gI = gIIter(j, i); +// printf("before copy in , %d\n", i); + TLOAD(dataTile, gI); reducemax_row_kernel<<>>(MaxTile.data(), dataTile.data(), oldMaxTile.data()); // reducesum_row_kernel<<<1, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), dataTile.data(), oldSumTile.data()); // printf("kernel , %d\n", i); @@ -141,40 +141,40 @@ void reducemax_row_rand( //for row corner if constexpr (rmd_N > 0){ auto gI = gIIter(j, Nb); - TCOPYIN(dataTile_row, gI); - reducemax_row_kernel<<>>(MaxTile.data(), dataTile_row.data(), oldMaxTile.data()); + TLOAD(dataTile_row, gI); + reducemax_row_kernel<<>>(MaxTile.data(), dataTile_row.data(), oldMaxTile.data()); // reducesum_row_kernel<<>>(SumTile.data(), dataTile_row.data(), oldSumTile.data()); oldMaxTile = MaxTile; } -// printf("before tcopyout\n"); - TCOPYOUT(gO, MaxTile); -// printf("end tcopyout\n"); +// printf("before tstore\n"); + TSTORE(gO, MaxTile); +// printf("end tstore\n"); } //for col cor if constexpr (rmd_M > 0){ auto gO = gOIter(Mb, 0); TEXPANDSCALAR(oldMaxTile_col, 0);//初始化为0 - //初始化old_sum的tile + //初始化old_sum的tile for (int i = 0; i < Nb; ++i) { auto gI = gIIter(Mb, i); - TCOPYIN(dataTile_col, gI); + TLOAD(dataTile_col, gI); reducemax_row_kernel<<>>(MaxTile_col.data(), dataTile_col.data(), oldMaxTile_col.data()); oldMaxTile_col = MaxTile_col; } if constexpr (rmd_N > 0){ auto gI = gIIter(Mb, Nb); - TCOPYIN(dataTile_cor, gI); + TLOAD(dataTile_cor, gI); reducemax_row_kernel<<>>(MaxTile_col.data(), dataTile_cor.data(), oldMaxTile_col.data()); oldMaxTile_col = MaxTile_col; } - TCOPYOUT(gO, MaxTile_col); + TSTORE(gO, MaxTile_col); } /* for(int i = 0; i < gIM; i++){ printf("out%d = %d\n", i, out_ptr[i]); } */ -// printf("end program\n"); +// printf("end program\n"); } #endif diff --git a/kernels/reduction/reducemax_rowvec_single_tree_opt_2.hpp b/kernels/reduction/reducemax_rowvec_single_tree_opt_2.hpp index 8f066d2..18cde97 100644 --- a/kernels/reduction/reducemax_rowvec_single_tree_opt_2.hpp +++ b/kernels/reduction/reducemax_rowvec_single_tree_opt_2.hpp @@ -19,26 +19,26 @@ void __vec__ reducemax_row_kernel( const typename tileSrc::TileDType __in__ src, const typename tileSrcCol::TileDType __in__ src_col, const typename tileTmpMax::TileDType __in__ old_max, - const size_t tile_idx + const size_t tile_idx ) { - size_t j = blkv_get_index_x(); - size_t z = blkv_get_index_y(); + size_t j = blkv_get_index_x(); + size_t z = blkv_get_index_y(); size_t stride_src = z * (tileSrc::ValidCol/4) * tileSrc::ColStride; - size_t stride_src_col = z * (tileSrcCol::ValidCol/4) * tileSrcCol::ColStride; - + size_t stride_src_col = z * (tileSrcCol::ValidCol/4) * tileSrcCol::ColStride; + __vbuf__ typename tileTmpMax::DType *new_max_ptr = blkv_get_tile_ptr(new_max); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileSrc::DType *src_col_ptr = blkv_get_tile_ptr(src_col); - __vbuf__ typename tileTmpMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max); + __vbuf__ typename tileSrc::DType *src_col_ptr = blkv_get_tile_ptr(src_col); + __vbuf__ typename tileTmpMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max); /* - #pragma clang loop unroll(full) + #pragma clang loop unroll(full) for(size_t i=0;i>; //将gm中的Tensor先声明为一维数据 -// using gm_shapeMax = global_tensor>; + using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 +// using gm_shapeMax = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // todo 尾块怎么处理?是否要作为参数写在这 - using tile_shapeDataCol = Tile; // todo 尾块怎么处理?是否要作为参数写在这 + using tile_shapeDataCol = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeMax = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec using tile_shapeTmpMax = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec - gm_shapeIn inGm(in_ptr); + gm_shapeIn inGm(in_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeMax olcMaxGm(old_max_ptr); +// gm_shapeMax olcMaxGm(old_max_ptr); - tile_shapeData dataTile; - tile_shapeDataCol dataTile_col; + tile_shapeData dataTile; + tile_shapeDataCol dataTile_col; tile_shapeMax MaxTile; tile_shapeTmpMax oldtmpMaxTile; tile_shapeTmpMax tmpMaxTile; @@ -204,7 +204,7 @@ void reducemax_row_rand( // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; + using itIn = global_iterator; using itOut = global_iterator; itIn gIIter(in_ptr); @@ -212,21 +212,21 @@ void reducemax_row_rand( auto gO = gOIter(0, 0); - TEXPANDSCALAR(oldtmpMaxTile, 0);//初始化为0 - TEXPANDSCALAR(dataTile_col, 0);//初始化为0 + TEXPANDSCALAR(oldtmpMaxTile, 0);//初始化为0 + TEXPANDSCALAR(dataTile_col, 0);//初始化为0 for (int i = 0; i < Nb; ++i) { - auto gI = gIIter(0, i); - TCOPYIN(dataTile, gI); - reducemax_row_kernel<<>>(tmpMaxTile.data(), - dataTile.data(), - dataTile_col.data(), + auto gI = gIIter(0, i); + TLOAD(dataTile, gI); + reducemax_row_kernel<<>>(tmpMaxTile.data(), + dataTile.data(), + dataTile_col.data(), oldtmpMaxTile.data(), i); oldtmpMaxTile = tmpMaxTile; } - reducemax_row_final_kernel<<>>(MaxTile.data(), - tmpMaxTile.data()); - TCOPYOUT(gO, MaxTile); + reducemax_row_final_kernel<<>>(MaxTile.data(), + tmpMaxTile.data()); + TSTORE(gO, MaxTile); } #endif diff --git a/kernels/reduction/reduceprod_colvec.hpp b/kernels/reduction/reduceprod_colvec.hpp index 15d12c1..15bab7d 100644 --- a/kernels/reduction/reduceprod_colvec.hpp +++ b/kernels/reduction/reduceprod_colvec.hpp @@ -19,14 +19,14 @@ template void __vec__ reduceprod_col_kernel( typename timeProd::TileDType __out__ new_prod, const typename tileSrc::TileDType __in__ src, - const typename timeProd::TileDType __in__ old_prod + const typename timeProd::TileDType __in__ old_prod ) { - size_t i = blkv_get_index_x(); + size_t i = blkv_get_index_x(); __vbuf__ typename timeProd::DType *new_prod_ptr = blkv_get_tile_ptr(new_prod); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename timeProd::DType *old_prod_ptr = blkv_get_tile_ptr(old_prod); + __vbuf__ typename timeProd::DType *old_prod_ptr = blkv_get_tile_ptr(old_prod); typename timeProd::DType upd_prod = old_prod_ptr[i]; @@ -34,9 +34,9 @@ void __vec__ reduceprod_col_kernel( #pragma clang loop unroll(full) for(size_t j=0;j void reduceprod_col_rand( dtype *in_ptr, -// dtype *inzero_ptr, +// dtype *inzero_ptr, dtype *out_ptr -) +) { const int Mb = gIM / tM; - const int Nb = gIN / tN; + const int Nb = gIN / tN; const int rmd_M = gIM % tM; const int rmd_N = gIN % tN; // const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; // -// using gm_shapeSum = global_tensor>; + using gm_shapeIn = global_tensor>; // +// using gm_shapeSum = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // - using tile_shapeData_col = Tile; // - using tile_shapeProd = Tile; // + using tile_shapeData_col = Tile; // + using tile_shapeProd = Tile; // - using tile_shapeData_row = Tile; // - using tile_shapeData_cor = Tile; // - using tile_shapeProd_row = Tile; // + using tile_shapeData_row = Tile; // + using tile_shapeData_cor = Tile; // + using tile_shapeProd_row = Tile; // //need tM = 1; - gm_shapeIn inGm(in_ptr); -// gm_shapeOut ZeroGm(inzero_ptr); + gm_shapeIn inGm(in_ptr); +// gm_shapeOut ZeroGm(inzero_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeSum olcSumGm(old_sum_ptr); +// gm_shapeSum olcSumGm(old_sum_ptr); tile_shapeData dataTile; - tile_shapeData_col dataTile_col; + tile_shapeData_col dataTile_col; tile_shapeProd ProdTile; tile_shapeProd oldProdTile; tile_shapeData_row dataTile_row; - tile_shapeData_cor dataTile_cor; + tile_shapeData_cor dataTile_cor; tile_shapeProd_row ProdTile_row; - tile_shapeProd_row oldProdTile_row; + tile_shapeProd_row oldProdTile_row; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; + using itIn = global_iterator; using itIn_row = global_iterator; using itOut = global_iterator; itIn gIIter(in_ptr); itIn_row gIIter_rmd_row(in_ptr); -// itZero gZeroIter(inzero_ptr); +// itZero gZeroIter(inzero_ptr); itOut gOIter(out_ptr); // dtype zero = 0; @@ -104,41 +104,41 @@ void reduceprod_col_rand( // auto gZero = gZeroIter(0, j); auto gO = gOIter(0, j); TEXPANDSCALAR(oldProdTile, 0);//初始化为0 -// TCOPYIN(oldSumTile, gZero);//初始化为0 - //初始化old_sum的tile - //need +// TLOAD(oldSumTile, gZero);//初始化为0 + //初始化old_sum的tile + //need for (int i = 0; i < Mb; ++i) { auto gI = gIIter(i, j); - TCOPYIN(dataTile, gI); + TLOAD(dataTile, gI); reduceprod_col_kernel<<>>(ProdTile.data(), dataTile.data(), oldProdTile.data()); oldProdTile = ProdTile; } - if constexpr (rmd_M > 0){ + if constexpr (rmd_M > 0){ auto gI = gIIter(Mb, j); - TCOPYIN(dataTile_col, gI); + TLOAD(dataTile_col, gI); reduceprod_col_kernel<<>>(ProdTile.data(), dataTile_col.data(), oldProdTile.data()); oldProdTile = ProdTile; } - TCOPYOUT(gO, ProdTile); + TSTORE(gO, ProdTile); } if constexpr (rmd_N > 0){ -// auto gZero = gZeroIter(0, Nb); +// auto gZero = gZeroIter(0, Nb); auto gO = gOIter(0, Nb); - TEXPANDSCALAR(oldProdTile_row, 0);//初始化为0 -// TCOPYIN(oldSumTile_row, gZero);//初始化为0 - for (int i = 0; i < Mb; ++i) { + TEXPANDSCALAR(oldProdTile_row, 0);//初始化为0 +// TLOAD(oldSumTile_row, gZero);//初始化为0 + for (int i = 0; i < Mb; ++i) { auto gI = gIIter(i, Nb); - TCOPYIN(dataTile_row, gI); + TLOAD(dataTile_row, gI); reduceprod_col_kernel<<>>(ProdTile_row.data(), dataTile_row.data(), oldProdTile_row.data()); oldProdTile_row = ProdTile_row; } - if constexpr (rmd_M > 0){ + if constexpr (rmd_M > 0){ auto gI = gIIter(Mb, Nb); - TCOPYIN(dataTile_cor, gI); + TLOAD(dataTile_cor, gI); reduceprod_col_kernel<<>>(ProdTile_row.data(), dataTile_cor.data(), oldProdTile_row.data()); oldProdTile_row = ProdTile_row; } - TCOPYOUT(gO, ProdTile_row); + TSTORE(gO, ProdTile_row); } } diff --git a/kernels/reduction/reduceprod_rowvec.hpp b/kernels/reduction/reduceprod_rowvec.hpp index 7494130..931b238 100644 --- a/kernels/reduction/reduceprod_rowvec.hpp +++ b/kernels/reduction/reduceprod_rowvec.hpp @@ -17,17 +17,17 @@ template void __vec__ reduceprod_row_kernel( typename tileProd::TileDType __out__ new_prod, const typename tileSrc::TileDType __in__ src, - const typename tileProd::TileDType __in__ old_prod + const typename tileProd::TileDType __in__ old_prod ) { -// size_t i = blkv_get_index_x(); - size_t j = blkv_get_index_x(); +// size_t i = blkv_get_index_x(); + size_t j = blkv_get_index_x(); // size_t j = blkv_get_index_y(); - size_t idx = j * tileProd::RowStride; + size_t idx = j * tileProd::RowStride; __vbuf__ typename tileProd::DType *new_prod_ptr = blkv_get_tile_ptr(new_prod); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileProd::DType *old_prod_ptr = blkv_get_tile_ptr(old_prod); + __vbuf__ typename tileProd::DType *old_prod_ptr = blkv_get_tile_ptr(old_prod); typename tileProd::DType upd_prod = old_prod_ptr[idx]; @@ -35,9 +35,9 @@ void __vec__ reduceprod_row_kernel( #pragma clang loop unroll(full) for(size_t i=0;i>; //将gm中的Tensor先声明为一维数据 -// using gm_shapeSum = global_tensor>; + using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 +// using gm_shapeSum = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeData_row = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeProd = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec - using tile_shapeData_col = Tile; // todo 尾块怎么处理?是否要作为参数写在这 + using tile_shapeData_col = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeData_cor = Tile; // todo 尾块怎么处理?是否要作为参数写在这 - using tile_shapeProd_col = Tile; + using tile_shapeProd_col = Tile; - gm_shapeIn inGm(in_ptr); + gm_shapeIn inGm(in_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeSum olcSumGm(old_sum_ptr); +// gm_shapeSum olcSumGm(old_sum_ptr); - tile_shapeData dataTile; + tile_shapeData dataTile; tile_shapeData_row dataTile_row; tile_shapeData_col dataTile_col; - tile_shapeData_cor dataTile_cor; - + tile_shapeData_cor dataTile_cor; + tile_shapeProd ProdTile; tile_shapeProd oldProdTile; tile_shapeProd_col ProdTile_col; - tile_shapeProd_col oldProdTile_col; + tile_shapeProd_col oldProdTile_col; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; + using itIn = global_iterator; using itOut = global_iterator; itIn gIIter(in_ptr); itOut gOIter(out_ptr); // printf("tile_shapeSum::ValidCol = %d\n", tile_shapeSum::ValidCol); -// printf("tile_shapeSum::ValidRow = %d\n", tile_shapeSum::ValidRow); +// printf("tile_shapeSum::ValidRow = %d\n", tile_shapeSum::ValidRow); // printf("before for\n"); for (int j = 0; j < Mb; ++j) { auto gO = gOIter(j, 0); TEXPANDSCALAR(oldProdTile, 0);//初始化为0 - //初始化old_sum的tile + //初始化old_sum的tile for (int i = 0; i < Nb; ++i) { - auto gI = gIIter(j, i); -// printf("before copy in , %d\n", i); - TCOPYIN(dataTile, gI); + auto gI = gIIter(j, i); +// printf("before copy in , %d\n", i); + TLOAD(dataTile, gI); reduceprod_row_kernel<<>>(ProdTile.data(), dataTile.data(), oldProdTile.data()); // reducesum_row_kernel<<<1, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), dataTile.data(), oldSumTile.data()); // printf("kernel , %d\n", i); @@ -112,40 +112,40 @@ void reduceprod_row_rand( //for row corner if constexpr (rmd_N > 0){ auto gI = gIIter(j, Nb); - TCOPYIN(dataTile_row, gI); - reduceprod_row_kernel<<>>(ProdTile.data(), dataTile_row.data(), oldProdTile.data()); + TLOAD(dataTile_row, gI); + reduceprod_row_kernel<<>>(ProdTile.data(), dataTile_row.data(), oldProdTile.data()); // reducesum_row_kernel<<>>(SumTile.data(), dataTile_row.data(), oldSumTile.data()); oldProdTile = ProdTile; } -// printf("before tcopyout\n"); - TCOPYOUT(gO, ProdTile); -// printf("end tcopyout\n"); +// printf("before tstore\n"); + TSTORE(gO, ProdTile); +// printf("end tstore\n"); } //for col cor if constexpr (rmd_M > 0){ auto gO = gOIter(Mb, 0); TEXPANDSCALAR(oldProdTile_col, 0);//初始化为0 - //初始化old_sum的tile + //初始化old_sum的tile for (int i = 0; i < Nb; ++i) { - auto gI = gIIter(Mb, i); - TCOPYIN(dataTile_col, gI); + auto gI = gIIter(Mb, i); + TLOAD(dataTile_col, gI); reduceprod_row_kernel<<>>(ProdTile_col.data(), dataTile_col.data(), oldProdTile_col.data()); oldProdTile_col = ProdTile_col; } if constexpr (rmd_N > 0){ auto gI = gIIter(Mb, Nb); - TCOPYIN(dataTile_cor, gI); + TLOAD(dataTile_cor, gI); reduceprod_row_kernel<<>>(ProdTile_col.data(), dataTile_cor.data(), oldProdTile_col.data()); oldProdTile_col = ProdTile_col; } - TCOPYOUT(gO, ProdTile_col); + TSTORE(gO, ProdTile_col); } /* for(int i = 0; i < gIM; i++){ printf("out%d = %d\n", i, out_ptr[i]); } */ -// printf("end program\n"); +// printf("end program\n"); } #endif diff --git a/kernels/reduction/reducesum_colvec.hpp b/kernels/reduction/reducesum_colvec.hpp index 431d8d4..65742ed 100644 --- a/kernels/reduction/reducesum_colvec.hpp +++ b/kernels/reduction/reducesum_colvec.hpp @@ -19,20 +19,20 @@ template void __vec__ reducesum_col_kernel( typename tileSum::TileDType __out__ new_sum, const typename tileSrc::TileDType __in__ src, - const typename tileSum::TileDType __in__ old_sum + const typename tileSum::TileDType __in__ old_sum ) { - size_t i = blkv_get_index_x(); + size_t i = blkv_get_index_x(); __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); + __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); // typename tileSum::DType upd_sum = old_sum_ptr[i]; typename tileSum::DType upd_sum = 0; - - #pragma clang loop unroll(full) + + #pragma clang loop unroll(full) for(size_t j=0;j void reducesum_colsum_rand( dtype *in_ptr, -// dtype *inzero_ptr, +// dtype *inzero_ptr, dtype *out_ptr -) +) { const int Mb = gIM / tM; - const int Nb = gIN / tN; + const int Nb = gIN / tN; const int rmd_M = gIM % tM; const int rmd_N = gIN % tN; // const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; // -// using gm_shapeSum = global_tensor>; + using gm_shapeIn = global_tensor>; // +// using gm_shapeSum = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // - using tile_shapeData_col = Tile; // - using tile_shapeSum = Tile; // + using tile_shapeData_col = Tile; // + using tile_shapeSum = Tile; // - using tile_shapeData_row = Tile; // - using tile_shapeData_cor = Tile; // - using tile_shapeSum_row = Tile; // + using tile_shapeData_row = Tile; // + using tile_shapeData_cor = Tile; // + using tile_shapeSum_row = Tile; // //need tM = 1; - gm_shapeIn inGm(in_ptr); -// gm_shapeOut ZeroGm(inzero_ptr); + gm_shapeIn inGm(in_ptr); +// gm_shapeOut ZeroGm(inzero_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeSum olcSumGm(old_sum_ptr); +// gm_shapeSum olcSumGm(old_sum_ptr); tile_shapeData dataTile; - tile_shapeData_col dataTile_col; + tile_shapeData_col dataTile_col; tile_shapeSum SumTile; tile_shapeSum oldSumTile; tile_shapeData_row dataTile_row; - tile_shapeData_cor dataTile_cor; + tile_shapeData_cor dataTile_cor; tile_shapeSum_row SumTile_row; - tile_shapeSum_row oldSumTile_row; + tile_shapeSum_row oldSumTile_row; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; + using itIn = global_iterator; using itIn_row = global_iterator; using itOut = global_iterator; itIn gIIter(in_ptr); itIn_row gIIter_rmd_row(in_ptr); -// itZero gZeroIter(inzero_ptr); +// itZero gZeroIter(inzero_ptr); itOut gOIter(out_ptr); // dtype zero = 0; @@ -128,41 +128,41 @@ void reducesum_colsum_rand( // auto gZero = gZeroIter(0, j); auto gO = gOIter(0, j); TEXPANDSCALAR(oldSumTile, 0);//初始化为0 -// TCOPYIN(oldSumTile, gZero);//初始化为0 - //初始化old_sum的tile - //need +// TLOAD(oldSumTile, gZero);//初始化为0 + //初始化old_sum的tile + //need for (int i = 0; i < Mb; ++i) { auto gI = gIIter(i, j); - TCOPYIN(dataTile, gI); + TLOAD(dataTile, gI); reducesum_col_kernel<<>>(SumTile.data(), dataTile.data(), oldSumTile.data()); oldSumTile = SumTile; } - if constexpr (rmd_M > 0){ + if constexpr (rmd_M > 0){ auto gI = gIIter(Mb, j); - TCOPYIN(dataTile_col, gI); + TLOAD(dataTile_col, gI); reducesum_col_kernel<<>>(SumTile.data(), dataTile_col.data(), oldSumTile.data()); oldSumTile = SumTile; } - TCOPYOUT(gO, SumTile); + TSTORE(gO, SumTile); } if constexpr (rmd_N > 0){ -// auto gZero = gZeroIter(0, Nb); +// auto gZero = gZeroIter(0, Nb); auto gO = gOIter(0, Nb); - TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0 -// TCOPYIN(oldSumTile_row, gZero);//初始化为0 - for (int i = 0; i < Mb; ++i) { + TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0 +// TLOAD(oldSumTile_row, gZero);//初始化为0 + for (int i = 0; i < Mb; ++i) { auto gI = gIIter(i, Nb); - TCOPYIN(dataTile_row, gI); + TLOAD(dataTile_row, gI); reducesum_col_kernel<<>>(SumTile_row.data(), dataTile_row.data(), oldSumTile_row.data()); oldSumTile_row = SumTile_row; } - if constexpr (rmd_M > 0){ + if constexpr (rmd_M > 0){ auto gI = gIIter(Mb, Nb); - TCOPYIN(dataTile_cor, gI); + TLOAD(dataTile_cor, gI); reducesum_col_kernel<<>>(SumTile_row.data(), dataTile_cor.data(), oldSumTile_row.data()); oldSumTile_row = SumTile_row; } - TCOPYOUT(gO, SumTile_row); + TSTORE(gO, SumTile_row); } } diff --git a/kernels/reduction/reducesum_colvec_single.hpp b/kernels/reduction/reducesum_colvec_single.hpp index f20e3ad..1070fe2 100644 --- a/kernels/reduction/reducesum_colvec_single.hpp +++ b/kernels/reduction/reducesum_colvec_single.hpp @@ -19,19 +19,19 @@ template void __vec__ reducesum_col_kernel( typename tileSum::TileDType __out__ new_sum, const typename tileSrc::TileDType __in__ src, - const typename tileSum::TileDType __in__ old_sum + const typename tileSum::TileDType __in__ old_sum ) { - size_t i = blkv_get_index_x(); + size_t i = blkv_get_index_x(); __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); + __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); typename tileSum::DType upd_sum = old_sum_ptr[i]; - - #pragma clang loop unroll(full) + + #pragma clang loop unroll(full) for(size_t j=0;j void reducesum_colsum_rand( dtype *in_ptr, -// dtype *inzero_ptr, +// dtype *inzero_ptr, dtype *out_ptr -) +) { const int Mb = gIM / tM; - const int Nb = gIN / tN; + const int Nb = gIN / tN; const int rmd_M = gIM % tM; const int rmd_N = gIN % tN; // const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; // -// using gm_shapeSum = global_tensor>; + using gm_shapeIn = global_tensor>; // +// using gm_shapeSum = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // - using tile_shapeData_col = Tile; // - using tile_shapeSum = Tile; // + using tile_shapeData_col = Tile; // + using tile_shapeSum = Tile; // - using tile_shapeData_row = Tile; // - using tile_shapeData_cor = Tile; // - using tile_shapeSum_row = Tile; // + using tile_shapeData_row = Tile; // + using tile_shapeData_cor = Tile; // + using tile_shapeSum_row = Tile; // //need tM = 1; - gm_shapeIn inGm(in_ptr); -// gm_shapeOut ZeroGm(inzero_ptr); + gm_shapeIn inGm(in_ptr); +// gm_shapeOut ZeroGm(inzero_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeSum olcSumGm(old_sum_ptr); +// gm_shapeSum olcSumGm(old_sum_ptr); tile_shapeData dataTile; - tile_shapeData_col dataTile_col; + tile_shapeData_col dataTile_col; tile_shapeSum SumTile; tile_shapeSum oldSumTile; tile_shapeData_row dataTile_row; - tile_shapeData_cor dataTile_cor; + tile_shapeData_cor dataTile_cor; tile_shapeSum_row SumTile_row; - tile_shapeSum_row oldSumTile_row; + tile_shapeSum_row oldSumTile_row; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; + using itIn = global_iterator; using itIn_row = global_iterator; using itOut = global_iterator; itIn gIIter(in_ptr); itIn_row gIIter_rmd_row(in_ptr); -// itZero gZeroIter(inzero_ptr); +// itZero gZeroIter(inzero_ptr); itOut gOIter(out_ptr); // dtype zero = 0; @@ -127,42 +127,42 @@ void reducesum_colsum_rand( // auto gZero = gZeroIter(0, j); auto gO = gOIter(0, 0); TEXPANDSCALAR(oldSumTile, 0);//初始化为0 -// TCOPYIN(oldSumTile, gZero);//初始化为0 - //初始化old_sum的tile - //need +// TLOAD(oldSumTile, gZero);//初始化为0 + //初始化old_sum的tile + //need for (int i = 0; i < Mb; ++i) { auto gI = gIIter(i, 0); - TCOPYIN(dataTile, gI); + TLOAD(dataTile, gI); reducesum_col_kernel<<>>(SumTile.data(), dataTile.data(), oldSumTile.data()); oldSumTile = SumTile; } - if constexpr (rmd_M > 0){ + if constexpr (rmd_M > 0){ auto gI = gIIter(Mb, 0); - TCOPYIN(dataTile_col, gI); + TLOAD(dataTile_col, gI); reducesum_col_kernel<<>>(SumTile.data(), dataTile_col.data(), oldSumTile.data()); oldSumTile = SumTile; } - TCOPYOUT(gO, SumTile); + TSTORE(gO, SumTile); // } /* if constexpr (rmd_N > 0){ -// auto gZero = gZeroIter(0, Nb); +// auto gZero = gZeroIter(0, Nb); auto gO = gOIter(0, Nb); - TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0 -// TCOPYIN(oldSumTile_row, gZero);//初始化为0 - for (int i = 0; i < Mb; ++i) { + TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0 +// TLOAD(oldSumTile_row, gZero);//初始化为0 + for (int i = 0; i < Mb; ++i) { auto gI = gIIter(i, Nb); - TCOPYIN(dataTile_row, gI); + TLOAD(dataTile_row, gI); reducesum_col_kernel<<>>(SumTile_row.data(), dataTile_row.data(), oldSumTile_row.data()); oldSumTile_row = SumTile_row; } - if constexpr (rmd_M > 0){ + if constexpr (rmd_M > 0){ auto gI = gIIter(Mb, Nb); - TCOPYIN(dataTile_cor, gI); + TLOAD(dataTile_cor, gI); reducesum_col_kernel<<>>(SumTile_row.data(), dataTile_cor.data(), oldSumTile_row.data()); oldSumTile_row = SumTile_row; } - TCOPYOUT(gO, SumTile_row); + TSTORE(gO, SumTile_row); } */ } diff --git a/kernels/reduction/reducesum_colvec_single_8192.hpp b/kernels/reduction/reducesum_colvec_single_8192.hpp index 266d1e5..055df83 100644 --- a/kernels/reduction/reducesum_colvec_single_8192.hpp +++ b/kernels/reduction/reducesum_colvec_single_8192.hpp @@ -20,39 +20,39 @@ void __vec__ reducesum_col_kernel( typename tileTmpSum::TileDType __out__ new_sum, const typename tileSrc::TileDType __in__ src, const typename tileTmpSum::TileDType __in__ old_sum, - const size_t tile_idx + const size_t tile_idx ) { - size_t i = blkv_get_index_x(); + size_t i = blkv_get_index_x(); __vbuf__ typename tileTmpSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); + __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); - #pragma clang loop unroll(full) + #pragma clang loop unroll(full) for(size_t j=0;j void reducesum_colsum_rand( - dtype *in_ptr, + dtype *in_ptr, dtype *out_ptr -) +) { const int Mb = gIM / tM; - const int Nb = gIN / tN; + const int Nb = gIN / tN; const int rmd_M = gIM % tM; const int rmd_N = gIN % tN; // const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; // + using gm_shapeIn = global_tensor>; // using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // - using tile_shapeData_col = Tile; // - using tile_shapeSum = Tile; // - using tile_shapeTmpSum = Tile; // -// using tile_shapeTmpSum_l2 = Tile; // + using tile_shapeData_col = Tile; // + using tile_shapeSum = Tile; // + using tile_shapeTmpSum = Tile; // +// using tile_shapeTmpSum_l2 = Tile; // -// using tile_shapeData_row = Tile; // -// using tile_shapeData_cor = Tile; // -// using tile_shapeSum_row = Tile; // +// using tile_shapeData_row = Tile; // +// using tile_shapeData_cor = Tile; // +// using tile_shapeSum_row = Tile; // //need tM = 1; - gm_shapeIn inGm(in_ptr); - gm_shapeOut outGm(out_ptr); + gm_shapeIn inGm(in_ptr); + gm_shapeOut outGm(out_ptr); tile_shapeData dataTile; - tile_shapeData_col dataTile_col; + tile_shapeData_col dataTile_col; tile_shapeSum SumTile; tile_shapeTmpSum oldtmpSumTile; tile_shapeTmpSum tmpSumTile; // tile_shapeTmpSum_l2 tmpSumTile_l2; // tile_shapeData_row dataTile_row; -// tile_shapeData_cor dataTile_cor; +// tile_shapeData_cor dataTile_cor; // tile_shapeSum_row SumTile_row; -// tile_shapeSum_row oldSumTile_row; +// tile_shapeSum_row oldSumTile_row; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 @@ -193,7 +193,7 @@ void reducesum_colsum_rand( using itIn = global_iterator; using itOut = global_iterator; - itIn gIIter(in_ptr); + itIn gIIter(in_ptr); itOut gOIter(out_ptr); // dtype zero = 0; @@ -203,24 +203,24 @@ void reducesum_colsum_rand( auto gO = gOIter(0, 0); TEXPANDSCALAR(oldtmpSumTile, 0);//初始化为0 // TEXPANDSCALAR(tmpSumTile, 0);//初始化为0 -// TEXPANDSCALAR(tmpSumTile_l2, 0);//初始化为0 +// TEXPANDSCALAR(tmpSumTile_l2, 0);//初始化为0 for (size_t i = 0; i < Mb; ++i){ auto gI = gIIter(i, 0); - TCOPYIN(dataTile, gI); - reducesum_col_kernel<<>>(tmpSumTile.data(), + TLOAD(dataTile, gI); + reducesum_col_kernel<<>>(tmpSumTile.data(), dataTile.data(), - oldtmpSumTile.data(), + oldtmpSumTile.data(), i); oldtmpSumTile = tmpSumTile; } - reducesum_col_final_kernel<<>>(SumTile.data(), + reducesum_col_final_kernel<<>>(SumTile.data(), tmpSumTile.data()); - TCOPYOUT(gO, SumTile); + TSTORE(gO, SumTile); } /* - if constexpr (rmd_M > 0){ + if constexpr (rmd_M > 0){ auto gI = gIIter(Mb, 0); - TCOPYIN(dataTile_col, gI); + TLOAD(dataTile_col, gI); reducesum_col_kernel<<>>(SumTile.data(), dataTile_col.data(), oldSumTile.data()); oldSumTile = SumTile; } @@ -229,23 +229,23 @@ void reducesum_colsum_rand( // } /* if constexpr (rmd_N > 0){ -// auto gZero = gZeroIter(0, Nb); +// auto gZero = gZeroIter(0, Nb); auto gO = gOIter(0, Nb); - TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0 -// TCOPYIN(oldSumTile_row, gZero);//初始化为0 - for (int i = 0; i < Mb; ++i) { + TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0 +// TLOAD(oldSumTile_row, gZero);//初始化为0 + for (int i = 0; i < Mb; ++i) { auto gI = gIIter(i, Nb); - TCOPYIN(dataTile_row, gI); + TLOAD(dataTile_row, gI); reducesum_col_kernel<<>>(SumTile_row.data(), dataTile_row.data(), oldSumTile_row.data()); oldSumTile_row = SumTile_row; } - if constexpr (rmd_M > 0){ + if constexpr (rmd_M > 0){ auto gI = gIIter(Mb, Nb); - TCOPYIN(dataTile_cor, gI); + TLOAD(dataTile_cor, gI); reducesum_col_kernel<<>>(SumTile_row.data(), dataTile_cor.data(), oldSumTile_row.data()); oldSumTile_row = SumTile_row; } - TCOPYOUT(gO, SumTile_row); + TSTORE(gO, SumTile_row); } */ diff --git a/kernels/reduction/reducesum_colvec_single_tree.hpp b/kernels/reduction/reducesum_colvec_single_tree.hpp index f24fb56..462116d 100644 --- a/kernels/reduction/reducesum_colvec_single_tree.hpp +++ b/kernels/reduction/reducesum_colvec_single_tree.hpp @@ -21,40 +21,40 @@ void __vec__ reducesum_col_kernel( typename tileTmpSum::TileDType __out__ new_sum, const typename tileSrc::TileDType __in__ src, const typename tileTmpSum::TileDType __in__ old_sum, - const size_t tile_idx + const size_t tile_idx ) { - size_t i = blkv_get_index_x(); + size_t i = blkv_get_index_x(); __vbuf__ typename tileTmpSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); + __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); - #pragma clang loop unroll(full) + #pragma clang loop unroll(full) for(size_t j=0;j void reducesum_colsum_rand( - dtype *in_ptr, + dtype *in_ptr, dtype *out_ptr -) +) { const int Mb = gIM / tM; - const int Nb = gIN / tN; + const int Nb = gIN / tN; const int rmd_M = gIM % tM; const int rmd_N = gIN % tN; // const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; // + using gm_shapeIn = global_tensor>; // using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // - using tile_shapeData_col = Tile; // - using tile_shapeSum = Tile; // - using tile_shapeTmpSum = Tile; // -// using tile_shapeTmpSum_l2 = Tile; // + using tile_shapeData_col = Tile; // + using tile_shapeSum = Tile; // + using tile_shapeTmpSum = Tile; // +// using tile_shapeTmpSum_l2 = Tile; // -// using tile_shapeData_row = Tile; // -// using tile_shapeData_cor = Tile; // -// using tile_shapeSum_row = Tile; // +// using tile_shapeData_row = Tile; // +// using tile_shapeData_cor = Tile; // +// using tile_shapeSum_row = Tile; // //need tM = 1; - gm_shapeIn inGm(in_ptr); - gm_shapeOut outGm(out_ptr); + gm_shapeIn inGm(in_ptr); + gm_shapeOut outGm(out_ptr); tile_shapeData dataTile; - tile_shapeData_col dataTile_col; + tile_shapeData_col dataTile_col; tile_shapeSum SumTile; tile_shapeTmpSum oldtmpSumTile; tile_shapeTmpSum tmpSumTile; // tile_shapeTmpSum_l2 tmpSumTile_l2; // tile_shapeData_row dataTile_row; -// tile_shapeData_cor dataTile_cor; +// tile_shapeData_cor dataTile_cor; // tile_shapeSum_row SumTile_row; -// tile_shapeSum_row oldSumTile_row; +// tile_shapeSum_row oldSumTile_row; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 @@ -198,7 +198,7 @@ void reducesum_colsum_rand( using itIn = global_iterator; using itOut = global_iterator; - itIn gIIter(in_ptr); + itIn gIIter(in_ptr); itOut gOIter(out_ptr); // dtype zero = 0; @@ -208,19 +208,19 @@ void reducesum_colsum_rand( auto gO = gOIter(0, 0); TEXPANDSCALAR(oldtmpSumTile, 0);//初始化为0 // TEXPANDSCALAR(tmpSumTile, 0);//初始化为0 -// TEXPANDSCALAR(tmpSumTile_l2, 0);//初始化为0 +// TEXPANDSCALAR(tmpSumTile_l2, 0);//初始化为0 for (size_t i = 0; i < Mb; ++i){ auto gI = gIIter(i, 0); - TCOPYIN(dataTile, gI); - reducesum_col_kernel<<>>(tmpSumTile.data(), + TLOAD(dataTile, gI); + reducesum_col_kernel<<>>(tmpSumTile.data(), dataTile.data(), - oldtmpSumTile.data(), + oldtmpSumTile.data(), i); oldtmpSumTile = tmpSumTile; } - reducesum_col_final_kernel<<>>(SumTile.data(), + reducesum_col_final_kernel<<>>(SumTile.data(), tmpSumTile.data()); - TCOPYOUT(gO, SumTile); + TSTORE(gO, SumTile); } diff --git a/kernels/reduction/reducesum_colvec_unalign_120_8.hpp b/kernels/reduction/reducesum_colvec_unalign_120_8.hpp index 45cb8f8..39ba724 100644 --- a/kernels/reduction/reducesum_colvec_unalign_120_8.hpp +++ b/kernels/reduction/reducesum_colvec_unalign_120_8.hpp @@ -18,18 +18,18 @@ using namespace pto; template void __vec__ reducesum_col_tmp( typename tileTmp::TileDType __out__ tmp_sum, - const typename tileSrc::TileDType __in__ src + const typename tileSrc::TileDType __in__ src ) { - size_t i = blkv_get_index_x(); + size_t i = blkv_get_index_x(); __vbuf__ typename tileTmp::DType *tmp_sum_ptr = blkv_get_tile_ptr(tmp_sum); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); -// __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); +// __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); typename tileTmp::DType upd_tmp_sum = 0; - - #pragma clang loop unroll(full) + + #pragma clang loop unroll(full) for(size_t j=0;j void __vec__ reducesum_col_final( typename tileSum::TileDType __out__ new_sum, - const typename tileTmp::TileDType __in__ src, - const typename tileSum::TileDType __in__ old_sum + const typename tileTmp::TileDType __in__ src, + const typename tileSum::TileDType __in__ old_sum ) { - size_t i = blkv_get_index_x(); + size_t i = blkv_get_index_x(); __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum); __vbuf__ typename tileTmp::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); + __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); typename tileSum::DType upd_sum = old_sum_ptr[i]; - + size_t src_idx_0 = i * tileSum::ColStride + 0 * tileSum::ValidCol; size_t src_idx_1 = i * tileSum::ColStride + 1 * tileSum::ValidCol; size_t src_idx_2 = i * tileSum::ColStride + 2 * tileSum::ValidCol; - size_t src_idx_3 = i * tileSum::ColStride + 3 * tileSum::ValidCol; + size_t src_idx_3 = i * tileSum::ColStride + 3 * tileSum::ValidCol; size_t src_idx_4 = i * tileSum::ColStride + 4 * tileSum::ValidCol; size_t src_idx_5 = i * tileSum::ColStride + 5 * tileSum::ValidCol; size_t src_idx_6 = i * tileSum::ColStride + 6 * tileSum::ValidCol; - size_t src_idx_7 = i * tileSum::ColStride + 7 * tileSum::ValidCol; - typename tileSum::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1]; - typename tileSum::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3]; - typename tileSum::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5]; - typename tileSum::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7]; - typename tileSum::DType sum_0123 = sum_01 + sum_23; - typename tileSum::DType sum_4567 = sum_45 + sum_67; - typename tileSum::DType sum_all = sum_0123 + sum_4567; - -// upd_sum = upd_sum + sum_tmp; + size_t src_idx_7 = i * tileSum::ColStride + 7 * tileSum::ValidCol; + typename tileSum::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1]; + typename tileSum::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3]; + typename tileSum::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5]; + typename tileSum::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7]; + typename tileSum::DType sum_0123 = sum_01 + sum_23; + typename tileSum::DType sum_4567 = sum_45 + sum_67; + typename tileSum::DType sum_all = sum_0123 + sum_4567; + +// upd_sum = upd_sum + sum_tmp; /* #pragma clang loop unroll(full) for(size_t j=0;j void reducesum_colsum_rand( dtype *in_ptr, -// dtype *inzero_ptr, +// dtype *inzero_ptr, dtype *out_ptr -) +) { -// const int Mb = (gIM/8) / tM; +// const int Mb = (gIM/8) / tM; const int rmd_M = gIM % tM; const int rmd_N = gIN % tN; // const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; // -// using gm_shapeSum = global_tensor>; + using gm_shapeIn = global_tensor>; // +// using gm_shapeSum = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // // using tile_shapeData_col = Tile; // - using tile_shapeTmp = Tile; // - using tile_shapeSum = Tile; // + using tile_shapeTmp = Tile; // + using tile_shapeSum = Tile; // -// using tile_shapeData_row = Tile; // -// using tile_shapeData_cor = Tile; // -// using tile_shapeSum_row = Tile; // +// using tile_shapeData_row = Tile; // +// using tile_shapeData_cor = Tile; // +// using tile_shapeSum_row = Tile; // //need tM = 1; - gm_shapeIn inGm(in_ptr); -// gm_shapeOut ZeroGm(inzero_ptr); + gm_shapeIn inGm(in_ptr); +// gm_shapeOut ZeroGm(inzero_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeSum olcSumGm(old_sum_ptr); +// gm_shapeSum olcSumGm(old_sum_ptr); tile_shapeData dataTile; // tile_shapeData_col dataTile_col; - tile_shapeTmp TmpTile; + tile_shapeTmp TmpTile; tile_shapeSum SumTile; tile_shapeSum oldSumTile; // tile_shapeData_row dataTile_row; -// tile_shapeData_cor dataTile_cor; +// tile_shapeData_cor dataTile_cor; // tile_shapeSum_row SumTile_row; -// tile_shapeSum_row oldSumTile_row; +// tile_shapeSum_row oldSumTile_row; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; + using itIn = global_iterator; using itIn_row = global_iterator; using itOut = global_iterator; itIn gIIter(in_ptr); itIn_row gIIter_rmd_row(in_ptr); -// itZero gZeroIter(inzero_ptr); +// itZero gZeroIter(inzero_ptr); itOut gOIter(out_ptr); auto gO = gOIter(0, 0); TEXPANDSCALAR(oldSumTile, 0);//初始化为0 auto gI = gIIter(0, 0); - TCOPYIN(dataTile, gI);//TLOAD应补0,目前gfrun默认补0,需要接口去弄 + TLOAD(dataTile, gI);//TLOAD应补0,目前gfrun默认补0,需要接口去弄 reducesum_col_tmp<<>>(TmpTile.data(), dataTile.data()); reducesum_col_final<<>>(SumTile.data(), TmpTile.data(), oldSumTile.data()); oldSumTile = SumTile; - TCOPYOUT(gO, SumTile); + TSTORE(gO, SumTile); } #endif diff --git a/kernels/reduction/reducesum_colvec_unalign_tree.hpp b/kernels/reduction/reducesum_colvec_unalign_tree.hpp index 5e603d6..d2c13d9 100644 --- a/kernels/reduction/reducesum_colvec_unalign_tree.hpp +++ b/kernels/reduction/reducesum_colvec_unalign_tree.hpp @@ -21,38 +21,38 @@ void __vec__ reducesum_col_kernel( const typename tileSrc::TileDType __in__ src ) { - size_t i = blkv_get_index_x(); + size_t i = blkv_get_index_x(); __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); -// __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); +// __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); /* - #pragma clang loop unroll(full) + #pragma clang loop unroll(full) for(size_t j=0;j void reducesum_colsum_rand( - dtype *in_ptr, + dtype *in_ptr, dtype *out_ptr -) +) { const int Mb = gIM / tM; - const int Nb = gIN / tN; + const int Nb = gIN / tN; // const int rmd_M = gIM % tM; // const int rmd_N = gIN % tN; // const int rmd_M = gOM % tM; // todo 尾块怎么处理? - using gm_shapeIn = global_tensor>; // + using gm_shapeIn = global_tensor>; // using gm_shapeOut = global_tensor>; - using tile_shapeData = Tile; // - using tile_shapeSum = Tile; // + using tile_shapeData = Tile; // + using tile_shapeSum = Tile; // - gm_shapeIn inGm(in_ptr); - gm_shapeOut outGm(out_ptr); + gm_shapeIn inGm(in_ptr); + gm_shapeOut outGm(out_ptr); tile_shapeData dataTile; -// tile_shapeData_col dataTile_col; +// tile_shapeData_col dataTile_col; tile_shapeSum SumTile; // tile_shapeTmpSum oldtmpSumTile; // tile_shapeTmpSum tmpSumTile; // tile_shapeTmpSum_l2 tmpSumTile_l2; // tile_shapeData_row dataTile_row; -// tile_shapeData_cor dataTile_cor; +// tile_shapeData_cor dataTile_cor; // tile_shapeSum_row SumTile_row; -// tile_shapeSum_row oldSumTile_row; +// tile_shapeSum_row oldSumTile_row; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 @@ -185,21 +185,21 @@ void reducesum_colsum_rand( using itIn = global_iterator; using itOut = global_iterator; - itIn gIIter(in_ptr); + itIn gIIter(in_ptr); itOut gOIter(out_ptr); // dtype zero = 0; // for (int j = 0; j < Nb; ++j) { // auto gZero = gZeroIter(0, j); - auto gO = gOIter(0, 0); + auto gO = gOIter(0, 0); // for (size_t i = 0; i < Mb; ++i){ auto gI = gIIter(0, 0); TLOAD_PAD_ZERO(dataTile, gI); - reducesum_col_kernel<<>>(SumTile.data(), + reducesum_col_kernel<<>>(SumTile.data(), dataTile.data() ); - TCOPYOUT(gO, SumTile); + TSTORE(gO, SumTile); } diff --git a/kernels/reduction/reducesum_rowvec.hpp b/kernels/reduction/reducesum_rowvec.hpp index be8b749..ea6abb4 100644 --- a/kernels/reduction/reducesum_rowvec.hpp +++ b/kernels/reduction/reducesum_rowvec.hpp @@ -17,17 +17,17 @@ template void __vec__ reducesum_row_kernel( typename tileSum::TileDType __out__ new_sum, const typename tileSrc::TileDType __in__ src, - const typename tileSum::TileDType __in__ old_sum + const typename tileSum::TileDType __in__ old_sum ) { -// size_t i = blkv_get_index_x(); - size_t j = blkv_get_index_x(); +// size_t i = blkv_get_index_x(); + size_t j = blkv_get_index_x(); // size_t j = blkv_get_index_y(); - size_t idx = j * tileSum::RowStride; + size_t idx = j * tileSum::RowStride; __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); + __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); typename tileSum::DType upd_sum = old_sum_ptr[idx]; @@ -37,32 +37,32 @@ void __vec__ reducesum_row_kernel( size_t src_idx0 = i * tileSrc::ColStride + j * tileSrc::RowStride; size_t src_idx1 = (i+1) * tileSrc::ColStride + j * tileSrc::RowStride; size_t src_idx2 = (i+2) * tileSrc::ColStride + j * tileSrc::RowStride; - size_t src_idx3 = (i+3) * tileSrc::ColStride + j * tileSrc::RowStride; + size_t src_idx3 = (i+3) * tileSrc::ColStride + j * tileSrc::RowStride; size_t src_idx4 = (i+4) * tileSrc::ColStride + j * tileSrc::RowStride; size_t src_idx5 = (i+5) * tileSrc::ColStride + j * tileSrc::RowStride; size_t src_idx6 = (i+6) * tileSrc::ColStride + j * tileSrc::RowStride; - size_t src_idx7 = (i+7) * tileSrc::ColStride + j * tileSrc::RowStride; + size_t src_idx7 = (i+7) * tileSrc::ColStride + j * tileSrc::RowStride; typename tileSum::DType sum_01 = src_ptr[src_idx0] + src_ptr[src_idx1]; - typename tileSum::DType sum_23 = src_ptr[src_idx2] + src_ptr[src_idx3]; - typename tileSum::DType sum_45 = src_ptr[src_idx4] + src_ptr[src_idx5]; - typename tileSum::DType sum_67 = src_ptr[src_idx6] + src_ptr[src_idx7]; + typename tileSum::DType sum_23 = src_ptr[src_idx2] + src_ptr[src_idx3]; + typename tileSum::DType sum_45 = src_ptr[src_idx4] + src_ptr[src_idx5]; + typename tileSum::DType sum_67 = src_ptr[src_idx6] + src_ptr[src_idx7]; typename tileSum::DType sum_0123 = sum_01 + sum_23; - typename tileSum::DType sum_4567 = sum_45 + sum_67; + typename tileSum::DType sum_4567 = sum_45 + sum_67; typename tileSum::DType sum_tmp = sum_0123 + sum_4567; - upd_sum = upd_sum + sum_tmp; - } + upd_sum = upd_sum + sum_tmp; + } -/* +/* for(size_t i=0;i>; //将gm中的Tensor先声明为一维数据 -// using gm_shapeSum = global_tensor>; + using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 +// using gm_shapeSum = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeData_row = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeSum = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec - using tile_shapeData_col = Tile; // todo 尾块怎么处理?是否要作为参数写在这 + using tile_shapeData_col = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeData_cor = Tile; // todo 尾块怎么处理?是否要作为参数写在这 - using tile_shapeSum_col = Tile; + using tile_shapeSum_col = Tile; - gm_shapeIn inGm(in_ptr); + gm_shapeIn inGm(in_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeSum olcSumGm(old_sum_ptr); +// gm_shapeSum olcSumGm(old_sum_ptr); - tile_shapeData dataTile; + tile_shapeData dataTile; tile_shapeData_row dataTile_row; tile_shapeData_col dataTile_col; - tile_shapeData_cor dataTile_cor; - + tile_shapeData_cor dataTile_cor; + tile_shapeSum SumTile; tile_shapeSum oldSumTile; tile_shapeSum_col SumTile_col; - tile_shapeSum_col oldSumTile_col; + tile_shapeSum_col oldSumTile_col; // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; + using itIn = global_iterator; using itOut = global_iterator; itIn gIIter(in_ptr); itOut gOIter(out_ptr); // printf("tile_shapeSum::ValidCol = %d\n", tile_shapeSum::ValidCol); -// printf("tile_shapeSum::ValidRow = %d\n", tile_shapeSum::ValidRow); +// printf("tile_shapeSum::ValidRow = %d\n", tile_shapeSum::ValidRow); // printf("before for\n"); for (int j = 0; j < Mb; ++j) { auto gO = gOIter(j, 0); TEXPANDSCALAR(oldSumTile, 0);//初始化为0 - //初始化old_sum的tile + //初始化old_sum的tile for (int i = 0; i < Nb; ++i) { - auto gI = gIIter(j, i); -// printf("before copy in , %d\n", i); - TCOPYIN(dataTile, gI); + auto gI = gIIter(j, i); +// printf("before copy in , %d\n", i); + TLOAD(dataTile, gI); reducesum_row_kernel<<>>(SumTile.data(), dataTile.data(), oldSumTile.data()); // reducesum_row_kernel<<<1, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), dataTile.data(), oldSumTile.data()); // printf("kernel , %d\n", i); @@ -138,40 +138,40 @@ void reducesum_trowsum_rand( //for row corner if constexpr (rmd_N > 0){ auto gI = gIIter(j, Nb); - TCOPYIN(dataTile_row, gI); - reducesum_row_kernel<<>>(SumTile.data(), dataTile_row.data(), oldSumTile.data()); + TLOAD(dataTile_row, gI); + reducesum_row_kernel<<>>(SumTile.data(), dataTile_row.data(), oldSumTile.data()); // reducesum_row_kernel<<>>(SumTile.data(), dataTile_row.data(), oldSumTile.data()); oldSumTile = SumTile; } -// printf("before tcopyout\n"); - TCOPYOUT(gO, SumTile); -// printf("end tcopyout\n"); +// printf("before tstore\n"); + TSTORE(gO, SumTile); +// printf("end tstore\n"); } //for col cor if constexpr (rmd_M > 0){ auto gO = gOIter(Mb, 0); TEXPANDSCALAR(oldSumTile_col, 0);//初始化为0 - //初始化old_sum的tile + //初始化old_sum的tile for (int i = 0; i < Nb; ++i) { - auto gI = gIIter(Mb, i); - TCOPYIN(dataTile_col, gI); + auto gI = gIIter(Mb, i); + TLOAD(dataTile_col, gI); reducesum_row_kernel<<>>(SumTile_col.data(), dataTile_col.data(), oldSumTile_col.data()); oldSumTile_col = SumTile_col; } if constexpr (rmd_N > 0){ auto gI = gIIter(Mb, Nb); - TCOPYIN(dataTile_cor, gI); + TLOAD(dataTile_cor, gI); reducesum_row_kernel<<>>(SumTile_col.data(), dataTile_cor.data(), oldSumTile_col.data()); oldSumTile_col = SumTile_col; } - TCOPYOUT(gO, SumTile_col); + TSTORE(gO, SumTile_col); } /* for(int i = 0; i < gIM; i++){ printf("out%d = %d\n", i, out_ptr[i]); } */ -// printf("end program\n"); +// printf("end program\n"); } #endif diff --git a/kernels/reduction/reducesum_rowvec_single_tree.hpp b/kernels/reduction/reducesum_rowvec_single_tree.hpp index 8b8cabf..ac2738b 100644 --- a/kernels/reduction/reducesum_rowvec_single_tree.hpp +++ b/kernels/reduction/reducesum_rowvec_single_tree.hpp @@ -18,38 +18,38 @@ void __vec__ reducesum_row_kernel( typename tileTmpSum::TileDType __out__ new_sum, const typename tileSrc::TileDType __in__ src, const typename tileTmpSum::TileDType __in__ old_sum, - const size_t tile_idx + const size_t tile_idx ) { - size_t j = blkv_get_index_x(); - + size_t j = blkv_get_index_x(); + __vbuf__ typename tileTmpSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); + __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); #pragma clang loop unroll(full) for(int i=0;i>; //将gm中的Tensor先声明为一维数据 -// using gm_shapeSum = global_tensor>; + using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 +// using gm_shapeSum = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeSum = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec using tile_shapeTmpSum = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec - gm_shapeIn inGm(in_ptr); + gm_shapeIn inGm(in_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeSum olcSumGm(old_sum_ptr); +// gm_shapeSum olcSumGm(old_sum_ptr); - tile_shapeData dataTile; + tile_shapeData dataTile; tile_shapeSum SumTile; tile_shapeTmpSum oldtmpSumTile; tile_shapeTmpSum tmpSumTile; @@ -185,7 +185,7 @@ void reducesum_trowsum_rand( // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; + using itIn = global_iterator; using itOut = global_iterator; itIn gIIter(in_ptr); @@ -194,22 +194,22 @@ void reducesum_trowsum_rand( // for (int j = 0; j < Mb; ++j) { auto gO = gOIter(0, 0); - TEXPANDSCALAR(oldtmpSumTile, 0);//初始化为0 + TEXPANDSCALAR(oldtmpSumTile, 0);//初始化为0 for (int i = 0; i < Nb; ++i) { - auto gI = gIIter(0, i); -// printf("before copy in , %d\n", i); - TCOPYIN(dataTile, gI); - reducesum_row_kernel<<>>(tmpSumTile.data(), - dataTile.data(), + auto gI = gIIter(0, i); +// printf("before copy in , %d\n", i); + TLOAD(dataTile, gI); + reducesum_row_kernel<<>>(tmpSumTile.data(), + dataTile.data(), oldtmpSumTile.data(), i); // reducesum_row_kernel<<<1, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), dataTile.data(), oldSumTile.data()); // printf("kernel , %d\n", i); oldtmpSumTile = tmpSumTile; } - reducesum_row_final_kernel<<>>(SumTile.data(), - tmpSumTile.data()); - TCOPYOUT(gO, SumTile); + reducesum_row_final_kernel<<>>(SumTile.data(), + tmpSumTile.data()); + TSTORE(gO, SumTile); } //} diff --git a/kernels/reduction/reducesum_rowvec_single_tree_opt_2.hpp b/kernels/reduction/reducesum_rowvec_single_tree_opt_2.hpp index ddfcca7..dd76974 100644 --- a/kernels/reduction/reducesum_rowvec_single_tree_opt_2.hpp +++ b/kernels/reduction/reducesum_rowvec_single_tree_opt_2.hpp @@ -19,26 +19,26 @@ void __vec__ reducesum_row_kernel( const typename tileSrc::TileDType __in__ src, const typename tileSrcCol::TileDType __in__ src_col, const typename tileTmpSum::TileDType __in__ old_sum, - const size_t tile_idx + const size_t tile_idx ) { - size_t j = blkv_get_index_x(); - size_t z = blkv_get_index_y(); + size_t j = blkv_get_index_x(); + size_t z = blkv_get_index_y(); size_t stride_src = z * (tileSrc::ValidCol/4) * tileSrc::ColStride; - size_t stride_src_col = z * (tileSrcCol::ValidCol/4) * tileSrcCol::ColStride; - + size_t stride_src_col = z * (tileSrcCol::ValidCol/4) * tileSrcCol::ColStride; + __vbuf__ typename tileTmpSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum); __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src); - __vbuf__ typename tileSrc::DType *src_col_ptr = blkv_get_tile_ptr(src_col); - __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); + __vbuf__ typename tileSrc::DType *src_col_ptr = blkv_get_tile_ptr(src_col); + __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum); /* - #pragma clang loop unroll(full) + #pragma clang loop unroll(full) for(size_t i=0;i>; //将gm中的Tensor先声明为一维数据 -// using gm_shapeSum = global_tensor>; + using gm_shapeIn = global_tensor>; //将gm中的Tensor先声明为一维数据 +// using gm_shapeSum = global_tensor>; using gm_shapeOut = global_tensor>; using tile_shapeData = Tile; // todo 尾块怎么处理?是否要作为参数写在这 - using tile_shapeDataCol = Tile; // todo 尾块怎么处理?是否要作为参数写在这 + using tile_shapeDataCol = Tile; // todo 尾块怎么处理?是否要作为参数写在这 using tile_shapeSum = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec using tile_shapeTmpSum = Tile; // todo 这里的location,一定要是Vec吗?哪怕没有传入Vec - gm_shapeIn inGm(in_ptr); + gm_shapeIn inGm(in_ptr); gm_shapeOut outGm(out_ptr); -// gm_shapeSum olcSumGm(old_sum_ptr); +// gm_shapeSum olcSumGm(old_sum_ptr); - tile_shapeData dataTile; - tile_shapeDataCol dataTile_col; + tile_shapeData dataTile; + tile_shapeDataCol dataTile_col; tile_shapeSum SumTile; tile_shapeTmpSum oldtmpSumTile; tile_shapeTmpSum tmpSumTile; @@ -204,7 +204,7 @@ void reducesum_trowsum_rand( // int base = 0;// todo 生成一个标量 // int all_num = gOM; // 总元素数量 - using itIn = global_iterator; + using itIn = global_iterator; using itOut = global_iterator; itIn gIIter(in_ptr); @@ -212,21 +212,21 @@ void reducesum_trowsum_rand( auto gO = gOIter(0, 0); - TEXPANDSCALAR(oldtmpSumTile, 0);//初始化为0 - TEXPANDSCALAR(dataTile_col, 0);//初始化为0 + TEXPANDSCALAR(oldtmpSumTile, 0);//初始化为0 + TEXPANDSCALAR(dataTile_col, 0);//初始化为0 for (int i = 0; i < Nb; ++i) { - auto gI = gIIter(0, i); - TCOPYIN(dataTile, gI); - reducesum_row_kernel<<>>(tmpSumTile.data(), - dataTile.data(), - dataTile_col.data(), + auto gI = gIIter(0, i); + TLOAD(dataTile, gI); + reducesum_row_kernel<<>>(tmpSumTile.data(), + dataTile.data(), + dataTile_col.data(), oldtmpSumTile.data(), i); oldtmpSumTile = tmpSumTile; } - reducesum_row_final_kernel<<>>(SumTile.data(), - tmpSumTile.data()); - TCOPYOUT(gO, SumTile); + reducesum_row_final_kernel<<>>(SumTile.data(), + tmpSumTile.data()); + TSTORE(gO, SumTile); } #endif diff --git a/models/deepseekv3/mla.hpp b/models/deepseekv3/mla.hpp index f309309..350d790 100644 --- a/models/deepseekv3/mla.hpp +++ b/models/deepseekv3/mla.hpp @@ -26,26 +26,26 @@ using namespace pto; // const int Sb = seq_len / tS; // for(int i=0;i freqs; +// Tile freqs; // Tile t; // TARANGE(freqs, 0, dim, 2); // TARANGE(t, 0, end, 1); // freq_tshape tfreq_cis; // freq_tshape tfreq_cis_real; // freq_tshape tfreq_cis_imag; -// TOUTDOT(tfreq_cis, freqs, t); //outer product -// TSIN(tfreq_cis_real, tfreq_cis); // +// TOUTDOT(tfreq_cis, freqs, t); //outer product +// TSIN(tfreq_cis_real, tfreq_cis); // // TCOS(tfreq_cis_imag, tfreq_cis); // auto gO_real = gFreq_real(i,0); // auto gO_imag = gFreq_imag(i,0); -// TCOPYOUT(gO_real, tfreq_cis_real); -// TCOPYOUT(gO_imag, tfreq_cis_imag); +// TSTORE(gO_real, tfreq_cis_real); +// TSTORE(gO_imag, tfreq_cis_imag); // } // } template void projection(Tensor &out, - Tensor &x, + Tensor &x, Tensor &proj){ for(int i=0;i>; using tile_shape = Tile; - + using tSum = Tile; - + using gIter = global_iterator; gIter giter_src(src); @@ -78,19 +78,19 @@ void rmsnorm(dtype *dst, dtype *src){ { auto gsrc = giter_src(i, j); tile_shape tsrc; - - TCOPYIN(tsrc, gsrc); + + TLOAD(tsrc, gsrc); tSum tLocalSum; TMUL(tsrc, tsrc, tsrc); TROWSUM(tLocalSum, tsrc); TADD(tAccSquareSum, tAccSquareSum, tLocalSum); } - + tSum gSqureMean; TDIVS(gSqureMean, tAccSquareSum, kN); TSQRT(gSqureMean, gSqureMean); - + tile_shape gSqureMean_i; TEXPANDCOL(gSqureMean_i, gSqureMean); @@ -98,12 +98,12 @@ void rmsnorm(dtype *dst, dtype *src){ { auto gsrc = giter_src(i,j); tile_shape tsrc; - TCOPYIN(tsrc, gsrc); - + TLOAD(tsrc, gsrc); + TDIV(tsrc, tsrc, gSqureMean_i); - + auto gdst = giter_dst(i,j); - TCOPYOUT(gdst, tsrc); + TSTORE(gdst, tsrc); } } } @@ -157,12 +157,12 @@ void apply_rotary_emb(dtype *x, dtype *freqs_cis){ gm_shape input(x+offset); tile_shape tin; tile_shape_rope resh_tin; - TCOPYIN(tin, input); // 64*32 + TLOAD(tin, input); // 64*32 TRESHAPE(resh_tin, tin); // 64*32 -> 1024*2 //TTRANS(); // 128*2 -> 2*128 tile_shape_half tin_real; - tile_shape_half tin_imag; + tile_shape_half tin_imag; TEXTRACT(tin_real, resh_tin, 0, 0); // real 1024*1 TEXTRACT(tin_imag, resh_tin, 0, 1); // image 1024*1 @@ -170,7 +170,7 @@ void apply_rotary_emb(dtype *x, dtype *freqs_cis){ gm_shape freqs(freqs_cis+offset); tile_shape tfreqs; tile_shape_rope tfreqs_resh; - TCOPYIN(tfreqs, freqs); + TLOAD(tfreqs, freqs); TRESHAPE(tfreqs_resh, tfreqs); tile_shape_half tfreqs_real; @@ -203,7 +203,7 @@ void apply_rotary_emb(dtype *x, dtype *freqs_cis){ tile_shape tout_resh; TRESHAPE(tout_resh, tout); - TCOPYOUT(input, tout_resh); + TSTORE(input, tout_resh); } } } @@ -221,18 +221,18 @@ void split(dtype *out1, dtype *out2, dtype *in){ uint32_t n_row = row/trow; uint32_t n_col1 = dim1/tcol; - uint32_t n_col2 = dim2/tcol; + uint32_t n_col2 = dim2/tcol; for(int i=0;i [row, ext_dim, dim] @@ -333,18 +333,18 @@ void expand(dtype *out, dtype *in){ uint32_t offset = i * dim; gm_in input(in + offset); tile_shape tmp; - TCOPYIN(tmp, input); + TLOAD(tmp, input); for(int j=0;j -void MLA(Tensor & out, - Tensor &x, +void MLA(Tensor & out, + Tensor &x, Tensor & freqs_cis, Tensor* atten_mask=nullptr){ // do down projection to q_down then do up projection to q_up @@ -374,7 +374,7 @@ void MLA(Tensor & out, Tensor q_nope; Tensor q_pe; split(q_nope.data(), q_pe.data(), q_up.data()); - + //for q_pe doing rotary embedding Tensor q_perm; permute(q_perm.data(), q_pe.data()); @@ -386,7 +386,7 @@ void MLA(Tensor & out, permute(q_pe.data(), q_perm.data()); //writeTensorToFile(q_pe.data(), "q_pe_cpp.txt"); - + Tensor q_attn; //concat q_nope and q_pe to Q concat(q_attn.data(), q_nope.data(), q_pe.data()); @@ -394,7 +394,7 @@ void MLA(Tensor & out, // do down projection to k_rope+k_lora_rank, and split to k_rope, k_lora_rank Tensor kv; Tensor k_pe; - { + { Tensor Wkv_down(1); Tensor kv_down; projection(kv_down, x, Wkv_down); @@ -432,15 +432,15 @@ void MLA(Tensor & out, //writeTensorToFile(k_attn.data(), "k_attn_cpp.txt"); //writeTensorToFile(v_attn.data(), "v_attn_cpp.txt"); - permute(q_attn_pm.data(), q_attn.data()); //[b,s,n_heads,qk_head_dim] permute to [b,n_heads,s,qk_head_dim] - permute(k_attn_pm.data(), k_attn.data()); //[b,s,n_heads,qk_head_dim] permute to [b,n_heads,s,qk_head_dim] + permute(q_attn_pm.data(), q_attn.data()); //[b,s,n_heads,qk_head_dim] permute to [b,n_heads,s,qk_head_dim] + permute(k_attn_pm.data(), k_attn.data()); //[b,s,n_heads,qk_head_dim] permute to [b,n_heads,s,qk_head_dim] permute(v_attn_pm.data(), v_attn.data()); //[b,s,n_heads,v_head_dim] permute to [b,n_heads,s,v_head_dim] Tensor attn_out; Tensor attn_tmp; for(int i=0;i(attn_tmp.data(i,j), q_attn_pm.data(i,j), k_attn_pm.data(i,j), v_attn_pm.data(i,j)); @@ -451,7 +451,7 @@ void MLA(Tensor & out, permute(attn_out.data(), attn_tmp.data()); //writeTensorToFile(attn_out.data(), "attn_out_cpp.txt"); - + //final output projection { Tensor Wout(1); diff --git a/models/deepseekv3/moe.hpp b/models/deepseekv3/moe.hpp index 4726706..da6387d 100644 --- a/models/deepseekv3/moe.hpp +++ b/models/deepseekv3/moe.hpp @@ -36,7 +36,7 @@ void __vec__ BitonicSortStepDescend_RowMajor_Imp( // dst_ptr[i * tile_shape::Cols + tid] = // src_ptr[i * tile_shape::Cols + tid]; // dst_ptr[i * tile_shape::Cols + partner] = - // src_ptr[i * tile_shape::Cols + partner]; + // src_ptr[i * tile_shape::Cols + partner]; // } // } else { // if (src_ptr[i * tile_shape::Cols + tid] > @@ -93,15 +93,15 @@ void __vec__ BitonicSortStepDescend_RowMajor_Imp( // "l.sw vt#1.sw, [to, vm#2.uh<<2]\n" // dst[tid] = src[partner] // "l.sw vt#2.sw, [to, vm#1.uh<<2]\n" // dst[partner] = src[tid] // "l.addi t#1.ud, 0, ->p\n" //resave p from 3rd branch - // "l.xori p, -1, ->p\n" + // "l.xori p, -1, ->p\n" // "l.and p, t#2.ud, ->p\n" //go else for 2nd branch - // "l.cmp.lt vt#1.sw, vt#2.sw, -> vn.b\n" // src[partner] < src[tid] + // "l.cmp.lt vt#1.sw, vt#2.sw, -> vn.b\n" // src[partner] < src[tid] // "l.addi p, 0 ->t.d\n" // save p for 4th branch // "l.cmp.eqi vn#1.ub, 1,->p\n" // set p if(src[tid] < src[partner]) // "l.sw vt#1.sw, [to, vm#2.uh<<2]\n" //dst[tid] = src[partner] // "l.sw vt#2.sw, [to, vm#1.uh<<2]\n" //dst[partner] = src[tid] // "l.addi t#1.ud, 0, ->p\n" //resave p from 4rd branch - // "" //merge 2nd branch two result + // "" //merge 2nd branch two result // "l.addi t#3.ud, 0, ->p\n" //resave p from 2nd branch // "l.addi t#4.ud, 0, ->p\n" //resave p from 1st branch // "c.bstop\n" @@ -122,16 +122,16 @@ void __vec__ BitonicSortStepDescend_RowMajor_Imp( "v.lw [ta, vn#1.reuse.uh<<2], ->vt.w\n" // src[index_part+col/2] = partner_idx "v.lw [ta, vm#2.reuse.uh<<2], ->vt.w\n" // src[index] = cur_value "v.lw [ta, vm#1.reuse.uh<<2], ->vt.w\n" // src[index_part] = partner_value - "v.sw vt#2.reuse.sw, [to, vm#2.reuse.uh<<2]\n" // dst[tid] = src[tid] // copy first + "v.sw vt#2.reuse.sw, [to, vm#2.reuse.uh<<2]\n" // dst[tid] = src[tid] // copy first "v.sw vt#1.reuse.sw, [to, vm#1.reuse.uh<<2]\n" // dst[partner] = src[partner] // copy first - "v.sw vt#4.reuse.sw, [to, vn#2.reuse.uh<<2]\n" // dst[tid+col/2] = src[tid+col/2] // copy first + "v.sw vt#4.reuse.sw, [to, vn#2.reuse.uh<<2]\n" // dst[tid+col/2] = src[tid+col/2] // copy first "v.sw vt#3.reuse.sw, [to, vn#1.reuse.uh<<2]\n" // dst[partner+col/2] = src[partner+col/2] // copy first "v.cmp.lt lc0.uh, vu#1.reuse.uh, ->vn.b\n" // tid < partner "v.and vu#1.reuse.uh, ri0.uh, ->vn.h\n" // partner & stage "v.cmp.eqi vn#1.reuse.uh, 0, ->vn.b\n" // partner & stage == 0 "v.cmp.lt vt#2.reuse.sw, vt#1.reuse.sw, ->vn.b\n" // cur_value < partner_value "v.and vn#4.reuse.ub, vn#2.reuse.ub, ->vu.b\n" // (tid < partner) & (partner & stage) == 0 - "v.and vu#1.reuse.ub, vn#1.reuse.ub ->vu.b\n" // (tid < partner) & ((partner & stage) == 0) & (cur_value < partner_value) + "v.and vu#1.reuse.ub, vn#1.reuse.ub ->vu.b\n" // (tid < partner) & ((partner & stage) == 0) & (cur_value < partner_value) "v.cmp.eqi vu#1.ub, 1, ->vm.b\n" // sort_descend "" "v.cmp.eqi vn#3.uh, 1, ->vn.b\n" // partner & stage == 1 @@ -152,7 +152,7 @@ void __vec__ BitonicSortStepDescend_RowMajor_Imp( "v.sw vt#3.sw, [to, vn#2.uh<<2]\n" // dst[tid+col/2] = src[partner] "v.sw vt#4.sw, [to, vn#1.uh<<2]\n" // dst[partner+col/2] = src[tid] "l.addi t#1.ud, 0, ->p\n" // resave p from 1st branch - "" // merge 2nd branch two result + "" // merge 2nd branch two result "c.bstop\n" : :"i"(tile_shape::ValidCol) @@ -171,7 +171,7 @@ TRANGE_RowMajor(typename tile_shape::TileDType __out__ dst) { } template -__attribute__((always_inline)) +__attribute__((always_inline)) void TSORTROW(tile_shape &weight, tile_shape &indices, tile_shape &src) { static constexpr uint16_t row = tile_shape::ValidRow; static constexpr uint16_t col = tile_shape::ValidCol; @@ -181,7 +181,7 @@ void TSORTROW(tile_shape &weight, tile_shape &indices, tile_shape &src) { using tile_shape_sort = Tile; tile_shape_sort dst_sort; - tile_shape_sort src_sort; + tile_shape_sort src_sort; TRANGE_RowMajor<<>>(indices.data()); tile_shape_sort padding(-1); @@ -196,7 +196,7 @@ void TSORTROW(tile_shape &weight, tile_shape &indices, tile_shape &src) { BitonicSortStepDescend_RowMajor_Imp<<>>(dst_sort.data(), src_sort.data(), stage, step); TCOPY(src_sort, dst_sort); - // TCOPYOUT(gIn, dst); + // TSTORE(gIn, dst); // printf("stage:%d step:%d\n", stage, step); // for (int j=0;j @@ -279,7 +279,7 @@ template void TSORTROW(tile_shape &weight, tile_shape &indices, tile_shape &src) { using tile_shape_sort = Tile; tile_shape_sort dst_sort; - tile_shape_sort src_sort; + tile_shape_sort src_sort; TRANGE_RowMajor(indices.data()); tile_shape_sort padding(0); @@ -325,7 +325,7 @@ void __vec__ TScatterRow_Vec_RowMajor( __vbuf__ typename tile_shape_dst::DType *dst_ptr = blkv_get_tile_ptr(dst); __vbuf__ typename tile_shape_dst::DType *src_ptr = blkv_get_tile_ptr(src); __vbuf__ typename tile_shape_srci::DType *si_ptr = blkv_get_tile_ptr(srci); - dst_ptr[j*tile_shape_dst::RowStride + i] = src_ptr[j*tile_shape_dst::RowStride + i]; + dst_ptr[j*tile_shape_dst::RowStride + i] = src_ptr[j*tile_shape_dst::RowStride + i]; for(uint16_t k=0;k>; using tileIn = Tile; using tileOut = Tile; // num < 32 - + #ifdef __cpu_sim__ //writeTensorToFile(x, "moe_topk_in_cpp.txt"); #endif @@ -374,7 +374,7 @@ void topk(dtype *weight, dtype* indices, dtype *x){ tileIn tIn; tileIn tWeight; tileIn tIndice; - TCOPYIN(tIn, gIn); + TLOAD(tIn, gIn); TSORTROW(tWeight, tIndice, tIn); tileOut tWeightOut; TEXTRACT(tWeightOut, tWeight, 0, 0); @@ -383,10 +383,10 @@ void topk(dtype *weight, dtype* indices, dtype *x){ TEXTRACT(tIndiceOut, tIndice, 0, 0); gmOut gWeight(weight+i*tS*num); - TCOPYOUT(gWeight, tWeightOut); + TSTORE(gWeight, tWeightOut); gmOut gIndice(indices+i*tS*num); - TCOPYOUT(gIndice, tIndiceOut); + TSTORE(gIndice, tIndiceOut); } #ifdef __cpu_sim__ @@ -410,18 +410,18 @@ void sigmoid(dtype *out, dtype* in){ for(int j=0;j(1)); // 1+ e^(-x) TRECIP(tmp,tmp); // 1/ (1 + e^-x) auto gOut = gIterOut(i,j); - TCOPYOUT(gOut, tmp); + TSTORE(gOut, tmp); } } } -//select idx array to index every row in "in", then mask with value and create last dim to ext_dim +//select idx array to index every row in "in", then mask with value and create last dim to ext_dim //out[idx] -> [tokens, in_dim] template void scatter_expand(dtype *out, dtype*idx, dtype *in, dtype value){ @@ -441,11 +441,11 @@ void scatter_expand(dtype *out, dtype*idx, dtype *in, dtype value){ for(int i=0;i(tIn.data(), tIn.data(), tIdx.data(), value); #endif - + Tile tRe; TRESHAPE(tRe, tIn); @@ -463,7 +463,7 @@ void scatter_expand(dtype *out, dtype*idx, dtype *in, dtype value){ TEXPANDCOL(tOut, tRe); gmOut gOut(out + i*tS*in_dim*ext_dim); - TCOPYOUT(gOut, tOut); + TSTORE(gOut, tOut); } } @@ -481,19 +481,19 @@ void mask_fill(dtype *data, dtype *mask, dtype mask_value){ gm_shape gmask(mask+i*tS*dim); tile_shape tmask; - TCOPYIN(tdata, gdata); - TCOPYIN(tmask, gmask); + TLOAD(tdata, gdata); + TLOAD(tmask, gmask); tile_shape tmaskval(mask_value); TSELECT(tdata, tmask, tmaskval, tdata); - TCOPYOUT(gdata, tdata); + TSTORE(gdata, tdata); } } template -void Gate(dtype *weights, +void Gate(dtype *weights, dtype *indices, dtype *x, dtype *bias=nullptr){ @@ -623,14 +623,14 @@ void Gate(dtype *weights, uint64_t offset = i * tS * 2; gm_shape gIn(group_weight.data()+offset); tile_shape tmp; - TCOPYIN(tmp, gIn); + TLOAD(tmp, gIn); tile_shape_out rowsum; TROWSUM(rowsum,tmp); offset = i * tS * 1; gm_shape_out gOut(group_weight_sum.data()+offset); - TCOPYOUT(gOut, rowsum); + TSTORE(gOut, rowsum); } } #ifdef __cpu_sim__ @@ -671,7 +671,7 @@ void Gate(dtype *weights, //[b*s, n_expert_groups] all-1 matrix to index limit_groups_indices, and set to zero Tensor mask(1); - Tensor mask_expand; + Tensor mask_expand; scatter_expand(mask_expand.data(), limit_group_indices.data(), mask.data(), 0); //TSCATTER(重新定义每行indices选对应行的某些列) //scores [b*s, n_expert_groups, n_routed_experts/n_expert_groups] mask [b*s, n_expert_groups] // to mask irelevant groups with "-inf" except selcted limited groups @@ -710,7 +710,7 @@ void Gate(dtype *weights, writeTensorToFile(indices, "moe_gate_indices_masked_cpp.txt"); #endif - //weights = original_scores.gather(1, indices) should be same as above? + //weights = original_scores.gather(1, indices) should be same as above? //gather is extract corresponding score on dim=1, weight([b*s, n_activated_experts]) if constexpr(args::score_func == ScoreFunc::SIGMOID){ // weights /= weights.sum weight sum normalization since we have extract some weight so the extract sum is not 1 @@ -722,13 +722,13 @@ void Gate(dtype *weights, uint64_t offset = i * tS * args::n_activated_experts; gm_shape gIn(weights+offset); tile_shape tmp; - TCOPYIN(tmp, gIn); + TLOAD(tmp, gIn); tile_shape rowsum; TROWSUMEXPAND(rowsum,tmp); TDIV(tmp, tmp, rowsum); gm_shape gOut(weights+offset); - TCOPYOUT(gOut, tmp); + TSTORE(gOut, tmp); } } } @@ -750,7 +750,7 @@ void Gate(dtype *weights, // uint16_t index = j * tile_shape::RowStride + i; // uint16_t idx = src_ptr[index]; // sum[idx] = sum[idx] + 1; - + // dst_ptr[idx] = dst_ptr[idx] + 1; // } @@ -765,7 +765,7 @@ void Gate(dtype *weights, // for (uint16_t i = 0; i < tile_shape_src::ValidRow; ++i){ // for (uint16_t j = 0; j < tile_shape_src::ValidCol; ++j) { // uint16_t idx = src[i * tile_shape_src::RowStride + j]; -// sum[idx] = sum[idx] + 1; +// sum[idx] = sum[idx] + 1; // } // } @@ -792,13 +792,13 @@ void bincount(size_t *counts, dtype *indices, size_t size){ // for (int i=0;i void MLP(dtype *out, dtype *in, dtype *w1, dtype *w2, dtype *w3){ const int tS = 64; @@ -872,8 +872,8 @@ void MLP(dtype *out, dtype *in, dtype *w1, dtype *w2, dtype *w3){ auto gW1 = gIterW1(0,k); tileIO tIn; tileW13 tW1; - TCOPYIN(tIn, gIn); - TCOPYIN(tW1, gW1); + TLOAD(tIn, gIn); + TLOAD(tW1, gW1); MATMUL(tACC_W1, tIn, tW1); } @@ -883,8 +883,8 @@ void MLP(dtype *out, dtype *in, dtype *w1, dtype *w2, dtype *w3){ auto gW1 = gIterW1(d,k); tileIO tIn; tileW13 tW1; - TCOPYIN(tIn, gIn); - TCOPYIN(tW1, gW1); + TLOAD(tIn, gIn); + TLOAD(tW1, gW1); MATMACC(tACC_W1, tIn, tW1); } @@ -897,8 +897,8 @@ void MLP(dtype *out, dtype *in, dtype *w1, dtype *w2, dtype *w3){ auto gW3 = gIterW3(0,k); tileIO tIn; tileW13 tW3; - TCOPYIN(tIn, gIn); - TCOPYIN(tW3, gW3); + TLOAD(tIn, gIn); + TLOAD(tW3, gW3); MATMUL(tACC_W3, tIn, tW3); } @@ -908,8 +908,8 @@ void MLP(dtype *out, dtype *in, dtype *w1, dtype *w2, dtype *w3){ auto gW3 = gIterW3(d,k); tileIO tIn; tileW13 tW3; - TCOPYIN(tIn, gIn); - TCOPYIN(tW3, gW3); + TLOAD(tIn, gIn); + TLOAD(tW3, gW3); MATMACC(tACC_W3, tIn, tW3); } @@ -930,7 +930,7 @@ void MLP(dtype *out, dtype *in, dtype *w1, dtype *w2, dtype *w3){ auto gW2 = gIterW2(k,j); tileW2 tW2; - TCOPYIN(tW2, gW2); + TLOAD(tW2, gW2); MATMUL(tACC2_W2, tACC_W13, tW2); tileACC2_CVT tACC2_W2_CVT; @@ -939,7 +939,7 @@ void MLP(dtype *out, dtype *in, dtype *w1, dtype *w2, dtype *w3){ } auto gOut = gIterOut(i,j); - TCOPYOUT(gOut, tACC2_W2_OUT); + TSTORE(gOut, tACC2_W2_OUT); } } } @@ -979,7 +979,7 @@ void TRowCondSet_Vec_RowMajor( dst[i*tile_shape::RowStride+j] = one; }else{ typename tile_shape::DType zero = 0; - dst[i*tile_shape::RowStride+j] = zero; + dst[i*tile_shape::RowStride+j] = zero; } } } @@ -987,7 +987,7 @@ void TRowCondSet_Vec_RowMajor( #endif template -void MoE(Tensor & out, +void MoE(Tensor & out, Tensor &x){ const int tokens = bsz*seq_len; //view(x, x); //[bsz, seq_len, dim] -> [b*s, dim] @@ -1026,7 +1026,7 @@ void MoE(Tensor & out, Tensor experts_w1[args::n_routed_experts]; Tensor experts_w2[args::n_routed_experts]; Tensor experts_w3[args::n_routed_experts]; - + #ifdef __cpu_sim__ writeTensorToFile(experts_w1[3].data(), "moe_tmp.txt"); #endif @@ -1039,14 +1039,14 @@ void MoE(Tensor & out, // printf("current idx is %d\n", idx); Tensor expert_w1 = experts_w1[idx]; Tensor expert_w2 = experts_w2[idx]; - Tensor expert_w3 = experts_w3[idx]; - + Tensor expert_w3 = experts_w3[idx]; + Tensor x_mask_w_wt; //Tensor weight_expand; //generate condition matrix for indices that indices == i -> 1, indices !=i -> 0 //[tokens, n_activated_experts] with corresponding tokens all zeros or all ones //finally get x_mask with unselect tokens row set to 0 and multiply with weight in advance; - { + { const int tS = 64; const int tdim = 64; using gmIn = global_tensor>; @@ -1059,7 +1059,7 @@ void MoE(Tensor & out, for(int i=0;i & out, gmIn gweight(weights.data()+i*tS*args::n_activated_experts); tileIn tweight; - TCOPYIN(tweight, gweight); + TLOAD(tweight, gweight); TMUL(tweight, tweight, tidx); Tile tweight_sum; TROWSUM(tweight_sum, tweight); @@ -1090,12 +1090,12 @@ void MoE(Tensor & out, uint64_t offset = i*(tS*args::dim)+j*tdim; gmOut gIn(x.data()+offset); tileOut tOut; - TCOPYIN(tOut, gIn); + TLOAD(tOut, gIn); TMUL(tOut, tOut, tcond); TMUL(tOut, tOut, tweight_expand); - + gmOut gOut(x_mask_w_wt.data()+offset); - TCOPYOUT(gOut, tOut); + TSTORE(gOut, tOut); } } } @@ -1124,7 +1124,7 @@ void MoE(Tensor & out, dtype tokens_idx[]; dtype topk_idx[]; where(tokens_idx, topk_idx, indices.data(), i); //idx, top = torch.where(indices == i) 返回index=i的expert的行索引(idx)即哪些token属于这个专家,列索引(top)这个专家属于topk的哪个,需要看下ascend c++做法 - + //Gather 选出来的tokens,在Scatter到最终out token dim for(int i=0;i & out, } */ } - + Tensor shared_expert_w1(1); Tensor shared_expert_w2(1); Tensor shared_expert_w3(1); Tensor shared_expert_out; MLP(shared_expert_out.data(), x.data(), shared_expert_w1.data(), shared_expert_w2.data(), shared_expert_w3.data()); - + matadd(out.data(), y.data(), shared_expert_out.data()); //reshape (bsz*seq_len, dim) -> (bsz, seq_len, dim) } diff --git a/samples/README.md b/samples/README.md new file mode 100644 index 0000000..194ec1d --- /dev/null +++ b/samples/README.md @@ -0,0 +1,15 @@ +# Samples + +This tree keeps small checked-in compiler-output examples for quick inspection. +Samples are reference artifacts, not build gates, and are intentionally separate +from generated local output under `output/`. + +| Workload | Sample | Related SuperNPUBench source | Provenance | +| --- | --- | --- | --- | +| Flash attention | [`flash_attention/flash_attention_block_template.diss`](flash_attention/flash_attention_block_template.diss) | [`../benchmarks/npu/fusion`](../benchmarks/npu/fusion), [`../benchmarks/kernels/composite/src/flash_attention.cpp`](../benchmarks/kernels/composite/src/flash_attention.cpp) | `pto_objdump`/`llvm-objdump` disassembly of a larger compiler-produced flash attention object, including block-template TileOP sequences such as `BSTART.TLOAD`, `BSTART.TMATMUL`, `BSTART.TCVT`, and `BSTART.TSTORE`. | +| GEMM | [`gemm/gemm_avs_tile_smoke.diss`](gemm/gemm_avs_tile_smoke.diss) | [`../benchmarks/npu/vec_simd/gemm_18x128x256`](../benchmarks/npu/vec_simd/gemm_18x128x256), [`../benchmarks/kernels/composite/src/gemm.cpp`](../benchmarks/kernels/composite/src/gemm.cpp) | `llvm-objdump -dl` of a compiler-produced `gemm.o` from the Linx superproject AVS tile smoke outputs. | + +A compatible Linx compiler can disassemble these objects, but direct +SuperNPUBench NPU source compilation may still require frontend support for the +block-vector builtins and tile-register inline-assembly constraints used by the +benchmark headers. diff --git a/samples/flash_attention/README.md b/samples/flash_attention/README.md new file mode 100644 index 0000000..011af42 --- /dev/null +++ b/samples/flash_attention/README.md @@ -0,0 +1,19 @@ +# Flash Attention Sample + +`flash_attention_block_template.diss` is a checked-in block-template TileOP +disassembly of a compiler-produced Linx object for a larger flash attention +case. It is intended to show representative `BSTART.*`, `B.ARG`, `B.IOR`, and +`B.IOTI` sequences instead of the older scalar smoke output. + +Related SuperNPUBench sources: + +| Path | Role | +| --- | --- | +| [`../../benchmarks/npu/fusion`](../../benchmarks/npu/fusion) | Active NPU flash-attention-style fusion benchmark suite. | +| [`../../benchmarks/kernels/composite/src/flash_attention.cpp`](../../benchmarks/kernels/composite/src/flash_attention.cpp) | Composite flash attention benchmark entrypoint. | + +Regenerate from a compatible Linx compiler object with: + +```sh +llvm-objdump -dl flash_attention.o > flash_attention_block_template.diss +``` diff --git a/samples/flash_attention/flash_attention_block_template.diss b/samples/flash_attention/flash_attention_block_template.diss new file mode 100644 index 0000000..b4c7487 --- /dev/null +++ b/samples/flash_attention/flash_attention_block_template.diss @@ -0,0 +1,926 @@ + +generated/pto_objdump/obj/flash_attention.o: file format elf64-linx + +Disassembly of section .text: + +0000000000000000 : + 0: FENTRY [ra ~ s8], sp!, 208 + 4: C.BSTART COND, 0x1c + 6: c.movr a3, ->s1 + 8: c.movr a1, ->s7 + a: sdi a0, [sp, 40] + e: c.movr zero, ->a0 + 10: c.setc.eq a4, a0 + 12: C.BSTART DIRECT, 0x24 + 14: c.lwi [a4, 4], ->t + 16: c.sdi t#1, [sp, 24] + 18: lwi [a4, 0], ->s0 + +000000000000001c <.LBB0_1>: + 1c: C.BSTART.STD + 1e: c.movi 5, ->s0 + 20: sdi s0, [sp, 24] + +0000000000000024 <.LBB0_3>: + 24: C.BSTART COND, 0x1e0 + 26: c.sext.w s0, ->t + 28: c.movi 5, ->a1 + 2a: c.setc.ne t#1, a1 + 2c: C.BSTART COND, 0x1e0 + 2e: c.ldi [sp, 24], ->t + 30: c.sext.w t#1, ->t + 32: c.setc.ne t#1, a1 + 34: C.BSTART.STD + 36: addi a2, 64, ->a1 + 3a: addi zero, 16, ->x1 + 3e: addi zero, 1024, ->a3 + 42: addi zero, 64, ->a4 + +0000000000000046 <.LBB0_6>: + 46: C.BSTART.STD + 48: slli a0, 8, ->t + 4c: add s7, t#1, ->a5 + 50: BSTART.TLOAD INT32 + 54: C.B.DIMI 4, ->lb0 + 56: C.B.DIMI 16, ->lb1 + 58: B.ARG ND2ZN.normal + 5c: B.IOR [a5,x1],[] + 60: B.IOTI [], last ->t<4KB> + 64: BSTART.TLOAD INT32 + 68: C.B.DIMI 4, ->lb0 + 6a: C.B.DIMI 4, ->lb1 + 6c: B.ARG DN2NZ.normal + 70: B.IOR [a2,x1],[] + 74: B.IOTI [], last ->t<4KB> + 78: BSTART.TLOAD INT32 + 7c: C.B.DIMI 4, ->lb0 + 7e: C.B.DIMI 16, ->lb1 + 80: B.ARG DN2NZ.normal + 84: B.IOR [s1,a3],[] + 88: B.IOTI [], last ->t<4KB> + 8c: BSTART.TMATMUL INT32 + 90: C.B.DIMI 16, ->lb0 + 92: C.B.DIMI 4, ->lb1 + 94: C.B.DIMI 4, ->lb2 + 96: B.IOTI [t#1, t#2], last ->acc<4KB> + 9a: BSTART.ACCCVT INT32 + 9e: B.IOTI [], last ->m<4KB> + a2: + a6: B.ARG VV + aa: B.IOTI [m#1] ->t<4KB> + ae: B.IOTI [], last ->t<4KB> + b2: + b6: B.ARG VV + ba: B.IOTI [t#2] ->t<4KB> + be: B.IOTI [], last ->t<4KB> + c2: BSTART.TMATMUL INT32 + c6: C.B.DIMI 16, ->lb0 + c8: C.B.DIMI 16, ->lb1 + ca: C.B.DIMI 4, ->lb2 + cc: B.IOTI [t#2, t#3], last ->acc<4KB> + d0: BSTART.ACCCVT INT32 + d4: B.IOTI [], last ->m<4KB> + d8: C.BSTART.STD + da: slli a0, 10, ->a5 + de: c.movr a1, ->a6 + e0: c.movr x1, ->a7 + +00000000000000e2 <.LBB0_15>: + e2: BSTART.TLOAD INT32 + e6: C.B.DIMI 4, ->lb0 + e8: C.B.DIMI 4, ->lb1 + ea: B.ARG DN2NZ.normal + ee: B.IOR [a6,x1],[] + f2: B.IOTI [], last ->t<4KB> + f6: C.BSTART.STD + f8: add s1, a7, ->x0 + fc: BSTART.TLOAD INT32 + 100: C.B.DIMI 4, ->lb0 + 102: C.B.DIMI 16, ->lb1 + 104: B.ARG DN2NZ.normal + 108: B.IOR [x0,a3],[] + 10c: B.IOTI [], last ->t<4KB> + 110: BSTART.TMATMUL INT32 + 114: C.B.DIMI 16, ->lb0 + 116: C.B.DIMI 4, ->lb1 + 118: C.B.DIMI 4, ->lb2 + 11a: B.IOTI [t#1, t#2], last ->acc<4KB> + 11e: BSTART.ACCCVT INT32 + 122: B.IOTI [], last ->m<4KB> + 126: + 12a: B.ARG VV + 12e: B.IOTI [m#2] ->t<4KB> + 132: B.IOTI [], last ->t<4KB> + 136: + 13a: B.ARG VV + 13e: B.IOTI [t#2] ->t<4KB> + 142: B.IOTI [], last ->t<4KB> + 146: BSTART.TMATMUL INT32 + 14a: C.B.DIMI 16, ->lb0 + 14c: C.B.DIMI 16, ->lb1 + 14e: C.B.DIMI 4, ->lb2 + 150: B.IOTI [t#2, t#3], last ->acc<4KB> + 154: BSTART.ACCCVT INT32 + 158: B.IOTI [], last ->m<4KB> + 15c: + 160: B.ARG VV + 164: B.IOTI [m#1] ->t<4KB> + 168: B.IOTI [], last ->t<4KB> + 16c: + 170: B.ARG VV + 174: B.IOTI [m#2] ->t<4KB> + 178: B.IOTI [], last ->t<4KB> + 17c: + 180: B.ARG VV + 184: B.IOTI [t#2, t#3] ->t<4KB> + 188: B.IOTI [], last ->t<4KB> + 18c: C.BSTART.STD + 18e: addi a6, 64, ->a6 + 192: + 196: B.ARG VV + 19a: B.IOTI [t#2] ->t<4KB> + 19e: B.IOTI [], last ->m<4KB> + 1a2: C.BSTART COND, 0xe2 + 1a4: addi a7, 16, ->a7 + 1a8: c.setc.ne a7, a3 + 1aa: C.BSTART.STD + 1ac: c.ldi [sp, 40], ->t + 1ae: add t#1, a5, ->a5 + 1b2: + 1b6: B.ARG VV + 1ba: B.IOTI [m#1] ->t<4KB> + 1be: B.IOTI [], last ->t<4KB> + 1c2: BSTART.TSTORE INT32 + 1c6: C.B.DIMI 16, ->lb0 + 1c8: C.B.DIMI 16, ->lb1 + 1ca: B.ARG NORM.normal + 1ce: B.IOR [a5,a4],[] + 1d2: B.IOTI [t#1], last ->t<4KB> + 1d6: C.BSTART COND, 0x306 + 1d8: addi a0, 1, ->a0 + 1dc: c.setc.eq a0, x1 + 1de: C.BSTART DIRECT, 0x46 + +00000000000001e0 <.LBB0_7>: + 1e0: C.BSTART.STD + 1e2: c.movr zero, ->u + 1e4: c.movi 1, ->s3 + 1e6: c.movi 4, ->s5 + 1e8: c.movi 4, ->t + 1ea: c.sdi t#1, [sp, 72] + 1ec: addi zero, 256, ->t + 1f0: c.sdi t#1, [sp, 64] + 1f2: addi zero, 16, ->t + 1f6: c.sdi t#1, [sp, 16] + 1f8: c.movr u#1, ->s4 + 1fa: sdi u#1, [sp, 96] + 1fe: c.movr u#1, ->a3 + 200: sdi s1, [sp, 56] + 204: sdi a2, [sp, 80] + +0000000000000208 <.LBB0_8>: + 208: C.BSTART.STD + 20a: sdi a3, [sp, 8] + 20e: slliw a3, 4, ->t + 212: c.sdi t#1, [sp, 32] + 214: ldi [sp, 96], ->s2 + +0000000000000218 <.LBB0_9>: + 218: C.BSTART.STD + 21a: sdi s2, [sp, 48] + 21e: slliw s2, 8, ->t + 222: c.sdi t#1, [sp, 88] + 224: c.ldi [sp, 96], ->t + 226: c.movr t#1, ->a2 + 228: c.movr t#1, ->s2 + 22a: c.movr t#1, ->a3 + +000000000000022c <.LBB0_10>: + 22c: C.BSTART.STD + 22e: sdi a2, [sp, 120] + 232: hl.sdip s2, a3, [sp, 104] + 238: ldi [sp, 96], ->s6 + 23c: c.movr s6, ->s1 + 23e: ldi [sp, 80], ->s2 + +0000000000000242 <.LBB0_13>: + 242: HL.BSTART.STD CALL, _ZN3pto7kernels15load_scalar_i32EPKvNS0_9pto_dtypeEi, ra=.LBB0_20 + 24a: addw s4, s1, ->a2 + 24e: c.movr s7, ->a0 + 250: c.movr s0, ->a1 + +0000000000000252 <.LBB0_20>: + 252: HL.BSTART.STD CALL, _ZN3pto7kernels15load_scalar_i32EPKvNS0_9pto_dtypeEi, ra=.LBB0_21 + 25a: c.movr s3, ->s8 + 25c: c.movr s5, ->s3 + 25e: c.movr s7, ->s5 + 260: c.movr a0, ->s7 + 262: ldi [sp, 120], ->a0 + 266: addw a0, s1, ->a2 + 26a: c.movr s2, ->a0 + 26c: c.movr s0, ->a1 + +000000000000026e <.LBB0_21>: + 26e: C.BSTART COND, 0x242 + 270: mulw a0, s7, ->t + 274: c.movr s5, ->s7 + 276: c.movr s3, ->s5 + 278: c.movr s8, ->s3 + 27a: addw s1, s3, ->s1 + 27e: addw t#1, s6, ->s6 + 282: c.sext.w s1, ->t + 284: c.setc.ne t#1, s5 + 286: HL.BSTART.STD CALL, _ZN3pto7kernels15load_scalar_i32EPKvNS0_9pto_dtypeEi, ra=.LBB0_19 + 28e: ldi [sp, 88], ->a0 + 292: ldi [sp, 104], ->s2 + 296: addw s2, a0, ->a2 + 29a: c.ldi [sp, 56], ->t + 29c: c.movr t#1, ->a0 + 29e: c.movr s0, ->a1 + +00000000000002a0 <.LBB0_19>: + 2a0: C.BSTART COND, 0x22c + 2a2: ldi [sp, 72], ->u + 2a6: c.ldi [sp, 120], ->t + 2a8: addw t#1, u#1, ->a2 + 2ac: mulw a0, s6, ->u + 2b0: c.ldi [sp, 112], ->t + 2b2: addw u#1, t#1, ->a3 + 2b6: addw s2, s3, ->s2 + 2ba: addw s2, zero, ->u + 2be: c.ldi [sp, 64], ->t + 2c0: c.setc.ne u#1, t#1 + 2c2: HL.BSTART.STD CALL, _ZN3pto7kernels16store_scalar_i32EPvNS0_9pto_dtypeEii, ra=.LBB0_18 + 2ca: ldi [sp, 32], ->a0 + 2ce: ldi [sp, 48], ->s2 + 2d2: addw s2, a0, ->a2 + 2d6: ldi [sp, 40], ->a0 + 2da: ldi [sp, 24], ->a1 + +00000000000002de <.LBB0_18>: + 2de: C.BSTART COND, 0x218 + 2e0: addw s2, s3, ->s2 + 2e4: addw s2, zero, ->u + 2e8: c.ldi [sp, 16], ->t + 2ea: c.setc.ne u#1, t#1 + 2ec: C.BSTART COND, 0x208 + 2ee: c.ldi [sp, 72], ->t + 2f0: addw s4, t#1, ->s4 + 2f4: c.ldi [sp, 8], ->t + 2f6: addw t#1, s3, ->a3 + 2fa: addw a3, zero, ->u + 2fe: ldi [sp, 80], ->a2 + 302: c.ldi [sp, 64], ->t + 304: c.setc.ne u#1, t#1 + +0000000000000306 <.LBB0_17>: + 306: FRET.STK [ra ~ s8], sp!, 208 + +Disassembly of section .text._ZN3pto7kernels15load_scalar_i32EPKvNS0_9pto_dtypeEi: + +0000000000000000 <_ZN3pto7kernels15load_scalar_i32EPKvNS0_9pto_dtypeEi>: + 0: FENTRY [ra ~ ra], sp!, 32 + 4: C.BSTART COND, 0x40 + 6: c.movr a0, ->a3 + 8: c.movr zero, ->a0 + a: c.sext.w a1, ->t + c: addw a1, zero, ->a1 + 10: c.movi 3, ->a4 + 12: setc.lt a4, t#1 + 16: C.BSTART COND, 0x5e + 18: setc.eqi a1, 1 + 1c: C.BSTART COND, 0x82 + 1e: setc.eqi a1, 2 + 22: C.BSTART COND, 0x134 + 24: c.setc.ne a1, a4 + 26: HL.BSTART.STD CALL, _ZN3pto17fp8_e4m3_to_floatENS_10fp8_e4m3_tE, ra=.LBB1_21 + 2e: addw a2, zero, ->a0 + 32: lb [a3, a0], ->a0 + 36: sbi a0, [sp, 23] + 3a: addi sp, 23, ->a0 + +000000000000003e <.LBB1_21>: + 3e: C.BSTART DIRECT, 0x12e + +0000000000000040 <.LBB1_5>: + 40: C.BSTART COND, 0x66 + 42: setc.eqi a1, 4 + 46: C.BSTART COND, 0xf8 + 48: setc.eqi a1, 6 + 4c: C.BSTART COND, 0x134 + 4e: setc.nei a1, 5 + 52: C.BSTART.STD + 54: c.sext.w a2, ->t + 56: lw [a3, t#1<<2], ->a0 + 5a: FRET.STK [ra ~ ra], sp!, 32 + +000000000000005e <.LBB1_9>: + 5e: C.BSTART DIRECT, 0x12e + 60: c.sext.w a2, ->t + 62: lwu [a3, t#1<<2], ->a0 + +0000000000000066 <.LBB1_10>: + 66: C.BSTART DIRECT, 0x12e + 68: c.sext.w a2, ->t + 6a: lbu [a3, t#1], ->t + 6e: andi t#1, 15, ->t + 72: slli t#1, 2, ->u + 76: addtpc 0, ->t + 7a: addi t#1, 0, ->t + 7e: lwu [u#1, t#1], ->a0 + +0000000000000082 <.LBB1_11>: + 82: C.BSTART COND, 0x104 + 84: c.sext.w a2, ->t + 86: lh [a3, t#1<<1], ->a3 + 8a: lui 524288, ->t + 8e: andw a3, t#1, ->a0 + 92: hl.lui 65535, ->t + 98: andw a3, t#1, ->a1 + 9c: srliw a3, 10, ->t + a0: andiw t#1, 31, ->a2 + a4: addw a2, zero, ->a4 + a8: setc.eqi a4, 31 + ac: C.BSTART COND, 0x116 + ae: andiw a3, 1023, ->a3 + b2: c.movr zero, ->a5 + b4: c.setc.ne a4, a5 + b6: C.BSTART COND, 0x12e + b8: c.sext.w a3, ->t + ba: c.setc.eq t#1, a5 + bc: C.BSTART DIRECT, 0x128 + be: clz a3, 31, ->a2 + c2: xoriw a2, 31, ->u + c6: c.movi 9, ->t + c8: subw t#1, u#1, ->t + cc: addi zero, 32, ->u + d0: sll t#1, u#1, ->t + d4: srl t#1, u#1, ->t + d8: sllw a1, t#1, ->t + dc: slliw t#1, 14, ->u + e0: lui 2044, ->t + e4: andw u#1, t#1, ->u + e8: slliw a2, 23, ->t + ec: subw u#1, t#1, ->t + f0: addw t#1, a0, ->a0 + f4: lui 274432, ->a1 + +00000000000000f8 <.LBB1_15>: + f8: C.BSTART.STD + fa: c.sext.w a2, ->t + fc: lb [a3, t#1], ->a0 + 100: FRET.STK [ra ~ ra], sp!, 32 + +0000000000000104 <.LBB1_16>: + 104: C.BSTART DIRECT, 0x12e + 106: slliw a1, 13, ->t + 10a: orw a0, t#1, ->u + 10e: lui 522240, ->t + 112: orw u#1, t#1, ->a0 + +0000000000000116 <.LBB1_17>: + 116: C.BSTART.STD + 118: slliw a3, 13, ->t + 11c: orw t#1, a2<<23, ->t + 120: addw t#1, a0, ->a0 + 124: lui 229376, ->a1 + +0000000000000128 <.LBB1_18>: + 128: C.BSTART.STD + 12a: addw a0, a1, ->a0 + +000000000000012e <.LBB1_19>: + 12e: C.BSTART.STD + 130: fcvtz.fs2sw a0, ->a0 + +0000000000000134 <.LBB1_20>: + 134: FRET.STK [ra ~ ra], sp!, 32 + +Disassembly of section .text._ZN3pto7kernels16store_scalar_i32EPvNS0_9pto_dtypeEii: + +0000000000000000 <_ZN3pto7kernels16store_scalar_i32EPvNS0_9pto_dtypeEii>: + 0: FENTRY [ra ~ s8], sp!, 96 + 4: C.BSTART COND, 0x5e + 6: c.sext.w a1, ->t + 8: addw a1, zero, ->a1 + c: c.movi 3, ->a4 + e: setc.lt a4, t#1 + 12: C.BSTART COND, 0x7c + 14: setc.eqi a1, 1 + 18: C.BSTART COND, 0x114 + 1a: setc.eqi a1, 2 + 1e: C.BSTART COND, 0x110 + 20: c.setc.ne a1, a4 + 22: C.BSTART COND, 0x1c4 + 24: addw a3, zero, ->a1 + 28: setc.eqi a1, 0 + 2c: C.BSTART COND, 0x1c8 + 2e: c.movr a2, ->s0 + 30: c.movr a0, ->s1 + 32: c.movr a3, ->s3 + 34: scvtf.sw2fs a3, ->a0 + 38: lui 524288, ->t + 3c: xorw a0, t#1, ->u + 40: cmp.lti a1, 0, ->t + 44: csel t#1, a0, u#1, ->s4 + 48: hl.lwu.pcr [.rodata.cst4+0x4], ->t + 4e: fge.fs s4, t#1, ->t + 52: xori t#1, 1, ->t + 56: c.setc.eq t#1, zero + 58: C.BSTART DIRECT, 0x212 + 5a: c.movr zero, ->s2 + 5c: c.movr s4, ->a0 + +000000000000005e <.LBB2_4>: + 5e: C.BSTART COND, 0x8c + 60: setc.eqi a1, 4 + 64: C.BSTART COND, 0x162 + 66: setc.eqi a1, 6 + 6a: C.BSTART COND, 0x110 + 6c: setc.nei a1, 5 + 70: C.BSTART.STD + 72: c.sext.w a2, ->t + 74: sw a3, [a0, t#1<<2] + 78: FRET.STK [ra ~ s8], sp!, 96 + +000000000000007c <.LBB2_9>: + 7c: C.BSTART.STD + 7e: scvtf.sw2fs a3, ->u + 82: c.sext.w a2, ->t + 84: sw u#1, [a0, t#1<<2] + 88: FRET.STK [ra ~ s8], sp!, 96 + +000000000000008c <.LBB2_46>: + 8c: C.BSTART.STD + 8e: c.movr a2, ->s2 + 90: c.movr a0, ->s1 + 92: scvtf.sw2fs a3, ->a0 + 96: lui 524288, ->t + 9a: xorw a0, t#1, ->u + 9e: c.sext.w a3, ->t + a0: cmp.lti t#1, 0, ->t + a4: csel t#1, a0, u#1, ->s4 + a8: c.movr zero, ->s3 + aa: c.movi 1, ->s5 + ac: addtpc 0, ->t + b0: addi t#1, 0, ->s6 + b4: addi zero, 32, ->s7 + b8: sll a0, s7, ->t + bc: srl t#1, s7, ->s0 + c0: addi zero, 16, ->s8 + +00000000000000c4 <.LBB2_47>: + c4: HL.BSTART.STD CALL, __subsf3, ra=.LBB2_58 + cc: lwi [s6, 0], ->a0 + d0: sll a0, s7, ->a0 + d4: srl a0, s7, ->a1 + d8: c.movr s0, ->a0 + +00000000000000da <.LBB2_58>: + da: C.BSTART COND, 0xc4 + dc: lui 524288, ->t + e0: xorw a0, t#1, ->u + e4: hl.lwu.pcr [.rodata.cst4], ->t + ea: flt.fs a0, t#1, ->t + ee: csel t#1, a0, u#1, ->u + f2: flt.fs u#1, s4, ->t + f6: csel t#1, s4, u#1, ->s4 + fa: csel t#1, s3, s5, ->s3 + fe: addi s6, 4, ->s6 + 102: addi s5, 1, ->s5 + 106: c.setc.ne s5, s8 + 108: C.BSTART.STD + 10a: c.sext.w s2, ->t + 10c: sb s3, [s1, t#1] + +0000000000000110 <.LBB2_49>: + 110: FRET.STK [ra ~ s8], sp!, 96 + +0000000000000114 <.LBB2_10>: + 114: C.BSTART COND, 0x16e + 116: scvtf.sw2fs a3, ->u + 11a: srliw u#1, 16, ->u + 11e: lui 8, ->t + 122: andw u#1, t#1, ->a1 + 126: hl.lui 8388607, ->t + 12c: andw u#2, t#1, ->a3 + 130: srliw u#2, 23, ->a5 + 134: andiw a5, 255, ->a4 + 138: addw a4, zero, ->a7 + 13c: setc.nei a7, 255 + 140: C.BSTART.STD + 142: c.sext.w a3, ->t + 144: cmp.eqi t#1, 0, ->a3 + 148: addiw zero, 512, ->u + 14c: c.movr zero, ->t + 14e: csel a3, u#1, t#1, ->t + 152: orw t#1, a1, ->a1 + +0000000000000156 <.LBB2_16>: + 156: C.BSTART DIRECT, 0x3aa + 158: hl.lui 31744, ->t + 15e: orw a1, t#1, ->a1 + +0000000000000162 <.LBB2_8>: + 162: C.BSTART.STD + 164: c.sext.w a2, ->t + 166: sb a3, [a0, t#1] + 16a: FRET.STK [ra ~ s8], sp!, 96 + +000000000000016e <.LBB2_12>: + 16e: C.BSTART COND, 0x2d6 + 170: addi zero, 32, ->a6 + 174: sll a4, a6, ->t + 178: srl t#1, a6, ->x0 + 17c: setc.geui x0, 113 + 180: C.BSTART COND, 0x3aa + 182: setc.ltui x0, 102 + 186: C.BSTART DIRECT, 0x3a4 + 188: lui 2048, ->t + 18c: orw a3, t#1, ->a3 + 190: addiw zero, 126, ->t + 194: subw t#1, a4, ->u + 198: addiw zero, 125, ->t + 19c: subw t#1, a4, ->t + 1a0: sll t#1, a6, ->t + 1a4: srl t#1, a6, ->u + 1a8: sll u#2, a6, ->t + 1ac: srl t#1, a6, ->t + 1b0: srlw a3, t#1, ->u + 1b4: srlw a3, u#2, ->t + 1b8: andiw t#1, 1, ->t + 1bc: addw t#1, u#1, ->t + 1c0: andiw t#1, 1023, ->a3 + +00000000000001c4 <.LBB2_24>: + 1c4: C.BSTART DIRECT, 0x3de + 1c6: c.movr zero, ->a1 + +00000000000001c8 <.LBB2_31>: + 1c8: C.BSTART.STD + 1ca: c.movr zero, ->s8 + 1cc: addi zero, 32, ->s5 + 1d0: c.movi 1, ->s6 + 1d2: addi zero, 29, ->s7 + 1d6: c.movr s4, ->a0 + +00000000000001d8 <.LBB2_32>: + 1d8: HL.BSTART.STD CALL, __mulsf3, ra=.LBB2_57 + 1e0: sll a0, s5, ->a0 + 1e4: srl a0, s5, ->a0 + 1e8: lui 258048, ->a1 + +00000000000001ec <.LBB2_57>: + 1ec: C.BSTART COND, 0x212 + 1ee: addw s8, s6, ->s2 + 1f2: hl.lwu.pcr [.rodata.cst4+0x4], ->t + 1f8: fge.fs a0, t#1, ->t + 1fc: xori t#1, 1, ->t + 200: c.setc.ne t#1, zero + 202: C.BSTART COND, 0x1d8 + 204: sll s8, s5, ->t + 208: srl t#1, s5, ->t + 20c: setc.ltu t#1, s7 + 210: c.movr s2, ->s8 + +0000000000000212 <.LBB2_27>: + 212: C.BSTART COND, 0x2c6 + 214: hl.lwu.pcr [.rodata.cst4+0x8], ->t + 21a: flt.fs a0, t#1, ->t + 21e: xori t#1, 1, ->t + 222: c.setc.ne t#1, zero + 224: C.BSTART.STD + 226: addi zero, 32, ->s5 + 22a: c.movi -1, ->s6 + 22c: subi zero, 29, ->s7 + +0000000000000230 <.LBB2_29>: + 230: HL.BSTART.STD CALL, __addsf3, ra=.LBB2_56 + 238: sll a0, s5, ->a0 + 23c: srl a0, s5, ->a0 + 240: c.movr a0, ->a1 + +0000000000000242 <.LBB2_56>: + 242: C.BSTART COND, 0x262 + 244: addw s2, s6, ->a1 + 248: hl.lwu.pcr [.rodata.cst4+0x8], ->t + 24e: flt.fs a0, t#1, ->t + 252: xori t#1, 1, ->t + 256: c.setc.ne t#1, zero + 258: C.BSTART COND, 0x230 + 25a: c.sext.w s2, ->t + 25c: setc.lt s7, t#1 + 260: c.movr a1, ->s2 + +0000000000000262 <.LBB2_34>: + 262: C.BSTART COND, 0x30c + 264: c.movi 7, ->s2 + 266: c.movi 1, ->t + 268: addw a1, t#1, ->u + 26c: sraiw s3, 31, ->t + 270: andiw t#1, -128, ->s3 + 274: addw u#1, zero, ->u + 278: c.movi -6, ->t + 27a: setc.lt t#1, u#1 + 27e: HL.BSTART.STD CALL, __mulsf3, ra=.LBB2_54 + 286: sll s4, s5, ->a0 + 28a: srl a0, s5, ->a0 + 28e: lui 278528, ->a1 + +0000000000000292 <.LBB2_54>: + 292: HL.BSTART.STD CALL, __addsf3, ra=.LBB2_55 + 29a: sll a0, s5, ->a0 + 29e: srl a0, s5, ->a0 + 2a2: lui 258048, ->a1 + +00000000000002a6 <.LBB2_55>: + 2a6: C.BSTART DIRECT, 0x326 + 2a8: fcvtz.fs2sw a0, ->a0 + 2ac: c.sext.w a0, ->t + 2ae: cmp.gei t#1, 1, ->u + 2b2: c.movr zero, ->t + 2b4: csel u#1, t#1, a0, ->u + 2b8: c.sext.w u#1, ->t + 2ba: cmp.lti t#1, 7, ->t + 2be: csel t#1, s2, u#1, ->t + 2c2: orw s3, t#1, ->a1 + +00000000000002c6 <.LBB2_37>: + 2c6: C.BSTART DIRECT, 0x314 + 2c8: c.movi 7, ->t + 2ca: addw s2, t#1, ->s4 + 2ce: sraiw s3, 31, ->t + 2d2: andiw t#1, -128, ->s3 + +00000000000002d6 <.LBB2_15>: + 2d6: C.BSTART COND, 0x156 + 2d8: setc.geui x0, 143 + 2dc: C.BSTART COND, 0x38e + 2de: sll a3, a6, ->t + 2e2: srl t#1, a6, ->u + 2e6: lui 2047, ->t + 2ea: setc.geu u#1, t#1 + 2ee: C.BSTART DIRECT, 0x3a4 + 2f0: slliw a4, 10, ->u + 2f4: lui 4, ->t + 2f8: addw u#1, t#1, ->u + 2fc: lui 1, ->t + 300: addw a3, t#1, ->t + 304: srliw t#1, 13, ->t + 308: orw t#1, u#1, ->a3 + +000000000000030c <.LBB2_35>: + 30c: C.BSTART.STD + 30e: addw a1, s2, ->s4 + 312: c.movr a1, ->s2 + +0000000000000314 <.LBB2_38>: + 314: C.BSTART COND, 0x32c + 316: addw s2, zero, ->s5 + 31a: c.movi 8, ->s6 + 31c: setc.lt s5, s6 + 320: C.BSTART.STD + 322: oriw s3, 126, ->a1 + +0000000000000326 <.LBB2_40>: + 326: C.BSTART DIRECT, 0x3de + 328: c.movr s1, ->a0 + 32a: c.movr s0, ->a2 + +000000000000032c <.LBB2_41>: + 32c: HL.BSTART.STD CALL, __addsf3, ra=.LBB2_51 + 334: addi zero, 32, ->s7 + 338: hl.lui 3212836864, ->a1 + 33e: bxu a1, 31, ->a1 + 342: sll a0, s7, ->a0 + 346: srl a0, s7, ->a0 + +000000000000034a <.LBB2_51>: + 34a: HL.BSTART.STD CALL, __mulsf3, ra=.LBB2_52 + 352: sll a0, s7, ->a0 + 356: srl a0, s7, ->a0 + 35a: lui 266240, ->a1 + +000000000000035e <.LBB2_52>: + 35e: HL.BSTART.STD CALL, __addsf3, ra=.LBB2_53 + 366: sll a0, s7, ->a0 + 36a: srl a0, s7, ->a0 + 36e: lui 258048, ->a1 + +0000000000000372 <.LBB2_53>: + 372: C.BSTART COND, 0x3c0 + 374: fcvtz.fs2sw a0, ->a1 + 378: c.sext.w a1, ->t + 37a: setc.lt t#1, s6 + 37e: c.movr s1, ->a0 + 380: c.movr s0, ->a2 + 382: C.BSTART COND, 0x3b6 + 384: setc.nei s5, 7 + 388: C.BSTART DIRECT, 0x3de + 38a: oriw s3, 126, ->a1 + +000000000000038e <.LBB2_18>: + 38e: C.BSTART COND, 0x156 + 390: setc.eqi a7, 142 + 394: C.BSTART.STD + 396: slliw a5, 10, ->u + 39a: hl.lui 17408, ->t + 3a0: addw u#1, t#1, ->a3 + +00000000000003a4 <.LBB2_21>: + 3a4: C.BSTART.STD + 3a6: orw a3, a1, ->a1 + +00000000000003aa <.LBB2_22>: + 3aa: C.BSTART.STD + 3ac: c.sext.w a2, ->t + 3ae: sh a1, [a0, t#1<<1] + 3b2: FRET.STK [ra ~ s8], sp!, 96 + +00000000000003b6 <.LBB2_43>: + 3b6: C.BSTART.STD + 3b8: c.movr zero, ->a1 + 3ba: c.movi 8, ->t + 3bc: addw s2, t#1, ->s4 + +00000000000003c0 <.LBB2_44>: + 3c0: C.BSTART.STD + 3c2: c.sext.w a1, ->t + 3c4: cmp.gei t#1, 1, ->u + 3c8: c.movr zero, ->t + 3ca: csel u#1, t#1, a1, ->u + 3ce: andiw s3, 255, ->u + 3d2: slliw s4, 3, ->t + 3d6: orw t#1, u#1, ->t + 3da: orw t#1, u#2, ->a1 + +00000000000003de <.LBB2_45>: + 3de: C.BSTART.STD + 3e0: c.sext.w a2, ->t + 3e2: sb a1, [a0, t#1] + 3e6: FRET.STK [ra ~ s8], sp!, 96 + +Disassembly of section .text._ZN3pto17fp8_e4m3_to_floatENS_10fp8_e4m3_tE: + +0000000000000000 <_ZN3pto17fp8_e4m3_to_floatENS_10fp8_e4m3_tE>: + 0: FENTRY [ra ~ s5], sp!, 64 + 4: C.BSTART COND, 0x5a + 6: lbui [a0, 0], ->u + a: slliw u#1, 24, ->t + e: sraiw t#1, 24, ->t + 12: c.sext.w t#1, ->t + 14: cmp.gei t#1, 0, ->u + 18: c.movr zero, ->s1 + 1a: c.movi 4, ->t + 1c: csel u#1, s1, t#1, ->u + 20: addtpc 0, ->t + 24: addi t#1, 0, ->t + 28: lwu [t#1, u#1], ->s0 + 2c: andiw u#3, 7, ->s3 + 30: srliw u#3, 3, ->t + 34: andiw t#1, 15, ->s4 + 38: c.sext.w s4, ->t + 3a: c.setc.eq t#1, s1 + 3c: C.BSTART COND, 0xe0 + 3e: setc.nei s4, 15 + 42: HL.BSTART.STD CALL, __mulsf3, ra=.LBB3_27 + 4a: addi zero, 32, ->a0 + 4e: bxu s0, 31, ->a0 + 52: lui 276224, ->a1 + +0000000000000056 <.LBB3_27>: + 56: FRET.STK [ra ~ s5], sp!, 64 + +000000000000005a <.LBB3_3>: + 5a: C.BSTART COND, 0x1be + 5c: c.sext.w s3, ->t + 5e: c.setc.eq t#1, s1 + 60: C.BSTART.STD + 62: hl.lwu.pcr [.rodata.cst4+0xc], ->a0 + 68: c.movi 6, ->s4 + 6a: addi zero, 32, ->s2 + 6e: c.movi -1, ->s5 + +0000000000000070 <.LBB3_5>: + 70: HL.BSTART.STD CALL, __mulsf3, ra=.LBB3_26 + 78: sll a0, s2, ->a0 + 7c: srl a0, s2, ->a0 + 80: lui 258048, ->a1 + +0000000000000084 <.LBB3_26>: + 84: C.BSTART COND, 0x70 + 86: addw s4, s5, ->s4 + 8a: c.sext.w s4, ->t + 8c: c.setc.ne t#1, s1 + 8e: HL.BSTART.STD CALL, __mulsf3, ra=.LBB3_23 + 96: ucvtf.uw2fs s3, ->a1 + 9a: sll a1, s2, ->a1 + 9e: srl a1, s2, ->a2 + a2: lui 253952, ->a1 + a6: c.movr a0, ->s1 + a8: c.movr a2, ->a0 + +00000000000000aa <.LBB3_23>: + aa: HL.BSTART.STD CALL, __mulsf3, ra=.LBB3_24 + b2: sll s0, s2, ->a1 + b6: srl a1, s2, ->a2 + ba: sll a0, s2, ->a0 + be: srl a0, s2, ->a1 + c2: c.movr a2, ->a0 + +00000000000000c4 <.LBB3_24>: + c4: HL.BSTART.STD CALL, __mulsf3, ra=.LBB3_25 + cc: sll s1, s2, ->a1 + d0: sll a0, s2, ->a0 + d4: srl a1, s2, ->a1 + d8: srl a0, s2, ->a0 + +00000000000000dc <.LBB3_25>: + dc: FRET.STK [ra ~ s5], sp!, 64 + +00000000000000e0 <.LBB3_7>: + e0: HL.BSTART.STD CALL, __mulsf3, ra=.LBB3_20 + e8: ucvtf.uw2fs s3, ->a0 + ec: addi zero, 32, ->s2 + f0: sll a0, s2, ->a0 + f4: srl a0, s2, ->a0 + f8: lui 253952, ->a1 + +00000000000000fc <.LBB3_20>: + fc: HL.BSTART.STD CALL, __addsf3, ra=.LBB3_21 + 104: sll a0, s2, ->a0 + 108: srl a0, s2, ->a0 + 10c: lui 260096, ->a1 + +0000000000000110 <.LBB3_21>: + 110: HL.BSTART.STD CALL, __mulsf3, ra=.LBB3_22 + 118: sll s0, s2, ->a1 + 11c: srl a1, s2, ->a2 + 120: sll a0, s2, ->a0 + 124: srl a0, s2, ->a1 + 128: c.movr a2, ->a0 + +000000000000012a <.LBB3_22>: + 12a: C.BSTART COND, 0x174 + 12c: c.movr a0, ->s0 + 12e: c.movi -7, ->t + 130: addw s4, t#1, ->s3 + 134: sll s4, s2, ->t + 138: srl t#1, s2, ->t + 13c: setc.ltui t#1, 7 + 140: C.BSTART COND, 0x1a0 + 142: hl.lwu.pcr [.rodata.cst4+0xc], ->a0 + 148: c.sext.w s3, ->t + 14a: c.setc.eq t#1, s1 + 14c: C.BSTART.STD + 14e: hl.lwu.pcr [.rodata.cst4+0xc], ->a0 + 154: c.movi -1, ->s4 + +0000000000000156 <.LBB3_12>: + 156: HL.BSTART.STD CALL, __addsf3, ra=.LBB3_19 + 15e: sll a0, s2, ->a0 + 162: srl a0, s2, ->a0 + 166: c.movr a0, ->a1 + +0000000000000168 <.LBB3_19>: + 168: C.BSTART COND, 0x1a0 + 16a: addw s3, s4, ->s3 + 16e: c.sext.w s3, ->t + 170: c.setc.eq t#1, s1 + 172: C.BSTART DIRECT, 0x156 + +0000000000000174 <.LBB3_8>: + 174: C.BSTART.STD + 176: hl.lwu.pcr [.rodata.cst4+0xc], ->a0 + 17c: c.movi 1, ->s1 + 17e: c.movi 1, ->s4 + +0000000000000180 <.LBB3_9>: + 180: HL.BSTART.STD CALL, __mulsf3, ra=.LBB3_18 + 188: sll a0, s2, ->a0 + 18c: srl a0, s2, ->a0 + 190: lui 258048, ->a1 + +0000000000000194 <.LBB3_18>: + 194: C.BSTART COND, 0x180 + 196: addw s3, s1, ->s3 + 19a: c.sext.w s3, ->t + 19c: c.cmp.eqi 0, ->t + 19e: c.setc.ne t#1, s4 + +00000000000001a0 <.LBB3_13>: + 1a0: HL.BSTART.STD CALL, __mulsf3, ra=.LBB3_17 + 1a8: sll s0, s2, ->a1 + 1ac: srl a1, s2, ->a2 + 1b0: sll a0, s2, ->a0 + 1b4: srl a0, s2, ->a1 + 1b8: c.movr a2, ->a0 + +00000000000001ba <.LBB3_17>: + 1ba: FRET.STK [ra ~ s5], sp!, 64 + +00000000000001be <.LBB3_16>: + 1be: C.BSTART.STD + 1c0: lui 524288, ->t + 1c4: andw s0, t#1, ->a0 + 1c8: FRET.STK [ra ~ s5], sp!, 64 diff --git a/samples/gemm/README.md b/samples/gemm/README.md new file mode 100644 index 0000000..f80c5b2 --- /dev/null +++ b/samples/gemm/README.md @@ -0,0 +1,18 @@ +# GEMM Sample + +`gemm_avs_tile_smoke.diss` is a checked-in `llvm-objdump -dl` disassembly of a +compiler-produced Linx object for an AVS tile smoke GEMM case. + +Related SuperNPUBench sources: + +| Path | Role | +| --- | --- | +| [`../../benchmarks/npu/vec_simd/gemm_18x128x256`](../../benchmarks/npu/vec_simd/gemm_18x128x256) | Active NPU GEMM benchmark case. | +| [`../../benchmarks/kernels/composite/src/gemm.cpp`](../../benchmarks/kernels/composite/src/gemm.cpp) | Composite GEMM benchmark entrypoint. | +| [`../../benchmarks/kernels/gemm/matmul`](../../benchmarks/kernels/gemm/matmul) | Matmul/GEMM kernel benchmark suite. | + +Regenerate from a compatible Linx compiler object with: + +```sh +llvm-objdump -dl gemm.o > gemm_avs_tile_smoke.diss +``` diff --git a/samples/gemm/gemm_avs_tile_smoke.diss b/samples/gemm/gemm_avs_tile_smoke.diss new file mode 100644 index 0000000..a873ec9 --- /dev/null +++ b/samples/gemm/gemm_avs_tile_smoke.diss @@ -0,0 +1,48 @@ + +generated/avs-tile-smoke/compiler/avs/obj/gemm.o: file format elf64-linx + +Disassembly of section .text: + +0000000000000000 : +; gemm_i32(): + 0: 41 00 d5 0a FENTRY [ra ~ s2], sp!, 40 + 4: 00 08 C.BSTART.STD + 6: 06 28 c.movr zero, ->a3 + 8: 15 03 00 04 addi zero, 64, ->a4 + c: 95 03 00 01 addi zero, 16, ->a5 + 10: 46 41 c.movr a3, ->a6 + +0000000000000012 <.LBB0_1>: +; .LBB0_1(): + 12: 00 08 C.BSTART.STD + 14: 95 7f 64 00 slli a6, 6, ->t + 18: 85 04 82 07 add a2, t#1, ->a7 + 1c: c6 a0 c.movr a1, ->x0 + 1e: 46 a9 c.movr a3, ->x1 + +0000000000000020 <.LBB0_2>: +; .LBB0_2(): + 20: 00 08 C.BSTART.STD + 22: 09 ab 54 17 lw [a7, x1<<2], ->x2 + 26: 95 ff 2a 00 slli x1, 2, ->t + 2a: 85 8b 84 07 add a7, t#1, ->x3 + 2e: 46 59 c.movr a3, ->s0 + +0000000000000030 <.LBB0_5>: +; .LBB0_5(): + 30: 04 00 C.BSTART COND, 0x30 + 32: 09 2f b1 06 lw [a0, s0], ->u + 36: 89 2f ba 06 lw [x0, s0], ->t + 3a: 47 7b cc b1 maddw t#1, u#1, x2, ->x2 + 3e: 95 85 45 00 addi s0, 4, ->s0 + 42: f6 32 c.setc.ne s0, a4 + 44: e4 fe C.BSTART COND, 0x20 + 46: 59 20 7b 01 swi x2, [x3, 0] + 4a: 15 0a 0a 04 addi x0, 64, ->x0 + 4e: 95 8a 1a 00 addi x1, 1, ->x1 + 52: 76 3d c.setc.ne x1, a5 + 54: f4 fd C.BSTART COND, 0x12 + 56: 15 01 01 04 addi a0, 64, ->a0 + 5a: 15 04 14 00 addi a6, 1, ->a6 + 5e: 36 3a c.setc.ne a6, a5 + 60: 41 30 d5 0a FRET.STK [ra ~ s2], sp!, 40 diff --git a/test/README.md b/test/README.md deleted file mode 100644 index ce041c4..0000000 --- a/test/README.md +++ /dev/null @@ -1,101 +0,0 @@ -# Test Navigation - -The `test` tree contains focused API tests, kernel and accelerator suites, -Python golden-comparison tests, and batch scripts. Most make-driven suites -reuse [`common/Makefile.common`](common/Makefile.common), so the same -`TESTCASE`, `PLAT`, `COMPILER_DIR`, and `QEMU` variables work across many -directories. - -## Directory Map - -| Path | Use it for | -| --- | --- | -| [`common`](common) | Shared make rules, platform flags, output layout, and simulator targets. | -| [`tileop_api`](tileop_api) | Small TileOP API tests. This is the best first stop for validating an individual API operation. | -| [`py_api`](py_api) | Python extension build and golden-comparison tests. | -| [`accelerator`](accelerator) | Accelerator-oriented suites such as cube, vector, DMA, fusion, and versioned target tests. | -| [`kernel`](kernel) | Kernel suites for control, element-wise, fusion, GEMM, memory, reduction, sort, and related cases. | -| [`other`](other) | Additional model, microbenchmark, TileOP, vector, and script-driven suites. | -| [`script`](script) | Recursive compile/run helper for larger batch workflows. | - -## Common Build Pattern - -```sh -cd test/tileop_api -make clean -make TESTCASE=TAdd PLAT=cpu COMPILER_DIR=/usr/bin -make TESTCASE=TAdd PLAT=linx COMPILER_DIR=/path/to/linx/compiler/bin -make TESTCASE=TAdd PLAT=linx QEMU=/path/to/qemu-linx sim -``` - -Platform values: - -| Platform | Backend | -| --- | --- | -| `PLAT=cpu` | CPU simulation backend with `__cpu_sim__`. | -| `PLAT=linx` | Linx target backend with `__linx`. | -| `PLAT=arm_sme` | Arm SME-oriented backend with `__ARM_FEATURE_SME`. | - -Common targets: - -```sh -make TESTCASE= all -make TESTCASE= diss -make TESTCASE= sim -make TESTCASE= debug -make clean -make clean_all -``` - -Build products are written below the repository-level `output/` directory. - -## Batch Runs - -Several suites include a local `compile.all` file. Run it from the suite -directory: - -```sh -cd test/tileop_api && bash compile.all -cd test/py_api && bash compile.all -cd test/kernel/gemm/matmul && bash compile.all -cd test/accelerator/vec_simt && bash compile.all -``` - -For recursive compile/run automation, see [`script/README.md`](script/README.md). - -## Python Golden Comparison - -```sh -cd test/py_api -make clean -make TESTCASE=tileop_py -python3 golden_cmp/golden_cmp.py -i tadd -``` - -For adding golden-comparison cases, see -[`py_api/golden_cmp/README.md`](py_api/golden_cmp/README.md). - -## Adding A Test Case - -For an existing make-driven suite: - -1. Add the source file under that suite's `src/` directory. -2. Set `SRC_FILE`, `TARGET`, and any suite-specific variables in the local - `Makefile`. -3. Include [`common/Makefile.common`](common/Makefile.common). -4. Add the case to the local `compile.all` file if it belongs in batch runs. - -Minimal local makefile shape: - -```make -SRC_FILE += $(TEST_ROOT)/$(CASE_SRC_DIR)/$(TESTCASE).cpp -TARGET = $(ELF_HEAD)_$(TESTCASE).elf -include ../common/Makefile.common -``` - -Adjust the relative include path when the suite is nested more deeply. - -For a new suite, create a directory with `src/`, a small local `Makefile`, and -an optional `compile.all` batch entrypoint. - -Back to the repository overview: [`../README.md`](../README.md). diff --git a/test/accelerator/vec_simt/compile.all b/test/accelerator/vec_simt/compile.all deleted file mode 100755 index 0991a56..0000000 --- a/test/accelerator/vec_simt/compile.all +++ /dev/null @@ -1,5 +0,0 @@ -#! /bin/bash - -make TESTCASE=accel_hashtable_insert_cmp_host -make TESTCASE=accel_hashtable_lookup_cmp_host -make TESTCASE=hashfind diff --git a/test/accelerator/vec_simt/hashfind/data_obj/.gitignore b/test/accelerator/vec_simt/hashfind/data_obj/.gitignore deleted file mode 100644 index b72b9e3..0000000 --- a/test/accelerator/vec_simt/hashfind/data_obj/.gitignore +++ /dev/null @@ -1,2 +0,0 @@ -*.s -*.o \ No newline at end of file diff --git a/test/kernel/control/compile.all b/test/kernel/control/compile.all deleted file mode 100755 index 9772a2f..0000000 --- a/test/kernel/control/compile.all +++ /dev/null @@ -1,12 +0,0 @@ -#! /bin/bash -for debug in on off; do - if [[ "$debug" == "on" ]]; then - debug_define="" - else - debug_define="-DFOR_GFSIM" - fi - make TESTCASE=hashtable_lookup_simt SUFFIX=_kNum6144_kNumThreads6144_kMaxProbe512_break_debug_${debug} EXTRA_DEFINES="-DkNum=6144 -DkNumThreads=6144 -DMAX_PROBE=512 ${debug_define}" diss - for num_col in 256 512 1024; do - make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum6144_kMaxProbe512_knum_col${num_col}_debug_${debug} EXTRA_DEFINES="-DkNum=6144 -DMAX_PROBE=512 -DNUM_COL=${num_col} ${debug_define}" diss - done -done diff --git a/test/kernel/control/hashtable_lookup_simd/data_obj/.gitignore b/test/kernel/control/hashtable_lookup_simd/data_obj/.gitignore deleted file mode 100644 index b72b9e3..0000000 --- a/test/kernel/control/hashtable_lookup_simd/data_obj/.gitignore +++ /dev/null @@ -1,2 +0,0 @@ -*.s -*.o \ No newline at end of file diff --git a/test/kernel/orther/accelerator_compile.sh b/test/kernel/orther/accelerator_compile.sh deleted file mode 100755 index 414d95a..0000000 --- a/test/kernel/orther/accelerator_compile.sh +++ /dev/null @@ -1,10 +0,0 @@ -#! /bin/bash - -./accelerator_compile_new/compile_matmul.all -./accelerator_compile_new/compile_matmul_reuseA.all -./accelerator_compile_new/compile_matmul_reuseB.all -./accelerator_compile_new/compile_matmul_reuseAB.all - -./accelerator_compile_new/compile_matmul_dynamic.all -./accelerator_compile_new/compile_matmul_dynamic_reuseA.all -./accelerator_compile_new/compile_matmul_dynamic_reuseB.all \ No newline at end of file diff --git a/test/kernel/orther/accelerator_compile/compile_matmul.all b/test/kernel/orther/accelerator_compile/compile_matmul.all deleted file mode 100755 index 03a6f46..0000000 --- a/test/kernel/orther/accelerator_compile/compile_matmul.all +++ /dev/null @@ -1,110 +0,0 @@ -#! /bin/bash - -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=2048 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 - - -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=777 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=777 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=777 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=777 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=777 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=256 N=777 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8 M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic.all b/test/kernel/orther/accelerator_compile/compile_matmul_dynamic.all deleted file mode 100755 index b7dc3b1..0000000 --- a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic.all +++ /dev/null @@ -1,110 +0,0 @@ -#! /bin/bash - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=2048 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 - - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=777 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=777 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=777 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=777 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=777 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256 N=777 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuse.all b/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuse.all deleted file mode 100644 index 48de0d9..0000000 --- a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuse.all +++ /dev/null @@ -1,110 +0,0 @@ -#! /bin/bash - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=2048 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 - - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=777 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=777 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=777 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=777 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=777 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256 N=777 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuseA.all b/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuseA.all deleted file mode 100755 index a0ada86..0000000 --- a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuseA.all +++ /dev/null @@ -1,110 +0,0 @@ -#! /bin/bash - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=2048 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 - - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=777 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=777 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=777 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=777 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=777 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256 N=777 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuseB.all b/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuseB.all deleted file mode 100755 index 9fdd451..0000000 --- a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuseB.all +++ /dev/null @@ -1,110 +0,0 @@ -#! /bin/bash - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=2048 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 - - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=777 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=777 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=777 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=777 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=777 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256 N=777 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/test/kernel/orther/accelerator_compile/compile_matmul_reuseA.all b/test/kernel/orther/accelerator_compile/compile_matmul_reuseA.all deleted file mode 100755 index f5a39cb..0000000 --- a/test/kernel/orther/accelerator_compile/compile_matmul_reuseA.all +++ /dev/null @@ -1,110 +0,0 @@ -#! /bin/bash - -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=2048 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 - - -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=777 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=777 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=777 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=777 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=777 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256 N=777 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/test/kernel/orther/accelerator_compile/compile_matmul_reuseAB.all b/test/kernel/orther/accelerator_compile/compile_matmul_reuseAB.all deleted file mode 100755 index c313c6f..0000000 --- a/test/kernel/orther/accelerator_compile/compile_matmul_reuseAB.all +++ /dev/null @@ -1,110 +0,0 @@ -#! /bin/bash - -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=2048 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 - - -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=777 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=777 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=777 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=777 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=777 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256 N=777 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/test/kernel/orther/accelerator_compile/compile_matmul_reuseB.all b/test/kernel/orther/accelerator_compile/compile_matmul_reuseB.all deleted file mode 100755 index 893c7fe..0000000 --- a/test/kernel/orther/accelerator_compile/compile_matmul_reuseB.all +++ /dev/null @@ -1,110 +0,0 @@ -#! /bin/bash - -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=2048 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=2048 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=2048 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=2048 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=2048 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=2048 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=2048 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=2048 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=2048 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=2048 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=256 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=2048 K=2048 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048 N=2048 K=2048 tM=64 tK=64 tN=64 - - -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=256 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=777 tM=64 tK=256 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=777 tM=64 tK=256 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=256 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=777 tM=256 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=777 tM=256 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=256 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=777 tM=64 tK=64 tN=256 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=777 tM=64 tK=64 tN=256 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=256 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=777 tM=64 tK=256 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=777 tM=64 tK=256 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=256 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=777 tM=256 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=777 tM=256 tK=64 tN=64 - -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=256 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=256 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256 N=777 K=777 tM=64 tK=64 tN=64 -make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777 N=777 K=777 tM=64 tK=64 tN=64 \ No newline at end of file diff --git a/test/kernel/sort/topk/.gitignore b/test/kernel/sort/topk/.gitignore deleted file mode 100644 index f406623..0000000 --- a/test/kernel/sort/topk/.gitignore +++ /dev/null @@ -1,2 +0,0 @@ -*.o -*.s \ No newline at end of file diff --git a/test/other/scripts/bench_all.sh b/test/other/scripts/bench_all.sh deleted file mode 100755 index d0cf362..0000000 --- a/test/other/scripts/bench_all.sh +++ /dev/null @@ -1,42 +0,0 @@ -#!/bin/bash -set -e -set -x -set -o pipefail - -cd $(dirname $0)/../.. - -export CC_OPT=default - -rm -rf output/matmul_compile_output -python3 test/ascpp/run_compile.py test/ascpp/matmul -o output/matmul_compile_output -f test/ascpp/filter.conf | tee bench.log -Total=$(grep "Total test cases:" bench.log | awk '{print $4}') -PASS=$(grep "Suaccelssful:" bench.log | awk '{print $2}') -if [[ x"$Total" != x"$PASS" ]]; then - exit 1 -fi - -rm -rf output/fa_normal_compile_output -python3 test/ascpp/run_compile.py test/ascpp/fa -o output/fa_normal_compile_output -f test/ascpp/filter.conf | tee bench.log -Total=$(grep "Total test cases:" bench.log | awk '{print $4}') -PASS=$(grep "Suaccelssful:" bench.log | awk '{print $2}') -if [[ x"$Total" != x"$PASS" ]]; then - exit 1 -fi - -python3 test/scripts/run_compile.py -ERRS=$(grep fail: cm_log/compile_summary.log | awk '{print $2}') -PASS=$(($ERRS <= 4)) -if [[ x"$PASS" != x"1" ]]; then - cat cm_log/compile_summary.log - exit 1 -fi - -# ELF_LIST="output/tileop_test/elf/*.elf output/lmbench/elf/*.elf output/kernel/elf/*.elf output/deepseek/elf/*.elf" -# -# realpath $ELF_LIST > tmp.list -# -# if [[ -f $QEMU ]]; then -# ARGS="$ARGS -m $QEMU" -# fi -# python3 test/scripts/run_qemu.py -i tmp.list -o cm_log/qemu_run.log $ARGS -# rm -f tmp.list diff --git a/test/other/tileop_api/src/MatMul_e4m3.cpp b/test/other/tileop_api/src/MatMul_e4m3.cpp deleted file mode 100644 index 7237b96..0000000 --- a/test/other/tileop_api/src/MatMul_e4m3.cpp +++ /dev/null @@ -1,89 +0,0 @@ -#include -#include "../data.hpp" -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void __vec__ test_cvt(typename TA::TileDType __out__ a, - typename TB::TileDType __in__ b) { - using AType = typename TA::DType; - using BType = typename TB::DType; - BType *pb = blkv_get_tile_ptr(b); - AType *pa = blkv_get_tile_ptr(a); - int x = blkv_get_index_x(); - int y = blkv_get_index_y(); - int idx = index(y, x); - AType o = (AType)(pb[idx]); - pa[idx] = o; -} - -template -void test(float *dst, float *src0, float *src1) { - using gm_shape_A = global_tensor>; - using gm_shape_B = global_tensor>; - using gm_shape_C = global_tensor>; - - using tile_shape_A = TileLeft; - using tile_shape_B = TileRight; - using tile_shape_C = TileAcc; - using tile_shape_LA = TileLeft<__fp8_e4m3, M, K>; - using tile_shape_LB = TileRight<__fp8_e4m3, K, N>; - - gm_shape_A s0(src0); - gm_shape_B s1(src1); - gm_shape_C res(dst); - - tile_shape_A d0; - tile_shape_B d1; - tile_shape_C d2; - tile_shape_LA lda; - tile_shape_LB ldb; - - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - test_cvt<<>>(lda.data(), d0.data()); - test_cvt<<>>(ldb.data(), d1.data()); - MATMUL(d2, lda, ldb); - TCOPYOUT(res, d2); -} - -int main() { - const uint16_t M = 64; - const uint16_t K = 32; - const uint16_t N = 64; - - size_t size_A = M * K; - size_t size_B = K * N; - size_t size_C = M * N; - - float *dst = (float *)malloc(size_C * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, size_C); - - float *src0 = (float *)malloc(size_A * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, size_A); - float *src1 = (float *)malloc(size_B * sizeof(float)); - check_mem_alloc(src1); - init_src_fp(src1, size_B); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src0, src1); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, size_C); - - free(dst); - free(src0); - free(src1); - - return 0; -} diff --git a/test/other/tileop_api/src/TAbs.cpp b/test/other/tileop_api/src/TAbs.cpp deleted file mode 100644 index e3d06c7..0000000 --- a/test/other/tileop_api/src/TAbs.cpp +++ /dev/null @@ -1,151 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test_RowMajor(T *dst, T *src0) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - #pragma clang loop unroll(full) - for (int i = 0; i < block_row; ++i) { - #pragma clang loop unroll(full) - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src0 + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1; - TCOPYIN(d0, s0); - TABS(d1, d0); - TCOPYOUT(res, d1); - } - } -} - -template -void test_ColMajor(T *dst, T *src0) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - #pragma clang loop unroll(full) - for (int i = 0; i < block_col; ++i) { - #pragma clang loop unroll(full) - for (int j = 0; j < block_row; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src0 + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1; - TCOPYIN(d0, s0); - TABS(d1, d0); - TCOPYOUT(res, d1); - } - } -} - -int main() { - const uint16_t gm_row = 128; - const uint16_t gm_col = 128; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src0 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, gm_size); - - __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); - check_mem_alloc(dst_f16); - init_dst(dst_f16, gm_size); - - __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); - check_mem_alloc(src0_f16); - init_src_fp(src0_f16, gm_size); - - int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); - check_mem_alloc(dst_i8); - init_dst(dst_i8, gm_size); - - int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); - check_mem_alloc(src0_i8); - init_src_int(src0_i8, gm_size); - - int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); - check_mem_alloc(dst_i16); - init_dst(dst_i16, gm_size); - - int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); - check_mem_alloc(src0_i16); - init_src_int(src0_i16, gm_size); - - int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); - check_mem_alloc(dst_i32); - init_dst(dst_i32, gm_size); - - int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); - check_mem_alloc(src0_i32); - init_src_int(src0_i32, gm_size); - - int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); - check_mem_alloc(dst_i64); - init_dst(dst_i64, gm_size); - - int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); - check_mem_alloc(src0_i64); - init_src_int(src0_i64, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test_RowMajor(dst, src0); - - test_RowMajor(dst_f16, src0_f16); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - OutArray(dst_i8, gm_size); - OutArray(dst_i16, gm_size); - OutArray(dst_i32, gm_size); - OutArray(dst_i64, gm_size); - - free(dst); - free(src0); - - free(dst_f16); - free(src0_f16); - - free(dst_i8); - free(src0_i8); - - free(dst_i16); - free(src0_i16); - - free(dst_i32); - free(src0_i32); - - free(dst_i64); - free(src0_i64); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TAdd_mask.cpp b/test/other/tileop_api/src/TAdd_mask.cpp deleted file mode 100644 index 011a5dd..0000000 --- a/test/other/tileop_api/src/TAdd_mask.cpp +++ /dev/null @@ -1,119 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -using namespace pto; - -template -void test(float *c_ptr, float *a_ptr, float *b_ptr) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - using glb_iterator = global_iterator; - - static constexpr int block_row = gm_row / tile_row; - static constexpr int block_col = gm_col / tile_col; - static constexpr int remainder_row = gm_row % tile_row; - static constexpr int remainder_col = gm_col % tile_col; - - using trailing_rows_shape = - Tile; - using trailing_cols_shape = - Tile; - using trailing_corner_shape = Tile; - - glb_iterator gAIter(a_ptr); - glb_iterator gBIter(b_ptr); - glb_iterator gCIter(c_ptr); - for (int i = 0; i < block_row; ++i) { - for (int j = 0; j < block_col; ++j) { - auto gA = gAIter(i, j); - auto gB = gBIter(i, j); - auto gC = gCIter(i, j); - - tile_shape tA, tB, tC; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TADD(tC, tA, tB); - TCOPYOUT(gC, tC); - } - if constexpr (remainder_col) { - auto gA = gAIter(i, block_col); - auto gB = gBIter(i, block_col); - auto gC = gCIter(i, block_col); - - trailing_rows_shape tA, tB, tC; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TADD(tC, tA, tB); - TCOPYOUT(gC, tC); - } - } - if constexpr (remainder_row) { - for (int j = 0; j < block_col; ++j) { - auto gA = gAIter(block_row, j); - auto gB = gBIter(block_row, j); - auto gC = gCIter(block_row, j); - - trailing_cols_shape tA, tB, tC; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TADD(tC, tA, tB); - TCOPYOUT(gC, tC); - } - if constexpr (remainder_col) { - auto gA = gAIter(block_row, block_col); - auto gB = gBIter(block_row, block_col); - auto gC = gCIter(block_row, block_col); - - trailing_corner_shape tA, tB, tC; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TADD(tC, tA, tB); - TCOPYOUT(gC, tC); - } - } -} - -int main() { - const uint16_t gm_row = 123; - const uint16_t gm_col = 123; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src0 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, gm_size); - float *src1 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src1); - init_src_fp(src1, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src0, src1); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - free(src0); - free(src1); - - return 0; -} diff --git a/test/other/tileop_api/src/TCopy.cpp b/test/other/tileop_api/src/TCopy.cpp deleted file mode 100644 index f182e04..0000000 --- a/test/other/tileop_api/src/TCopy.cpp +++ /dev/null @@ -1,161 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test_RowMajor(T *dst, T *src0) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - #pragma clang loop unroll(full) - for (int i = 0; i < block_row; ++i) { - #pragma clang loop unroll(full) - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src0 + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1; - TCOPYIN(d0, s0); - TCOPY(d1, d0); - TCOPYOUT(res, d1); - } - } -} - -template -void test_ColMajor(T *dst, T *src0) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - #pragma clang loop unroll(full) - for (int i = 0; i < block_col; ++i) { - #pragma clang loop unroll(full) - for (int j = 0; j < block_row; ++j) { - int offset = i * (tile_col * gm_row) + j * tile_row; - gm_shape s0(src0 + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1; - TCOPYIN(d0, s0); - TCOPY(d1, d0); - TCOPYOUT(res, d1); - } - } -} - -int main() { - const uint16_t gm_row = 128; - const uint16_t gm_col = 128; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src0 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, gm_size); - - __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); - check_mem_alloc(dst_f16); - init_dst(dst_f16, gm_size); - - __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); - check_mem_alloc(src0_f16); - init_src_fp(src0_f16, gm_size); - - int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); - check_mem_alloc(dst_i8); - init_dst(dst_i8, gm_size); - - int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); - check_mem_alloc(src0_i8); - init_src_int(src0_i8, gm_size); - - int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); - check_mem_alloc(dst_i16); - init_dst(dst_i16, gm_size); - - int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); - check_mem_alloc(src0_i16); - init_src_int(src0_i16, gm_size); - - int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); - check_mem_alloc(dst_i32); - init_dst(dst_i32, gm_size); - - int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); - check_mem_alloc(src0_i32); - init_src_int(src0_i32, gm_size); - - int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); - check_mem_alloc(dst_i64); - init_dst(dst_i64, gm_size); - - int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); - check_mem_alloc(src0_i64); - init_src_int(src0_i64, gm_size); - - -#ifdef LINX_PMC - PMC_START(); -#endif - - test_RowMajor(dst, src0); - - test_RowMajor(dst_f16, src0_f16); - - test_RowMajor(dst_i8, src0_i8); - - test_RowMajor(dst_i16, src0_i16); - - test_RowMajor(dst_i32, src0_i32); - - test_RowMajor(dst_i64, src0_i64); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - //OutArray(dst_f16, gm_size); - OutArray(dst_i8, gm_size); - OutArray(dst_i16, gm_size); - OutArray(dst_i32, gm_size); - OutArray(dst_i64, gm_size); - - free(dst); - free(src0); - - free(dst_f16); - free(src0_f16); - - free(dst_i8); - free(src0_i8); - - free(dst_i16); - free(src0_i16); - - free(dst_i32); - free(src0_i32); - - free(dst_i64); - free(src0_i64); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TCopyIn.cpp b/test/other/tileop_api/src/TCopyIn.cpp deleted file mode 100644 index b99ea98..0000000 --- a/test/other/tileop_api/src/TCopyIn.cpp +++ /dev/null @@ -1,170 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test_RowMajor(T *dst, T *src0) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - #pragma clang loop unroll(full) - for (int i = 0; i < block_row; ++i) { - #pragma clang loop unroll(full) - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src0 + offset); - gm_shape res(dst + offset); - - tile_shape d0; - TCOPYIN(d0, s0); - TCOPYOUT(res, d0); - } - } -} - -template -void test_ColMajor(T *dst, T *src0) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - #pragma clang loop unroll(full) - for (int i = 0; i < block_col; ++i) { - #pragma clang loop unroll(full) - for (int j = 0; j < block_row; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src0 + offset); - gm_shape res(dst + offset); - - tile_shape d0; - TCOPYIN(d0, s0); - TCOPYOUT(res, d0); - } - } -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src0 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, gm_size); - - __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); - check_mem_alloc(dst_f16); - init_dst(dst_f16, gm_size); - - __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); - check_mem_alloc(src0_f16); - init_src_fp(src0_f16, gm_size); - - int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); - check_mem_alloc(dst_i8); - init_dst(dst_i8, gm_size); - - int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); - check_mem_alloc(src0_i8); - init_src_int(src0_i8, gm_size); - - int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); - check_mem_alloc(dst_i16); - init_dst(dst_i16, gm_size); - - int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); - check_mem_alloc(src0_i16); - init_src_int(src0_i16, gm_size); - - int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); - check_mem_alloc(dst_i32); - init_dst(dst_i32, gm_size); - - int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); - check_mem_alloc(src0_i32); - init_src_int(src0_i32, gm_size); - - int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); - check_mem_alloc(dst_i64); - init_dst(dst_i64, gm_size); - - int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); - check_mem_alloc(src0_i64); - init_src_int(src0_i64, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test_RowMajor(dst, src0); - - test_ColMajor(dst, src0); - - test_RowMajor(dst_f16, src0_f16); - - test_ColMajor(dst_f16, src0_f16); - - test_RowMajor(dst_i8, src0_i8); - - test_ColMajor(dst_i8, src0_i8); - - test_RowMajor(dst_i16, src0_i16); - - test_ColMajor(dst_i16, src0_i16); - - test_RowMajor(dst_i32, src0_i32); - - test_ColMajor(dst_i32, src0_i32); - - test_RowMajor(dst_i64, src0_i64); - - test_ColMajor(dst_i64, src0_i64); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - //OutArray(dst_f16, gm_size); - OutArray(dst_i8, gm_size); - OutArray(dst_i16, gm_size); - OutArray(dst_i32, gm_size); - OutArray(dst_i64, gm_size); - - free(dst); - free(src0); - - free(dst_f16); - free(src0_f16); - - free(dst_i8); - free(src0_i8); - - free(dst_i16); - free(src0_i16); - - free(dst_i32); - free(src0_i32); - - free(dst_i64); - free(src0_i64); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TCopyOut.cpp b/test/other/tileop_api/src/TCopyOut.cpp deleted file mode 100644 index b99ea98..0000000 --- a/test/other/tileop_api/src/TCopyOut.cpp +++ /dev/null @@ -1,170 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test_RowMajor(T *dst, T *src0) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - #pragma clang loop unroll(full) - for (int i = 0; i < block_row; ++i) { - #pragma clang loop unroll(full) - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src0 + offset); - gm_shape res(dst + offset); - - tile_shape d0; - TCOPYIN(d0, s0); - TCOPYOUT(res, d0); - } - } -} - -template -void test_ColMajor(T *dst, T *src0) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - #pragma clang loop unroll(full) - for (int i = 0; i < block_col; ++i) { - #pragma clang loop unroll(full) - for (int j = 0; j < block_row; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src0 + offset); - gm_shape res(dst + offset); - - tile_shape d0; - TCOPYIN(d0, s0); - TCOPYOUT(res, d0); - } - } -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src0 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, gm_size); - - __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); - check_mem_alloc(dst_f16); - init_dst(dst_f16, gm_size); - - __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half)); - check_mem_alloc(src0_f16); - init_src_fp(src0_f16, gm_size); - - int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); - check_mem_alloc(dst_i8); - init_dst(dst_i8, gm_size); - - int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t)); - check_mem_alloc(src0_i8); - init_src_int(src0_i8, gm_size); - - int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); - check_mem_alloc(dst_i16); - init_dst(dst_i16, gm_size); - - int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t)); - check_mem_alloc(src0_i16); - init_src_int(src0_i16, gm_size); - - int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); - check_mem_alloc(dst_i32); - init_dst(dst_i32, gm_size); - - int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t)); - check_mem_alloc(src0_i32); - init_src_int(src0_i32, gm_size); - - int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); - check_mem_alloc(dst_i64); - init_dst(dst_i64, gm_size); - - int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t)); - check_mem_alloc(src0_i64); - init_src_int(src0_i64, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test_RowMajor(dst, src0); - - test_ColMajor(dst, src0); - - test_RowMajor(dst_f16, src0_f16); - - test_ColMajor(dst_f16, src0_f16); - - test_RowMajor(dst_i8, src0_i8); - - test_ColMajor(dst_i8, src0_i8); - - test_RowMajor(dst_i16, src0_i16); - - test_ColMajor(dst_i16, src0_i16); - - test_RowMajor(dst_i32, src0_i32); - - test_ColMajor(dst_i32, src0_i32); - - test_RowMajor(dst_i64, src0_i64); - - test_ColMajor(dst_i64, src0_i64); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - //OutArray(dst_f16, gm_size); - OutArray(dst_i8, gm_size); - OutArray(dst_i16, gm_size); - OutArray(dst_i32, gm_size); - OutArray(dst_i64, gm_size); - - free(dst); - free(src0); - - free(dst_f16); - free(src0_f16); - - free(dst_i8); - free(src0_i8); - - free(dst_i16); - free(src0_i16); - - free(dst_i32); - free(src0_i32); - - free(dst_i64); - free(src0_i64); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TCvt.cpp b/test/other/tileop_api/src/TCvt.cpp deleted file mode 100644 index a4f3362..0000000 --- a/test/other/tileop_api/src/TCvt.cpp +++ /dev/null @@ -1,57 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template void Test(float *dst, float *src) { - using gm_shape = global_tensor>; - - using tile_shape_in = Tile; - using tile_shape_out = TileLeft; - - gm_shape s0(src); - gm_shape res(dst); - - tile_shape_in d0; - tile_shape_out d1; - - TCOPYIN(d0, s0); - TCVT(d1, d0); - TCVT(d0, d1); - TCOPYOUT(res, d0); -} - -int main() { - const uint16_t row = 64; - const uint16_t col = 128; - - size_t size = row * col; - - float *dst = (float *)malloc(size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, size); - - float *src = (float *)malloc(size * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - Test(dst, src); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, size); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TDiv.cpp b/test/other/tileop_api/src/TDiv.cpp deleted file mode 100644 index aa3668b..0000000 --- a/test/other/tileop_api/src/TDiv.cpp +++ /dev/null @@ -1,70 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src0, float *src1) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - for (int i = 0; i < block_row; ++i) { - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src0 + offset); - gm_shape s1(src1 + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TDIV(d2, d1, d0); - TCOPYOUT(res, d2); - } - } -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src0 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, gm_size); - float *src1 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src1); - init_src_fp(src1, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src0, src1); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - free(src0); - free(src1); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TDivs.cpp b/test/other/tileop_api/src/TDivs.cpp deleted file mode 100644 index 0c798c4..0000000 --- a/test/other/tileop_api/src/TDivs.cpp +++ /dev/null @@ -1,64 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src, float s) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - for (int i = 0; i < block_row; ++i) { - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1; - TCOPYIN(d0, s0); - TDIVS(d1, d0, s); - TCOPYOUT(res, d1); - } - } -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src, s_fp32); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TExp.cpp b/test/other/tileop_api/src/TExp.cpp deleted file mode 100644 index 49f8645..0000000 --- a/test/other/tileop_api/src/TExp.cpp +++ /dev/null @@ -1,64 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - for (int i = 0; i < block_row; ++i) { - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1; - TCOPYIN(d0, s0); - TEXP(d1, d0); - TCOPYOUT(res, d1); - } - } -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TExpandCol.cpp b/test/other/tileop_api/src/TExpandCol.cpp deleted file mode 100644 index f2633e4..0000000 --- a/test/other/tileop_api/src/TExpandCol.cpp +++ /dev/null @@ -1,57 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template void test(float *dst, float *src) { - using gm_shape_in = global_tensor>; - using gm_shape_out = global_tensor>; - - using tile_shape_in = Tile; - using tile_shape_out = Tile; - - gm_shape_in s0(src); - gm_shape_out res(dst); - tile_shape_in d0; - tile_shape_out d1; - - TCOPYIN(d0, s0); - TEXPANDCOL(d1, d0); - TCOPYOUT(res, d1); -} - -int main() { - const uint16_t row = 128; - const uint16_t col = 64; - - size_t size_in = row; - size_t size_out = row * col; - - float *dst = (float *)malloc(size_out * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, size_out); - - float *src = (float *)malloc(size_in * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, size_in); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, size_out); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TExpandRow.cpp b/test/other/tileop_api/src/TExpandRow.cpp deleted file mode 100644 index eb2902f..0000000 --- a/test/other/tileop_api/src/TExpandRow.cpp +++ /dev/null @@ -1,57 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template void test(float *dst, float *src) { - using gm_shape_in = global_tensor>; - using gm_shape_out = global_tensor>; - - using tile_shape_in = Tile; - using tile_shape_out = Tile; - - gm_shape_in s0(src); - gm_shape_out res(dst); - - tile_shape_in d0; - tile_shape_out d1; - TCOPYIN(d0, s0); - TEXPANDROW(d1, d0); - TCOPYOUT(res, d1); -} - -int main() { - const uint16_t row = 64; - const uint16_t col = 128; - - size_t size_in = col; - size_t size_out = row * col; - - float *dst = (float *)malloc(size_out * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, size_out); - - float *src = (float *)malloc(size_in * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, size_in); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, size_out); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TExpandScalar.cpp b/test/other/tileop_api/src/TExpandScalar.cpp deleted file mode 100644 index 83ac6ed..0000000 --- a/test/other/tileop_api/src/TExpandScalar.cpp +++ /dev/null @@ -1,49 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float s) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - gm_shape res(dst); - - tile_shape d0; - TEXPANDSCALAR(d0, s); - TCOPYOUT(res, d0); -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 128; - const uint16_t tile_row = 64; - const uint16_t tile_col = 128; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, s_fp32); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TMax.cpp b/test/other/tileop_api/src/TMax.cpp deleted file mode 100644 index 7d156c2..0000000 --- a/test/other/tileop_api/src/TMax.cpp +++ /dev/null @@ -1,70 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src0, float *src1) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - for (int i = 0; i < block_row; ++i) { - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src0 + offset); - gm_shape s1(src1 + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TMAX(d2, d1, d0); - TCOPYOUT(res, d2); - } - } -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src0 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, gm_size); - float *src1 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src1); - init_src_fp(src1, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src0, src1); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - free(src0); - free(src1); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TMaxs.cpp b/test/other/tileop_api/src/TMaxs.cpp deleted file mode 100644 index 5224e6e..0000000 --- a/test/other/tileop_api/src/TMaxs.cpp +++ /dev/null @@ -1,64 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src, float s) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - for (int i = 0; i < block_row; ++i) { - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1; - TCOPYIN(d0, s0); - TMAXS(d1, d0, s); - TCOPYOUT(res, d1); - } - } -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src, s_fp32); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TMul.cpp b/test/other/tileop_api/src/TMul.cpp deleted file mode 100644 index 6800ab7..0000000 --- a/test/other/tileop_api/src/TMul.cpp +++ /dev/null @@ -1,70 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src0, float *src1) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - for (int i = 0; i < block_row; ++i) { - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src0 + offset); - gm_shape s1(src1 + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TMUL(d2, d1, d0); - TCOPYOUT(res, d2); - } - } -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src0 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, gm_size); - float *src1 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src1); - init_src_fp(src1, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src0, src1); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - free(src0); - free(src1); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TMuls.cpp b/test/other/tileop_api/src/TMuls.cpp deleted file mode 100644 index bbf8f6f..0000000 --- a/test/other/tileop_api/src/TMuls.cpp +++ /dev/null @@ -1,64 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src, float s) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - for (int i = 0; i < block_row; ++i) { - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1; - TCOPYIN(d0, s0); - TMULS(d1, d0, s); - TCOPYOUT(res, d1); - } - } -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src, s_fp32); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TRecip.cpp b/test/other/tileop_api/src/TRecip.cpp deleted file mode 100644 index d30c423..0000000 --- a/test/other/tileop_api/src/TRecip.cpp +++ /dev/null @@ -1,64 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - for (int i = 0; i < block_row; ++i) { - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1; - TCOPYIN(d0, s0); - TRECIP(d1, d0); - TCOPYOUT(res, d1); - } - } -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TReshape.cpp b/test/other/tileop_api/src/TReshape.cpp deleted file mode 100644 index f96a5bc..0000000 --- a/test/other/tileop_api/src/TReshape.cpp +++ /dev/null @@ -1,60 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src) { - using gm_shape_in = global_tensor>; - using gm_shape_out = global_tensor>; - - using tile_shape_in = Tile; - using tile_shape_out = Tile; - gm_shape_in s0(src); - gm_shape_out res(dst); - - tile_shape_in d0; - tile_shape_out d1; - TCOPYIN(d0, s0); - TRESHAPE(d1, d0); - TCOPYOUT(res, d1); -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TRowMax.cpp b/test/other/tileop_api/src/TRowMax.cpp deleted file mode 100644 index 074f9c2..0000000 --- a/test/other/tileop_api/src/TRowMax.cpp +++ /dev/null @@ -1,58 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template void test(float *dst, float *src) { - using gm_shape_in = global_tensor>; - using gm_shape_out = global_tensor>; - - using tile_shape_in = Tile; - using tile_shape_out = Tile; - - gm_shape_in s0(src); - gm_shape_out res(dst); - - tile_shape_in d0; - tile_shape_out d1; - - TCOPYIN(d0, s0); - TROWMAX(d1, d0); - TCOPYOUT(res, d1); -} - -int main() { - const uint16_t row = 128; - const uint16_t col = 64; - - size_t size_in = row * col; - size_t size_out = row; - - float *dst = (float *)malloc(size_out * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, size_out); - - float *src = (float *)malloc(size_in * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, size_in); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, size_out); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TRowMaxExpand.cpp b/test/other/tileop_api/src/TRowMaxExpand.cpp deleted file mode 100644 index b9cc1e1..0000000 --- a/test/other/tileop_api/src/TRowMaxExpand.cpp +++ /dev/null @@ -1,58 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template void test(float *dst, float *src) { - using gm_shape_in = global_tensor>; - using gm_shape_out = global_tensor>; - - using tile_shape_in = Tile; - using tile_shape_out = Tile; - - gm_shape_in s0(src); - gm_shape_out res(dst); - - tile_shape_in d0; - tile_shape_out d1; - - TCOPYIN(d0, s0); - TROWMAXEXPAND(d1, d0); - TCOPYOUT(res, d1); -} - -int main() { - const uint16_t row = 64; - const uint16_t col = 128; - - size_t size_in = row * col; - size_t size_out = row * col; - - float *dst = (float *)malloc(size_out * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, size_out); - - float *src = (float *)malloc(size_in * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, size_in); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, size_out); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TRowSum.cpp b/test/other/tileop_api/src/TRowSum.cpp deleted file mode 100644 index c242a41..0000000 --- a/test/other/tileop_api/src/TRowSum.cpp +++ /dev/null @@ -1,58 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template void test(float *dst, float *src) { - using gm_shape_in = global_tensor>; - using gm_shape_out = global_tensor>; - - using tile_shape_in = Tile; - using tile_shape_out = Tile; - - gm_shape_in s0(src); - gm_shape_out res(dst); - - tile_shape_in d0; - tile_shape_out d1; - - TCOPYIN(d0, s0); - TROWSUM(d1, d0); - TCOPYOUT(res, d1); -} - -int main() { - const uint16_t row = 128; - const uint16_t col = 64; - - size_t size_in = row * col; - size_t size_out = row; - - float *dst = (float *)malloc(size_out * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, size_out); - - float *src = (float *)malloc(size_in * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, size_in); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, size_out); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TRowSumExpand.cpp b/test/other/tileop_api/src/TRowSumExpand.cpp deleted file mode 100644 index 32b0a78..0000000 --- a/test/other/tileop_api/src/TRowSumExpand.cpp +++ /dev/null @@ -1,58 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template void test(float *dst, float *src) { - using gm_shape_in = global_tensor>; - using gm_shape_out = global_tensor>; - - using tile_shape_in = Tile; - using tile_shape_out = Tile; - - gm_shape_in s0(src); - gm_shape_out res(dst); - - tile_shape_in d0; - tile_shape_out d1; - - TCOPYIN(d0, s0); - TROWSUMEXPAND(d1, d0); - TCOPYOUT(res, d1); -} - -int main() { - const uint16_t row = 64; - const uint16_t col = 128; - - size_t size_in = row * col; - size_t size_out = row * col; - - float *dst = (float *)malloc(size_out * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, size_out); - - float *src = (float *)malloc(size_in * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, size_in); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, size_out); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TSqrt.cpp b/test/other/tileop_api/src/TSqrt.cpp deleted file mode 100644 index 68813b1..0000000 --- a/test/other/tileop_api/src/TSqrt.cpp +++ /dev/null @@ -1,64 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - for (int i = 0; i < block_row; ++i) { - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1; - TCOPYIN(d0, s0); - TSQRT(d1, d0); - TCOPYOUT(res, d1); - } - } -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TSub.cpp b/test/other/tileop_api/src/TSub.cpp deleted file mode 100644 index ac1c694..0000000 --- a/test/other/tileop_api/src/TSub.cpp +++ /dev/null @@ -1,70 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src0, float *src1) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - for (int i = 0; i < block_row; ++i) { - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src0 + offset); - gm_shape s1(src1 + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - TSUB(d2, d1, d0); - TCOPYOUT(res, d2); - } - } -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src0 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, gm_size); - float *src1 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src1); - init_src_fp(src1, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src0, src1); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - free(src0); - free(src1); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TSubs.cpp b/test/other/tileop_api/src/TSubs.cpp deleted file mode 100644 index 615f129..0000000 --- a/test/other/tileop_api/src/TSubs.cpp +++ /dev/null @@ -1,64 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src, float s) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - - uint16_t block_row = gm_row / tile_row; - uint16_t block_col = gm_col / tile_col; - for (int i = 0; i < block_row; ++i) { - for (int j = 0; j < block_col; ++j) { - int offset = i * (tile_row * gm_col) + j * tile_col; - gm_shape s0(src + offset); - gm_shape res(dst + offset); - - tile_shape d0, d1; - TCOPYIN(d0, s0); - TSUBS(d1, d0, s); - TCOPYOUT(res, d1); - } - } -} - -int main() { - const uint16_t gm_row = 64; - const uint16_t gm_col = 64; - const uint16_t tile_row = 32; - const uint16_t tile_col = 32; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src, s_fp32); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/TTrans.cpp b/test/other/tileop_api/src/TTrans.cpp deleted file mode 100644 index a430236..0000000 --- a/test/other/tileop_api/src/TTrans.cpp +++ /dev/null @@ -1,58 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template void test(float *dst, float *src) { - using gm_shape_in = global_tensor>; - using gm_shape_out = global_tensor>; - - using tile_shape_in = Tile; - using tile_shape_out = Tile; - - gm_shape_in s0(src); - gm_shape_out res(dst); - - tile_shape_in d0; - tile_shape_out d1; - - TCOPYIN(d0, s0); - TTRANS(d1, d0); - TCOPYOUT(res, d1); -} - -int main() { - const uint16_t row = 64; - const uint16_t col = 128; - - size_t size_in = row * col; - size_t size_out = col * row; - - float *dst = (float *)malloc(size_out * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, size_out); - - float *src = (float *)malloc(size_in * sizeof(float)); - check_mem_alloc(src); - init_src_fp(src, size_in); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, size_out); - - free(dst); - free(src); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/test_MatMacc.cpp b/test/other/tileop_api/src/test_MatMacc.cpp deleted file mode 100644 index 866c92c..0000000 --- a/test/other/tileop_api/src/test_MatMacc.cpp +++ /dev/null @@ -1,71 +0,0 @@ -#include - -#include "../data.hpp" - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src0, float *src1) { - using gm_shape_A = global_tensor>; - using gm_shape_B = global_tensor>; - using gm_shape_C = global_tensor>; - - using tile_shape_A = TileLeft; - using tile_shape_B = TileRight; - using tile_shape_C = TileAcc; - - gm_shape_A s0(src0); - gm_shape_B s1(src1); - gm_shape_C res(dst); - - tile_shape_A d0; - tile_shape_B d1; - tile_shape_C d2(0); - - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - MATMACC(d2, d0, d1); - TCOPYOUT(res, d2); -} - -int main() { - const uint16_t M = 64; - const uint16_t K = 32; - const uint16_t N = 128; - - size_t size_A = M * K; - size_t size_B = K * N; - size_t size_C = M * N; - - float *dst = (float *)malloc(size_C * sizeof(float)); - check_mem_alloc(dst); - init_src_fp(dst, size_C); - - float *src0 = (float *)malloc(size_A * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, size_A); - float *src1 = (float *)malloc(size_B * sizeof(float)); - check_mem_alloc(src1); - init_src_fp(src1, size_B); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src0, src1); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, size_C); - - free(dst); - free(src0); - free(src1); - - return 0; -} \ No newline at end of file diff --git a/test/other/tileop_api/src/test_MatMul.cpp b/test/other/tileop_api/src/test_MatMul.cpp deleted file mode 100644 index 92aa454..0000000 --- a/test/other/tileop_api/src/test_MatMul.cpp +++ /dev/null @@ -1,51 +0,0 @@ -#include - -#include "../../../kernels/matmul.hpp" -#include "../data.hpp" - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -int main() { - const uint16_t M = 160; - const uint16_t K = 80; - const uint16_t N = 320; - const uint16_t TM = 32; - const uint16_t TK = 32; - const uint16_t TN = 32; - - size_t size_A = M * K; - size_t size_B = K * N; - size_t size_C = M * N; - - float *dst = (float *)malloc(size_C * sizeof(float)); - check_mem_alloc(dst); - init_src_fp(dst, size_C); - - float *src0 = (float *)malloc(size_A * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, size_A); - float *src1 = (float *)malloc(size_B * sizeof(float)); - check_mem_alloc(src1); - init_src_fp(src1, size_B); - -#ifdef LINX_PMC - PMC_START(); -#endif - - matmul(dst, src0, src1); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, size_C); - - free(dst); - free(src0); - free(src1); - - return 0; -} diff --git a/test/tileop_api/src/MatMul_e4m3.cpp b/test/tileop_api/src/MatMul_e4m3.cpp deleted file mode 100644 index 2392f07..0000000 --- a/test/tileop_api/src/MatMul_e4m3.cpp +++ /dev/null @@ -1,89 +0,0 @@ -#include -#include "../data.hpp" -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void __vec__ test_cvt(typename TA::TileDType __out__ a, - typename TB::TileDType __in__ b) { - using AType = typename TA::DType; - using BType = typename TB::DType; - __vbuf__ BType *pb = blkv_get_tile_ptr(b); - __vbuf__ AType *pa = blkv_get_tile_ptr(a); - int x = blkv_get_index_x(); - int y = blkv_get_index_y(); - int idx = index(y, x); - AType o = (AType)(pb[idx]); - pa[idx] = o; -} - -template -void test(float *dst, float *src0, float *src1) { - using gm_shape_A = global_tensor>; - using gm_shape_B = global_tensor>; - using gm_shape_C = global_tensor>; - - using tile_shape_A = TileLeft; - using tile_shape_B = TileRight; - using tile_shape_C = TileAcc; - using tile_shape_LA = TileLeft<__fp8_e4m3, M, K>; - using tile_shape_LB = TileRight<__fp8_e4m3, K, N>; - - gm_shape_A s0(src0); - gm_shape_B s1(src1); - gm_shape_C res(dst); - - tile_shape_A d0; - tile_shape_B d1; - tile_shape_C d2; - tile_shape_LA lda; - tile_shape_LB ldb; - - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - test_cvt<<>>(lda.data(), d0.data()); - test_cvt<<>>(ldb.data(), d1.data()); - MATMUL(d2, lda, ldb); - TCOPYOUT(res, d2); -} - -int main() { - const uint16_t M = 64; - const uint16_t K = 32; - const uint16_t N = 128; - - size_t size_A = M * K; - size_t size_B = K * N; - size_t size_C = M * N; - - float *dst = (float *)malloc(size_C * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, size_C); - - float *src0 = (float *)malloc(size_A * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, size_A); - float *src1 = (float *)malloc(size_B * sizeof(float)); - check_mem_alloc(src1); - init_src_fp(src1, size_B); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src0, src1); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, size_C); - - free(dst); - free(src0); - free(src1); - - return 0; -} diff --git a/test/tileop_api/src/TAdd_mask.cpp b/test/tileop_api/src/TAdd_mask.cpp deleted file mode 100644 index a0f6f40..0000000 --- a/test/tileop_api/src/TAdd_mask.cpp +++ /dev/null @@ -1,120 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -using namespace pto; - -template -void test(float *c_ptr, float *a_ptr, float *b_ptr) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - using glb_iterator = global_iterator; - - static constexpr int block_row = gm_row / tile_row; - static constexpr int block_col = gm_col / tile_col; - static constexpr int remainder_row = gm_row % tile_row; - static constexpr int remainder_col = gm_col % tile_col; - - using trailing_rows_shape = - Tile; - using trailing_cols_shape = - Tile; - using trailing_corner_shape = Tile; - - glb_iterator gAIter(a_ptr); - glb_iterator gBIter(b_ptr); - glb_iterator gCIter(c_ptr); - for (int i = 0; i < block_row; ++i) { - for (int j = 0; j < block_col; ++j) { - auto gA = gAIter(i, j); - auto gB = gBIter(i, j); - auto gC = gCIter(i, j); - - tile_shape tA, tB, tC; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TADD(tC, tA, tB); - TCOPYOUT(gC, tC); - } - if constexpr (remainder_col) { - auto gA = gAIter(i, block_col); - auto gB = gBIter(i, block_col); - auto gC = gCIter(i, block_col); - - trailing_rows_shape tA, tB, tC; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TADD(tC, tA, tB); - TCOPYOUT(gC, tC); - } - } - if constexpr (remainder_row) { - for (int j = 0; j < block_col; ++j) { - auto gA = gAIter(block_row, j); - auto gB = gBIter(block_row, j); - auto gC = gCIter(block_row, j); - - trailing_cols_shape tA, tB, tC; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TADD(tC, tA, tB); - TCOPYOUT(gC, tC); - } - if constexpr (remainder_col) { - auto gA = gAIter(block_row, block_col); - auto gB = gBIter(block_row, block_col); - auto gC = gCIter(block_row, block_col); - - trailing_corner_shape tA, tB, tC; - TCOPYIN(tA, gA); - TCOPYIN(tB, gB); - TADD(tC, tA, tB); - TCOPYOUT(gC, tC); - } - } -} - -int main() { - const uint16_t gm_row = 66; - const uint16_t gm_col = 66; - const uint16_t tile_row = 16; - const uint16_t tile_col = 16; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - float *dst = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, gm_size); - - float *src0 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, gm_size); - float *src1 = (float *)malloc(gm_size * sizeof(float)); - check_mem_alloc(src1); - init_src_fp(src1, gm_size); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src0, src1); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, gm_size); - - free(dst); - free(src0); - free(src1); - - return 0; -} diff --git a/test/tileop_api/src/TSqrt.cpp b/test/tileop_api/src/TSqrt.cpp deleted file mode 100644 index 7fbc90f..0000000 --- a/test/tileop_api/src/TSqrt.cpp +++ /dev/null @@ -1,107 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test_rm(T *dst, T *src) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - using glb_iterator = global_iterator; - - glb_iterator gSIter(src); - glb_iterator gDIter(dst); - - size_t block_row = gm_row / tile_row; - size_t block_col = gm_col / tile_col; - for (int i = 0; i < block_row; ++i) { - for (int j = 0; j < block_col; ++j) { - auto s0 = gSIter(i, j); - auto res = gDIter(i, j); - - tile_shape t0, t1; - TCOPYIN(t0, s0); - TSQRT(t1, t0); - TCOPYOUT(res, t1); - } - } -} - -template -void test_cm(T *dst, T *src) { - using gm_shape = global_tensor>; - using tile_shape = Tile; - using glb_iterator = global_iterator; - - glb_iterator gSIter(src); - glb_iterator gDIter(dst); - - size_t block_row = gm_row / tile_row; - size_t block_col = gm_col / tile_col; - for (int i = 0; i < block_col; ++i) { - for (int j = 0; j < block_row; ++j) { - auto s0 = gSIter(j, i); - auto res = gDIter(j, i); - - tile_shape t0, t1; - TCOPYIN(t0, s0); - TSQRT(t1, t0); - TCOPYOUT(res, t1); - } - } -} - -int main() { - - const size_t gm_row = 32; - const size_t gm_col = 32; - const size_t tile_row = 16; - const size_t tile_col = 16; - - size_t gm_size = gm_row * gm_col; - size_t tile_size = tile_row * tile_col; - - // __half - __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half)); - check_mem_alloc(dst_f16); - init_dst(dst_f16, gm_size); - - __half *src_f16 = (__half *)malloc(gm_size * sizeof(__half)); - check_mem_alloc(src_f16); - init_rows_fp(src_f16, gm_row, gm_col); - - // __fp32 - __fp32 *dst_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); - check_mem_alloc(dst_f32); - init_dst(dst_f32, gm_size); - - __fp32 *src_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32)); - check_mem_alloc(src_f32); - init_rows_fp(src_f32, gm_row, gm_col); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test_rm(dst_f16, src_f16); - test_cm(dst_f16, src_f16); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst_f16, gm_size); - OutArray(dst_f32, gm_size); - - free(dst_f16); - free(src_f16); - free(dst_f32); - free(src_f32); - - return 0; -} \ No newline at end of file diff --git a/test/tileop_api/src/test_MatMacc.cpp b/test/tileop_api/src/test_MatMacc.cpp deleted file mode 100644 index ead53c2..0000000 --- a/test/tileop_api/src/test_MatMacc.cpp +++ /dev/null @@ -1,75 +0,0 @@ -#include - -#include "../data.hpp" - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src0, float *src1) { - using gm_shape_A = global_tensor>; - using gm_shape_B = global_tensor>; - using gm_shape_C = global_tensor>; - - using tile_shape_A = TileLeft; - using tile_shape_B = TileRight; - using tile_shape_C = TileAcc; - using tile_shape_O = Tile; - - gm_shape_A s0(src0); - gm_shape_B s1(src1); - gm_shape_C res(dst); - - tile_shape_A d0; - tile_shape_B d1; - tile_shape_C d2; - tile_shape_O d3; - - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - MATMUL(d2, d0, d1); - MATMACC(d2, d0, d1); - TCVT(d3, d2); - TCOPYOUT(res, d3); -} - -int main() { - const uint16_t M = 16; - const uint16_t K = 8; - const uint16_t N = 32; - - size_t size_A = M * K; - size_t size_B = K * N; - size_t size_C = M * N; - - float *dst = (float *)malloc(size_C * sizeof(float)); - check_mem_alloc(dst); - init_src_fp(dst, size_C); - - float *src0 = (float *)malloc(size_A * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, size_A); - float *src1 = (float *)malloc(size_B * sizeof(float)); - check_mem_alloc(src1); - init_src_fp(src1, size_B); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src0, src1); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, size_C); - - free(dst); - free(src0); - free(src1); - - return 0; -} diff --git a/test/tileop_api/src/test_MatMul.cpp b/test/tileop_api/src/test_MatMul.cpp deleted file mode 100644 index 56ad9c3..0000000 --- a/test/tileop_api/src/test_MatMul.cpp +++ /dev/null @@ -1,73 +0,0 @@ -#include "../data.hpp" -#include - -#ifdef LINX_PMC -#include "../linxStartEnd.hpp" -#endif - -template -void test(float *dst, float *src0, float *src1) { - using gm_shape_A = global_tensor>; - using gm_shape_B = global_tensor>; - using gm_shape_C = global_tensor>; - - using tile_shape_A = TileLeft; - using tile_shape_B = TileRight; - using tile_shape_C = TileAcc; - using tile_shape_O = TileLeft; - - gm_shape_A s0(src0); - gm_shape_B s1(src1); - gm_shape_C res(dst); - - tile_shape_A d0; - tile_shape_B d1; - tile_shape_C d2; - tile_shape_O d3; - - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); - MATMUL(d2, d0, d1); - TCVT(d3, d2); - TCOPYOUT(res, d3); -} - -int main() { - const uint16_t M = 16; - const uint16_t K = 8; - const uint16_t N = 32; - - size_t size_A = M * K; - size_t size_B = K * N; - size_t size_C = M * N; - - float *dst = (float *)malloc(size_C * sizeof(float)); - check_mem_alloc(dst); - init_dst(dst, size_C); - - float *src0 = (float *)malloc(size_A * sizeof(float)); - check_mem_alloc(src0); - init_src_fp(src0, size_A); - float *src1 = (float *)malloc(size_B * sizeof(float)); - check_mem_alloc(src1); - init_src_fp(src1, size_B); - -#ifdef LINX_PMC - PMC_START(); -#endif - - test(dst, src0, src1); - -#ifdef LINX_PMC - PMC_END(); -#endif - - printf("Result:\n"); - OutArray(dst, size_C); - - free(dst); - free(src0); - free(src1); - - return 0; -} diff --git a/tests/README.md b/tests/README.md new file mode 100644 index 0000000..e0558f9 --- /dev/null +++ b/tests/README.md @@ -0,0 +1,71 @@ +# Tests + +This tree keeps correctness material that is not the primary Linx benchmark +navigation surface. Active Linx benchmark entrypoints live under +[`../benchmarks`](../benchmarks); add new benchmark suites there instead of +recreating the old `test/` tree. + +## Directory Map + +| Path | Purpose | +| --- | --- | +| [`py_api`](py_api) | Active Python-facing TileOP correctness and golden-comparison flow. | +| [`tileop_layout`](tileop_layout) | TileOP layout and behavior checks that are not cataloged as primary benchmark suites. | + +These directories still use the shared benchmark harness through +[`../benchmarks/common/Makefile.common`](../benchmarks/common/Makefile.common), +so the same `TESTCASE`, `PLAT`, `COMPILER_DIR`, and `QEMU` variables work here. + +## Common Build Pattern + +```sh +cd tests/tileop_layout +make clean +make TESTCASE=TLOAD PLAT=linx COMPILER_DIR=/path/to/linx/compiler/bin +make TESTCASE=TSTORE PLAT=linx COMPILER_DIR=/path/to/linx/compiler/bin +``` + +Platform values: + +| Platform | Backend | +| --- | --- | +| `PLAT=cpu` | CPU simulation backend with `__cpu_sim__`. | +| `PLAT=linx` | Linx target backend with `__linx`. | +| `PLAT=arm_sme` | Arm SME-oriented backend with `__ARM_FEATURE_SME`. | + +Common targets: + +```sh +make TESTCASE= all +make TESTCASE= diss +make TESTCASE= sim +make TESTCASE= debug +make clean +make clean_all +``` + +Build products are written below the repository-level `output/` directory. + +## Batch Runs + +Run batch files from their suite directory so relative paths and make variables +resolve as intended: + +```sh +cd tests/tileop_layout && bash compile.all +cd tests/py_api && bash compile.all +``` + +## Python Golden Comparison + +```sh +cd tests/py_api +make clean +make TESTCASE=tileop_py +python3 golden_cmp/golden_cmp.py -i tadd +``` + +For adding golden-comparison cases, see +[`py_api/golden_cmp/README.md`](py_api/golden_cmp/README.md). + +Back to the repository overview: [`../README.md`](../README.md). diff --git a/test/py_api/Makefile b/tests/py_api/Makefile similarity index 65% rename from test/py_api/Makefile rename to tests/py_api/Makefile index 01a5e4e..21ffb78 100644 --- a/test/py_api/Makefile +++ b/tests/py_api/Makefile @@ -6,4 +6,5 @@ endif SRC_FILE += $(TEST_ROOT)/$(CASE_SRC_DIR)/$(TESTCASE).cpp -include ../common/Makefile.common \ No newline at end of file +CATEGORY_ROOT := $(abspath ../..) +include ../../benchmarks/common/Makefile.common diff --git a/test/py_api/compile.all b/tests/py_api/compile.all similarity index 100% rename from test/py_api/compile.all rename to tests/py_api/compile.all diff --git a/test/py_api/golden_cmp/README.md b/tests/py_api/golden_cmp/README.md similarity index 88% rename from test/py_api/golden_cmp/README.md rename to tests/py_api/golden_cmp/README.md index 068ddb8..4d89087 100644 --- a/test/py_api/golden_cmp/README.md +++ b/tests/py_api/golden_cmp/README.md @@ -7,12 +7,12 @@ · 文件路径:PTOTileLib/include/cpu_sim/ · 操作说明: - + 1. 如果是添加一个新的运算方式(如 texp),则需要新建一个 HPP 文件。 2. 如果是同一运算方式的不同属性(如不同的矩阵尺寸或 tile 大小),则直接在对应的 HPP 文件中添加。 · 标准函数格式: - + · 文件头需要包含必要的头文件。 · 声明变量和函数名称时,需注意命名规范。 · 如果有一个综合函数记得写清条件 @@ -57,16 +57,16 @@ void TADD(tile_shape &dst, tile_shape &src0, tile_shape &src1) { 2. 新建调用函数与python接口的hpp文件 -文件路径:PTOTileLib/test/py_api/src +文件路径:PTOTileLib/tests/py_api/src 步骤说明: 1. 新建文件: - + · 添加固定文件头。 . 声明要传入的参数 . 声明矩阵的形状与layout - . 进行矩阵操作,并且使用TCOPYIN,TCOPYOUT函数以及上一个步骤声明的函数来进行操作。注意满足TCOPYIN,TCOPYOUT对于矩阵layout的要求。 + . 进行矩阵操作,并且使用TLOAD,TSTORE函数以及上一个步骤声明的函数来进行操作。注意满足TLOAD,TSTORE对于矩阵layout的要求。 . 对函数进行绑定,注意在绑定时需要进行接口的转换以及参数的传入。 . 之后在tileop_py.cpp中加入要编译的文件名 ``` @@ -90,18 +90,18 @@ void tadd_py(float* dst, float* src0, float* src1){ int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); - gm_shape res(dst + offset); + gm_shape res(dst + offset); tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TADD(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } -#ifdef __cpu_sim__ +#ifdef __cpu_sim__ void bind_tadd(py::module_& m) { m.def("tadd", [](py::array_t dst_py, py::array_t src0_py, py::array_t src1_py){ float* dst = static_cast(dst_py.request().ptr); @@ -119,12 +119,12 @@ void tadd_py(float* dst, float* src0, float* src1){ 3. 修改 CONFIG.JSON 文件和 ref_func_lib.py 文件 -文件路径:/PTOTileLib/test/py_api/golden_cmp/ +文件路径:/PTOTileLib/tests/py_api/golden_cmp/ 步骤说明: 1. 在config.json文件中的 cases 中添加新测试用例: - + · 按照以下格式添加新函数的属性。 ``` { @@ -132,7 +132,7 @@ void tadd_py(float* dst, float* src0, float* src1){ "group": "tadd", "input_shapes": [[16, 16], [16, 16]], "output_shape": [16, 16], - "ref_func":"lambda input: tadd(input[0], input[1])", + "ref_func":"lambda input: tadd(input[0], input[1])", "test_func":"lambda res, input: tileop_py.tadd_api.tadd(res, input[0], input[1])" } ``` @@ -144,7 +144,7 @@ void tadd_py(float* dst, float* src0, float* src1){ . **test_func**: 该函数调用的是绑定的函数。 2. 在 ref_func_lib.py 中添加需要进行的python操作: - + · 按照以下格式添加python操作。 ``` def tadd(a, b): @@ -158,12 +158,12 @@ void tadd_py(float* dst, float* src0, float* src1){ 添加新的测试用例 按照以下步骤完成修改后,即可成功添加新的测试用例。确保所有文件的修改内容与 config.json 中的定义一致,以避免运行时错误。 -在 /PTOTileLib/test/py_api/ 路径下,执行以下命令: +在 /PTOTileLib/tests/py_api/ 路径下,执行以下命令: ``` -make clean -make TESTCASE=tileop_py -python3 golden_cmp/golden_cmp.py -i tadd +make clean +make TESTCASE=tileop_py +python3 golden_cmp/golden_cmp.py -i tadd ``` -其中 -i 后面跟着的是函数的名称,具体的函数名可以参考 config.json 文件中的内容。 +其中 -i 后面跟着的是函数的名称,具体的函数名可以参考 config.json 文件中的内容。 之后print出的对比结果中,在最后两行会显示loss(误差)以及是否pass or fail diff --git a/test/py_api/golden_cmp/config.json b/tests/py_api/golden_cmp/config.json similarity index 100% rename from test/py_api/golden_cmp/config.json rename to tests/py_api/golden_cmp/config.json diff --git a/test/py_api/golden_cmp/golden_cmp.py b/tests/py_api/golden_cmp/golden_cmp.py similarity index 100% rename from test/py_api/golden_cmp/golden_cmp.py rename to tests/py_api/golden_cmp/golden_cmp.py diff --git a/test/py_api/golden_cmp/ref_func_lib.py b/tests/py_api/golden_cmp/ref_func_lib.py similarity index 100% rename from test/py_api/golden_cmp/ref_func_lib.py rename to tests/py_api/golden_cmp/ref_func_lib.py diff --git a/test/py_api/golden_cmp/test.sh b/tests/py_api/golden_cmp/test.sh similarity index 100% rename from test/py_api/golden_cmp/test.sh rename to tests/py_api/golden_cmp/test.sh diff --git a/test/py_api/src/flash_attention_py.hpp b/tests/py_api/src/flash_attention_py.hpp similarity index 100% rename from test/py_api/src/flash_attention_py.hpp rename to tests/py_api/src/flash_attention_py.hpp diff --git a/test/py_api/src/tadd.hpp b/tests/py_api/src/tadd.hpp similarity index 89% rename from test/py_api/src/tadd.hpp rename to tests/py_api/src/tadd.hpp index 39bf976..3245e5c 100644 --- a/test/py_api/src/tadd.hpp +++ b/tests/py_api/src/tadd.hpp @@ -18,18 +18,18 @@ void tadd_py(float* dst, float* src0, float* src1){ int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); - gm_shape res(dst + offset); + gm_shape res(dst + offset); tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TADD(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } -#ifdef __cpu_sim__ +#ifdef __cpu_sim__ void bind_tadd(py::module_& m) { m.def("tadd", [](py::array_t dst_py, py::array_t src0_py, py::array_t src1_py){ float* dst = static_cast(dst_py.request().ptr); diff --git a/test/py_api/src/tcvt.hpp b/tests/py_api/src/tcvt.hpp similarity index 94% rename from test/py_api/src/tcvt.hpp rename to tests/py_api/src/tcvt.hpp index 2931dd3..04bab23 100644 --- a/test/py_api/src/tcvt.hpp +++ b/tests/py_api/src/tcvt.hpp @@ -22,13 +22,13 @@ void tcvtnz2zn(float* dst, float* src) { tile_shape_nz d1; tile_shape_zn d2; tile_shape_out d3; - - TCOPYIN(d0, s); + + TLOAD(d0, s); TRESHAPE(d1, d0); TCVT(d2, d1); TRESHAPE(d3, d2); - TCOPYOUT(res, d3); + TSTORE(res, d3); } @@ -48,14 +48,14 @@ void tcvtzn2nz(float* dst, float* src) { tile_shape_zn d1; tile_shape_nz d2; tile_shape_out d3; - - TCOPYIN(d0, s); + + TLOAD(d0, s); TRESHAPE(d1, d0); TCVT(d2, d1); TRESHAPE(d3, d2); - TCOPYOUT(res, d3); - + TSTORE(res, d3); + } template void tcvtnz2rowmajor(float* dst, float* src) { @@ -73,13 +73,13 @@ void tcvtnz2rowmajor(float* dst, float* src) { tile_shape_nz d1; tile_shape_rowmajor d2; tile_shape_out d3; - - TCOPYIN(d0, s); + + TLOAD(d0, s); TRESHAPE(d1, d0); TCVT(d2, d1); TRESHAPE(d3, d2); - TCOPYOUT(res, d3); + TSTORE(res, d3); } @@ -100,13 +100,13 @@ void tcvtrowmajor2nz(float* dst, float* src) { tile_shape_rowmajor d1; tile_shape_nz d2; tile_shape_out d3; - - TCOPYIN(d0, s); + + TLOAD(d0, s); TRESHAPE(d1, d0); TCVT(d2, d1); TRESHAPE(d3, d2); - TCOPYOUT(res, d3); + TSTORE(res, d3); } // Python 接口绑定 diff --git a/test/py_api/src/texp.hpp b/tests/py_api/src/texp.hpp similarity index 94% rename from test/py_api/src/texp.hpp rename to tests/py_api/src/texp.hpp index b8d1566..7c5a577 100644 --- a/test/py_api/src/texp.hpp +++ b/tests/py_api/src/texp.hpp @@ -23,9 +23,9 @@ void texp_py(float* dst, float* src) { tile_shape d0, d1; - TCOPYIN(d0, s0); + TLOAD(d0, s0); TEXP(d1, d0); - TCOPYOUT(res, d1); + TSTORE(res, d1); } } } diff --git a/test/py_api/src/tileop_py.cpp b/tests/py_api/src/tileop_py.cpp similarity index 97% rename from test/py_api/src/tileop_py.cpp rename to tests/py_api/src/tileop_py.cpp index 557801f..c40a9d6 100644 --- a/test/py_api/src/tileop_py.cpp +++ b/tests/py_api/src/tileop_py.cpp @@ -11,7 +11,7 @@ #include "flash_attention_py.hpp" #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif namespace py = pybind11; diff --git a/test/py_api/src/tmax.hpp b/tests/py_api/src/tmax.hpp similarity index 93% rename from test/py_api/src/tmax.hpp rename to tests/py_api/src/tmax.hpp index 5b07694..618fb15 100644 --- a/test/py_api/src/tmax.hpp +++ b/tests/py_api/src/tmax.hpp @@ -23,10 +23,10 @@ void tmax_py(float* dst, float* src0, float* src1){ tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TMAX(d2, d1, d0); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } diff --git a/test/py_api/src/tsub.hpp b/tests/py_api/src/tsub.hpp similarity index 89% rename from test/py_api/src/tsub.hpp rename to tests/py_api/src/tsub.hpp index a13c69d..a47b0be 100644 --- a/test/py_api/src/tsub.hpp +++ b/tests/py_api/src/tsub.hpp @@ -18,18 +18,18 @@ void tsub_py(float* dst, float* src0, float* src1) { int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src0 + offset); gm_shape s1(src1 + offset); - gm_shape res(dst + offset); + gm_shape res(dst + offset); tile_shape d0, d1, d2; - TCOPYIN(d0, s0); - TCOPYIN(d1, s1); + TLOAD(d0, s0); + TLOAD(d1, s1); TSUB(d2, d0, d1); - TCOPYOUT(res, d2); + TSTORE(res, d2); } } } -#ifdef __cpu_sim__ +#ifdef __cpu_sim__ void bind_tsub(py::module_& m) { m.def("tsub", [](py::array_t dst_py, py::array_t src0_py, py::array_t src1_py){ float* dst = static_cast(dst_py.request().ptr); diff --git a/test/other/tileop_test/Makefile b/tests/tileop_layout/Makefile similarity index 97% rename from test/other/tileop_test/Makefile rename to tests/tileop_layout/Makefile index 11112a1..f274aea 100644 --- a/test/other/tileop_test/Makefile +++ b/tests/tileop_layout/Makefile @@ -1,9 +1,9 @@ -ifeq ($(TESTCASE), TCOPYIN) +ifeq ($(TESTCASE), TLOAD) DEFINES += -DGM_ROW=$(row) -DGM_COL=$(col) -DTR_ROW=$(trow) -DTR_COL=$(tcol) TARGET = $(ELF_HEAD)_$(TESTCASE)_$(MODE)_r$(row)_c$(col)_tr$(trow)_tc$(tcol).elf endif -ifeq ($(TESTCASE), TCOPYOUT) +ifeq ($(TESTCASE), TSTORE) DEFINES += -DROW=$(row) -DCOL=$(col) -DTROW=$(trow) -DTCOL=$(tcol) TARGET = $(ELF_HEAD)_$(TESTCASE)_$(MODE)_r$(row)_c$(col)_tr$(trow)_tc$(tcol).elf endif @@ -177,4 +177,5 @@ ifneq ($(MODE), ) endif SRC_FILE += $(TEST_ROOT)/$(CASE_SRC_DIR)/$(TESTCASE).cpp -include ../../common/Makefile.common +CATEGORY_ROOT := $(abspath ../..) +include ../../benchmarks/common/Makefile.common diff --git a/test/other/tileop_test/compile.all b/tests/tileop_layout/compile.all similarity index 85% rename from test/other/tileop_test/compile.all rename to tests/tileop_layout/compile.all index 94ccd7d..af37eeb 100755 --- a/test/other/tileop_test/compile.all +++ b/tests/tileop_layout/compile.all @@ -1,39 +1,39 @@ -make TESTCASE=TCOPYIN MODE=ND2ND row=16 col=16 trow=16 tcol=16 PLAT=linx -make TESTCASE=TCOPYIN MODE=ND2ND row=64 col=64 trow=16 tcol=16 PLAT=linx -make TESTCASE=TCOPYIN MODE=ND2ND row=128 col=64 trow=32 tcol=16 PLAT=linx -make TESTCASE=TCOPYIN MODE=ND2ND row=64 col=128 trow=16 tcol=32 PLAT=linx - -make TESTCASE=TCOPYIN MODE=ND2NZ row=16 col=16 trow=16 tcol=16 PLAT=linx -make TESTCASE=TCOPYIN MODE=ND2NZ row=64 col=64 trow=16 tcol=16 PLAT=linx -make TESTCASE=TCOPYIN MODE=ND2NZ row=128 col=64 trow=32 tcol=16 PLAT=linx -make TESTCASE=TCOPYIN MODE=ND2NZ row=64 col=128 trow=16 tcol=32 PLAT=linx - -make TESTCASE=TCOPYIN MODE=DN2ZN row=16 col=16 trow=16 tcol=16 PLAT=linx -make TESTCASE=TCOPYIN MODE=DN2ZN row=64 col=64 trow=16 tcol=16 PLAT=linx -make TESTCASE=TCOPYIN MODE=DN2ZN row=128 col=64 trow=32 tcol=16 PLAT=linx -make TESTCASE=TCOPYIN MODE=DN2ZN row=64 col=128 trow=16 tcol=32 PLAT=linx - -#Tcopyout -make TESTCASE=TCOPYOUT MODE=ND2ND row=16 col=16 trow=16 tcol=16 PLAT=linx -make TESTCASE=TCOPYOUT MODE=ND2ND row=64 col=64 trow=16 tcol=16 PLAT=linx -make TESTCASE=TCOPYOUT MODE=ND2ND row=128 col=64 trow=32 tcol=16 PLAT=linx -make TESTCASE=TCOPYOUT MODE=ND2ND row=64 col=128 trow=16 tcol=32 PLAT=linx - -make TESTCASE=TCOPYOUT MODE=NZ2ND row=16 col=16 trow=16 tcol=16 PLAT=linx -make TESTCASE=TCOPYOUT MODE=NZ2ND row=64 col=64 trow=16 tcol=16 PLAT=linx -make TESTCASE=TCOPYOUT MODE=NZ2ND row=128 col=64 trow=32 tcol=16 PLAT=linx -make TESTCASE=TCOPYOUT MODE=NZ2ND row=64 col=128 trow=16 tcol=32 PLAT=linx - -# make TESTCASE=TCOPYOUT MODE=ZN2DN row=16 col=16 trow=16 tcol=16 PLAT=linx -# make TESTCASE=TCOPYOUT MODE=ZN2DN row=64 col=64 trow=16 tcol=16 PLAT=linx -# make TESTCASE=TCOPYOUT MODE=ZN2DN row=128 col=64 trow=32 tcol=16 PLAT=linx -# make TESTCASE=TCOPYOUT MODE=ZN2DN row=64 col=128 trow=16 tcol=32 PLAT=linx +make TESTCASE=TLOAD MODE=ND2ND row=16 col=16 trow=16 tcol=16 PLAT=linx +make TESTCASE=TLOAD MODE=ND2ND row=64 col=64 trow=16 tcol=16 PLAT=linx +make TESTCASE=TLOAD MODE=ND2ND row=128 col=64 trow=32 tcol=16 PLAT=linx +make TESTCASE=TLOAD MODE=ND2ND row=64 col=128 trow=16 tcol=32 PLAT=linx + +make TESTCASE=TLOAD MODE=ND2NZ row=16 col=16 trow=16 tcol=16 PLAT=linx +make TESTCASE=TLOAD MODE=ND2NZ row=64 col=64 trow=16 tcol=16 PLAT=linx +make TESTCASE=TLOAD MODE=ND2NZ row=128 col=64 trow=32 tcol=16 PLAT=linx +make TESTCASE=TLOAD MODE=ND2NZ row=64 col=128 trow=16 tcol=32 PLAT=linx + +make TESTCASE=TLOAD MODE=DN2ZN row=16 col=16 trow=16 tcol=16 PLAT=linx +make TESTCASE=TLOAD MODE=DN2ZN row=64 col=64 trow=16 tcol=16 PLAT=linx +make TESTCASE=TLOAD MODE=DN2ZN row=128 col=64 trow=32 tcol=16 PLAT=linx +make TESTCASE=TLOAD MODE=DN2ZN row=64 col=128 trow=16 tcol=32 PLAT=linx + +#Tstore +make TESTCASE=TSTORE MODE=ND2ND row=16 col=16 trow=16 tcol=16 PLAT=linx +make TESTCASE=TSTORE MODE=ND2ND row=64 col=64 trow=16 tcol=16 PLAT=linx +make TESTCASE=TSTORE MODE=ND2ND row=128 col=64 trow=32 tcol=16 PLAT=linx +make TESTCASE=TSTORE MODE=ND2ND row=64 col=128 trow=16 tcol=32 PLAT=linx + +make TESTCASE=TSTORE MODE=NZ2ND row=16 col=16 trow=16 tcol=16 PLAT=linx +make TESTCASE=TSTORE MODE=NZ2ND row=64 col=64 trow=16 tcol=16 PLAT=linx +make TESTCASE=TSTORE MODE=NZ2ND row=128 col=64 trow=32 tcol=16 PLAT=linx +make TESTCASE=TSTORE MODE=NZ2ND row=64 col=128 trow=16 tcol=32 PLAT=linx + +# make TESTCASE=TSTORE MODE=ZN2DN row=16 col=16 trow=16 tcol=16 PLAT=linx +# make TESTCASE=TSTORE MODE=ZN2DN row=64 col=64 trow=16 tcol=16 PLAT=linx +# make TESTCASE=TSTORE MODE=ZN2DN row=128 col=64 trow=32 tcol=16 PLAT=linx +# make TESTCASE=TSTORE MODE=ZN2DN row=64 col=128 trow=16 tcol=32 PLAT=linx #Tadd -make TESTCASE=TADD MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx -make TESTCASE=TADD MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx -make TESTCASE=TADD MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx -make TESTCASE=TADD MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx +make TESTCASE=TADD MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx +make TESTCASE=TADD MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx +make TESTCASE=TADD MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx +make TESTCASE=TADD MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx # make TESTCASE=TADD MODE=NZ row=16 col=16 trow=16 tcol=16 PLAT=linx # make TESTCASE=TADD MODE=NZ row=64 col=64 trow=16 tcol=16 PLAT=linx @@ -46,10 +46,10 @@ make TESTCASE=TADD MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx # make TESTCASE=TADD MODE=ZN row=64 col=128 trow=16 tcol=32 PLAT=linx #Texp -make TESTCASE=TEXP MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx -make TESTCASE=TEXP MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx -make TESTCASE=TEXP MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx -make TESTCASE=TEXP MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx +make TESTCASE=TEXP MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx +make TESTCASE=TEXP MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx +make TESTCASE=TEXP MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx +make TESTCASE=TEXP MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx # make TESTCASE=TEXP MODE=NZ row=16 col=16 trow=16 tcol=16 PLAT=linx # make TESTCASE=TEXP MODE=NZ row=64 col=64 trow=16 tcol=16 PLAT=linx @@ -62,10 +62,10 @@ make TESTCASE=TEXP MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx # make TESTCASE=TEXP MODE=ZN row=64 col=128 trow=16 tcol=32 PLAT=linx #Tcopy -make TESTCASE=TCOPY MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx -make TESTCASE=TCOPY MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx -make TESTCASE=TCOPY MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx -make TESTCASE=TCOPY MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx +make TESTCASE=TCOPY MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx +make TESTCASE=TCOPY MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx +make TESTCASE=TCOPY MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx +make TESTCASE=TCOPY MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx # make TESTCASE=TCOPY TEST_TYPE=tileop MODE=DN row=16 col=16 trow=16 tcol=16 PLAT=linx # make TESTCASE=TCOPY TEST_TYPE=tileop MODE=DN row=64 col=64 trow=16 tcol=16 PLAT=linx @@ -139,10 +139,10 @@ make TESTCASE=TASSEMBLE ROW=128 COL1=16 COL2=8 COL3=8 PLAT=linx make TESTCASE=TASSEMBLE ROW=128 COL1=8 COL2=16 COL3=8 PLAT=linx #TADDCAST -make TESTCASE=TADDCAST MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx -make TESTCASE=TADDCAST MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx -make TESTCASE=TADDCAST MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx -make TESTCASE=TADDCAST MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx +make TESTCASE=TADDCAST MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx +make TESTCASE=TADDCAST MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx +make TESTCASE=TADDCAST MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx +make TESTCASE=TADDCAST MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx #TABS make TESTCASE=TABS TEST_TYPE=tileop MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx @@ -151,10 +151,10 @@ make TESTCASE=TABS TEST_TYPE=tileop MODE=ND row=128 col=64 trow=32 tcol=16 PLAT make TESTCASE=TABS TEST_TYPE=tileop MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx #TADDMASK -make TESTCASE=TADDMASK MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx -make TESTCASE=TADDMASK MODE=ND row=66 col=66 trow=16 tcol=16 PLAT=linx -make TESTCASE=TADDMASK MODE=ND row=130 col=66 trow=32 tcol=16 PLAT=linx -make TESTCASE=TADDMASK MODE=ND row=66 col=130 trow=16 tcol=32 PLAT=linx +make TESTCASE=TADDMASK MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx +make TESTCASE=TADDMASK MODE=ND row=66 col=66 trow=16 tcol=16 PLAT=linx +make TESTCASE=TADDMASK MODE=ND row=130 col=66 trow=32 tcol=16 PLAT=linx +make TESTCASE=TADDMASK MODE=ND row=66 col=130 trow=16 tcol=32 PLAT=linx #TAND make TESTCASE=TAND TEST_TYPE=tileop MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx @@ -181,22 +181,22 @@ make TESTCASE=TEXPANDSCALAR TEST_TYPE=tileop MODE=ND row=128 col=64 trow=32 tco make TESTCASE=TEXPANDSCALAR TEST_TYPE=tileop MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx #TDIV -make TESTCASE=TDIV MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx -make TESTCASE=TDIV MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx -make TESTCASE=TDIV MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx -make TESTCASE=TDIV MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx +make TESTCASE=TDIV MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx +make TESTCASE=TDIV MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx +make TESTCASE=TDIV MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx +make TESTCASE=TDIV MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx #TMUL -make TESTCASE=TMUL MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx -make TESTCASE=TMUL MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx -make TESTCASE=TMUL MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx -make TESTCASE=TMUL MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx +make TESTCASE=TMUL MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx +make TESTCASE=TMUL MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx +make TESTCASE=TMUL MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx +make TESTCASE=TMUL MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx #TSUB -make TESTCASE=TSUB MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx -make TESTCASE=TSUB MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx -make TESTCASE=TSUB MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx -make TESTCASE=TSUB MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx +make TESTCASE=TSUB MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx +make TESTCASE=TSUB MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx +make TESTCASE=TSUB MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx +make TESTCASE=TSUB MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx #TCAST make TESTCASE=TCAST TEST_TYPE=tileop MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx @@ -229,10 +229,10 @@ make TESTCASE=TTRANS TEST_TYPE=tileop MODE=ND row=128 col=64 trow=32 tcol=16 PL make TESTCASE=TTRANS TEST_TYPE=tileop MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx #TMAKERANGE -make TESTCASE=TMAKERANGE MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx -make TESTCASE=TMAKERANGE MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx -make TESTCASE=TMAKERANGE MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx -make TESTCASE=TMAKERANGE MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx +make TESTCASE=TMAKERANGE MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx +make TESTCASE=TMAKERANGE MODE=ND row=64 col=64 trow=16 tcol=16 PLAT=linx +make TESTCASE=TMAKERANGE MODE=ND row=128 col=64 trow=32 tcol=16 PLAT=linx +make TESTCASE=TMAKERANGE MODE=ND row=64 col=128 trow=16 tcol=32 PLAT=linx #TOR make TESTCASE=TOR TEST_TYPE=tileop MODE=ND row=16 col=16 trow=16 tcol=16 PLAT=linx diff --git a/test/other/tileop_test/compile_fa_tileop.all b/tests/tileop_layout/compile_fa_tileop.all similarity index 100% rename from test/other/tileop_test/compile_fa_tileop.all rename to tests/tileop_layout/compile_fa_tileop.all diff --git a/test/other/tileop_test/src/CubeVecTrans.cpp b/tests/tileop_layout/src/CubeVecTrans.cpp similarity index 93% rename from test/other/tileop_test/src/CubeVecTrans.cpp rename to tests/tileop_layout/src/CubeVecTrans.cpp index b5aefb8..7c6837a 100644 --- a/test/other/tileop_test/src/CubeVecTrans.cpp +++ b/tests/tileop_layout/src/CubeVecTrans.cpp @@ -1,8 +1,8 @@ #include -#include +#include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef GM @@ -64,8 +64,8 @@ void CubeVecTrans(float* dst, float* src0, float* src1){ for(int k=0;k #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/TADD.cpp b/tests/tileop_layout/src/TADD.cpp similarity index 98% rename from test/other/tileop_test/src/TADD.cpp rename to tests/tileop_layout/src/TADD.cpp index bae1b85..7c096f6 100644 --- a/test/other/tileop_test/src/TADD.cpp +++ b/tests/tileop_layout/src/TADD.cpp @@ -2,7 +2,7 @@ #include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/TADDCAST.cpp b/tests/tileop_layout/src/TADDCAST.cpp similarity index 99% rename from test/other/tileop_test/src/TADDCAST.cpp rename to tests/tileop_layout/src/TADDCAST.cpp index f76949b..4772b76 100644 --- a/test/other/tileop_test/src/TADDCAST.cpp +++ b/tests/tileop_layout/src/TADDCAST.cpp @@ -3,7 +3,7 @@ #include "jcore/TAddCast.hpp" #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/TADDMASK.cpp b/tests/tileop_layout/src/TADDMASK.cpp similarity index 90% rename from test/other/tileop_test/src/TADDMASK.cpp rename to tests/tileop_layout/src/TADDMASK.cpp index c52dcf9..b11fad5 100644 --- a/test/other/tileop_test/src/TADDMASK.cpp +++ b/tests/tileop_layout/src/TADDMASK.cpp @@ -1,8 +1,8 @@ #include -#include +#include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW @@ -55,10 +55,10 @@ void tadd_mask_nd(float *dst, float *src0, float *src1) { auto g2 = gdst(i, j); tile_shape td0(2*i+j), td1(i+2*j), td2; - // TCOPYIN(td0, g0); - // TCOPYIN(td1, g1); + // TLOAD(td0, g0); + // TLOAD(td1, g1); TADD(td2, td0, td1); - // TCOPYOUT(g2, td2); + // TSTORE(g2, td2); } if constexpr (remainder_col) { auto g0 = gsrc0(i, block_col); @@ -66,10 +66,10 @@ void tadd_mask_nd(float *dst, float *src0, float *src1) { auto g2 = gdst(i, block_col); trailing_rows_shape td0(2*i), td1(i), td2; - // TCOPYIN(td0, g0); - // TCOPYIN(td1, g1); + // TLOAD(td0, g0); + // TLOAD(td1, g1); TADD(td2, td0, td1); - // TCOPYOUT(g2, td2); + // TSTORE(g2, td2); } } if constexpr (remainder_row) { @@ -79,10 +79,10 @@ void tadd_mask_nd(float *dst, float *src0, float *src1) { auto g2 = gdst(block_row, j); trailing_cols_shape td0(j), td1(2*j), td2; - // TCOPYIN(td0, g0); - // TCOPYIN(td1, g1); + // TLOAD(td0, g0); + // TLOAD(td1, g1); TADD(td2, td0, td1); - // TCOPYOUT(g2, td2); + // TSTORE(g2, td2); } if constexpr (remainder_col) { auto g0 = gsrc0(block_row, block_col); @@ -90,10 +90,10 @@ void tadd_mask_nd(float *dst, float *src0, float *src1) { auto g2 = gdst(block_row, block_col); trailing_corner_shape td0(0), td1(1), td2; - // TCOPYIN(td0, g0); - // TCOPYIN(td1, g1); + // TLOAD(td0, g0); + // TLOAD(td1, g1); TADD(td2, td0, td1); - // TCOPYOUT(g2, td2); + // TSTORE(g2, td2); } } } diff --git a/test/other/tileop_test/src/TAND.cpp b/tests/tileop_layout/src/TAND.cpp similarity index 95% rename from test/other/tileop_test/src/TAND.cpp rename to tests/tileop_layout/src/TAND.cpp index 6a4cbcf..3680161 100644 --- a/test/other/tileop_test/src/TAND.cpp +++ b/tests/tileop_layout/src/TAND.cpp @@ -1,9 +1,9 @@ #include -#include +#include #include "jcore/TAnd.hpp" #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW @@ -47,10 +47,10 @@ void tand_nd(T *dst, T *src0, T *src1) { auto g2 = gdst(i, j); tile_shape td0(2*i+j), td1(i+2*j), td2; - // TCOPYIN(td0, g0); - // TCOPYIN(td1, g1); + // TLOAD(td0, g0); + // TLOAD(td1, g1); TAND_Impl(td2, td1, td0); - // TCOPYOUT(g2, td2); + // TSTORE(g2, td2); } } } diff --git a/test/other/tileop_test/src/TASSEMBLE.cpp b/tests/tileop_layout/src/TASSEMBLE.cpp similarity index 100% rename from test/other/tileop_test/src/TASSEMBLE.cpp rename to tests/tileop_layout/src/TASSEMBLE.cpp diff --git a/test/other/tileop_test/src/TCAST.cpp b/tests/tileop_layout/src/TCAST.cpp similarity index 98% rename from test/other/tileop_test/src/TCAST.cpp rename to tests/tileop_layout/src/TCAST.cpp index 33504fa..cdf679b 100644 --- a/test/other/tileop_test/src/TCAST.cpp +++ b/tests/tileop_layout/src/TCAST.cpp @@ -2,7 +2,7 @@ #include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/TCOPY.cpp b/tests/tileop_layout/src/TCOPY.cpp similarity index 99% rename from test/other/tileop_test/src/TCOPY.cpp rename to tests/tileop_layout/src/TCOPY.cpp index 2ec924f..594f6d0 100644 --- a/test/other/tileop_test/src/TCOPY.cpp +++ b/tests/tileop_layout/src/TCOPY.cpp @@ -2,7 +2,7 @@ #include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/TDIV.cpp b/tests/tileop_layout/src/TDIV.cpp similarity index 98% rename from test/other/tileop_test/src/TDIV.cpp rename to tests/tileop_layout/src/TDIV.cpp index b2f8a2b..a326d25 100644 --- a/test/other/tileop_test/src/TDIV.cpp +++ b/tests/tileop_layout/src/TDIV.cpp @@ -2,7 +2,7 @@ #include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/TEXP.cpp b/tests/tileop_layout/src/TEXP.cpp similarity index 98% rename from test/other/tileop_test/src/TEXP.cpp rename to tests/tileop_layout/src/TEXP.cpp index 320ed9a..3897710 100644 --- a/test/other/tileop_test/src/TEXP.cpp +++ b/tests/tileop_layout/src/TEXP.cpp @@ -2,7 +2,7 @@ #include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/TEXPANDCOL.cpp b/tests/tileop_layout/src/TEXPANDCOL.cpp similarity index 93% rename from test/other/tileop_test/src/TEXPANDCOL.cpp rename to tests/tileop_layout/src/TEXPANDCOL.cpp index d8f4e8d..6119706 100644 --- a/test/other/tileop_test/src/TEXPANDCOL.cpp +++ b/tests/tileop_layout/src/TEXPANDCOL.cpp @@ -1,8 +1,8 @@ #include -#include +#include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW @@ -47,9 +47,9 @@ void texpandcol_nd(float *dst, float *src) { tile_shape_in td0(2*i+j); tile_shape_out td1; - // TCOPYIN(td0, g0); + // TLOAD(td0, g0); TEXPANDCOL(td1, td0); - // TCOPYOUT(g1, td1); + // TSTORE(g1, td1); } } } @@ -76,9 +76,9 @@ void texpandcol_dn(float *dst, float *src) { tile_shape_in td0(2*i+j); tile_shape_out td1; - // TCOPYIN(td0, g0); + // TLOAD(td0, g0); TEXPANDCOL(td1, td0); - // TCOPYOUT(g1, td1); + // TSTORE(g1, td1); } } } diff --git a/test/other/tileop_test/src/TEXPANDROW.cpp b/tests/tileop_layout/src/TEXPANDROW.cpp similarity index 93% rename from test/other/tileop_test/src/TEXPANDROW.cpp rename to tests/tileop_layout/src/TEXPANDROW.cpp index f5db59b..2eaea10 100644 --- a/test/other/tileop_test/src/TEXPANDROW.cpp +++ b/tests/tileop_layout/src/TEXPANDROW.cpp @@ -1,8 +1,8 @@ #include -#include +#include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW @@ -47,9 +47,9 @@ void texpandrow_nd(float *dst, float *src) { tile_shape_in td0(2*i+j); tile_shape_out td1; - // TCOPYIN(td0, g0); + // TLOAD(td0, g0); TEXPANDROW(td1, td0); - // TCOPYOUT(g1, td1); + // TSTORE(g1, td1); } } } @@ -76,9 +76,9 @@ void texpandrow_dn(float *dst, float *src) { tile_shape_in td0(2*i+j); tile_shape_out td1; - // TCOPYIN(td0, g0); + // TLOAD(td0, g0); TEXPANDROW(td1, td0); - // TCOPYOUT(g1, td1); + // TSTORE(g1, td1); } } } diff --git a/test/other/tileop_test/src/TEXPANDSCALAR.cpp b/tests/tileop_layout/src/TEXPANDSCALAR.cpp similarity index 94% rename from test/other/tileop_test/src/TEXPANDSCALAR.cpp rename to tests/tileop_layout/src/TEXPANDSCALAR.cpp index 6834142..5621dfa 100644 --- a/test/other/tileop_test/src/TEXPANDSCALAR.cpp +++ b/tests/tileop_layout/src/TEXPANDSCALAR.cpp @@ -1,8 +1,8 @@ #include -#include +#include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW @@ -43,7 +43,7 @@ void texpandscalar_nd(float *dst, float s) { tile_shape td0; TEXPANDSCALAR(td0, s); - // TCOPYOUT(g0, td0); + // TSTORE(g0, td0); } } } @@ -66,7 +66,7 @@ void texpandscalar_dn(float *dst, float s) { tile_shape td0; TEXPANDSCALAR(td0, s); - // TCOPYOUT(g0, td0); + // TSTORE(g0, td0); } } } diff --git a/test/other/tileop_test/src/TEXTRACT.cpp b/tests/tileop_layout/src/TEXTRACT.cpp similarity index 100% rename from test/other/tileop_test/src/TEXTRACT.cpp rename to tests/tileop_layout/src/TEXTRACT.cpp diff --git a/test/other/tileop_test/src/TFILLPAD.cpp b/tests/tileop_layout/src/TFILLPAD.cpp similarity index 98% rename from test/other/tileop_test/src/TFILLPAD.cpp rename to tests/tileop_layout/src/TFILLPAD.cpp index b80a8bd..f3f37de 100644 --- a/test/other/tileop_test/src/TFILLPAD.cpp +++ b/tests/tileop_layout/src/TFILLPAD.cpp @@ -2,7 +2,7 @@ #include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/TGATHER.cpp b/tests/tileop_layout/src/TGATHER.cpp similarity index 93% rename from test/other/tileop_test/src/TGATHER.cpp rename to tests/tileop_layout/src/TGATHER.cpp index e3bda07..86e4917 100644 --- a/test/other/tileop_test/src/TGATHER.cpp +++ b/tests/tileop_layout/src/TGATHER.cpp @@ -1,8 +1,8 @@ #include -#include +#include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW @@ -55,10 +55,10 @@ void tgather_nd(float *dst, float *src, uint16_t *indices) { tile_shape_src td0(2*i+j); tile_shape_indices td1(1); tile_shape_dst td2; - // TCOPYIN(td0, g0); - // TCOPYIN(td1, g1); + // TLOAD(td0, g0); + // TLOAD(td1, g1); TGATHER(td2, td0, td1); - // TCOPYOUT(g2, td2); + // TSTORE(g2, td2); } } } @@ -93,10 +93,10 @@ void tgather_dn(float *dst, float *src, uint16_t *indices) { tile_shape_src td0(2*i+j); tile_shape_indices td1(1); tile_shape_dst td2; - // TCOPYIN(td0, g0); - // TCOPYIN(td1, g1); + // TLOAD(td0, g0); + // TLOAD(td1, g1); TGATHER(td2, td0, td1); - // TCOPYOUT(g2, td2); + // TSTORE(g2, td2); } } } diff --git a/test/other/tileop_test/src/TCOPYIN.cpp b/tests/tileop_layout/src/TLOAD.cpp similarity index 83% rename from test/other/tileop_test/src/TCOPYIN.cpp rename to tests/tileop_layout/src/TLOAD.cpp index b3d38b6..c947f3c 100644 --- a/test/other/tileop_test/src/TCOPYIN.cpp +++ b/tests/tileop_layout/src/TLOAD.cpp @@ -1,8 +1,8 @@ #include -#include +#include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef GM_ROW @@ -26,7 +26,7 @@ #endif template -void copyin_nd2nd(float *src) { +void load_nd2nd(float *src) { using gm_shape = global_tensor>; using tile_shape = Tile; //带mask tile_shape = Tile; @@ -36,14 +36,14 @@ void copyin_nd2nd(float *src) { for (int j = 0; j < block_col; ++j) { int offset = i * (tile_row * gm_col) + j * tile_col; gm_shape s0(src + offset); - tile_shape d0; - TCOPYIN(d0, s0); + tile_shape d0; + TLOAD(d0, s0); } } } template -void copyin_nd2nz(float *src) { +void load_nd2nz(float *src) { using gm_shape = global_tensor>; using tile_shape = TileLeft; @@ -56,15 +56,15 @@ void copyin_nd2nz(float *src) { for (int i = 0; i < block_row; ++i) { for (int j = 0; j < block_col; ++j) { - tile_shape d0; + tile_shape d0; auto g0 = gsrc(i,j); - TCOPYIN(d0, g0); + TLOAD(d0, g0); } } } template -void copyin_dn2zn(float *src) { +void load_dn2zn(float *src) { using gm_shape = global_tensor>; using tile_shape = TileRight; @@ -77,9 +77,9 @@ void copyin_dn2zn(float *src) { for (int i = 0; i < block_col; ++i) { for (int j = 0; j < block_row; ++j) { - tile_shape d0; + tile_shape d0; auto g0 = gsrc(i,j); - TCOPYIN(d0, g0); + TLOAD(d0, g0); } } } @@ -100,11 +100,11 @@ int main() { float src[gm_size]; if(!strcmp(MODE, "ND2ND")){ - copyin_nd2nd(src); + load_nd2nd(src); }else if(!strcmp(MODE, "ND2NZ")){ - copyin_nd2nz(src); + load_nd2nz(src); }else if(!strcmp(MODE, "DN2ZN")){ - copyin_dn2zn(src); + load_dn2zn(src); } #ifdef LINX_PMC diff --git a/test/other/tileop_test/src/TMAKERANGE.cpp b/tests/tileop_layout/src/TMAKERANGE.cpp similarity index 94% rename from test/other/tileop_test/src/TMAKERANGE.cpp rename to tests/tileop_layout/src/TMAKERANGE.cpp index b63317b..66ec41d 100644 --- a/test/other/tileop_test/src/TMAKERANGE.cpp +++ b/tests/tileop_layout/src/TMAKERANGE.cpp @@ -1,9 +1,9 @@ #include -#include +#include #include "jcore/TMakeRange.hpp" #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW @@ -44,7 +44,7 @@ void tmakerange_nd(float *dst, float s) { tile_shape td0; TMAKERANGE_Impl(td0, s); - // TCOPYOUT(g0, td0); + // TSTORE(g0, td0); } } } @@ -67,7 +67,7 @@ void tmakerange_dn(float *dst, float s) { tile_shape td0; // TMAKERANGE_Impl(td0, s); - // TCOPYOUT(g0, td0); + // TSTORE(g0, td0); } } } diff --git a/test/other/tileop_test/src/TMUL.cpp b/tests/tileop_layout/src/TMUL.cpp similarity index 98% rename from test/other/tileop_test/src/TMUL.cpp rename to tests/tileop_layout/src/TMUL.cpp index 47cd391..0389cb3 100644 --- a/test/other/tileop_test/src/TMUL.cpp +++ b/tests/tileop_layout/src/TMUL.cpp @@ -2,7 +2,7 @@ #include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/TOR.cpp b/tests/tileop_layout/src/TOR.cpp similarity index 95% rename from test/other/tileop_test/src/TOR.cpp rename to tests/tileop_layout/src/TOR.cpp index a31eb31..ae9e5e9 100644 --- a/test/other/tileop_test/src/TOR.cpp +++ b/tests/tileop_layout/src/TOR.cpp @@ -1,9 +1,9 @@ #include -#include +#include #include "jcore/TOr.hpp" #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW @@ -47,10 +47,10 @@ void tor_nd(T *dst, T *src0, T *src1) { auto g2 = gdst(i, j); tile_shape td0(i%2), td1(j%2), td2; - // TCOPYIN(td0, g0); - // TCOPYIN(td1, g1); + // TLOAD(td0, g0); + // TLOAD(td1, g1); TOR_Impl(td2, td1, td0); - // TCOPYOUT(g2, td2); + // TSTORE(g2, td2); } } } diff --git a/test/other/tileop_test/src/TRESHAPE.cpp b/tests/tileop_layout/src/TRESHAPE.cpp similarity index 100% rename from test/other/tileop_test/src/TRESHAPE.cpp rename to tests/tileop_layout/src/TRESHAPE.cpp diff --git a/test/other/tileop_test/src/TROWMAX.cpp b/tests/tileop_layout/src/TROWMAX.cpp similarity index 98% rename from test/other/tileop_test/src/TROWMAX.cpp rename to tests/tileop_layout/src/TROWMAX.cpp index b668481..a70bb93 100644 --- a/test/other/tileop_test/src/TROWMAX.cpp +++ b/tests/tileop_layout/src/TROWMAX.cpp @@ -2,7 +2,7 @@ #include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/TROWMAXEXPAND.cpp b/tests/tileop_layout/src/TROWMAXEXPAND.cpp similarity index 98% rename from test/other/tileop_test/src/TROWMAXEXPAND.cpp rename to tests/tileop_layout/src/TROWMAXEXPAND.cpp index 8adb2a1..8c4af0f 100644 --- a/test/other/tileop_test/src/TROWMAXEXPAND.cpp +++ b/tests/tileop_layout/src/TROWMAXEXPAND.cpp @@ -2,7 +2,7 @@ #include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/TROWSUM.cpp b/tests/tileop_layout/src/TROWSUM.cpp similarity index 98% rename from test/other/tileop_test/src/TROWSUM.cpp rename to tests/tileop_layout/src/TROWSUM.cpp index 7fd88b9..3ed3d08 100644 --- a/test/other/tileop_test/src/TROWSUM.cpp +++ b/tests/tileop_layout/src/TROWSUM.cpp @@ -2,7 +2,7 @@ #include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/TROWSUMEXPAND.cpp b/tests/tileop_layout/src/TROWSUMEXPAND.cpp similarity index 98% rename from test/other/tileop_test/src/TROWSUMEXPAND.cpp rename to tests/tileop_layout/src/TROWSUMEXPAND.cpp index cb5bff7..437c50c 100644 --- a/test/other/tileop_test/src/TROWSUMEXPAND.cpp +++ b/tests/tileop_layout/src/TROWSUMEXPAND.cpp @@ -2,7 +2,7 @@ #include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/TSELECT.cpp b/tests/tileop_layout/src/TSELECT.cpp similarity index 91% rename from test/other/tileop_test/src/TSELECT.cpp rename to tests/tileop_layout/src/TSELECT.cpp index 685aa80..519b55f 100644 --- a/test/other/tileop_test/src/TSELECT.cpp +++ b/tests/tileop_layout/src/TSELECT.cpp @@ -1,8 +1,8 @@ #include -#include +#include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW @@ -54,11 +54,11 @@ void tselect_nd(float *dst, float *src0, float *src1, uint16_t *cond) { tile_shape_fp32 td0(2*i+j), td1(i+2*j); tile_shape_uint16 td2(i%2); tile_shape_fp32 td3; - // TCOPYIN(td0, g0); - // TCOPYIN(td1, g1); - // TCOPYIN(td2, g2); + // TLOAD(td0, g0); + // TLOAD(td1, g1); + // TLOAD(td2, g2); TSELECT(td3, td2, td0, td1); - // TCOPYOUT(g3, td3); + // TSTORE(g3, td3); } } } @@ -92,11 +92,11 @@ void tselect_dn(float *dst, float *src0, float *src1, uint16_t *cond) { tile_shape_fp32 td0(2*i+j), td1(i+2*j); tile_shape_uint16 td2(i%2); tile_shape_fp32 td3; - // TCOPYIN(td0, g0); - // TCOPYIN(td1, g1); - // TCOPYIN(td2, g2); + // TLOAD(td0, g0); + // TLOAD(td1, g1); + // TLOAD(td2, g2); TSELECT(td3, td2, td0, td1); - // TCOPYOUT(g3, td3); + // TSTORE(g3, td3); } } } diff --git a/test/other/tileop_test/src/TCOPYOUT.cpp b/tests/tileop_layout/src/TSTORE.cpp similarity index 85% rename from test/other/tileop_test/src/TCOPYOUT.cpp rename to tests/tileop_layout/src/TSTORE.cpp index 7c81372..b8e7bc6 100644 --- a/test/other/tileop_test/src/TCOPYOUT.cpp +++ b/tests/tileop_layout/src/TSTORE.cpp @@ -1,7 +1,7 @@ #include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW @@ -21,7 +21,7 @@ #endif template -void copyout_nd2nd(float *dst) { +void store_nd2nd(float *dst) { using tile_shape = Tile; using gm_shape = global_tensor>; @@ -36,13 +36,13 @@ void copyout_nd2nd(float *dst) { for (int j = 0; j < block_col; ++j) { tile_shape d0(i+j); auto dstO = gdst(i,j); - TCOPYOUT(dstO, d0); + TSTORE(dstO, d0); } } } template -void copyout_nz2nd(float *dst) { +void store_nz2nd(float *dst) { using tile_shape = TileLeft; using gm_shape = global_tensor>; @@ -56,13 +56,13 @@ void copyout_nz2nd(float *dst) { for (int j = 0; j < block_col; ++j) { tile_shape d0(i+j); auto dstO = gdst(i,j); - TCOPYOUT(dstO, d0); + TSTORE(dstO, d0); } } } template -void copyout_zn2dn(float *dst) { +void store_zn2dn(float *dst) { // using tile_shape = TileRight; // using gm_shape = global_tensor>; @@ -76,7 +76,7 @@ void copyout_zn2dn(float *dst) { // for (int j = 0; j < block_col; ++j) { // tile_shape d0(i+j); // auto dstO = gdst(i,j); - // TCOPYOUT(dstO, d0); + // TSTORE(dstO, d0); // } // } } @@ -92,15 +92,15 @@ int main() { #endif size_t gm_size = gm_row * gm_col; - + float dst[gm_size]; if(!strcmp(MODE, "ND2ND")){ - copyout_nd2nd(dst); + store_nd2nd(dst); }else if(!strcmp(MODE, "NZ2ND")){ - copyout_nz2nd(dst); + store_nz2nd(dst); }else if(!strcmp(MODE, "ZN2DN")){ - copyout_zn2dn(dst); + store_zn2dn(dst); } #ifdef LINX_PMC diff --git a/test/other/tileop_test/src/TSUB.cpp b/tests/tileop_layout/src/TSUB.cpp similarity index 98% rename from test/other/tileop_test/src/TSUB.cpp rename to tests/tileop_layout/src/TSUB.cpp index 9448a6e..afbed25 100644 --- a/test/other/tileop_test/src/TSUB.cpp +++ b/tests/tileop_layout/src/TSUB.cpp @@ -2,7 +2,7 @@ #include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/TTRANS.cpp b/tests/tileop_layout/src/TTRANS.cpp similarity index 98% rename from test/other/tileop_test/src/TTRANS.cpp rename to tests/tileop_layout/src/TTRANS.cpp index 8975633..e2852a0 100644 --- a/test/other/tileop_test/src/TTRANS.cpp +++ b/tests/tileop_layout/src/TTRANS.cpp @@ -2,7 +2,7 @@ #include #ifdef LINX_PMC -#include "../linxStartEnd.hpp" +#include #endif #ifndef ROW diff --git a/test/other/tileop_test/src/fa_tileop.cpp b/tests/tileop_layout/src/fa_tileop.cpp similarity index 92% rename from test/other/tileop_test/src/fa_tileop.cpp rename to tests/tileop_layout/src/fa_tileop.cpp index 10812e2..d409687 100644 --- a/test/other/tileop_test/src/fa_tileop.cpp +++ b/tests/tileop_layout/src/fa_tileop.cpp @@ -38,7 +38,7 @@ #endif #else typedef float dtype; -#endif +#endif template void tsub_nz_left(dtype *dst, dtype *src0, dtype *src1) { @@ -57,11 +57,11 @@ void tsub_nz_left(dtype *dst, dtype *src0, dtype *src1) { for (int i = 0; i < block_row; ++i) { for (int j = 0; j < block_col; ++j) { tile_shape tsrc0,tsrc1; - TCOPYIN(tsrc0, gsrc0(i, j)); - TCOPYIN(tsrc1, gsrc1(i, j)); + TLOAD(tsrc0, gsrc0(i, j)); + TLOAD(tsrc1, gsrc1(i, j)); TSUB(tsrc0, tsrc0, tsrc1); auto gO = gdst(i, j); - TCOPYOUT(gO, tsrc0); + TSTORE(gO, tsrc0); } } } @@ -86,11 +86,11 @@ void trowsum_nz_left(dtype *dst, dtype *src) { tile_shape_r trsum; for (int j = 0; j < block_col; ++j) { tile_shape_in tsrc; - TCOPYIN(tsrc, gsrc(i, j)); + TLOAD(tsrc, gsrc(i, j)); TROWSUM(trsum, tsrc); } auto gO = gdst(i, 0); - TCOPYOUT(gO, trsum); + TSTORE(gO, trsum); } } @@ -115,10 +115,10 @@ void texpandcol_nz_left(dtype *dst, dtype *src) { for (int j = 0; j < block_col; ++j) { tile_shape_in tsrc; tile_shape_expand texpand; - TCOPYIN(tsrc, gsrc(i, 0)); + TLOAD(tsrc, gsrc(i, 0)); TEXPANDCOL(texpand, tsrc); auto gO = gdst(i, j); - TCOPYOUT(gO, texpand); + TSTORE(gO, texpand); } } } @@ -140,11 +140,11 @@ void tmul_nz_out(dtype *dst, dtype *src0, dtype *src1) { for (int i = 0; i < block_row; ++i) { for (int j = 0; j < block_col; ++j) { tile_shape tsrc0,tsrc1; - TCOPYIN(tsrc0, gsrc0(i, j)); - TCOPYIN(tsrc1, gsrc1(i, j)); + TLOAD(tsrc0, gsrc0(i, j)); + TLOAD(tsrc1, gsrc1(i, j)); TMUL(tsrc0, tsrc0, tsrc1); auto gO = gdst(i, j); - TCOPYOUT(gO, tsrc0); + TSTORE(gO, tsrc0); } } } @@ -170,10 +170,10 @@ void texpandcol_nz_out(dtype *dst, dtype *src) { for (int j = 0; j < block_col; ++j) { tile_shape_in tsrc; tile_shape_expand texpand; - TCOPYIN(tsrc, gsrc(i, 0)); + TLOAD(tsrc, gsrc(i, 0)); TEXPANDCOL(texpand, tsrc); auto gO = gdst(i, j); - TCOPYOUT(gO, texpand); + TSTORE(gO, texpand); } } } @@ -199,11 +199,11 @@ void trowmax_nz_left(dtype *dst, dtype *src) { tile_shape_r trmax; for (int j = 0; j < block_col; ++j) { tile_shape_in tsrc; - TCOPYIN(tsrc, gsrc(i, j)); + TLOAD(tsrc, gsrc(i, j)); TROWMAX(trmax, tsrc); } auto gO = gdst(i, 0); - TCOPYOUT(gO, trmax); + TSTORE(gO, trmax); } } @@ -223,10 +223,10 @@ void tmuls_nz_left(dtype *dst, dtype *src, dtype s) { for (int i = 0; i < block_row; ++i) { for (int j = 0; j < block_col; ++j) { tile_shape tsrc; - TCOPYIN(tsrc, gsrc(i, j)); + TLOAD(tsrc, gsrc(i, j)); TMULS(tsrc, tsrc, s); auto gO = gdst(i, j); - TCOPYOUT(gO, tsrc); + TSTORE(gO, tsrc); } } } @@ -249,10 +249,10 @@ void tcvt_out_left(dtype *dst, dtype *src) { for (int j = 0; j < block_col; ++j) { tile_shape_in tsrc; tile_shape_out tout; - TCOPYIN(tsrc, gsrc(i, j)); + TLOAD(tsrc, gsrc(i, j)); TCVT(tout, tsrc); auto gO = gdst(i, j); - TCOPYOUT(gO, tout); + TSTORE(gO, tout); } } }