diff --git a/.gitattributes b/.gitattributes
new file mode 100644
index 0000000..f087b42
--- /dev/null
+++ b/.gitattributes
@@ -0,0 +1 @@
+*.tar.gz filter=lfs diff=lfs merge=lfs -text
diff --git a/README.md b/README.md
index 0b10325..2b744a9 100644
--- a/README.md
+++ b/README.md
@@ -1,81 +1,149 @@
 # SuperNPUBench
 
-SuperNPUBench is a TileOP API, kernel, model, and benchmark workspace for
-LinxISA/PTO-style tile programming experiments. The repository is organized
-around header-only TileOP APIs, reusable kernels, model-level examples, and
-make-driven test suites.
+SuperNPUBench is a Linx/PTO TileOP benchmark workspace. Active benchmark
+entrypoints live under `benchmarks/`, reusable library code stays under
+`include/`, `kernels/`, and `models/`, non-benchmark correctness checks live
+under `tests/`, and superseded material is preserved under `archive/outdated/`.
 
 ## Repository Map
 
 | Path | Purpose |
 | --- | --- |
+| [`benchmarks`](benchmarks) | Primary active Linx-buildable benchmark tree. Start here for benchmark source and build commands. |
+| [`benchmarks/INDEX.md`](benchmarks/INDEX.md) | Source catalog with benchmark paths, build commands, category, status, and required data objects. |
+| [`benchmarks/common`](benchmarks/common) | Shared make harness, platform selection, compiler flags, simulator targets, and benchmark-local helper headers. |
+| [`compiler`](compiler) | Compiler artifact staging. The checked-in legacy archive is a Git LFS pointer; active builds should pass a real toolchain through `COMPILER_DIR`. |
 | [`include/common`](include/common) | Shared TileOP API surface, data types, tensor helpers, layouts, and compile-time utilities. |
-| [`include/cpu_sim`](include/cpu_sim) | CPU simulation backend used by tests built with `PLAT=cpu`. |
+| [`include/benchmark_support`](include/benchmark_support) | Benchmark-only support headers, including NPU helper APIs used by active suites. |
+| [`include/cpu_sim`](include/cpu_sim) | CPU simulation backend used by checks built with `PLAT=cpu`. |
 | [`include/aarch64`](include/aarch64) | Host/Arm-oriented backend headers and TileOP API variants. |
-| [`include/accelerator`](include/accelerator) | Accelerator-facing headers split by versioned targets such as `v220` and `v310`. |
-| [`include/jcore`](include/jcore) | Linx/JCore backend headers used by tests built with `PLAT=linx`. |
-| [`kernels`](kernels) | Reusable kernel implementations, grouped by domain such as element-wise, memory, reduction, and matmul. |
-| [`models`](models) | Model-level examples and compositions, currently including DeepSeekV3-oriented code. |
-| [`test/common`](test/common) | Shared make harness, platform selection, compiler flags, and simulator targets. |
-| [`test/tileop_api`](test/tileop_api) | Focused TileOP API compile and behavior tests. Start here for single-operation API work. |
-| [`test/py_api`](test/py_api) | Python extension and golden-comparison flow for TileOP API checks. |
-| [`test/accelerator`](test/accelerator) | Accelerator-oriented benchmark and validation suites. |
-| [`test/kernel`](test/kernel) | Kernel benchmark and validation suites. |
-| [`test/other`](test/other) | Additional model, microbenchmark, TileOP, vector, and script-driven suites. |
-| [`test/script`](test/script) | Recursive compile/run helper for batch workflows. |
+| [`include/accelerator`](include/accelerator) | Versioned device API headers consumed by Linx/NPU code. |
+| [`include/jcore`](include/jcore) | Linx/JCore backend headers used by `PLAT=linx`. |
+| [`kernels`](kernels) | Reusable kernel implementations shared by benchmark entrypoints. |
+| [`models`](models) | Reusable model-level implementation code shared by model benchmarks. |
+| [`samples`](samples) | Checked-in sample compiler disassembly for representative flash attention and GEMM outputs. |
+| [`tests`](tests) | Non-benchmark correctness material that is not the primary Linx benchmark navigation surface. |
+| [`archive/outdated`](archive/outdated) | Preserved duplicate, superseded, generated, or unusable historical material with replacement notes. |
 | `output/` | Generated build products. Treat this as local output, not source. |
 
-## Quick Navigation
+## Getting Started: Linx Compiler And QEMU
 
-- To understand the public API, start with [`include/common`](include/common).
-- To inspect a backend implementation of an operation, compare the matching
-  headers under [`include/cpu_sim`](include/cpu_sim),
-  [`include/jcore`](include/jcore), and [`include/aarch64`](include/aarch64).
-- To add or update reusable compute code, use the matching domain under
-  [`kernels`](kernels).
-- To compile one small API test, use [`test/tileop_api`](test/tileop_api).
-- To validate a Python-facing API path, use [`test/py_api`](test/py_api) and
-  [`test/py_api/golden_cmp`](test/py_api/golden_cmp).
-- To run larger suites, use the `compile.all` files in the relevant
-  [`test`](test) subdirectories.
+These commands assume this workload lives at `$LINXISA_ROOT/workloads/SuperNPUBench`. Set `LINXISA_ROOT` to the root of your Linx superproject checkout.
 
-## Header Installation
+Build the Linx LLVM compiler from `compiler/llvm`:
 
-The top-level `Makefile` installs the TileOP header tree into the active
-Clang resource directory under `include/tileop-api`.
+```bash
+export LINXISA_ROOT=/path/to/linx-isa
+cd "$LINXISA_ROOT"
 
-```sh
-make -n CLANG_PREFIX=/usr
-make install CLANG_PREFIX=/path/to/clang/prefix
-make uninstall CLANG_PREFIX=/path/to/clang/prefix
+cmake -S compiler/llvm/llvm -B compiler/llvm/build-linxisa-clang -G Ninja \
+  -DLLVM_ENABLE_PROJECTS="clang;lld" \
+  -DLLVM_TARGETS_TO_BUILD=Linx
+cmake --build compiler/llvm/build-linxisa-clang --target clang lld llvm-mc llvm-objdump llvm-objcopy
+
+export COMPILER_DIR="$LINXISA_ROOT/compiler/llvm/build-linxisa-clang/bin"
+```
+
+For incremental compiler rebuilds after the CMake tree exists:
+
+```bash
+cd "$LINXISA_ROOT"
+bash tools/bringup/run_llvm_incremental_build.sh clang lld llvm-mc llvm-objdump llvm-objcopy
+```
+
+Build the Linx QEMU target and run a QEMU smoke suite:
+
+```bash
+cd "$LINXISA_ROOT"
+QEMU="$(bash tools/bringup/run_qemu_build_local.sh)"
+
+cd "$LINXISA_ROOT/avs/qemu"
+CLANG="$COMPILER_DIR/clang" LLD="$COMPILER_DIR/ld.lld" QEMU="$QEMU" ./run_tests.sh --suite arithmetic --timeout 10
+```
+
+Compile one SuperNPUBench case with the rebuilt compiler:
+
+```bash
+cd "$LINXISA_ROOT/workloads/SuperNPUBench"
+make -C benchmarks/api/tileop TESTCASE=TAdd PLAT=linx COMPILER_DIR="$COMPILER_DIR"
 ```
 
-`CLANG_PREFIX` should point to a prefix containing `bin/clang`. The dry run is
-useful because the install location is derived from
-`clang -print-resource-dir`.
+The benchmark `sim` target invokes `$(QEMU) -run-supertest -blk_optimize force_tb_chained <elf>`. Use it with a SuperTest-compatible QEMU binary:
 
-## Building Tests
+```bash
+make -C benchmarks/api/tileop TESTCASE=TAdd PLAT=linx COMPILER_DIR="$COMPILER_DIR" QEMU=/path/to/supertest-compatible-qemu sim
+```
+
+For standard `qemu-system-linx64 -machine virt` validation, use the `avs/qemu` runner shown above.
+
+## Benchmark Navigation
+
+Active benchmark source is grouped by workload intent:
+
+| Path | Category |
+| --- | --- |
+| [`benchmarks/api/tileop`](benchmarks/api/tileop) | TileOP API operation benchmarks. |
+| [`benchmarks/npu`](benchmarks/npu) | NPU cube, fusion, NDDMA, vec SIMD, and vec SIMT suites. |
+| [`benchmarks/kernels`](benchmarks/kernels) | Control, element-wise, GEMM, fusion, memory, reduction, sort, and composite kernel suites. |
+| [`benchmarks/models/deepseekv3`](benchmarks/models/deepseekv3) | DeepSeekV3 model-level benchmark cases. |
+| [`benchmarks/microbench`](benchmarks/microbench) | Cube, memory, and vector microbenchmarks. |
+| [`benchmarks/scripts`](benchmarks/scripts) | Batch and recursive helper scripts for benchmark workflows. |
+
+Use [`benchmarks/INDEX.md`](benchmarks/INDEX.md) when you need the exact source path, suite-level build command, or data-object requirement for a case.
+
+## Benchmark Names
 
-Most test directories include [`test/common/Makefile.common`](test/common/Makefile.common).
-The common harness is controlled mainly by `TESTCASE`, `PLAT`,
-`COMPILER_DIR`, and `QEMU`.
+The table below is generated from active benchmark source files. It currently lists 166 C++ benchmark entrypoints across 26 suites and 44 batch scripts.
+
+| Suite | Benchmark names |
+| --- | --- |
+| [`benchmarks/api/tileop`](benchmarks/api/tileop) | `Cus_Template_ASM`, `MatMacc`, `MatMul`, `MatMul_e4m3`, `Print`, `TAbs`, `TAdd`<br>`TAdd_mask`, `TAdds`, `TAnd`, `TAssemble`, `TCI`, `TCast`, `TCmp`<br>`TCopy`, `TLoad`, `TStore`, `TCvt`, `TDiv`, `TDivs`, `TExp`<br>`TExpandCol`, `TExpandRow`, `TExpandScalar`, `TExtract`, `TFillPad`, `TGather`, `TMax`<br>`TMaxs`, `TMin`, `TMins`, `TMul`, `TMuls`, `TOr`, `TPad`<br>`TRSqrt`, `TRecip`, `TRem`, `TReshape`, `TRowMax`, `TRowMaxExpand`, `TRowSum`<br>`TRowSumExpand`, `TScatter`, `TSelect`, `TSqrt`, `TSub`, `TSubs`, `TTrans`<br>`test_MatMacc`, `test_MatMmxac`, `test_MatMul`, `test_MatMulmx` |
+| [`benchmarks/npu/cube`](benchmarks/npu/cube) | `LLAMA3_70B_attn_matmul_decode_bs_192`, `LLAMA3_70B_ffn_matmul_3_decode_bs_192`, `QuantBatchMatmulV3_292_hif4`, `QuantBatchMatmulV3_293_hif4`, `QuantBatchMatmulV3_294_hif4`, `QuantBatchMatmulV3_295_hif4`, `QuantBatchMatmulV3_296_hif4`<br>`QuantBatchMatmulV3_297_hif4`, `dsv3_q_up_proj_mxfp8`, `llama3_70b_w8_bs_1_case_4`, `llama_train_mm_2_A16W4`, `llama_train_mm_2_A16W8`, `llama_train_mm_2_mxfp8_mxfp4`, `llava1_6_6`<br>`mat_mul_o1_align_0001`, `matmul_1_bs16_fp8_GB_test`, `model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf`, `moe_w1w3_bs16_fp8_GB_DN_nbuf`, `mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022`, `mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16`, `xinghuo_13b_tp8_matmul_01_A16W8`<br>`xinghuo_13b_tp8_matmul_01_mxfp8_modified`, `xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4` |
+| [`benchmarks/npu/fusion`](benchmarks/npu/fusion) | `fa1`, `fa10`, `fa11`, `fa2`, `fa3`, `fa4`, `fa5`<br>`fa6`, `fa7`, `fa8`, `fa9`, `fa_fp4`, `flashmla13` |
+| [`benchmarks/npu/nddma`](benchmarks/npu/nddma) | `transpose_053_mgather`, `transpose_053_tload` |
+| [`benchmarks/npu/vec_simd`](benchmarks/npu/vec_simd) | `Add_ND_bfloat16_float32_DeepSeek_V3_000028`, `LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic`, `LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic`, `LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV`, `gemm_18x128x256`, `layernorm_vcadd_vaddx3_12288_fp16`, `moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV`<br>`rmsnorm_reduce_1_16384_fp16`, `rmsnorm_reduce_2_8192_fp16`, `rmsnorm_reduce_4_4096_fp16`, `rmsnorm_reduce_4_5120_fp16`, `rope_32_40_1_64_bf16`, `softmax_8_34_fp16`, `softmax_LLM_2`<br>`softmax_vaddx3_vcadd_1_4096_bf16`, `softmax_vaddx3_vcadd_1_4096_fp16`, `swiglu_64_1024_fp16` |
+| [`benchmarks/npu/vec_simt`](benchmarks/npu/vec_simt) | `npu_hashtable_insert_cmp_host`, `npu_hashtable_lookup_cmp_host`, `hashfind` |
+| [`benchmarks/kernels/composite`](benchmarks/kernels/composite) | `flash_attention`, `flash_attention_mask`, `gemm`, `linear`, `matmul`, `normalization`, `onlinesoftmax`<br>`softmax` |
+| [`benchmarks/kernels/control`](benchmarks/kernels/control) | `hashfind`, `hashtable_lookup_simd`, `hashtable_lookup_simt`, `hashtable_lookup_simt_v2`, `hkv` |
+| [`benchmarks/kernels/element_wise/gelu`](benchmarks/kernels/element_wise/gelu) | `gelu` |
+| [`benchmarks/kernels/fusion`](benchmarks/kernels/fusion) | `fa_hif4` |
+| [`benchmarks/kernels/gemm/matmul`](benchmarks/kernels/gemm/matmul) | `A16W4`, `HiF4_HiF4` |
+| [`benchmarks/kernels/memory/broadcast`](benchmarks/kernels/memory/broadcast) | `broadcast`, `broadcast_019`, `broadcast_039`, `broadcast_07`, `broadcast_Hunyuan`, `broadcast_mscatter`, `broadcast_nostore`<br>`broadcast_nomg`, `broadcast_tst` |
+| [`benchmarks/kernels/memory/broadcast_vec`](benchmarks/kernels/memory/broadcast_vec) | `broadcast_vec_019`, `broadcast_vec_039`, `broadcast_vec_07` |
+| [`benchmarks/kernels/memory/concat_gather`](benchmarks/kernels/memory/concat_gather) | `concat_gather` |
+| [`benchmarks/kernels/memory/concat_scatter`](benchmarks/kernels/memory/concat_scatter) | `concat_scatter` |
+| [`benchmarks/kernels/memory/gather`](benchmarks/kernels/memory/gather) | `gather` |
+| [`benchmarks/kernels/memory/transpose`](benchmarks/kernels/memory/transpose) | `transpose` |
+| [`benchmarks/kernels/reduction/reducemax_col`](benchmarks/kernels/reduction/reducemax_col) | `reducemax_col` |
+| [`benchmarks/kernels/reduction/reducemax_row`](benchmarks/kernels/reduction/reducemax_row) | `reducemax_row` |
+| [`benchmarks/kernels/reduction/reducesum_col`](benchmarks/kernels/reduction/reducesum_col) | `reducesum_col` |
+| [`benchmarks/kernels/reduction/reducesum_row`](benchmarks/kernels/reduction/reducesum_row) | `reducesum_row` |
+| [`benchmarks/kernels/sort/topk`](benchmarks/kernels/sort/topk) | `topk` |
+| [`benchmarks/models/deepseekv3`](benchmarks/models/deepseekv3) | `concat`, `expand`, `gate`, `mask`, `mla`, `mlp`, `moe`<br>`permute`, `projection`, `rmsnorm`, `rope`, `split`, `topk`, `transformer` |
+| [`benchmarks/microbench/cube`](benchmarks/microbench/cube) | `matop` |
+| [`benchmarks/microbench/lmbench`](benchmarks/microbench/lmbench) | `mem` |
+| [`benchmarks/microbench/vec`](benchmarks/microbench/vec) | `lat_bw` |
+
+## Building Benchmarks
+
+Most benchmark directories include [`benchmarks/common/Makefile.common`](benchmarks/common/Makefile.common). The common harness is controlled mainly by `TESTCASE`, `PLAT`, `COMPILER_DIR`, and `QEMU`.
 
 ```sh
-cd test/tileop_api
+cd benchmarks/api/tileop
 make clean
 make TESTCASE=TAdd PLAT=cpu COMPILER_DIR=/usr/bin
 make TESTCASE=TAdd PLAT=linx COMPILER_DIR=/path/to/linx/compiler/bin
-make TESTCASE=TAdd PLAT=linx QEMU=/path/to/qemu-linx sim
+make TESTCASE=TAdd PLAT=linx QEMU=/path/to/supertest-compatible-qemu sim
 ```
 
 | Variable | Meaning |
 | --- | --- |
-| `TESTCASE` | Source basename to build, for example `TAdd` in `test/tileop_api/src/TAdd.cpp`. |
+| `TESTCASE` | Source basename or suite-local case name, for example `TAdd` in `benchmarks/api/tileop/src/TAdd.cpp`. |
 | `PLAT=cpu` | Builds against the CPU simulation backend and defines `__cpu_sim__`. |
 | `PLAT=linx` | Builds for the Linx target and defines `__linx`. |
 | `PLAT=arm_sme` | Builds Arm SME-oriented cases and defines `__ARM_FEATURE_SME`. |
-| `COMPILER_DIR` | Directory containing `clang`, `clang++`, `llvm-objdump`, and related tools. |
-| `QEMU` | Simulator binary used by `make sim` for Linx-targeted test execution. |
+| `COMPILER_DIR` | Directory containing `clang`, `clang++`, `llvm-objdump`, `llvm-objcopy`, and related tools. |
+| `QEMU` | Simulator binary used by `make sim` for Linx-targeted execution. |
 
 Common targets:
 
@@ -88,63 +156,48 @@ make clean
 make clean_all
 ```
 
-Generated objects, executables, disassembly, and logs are written below
-`output/`.
+Generated objects, executables, disassembly, and logs are written below `output/`.
 
 ## Batch Suites
 
-Many suites provide a local `compile.all` file. Run these from the suite
-directory so relative paths and local make variables resolve as intended.
+Many suites provide a local `compile*.all` file. Run these from the suite directory so relative paths and local make variables resolve as intended.
 
 Examples:
 
 ```sh
-cd test/tileop_api && bash compile.all
-cd test/py_api && bash compile.all
-cd test/accelerator/vec_simt && bash compile.all
-cd test/kernel/gemm/matmul && bash compile.all
+cd benchmarks/api/tileop && bash compile.all
+cd benchmarks/npu/vec_simt && bash compile.all
+cd benchmarks/kernels/gemm/matmul && bash compile.all
+cd benchmarks/models/deepseekv3 && bash compile.all
 ```
 
-For recursive compile/run automation, see
-[`test/script/README.md`](test/script/README.md).
+For recursive compile/run automation, see [`benchmarks/scripts/recursive/README.md`](benchmarks/scripts/recursive/README.md).
 
-## Python API And Golden Comparison
+## Header Installation
 
-The Python API flow builds a shared object and compares selected cases against
-Python golden logic.
+The top-level `Makefile` installs the TileOP header tree into the active Clang resource directory under `include/tileop-api`.
 
 ```sh
-cd test/py_api
-make clean
-make TESTCASE=tileop_py
-python3 golden_cmp/golden_cmp.py -i tadd
+make -n CLANG_PREFIX=/usr
+make install CLANG_PREFIX=/path/to/clang/prefix
+make uninstall CLANG_PREFIX=/path/to/clang/prefix
 ```
 
-For adding new golden cases, see
-[`test/py_api/golden_cmp/README.md`](test/py_api/golden_cmp/README.md).
+`CLANG_PREFIX` should point to a prefix containing `bin/clang`. The dry run is useful because the install location is derived from `clang -print-resource-dir`.
 
 ## Adding Work
 
-Use the existing directory shape when adding code:
+Use the current directory ownership when adding code:
 
 1. Add API-facing definitions or declarations in [`include/common`](include/common).
-2. Add backend behavior in the relevant backend directory, usually
-   [`include/cpu_sim`](include/cpu_sim), [`include/jcore`](include/jcore), or
-   [`include/aarch64`](include/aarch64).
-3. Add reusable compute kernels under the matching [`kernels`](kernels)
-   domain.
-4. Add focused tests under [`test/tileop_api`](test/tileop_api) or the
-   relevant suite under [`test/kernel`](test/kernel),
-   [`test/accelerator`](test/accelerator), or [`test/other`](test/other).
-5. Add the case to the local `compile.all` file when it should be part of the
-   batch suite.
-
-New make-driven test directories should keep the local `Makefile` small and
-include [`test/common/Makefile.common`](test/common/Makefile.common) for shared
-platform flags, output paths, simulator targets, and cleanup behavior.
+2. Add backend behavior in [`include/cpu_sim`](include/cpu_sim), [`include/jcore`](include/jcore), [`include/aarch64`](include/aarch64), or the relevant support include directory.
+3. Add reusable compute kernels under the matching [`kernels`](kernels) domain.
+4. Add Linx-buildable benchmark entrypoints under the matching [`benchmarks`](benchmarks) category.
+5. Add non-benchmark correctness material under [`tests`](tests).
+6. Move superseded or duplicate material to [`archive/outdated`](archive/outdated) with a replacement note instead of deleting it.
+
+New make-driven benchmark directories should keep the local `Makefile` small and include [`benchmarks/common/Makefile.common`](benchmarks/common/Makefile.common) for shared platform flags, output paths, simulator targets, and cleanup behavior.
 
 ## Generated Files
 
-Do not commit generated files from `output/`, object files, executable test
-artifacts, local logs, or disassembly files. Keep source changes in `include/`,
-`kernels/`, `models/`, and `test/`.
+Do not commit generated files from `output/`, object files, executable artifacts, local logs, or ad hoc disassembly files outside [`samples`](samples). Keep source changes in `include/`, `kernels/`, `models/`, `benchmarks/`, `tests/`, `samples/`, `compiler/`, and `archive/outdated/`.
diff --git a/archive/outdated/README.md b/archive/outdated/README.md
new file mode 100644
index 0000000..3aa7e3b
--- /dev/null
+++ b/archive/outdated/README.md
@@ -0,0 +1,12 @@
+# Outdated Archive
+
+This directory preserves superseded or unusable material that should not be the default benchmark navigation surface. Nothing here was deleted; the table below records why each item moved and where active work should happen instead.
+
+| Archived path | Rationale | Replacement |
+| --- | --- | --- |
+| [`tests/other/tileop_api`](tests/other/tileop_api) | Legacy duplicate TileOP API surface with committed generated logs under `script/checknum_true`. | [`../../benchmarks/api/tileop`](../../benchmarks/api/tileop) |
+| [`tests/other/py_api`](tests/other/py_api) | Older Python API duplicate. Active Python correctness material is kept outside the benchmark tree. | [`../../tests/py_api`](../../tests/py_api) |
+| [`tests/accelerator/v220`](tests/accelerator/v220) | Superseded legacy NPU validation surface, not part of the active Linx benchmark catalog. | [`../../benchmarks/npu`](../../benchmarks/npu) |
+| [`tests/accelerator/v310`](tests/accelerator/v310) | Superseded legacy NPU validation surface, not part of the active Linx benchmark catalog. | [`../../benchmarks/npu`](../../benchmarks/npu) |
+
+Archive files may retain historical path references because they document the old layout. Do not add new active benchmark cases here.
diff --git a/test/accelerator/v220/src/common/data.hpp b/archive/outdated/tests/accelerator/v220/src/common/data.hpp
similarity index 100%
rename from test/accelerator/v220/src/common/data.hpp
rename to archive/outdated/tests/accelerator/v220/src/common/data.hpp
diff --git a/test/accelerator/v220/src/st/st1.cpp b/archive/outdated/tests/accelerator/v220/src/st/st1.cpp
similarity index 100%
rename from test/accelerator/v220/src/st/st1.cpp
rename to archive/outdated/tests/accelerator/v220/src/st/st1.cpp
diff --git a/test/accelerator/v220/src/ut/TAdd.cpp b/archive/outdated/tests/accelerator/v220/src/ut/TAdd.cpp
similarity index 100%
rename from test/accelerator/v220/src/ut/TAdd.cpp
rename to archive/outdated/tests/accelerator/v220/src/ut/TAdd.cpp
diff --git a/test/accelerator/v310/common/data.hpp b/archive/outdated/tests/accelerator/v310/common/data.hpp
similarity index 100%
rename from test/accelerator/v310/common/data.hpp
rename to archive/outdated/tests/accelerator/v310/common/data.hpp
diff --git a/test/accelerator/v310/st/st1.cpp b/archive/outdated/tests/accelerator/v310/st/st1.cpp
similarity index 100%
rename from test/accelerator/v310/st/st1.cpp
rename to archive/outdated/tests/accelerator/v310/st/st1.cpp
diff --git a/test/accelerator/v310/ut/TAdd.cpp b/archive/outdated/tests/accelerator/v310/ut/TAdd.cpp
similarity index 100%
rename from test/accelerator/v310/ut/TAdd.cpp
rename to archive/outdated/tests/accelerator/v310/ut/TAdd.cpp
diff --git a/test/other/py_api/Makefile b/archive/outdated/tests/other/py_api/Makefile
similarity index 100%
rename from test/other/py_api/Makefile
rename to archive/outdated/tests/other/py_api/Makefile
diff --git a/test/other/py_api/compile.all b/archive/outdated/tests/other/py_api/compile.all
similarity index 100%
rename from test/other/py_api/compile.all
rename to archive/outdated/tests/other/py_api/compile.all
diff --git a/test/other/py_api/golden_cmp/README.md b/archive/outdated/tests/other/py_api/golden_cmp/README.md
similarity index 94%
rename from test/other/py_api/golden_cmp/README.md
rename to archive/outdated/tests/other/py_api/golden_cmp/README.md
index b1aef28..433fe7f 100644
--- a/test/other/py_api/golden_cmp/README.md
+++ b/archive/outdated/tests/other/py_api/golden_cmp/README.md
@@ -7,12 +7,12 @@
  · 文件路径：JanusCoreBench/test/golden_cmp/py_api/src/
 
  · 操作说明：
-   
+
    1. 如果是添加一个新的运算方式（如 texp），则需要新建一个 HPP 文件。
    2. 如果是同一运算方式的不同属性（如不同的矩阵尺寸或 tile 大小），则直接在对应的 HPP 文件中添加。
 
  · 标准函数格式：
-   
+
    · 文件头需要包含必要的头文件。
    · 声明变量和函数名称时，需注意命名规范。
    · 函数声明后，需将函数与模块绑定（m.def）。
@@ -47,9 +47,9 @@ void texp1_py(py::array_t<float> dst_py, py::array_t<float> src_py) {
 
             tile_shape d0, d1;
 
-            TCOPYIN(d0, s0);
+            TLOAD(d0, s0);
             TEXP(d1, d0);
-            TCOPYOUT(res, d1);
+            TSTORE(res, d1);
         }
     }
 }
@@ -72,14 +72,14 @@ void bind_texp(py::module_& m) {
 步骤说明：
 
  1. 添加文件头：
-    
+
     · 在文件开头添加包含新 HPP 文件的头文件路径。
     ```
     #include "src/texp1.hpp"
     ```
 
  2. 添加绑定内容：
-    
+
     · 在模块中绑定新函数。
     ```
     py::module_ _api = m.def_submodule("_api", "API module");
@@ -93,7 +93,7 @@ void bind_texp(py::module_& m) {
 步骤说明：
 
  1. 在 cases 中添加新测试用例：
-    
+
     · 按照以下格式添加新函数的属性。
         ```
         {
@@ -109,7 +109,7 @@ void bind_texp(py::module_& m) {
     · **output_shapes**：输出矩阵的形状。
 
  2. 在 op_map 中添加操作映射：
-    
+
     · 按照以下格式添加新操作的映射。
     ```
     "texp": [
@@ -129,10 +129,10 @@ void bind_texp(py::module_& m) {
 
 在 /JanusCoreBench/test 路径下，执行以下命令：
 ```
-make clean  
-make TESTCASE=tileop_py PLAT=cpu PY_LIB=on  
-python3 golden_cmp/golden_cmp.py -i tadd1  
+make clean
+make TESTCASE=tileop_py PLAT=cpu PY_LIB=on
+python3 golden_cmp/golden_cmp.py -i tadd1
 ```
-其中，PLAT 和 PY_LIB 的值可以根据需要进行修改。具体的可选项可参考 common 文件夹下的 Makefile 。其中 -i 后面跟着的是函数的名称，具体的函数名可以参考 config.json 文件中的内容。  
+其中，PLAT 和 PY_LIB 的值可以根据需要进行修改。具体的可选项可参考 common 文件夹下的 Makefile 。其中 -i 后面跟着的是函数的名称，具体的函数名可以参考 config.json 文件中的内容。
 之后print出的对比结果中，在最后两行会显示loss（误差）以及是否pass or fail
 
diff --git a/test/other/py_api/golden_cmp/config.json b/archive/outdated/tests/other/py_api/golden_cmp/config.json
similarity index 100%
rename from test/other/py_api/golden_cmp/config.json
rename to archive/outdated/tests/other/py_api/golden_cmp/config.json
diff --git a/test/other/py_api/golden_cmp/golden_cmp.py b/archive/outdated/tests/other/py_api/golden_cmp/golden_cmp.py
similarity index 100%
rename from test/other/py_api/golden_cmp/golden_cmp.py
rename to archive/outdated/tests/other/py_api/golden_cmp/golden_cmp.py
diff --git a/test/other/py_api/golden_cmp/ref_func_lib.py b/archive/outdated/tests/other/py_api/golden_cmp/ref_func_lib.py
similarity index 100%
rename from test/other/py_api/golden_cmp/ref_func_lib.py
rename to archive/outdated/tests/other/py_api/golden_cmp/ref_func_lib.py
diff --git a/test/other/py_api/golden_cmp/test.sh b/archive/outdated/tests/other/py_api/golden_cmp/test.sh
similarity index 100%
rename from test/other/py_api/golden_cmp/test.sh
rename to archive/outdated/tests/other/py_api/golden_cmp/test.sh
diff --git a/test/other/py_api/src/flash_attention_py.hpp b/archive/outdated/tests/other/py_api/src/flash_attention_py.hpp
similarity index 100%
rename from test/other/py_api/src/flash_attention_py.hpp
rename to archive/outdated/tests/other/py_api/src/flash_attention_py.hpp
diff --git a/test/other/py_api/src/matmul_py.hpp b/archive/outdated/tests/other/py_api/src/matmul_py.hpp
similarity index 100%
rename from test/other/py_api/src/matmul_py.hpp
rename to archive/outdated/tests/other/py_api/src/matmul_py.hpp
diff --git a/test/other/py_api/src/softmax_py.hpp b/archive/outdated/tests/other/py_api/src/softmax_py.hpp
similarity index 100%
rename from test/other/py_api/src/softmax_py.hpp
rename to archive/outdated/tests/other/py_api/src/softmax_py.hpp
diff --git a/test/other/py_api/src/tadd.hpp b/archive/outdated/tests/other/py_api/src/tadd.hpp
similarity index 89%
rename from test/other/py_api/src/tadd.hpp
rename to archive/outdated/tests/other/py_api/src/tadd.hpp
index 68fdc19..a15d36e 100644
--- a/test/other/py_api/src/tadd.hpp
+++ b/archive/outdated/tests/other/py_api/src/tadd.hpp
@@ -18,18 +18,18 @@ void tadd_py(float* dst, float* src0, float* src1){
             int offset = i * (tile_row * gm_col) + j * tile_col;
             gm_shape s0(src0 + offset);
             gm_shape s1(src1 + offset);
-            gm_shape res(dst + offset);  
+            gm_shape res(dst + offset);
 
             tile_shape d0, d1, d2;
-            TCOPYIN(d0, s0);
-            TCOPYIN(d1, s1);
+            TLOAD(d0, s0);
+            TLOAD(d1, s1);
             TADD(d2, d0, d1);
-            TCOPYOUT(res, d2);
+            TSTORE(res, d2);
         }
     }
 }
 
-#ifdef __cpu_sim__ 
+#ifdef __cpu_sim__
     void bind_tadd(py::module_& m) {
         m.def("tadd", [](py::array_t<float> dst_py, py::array_t<float> src0_py, py::array_t<float> src1_py){
             float* dst = static_cast<float*>(dst_py.request().ptr);
diff --git a/test/other/py_api/src/texp.hpp b/archive/outdated/tests/other/py_api/src/texp.hpp
similarity index 95%
rename from test/other/py_api/src/texp.hpp
rename to archive/outdated/tests/other/py_api/src/texp.hpp
index 9c1ab3b..5564bba 100644
--- a/test/other/py_api/src/texp.hpp
+++ b/archive/outdated/tests/other/py_api/src/texp.hpp
@@ -23,9 +23,9 @@ void texp_py(float* dst, float* src) {
 
             tile_shape d0, d1;
 
-            TCOPYIN(d0, s0);
+            TLOAD(d0, s0);
             TEXP(d1, d0);
-            TCOPYOUT(res, d1);
+            TSTORE(res, d1);
         }
     }
 }
diff --git a/test/other/py_api/src/tileop_py.cpp b/archive/outdated/tests/other/py_api/src/tileop_py.cpp
similarity index 100%
rename from test/other/py_api/src/tileop_py.cpp
rename to archive/outdated/tests/other/py_api/src/tileop_py.cpp
diff --git a/test/other/py_api/src/tmax.hpp b/archive/outdated/tests/other/py_api/src/tmax.hpp
similarity index 93%
rename from test/other/py_api/src/tmax.hpp
rename to archive/outdated/tests/other/py_api/src/tmax.hpp
index f144668..bdadeff 100644
--- a/test/other/py_api/src/tmax.hpp
+++ b/archive/outdated/tests/other/py_api/src/tmax.hpp
@@ -23,10 +23,10 @@ void tmax_py(float* dst, float* src0, float* src1){
 
             tile_shape d0, d1, d2;
 
-            TCOPYIN(d0, s0);
-            TCOPYIN(d1, s1);
+            TLOAD(d0, s0);
+            TLOAD(d1, s1);
             TMAX(d2, d1, d0);
-            TCOPYOUT(res, d2);
+            TSTORE(res, d2);
         }
     }
 }
diff --git a/test/other/py_api/src/tsub.hpp b/archive/outdated/tests/other/py_api/src/tsub.hpp
similarity index 89%
rename from test/other/py_api/src/tsub.hpp
rename to archive/outdated/tests/other/py_api/src/tsub.hpp
index 7f05ea1..eabf0f6 100644
--- a/test/other/py_api/src/tsub.hpp
+++ b/archive/outdated/tests/other/py_api/src/tsub.hpp
@@ -18,18 +18,18 @@ void tsub_py(float* dst, float* src0, float* src1) {
             int offset = i * (tile_row * gm_col) + j * tile_col;
             gm_shape s0(src0 + offset);
             gm_shape s1(src1 + offset);
-            gm_shape res(dst + offset);  
+            gm_shape res(dst + offset);
 
             tile_shape d0, d1, d2;
-            TCOPYIN(d0, s0);
-            TCOPYIN(d1, s1);
+            TLOAD(d0, s0);
+            TLOAD(d1, s1);
             TSUB(d2, d0, d1);
-            TCOPYOUT(res, d2);
+            TSTORE(res, d2);
         }
     }
 }
 
-#ifdef __cpu_sim__ 
+#ifdef __cpu_sim__
     void bind_tsub(py::module_& m) {
         m.def("tsub", [](py::array_t<float> dst_py, py::array_t<float> src0_py, py::array_t<float> src1_py){
             float* dst = static_cast<float*>(dst_py.request().ptr);
diff --git a/test/other/tileop_api/Makefile b/archive/outdated/tests/other/tileop_api/Makefile
similarity index 100%
rename from test/other/tileop_api/Makefile
rename to archive/outdated/tests/other/tileop_api/Makefile
diff --git a/test/other/tileop_api/compile.all b/archive/outdated/tests/other/tileop_api/compile.all
similarity index 88%
rename from test/other/tileop_api/compile.all
rename to archive/outdated/tests/other/tileop_api/compile.all
index 116d3be..4ff38ab 100755
--- a/test/other/tileop_api/compile.all
+++ b/archive/outdated/tests/other/tileop_api/compile.all
@@ -8,13 +8,12 @@ make TESTCASE=TAdd_mask
 make TESTCASE=TAdd
 make TESTCASE=TAdds
 make TESTCASE=TCopy
-make TESTCASE=TCopyIn
-make TESTCASE=TCopyOut
+make TESTCASE=TLoad
+make TESTCASE=TStore
 make TESTCASE=TCvt
 make TESTCASE=TDiv
 make TESTCASE=TDivs
 make TESTCASE=test_MatMacc
-make TESTCASE=test_matmul
 make TESTCASE=test_MatMul
 make TESTCASE=TExp
 make TESTCASE=TExpandCol
@@ -33,4 +32,4 @@ make TESTCASE=TRowSumExpand
 make TESTCASE=TSqrt
 make TESTCASE=TSub
 make TESTCASE=TSubs
-make TESTCASE=TTrans
\ No newline at end of file
+make TESTCASE=TTrans
diff --git a/test/tileop_api/data.hpp b/archive/outdated/tests/other/tileop_api/data.hpp
similarity index 90%
rename from test/tileop_api/data.hpp
rename to archive/outdated/tests/other/tileop_api/data.hpp
index c595f25..afca271 100644
--- a/test/tileop_api/data.hpp
+++ b/archive/outdated/tests/other/tileop_api/data.hpp
@@ -1,16 +1,34 @@
 #ifndef DATA_H
 #define DATA_H
 
+#ifdef __linx
+#include <stddef.h>
+#include <stdint.h>
+extern "C" void exit(int);
+extern "C" void free(void *);
+extern "C" void *malloc(size_t);
+extern "C" int printf(const char *, ...);
+#else
 #include <iostream>
 #include <cmath>
+#endif
 #include "common/type.hpp"
 
+#ifdef __linx
+static constexpr float s_fp32 = 0.1f;
+static constexpr __half s_fp16 = __half(0.0f);
+static constexpr int8_t s_i8 = 1;
+static constexpr int16_t s_i16 = 1;
+static constexpr int32_t s_i32 = 1;
+static constexpr int64_t s_i64 = 1;
+#else
 float s_fp32 = 0.1;
 __half s_fp16 = 0.1;
 int8_t s_i8 = 1;
 int16_t s_i16 = 1;
 int32_t s_i32 = 1;
 int64_t s_i64 = 1;
+#endif
 
 template <typename T> void init_src_uint(T *aar, uint16_t size) {
   for (uint16_t i = 0; i < size; i++) {
@@ -36,7 +54,13 @@ void init_src_int8(int8_t *aar, uint16_t size) {
 
 template <typename T> void init_src_fp(T *aar, uint16_t size) {
   for (uint16_t i = 0; i < size; i++) {
+#ifdef __linx
+    const float x = (i + 1) / 100.0f;
+    const float x2 = x * x;
+    aar[i] = x * (1.0f - x2 / 6.0f + (x2 * x2) / 120.0f);
+#else
     aar[i] = sin((i + 1) / 100.0f);
+#endif
   }
 }
 
@@ -81,22 +105,37 @@ template <typename T> void init_rows_fp(T *aar, uint16_t row, uint16_t col) {
 }
 
 template <typename T> void OutArray(const T *aar, size_t size) {
+#ifdef __linx
+  (void)aar;
+  (void)size;
+#else
   for (uint16_t i = 0; i < size; i++) {
     std::cout << aar[i] << " ";
   }
   std::cout << std::endl;
+#endif
 }
 void OutArray(const int8_t *aar, size_t size) {
+#ifdef __linx
+  (void)aar;
+  (void)size;
+#else
   for (uint16_t i = 0; i < size; i++) {
     std::cout << static_cast<int32_t>(aar[i]) << " ";
   }
   std::cout << std::endl;
+#endif
 }
 void OutArray(const __half *aar, size_t size) {
+#ifdef __linx
+  (void)aar;
+  (void)size;
+#else
   for (uint16_t i = 0; i < size; i++) {
     std::cout << static_cast<__fp16>(aar[i]) << " ";
   }
   std::cout << std::endl;
+#endif
 }
 
 // check memory allocation
@@ -181,4 +220,4 @@ template <typename T> void check_mem_alloc(const T *p) {
   free(d2);                                                                    \
   free(d3);
 
-#endif
\ No newline at end of file
+#endif
diff --git a/test/other/tileop_api/script/README.md b/archive/outdated/tests/other/tileop_api/script/README.md
similarity index 100%
rename from test/other/tileop_api/script/README.md
rename to archive/outdated/tests/other/tileop_api/script/README.md
diff --git a/test/other/tileop_api/script/checknum_true/MatMacc.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/MatMacc.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/MatMacc.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/MatMacc.log
diff --git a/test/other/tileop_api/script/checknum_true/MatMul.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/MatMul.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/MatMul.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/MatMul.log
diff --git a/test/other/tileop_api/script/checknum_true/TAbs.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TAbs.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TAbs.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TAbs.log
diff --git a/test/other/tileop_api/script/checknum_true/TAdd.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TAdd.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TAdd.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TAdd.log
diff --git a/test/other/tileop_api/script/checknum_true/TAdds.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TAdds.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TAdds.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TAdds.log
diff --git a/test/other/tileop_api/script/checknum_true/TAssemble.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TAssemble.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TAssemble.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TAssemble.log
diff --git a/test/other/tileop_api/script/checknum_true/TCopy.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TCopy.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TCopy.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TCopy.log
diff --git a/test/other/tileop_api/script/checknum_true/TCvt.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TCvt.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TCvt.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TCvt.log
diff --git a/test/other/tileop_api/script/checknum_true/TDiv.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TDiv.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TDiv.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TDiv.log
diff --git a/test/other/tileop_api/script/checknum_true/TDivs.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TDivs.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TDivs.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TDivs.log
diff --git a/test/other/tileop_api/script/checknum_true/TExp.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TExp.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TExp.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TExp.log
diff --git a/test/other/tileop_api/script/checknum_true/TExpandCol.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TExpandCol.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TExpandCol.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TExpandCol.log
diff --git a/test/other/tileop_api/script/checknum_true/TExpandRow.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TExpandRow.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TExpandRow.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TExpandRow.log
diff --git a/test/other/tileop_api/script/checknum_true/TExpandScalar.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TExpandScalar.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TExpandScalar.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TExpandScalar.log
diff --git a/test/other/tileop_api/script/checknum_true/TExtract.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TExtract.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TExtract.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TExtract.log
diff --git a/test/other/tileop_api/script/checknum_true/TGatherElementCol.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TGatherElementCol.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TGatherElementCol.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TGatherElementCol.log
diff --git a/test/other/tileop_api/script/checknum_true/TGatherElementRow.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TGatherElementRow.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TGatherElementRow.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TGatherElementRow.log
diff --git a/test/other/tileop_api/script/checknum_true/TGatherRow.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TGatherRow.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TGatherRow.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TGatherRow.log
diff --git a/test/other/tileop_api/script/checknum_true/TCopyIn.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TLoad.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TCopyIn.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TLoad.log
diff --git a/test/other/tileop_api/script/checknum_true/TMax.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TMax.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TMax.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TMax.log
diff --git a/test/other/tileop_api/script/checknum_true/TMaxs.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TMaxs.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TMaxs.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TMaxs.log
diff --git a/test/other/tileop_api/script/checknum_true/TMin.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TMin.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TMin.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TMin.log
diff --git a/test/other/tileop_api/script/checknum_true/TMins.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TMins.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TMins.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TMins.log
diff --git a/test/other/tileop_api/script/checknum_true/TMul.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TMul.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TMul.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TMul.log
diff --git a/test/other/tileop_api/script/checknum_true/TMuls.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TMuls.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TMuls.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TMuls.log
diff --git a/test/other/tileop_api/script/checknum_true/TRSqrt.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TRSqrt.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TRSqrt.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TRSqrt.log
diff --git a/test/other/tileop_api/script/checknum_true/TRecip.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TRecip.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TRecip.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TRecip.log
diff --git a/test/other/tileop_api/script/checknum_true/TReshape.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TReshape.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TReshape.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TReshape.log
diff --git a/test/other/tileop_api/script/checknum_true/TRowMax.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TRowMax.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TRowMax.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TRowMax.log
diff --git a/test/other/tileop_api/script/checknum_true/TRowMaxExpand.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TRowMaxExpand.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TRowMaxExpand.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TRowMaxExpand.log
diff --git a/test/other/tileop_api/script/checknum_true/TRowSum.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TRowSum.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TRowSum.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TRowSum.log
diff --git a/test/other/tileop_api/script/checknum_true/TRowSumExpand.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TRowSumExpand.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TRowSumExpand.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TRowSumExpand.log
diff --git a/test/other/tileop_api/script/checknum_true/TScatterElementCol.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TScatterElementCol.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TScatterElementCol.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TScatterElementCol.log
diff --git a/test/other/tileop_api/script/checknum_true/TScatterElementRow.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TScatterElementRow.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TScatterElementRow.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TScatterElementRow.log
diff --git a/test/other/tileop_api/script/checknum_true/TSelect.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TSelect.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TSelect.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TSelect.log
diff --git a/test/other/tileop_api/script/checknum_true/TSqrt.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TSqrt.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TSqrt.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TSqrt.log
diff --git a/test/other/tileop_api/script/checknum_true/TCopyOut.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TStore.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TCopyOut.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TStore.log
diff --git a/test/other/tileop_api/script/checknum_true/TSub.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TSub.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TSub.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TSub.log
diff --git a/test/other/tileop_api/script/checknum_true/TSubs.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TSubs.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TSubs.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TSubs.log
diff --git a/test/other/tileop_api/script/checknum_true/TTrans.log b/archive/outdated/tests/other/tileop_api/script/checknum_true/TTrans.log
similarity index 100%
rename from test/other/tileop_api/script/checknum_true/TTrans.log
rename to archive/outdated/tests/other/tileop_api/script/checknum_true/TTrans.log
diff --git a/test/other/tileop_api/script/get_checknum.py b/archive/outdated/tests/other/tileop_api/script/get_checknum.py
similarity index 100%
rename from test/other/tileop_api/script/get_checknum.py
rename to archive/outdated/tests/other/tileop_api/script/get_checknum.py
diff --git a/test/other/tileop_api/script/test.py b/archive/outdated/tests/other/tileop_api/script/test.py
similarity index 100%
rename from test/other/tileop_api/script/test.py
rename to archive/outdated/tests/other/tileop_api/script/test.py
diff --git a/test/tileop_api/src/MatMacc.cpp b/archive/outdated/tests/other/tileop_api/src/MatMacc.cpp
similarity index 62%
rename from test/tileop_api/src/MatMacc.cpp
rename to archive/outdated/tests/other/tileop_api/src/MatMacc.cpp
index b0643fe..2c665a2 100644
--- a/test/tileop_api/src/MatMacc.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/MatMacc.cpp
@@ -5,6 +5,57 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t M, uint16_t N, uint16_t K, typename T>
 void test_RowMajor(T *dst, T *src0, T *src1) {
   using gm_shape_A = global_tensor<T, RowMajor<M, K>>;
@@ -23,11 +74,11 @@ void test_RowMajor(T *dst, T *src0, T *src1) {
   tile_shape_B d1;
   tile_shape_C d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  TCOPYIN(d2, res);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  TLOAD(d2, res);
   MATMACC(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 template <uint16_t M, uint16_t N, uint16_t K, typename T>
@@ -48,21 +99,70 @@ void test_ColMajor(T *dst, T *src0, T *src1) {
   tile_shape_B d1;
   tile_shape_C d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  TCOPYIN(d2, res);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  TLOAD(d2, res);
   MATMACC(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 int main() {
+#ifdef __linx
+  constexpr uint16_t M = 4;
+  constexpr uint16_t K = 4;
+  constexpr uint16_t N = 4;
+#else
   const uint16_t M = 16;
   const uint16_t K = 8;
   const uint16_t N = 32;
+#endif
+
+  constexpr size_t size_A = M * K;
+  constexpr size_t size_B = K * N;
+  constexpr size_t size_C = M * N;
+
+#ifdef __linx
+  static int64_t dst_rm[size_C];
+  static int64_t src0_rm[size_A];
+  static int64_t src1_rm[size_B];
+  static int64_t base_rm[size_C];
 
-  size_t size_A = M * K;
-  size_t size_B = K * N;
-  size_t size_C = M * N;
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t k = 0; k < K; ++k) {
+      const int64_t value = static_cast<int64_t>((row + 1) * (k + 2));
+      src0_rm[row * K + k] = value;
+    }
+  }
+  for (size_t k = 0; k < K; ++k) {
+    for (size_t col = 0; col < N; ++col) {
+      const int64_t value = static_cast<int64_t>((k + 1) + (col + 1));
+      src1_rm[k * N + col] = value;
+    }
+  }
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t col = 0; col < N; ++col) {
+      const int64_t value = static_cast<int64_t>(10 + row * N + col);
+      dst_rm[row * N + col] = value;
+      base_rm[row * N + col] = value;
+    }
+  }
+
+  test_RowMajor<M, N, K, int64_t>(dst_rm, src0_rm, src1_rm);
+
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t col = 0; col < N; ++col) {
+      int64_t expected = base_rm[row * N + col];
+      for (size_t k = 0; k < K; ++k) {
+        expected += src0_rm[row * K + k] * src1_rm[k * N + col];
+      }
+      if (dst_rm[row * N + col] != expected) {
+        return 1;
+      }
+    }
+  }
+
+  return 0;
+#else
 
   float *dst = (float *)malloc(size_C * sizeof(float));
   check_mem_alloc(dst);
@@ -77,7 +177,7 @@ int main() {
 
   __half *dst_f16 = (__half *)malloc(size_C * sizeof(__half));
   check_mem_alloc(dst_f16);
-  init_dst_no_zero(dst_f16, size_C); 
+  init_dst_no_zero(dst_f16, size_C);
 
   __half *src0_f16 = (__half *)malloc(size_A * sizeof(__half));
   check_mem_alloc(src0_f16);
@@ -85,44 +185,44 @@ int main() {
   __half *src1_f16 = (__half *)malloc(size_B * sizeof(__half));
   check_mem_alloc(src1_f16);
   init_src_fp(src1_f16, size_B);
- 
+
   int8_t *dst_i8 = (int8_t *)malloc(size_C * sizeof(int8_t));
   check_mem_alloc(dst_i8);
   init_dst_no_zero(dst_i8, size_C);
- 
+
   int8_t *src0_i8 = (int8_t *)malloc(size_A * sizeof(int8_t));
   check_mem_alloc(src0_i8);
   init_src_int(src0_i8, size_A);
   int8_t *src1_i8 = (int8_t *)malloc(size_B * sizeof(int8_t));
   check_mem_alloc(src1_i8);
   init_src_int(src1_i8, size_B);
- 
+
   int16_t *dst_i16 = (int16_t *)malloc(size_C * sizeof(int16_t));
   check_mem_alloc(dst_i16);
   init_dst_no_zero(dst_i16, size_C);
- 
+
   int16_t *src0_i16 = (int16_t *)malloc(size_A * sizeof(int16_t));
   check_mem_alloc(src0_i16);
   init_src_int(src0_i16, size_A);
   int16_t *src1_i16 = (int16_t *)malloc(size_B * sizeof(int16_t));
   check_mem_alloc(src1_i16);
   init_src_int(src1_i16, size_B);
-  
+
   int32_t *dst_i32 = (int32_t *)malloc(size_C * sizeof(int32_t));
   check_mem_alloc(dst_i32);
   init_dst_no_zero(dst_i32, size_C);
- 
+
   int32_t *src0_i32 = (int32_t *)malloc(size_A * sizeof(int32_t));
   check_mem_alloc(src0_i32);
   init_src_int(src0_i32, size_A);
   int32_t *src1_i32 = (int32_t *)malloc(size_B * sizeof(int32_t));
   check_mem_alloc(src1_i32);
   init_src_int(src1_i32, size_B);
- 
+
   int64_t *dst_i64 = (int64_t *)malloc(size_C * sizeof(int64_t));
   check_mem_alloc(dst_i64);
   init_dst_no_zero(dst_i64, size_C);
- 
+
   int64_t *src0_i64 = (int64_t *)malloc(size_A * sizeof(int64_t));
   check_mem_alloc(src0_i64);
   init_src_int(src0_i64, size_A);
@@ -135,7 +235,7 @@ int main() {
 #endif
 
   //test_RowMajor<M, N, K, float>(dst, src0, src1);
- 
+
   //test_RowMajor<M, N, K, __half>(dst_f16, src0_f16, src1_f16);
 
   //test_RowMajor<M, N, K, int8_t>(dst_i8, src0_i8, src1_i8);
@@ -157,30 +257,31 @@ int main() {
   OutArray(dst_i16, size_C);
   OutArray(dst_i32, size_C);
   OutArray(dst_i64, size_C);
- 
+
   free(dst);
   free(src0);
   free(src1);
- 
+
   free(dst_f16);
   free(src0_f16);
   free(src1_f16);
- 
+
   free(dst_i8);
   free(src0_i8);
   free(src1_i8);
- 
+
   free(dst_i16);
   free(src0_i16);
   free(src1_i16);
- 
+
   free(dst_i32);
   free(src0_i32);
   free(src1_i32);
- 
+
   free(dst_i64);
   free(src0_i64);
   free(src1_i64);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/MatMul.cpp b/archive/outdated/tests/other/tileop_api/src/MatMul.cpp
similarity index 70%
rename from test/tileop_api/src/MatMul.cpp
rename to archive/outdated/tests/other/tileop_api/src/MatMul.cpp
index 1942e97..99d8a33 100644
--- a/test/tileop_api/src/MatMul.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/MatMul.cpp
@@ -5,6 +5,48 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  auto *d = static_cast<unsigned char *>(dst);
+  const auto *s = static_cast<const unsigned char *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+    __asm__ volatile("" ::: "memory");
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t M, uint16_t N, uint16_t K, typename T>
 void test_RowMajor(T *dst, T *src0, T *src1) {
   using gm_shape_A = global_tensor<T, RowMajor<M, K>>;
@@ -23,10 +65,10 @@ void test_RowMajor(T *dst, T *src0, T *src1) {
   tile_shape_B d1;
   tile_shape_C d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
   MATMUL(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 template <uint16_t M, uint16_t N, uint16_t K, typename T>
@@ -47,13 +89,45 @@ void test_ColMajor(T *dst, T *src0, T *src1) {
   tile_shape_B d1;
   tile_shape_C d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
   MATMUL(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 int main() {
+#ifdef __linx
+  constexpr uint16_t M = 4;
+  constexpr uint16_t K = 4;
+  constexpr uint16_t N = 4;
+  constexpr size_t size_A = M * K;
+  constexpr size_t size_B = K * N;
+  constexpr size_t size_C = M * N;
+
+  static int64_t dst_i64[size_C];
+  static int64_t src0_i64[size_A];
+  static int64_t src1_i64[size_B];
+
+  init_dst(dst_i64, size_C);
+  init_src_int(src0_i64, size_A);
+  init_src_int(src1_i64, size_B);
+
+  test_RowMajor<M, N, K, int64_t>(dst_i64, src0_i64, src1_i64);
+
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t col = 0; col < N; ++col) {
+      int64_t expected = 0;
+      for (size_t k = 0; k < K; ++k) {
+        expected += src0_i64[row * K + k] * src1_i64[k * N + col];
+      }
+      if (dst_i64[row * N + col] != expected) {
+        return 1;
+      }
+    }
+  }
+
+  return 0;
+#else
   const uint16_t M = 16;
   const uint16_t K = 32;
   const uint16_t N = 32;
@@ -75,7 +149,7 @@ int main() {
 
   __half *dst_f16 = (__half *)malloc(size_C * sizeof(__half));
   check_mem_alloc(dst_f16);
-  init_dst(dst_f16, size_C); 
+  init_dst(dst_f16, size_C);
 
   __half *src0_f16 = (__half *)malloc(size_A * sizeof(__half));
   check_mem_alloc(src0_f16);
@@ -83,44 +157,44 @@ int main() {
   __half *src1_f16 = (__half *)malloc(size_B * sizeof(__half));
   check_mem_alloc(src1_f16);
   init_src_fp(src1_f16, size_B);
- 
+
   int8_t *dst_i8 = (int8_t *)malloc(size_C * sizeof(int8_t));
   check_mem_alloc(dst_i8);
   init_dst(dst_i8, size_C);
- 
+
   int8_t *src0_i8 = (int8_t *)malloc(size_A * sizeof(int8_t));
   check_mem_alloc(src0_i8);
   init_src_int(src0_i8, size_A);
   int8_t *src1_i8 = (int8_t *)malloc(size_B * sizeof(int8_t));
   check_mem_alloc(src1_i8);
   init_src_int(src1_i8, size_B);
- 
+
   int16_t *dst_i16 = (int16_t *)malloc(size_C * sizeof(int16_t));
   check_mem_alloc(dst_i16);
   init_dst(dst_i16, size_C);
- 
+
   int16_t *src0_i16 = (int16_t *)malloc(size_A * sizeof(int16_t));
   check_mem_alloc(src0_i16);
   init_src_int(src0_i16, size_A);
   int16_t *src1_i16 = (int16_t *)malloc(size_B * sizeof(int16_t));
   check_mem_alloc(src1_i16);
   init_src_int(src1_i16, size_B);
-  
+
   int32_t *dst_i32 = (int32_t *)malloc(size_C * sizeof(int32_t));
   check_mem_alloc(dst_i32);
   init_dst(dst_i32, size_C);
- 
+
   int32_t *src0_i32 = (int32_t *)malloc(size_A * sizeof(int32_t));
   check_mem_alloc(src0_i32);
   init_src_int(src0_i32, size_A);
   int32_t *src1_i32 = (int32_t *)malloc(size_B * sizeof(int32_t));
   check_mem_alloc(src1_i32);
   init_src_int(src1_i32, size_B);
- 
+
   int64_t *dst_i64 = (int64_t *)malloc(size_C * sizeof(int64_t));
   check_mem_alloc(dst_i64);
   init_dst(dst_i64, size_C);
- 
+
   int64_t *src0_i64 = (int64_t *)malloc(size_A * sizeof(int64_t));
   check_mem_alloc(src0_i64);
   init_src_int(src0_i64, size_A);
@@ -133,7 +207,7 @@ int main() {
 #endif
 
   test_RowMajor<M, N, K, float>(dst, src0, src1);
- 
+
   test_RowMajor<M, N, K, __half>(dst_f16, src0_f16, src1_f16);
 
   test_RowMajor<M, N, K, int8_t>(dst_i8, src0_i8, src1_i8);
@@ -155,30 +229,31 @@ int main() {
   OutArray(dst_i16, size_C);
   OutArray(dst_i32, size_C);
   OutArray(dst_i64, size_C);
- 
+
   free(dst);
   free(src0);
   free(src1);
- 
+
   free(dst_f16);
   free(src0_f16);
   free(src1_f16);
- 
+
   free(dst_i8);
   free(src0_i8);
   free(src1_i8);
- 
+
   free(dst_i16);
   free(src0_i16);
   free(src1_i16);
- 
+
   free(dst_i32);
   free(src0_i32);
   free(src1_i32);
- 
+
   free(dst_i64);
   free(src0_i64);
   free(src1_i64);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/archive/outdated/tests/other/tileop_api/src/MatMul_e4m3.cpp b/archive/outdated/tests/other/tileop_api/src/MatMul_e4m3.cpp
new file mode 100644
index 0000000..d9870b0
--- /dev/null
+++ b/archive/outdated/tests/other/tileop_api/src/MatMul_e4m3.cpp
@@ -0,0 +1,188 @@
+#include <common/pto_tileop.hpp>
+#include "../data.hpp"
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  auto *d = static_cast<unsigned char *>(dst);
+  const auto *s = static_cast<const unsigned char *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+    __asm__ volatile("" ::: "memory");
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+
+template <uint16_t M, uint16_t N, uint16_t K>
+void test(int64_t *dst, int64_t *src0, int64_t *src1) {
+  using gm_shape_A = global_tensor<int64_t, RowMajor<M, K>>;
+  using gm_shape_B = global_tensor<int64_t, RowMajor<K, N>>;
+  using gm_shape_C = global_tensor<int64_t, RowMajor<M, N>>;
+
+  using tile_shape_A = Tile<Location::Vec, int64_t, M, K, BLayout::RowMajor>;
+  using tile_shape_B = Tile<Location::Vec, int64_t, K, N, BLayout::RowMajor>;
+  using tile_shape_C = Tile<Location::Vec, int64_t, M, N, BLayout::RowMajor>;
+
+  gm_shape_A s0(src0);
+  gm_shape_B s1(src1);
+  gm_shape_C res(dst);
+
+  tile_shape_A d0;
+  tile_shape_B d1;
+  tile_shape_C d2;
+
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  MATMUL(d2, d0, d1);
+  TSTORE(res, d2);
+}
+#else
+template <typename TA, typename TB>
+void __vec__ test_cvt(typename TA::TileDType __out__ a,
+                      typename TB::TileDType __in__ b) {
+  using AType = typename TA::DType;
+  using BType = typename TB::DType;
+  BType *pb = blkv_get_tile_ptr(b);
+  AType *pa = blkv_get_tile_ptr(a);
+  int x = blkv_get_index_x();
+  int y = blkv_get_index_y();
+  int idx = index<TA>(y, x);
+  AType o = (AType)(pb[idx]);
+  pa[idx] = o;
+}
+
+template <uint16_t M, uint16_t N, uint16_t K>
+void test(float *dst, float *src0, float *src1) {
+  using gm_shape_A = global_tensor<float, RowMajor<M, K>>;
+  using gm_shape_B = global_tensor<float, ColMajor<K, N>>;
+  using gm_shape_C = global_tensor<float, RowMajor<M, N>>;
+
+  using tile_shape_A = TileLeft<float, M, K>;
+  using tile_shape_B = TileRight<float, K, N>;
+  using tile_shape_C = TileAcc<float, M, N>;
+  using tile_shape_LA = TileLeft<__fp8_e4m3, M, K>;
+  using tile_shape_LB = TileRight<__fp8_e4m3, K, N>;
+
+  gm_shape_A s0(src0);
+  gm_shape_B s1(src1);
+  gm_shape_C res(dst);
+
+  tile_shape_A d0;
+  tile_shape_B d1;
+  tile_shape_C d2;
+  tile_shape_LA lda;
+  tile_shape_LB ldb;
+
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  test_cvt<tile_shape_LA, tile_shape_A><<<M, K, 1>>>(lda.data(), d0.data());
+  test_cvt<tile_shape_LB, tile_shape_B><<<K, N, 1>>>(ldb.data(), d1.data());
+  MATMUL(d2, lda, ldb);
+  TSTORE(res, d2);
+}
+#endif
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t M = 4;
+  constexpr uint16_t K = 4;
+  constexpr uint16_t N = 4;
+  constexpr size_t size_A = M * K;
+  constexpr size_t size_B = K * N;
+  constexpr size_t size_C = M * N;
+
+  static int64_t dst_i64[size_C];
+  static int64_t src0_i64[size_A];
+  static int64_t src1_i64[size_B];
+
+  init_dst(dst_i64, size_C);
+  init_src_int(src0_i64, size_A);
+  init_src_int(src1_i64, size_B);
+
+  test<M, N, K>(dst_i64, src0_i64, src1_i64);
+
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t col = 0; col < N; ++col) {
+      int64_t expected = 0;
+      for (size_t k = 0; k < K; ++k) {
+        expected += src0_i64[row * K + k] * src1_i64[k * N + col];
+      }
+      if (dst_i64[row * N + col] != expected) {
+        return 1;
+      }
+    }
+  }
+
+  return 0;
+#else
+  const uint16_t M = 64;
+  const uint16_t K = 32;
+  const uint16_t N = 64;
+
+  size_t size_A = M * K;
+  size_t size_B = K * N;
+  size_t size_C = M * N;
+
+  float *dst = (float *)malloc(size_C * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, size_C);
+
+  float *src0 = (float *)malloc(size_A * sizeof(float));
+  check_mem_alloc(src0);
+  init_src_fp(src0, size_A);
+  float *src1 = (float *)malloc(size_B * sizeof(float));
+  check_mem_alloc(src1);
+  init_src_fp(src1, size_B);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test<M, N, K>(dst, src0, src1);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, size_C);
+
+  free(dst);
+  free(src0);
+  free(src1);
+
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/TAbs.cpp b/archive/outdated/tests/other/tileop_api/src/TAbs.cpp
similarity index 62%
rename from test/tileop_api/src/TAbs.cpp
rename to archive/outdated/tests/other/tileop_api/src/TAbs.cpp
index dc9f3f6..e580599 100644
--- a/test/tileop_api/src/TAbs.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TAbs.cpp
@@ -5,12 +5,44 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_RowMajor(T *dst, T *src0) {
   using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -20,21 +52,21 @@ void test_RowMajor(T *dst, T *src0) {
       int offset = i * (tile_row * gm_col) + j * tile_col;
       gm_shape s0(src0 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TABS(d1, d0);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
- 
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_ColMajor(T *dst, T *src0) {
   using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -44,24 +76,42 @@ void test_ColMajor(T *dst, T *src0) {
       int offset = i * (tile_row * gm_col) + j * tile_col;
       gm_shape s0(src0 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TABS(d1, d0);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
 
 int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 16;
-  const uint16_t tile_col = 16;
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 16;
+  constexpr uint16_t tile_col = 16;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src, gm_size);
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src);
 
+  return 0;
+#else
   float *dst_col = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst_col);
   init_dst(dst_col, gm_size);
@@ -73,7 +123,7 @@ int main() {
   __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(dst_f16);
   init_dst(dst_f16, gm_size);
- 
+
   __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src0_f16);
   init_src_fp(src0_f16, gm_size);
@@ -96,9 +146,10 @@ int main() {
 
   free(dst_col);
   free(src0_col);
- 
+
   free(dst_f16);
   free(src0_f16);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TAdd.cpp b/archive/outdated/tests/other/tileop_api/src/TAdd.cpp
similarity index 78%
rename from test/tileop_api/src/TAdd.cpp
rename to archive/outdated/tests/other/tileop_api/src/TAdd.cpp
index 250f76e..7fc20ed 100644
--- a/test/tileop_api/src/TAdd.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TAdd.cpp
@@ -5,12 +5,44 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_RowMajor(T *dst, T *src0, T *src1) {
   using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -21,22 +53,22 @@ void test_RowMajor(T *dst, T *src0, T *src1) {
       gm_shape s0(src0 + offset);
       gm_shape s1(src1 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TADD(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
- 
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_ColMajor(T *dst, T *src0, T *src1) {
   using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -47,25 +79,45 @@ void test_ColMajor(T *dst, T *src0, T *src1) {
       gm_shape s0(src0 + offset);
       gm_shape s1(src1 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TADD(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
 
 int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
   const uint16_t gm_row = 64;
   const uint16_t gm_col = 32;
   const uint16_t tile_row = 64;
   const uint16_t tile_col = 32;
+#endif
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
 
+#ifdef __linx
+  static int64_t dst_i64[gm_size];
+  static int64_t src0_i64[gm_size];
+  static int64_t src1_i64[gm_size];
+  init_dst(dst_i64, gm_size);
+  init_src_int(src0_i64, gm_size);
+  init_src_int(src1_i64, gm_size);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64, src1_i64);
+
+  return 0;
+#else
   float *dst = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst);
   init_dst(dst, gm_size);
@@ -88,54 +140,56 @@ int main() {
   check_mem_alloc(src1_col);
   init_src_fp(src1_col, gm_size);
 
+#ifndef __linx
   __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(dst_f16);
   init_dst(dst_f16, gm_size);
- 
+
   __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src0_f16);
   init_src_fp(src0_f16, gm_size);
   __half *src1_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src1_f16);
   init_src_fp(src1_f16, gm_size);
- 
+#endif
+
   int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(dst_i8);
   init_dst(dst_i8, gm_size);
- 
+
   int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src0_i8);
   init_src_int(src0_i8, gm_size);
   int8_t *src1_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src1_i8);
   init_src_int(src1_i8, gm_size);
- 
+
   int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(dst_i16);
   init_dst(dst_i16, gm_size);
- 
+
   int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(src0_i16);
   init_src_int(src0_i16, gm_size);
   int16_t *src1_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(src1_i16);
   init_src_int(src1_i16, gm_size);
-  
+
   int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst_i32);
   init_dst(dst_i32, gm_size);
- 
+
   int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src0_i32);
   init_src_int(src0_i32, gm_size);
   int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src1_i32);
   init_src_int(src1_i32, gm_size);
- 
+
   int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(dst_i64);
   init_dst(dst_i64, gm_size);
- 
+
   int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(src0_i64);
   init_src_int(src0_i64, gm_size);
@@ -149,14 +203,16 @@ int main() {
 
   test_ColMajor<gm_row, gm_col, tile_row, tile_col, float>(dst_col, src0_col, src1_col);
 
+#ifndef __linx
   test_ColMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16, src1_f16);
+#endif
 
   test_ColMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8, src1_i8);
 
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16, src1_i16);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32, src1_i32);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64, src1_i64);
 
 #ifdef LINX_PMC
@@ -166,12 +222,14 @@ int main() {
   printf("Result:\n");
   OutArray(dst, gm_size);
   OutArray(dst_col, gm_size);
+#ifndef __linx
   OutArray(dst_f16, gm_size);
+#endif
   OutArray(dst_i8, gm_size);
   OutArray(dst_i16, gm_size);
   OutArray(dst_i32, gm_size);
   OutArray(dst_i64, gm_size);
- 
+
   free(dst);
   free(src0);
   free(src1);
@@ -180,25 +238,28 @@ int main() {
   free(src0_col);
   free(src1_col);
 
+#ifndef __linx
   free(dst_f16);
   free(src0_f16);
   free(src1_f16);
- 
+#endif
+
   free(dst_i8);
   free(src0_i8);
   free(src1_i8);
- 
+
   free(dst_i16);
   free(src0_i16);
   free(src1_i16);
- 
+
   free(dst_i32);
   free(src0_i32);
   free(src1_i32);
- 
+
   free(dst_i64);
   free(src0_i64);
   free(src1_i64);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/archive/outdated/tests/other/tileop_api/src/TAdd_mask.cpp b/archive/outdated/tests/other/tileop_api/src/TAdd_mask.cpp
new file mode 100644
index 0000000..a1770b2
--- /dev/null
+++ b/archive/outdated/tests/other/tileop_api/src/TAdd_mask.cpp
@@ -0,0 +1,191 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+using namespace pto;
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test(T *c_ptr, T *a_ptr, T *b_ptr) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor>;
+  using glb_iterator = global_iterator<gm_shape, tile_shape>;
+
+  static constexpr int block_row = gm_row / tile_row;
+  static constexpr int block_col = gm_col / tile_col;
+  static constexpr int remainder_row = gm_row % tile_row;
+  static constexpr int remainder_col = gm_col % tile_col;
+
+  using trailing_rows_shape =
+      Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor, tile_row, remainder_col>;
+  using trailing_cols_shape =
+      Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor, remainder_row, tile_col>;
+  using trailing_corner_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor,
+                                            remainder_row, remainder_col>;
+
+  glb_iterator gAIter(a_ptr);
+  glb_iterator gBIter(b_ptr);
+  glb_iterator gCIter(c_ptr);
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      auto gA = gAIter(i, j);
+      auto gB = gBIter(i, j);
+      auto gC = gCIter(i, j);
+
+      tile_shape tA, tB, tC;
+      TLOAD(tA, gA);
+      TLOAD(tB, gB);
+      TADD(tC, tA, tB);
+      TSTORE(gC, tC);
+    }
+    if constexpr (remainder_col) {
+      auto gA = gAIter(i, block_col);
+      auto gB = gBIter(i, block_col);
+      auto gC = gCIter(i, block_col);
+
+      trailing_rows_shape tA, tB, tC;
+      TLOAD(tA, gA);
+      TLOAD(tB, gB);
+      TADD(tC, tA, tB);
+      TSTORE(gC, tC);
+    }
+  }
+  if constexpr (remainder_row) {
+    for (int j = 0; j < block_col; ++j) {
+      auto gA = gAIter(block_row, j);
+      auto gB = gBIter(block_row, j);
+      auto gC = gCIter(block_row, j);
+
+      trailing_cols_shape tA, tB, tC;
+      TLOAD(tA, gA);
+      TLOAD(tB, gB);
+      TADD(tC, tA, tB);
+      TSTORE(gC, tC);
+    }
+    if constexpr (remainder_col) {
+      auto gA = gAIter(block_row, block_col);
+      auto gB = gBIter(block_row, block_col);
+      auto gC = gCIter(block_row, block_col);
+
+      trailing_corner_shape tA, tB, tC;
+      TLOAD(tA, gA);
+      TLOAD(tB, gB);
+      TADD(tC, tA, tB);
+      TSTORE(gC, tC);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 6;
+  constexpr uint16_t gm_col = 6;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  const uint16_t gm_row = 66;
+  const uint16_t gm_col = 66;
+  const uint16_t tile_row = 16;
+  const uint16_t tile_col = 16;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src0[gm_size];
+  static int64_t src1[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src0, gm_size);
+  init_src_uint(src1, gm_size);
+
+  test<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src0, src1);
+  return 0;
+#else
+  float *dst = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, gm_size);
+
+  float *src0 = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src0);
+  init_src_fp(src0, gm_size);
+  float *src1 = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src1);
+  init_src_fp(src1, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test<gm_row, gm_col, tile_row, tile_col>(dst, src0, src1);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, gm_size);
+
+  free(dst);
+  free(src0);
+  free(src1);
+
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/TAdds.cpp b/archive/outdated/tests/other/tileop_api/src/TAdds.cpp
similarity index 73%
rename from test/tileop_api/src/TAdds.cpp
rename to archive/outdated/tests/other/tileop_api/src/TAdds.cpp
index 38170fd..3545585 100644
--- a/test/tileop_api/src/TAdds.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TAdds.cpp
@@ -5,12 +5,44 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_RowMajor(T *dst, T *src0, T s) {
   using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -20,21 +52,21 @@ void test_RowMajor(T *dst, T *src0, T s) {
       int offset = i * (tile_row * gm_col) + j * tile_col;
       gm_shape s0(src0 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TADDS(d1, d0, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
- 
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_ColMajor(T *dst, T *src0, T s) {
   using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -44,24 +76,42 @@ void test_ColMajor(T *dst, T *src0, T s) {
       int offset = i * (tile_col * gm_row) + j * tile_row;
       gm_shape s0(src0 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TADDS(d1, d0, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
 
 int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+#ifdef __linx
+  static int64_t dst_i64[gm_size];
+  static int64_t src0_i64[gm_size];
+  init_dst(dst_i64, gm_size);
+  init_src_int(src0_i64, gm_size);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64, s_i64);
 
+  return 0;
+#else
   float *dst_col = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst_col);
   init_dst(dst_col, gm_size);
@@ -73,7 +123,7 @@ int main() {
   __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(dst_f16);
   init_dst(dst_f16, gm_size);
- 
+
   __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src0_f16);
   init_src_fp(src0_f16, gm_size);
@@ -81,31 +131,31 @@ int main() {
   int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(dst_i8);
   init_dst(dst_i8, gm_size);
- 
+
   int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src0_i8);
   init_src_int(src0_i8, gm_size);
- 
+
   int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(dst_i16);
   init_dst(dst_i16, gm_size);
- 
+
   int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(src0_i16);
   init_src_int(src0_i16, gm_size);
-  
+
   int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst_i32);
   init_dst(dst_i32, gm_size);
- 
+
   int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src0_i32);
   init_src_int(src0_i32, gm_size);
- 
+
   int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(dst_i64);
   init_dst(dst_i64, gm_size);
- 
+
   int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(src0_i64);
   init_src_int(src0_i64, gm_size);
@@ -121,9 +171,9 @@ int main() {
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8, s_i8);
 
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16, s_i16);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32, s_i32);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64, s_i64);
 
 #ifdef LINX_PMC
@@ -143,18 +193,19 @@ int main() {
 
   free(dst_f16);
   free(src0_f16);
- 
+
   free(dst_i8);
   free(src0_i8);
- 
+
   free(dst_i16);
   free(src0_i16);
- 
+
   free(dst_i32);
   free(src0_i32);
- 
+
   free(dst_i64);
   free(src0_i64);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/other/tileop_api/src/TAssemble.cpp b/archive/outdated/tests/other/tileop_api/src/TAssemble.cpp
similarity index 96%
rename from test/other/tileop_api/src/TAssemble.cpp
rename to archive/outdated/tests/other/tileop_api/src/TAssemble.cpp
index c2ec2e8..d7940f3 100644
--- a/test/other/tileop_api/src/TAssemble.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TAssemble.cpp
@@ -29,11 +29,11 @@ void test(float *dst, float *src0, float *src1, float *src2) {
   tile_shape_src2 d2;
   tile_shape_dst d3;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  TCOPYIN(d2, s2);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  TLOAD(d2, s2);
   TASSEMBLE(d3, d0, d1, d2);
-  TCOPYOUT(res, d3);
+  TSTORE(res, d3);
 }
 
 int main() {
diff --git a/test/other/tileop_api/src/TCast.cpp b/archive/outdated/tests/other/tileop_api/src/TCast.cpp
similarity index 97%
rename from test/other/tileop_api/src/TCast.cpp
rename to archive/outdated/tests/other/tileop_api/src/TCast.cpp
index 2a36007..6790915 100644
--- a/test/other/tileop_api/src/TCast.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TCast.cpp
@@ -18,9 +18,9 @@ void test(T2 *dst, T1 *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TCAST(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 int main() {
diff --git a/test/tileop_api/src/TCopy.cpp b/archive/outdated/tests/other/tileop_api/src/TCopy.cpp
similarity index 83%
rename from test/tileop_api/src/TCopy.cpp
rename to archive/outdated/tests/other/tileop_api/src/TCopy.cpp
index bd45f38..d0fb15c 100644
--- a/test/tileop_api/src/TCopy.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TCopy.cpp
@@ -5,6 +5,38 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_Nz(T *dst, T *src0) {
@@ -14,18 +46,18 @@ void test_Nz(T *dst, T *src0) {
 
   glb_iterator gS0Iter(src0);
   glb_iterator gDIter(dst);
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   for (int i = 0; i < block_row; ++i) {
     for (int j = 0; j < block_col; ++j) {
       auto s0 = gS0Iter(i, j);
       auto res = gDIter(i, j);
- 
+
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TCOPY(d1, d0);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
@@ -40,7 +72,7 @@ void test_Nz_Dynamic(T *dst, T *src0) {
 
   uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row;
   uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col;
-  
+
   for (int i = 0; i < block_row; ++i) {
     for (int j = 0; j < block_col; ++j) {
       uint16_t remainder_row = gm_row - i * tile_valid_row;
@@ -55,9 +87,9 @@ void test_Nz_Dynamic(T *dst, T *src0) {
 
       tile_shape d0(active_row, active_col);
       tile_shape d1(active_row, active_col);
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TCOPY(d1, d0);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
@@ -67,7 +99,7 @@ template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
 void test_RowMajor(T *dst, T *src0) {
   using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -77,11 +109,11 @@ void test_RowMajor(T *dst, T *src0) {
       int offset = i * (tile_row * gm_col) + j * tile_col;
       gm_shape s0(src0 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TCOPY(d1, d0);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
@@ -112,19 +144,19 @@ void test_RowMajor_Dynamic(T *dst, T *src0) {
 
       tile_shape d0(active_row, active_col);
       tile_shape d1(active_row, active_col);
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TCOPY(d1, d0);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
- 
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_ColMajor(T *dst, T *src0) {
   using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -134,24 +166,42 @@ void test_ColMajor(T *dst, T *src0) {
       int offset = i * (tile_col * gm_row) + j * tile_row;
       gm_shape s0(src0 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TCOPY(d1, d0);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
 
 int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src, gm_size);
 
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src);
+
+  return 0;
+#else
   float *dst = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst);
   init_dst(dst, gm_size);
@@ -171,7 +221,7 @@ int main() {
   __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(dst_f16);
   init_dst(dst_f16, gm_size);
- 
+
   __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src0_f16);
   init_src_fp(src0_f16, gm_size);
@@ -179,31 +229,31 @@ int main() {
   int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(dst_i8);
   init_dst(dst_i8, gm_size);
- 
+
   int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src0_i8);
   init_src_int(src0_i8, gm_size);
- 
+
   int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(dst_i16);
   init_dst(dst_i16, gm_size);
- 
+
   int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(src0_i16);
   init_src_int(src0_i16, gm_size);
-  
+
   int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst_i32);
   init_dst(dst_i32, gm_size);
- 
+
   int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src0_i32);
   init_src_int(src0_i32, gm_size);
- 
+
   int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(dst_i64);
   init_dst(dst_i64, gm_size);
- 
+
   int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(src0_i64);
   init_src_int(src0_i64, gm_size);
@@ -211,7 +261,7 @@ int main() {
   int32_t *dst1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst1_i32);
   init_dst(dst1_i32, gm_size);
- 
+
   int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src1_i32);
   init_src_int(src1_i32, gm_size);
@@ -219,7 +269,7 @@ int main() {
   int32_t *dst_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst_nz_i32);
   init_dst(dst_nz_i32, gm_size);
- 
+
   int32_t *src_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src_nz_i32);
   init_src_int(src_nz_i32, gm_size);
@@ -239,7 +289,7 @@ int main() {
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16);
 
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64);
 
   test_RowMajor_Dynamic<gm_row, gm_col, tile_row, tile_col, int32_t>(dst1_i32, src1_i32);
@@ -259,22 +309,22 @@ int main() {
   OutArray(dst_i64, gm_size);
   OutArray(dst1_i32, gm_size);
   OutArray(dst_nz_i32, gm_size);
- 
+
   free(dst);
   free(src0);
- 
+
   free(dst_f16);
   free(src0_f16);
- 
+
   free(dst_i8);
   free(src0_i8);
- 
+
   free(dst_i16);
   free(src0_i16);
- 
+
   free(dst_i32);
   free(src0_i32);
- 
+
   free(dst_i64);
   free(src0_i64);
 
@@ -285,4 +335,5 @@ int main() {
   free(src_nz_i32);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TCvt.cpp b/archive/outdated/tests/other/tileop_api/src/TCvt.cpp
similarity index 57%
rename from test/tileop_api/src/TCvt.cpp
rename to archive/outdated/tests/other/tileop_api/src/TCvt.cpp
index cdc88f0..f8652e3 100644
--- a/test/tileop_api/src/TCvt.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TCvt.cpp
@@ -5,6 +5,39 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t row, uint16_t col> void testRow2Nz(float *dst, float *src) {
   using gm_shape = global_tensor<float, RowMajor<row, col>>;
 
@@ -17,10 +50,10 @@ template <uint16_t row, uint16_t col> void testRow2Nz(float *dst, float *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TCVT(d1, d0);
   TCVT(d0, d1);
-  TCOPYOUT(res, d0);
+  TSTORE(res, d0);
 }
 
 template <uint16_t row, uint16_t col> void testNz2Col(float *dst, float *src) {
@@ -35,10 +68,10 @@ template <uint16_t row, uint16_t col> void testNz2Col(float *dst, float *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TCVT(d1, d0);
   TCVT(d0, d1);
-  TCOPYOUT(res, d0);
+  TSTORE(res, d0);
 }
 
 template <uint16_t row, uint16_t col> void testNz2Zn(float *dst, float *src) {
@@ -53,10 +86,10 @@ template <uint16_t row, uint16_t col> void testNz2Zn(float *dst, float *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TCVT(d1, d0);
   TCVT(d0, d1);
-  TCOPYOUT(res, d0);
+  TSTORE(res, d0);
 }
 
 template <uint16_t row, uint16_t col> void testZn2Nz(float *dst, float *src) {
@@ -71,10 +104,10 @@ template <uint16_t row, uint16_t col> void testZn2Nz(float *dst, float *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TCVT(d1, d0);
   TCVT(d0, d1);
-  TCOPYOUT(res, d0);
+  TSTORE(res, d0);
 }
 
 template <uint16_t row, uint16_t col> void testNz2Nz(float *dst, float *src) {
@@ -89,13 +122,69 @@ template <uint16_t row, uint16_t col> void testNz2Nz(float *dst, float *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TCVT(d1, d0);
   TCVT(d0, d1);
-  TCOPYOUT(res, d0);
+  TSTORE(res, d0);
 }
 
 int main() {
+#ifdef __linx
+  constexpr uint16_t row = 16;
+  constexpr uint16_t col = 16;
+  using row_tile = Tile<Location::Vec, int64_t, row, col>;
+  using col_tile = Tile<Location::Vec, int64_t, row, col, BLayout::ColMajor>;
+  using nz_tile = TileLeft<int64_t, row, col>;
+  using zn_tile = TileRight<int64_t, row, col>;
+
+  row_tile row_src;
+  row_tile row_round;
+  col_tile col_src;
+  col_tile col_round;
+  nz_tile nz_a;
+  nz_tile nz_b;
+  zn_tile zn;
+
+  for (size_t i = 0; i < row; ++i) {
+    for (size_t j = 0; j < col; ++j) {
+      row_src.data()[index<row_tile>(i, j)] =
+          static_cast<int64_t>((i + 1) * 100 + j);
+      col_src.data()[index<col_tile>(i, j)] =
+          static_cast<int64_t>((i + 1) * 1000 + j);
+    }
+  }
+
+  TCVT(nz_a, row_src);
+  TCVT(row_round, nz_a);
+  TCVT(zn, nz_a);
+  TCVT(nz_b, zn);
+
+  for (size_t i = 0; i < row; ++i) {
+    for (size_t j = 0; j < col; ++j) {
+      if (row_round.data()[index<row_tile>(i, j)] !=
+          row_src.data()[index<row_tile>(i, j)]) {
+        return 1;
+      }
+      if (nz_b.data()[index<nz_tile>(i, j)] !=
+          nz_a.data()[index<nz_tile>(i, j)]) {
+        return 2;
+      }
+    }
+  }
+
+  TCVT(nz_a, col_src);
+  TCVT(col_round, nz_a);
+  for (size_t i = 0; i < row; ++i) {
+    for (size_t j = 0; j < col; ++j) {
+      if (col_round.data()[index<col_tile>(i, j)] !=
+          col_src.data()[index<col_tile>(i, j)]) {
+        return 3;
+      }
+    }
+  }
+
+  return 0;
+#else
   const uint16_t row = 16;
   const uint16_t col = 32;
 
@@ -150,4 +239,5 @@ int main() {
   free(src2);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/archive/outdated/tests/other/tileop_api/src/TDiv.cpp b/archive/outdated/tests/other/tileop_api/src/TDiv.cpp
new file mode 100644
index 0000000..d8950a1
--- /dev/null
+++ b/archive/outdated/tests/other/tileop_api/src/TDiv.cpp
@@ -0,0 +1,251 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test_rm(T *dst, T *src0, T *src1) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src0 + offset);
+      gm_shape s1(src1 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1, d2;
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
+      TDIV(d2, d1, d0);
+      TSTORE(res, d2);
+    }
+  }
+}
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test_cm(T *dst, T *src0, T *src1) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src0 + offset);
+      gm_shape s1(src1 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1, d2;
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
+      TDIV(d2, d1, d0);
+      TSTORE(res, d2);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  // 64*64-16*16
+  const uint16_t gm_row = 64;
+  const uint16_t gm_col = 64;
+  const uint16_t tile_row = 32;
+  const uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst_rm[gm_size];
+  static int64_t dst_cm[gm_size];
+  static int64_t src0_rm[gm_size];
+  static int64_t src1_rm[gm_size];
+  static int64_t src0_cm[gm_size];
+  static int64_t src1_cm[gm_size];
+  init_dst(dst_rm, gm_size);
+  init_dst(dst_cm, gm_size);
+  init_src_uint(src0_rm, gm_size);
+  init_src_int(src1_rm, gm_size);
+  init_src_uint(src0_cm, gm_size);
+  init_src_int(src1_cm, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_rm, src0_rm, src1_rm);
+  test_cm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_cm, src0_cm, src1_cm);
+
+  return 0;
+#else
+  // float32
+  float *dst = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, gm_size);
+
+  float *src0 = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src0);
+  init_src_fp(src0, gm_size);
+  float *src1 = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src1);
+  init_src_fp(src1, gm_size);
+  // float16
+  __half *dst1 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst1);
+  init_dst(dst1, gm_size);
+
+  __half *src2 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src2);
+  init_src_fp(src2, gm_size);
+  __half *src3 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src3);
+  init_src_fp(src3, gm_size);
+  // int8
+  int8_t *dst2 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(dst2);
+  init_dst(dst2, gm_size);
+
+  int8_t *src4 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src4);
+  init_src_int8(src4, gm_size);
+  int8_t *src5 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src5);
+  init_src_int8(src5, gm_size);
+  // int16
+  int16_t *dst3 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(dst3);
+  init_dst(dst3, gm_size);
+
+  int16_t *src6 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src6);
+  init_src_int(src6, gm_size);
+  int16_t *src7 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src7);
+  init_src_int(src7, gm_size);
+  // int32
+  int32_t *dst4 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst4);
+  init_dst(dst4, gm_size);
+
+  int32_t *src8 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src8);
+  init_src_int(src8, gm_size);
+  int32_t *src9 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src9);
+  init_src_int(src9, gm_size);
+  // int64
+  int64_t *dst5 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(dst5);
+  init_dst(dst5, gm_size);
+
+  int64_t *src10 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src10);
+  init_src_int(src10, gm_size);
+  int64_t *src11 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src11);
+  init_src_int(src11, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+  test_rm<gm_row, gm_col, tile_row, tile_col, float>(dst, src0, src1);
+  test_rm<gm_row, gm_col, tile_row, tile_col,__half>(dst1, src2, src3);
+  test_rm<gm_row, gm_col, tile_row, tile_col,int8_t>(dst2, src4, src5);
+  test_rm<gm_row, gm_col, tile_row, tile_col,int16_t>(dst3, src6, src7);
+  test_rm<gm_row, gm_col, tile_row, tile_col,int32_t>(dst4, src8, src9);
+  test_rm<gm_row, gm_col, tile_row, tile_col,int64_t>(dst5, src10, src11);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+  printf("Result:\n");
+  OutArray(dst, gm_size);
+  OutArray(dst1, gm_size);
+  OutArray(dst2, gm_size);
+  OutArray(dst3, gm_size);
+  OutArray(dst4, gm_size);
+  OutArray(dst5, gm_size);
+
+  free(dst);
+  free(src0);
+  free(src1);
+  free(dst1);
+  free(src2);
+  free(src3);
+  free(dst2);
+  free(src4);
+  free(src5);
+  free(dst3);
+  free(src6);
+  free(src7);
+  free(dst4);
+  free(src8);
+  free(src9);
+  free(dst5);
+  free(src10);
+  free(src11);
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/TDivs.cpp b/archive/outdated/tests/other/tileop_api/src/TDivs.cpp
similarity index 65%
rename from test/tileop_api/src/TDivs.cpp
rename to archive/outdated/tests/other/tileop_api/src/TDivs.cpp
index e3773ab..59b5023 100644
--- a/test/tileop_api/src/TDivs.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TDivs.cpp
@@ -5,6 +5,57 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
           uint64_t tile_col, typename T>
 void test_rm(T *dst, T *src, T s) {
@@ -20,9 +71,9 @@ void test_rm(T *dst, T *src, T s) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TDIVS(d1, d0, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
@@ -41,21 +92,44 @@ void test_cm(T *dst, T *src, T s) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TDIVS(d1, d0, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
 
 int main() {
+#ifdef __linx
+  constexpr uint64_t gm_row = 4;
+  constexpr uint64_t gm_col = 4;
+  constexpr uint64_t tile_row = 4;
+  constexpr uint64_t tile_col = 4;
+#else
   const uint16_t gm_row = 64;
   const uint16_t gm_col = 64;
   const uint16_t tile_row = 32;
   const uint16_t tile_col = 32;
+#endif
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst_rm[gm_size];
+  static int64_t dst_cm[gm_size];
+  static int64_t src_rm[gm_size];
+  static int64_t src_cm[gm_size];
+  init_dst(dst_rm, gm_size);
+  init_dst(dst_cm, gm_size);
+  init_src_uint(src_rm, gm_size);
+  init_src_uint(src_cm, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_rm, src_rm, 2);
+  test_cm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_cm, src_cm, 2);
+  return 0;
+#else
   // float32
   float *dst = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst);
@@ -137,4 +211,5 @@ int main() {
   free(dst5);
   free(src5);
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TExp.cpp b/archive/outdated/tests/other/tileop_api/src/TExp.cpp
similarity index 50%
rename from test/tileop_api/src/TExp.cpp
rename to archive/outdated/tests/other/tileop_api/src/TExp.cpp
index 233d727..2b7e767 100644
--- a/test/tileop_api/src/TExp.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TExp.cpp
@@ -5,6 +5,57 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
           uint64_t tile_col, typename T>
 void test_rm(T *dst, T *src) {
@@ -20,9 +71,9 @@ void test_rm(T *dst, T *src) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TEXP(d1, d0);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
@@ -41,14 +92,54 @@ void test_cm(T *dst, T *src) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TEXP(d1, d0);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
 
 int main() {
+#ifdef __linx
+  constexpr size_t tile_row = 4;
+  constexpr size_t tile_col = 4;
+  using row_tile = Tile<Location::Vec, int64_t, tile_row, tile_col>;
+  using col_tile =
+      Tile<Location::Vec, int64_t, tile_row, tile_col, BLayout::ColMajor>;
+
+  row_tile src_rm, dst_rm;
+  col_tile src_cm, dst_cm;
+
+  for (size_t i = 0; i < tile_row; ++i) {
+    for (size_t j = 0; j < tile_col; ++j) {
+      int64_t value = static_cast<int64_t>((i + j) % 6);
+      size_t row_index = index<row_tile>(i, j);
+      size_t col_index = index<col_tile>(i, j);
+      src_rm.data()[row_index] = value;
+      src_cm.data()[col_index] = value;
+      dst_rm.data()[row_index] = 0;
+      dst_cm.data()[col_index] = 0;
+    }
+  }
+
+  TEXP(dst_rm, src_rm);
+  TEXP(dst_cm, src_cm);
+
+  for (size_t i = 0; i < tile_row; ++i) {
+    for (size_t j = 0; j < tile_col; ++j) {
+      int64_t value = static_cast<int64_t>((i + j) % 6);
+      int64_t expected = linx_tile_iexp(value);
+      if (dst_rm.data()[index<row_tile>(i, j)] != expected) {
+        return 1;
+      }
+      if (dst_cm.data()[index<col_tile>(i, j)] != expected) {
+        return 2;
+      }
+    }
+  }
+
+  return 0;
+#else
   const uint16_t gm_row = 64;
   const uint16_t gm_col = 64;
   const uint16_t tile_row = 16;
@@ -95,4 +186,5 @@ int main() {
   free(src2);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TExpandCol.cpp b/archive/outdated/tests/other/tileop_api/src/TExpandCol.cpp
similarity index 73%
rename from test/tileop_api/src/TExpandCol.cpp
rename to archive/outdated/tests/other/tileop_api/src/TExpandCol.cpp
index 2950077..4b6137a 100644
--- a/test/tileop_api/src/TExpandCol.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TExpandCol.cpp
@@ -5,6 +5,39 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t row, uint16_t col, typename T> void test_rm(T *dst, T *src) {
   using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
   using gm_shape_out = global_tensor<T, RowMajor<row, col>>;
@@ -17,9 +50,9 @@ template <uint16_t row, uint16_t col, typename T> void test_rm(T *dst, T *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TEXPANDCOL(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 template <uint16_t row, uint16_t col, typename T> void test_cm(T *dst, T *src) {
   using gm_shape_in = global_tensor<T, ColMajor<row, col>>;
@@ -33,12 +66,30 @@ template <uint16_t row, uint16_t col, typename T> void test_cm(T *dst, T *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TEXPANDCOL(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 int main() {
+#ifdef __linx
+  constexpr uint16_t row = 4;
+  constexpr uint16_t col = 8;
+  constexpr uint16_t size = row * col;
+
+  static int64_t dst_rm[size];
+  static int64_t dst_cm[size];
+  static int64_t src_rm[size];
+  static int64_t src_cm[size];
+  init_dst(dst_rm, size);
+  init_dst(dst_cm, size);
+  init_src_int(src_rm, size);
+  init_src_int(src_cm, size);
+
+  test_rm<row, col, int64_t>(dst_rm, src_rm);
+  test_cm<row, col, int64_t>(dst_cm, src_cm);
+  return 0;
+#else
   const uint16_t row = 32;
   const uint16_t col = 32;
 
@@ -131,4 +182,5 @@ int main() {
   free(src5);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TExpandRow.cpp b/archive/outdated/tests/other/tileop_api/src/TExpandRow.cpp
similarity index 73%
rename from test/tileop_api/src/TExpandRow.cpp
rename to archive/outdated/tests/other/tileop_api/src/TExpandRow.cpp
index d6d0005..f448a92 100644
--- a/test/tileop_api/src/TExpandRow.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TExpandRow.cpp
@@ -5,6 +5,39 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t row, uint16_t col,typename T>
 void test_rm(T *dst, T *src) {
   using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
@@ -18,9 +51,9 @@ void test_rm(T *dst, T *src) {
 
   tile_shape_in d0;
   tile_shape_out d1;
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TEXPANDROW(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 template <uint16_t row, uint16_t col,typename T>
 void test_cm(T *dst, T *src) {
@@ -35,12 +68,30 @@ void test_cm(T *dst, T *src) {
 
   tile_shape_in d0;
   tile_shape_out d1;
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TEXPANDROW(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 int main() {
+#ifdef __linx
+  constexpr uint16_t row = 4;
+  constexpr uint16_t col = 8;
+  constexpr uint16_t size = row * col;
+
+  static int64_t dst_rm[size];
+  static int64_t dst_cm[size];
+  static int64_t src_rm[size];
+  static int64_t src_cm[size];
+  init_dst(dst_rm, size);
+  init_dst(dst_cm, size);
+  init_src_int(src_rm, size);
+  init_src_int(src_cm, size);
+
+  test_rm<row, col, int64_t>(dst_rm, src_rm);
+  test_cm<row, col, int64_t>(dst_cm, src_cm);
+  return 0;
+#else
   const uint16_t row = 32;
   const uint16_t col = 32;
   size_t size_in = col;
@@ -135,4 +186,5 @@ int main() {
   free(dst5);
   free(src5);
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TExpandScalar.cpp b/archive/outdated/tests/other/tileop_api/src/TExpandScalar.cpp
similarity index 72%
rename from test/tileop_api/src/TExpandScalar.cpp
rename to archive/outdated/tests/other/tileop_api/src/TExpandScalar.cpp
index 1bff82a..7b9e347 100644
--- a/test/tileop_api/src/TExpandScalar.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TExpandScalar.cpp
@@ -5,6 +5,39 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
           uint64_t tile_col,typename T>
 void test_rm(T *dst, T s) {
@@ -14,7 +47,7 @@ void test_rm(T *dst, T s) {
 
   tile_shape d0;
   TEXPANDSCALAR(d0, s);
-  TCOPYOUT(res, d0);
+  TSTORE(res, d0);
 }
 
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
@@ -30,7 +63,7 @@ void test_rm_dynamic(T *dst, T s) {
   tile_shape d0(tile_valid_row, tile_valid_col);
 
   TEXPANDSCALAR(d0, s);
-  TCOPYOUT(res, d0);
+  TSTORE(res, d0);
 }
 
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
@@ -42,10 +75,26 @@ void test_cm(T *dst, T s) {
 
   tile_shape d0;
   TEXPANDSCALAR(d0, s);
-  TCOPYOUT(res, d0);
+  TSTORE(res, d0);
 }
 
 int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 8;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 8;
+  constexpr uint16_t gm_size = gm_row * gm_col;
+
+  static int64_t dst_rm[gm_size];
+  static int64_t dst_cm[gm_size];
+  init_dst(dst_rm, gm_size);
+  init_dst(dst_cm, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_rm, s_i64);
+  test_cm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_cm, s_i64);
+  return 0;
+#else
   const uint16_t gm_row = 16;
   const uint16_t gm_col = 32;
   const uint16_t tile_row = 16;
@@ -116,4 +165,5 @@ int main() {
   free(dst6);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/other/tileop_api/src/TExtract.cpp b/archive/outdated/tests/other/tileop_api/src/TExtract.cpp
similarity index 97%
rename from test/other/tileop_api/src/TExtract.cpp
rename to archive/outdated/tests/other/tileop_api/src/TExtract.cpp
index dd38494..3e80365 100644
--- a/test/other/tileop_api/src/TExtract.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TExtract.cpp
@@ -20,9 +20,9 @@ void test(float *dst, float *src, uint16_t offest) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TEXTRACT(d1, d0, offest);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 int main() {
diff --git a/test/other/tileop_api/src/TGatherElementCol.cpp b/archive/outdated/tests/other/tileop_api/src/TGatherElementCol.cpp
similarity index 96%
rename from test/other/tileop_api/src/TGatherElementCol.cpp
rename to archive/outdated/tests/other/tileop_api/src/TGatherElementCol.cpp
index e72a015..31dd225 100644
--- a/test/other/tileop_api/src/TGatherElementCol.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TGatherElementCol.cpp
@@ -23,10 +23,10 @@ void test(float *dst, float *src0, uint16_t *src1) {
   tile_shape_src1 d1;
   tile_shape_dst d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
   TGATHERELEMENTCOL(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 int main() {
diff --git a/test/other/tileop_api/src/TGatherElementRow.cpp b/archive/outdated/tests/other/tileop_api/src/TGatherElementRow.cpp
similarity index 96%
rename from test/other/tileop_api/src/TGatherElementRow.cpp
rename to archive/outdated/tests/other/tileop_api/src/TGatherElementRow.cpp
index cbe5573..ed9adbb 100644
--- a/test/other/tileop_api/src/TGatherElementRow.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TGatherElementRow.cpp
@@ -23,10 +23,10 @@ void test(float *dst, float *src0, uint16_t *src1) {
   tile_shape_src1 d1;
   tile_shape_dst d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
   TGATHERELEMENTROW(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 int main() {
diff --git a/test/other/tileop_api/src/TGatherRow.cpp b/archive/outdated/tests/other/tileop_api/src/TGatherRow.cpp
similarity index 97%
rename from test/other/tileop_api/src/TGatherRow.cpp
rename to archive/outdated/tests/other/tileop_api/src/TGatherRow.cpp
index e5728be..a0204e5 100644
--- a/test/other/tileop_api/src/TGatherRow.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TGatherRow.cpp
@@ -24,10 +24,10 @@ void test(float *dst, float *src0, uint16_t *src1) {
   tile_shape_src1 d1;
   tile_shape_dst d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
   TGATHERROW(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 int main() {
diff --git a/test/tileop_api/src/TCopyIn.cpp b/archive/outdated/tests/other/tileop_api/src/TLoad.cpp
similarity index 82%
rename from test/tileop_api/src/TCopyIn.cpp
rename to archive/outdated/tests/other/tileop_api/src/TLoad.cpp
index 944f9f4..e13b0d6 100644
--- a/test/tileop_api/src/TCopyIn.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TLoad.cpp
@@ -5,6 +5,38 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_RowMajor(T *dst, T *src0) {
@@ -12,7 +44,7 @@ void test_RowMajor(T *dst, T *src0) {
   using stride = Stride<1, 1, gm_row * gm_col, gm_col, 1>;
   using gm_shape = GlobalTensor<T, shape, stride, Layout::ND>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -22,10 +54,10 @@ void test_RowMajor(T *dst, T *src0) {
       int offset = i * (tile_row * gm_col) + j * tile_col;
       gm_shape s0(src0 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0;
-      TCOPYIN(d0, s0);
-      TCOPYOUT(res, d0);
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
     }
   }
 }
@@ -57,12 +89,12 @@ void test_RowMajor_Dynamic(T *dst, T *src0) {
       gm_shape res(dst + offset, gm_valid_row, gm_valid_col);
 
       tile_shape d0(active_row, active_col);
-      TCOPYIN(d0, s0);
-      TCOPYOUT(res, d0);
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
     }
   }
 }
- 
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_ColMajor(T *dst, T *src0) {
@@ -70,7 +102,7 @@ void test_ColMajor(T *dst, T *src0) {
   using stride = Stride<1, 1, gm_row * gm_col, 1, gm_row>;
   using gm_shape = GlobalTensor<T, shape, stride, Layout::DN>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -80,10 +112,10 @@ void test_ColMajor(T *dst, T *src0) {
       int offset = i * (tile_row * gm_col) + j * tile_col;
       gm_shape s0(src0 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0;
-      TCOPYIN(d0, s0);
-      TCOPYOUT(res, d0);
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
     }
   }
 }
@@ -101,7 +133,7 @@ void test_Nz_Dynamic(T *dst, T *src0) {
 
   uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row;
   uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col;
-  
+
   for (int i = 0; i < block_row; ++i) {
     for (int j = 0; j < block_col; ++j) {
       uint16_t remainder_row = gm_row - i * tile_valid_row;
@@ -116,21 +148,39 @@ void test_Nz_Dynamic(T *dst, T *src0) {
 
       tile_shape d0(active_row, active_col);
       tile_shape d1(active_row, active_col);
-      TCOPYIN(d0, s0);
-      TCOPYOUT(res, d0);
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
     }
   }
 }
 
 int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src, gm_size);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src);
 
+  return 0;
+#else
   float *dst = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst);
   init_dst(dst, gm_size);
@@ -142,7 +192,7 @@ int main() {
   __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(dst_f16);
   init_dst(dst_f16, gm_size);
- 
+
   __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src0_f16);
   init_src_fp(src0_f16, gm_size);
@@ -150,31 +200,31 @@ int main() {
   int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(dst_i8);
   init_dst(dst_i8, gm_size);
- 
+
   int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src0_i8);
   init_src_int(src0_i8, gm_size);
- 
+
   int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(dst_i16);
   init_dst(dst_i16, gm_size);
- 
+
   int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(src0_i16);
   init_src_int(src0_i16, gm_size);
-  
+
   int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst_i32);
   init_dst(dst_i32, gm_size);
- 
+
   int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src0_i32);
   init_src_int(src0_i32, gm_size);
- 
+
   int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(dst_i64);
   init_dst(dst_i64, gm_size);
- 
+
   int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(src0_i64);
   init_src_int(src0_i64, gm_size);
@@ -182,7 +232,7 @@ int main() {
   int32_t *dst1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst1_i32);
   init_dst(dst1_i32, gm_size);
- 
+
   int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src1_i32);
   init_src_int(src1_i32, gm_size);
@@ -190,7 +240,7 @@ int main() {
   int32_t *dst_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst_nz_i32);
   init_dst(dst_nz_i32, gm_size);
- 
+
   int32_t *src_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src_nz_i32);
   init_src_int(src_nz_i32, gm_size);
@@ -200,15 +250,15 @@ int main() {
 #endif
 
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, float>(dst, src0);
-  
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8);
 
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64);
 
   test_RowMajor_Dynamic<gm_row + 1, gm_col + 1, tile_row, tile_col, int32_t>(dst1_i32, src1_i32);
@@ -228,22 +278,22 @@ int main() {
   OutArray(dst_i64, gm_size);
   OutArray(dst1_i32, gm_size);
   OutArray(dst_nz_i32, gm_size);
- 
+
   free(dst);
   free(src0);
- 
+
   free(dst_f16);
   free(src0_f16);
- 
+
   free(dst_i8);
   free(src0_i8);
- 
+
   free(dst_i16);
   free(src0_i16);
- 
+
   free(dst_i32);
   free(src0_i32);
- 
+
   free(dst_i64);
   free(src0_i64);
 
@@ -251,4 +301,5 @@ int main() {
   free(src1_i32);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TMax.cpp b/archive/outdated/tests/other/tileop_api/src/TMax.cpp
similarity index 74%
rename from test/tileop_api/src/TMax.cpp
rename to archive/outdated/tests/other/tileop_api/src/TMax.cpp
index 710c31f..25301d6 100644
--- a/test/tileop_api/src/TMax.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TMax.cpp
@@ -5,6 +5,38 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_rm(T *dst, T *src0, T *src1) {
@@ -21,10 +53,10 @@ void test_rm(T *dst, T *src0, T *src1) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TMAX(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
@@ -44,22 +76,43 @@ void test_cm(T *dst, T *src0, T *src1) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TMAX(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
 
 int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src0[gm_size];
+  static int64_t src1[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src0, gm_size);
+  init_src_int(src1, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src0, src1);
+
+  return 0;
+#else
   // float32
   float *dst = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst);
@@ -170,4 +223,5 @@ int main() {
   free(src11);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TMaxs.cpp b/archive/outdated/tests/other/tileop_api/src/TMaxs.cpp
similarity index 72%
rename from test/tileop_api/src/TMaxs.cpp
rename to archive/outdated/tests/other/tileop_api/src/TMaxs.cpp
index 3593c8e..8833ab9 100644
--- a/test/tileop_api/src/TMaxs.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TMaxs.cpp
@@ -5,6 +5,38 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
           uint64_t tile_col,typename T>
 void test_rm(T *dst, T *src, T s) {
@@ -20,9 +52,9 @@ void test_rm(T *dst, T *src, T s) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TMAXS(d1, d0, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
@@ -41,21 +73,40 @@ void test_cm(T *dst, T *src, T s) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TMAXS(d1, d0, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
 
 int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src, s_i64);
+
+  return 0;
+#else
   // float32
   float *dst = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst);
@@ -142,4 +193,5 @@ int main() {
   free(src5);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/other/tileop_api/src/TMin.cpp b/archive/outdated/tests/other/tileop_api/src/TMin.cpp
similarity index 95%
rename from test/other/tileop_api/src/TMin.cpp
rename to archive/outdated/tests/other/tileop_api/src/TMin.cpp
index c711356..b6418bb 100644
--- a/test/other/tileop_api/src/TMin.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TMin.cpp
@@ -21,10 +21,10 @@ void test(float *dst, float *src0, float *src1) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TMIN(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
diff --git a/test/other/tileop_api/src/TMins.cpp b/archive/outdated/tests/other/tileop_api/src/TMins.cpp
similarity index 96%
rename from test/other/tileop_api/src/TMins.cpp
rename to archive/outdated/tests/other/tileop_api/src/TMins.cpp
index de3dc22..5673a48 100644
--- a/test/other/tileop_api/src/TMins.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TMins.cpp
@@ -20,9 +20,9 @@ void test(float *dst, float *src, float s) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TMINS(d1, d0, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
diff --git a/test/tileop_api/src/TMul.cpp b/archive/outdated/tests/other/tileop_api/src/TMul.cpp
similarity index 73%
rename from test/tileop_api/src/TMul.cpp
rename to archive/outdated/tests/other/tileop_api/src/TMul.cpp
index 7436241..9967351 100644
--- a/test/tileop_api/src/TMul.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TMul.cpp
@@ -5,6 +5,38 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col,typename T>
 void test_rm(T *dst, T *src0, T *src1) {
@@ -18,13 +50,13 @@ void test_rm(T *dst, T *src0, T *src1) {
       int offset = i * (tile_row * gm_col) + j * tile_col;
       gm_shape s0(src0 + offset);
       gm_shape s1(src1 + offset);
-      gm_shape res(dst + offset);  
-  
+      gm_shape res(dst + offset);
+
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TMUL(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
@@ -41,30 +73,51 @@ void test_cm(T *dst, T *src0, T *src1) {
       int offset = i * (tile_row * gm_col) + j * tile_col;
       gm_shape s0(src0 + offset);
       gm_shape s1(src1 + offset);
-      gm_shape res(dst + offset);  
-  
+      gm_shape res(dst + offset);
+
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TMUL(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
 
 int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src0[gm_size];
+  static int64_t src1[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src0, gm_size);
+  init_src_int(src1, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src0, src1);
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+  return 0;
+#else
  // float32
   float *dst = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst);
   init_dst(dst, gm_size);
- 
+
   float *src0 = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(src0);
   init_src_fp(src0, gm_size);
@@ -170,4 +223,5 @@ int main() {
   free(src11);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TMuls.cpp b/archive/outdated/tests/other/tileop_api/src/TMuls.cpp
similarity index 72%
rename from test/tileop_api/src/TMuls.cpp
rename to archive/outdated/tests/other/tileop_api/src/TMuls.cpp
index ae280f5..e451ee0 100644
--- a/test/tileop_api/src/TMuls.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TMuls.cpp
@@ -5,6 +5,38 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
           uint64_t tile_col,typename T>
 void test_rm(T *dst, T *src, T s) {
@@ -20,9 +52,9 @@ void test_rm(T *dst, T *src, T s) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TMULS(d1, d0, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
@@ -41,21 +73,40 @@ void test_cm(T *dst, T *src, T s) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TMULS(d1, d0, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
 
 int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src, s_i64);
+
+  return 0;
+#else
   // float32
   float *dst = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst);
@@ -142,4 +193,5 @@ int main() {
   free(src5);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/other/tileop_api/src/TRSqrt.cpp b/archive/outdated/tests/other/tileop_api/src/TRSqrt.cpp
similarity index 96%
rename from test/other/tileop_api/src/TRSqrt.cpp
rename to archive/outdated/tests/other/tileop_api/src/TRSqrt.cpp
index 8b2a10c..7e1d236 100644
--- a/test/other/tileop_api/src/TRSqrt.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TRSqrt.cpp
@@ -20,9 +20,9 @@ void test(float *dst, float *src) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TRSQRT(d1, d0);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
diff --git a/test/tileop_api/src/TRecip.cpp b/archive/outdated/tests/other/tileop_api/src/TRecip.cpp
similarity index 62%
rename from test/tileop_api/src/TRecip.cpp
rename to archive/outdated/tests/other/tileop_api/src/TRecip.cpp
index 70c591b..61555ec 100644
--- a/test/tileop_api/src/TRecip.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TRecip.cpp
@@ -5,6 +5,57 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
           uint64_t tile_col, typename T>
 void test_rm(T *dst, T *src) {
@@ -23,9 +74,9 @@ void test_rm(T *dst, T *src) {
       auto res = gDIter(i, j);
 
       tile_shape t0, t1;
-      TCOPYIN(t0, s0);
+      TLOAD(t0, s0);
       TRECIP(t1, t0);
-      TCOPYOUT(res, t1);
+      TSTORE(res, t1);
     }
   }
 }
@@ -48,22 +99,65 @@ void test_cm(T *dst, T *src) {
       auto res = gDIter(j, i);
 
       tile_shape t0, t1;
-      TCOPYIN(t0, s0);
+      TLOAD(t0, s0);
       TRECIP(t1, t0);
-      TCOPYOUT(res, t1);
+      TSTORE(res, t1);
     }
   }
 }
 
 int main() {
+#ifdef __linx
+  constexpr size_t gm_row = 4;
+  constexpr size_t gm_col = 4;
+  constexpr size_t tile_row = 4;
+  constexpr size_t tile_col = 4;
+#else
   const size_t gm_row = 32;
   const size_t gm_col = 32;
   const size_t tile_row = 32;
   const size_t tile_col = 32;
+#endif
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)gm_size;
+  (void)tile_size;
+
+#ifdef __linx
+  using row_tile = Tile<Location::Vec, int64_t, tile_row, tile_col>;
+  using col_tile =
+      Tile<Location::Vec, int64_t, tile_row, tile_col, BLayout::ColMajor>;
+  row_tile src_rm, dst_rm;
+  col_tile src_cm, dst_cm;
+
+  for (size_t i = 0; i < tile_row; ++i) {
+    for (size_t j = 0; j < tile_col; ++j) {
+      size_t row_index = index<row_tile>(i, j);
+      size_t col_index = index<col_tile>(i, j);
+      src_rm.data()[row_index] = 1;
+      src_cm.data()[col_index] = 1;
+      dst_rm.data()[row_index] = 0;
+      dst_cm.data()[col_index] = 0;
+    }
+  }
 
+  TRECIP(dst_rm, src_rm);
+  TRECIP(dst_cm, src_cm);
+
+  for (size_t i = 0; i < tile_row; ++i) {
+    for (size_t j = 0; j < tile_col; ++j) {
+      if (dst_rm.data()[index<row_tile>(i, j)] != 1) {
+        return 1;
+      }
+      if (dst_cm.data()[index<col_tile>(i, j)] != 1) {
+        return 2;
+      }
+    }
+  }
+
+  return 0;
+#else
   // int8_t
   int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(dst_int8);
@@ -155,4 +249,5 @@ int main() {
   free(src_f32);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TReshape.cpp b/archive/outdated/tests/other/tileop_api/src/TReshape.cpp
similarity index 70%
rename from test/tileop_api/src/TReshape.cpp
rename to archive/outdated/tests/other/tileop_api/src/TReshape.cpp
index 410a9cb..f83024e 100644
--- a/test/tileop_api/src/TReshape.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TReshape.cpp
@@ -5,6 +5,38 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
           uint64_t tile_col, typename T>
 void test(T *dst, T *src) {
@@ -18,20 +50,38 @@ void test(T *dst, T *src) {
 
   tile_shape_in d0;
   tile_shape_out d1;
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TRESHAPE(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 int main() {
-  const size_t gm_row = 64;
-  const size_t gm_col = 64;
-  const size_t tile_row = 64;
-  const size_t tile_col = 64;
+#ifdef __linx
+  constexpr size_t gm_row = 4;
+  constexpr size_t gm_col = 8;
+  constexpr size_t tile_row = 4;
+  constexpr size_t tile_col = 8;
+#else
+  constexpr size_t gm_row = 64;
+  constexpr size_t gm_col = 64;
+  constexpr size_t tile_row = 64;
+  constexpr size_t tile_col = 64;
+#endif
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
 
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src[gm_size];
+  init_dst(dst, gm_size);
+  init_src_uint(src, gm_size);
+
+  test<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src);
+
+  return 0;
+#else
   // int8_t
   int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(dst_int8);
@@ -108,7 +158,7 @@ int main() {
   OutArray(dst_int64, gm_size);
   OutArray(dst_f16, gm_size);
   OutArray(dst_f32, gm_size);
-  
+
 
   free(dst_int8);
   free(src_int8);
@@ -123,4 +173,5 @@ int main() {
   free(dst_f32);
   free(src_f32);
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/archive/outdated/tests/other/tileop_api/src/TRowMax.cpp b/archive/outdated/tests/other/tileop_api/src/TRowMax.cpp
new file mode 100644
index 0000000..1c91357
--- /dev/null
+++ b/archive/outdated/tests/other/tileop_api/src/TRowMax.cpp
@@ -0,0 +1,203 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t row, uint16_t col, typename T> void test_rm(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, RowMajor<row, col>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, row, col, BLayout::RowMajor>;
+  using tile_shape_out = Tile<Location::Vec, T, row, col, BLayout::RowMajor, row, 1>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TROWMAX(d1, d0);
+  TSTORE(res, d1);
+}
+
+template <uint16_t row, uint16_t col, typename T> void test_cm(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, ColMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, ColMajor<row, col>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, row, col, BLayout::ColMajor>;
+  using tile_shape_out = Tile<Location::Vec, T, row, col, BLayout::ColMajor, row, 1>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TROWMAX(d1, d0);
+  TSTORE(res, d1);
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t row = 4;
+  constexpr uint16_t col = 8;
+  constexpr uint16_t size = row * col;
+
+  static int64_t dst_rm[size];
+  static int64_t dst_cm[size];
+  static int64_t src_rm[size];
+  static int64_t src_cm[size];
+  init_dst(dst_rm, size);
+  init_dst(dst_cm, size);
+  init_src_int(src_rm, size);
+  init_src_int(src_cm, size);
+
+  test_rm<row, col, int64_t>(dst_rm, src_rm);
+  test_cm<row, col, int64_t>(dst_cm, src_cm);
+  return 0;
+#else
+  const size_t row = 32;
+  const size_t col = 32;
+
+  size_t size_in = row * col;
+  size_t size_out = row * col;
+
+  // int8_t
+  int8_t *dst_int8 = (int8_t *)malloc(size_out * sizeof(int8_t));
+  check_mem_alloc(dst_int8);
+  init_dst(dst_int8, size_out);
+
+  int8_t *src_int8 = (int8_t *)malloc(size_in * sizeof(int8_t));
+  check_mem_alloc(src_int8);
+  init_src_uint(src_int8, size_in);
+
+  // int16_t
+  int16_t *dst_int16 = (int16_t *)malloc(size_out * sizeof(int16_t));
+  check_mem_alloc(dst_int16);
+  init_dst(dst_int16, size_out);
+
+  int16_t *src_int16 = (int16_t *)malloc(size_in * sizeof(int16_t));
+  check_mem_alloc(src_int16);
+  init_src_uint(src_int16, size_in);
+
+  // int32_t
+  int32_t *dst_int32 = (int32_t *)malloc(size_out * sizeof(int32_t));
+  check_mem_alloc(dst_int32);
+  init_dst(dst_int32, size_out);
+
+  int32_t *src_int32 = (int32_t *)malloc(size_in * sizeof(int32_t));
+  check_mem_alloc(src_int32);
+  init_src_uint(src_int32, size_in);
+
+  // int64_t
+  int64_t *dst_int64 = (int64_t *)malloc(size_out * sizeof(int64_t));
+  check_mem_alloc(dst_int64);
+  init_dst(dst_int64, size_out);
+
+  int64_t *src_int64 = (int64_t *)malloc(size_in * sizeof(int64_t));
+  check_mem_alloc(src_int64);
+  init_src_uint(src_int64, size_in);
+
+  // __half
+  __half *dst_f16 = (__half *)malloc(size_out * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, size_out);
+
+  __half *src_f16 = (__half *)malloc(size_in * sizeof(__half));
+  check_mem_alloc(src_f16);
+  init_src_fp(src_f16, size_in);
+
+  // __fp32
+  __fp32 *dst_f32 = (__fp32 *)malloc(size_out * sizeof(__fp32));
+  check_mem_alloc(dst_f32);
+  init_dst(dst_f32, size_out);
+
+  __fp32 *src_f32 = (__fp32 *)malloc(size_in * sizeof(__fp32));
+  check_mem_alloc(src_f32);
+  init_src_fp(src_f32, size_in);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<row, col, int8_t>(dst_int8, src_int8);
+  test_rm<row, col, int16_t>(dst_int16, src_int16);
+  test_rm<row, col, int32_t>(dst_int32, src_int32);
+  test_rm<row, col, int64_t>(dst_int64, src_int64);
+  test_cm<row, col, __half>(dst_f16, src_f16);
+  test_cm<row, col, __fp32>(dst_f32, src_f32);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst_int8, size_out);
+  OutArray(dst_int16, size_out);
+  OutArray(dst_int32, size_out);
+  OutArray(dst_int64, size_out);
+  OutArray(dst_f16, size_out);
+  OutArray(dst_f32, size_out);
+
+  free(dst_int8);
+  free(src_int8);
+  free(dst_int16);
+  free(src_int16);
+  free(dst_int32);
+  free(src_int32);
+  free(dst_int64);
+  free(src_int64);
+  free(dst_f16);
+  free(src_f16);
+  free(dst_f32);
+  free(src_f32);
+
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/TRowMaxExpand.cpp b/archive/outdated/tests/other/tileop_api/src/TRowMaxExpand.cpp
similarity index 69%
rename from test/tileop_api/src/TRowMaxExpand.cpp
rename to archive/outdated/tests/other/tileop_api/src/TRowMaxExpand.cpp
index d4814ca..669ff1e 100644
--- a/test/tileop_api/src/TRowMaxExpand.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TRowMaxExpand.cpp
@@ -5,7 +5,49 @@
 #include "../linxStartEnd.hpp"
 #endif
 
-template <size_t row, size_t col, typename T> void test_rm(T *dst, T *src) {
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t row, uint16_t col, typename T> void test_rm(T *dst, T *src) {
   using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
   using gm_shape_out = global_tensor<T, RowMajor<row, col>>;
 
@@ -18,12 +60,12 @@ template <size_t row, size_t col, typename T> void test_rm(T *dst, T *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TROWMAXEXPAND(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
-template <size_t row, size_t col, typename T> void test_cm(T *dst, T *src) {
+template <uint16_t row, uint16_t col, typename T> void test_cm(T *dst, T *src) {
   using gm_shape_in = global_tensor<T, ColMajor<row, col>>;
   using gm_shape_out = global_tensor<T, ColMajor<row, col>>;
 
@@ -36,11 +78,12 @@ template <size_t row, size_t col, typename T> void test_cm(T *dst, T *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TROWMAXEXPAND(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
+#ifndef __linx
 template <size_t row, size_t col, typename T> void test_Nz(T *dst, T *src) {
   using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
   using gm_shape_out = global_tensor<T, RowMajor<row, col>>;
@@ -54,11 +97,31 @@ template <size_t row, size_t col, typename T> void test_Nz(T *dst, T *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TROWMAXEXPAND(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
+#endif
+
 int main() {
+#ifdef __linx
+  constexpr uint16_t row = 4;
+  constexpr uint16_t col = 8;
+  constexpr uint16_t size = row * col;
+
+  static int64_t dst_rm[size];
+  static int64_t dst_cm[size];
+  static int64_t src_rm[size];
+  static int64_t src_cm[size];
+  init_dst(dst_rm, size);
+  init_dst(dst_cm, size);
+  init_src_int(src_rm, size);
+  init_src_int(src_cm, size);
+
+  test_rm<row, col, int64_t>(dst_rm, src_rm);
+  test_cm<row, col, int64_t>(dst_cm, src_cm);
+  return 0;
+#else
   const size_t row = 32;
   const size_t col = 32;
 
@@ -156,4 +219,5 @@ int main() {
   free(src_f32);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TRowSum.cpp b/archive/outdated/tests/other/tileop_api/src/TRowSum.cpp
similarity index 71%
rename from test/tileop_api/src/TRowSum.cpp
rename to archive/outdated/tests/other/tileop_api/src/TRowSum.cpp
index f881873..f1a6a59 100644
--- a/test/tileop_api/src/TRowSum.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TRowSum.cpp
@@ -5,7 +5,40 @@
 #include "../linxStartEnd.hpp"
 #endif
 
-template <size_t row, size_t col, typename T> void test_rm(T *dst, T *src) {
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t row, uint16_t col, typename T> void test_rm(T *dst, T *src) {
   using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
   using gm_shape_out = global_tensor<T, RowMajor<row, col>>;
 
@@ -18,12 +51,12 @@ template <size_t row, size_t col, typename T> void test_rm(T *dst, T *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TROWSUM(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
-template <size_t row, size_t col, typename T> void test_cm(T *dst, T *src) {
+template <uint16_t row, uint16_t col, typename T> void test_cm(T *dst, T *src) {
   using gm_shape_in = global_tensor<T, ColMajor<row, col>>;
   using gm_shape_out = global_tensor<T, ColMajor<row, col>>;
 
@@ -36,12 +69,30 @@ template <size_t row, size_t col, typename T> void test_cm(T *dst, T *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TROWSUM(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 int main() {
+#ifdef __linx
+  constexpr uint16_t row = 4;
+  constexpr uint16_t col = 8;
+  constexpr uint16_t size = row * col;
+
+  static int64_t dst_rm[size];
+  static int64_t dst_cm[size];
+  static int64_t src_rm[size];
+  static int64_t src_cm[size];
+  init_dst(dst_rm, size);
+  init_dst(dst_cm, size);
+  init_src_int(src_rm, size);
+  init_src_int(src_cm, size);
+
+  test_rm<row, col, int64_t>(dst_rm, src_rm);
+  test_cm<row, col, int64_t>(dst_cm, src_cm);
+  return 0;
+#else
   const size_t row = 32;
   const size_t col = 32;
 
@@ -139,4 +190,5 @@ int main() {
   free(src_f32);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TRowSumExpand.cpp b/archive/outdated/tests/other/tileop_api/src/TRowSumExpand.cpp
similarity index 66%
rename from test/tileop_api/src/TRowSumExpand.cpp
rename to archive/outdated/tests/other/tileop_api/src/TRowSumExpand.cpp
index 99a1f79..a5203eb 100644
--- a/test/tileop_api/src/TRowSumExpand.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TRowSumExpand.cpp
@@ -5,7 +5,49 @@
 #include "../linxStartEnd.hpp"
 #endif
 
-template <size_t row, size_t col, typename T> void test_rm(T *dst, T *src) {
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t row, uint16_t col, typename T> void test_rm(T *dst, T *src) {
   using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
   using gm_shape_out = global_tensor<T, RowMajor<row, col>>;
 
@@ -18,12 +60,12 @@ template <size_t row, size_t col, typename T> void test_rm(T *dst, T *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TROWSUMEXPAND(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
-template <size_t row, size_t col, typename T> void test_cm(T *dst, T *src) {
+template <uint16_t row, uint16_t col, typename T> void test_cm(T *dst, T *src) {
   using gm_shape_in = global_tensor<T, ColMajor<row, col>>;
   using gm_shape_out = global_tensor<T, ColMajor<row, col>>;
 
@@ -36,12 +78,30 @@ template <size_t row, size_t col, typename T> void test_cm(T *dst, T *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TROWSUMEXPAND(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 int main() {
+#ifdef __linx
+  constexpr uint16_t row = 4;
+  constexpr uint16_t col = 8;
+  constexpr uint16_t size = row * col;
+
+  static int64_t dst_rm[size];
+  static int64_t dst_cm[size];
+  static int64_t src_rm[size];
+  static int64_t src_cm[size];
+  init_dst(dst_rm, size);
+  init_dst(dst_cm, size);
+  init_src_int(src_rm, size);
+  init_src_int(src_cm, size);
+
+  test_rm<row, col, int64_t>(dst_rm, src_rm);
+  test_cm<row, col, int64_t>(dst_cm, src_cm);
+  return 0;
+#else
   const size_t row = 32;
   const size_t col = 32;
 
@@ -106,7 +166,7 @@ int main() {
   PMC_START();
 #endif
 
-  test_rm<row, col, int8_t>(dst_int8, src_int8); 
+  test_rm<row, col, int8_t>(dst_int8, src_int8);
   test_rm<row, col, int16_t>(dst_int16, src_int16);
   test_rm<row, col, int32_t>(dst_int32, src_int32);
   test_rm<row, col, int64_t>(dst_int64, src_int64);
@@ -137,4 +197,5 @@ int main() {
   free(dst_f32);
   free(src_f32);
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/other/tileop_api/src/TScatterElementCol.cpp b/archive/outdated/tests/other/tileop_api/src/TScatterElementCol.cpp
similarity index 96%
rename from test/other/tileop_api/src/TScatterElementCol.cpp
rename to archive/outdated/tests/other/tileop_api/src/TScatterElementCol.cpp
index a14ba89..71ae198 100644
--- a/test/other/tileop_api/src/TScatterElementCol.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TScatterElementCol.cpp
@@ -21,10 +21,10 @@ void test(float *dst, uint16_t *srci, float s) {
   tile_shape_srci d0;
   tile_shape_dst d1;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
   TSCATTERELEMENTCOL(d1, d0, s);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 int main() {
diff --git a/test/other/tileop_api/src/TScatterElementRow.cpp b/archive/outdated/tests/other/tileop_api/src/TScatterElementRow.cpp
similarity index 96%
rename from test/other/tileop_api/src/TScatterElementRow.cpp
rename to archive/outdated/tests/other/tileop_api/src/TScatterElementRow.cpp
index dc3cf50..c74b0cc 100644
--- a/test/other/tileop_api/src/TScatterElementRow.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TScatterElementRow.cpp
@@ -21,10 +21,10 @@ void test(float *dst, uint16_t *srci, float s) {
   tile_shape_srci d0;
   tile_shape_dst d1;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
   TSCATTERELEMENTROW(d1, d0, s);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 int main() {
diff --git a/test/other/tileop_api/src/TSelect.cpp b/archive/outdated/tests/other/tileop_api/src/TSelect.cpp
similarity index 95%
rename from test/other/tileop_api/src/TSelect.cpp
rename to archive/outdated/tests/other/tileop_api/src/TSelect.cpp
index 999cfc5..b8b68e4 100644
--- a/test/other/tileop_api/src/TSelect.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TSelect.cpp
@@ -23,11 +23,11 @@ void test(float *dst, float *src0, float *src1, uint16_t *src2) {
   tile_shape_uint16 d2;
   tile_shape_fp32 d3;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  TCOPYIN(d2, s2);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  TLOAD(d2, s2);
   TSELECT(d3, d0, d1, d2);
-  TCOPYOUT(res, d3);
+  TSTORE(res, d3);
 }
 
 int main() {
diff --git a/archive/outdated/tests/other/tileop_api/src/TSqrt.cpp b/archive/outdated/tests/other/tileop_api/src/TSqrt.cpp
new file mode 100644
index 0000000..9d4923f
--- /dev/null
+++ b/archive/outdated/tests/other/tileop_api/src/TSqrt.cpp
@@ -0,0 +1,204 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col, typename T>
+void test_rm(T *dst, T *src) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+  using glb_iterator = global_iterator<gm_shape, tile_shape>;
+
+  glb_iterator gSIter(src);
+  glb_iterator gDIter(dst);
+
+  size_t block_row = gm_row / tile_row;
+  size_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      auto s0 = gSIter(i, j);
+      auto res = gDIter(i, j);
+
+      tile_shape t0, t1;
+      TLOAD(t0, s0);
+      TSQRT(t1, t0);
+      TSTORE(res, t1);
+    }
+  }
+}
+
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col, typename T>
+void test_cm(T *dst, T *src) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+  using glb_iterator = global_iterator<gm_shape, tile_shape>;
+
+  glb_iterator gSIter(src);
+  glb_iterator gDIter(dst);
+
+  size_t block_row = gm_row / tile_row;
+  size_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_col; ++i) {
+    for (int j = 0; j < block_row; ++j) {
+      auto s0 = gSIter(j, i);
+      auto res = gDIter(j, i);
+
+      tile_shape t0, t1;
+      TLOAD(t0, s0);
+      TSQRT(t1, t0);
+      TSTORE(res, t1);
+    }
+  }
+}
+
+int main() {
+
+#ifdef __linx
+  constexpr size_t gm_row = 4;
+  constexpr size_t gm_col = 4;
+  constexpr size_t tile_row = 4;
+  constexpr size_t tile_col = 4;
+#else
+  const size_t gm_row = 32;
+  const size_t gm_col = 32;
+  const size_t tile_row = 16;
+  const size_t tile_col = 16;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)gm_size;
+  (void)tile_size;
+
+#ifdef __linx
+  using row_tile = Tile<Location::Vec, int64_t, tile_row, tile_col>;
+  using col_tile =
+      Tile<Location::Vec, int64_t, tile_row, tile_col, BLayout::ColMajor>;
+  row_tile src_rm, dst_rm;
+  col_tile src_cm, dst_cm;
+
+  for (size_t i = 0; i < tile_row; ++i) {
+    for (size_t j = 0; j < tile_col; ++j) {
+      int64_t expected = static_cast<int64_t>(i * tile_col + j);
+      size_t row_index = index<row_tile>(i, j);
+      size_t col_index = index<col_tile>(i, j);
+      src_rm.data()[row_index] = expected * expected;
+      src_cm.data()[col_index] = expected * expected;
+      dst_rm.data()[row_index] = 0;
+      dst_cm.data()[col_index] = 0;
+    }
+  }
+
+  TSQRT(dst_rm, src_rm);
+  TSQRT(dst_cm, src_cm);
+
+  for (size_t i = 0; i < tile_row; ++i) {
+    for (size_t j = 0; j < tile_col; ++j) {
+      int64_t expected = static_cast<int64_t>(i * tile_col + j);
+      if (dst_rm.data()[index<row_tile>(i, j)] != expected) {
+        return 1;
+      }
+      if (dst_cm.data()[index<col_tile>(i, j)] != expected) {
+        return 2;
+      }
+    }
+  }
+
+  return 0;
+#else
+  // __half
+  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, gm_size);
+
+  __half *src_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src_f16);
+  init_rows_fp(src_f16, gm_row, gm_col);
+
+  // __fp32
+  __fp32 *dst_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
+  check_mem_alloc(dst_f32);
+  init_dst(dst_f32, gm_size);
+
+  __fp32 *src_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
+  check_mem_alloc(src_f32);
+  init_rows_fp(src_f32, gm_row, gm_col);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src_f16);
+  test_cm<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src_f16);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst_f16, gm_size);
+  OutArray(dst_f32, gm_size);
+
+  free(dst_f16);
+  free(src_f16);
+  free(dst_f32);
+  free(src_f32);
+
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/TCopyOut.cpp b/archive/outdated/tests/other/tileop_api/src/TStore.cpp
similarity index 82%
rename from test/tileop_api/src/TCopyOut.cpp
rename to archive/outdated/tests/other/tileop_api/src/TStore.cpp
index a791431..02a2cf9 100644
--- a/test/tileop_api/src/TCopyOut.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TStore.cpp
@@ -5,6 +5,38 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_RowMajor(T *dst, T *src0) {
@@ -12,7 +44,7 @@ void test_RowMajor(T *dst, T *src0) {
   using stride = Stride<1, 1, gm_row * gm_col, gm_col, 1>;
   using gm_shape = GlobalTensor<T, shape, stride, Layout::ND>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -22,10 +54,10 @@ void test_RowMajor(T *dst, T *src0) {
       int offset = i * (tile_row * gm_col) + j * tile_col;
       gm_shape s0(src0 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0;
-      TCOPYIN(d0, s0);
-      TCOPYOUT(res, d0);
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
     }
   }
 }
@@ -57,12 +89,12 @@ void test_RowMajor_Dynamic(T *dst, T *src0) {
       gm_shape res(dst + offset, gm_valid_row, gm_valid_col);
 
       tile_shape d0(active_row, active_col);
-      TCOPYIN(d0, s0);
-      TCOPYOUT(res, d0);
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
     }
   }
 }
- 
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_ColMajor(T *dst, T *src0) {
@@ -70,7 +102,7 @@ void test_ColMajor(T *dst, T *src0) {
   using stride = Stride<1, 1, gm_row * gm_col, 1, gm_row>;
   using gm_shape = GlobalTensor<T, shape, stride, Layout::DN>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -80,10 +112,10 @@ void test_ColMajor(T *dst, T *src0) {
       int offset = i * (tile_row * gm_col) + j * tile_col;
       gm_shape s0(src0 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0;
-      TCOPYIN(d0, s0);
-      TCOPYOUT(res, d0);
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
     }
   }
 }
@@ -101,7 +133,7 @@ void test_Nz_Dynamic(T *dst, T *src0) {
 
   uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row;
   uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col;
-  
+
   for (int i = 0; i < block_row; ++i) {
     for (int j = 0; j < block_col; ++j) {
       uint16_t remainder_row = gm_row - i * tile_valid_row;
@@ -116,21 +148,39 @@ void test_Nz_Dynamic(T *dst, T *src0) {
 
       tile_shape d0(active_row, active_col);
       tile_shape d1(active_row, active_col);
-      TCOPYIN(d0, s0);
-      TCOPYOUT(res, d0);
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
     }
   }
 }
 
 int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src, gm_size);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src);
 
+  return 0;
+#else
   float *dst = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst);
   init_dst(dst, gm_size);
@@ -142,7 +192,7 @@ int main() {
   __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(dst_f16);
   init_dst(dst_f16, gm_size);
- 
+
   __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src0_f16);
   init_src_fp(src0_f16, gm_size);
@@ -150,31 +200,31 @@ int main() {
   int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(dst_i8);
   init_dst(dst_i8, gm_size);
- 
+
   int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src0_i8);
   init_src_int(src0_i8, gm_size);
- 
+
   int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(dst_i16);
   init_dst(dst_i16, gm_size);
- 
+
   int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(src0_i16);
   init_src_int(src0_i16, gm_size);
-  
+
   int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst_i32);
   init_dst(dst_i32, gm_size);
- 
+
   int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src0_i32);
   init_src_int(src0_i32, gm_size);
- 
+
   int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(dst_i64);
   init_dst(dst_i64, gm_size);
- 
+
   int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(src0_i64);
   init_src_int(src0_i64, gm_size);
@@ -182,7 +232,7 @@ int main() {
   int32_t *dst1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst1_i32);
   init_dst(dst1_i32, gm_size);
- 
+
   int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src1_i32);
   init_src_int(src1_i32, gm_size);
@@ -190,7 +240,7 @@ int main() {
   int32_t *dst_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst_nz_i32);
   init_dst(dst_nz_i32, gm_size);
- 
+
   int32_t *src_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src_nz_i32);
   init_src_int(src_nz_i32, gm_size);
@@ -200,15 +250,15 @@ int main() {
 #endif
 
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, float>(dst, src0);
-  
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8);
 
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16);
- 
+
   test_ColMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32);
- 
+
   test_ColMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64);
 
   test_RowMajor_Dynamic<gm_row + 1, gm_col + 1, tile_row, tile_col, int32_t>(dst1_i32, src1_i32);
@@ -228,22 +278,22 @@ int main() {
   OutArray(dst_i64, gm_size);
   OutArray(dst1_i32, gm_size);
   OutArray(dst_nz_i32, gm_size);
- 
+
   free(dst);
   free(src0);
- 
+
   free(dst_f16);
   free(src0_f16);
- 
+
   free(dst_i8);
   free(src0_i8);
- 
+
   free(dst_i16);
   free(src0_i16);
- 
+
   free(dst_i32);
   free(src0_i32);
- 
+
   free(dst_i64);
   free(src0_i64);
 
@@ -251,4 +301,5 @@ int main() {
   free(src1_i32);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TSub.cpp b/archive/outdated/tests/other/tileop_api/src/TSub.cpp
similarity index 77%
rename from test/tileop_api/src/TSub.cpp
rename to archive/outdated/tests/other/tileop_api/src/TSub.cpp
index 0552ecb..8f8480b 100644
--- a/test/tileop_api/src/TSub.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TSub.cpp
@@ -5,6 +5,38 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 //  C = A - B
 template <size_t gm_row, size_t gm_col, size_t tile_row, size_t tile_col,
           typename T>
@@ -26,10 +58,10 @@ void test_rm(T *dst, T *src0, T *src1) {
       auto res = gCIter(i, j);
 
       tile_shape t0, t1, t2;
-      TCOPYIN(t0, s0);
-      TCOPYIN(t1, s1);
+      TLOAD(t0, s0);
+      TLOAD(t1, s1);
       TSUB(t2, t1, t0);
-      TCOPYOUT(res, t2);
+      TSTORE(res, t2);
     }
   }
 }
@@ -54,23 +86,44 @@ void test_cm(T *dst, T *src0, T *src1) {
       auto res = gCIter(j, i);
 
       tile_shape t0, t1, t2;
-      TCOPYIN(t0, s0);
-      TCOPYIN(t1, s1);
+      TLOAD(t0, s0);
+      TLOAD(t1, s1);
       TSUB(t2, t1, t0);
-      TCOPYOUT(res, t2);
+      TSTORE(res, t2);
     }
   }
 }
 
 int main() {
-  const size_t gm_row = 32;
-  const size_t gm_col = 32;
-  const size_t tile_row = 32;
-  const size_t tile_col = 32;
+#ifdef __linx
+  constexpr size_t gm_row = 4;
+  constexpr size_t gm_col = 4;
+  constexpr size_t tile_row = 4;
+  constexpr size_t tile_col = 4;
+#else
+  constexpr size_t gm_row = 32;
+  constexpr size_t gm_col = 32;
+  constexpr size_t tile_row = 32;
+  constexpr size_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+#ifdef __linx
+  static int64_t dst_int64[gm_size];
+  static int64_t src0_int64[gm_size];
+  static int64_t src1_int64[gm_size];
+  init_dst(dst_int64, gm_size);
+  init_src_int(src0_int64, gm_size);
+  init_src_int(src1_int64, gm_size);
 
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_int64, src0_int64,
+                                                       src1_int64);
+
+  return 0;
+#else
   // int8_t
   int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(dst_int8);
@@ -191,4 +244,5 @@ int main() {
   free(src1_f32);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TSubs.cpp b/archive/outdated/tests/other/tileop_api/src/TSubs.cpp
similarity index 73%
rename from test/tileop_api/src/TSubs.cpp
rename to archive/outdated/tests/other/tileop_api/src/TSubs.cpp
index 06d0288..fd3e234 100644
--- a/test/tileop_api/src/TSubs.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TSubs.cpp
@@ -5,6 +5,38 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <size_t gm_row, size_t gm_col, size_t tile_row, size_t tile_col,
           typename T>
 void test_rm(T *dst, T *src, T s) {
@@ -23,9 +55,9 @@ void test_rm(T *dst, T *src, T s) {
       auto res = gDIter(i, j);
 
       tile_shape t0, t1;
-      TCOPYIN(t0, s0);
+      TLOAD(t0, s0);
       TSUBS(t1, t0, s);
-      TCOPYOUT(res, t1);
+      TSTORE(res, t1);
     }
   }
 }
@@ -48,22 +80,41 @@ void test_cm(T *dst, T *src, T s) {
       auto res = gDIter(j, i);
 
       tile_shape t0, t1;
-      TCOPYIN(t0, s0);
+      TLOAD(t0, s0);
       TSUBS(t1, t0, s);
-      TCOPYOUT(res, t1);
+      TSTORE(res, t1);
     }
   }
 }
 
 int main() {
-  const size_t gm_row = 32;
-  const size_t gm_col = 32;
-  const size_t tile_row = 32;
-  const size_t tile_col = 32;
+#ifdef __linx
+  constexpr size_t gm_row = 4;
+  constexpr size_t gm_col = 4;
+  constexpr size_t tile_row = 4;
+  constexpr size_t tile_col = 4;
+#else
+  constexpr size_t gm_row = 32;
+  constexpr size_t gm_col = 32;
+  constexpr size_t tile_row = 32;
+  constexpr size_t tile_col = 32;
+#endif
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
 
+#ifdef __linx
+  static int64_t dst_int64[gm_size];
+  static int64_t src_int64[gm_size];
+  init_dst(dst_int64, gm_size);
+  init_src_int(src_int64, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_int64, src_int64,
+                                                       s_i64);
+
+  return 0;
+#else
   // int8_t
   int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(dst_int8);
@@ -158,4 +209,5 @@ int main() {
   free(src_f32);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TTrans.cpp b/archive/outdated/tests/other/tileop_api/src/TTrans.cpp
similarity index 74%
rename from test/tileop_api/src/TTrans.cpp
rename to archive/outdated/tests/other/tileop_api/src/TTrans.cpp
index 324d88a..0d441a3 100644
--- a/test/tileop_api/src/TTrans.cpp
+++ b/archive/outdated/tests/other/tileop_api/src/TTrans.cpp
@@ -5,6 +5,38 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <size_t row, size_t col, typename T> void test_rm(T *dst, T *src) {
   using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
   using gm_shape_out = global_tensor<T, RowMajor<col, row>>;
@@ -17,9 +49,9 @@ template <size_t row, size_t col, typename T> void test_rm(T *dst, T *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TTRANS(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 template <size_t row, size_t col, typename T> void test_cm(T *dst, T *src) {
@@ -34,18 +66,33 @@ template <size_t row, size_t col, typename T> void test_cm(T *dst, T *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TTRANS(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 int main() {
-  const size_t row = 32;
-  const size_t col = 32;
+#ifdef __linx
+  constexpr size_t row = 4;
+  constexpr size_t col = 4;
+#else
+  constexpr size_t row = 32;
+  constexpr size_t col = 32;
+#endif
 
-  size_t size_in = row * col;
-  size_t size_out = col * row;
+  constexpr size_t size_in = row * col;
+  constexpr size_t size_out = col * row;
 
+#ifdef __linx
+  static int64_t dst[size_out];
+  static int64_t src[size_in];
+  init_dst(dst, size_out);
+  init_src_int(src, size_in);
+
+  test_rm<row, col, int64_t>(dst, src);
+
+  return 0;
+#else
   // int8
   int8_t *dst_int8 = (int8_t *)malloc(size_out * sizeof(int8_t));
   check_mem_alloc(dst_int8);
@@ -137,4 +184,5 @@ int main() {
   free(src_f32);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/archive/outdated/tests/other/tileop_api/src/test_MatMacc.cpp b/archive/outdated/tests/other/tileop_api/src/test_MatMacc.cpp
new file mode 100644
index 0000000..5316b61
--- /dev/null
+++ b/archive/outdated/tests/other/tileop_api/src/test_MatMacc.cpp
@@ -0,0 +1,194 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+
+template <uint16_t M, uint16_t N, uint16_t K, typename T>
+void test_linx_row_major(T *dst, T *src0, T *src1) {
+  using gm_shape_A = global_tensor<T, RowMajor<M, K>>;
+  using gm_shape_B = global_tensor<T, RowMajor<K, N>>;
+  using gm_shape_C = global_tensor<T, RowMajor<M, N>>;
+
+  using tile_shape_A = Tile<Location::Vec, T, M, K, BLayout::RowMajor>;
+  using tile_shape_B = Tile<Location::Vec, T, K, N, BLayout::RowMajor>;
+  using tile_shape_C = Tile<Location::Vec, T, M, N, BLayout::RowMajor>;
+
+  gm_shape_A s0(src0);
+  gm_shape_B s1(src1);
+  gm_shape_C res(dst);
+
+  tile_shape_A d0;
+  tile_shape_B d1;
+  tile_shape_C d2;
+
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  MATMUL(d2, d0, d1);
+  MATMACC(d2, d0, d1);
+  TSTORE(res, d2);
+}
+#endif
+
+template <uint16_t M, uint16_t N, uint16_t K>
+void test(float *dst, float *src0, float *src1) {
+  using gm_shape_A = global_tensor<float, RowMajor<M, K>>;
+  using gm_shape_B = global_tensor<float, ColMajor<K, N>>;
+  using gm_shape_C = global_tensor<float, RowMajor<M, N>>;
+
+  using tile_shape_A = TileLeft<float, M, K>;
+  using tile_shape_B = TileRight<float, K, N>;
+  using tile_shape_C = TileAcc<float, M, N>;
+  using tile_shape_O = Tile<Location::Vec, float, M, N>;
+
+  gm_shape_A s0(src0);
+  gm_shape_B s1(src1);
+  gm_shape_C res(dst);
+
+  tile_shape_A d0;
+  tile_shape_B d1;
+  tile_shape_C d2;
+  tile_shape_O d3;
+
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  MATMUL(d2, d0, d1);
+  MATMACC(d2, d0, d1);
+  TCVT(d3, d2);
+  TSTORE(res, d3);
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t M = 4;
+  constexpr uint16_t K = 4;
+  constexpr uint16_t N = 4;
+  constexpr size_t size_A = M * K;
+  constexpr size_t size_B = K * N;
+  constexpr size_t size_C = M * N;
+
+  static int64_t dst_i64[size_C];
+  static int64_t src0_i64[size_A];
+  static int64_t src1_i64[size_B];
+
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t k = 0; k < K; ++k) {
+      src0_i64[row * K + k] = static_cast<int64_t>((row + 1) * (k + 2));
+    }
+  }
+  for (size_t k = 0; k < K; ++k) {
+    for (size_t col = 0; col < N; ++col) {
+      src1_i64[k * N + col] = static_cast<int64_t>((k + 1) + (col + 1));
+    }
+  }
+  for (size_t i = 0; i < size_C; ++i) {
+    dst_i64[i] = 0;
+  }
+
+  test_linx_row_major<M, N, K, int64_t>(dst_i64, src0_i64, src1_i64);
+
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t col = 0; col < N; ++col) {
+      int64_t expected = 0;
+      for (size_t k = 0; k < K; ++k) {
+        expected += src0_i64[row * K + k] * src1_i64[k * N + col];
+      }
+      expected *= 2;
+      if (dst_i64[row * N + col] != expected) {
+        return 1;
+      }
+    }
+  }
+
+  return 0;
+#else
+  const uint16_t M = 16;
+  const uint16_t K = 8;
+  const uint16_t N = 32;
+
+  size_t size_A = M * K;
+  size_t size_B = K * N;
+  size_t size_C = M * N;
+
+  float *dst = (float *)malloc(size_C * sizeof(float));
+  check_mem_alloc(dst);
+  init_src_fp(dst, size_C);
+
+  float *src0 = (float *)malloc(size_A * sizeof(float));
+  check_mem_alloc(src0);
+  init_src_fp(src0, size_A);
+  float *src1 = (float *)malloc(size_B * sizeof(float));
+  check_mem_alloc(src1);
+  init_src_fp(src1, size_B);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test<M, N, K>(dst, src0, src1);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, size_C);
+
+  free(dst);
+  free(src0);
+  free(src1);
+
+  return 0;
+#endif
+}
diff --git a/archive/outdated/tests/other/tileop_api/src/test_MatMul.cpp b/archive/outdated/tests/other/tileop_api/src/test_MatMul.cpp
new file mode 100644
index 0000000..bf01255
--- /dev/null
+++ b/archive/outdated/tests/other/tileop_api/src/test_MatMul.cpp
@@ -0,0 +1,191 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+
+template <uint16_t M, uint16_t N, uint16_t K, typename T>
+void test_linx_row_major(T *dst, T *src0, T *src1) {
+  using gm_shape_A = global_tensor<T, RowMajor<M, K>>;
+  using gm_shape_B = global_tensor<T, RowMajor<K, N>>;
+  using gm_shape_C = global_tensor<T, RowMajor<M, N>>;
+
+  using tile_shape_A = Tile<Location::Vec, T, M, K, BLayout::RowMajor>;
+  using tile_shape_B = Tile<Location::Vec, T, K, N, BLayout::RowMajor>;
+  using tile_shape_C = Tile<Location::Vec, T, M, N, BLayout::RowMajor>;
+
+  gm_shape_A s0(src0);
+  gm_shape_B s1(src1);
+  gm_shape_C res(dst);
+
+  tile_shape_A d0;
+  tile_shape_B d1;
+  tile_shape_C d2;
+
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  MATMUL(d2, d0, d1);
+  TSTORE(res, d2);
+}
+#endif
+
+template <uint16_t M, uint16_t N, uint16_t K>
+void test(float *dst, float *src0, float *src1) {
+  using gm_shape_A = global_tensor<float, RowMajor<M, K>>;
+  using gm_shape_B = global_tensor<float, ColMajor<K, N>>;
+  using gm_shape_C = global_tensor<float, RowMajor<M, N>>;
+
+  using tile_shape_A = TileLeft<float, M, K>;
+  using tile_shape_B = TileRight<float, K, N>;
+  using tile_shape_C = TileAcc<float, M, N>;
+  using tile_shape_O = TileLeft<float, M, K>;
+
+  gm_shape_A s0(src0);
+  gm_shape_B s1(src1);
+  gm_shape_C res(dst);
+
+  tile_shape_A d0;
+  tile_shape_B d1;
+  tile_shape_C d2;
+  tile_shape_O d3;
+
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  MATMUL(d2, d0, d1);
+  TCVT(d3, d2);
+  TSTORE(res, d3);
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t M = 4;
+  constexpr uint16_t K = 4;
+  constexpr uint16_t N = 4;
+  constexpr size_t size_A = M * K;
+  constexpr size_t size_B = K * N;
+  constexpr size_t size_C = M * N;
+
+  static int64_t dst_i64[size_C];
+  static int64_t src0_i64[size_A];
+  static int64_t src1_i64[size_B];
+
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t k = 0; k < K; ++k) {
+      src0_i64[row * K + k] = static_cast<int64_t>((row + 1) * (k + 2));
+    }
+  }
+  for (size_t k = 0; k < K; ++k) {
+    for (size_t col = 0; col < N; ++col) {
+      src1_i64[k * N + col] = static_cast<int64_t>((k + 1) + (col + 1));
+    }
+  }
+  for (size_t i = 0; i < size_C; ++i) {
+    dst_i64[i] = 0;
+  }
+
+  test_linx_row_major<M, N, K, int64_t>(dst_i64, src0_i64, src1_i64);
+
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t col = 0; col < N; ++col) {
+      int64_t expected = 0;
+      for (size_t k = 0; k < K; ++k) {
+        expected += src0_i64[row * K + k] * src1_i64[k * N + col];
+      }
+      if (dst_i64[row * N + col] != expected) {
+        return 1;
+      }
+    }
+  }
+
+  return 0;
+#else
+  const uint16_t M = 16;
+  const uint16_t K = 8;
+  const uint16_t N = 32;
+
+  size_t size_A = M * K;
+  size_t size_B = K * N;
+  size_t size_C = M * N;
+
+  float *dst = (float *)malloc(size_C * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, size_C);
+
+  float *src0 = (float *)malloc(size_A * sizeof(float));
+  check_mem_alloc(src0);
+  init_src_fp(src0, size_A);
+  float *src1 = (float *)malloc(size_B * sizeof(float));
+  check_mem_alloc(src1);
+  init_src_fp(src1, size_B);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test<M, N, K>(dst, src0, src1);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, size_C);
+
+  free(dst);
+  free(src0);
+  free(src1);
+
+  return 0;
+#endif
+}
diff --git a/benchmarks/INDEX.md b/benchmarks/INDEX.md
new file mode 100644
index 0000000..39ff3f9
--- /dev/null
+++ b/benchmarks/INDEX.md
@@ -0,0 +1,232 @@
+# Benchmark Index
+
+Generated from the active `benchmarks/` tree. The suite table records batch build surfaces; the source table records benchmark entrypoints with source path, build command, category, active/archive status, and required data objects.
+
+## Suite Batch Commands
+
+| Category | Source path | Build command | Commands | Required data objects | Status |
+| --- | --- | --- | --- | --- | --- |
+| `api/tileop` | [`benchmarks/api/tileop/compile.all`](../benchmarks/api/tileop/compile.all) | `cd benchmarks/api/tileop && bash compile.all` | 39 | none | active |
+| `kernels/composite` | [`benchmarks/kernels/composite/compile_flash_attention.all`](../benchmarks/kernels/composite/compile_flash_attention.all) | `cd benchmarks/kernels/composite && bash compile_flash_attention.all` | 109 | none | active |
+| `kernels/composite` | [`benchmarks/kernels/composite/compile_gemm.all`](../benchmarks/kernels/composite/compile_gemm.all) | `cd benchmarks/kernels/composite && bash compile_gemm.all` | 12 | none | active |
+| `kernels/composite` | [`benchmarks/kernels/composite/compile_linear.all`](../benchmarks/kernels/composite/compile_linear.all) | `cd benchmarks/kernels/composite && bash compile_linear.all` | 6 | none | active |
+| `kernels/composite` | [`benchmarks/kernels/composite/compile_matmul.all`](../benchmarks/kernels/composite/compile_matmul.all) | `cd benchmarks/kernels/composite && bash compile_matmul.all` | 92 | none | active |
+| `kernels/composite` | [`benchmarks/kernels/composite/compile_norm.all`](../benchmarks/kernels/composite/compile_norm.all) | `cd benchmarks/kernels/composite && bash compile_norm.all` | 18 | none | active |
+| `kernels/composite` | [`benchmarks/kernels/composite/compile_softmax.all`](../benchmarks/kernels/composite/compile_softmax.all) | `cd benchmarks/kernels/composite && bash compile_softmax.all` | 12 | none | active |
+| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul.all` | 96 | none | active |
+| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul_dynamic.all` | 96 | none | active |
+| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuse.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuse.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul_dynamic_reuse.all` | 96 | none | active |
+| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseA.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseA.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul_dynamic_reuseA.all` | 96 | none | active |
+| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseB.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseB.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul_dynamic_reuseB.all` | 96 | none | active |
+| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul_reuseA.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul_reuseA.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul_reuseA.all` | 96 | none | active |
+| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul_reuseAB.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul_reuseAB.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul_reuseAB.all` | 96 | none | active |
+| `kernels/composite/npu_compile` | [`benchmarks/kernels/composite/npu_compile/compile_matmul_reuseB.all`](../benchmarks/kernels/composite/npu_compile/compile_matmul_reuseB.all) | `cd benchmarks/kernels/composite/npu_compile && bash compile_matmul_reuseB.all` | 96 | none | active |
+| `kernels/control` | [`benchmarks/kernels/control/compile.all`](../benchmarks/kernels/control/compile.all) | `cd benchmarks/kernels/control && bash compile.all` | 10 | `hashtable_lookup_simd/data_obj`; `hkv/data_obj` for data-backed cases | active |
+| `kernels/element_wise/gelu` | [`benchmarks/kernels/element_wise/gelu/compile.all`](../benchmarks/kernels/element_wise/gelu/compile.all) | `cd benchmarks/kernels/element_wise/gelu && bash compile.all` | 1 | none | active |
+| `kernels/fusion` | [`benchmarks/kernels/fusion/compile.all`](../benchmarks/kernels/fusion/compile.all) | `cd benchmarks/kernels/fusion && bash compile.all` | 14 | none | active |
+| `kernels/gemm/matmul` | [`benchmarks/kernels/gemm/matmul/compile.all`](../benchmarks/kernels/gemm/matmul/compile.all) | `cd benchmarks/kernels/gemm/matmul && bash compile.all` | 115 | none | active |
+| `kernels/memory/broadcast` | [`benchmarks/kernels/memory/broadcast/compile.all`](../benchmarks/kernels/memory/broadcast/compile.all) | `cd benchmarks/kernels/memory/broadcast && bash compile.all` | 5 | none | active |
+| `kernels/memory/broadcast_vec` | [`benchmarks/kernels/memory/broadcast_vec/compile.all`](../benchmarks/kernels/memory/broadcast_vec/compile.all) | `cd benchmarks/kernels/memory/broadcast_vec && bash compile.all` | 3 | none | active |
+| `kernels/memory/concat_gather` | [`benchmarks/kernels/memory/concat_gather/compile.all`](../benchmarks/kernels/memory/concat_gather/compile.all) | `cd benchmarks/kernels/memory/concat_gather && bash compile.all` | 1 | none | active |
+| `kernels/memory/concat_scatter` | [`benchmarks/kernels/memory/concat_scatter/compile.all`](../benchmarks/kernels/memory/concat_scatter/compile.all) | `cd benchmarks/kernels/memory/concat_scatter && bash compile.all` | 1 | none | active |
+| `kernels/memory/gather` | [`benchmarks/kernels/memory/gather/compile.all`](../benchmarks/kernels/memory/gather/compile.all) | `cd benchmarks/kernels/memory/gather && bash compile.all` | 4 | none | active |
+| `kernels/memory/transpose` | [`benchmarks/kernels/memory/transpose/compile.all`](../benchmarks/kernels/memory/transpose/compile.all) | `cd benchmarks/kernels/memory/transpose && bash compile.all` | 1 | none | active |
+| `kernels/reduction/reducemax_col` | [`benchmarks/kernels/reduction/reducemax_col/compile.all`](../benchmarks/kernels/reduction/reducemax_col/compile.all) | `cd benchmarks/kernels/reduction/reducemax_col && bash compile.all` | 1 | none | active |
+| `kernels/reduction/reducemax_row` | [`benchmarks/kernels/reduction/reducemax_row/compile.all`](../benchmarks/kernels/reduction/reducemax_row/compile.all) | `cd benchmarks/kernels/reduction/reducemax_row && bash compile.all` | 1 | none | active |
+| `kernels/reduction/reducesum_col` | [`benchmarks/kernels/reduction/reducesum_col/compile.all`](../benchmarks/kernels/reduction/reducesum_col/compile.all) | `cd benchmarks/kernels/reduction/reducesum_col && bash compile.all` | 1 | none | active |
+| `kernels/reduction/reducesum_row` | [`benchmarks/kernels/reduction/reducesum_row/compile.all`](../benchmarks/kernels/reduction/reducesum_row/compile.all) | `cd benchmarks/kernels/reduction/reducesum_row && bash compile.all` | 1 | none | active |
+| `kernels/sort` | [`benchmarks/kernels/sort/compile.all`](../benchmarks/kernels/sort/compile.all) | `cd benchmarks/kernels/sort && bash compile.all` | 1 | `topk/data_obj` | active |
+| `microbench/cube` | [`benchmarks/microbench/cube/compile.all`](../benchmarks/microbench/cube/compile.all) | `cd benchmarks/microbench/cube && bash compile.all` | 72 | none | active |
+| `microbench/lmbench` | [`benchmarks/microbench/lmbench/compile_mem.all`](../benchmarks/microbench/lmbench/compile_mem.all) | `cd benchmarks/microbench/lmbench && bash compile_mem.all` | 78 | none | active |
+| `microbench/vec` | [`benchmarks/microbench/vec/compile_lat_bw.all`](../benchmarks/microbench/vec/compile_lat_bw.all) | `cd benchmarks/microbench/vec && bash compile_lat_bw.all` | 120 | none | active |
+| `models/deepseekv3` | [`benchmarks/models/deepseekv3/compile.all`](../benchmarks/models/deepseekv3/compile.all) | `cd benchmarks/models/deepseekv3 && bash compile.all` | 47 | none | active |
+| `models/deepseekv3` | [`benchmarks/models/deepseekv3/compile_cpu.all`](../benchmarks/models/deepseekv3/compile_cpu.all) | `cd benchmarks/models/deepseekv3 && bash compile_cpu.all` | 47 | none | active |
+| `npu/cube` | [`benchmarks/npu/cube/compile.all`](../benchmarks/npu/cube/compile.all) | `cd benchmarks/npu/cube && bash compile.all` | 10 | none | active |
+| `npu/fusion` | [`benchmarks/npu/fusion/compile.all`](../benchmarks/npu/fusion/compile.all) | `cd benchmarks/npu/fusion && bash compile.all` | 71 | none | active |
+| `npu/fusion` | [`benchmarks/npu/fusion/compile_fusion_2d_unroll.all`](../benchmarks/npu/fusion/compile_fusion_2d_unroll.all) | `cd benchmarks/npu/fusion && bash compile_fusion_2d_unroll.all` | 672 | none | active |
+| `npu/fusion` | [`benchmarks/npu/fusion/compile_fusion_dcore.all`](../benchmarks/npu/fusion/compile_fusion_dcore.all) | `cd benchmarks/npu/fusion && bash compile_fusion_dcore.all` | 96 | none | active |
+| `npu/fusion` | [`benchmarks/npu/fusion/compile_fusion_dynamic.all`](../benchmarks/npu/fusion/compile_fusion_dynamic.all) | `cd benchmarks/npu/fusion && bash compile_fusion_dynamic.all` | 15 | none | active |
+| `npu/fusion` | [`benchmarks/npu/fusion/compile_fusion_fp4.all`](../benchmarks/npu/fusion/compile_fusion_fp4.all) | `cd benchmarks/npu/fusion && bash compile_fusion_fp4.all` | 62 | none | active |
+| `npu/nddma` | [`benchmarks/npu/nddma/compile_transpose.all`](../benchmarks/npu/nddma/compile_transpose.all) | `cd benchmarks/npu/nddma && bash compile_transpose.all` | 1 | none | active |
+| `npu/vec_simd` | [`benchmarks/npu/vec_simd/compile.all`](../benchmarks/npu/vec_simd/compile.all) | `cd benchmarks/npu/vec_simd && bash compile.all` | 16 | none | active |
+| `npu/vec_simt` | [`benchmarks/npu/vec_simt/compile.all`](../benchmarks/npu/vec_simt/compile.all) | `cd benchmarks/npu/vec_simt && bash compile.all` | 3 | `hashfind/data_obj` when `TESTCASE=hashfind` | active |
+
+## Benchmark Source Entry Points
+
+| Category | Benchmark name | Source path | Build command | Required data objects | Status |
+| --- | --- | --- | --- | --- | --- |
+| `api/tileop` | `Cus_Template_ASM` | [`benchmarks/api/tileop/src/Cus_Template_ASM.cpp`](../benchmarks/api/tileop/src/Cus_Template_ASM.cpp) | `cd benchmarks/api/tileop && make TESTCASE=Cus_Template_ASM PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `MatMacc` | [`benchmarks/api/tileop/src/MatMacc.cpp`](../benchmarks/api/tileop/src/MatMacc.cpp) | `cd benchmarks/api/tileop && make TESTCASE=MatMacc PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `MatMul` | [`benchmarks/api/tileop/src/MatMul.cpp`](../benchmarks/api/tileop/src/MatMul.cpp) | `cd benchmarks/api/tileop && make TESTCASE=MatMul PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `MatMul_e4m3` | [`benchmarks/api/tileop/src/MatMul_e4m3.cpp`](../benchmarks/api/tileop/src/MatMul_e4m3.cpp) | `cd benchmarks/api/tileop && make TESTCASE=MatMul_e4m3 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `Print` | [`benchmarks/api/tileop/src/Print.cpp`](../benchmarks/api/tileop/src/Print.cpp) | `cd benchmarks/api/tileop && make TESTCASE=Print PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TAbs` | [`benchmarks/api/tileop/src/TAbs.cpp`](../benchmarks/api/tileop/src/TAbs.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TAbs PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TAdd` | [`benchmarks/api/tileop/src/TAdd.cpp`](../benchmarks/api/tileop/src/TAdd.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TAdd PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TAdd_mask` | [`benchmarks/api/tileop/src/TAdd_mask.cpp`](../benchmarks/api/tileop/src/TAdd_mask.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TAdd_mask PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TAdds` | [`benchmarks/api/tileop/src/TAdds.cpp`](../benchmarks/api/tileop/src/TAdds.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TAdds PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TAnd` | [`benchmarks/api/tileop/src/TAnd.cpp`](../benchmarks/api/tileop/src/TAnd.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TAnd PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TAssemble` | [`benchmarks/api/tileop/src/TAssemble.cpp`](../benchmarks/api/tileop/src/TAssemble.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TAssemble PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TCI` | [`benchmarks/api/tileop/src/TCI.cpp`](../benchmarks/api/tileop/src/TCI.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TCI PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TCast` | [`benchmarks/api/tileop/src/TCast.cpp`](../benchmarks/api/tileop/src/TCast.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TCast PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TCmp` | [`benchmarks/api/tileop/src/TCmp.cpp`](../benchmarks/api/tileop/src/TCmp.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TCmp PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TCopy` | [`benchmarks/api/tileop/src/TCopy.cpp`](../benchmarks/api/tileop/src/TCopy.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TCopy PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TLoad` | [`benchmarks/api/tileop/src/TLoad.cpp`](../benchmarks/api/tileop/src/TLoad.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TLoad PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TStore` | [`benchmarks/api/tileop/src/TStore.cpp`](../benchmarks/api/tileop/src/TStore.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TStore PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TCvt` | [`benchmarks/api/tileop/src/TCvt.cpp`](../benchmarks/api/tileop/src/TCvt.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TCvt PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TDiv` | [`benchmarks/api/tileop/src/TDiv.cpp`](../benchmarks/api/tileop/src/TDiv.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TDiv PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TDivs` | [`benchmarks/api/tileop/src/TDivs.cpp`](../benchmarks/api/tileop/src/TDivs.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TDivs PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TExp` | [`benchmarks/api/tileop/src/TExp.cpp`](../benchmarks/api/tileop/src/TExp.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TExp PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TExpandCol` | [`benchmarks/api/tileop/src/TExpandCol.cpp`](../benchmarks/api/tileop/src/TExpandCol.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TExpandCol PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TExpandRow` | [`benchmarks/api/tileop/src/TExpandRow.cpp`](../benchmarks/api/tileop/src/TExpandRow.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TExpandRow PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TExpandScalar` | [`benchmarks/api/tileop/src/TExpandScalar.cpp`](../benchmarks/api/tileop/src/TExpandScalar.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TExpandScalar PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TExtract` | [`benchmarks/api/tileop/src/TExtract.cpp`](../benchmarks/api/tileop/src/TExtract.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TExtract PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TFillPad` | [`benchmarks/api/tileop/src/TFillPad.cpp`](../benchmarks/api/tileop/src/TFillPad.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TFillPad PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TGather` | [`benchmarks/api/tileop/src/TGather.cpp`](../benchmarks/api/tileop/src/TGather.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TGather PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TMax` | [`benchmarks/api/tileop/src/TMax.cpp`](../benchmarks/api/tileop/src/TMax.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TMax PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TMaxs` | [`benchmarks/api/tileop/src/TMaxs.cpp`](../benchmarks/api/tileop/src/TMaxs.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TMaxs PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TMin` | [`benchmarks/api/tileop/src/TMin.cpp`](../benchmarks/api/tileop/src/TMin.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TMin PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TMins` | [`benchmarks/api/tileop/src/TMins.cpp`](../benchmarks/api/tileop/src/TMins.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TMins PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TMul` | [`benchmarks/api/tileop/src/TMul.cpp`](../benchmarks/api/tileop/src/TMul.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TMul PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TMuls` | [`benchmarks/api/tileop/src/TMuls.cpp`](../benchmarks/api/tileop/src/TMuls.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TMuls PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TOr` | [`benchmarks/api/tileop/src/TOr.cpp`](../benchmarks/api/tileop/src/TOr.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TOr PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TPad` | [`benchmarks/api/tileop/src/TPad.cpp`](../benchmarks/api/tileop/src/TPad.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TPad PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TRSqrt` | [`benchmarks/api/tileop/src/TRSqrt.cpp`](../benchmarks/api/tileop/src/TRSqrt.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TRSqrt PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TRecip` | [`benchmarks/api/tileop/src/TRecip.cpp`](../benchmarks/api/tileop/src/TRecip.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TRecip PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TRem` | [`benchmarks/api/tileop/src/TRem.cpp`](../benchmarks/api/tileop/src/TRem.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TRem PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TReshape` | [`benchmarks/api/tileop/src/TReshape.cpp`](../benchmarks/api/tileop/src/TReshape.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TReshape PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TRowMax` | [`benchmarks/api/tileop/src/TRowMax.cpp`](../benchmarks/api/tileop/src/TRowMax.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TRowMax PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TRowMaxExpand` | [`benchmarks/api/tileop/src/TRowMaxExpand.cpp`](../benchmarks/api/tileop/src/TRowMaxExpand.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TRowMaxExpand PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TRowSum` | [`benchmarks/api/tileop/src/TRowSum.cpp`](../benchmarks/api/tileop/src/TRowSum.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TRowSum PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TRowSumExpand` | [`benchmarks/api/tileop/src/TRowSumExpand.cpp`](../benchmarks/api/tileop/src/TRowSumExpand.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TRowSumExpand PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TScatter` | [`benchmarks/api/tileop/src/TScatter.cpp`](../benchmarks/api/tileop/src/TScatter.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TScatter PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TSelect` | [`benchmarks/api/tileop/src/TSelect.cpp`](../benchmarks/api/tileop/src/TSelect.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TSelect PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TSqrt` | [`benchmarks/api/tileop/src/TSqrt.cpp`](../benchmarks/api/tileop/src/TSqrt.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TSqrt PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TSub` | [`benchmarks/api/tileop/src/TSub.cpp`](../benchmarks/api/tileop/src/TSub.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TSub PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TSubs` | [`benchmarks/api/tileop/src/TSubs.cpp`](../benchmarks/api/tileop/src/TSubs.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TSubs PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `TTrans` | [`benchmarks/api/tileop/src/TTrans.cpp`](../benchmarks/api/tileop/src/TTrans.cpp) | `cd benchmarks/api/tileop && make TESTCASE=TTrans PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `test_MatMacc` | [`benchmarks/api/tileop/src/test_MatMacc.cpp`](../benchmarks/api/tileop/src/test_MatMacc.cpp) | `cd benchmarks/api/tileop && make TESTCASE=test_MatMacc PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `test_MatMmxac` | [`benchmarks/api/tileop/src/test_MatMmxac.cpp`](../benchmarks/api/tileop/src/test_MatMmxac.cpp) | `cd benchmarks/api/tileop && make TESTCASE=test_MatMmxac PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `test_MatMul` | [`benchmarks/api/tileop/src/test_MatMul.cpp`](../benchmarks/api/tileop/src/test_MatMul.cpp) | `cd benchmarks/api/tileop && make TESTCASE=test_MatMul PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `api/tileop` | `test_MatMulmx` | [`benchmarks/api/tileop/src/test_MatMulmx.cpp`](../benchmarks/api/tileop/src/test_MatMulmx.cpp) | `cd benchmarks/api/tileop && make TESTCASE=test_MatMulmx PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/composite` | `flash_attention` | [`benchmarks/kernels/composite/src/flash_attention.cpp`](../benchmarks/kernels/composite/src/flash_attention.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=flash_attention PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/composite` | `flash_attention_mask` | [`benchmarks/kernels/composite/src/flash_attention_mask.cpp`](../benchmarks/kernels/composite/src/flash_attention_mask.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=flash_attention_mask PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/composite` | `gemm` | [`benchmarks/kernels/composite/src/gemm.cpp`](../benchmarks/kernels/composite/src/gemm.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=gemm PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/composite` | `linear` | [`benchmarks/kernels/composite/src/linear.cpp`](../benchmarks/kernels/composite/src/linear.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=linear PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/composite` | `matmul` | [`benchmarks/kernels/composite/src/matmul.cpp`](../benchmarks/kernels/composite/src/matmul.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=matmul PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/composite` | `normalization` | [`benchmarks/kernels/composite/src/normalization.cpp`](../benchmarks/kernels/composite/src/normalization.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=normalization PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/composite` | `onlinesoftmax` | [`benchmarks/kernels/composite/src/onlinesoftmax.cpp`](../benchmarks/kernels/composite/src/onlinesoftmax.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=onlinesoftmax PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/composite` | `softmax` | [`benchmarks/kernels/composite/src/softmax.cpp`](../benchmarks/kernels/composite/src/softmax.cpp) | `cd benchmarks/kernels/composite && make TESTCASE=softmax PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/control` | `hashfind` | [`benchmarks/kernels/control/hashfind/hashfind.cpp`](../benchmarks/kernels/control/hashfind/hashfind.cpp) | `cd benchmarks/kernels/control && make TESTCASE=hashfind PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/control` | `hashtable_lookup_simd` | [`benchmarks/kernels/control/hashtable_lookup_simd/hashtable_lookup_simd.cpp`](../benchmarks/kernels/control/hashtable_lookup_simd/hashtable_lookup_simd.cpp) | `cd benchmarks/kernels/control && make TESTCASE=hashtable_lookup_simd PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | `hashtable_lookup_simd/data_obj`: `inserted_slot.o`, `lookup_keys.o`, `lookup_values.o` | active |
+| `kernels/control` | `hashtable_lookup_simt` | [`benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp`](../benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp) | `cd benchmarks/kernels/control && make TESTCASE=hashtable_lookup_simt PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | `hashtable_lookup_simd/data_obj`: `inserted_slot.o`, `lookup_keys.o`, `lookup_values.o` | active |
+| `kernels/control` | `hashtable_lookup_simt_v2` | [`benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp`](../benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp) | `cd benchmarks/kernels/control && make TESTCASE=hashtable_lookup_simt_v2 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | `hashtable_lookup_simd/data_obj`: `inserted_slot.o`, `lookup_keys.o`, `lookup_values.o` | active |
+| `kernels/control` | `hkv` | [`benchmarks/kernels/control/hkv/hkv.cpp`](../benchmarks/kernels/control/hkv/hkv.cpp) | `cd benchmarks/kernels/control && make TESTCASE=hkv PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | `hkv/data_obj`: `buckets.bin.o`, `buckets_size.bin.o`, `lookup_keys.bin.o`, `lookedup_values.bin.o`, `key_score_digest.bin.o` | active |
+| `kernels/element_wise/gelu` | `gelu` | [`benchmarks/kernels/element_wise/gelu/src/gelu.cpp`](../benchmarks/kernels/element_wise/gelu/src/gelu.cpp) | `cd benchmarks/kernels/element_wise/gelu && make TESTCASE=gelu PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/fusion` | `fa_hif4` | [`benchmarks/kernels/fusion/src/fa_hif4.cpp`](../benchmarks/kernels/fusion/src/fa_hif4.cpp) | `cd benchmarks/kernels/fusion && make TESTCASE=fa_hif4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/gemm/matmul` | `A16W4` | [`benchmarks/kernels/gemm/matmul/src/A16W4.cpp`](../benchmarks/kernels/gemm/matmul/src/A16W4.cpp) | `cd benchmarks/kernels/gemm/matmul && make TESTCASE=matmul TYPE=A16W4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/gemm/matmul` | `HiF4_HiF4` | [`benchmarks/kernels/gemm/matmul/src/HiF4_HiF4.cpp`](../benchmarks/kernels/gemm/matmul/src/HiF4_HiF4.cpp) | `cd benchmarks/kernels/gemm/matmul && make TESTCASE=matmul TYPE=HIF4_HIF4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/broadcast` | `broadcast` | [`benchmarks/kernels/memory/broadcast/src/broadcast.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/broadcast` | `broadcast_019` | [`benchmarks/kernels/memory/broadcast/src/broadcast_019.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_019.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_019 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/broadcast` | `broadcast_039` | [`benchmarks/kernels/memory/broadcast/src/broadcast_039.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_039.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_039 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/broadcast` | `broadcast_07` | [`benchmarks/kernels/memory/broadcast/src/broadcast_07.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_07.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_07 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/broadcast` | `broadcast_Hunyuan` | [`benchmarks/kernels/memory/broadcast/src/broadcast_Hunyuan.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_Hunyuan.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_Hunyuan PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/broadcast` | `broadcast_mscatter` | [`benchmarks/kernels/memory/broadcast/src/broadcast_mscatter.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_mscatter.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_mscatter PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/broadcast` | `broadcast_nostore` | [`benchmarks/kernels/memory/broadcast/src/broadcast_nostore.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_nostore.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_nostore PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/broadcast` | `broadcast_nomg` | [`benchmarks/kernels/memory/broadcast/src/broadcast_nomg.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_nomg.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_nomg PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/broadcast` | `broadcast_tst` | [`benchmarks/kernels/memory/broadcast/src/broadcast_tst.cpp`](../benchmarks/kernels/memory/broadcast/src/broadcast_tst.cpp) | `cd benchmarks/kernels/memory/broadcast && make TESTCASE=broadcast_tst PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/broadcast_vec` | `broadcast_vec_019` | [`benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_019.cpp`](../benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_019.cpp) | `cd benchmarks/kernels/memory/broadcast_vec && make TESTCASE=broadcast_vec_019 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/broadcast_vec` | `broadcast_vec_039` | [`benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_039.cpp`](../benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_039.cpp) | `cd benchmarks/kernels/memory/broadcast_vec && make TESTCASE=broadcast_vec_039 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/broadcast_vec` | `broadcast_vec_07` | [`benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_07.cpp`](../benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_07.cpp) | `cd benchmarks/kernels/memory/broadcast_vec && make TESTCASE=broadcast_vec_07 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/concat_gather` | `concat_gather` | [`benchmarks/kernels/memory/concat_gather/src/concat_gather.cpp`](../benchmarks/kernels/memory/concat_gather/src/concat_gather.cpp) | `cd benchmarks/kernels/memory/concat_gather && make TESTCASE=concat_gather PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/concat_scatter` | `concat_scatter` | [`benchmarks/kernels/memory/concat_scatter/src/concat_scatter.cpp`](../benchmarks/kernels/memory/concat_scatter/src/concat_scatter.cpp) | `cd benchmarks/kernels/memory/concat_scatter && make TESTCASE=concat_scatter PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/gather` | `gather` | [`benchmarks/kernels/memory/gather/src/gather.cpp`](../benchmarks/kernels/memory/gather/src/gather.cpp) | `cd benchmarks/kernels/memory/gather && make TESTCASE=gather PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/memory/transpose` | `transpose` | [`benchmarks/kernels/memory/transpose/src/transpose.cpp`](../benchmarks/kernels/memory/transpose/src/transpose.cpp) | `cd benchmarks/kernels/memory/transpose && make TESTCASE=transpose PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/reduction/reducemax_col` | `reducemax_col` | [`benchmarks/kernels/reduction/reducemax_col/src/reducemax_col.cpp`](../benchmarks/kernels/reduction/reducemax_col/src/reducemax_col.cpp) | `cd benchmarks/kernels/reduction/reducemax_col && make TESTCASE=reducemax_col PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/reduction/reducemax_row` | `reducemax_row` | [`benchmarks/kernels/reduction/reducemax_row/src/reducemax_row.cpp`](../benchmarks/kernels/reduction/reducemax_row/src/reducemax_row.cpp) | `cd benchmarks/kernels/reduction/reducemax_row && make TESTCASE=reducemax_row PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/reduction/reducesum_col` | `reducesum_col` | [`benchmarks/kernels/reduction/reducesum_col/src/reducesum_col.cpp`](../benchmarks/kernels/reduction/reducesum_col/src/reducesum_col.cpp) | `cd benchmarks/kernels/reduction/reducesum_col && make TESTCASE=reducesum_col PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/reduction/reducesum_row` | `reducesum_row` | [`benchmarks/kernels/reduction/reducesum_row/src/reducesum_row.cpp`](../benchmarks/kernels/reduction/reducesum_row/src/reducesum_row.cpp) | `cd benchmarks/kernels/reduction/reducesum_row && make TESTCASE=reducesum_row PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `kernels/sort` | `topk` | [`benchmarks/kernels/sort/topk/topk.cpp`](../benchmarks/kernels/sort/topk/topk.cpp) | `cd benchmarks/kernels/sort && make TESTCASE=topk PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | `topk/data_obj`: `input_131072.o`, `top_2048_out.o` | active |
+| `microbench/cube` | `matop` | [`benchmarks/microbench/cube/src/matop.cpp`](../benchmarks/microbench/cube/src/matop.cpp) | `cd benchmarks/microbench/cube && make TESTCASE=matop PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `microbench/lmbench` | `mem` | [`benchmarks/microbench/lmbench/src/mem.cpp`](../benchmarks/microbench/lmbench/src/mem.cpp) | `cd benchmarks/microbench/lmbench && make TESTCASE=mem PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `microbench/vec` | `lat_bw` | [`benchmarks/microbench/vec/src/lat_bw.cpp`](../benchmarks/microbench/vec/src/lat_bw.cpp) | `cd benchmarks/microbench/vec && make TESTCASE=lat_bw PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `models/deepseekv3` | `concat` | [`benchmarks/models/deepseekv3/src/concat.cpp`](../benchmarks/models/deepseekv3/src/concat.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=concat PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `models/deepseekv3` | `expand` | [`benchmarks/models/deepseekv3/src/expand.cpp`](../benchmarks/models/deepseekv3/src/expand.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=expand PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `models/deepseekv3` | `gate` | [`benchmarks/models/deepseekv3/src/gate.cpp`](../benchmarks/models/deepseekv3/src/gate.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=gate PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `models/deepseekv3` | `mask` | [`benchmarks/models/deepseekv3/src/mask.cpp`](../benchmarks/models/deepseekv3/src/mask.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=mask PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `models/deepseekv3` | `mla` | [`benchmarks/models/deepseekv3/src/mla.cpp`](../benchmarks/models/deepseekv3/src/mla.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=mla PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `models/deepseekv3` | `mlp` | [`benchmarks/models/deepseekv3/src/mlp.cpp`](../benchmarks/models/deepseekv3/src/mlp.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=mlp PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `models/deepseekv3` | `moe` | [`benchmarks/models/deepseekv3/src/moe.cpp`](../benchmarks/models/deepseekv3/src/moe.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=moe PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `models/deepseekv3` | `permute` | [`benchmarks/models/deepseekv3/src/permute.cpp`](../benchmarks/models/deepseekv3/src/permute.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=permute PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `models/deepseekv3` | `projection` | [`benchmarks/models/deepseekv3/src/projection.cpp`](../benchmarks/models/deepseekv3/src/projection.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=projection PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `models/deepseekv3` | `rmsnorm` | [`benchmarks/models/deepseekv3/src/rmsnorm.cpp`](../benchmarks/models/deepseekv3/src/rmsnorm.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=rmsnorm PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `models/deepseekv3` | `rope` | [`benchmarks/models/deepseekv3/src/rope.cpp`](../benchmarks/models/deepseekv3/src/rope.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=rope PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `models/deepseekv3` | `split` | [`benchmarks/models/deepseekv3/src/split.cpp`](../benchmarks/models/deepseekv3/src/split.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=split PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `models/deepseekv3` | `topk` | [`benchmarks/models/deepseekv3/src/topk.cpp`](../benchmarks/models/deepseekv3/src/topk.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=topk PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `models/deepseekv3` | `transformer` | [`benchmarks/models/deepseekv3/src/transformer.cpp`](../benchmarks/models/deepseekv3/src/transformer.cpp) | `cd benchmarks/models/deepseekv3 && make TESTCASE=transformer PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `LLAMA3_70B_attn_matmul_decode_bs_192` | [`benchmarks/npu/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp`](../benchmarks/npu/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp) | `cd benchmarks/npu/cube && make TESTCASE=LLAMA3_70B_attn_matmul_decode_bs_192 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `LLAMA3_70B_ffn_matmul_3_decode_bs_192` | [`benchmarks/npu/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp`](../benchmarks/npu/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp) | `cd benchmarks/npu/cube && make TESTCASE=LLAMA3_70B_ffn_matmul_3_decode_bs_192 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `QuantBatchMatmulV3_292_hif4` | [`benchmarks/npu/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp`](../benchmarks/npu/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=QuantBatchMatmulV3_292_hif4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `QuantBatchMatmulV3_293_hif4` | [`benchmarks/npu/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp`](../benchmarks/npu/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=QuantBatchMatmulV3_293_hif4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `QuantBatchMatmulV3_294_hif4` | [`benchmarks/npu/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp`](../benchmarks/npu/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=QuantBatchMatmulV3_294_hif4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `QuantBatchMatmulV3_295_hif4` | [`benchmarks/npu/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp`](../benchmarks/npu/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=QuantBatchMatmulV3_295_hif4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `QuantBatchMatmulV3_296_hif4` | [`benchmarks/npu/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp`](../benchmarks/npu/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=QuantBatchMatmulV3_296_hif4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `QuantBatchMatmulV3_297_hif4` | [`benchmarks/npu/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp`](../benchmarks/npu/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=QuantBatchMatmulV3_297_hif4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `dsv3_q_up_proj_mxfp8` | [`benchmarks/npu/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp`](../benchmarks/npu/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp) | `cd benchmarks/npu/cube && make TESTCASE=dsv3_q_up_proj_mxfp8 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `llama3_70b_w8_bs_1_case_4` | [`benchmarks/npu/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp`](../benchmarks/npu/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=llama3_70b_w8_bs_1_case_4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `llama_train_mm_2_A16W4` | [`benchmarks/npu/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp`](../benchmarks/npu/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=llama_train_mm_2_A16W4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `llama_train_mm_2_A16W8` | [`benchmarks/npu/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp`](../benchmarks/npu/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp) | `cd benchmarks/npu/cube && make TESTCASE=llama_train_mm_2_A16W8 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `llama_train_mm_2_mxfp8_mxfp4` | [`benchmarks/npu/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp`](../benchmarks/npu/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=llama_train_mm_2_mxfp8_mxfp4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `llava1_6_6` | [`benchmarks/npu/cube/llava1_6_6/llava1_6_6.cpp`](../benchmarks/npu/cube/llava1_6_6/llava1_6_6.cpp) | `cd benchmarks/npu/cube && make TESTCASE=llava1_6_6 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `mat_mul_o1_align_0001` | [`benchmarks/npu/cube/mat_mul_o1_align_0001/mat_mul_o1_align_0001.cpp`](../benchmarks/npu/cube/mat_mul_o1_align_0001/mat_mul_o1_align_0001.cpp) | `cd benchmarks/npu/cube && make TESTCASE=mat_mul_o1_align_0001 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `matmul_1_bs16_fp8_GB_test` | [`benchmarks/npu/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp`](../benchmarks/npu/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp) | `cd benchmarks/npu/cube && make TESTCASE=matmul_1_bs16_fp8_GB_test PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf` | [`benchmarks/npu/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp`](../benchmarks/npu/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp) | `cd benchmarks/npu/cube && make TESTCASE=model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `moe_w1w3_bs16_fp8_GB_DN_nbuf` | [`benchmarks/npu/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp`](../benchmarks/npu/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp) | `cd benchmarks/npu/cube && make TESTCASE=moe_w1w3_bs16_fp8_GB_DN_nbuf PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022` | [`benchmarks/npu/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp`](../benchmarks/npu/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp) | `cd benchmarks/npu/cube && make TESTCASE=mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16` | [`benchmarks/npu/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp`](../benchmarks/npu/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp) | `cd benchmarks/npu/cube && make TESTCASE=mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `xinghuo_13b_tp8_matmul_01_A16W8` | [`benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp`](../benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp) | `cd benchmarks/npu/cube && make TESTCASE=xinghuo_13b_tp8_matmul_01_A16W8 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `xinghuo_13b_tp8_matmul_01_mxfp8_modified` | [`benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp`](../benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp) | `cd benchmarks/npu/cube && make TESTCASE=xinghuo_13b_tp8_matmul_01_mxfp8_modified PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/cube` | `xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4` | [`benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp`](../benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp) | `cd benchmarks/npu/cube && make TESTCASE=xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/fusion` | `fa1` | [`benchmarks/npu/fusion/fa1/fa1.cpp`](../benchmarks/npu/fusion/fa1/fa1.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa1 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/fusion` | `fa10` | [`benchmarks/npu/fusion/fa10/fa10.cpp`](../benchmarks/npu/fusion/fa10/fa10.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa10 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/fusion` | `fa11` | [`benchmarks/npu/fusion/fa11/fa11.cpp`](../benchmarks/npu/fusion/fa11/fa11.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa11 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/fusion` | `fa2` | [`benchmarks/npu/fusion/fa2/fa2.cpp`](../benchmarks/npu/fusion/fa2/fa2.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa2 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/fusion` | `fa3` | [`benchmarks/npu/fusion/fa3/fa3.cpp`](../benchmarks/npu/fusion/fa3/fa3.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa3 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/fusion` | `fa4` | [`benchmarks/npu/fusion/fa4/fa4.cpp`](../benchmarks/npu/fusion/fa4/fa4.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/fusion` | `fa5` | [`benchmarks/npu/fusion/fa5/fa5.cpp`](../benchmarks/npu/fusion/fa5/fa5.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa5 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/fusion` | `fa6` | [`benchmarks/npu/fusion/fa6/fa6.cpp`](../benchmarks/npu/fusion/fa6/fa6.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa6 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/fusion` | `fa7` | [`benchmarks/npu/fusion/fa7/fa7.cpp`](../benchmarks/npu/fusion/fa7/fa7.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa7 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/fusion` | `fa8` | [`benchmarks/npu/fusion/fa8/fa8.cpp`](../benchmarks/npu/fusion/fa8/fa8.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa8 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/fusion` | `fa9` | [`benchmarks/npu/fusion/fa9/fa9.cpp`](../benchmarks/npu/fusion/fa9/fa9.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa9 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/fusion` | `fa_fp4` | [`benchmarks/npu/fusion/fa_fp4/fa_fp4.cpp`](../benchmarks/npu/fusion/fa_fp4/fa_fp4.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=fa_fp4 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/fusion` | `flashmla13` | [`benchmarks/npu/fusion/flashmla13/flashmla13.cpp`](../benchmarks/npu/fusion/flashmla13/flashmla13.cpp) | `cd benchmarks/npu/fusion && make TESTCASE=flashmla13 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/nddma` | `transpose_053_mgather` | [`benchmarks/npu/nddma/transpose_053_mgather/transpose_053_mgather.cpp`](../benchmarks/npu/nddma/transpose_053_mgather/transpose_053_mgather.cpp) | `cd benchmarks/npu/nddma && make TESTCASE=transpose_053_mgather PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/nddma` | `transpose_053_tload` | [`benchmarks/npu/nddma/transpose_053_tload/transpose_053_tload.cpp`](../benchmarks/npu/nddma/transpose_053_tload/transpose_053_tload.cpp) | `cd benchmarks/npu/nddma && make TESTCASE=transpose_053_tload PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `Add_ND_bfloat16_float32_DeepSeek_V3_000028` | [`benchmarks/npu/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp`](../benchmarks/npu/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=Add_ND_bfloat16_float32_DeepSeek_V3_000028 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic` | [`benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp`](../benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic` | [`benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp`](../benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV` | [`benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp`](../benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `gemm_18x128x256` | [`benchmarks/npu/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp`](../benchmarks/npu/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=gemm_18x128x256 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `layernorm_vcadd_vaddx3_12288_fp16` | [`benchmarks/npu/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp`](../benchmarks/npu/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=layernorm_vcadd_vaddx3_12288_fp16 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV` | [`benchmarks/npu/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp`](../benchmarks/npu/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `rmsnorm_reduce_1_16384_fp16` | [`benchmarks/npu/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp`](../benchmarks/npu/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=rmsnorm_reduce_1_16384_fp16 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `rmsnorm_reduce_2_8192_fp16` | [`benchmarks/npu/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp`](../benchmarks/npu/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=rmsnorm_reduce_2_8192_fp16 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `rmsnorm_reduce_4_4096_fp16` | [`benchmarks/npu/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp`](../benchmarks/npu/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=rmsnorm_reduce_4_4096_fp16 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `rmsnorm_reduce_4_5120_fp16` | [`benchmarks/npu/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp`](../benchmarks/npu/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=rmsnorm_reduce_4_5120_fp16 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `rope_32_40_1_64_bf16` | [`benchmarks/npu/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp`](../benchmarks/npu/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=rope_32_40_1_64_bf16 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `softmax_8_34_fp16` | [`benchmarks/npu/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp`](../benchmarks/npu/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=softmax_8_34_fp16 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `softmax_LLM_2` | [`benchmarks/npu/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp`](../benchmarks/npu/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=softmax_LLM_2 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `softmax_vaddx3_vcadd_1_4096_bf16` | [`benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp`](../benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=softmax_vaddx3_vcadd_1_4096_bf16 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `softmax_vaddx3_vcadd_1_4096_fp16` | [`benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp`](../benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=softmax_vaddx3_vcadd_1_4096_fp16 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simd` | `swiglu_64_1024_fp16` | [`benchmarks/npu/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp`](../benchmarks/npu/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp) | `cd benchmarks/npu/vec_simd && make TESTCASE=swiglu_64_1024_fp16 PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simt` | `npu_hashtable_insert_cmp_host` | [`benchmarks/npu/vec_simt/npu_hashtable_insert_cmp_host/npu_hashtable_insert_cmp_host.cpp`](../benchmarks/npu/vec_simt/npu_hashtable_insert_cmp_host/npu_hashtable_insert_cmp_host.cpp) | `cd benchmarks/npu/vec_simt && make TESTCASE=npu_hashtable_insert_cmp_host PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simt` | `npu_hashtable_lookup_cmp_host` | [`benchmarks/npu/vec_simt/npu_hashtable_lookup_cmp_host/npu_hashtable_lookup_cmp_host.cpp`](../benchmarks/npu/vec_simt/npu_hashtable_lookup_cmp_host/npu_hashtable_lookup_cmp_host.cpp) | `cd benchmarks/npu/vec_simt && make TESTCASE=npu_hashtable_lookup_cmp_host PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | none | active |
+| `npu/vec_simt` | `hashfind` | [`benchmarks/npu/vec_simt/hashfind/hashfind.cpp`](../benchmarks/npu/vec_simt/hashfind/hashfind.cpp) | `cd benchmarks/npu/vec_simt && make TESTCASE=hashfind PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` | `hashfind/data_obj`: `simple_inserted_slot.o`, `simple_lookup_keys.o`, `simple_lookup_values.o` | active |
+
+## Archived And Outdated Surfaces
+
+| Category | Source path | Replacement | Required data objects | Status |
+| --- | --- | --- | --- | --- |
+| `legacy/api/tileop` | [`archive/outdated/tests/other/tileop_api`](../archive/outdated/tests/other/tileop_api) | [`benchmarks/api/tileop`](../benchmarks/api/tileop) | none | archive/outdated |
+| `legacy/api/python` | [`archive/outdated/tests/other/py_api`](../archive/outdated/tests/other/py_api) | [`tests/py_api`](../tests/py_api) | none | archive/outdated |
+| `legacy/npu/v220` | [`archive/outdated/tests/accelerator/v220`](../archive/outdated/tests/accelerator/v220) | [`benchmarks/npu`](../benchmarks/npu) | none | archive/outdated |
+| `legacy/npu/v310` | [`archive/outdated/tests/accelerator/v310`](../archive/outdated/tests/accelerator/v310) | [`benchmarks/npu`](../benchmarks/npu) | none | archive/outdated |
diff --git a/benchmarks/README.md b/benchmarks/README.md
new file mode 100644
index 0000000..4326e76
--- /dev/null
+++ b/benchmarks/README.md
@@ -0,0 +1,44 @@
+# Benchmarks
+
+This is the primary navigation surface for active SuperNPUBench benchmark sources. These suites are intended to build through the shared make harness with `PLAT=linx COMPILER_DIR=<linx-isa-llvm-bin>` unless a local `compile*.all` file explicitly selects another platform for comparison.
+
+## Layout
+
+| Path | Purpose |
+| --- | --- |
+| [`common`](common) | Shared make harness and benchmark-local utility headers. |
+| [`api/tileop`](api/tileop) | Focused TileOP API operation benchmarks. |
+| [`npu/cube`](npu/cube) | Cube/matmul NPU benchmark cases. |
+| [`npu/fusion`](npu/fusion) | Flash-attention and fusion NPU cases. |
+| [`npu/nddma`](npu/nddma) | NDDMA transpose cases. |
+| [`npu/vec_simd`](npu/vec_simd) | Vector SIMD NPU cases. |
+| [`npu/vec_simt`](npu/vec_simt) | Vector SIMT NPU cases, including embedded data-object cases. |
+| [`kernels/control`](kernels/control) | Control-flow/hash-table kernels. |
+| [`kernels/element_wise`](kernels/element_wise) | Element-wise kernels. |
+| [`kernels/gemm`](kernels/gemm) | GEMM/matmul kernels. |
+| [`kernels/fusion`](kernels/fusion) | Kernel-level fusion cases. |
+| [`kernels/memory`](kernels/memory) | Broadcast, gather, scatter, concat, and transpose memory kernels. |
+| [`kernels/reduction`](kernels/reduction) | Row/column reduction kernels. |
+| [`kernels/sort`](kernels/sort) | Sort/top-k kernels with embedded data-object support. |
+| [`kernels/composite`](kernels/composite) | Composite kernels formerly grouped under `orther`. |
+| [`models/deepseekv3`](models/deepseekv3) | DeepSeekV3 model-level benchmark kernels. |
+| [`microbench`](microbench) | Cube, memory, and vector microbenchmark suites. |
+| [`scripts`](scripts) | Batch and recursive helper scripts. |
+
+## Build Pattern
+
+Run local make commands from the suite directory:
+
+```sh
+cd benchmarks/api/tileop
+make TESTCASE=TAdd PLAT=linx COMPILER_DIR=/path/to/linx/compiler/bin
+```
+
+Run suite batches from the same directory as the script:
+
+```sh
+cd benchmarks/kernels/gemm/matmul
+bash compile.all
+```
+
+For the complete source and build catalog, use [`INDEX.md`](INDEX.md).
diff --git a/test/tileop_api/Makefile b/benchmarks/api/tileop/Makefile
similarity index 78%
rename from test/tileop_api/Makefile
rename to benchmarks/api/tileop/Makefile
index 8d51541..f5085ab 100644
--- a/test/tileop_api/Makefile
+++ b/benchmarks/api/tileop/Makefile
@@ -3,4 +3,4 @@ TARGET = $(ELF_HEAD)_$(TESTCASE)_$(PLAT).elf
 SRC_FILE +=  $(TEST_ROOT)/$(CASE_SRC_DIR)/$(TESTCASE).cpp
 endif
 
-include ../common/Makefile.common
\ No newline at end of file
+include ../../common/Makefile.common
diff --git a/test/tileop_api/compile.all b/benchmarks/api/tileop/compile.all
similarity index 94%
rename from test/tileop_api/compile.all
rename to benchmarks/api/tileop/compile.all
index a3f4706..fccb53e 100755
--- a/test/tileop_api/compile.all
+++ b/benchmarks/api/tileop/compile.all
@@ -11,8 +11,8 @@ make clean;make TESTCASE=TAnd
 make clean;make TESTCASE=TCI
 make clean;make TESTCASE=TCmp
 make clean;make TESTCASE=TCopy
-make clean;make TESTCASE=TCopyIn
-make clean;make TESTCASE=TCopyOut
+make clean;make TESTCASE=TLoad
+make clean;make TESTCASE=TStore
 make clean;make TESTCASE=TCvt
 make clean;make TESTCASE=TDiv
 make clean;make TESTCASE=TDivs
diff --git a/test/other/tileop_api/data.hpp b/benchmarks/api/tileop/data.hpp
similarity index 78%
rename from test/other/tileop_api/data.hpp
rename to benchmarks/api/tileop/data.hpp
index 394ff52..afca271 100644
--- a/test/other/tileop_api/data.hpp
+++ b/benchmarks/api/tileop/data.hpp
@@ -1,16 +1,34 @@
 #ifndef DATA_H
 #define DATA_H
 
+#ifdef __linx
+#include <stddef.h>
+#include <stdint.h>
+extern "C" void exit(int);
+extern "C" void free(void *);
+extern "C" void *malloc(size_t);
+extern "C" int printf(const char *, ...);
+#else
 #include <iostream>
 #include <cmath>
-#include <common/type.hpp>
-
+#endif
+#include "common/type.hpp"
+
+#ifdef __linx
+static constexpr float s_fp32 = 0.1f;
+static constexpr __half s_fp16 = __half(0.0f);
+static constexpr int8_t s_i8 = 1;
+static constexpr int16_t s_i16 = 1;
+static constexpr int32_t s_i32 = 1;
+static constexpr int64_t s_i64 = 1;
+#else
 float s_fp32 = 0.1;
 __half s_fp16 = 0.1;
 int8_t s_i8 = 1;
 int16_t s_i16 = 1;
 int32_t s_i32 = 1;
 int64_t s_i64 = 1;
+#endif
 
 template <typename T> void init_src_uint(T *aar, uint16_t size) {
   for (uint16_t i = 0; i < size; i++) {
@@ -23,10 +41,26 @@ template <typename T> void init_src_int(T *aar, uint16_t size) {
     aar[i] = -(i + 1);
   }
 }
+void init_src_int8(int8_t *aar, uint16_t size) {
+  for (uint16_t i = 0; i < size; i++) {
+    uint16_t val = i % 256;
+    if (val != 128) {
+      aar[i] = val - 128;
+    } else {
+      aar[i] = -128;
+    }
+  }
+}
 
 template <typename T> void init_src_fp(T *aar, uint16_t size) {
   for (uint16_t i = 0; i < size; i++) {
+#ifdef __linx
+    const float x = (i + 1) / 100.0f;
+    const float x2 = x * x;
+    aar[i] = x * (1.0f - x2 / 6.0f + (x2 * x2) / 120.0f);
+#else
     aar[i] = sin((i + 1) / 100.0f);
+#endif
   }
 }
 
@@ -36,6 +70,12 @@ template <typename T> void init_dst(T *aar, uint16_t size) {
   }
 }
 
+template <typename T> void init_dst_no_zero(T *aar, uint16_t size) {
+  for (uint16_t i = 0; i < size; i++) {
+    aar[i] = 1.0;
+  }
+}
+
 template <typename T> void init_index(T *aar, uint16_t row, uint16_t col) {
   for (uint16_t i = 0; i < row; ++i) {
     for (uint16_t j = 0; j < col; ++j) {
@@ -56,11 +96,46 @@ template <typename T> void init_01(T *aar, uint16_t row, uint16_t col) {
   }
 }
 
+template <typename T> void init_rows_fp(T *aar, uint16_t row, uint16_t col) {
+  for (uint16_t i = 0; i < row; ++i) {
+    for (uint16_t j = 0; j < col; ++j) {
+        aar[i * col + j] = (i * col + j) / 100.0f;
+    }
+  }
+}
+
 template <typename T> void OutArray(const T *aar, size_t size) {
+#ifdef __linx
+  (void)aar;
+  (void)size;
+#else
   for (uint16_t i = 0; i < size; i++) {
     std::cout << aar[i] << " ";
   }
   std::cout << std::endl;
+#endif
+}
+void OutArray(const int8_t *aar, size_t size) {
+#ifdef __linx
+  (void)aar;
+  (void)size;
+#else
+  for (uint16_t i = 0; i < size; i++) {
+    std::cout << static_cast<int32_t>(aar[i]) << " ";
+  }
+  std::cout << std::endl;
+#endif
+}
+void OutArray(const __half *aar, size_t size) {
+#ifdef __linx
+  (void)aar;
+  (void)size;
+#else
+  for (uint16_t i = 0; i < size; i++) {
+    std::cout << static_cast<__fp16>(aar[i]) << " ";
+  }
+  std::cout << std::endl;
+#endif
 }
 
 // check memory allocation
@@ -145,4 +220,4 @@ template <typename T> void check_mem_alloc(const T *p) {
   free(d2);                                                                    \
   free(d3);
 
-#endif
\ No newline at end of file
+#endif
diff --git a/test/common/linxStartEnd.hpp b/benchmarks/api/tileop/linxStartEnd.hpp
similarity index 100%
rename from test/common/linxStartEnd.hpp
rename to benchmarks/api/tileop/linxStartEnd.hpp
diff --git a/test/tileop_api/src/Cus_Template_ASM.cpp b/benchmarks/api/tileop/src/Cus_Template_ASM.cpp
similarity index 95%
rename from test/tileop_api/src/Cus_Template_ASM.cpp
rename to benchmarks/api/tileop/src/Cus_Template_ASM.cpp
index 418e89b..5491da1 100644
--- a/test/tileop_api/src/Cus_Template_ASM.cpp
+++ b/benchmarks/api/tileop/src/Cus_Template_ASM.cpp
@@ -8,7 +8,7 @@
 
 #ifdef ENABLE_TENSOR_INSTR
 template <is_tile_data_v tile_shape, is_global_data_v gm_shape>
-void TCOPYIN_ASM(tile_shape &dst, gm_shape &src) {
+void TLOAD_ASM(tile_shape &dst, gm_shape &src) {
 
   asm volatile(
     "BSTART.PAR 33, %c1\n"
@@ -37,9 +37,9 @@ void test_Nz(T *dst) {
   gm_shape g(dst);
   tile_shape t;
 #ifdef ENABLE_TENSOR_INSTR
-  TCOPYIN_ASM(t, g);
+  TLOAD_ASM(t, g);
 #else
-  TCOPYIN(t, g);
+  TLOAD(t, g);
 #endif
   print_tile(t);
 }
diff --git a/test/other/tileop_api/src/MatMacc.cpp b/benchmarks/api/tileop/src/MatMacc.cpp
similarity index 54%
rename from test/other/tileop_api/src/MatMacc.cpp
rename to benchmarks/api/tileop/src/MatMacc.cpp
index afb4a37..2c665a2 100644
--- a/test/other/tileop_api/src/MatMacc.cpp
+++ b/benchmarks/api/tileop/src/MatMacc.cpp
@@ -5,6 +5,57 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t M, uint16_t N, uint16_t K, typename T>
 void test_RowMajor(T *dst, T *src0, T *src1) {
   using gm_shape_A = global_tensor<T, RowMajor<M, K>>;
@@ -12,8 +63,8 @@ void test_RowMajor(T *dst, T *src0, T *src1) {
   using gm_shape_C = global_tensor<T, RowMajor<M, N>>;
 
   using tile_shape_A = Tile<Location::Vec, T, M, K, BLayout::RowMajor>;
-  using tile_shape_B = Tile<Location::Vec, T, K, N, BLayout::RowMajor>;
-  using tile_shape_C = Tile<Location::Vec, T, M, N, BLayout::RowMajor>;
+  using tile_shape_B = Tile<Location::Vec, T, K, N, BLayout::RowMajor>;;
+  using tile_shape_C = Tile<Location::Vec, T, M, N, BLayout::RowMajor>;;
 
   gm_shape_A s0(src0);
   gm_shape_B s1(src1);
@@ -23,10 +74,11 @@ void test_RowMajor(T *dst, T *src0, T *src1) {
   tile_shape_B d1;
   tile_shape_C d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  TLOAD(d2, res);
   MATMACC(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 template <uint16_t M, uint16_t N, uint16_t K, typename T>
@@ -47,24 +99,74 @@ void test_ColMajor(T *dst, T *src0, T *src1) {
   tile_shape_B d1;
   tile_shape_C d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  TLOAD(d2, res);
   MATMACC(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 int main() {
-  const uint16_t M = 64;
-  const uint16_t K = 32;
-  const uint16_t N = 64;
+#ifdef __linx
+  constexpr uint16_t M = 4;
+  constexpr uint16_t K = 4;
+  constexpr uint16_t N = 4;
+#else
+  const uint16_t M = 16;
+  const uint16_t K = 8;
+  const uint16_t N = 32;
+#endif
+
+  constexpr size_t size_A = M * K;
+  constexpr size_t size_B = K * N;
+  constexpr size_t size_C = M * N;
+
+#ifdef __linx
+  static int64_t dst_rm[size_C];
+  static int64_t src0_rm[size_A];
+  static int64_t src1_rm[size_B];
+  static int64_t base_rm[size_C];
 
-  size_t size_A = M * K;
-  size_t size_B = K * N;
-  size_t size_C = M * N;
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t k = 0; k < K; ++k) {
+      const int64_t value = static_cast<int64_t>((row + 1) * (k + 2));
+      src0_rm[row * K + k] = value;
+    }
+  }
+  for (size_t k = 0; k < K; ++k) {
+    for (size_t col = 0; col < N; ++col) {
+      const int64_t value = static_cast<int64_t>((k + 1) + (col + 1));
+      src1_rm[k * N + col] = value;
+    }
+  }
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t col = 0; col < N; ++col) {
+      const int64_t value = static_cast<int64_t>(10 + row * N + col);
+      dst_rm[row * N + col] = value;
+      base_rm[row * N + col] = value;
+    }
+  }
+
+  test_RowMajor<M, N, K, int64_t>(dst_rm, src0_rm, src1_rm);
+
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t col = 0; col < N; ++col) {
+      int64_t expected = base_rm[row * N + col];
+      for (size_t k = 0; k < K; ++k) {
+        expected += src0_rm[row * K + k] * src1_rm[k * N + col];
+      }
+      if (dst_rm[row * N + col] != expected) {
+        return 1;
+      }
+    }
+  }
+
+  return 0;
+#else
 
   float *dst = (float *)malloc(size_C * sizeof(float));
   check_mem_alloc(dst);
-  init_src_fp(dst, size_C);
+  init_dst_no_zero(dst, size_C);
 
   float *src0 = (float *)malloc(size_A * sizeof(float));
   check_mem_alloc(src0);
@@ -75,7 +177,7 @@ int main() {
 
   __half *dst_f16 = (__half *)malloc(size_C * sizeof(__half));
   check_mem_alloc(dst_f16);
-  init_src_fp(dst_f16, size_C); 
+  init_dst_no_zero(dst_f16, size_C);
 
   __half *src0_f16 = (__half *)malloc(size_A * sizeof(__half));
   check_mem_alloc(src0_f16);
@@ -83,44 +185,44 @@ int main() {
   __half *src1_f16 = (__half *)malloc(size_B * sizeof(__half));
   check_mem_alloc(src1_f16);
   init_src_fp(src1_f16, size_B);
- 
+
   int8_t *dst_i8 = (int8_t *)malloc(size_C * sizeof(int8_t));
   check_mem_alloc(dst_i8);
-  init_src_fp(dst_i8, size_C);
- 
+  init_dst_no_zero(dst_i8, size_C);
+
   int8_t *src0_i8 = (int8_t *)malloc(size_A * sizeof(int8_t));
   check_mem_alloc(src0_i8);
   init_src_int(src0_i8, size_A);
   int8_t *src1_i8 = (int8_t *)malloc(size_B * sizeof(int8_t));
   check_mem_alloc(src1_i8);
   init_src_int(src1_i8, size_B);
- 
+
   int16_t *dst_i16 = (int16_t *)malloc(size_C * sizeof(int16_t));
   check_mem_alloc(dst_i16);
-  init_src_fp(dst_i16, size_C);
- 
+  init_dst_no_zero(dst_i16, size_C);
+
   int16_t *src0_i16 = (int16_t *)malloc(size_A * sizeof(int16_t));
   check_mem_alloc(src0_i16);
   init_src_int(src0_i16, size_A);
   int16_t *src1_i16 = (int16_t *)malloc(size_B * sizeof(int16_t));
   check_mem_alloc(src1_i16);
   init_src_int(src1_i16, size_B);
-  
+
   int32_t *dst_i32 = (int32_t *)malloc(size_C * sizeof(int32_t));
   check_mem_alloc(dst_i32);
-  init_src_fp(dst_i32, size_C);
- 
+  init_dst_no_zero(dst_i32, size_C);
+
   int32_t *src0_i32 = (int32_t *)malloc(size_A * sizeof(int32_t));
   check_mem_alloc(src0_i32);
   init_src_int(src0_i32, size_A);
   int32_t *src1_i32 = (int32_t *)malloc(size_B * sizeof(int32_t));
   check_mem_alloc(src1_i32);
   init_src_int(src1_i32, size_B);
- 
+
   int64_t *dst_i64 = (int64_t *)malloc(size_C * sizeof(int64_t));
   check_mem_alloc(dst_i64);
-  init_src_fp(dst_i64, size_C);
- 
+  init_dst_no_zero(dst_i64, size_C);
+
   int64_t *src0_i64 = (int64_t *)malloc(size_A * sizeof(int64_t));
   check_mem_alloc(src0_i64);
   init_src_int(src0_i64, size_A);
@@ -132,13 +234,13 @@ int main() {
   PMC_START();
 #endif
 
-  test_RowMajor<M, N, K, float>(dst, src0, src1);
- 
-  test_RowMajor<M, N, K, __half>(dst_f16, src0_f16, src1_f16);
+  //test_RowMajor<M, N, K, float>(dst, src0, src1);
+
+  //test_RowMajor<M, N, K, __half>(dst_f16, src0_f16, src1_f16);
 
-  test_RowMajor<M, N, K, int8_t>(dst_i8, src0_i8, src1_i8);
+  //test_RowMajor<M, N, K, int8_t>(dst_i8, src0_i8, src1_i8);
 
-  test_RowMajor<M, N, K, int16_t>(dst_i16, src0_i16, src1_i16);
+  //test_RowMajor<M, N, K, int16_t>(dst_i16, src0_i16, src1_i16);
 
   test_RowMajor<M, N, K, int32_t>(dst_i32, src0_i32, src1_i32);
 
@@ -150,35 +252,36 @@ int main() {
 
   printf("Result:\n");
   OutArray(dst, size_C);
-  //OutArray(dst_f16, size_C);
+  OutArray(dst_f16, size_C);
   OutArray(dst_i8, size_C);
   OutArray(dst_i16, size_C);
   OutArray(dst_i32, size_C);
   OutArray(dst_i64, size_C);
- 
+
   free(dst);
   free(src0);
   free(src1);
- 
+
   free(dst_f16);
   free(src0_f16);
   free(src1_f16);
- 
+
   free(dst_i8);
   free(src0_i8);
   free(src1_i8);
- 
+
   free(dst_i16);
   free(src0_i16);
   free(src1_i16);
- 
+
   free(dst_i32);
   free(src0_i32);
   free(src1_i32);
- 
+
   free(dst_i64);
   free(src0_i64);
   free(src1_i64);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/other/tileop_api/src/MatMul.cpp b/benchmarks/api/tileop/src/MatMul.cpp
similarity index 69%
rename from test/other/tileop_api/src/MatMul.cpp
rename to benchmarks/api/tileop/src/MatMul.cpp
index cfaec47..99d8a33 100644
--- a/test/other/tileop_api/src/MatMul.cpp
+++ b/benchmarks/api/tileop/src/MatMul.cpp
@@ -5,6 +5,48 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  auto *d = static_cast<unsigned char *>(dst);
+  const auto *s = static_cast<const unsigned char *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+    __asm__ volatile("" ::: "memory");
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t M, uint16_t N, uint16_t K, typename T>
 void test_RowMajor(T *dst, T *src0, T *src1) {
   using gm_shape_A = global_tensor<T, RowMajor<M, K>>;
@@ -12,8 +54,8 @@ void test_RowMajor(T *dst, T *src0, T *src1) {
   using gm_shape_C = global_tensor<T, RowMajor<M, N>>;
 
   using tile_shape_A = Tile<Location::Vec, T, M, K, BLayout::RowMajor>;
-  using tile_shape_B = Tile<Location::Vec, T, K, N, BLayout::RowMajor>;
-  using tile_shape_C = Tile<Location::Vec, T, M, N, BLayout::RowMajor>;
+  using tile_shape_B = Tile<Location::Vec, T, K, N, BLayout::RowMajor>;;
+  using tile_shape_C = Tile<Location::Vec, T, M, N, BLayout::RowMajor>;;
 
   gm_shape_A s0(src0);
   gm_shape_B s1(src1);
@@ -23,10 +65,10 @@ void test_RowMajor(T *dst, T *src0, T *src1) {
   tile_shape_B d1;
   tile_shape_C d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
   MATMUL(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 template <uint16_t M, uint16_t N, uint16_t K, typename T>
@@ -47,16 +89,48 @@ void test_ColMajor(T *dst, T *src0, T *src1) {
   tile_shape_B d1;
   tile_shape_C d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
   MATMUL(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 int main() {
-  const uint16_t M = 64;
+#ifdef __linx
+  constexpr uint16_t M = 4;
+  constexpr uint16_t K = 4;
+  constexpr uint16_t N = 4;
+  constexpr size_t size_A = M * K;
+  constexpr size_t size_B = K * N;
+  constexpr size_t size_C = M * N;
+
+  static int64_t dst_i64[size_C];
+  static int64_t src0_i64[size_A];
+  static int64_t src1_i64[size_B];
+
+  init_dst(dst_i64, size_C);
+  init_src_int(src0_i64, size_A);
+  init_src_int(src1_i64, size_B);
+
+  test_RowMajor<M, N, K, int64_t>(dst_i64, src0_i64, src1_i64);
+
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t col = 0; col < N; ++col) {
+      int64_t expected = 0;
+      for (size_t k = 0; k < K; ++k) {
+        expected += src0_i64[row * K + k] * src1_i64[k * N + col];
+      }
+      if (dst_i64[row * N + col] != expected) {
+        return 1;
+      }
+    }
+  }
+
+  return 0;
+#else
+  const uint16_t M = 16;
   const uint16_t K = 32;
-  const uint16_t N = 64;
+  const uint16_t N = 32;
 
   size_t size_A = M * K;
   size_t size_B = K * N;
@@ -75,7 +149,7 @@ int main() {
 
   __half *dst_f16 = (__half *)malloc(size_C * sizeof(__half));
   check_mem_alloc(dst_f16);
-  init_dst(dst_f16, size_C); 
+  init_dst(dst_f16, size_C);
 
   __half *src0_f16 = (__half *)malloc(size_A * sizeof(__half));
   check_mem_alloc(src0_f16);
@@ -83,44 +157,44 @@ int main() {
   __half *src1_f16 = (__half *)malloc(size_B * sizeof(__half));
   check_mem_alloc(src1_f16);
   init_src_fp(src1_f16, size_B);
- 
+
   int8_t *dst_i8 = (int8_t *)malloc(size_C * sizeof(int8_t));
   check_mem_alloc(dst_i8);
   init_dst(dst_i8, size_C);
- 
+
   int8_t *src0_i8 = (int8_t *)malloc(size_A * sizeof(int8_t));
   check_mem_alloc(src0_i8);
   init_src_int(src0_i8, size_A);
   int8_t *src1_i8 = (int8_t *)malloc(size_B * sizeof(int8_t));
   check_mem_alloc(src1_i8);
   init_src_int(src1_i8, size_B);
- 
+
   int16_t *dst_i16 = (int16_t *)malloc(size_C * sizeof(int16_t));
   check_mem_alloc(dst_i16);
   init_dst(dst_i16, size_C);
- 
+
   int16_t *src0_i16 = (int16_t *)malloc(size_A * sizeof(int16_t));
   check_mem_alloc(src0_i16);
   init_src_int(src0_i16, size_A);
   int16_t *src1_i16 = (int16_t *)malloc(size_B * sizeof(int16_t));
   check_mem_alloc(src1_i16);
   init_src_int(src1_i16, size_B);
-  
+
   int32_t *dst_i32 = (int32_t *)malloc(size_C * sizeof(int32_t));
   check_mem_alloc(dst_i32);
   init_dst(dst_i32, size_C);
- 
+
   int32_t *src0_i32 = (int32_t *)malloc(size_A * sizeof(int32_t));
   check_mem_alloc(src0_i32);
   init_src_int(src0_i32, size_A);
   int32_t *src1_i32 = (int32_t *)malloc(size_B * sizeof(int32_t));
   check_mem_alloc(src1_i32);
   init_src_int(src1_i32, size_B);
- 
+
   int64_t *dst_i64 = (int64_t *)malloc(size_C * sizeof(int64_t));
   check_mem_alloc(dst_i64);
   init_dst(dst_i64, size_C);
- 
+
   int64_t *src0_i64 = (int64_t *)malloc(size_A * sizeof(int64_t));
   check_mem_alloc(src0_i64);
   init_src_int(src0_i64, size_A);
@@ -133,7 +207,7 @@ int main() {
 #endif
 
   test_RowMajor<M, N, K, float>(dst, src0, src1);
- 
+
   test_RowMajor<M, N, K, __half>(dst_f16, src0_f16, src1_f16);
 
   test_RowMajor<M, N, K, int8_t>(dst_i8, src0_i8, src1_i8);
@@ -150,35 +224,36 @@ int main() {
 
   printf("Result:\n");
   OutArray(dst, size_C);
-  //OutArray(dst_f16, size_C);
+  OutArray(dst_f16, size_C);
   OutArray(dst_i8, size_C);
   OutArray(dst_i16, size_C);
   OutArray(dst_i32, size_C);
   OutArray(dst_i64, size_C);
- 
+
   free(dst);
   free(src0);
   free(src1);
- 
+
   free(dst_f16);
   free(src0_f16);
   free(src1_f16);
- 
+
   free(dst_i8);
   free(src0_i8);
   free(src1_i8);
- 
+
   free(dst_i16);
   free(src0_i16);
   free(src1_i16);
- 
+
   free(dst_i32);
   free(src0_i32);
   free(src1_i32);
- 
+
   free(dst_i64);
   free(src0_i64);
   free(src1_i64);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/benchmarks/api/tileop/src/MatMul_e4m3.cpp b/benchmarks/api/tileop/src/MatMul_e4m3.cpp
new file mode 100644
index 0000000..f81d692
--- /dev/null
+++ b/benchmarks/api/tileop/src/MatMul_e4m3.cpp
@@ -0,0 +1,188 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  auto *d = static_cast<unsigned char *>(dst);
+  const auto *s = static_cast<const unsigned char *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+    __asm__ volatile("" ::: "memory");
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+
+template <uint16_t M, uint16_t N, uint16_t K>
+void test(int64_t *dst, int64_t *src0, int64_t *src1) {
+  using gm_shape_A = global_tensor<int64_t, RowMajor<M, K>>;
+  using gm_shape_B = global_tensor<int64_t, RowMajor<K, N>>;
+  using gm_shape_C = global_tensor<int64_t, RowMajor<M, N>>;
+
+  using tile_shape_A = Tile<Location::Vec, int64_t, M, K, BLayout::RowMajor>;
+  using tile_shape_B = Tile<Location::Vec, int64_t, K, N, BLayout::RowMajor>;
+  using tile_shape_C = Tile<Location::Vec, int64_t, M, N, BLayout::RowMajor>;
+
+  gm_shape_A s0(src0);
+  gm_shape_B s1(src1);
+  gm_shape_C res(dst);
+
+  tile_shape_A d0;
+  tile_shape_B d1;
+  tile_shape_C d2;
+
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  MATMUL(d2, d0, d1);
+  TSTORE(res, d2);
+}
+#else
+template <typename TA, typename TB>
+void __vec__ test_cvt(typename TA::TileDType __out__ a,
+                      typename TB::TileDType __in__ b) {
+  using AType = typename TA::DType;
+  using BType = typename TB::DType;
+  __vbuf__ BType *pb = blkv_get_tile_ptr(b);
+  __vbuf__ AType *pa = blkv_get_tile_ptr(a);
+  int x = blkv_get_index_x();
+  int y = blkv_get_index_y();
+  int idx = index<TA>(y, x);
+  AType o = (AType)(pb[idx]);
+  pa[idx] = o;
+}
+
+template <uint16_t M, uint16_t N, uint16_t K>
+void test(float *dst, float *src0, float *src1) {
+  using gm_shape_A = global_tensor<float, RowMajor<M, K>>;
+  using gm_shape_B = global_tensor<float, ColMajor<K, N>>;
+  using gm_shape_C = global_tensor<float, RowMajor<M, N>>;
+
+  using tile_shape_A = TileLeft<float, M, K>;
+  using tile_shape_B = TileRight<float, K, N>;
+  using tile_shape_C = TileAcc<float, M, N>;
+  using tile_shape_LA = TileLeft<__fp8_e4m3, M, K>;
+  using tile_shape_LB = TileRight<__fp8_e4m3, K, N>;
+
+  gm_shape_A s0(src0);
+  gm_shape_B s1(src1);
+  gm_shape_C res(dst);
+
+  tile_shape_A d0;
+  tile_shape_B d1;
+  tile_shape_C d2;
+  tile_shape_LA lda;
+  tile_shape_LB ldb;
+
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  test_cvt<tile_shape_LA, tile_shape_A><<<M, K, 1>>>(lda.data(), d0.data());
+  test_cvt<tile_shape_LB, tile_shape_B><<<K, N, 1>>>(ldb.data(), d1.data());
+  MATMUL(d2, lda, ldb);
+  TSTORE(res, d2);
+}
+#endif
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t M = 4;
+  constexpr uint16_t K = 4;
+  constexpr uint16_t N = 4;
+  constexpr size_t size_A = M * K;
+  constexpr size_t size_B = K * N;
+  constexpr size_t size_C = M * N;
+
+  static int64_t dst_i64[size_C];
+  static int64_t src0_i64[size_A];
+  static int64_t src1_i64[size_B];
+
+  init_dst(dst_i64, size_C);
+  init_src_int(src0_i64, size_A);
+  init_src_int(src1_i64, size_B);
+
+  test<M, N, K>(dst_i64, src0_i64, src1_i64);
+
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t col = 0; col < N; ++col) {
+      int64_t expected = 0;
+      for (size_t k = 0; k < K; ++k) {
+        expected += src0_i64[row * K + k] * src1_i64[k * N + col];
+      }
+      if (dst_i64[row * N + col] != expected) {
+        return 1;
+      }
+    }
+  }
+
+  return 0;
+#else
+  const uint16_t M = 64;
+  const uint16_t K = 32;
+  const uint16_t N = 128;
+
+  size_t size_A = M * K;
+  size_t size_B = K * N;
+  size_t size_C = M * N;
+
+  float *dst = (float *)malloc(size_C * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, size_C);
+
+  float *src0 = (float *)malloc(size_A * sizeof(float));
+  check_mem_alloc(src0);
+  init_src_fp(src0, size_A);
+  float *src1 = (float *)malloc(size_B * sizeof(float));
+  check_mem_alloc(src1);
+  init_src_fp(src1, size_B);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test<M, N, K>(dst, src0, src1);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, size_C);
+
+  free(dst);
+  free(src0);
+  free(src1);
+
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/Print.cpp b/benchmarks/api/tileop/src/Print.cpp
similarity index 96%
rename from test/tileop_api/src/Print.cpp
rename to benchmarks/api/tileop/src/Print.cpp
index 7f6dba2..56c56fb 100644
--- a/test/tileop_api/src/Print.cpp
+++ b/benchmarks/api/tileop/src/Print.cpp
@@ -25,12 +25,12 @@ void test_ACC(float *dst, float *src0, float *src1) {
   tile_shape_C d2;
   tile_shape_O d3;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
   MATMUL(d2, d0, d1);
   TCVT(d3, d2);
   print_tile(d3);
-  TCOPYOUT(res, d3);
+  TSTORE(res, d3);
 }
 
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
@@ -40,7 +40,7 @@ void test_Nz(T *dst) {
   using tile_shape = TileLeft<T, tile_row, tile_col, 16, 16>;
   gm_shape g(dst);
   tile_shape t;
-  TCOPYIN(t, g);
+  TLOAD(t, g);
   print_tile(t);
 }
 
@@ -51,7 +51,7 @@ void test_Zn(T *dst) {
   using tile_shape = TileRight<T, tile_row, tile_col, 16, 16>;
   gm_shape g(dst);
   tile_shape t;
-  TCOPYIN(t, g);
+  TLOAD(t, g);
   print_tile(t);
 }
 
@@ -62,10 +62,10 @@ void test_RowMajor(T *dst) {
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor>;
   gm_shape g(dst);
   tile_shape t;
-  TCOPYIN(t, g);
+  TLOAD(t, g);
   print_tile(t);
 }
- 
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_ColMajor(T *dst) {
@@ -73,7 +73,7 @@ void test_ColMajor(T *dst) {
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
   gm_shape g(dst);
   tile_shape t;
-  TCOPYIN(t, g);
+  TLOAD(t, g);
   print_tile(t);
 }
 
diff --git a/benchmarks/api/tileop/src/TAbs.cpp b/benchmarks/api/tileop/src/TAbs.cpp
new file mode 100644
index 0000000..e580599
--- /dev/null
+++ b/benchmarks/api/tileop/src/TAbs.cpp
@@ -0,0 +1,155 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test_RowMajor(T *dst, T *src0) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  #pragma clang loop unroll(full)
+  for (int i = 0; i < block_row; ++i) {
+    #pragma clang loop unroll(full)
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src0 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1;
+      TLOAD(d0, s0);
+      TABS(d1, d0);
+      TSTORE(res, d1);
+    }
+  }
+}
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test_ColMajor(T *dst, T *src0) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  #pragma clang loop unroll(full)
+  for (int i = 0; i < block_col; ++i) {
+    #pragma clang loop unroll(full)
+    for (int j = 0; j < block_row; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src0 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1;
+      TLOAD(d0, s0);
+      TABS(d1, d0);
+      TSTORE(res, d1);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 16;
+  constexpr uint16_t tile_col = 16;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src, gm_size);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src);
+
+  return 0;
+#else
+  float *dst_col = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst_col);
+  init_dst(dst_col, gm_size);
+
+  float *src0_col = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src0_col);
+  init_src_fp(src0_col, gm_size);
+
+  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, gm_size);
+
+  __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src0_f16);
+  init_src_fp(src0_f16, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_ColMajor<gm_row, gm_col, tile_row, tile_col, float>(dst_col, src0_col);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst_col, gm_size);
+  OutArray(dst_f16, gm_size);
+
+  free(dst_col);
+  free(src0_col);
+
+  free(dst_f16);
+  free(src0_f16);
+
+  return 0;
+#endif
+}
diff --git a/test/other/tileop_api/src/TAdd.cpp b/benchmarks/api/tileop/src/TAdd.cpp
similarity index 66%
rename from test/other/tileop_api/src/TAdd.cpp
rename to benchmarks/api/tileop/src/TAdd.cpp
index 3da9aea..7fc20ed 100644
--- a/test/other/tileop_api/src/TAdd.cpp
+++ b/benchmarks/api/tileop/src/TAdd.cpp
@@ -5,12 +5,44 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_RowMajor(T *dst, T *src0, T *src1) {
   using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor>;
- 
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -21,22 +53,22 @@ void test_RowMajor(T *dst, T *src0, T *src1) {
       gm_shape s0(src0 + offset);
       gm_shape s1(src1 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TADD(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
- 
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_ColMajor(T *dst, T *src0, T *src1) {
   using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -47,25 +79,45 @@ void test_ColMajor(T *dst, T *src0, T *src1) {
       gm_shape s0(src0 + offset);
       gm_shape s1(src1 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TADD(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
 
 int main() {
-  const uint16_t gm_row = 128;
-  const uint16_t gm_col = 128;
-  const uint16_t tile_row = 32;
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  const uint16_t gm_row = 64;
+  const uint16_t gm_col = 32;
+  const uint16_t tile_row = 64;
   const uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst_i64[gm_size];
+  static int64_t src0_i64[gm_size];
+  static int64_t src1_i64[gm_size];
+  init_dst(dst_i64, gm_size);
+  init_src_int(src0_i64, gm_size);
+  init_src_int(src1_i64, gm_size);
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64, src1_i64);
 
+  return 0;
+#else
   float *dst = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst);
   init_dst(dst, gm_size);
@@ -77,54 +129,67 @@ int main() {
   check_mem_alloc(src1);
   init_src_fp(src1, gm_size);
 
+  float *dst_col = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst_col);
+  init_dst(dst_col, gm_size);
+
+  float *src0_col = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src0_col);
+  init_src_fp(src0_col, gm_size);
+  float *src1_col = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src1_col);
+  init_src_fp(src1_col, gm_size);
+
+#ifndef __linx
   __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(dst_f16);
   init_dst(dst_f16, gm_size);
- 
+
   __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src0_f16);
   init_src_fp(src0_f16, gm_size);
   __half *src1_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src1_f16);
   init_src_fp(src1_f16, gm_size);
- 
+#endif
+
   int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(dst_i8);
   init_dst(dst_i8, gm_size);
- 
+
   int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src0_i8);
   init_src_int(src0_i8, gm_size);
   int8_t *src1_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src1_i8);
   init_src_int(src1_i8, gm_size);
- 
+
   int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(dst_i16);
   init_dst(dst_i16, gm_size);
- 
+
   int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(src0_i16);
   init_src_int(src0_i16, gm_size);
   int16_t *src1_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(src1_i16);
   init_src_int(src1_i16, gm_size);
-  
+
   int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst_i32);
   init_dst(dst_i32, gm_size);
- 
+
   int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src0_i32);
   init_src_int(src0_i32, gm_size);
   int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src1_i32);
   init_src_int(src1_i32, gm_size);
- 
+
   int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(dst_i64);
   init_dst(dst_i64, gm_size);
- 
+
   int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(src0_i64);
   init_src_int(src0_i64, gm_size);
@@ -136,16 +201,18 @@ int main() {
   PMC_START();
 #endif
 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, float>(dst, src0, src1);
- 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16, src1_f16);
- 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8, src1_i8);
+  test_ColMajor<gm_row, gm_col, tile_row, tile_col, float>(dst_col, src0_col, src1_col);
+
+#ifndef __linx
+  test_ColMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16, src1_f16);
+#endif
+
+  test_ColMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8, src1_i8);
 
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16, src1_i16);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32, src1_i32);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64, src1_i64);
 
 #ifdef LINX_PMC
@@ -154,35 +221,45 @@ int main() {
 
   printf("Result:\n");
   OutArray(dst, gm_size);
-  //OutArray(dst_f16, gm_size);
+  OutArray(dst_col, gm_size);
+#ifndef __linx
+  OutArray(dst_f16, gm_size);
+#endif
   OutArray(dst_i8, gm_size);
   OutArray(dst_i16, gm_size);
   OutArray(dst_i32, gm_size);
   OutArray(dst_i64, gm_size);
- 
+
   free(dst);
   free(src0);
   free(src1);
- 
+
+  free(dst_col);
+  free(src0_col);
+  free(src1_col);
+
+#ifndef __linx
   free(dst_f16);
   free(src0_f16);
   free(src1_f16);
- 
+#endif
+
   free(dst_i8);
   free(src0_i8);
   free(src1_i8);
- 
+
   free(dst_i16);
   free(src0_i16);
   free(src1_i16);
- 
+
   free(dst_i32);
   free(src0_i32);
   free(src1_i32);
- 
+
   free(dst_i64);
   free(src0_i64);
   free(src1_i64);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TAdd_mask.cpp b/benchmarks/api/tileop/src/TAdd_mask.cpp
new file mode 100644
index 0000000..a1770b2
--- /dev/null
+++ b/benchmarks/api/tileop/src/TAdd_mask.cpp
@@ -0,0 +1,191 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+using namespace pto;
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test(T *c_ptr, T *a_ptr, T *b_ptr) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor>;
+  using glb_iterator = global_iterator<gm_shape, tile_shape>;
+
+  static constexpr int block_row = gm_row / tile_row;
+  static constexpr int block_col = gm_col / tile_col;
+  static constexpr int remainder_row = gm_row % tile_row;
+  static constexpr int remainder_col = gm_col % tile_col;
+
+  using trailing_rows_shape =
+      Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor, tile_row, remainder_col>;
+  using trailing_cols_shape =
+      Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor, remainder_row, tile_col>;
+  using trailing_corner_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor,
+                                            remainder_row, remainder_col>;
+
+  glb_iterator gAIter(a_ptr);
+  glb_iterator gBIter(b_ptr);
+  glb_iterator gCIter(c_ptr);
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      auto gA = gAIter(i, j);
+      auto gB = gBIter(i, j);
+      auto gC = gCIter(i, j);
+
+      tile_shape tA, tB, tC;
+      TLOAD(tA, gA);
+      TLOAD(tB, gB);
+      TADD(tC, tA, tB);
+      TSTORE(gC, tC);
+    }
+    if constexpr (remainder_col) {
+      auto gA = gAIter(i, block_col);
+      auto gB = gBIter(i, block_col);
+      auto gC = gCIter(i, block_col);
+
+      trailing_rows_shape tA, tB, tC;
+      TLOAD(tA, gA);
+      TLOAD(tB, gB);
+      TADD(tC, tA, tB);
+      TSTORE(gC, tC);
+    }
+  }
+  if constexpr (remainder_row) {
+    for (int j = 0; j < block_col; ++j) {
+      auto gA = gAIter(block_row, j);
+      auto gB = gBIter(block_row, j);
+      auto gC = gCIter(block_row, j);
+
+      trailing_cols_shape tA, tB, tC;
+      TLOAD(tA, gA);
+      TLOAD(tB, gB);
+      TADD(tC, tA, tB);
+      TSTORE(gC, tC);
+    }
+    if constexpr (remainder_col) {
+      auto gA = gAIter(block_row, block_col);
+      auto gB = gBIter(block_row, block_col);
+      auto gC = gCIter(block_row, block_col);
+
+      trailing_corner_shape tA, tB, tC;
+      TLOAD(tA, gA);
+      TLOAD(tB, gB);
+      TADD(tC, tA, tB);
+      TSTORE(gC, tC);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 6;
+  constexpr uint16_t gm_col = 6;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  const uint16_t gm_row = 66;
+  const uint16_t gm_col = 66;
+  const uint16_t tile_row = 16;
+  const uint16_t tile_col = 16;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src0[gm_size];
+  static int64_t src1[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src0, gm_size);
+  init_src_uint(src1, gm_size);
+
+  test<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src0, src1);
+  return 0;
+#else
+  float *dst = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, gm_size);
+
+  float *src0 = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src0);
+  init_src_fp(src0, gm_size);
+  float *src1 = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src1);
+  init_src_fp(src1, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test<gm_row, gm_col, tile_row, tile_col>(dst, src0, src1);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, gm_size);
+
+  free(dst);
+  free(src0);
+  free(src1);
+
+  return 0;
+#endif
+}
diff --git a/test/other/tileop_api/src/TAdds.cpp b/benchmarks/api/tileop/src/TAdds.cpp
similarity index 64%
rename from test/other/tileop_api/src/TAdds.cpp
rename to benchmarks/api/tileop/src/TAdds.cpp
index 305fd6a..3545585 100644
--- a/test/other/tileop_api/src/TAdds.cpp
+++ b/benchmarks/api/tileop/src/TAdds.cpp
@@ -5,12 +5,44 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_RowMajor(T *dst, T *src0, T s) {
   using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor>;
- 
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -20,21 +52,21 @@ void test_RowMajor(T *dst, T *src0, T s) {
       int offset = i * (tile_row * gm_col) + j * tile_col;
       gm_shape s0(src0 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TADDS(d1, d0, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
- 
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_ColMajor(T *dst, T *src0, T s) {
   using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -44,36 +76,54 @@ void test_ColMajor(T *dst, T *src0, T s) {
       int offset = i * (tile_col * gm_row) + j * tile_row;
       gm_shape s0(src0 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TADDS(d1, d0, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
 
 int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+#ifdef __linx
+  static int64_t dst_i64[gm_size];
+  static int64_t src0_i64[gm_size];
+  init_dst(dst_i64, gm_size);
+  init_src_int(src0_i64, gm_size);
 
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64, s_i64);
+
+  return 0;
+#else
+  float *dst_col = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst_col);
+  init_dst(dst_col, gm_size);
 
-  float *src0 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, gm_size);
+  float *src0_col = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src0_col);
+  init_src_fp(src0_col, gm_size);
 
   __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(dst_f16);
   init_dst(dst_f16, gm_size);
- 
+
   __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src0_f16);
   init_src_fp(src0_f16, gm_size);
@@ -81,31 +131,31 @@ int main() {
   int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(dst_i8);
   init_dst(dst_i8, gm_size);
- 
+
   int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src0_i8);
   init_src_int(src0_i8, gm_size);
- 
+
   int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(dst_i16);
   init_dst(dst_i16, gm_size);
- 
+
   int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(src0_i16);
   init_src_int(src0_i16, gm_size);
-  
+
   int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst_i32);
   init_dst(dst_i32, gm_size);
- 
+
   int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src0_i32);
   init_src_int(src0_i32, gm_size);
- 
+
   int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(dst_i64);
   init_dst(dst_i64, gm_size);
- 
+
   int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(src0_i64);
   init_src_int(src0_i64, gm_size);
@@ -114,16 +164,16 @@ int main() {
   PMC_START();
 #endif
 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, float>(dst, src0, s_fp32);
- 
+  test_ColMajor<gm_row, gm_col, tile_row, tile_col, float>(dst_col, src0_col, s_fp32);
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16, s_fp16);
 
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8, s_i8);
 
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16, s_i16);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32, s_i32);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64, s_i64);
 
 #ifdef LINX_PMC
@@ -131,30 +181,31 @@ int main() {
 #endif
 
   printf("Result:\n");
-  OutArray(dst, gm_size);
-  //OutArray(dst_f16, gm_size);
+  OutArray(dst_col, gm_size);
+  OutArray(dst_f16, gm_size);
   OutArray(dst_i8, gm_size);
   OutArray(dst_i16, gm_size);
   OutArray(dst_i32, gm_size);
   OutArray(dst_i64, gm_size);
- 
-  free(dst);
-  free(src0);
- 
+
+  free(dst_col);
+  free(src0_col);
+
   free(dst_f16);
   free(src0_f16);
- 
+
   free(dst_i8);
   free(src0_i8);
- 
+
   free(dst_i16);
   free(src0_i16);
- 
+
   free(dst_i32);
   free(src0_i32);
- 
+
   free(dst_i64);
   free(src0_i64);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TAnd.cpp b/benchmarks/api/tileop/src/TAnd.cpp
similarity index 77%
rename from test/tileop_api/src/TAnd.cpp
rename to benchmarks/api/tileop/src/TAnd.cpp
index 714b320..14e97d7 100644
--- a/test/tileop_api/src/TAnd.cpp
+++ b/benchmarks/api/tileop/src/TAnd.cpp
@@ -5,12 +5,44 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_RowMajor(T *dst, T *src0, T *src1) {
   using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -21,22 +53,22 @@ void test_RowMajor(T *dst, T *src0, T *src1) {
       gm_shape s0(src0 + offset);
       gm_shape s1(src1 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TAND(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
- 
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_ColMajor(T *dst, T *src0, T *src1) {
   using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -47,25 +79,45 @@ void test_ColMajor(T *dst, T *src0, T *src1) {
       gm_shape s0(src0 + offset);
       gm_shape s1(src1 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TAND(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
 
 int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 32;
-  const uint16_t tile_row = 64;
-  const uint16_t tile_col = 32;
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 32;
+  constexpr uint16_t tile_row = 64;
+  constexpr uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+#ifdef __linx
+  static int64_t dst_i64[gm_size];
+  static int64_t src0_i64[gm_size];
+  static int64_t src1_i64[gm_size];
+  init_dst(dst_i64, gm_size);
+  init_src_int(src0_i64, gm_size);
+  init_src_int(src1_i64, gm_size);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64, src1_i64);
 
+  return 0;
+#else
   float *dst = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst);
   init_dst(dst, gm_size);
@@ -91,51 +143,51 @@ int main() {
   __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(dst_f16);
   init_dst(dst_f16, gm_size);
- 
+
   __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src0_f16);
   init_src_fp(src0_f16, gm_size);
   __half *src1_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src1_f16);
   init_src_fp(src1_f16, gm_size);
- 
+
   int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(dst_i8);
   init_dst(dst_i8, gm_size);
- 
+
   int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src0_i8);
   init_src_int(src0_i8, gm_size);
   int8_t *src1_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src1_i8);
   init_src_int(src1_i8, gm_size);
- 
+
   int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(dst_i16);
   init_dst(dst_i16, gm_size);
- 
+
   int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(src0_i16);
   init_src_int(src0_i16, gm_size);
   int16_t *src1_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(src1_i16);
   init_src_int(src1_i16, gm_size);
-  
+
   int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst_i32);
   init_dst(dst_i32, gm_size);
- 
+
   int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src0_i32);
   init_src_int(src0_i32, gm_size);
   int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src1_i32);
   init_src_int(src1_i32, gm_size);
- 
+
   int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(dst_i64);
   init_dst(dst_i64, gm_size);
- 
+
   int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(src0_i64);
   init_src_int(src0_i64, gm_size);
@@ -154,9 +206,9 @@ int main() {
   test_ColMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8, src1_i8);
 
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16, src1_i16);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32, src1_i32);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64, src1_i64);
 
 #ifdef LINX_PMC
@@ -171,7 +223,7 @@ int main() {
   OutArray(dst_i16, gm_size);
   OutArray(dst_i32, gm_size);
   OutArray(dst_i64, gm_size);
- 
+
   free(dst);
   free(src0);
   free(src1);
@@ -183,22 +235,23 @@ int main() {
   free(dst_f16);
   free(src0_f16);
   free(src1_f16);
- 
+
   free(dst_i8);
   free(src0_i8);
   free(src1_i8);
- 
+
   free(dst_i16);
   free(src0_i16);
   free(src1_i16);
- 
+
   free(dst_i32);
   free(src0_i32);
   free(src1_i32);
- 
+
   free(dst_i64);
   free(src0_i64);
   free(src1_i64);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TAssemble.cpp b/benchmarks/api/tileop/src/TAssemble.cpp
similarity index 95%
rename from test/tileop_api/src/TAssemble.cpp
rename to benchmarks/api/tileop/src/TAssemble.cpp
index efd4e25..04a535c 100644
--- a/test/tileop_api/src/TAssemble.cpp
+++ b/benchmarks/api/tileop/src/TAssemble.cpp
@@ -28,11 +28,11 @@ void test_rm(float *dst, float *src0, float *src1, float *src2) {
   tile_shape_src2 d2;
   tile_shape_dst d3;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  TCOPYIN(d2, s2);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  TLOAD(d2, s2);
   TASSEMBLE(d3, d0, d1, d2);
-  TCOPYOUT(res, d3);
+  TSTORE(res, d3);
 }
 
 template <size_t dst_row, size_t dst_col, size_t src0_row, size_t src0_col,
@@ -58,11 +58,11 @@ void test_rm_mask(float *dst, float *src0, float *src1, float *src2) {
   tile_shape_src2 d2;
   tile_shape_dst d3;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  TCOPYIN(d2, s2);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  TLOAD(d2, s2);
   TASSEMBLE(d3, d0, d1, d2);
-  TCOPYOUT(res, d3);
+  TSTORE(res, d3);
 }
 
 template <size_t dst_row, size_t dst_col, size_t src0_row, size_t src0_col,
@@ -88,11 +88,11 @@ void test_cm(float *dst, float *src0, float *src1, float *src2) {
   tile_shape_src2 d2;
   tile_shape_dst d3;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  TCOPYIN(d2, s2);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  TLOAD(d2, s2);
   TASSEMBLE(d3, d0, d1, d2);
-  TCOPYOUT(res, d3);
+  TSTORE(res, d3);
 }
 
 int main() {
@@ -119,7 +119,7 @@ int main() {
   float *dst3 = (float *)malloc(size_dst * sizeof(float));
   check_mem_alloc(dst3);
   init_dst(dst3, size_dst);
-  
+
   float *src0 = (float *)malloc(size_src0 * sizeof(float));
   check_mem_alloc(src0);
   init_src_fp(src0, size_src0);
@@ -136,10 +136,10 @@ int main() {
 
   test_rm<dst_row, dst_col, src0_row, src0_col, src1_row, src1_col, src2_row,
           src2_col>(dst1, src0, src1, src2);
-          
+
   test_cm<dst_row, dst_col, src0_row, src0_col, src1_row, src1_col, src2_row,
           src2_col>(dst2, src0, src1, src2);
-  
+
   test_rm_mask<dst_row, dst_col, src0_row, src0_col, src1_row, src1_col, src2_row,
           src2_col>(dst3, src0, src1, src2);
 
diff --git a/test/tileop_api/src/TCI.cpp b/benchmarks/api/tileop/src/TCI.cpp
similarity index 74%
rename from test/tileop_api/src/TCI.cpp
rename to benchmarks/api/tileop/src/TCI.cpp
index 776cad0..092e7c3 100644
--- a/test/tileop_api/src/TCI.cpp
+++ b/benchmarks/api/tileop/src/TCI.cpp
@@ -5,6 +5,39 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
           uint64_t tile_col,typename T>
 void test_rm(T *dst, T s) {
@@ -20,7 +53,7 @@ void test_rm(T *dst, T s) {
 
       tile_shape d1;
       TCI<tile_shape, T, 0>(d1, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
@@ -40,12 +73,28 @@ void test_cm(T *dst, T s) {
 
       tile_shape d1;
       TCI<tile_shape, T, 0>(d1, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
 
 int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 8;
+  constexpr uint16_t gm_col = 8;
+  constexpr uint16_t tile_row = 8;
+  constexpr uint16_t tile_col = 8;
+  constexpr uint16_t gm_size = gm_row * gm_col;
+
+  static int32_t dst_rm[gm_size];
+  static int32_t dst_cm[gm_size];
+  init_dst(dst_rm, gm_size);
+  init_dst(dst_cm, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_rm, s_i32);
+  test_cm<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_cm, s_i32);
+  return 0;
+#else
   const uint16_t gm_row = 64;
   const uint16_t gm_col = 32;
   const uint16_t tile_row = 64;
@@ -127,4 +176,5 @@ int main() {
   free(dst5);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TCast.cpp b/benchmarks/api/tileop/src/TCast.cpp
similarity index 95%
rename from test/tileop_api/src/TCast.cpp
rename to benchmarks/api/tileop/src/TCast.cpp
index b40d9ba..5d71936 100644
--- a/test/tileop_api/src/TCast.cpp
+++ b/benchmarks/api/tileop/src/TCast.cpp
@@ -17,9 +17,9 @@ void test_rm(T2 *dst, T1 *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TCAST(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 template <size_t row, size_t col, typename T1, typename T2>
@@ -34,9 +34,9 @@ void test_cm(T2 *dst, T1 *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TCAST(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 template <size_t row, size_t col, typename T1, typename T2>
@@ -51,9 +51,9 @@ void test_Nz(T2 *dst, T1 *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TCAST(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 int main() {
diff --git a/test/tileop_api/src/TCmp.cpp b/benchmarks/api/tileop/src/TCmp.cpp
similarity index 69%
rename from test/tileop_api/src/TCmp.cpp
rename to benchmarks/api/tileop/src/TCmp.cpp
index 7d6c16e..495e118 100644
--- a/test/tileop_api/src/TCmp.cpp
+++ b/benchmarks/api/tileop/src/TCmp.cpp
@@ -5,6 +5,57 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T, CmpMode Mode>
 void test_RowMajor_CmpMode(int32_t *dst, T *src0, T *src1) {
@@ -12,7 +63,7 @@ void test_RowMajor_CmpMode(int32_t *dst, T *src0, T *src1) {
   using gm_shape_out = global_tensor<int32_t, RowMajor<gm_row, gm_col>>;
   using tile_shape_in = Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor>;
   using tile_shape_out = Tile<Location::Vec, int32_t, tile_row, tile_col, BLayout::RowMajor>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -23,13 +74,13 @@ void test_RowMajor_CmpMode(int32_t *dst, T *src0, T *src1) {
       gm_shape_in s0(src0 + offset);
       gm_shape_in s1(src1 + offset);
       gm_shape_out res(dst + offset);
-  
+
       tile_shape_in d0, d1;
       tile_shape_out d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TCMP(d2, d1, d0, Mode);  // 使用模板参数Mode
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
@@ -41,7 +92,7 @@ void test_ColMajor_CmpMode(int32_t *dst, T *src0, T *src1) {
   using gm_shape_out = global_tensor<int32_t, ColMajor<gm_row, gm_col>>;
   using tile_shape_in = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
   using tile_shape_out = Tile<Location::Vec, int32_t, tile_row, tile_col, BLayout::ColMajor>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -52,13 +103,13 @@ void test_ColMajor_CmpMode(int32_t *dst, T *src0, T *src1) {
       gm_shape_in s0(src0 + offset);
       gm_shape_in s1(src1 + offset);
       gm_shape_out res(dst + offset);
-  
+
       tile_shape_in d0, d1;
       tile_shape_out d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TCMP(d2, d1, d0, Mode);  // 使用模板参数Mode
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
@@ -68,7 +119,7 @@ template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T, CmpMode Mode>
 void test_SingleCmpMode_RowMajor() {
   size_t gm_size = gm_row * gm_col;
-  
+
   int32_t *dst = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst);
   init_dst(dst, gm_size);
@@ -89,7 +140,7 @@ void test_SingleCmpMode_RowMajor() {
 #endif
 
   OutArray(dst, gm_size);
-  
+
   free(dst);
   free(src0);
   free(src1);
@@ -99,7 +150,7 @@ template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T, CmpMode Mode>
 void test_SingleCmpMode_ColMajor() {
   size_t gm_size = gm_row * gm_col;
-  
+
   int32_t *dst = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst);
   init_dst(dst, gm_size);
@@ -120,7 +171,7 @@ void test_SingleCmpMode_ColMajor() {
 #endif
 
   OutArray(dst, gm_size);
-  
+
   free(dst);
   free(src0);
   free(src1);
@@ -147,6 +198,42 @@ void test_AllCmpModes_ColMajor() {
 }
 
 int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 8;
+  constexpr uint16_t gm_col = 8;
+  constexpr uint16_t tile_row = 8;
+  constexpr uint16_t tile_col = 8;
+  constexpr uint16_t gm_size = gm_row * gm_col;
+
+  static int32_t dst_rm[gm_size];
+  static int32_t dst_cm[gm_size];
+  static int32_t dst_eq[gm_size];
+  static int64_t src0_rm[gm_size];
+  static int64_t src1_rm[gm_size];
+  static int64_t src0_cm[gm_size];
+  static int64_t src1_cm[gm_size];
+  static int32_t src0_eq[gm_size];
+  static int32_t src1_eq[gm_size];
+
+  init_dst(dst_rm, gm_size);
+  init_dst(dst_cm, gm_size);
+  init_dst(dst_eq, gm_size);
+  init_src_int(src0_rm, gm_size);
+  init_src_uint(src1_rm, gm_size);
+  init_src_uint(src0_cm, gm_size);
+  init_src_int(src1_cm, gm_size);
+  init_src_int(src0_eq, gm_size);
+  init_src_int(src1_eq, gm_size);
+
+  test_RowMajor_CmpMode<gm_row, gm_col, tile_row, tile_col, int64_t,
+                         CmpMode::GT>(dst_rm, src0_rm, src1_rm);
+  test_ColMajor_CmpMode<gm_row, gm_col, tile_row, tile_col, int64_t,
+                         CmpMode::LE>(dst_cm, src0_cm, src1_cm);
+  test_RowMajor_CmpMode<gm_row, gm_col, tile_row, tile_col, int32_t,
+                         CmpMode::EQ>(dst_eq, src0_eq, src1_eq);
+
+  return 0;
+#else
   const uint16_t gm_row = 64;
   const uint16_t gm_col = 32;
   const uint16_t tile_row = 64;
@@ -154,6 +241,7 @@ int main() {
 
   size_t gm_size = gm_row * gm_col;
   size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
 
   printf("Result:\n");
   //  测试float类型的所有比较模式
@@ -175,4 +263,5 @@ int main() {
   test_SingleCmpMode_RowMajor<gm_row, gm_col, tile_row, tile_col, int32_t, CmpMode::EQ>();
   test_SingleCmpMode_ColMajor<gm_row, gm_col, tile_row, tile_col, int32_t, CmpMode::EQ>();
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TCopy.cpp b/benchmarks/api/tileop/src/TCopy.cpp
new file mode 100644
index 0000000..d0fb15c
--- /dev/null
+++ b/benchmarks/api/tileop/src/TCopy.cpp
@@ -0,0 +1,339 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test_Nz(T *dst, T *src0) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = TileLeft<T, tile_row, tile_col>;
+  using glb_iterator = global_iterator<gm_shape, tile_shape>;
+
+  glb_iterator gS0Iter(src0);
+  glb_iterator gDIter(dst);
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      auto s0 = gS0Iter(i, j);
+      auto res = gDIter(i, j);
+
+      tile_shape d0, d1;
+      TLOAD(d0, s0);
+      TCOPY(d1, d0);
+      TSTORE(res, d1);
+    }
+  }
+}
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row, uint16_t tile_col, typename T>
+void test_Nz_Dynamic(T *dst, T *src0) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = TileLeft<T, tile_row, tile_col, -1, -1>;
+
+  volatile size_t tile_valid_row = tile_row - 2;
+  volatile size_t tile_valid_col = tile_col - 2;
+
+  uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row;
+  uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col;
+
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      uint16_t remainder_row = gm_row - i * tile_valid_row;
+      uint16_t remainder_col = gm_col - j * tile_valid_col;
+
+      uint16_t active_row = remainder_row < tile_valid_row ? remainder_row : tile_valid_row;
+      uint16_t active_col = remainder_col < tile_valid_col ? remainder_col : tile_valid_col;
+
+      int offset = i * (tile_valid_row * gm_col) + j * tile_valid_col;
+      gm_shape s0(src0 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0(active_row, active_col);
+      tile_shape d1(active_row, active_col);
+      TLOAD(d0, s0);
+      TCOPY(d1, d0);
+      TSTORE(res, d1);
+    }
+  }
+}
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test_RowMajor(T *dst, T *src0) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  #pragma clang loop unroll(full)
+  for (int i = 0; i < block_row; ++i) {
+    #pragma clang loop unroll(full)
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src0 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1;
+      TLOAD(d0, s0);
+      TCOPY(d1, d0);
+      TSTORE(res, d1);
+    }
+  }
+}
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test_RowMajor_Dynamic(T *dst, T *src0) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, 2*tile_row, 2*tile_col, BLayout::RowMajor, -1, -1>;
+
+  volatile size_t tile_valid_row = tile_row - 2;
+  volatile size_t tile_valid_col = tile_col - 2;
+
+  uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row;
+  uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col;
+
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      uint16_t remainder_row = gm_row - i * tile_valid_row;
+      uint16_t remainder_col = gm_col - j * tile_valid_col;
+
+      uint16_t active_row = remainder_row < tile_valid_row ? remainder_row : tile_valid_row;
+      uint16_t active_col = remainder_col < tile_valid_col ? remainder_col : tile_valid_col;
+
+      int offset = i * (tile_valid_row * gm_col) + j * tile_valid_col;
+      gm_shape s0(src0 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0(active_row, active_col);
+      tile_shape d1(active_row, active_col);
+      TLOAD(d0, s0);
+      TCOPY(d1, d0);
+      TSTORE(res, d1);
+    }
+  }
+}
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test_ColMajor(T *dst, T *src0) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  #pragma clang loop unroll(full)
+  for (int i = 0; i < block_col; ++i) {
+    #pragma clang loop unroll(full)
+    for (int j = 0; j < block_row; ++j) {
+      int offset = i * (tile_col * gm_row) + j * tile_row;
+      gm_shape s0(src0 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1;
+      TLOAD(d0, s0);
+      TCOPY(d1, d0);
+      TSTORE(res, d1);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src, gm_size);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src);
+
+  return 0;
+#else
+  float *dst = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, gm_size);
+
+  float *src0 = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src0);
+  init_src_fp(src0, gm_size);
+
+  float *dst_col = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst_col);
+  init_dst(dst_col, gm_size);
+
+  float *src0_col = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src0_col);
+  init_src_fp(src0_col, gm_size);
+
+  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, gm_size);
+
+  __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src0_f16);
+  init_src_fp(src0_f16, gm_size);
+
+  int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(dst_i8);
+  init_dst(dst_i8, gm_size);
+
+  int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src0_i8);
+  init_src_int(src0_i8, gm_size);
+
+  int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(dst_i16);
+  init_dst(dst_i16, gm_size);
+
+  int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src0_i16);
+  init_src_int(src0_i16, gm_size);
+
+  int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst_i32);
+  init_dst(dst_i32, gm_size);
+
+  int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src0_i32);
+  init_src_int(src0_i32, gm_size);
+
+  int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(dst_i64);
+  init_dst(dst_i64, gm_size);
+
+  int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src0_i64);
+  init_src_int(src0_i64, gm_size);
+
+  int32_t *dst1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst1_i32);
+  init_dst(dst1_i32, gm_size);
+
+  int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src1_i32);
+  init_src_int(src1_i32, gm_size);
+
+  int32_t *dst_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst_nz_i32);
+  init_dst(dst_nz_i32, gm_size);
+
+  int32_t *src_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src_nz_i32);
+  init_src_int(src_nz_i32, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+  //test for fp32 Nz
+  test_Nz<gm_row, gm_col, tile_row, tile_col, float>(dst, src0);
+
+  test_ColMajor<gm_row, gm_col, tile_row, tile_col, float>(dst_col, src0_col);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64);
+
+  test_RowMajor_Dynamic<gm_row, gm_col, tile_row, tile_col, int32_t>(dst1_i32, src1_i32);
+
+  test_Nz_Dynamic<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_nz_i32, src_nz_i32);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, gm_size);
+  OutArray(dst_f16, gm_size);
+  OutArray(dst_i8, gm_size);
+  OutArray(dst_i16, gm_size);
+  OutArray(dst_i32, gm_size);
+  OutArray(dst_i64, gm_size);
+  OutArray(dst1_i32, gm_size);
+  OutArray(dst_nz_i32, gm_size);
+
+  free(dst);
+  free(src0);
+
+  free(dst_f16);
+  free(src0_f16);
+
+  free(dst_i8);
+  free(src0_i8);
+
+  free(dst_i16);
+  free(src0_i16);
+
+  free(dst_i32);
+  free(src0_i32);
+
+  free(dst_i64);
+  free(src0_i64);
+
+  free(dst1_i32);
+  free(src1_i32);
+
+  free(dst_nz_i32);
+  free(src_nz_i32);
+
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TCvt.cpp b/benchmarks/api/tileop/src/TCvt.cpp
new file mode 100644
index 0000000..f8652e3
--- /dev/null
+++ b/benchmarks/api/tileop/src/TCvt.cpp
@@ -0,0 +1,243 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t row, uint16_t col> void testRow2Nz(float *dst, float *src) {
+  using gm_shape = global_tensor<float, RowMajor<row, col>>;
+
+  using tile_shape_in = Tile<Location::Vec, float, row, col, BLayout::RowMajor>;
+  using tile_shape_out = TileLeft<float, row, col>;
+
+  gm_shape s0(src);
+  gm_shape res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TCVT(d1, d0);
+  TCVT(d0, d1);
+  TSTORE(res, d0);
+}
+
+template <uint16_t row, uint16_t col> void testNz2Col(float *dst, float *src) {
+  using gm_shape = global_tensor<float, RowMajor<row, col>>;
+
+  using tile_shape_in = TileLeft<float, row, col>;
+  using tile_shape_out = Tile<Location::Vec, float, row, col, BLayout::RowMajor>;
+
+  gm_shape s0(src);
+  gm_shape res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TCVT(d1, d0);
+  TCVT(d0, d1);
+  TSTORE(res, d0);
+}
+
+template <uint16_t row, uint16_t col> void testNz2Zn(float *dst, float *src) {
+  using gm_shape = global_tensor<float, RowMajor<row, col>>;
+
+  using tile_shape_in = TileLeft<float, row, col>;
+  using tile_shape_out = TileRight<float, row, col>;
+
+  gm_shape s0(src);
+  gm_shape res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TCVT(d1, d0);
+  TCVT(d0, d1);
+  TSTORE(res, d0);
+}
+
+template <uint16_t row, uint16_t col> void testZn2Nz(float *dst, float *src) {
+  using gm_shape = global_tensor<float, RowMajor<row, col>>;
+
+  using tile_shape_in = TileRight<float, row, col>;
+  using tile_shape_out = TileLeft<float, row, col>;
+
+  gm_shape s0(src);
+  gm_shape res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TCVT(d1, d0);
+  TCVT(d0, d1);
+  TSTORE(res, d0);
+}
+
+template <uint16_t row, uint16_t col> void testNz2Nz(float *dst, float *src) {
+  using gm_shape = global_tensor<float, RowMajor<row, col>>;
+
+  using tile_shape_in = TileLeft<float, row, col>;
+  using tile_shape_out = TileLeft<float, row, col>;
+
+  gm_shape s0(src);
+  gm_shape res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TCVT(d1, d0);
+  TCVT(d0, d1);
+  TSTORE(res, d0);
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t row = 16;
+  constexpr uint16_t col = 16;
+  using row_tile = Tile<Location::Vec, int64_t, row, col>;
+  using col_tile = Tile<Location::Vec, int64_t, row, col, BLayout::ColMajor>;
+  using nz_tile = TileLeft<int64_t, row, col>;
+  using zn_tile = TileRight<int64_t, row, col>;
+
+  row_tile row_src;
+  row_tile row_round;
+  col_tile col_src;
+  col_tile col_round;
+  nz_tile nz_a;
+  nz_tile nz_b;
+  zn_tile zn;
+
+  for (size_t i = 0; i < row; ++i) {
+    for (size_t j = 0; j < col; ++j) {
+      row_src.data()[index<row_tile>(i, j)] =
+          static_cast<int64_t>((i + 1) * 100 + j);
+      col_src.data()[index<col_tile>(i, j)] =
+          static_cast<int64_t>((i + 1) * 1000 + j);
+    }
+  }
+
+  TCVT(nz_a, row_src);
+  TCVT(row_round, nz_a);
+  TCVT(zn, nz_a);
+  TCVT(nz_b, zn);
+
+  for (size_t i = 0; i < row; ++i) {
+    for (size_t j = 0; j < col; ++j) {
+      if (row_round.data()[index<row_tile>(i, j)] !=
+          row_src.data()[index<row_tile>(i, j)]) {
+        return 1;
+      }
+      if (nz_b.data()[index<nz_tile>(i, j)] !=
+          nz_a.data()[index<nz_tile>(i, j)]) {
+        return 2;
+      }
+    }
+  }
+
+  TCVT(nz_a, col_src);
+  TCVT(col_round, nz_a);
+  for (size_t i = 0; i < row; ++i) {
+    for (size_t j = 0; j < col; ++j) {
+      if (col_round.data()[index<col_tile>(i, j)] !=
+          col_src.data()[index<col_tile>(i, j)]) {
+        return 3;
+      }
+    }
+  }
+
+  return 0;
+#else
+  const uint16_t row = 16;
+  const uint16_t col = 32;
+
+  size_t size = row * col;
+
+  float *dst = (float *)malloc(size * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, size);
+
+  float *src = (float *)malloc(size * sizeof(float));
+  check_mem_alloc(src);
+  init_src_fp(src, size);
+
+  float *dst1 = (float *)malloc(size * sizeof(float));
+  check_mem_alloc(dst1);
+  init_dst(dst1, size);
+
+  float *src1 = (float *)malloc(size * sizeof(float));
+  check_mem_alloc(src1);
+  init_rows_fp(src1, row, col);
+
+  float *dst2 = (float *)malloc(size * sizeof(float));
+  check_mem_alloc(dst2);
+  init_dst(dst2, size);
+
+  float *src2 = (float *)malloc(size * sizeof(float));
+  check_mem_alloc(src2);
+  init_rows_fp(src2, row, col);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  testRow2Nz<row, col>(dst, src);
+  testNz2Col<row, col>(dst1, src1);
+  testNz2Zn<row, col>(dst2, src2);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, size);
+  OutArray(dst1, size);
+  OutArray(dst2, size);
+
+  free(dst);
+  free(src);
+  free(dst1);
+  free(src1);
+  free(dst2);
+  free(src2);
+
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TDiv.cpp b/benchmarks/api/tileop/src/TDiv.cpp
new file mode 100644
index 0000000..d8950a1
--- /dev/null
+++ b/benchmarks/api/tileop/src/TDiv.cpp
@@ -0,0 +1,251 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test_rm(T *dst, T *src0, T *src1) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src0 + offset);
+      gm_shape s1(src1 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1, d2;
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
+      TDIV(d2, d1, d0);
+      TSTORE(res, d2);
+    }
+  }
+}
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test_cm(T *dst, T *src0, T *src1) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src0 + offset);
+      gm_shape s1(src1 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1, d2;
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
+      TDIV(d2, d1, d0);
+      TSTORE(res, d2);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  // 64*64-16*16
+  const uint16_t gm_row = 64;
+  const uint16_t gm_col = 64;
+  const uint16_t tile_row = 32;
+  const uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst_rm[gm_size];
+  static int64_t dst_cm[gm_size];
+  static int64_t src0_rm[gm_size];
+  static int64_t src1_rm[gm_size];
+  static int64_t src0_cm[gm_size];
+  static int64_t src1_cm[gm_size];
+  init_dst(dst_rm, gm_size);
+  init_dst(dst_cm, gm_size);
+  init_src_uint(src0_rm, gm_size);
+  init_src_int(src1_rm, gm_size);
+  init_src_uint(src0_cm, gm_size);
+  init_src_int(src1_cm, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_rm, src0_rm, src1_rm);
+  test_cm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_cm, src0_cm, src1_cm);
+
+  return 0;
+#else
+  // float32
+  float *dst = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, gm_size);
+
+  float *src0 = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src0);
+  init_src_fp(src0, gm_size);
+  float *src1 = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src1);
+  init_src_fp(src1, gm_size);
+  // float16
+  __half *dst1 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst1);
+  init_dst(dst1, gm_size);
+
+  __half *src2 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src2);
+  init_src_fp(src2, gm_size);
+  __half *src3 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src3);
+  init_src_fp(src3, gm_size);
+  // int8
+  int8_t *dst2 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(dst2);
+  init_dst(dst2, gm_size);
+
+  int8_t *src4 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src4);
+  init_src_int8(src4, gm_size);
+  int8_t *src5 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src5);
+  init_src_int8(src5, gm_size);
+  // int16
+  int16_t *dst3 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(dst3);
+  init_dst(dst3, gm_size);
+
+  int16_t *src6 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src6);
+  init_src_int(src6, gm_size);
+  int16_t *src7 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src7);
+  init_src_int(src7, gm_size);
+  // int32
+  int32_t *dst4 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst4);
+  init_dst(dst4, gm_size);
+
+  int32_t *src8 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src8);
+  init_src_int(src8, gm_size);
+  int32_t *src9 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src9);
+  init_src_int(src9, gm_size);
+  // int64
+  int64_t *dst5 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(dst5);
+  init_dst(dst5, gm_size);
+
+  int64_t *src10 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src10);
+  init_src_int(src10, gm_size);
+  int64_t *src11 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src11);
+  init_src_int(src11, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+  test_rm<gm_row, gm_col, tile_row, tile_col, float>(dst, src0, src1);
+  test_rm<gm_row, gm_col, tile_row, tile_col,__half>(dst1, src2, src3);
+  test_rm<gm_row, gm_col, tile_row, tile_col,int8_t>(dst2, src4, src5);
+  test_rm<gm_row, gm_col, tile_row, tile_col,int16_t>(dst3, src6, src7);
+  test_rm<gm_row, gm_col, tile_row, tile_col,int32_t>(dst4, src8, src9);
+  test_rm<gm_row, gm_col, tile_row, tile_col,int64_t>(dst5, src10, src11);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+  printf("Result:\n");
+  OutArray(dst, gm_size);
+  OutArray(dst1, gm_size);
+  OutArray(dst2, gm_size);
+  OutArray(dst3, gm_size);
+  OutArray(dst4, gm_size);
+  OutArray(dst5, gm_size);
+
+  free(dst);
+  free(src0);
+  free(src1);
+  free(dst1);
+  free(src2);
+  free(src3);
+  free(dst2);
+  free(src4);
+  free(src5);
+  free(dst3);
+  free(src6);
+  free(src7);
+  free(dst4);
+  free(src8);
+  free(src9);
+  free(dst5);
+  free(src10);
+  free(src11);
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TDivs.cpp b/benchmarks/api/tileop/src/TDivs.cpp
new file mode 100644
index 0000000..59b5023
--- /dev/null
+++ b/benchmarks/api/tileop/src/TDivs.cpp
@@ -0,0 +1,215 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col, typename T>
+void test_rm(T *dst, T *src, T s) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1;
+      TLOAD(d0, s0);
+      TDIVS(d1, d0, s);
+      TSTORE(res, d1);
+    }
+  }
+}
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col, typename T>
+void test_cm(T *dst, T *src, T s) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1;
+      TLOAD(d0, s0);
+      TDIVS(d1, d0, s);
+      TSTORE(res, d1);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint64_t gm_row = 4;
+  constexpr uint64_t gm_col = 4;
+  constexpr uint64_t tile_row = 4;
+  constexpr uint64_t tile_col = 4;
+#else
+  const uint16_t gm_row = 64;
+  const uint16_t gm_col = 64;
+  const uint16_t tile_row = 32;
+  const uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst_rm[gm_size];
+  static int64_t dst_cm[gm_size];
+  static int64_t src_rm[gm_size];
+  static int64_t src_cm[gm_size];
+  init_dst(dst_rm, gm_size);
+  init_dst(dst_cm, gm_size);
+  init_src_uint(src_rm, gm_size);
+  init_src_uint(src_cm, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_rm, src_rm, 2);
+  test_cm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_cm, src_cm, 2);
+  return 0;
+#else
+  // float32
+  float *dst = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, gm_size);
+
+  float *src = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src);
+  init_src_fp(src, gm_size);
+
+  // float16
+  __half *dst1 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst1);
+  init_dst(dst1, gm_size);
+  __half *src1 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src1);
+  init_src_fp(src1, gm_size);
+
+  // int16_t
+  int16_t *dst2 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(dst2);
+  init_dst(dst2, gm_size);
+  int16_t *src2 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src2);
+  init_src_int(src2, gm_size);
+  // int8
+  int8_t *dst3 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(dst3);
+  init_dst(dst3, gm_size);
+  int8_t *src3 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src3);
+  init_src_int(src3, gm_size);
+  // int32_t
+  int32_t *dst4 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst4);
+  init_dst(dst4, gm_size);
+  int32_t *src4 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src4);
+  init_src_int(src4, gm_size);
+  // int64_t
+  int64_t *dst5 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(dst5);
+  init_dst(dst5, gm_size);
+  int64_t *src5 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src5);
+  init_src_int(src5, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+  // qemu error: simt instructions do not support writing toscalar registers.
+  test_rm<gm_row, gm_col, tile_row, tile_col, float>(dst, src, s_fp32);
+  test_rm<gm_row, gm_col, tile_row, tile_col, __half>(dst1, src1, s_fp16);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int8_t>(dst3, src3, s_i8);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int16_t>(dst2, src2, s_i16);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int32_t>(dst4, src4, s_i32);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst5, src5, s_i64);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, gm_size);
+  OutArray(dst1, gm_size);
+  OutArray(dst2, gm_size);
+  OutArray(dst3, gm_size);
+  OutArray(dst4, gm_size);
+  OutArray(dst5, gm_size);
+  free(dst);
+  free(src);
+  free(dst1);
+  free(src1);
+  free(dst2);
+  free(src2);
+  free(dst3);
+  free(src3);
+  free(dst4);
+  free(src4);
+  free(dst5);
+  free(src5);
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TExp.cpp b/benchmarks/api/tileop/src/TExp.cpp
new file mode 100644
index 0000000..2b7e767
--- /dev/null
+++ b/benchmarks/api/tileop/src/TExp.cpp
@@ -0,0 +1,190 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col, typename T>
+void test_rm(T *dst, T *src) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1;
+      TLOAD(d0, s0);
+      TEXP(d1, d0);
+      TSTORE(res, d1);
+    }
+  }
+}
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col, typename T>
+void test_cm(T *dst, T *src) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1;
+      TLOAD(d0, s0);
+      TEXP(d1, d0);
+      TSTORE(res, d1);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr size_t tile_row = 4;
+  constexpr size_t tile_col = 4;
+  using row_tile = Tile<Location::Vec, int64_t, tile_row, tile_col>;
+  using col_tile =
+      Tile<Location::Vec, int64_t, tile_row, tile_col, BLayout::ColMajor>;
+
+  row_tile src_rm, dst_rm;
+  col_tile src_cm, dst_cm;
+
+  for (size_t i = 0; i < tile_row; ++i) {
+    for (size_t j = 0; j < tile_col; ++j) {
+      int64_t value = static_cast<int64_t>((i + j) % 6);
+      size_t row_index = index<row_tile>(i, j);
+      size_t col_index = index<col_tile>(i, j);
+      src_rm.data()[row_index] = value;
+      src_cm.data()[col_index] = value;
+      dst_rm.data()[row_index] = 0;
+      dst_cm.data()[col_index] = 0;
+    }
+  }
+
+  TEXP(dst_rm, src_rm);
+  TEXP(dst_cm, src_cm);
+
+  for (size_t i = 0; i < tile_row; ++i) {
+    for (size_t j = 0; j < tile_col; ++j) {
+      int64_t value = static_cast<int64_t>((i + j) % 6);
+      int64_t expected = linx_tile_iexp(value);
+      if (dst_rm.data()[index<row_tile>(i, j)] != expected) {
+        return 1;
+      }
+      if (dst_cm.data()[index<col_tile>(i, j)] != expected) {
+        return 2;
+      }
+    }
+  }
+
+  return 0;
+#else
+  const uint16_t gm_row = 64;
+  const uint16_t gm_col = 64;
+  const uint16_t tile_row = 16;
+  const uint16_t tile_col = 16;
+
+  size_t gm_size = gm_row * gm_col;
+  size_t tile_size = tile_row * tile_col;
+  // float32
+  float *dst = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, gm_size);
+  float *src = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src);
+  init_src_fp(src, gm_size);
+  // float16
+  __half *dst2 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst2);
+  init_dst(dst2, gm_size);
+  __half *src2 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src2);
+  init_src_fp(src2, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  // TExp只支持float32和16
+  test_rm<gm_row, gm_col, tile_row, tile_col,float>(dst, src);
+  // half编译通过，运行出错
+  // test_rm<gm_row, gm_col, tile_row, tile_col, __half>(dst2,src2);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, gm_size);
+  OutArray(dst2, gm_size);
+
+
+  free(dst);
+  free(src);
+  free(dst2);
+  free(src2);
+
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TExpandCol.cpp b/benchmarks/api/tileop/src/TExpandCol.cpp
new file mode 100644
index 0000000..4b6137a
--- /dev/null
+++ b/benchmarks/api/tileop/src/TExpandCol.cpp
@@ -0,0 +1,186 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t row, uint16_t col, typename T> void test_rm(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, RowMajor<row, col>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, row, col, BLayout::RowMajor, row, 1>;
+  using tile_shape_out = Tile<Location::Vec, T, row, col, BLayout::RowMajor>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TEXPANDCOL(d1, d0);
+  TSTORE(res, d1);
+}
+template <uint16_t row, uint16_t col, typename T> void test_cm(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, ColMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, ColMajor<row, col>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, row, col, BLayout::ColMajor, row, 1>;
+  using tile_shape_out = Tile<Location::Vec, T, row, col, BLayout::ColMajor>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TEXPANDCOL(d1, d0);
+  TSTORE(res, d1);
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t row = 4;
+  constexpr uint16_t col = 8;
+  constexpr uint16_t size = row * col;
+
+  static int64_t dst_rm[size];
+  static int64_t dst_cm[size];
+  static int64_t src_rm[size];
+  static int64_t src_cm[size];
+  init_dst(dst_rm, size);
+  init_dst(dst_cm, size);
+  init_src_int(src_rm, size);
+  init_src_int(src_cm, size);
+
+  test_rm<row, col, int64_t>(dst_rm, src_rm);
+  test_cm<row, col, int64_t>(dst_cm, src_cm);
+  return 0;
+#else
+  const uint16_t row = 32;
+  const uint16_t col = 32;
+
+  size_t size_in = row * col;
+  size_t size_out = row * col;
+  size_t print_out = row;
+  // float32
+  float *dst = (float *)malloc(size_out * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, size_out);
+
+  float *src = (float *)malloc(size_in * sizeof(float));
+  check_mem_alloc(src);
+  init_src_fp(src, size_in);
+  // float16
+  __half *dst2 = (__half *)malloc(size_out * sizeof(__half));
+  check_mem_alloc(dst2);
+  init_dst(dst2, size_out);
+
+  __half *src2 = (__half *)malloc(size_in * sizeof(__half));
+  check_mem_alloc(src2);
+  init_src_fp(src2, size_in);
+  // int16
+  int16_t *dst1 = (int16_t *)malloc(size_out * sizeof(int16_t));
+  check_mem_alloc(dst1);
+  init_dst(dst1, size_out);
+
+  int16_t *src1 = (int16_t *)malloc(size_in * sizeof(int16_t));
+  check_mem_alloc(src1);
+  init_src_int(src1, size_in);
+  // int8
+  int8_t *dst3 = (int8_t *)malloc(size_out * sizeof(int8_t));
+  check_mem_alloc(dst3);
+  init_dst(dst3, size_out);
+
+  int8_t *src3 = (int8_t *)malloc(size_in * sizeof(int8_t));
+  check_mem_alloc(src3);
+  init_src_int(src3, size_in);
+  // int32
+  int32_t *dst4 = (int32_t *)malloc(size_out * sizeof(int32_t));
+  check_mem_alloc(dst4);
+  init_dst(dst4, size_out);
+
+  int32_t *src4 = (int32_t *)malloc(size_in * sizeof(int32_t));
+  check_mem_alloc(src4);
+  init_src_int(src4, size_in);
+  // int64
+  int64_t *dst5 = (int64_t *)malloc(size_out * sizeof(int64_t));
+  check_mem_alloc(dst5);
+  init_dst(dst5, size_out);
+
+  int64_t *src5 = (int64_t *)malloc(size_in * sizeof(int64_t));
+  check_mem_alloc(src5);
+  init_src_int(src5, size_in);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<row, col, float>(dst, src);
+  test_rm<row, col, __half>(dst2, src2);
+  test_rm<row, col, int16_t>(dst1, src1);
+  test_rm<row, col, int8_t>(dst3, src3);
+  test_rm<row, col, int32_t>(dst4, src4);
+  test_rm<row, col, int64_t>(dst5, src5);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, print_out);
+  OutArray(dst1, print_out);
+  OutArray(dst2, print_out);
+  OutArray(dst3, print_out);
+  OutArray(dst4, print_out);
+  OutArray(dst5, print_out);
+
+  free(dst);
+  free(src);
+  free(dst1);
+  free(src1);
+  free(dst2);
+  free(src2);
+  free(dst3);
+  free(src3);
+  free(dst4);
+  free(src4);
+  free(dst5);
+  free(src5);
+
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TExpandRow.cpp b/benchmarks/api/tileop/src/TExpandRow.cpp
new file mode 100644
index 0000000..f448a92
--- /dev/null
+++ b/benchmarks/api/tileop/src/TExpandRow.cpp
@@ -0,0 +1,190 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t row, uint16_t col,typename T>
+void test_rm(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, RowMajor<row, col>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, row, col, BLayout::RowMajor, 1, col>;
+  using tile_shape_out = Tile<Location::Vec, T, row, col, BLayout::RowMajor>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+  TLOAD(d0, s0);
+  TEXPANDROW(d1, d0);
+  TSTORE(res, d1);
+}
+template <uint16_t row, uint16_t col,typename T>
+void test_cm(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, ColMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, ColMajor<row, col>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, row, col, BLayout::ColMajor, 1, col>;
+  using tile_shape_out = Tile<Location::Vec, T, row, col, BLayout::ColMajor>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+  TLOAD(d0, s0);
+  TEXPANDROW(d1, d0);
+  TSTORE(res, d1);
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t row = 4;
+  constexpr uint16_t col = 8;
+  constexpr uint16_t size = row * col;
+
+  static int64_t dst_rm[size];
+  static int64_t dst_cm[size];
+  static int64_t src_rm[size];
+  static int64_t src_cm[size];
+  init_dst(dst_rm, size);
+  init_dst(dst_cm, size);
+  init_src_int(src_rm, size);
+  init_src_int(src_cm, size);
+
+  test_rm<row, col, int64_t>(dst_rm, src_rm);
+  test_cm<row, col, int64_t>(dst_cm, src_cm);
+  return 0;
+#else
+  const uint16_t row = 32;
+  const uint16_t col = 32;
+  size_t size_in = col;
+  size_t size_out = row * col;
+
+  const uint16_t row1 = 64;
+  const uint16_t col1 = 64;
+  size_t size_in1 = col1;
+  size_t size_out1 = row1 * col1;
+
+//float32
+  float *dst = (float *)malloc(size_out1 * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, size_out1);
+
+  float *src = (float *)malloc(size_in1 * sizeof(float));
+  check_mem_alloc(src);
+  init_src_fp(src, size_in1);
+  //float16
+  __half *dst1 = (__half *)malloc(size_out * sizeof(__half));
+  check_mem_alloc(dst1);
+  init_dst(dst1, size_out);
+
+  __half *src1 = (__half *)malloc(size_in * sizeof(__half));
+  check_mem_alloc(src1);
+  init_src_fp(src1, size_in);
+  //int8
+  int8_t *dst2 = (int8_t *)malloc(size_out * sizeof(int8_t));
+  check_mem_alloc(dst2);
+  init_dst(dst2, size_out);
+
+  int8_t *src2 = (int8_t *)malloc(size_in * sizeof(int8_t));
+  check_mem_alloc(src2);
+  init_src_int(src2, size_in);
+//int16
+  int16_t *dst3 = (int16_t *)malloc(size_out * sizeof(int16_t));
+  check_mem_alloc(dst3);
+  init_dst(dst3, size_out);
+
+  int16_t *src3 = (int16_t *)malloc(size_in * sizeof(int16_t));
+  check_mem_alloc(src3);
+  init_src_int(src3, size_in);
+  //int32
+  int32_t *dst4 = (int32_t *)malloc(size_out * sizeof(int32_t));
+  check_mem_alloc(dst4);
+  init_dst(dst4, size_out);
+
+  int32_t *src4 = (int32_t *)malloc(size_in * sizeof(int32_t));
+  check_mem_alloc(src4);
+  init_src_int(src4, size_in);
+    //int64
+  int64_t *dst5 = (int64_t *)malloc(size_out * sizeof(int64_t));
+  check_mem_alloc(dst5);
+  init_dst(dst5, size_out);
+
+  int64_t *src5 = (int64_t *)malloc(size_in * sizeof(int64_t));
+  check_mem_alloc(src5);
+  init_src_int(src5, size_in);
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<row, col, float>(dst, src);
+  test_rm<row, col, __half>(dst1, src1);
+  test_rm<row, col, int8_t>(dst2, src2);
+  test_rm<row, col, int16_t>(dst3, src3);
+  test_rm<row, col, int32_t>(dst4, src4);
+  test_rm<row, col, int64_t>(dst5, src5);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, size_out);
+  OutArray(dst1, size_out);
+  OutArray(dst2, size_out);
+  OutArray(dst3, size_out);
+  OutArray(dst4, size_out);
+  OutArray(dst5, size_out);
+
+  free(dst);
+  free(src);
+  free(dst1);
+  free(src1);
+  free(dst2);
+  free(src2);
+  free(dst3);
+  free(src3);
+  free(dst4);
+  free(src4);
+  free(dst5);
+  free(src5);
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TExpandScalar.cpp b/benchmarks/api/tileop/src/TExpandScalar.cpp
new file mode 100644
index 0000000..7b9e347
--- /dev/null
+++ b/benchmarks/api/tileop/src/TExpandScalar.cpp
@@ -0,0 +1,169 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col,typename T>
+void test_rm(T *dst, T s) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+  gm_shape res(dst);
+
+  tile_shape d0;
+  TEXPANDSCALAR(d0, s);
+  TSTORE(res, d0);
+}
+
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col,typename T>
+void test_rm_dynamic(T *dst, T s) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, 2*tile_row, 2*tile_col, BLayout::RowMajor, -1, -1>;
+
+  volatile size_t tile_valid_row = tile_row;
+  volatile size_t tile_valid_col = tile_col;
+
+  gm_shape res(dst);
+  tile_shape d0(tile_valid_row, tile_valid_col);
+
+  TEXPANDSCALAR(d0, s);
+  TSTORE(res, d0);
+}
+
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col,typename T>
+void test_cm(T *dst, T s) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+  gm_shape res(dst);
+
+  tile_shape d0;
+  TEXPANDSCALAR(d0, s);
+  TSTORE(res, d0);
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 8;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 8;
+  constexpr uint16_t gm_size = gm_row * gm_col;
+
+  static int64_t dst_rm[gm_size];
+  static int64_t dst_cm[gm_size];
+  init_dst(dst_rm, gm_size);
+  init_dst(dst_cm, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_rm, s_i64);
+  test_cm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_cm, s_i64);
+  return 0;
+#else
+  const uint16_t gm_row = 16;
+  const uint16_t gm_col = 32;
+  const uint16_t tile_row = 16;
+  const uint16_t tile_col = 32;
+
+  size_t gm_size = gm_row * gm_col;
+  size_t tile_size = tile_row * tile_col;
+  // float32
+  float *dst = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, gm_size);
+  // float16
+  __half *dst1 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst1);
+  init_dst(dst1, gm_size);
+  // int8
+  int8_t *dst2 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(dst2);
+  init_dst(dst2, gm_size);
+  // int16
+  int16_t *dst3 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(dst3);
+  init_dst(dst3, gm_size);
+  // int32
+  int32_t *dst4 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst4);
+  init_dst(dst4, gm_size);
+  // int64
+  int64_t *dst5 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(dst5);
+  init_dst(dst5, gm_size);
+  // int32 dynamic
+  int32_t *dst6 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst6);
+  init_dst(dst6, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, float>(dst, s_fp32);
+  test_rm<gm_row, gm_col, tile_row, tile_col, __half>(dst1, s_fp16);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int8_t>(dst2, s_i8);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int16_t>(dst3, s_i16);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int32_t>(dst4, s_i32);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst5, s_i64);
+  test_rm_dynamic<gm_row, gm_col, tile_row, tile_col, int32_t>(dst6, s_i32);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, gm_size);
+  OutArray(dst1, gm_size);
+  OutArray(dst2, gm_size);
+  OutArray(dst3, gm_size);
+  OutArray(dst4, gm_size);
+  OutArray(dst5, gm_size);
+  OutArray(dst6, gm_size);
+
+  free(dst);
+  free(dst1);
+  free(dst2);
+  free(dst3);
+  free(dst4);
+  free(dst5);
+  free(dst6);
+
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/TExtract.cpp b/benchmarks/api/tileop/src/TExtract.cpp
similarity index 96%
rename from test/tileop_api/src/TExtract.cpp
rename to benchmarks/api/tileop/src/TExtract.cpp
index 8a6e0bf..9fc7c35 100644
--- a/test/tileop_api/src/TExtract.cpp
+++ b/benchmarks/api/tileop/src/TExtract.cpp
@@ -20,9 +20,9 @@ void test_rm(float *dst, float *src, size_t offset_i, size_t offset_j) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TEXTRACT(d1, d0, offset_i, offset_j);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 template <size_t src_row, size_t src_col, size_t dst_row, size_t dst_col>
@@ -43,9 +43,9 @@ void test_rm_dynamic(float *dst, float *src, size_t offset_i, size_t offset_j) {
   tile_shape_in d0(src_valid_row, src_valid_col);
   tile_shape_out d1(dst_valid_row, dst_valid_col);
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TEXTRACT(d1, d0, offset_i, offset_j);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 template <size_t src_row, size_t src_col, size_t dst_row, size_t dst_col>
@@ -61,9 +61,9 @@ void test_cm(float *dst, float *src, size_t offset_i, size_t offset_j) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TEXTRACT(d1, d0, offset_i, offset_j);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 int main() {
diff --git a/test/tileop_api/src/TFillPad.cpp b/benchmarks/api/tileop/src/TFillPad.cpp
similarity index 96%
rename from test/tileop_api/src/TFillPad.cpp
rename to benchmarks/api/tileop/src/TFillPad.cpp
index 0174d9d..8cb8ec4 100644
--- a/test/tileop_api/src/TFillPad.cpp
+++ b/benchmarks/api/tileop/src/TFillPad.cpp
@@ -20,9 +20,9 @@ void test_rm(int32_t *dst, int32_t *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TFILLPAD(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 template <size_t tile_row, size_t tile_col, size_t valid_row, size_t valid_col>
@@ -45,9 +45,9 @@ void test_rm_dynamic(int32_t *dst, int32_t *src) {
   tile_shape_in d0(src_valid_row, src_valid_col);
   tile_shape_out d1(dst_valid_row, dst_valid_col);
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TFILLPAD(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 template <size_t tile_row, size_t tile_col, size_t valid_row, size_t valid_col>
@@ -64,9 +64,9 @@ void test_cm(int32_t *dst, int32_t *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
+  TLOAD(d0, s0);
   TFILLPAD(d1, d0);
-  TCOPYOUT(res, d1);
+  TSTORE(res, d1);
 }
 
 
@@ -89,7 +89,7 @@ int main() {
   int32_t *dst3 = (int32_t *)malloc(size * sizeof(int32_t));
   check_mem_alloc(dst3);
   init_dst(dst3, size);
- 
+
   int32_t *src = (int32_t *)malloc(size * sizeof(int32_t));
   check_mem_alloc(src);
   init_src_int(src, size);
diff --git a/test/tileop_api/src/TGather.cpp b/benchmarks/api/tileop/src/TGather.cpp
similarity index 96%
rename from test/tileop_api/src/TGather.cpp
rename to benchmarks/api/tileop/src/TGather.cpp
index 088ad59..cd04d9b 100644
--- a/test/tileop_api/src/TGather.cpp
+++ b/benchmarks/api/tileop/src/TGather.cpp
@@ -23,10 +23,10 @@ void test_RowMajor(float *dst, float *src, uint16_t *indices) {
   tile_shape_indices d1;
   tile_shape_dst d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
   TGATHER(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 template <uint16_t src_row, uint16_t src_col, uint16_t row, uint16_t col>
@@ -47,10 +47,10 @@ void test_ColMajor(float *dst, float *src, uint16_t *indices) {
   tile_shape_indices d1;
   tile_shape_dst d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
   TGATHER(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 int main() {
diff --git a/benchmarks/api/tileop/src/TLoad.cpp b/benchmarks/api/tileop/src/TLoad.cpp
new file mode 100644
index 0000000..e13b0d6
--- /dev/null
+++ b/benchmarks/api/tileop/src/TLoad.cpp
@@ -0,0 +1,305 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test_RowMajor(T *dst, T *src0) {
+  using shape = Shape<1, 1, 1, tile_row, tile_col>;
+  using stride = Stride<1, 1, gm_row * gm_col, gm_col, 1>;
+  using gm_shape = GlobalTensor<T, shape, stride, Layout::ND>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  #pragma clang loop unroll(full)
+  for (int i = 0; i < block_row; ++i) {
+    #pragma clang loop unroll(full)
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src0 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0;
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
+    }
+  }
+}
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row, uint16_t tile_col, typename T>
+void test_RowMajor_Dynamic(T *dst, T *src0) {
+  using gm_shape = global_tensor<T, RowMajor<-1, -1>>;
+  using tile_shape = Tile<Location::Vec, T, 2*tile_row, 2*tile_col, BLayout::RowMajor, -1, -1>;
+
+  volatile size_t tile_valid_row = tile_row - 2;
+  volatile size_t tile_valid_col = tile_col - 2;
+
+  volatile size_t gm_valid_row = gm_row;
+  volatile size_t gm_valid_col = gm_col;
+
+  uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row;
+  uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col;
+
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      uint16_t remainder_row = gm_row - i * tile_valid_row;
+      uint16_t remainder_col = gm_col - j * tile_valid_col;
+
+      uint16_t active_row = remainder_row < tile_valid_row ? remainder_row : tile_valid_row;
+      uint16_t active_col = remainder_col < tile_valid_col ? remainder_col : tile_valid_col;
+
+      int offset = i * (tile_valid_row * gm_valid_col) + j * tile_valid_col;
+      gm_shape s0(src0 + offset, gm_valid_row, gm_valid_col);
+      gm_shape res(dst + offset, gm_valid_row, gm_valid_col);
+
+      tile_shape d0(active_row, active_col);
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
+    }
+  }
+}
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test_ColMajor(T *dst, T *src0) {
+  using shape = Shape<1, 1, 1, tile_row, tile_col>;
+  using stride = Stride<1, 1, gm_row * gm_col, 1, gm_row>;
+  using gm_shape = GlobalTensor<T, shape, stride, Layout::DN>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  #pragma clang loop unroll(full)
+  for (int i = 0; i < block_col; ++i) {
+    #pragma clang loop unroll(full)
+    for (int j = 0; j < block_row; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src0 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0;
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
+    }
+  }
+}
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row, uint16_t tile_col, typename T>
+void test_Nz_Dynamic(T *dst, T *src0) {
+  using gm_shape = global_tensor<T, RowMajor<-1, -1>>;
+  using tile_shape = TileLeft<T, tile_row, tile_col, -1, -1>;
+
+  volatile size_t tile_valid_row = tile_row - 2;
+  volatile size_t tile_valid_col = tile_col - 2;
+
+  volatile size_t gm_valid_row = gm_row;
+  volatile size_t gm_valid_col = gm_col;
+
+  uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row;
+  uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col;
+
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      uint16_t remainder_row = gm_row - i * tile_valid_row;
+      uint16_t remainder_col = gm_col - j * tile_valid_col;
+
+      uint16_t active_row = remainder_row < tile_valid_row ? remainder_row : tile_valid_row;
+      uint16_t active_col = remainder_col < tile_valid_col ? remainder_col : tile_valid_col;
+
+      int offset = i * (tile_valid_row * gm_valid_col) + j * tile_valid_col;
+      gm_shape s0(src0 + offset, gm_valid_row, gm_valid_col);
+      gm_shape res(dst + offset, gm_valid_row, gm_valid_col);
+
+      tile_shape d0(active_row, active_col);
+      tile_shape d1(active_row, active_col);
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src, gm_size);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src);
+
+  return 0;
+#else
+  float *dst = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, gm_size);
+
+  float *src0 = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src0);
+  init_src_fp(src0, gm_size);
+
+  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, gm_size);
+
+  __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src0_f16);
+  init_src_fp(src0_f16, gm_size);
+
+  int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(dst_i8);
+  init_dst(dst_i8, gm_size);
+
+  int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src0_i8);
+  init_src_int(src0_i8, gm_size);
+
+  int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(dst_i16);
+  init_dst(dst_i16, gm_size);
+
+  int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src0_i16);
+  init_src_int(src0_i16, gm_size);
+
+  int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst_i32);
+  init_dst(dst_i32, gm_size);
+
+  int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src0_i32);
+  init_src_int(src0_i32, gm_size);
+
+  int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(dst_i64);
+  init_dst(dst_i64, gm_size);
+
+  int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src0_i64);
+  init_src_int(src0_i64, gm_size);
+
+  int32_t *dst1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst1_i32);
+  init_dst(dst1_i32, gm_size);
+
+  int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src1_i32);
+  init_src_int(src1_i32, gm_size);
+
+  int32_t *dst_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst_nz_i32);
+  init_dst(dst_nz_i32, gm_size);
+
+  int32_t *src_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src_nz_i32);
+  init_src_int(src_nz_i32, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, float>(dst, src0);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64);
+
+  test_RowMajor_Dynamic<gm_row + 1, gm_col + 1, tile_row, tile_col, int32_t>(dst1_i32, src1_i32);
+
+  test_Nz_Dynamic<gm_row + 1, gm_col + 1, tile_row, tile_col, int32_t>(dst_nz_i32, src_nz_i32);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, gm_size);
+  OutArray(dst_f16, gm_size);
+  OutArray(dst_i8, gm_size);
+  OutArray(dst_i16, gm_size);
+  OutArray(dst_i32, gm_size);
+  OutArray(dst_i64, gm_size);
+  OutArray(dst1_i32, gm_size);
+  OutArray(dst_nz_i32, gm_size);
+
+  free(dst);
+  free(src0);
+
+  free(dst_f16);
+  free(src0_f16);
+
+  free(dst_i8);
+  free(src0_i8);
+
+  free(dst_i16);
+  free(src0_i16);
+
+  free(dst_i32);
+  free(src0_i32);
+
+  free(dst_i64);
+  free(src0_i64);
+
+  free(dst1_i32);
+  free(src1_i32);
+
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/TDiv.cpp b/benchmarks/api/tileop/src/TMax.cpp
similarity index 71%
rename from test/tileop_api/src/TDiv.cpp
rename to benchmarks/api/tileop/src/TMax.cpp
index f9ecf10..25301d6 100644
--- a/test/tileop_api/src/TDiv.cpp
+++ b/benchmarks/api/tileop/src/TMax.cpp
@@ -5,6 +5,38 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_rm(T *dst, T *src0, T *src1) {
@@ -21,10 +53,10 @@ void test_rm(T *dst, T *src0, T *src1) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
-      TDIV(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
+      TMAX(d2, d1, d0);
+      TSTORE(res, d2);
     }
   }
 }
@@ -44,24 +76,43 @@ void test_cm(T *dst, T *src0, T *src1) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
-      TDIV(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
+      TMAX(d2, d1, d0);
+      TSTORE(res, d2);
     }
   }
 }
 
 int main() {
-  // 64*64-16*16
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
 
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src0[gm_size];
+  static int64_t src1[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src0, gm_size);
+  init_src_int(src1, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src0, src1);
+
+  return 0;
+#else
   // float32
   float *dst = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst);
@@ -91,10 +142,10 @@ int main() {
 
   int8_t *src4 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src4);
-  init_src_int8(src4, gm_size);
+  init_src_int(src4, gm_size);
   int8_t *src5 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src5);
-  init_src_int8(src5, gm_size);
+  init_src_int(src5, gm_size);
   // int16
   int16_t *dst3 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(dst3);
@@ -132,7 +183,8 @@ int main() {
 #ifdef LINX_PMC
   PMC_START();
 #endif
-  test_rm<gm_row, gm_col, tile_row, tile_col, float>(dst, src0, src1);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col,float>(dst, src0, src1);
   test_rm<gm_row, gm_col, tile_row, tile_col,__half>(dst1, src2, src3);
   test_rm<gm_row, gm_col, tile_row, tile_col,int8_t>(dst2, src4, src5);
   test_rm<gm_row, gm_col, tile_row, tile_col,int16_t>(dst3, src6, src7);
@@ -142,6 +194,7 @@ int main() {
 #ifdef LINX_PMC
   PMC_END();
 #endif
+
   printf("Result:\n");
   OutArray(dst, gm_size);
   OutArray(dst1, gm_size);
@@ -168,5 +221,7 @@ int main() {
   free(dst5);
   free(src10);
   free(src11);
+
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TMaxs.cpp b/benchmarks/api/tileop/src/TMaxs.cpp
new file mode 100644
index 0000000..8833ab9
--- /dev/null
+++ b/benchmarks/api/tileop/src/TMaxs.cpp
@@ -0,0 +1,197 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col,typename T>
+void test_rm(T *dst, T *src, T s) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1;
+      TLOAD(d0, s0);
+      TMAXS(d1, d0, s);
+      TSTORE(res, d1);
+    }
+  }
+}
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col,typename T>
+void test_cm(T *dst, T *src, T s) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1;
+      TLOAD(d0, s0);
+      TMAXS(d1, d0, s);
+      TSTORE(res, d1);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src, s_i64);
+
+  return 0;
+#else
+  // float32
+  float *dst = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, gm_size);
+
+  float *src = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src);
+  init_src_fp(src, gm_size);
+  // float16
+  __half *dst1 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst1);
+  init_dst(dst1, gm_size);
+
+  __half *src1 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src1);
+  init_src_fp(src1, gm_size);
+  // int8
+  int8_t *dst2 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(dst2);
+  init_dst(dst2, gm_size);
+
+  int8_t *src2 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src2);
+  init_src_int(src2, gm_size);
+  // int16
+  int16_t *dst3 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(dst3);
+  init_dst(dst3, gm_size);
+
+  int16_t *src3 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src3);
+  init_src_int(src3, gm_size);
+  // int32
+  int32_t *dst4 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst4);
+  init_dst(dst4, gm_size);
+
+  int32_t *src4 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src4);
+  init_src_int(src4, gm_size);
+  // int64
+  int64_t *dst5 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(dst5);
+  init_dst(dst5, gm_size);
+
+  int64_t *src5 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src5);
+  init_src_int(src5, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, float>(dst, src, s_fp32);
+  test_rm<gm_row, gm_col, tile_row, tile_col, __half>(dst1, src1, s_fp16);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int8_t>(dst2, src2, s_i8);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int16_t>(dst3, src3, s_i16);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int32_t>(dst4, src4, s_i32);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst5, src5, s_i64);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, gm_size);
+  OutArray(dst1, gm_size);
+  OutArray(dst2, gm_size);
+  OutArray(dst3, gm_size);
+  OutArray(dst4, gm_size);
+  OutArray(dst5, gm_size);
+
+  free(dst);
+  free(src);
+  free(dst1);
+  free(src1);
+  free(dst2);
+  free(src2);
+  free(dst3);
+  free(src3);
+  free(dst4);
+  free(src4);
+  free(dst5);
+  free(src5);
+
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/TMin.cpp b/benchmarks/api/tileop/src/TMin.cpp
similarity index 96%
rename from test/tileop_api/src/TMin.cpp
rename to benchmarks/api/tileop/src/TMin.cpp
index 35d8f0d..6a8f6b3 100644
--- a/test/tileop_api/src/TMin.cpp
+++ b/benchmarks/api/tileop/src/TMin.cpp
@@ -21,10 +21,10 @@ void test_RowMajor(T *dst, T *src0, T *src1) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TMIN(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
@@ -45,10 +45,10 @@ void test_ColMajor(T *dst, T *src0, T *src1) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TMIN(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
@@ -98,7 +98,7 @@ int main() {
   __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(dst_f16);
   init_dst(dst_f16, gm_size);
- 
+
   __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src0_f16);
   init_src_fp(src0_f16, gm_size);
diff --git a/test/tileop_api/src/TMins.cpp b/benchmarks/api/tileop/src/TMins.cpp
similarity index 96%
rename from test/tileop_api/src/TMins.cpp
rename to benchmarks/api/tileop/src/TMins.cpp
index 1cdbee2..7155098 100644
--- a/test/tileop_api/src/TMins.cpp
+++ b/benchmarks/api/tileop/src/TMins.cpp
@@ -20,9 +20,9 @@ void test_RowMajor(float *dst, float *src, float s) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TMINS(d1, d0, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
@@ -42,9 +42,9 @@ void test_ColMajor(float *dst, float *src, float s) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TMINS(d1, d0, s);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
diff --git a/benchmarks/api/tileop/src/TMul.cpp b/benchmarks/api/tileop/src/TMul.cpp
new file mode 100644
index 0000000..9967351
--- /dev/null
+++ b/benchmarks/api/tileop/src/TMul.cpp
@@ -0,0 +1,227 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col,typename T>
+void test_rm(T *dst, T *src0, T *src1) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src0 + offset);
+      gm_shape s1(src1 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1, d2;
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
+      TMUL(d2, d1, d0);
+      TSTORE(res, d2);
+    }
+  }
+}
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col,typename T>
+void test_cm(T *dst, T *src0, T *src1) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src0 + offset);
+      gm_shape s1(src1 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1, d2;
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
+      TMUL(d2, d1, d0);
+      TSTORE(res, d2);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src0[gm_size];
+  static int64_t src1[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src0, gm_size);
+  init_src_int(src1, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src0, src1);
+
+  return 0;
+#else
+ // float32
+  float *dst = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, gm_size);
+
+  float *src0 = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src0);
+  init_src_fp(src0, gm_size);
+  float *src1 = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src1);
+  init_src_fp(src1, gm_size);
+    // float16
+  __half *dst1 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst1);
+  init_dst(dst1, gm_size);
+
+  __half *src2 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src2);
+  init_src_fp(src2, gm_size);
+  __half *src3 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src3);
+  init_src_fp(src3, gm_size);
+  // int8
+  int8_t *dst2 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(dst2);
+  init_dst(dst2, gm_size);
+
+  int8_t *src4 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src4);
+  init_src_int(src4, gm_size);
+  int8_t *src5 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src5);
+  init_src_int(src5, gm_size);
+  // int16
+  int16_t *dst3 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(dst3);
+  init_dst(dst3, gm_size);
+
+  int16_t *src6 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src6);
+  init_src_int(src6, gm_size);
+  int16_t *src7 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src7);
+  init_src_int(src7, gm_size);
+  // int32
+  int32_t *dst4 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst4);
+  init_dst(dst4, gm_size);
+
+  int32_t *src8 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src8);
+  init_src_int(src8, gm_size);
+  int32_t *src9 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src9);
+  init_src_int(src9, gm_size);
+  // int64
+  int64_t *dst5 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(dst5);
+  init_dst(dst5, gm_size);
+
+  int64_t *src10 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src10);
+  init_src_int(src10, gm_size);
+  int64_t *src11 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src11);
+  init_src_int(src11, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, float>(dst, src0, src1);
+  test_rm<gm_row, gm_col, tile_row, tile_col, __half>(dst1, src2, src3);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int8_t>(dst2, src4, src5);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int16_t>(dst3, src6, src7);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int32_t>(dst4, src8, src9);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst5, src10, src11);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, gm_size);
+  OutArray(dst1, gm_size);
+  OutArray(dst2, gm_size);
+  OutArray(dst3, gm_size);
+  OutArray(dst4, gm_size);
+  OutArray(dst5, gm_size);
+
+  free(dst);
+  free(src0);
+  free(src1);
+  free(dst1);
+  free(src2);
+  free(src3);
+  free(dst2);
+  free(src4);
+  free(src5);
+  free(dst3);
+  free(src6);
+  free(src7);
+  free(dst4);
+  free(src8);
+  free(src9);
+  free(dst5);
+  free(src10);
+  free(src11);
+
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TMuls.cpp b/benchmarks/api/tileop/src/TMuls.cpp
new file mode 100644
index 0000000..e451ee0
--- /dev/null
+++ b/benchmarks/api/tileop/src/TMuls.cpp
@@ -0,0 +1,197 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col,typename T>
+void test_rm(T *dst, T *src, T s) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1;
+      TLOAD(d0, s0);
+      TMULS(d1, d0, s);
+      TSTORE(res, d1);
+    }
+  }
+}
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col,typename T>
+void test_cm(T *dst, T *src, T s) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0, d1;
+      TLOAD(d0, s0);
+      TMULS(d1, d0, s);
+      TSTORE(res, d1);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src, s_i64);
+
+  return 0;
+#else
+  // float32
+  float *dst = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, gm_size);
+
+  float *src = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src);
+  init_src_fp(src, gm_size);
+  // float16
+  __half *dst1 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst1);
+  init_dst(dst1, gm_size);
+
+  __half *src1 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src1);
+  init_src_fp(src1, gm_size);
+  // int8
+  int8_t *dst2 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(dst2);
+  init_dst(dst2, gm_size);
+
+  int8_t *src2 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src2);
+  init_src_int(src2, gm_size);
+  // int16
+  int16_t *dst3 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(dst3);
+  init_dst(dst3, gm_size);
+
+  int16_t *src3 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src3);
+  init_src_int(src3, gm_size);
+  // int32
+  int32_t *dst4 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst4);
+  init_dst(dst4, gm_size);
+
+  int32_t *src4 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src4);
+  init_src_int(src4, gm_size);
+  // int64
+  int64_t *dst5 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(dst5);
+  init_dst(dst5, gm_size);
+
+  int64_t *src5 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src5);
+  init_src_int(src5, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+    test_rm<gm_row, gm_col, tile_row, tile_col, float>(dst, src, s_fp32);
+    test_rm<gm_row, gm_col, tile_row, tile_col, __half>(dst1, src1, s_fp16);
+    test_rm<gm_row, gm_col, tile_row, tile_col, int8_t>(dst2, src2, s_i8);
+    test_rm<gm_row, gm_col, tile_row, tile_col, int16_t>(dst3, src3, s_i16);
+    test_rm<gm_row, gm_col, tile_row, tile_col, int32_t>(dst4, src4, s_i32);
+    test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst5, src5, s_i64);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, gm_size);
+  OutArray(dst1, gm_size);
+  OutArray(dst2, gm_size);
+  OutArray(dst3, gm_size);
+  OutArray(dst4, gm_size);
+  OutArray(dst5, gm_size);
+
+  free(dst);
+  free(src);
+  free(dst1);
+  free(src1);
+  free(dst2);
+  free(src2);
+  free(dst3);
+  free(src3);
+  free(dst4);
+  free(src4);
+  free(dst5);
+  free(src5);
+
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/TOr.cpp b/benchmarks/api/tileop/src/TOr.cpp
similarity index 80%
rename from test/tileop_api/src/TOr.cpp
rename to benchmarks/api/tileop/src/TOr.cpp
index 922bd5c..be0c249 100644
--- a/test/tileop_api/src/TOr.cpp
+++ b/benchmarks/api/tileop/src/TOr.cpp
@@ -5,12 +5,44 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_RowMajor(T *dst, T *src0, T *src1) {
   using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -21,22 +53,22 @@ void test_RowMajor(T *dst, T *src0, T *src1) {
       gm_shape s0(src0 + offset);
       gm_shape s1(src1 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TOR(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
- 
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_ColMajor(T *dst, T *src0, T *src1) {
   using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
- 
+
   uint16_t block_row = gm_row / tile_row;
   uint16_t block_col = gm_col / tile_col;
   #pragma clang loop unroll(full)
@@ -47,25 +79,45 @@ void test_ColMajor(T *dst, T *src0, T *src1) {
       gm_shape s0(src0 + offset);
       gm_shape s1(src1 + offset);
       gm_shape res(dst + offset);
-  
+
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TOR(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
 
 int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 32;
-  const uint16_t tile_row = 64;
-  const uint16_t tile_col = 32;
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 32;
+  constexpr uint16_t tile_row = 64;
+  constexpr uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+#ifdef __linx
+  static int64_t dst_i64[gm_size];
+  static int64_t src0_i64[gm_size];
+  static int64_t src1_i64[gm_size];
+  init_dst(dst_i64, gm_size);
+  init_src_int(src0_i64, gm_size);
+  init_src_int(src1_i64, gm_size);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64, src1_i64);
 
+  return 0;
+#else
   float *dst = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst);
   init_dst(dst, gm_size);
@@ -91,63 +143,63 @@ int main() {
   __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(dst_f16);
   init_dst(dst_f16, gm_size);
- 
+
   __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src0_f16);
   init_src_fp(src0_f16, gm_size);
   __half *src1_f16 = (__half *)malloc(gm_size * sizeof(__half));
   check_mem_alloc(src1_f16);
   init_src_fp(src1_f16, gm_size);
- 
+
   int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(dst_i8);
   init_dst(dst_i8, gm_size);
   int8_t *dst_i8_col = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(dst_i8_col);
   init_dst(dst_i8_col, gm_size);
- 
+
   int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src0_i8);
   init_src_int(src0_i8, gm_size);
   int8_t *src1_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
   check_mem_alloc(src1_i8);
   init_src_int(src1_i8, gm_size);
- 
+
   int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(dst_i16);
   init_dst(dst_i16, gm_size);
   int16_t *dst_i16_col = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(dst_i16_col);
   init_dst(dst_i16_col, gm_size);
- 
+
   int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(src0_i16);
   init_src_int(src0_i16, gm_size);
   int16_t *src1_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
   check_mem_alloc(src1_i16);
   init_src_int(src1_i16, gm_size);
-  
+
   int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst_i32);
   init_dst(dst_i32, gm_size);
   int32_t *dst_i32_col = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(dst_i32);
   init_dst(dst_i32_col, gm_size);
- 
+
   int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src0_i32);
   init_src_int(src0_i32, gm_size);
   int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
   check_mem_alloc(src1_i32);
   init_src_int(src1_i32, gm_size);
- 
+
   int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(dst_i64);
   init_dst(dst_i64, gm_size);
   int64_t *dst_i64_col = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(dst_i64_col);
   init_dst(dst_i64_col, gm_size);
- 
+
   int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
   check_mem_alloc(src0_i64);
   init_src_int(src0_i64, gm_size);
@@ -168,10 +220,10 @@ int main() {
 
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16, src1_i16);
   test_ColMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16_col, src0_i16, src1_i16);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32, src1_i32);
   test_ColMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32_col, src0_i32, src1_i32);
- 
+
   test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64, src1_i64);
   test_ColMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64_col, src0_i64, src1_i64);
 
@@ -191,7 +243,7 @@ int main() {
   OutArray(dst_i32_col, gm_size);
   OutArray(dst_i64, gm_size);
   OutArray(dst_i64_col, gm_size);
- 
+
   free(dst);
   free(src0);
   free(src1);
@@ -203,26 +255,27 @@ int main() {
   free(dst_f16);
   free(src0_f16);
   free(src1_f16);
- 
+
   free(dst_i8);
   free(dst_i8_col);
   free(src0_i8);
   free(src1_i8);
- 
+
   free(dst_i16);
   free(dst_i16_col);
   free(src0_i16);
   free(src1_i16);
- 
+
   free(dst_i32);
   free(dst_i32_col);
   free(src0_i32);
   free(src1_i32);
- 
+
   free(dst_i64);
   free(dst_i64_col);
   free(src0_i64);
   free(src1_i64);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/test/tileop_api/src/TPad.cpp b/benchmarks/api/tileop/src/TPad.cpp
similarity index 69%
rename from test/tileop_api/src/TPad.cpp
rename to benchmarks/api/tileop/src/TPad.cpp
index 1ae7fae..87d3df5 100644
--- a/test/tileop_api/src/TPad.cpp
+++ b/benchmarks/api/tileop/src/TPad.cpp
@@ -5,6 +5,39 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t tile_row, uint16_t tile_col, uint16_t valid_row, uint16_t valid_col,
           uint16_t dst_tile_row, uint16_t dst_tile_col, typename T>
 void test_pad_rm(T *dst, T *src, T pad_value, size_t up_pad, size_t left_pad, size_t down_pad, size_t right_pad) {
@@ -19,9 +52,9 @@ void test_pad_rm(T *dst, T *src, T pad_value, size_t up_pad, size_t left_pad, si
   tile_shape_src src_tensor;
   tile_shape_dst dst_tensor;
 
-  TCOPYIN(src_tensor, s0);
+  TLOAD(src_tensor, s0);
   TPAD(dst_tensor, src_tensor, pad_value, up_pad, left_pad, down_pad, right_pad);
-  TCOPYOUT(res, dst_tensor);
+  TSTORE(res, dst_tensor);
 }
 
 template <uint16_t tile_row, uint16_t tile_col, uint16_t valid_row, uint16_t valid_col,
@@ -38,9 +71,9 @@ void test_pad_cm(T *dst, T *src, T pad_value, size_t up_pad, size_t left_pad, si
   tile_shape_src src_tensor;
   tile_shape_dst dst_tensor;
 
-  TCOPYIN(src_tensor, s0);
+  TLOAD(src_tensor, s0);
   TPAD(dst_tensor, src_tensor, pad_value, up_pad, left_pad, down_pad, right_pad);
-  TCOPYOUT(res, dst_tensor);
+  TSTORE(res, dst_tensor);
 }
 
 // 测试单个数据类型的函数
@@ -50,7 +83,7 @@ void test_single_type() {
     const uint16_t tile_col = 32;
     const uint16_t valid_row = 2;
     const uint16_t valid_col = 2;
-    
+
     const int32_t pad_value = 0;
     const size_t up_pad = 1, left_pad = 2, down_pad = 3, right_pad = 4;
     const uint16_t dst_tile_row = valid_row + up_pad + down_pad;
@@ -71,7 +104,7 @@ void test_single_type() {
     // 分配源内存
     T *src = (T *)malloc(size * sizeof(T));
     check_mem_alloc(src);
-    
+
     // 根据类型选择合适的初始化函数
     if constexpr (std::is_integral_v<T>) {
         if constexpr (std::is_unsigned_v<T>) {
@@ -104,6 +137,29 @@ void test_single_type() {
 }
 
 int main() {
+#ifdef __linx
+    constexpr uint16_t tile_row = 4;
+    constexpr uint16_t tile_col = 4;
+    constexpr uint16_t valid_row = 2;
+    constexpr uint16_t valid_col = 2;
+    constexpr size_t up_pad = 1;
+    constexpr size_t left_pad = 1;
+    constexpr size_t down_pad = 1;
+    constexpr size_t right_pad = 1;
+    constexpr uint16_t dst_tile_row = valid_row + up_pad + down_pad;
+    constexpr uint16_t dst_tile_col = valid_col + left_pad + right_pad;
+    constexpr uint16_t size = tile_row * tile_col;
+
+    static int64_t dst[size];
+    static int64_t src[size];
+    init_dst(dst, size);
+    init_src_int(src, size);
+
+    test_pad_rm<tile_row, tile_col, valid_row, valid_col, dst_tile_row,
+                dst_tile_col, int64_t>(dst, src, static_cast<int64_t>(0),
+                                        up_pad, left_pad, down_pad, right_pad);
+    return 0;
+#else
     printf("Results:\n");
     // 依次测试各种数据类型可通过, 一起运行测试会有精度错误
     // test_single_type<int8_t>();
@@ -112,6 +168,7 @@ int main() {
     // test_single_type<int64_t>();
     // test_single_type<__half>();
     test_single_type<float>();
-    
+
     return 0;
+#endif
 }
diff --git a/test/tileop_api/src/TRSqrt.cpp b/benchmarks/api/tileop/src/TRSqrt.cpp
similarity index 96%
rename from test/tileop_api/src/TRSqrt.cpp
rename to benchmarks/api/tileop/src/TRSqrt.cpp
index 8d33fc2..89b4731 100644
--- a/test/tileop_api/src/TRSqrt.cpp
+++ b/benchmarks/api/tileop/src/TRSqrt.cpp
@@ -20,9 +20,9 @@ void test(float *dst, float *src) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1;
-      TCOPYIN(d0, s0);
+      TLOAD(d0, s0);
       TRSQRT(d1, d0);
-      TCOPYOUT(res, d1);
+      TSTORE(res, d1);
     }
   }
 }
diff --git a/benchmarks/api/tileop/src/TRecip.cpp b/benchmarks/api/tileop/src/TRecip.cpp
new file mode 100644
index 0000000..61555ec
--- /dev/null
+++ b/benchmarks/api/tileop/src/TRecip.cpp
@@ -0,0 +1,253 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col, typename T>
+void test_rm(T *dst, T *src) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+  using glb_iterator = global_iterator<gm_shape, tile_shape>;
+
+  glb_iterator gSIter(src);
+  glb_iterator gDIter(dst);
+
+  size_t block_row = gm_row / tile_row;
+  size_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      auto s0 = gSIter(i, j);
+      auto res = gDIter(i, j);
+
+      tile_shape t0, t1;
+      TLOAD(t0, s0);
+      TRECIP(t1, t0);
+      TSTORE(res, t1);
+    }
+  }
+}
+
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col, typename T>
+void test_cm(T *dst, T *src) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+  using glb_iterator = global_iterator<gm_shape, tile_shape>;
+
+  glb_iterator gSIter(src);
+  glb_iterator gDIter(dst);
+
+  size_t block_row = gm_row / tile_row;
+  size_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_col; ++i) {
+    for (int j = 0; j < block_row; ++j) {
+      auto s0 = gSIter(j, i);
+      auto res = gDIter(j, i);
+
+      tile_shape t0, t1;
+      TLOAD(t0, s0);
+      TRECIP(t1, t0);
+      TSTORE(res, t1);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr size_t gm_row = 4;
+  constexpr size_t gm_col = 4;
+  constexpr size_t tile_row = 4;
+  constexpr size_t tile_col = 4;
+#else
+  const size_t gm_row = 32;
+  const size_t gm_col = 32;
+  const size_t tile_row = 32;
+  const size_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)gm_size;
+  (void)tile_size;
+
+#ifdef __linx
+  using row_tile = Tile<Location::Vec, int64_t, tile_row, tile_col>;
+  using col_tile =
+      Tile<Location::Vec, int64_t, tile_row, tile_col, BLayout::ColMajor>;
+  row_tile src_rm, dst_rm;
+  col_tile src_cm, dst_cm;
+
+  for (size_t i = 0; i < tile_row; ++i) {
+    for (size_t j = 0; j < tile_col; ++j) {
+      size_t row_index = index<row_tile>(i, j);
+      size_t col_index = index<col_tile>(i, j);
+      src_rm.data()[row_index] = 1;
+      src_cm.data()[col_index] = 1;
+      dst_rm.data()[row_index] = 0;
+      dst_cm.data()[col_index] = 0;
+    }
+  }
+
+  TRECIP(dst_rm, src_rm);
+  TRECIP(dst_cm, src_cm);
+
+  for (size_t i = 0; i < tile_row; ++i) {
+    for (size_t j = 0; j < tile_col; ++j) {
+      if (dst_rm.data()[index<row_tile>(i, j)] != 1) {
+        return 1;
+      }
+      if (dst_cm.data()[index<col_tile>(i, j)] != 1) {
+        return 2;
+      }
+    }
+  }
+
+  return 0;
+#else
+  // int8_t
+  int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(dst_int8);
+  init_dst(dst_int8, gm_size);
+
+  int8_t *src_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src_int8);
+  init_src_uint(src_int8, gm_size);
+
+  // int16_t
+  int16_t *dst_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(dst_int16);
+  init_dst(dst_int16, gm_size);
+
+  int16_t *src_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src_int16);
+  init_src_uint(src_int16, gm_size);
+
+  // int32_t
+  int32_t *dst_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst_int32);
+  init_dst(dst_int32, gm_size);
+
+  int32_t *src_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src_int32);
+  init_src_uint(src_int32, gm_size);
+
+  // int64_t
+  int64_t *dst_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(dst_int64);
+  init_dst(dst_int64, gm_size);
+
+  int64_t *src_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src_int64);
+  init_src_uint(src_int64, gm_size);
+
+  // __half
+  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, gm_size);
+
+  __half *src_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src_f16);
+  init_src_fp(src_f16, gm_size);
+
+  // __fp32
+  __fp32 *dst_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
+  check_mem_alloc(dst_f32);
+  init_dst(dst_f32, gm_size);
+
+  __fp32 *src_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
+  check_mem_alloc(src_f32);
+  init_src_fp(src_f32, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_int8, src_int8);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_int16, src_int16);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_int32, src_int32);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_int64, src_int64);
+  test_cm<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src_f16);
+  test_cm<gm_row, gm_col, tile_row, tile_col, __fp32>(dst_f32, src_f32);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst_int8, gm_size);
+  OutArray(dst_int16, gm_size);
+  OutArray(dst_int32, gm_size);
+  OutArray(dst_int64, gm_size);
+  OutArray(dst_f16, gm_size);
+  OutArray(dst_f32, gm_size);
+
+  free(dst_int8);
+  free(src_int8);
+  free(dst_int16);
+  free(src_int16);
+  free(dst_int32);
+  free(src_int32);
+  free(dst_int64);
+  free(src_int64);
+  free(dst_f16);
+  free(src_f16);
+  free(dst_f32);
+  free(src_f32);
+
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/TRem.cpp b/benchmarks/api/tileop/src/TRem.cpp
similarity index 70%
rename from test/tileop_api/src/TRem.cpp
rename to benchmarks/api/tileop/src/TRem.cpp
index 3648b1d..fac15b0 100644
--- a/test/tileop_api/src/TRem.cpp
+++ b/benchmarks/api/tileop/src/TRem.cpp
@@ -5,6 +5,57 @@
 #include "../linxStartEnd.hpp"
 #endif
 
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
           uint16_t tile_col, typename T>
 void test_rm(T *dst, T *src0, T *src1) {
@@ -21,10 +72,10 @@ void test_rm(T *dst, T *src0, T *src1) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TREM(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
@@ -44,24 +95,51 @@ void test_cm(T *dst, T *src0, T *src1) {
       gm_shape res(dst + offset);
 
       tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
+      TLOAD(d0, s0);
+      TLOAD(d1, s1);
       TREM(d2, d1, d0);
-      TCOPYOUT(res, d2);
+      TSTORE(res, d2);
     }
   }
 }
 
 int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 8;
+  constexpr uint16_t gm_col = 8;
+  constexpr uint16_t tile_row = 8;
+  constexpr uint16_t tile_col = 8;
+#else
   // 64*64-16*16
   const uint16_t gm_row = 64;
   const uint16_t gm_col = 64;
   const uint16_t tile_row = 32;
   const uint16_t tile_col = 32;
+#endif
 
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
 
+#ifdef __linx
+  static int32_t dst_rm[gm_size];
+  static int32_t dst_cm[gm_size];
+  static int32_t src0_rm[gm_size];
+  static int32_t src1_rm[gm_size];
+  static int32_t src0_cm[gm_size];
+  static int32_t src1_cm[gm_size];
+  init_dst(dst_rm, gm_size);
+  init_dst(dst_cm, gm_size);
+  init_src_uint(src0_rm, gm_size);
+  init_src_int(src1_rm, gm_size);
+  init_src_uint(src0_cm, gm_size);
+  init_src_int(src1_cm, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_rm, src0_rm, src1_rm);
+  test_cm<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_cm, src0_cm, src1_cm);
+
+  return 0;
+#else
   // float32
   float *dst = (float *)malloc(gm_size * sizeof(float));
   check_mem_alloc(dst);
@@ -181,4 +259,5 @@ int main() {
   free(src10);
   free(src11);
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TReshape.cpp b/benchmarks/api/tileop/src/TReshape.cpp
new file mode 100644
index 0000000..f83024e
--- /dev/null
+++ b/benchmarks/api/tileop/src/TReshape.cpp
@@ -0,0 +1,177 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col, typename T>
+void test(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using gm_shape_out = global_tensor<T, RowMajor<gm_row * 2, gm_col / 2>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor>;
+  using tile_shape_out = Tile<Location::Vec, T, tile_row * 2, tile_col / 2, BLayout::RowMajor>;
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+  TLOAD(d0, s0);
+  TRESHAPE(d1, d0);
+  TSTORE(res, d1);
+}
+
+int main() {
+#ifdef __linx
+  constexpr size_t gm_row = 4;
+  constexpr size_t gm_col = 8;
+  constexpr size_t tile_row = 4;
+  constexpr size_t tile_col = 8;
+#else
+  constexpr size_t gm_row = 64;
+  constexpr size_t gm_col = 64;
+  constexpr size_t tile_row = 64;
+  constexpr size_t tile_col = 64;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src[gm_size];
+  init_dst(dst, gm_size);
+  init_src_uint(src, gm_size);
+
+  test<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src);
+
+  return 0;
+#else
+  // int8_t
+  int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(dst_int8);
+  init_dst(dst_int8, gm_size);
+
+  int8_t *src_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src_int8);
+  init_src_uint(src_int8, gm_size);
+
+  // int16_t
+  int16_t *dst_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(dst_int16);
+  init_dst(dst_int16, gm_size);
+
+  int16_t *src_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src_int16);
+  init_src_uint(src_int16, gm_size);
+
+  // int32_t
+  int32_t *dst_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst_int32);
+  init_dst(dst_int32, gm_size);
+
+  int32_t *src_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src_int32);
+  init_src_uint(src_int32, gm_size);
+
+  // int64_t
+  int64_t *dst_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(dst_int64);
+  init_dst(dst_int64, gm_size);
+
+  int64_t *src_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src_int64);
+  init_src_uint(src_int64, gm_size);
+
+  // __half
+  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, gm_size);
+
+  __half *src_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src_f16);
+  init_src_fp(src_f16, gm_size);
+
+  // __fp32
+  __fp32 *dst_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
+  check_mem_alloc(dst_f32);
+  init_dst(dst_f32, gm_size);
+
+  __fp32 *src_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
+  check_mem_alloc(src_f32);
+  init_src_fp(src_f32, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_int8, src_int8);
+  test<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_int16, src_int16);
+  test<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_int32, src_int32);
+  test<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_int64, src_int64);
+  test<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src_f16);
+  test<gm_row, gm_col, tile_row, tile_col, __fp32>(dst_f32, src_f32);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst_int8, gm_size);
+  OutArray(dst_int16, gm_size);
+  OutArray(dst_int32, gm_size);
+  OutArray(dst_int64, gm_size);
+  OutArray(dst_f16, gm_size);
+  OutArray(dst_f32, gm_size);
+
+
+  free(dst_int8);
+  free(src_int8);
+  free(dst_int16);
+  free(src_int16);
+  free(dst_int32);
+  free(src_int32);
+  free(dst_int64);
+  free(src_int64);
+  free(dst_f16);
+  free(src_f16);
+  free(dst_f32);
+  free(src_f32);
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TRowMax.cpp b/benchmarks/api/tileop/src/TRowMax.cpp
new file mode 100644
index 0000000..1c91357
--- /dev/null
+++ b/benchmarks/api/tileop/src/TRowMax.cpp
@@ -0,0 +1,203 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t row, uint16_t col, typename T> void test_rm(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, RowMajor<row, col>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, row, col, BLayout::RowMajor>;
+  using tile_shape_out = Tile<Location::Vec, T, row, col, BLayout::RowMajor, row, 1>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TROWMAX(d1, d0);
+  TSTORE(res, d1);
+}
+
+template <uint16_t row, uint16_t col, typename T> void test_cm(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, ColMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, ColMajor<row, col>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, row, col, BLayout::ColMajor>;
+  using tile_shape_out = Tile<Location::Vec, T, row, col, BLayout::ColMajor, row, 1>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TROWMAX(d1, d0);
+  TSTORE(res, d1);
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t row = 4;
+  constexpr uint16_t col = 8;
+  constexpr uint16_t size = row * col;
+
+  static int64_t dst_rm[size];
+  static int64_t dst_cm[size];
+  static int64_t src_rm[size];
+  static int64_t src_cm[size];
+  init_dst(dst_rm, size);
+  init_dst(dst_cm, size);
+  init_src_int(src_rm, size);
+  init_src_int(src_cm, size);
+
+  test_rm<row, col, int64_t>(dst_rm, src_rm);
+  test_cm<row, col, int64_t>(dst_cm, src_cm);
+  return 0;
+#else
+  const size_t row = 32;
+  const size_t col = 32;
+
+  size_t size_in = row * col;
+  size_t size_out = row * col;
+
+  // int8_t
+  int8_t *dst_int8 = (int8_t *)malloc(size_out * sizeof(int8_t));
+  check_mem_alloc(dst_int8);
+  init_dst(dst_int8, size_out);
+
+  int8_t *src_int8 = (int8_t *)malloc(size_in * sizeof(int8_t));
+  check_mem_alloc(src_int8);
+  init_src_uint(src_int8, size_in);
+
+  // int16_t
+  int16_t *dst_int16 = (int16_t *)malloc(size_out * sizeof(int16_t));
+  check_mem_alloc(dst_int16);
+  init_dst(dst_int16, size_out);
+
+  int16_t *src_int16 = (int16_t *)malloc(size_in * sizeof(int16_t));
+  check_mem_alloc(src_int16);
+  init_src_uint(src_int16, size_in);
+
+  // int32_t
+  int32_t *dst_int32 = (int32_t *)malloc(size_out * sizeof(int32_t));
+  check_mem_alloc(dst_int32);
+  init_dst(dst_int32, size_out);
+
+  int32_t *src_int32 = (int32_t *)malloc(size_in * sizeof(int32_t));
+  check_mem_alloc(src_int32);
+  init_src_uint(src_int32, size_in);
+
+  // int64_t
+  int64_t *dst_int64 = (int64_t *)malloc(size_out * sizeof(int64_t));
+  check_mem_alloc(dst_int64);
+  init_dst(dst_int64, size_out);
+
+  int64_t *src_int64 = (int64_t *)malloc(size_in * sizeof(int64_t));
+  check_mem_alloc(src_int64);
+  init_src_uint(src_int64, size_in);
+
+  // __half
+  __half *dst_f16 = (__half *)malloc(size_out * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, size_out);
+
+  __half *src_f16 = (__half *)malloc(size_in * sizeof(__half));
+  check_mem_alloc(src_f16);
+  init_src_fp(src_f16, size_in);
+
+  // __fp32
+  __fp32 *dst_f32 = (__fp32 *)malloc(size_out * sizeof(__fp32));
+  check_mem_alloc(dst_f32);
+  init_dst(dst_f32, size_out);
+
+  __fp32 *src_f32 = (__fp32 *)malloc(size_in * sizeof(__fp32));
+  check_mem_alloc(src_f32);
+  init_src_fp(src_f32, size_in);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<row, col, int8_t>(dst_int8, src_int8);
+  test_rm<row, col, int16_t>(dst_int16, src_int16);
+  test_rm<row, col, int32_t>(dst_int32, src_int32);
+  test_rm<row, col, int64_t>(dst_int64, src_int64);
+  test_cm<row, col, __half>(dst_f16, src_f16);
+  test_cm<row, col, __fp32>(dst_f32, src_f32);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst_int8, size_out);
+  OutArray(dst_int16, size_out);
+  OutArray(dst_int32, size_out);
+  OutArray(dst_int64, size_out);
+  OutArray(dst_f16, size_out);
+  OutArray(dst_f32, size_out);
+
+  free(dst_int8);
+  free(src_int8);
+  free(dst_int16);
+  free(src_int16);
+  free(dst_int32);
+  free(src_int32);
+  free(dst_int64);
+  free(src_int64);
+  free(dst_f16);
+  free(src_f16);
+  free(dst_f32);
+  free(src_f32);
+
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TRowMaxExpand.cpp b/benchmarks/api/tileop/src/TRowMaxExpand.cpp
new file mode 100644
index 0000000..669ff1e
--- /dev/null
+++ b/benchmarks/api/tileop/src/TRowMaxExpand.cpp
@@ -0,0 +1,223 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t row, uint16_t col, typename T> void test_rm(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, RowMajor<row, col>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, row, col, BLayout::RowMajor>;
+  using tile_shape_out = Tile<Location::Vec, T, row, col, BLayout::RowMajor>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TROWMAXEXPAND(d1, d0);
+  TSTORE(res, d1);
+}
+
+template <uint16_t row, uint16_t col, typename T> void test_cm(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, ColMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, ColMajor<row, col>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, row, col, BLayout::ColMajor>;
+  using tile_shape_out = Tile<Location::Vec, T, row, col, BLayout::ColMajor>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TROWMAXEXPAND(d1, d0);
+  TSTORE(res, d1);
+}
+
+#ifndef __linx
+template <size_t row, size_t col, typename T> void test_Nz(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, RowMajor<row, col>>;
+
+  using tile_shape_in = TileLeft<T, row, col>;
+  using tile_shape_out = TileLeft<T, row, col>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TROWMAXEXPAND(d1, d0);
+  TSTORE(res, d1);
+}
+#endif
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t row = 4;
+  constexpr uint16_t col = 8;
+  constexpr uint16_t size = row * col;
+
+  static int64_t dst_rm[size];
+  static int64_t dst_cm[size];
+  static int64_t src_rm[size];
+  static int64_t src_cm[size];
+  init_dst(dst_rm, size);
+  init_dst(dst_cm, size);
+  init_src_int(src_rm, size);
+  init_src_int(src_cm, size);
+
+  test_rm<row, col, int64_t>(dst_rm, src_rm);
+  test_cm<row, col, int64_t>(dst_cm, src_cm);
+  return 0;
+#else
+  const size_t row = 32;
+  const size_t col = 32;
+
+  size_t size_in = row * col;
+  size_t size_out = row * col;
+
+  // int8_t
+  int8_t *dst_int8 = (int8_t *)malloc(size_out * sizeof(int8_t));
+  check_mem_alloc(dst_int8);
+  init_dst(dst_int8, size_out);
+
+  int8_t *src_int8 = (int8_t *)malloc(size_in * sizeof(int8_t));
+  check_mem_alloc(src_int8);
+  init_src_uint(src_int8, size_in);
+
+  // int16_t
+  int16_t *dst_int16 = (int16_t *)malloc(size_out * sizeof(int16_t));
+  check_mem_alloc(dst_int16);
+  init_dst(dst_int16, size_out);
+
+  int16_t *src_int16 = (int16_t *)malloc(size_in * sizeof(int16_t));
+  check_mem_alloc(src_int16);
+  init_src_uint(src_int16, size_in);
+
+  // int32_t
+  int32_t *dst_int32 = (int32_t *)malloc(size_out * sizeof(int32_t));
+  check_mem_alloc(dst_int32);
+  init_dst(dst_int32, size_out);
+
+  int32_t *src_int32 = (int32_t *)malloc(size_in * sizeof(int32_t));
+  check_mem_alloc(src_int32);
+  init_src_uint(src_int32, size_in);
+
+  // int64_t
+  int64_t *dst_int64 = (int64_t *)malloc(size_out * sizeof(int64_t));
+  check_mem_alloc(dst_int64);
+  init_dst(dst_int64, size_out);
+
+  int64_t *src_int64 = (int64_t *)malloc(size_in * sizeof(int64_t));
+  check_mem_alloc(src_int64);
+  init_src_uint(src_int64, size_in);
+
+  // __half
+  __half *dst_f16 = (__half *)malloc(size_out * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, size_out);
+
+  __half *src_f16 = (__half *)malloc(size_in * sizeof(__half));
+  check_mem_alloc(src_f16);
+  init_src_fp(src_f16, size_in);
+
+  // __fp32
+  __fp32 *dst_f32 = (__fp32 *)malloc(size_out * sizeof(__fp32));
+  check_mem_alloc(dst_f32);
+  init_dst(dst_f32, size_out);
+
+  __fp32 *src_f32 = (__fp32 *)malloc(size_in * sizeof(__fp32));
+  check_mem_alloc(src_f32);
+  init_src_fp(src_f32, size_in);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<row, col, int8_t>(dst_int8, src_int8);
+  test_rm<row, col, int16_t>(dst_int16, src_int16);
+  test_rm<row, col, int32_t>(dst_int32, src_int32);
+  test_rm<row, col, int64_t>(dst_int64, src_int64);
+  test_rm<row, col, __half>(dst_f16, src_f16);
+  // test_rm<row, col, __half>(dst_f32, src_f32);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst_int8, size_out);
+  OutArray(dst_int16, size_out);
+  OutArray(dst_int32, size_out);
+  OutArray(dst_int64, size_out);
+  OutArray(dst_f16, size_out);
+  OutArray(dst_f32, size_out);
+
+  free(dst_int8);
+  free(src_int8);
+  free(dst_int16);
+  free(src_int16);
+  free(dst_int32);
+  free(src_int32);
+  free(dst_int64);
+  free(src_int64);
+  free(dst_f16);
+  free(src_f16);
+  free(dst_f32);
+  free(src_f32);
+
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/TRowMax.cpp b/benchmarks/api/tileop/src/TRowSum.cpp
similarity index 70%
rename from test/tileop_api/src/TRowMax.cpp
rename to benchmarks/api/tileop/src/TRowSum.cpp
index e0c52e8..f1a6a59 100644
--- a/test/tileop_api/src/TRowMax.cpp
+++ b/benchmarks/api/tileop/src/TRowSum.cpp
@@ -5,7 +5,40 @@
 #include "../linxStartEnd.hpp"
 #endif
 
-template <size_t row, size_t col, typename T> void test_rm(T *dst, T *src) {
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t row, uint16_t col, typename T> void test_rm(T *dst, T *src) {
   using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
   using gm_shape_out = global_tensor<T, RowMajor<row, col>>;
 
@@ -18,12 +51,12 @@ template <size_t row, size_t col, typename T> void test_rm(T *dst, T *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
-  TROWMAX(d1, d0);
-  TCOPYOUT(res, d1);
+  TLOAD(d0, s0);
+  TROWSUM(d1, d0);
+  TSTORE(res, d1);
 }
 
-template <size_t row, size_t col, typename T> void test_cm(T *dst, T *src) {
+template <uint16_t row, uint16_t col, typename T> void test_cm(T *dst, T *src) {
   using gm_shape_in = global_tensor<T, ColMajor<row, col>>;
   using gm_shape_out = global_tensor<T, ColMajor<row, col>>;
 
@@ -36,12 +69,30 @@ template <size_t row, size_t col, typename T> void test_cm(T *dst, T *src) {
   tile_shape_in d0;
   tile_shape_out d1;
 
-  TCOPYIN(d0, s0);
-  TROWMAX(d1, d0);
-  TCOPYOUT(res, d1);
+  TLOAD(d0, s0);
+  TROWSUM(d1, d0);
+  TSTORE(res, d1);
 }
 
 int main() {
+#ifdef __linx
+  constexpr uint16_t row = 4;
+  constexpr uint16_t col = 8;
+  constexpr uint16_t size = row * col;
+
+  static int64_t dst_rm[size];
+  static int64_t dst_cm[size];
+  static int64_t src_rm[size];
+  static int64_t src_cm[size];
+  init_dst(dst_rm, size);
+  init_dst(dst_cm, size);
+  init_src_int(src_rm, size);
+  init_src_int(src_cm, size);
+
+  test_rm<row, col, int64_t>(dst_rm, src_rm);
+  test_cm<row, col, int64_t>(dst_cm, src_cm);
+  return 0;
+#else
   const size_t row = 32;
   const size_t col = 32;
 
@@ -139,4 +190,5 @@ int main() {
   free(src_f32);
 
   return 0;
-}
\ No newline at end of file
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TRowSumExpand.cpp b/benchmarks/api/tileop/src/TRowSumExpand.cpp
new file mode 100644
index 0000000..a5203eb
--- /dev/null
+++ b/benchmarks/api/tileop/src/TRowSumExpand.cpp
@@ -0,0 +1,201 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t row, uint16_t col, typename T> void test_rm(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, RowMajor<row, col>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, row, col, BLayout::RowMajor>;
+  using tile_shape_out = Tile<Location::Vec, T, row, col, BLayout::RowMajor>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TROWSUMEXPAND(d1, d0);
+  TSTORE(res, d1);
+}
+
+template <uint16_t row, uint16_t col, typename T> void test_cm(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, ColMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, ColMajor<row, col>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, row, col, BLayout::ColMajor>;
+  using tile_shape_out = Tile<Location::Vec, T, row, col, BLayout::ColMajor>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TROWSUMEXPAND(d1, d0);
+  TSTORE(res, d1);
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t row = 4;
+  constexpr uint16_t col = 8;
+  constexpr uint16_t size = row * col;
+
+  static int64_t dst_rm[size];
+  static int64_t dst_cm[size];
+  static int64_t src_rm[size];
+  static int64_t src_cm[size];
+  init_dst(dst_rm, size);
+  init_dst(dst_cm, size);
+  init_src_int(src_rm, size);
+  init_src_int(src_cm, size);
+
+  test_rm<row, col, int64_t>(dst_rm, src_rm);
+  test_cm<row, col, int64_t>(dst_cm, src_cm);
+  return 0;
+#else
+  const size_t row = 32;
+  const size_t col = 32;
+
+  size_t size_in = row * col;
+  size_t size_out = row * col;
+
+  // int8_t
+  int8_t *dst_int8 = (int8_t *)malloc(size_out * sizeof(int8_t));
+  check_mem_alloc(dst_int8);
+  init_dst(dst_int8, size_out);
+
+  int8_t *src_int8 = (int8_t *)malloc(size_in * sizeof(int8_t));
+  check_mem_alloc(src_int8);
+  init_src_uint(src_int8, size_in);
+
+  // int16_t
+  int16_t *dst_int16 = (int16_t *)malloc(size_out * sizeof(int16_t));
+  check_mem_alloc(dst_int16);
+  init_dst(dst_int16, size_out);
+
+  int16_t *src_int16 = (int16_t *)malloc(size_in * sizeof(int16_t));
+  check_mem_alloc(src_int16);
+  init_src_uint(src_int16, size_in);
+
+  // int32_t
+  int32_t *dst_int32 = (int32_t *)malloc(size_out * sizeof(int32_t));
+  check_mem_alloc(dst_int32);
+  init_dst(dst_int32, size_out);
+
+  int32_t *src_int32 = (int32_t *)malloc(size_in * sizeof(int32_t));
+  check_mem_alloc(src_int32);
+  init_src_uint(src_int32, size_in);
+
+  // int64_t
+  int64_t *dst_int64 = (int64_t *)malloc(size_out * sizeof(int64_t));
+  check_mem_alloc(dst_int64);
+  init_dst(dst_int64, size_out);
+
+  int64_t *src_int64 = (int64_t *)malloc(size_in * sizeof(int64_t));
+  check_mem_alloc(src_int64);
+  init_src_uint(src_int64, size_in);
+
+  // __half
+  __half *dst_f16 = (__half *)malloc(size_out * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, size_out);
+
+  __half *src_f16 = (__half *)malloc(size_in * sizeof(__half));
+  check_mem_alloc(src_f16);
+  init_src_fp(src_f16, size_in);
+
+  // __fp32
+  __fp32 *dst_f32 = (__fp32 *)malloc(size_out * sizeof(__fp32));
+  check_mem_alloc(dst_f32);
+  init_dst(dst_f32, size_out);
+
+  __fp32 *src_f32 = (__fp32 *)malloc(size_in * sizeof(__fp32));
+  check_mem_alloc(src_f32);
+  init_src_fp(src_f32, size_in);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<row, col, int8_t>(dst_int8, src_int8);
+  test_rm<row, col, int16_t>(dst_int16, src_int16);
+  test_rm<row, col, int32_t>(dst_int32, src_int32);
+  test_rm<row, col, int64_t>(dst_int64, src_int64);
+  test_rm<row, col, __half>(dst_f16, src_f16);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst_int8, size_out);
+  OutArray(dst_int16, size_out);
+  OutArray(dst_int32, size_out);
+  OutArray(dst_int64, size_out);
+  OutArray(dst_f16, size_out);
+  OutArray(dst_f32, size_out);
+
+  free(dst_int8);
+  free(src_int8);
+  free(dst_int16);
+  free(src_int16);
+  free(dst_int32);
+  free(src_int32);
+  free(dst_int64);
+  free(src_int64);
+  free(dst_f16);
+  free(src_f16);
+  free(dst_f32);
+  free(src_f32);
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/TScatter.cpp b/benchmarks/api/tileop/src/TScatter.cpp
similarity index 95%
rename from test/tileop_api/src/TScatter.cpp
rename to benchmarks/api/tileop/src/TScatter.cpp
index c8f34e8..bdf4473 100644
--- a/test/tileop_api/src/TScatter.cpp
+++ b/benchmarks/api/tileop/src/TScatter.cpp
@@ -23,11 +23,11 @@ void test_RowMajor(float *dst, float *src, uint16_t *indices) {
   tile_shape_indices d1;
   tile_shape_dst d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  TCOPYIN(d2, res);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  TLOAD(d2, res);
   TSCATTER(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 template <uint16_t dst_row, uint16_t dst_col, uint16_t row, uint16_t col>
@@ -48,11 +48,11 @@ void test_ColMajor(float *dst, float *src, uint16_t *indices) {
   tile_shape_indices d1;
   tile_shape_dst d2;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  TCOPYIN(d2, res);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  TLOAD(d2, res);
   TSCATTER(d2, d0, d1);
-  TCOPYOUT(res, d2);
+  TSTORE(res, d2);
 }
 
 int main() {
diff --git a/test/tileop_api/src/TSelect.cpp b/benchmarks/api/tileop/src/TSelect.cpp
similarity index 93%
rename from test/tileop_api/src/TSelect.cpp
rename to benchmarks/api/tileop/src/TSelect.cpp
index 7001a8c..7c69c02 100644
--- a/test/tileop_api/src/TSelect.cpp
+++ b/benchmarks/api/tileop/src/TSelect.cpp
@@ -23,11 +23,11 @@ void test_RowMajor(float *dst, float *src0, float *src1, uint16_t *cond) {
   tile_shape_uint16 d2;
   tile_shape_fp32 d3;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  TCOPYIN(d2, s2);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  TLOAD(d2, s2);
   TSELECT(d3, d2, d0, d1);
-  TCOPYOUT(res, d3);
+  TSTORE(res, d3);
 }
 
 template <uint16_t row, uint16_t col>
@@ -48,11 +48,11 @@ void test_ColMajor(float *dst, float *src0, float *src1, uint16_t *cond) {
   tile_shape_uint16 d2;
   tile_shape_fp32 d3;
 
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  TCOPYIN(d2, s2);
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  TLOAD(d2, s2);
   TSELECT(d3, d2, d0, d1);
-  TCOPYOUT(res, d3);
+  TSTORE(res, d3);
 }
 
 int main() {
diff --git a/benchmarks/api/tileop/src/TSqrt.cpp b/benchmarks/api/tileop/src/TSqrt.cpp
new file mode 100644
index 0000000..9d4923f
--- /dev/null
+++ b/benchmarks/api/tileop/src/TSqrt.cpp
@@ -0,0 +1,204 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col, typename T>
+void test_rm(T *dst, T *src) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+  using glb_iterator = global_iterator<gm_shape, tile_shape>;
+
+  glb_iterator gSIter(src);
+  glb_iterator gDIter(dst);
+
+  size_t block_row = gm_row / tile_row;
+  size_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      auto s0 = gSIter(i, j);
+      auto res = gDIter(i, j);
+
+      tile_shape t0, t1;
+      TLOAD(t0, s0);
+      TSQRT(t1, t0);
+      TSTORE(res, t1);
+    }
+  }
+}
+
+template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
+          uint64_t tile_col, typename T>
+void test_cm(T *dst, T *src) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+  using glb_iterator = global_iterator<gm_shape, tile_shape>;
+
+  glb_iterator gSIter(src);
+  glb_iterator gDIter(dst);
+
+  size_t block_row = gm_row / tile_row;
+  size_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_col; ++i) {
+    for (int j = 0; j < block_row; ++j) {
+      auto s0 = gSIter(j, i);
+      auto res = gDIter(j, i);
+
+      tile_shape t0, t1;
+      TLOAD(t0, s0);
+      TSQRT(t1, t0);
+      TSTORE(res, t1);
+    }
+  }
+}
+
+int main() {
+
+#ifdef __linx
+  constexpr size_t gm_row = 4;
+  constexpr size_t gm_col = 4;
+  constexpr size_t tile_row = 4;
+  constexpr size_t tile_col = 4;
+#else
+  const size_t gm_row = 32;
+  const size_t gm_col = 32;
+  const size_t tile_row = 16;
+  const size_t tile_col = 16;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)gm_size;
+  (void)tile_size;
+
+#ifdef __linx
+  using row_tile = Tile<Location::Vec, int64_t, tile_row, tile_col>;
+  using col_tile =
+      Tile<Location::Vec, int64_t, tile_row, tile_col, BLayout::ColMajor>;
+  row_tile src_rm, dst_rm;
+  col_tile src_cm, dst_cm;
+
+  for (size_t i = 0; i < tile_row; ++i) {
+    for (size_t j = 0; j < tile_col; ++j) {
+      int64_t expected = static_cast<int64_t>(i * tile_col + j);
+      size_t row_index = index<row_tile>(i, j);
+      size_t col_index = index<col_tile>(i, j);
+      src_rm.data()[row_index] = expected * expected;
+      src_cm.data()[col_index] = expected * expected;
+      dst_rm.data()[row_index] = 0;
+      dst_cm.data()[col_index] = 0;
+    }
+  }
+
+  TSQRT(dst_rm, src_rm);
+  TSQRT(dst_cm, src_cm);
+
+  for (size_t i = 0; i < tile_row; ++i) {
+    for (size_t j = 0; j < tile_col; ++j) {
+      int64_t expected = static_cast<int64_t>(i * tile_col + j);
+      if (dst_rm.data()[index<row_tile>(i, j)] != expected) {
+        return 1;
+      }
+      if (dst_cm.data()[index<col_tile>(i, j)] != expected) {
+        return 2;
+      }
+    }
+  }
+
+  return 0;
+#else
+  // __half
+  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, gm_size);
+
+  __half *src_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src_f16);
+  init_rows_fp(src_f16, gm_row, gm_col);
+
+  // __fp32
+  __fp32 *dst_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
+  check_mem_alloc(dst_f32);
+  init_dst(dst_f32, gm_size);
+
+  __fp32 *src_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
+  check_mem_alloc(src_f32);
+  init_rows_fp(src_f32, gm_row, gm_col);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src_f16);
+  test_cm<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src_f16);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst_f16, gm_size);
+  OutArray(dst_f32, gm_size);
+
+  free(dst_f16);
+  free(src_f16);
+  free(dst_f32);
+  free(src_f32);
+
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TStore.cpp b/benchmarks/api/tileop/src/TStore.cpp
new file mode 100644
index 0000000..02a2cf9
--- /dev/null
+++ b/benchmarks/api/tileop/src/TStore.cpp
@@ -0,0 +1,305 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test_RowMajor(T *dst, T *src0) {
+  using shape = Shape<1, 1, 1, tile_row, tile_col>;
+  using stride = Stride<1, 1, gm_row * gm_col, gm_col, 1>;
+  using gm_shape = GlobalTensor<T, shape, stride, Layout::ND>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  #pragma clang loop unroll(full)
+  for (int i = 0; i < block_row; ++i) {
+    #pragma clang loop unroll(full)
+    for (int j = 0; j < block_col; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src0 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0;
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
+    }
+  }
+}
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row, uint16_t tile_col, typename T>
+void test_RowMajor_Dynamic(T *dst, T *src0) {
+  using gm_shape = global_tensor<T, RowMajor<-1, -1>>;
+  using tile_shape = Tile<Location::Vec, T, 2*tile_row, 2*tile_col, BLayout::RowMajor, -1, -1>;
+
+  volatile size_t tile_valid_row = tile_row - 2;
+  volatile size_t tile_valid_col = tile_col - 2;
+
+  volatile size_t gm_valid_row = gm_row;
+  volatile size_t gm_valid_col = gm_col;
+
+  uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row;
+  uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col;
+
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      uint16_t remainder_row = gm_row - i * tile_valid_row;
+      uint16_t remainder_col = gm_col - j * tile_valid_col;
+
+      uint16_t active_row = remainder_row < tile_valid_row ? remainder_row : tile_valid_row;
+      uint16_t active_col = remainder_col < tile_valid_col ? remainder_col : tile_valid_col;
+
+      int offset = i * (tile_valid_row * gm_valid_col) + j * tile_valid_col;
+      gm_shape s0(src0 + offset, gm_valid_row, gm_valid_col);
+      gm_shape res(dst + offset, gm_valid_row, gm_valid_col);
+
+      tile_shape d0(active_row, active_col);
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
+    }
+  }
+}
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
+          uint16_t tile_col, typename T>
+void test_ColMajor(T *dst, T *src0) {
+  using shape = Shape<1, 1, 1, tile_row, tile_col>;
+  using stride = Stride<1, 1, gm_row * gm_col, 1, gm_row>;
+  using gm_shape = GlobalTensor<T, shape, stride, Layout::DN>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+
+  uint16_t block_row = gm_row / tile_row;
+  uint16_t block_col = gm_col / tile_col;
+  #pragma clang loop unroll(full)
+  for (int i = 0; i < block_col; ++i) {
+    #pragma clang loop unroll(full)
+    for (int j = 0; j < block_row; ++j) {
+      int offset = i * (tile_row * gm_col) + j * tile_col;
+      gm_shape s0(src0 + offset);
+      gm_shape res(dst + offset);
+
+      tile_shape d0;
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
+    }
+  }
+}
+
+template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row, uint16_t tile_col, typename T>
+void test_Nz_Dynamic(T *dst, T *src0) {
+  using gm_shape = global_tensor<T, RowMajor<-1, -1>>;
+  using tile_shape = TileLeft<T, tile_row, tile_col, -1, -1>;
+
+  volatile size_t tile_valid_row = tile_row - 2;
+  volatile size_t tile_valid_col = tile_col - 2;
+
+  volatile size_t gm_valid_row = gm_row;
+  volatile size_t gm_valid_col = gm_col;
+
+  uint16_t block_row = (gm_row + tile_valid_row - 1) / tile_valid_row;
+  uint16_t block_col = (gm_col + tile_valid_col - 1) / tile_valid_col;
+
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      uint16_t remainder_row = gm_row - i * tile_valid_row;
+      uint16_t remainder_col = gm_col - j * tile_valid_col;
+
+      uint16_t active_row = remainder_row < tile_valid_row ? remainder_row : tile_valid_row;
+      uint16_t active_col = remainder_col < tile_valid_col ? remainder_col : tile_valid_col;
+
+      int offset = i * (tile_valid_row * gm_valid_col) + j * tile_valid_col;
+      gm_shape s0(src0 + offset, gm_valid_row, gm_valid_col);
+      gm_shape res(dst + offset, gm_valid_row, gm_valid_col);
+
+      tile_shape d0(active_row, active_col);
+      tile_shape d1(active_row, active_col);
+      TLOAD(d0, s0);
+      TSTORE(res, d0);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t gm_row = 4;
+  constexpr uint16_t gm_col = 4;
+  constexpr uint16_t tile_row = 4;
+  constexpr uint16_t tile_col = 4;
+#else
+  constexpr uint16_t gm_row = 64;
+  constexpr uint16_t gm_col = 64;
+  constexpr uint16_t tile_row = 32;
+  constexpr uint16_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst[gm_size];
+  static int64_t src[gm_size];
+  init_dst(dst, gm_size);
+  init_src_int(src, gm_size);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst, src);
+
+  return 0;
+#else
+  float *dst = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, gm_size);
+
+  float *src0 = (float *)malloc(gm_size * sizeof(float));
+  check_mem_alloc(src0);
+  init_src_fp(src0, gm_size);
+
+  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, gm_size);
+
+  __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src0_f16);
+  init_src_fp(src0_f16, gm_size);
+
+  int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(dst_i8);
+  init_dst(dst_i8, gm_size);
+
+  int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src0_i8);
+  init_src_int(src0_i8, gm_size);
+
+  int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(dst_i16);
+  init_dst(dst_i16, gm_size);
+
+  int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src0_i16);
+  init_src_int(src0_i16, gm_size);
+
+  int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst_i32);
+  init_dst(dst_i32, gm_size);
+
+  int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src0_i32);
+  init_src_int(src0_i32, gm_size);
+
+  int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(dst_i64);
+  init_dst(dst_i64, gm_size);
+
+  int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src0_i64);
+  init_src_int(src0_i64, gm_size);
+
+  int32_t *dst1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst1_i32);
+  init_dst(dst1_i32, gm_size);
+
+  int32_t *src1_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src1_i32);
+  init_src_int(src1_i32, gm_size);
+
+  int32_t *dst_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst_nz_i32);
+  init_dst(dst_nz_i32, gm_size);
+
+  int32_t *src_nz_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src_nz_i32);
+  init_src_int(src_nz_i32, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, float>(dst, src0);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8);
+
+  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16);
+
+  test_ColMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32);
+
+  test_ColMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64);
+
+  test_RowMajor_Dynamic<gm_row + 1, gm_col + 1, tile_row, tile_col, int32_t>(dst1_i32, src1_i32);
+
+  test_Nz_Dynamic<gm_row + 1, gm_col + 1, tile_row, tile_col, int32_t>(dst_nz_i32, src_nz_i32);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, gm_size);
+  OutArray(dst_f16, gm_size);
+  OutArray(dst_i8, gm_size);
+  OutArray(dst_i16, gm_size);
+  OutArray(dst_i32, gm_size);
+  OutArray(dst_i64, gm_size);
+  OutArray(dst1_i32, gm_size);
+  OutArray(dst_nz_i32, gm_size);
+
+  free(dst);
+  free(src0);
+
+  free(dst_f16);
+  free(src0_f16);
+
+  free(dst_i8);
+  free(src0_i8);
+
+  free(dst_i16);
+  free(src0_i16);
+
+  free(dst_i32);
+  free(src0_i32);
+
+  free(dst_i64);
+  free(src0_i64);
+
+  free(dst1_i32);
+  free(src1_i32);
+
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TSub.cpp b/benchmarks/api/tileop/src/TSub.cpp
new file mode 100644
index 0000000..8f8480b
--- /dev/null
+++ b/benchmarks/api/tileop/src/TSub.cpp
@@ -0,0 +1,248 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+//  C = A - B
+template <size_t gm_row, size_t gm_col, size_t tile_row, size_t tile_col,
+          typename T>
+void test_rm(T *dst, T *src0, T *src1) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+  using glb_iterator = global_iterator<gm_shape, tile_shape>;
+
+  glb_iterator gAIter(src0);
+  glb_iterator gBIter(src1);
+  glb_iterator gCIter(dst);
+
+  size_t block_row = gm_row / tile_row;
+  size_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      auto s0 = gAIter(i, j);
+      auto s1 = gBIter(i, j);
+      auto res = gCIter(i, j);
+
+      tile_shape t0, t1, t2;
+      TLOAD(t0, s0);
+      TLOAD(t1, s1);
+      TSUB(t2, t1, t0);
+      TSTORE(res, t2);
+    }
+  }
+}
+
+template <size_t gm_row, size_t gm_col, size_t tile_row, size_t tile_col,
+          typename T>
+void test_cm(T *dst, T *src0, T *src1) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+  using glb_iterator = global_iterator<gm_shape, tile_shape>;
+
+  glb_iterator gAIter(src0);
+  glb_iterator gBIter(src1);
+  glb_iterator gCIter(dst);
+
+  size_t block_row = gm_row / tile_row;
+  size_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_col; ++i) {
+    for (int j = 0; j < block_row; ++j) {
+      auto s0 = gAIter(j, i);
+      auto s1 = gBIter(j, i);
+      auto res = gCIter(j, i);
+
+      tile_shape t0, t1, t2;
+      TLOAD(t0, s0);
+      TLOAD(t1, s1);
+      TSUB(t2, t1, t0);
+      TSTORE(res, t2);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr size_t gm_row = 4;
+  constexpr size_t gm_col = 4;
+  constexpr size_t tile_row = 4;
+  constexpr size_t tile_col = 4;
+#else
+  constexpr size_t gm_row = 32;
+  constexpr size_t gm_col = 32;
+  constexpr size_t tile_row = 32;
+  constexpr size_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst_int64[gm_size];
+  static int64_t src0_int64[gm_size];
+  static int64_t src1_int64[gm_size];
+  init_dst(dst_int64, gm_size);
+  init_src_int(src0_int64, gm_size);
+  init_src_int(src1_int64, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_int64, src0_int64,
+                                                       src1_int64);
+
+  return 0;
+#else
+  // int8_t
+  int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(dst_int8);
+  init_dst(dst_int8, gm_size);
+
+  int8_t *src0_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src0_int8);
+  init_src_int(src0_int8, gm_size);
+  int8_t *src1_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src1_int8);
+  init_src_int(src1_int8, gm_size);
+
+  // int16_t
+  int16_t *dst_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(dst_int16);
+  init_dst(dst_int16, gm_size);
+
+  int16_t *src0_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src0_int16);
+  init_src_int(src0_int16, gm_size);
+  int16_t *src1_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src1_int16);
+  init_src_int(src1_int16, gm_size);
+
+  // int32_t
+  int32_t *dst_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst_int32);
+  init_dst(dst_int32, gm_size);
+
+  int32_t *src0_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src0_int32);
+  init_src_int(src0_int32, gm_size);
+  int32_t *src1_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src1_int32);
+  init_src_int(src1_int32, gm_size);
+
+  // int64_t
+  int64_t *dst_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(dst_int64);
+  init_dst(dst_int64, gm_size);
+
+  int64_t *src0_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src0_int64);
+  init_src_int(src0_int64, gm_size);
+  int64_t *src1_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src1_int64);
+  init_src_int(src1_int64, gm_size);
+
+  // __half
+  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, gm_size);
+
+  __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src0_f16);
+  init_src_fp(src0_f16, gm_size);
+  __half *src1_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src1_f16);
+  init_src_fp(src1_f16, gm_size);
+
+  // __fp32
+  __fp32 *dst_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
+  check_mem_alloc(dst_f32);
+  init_dst(dst_f32, gm_size);
+
+  __fp32 *src0_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
+  check_mem_alloc(src0_f32);
+  init_src_fp(src0_f32, gm_size);
+  __fp32 *src1_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
+  check_mem_alloc(src1_f32);
+  init_src_fp(src1_f32, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_int8, src0_int8,
+                                                      src1_int8);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_int16, src0_int16,
+                                                       src1_int16);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_int32, src0_int32,
+                                                       src1_int32);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_int64, src0_int64,
+                                                       src1_int64);
+  test_cm<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16,
+                                                      src1_f16);
+  test_cm<gm_row, gm_col, tile_row, tile_col, __fp32>(dst_f32, src0_f32,
+                                                      src1_f32);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst_int8, gm_size);
+  OutArray(dst_int16, gm_size);
+  OutArray(dst_int32, gm_size);
+  OutArray(dst_int64, gm_size);
+  OutArray(dst_f16, gm_size);
+  OutArray(dst_f32, gm_size);
+
+  free(dst_int8);
+  free(src0_int8);
+  free(src1_int8);
+  free(dst_int16);
+  free(src0_int16);
+  free(src1_int16);
+  free(dst_int32);
+  free(src0_int32);
+  free(dst_int64);
+  free(src0_int64);
+  free(src1_int64);
+  free(dst_f16);
+  free(src0_f16);
+  free(src1_f16);
+  free(dst_f32);
+  free(src0_f32);
+  free(src1_f32);
+
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TSubs.cpp b/benchmarks/api/tileop/src/TSubs.cpp
new file mode 100644
index 0000000..fd3e234
--- /dev/null
+++ b/benchmarks/api/tileop/src/TSubs.cpp
@@ -0,0 +1,213 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <size_t gm_row, size_t gm_col, size_t tile_row, size_t tile_col,
+          typename T>
+void test_rm(T *dst, T *src, T s) {
+  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
+  using glb_iterator = global_iterator<gm_shape, tile_shape>;
+
+  glb_iterator gSIter(src);
+  glb_iterator gDIter(dst);
+
+  size_t block_row = gm_row / tile_row;
+  size_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_row; ++i) {
+    for (int j = 0; j < block_col; ++j) {
+      auto s0 = gSIter(i, j);
+      auto res = gDIter(i, j);
+
+      tile_shape t0, t1;
+      TLOAD(t0, s0);
+      TSUBS(t1, t0, s);
+      TSTORE(res, t1);
+    }
+  }
+}
+
+template <size_t gm_row, size_t gm_col, size_t tile_row, size_t tile_col,
+          typename T>
+void test_cm(T *dst, T *src, T s) {
+  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
+  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
+  using glb_iterator = global_iterator<gm_shape, tile_shape>;
+
+  glb_iterator gSIter(src);
+  glb_iterator gDIter(dst);
+
+  size_t block_row = gm_row / tile_row;
+  size_t block_col = gm_col / tile_col;
+  for (int i = 0; i < block_col; ++i) {
+    for (int j = 0; j < block_row; ++j) {
+      auto s0 = gSIter(j, i);
+      auto res = gDIter(j, i);
+
+      tile_shape t0, t1;
+      TLOAD(t0, s0);
+      TSUBS(t1, t0, s);
+      TSTORE(res, t1);
+    }
+  }
+}
+
+int main() {
+#ifdef __linx
+  constexpr size_t gm_row = 4;
+  constexpr size_t gm_col = 4;
+  constexpr size_t tile_row = 4;
+  constexpr size_t tile_col = 4;
+#else
+  constexpr size_t gm_row = 32;
+  constexpr size_t gm_col = 32;
+  constexpr size_t tile_row = 32;
+  constexpr size_t tile_col = 32;
+#endif
+
+  constexpr size_t gm_size = gm_row * gm_col;
+  constexpr size_t tile_size = tile_row * tile_col;
+  (void)tile_size;
+
+#ifdef __linx
+  static int64_t dst_int64[gm_size];
+  static int64_t src_int64[gm_size];
+  init_dst(dst_int64, gm_size);
+  init_src_int(src_int64, gm_size);
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_int64, src_int64,
+                                                       s_i64);
+
+  return 0;
+#else
+  // int8_t
+  int8_t *dst_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(dst_int8);
+  init_dst(dst_int8, gm_size);
+
+  int8_t *src_int8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
+  check_mem_alloc(src_int8);
+  init_src_int(src_int8, gm_size);
+
+  // int16_t
+  int16_t *dst_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(dst_int16);
+  init_dst(dst_int16, gm_size);
+
+  int16_t *src_int16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
+  check_mem_alloc(src_int16);
+  init_src_int(src_int16, gm_size);
+
+  // int32
+  int32_t *dst_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(dst_int32);
+  init_dst(dst_int32, gm_size);
+
+  int32_t *src_int32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
+  check_mem_alloc(src_int32);
+  init_src_int(src_int32, gm_size);
+
+  // int64_t
+  int64_t *dst_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(dst_int64);
+  init_dst(dst_int64, gm_size);
+
+  int64_t *src_int64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
+  check_mem_alloc(src_int64);
+  init_src_int(src_int64, gm_size);
+
+  // __half
+  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, gm_size);
+
+  __half *src_f16 = (__half *)malloc(gm_size * sizeof(__half));
+  check_mem_alloc(src_f16);
+  init_src_fp(src_f16, gm_size);
+
+  // __fp32
+  __fp32 *dst_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
+  check_mem_alloc(dst_f32);
+  init_dst(dst_f32, gm_size);
+
+  __fp32 *src_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
+  check_mem_alloc(src_f32);
+  init_src_fp(src_f32, gm_size);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_int8, src_int8, s_i8);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_int16, src_int16,
+                                                       s_i16);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_int32, src_int32,
+                                                       s_i32);
+  test_rm<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_int64, src_int64,
+                                                       s_i64);
+  test_cm<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src_f16, s_fp16);
+  test_cm<gm_row, gm_col, tile_row, tile_col, __fp32>(dst_f32, src_f32, s_fp32);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst_int8, gm_size);
+  OutArray(dst_int16, gm_size);
+  OutArray(dst_int32, gm_size);
+  OutArray(dst_int64, gm_size);
+  OutArray(dst_f16, gm_size);
+  OutArray(dst_f32, gm_size);
+
+  free(dst_int8);
+  free(src_int8);
+  free(dst_int16);
+  free(src_int16);
+  free(dst_int32);
+  free(src_int32);
+  free(dst_int64);
+  free(src_int64);
+  free(dst_f16);
+  free(src_f16);
+  free(dst_f32);
+  free(src_f32);
+
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/TTrans.cpp b/benchmarks/api/tileop/src/TTrans.cpp
new file mode 100644
index 0000000..0d441a3
--- /dev/null
+++ b/benchmarks/api/tileop/src/TTrans.cpp
@@ -0,0 +1,188 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
+
+template <size_t row, size_t col, typename T> void test_rm(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, RowMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, RowMajor<col, row>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, row, col, BLayout::RowMajor>;
+  using tile_shape_out = Tile<Location::Vec, T, row, col, BLayout::RowMajor>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TTRANS(d1, d0);
+  TSTORE(res, d1);
+}
+
+template <size_t row, size_t col, typename T> void test_cm(T *dst, T *src) {
+  using gm_shape_in = global_tensor<T, ColMajor<row, col>>;
+  using gm_shape_out = global_tensor<T, ColMajor<col, row>>;
+
+  using tile_shape_in = Tile<Location::Vec, T, row, col, BLayout::ColMajor>;
+  using tile_shape_out = Tile<Location::Vec, T, row, col, BLayout::ColMajor>;
+
+  gm_shape_in s0(src);
+  gm_shape_out res(dst);
+  tile_shape_in d0;
+  tile_shape_out d1;
+
+  TLOAD(d0, s0);
+  TTRANS(d1, d0);
+  TSTORE(res, d1);
+}
+
+int main() {
+#ifdef __linx
+  constexpr size_t row = 4;
+  constexpr size_t col = 4;
+#else
+  constexpr size_t row = 32;
+  constexpr size_t col = 32;
+#endif
+
+  constexpr size_t size_in = row * col;
+  constexpr size_t size_out = col * row;
+
+#ifdef __linx
+  static int64_t dst[size_out];
+  static int64_t src[size_in];
+  init_dst(dst, size_out);
+  init_src_int(src, size_in);
+
+  test_rm<row, col, int64_t>(dst, src);
+
+  return 0;
+#else
+  // int8
+  int8_t *dst_int8 = (int8_t *)malloc(size_out * sizeof(int8_t));
+  check_mem_alloc(dst_int8);
+  init_dst(dst_int8, size_out);
+
+  int8_t *src_int8 = (int8_t *)malloc(size_in * sizeof(int8_t));
+  check_mem_alloc(src_int8);
+  init_src_int(src_int8, size_in);
+
+  // int16
+  int16_t *dst_int16 = (int16_t *)malloc(size_out * sizeof(int16_t));
+  check_mem_alloc(dst_int16);
+  init_dst(dst_int16, size_out);
+
+  int16_t *src_int16 = (int16_t *)malloc(size_in * sizeof(int16_t));
+  check_mem_alloc(src_int16);
+  init_src_int(src_int16, size_in);
+
+  // int32
+  int32_t *dst_int32 = (int32_t *)malloc(size_out * sizeof(int32_t));
+  check_mem_alloc(dst_int32);
+  init_dst(dst_int32, size_out);
+
+  int32_t *src_int32 = (int32_t *)malloc(size_in * sizeof(int32_t));
+  check_mem_alloc(src_int32);
+  init_src_int(src_int32, size_in);
+
+  // int 64
+  int64_t *dst_int64 = (int64_t *)malloc(size_out * sizeof(int64_t));
+  check_mem_alloc(dst_int64);
+  init_dst(dst_int64, size_out);
+
+  int64_t *src_int64 = (int64_t *)malloc(size_in * sizeof(int64_t));
+  check_mem_alloc(src_int64);
+  init_src_int(src_int64, size_in);
+
+  // __half
+  __half *dst_f16 = (__half *)malloc(size_out * sizeof(__half));
+  check_mem_alloc(dst_f16);
+  init_dst(dst_f16, size_out);
+
+  __half *src_f16 = (__half *)malloc(size_in * sizeof(__half));
+  check_mem_alloc(src_f16);
+  init_src_fp(src_f16, size_in);
+
+  // __fp32
+  __fp32 *dst_f32 = (__fp32 *)malloc(size_out * sizeof(__fp32));
+  check_mem_alloc(dst_f32);
+  init_dst(dst_f32, size_out);
+
+  __fp32 *src_f32 = (__fp32 *)malloc(size_in * sizeof(__fp32));
+  check_mem_alloc(src_f32);
+  init_src_fp(src_f32, size_in);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test_rm<row, col, int8_t>(dst_int8, src_int8);
+  test_rm<row, col, int16_t>(dst_int16, src_int16);
+  test_rm<row, col, int32_t>(dst_int32, src_int32);
+  test_rm<row, col, int64_t>(dst_int64, src_int64);
+  test_cm<row, col, __half>(dst_f16, src_f16);
+  test_cm<row, col, __fp32>(dst_f32, src_f32);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst_int8, size_out);
+  OutArray(dst_int16, size_out);
+  OutArray(dst_int32, size_out);
+  OutArray(dst_int64, size_out);
+  OutArray(dst_f16, size_out);
+  OutArray(dst_f32, size_out);
+
+  free(dst_int8);
+  free(src_int8);
+  free(dst_int16);
+  free(src_int16);
+  free(dst_int32);
+  free(src_int32);
+  free(dst_int64);
+  free(src_int64);
+  free(dst_f16);
+  free(src_f16);
+  free(dst_f32);
+  free(src_f32);
+
+  return 0;
+#endif
+}
diff --git a/benchmarks/api/tileop/src/test_MatMacc.cpp b/benchmarks/api/tileop/src/test_MatMacc.cpp
new file mode 100644
index 0000000..5316b61
--- /dev/null
+++ b/benchmarks/api/tileop/src/test_MatMacc.cpp
@@ -0,0 +1,194 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+
+template <uint16_t M, uint16_t N, uint16_t K, typename T>
+void test_linx_row_major(T *dst, T *src0, T *src1) {
+  using gm_shape_A = global_tensor<T, RowMajor<M, K>>;
+  using gm_shape_B = global_tensor<T, RowMajor<K, N>>;
+  using gm_shape_C = global_tensor<T, RowMajor<M, N>>;
+
+  using tile_shape_A = Tile<Location::Vec, T, M, K, BLayout::RowMajor>;
+  using tile_shape_B = Tile<Location::Vec, T, K, N, BLayout::RowMajor>;
+  using tile_shape_C = Tile<Location::Vec, T, M, N, BLayout::RowMajor>;
+
+  gm_shape_A s0(src0);
+  gm_shape_B s1(src1);
+  gm_shape_C res(dst);
+
+  tile_shape_A d0;
+  tile_shape_B d1;
+  tile_shape_C d2;
+
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  MATMUL(d2, d0, d1);
+  MATMACC(d2, d0, d1);
+  TSTORE(res, d2);
+}
+#endif
+
+template <uint16_t M, uint16_t N, uint16_t K>
+void test(float *dst, float *src0, float *src1) {
+  using gm_shape_A = global_tensor<float, RowMajor<M, K>>;
+  using gm_shape_B = global_tensor<float, ColMajor<K, N>>;
+  using gm_shape_C = global_tensor<float, RowMajor<M, N>>;
+
+  using tile_shape_A = TileLeft<float, M, K>;
+  using tile_shape_B = TileRight<float, K, N>;
+  using tile_shape_C = TileAcc<float, M, N>;
+  using tile_shape_O = Tile<Location::Vec, float, M, N>;
+
+  gm_shape_A s0(src0);
+  gm_shape_B s1(src1);
+  gm_shape_C res(dst);
+
+  tile_shape_A d0;
+  tile_shape_B d1;
+  tile_shape_C d2;
+  tile_shape_O d3;
+
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  MATMUL(d2, d0, d1);
+  MATMACC(d2, d0, d1);
+  TCVT(d3, d2);
+  TSTORE(res, d3);
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t M = 4;
+  constexpr uint16_t K = 4;
+  constexpr uint16_t N = 4;
+  constexpr size_t size_A = M * K;
+  constexpr size_t size_B = K * N;
+  constexpr size_t size_C = M * N;
+
+  static int64_t dst_i64[size_C];
+  static int64_t src0_i64[size_A];
+  static int64_t src1_i64[size_B];
+
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t k = 0; k < K; ++k) {
+      src0_i64[row * K + k] = static_cast<int64_t>((row + 1) * (k + 2));
+    }
+  }
+  for (size_t k = 0; k < K; ++k) {
+    for (size_t col = 0; col < N; ++col) {
+      src1_i64[k * N + col] = static_cast<int64_t>((k + 1) + (col + 1));
+    }
+  }
+  for (size_t i = 0; i < size_C; ++i) {
+    dst_i64[i] = 0;
+  }
+
+  test_linx_row_major<M, N, K, int64_t>(dst_i64, src0_i64, src1_i64);
+
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t col = 0; col < N; ++col) {
+      int64_t expected = 0;
+      for (size_t k = 0; k < K; ++k) {
+        expected += src0_i64[row * K + k] * src1_i64[k * N + col];
+      }
+      expected *= 2;
+      if (dst_i64[row * N + col] != expected) {
+        return 1;
+      }
+    }
+  }
+
+  return 0;
+#else
+  const uint16_t M = 16;
+  const uint16_t K = 8;
+  const uint16_t N = 32;
+
+  size_t size_A = M * K;
+  size_t size_B = K * N;
+  size_t size_C = M * N;
+
+  float *dst = (float *)malloc(size_C * sizeof(float));
+  check_mem_alloc(dst);
+  init_src_fp(dst, size_C);
+
+  float *src0 = (float *)malloc(size_A * sizeof(float));
+  check_mem_alloc(src0);
+  init_src_fp(src0, size_A);
+  float *src1 = (float *)malloc(size_B * sizeof(float));
+  check_mem_alloc(src1);
+  init_src_fp(src1, size_B);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test<M, N, K>(dst, src0, src1);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, size_C);
+
+  free(dst);
+  free(src0);
+  free(src1);
+
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/test_MatMmxac.cpp b/benchmarks/api/tileop/src/test_MatMmxac.cpp
similarity index 94%
rename from test/tileop_api/src/test_MatMmxac.cpp
rename to benchmarks/api/tileop/src/test_MatMmxac.cpp
index a972acf..27d3f70 100644
--- a/test/tileop_api/src/test_MatMmxac.cpp
+++ b/benchmarks/api/tileop/src/test_MatMmxac.cpp
@@ -34,16 +34,16 @@ void test_mx(float *dst, float *src0, float *src0x, float *src1, float *src1x) {
   tile_shape_C  d2;
   tile_shape_O  d3;
 
-  TCOPYIN(d0,  s0);
-  TCOPYIN(d0x, s0x);
-  TCOPYIN(d1,  s1);
-  TCOPYIN(d1x, s1x);
+  TLOAD(d0,  s0);
+  TLOAD(d0x, s0x);
+  TLOAD(d1,  s1);
+  TLOAD(d1x, s1x);
 
   MATMULMX(d2, d0, d0x, d1, d1x);
   MATMACCMX(d2, d0, d0x, d1, d1x);
 
   TCVT(d3, d2);
-  TCOPYOUT(res, d3);
+  TSTORE(res, d3);
 }
 
 template <uint16_t M, uint16_t N, uint16_t K>
@@ -70,15 +70,15 @@ void test_mxb(float *dst, float *src0, float *src1, float *src1x) {
   tile_shape_C  d2;
   tile_shape_O  d3;
 
-  TCOPYIN(d0,  s0);
-  TCOPYIN(d1,  s1);
-  TCOPYIN(d1x, s1x);
+  TLOAD(d0,  s0);
+  TLOAD(d1,  s1);
+  TLOAD(d1x, s1x);
 
   MATMULMXB(d2, d0, d1, d1x);
   MATMACCMXB(d2, d0, d1, d1x);
 
   TCVT(d3, d2);
-  TCOPYOUT(res, d3);
+  TSTORE(res, d3);
 }
 
 int main() {
diff --git a/benchmarks/api/tileop/src/test_MatMul.cpp b/benchmarks/api/tileop/src/test_MatMul.cpp
new file mode 100644
index 0000000..bf01255
--- /dev/null
+++ b/benchmarks/api/tileop/src/test_MatMul.cpp
@@ -0,0 +1,191 @@
+#include "../data.hpp"
+#include <common/pto_tileop.hpp>
+
+#ifdef LINX_PMC
+#include "../linxStartEnd.hpp"
+#endif
+
+#ifdef __linx
+int main();
+
+extern "C" void *memcpy(void *dst, const void *src, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const volatile uint8_t *s = static_cast<const volatile uint8_t *>(src);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = s[i];
+  }
+  return dst;
+}
+
+extern "C" void *memset(void *dst, int value, size_t n) {
+  volatile uint8_t *d = static_cast<volatile uint8_t *>(dst);
+  const uint8_t byte = static_cast<uint8_t>(value);
+  for (size_t i = 0; i < n; ++i) {
+    d[i] = byte;
+  }
+  return dst;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void
+_start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+
+template <uint16_t M, uint16_t N, uint16_t K, typename T>
+void test_linx_row_major(T *dst, T *src0, T *src1) {
+  using gm_shape_A = global_tensor<T, RowMajor<M, K>>;
+  using gm_shape_B = global_tensor<T, RowMajor<K, N>>;
+  using gm_shape_C = global_tensor<T, RowMajor<M, N>>;
+
+  using tile_shape_A = Tile<Location::Vec, T, M, K, BLayout::RowMajor>;
+  using tile_shape_B = Tile<Location::Vec, T, K, N, BLayout::RowMajor>;
+  using tile_shape_C = Tile<Location::Vec, T, M, N, BLayout::RowMajor>;
+
+  gm_shape_A s0(src0);
+  gm_shape_B s1(src1);
+  gm_shape_C res(dst);
+
+  tile_shape_A d0;
+  tile_shape_B d1;
+  tile_shape_C d2;
+
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  MATMUL(d2, d0, d1);
+  TSTORE(res, d2);
+}
+#endif
+
+template <uint16_t M, uint16_t N, uint16_t K>
+void test(float *dst, float *src0, float *src1) {
+  using gm_shape_A = global_tensor<float, RowMajor<M, K>>;
+  using gm_shape_B = global_tensor<float, ColMajor<K, N>>;
+  using gm_shape_C = global_tensor<float, RowMajor<M, N>>;
+
+  using tile_shape_A = TileLeft<float, M, K>;
+  using tile_shape_B = TileRight<float, K, N>;
+  using tile_shape_C = TileAcc<float, M, N>;
+  using tile_shape_O = TileLeft<float, M, K>;
+
+  gm_shape_A s0(src0);
+  gm_shape_B s1(src1);
+  gm_shape_C res(dst);
+
+  tile_shape_A d0;
+  tile_shape_B d1;
+  tile_shape_C d2;
+  tile_shape_O d3;
+
+  TLOAD(d0, s0);
+  TLOAD(d1, s1);
+  MATMUL(d2, d0, d1);
+  TCVT(d3, d2);
+  TSTORE(res, d3);
+}
+
+int main() {
+#ifdef __linx
+  constexpr uint16_t M = 4;
+  constexpr uint16_t K = 4;
+  constexpr uint16_t N = 4;
+  constexpr size_t size_A = M * K;
+  constexpr size_t size_B = K * N;
+  constexpr size_t size_C = M * N;
+
+  static int64_t dst_i64[size_C];
+  static int64_t src0_i64[size_A];
+  static int64_t src1_i64[size_B];
+
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t k = 0; k < K; ++k) {
+      src0_i64[row * K + k] = static_cast<int64_t>((row + 1) * (k + 2));
+    }
+  }
+  for (size_t k = 0; k < K; ++k) {
+    for (size_t col = 0; col < N; ++col) {
+      src1_i64[k * N + col] = static_cast<int64_t>((k + 1) + (col + 1));
+    }
+  }
+  for (size_t i = 0; i < size_C; ++i) {
+    dst_i64[i] = 0;
+  }
+
+  test_linx_row_major<M, N, K, int64_t>(dst_i64, src0_i64, src1_i64);
+
+  for (size_t row = 0; row < M; ++row) {
+    for (size_t col = 0; col < N; ++col) {
+      int64_t expected = 0;
+      for (size_t k = 0; k < K; ++k) {
+        expected += src0_i64[row * K + k] * src1_i64[k * N + col];
+      }
+      if (dst_i64[row * N + col] != expected) {
+        return 1;
+      }
+    }
+  }
+
+  return 0;
+#else
+  const uint16_t M = 16;
+  const uint16_t K = 8;
+  const uint16_t N = 32;
+
+  size_t size_A = M * K;
+  size_t size_B = K * N;
+  size_t size_C = M * N;
+
+  float *dst = (float *)malloc(size_C * sizeof(float));
+  check_mem_alloc(dst);
+  init_dst(dst, size_C);
+
+  float *src0 = (float *)malloc(size_A * sizeof(float));
+  check_mem_alloc(src0);
+  init_src_fp(src0, size_A);
+  float *src1 = (float *)malloc(size_B * sizeof(float));
+  check_mem_alloc(src1);
+  init_src_fp(src1, size_B);
+
+#ifdef LINX_PMC
+  PMC_START();
+#endif
+
+  test<M, N, K>(dst, src0, src1);
+
+#ifdef LINX_PMC
+  PMC_END();
+#endif
+
+  printf("Result:\n");
+  OutArray(dst, size_C);
+
+  free(dst);
+  free(src0);
+  free(src1);
+
+  return 0;
+#endif
+}
diff --git a/test/tileop_api/src/test_MatMulmx.cpp b/benchmarks/api/tileop/src/test_MatMulmx.cpp
similarity index 94%
rename from test/tileop_api/src/test_MatMulmx.cpp
rename to benchmarks/api/tileop/src/test_MatMulmx.cpp
index f017582..60b444e 100644
--- a/test/tileop_api/src/test_MatMulmx.cpp
+++ b/benchmarks/api/tileop/src/test_MatMulmx.cpp
@@ -34,14 +34,14 @@ void test_mx(float *dst, float *src0, float *src0x, float *src1, float *src1x) {
   tile_shape_C  d2;
   tile_shape_O  d3;
 
-  TCOPYIN(d0,  s0);
-  TCOPYIN(d0x, s0x);
-  TCOPYIN(d1,  s1);
-  TCOPYIN(d1x, s1x);
+  TLOAD(d0,  s0);
+  TLOAD(d0x, s0x);
+  TLOAD(d1,  s1);
+  TLOAD(d1x, s1x);
 
   MATMULMX(d2, d0, d0x, d1, d1x);
   TCVT(d3, d2);
-  TCOPYOUT(res, d3);
+  TSTORE(res, d3);
 }
 
 // MATMULMXB: A, B + BX
@@ -69,13 +69,13 @@ void test_mxb(float *dst, float *src0, float *src1, float *src1x) {
   tile_shape_C  d2;
   tile_shape_O  d3;
 
-  TCOPYIN(d0,  s0);
-  TCOPYIN(d1,  s1);
-  TCOPYIN(d1x, s1x);
+  TLOAD(d0,  s0);
+  TLOAD(d1,  s1);
+  TLOAD(d1x, s1x);
 
   MATMULMXB(d2, d0, d1, d1x);
   TCVT(d3, d2);
-  TCOPYOUT(res, d3);
+  TSTORE(res, d3);
 }
 
 int main() {
diff --git a/test/common/Makefile.common b/benchmarks/common/Makefile.common
similarity index 84%
rename from test/common/Makefile.common
rename to benchmarks/common/Makefile.common
index afae221..e745d23 100644
--- a/test/common/Makefile.common
+++ b/benchmarks/common/Makefile.common
@@ -1,7 +1,9 @@
 COMMON_MAKEFILE := $(abspath $(lastword $(MAKEFILE_LIST)))
-TEST_ROOT := $(abspath $(dir $(COMMON_MAKEFILE))/..)
-ROOT := $(abspath $(TEST_ROOT)/..)
-CATEGORY := $(patsubst $(TEST_ROOT)/%,%,$(CURDIR))
+BENCHMARK_ROOT ?= $(abspath $(dir $(COMMON_MAKEFILE))/..)
+ROOT ?= $(abspath $(BENCHMARK_ROOT)/..)
+CATEGORY_ROOT ?= $(BENCHMARK_ROOT)
+TEST_ROOT ?= $(CATEGORY_ROOT)
+CATEGORY := $(patsubst $(CATEGORY_ROOT)/%,%,$(CURDIR))
 CATEGORY_NAME := $(subst /,_,$(CATEGORY))
 OBJ_ROOT := $(abspath $(ROOT)/output)
 CASE_SRC_DIR := $(CATEGORY)/src
@@ -50,12 +52,13 @@ CXX = $(COMPILER_DIR)/clang++
 LINK = $(COMPILER_DIR)/clang++
 DUMP = $(COMPILER_DIR)/llvm-objdump
 COPY = $(COMPILER_DIR)/llvm-objcopy
-CC_O = -c -mlxbc -fenable-matrix -O2 -mllvm -enable-all-vector-as-tilereg=true
+CC_O = -c -target linx64-linx-none-elf -fenable-matrix -O2
+CC_LINK ?= -target linx64-linx-none-elf -nostdlib
 CC_VER ?= -std=c++20
-# COMM_SRC_FILE += $(ROOT)/test/common/_start.s
+# COMM_SRC_FILE += $(BENCHMARK_ROOT)/common/_start.s
 # COMM_SRC_DIR = $(shell dirname $(COMM_SRC_FILE))
 # COMM_OBJ += $(patsubst %.s, %.o, $(subst $(COMM_SRC_DIR), $(OBJ_DIR), $(COMM_SRC_FILE)))
-# CC_LINK += -nostartfiles $(ROOT)/test/common/_start.s
+# CC_LINK += -nostartfiles $(BENCHMARK_ROOT)/common/_start.s
 endif
 
 ifeq ($(PY_LIB), on)
@@ -70,7 +73,7 @@ CC_O += -fPIC
 CC_LINK += -shared
 endif
 
-INCLUDE += -I$(ROOT)/include -I$(ROOT)/test/common -I$(ROOT)/test/kernels/src
+INCLUDE += -I$(ROOT)/include -I$(ROOT)/kernels -I$(BENCHMARK_ROOT)/common -I$(BENCHMARK_ROOT)/common/src
 QEMU = /remote/lms60/c00622284/qemu/LinxBlockModel/build/qemu-linx
 
 CC_O_ALL = $(CC_O) $(CC_VER) $(CC_OPTS)
@@ -111,9 +114,9 @@ $(OBJ_DIR)%.o: $(COMM_SRC_DIR)%.s
 	@mkdir -p $(shell dirname $@)
 	$(AS) $(CC_O_ALL) $(INCLUDE) $(DEFINES) $< -o $@
 
-$(TARGET): $(OBJ) $(COMM_OBJ)
+$(TARGET): $(OBJ) $(COMM_OBJ) $(EXTRA_OBJ_FILES)
 	@mkdir -p $(shell dirname $@)
-	$(LINK) $(CC_LINK) $(COMM_OBJ) $(OBJ) -o $@
+	$(LINK) $(CC_LINK) $(COMM_OBJ) $(OBJ) $(EXTRA_OBJ_FILES) -o $@
 
 pre_work:
 	@mkdir -p $(OBJ_DIR)
diff --git a/test/common/_start.s b/benchmarks/common/_start.s
similarity index 100%
rename from test/common/_start.s
rename to benchmarks/common/_start.s
diff --git a/test/common/fileop.h b/benchmarks/common/fileop.h
similarity index 100%
rename from test/common/fileop.h
rename to benchmarks/common/fileop.h
diff --git a/test/tileop_api/linxStartEnd.hpp b/benchmarks/common/linxStartEnd.hpp
similarity index 100%
rename from test/tileop_api/linxStartEnd.hpp
rename to benchmarks/common/linxStartEnd.hpp
diff --git a/test/common/multi_tile.hpp b/benchmarks/common/multi_tile.hpp
similarity index 96%
rename from test/common/multi_tile.hpp
rename to benchmarks/common/multi_tile.hpp
index 6ca6a6c..e7ba914 100644
--- a/test/common/multi_tile.hpp
+++ b/benchmarks/common/multi_tile.hpp
@@ -129,19 +129,19 @@ void TCAST(tO &o, tA &a) {
 }
 
 template <is_multi_tile tile_shape, is_gm_iter itfn>
-void TCOPYIN(tile_shape &dst, itfn it) {
+void TLOAD(tile_shape &dst, itfn it) {
   #pragma clang loop unroll(full)
   for (int i = 0; i < tile_shape::NumTiles; ++i) {
     auto gm = it(i);
-    TCOPYIN(dst.Tiles[i], gm);
+    TLOAD(dst.Tiles[i], gm);
   }
 }
 
 template <is_multi_tile tile_shape, is_global_data_v gm_shape>
-void TCOPYIN(tile_shape &dst, gm_shape &src) {
+void TLOAD(tile_shape &dst, gm_shape &src) {
 #ifdef MULTI_REUSE
   typename tile_shape::TileType t;
-  TCOPYIN(t, src);
+  TLOAD(t, src);
   #pragma clang loop unroll(full)
   for (int i = 0; i < tile_shape::NumTiles; ++i) {
     dst.Tiles[i] = t;
@@ -149,17 +149,17 @@ void TCOPYIN(tile_shape &dst, gm_shape &src) {
 #else
   #pragma clang loop unroll(full)
   for (int i = 0; i < tile_shape::NumTiles; ++i) {
-    TCOPYIN(dst.Tiles[i], src);
+    TLOAD(dst.Tiles[i], src);
   }
 #endif
 }
 
 template <class itfn, is_multi_tile tile_shape>
-void TCOPYOUT(itfn it, tile_shape &src) {
+void TSTORE(itfn it, tile_shape &src) {
   #pragma clang loop unroll(full)
   for (int i = 0; i < tile_shape::NumTiles; ++i) {
     auto gm = it(i);
-    TCOPYOUT(gm, src.Tiles[i]);
+    TSTORE(gm, src.Tiles[i]);
   }
 }
 
diff --git a/test/common/readBinary.h b/benchmarks/common/readBinary.h
similarity index 100%
rename from test/common/readBinary.h
rename to benchmarks/common/readBinary.h
diff --git a/test/common/src/assembler.h b/benchmarks/common/src/assembler.h
similarity index 100%
rename from test/common/src/assembler.h
rename to benchmarks/common/src/assembler.h
diff --git a/test/common/src/baremetal_linx.lds.S b/benchmarks/common/src/baremetal_linx.lds.S
similarity index 100%
rename from test/common/src/baremetal_linx.lds.S
rename to benchmarks/common/src/baremetal_linx.lds.S
diff --git a/test/common/src/benchmark.h b/benchmarks/common/src/benchmark.h
similarity index 100%
rename from test/common/src/benchmark.h
rename to benchmarks/common/src/benchmark.h
diff --git a/test/common/src/benchmark_boot_linx.s b/benchmarks/common/src/benchmark_boot_linx.s
similarity index 100%
rename from test/common/src/benchmark_boot_linx.s
rename to benchmarks/common/src/benchmark_boot_linx.s
diff --git a/test/common/src/chip_def.h b/benchmarks/common/src/chip_def.h
similarity index 100%
rename from test/common/src/chip_def.h
rename to benchmarks/common/src/chip_def.h
diff --git a/test/common/src/common.h b/benchmarks/common/src/common.h
similarity index 100%
rename from test/common/src/common.h
rename to benchmarks/common/src/common.h
diff --git a/test/common/src/ldv5.lds.S b/benchmarks/common/src/ldv5.lds.S
similarity index 100%
rename from test/common/src/ldv5.lds.S
rename to benchmarks/common/src/ldv5.lds.S
diff --git a/test/common/src/stackheap_linx.c b/benchmarks/common/src/stackheap_linx.c
similarity index 100%
rename from test/common/src/stackheap_linx.c
rename to benchmarks/common/src/stackheap_linx.c
diff --git a/test/common/src/sys-sections.h b/benchmarks/common/src/sys-sections.h
similarity index 100%
rename from test/common/src/sys-sections.h
rename to benchmarks/common/src/sys-sections.h
diff --git a/test/common/src/sys_linx.c b/benchmarks/common/src/sys_linx.c
similarity index 100%
rename from test/common/src/sys_linx.c
rename to benchmarks/common/src/sys_linx.c
diff --git a/test/common/template_asm.h b/benchmarks/common/template_asm.h
similarity index 100%
rename from test/common/template_asm.h
rename to benchmarks/common/template_asm.h
diff --git a/test/common/tensorwrite.hpp b/benchmarks/common/tensorwrite.hpp
similarity index 100%
rename from test/common/tensorwrite.hpp
rename to benchmarks/common/tensorwrite.hpp
diff --git a/test/common/writeBinary.h b/benchmarks/common/writeBinary.h
similarity index 100%
rename from test/common/writeBinary.h
rename to benchmarks/common/writeBinary.h
diff --git a/test/kernel/orther/Makefile b/benchmarks/kernels/composite/Makefile
similarity index 98%
rename from test/kernel/orther/Makefile
rename to benchmarks/kernels/composite/Makefile
index ec6acc1..3dc0454 100644
--- a/test/kernel/orther/Makefile
+++ b/benchmarks/kernels/composite/Makefile
@@ -66,6 +66,7 @@ endif
 
 CP = cp
 DEST_DIR = ~/elf_subset/subset_matmul_reuse/
+INCLUDE += -I$(ROOT)/kernels/other
 SRC_FILE +=  $(TEST_ROOT)/$(CASE_SRC_DIR)/$(TESTCASE).cpp
 include ../../common/Makefile.common
 
diff --git a/test/kernel/orther/compile_flash_attention.all b/benchmarks/kernels/composite/compile_flash_attention.all
similarity index 100%
rename from test/kernel/orther/compile_flash_attention.all
rename to benchmarks/kernels/composite/compile_flash_attention.all
diff --git a/test/kernel/orther/compile_gemm.all b/benchmarks/kernels/composite/compile_gemm.all
similarity index 100%
rename from test/kernel/orther/compile_gemm.all
rename to benchmarks/kernels/composite/compile_gemm.all
diff --git a/test/kernel/orther/compile_linear.all b/benchmarks/kernels/composite/compile_linear.all
similarity index 100%
rename from test/kernel/orther/compile_linear.all
rename to benchmarks/kernels/composite/compile_linear.all
diff --git a/test/kernel/orther/compile_matmul.all b/benchmarks/kernels/composite/compile_matmul.all
similarity index 100%
rename from test/kernel/orther/compile_matmul.all
rename to benchmarks/kernels/composite/compile_matmul.all
diff --git a/test/kernel/orther/compile_norm.all b/benchmarks/kernels/composite/compile_norm.all
similarity index 100%
rename from test/kernel/orther/compile_norm.all
rename to benchmarks/kernels/composite/compile_norm.all
diff --git a/test/kernel/orther/compile_softmax.all b/benchmarks/kernels/composite/compile_softmax.all
similarity index 100%
rename from test/kernel/orther/compile_softmax.all
rename to benchmarks/kernels/composite/compile_softmax.all
diff --git a/benchmarks/kernels/composite/npu_compile.sh b/benchmarks/kernels/composite/npu_compile.sh
new file mode 100755
index 0000000..08750da
--- /dev/null
+++ b/benchmarks/kernels/composite/npu_compile.sh
@@ -0,0 +1,12 @@
+#! /bin/bash
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+
+"$SCRIPT_DIR/npu_compile/compile_matmul.all"
+"$SCRIPT_DIR/npu_compile/compile_matmul_reuseA.all"
+"$SCRIPT_DIR/npu_compile/compile_matmul_reuseB.all"
+"$SCRIPT_DIR/npu_compile/compile_matmul_reuseAB.all"
+
+"$SCRIPT_DIR/npu_compile/compile_matmul_dynamic.all"
+"$SCRIPT_DIR/npu_compile/compile_matmul_dynamic_reuseA.all"
+"$SCRIPT_DIR/npu_compile/compile_matmul_dynamic_reuseB.all"
diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul.all b/benchmarks/kernels/composite/npu_compile/compile_matmul.all
new file mode 100755
index 0000000..0539888
--- /dev/null
+++ b/benchmarks/kernels/composite/npu_compile/compile_matmul.all
@@ -0,0 +1,113 @@
+#! /bin/bash
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=777  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=777  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=777  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=777  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=777  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic.all b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic.all
new file mode 100755
index 0000000..1ae6f6e
--- /dev/null
+++ b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic.all
@@ -0,0 +1,113 @@
+#! /bin/bash
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=777  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=777  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=777  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=777  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=777  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuse.all b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuse.all
new file mode 100644
index 0000000..39b28e0
--- /dev/null
+++ b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuse.all
@@ -0,0 +1,113 @@
+#! /bin/bash
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=777  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=777  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=777  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=777  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=777  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseA.all b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseA.all
new file mode 100755
index 0000000..bb0bdbf
--- /dev/null
+++ b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseA.all
@@ -0,0 +1,113 @@
+#! /bin/bash
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=777  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=777  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=777  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=777  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=777  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseB.all b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseB.all
new file mode 100755
index 0000000..5355f6f
--- /dev/null
+++ b/benchmarks/kernels/composite/npu_compile/compile_matmul_dynamic_reuseB.all
@@ -0,0 +1,113 @@
+#! /bin/bash
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=777  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=777  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=777  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=777  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=777  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseA.all b/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseA.all
new file mode 100755
index 0000000..1bfd87f
--- /dev/null
+++ b/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseA.all
@@ -0,0 +1,113 @@
+#! /bin/bash
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=777  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=777  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=777  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=777  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=777  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseAB.all b/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseAB.all
new file mode 100755
index 0000000..05ee361
--- /dev/null
+++ b/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseAB.all
@@ -0,0 +1,113 @@
+#! /bin/bash
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=777  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=777  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=777  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=777  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=777  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseB.all b/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseB.all
new file mode 100755
index 0000000..f18b5d6
--- /dev/null
+++ b/benchmarks/kernels/composite/npu_compile/compile_matmul_reuseB.all
@@ -0,0 +1,113 @@
+#! /bin/bash
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=2048  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=2048  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=2048  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=2048  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=2048  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=2048  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
+
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=256  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=777  tM=64 tK=256 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=777  tM=64 tK=256 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=256  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=777  tM=256 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=777  tM=256 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=256  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=777  tM=64 tK=64 tN=256
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=777  tM=64 tK=64 tN=256
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=256  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=777  tM=64 tK=256 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=777  tM=64 tK=256 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=256  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=777  tM=256 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=777  tM=256 tK=64 tN=64
+
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=256  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=777  tM=64 tK=64 tN=64
+make -C "$SCRIPT_DIR/.." TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/test/kernel/orther/src/FA.py b/benchmarks/kernels/composite/src/FA.py
similarity index 100%
rename from test/kernel/orther/src/FA.py
rename to benchmarks/kernels/composite/src/FA.py
diff --git a/test/kernel/orther/src/flash_attention.cpp b/benchmarks/kernels/composite/src/flash_attention.cpp
similarity index 100%
rename from test/kernel/orther/src/flash_attention.cpp
rename to benchmarks/kernels/composite/src/flash_attention.cpp
diff --git a/test/kernel/orther/src/flash_attention_mask.cpp b/benchmarks/kernels/composite/src/flash_attention_mask.cpp
similarity index 100%
rename from test/kernel/orther/src/flash_attention_mask.cpp
rename to benchmarks/kernels/composite/src/flash_attention_mask.cpp
diff --git a/test/kernel/orther/src/gemm.cpp b/benchmarks/kernels/composite/src/gemm.cpp
similarity index 100%
rename from test/kernel/orther/src/gemm.cpp
rename to benchmarks/kernels/composite/src/gemm.cpp
diff --git a/test/kernel/orther/src/linear.cpp b/benchmarks/kernels/composite/src/linear.cpp
similarity index 100%
rename from test/kernel/orther/src/linear.cpp
rename to benchmarks/kernels/composite/src/linear.cpp
diff --git a/test/kernel/orther/src/matmul.cpp b/benchmarks/kernels/composite/src/matmul.cpp
similarity index 100%
rename from test/kernel/orther/src/matmul.cpp
rename to benchmarks/kernels/composite/src/matmul.cpp
diff --git a/test/kernel/orther/src/normalization.cpp b/benchmarks/kernels/composite/src/normalization.cpp
similarity index 100%
rename from test/kernel/orther/src/normalization.cpp
rename to benchmarks/kernels/composite/src/normalization.cpp
diff --git a/test/kernel/orther/src/onlinesoftmax.cpp b/benchmarks/kernels/composite/src/onlinesoftmax.cpp
similarity index 83%
rename from test/kernel/orther/src/onlinesoftmax.cpp
rename to benchmarks/kernels/composite/src/onlinesoftmax.cpp
index edee11a..20af0fb 100644
--- a/test/kernel/orther/src/onlinesoftmax.cpp
+++ b/benchmarks/kernels/composite/src/onlinesoftmax.cpp
@@ -3,7 +3,7 @@
 #include "softmax.hpp"
 #include "benchmark.h"
 
-#ifndef globM 
+#ifndef globM
 #define globM 120
 #endif
 
@@ -36,14 +36,14 @@ void onlinesoftmax_test(dtype* dst, dtype* src){
 
     for(int i=0;i<Mb;i++){
       tile_shape2 tsum(0);
-      tile_shape2 tmax(-10000);    
-      tile_shape tsrc;               
+      tile_shape2 tmax(-10000);
+      tile_shape tsrc;
       for (int j = 0; j < Nb; j++) {
         uint32_t offset = i * kTM * kN + j * kTN;
         gm_shape gsrc(src + offset);
         // tile_shape tsrc;
         tDst tdst;
-        TCOPYIN(tsrc, gsrc);
+        TLOAD(tsrc, gsrc);
         OnlineSoftMax(tdst,tsrc,tmax,tsum);
         TEXTRACT(tsrc, tdst, 0, kTN);
         TEXTRACT(tmax, tdst, 0, 0);
@@ -51,20 +51,20 @@ void onlinesoftmax_test(dtype* dst, dtype* src){
       }
         int offset1 = i * kTM * kN;
         // gm_shape1 res(dst);
-        // TCOPYOUT(res,tmax);
-        gm_shape gsrc1(src + offset1);  
-        gm_shape res(dst + offset1); 
-        tile_shape1 d4;                 
-        tile_shape1 d5;                 
-        tile_shape1 d6; 
+        // TSTORE(res,tmax);
+        gm_shape gsrc1(src + offset1);
+        gm_shape res(dst + offset1);
+        tile_shape1 d4;
+        tile_shape1 d5;
+        tile_shape1 d6;
 
-        TCOPYIN(d4, gsrc1);
+        TLOAD(d4, gsrc1);
         TEXPANDCOL(d5, tmax);
         TEXPANDCOL(d6, tsum);
         TSUB(d4, d4, d5);
         TEXP(d4, d4);
         TDIV(d4, d4, d6);
-        TCOPYOUT(res, d4);
+        TSTORE(res, d4);
     }
 }
 
diff --git a/test/kernel/orther/src/softmax.cpp b/benchmarks/kernels/composite/src/softmax.cpp
similarity index 100%
rename from test/kernel/orther/src/softmax.cpp
rename to benchmarks/kernels/composite/src/softmax.cpp
diff --git a/test/kernel/control/Makefile b/benchmarks/kernels/control/Makefile
similarity index 80%
rename from test/kernel/control/Makefile
rename to benchmarks/kernels/control/Makefile
index 0a09736..806ef3f 100644
--- a/test/kernel/control/Makefile
+++ b/benchmarks/kernels/control/Makefile
@@ -1,3 +1,5 @@
+.DEFAULT_GOAL := all
+
 TARGET = $(ELF_HEAD)_$(TESTCASE)$(SUFFIX).elf
 
 # Override target names
@@ -34,12 +36,12 @@ SRC_FILE +=  $(TEST_ROOT)/$(CATEGORY)/$(TESTCASE)/$(TESTCASE).cpp
 endif
 
 # Special handling for hashtable_lookup_simd - embed data as object files
-EXTRA_OBJ_FILES :=
-EXTRA_OBJ_DEPS :=
+EXTRA_OBJ_FILES =
+EXTRA_OBJ_DEPS =
 
 # Data object files location (relative paths)
 DATA_OBJ_DIR := hashtable_lookup_simd/data_obj
-OUTPUT_DATA_OBJ_DIR := ../../../output/kernel/control/hashtable_lookup_simd/data_obj
+OUTPUT_DATA_OBJ_DIR = $(OBJ_ROOT)/$(CATEGORY)/hashtable_lookup_simd/data_obj
 
 # hashtable_lookup_simd uses embedded data (large dataset for 2.55M-entry table)
 ifeq ($(TESTCASE), hashtable_lookup_simd)
@@ -52,10 +54,6 @@ pre_work: build_sim_data_objs
 build_sim_data_objs:
 	@COMPILER_DIR="$(COMPILER_DIR)" $(DATA_OBJ_DIR)/build_data_obj.sh $(DATA_OBJ_DIR) $(OUTPUT_DATA_OBJ_DIR)
 
-$(OUTPUT_DATA_OBJ_DIR)/%.o: $(DATA_OBJ_DIR)/%.s pre_work
-	@mkdir -p $(shell dirname $@)
-	$(AS) $(CC_O_ALL) $(INCLUDE) $(DEFINES) $< -o $@
-
 endif
 
 # hashtable_lookup_simt uses embedded data (same hashtable_lookup_simd data)
@@ -70,10 +68,6 @@ pre_work: build_simt_data_objs
 build_simt_data_objs:
 	@COMPILER_DIR="$(COMPILER_DIR)" $(DATA_OBJ_DIR)/build_data_obj.sh $(DATA_OBJ_DIR) $(OUTPUT_DATA_OBJ_DIR)
 
-$(OUTPUT_DATA_OBJ_DIR)/%.o: $(DATA_OBJ_DIR)/%.s pre_work
-	@mkdir -p $(shell dirname $@)
-	$(AS) $(CC_O_ALL) $(INCLUDE) $(DEFINES) $< -o $@
-
 endif
 
 # hashtable_lookup_simt_v2 uses the same embedded data
@@ -87,15 +81,11 @@ pre_work: build_simt_v2_data_objs
 build_simt_v2_data_objs:
 	@COMPILER_DIR="$(COMPILER_DIR)" $(DATA_OBJ_DIR)/build_data_obj.sh $(DATA_OBJ_DIR) $(OUTPUT_DATA_OBJ_DIR)
 
-$(OUTPUT_DATA_OBJ_DIR)/%.o: $(DATA_OBJ_DIR)/%.s pre_work
-	@mkdir -p $(shell dirname $@)
-	$(AS) $(CC_O_ALL) $(INCLUDE) $(DEFINES) $< -o $@
-
 endif
 
 # hkv uses embedded data
 HKV_DATA_OBJ_DIR := hkv/data_obj
-HKV_OUTPUT_DATA_OBJ_DIR := ../../../output/kernel/control/hkv/data_obj
+HKV_OUTPUT_DATA_OBJ_DIR = $(OBJ_ROOT)/$(CATEGORY)/hkv/data_obj
 
 ifeq ($(TESTCASE), hkv)
 EXTRA_OBJ_FILES += $(HKV_OUTPUT_DATA_OBJ_DIR)/buckets.bin.o
@@ -109,12 +99,13 @@ pre_work: build_hkv_data_objs
 build_hkv_data_objs:
 	@COMPILER_DIR="$(COMPILER_DIR)" $(HKV_DATA_OBJ_DIR)/build_data_obj.sh $(HKV_DATA_OBJ_DIR) $(HKV_OUTPUT_DATA_OBJ_DIR)
 
-$(HKV_OUTPUT_DATA_OBJ_DIR)/%.o: $(HKV_DATA_OBJ_DIR)/%.s pre_work
-	@mkdir -p $(shell dirname $@)
-	$(AS) $(CC_O_ALL) $(INCLUDE) $(DEFINES) $< -o $@
-
 endif
 
 include ../../common/Makefile.common
 
 DEFINES += $(EXTRA_DEFINES)
+
+ifneq ($(EXTRA_OBJ_FILES),)
+$(EXTRA_OBJ_FILES): pre_work
+	@true
+endif
diff --git a/benchmarks/kernels/control/compile.all b/benchmarks/kernels/control/compile.all
new file mode 100755
index 0000000..6f9135f
--- /dev/null
+++ b/benchmarks/kernels/control/compile.all
@@ -0,0 +1,11 @@
+#! /bin/bash
+make TESTCASE=hashtable_lookup_simt SUFFIX=_kNum16_htscan_gfsim EXTRA_DEFINES="-DkNum=16 -DLINX_HT_CAPACITY=2048 -DLINX_HT_SCAN=1 -DLINX_HT_DIRECT=1 -DFOR_GFSIM" diss
+make TESTCASE=hashtable_lookup_simt SUFFIX=_kNum16_htprobe_gfsim EXTRA_DEFINES="-DkNum=16 -DLINX_HT_CAPACITY=2048 -DLINX_HT_DIRECT=1 -DFOR_GFSIM" diss
+make TESTCASE=hashtable_lookup_simt SUFFIX=_kNum6144_kNumThreads6144_kMaxProbe512_break_debug_on EXTRA_DEFINES="-DkNum=6144 -DkNumThreads=6144 -DMAX_PROBE=512" diss
+make TESTCASE=hashtable_lookup_simt SUFFIX=_kNum6144_kNumThreads6144_kMaxProbe512_break_debug_off EXTRA_DEFINES="-DkNum=6144 -DkNumThreads=6144 -DMAX_PROBE=512 -DFOR_GFSIM" diss
+make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum6144_kMaxProbe512_knum_col256_debug_on EXTRA_DEFINES="-DkNum=6144 -DMAX_PROBE=512 -DNUM_COL=256" diss
+make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum6144_kMaxProbe512_knum_col512_debug_on EXTRA_DEFINES="-DkNum=6144 -DMAX_PROBE=512 -DNUM_COL=512" diss
+make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum6144_kMaxProbe512_knum_col1024_debug_on EXTRA_DEFINES="-DkNum=6144 -DMAX_PROBE=512 -DNUM_COL=1024" diss
+make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum6144_kMaxProbe512_knum_col256_debug_off EXTRA_DEFINES="-DkNum=6144 -DMAX_PROBE=512 -DNUM_COL=256 -DFOR_GFSIM" diss
+make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum6144_kMaxProbe512_knum_col512_debug_off EXTRA_DEFINES="-DkNum=6144 -DMAX_PROBE=512 -DNUM_COL=512 -DFOR_GFSIM" diss
+make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum6144_kMaxProbe512_knum_col1024_debug_off EXTRA_DEFINES="-DkNum=6144 -DMAX_PROBE=512 -DNUM_COL=1024 -DFOR_GFSIM" diss
diff --git a/test/kernel/control/hashfind/hashfind.cpp b/benchmarks/kernels/control/hashfind/hashfind.cpp
similarity index 99%
rename from test/kernel/control/hashfind/hashfind.cpp
rename to benchmarks/kernels/control/hashfind/hashfind.cpp
index fb438ae..3437672 100644
--- a/test/kernel/control/hashfind/hashfind.cpp
+++ b/benchmarks/kernels/control/hashfind/hashfind.cpp
@@ -437,7 +437,7 @@ void runHashFind(int32_t __out__ *out,
 
     // copy in
     KeyGT key_gt(queries);
-    TCOPYIN(queryKeyTile, key_gt);
+    TLOAD(queryKeyTile, key_gt);
 
     // compute hash (writes int64_t byte offsets into probeIdxTile)
     compute_hash_vec(queryKeyTile, probeIdxTile, kCap);
@@ -449,7 +449,7 @@ void runHashFind(int32_t __out__ *out,
 
     // copy out
     OutGT outGlobal(out);
-    TCOPYOUT(outGlobal, outTile);
+    TSTORE(outGlobal, outTile);
 }
 
 template <int kTileRows, int kTileCols, int kCap, int kMaxProbe>
diff --git a/test/accelerator/vec_simt/hashfind/compute_offsets.py b/benchmarks/kernels/control/hashtable_lookup_simd/compute_offsets.py
similarity index 100%
rename from test/accelerator/vec_simt/hashfind/compute_offsets.py
rename to benchmarks/kernels/control/hashtable_lookup_simd/compute_offsets.py
diff --git a/benchmarks/kernels/control/hashtable_lookup_simd/data_obj/.gitignore b/benchmarks/kernels/control/hashtable_lookup_simd/data_obj/.gitignore
new file mode 100644
index 0000000..dbf14ab
--- /dev/null
+++ b/benchmarks/kernels/control/hashtable_lookup_simd/data_obj/.gitignore
@@ -0,0 +1,3 @@
+*.s
+*.o
+*.data
diff --git a/test/kernel/control/hashtable_lookup_simd/data_obj/build_data_obj.sh b/benchmarks/kernels/control/hashtable_lookup_simd/data_obj/build_data_obj.sh
similarity index 51%
rename from test/kernel/control/hashtable_lookup_simd/data_obj/build_data_obj.sh
rename to benchmarks/kernels/control/hashtable_lookup_simd/data_obj/build_data_obj.sh
index 2668212..132c16b 100755
--- a/test/kernel/control/hashtable_lookup_simd/data_obj/build_data_obj.sh
+++ b/benchmarks/kernels/control/hashtable_lookup_simd/data_obj/build_data_obj.sh
@@ -1,10 +1,21 @@
 #!/bin/bash
-COMPILER_DIR="${COMPILER_DIR:-/remote/lms60/c00622284/janus/linxisa_compiler_v0.55/linx_blockisa_llvm_musl/bin}"
-DATA_OBJ_DIR="$1"
-OUTPUT_DIR="$2"
+set -euo pipefail
+
+COMPILER_DIR="${COMPILER_DIR:-/usr/bin}"
+LINX_TARGET="${LINX_TARGET:-linx64-linx-none-elf}"
+DATA_OBJ_DIR="${1:?data object directory required}"
+OUTPUT_DIR="${2:?output directory required}"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+CASE_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
 
 mkdir -p "$OUTPUT_DIR"
 
+if [[ ! -f "${DATA_OBJ_DIR}/inserted_slot.data" ||
+      ! -f "${DATA_OBJ_DIR}/lookup_keys.data" ||
+      ! -f "${DATA_OBJ_DIR}/lookup_values.data" ]]; then
+    (cd "$CASE_DIR" && python3 gen_data_simple.py)
+fi
+
 build_one() {
     local name="$1"
     local data_file="${DATA_OBJ_DIR}/${name}.data"
@@ -26,11 +37,11 @@ _binary_${name}_data_end:
 .equ _binary_${name}_data_size, .-_binary_${name}_data_start
 EOF
 
-    $COMPILER_DIR/clang++ -target linx64v5 -c "$asm_file" -o "$obj_file"
+    "${COMPILER_DIR}/clang++" -target "$LINX_TARGET" -c "$asm_file" -o "$obj_file"
 }
 
 build_one "inserted_slot"
 build_one "lookup_keys"
 build_one "lookup_values"
 
-echo "Done building data object files"
\ No newline at end of file
+echo "Done building data object files"
diff --git a/test/kernel/control/hashtable_lookup_simd/data_obj/probe_statistics.md b/benchmarks/kernels/control/hashtable_lookup_simd/data_obj/probe_statistics.md
similarity index 100%
rename from test/kernel/control/hashtable_lookup_simd/data_obj/probe_statistics.md
rename to benchmarks/kernels/control/hashtable_lookup_simd/data_obj/probe_statistics.md
diff --git a/test/kernel/control/hashtable_lookup_simd/gen_data_simple.py b/benchmarks/kernels/control/hashtable_lookup_simd/gen_data_simple.py
similarity index 90%
rename from test/kernel/control/hashtable_lookup_simd/gen_data_simple.py
rename to benchmarks/kernels/control/hashtable_lookup_simd/gen_data_simple.py
index e1405b0..540b887 100644
--- a/test/kernel/control/hashtable_lookup_simd/gen_data_simple.py
+++ b/benchmarks/kernels/control/hashtable_lookup_simd/gen_data_simple.py
@@ -153,29 +153,29 @@ def u64_to_i64(u):
             return u - (1 << 64)
         return u
 
-    # Write simple_inserted_slot.data (hashtable)
+    # Write inserted_slot.data (hashtable)
     output_dir = "data_obj"
-    with open(f"{output_dir}/simple_inserted_slot.data", "wb") as f:
+    with open(f"{output_dir}/inserted_slot.data", "wb") as f:
         for key, value, padding in table:
             # Pack as: key(int64), value(int32), padding(int32)
             f.write(struct.pack("<q", u64_to_i64(key)))   # little-endian int64
             f.write(struct.pack("<i", value))  # little-endian int32
             f.write(struct.pack("<i", padding))  # little-endian int32
 
-    # Write simple_lookup_keys.data
-    with open(f"{output_dir}/simple_lookup_keys.data", "wb") as f:
+    # Write lookup_keys.data
+    with open(f"{output_dir}/lookup_keys.data", "wb") as f:
         for key in query_keys:
             f.write(struct.pack("<q", u64_to_i64(key)))
 
-    # Write simple_lookup_values.data
-    with open(f"{output_dir}/simple_lookup_values.data", "wb") as f:
+    # Write lookup_values.data
+    with open(f"{output_dir}/lookup_values.data", "wb") as f:
         for val in expected_values:
             f.write(struct.pack("<i", val))
 
     print(f"\nGenerated files in {output_dir}/:")
-    print(f"  simple_inserted_slot.data:  {CAP * 16} bytes ({CAP} entries)")
-    print(f"  simple_lookup_keys.data:    {NUM_QUERIES * 8} bytes ({NUM_QUERIES} keys)")
-    print(f"  simple_lookup_values.data:  {NUM_QUERIES * 4} bytes ({NUM_QUERIES} values)")
+    print(f"  inserted_slot.data:  {CAP * 16} bytes ({CAP} entries)")
+    print(f"  lookup_keys.data:    {NUM_QUERIES * 8} bytes ({NUM_QUERIES} keys)")
+    print(f"  lookup_values.data:  {NUM_QUERIES * 4} bytes ({NUM_QUERIES} values)")
 
     # Verify offsets exceed uint16_t
     max_offset = CAP * 16 - 8  # Last key starts 8 bytes before end
@@ -193,4 +193,4 @@ def u64_to_i64(u):
 
 
 if __name__ == "__main__":
-    main()
\ No newline at end of file
+    main()
diff --git a/test/kernel/control/hashtable_lookup_simd/hashtable_lookup_simd.cpp b/benchmarks/kernels/control/hashtable_lookup_simd/hashtable_lookup_simd.cpp
similarity index 99%
rename from test/kernel/control/hashtable_lookup_simd/hashtable_lookup_simd.cpp
rename to benchmarks/kernels/control/hashtable_lookup_simd/hashtable_lookup_simd.cpp
index e7baa40..a55299b 100644
--- a/test/kernel/control/hashtable_lookup_simd/hashtable_lookup_simd.cpp
+++ b/benchmarks/kernels/control/hashtable_lookup_simd/hashtable_lookup_simd.cpp
@@ -1,8 +1,12 @@
 #include <common/pto_tileop.hpp>
 #include "benchmark.h"
+#ifndef __linx
 #include "fileop.h"
+#endif
 #include "template_asm.h"
+#ifndef __linx
 #include <stdio.h>
+#endif
 
 // ============================================================================
 // Tile operation implementations
@@ -453,7 +457,7 @@ void linearProbing(typename HashFindTypes<kTileRows, kTileCols>::TileI32& outTil
         // Compare, advance, and check if all lanes found
         probe_step(queryKeyTile, tableKeyTile, tableValueTile, probeIdxTile, outTile, countTile, kCap, kNotFound);
 
-        TCOPYOUT(countGT, countTile);
+        TSTORE(countGT, countTile);
 
         bool all_done = true;
         for (int g = 0; g < kNumGroups; g++) {
@@ -492,7 +496,7 @@ void runHashFind(int32_t __out__ *out,
 
     // copy in
     KeyGT key_gt(queries);
-    TCOPYIN(queryKeyTile, key_gt);
+    TLOAD(queryKeyTile, key_gt);
 
     // compute hash (writes int64_t byte offsets into probeIdxTile)
     compute_hash_vec(queryKeyTile, probeIdxTile, kCap);
@@ -504,7 +508,7 @@ void runHashFind(int32_t __out__ *out,
 
     // copy out
     OutGT outGlobal(out);
-    TCOPYOUT(outGlobal, outTile);
+    TSTORE(outGlobal, outTile);
 }
 
 template <int kTileRows, int kTileCols, int kCap, int kMaxProbe>
@@ -551,6 +555,7 @@ int main() {
         }
     }
 
+#ifndef __linx
     printf("=== hashtable_lookup_simd ===\n");
     printf("Match: %d/%d (%.4f%%)\n", match, kNum, 100.0 * double(match) / double(kNum));
 
@@ -567,9 +572,6 @@ int main() {
         }
     }
     fflush(stdout);
-#endif
-
-#ifndef FOR_GFSIM
     int ret = (match == kNum) ? 0 : 1;
     if (!ret) {
         printf("PASS\n");
@@ -577,8 +579,9 @@ int main() {
         printf("FAIL\n");
     }
     fflush(stdout);
+#endif
     return ret;
 #else
     return 0;
 #endif
-}
\ No newline at end of file
+}
diff --git a/test/kernel/control/hashtable_lookup_simd/run_hashtable_lookup_simd.md b/benchmarks/kernels/control/hashtable_lookup_simd/run_hashtable_lookup_simd.md
similarity index 64%
rename from test/kernel/control/hashtable_lookup_simd/run_hashtable_lookup_simd.md
rename to benchmarks/kernels/control/hashtable_lookup_simd/run_hashtable_lookup_simd.md
index 72fdd73..9d4d32c 100644
--- a/test/kernel/control/hashtable_lookup_simd/run_hashtable_lookup_simd.md
+++ b/benchmarks/kernels/control/hashtable_lookup_simd/run_hashtable_lookup_simd.md
@@ -9,7 +9,7 @@
 ```
 {WORKSPACE}/
   LinxBlockModel/          ← QEMU（LinxBlockModel）
-  JanusCoreBench/          ← Benchmark
+  SuperNPUBench/          ← Benchmark
 ```
 
 在以下命令中，将 `{WORKSPACE}` 替换为实际路径，例如：
@@ -50,7 +50,7 @@ ninja -j 32
 `compile.all` 本质是对 Makefile 的多次调用：
 
 ```bash
-cd {WORKSPACE}/JanusCoreBench/test/kernel/control
+cd {WORKSPACE}/SuperNPUBench/benchmarks/kernels/control
 
 # hashtable_lookup_simt
 make TESTCASE=hashtable_lookup_simt SUFFIX=_kNum409600 EXTRA_DEFINES="-DkNum=409600"
@@ -73,9 +73,9 @@ make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum256 EXTRA_DEFINES="-DkNum=256"
 
 ```bash
 # 自动创建，Makefile 会执行：
-mkdir -p output/kernel/control/src/
-mkdir -p output/kernel/control/elf/
-mkdir -p output/kernel/control/hashtable_lookup_simd/data_obj/
+mkdir -p output/kernels/control/src/
+mkdir -p output/kernels/control/elf/
+mkdir -p output/kernels/control/hashtable_lookup_simd/data_obj/
 ```
 
 ---
@@ -83,8 +83,8 @@ mkdir -p output/kernel/control/hashtable_lookup_simd/data_obj/
 #### Step 2 — 把数据 .data 文件转为 .o（hashtable_lookup_simd 专用）
 
 ```bash
-cd {WORKSPACE}/JanusCoreBench/test/kernel/control/hashtable_lookup_simd
-COMPILER_DIR=<your_compiler_dir> bash data_obj/build_data_obj.sh data_obj ../../../output/kernel/control/hashtable_lookup_simd/data_obj
+cd {WORKSPACE}/SuperNPUBench/benchmarks/kernels/control/hashtable_lookup_simd
+COMPILER_DIR=<your_compiler_dir> bash data_obj/build_data_obj.sh data_obj ../../../output/kernels/control/hashtable_lookup_simd/data_obj
 ```
 
 `build_data_obj.sh` 遍历 `data_obj/` 下所有 `.data` 文件，调用：
@@ -96,9 +96,9 @@ $COMPILER_DIR/clang++ -target linx64v5 -c *.s -o output/.../*.o
 
 生成的 `.o` 文件：
 ```
-output/kernel/control/hashtable_lookup_simd/data_obj/inserted_slot.o     # hash 表
-output/kernel/control/hashtable_lookup_simd/data_obj/lookup_keys.o      # 查询 key
-output/kernel/control/hashtable_lookup_simd/data_obj/lookup_values.o    # 期望 value
+output/kernels/control/hashtable_lookup_simd/data_obj/inserted_slot.o     # hash 表
+output/kernels/control/hashtable_lookup_simd/data_obj/lookup_keys.o      # 查询 key
+output/kernels/control/hashtable_lookup_simd/data_obj/lookup_values.o    # 期望 value
 ```
 
 ---
@@ -107,15 +107,15 @@ output/kernel/control/hashtable_lookup_simd/data_obj/lookup_values.o    # 期望
 
 ```bash
 # 实际执行（make 自动推导）：
-cd {WORKSPACE}/JanusCoreBench/test/kernel/control
+cd {WORKSPACE}/SuperNPUBench/benchmarks/kernels/control
 $COMPILER_DIR/clang++ \
   -c -mlxbc -fenable-matrix -O2 \
   -std=c++20 \
-  -I{WORKSPACE}/JanusCoreBench/include \
-  -I{WORKSPACE}/JanusCoreBench/test/common \
+  -I{WORKSPACE}/SuperNPUBench/include \
+  -I{WORKSPACE}/SuperNPUBench/benchmarks/common \
   -D__linx -DENABLE_TENSOR_INSTR \
   hashtable_lookup_simd/hashtable_lookup_simd.cpp \
-  -o output/kernel/control/src/hashtable_lookup_simd.o
+  -o output/kernels/control/src/hashtable_lookup_simd.o
 ```
 
 ---
@@ -125,11 +125,11 @@ $COMPILER_DIR/clang++ \
 ```bash
 $COMPILER_DIR/clang++ \
   -nostartfiles \
-  output/kernel/control/src/hashtable_lookup_simd.o \
-  output/kernel/control/hashtable_lookup_simd/data_obj/inserted_slot.o \
-  output/kernel/control/hashtable_lookup_simd/data_obj/lookup_keys.o \
-  output/kernel/control/hashtable_lookup_simd/data_obj/lookup_values.o \
-  -o output/kernel/control/elf/kernel_control_hashtable_lookup_simd_kNum409600.elf
+  output/kernels/control/src/hashtable_lookup_simd.o \
+  output/kernels/control/hashtable_lookup_simd/data_obj/inserted_slot.o \
+  output/kernels/control/hashtable_lookup_simd/data_obj/lookup_keys.o \
+  output/kernels/control/hashtable_lookup_simd/data_obj/lookup_values.o \
+  -o output/kernels/control/elf/kernels_control_hashtable_lookup_simd_kNum409600.elf
 ```
 
 ---
@@ -138,8 +138,8 @@ $COMPILER_DIR/clang++ \
 
 | ELF 名称 | 编译命令 |
 |----------|----------|
-| `kernel_control_hashtable_lookup_simd_kNum409600.elf` | `make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum409600 EXTRA_DEFINES="-DkNum=409600"` |
-| `kernel_control_hashtable_lookup_simd_kNum256.elf` | `make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum256 EXTRA_DEFINES="-DkNum=256"` |
+| `kernels_control_hashtable_lookup_simd_kNum409600.elf` | `make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum409600 EXTRA_DEFINES="-DkNum=409600"` |
+| `kernels_control_hashtable_lookup_simd_kNum256.elf` | `make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum256 EXTRA_DEFINES="-DkNum=256"` |
 
 ---
 
@@ -148,8 +148,8 @@ $COMPILER_DIR/clang++ \
 ```bash
 cd {WORKSPACE}/LinxBlockModel
 ./build/qemu-linx \
-  ../JanusCoreBench/output/kernel/control/elf/kernel_control_hashtable_lookup_simd_kNum256.elf \
-  2>&1 | tee ../JanusCoreBench/test/kernel/control/hashtable_lookup_simd/run.log
+  ../SuperNPUBench/output/kernels/control/elf/kernels_control_hashtable_lookup_simd_kNum256.elf \
+  2>&1 | tee ../SuperNPUBench/benchmarks/kernels/control/hashtable_lookup_simd/run.log
 ```
 
 **说明：**
@@ -165,7 +165,7 @@ cd {WORKSPACE}/LinxBlockModel
 
 ```bash
 export COMPILER_DIR=/remote/lms01/j00827727/jcore/compilers/linx_blockisa_llvm_musl0.56.18/bin
-cd {WORKSPACE}/JanusCoreBench/test/kernel/control
+cd {WORKSPACE}/SuperNPUBench/benchmarks/kernels/control
 ./compile.all
 ```
 
@@ -173,7 +173,7 @@ cd {WORKSPACE}/JanusCoreBench/test/kernel/control
 
 ## 文件说明
 
-路径前缀约定：`{WORKSPACE}/JanusCoreBench/test/kernel/control/hashtable_lookup_simd/`
+路径前缀约定：`{WORKSPACE}/SuperNPUBench/benchmarks/kernels/control/hashtable_lookup_simd/`
 
 | 文件 | 说明 |
 |------|------|
diff --git a/test/kernel/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp b/benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp
similarity index 59%
rename from test/kernel/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp
rename to benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp
index 72a7bf1..5ff6dc7 100644
--- a/test/kernel/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp
+++ b/benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt.cpp
@@ -1,5 +1,170 @@
+#if defined(__linx) && defined(FOR_GFSIM) && (defined(LINX_HT_DIRECT) || defined(LINX_HASHTABLE_DIRECT_SMOKE))
+typedef unsigned char uint8_t;
+typedef unsigned int uint32_t;
+typedef int int32_t;
+typedef long long int64_t;
+
+#ifndef kNum
+#define kNum 1024
+#endif
+
+#ifndef MAX_PROBE
+#define MAX_PROBE 512
+#endif
+
+#ifndef LINX_HT_CAPACITY
+#ifdef LINX_HASH_CAPACITY
+#define LINX_HT_CAPACITY LINX_HASH_CAPACITY
+#else
+#define LINX_HT_CAPACITY 2048u
+#endif
+#endif
+
+#ifndef LINX_HT_SCAN
+#ifdef LINX_HASH_LINEAR_SCAN
+#define LINX_HT_SCAN LINX_HASH_LINEAR_SCAN
+#else
+#define LINX_HT_SCAN 0
+#endif
+#endif
+
+struct TableEntry {
+    int64_t key;
+    int32_t value;
+    int32_t padding;
+};
+
+extern "C" {
+    extern const uint8_t _binary_inserted_slot_data_start[];
+    extern const uint8_t _binary_lookup_keys_data_start[];
+    extern const uint8_t _binary_lookup_values_data_start[];
+}
+
+static uint32_t rotl32(uint32_t value, uint32_t shift) {
+    return (value << shift) | (value >> (32u - shift));
+}
+
+static uint32_t murmurhash3_i64(int64_t key) {
+    const uint32_t c1_local = 0xcc9e2d51u;
+    const uint32_t c2_local = 0x1b873593u;
+    const uint32_t c3_local = 0xe6546b64u;
+    unsigned long long bits = (unsigned long long)key;
+    uint32_t h = 0u;
+    uint32_t block = (uint32_t)bits;
+
+    block *= c1_local;
+    block = rotl32(block, 15u);
+    block *= c2_local;
+    h ^= block;
+    h = rotl32(h, 13u);
+    h = h * 5u + c3_local;
+
+    block = (uint32_t)(bits >> 32);
+    block *= c1_local;
+    block = rotl32(block, 15u);
+    block *= c2_local;
+    h ^= block;
+    h = rotl32(h, 13u);
+    h = h * 5u + c3_local;
+
+    h ^= 8u;
+    h ^= h >> 16u;
+    h *= 0x85ebca6bu;
+    h ^= h >> 13u;
+    h *= 0xc2b2ae35u;
+    h ^= h >> 16u;
+    return h;
+}
+
+static uint32_t first_slot(uint32_t hash) {
+#if (LINX_HT_CAPACITY & (LINX_HT_CAPACITY - 1u)) == 0
+    return hash & (LINX_HT_CAPACITY - 1u);
+#else
+    return hash % LINX_HT_CAPACITY;
+#endif
+}
+
+int main() {
+    const TableEntry* table =
+        (const TableEntry*)_binary_inserted_slot_data_start;
+    const int64_t* keys =
+        (const int64_t*)_binary_lookup_keys_data_start;
+    const int32_t* expected =
+        (const int32_t*)_binary_lookup_values_data_start;
+
+    int32_t mismatches = 0;
+    for (int32_t i = 0; i < kNum; ++i) {
+#if LINX_HT_SCAN
+        int32_t found = -1;
+        for (uint32_t slot = 0; slot < LINX_HT_CAPACITY; ++slot) {
+            const TableEntry* entry = table + slot;
+            if (entry->key == keys[i]) {
+                found = entry->value;
+                break;
+            }
+        }
+#else
+        uint32_t slot = first_slot(murmurhash3_i64(keys[i]));
+        int32_t found = -1;
+        for (int32_t probe = 0; probe < MAX_PROBE; ++probe) {
+            const TableEntry* entry = table + slot;
+            if (entry->key == keys[i]) {
+                found = entry->value;
+                break;
+            }
+            if ((unsigned long long)entry->key == 0x8000000000000000ull) {
+                break;
+            }
+            ++slot;
+            if (slot == LINX_HT_CAPACITY) {
+                slot = 0;
+            }
+        }
+#endif
+        if (found != expected[i]) {
+            ++mismatches;
+        }
+    }
+    return mismatches == 0 ? 0 : 1;
+}
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+    if (code == 0) {
+        __asm__ volatile(
+            "BSTART.STD\n"
+            "lui 65545, ->u\n"
+            "lui 5, ->t\n"
+            "addi t#1, 1365, ->t\n"
+            "c.swi t#1, [u#1, 0]\n"
+            "BSTOP\n"
+            ::: "memory");
+    } else {
+        __asm__ volatile(
+            "BSTART.STD\n"
+            "lui 65545, ->u\n"
+            "lui 19, ->t\n"
+            "addi t#1, 819, ->t\n"
+            "c.swi t#1, [u#1, 0]\n"
+            "BSTOP\n"
+            ::: "memory");
+    }
+    while (1) {
+        __asm__ volatile("" ::: "memory");
+    }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+    linx_supernpu_exit((uint32_t)main());
+}
+
+#else
+
 #include <common/pto_tileop.hpp>
 #include "benchmark.h"
+#include "template_asm.h"
+#ifndef __linx
+#include <stdio.h>
+#endif
 
 // ============================================================================
 // ELF Data layout — embedded binary data produced by build_data_obj.sh
@@ -153,6 +318,7 @@ int main() {
         }
     }
 
+#ifndef __linx
     if (mismatch_count > 0) {
         printf("\n=== Mismatching keys (%d total) ===\n", mismatch_count);
         printf("%7s  %22s  %10s  %10s\n", "Idx", "Key", "Got", "Expected");
@@ -167,8 +333,11 @@ int main() {
     printf("\n=== hashtable_lookup_simt ===\n");
     printf("Match: %d/%d (%d %%)\n", match, kNum, int(100 * double(match) / double(kNum)));
     fflush(stdout);
+#endif
     return (match == kNum) ? 0 : 1;
 #else
     return 0;
 #endif
 }
+
+#endif
diff --git a/test/kernel/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp b/benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp
similarity index 97%
rename from test/kernel/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp
rename to benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp
index d10bbba..cd29221 100644
--- a/test/kernel/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp
+++ b/benchmarks/kernels/control/hashtable_lookup_simt/hashtable_lookup_simt_v2.cpp
@@ -1,5 +1,9 @@
 #include <common/pto_tileop.hpp>
 #include "benchmark.h"
+#include "template_asm.h"
+#ifndef __linx
+#include <stdio.h>
+#endif
 
 // ============================================================================
 // ELF Data layout — embedded binary data produced by build_data_obj.sh
@@ -141,7 +145,7 @@ int main() {
         kEntrySize, kMaxProbe);
     BENCHEND;
 
-#ifndef FOR_GFSIM
+#if !defined(FOR_GFSIM) && !defined(__linx)
     // Print SIMT kernel computed hash values for first 64 keys
     printf("\n=== SIMT kernel hash values (first 64 keys) ===\n");
     printf("%4s  %22s  %10s  %7s  %10s  %10s\n", "Idx", "Key", "Hash(hex)", "Slot", "SIMT_out", "Expected");
@@ -164,7 +168,7 @@ int main() {
         }
     }
 
-#ifndef FOR_GFSIM
+#if !defined(FOR_GFSIM) && !defined(__linx)
     if (mismatch_count > 0) {
         printf("\n=== Mismatching keys (%d total) ===\n", mismatch_count);
         printf("%7s  %22s  %10s  %7s  %10s  %10s\n", "Idx", "Key", "Hash(hex)", "Slot", "Got", "Expected");
diff --git a/benchmarks/kernels/control/hkv/data_obj/.gitignore b/benchmarks/kernels/control/hkv/data_obj/.gitignore
new file mode 100644
index 0000000..75f1c9f
--- /dev/null
+++ b/benchmarks/kernels/control/hkv/data_obj/.gitignore
@@ -0,0 +1,3 @@
+*.s
+*.o
+*.bin
diff --git a/test/kernel/control/hkv/data_obj/build_data_obj.sh b/benchmarks/kernels/control/hkv/data_obj/build_data_obj.sh
similarity index 56%
rename from test/kernel/control/hkv/data_obj/build_data_obj.sh
rename to benchmarks/kernels/control/hkv/data_obj/build_data_obj.sh
index ec4819c..7771bdb 100755
--- a/test/kernel/control/hkv/data_obj/build_data_obj.sh
+++ b/benchmarks/kernels/control/hkv/data_obj/build_data_obj.sh
@@ -1,10 +1,23 @@
 #!/bin/bash
-COMPILER_DIR="${COMPILER_DIR:-/remote/lms01/j00827727/jcore/compilers/linx_blockisa_llvm_musl0.56.16/bin}"
-DATA_OBJ_DIR="$1"
-OUTPUT_DIR="$2"
+set -euo pipefail
+
+COMPILER_DIR="${COMPILER_DIR:-/usr/bin}"
+LINX_TARGET="${LINX_TARGET:-linx64-linx-none-elf}"
+DATA_OBJ_DIR="${1:?data object directory required}"
+OUTPUT_DIR="${2:?output directory required}"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+CASE_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
 
 mkdir -p "$OUTPUT_DIR"
 
+if [[ ! -f "${DATA_OBJ_DIR}/buckets.bin" ||
+      ! -f "${DATA_OBJ_DIR}/buckets_size.bin" ||
+      ! -f "${DATA_OBJ_DIR}/lookup_keys.bin" ||
+      ! -f "${DATA_OBJ_DIR}/lookedup_values.bin" ||
+      ! -f "${DATA_OBJ_DIR}/key_score_digest.bin" ]]; then
+    (cd "$CASE_DIR" && python3 gen_data.py)
+fi
+
 build_one() {
     local name="$1"
     local data_file="${DATA_OBJ_DIR}/${name}"
@@ -28,7 +41,7 @@ _binary_${sym_name}_end:
 .equ _binary_${sym_name}_size, .-_binary_${sym_name}_start
 EOF
 
-    $COMPILER_DIR/clang++ -target linx64v5 -c "$asm_file" -o "$obj_file"
+    "${COMPILER_DIR}/clang++" -target "$LINX_TARGET" -c "$asm_file" -o "$obj_file"
 }
 
 build_one "buckets.bin"
diff --git a/test/kernel/control/hkv/gen_data.py b/benchmarks/kernels/control/hkv/gen_data.py
similarity index 100%
rename from test/kernel/control/hkv/gen_data.py
rename to benchmarks/kernels/control/hkv/gen_data.py
diff --git a/test/kernel/control/hkv/hkv.cpp b/benchmarks/kernels/control/hkv/hkv.cpp
similarity index 100%
rename from test/kernel/control/hkv/hkv.cpp
rename to benchmarks/kernels/control/hkv/hkv.cpp
diff --git a/test/kernel/element_wise/gelu/Makefile b/benchmarks/kernels/element_wise/gelu/Makefile
similarity index 100%
rename from test/kernel/element_wise/gelu/Makefile
rename to benchmarks/kernels/element_wise/gelu/Makefile
diff --git a/test/kernel/element_wise/gelu/compile.all b/benchmarks/kernels/element_wise/gelu/compile.all
similarity index 100%
rename from test/kernel/element_wise/gelu/compile.all
rename to benchmarks/kernels/element_wise/gelu/compile.all
diff --git a/test/kernel/element_wise/gelu/src/gelu.cpp b/benchmarks/kernels/element_wise/gelu/src/gelu.cpp
similarity index 97%
rename from test/kernel/element_wise/gelu/src/gelu.cpp
rename to benchmarks/kernels/element_wise/gelu/src/gelu.cpp
index 1dd8af9..2e70935 100644
--- a/test/kernel/element_wise/gelu/src/gelu.cpp
+++ b/benchmarks/kernels/element_wise/gelu/src/gelu.cpp
@@ -1,9 +1,14 @@
 #include <common/pto_tileop.hpp>
 
+#ifdef __linx
+#include <stddef.h>
+#include <stdint.h>
+#else
 #include <cstdint>
 #include <cstdio>
 
 #include "fileop.h"
+#endif
 #include "element_wise/gelu.hpp"
 
 
@@ -76,4 +81,4 @@ int main() {
     // #ifdef RES_CHECK
     // writeBinaryFile(OUTPUT_PATH, (uint8_t*)output, gMs * sizeof(dtype));
     // #endif
-}
\ No newline at end of file
+}
diff --git a/test/kernel/element_wise/gelu/src/gelu_data_compare.py b/benchmarks/kernels/element_wise/gelu/src/gelu_data_compare.py
similarity index 100%
rename from test/kernel/element_wise/gelu/src/gelu_data_compare.py
rename to benchmarks/kernels/element_wise/gelu/src/gelu_data_compare.py
diff --git a/test/kernel/element_wise/gelu/src/gen_gelu_data.py b/benchmarks/kernels/element_wise/gelu/src/gen_gelu_data.py
similarity index 100%
rename from test/kernel/element_wise/gelu/src/gen_gelu_data.py
rename to benchmarks/kernels/element_wise/gelu/src/gen_gelu_data.py
diff --git a/test/kernel/element_wise/gelu/src/tmp.list b/benchmarks/kernels/element_wise/gelu/src/tmp.list
similarity index 100%
rename from test/kernel/element_wise/gelu/src/tmp.list
rename to benchmarks/kernels/element_wise/gelu/src/tmp.list
diff --git a/test/kernel/fusion_op/Makefile b/benchmarks/kernels/fusion/Makefile
similarity index 100%
rename from test/kernel/fusion_op/Makefile
rename to benchmarks/kernels/fusion/Makefile
diff --git a/test/kernel/fusion_op/compile.all b/benchmarks/kernels/fusion/compile.all
similarity index 100%
rename from test/kernel/fusion_op/compile.all
rename to benchmarks/kernels/fusion/compile.all
diff --git a/test/kernel/fusion_op/src/fa_hif4.cpp b/benchmarks/kernels/fusion/src/fa_hif4.cpp
similarity index 98%
rename from test/kernel/fusion_op/src/fa_hif4.cpp
rename to benchmarks/kernels/fusion/src/fa_hif4.cpp
index 00899e0..8307e51 100644
--- a/test/kernel/fusion_op/src/fa_hif4.cpp
+++ b/benchmarks/kernels/fusion/src/fa_hif4.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-// #include "../../include/accelerator_fusion.h"
+// #include <benchmark_support/npu/npu_fusion.h>
 #include "benchmark.h"
 #include "fileop.h"
 #include "fa_mx/fa_hif4.hpp"
diff --git a/test/kernel/gemm/matmul/Makefile b/benchmarks/kernels/gemm/matmul/Makefile
similarity index 89%
rename from test/kernel/gemm/matmul/Makefile
rename to benchmarks/kernels/gemm/matmul/Makefile
index aba422d..25dafb3 100644
--- a/test/kernel/gemm/matmul/Makefile
+++ b/benchmarks/kernels/gemm/matmul/Makefile
@@ -4,11 +4,11 @@ ifeq ($(TYPE), HIF4_HIF4)
    OPFILE = $(notdir $(OPPATH))
    OPNAME = $(patsubst %.cpp,%, $(OPFILE))
    DEFINES += -DglobM=$(M) -DglobN=$(N) -DglobK=$(K) -DtilM=$(tM) -DtilN=$(tN) -DtilK=$(tK)
-      Batch = 1
-      ifneq ($(B), )
-      Batch = $(B)
-      DEFINES += -DBatch=$(B)
-      endif
+	      Batch = 1
+	      ifneq ($(B), )
+	      Batch = $(B)
+	      endif
+	      DEFINES += -DBatch=$(Batch)
    ifneq ($(VER), )
       DEFINES += -D$(VER)
    else 
@@ -29,11 +29,11 @@ ifeq ($(TYPE), A16W4)
    OPFILE = $(notdir $(OPPATH))
    OPNAME = $(patsubst %.cpp,%, $(OPFILE))
    DEFINES += -DglobM=$(M) -DglobN=$(N) -DglobK=$(K) -DtilM=$(tM) -DtilN=$(tN) -DtilK=$(tK)
-      Batch = 1
-      ifneq ($(B), )
-      Batch = $(B)
-      endif
-   DEFINES += -DBatch=$(B)
+   Batch = 1
+   ifneq ($(B), )
+   Batch = $(B)
+   endif
+   DEFINES += -DBatch=$(Batch)
    # TARGET = $(OPNAME)_B$(Batch)_M$(M)_N$(N)_K$(K)_tM$(tM)_tN$(tN)_tK$(tK).elf
    TARGET = $(ELF_HEAD)/$(TESTCASE)_$(TYPE)_$(MODE)_B$(Batch)_M$(M)_N$(N)_K$(K)_tM$(tM)_tN$(tN)_tK$(tK).elf
 endif
@@ -56,4 +56,4 @@ DEST_DIR = ~/elf_subset/subset_matmul_reuse/
 $(TARGET): clean $(LINK_SCRIPT) $(COMM_OBJ) $(OBJ) $(EXTRA_OBJ_FILES)
 	@mkdir -p $(shell dirname $@)
 	$(LINK) $(CC_LINK) $(OBJ) $(COMM_OBJ) $(EXTRA_OBJ_FILES) -o $@
-	$(CP) $(TARGET) $(DEST_DIR)
\ No newline at end of file
+	$(CP) $(TARGET) $(DEST_DIR)
diff --git a/test/kernel/gemm/matmul/compile.all b/benchmarks/kernels/gemm/matmul/compile.all
similarity index 100%
rename from test/kernel/gemm/matmul/compile.all
rename to benchmarks/kernels/gemm/matmul/compile.all
diff --git a/test/kernel/gemm/matmul/src/A16W4.cpp b/benchmarks/kernels/gemm/matmul/src/A16W4.cpp
similarity index 63%
rename from test/kernel/gemm/matmul/src/A16W4.cpp
rename to benchmarks/kernels/gemm/matmul/src/A16W4.cpp
index 7defbf1..f52b5d8 100644
--- a/test/kernel/gemm/matmul/src/A16W4.cpp
+++ b/benchmarks/kernels/gemm/matmul/src/A16W4.cpp
@@ -1,8 +1,49 @@
 #include <common/pto_tileop.hpp>
+#ifdef __linx
+#include <stddef.h>
+#include <stdint.h>
+#else
 #include <cstring>
 #include "fileop.h"
 #include "common.h"
 #include "benchmark.h"
+#endif
+
+#ifdef __linx
+#define BENCHSTART __asm__ __volatile__("B.HINT TRACE.begin\n" : : :);
+#define BENCHEND __asm__ __volatile__("B.HINT TRACE.end\n" : : :);
+
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+    __asm__ volatile("" ::: "memory");
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
 
 #ifndef globM 
 #define globM 120
@@ -68,4 +109,4 @@ int main() {
     #endif
 
   return 0;
-}
\ No newline at end of file
+}
diff --git a/test/kernel/gemm/matmul/src/HiF4_HiF4.cpp b/benchmarks/kernels/gemm/matmul/src/HiF4_HiF4.cpp
similarity index 73%
rename from test/kernel/gemm/matmul/src/HiF4_HiF4.cpp
rename to benchmarks/kernels/gemm/matmul/src/HiF4_HiF4.cpp
index 9d7a351..6e23751 100644
--- a/test/kernel/gemm/matmul/src/HiF4_HiF4.cpp
+++ b/benchmarks/kernels/gemm/matmul/src/HiF4_HiF4.cpp
@@ -1,8 +1,49 @@
 #include <common/pto_tileop.hpp>
+#ifdef __linx
+#include <stddef.h>
+#include <stdint.h>
+#else
 #include <cstring>
 #include "fileop.h"
 #include "common.h"
 #include "benchmark.h"
+#endif
+
+#ifdef __linx
+#define BENCHSTART __asm__ __volatile__("B.HINT TRACE.begin\n" : : :);
+#define BENCHEND __asm__ __volatile__("B.HINT TRACE.end\n" : : :);
+
+int main();
+
+static inline __attribute__((noreturn)) void linx_supernpu_exit(uint32_t code) {
+  if (code == 0) {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 5, ->t\n"
+        "addi t#1, 1365, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  } else {
+    __asm__ volatile(
+        "BSTART.STD\n"
+        "lui 65545, ->u\n"
+        "lui 19, ->t\n"
+        "addi t#1, 819, ->t\n"
+        "c.swi t#1, [u#1, 0]\n"
+        "BSTOP\n"
+        ::: "memory");
+  }
+  while (1) {
+    __asm__ volatile("" ::: "memory");
+  }
+}
+
+extern "C" __attribute__((noreturn, section(".text._start"))) void _start(void) {
+  linx_supernpu_exit(static_cast<uint32_t>(main()));
+}
+#endif
 
 #ifndef globM 
 #define globM 120
@@ -82,4 +123,4 @@ int main() {
   #endif
 
   return 0;
-}
\ No newline at end of file
+}
diff --git a/test/kernel/memory/broadcast/Makefile b/benchmarks/kernels/memory/broadcast/Makefile
similarity index 98%
rename from test/kernel/memory/broadcast/Makefile
rename to benchmarks/kernels/memory/broadcast/Makefile
index cffedf1..c257021 100644
--- a/test/kernel/memory/broadcast/Makefile
+++ b/benchmarks/kernels/memory/broadcast/Makefile
@@ -3,7 +3,7 @@ DEFINES += -DDType=$(DType) -DtMs=$(tMs) -DMAX_DIMs=$(MAX_DIMs) -DIN_SHAPEs=$(IN
 TARGET = $(ELF_HEAD)_$(TESTCASE)_$(MODE)_DType$(DType)_tM$(tMs)_IN_SHAPE$(IN_SHAPE_NAME)_OUT_SHAPE$(OUT_SHAPE_NAME).elf
 endif
 
-ifeq ($(TESTCASE), broadcast_nocopyout)
+ifeq ($(TESTCASE), broadcast_nostore)
 DEFINES += -DDType=$(DType) -DtMs=$(tMs) -DMAX_DIMs=$(MAX_DIMs) -DIN_SHAPEs=$(IN_SHAPEs) -DOUT_SHAPEs=$(OUT_SHAPEs) -DIN_DIMs=$(IN_DIMs) -DOUT_DIMs=$(OUT_DIMs) -DgIMs=$(gIMs) -DgOMs=$(gOMs)
 TARGET = $(ELF_HEAD)_$(TESTCASE)_$(MODE)_DType$(DType)_tM$(tMs)_IN_SHAPE$(IN_SHAPE_NAME)_OUT_SHAPE$(OUT_SHAPE_NAME).elf
 endif
diff --git a/test/kernel/memory/broadcast/compile.all b/benchmarks/kernels/memory/broadcast/compile.all
similarity index 89%
rename from test/kernel/memory/broadcast/compile.all
rename to benchmarks/kernels/memory/broadcast/compile.all
index 9206219..cebb560 100755
--- a/test/kernel/memory/broadcast/compile.all
+++ b/benchmarks/kernels/memory/broadcast/compile.all
@@ -3,98 +3,98 @@ make    TESTCASE=broadcast_07 DType=__half tMs=2048 MAX_DIMs=2 \
         IN_SHAPEs=1334,1 OUT_SHAPEs=1334,129 IN_DIMs=2 OUT_DIMs=2 \
         gIMs=1334*1 gOMs=1334*129 \
         TESTNAME=broadcast_to_07 \
-        IN_SHAPE_NAME=1334_1 OUT_SHAPE_NAME=1334_129 res_check=off diss 
+        IN_SHAPE_NAME=1334_1 OUT_SHAPE_NAME=1334_129 res_check=off diss
 
 # # 2\broadcast_to_ABA_019_new fp16
 make    TESTCASE=broadcast_019 DType=__half tMs=2048 MAX_DIMs=3 \
         IN_SHAPEs=1280,1,49 OUT_SHAPEs=1280,8,49 IN_DIMs=3 OUT_DIMs=3 \
         gIMs=1280*1*49 gOMs=1280*8*49 \
         TESTNAME=broadcast_to_ABA_019_new \
-        IN_SHAPE_NAME=1280_1_49 OUT_SHAPE_NAME=1280_8_49 res_check=off diss 
+        IN_SHAPE_NAME=1280_1_49 OUT_SHAPE_NAME=1280_8_49 res_check=off diss
 
 # 3\BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004  fp16
 make    TESTCASE=broadcast_Hunyuan DType=__half tMs=2048 MAX_DIMs=5 \
         IN_SHAPEs=1,1,1,65,128 OUT_SHAPEs=1,1,7,65,128 IN_DIMs=5 OUT_DIMs=5 \
         gIMs=1*1*1*65*128 gOMs=1*1*7*65*128 \
         TESTNAME=BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004 \
-        IN_SHAPE_NAME=1_1_1_65_128 OUT_SHAPE_NAME=1_1_7_65_128  res_check=off diss 
+        IN_SHAPE_NAME=1_1_1_65_128 OUT_SHAPE_NAME=1_1_7_65_128  res_check=off diss
 
 # 4\Bbroadcast_to_BABA_039  fp16 int32
 make    TESTCASE=broadcast_039 DType=__half tMs=2048 MAX_DIMs=4 \
         IN_SHAPEs=1,128,1,16 OUT_SHAPEs=64,128,8,16 IN_DIMs=4 OUT_DIMs=4 \
         gIMs=1*128*1*16 gOMs=64*128*8*16 \
         TESTNAME=Bbroadcast_to_BABA_039 \
-        IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16  res_check=off diss 
+        IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16  res_check=off diss
 
 make    TESTCASE=broadcast_039 DType=__half tMs=2048 MAX_DIMs=4 \
         IN_SHAPEs=1,8192,1,16 OUT_SHAPEs=1,8192,8,16 IN_DIMs=4 OUT_DIMs=4 \
         gIMs=1*8192*1*16 gOMs=1*8192*8*16 \
         TESTNAME=Bbroadcast_to_BABA_039 \
-        IN_SHAPE_NAME=1_8192_1_16 OUT_SHAPE_NAME=1_8192_8_16  res_check=off diss 
+        IN_SHAPE_NAME=1_8192_1_16 OUT_SHAPE_NAME=1_8192_8_16  res_check=off diss
 
 # make    TESTCASE=broadcast_039 DType=int32_t tMs=2048 MAX_DIMs=4 \
 #         IN_SHAPEs=1,128,1,16 OUT_SHAPEs=64,128,8,16 IN_DIMs=4 OUT_DIMs=4 \
 #         gIMs=1*128*1*16 gOMs=64*128*8*16 \
 #         TESTNAME=Bbroadcast_to_BABA_039 \
-#         IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16  res_check=off diss 
+#         IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16  res_check=off diss
 
 
 
 
 
 
-# make    TESTCASE=broadcast_nocopyout DType=int32_t tMs=512 MAX_DIMs=2 \
+# make    TESTCASE=broadcast_nostore DType=int32_t tMs=512 MAX_DIMs=2 \
 #         IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \
 #         gIMs=1042*1 gOMs=1042*129 \
 #         TESTNAME=broadcast_to_07 \
-#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss 
+#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss
 
 # # NoCopyout 性能分析
 # # # 1\broadcast_to_07 fp16
-# make    TESTCASE=broadcast_nocopyout DType=__half tMs=128 MAX_DIMs=2 \
+# make    TESTCASE=broadcast_nostore DType=__half tMs=128 MAX_DIMs=2 \
 #         IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \
 #         gIMs=1042*1 gOMs=1042*129 \
 #         TESTNAME=broadcast_to_07 \
-#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss 
+#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss
 
-# make    TESTCASE=broadcast_nocopyout DType=__half tMs=512 MAX_DIMs=2 \
+# make    TESTCASE=broadcast_nostore DType=__half tMs=512 MAX_DIMs=2 \
 #         IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \
 #         gIMs=1042*1 gOMs=1042*129 \
 #         TESTNAME=broadcast_to_07 \
-#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss 
+#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss
 
-# make    TESTCASE=broadcast_nocopyout DType=__half tMs=2048 MAX_DIMs=2 \
+# make    TESTCASE=broadcast_nostore DType=__half tMs=2048 MAX_DIMs=2 \
 #         IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \
 #         gIMs=1042*1 gOMs=1042*129 \
 #         TESTNAME=broadcast_to_07 \
-#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss 
+#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss
 
 # # # 2\broadcast_to_ABA_019_new fp16
-# make    TESTCASE=broadcast_nocopyout DType=__half tMs=2048 MAX_DIMs=3 \
+# make    TESTCASE=broadcast_nostore DType=__half tMs=2048 MAX_DIMs=3 \
 #         IN_SHAPEs=1280,1,49 OUT_SHAPEs=1280,8,49 IN_DIMs=3 OUT_DIMs=3 \
 #         gIMs=1280*1*49 gOMs=1280*8*49 \
 #         TESTNAME=broadcast_to_ABA_019_new \
-#         IN_SHAPE_NAME=1280_1_49 OUT_SHAPE_NAME=1280_8_49 res_check=off diss 
+#         IN_SHAPE_NAME=1280_1_49 OUT_SHAPE_NAME=1280_8_49 res_check=off diss
 
 # # # 3\BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004  fp16
-# make    TESTCASE=broadcast_nocopyout DType=__half tMs=2048 MAX_DIMs=5 \
+# make    TESTCASE=broadcast_nostore DType=__half tMs=2048 MAX_DIMs=5 \
 #         IN_SHAPEs=1,1,1,80,128 OUT_SHAPEs=1,1,7,80,128 IN_DIMs=5 OUT_DIMs=5 \
 #         gIMs=1*1*1*80*128 gOMs=1*1*7*80*128 \
 #         TESTNAME=BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004 \
-#         IN_SHAPE_NAME=1_1_1_80_128 OUT_SHAPE_NAME=1_1_7_80_128  res_check=off diss 
+#         IN_SHAPE_NAME=1_1_1_80_128 OUT_SHAPE_NAME=1_1_7_80_128  res_check=off diss
 
 # # 4\Bbroadcast_to_BABA_039   fp16 int32
-# make    TESTCASE=broadcast_nocopyout DType=__half tMs=512 MAX_DIMs=4 \
+# make    TESTCASE=broadcast_nostore DType=__half tMs=512 MAX_DIMs=4 \
 #         IN_SHAPEs=1,128,1,16 OUT_SHAPEs=64,128,8,16 IN_DIMs=4 OUT_DIMs=4 \
 #         gIMs=1*128*1*16 gOMs=64*128*8*16 \
 #         TESTNAME=Bbroadcast_to_BABA_039 \
-#         IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16  res_check=off diss 
+#         IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16  res_check=off diss
 
-# make    TESTCASE=broadcast_nocopyout DType=int32_t tMs=512 MAX_DIMs=4 \
+# make    TESTCASE=broadcast_nostore DType=int32_t tMs=512 MAX_DIMs=4 \
 #         IN_SHAPEs=1,128,1,16 OUT_SHAPEs=64,128,8,16 IN_DIMs=4 OUT_DIMs=4 \
 #         gIMs=1*128*1*16 gOMs=64*128*8*16 \
 #         TESTNAME=Bbroadcast_to_BABA_039 \
-#         IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16  res_check=off diss 
+#         IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16  res_check=off diss
 
 
 
@@ -104,70 +104,70 @@ make    TESTCASE=broadcast_039 DType=__half tMs=2048 MAX_DIMs=4 \
 #         IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \
 #         gIMs=1042*1 gOMs=1042*129 \
 #         TESTNAME=broadcast_to_07 \
-#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss 
-        
+#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss
+
 # make    TESTCASE=broadcast DType=__half tMs=256 MAX_DIMs=2 \
 #         IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \
 #         gIMs=1042*1 gOMs=1042*129 \
 #         TESTNAME=broadcast_to_07 \
-#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss 
-        
+#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss
+
 # make    TESTCASE=broadcast DType=__half tMs=512 MAX_DIMs=2 \
 #         IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \
 #         gIMs=1042*1 gOMs=1042*129 \
 #         TESTNAME=broadcast_to_07 \
-#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss 
-        
+#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss
+
 # make    TESTCASE=broadcast DType=__half tMs=768 MAX_DIMs=2 \
 #         IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \
 #         gIMs=1042*1 gOMs=1042*129 \
 #         TESTNAME=broadcast_to_07 \
-#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss 
+#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss
 
 # make    TESTCASE=broadcast DType=__half tMs=1024 MAX_DIMs=2 \
 #         IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \
 #         gIMs=1042*1 gOMs=1042*129 \
 #         TESTNAME=broadcast_to_07 \
-#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss 
+#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss
 
 # make    TESTCASE=broadcast DType=__half tMs=1536 MAX_DIMs=2 \
 #         IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \
 #         gIMs=1042*1 gOMs=1042*129 \
 #         TESTNAME=broadcast_to_07 \
-#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss 
+#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss
 
 # make    TESTCASE=broadcast DType=__half tMs=2048 MAX_DIMs=2 \
 #         IN_SHAPEs=1042,1 OUT_SHAPEs=1042,129 IN_DIMs=2 OUT_DIMs=2 \
 #         gIMs=1042*1 gOMs=1042*129 \
 #         TESTNAME=broadcast_to_07 \
-#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss 
+#         IN_SHAPE_NAME=1042_1 OUT_SHAPE_NAME=1042_129 res_check=off diss
 
 # # # 2\broadcast_to_ABA_019_new fp16
 # make    TESTCASE=broadcast DType=__half tMs=2048 MAX_DIMs=3 \
 #         IN_SHAPEs=1280,1,49 OUT_SHAPEs=1280,8,49 IN_DIMs=3 OUT_DIMs=3 \
 #         gIMs=1280*1*49 gOMs=1280*8*49 \
 #         TESTNAME=broadcast_to_ABA_019_new \
-#         IN_SHAPE_NAME=1280_1_49 OUT_SHAPE_NAME=1280_8_49 res_check=off diss 
+#         IN_SHAPE_NAME=1280_1_49 OUT_SHAPE_NAME=1280_8_49 res_check=off diss
 
 # # # 3\BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004  fp16
 # make    TESTCASE=broadcast DType=__half tMs=2048 MAX_DIMs=5 \
 #         IN_SHAPEs=1,1,1,65,128 OUT_SHAPEs=1,1,7,65,128 IN_DIMs=5 OUT_DIMs=5 \
 #         gIMs=1*1*1*65*128 gOMs=1*1*7*65*128 \
 #         TESTNAME=BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004 \
-#         IN_SHAPE_NAME=1_1_1_65_128 OUT_SHAPE_NAME=1_1_7_65_128  res_check=off diss 
+#         IN_SHAPE_NAME=1_1_1_65_128 OUT_SHAPE_NAME=1_1_7_65_128  res_check=off diss
 
 # # 4\Bbroadcast_to_BABA_039  fp16 int32
 # make    TESTCASE=broadcast DType=__half tMs=512 MAX_DIMs=4 \
 #         IN_SHAPEs=1,128,1,16 OUT_SHAPEs=64,128,8,16 IN_DIMs=4 OUT_DIMs=4 \
 #         gIMs=1*128*1*16 gOMs=64*128*8*16 \
 #         TESTNAME=Bbroadcast_to_BABA_039 \
-#         IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16  res_check=off diss 
+#         IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16  res_check=off diss
 
 # make    TESTCASE=broadcast DType=int32_t tMs=512 MAX_DIMs=4 \
 #         IN_SHAPEs=1,128,1,16 OUT_SHAPEs=64,128,8,16 IN_DIMs=4 OUT_DIMs=4 \
 #         gIMs=1*128*1*16 gOMs=64*128*8*16 \
 #         TESTNAME=Bbroadcast_to_BABA_039 \
-#         IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16  res_check=off diss 
+#         IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16  res_check=off diss
 
 # gfrun验证通过
 
@@ -175,7 +175,7 @@ make    TESTCASE=broadcast_039 DType=__half tMs=2048 MAX_DIMs=4 \
 #         IN_SHAPEs=1,1,1,80,128 OUT_SHAPEs=1,1,7,80,128 IN_DIMs=5 OUT_DIMs=5 \
 #         gIMs=1*1*1*80*128 gOMs=1*1*7*80*128 \
 #         TESTNAME=BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004 \
-#         IN_SHAPE_NAME=1_1_1_80_128 OUT_SHAPE_NAME=1_1_7_80_128  res_check=on diss 
+#         IN_SHAPE_NAME=1_1_1_80_128 OUT_SHAPE_NAME=1_1_7_80_128  res_check=on diss
 
 # make    TESTCASE=broadcast_mscatter DType=int32_t tMs=512 MAX_DIMs=3 \
 #         IN_SHAPEs=64,1,49 OUT_SHAPEs=64,8,49 IN_DIMs=3 OUT_DIMs=3 \
@@ -185,23 +185,23 @@ make    TESTCASE=broadcast_039 DType=__half tMs=2048 MAX_DIMs=4 \
 # make    TESTCASE=broadcast DType=int32_t tMs=512 MAX_DIMs=4 \
 #         IN_SHAPEs=1,128,1,16 OUT_SHAPEs=64,128,8,16 IN_DIMs=4 OUT_DIMs=4 \
 #         gIMs=1*128*1*16 gOMs=64*128*8*16 \
-#         IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16-Tst  res_check=off diss 
+#         IN_SHAPE_NAME=1_128_1_16 OUT_SHAPE_NAME=64_128_8_16-Tst  res_check=off diss
 
 # make    TESTCASE=broadcast DType=int32_t tMs=2048 MAX_DIMs=3 \
 #         IN_SHAPEs=8192,1,49 OUT_SHAPEs=8192,8,49 IN_DIMs=3 OUT_DIMs=3 \
 #         gIMs=8192*1*49 gOMs=8192*8*49 \
-#         IN_SHAPE_NAME=8192_1_49 OUT_SHAPE_NAME=8192_8_49 res_check=off diss 
+#         IN_SHAPE_NAME=8192_1_49 OUT_SHAPE_NAME=8192_8_49 res_check=off diss
 
 # make    TESTCASE=broadcast DType=int32_t tMs=512 MAX_DIMs=3 \
 #         IN_SHAPEs=64,1,49 OUT_SHAPEs=64,8,49 IN_DIMs=3 OUT_DIMs=3 \
 #         gIMs=64*1*49 gOMs=64*8*49 \
-#         IN_SHAPE_NAME=64_1_49 OUT_SHAPE_NAME=64_8_49 res_check=off diss 
+#         IN_SHAPE_NAME=64_1_49 OUT_SHAPE_NAME=64_8_49 res_check=off diss
 
 # # BroadcastTo_ND_NCDHW_float16_int64_HunyuanImage21_MLLM_1015_000004
 # make    TESTCASE=broadcast DType=__half tMs=512 MAX_DIMs=5 \
 #         IN_SHAPEs=1,4,1,1034,128 OUT_SHAPEs=1,4,7,1034,128 IN_DIMs=5 OUT_DIMs=5 \
 #         gIMs=1*4*1*1034*128 gOMs=1*4*7*1034*128 \
-#         IN_SHAPE_NAME=1_4_1_1034_128 OUT_SHAPE_NAME=1_4_7_1034_128  res_check=off diss 
+#         IN_SHAPE_NAME=1_4_1_1034_128 OUT_SHAPE_NAME=1_4_7_1034_128  res_check=off diss
 
 
 
@@ -210,31 +210,31 @@ make    TESTCASE=broadcast_039 DType=__half tMs=2048 MAX_DIMs=4 \
 # make    TESTCASE=broadcast DType=int32_t tMs=512 MAX_DIMs=2 \
 #         IN_SHAPEs=66661,1 OUT_SHAPEs=66661,129 IN_DIMs=2 OUT_DIMs=2 \
 #         gIMs=66661*1 gOMs=66661*129 \
-#         IN_SHAPE_NAME=66661_1 OUT_SHAPE_NAME=66661_129 res_check=off diss 
+#         IN_SHAPE_NAME=66661_1 OUT_SHAPE_NAME=66661_129 res_check=off diss
 
 # make    TESTCASE=broadcast DType=__half tMs=512 MAX_DIMs=2 \
 #         IN_SHAPEs=66661,1 OUT_SHAPEs=66661,129 IN_DIMs=2 OUT_DIMs=2 \
 #         gIMs=66661*1 gOMs=66661*129 \
-#         IN_SHAPE_NAME=66661_1 OUT_SHAPE_NAME=66661_129 res_check=off diss 
+#         IN_SHAPE_NAME=66661_1 OUT_SHAPE_NAME=66661_129 res_check=off diss
 
 
 # make    TESTCASE=broadcast DType=int32_t tMs=512 MAX_DIMs=3 \
 #         IN_SHAPEs=81920,1,49 OUT_SHAPEs=81920,8,49 IN_DIMs=3 OUT_DIMs=3 \
 #         gIMs=81920*1*49 gOMs=81920*8*49 \
-#         IN_SHAPE_NAME=81920_1_49 OUT_SHAPE_NAME=81920_8_49 res_check=off diss 
+#         IN_SHAPE_NAME=81920_1_49 OUT_SHAPE_NAME=81920_8_49 res_check=off diss
 
 # make    TESTCASE=broadcast DType=int32_t tMs=512 MAX_DIMs=2 \
 #         IN_SHAPEs=66661,1 OUT_SHAPEs=66661,129 IN_DIMs=2 OUT_DIMs=2 \
 #         gIMs=66661*1 gOMs=66661*129 \
-#         IN_SHAPE_NAME=66661_1 OUT_SHAPE_NAME=66661_129  res_check=off diss 
-        
+#         IN_SHAPE_NAME=66661_1 OUT_SHAPE_NAME=66661_129  res_check=off diss
+
 # make    TESTCASE=broadcast DType=int32_t tMs=512 MAX_DIMs=4 \
 #         IN_SHAPEs=1,81920,1,49 OUT_SHAPEs=64,81920,8,49 IN_DIMs=4 OUT_DIMs=4 \
 #         gIMs=1*81920*1*49 gOMs=64*81920*8*49 \
-#         IN_SHAPE_NAME=1_81920_1_49 OUT_SHAPE_NAME=64_81920_8_49  res_check=off diss 
+#         IN_SHAPE_NAME=1_81920_1_49 OUT_SHAPE_NAME=64_81920_8_49  res_check=off diss
 
 ### for test ###
 # make    TESTCASE=broadcast_tst DType=int32_t tMs=2048 MAX_DIMs=3 \
 #         IN_SHAPEs=8192,1,49 OUT_SHAPEs=8192,8,49 IN_DIMs=3 OUT_DIMs=3 \
 #         gIMs=8192*1*49 gOMs=8192*8*49 \
-#         IN_SHAPE_NAME=8192_1_49 OUT_SHAPE_NAME=8192_8_49 res_check=off diss 
\ No newline at end of file
+#         IN_SHAPE_NAME=8192_1_49 OUT_SHAPE_NAME=8192_8_49 res_check=off diss
\ No newline at end of file
diff --git a/test/kernel/memory/broadcast/src/broadcast.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast.cpp
similarity index 100%
rename from test/kernel/memory/broadcast/src/broadcast.cpp
rename to benchmarks/kernels/memory/broadcast/src/broadcast.cpp
diff --git a/test/kernel/memory/broadcast/src/broadcast_019.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_019.cpp
similarity index 100%
rename from test/kernel/memory/broadcast/src/broadcast_019.cpp
rename to benchmarks/kernels/memory/broadcast/src/broadcast_019.cpp
diff --git a/test/kernel/memory/broadcast/src/broadcast_039.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_039.cpp
similarity index 100%
rename from test/kernel/memory/broadcast/src/broadcast_039.cpp
rename to benchmarks/kernels/memory/broadcast/src/broadcast_039.cpp
diff --git a/test/kernel/memory/broadcast/src/broadcast_07.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_07.cpp
similarity index 100%
rename from test/kernel/memory/broadcast/src/broadcast_07.cpp
rename to benchmarks/kernels/memory/broadcast/src/broadcast_07.cpp
diff --git a/test/kernel/memory/broadcast/src/broadcast_Hunyuan.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_Hunyuan.cpp
similarity index 100%
rename from test/kernel/memory/broadcast/src/broadcast_Hunyuan.cpp
rename to benchmarks/kernels/memory/broadcast/src/broadcast_Hunyuan.cpp
diff --git a/test/kernel/memory/broadcast/src/broadcast_data_compare.py b/benchmarks/kernels/memory/broadcast/src/broadcast_data_compare.py
similarity index 100%
rename from test/kernel/memory/broadcast/src/broadcast_data_compare.py
rename to benchmarks/kernels/memory/broadcast/src/broadcast_data_compare.py
diff --git a/test/kernel/memory/broadcast/src/broadcast_mscatter.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_mscatter.cpp
similarity index 100%
rename from test/kernel/memory/broadcast/src/broadcast_mscatter.cpp
rename to benchmarks/kernels/memory/broadcast/src/broadcast_mscatter.cpp
diff --git a/test/kernel/memory/broadcast/src/broadcast_nomg.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_nomg.cpp
similarity index 100%
rename from test/kernel/memory/broadcast/src/broadcast_nomg.cpp
rename to benchmarks/kernels/memory/broadcast/src/broadcast_nomg.cpp
diff --git a/test/kernel/memory/broadcast/src/broadcast_nocopyout.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_nostore.cpp
similarity index 93%
rename from test/kernel/memory/broadcast/src/broadcast_nocopyout.cpp
rename to benchmarks/kernels/memory/broadcast/src/broadcast_nostore.cpp
index 9e64e9d..c7070e0 100644
--- a/test/kernel/memory/broadcast/src/broadcast_nocopyout.cpp
+++ b/benchmarks/kernels/memory/broadcast/src/broadcast_nostore.cpp
@@ -4,7 +4,7 @@
 #include <cstdio>
 
 #include "fileop.h"
-#include "memory/broadcast_nocopyout.hpp"
+#include "memory/broadcast_nostore.hpp"
 
 
 #ifndef DType
@@ -92,8 +92,8 @@ int main() {
     printf("input[2]=%d\n",input[2]);
     printf("input[3]=%d\n",input[3]);
     #endif
-    
-    broadcast_nocopyout<dtype, MAX_DIMs, IN_DIMs, OUT_DIMs, gIMs, gOMs, tMs>(input, output, in_shape, out_shape);
+
+    broadcast_nostore<dtype, MAX_DIMs, IN_DIMs, OUT_DIMs, gIMs, gOMs, tMs>(input, output, in_shape, out_shape);
 
     #ifdef RES_CHECK
     writeBinaryFile(OUTPUT_PATH, (uint8_t*)output, gOMs * sizeof(dtype));
diff --git a/test/kernel/memory/broadcast/src/broadcast_tst.cpp b/benchmarks/kernels/memory/broadcast/src/broadcast_tst.cpp
similarity index 100%
rename from test/kernel/memory/broadcast/src/broadcast_tst.cpp
rename to benchmarks/kernels/memory/broadcast/src/broadcast_tst.cpp
diff --git a/test/kernel/memory/broadcast/src/gen_broadcast_data.py b/benchmarks/kernels/memory/broadcast/src/gen_broadcast_data.py
similarity index 100%
rename from test/kernel/memory/broadcast/src/gen_broadcast_data.py
rename to benchmarks/kernels/memory/broadcast/src/gen_broadcast_data.py
diff --git a/test/kernel/memory/broadcast/src/gfrun_broadcast.py b/benchmarks/kernels/memory/broadcast/src/gfrun_broadcast.py
similarity index 100%
rename from test/kernel/memory/broadcast/src/gfrun_broadcast.py
rename to benchmarks/kernels/memory/broadcast/src/gfrun_broadcast.py
diff --git a/test/kernel/memory/broadcast/src/tmp.list b/benchmarks/kernels/memory/broadcast/src/tmp.list
similarity index 100%
rename from test/kernel/memory/broadcast/src/tmp.list
rename to benchmarks/kernels/memory/broadcast/src/tmp.list
diff --git a/test/kernel/memory/broadcast_vec/Makefile b/benchmarks/kernels/memory/broadcast_vec/Makefile
similarity index 100%
rename from test/kernel/memory/broadcast_vec/Makefile
rename to benchmarks/kernels/memory/broadcast_vec/Makefile
diff --git a/test/kernel/memory/broadcast_vec/compile.all b/benchmarks/kernels/memory/broadcast_vec/compile.all
similarity index 100%
rename from test/kernel/memory/broadcast_vec/compile.all
rename to benchmarks/kernels/memory/broadcast_vec/compile.all
diff --git a/test/kernel/memory/broadcast_vec/src/broadcast_vec_019.cpp b/benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_019.cpp
similarity index 100%
rename from test/kernel/memory/broadcast_vec/src/broadcast_vec_019.cpp
rename to benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_019.cpp
diff --git a/test/kernel/memory/broadcast_vec/src/broadcast_vec_039.cpp b/benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_039.cpp
similarity index 100%
rename from test/kernel/memory/broadcast_vec/src/broadcast_vec_039.cpp
rename to benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_039.cpp
diff --git a/test/kernel/memory/broadcast_vec/src/broadcast_vec_07.cpp b/benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_07.cpp
similarity index 100%
rename from test/kernel/memory/broadcast_vec/src/broadcast_vec_07.cpp
rename to benchmarks/kernels/memory/broadcast_vec/src/broadcast_vec_07.cpp
diff --git a/test/kernel/memory/concat_gather/Makefile b/benchmarks/kernels/memory/concat_gather/Makefile
similarity index 100%
rename from test/kernel/memory/concat_gather/Makefile
rename to benchmarks/kernels/memory/concat_gather/Makefile
diff --git a/test/kernel/memory/concat_gather/compile.all b/benchmarks/kernels/memory/concat_gather/compile.all
similarity index 100%
rename from test/kernel/memory/concat_gather/compile.all
rename to benchmarks/kernels/memory/concat_gather/compile.all
diff --git a/test/kernel/memory/concat_gather/src/concat_gather.cpp b/benchmarks/kernels/memory/concat_gather/src/concat_gather.cpp
similarity index 100%
rename from test/kernel/memory/concat_gather/src/concat_gather.cpp
rename to benchmarks/kernels/memory/concat_gather/src/concat_gather.cpp
diff --git a/test/kernel/memory/concat_scatter/Makefile b/benchmarks/kernels/memory/concat_scatter/Makefile
similarity index 100%
rename from test/kernel/memory/concat_scatter/Makefile
rename to benchmarks/kernels/memory/concat_scatter/Makefile
diff --git a/test/kernel/memory/concat_scatter/compile.all b/benchmarks/kernels/memory/concat_scatter/compile.all
similarity index 100%
rename from test/kernel/memory/concat_scatter/compile.all
rename to benchmarks/kernels/memory/concat_scatter/compile.all
diff --git a/test/kernel/memory/concat_scatter/src/concat_scatter.cpp b/benchmarks/kernels/memory/concat_scatter/src/concat_scatter.cpp
similarity index 100%
rename from test/kernel/memory/concat_scatter/src/concat_scatter.cpp
rename to benchmarks/kernels/memory/concat_scatter/src/concat_scatter.cpp
diff --git a/test/kernel/memory/gather/Makefile b/benchmarks/kernels/memory/gather/Makefile
similarity index 100%
rename from test/kernel/memory/gather/Makefile
rename to benchmarks/kernels/memory/gather/Makefile
diff --git a/test/kernel/memory/gather/compile.all b/benchmarks/kernels/memory/gather/compile.all
similarity index 100%
rename from test/kernel/memory/gather/compile.all
rename to benchmarks/kernels/memory/gather/compile.all
diff --git a/test/kernel/memory/gather/src/gather.cpp b/benchmarks/kernels/memory/gather/src/gather.cpp
similarity index 100%
rename from test/kernel/memory/gather/src/gather.cpp
rename to benchmarks/kernels/memory/gather/src/gather.cpp
diff --git a/test/kernel/memory/gather/src/gen_gather_data.py b/benchmarks/kernels/memory/gather/src/gen_gather_data.py
similarity index 100%
rename from test/kernel/memory/gather/src/gen_gather_data.py
rename to benchmarks/kernels/memory/gather/src/gen_gather_data.py
diff --git a/test/kernel/memory/gather/src/tmp.list b/benchmarks/kernels/memory/gather/src/tmp.list
similarity index 100%
rename from test/kernel/memory/gather/src/tmp.list
rename to benchmarks/kernels/memory/gather/src/tmp.list
diff --git a/test/kernel/memory/transpose/Makefile b/benchmarks/kernels/memory/transpose/Makefile
similarity index 100%
rename from test/kernel/memory/transpose/Makefile
rename to benchmarks/kernels/memory/transpose/Makefile
diff --git a/test/kernel/memory/transpose/compile.all b/benchmarks/kernels/memory/transpose/compile.all
similarity index 100%
rename from test/kernel/memory/transpose/compile.all
rename to benchmarks/kernels/memory/transpose/compile.all
diff --git a/test/kernel/memory/transpose/src/transpose.cpp b/benchmarks/kernels/memory/transpose/src/transpose.cpp
similarity index 100%
rename from test/kernel/memory/transpose/src/transpose.cpp
rename to benchmarks/kernels/memory/transpose/src/transpose.cpp
diff --git a/test/kernel/reduction/reducemax_col/Makefile b/benchmarks/kernels/reduction/reducemax_col/Makefile
similarity index 100%
rename from test/kernel/reduction/reducemax_col/Makefile
rename to benchmarks/kernels/reduction/reducemax_col/Makefile
diff --git a/test/kernel/reduction/reducemax_col/compile.all b/benchmarks/kernels/reduction/reducemax_col/compile.all
similarity index 100%
rename from test/kernel/reduction/reducemax_col/compile.all
rename to benchmarks/kernels/reduction/reducemax_col/compile.all
diff --git a/test/kernel/reduction/reducemax_col/src/reducemax_col.cpp b/benchmarks/kernels/reduction/reducemax_col/src/reducemax_col.cpp
similarity index 100%
rename from test/kernel/reduction/reducemax_col/src/reducemax_col.cpp
rename to benchmarks/kernels/reduction/reducemax_col/src/reducemax_col.cpp
diff --git a/test/kernel/reduction/reducemax_row/Makefile b/benchmarks/kernels/reduction/reducemax_row/Makefile
similarity index 100%
rename from test/kernel/reduction/reducemax_row/Makefile
rename to benchmarks/kernels/reduction/reducemax_row/Makefile
diff --git a/test/kernel/reduction/reducemax_row/compile.all b/benchmarks/kernels/reduction/reducemax_row/compile.all
similarity index 100%
rename from test/kernel/reduction/reducemax_row/compile.all
rename to benchmarks/kernels/reduction/reducemax_row/compile.all
diff --git a/test/kernel/reduction/reducemax_row/src/reducemax_row.cpp b/benchmarks/kernels/reduction/reducemax_row/src/reducemax_row.cpp
similarity index 100%
rename from test/kernel/reduction/reducemax_row/src/reducemax_row.cpp
rename to benchmarks/kernels/reduction/reducemax_row/src/reducemax_row.cpp
diff --git a/test/kernel/reduction/reducesum_col/Makefile b/benchmarks/kernels/reduction/reducesum_col/Makefile
similarity index 100%
rename from test/kernel/reduction/reducesum_col/Makefile
rename to benchmarks/kernels/reduction/reducesum_col/Makefile
diff --git a/test/kernel/reduction/reducesum_col/compile.all b/benchmarks/kernels/reduction/reducesum_col/compile.all
similarity index 100%
rename from test/kernel/reduction/reducesum_col/compile.all
rename to benchmarks/kernels/reduction/reducesum_col/compile.all
diff --git a/test/kernel/reduction/reducesum_col/src/reducesum_col.cpp b/benchmarks/kernels/reduction/reducesum_col/src/reducesum_col.cpp
similarity index 100%
rename from test/kernel/reduction/reducesum_col/src/reducesum_col.cpp
rename to benchmarks/kernels/reduction/reducesum_col/src/reducesum_col.cpp
diff --git a/test/kernel/reduction/reducesum_row/Makefile b/benchmarks/kernels/reduction/reducesum_row/Makefile
similarity index 100%
rename from test/kernel/reduction/reducesum_row/Makefile
rename to benchmarks/kernels/reduction/reducesum_row/Makefile
diff --git a/test/kernel/reduction/reducesum_row/compile.all b/benchmarks/kernels/reduction/reducesum_row/compile.all
similarity index 100%
rename from test/kernel/reduction/reducesum_row/compile.all
rename to benchmarks/kernels/reduction/reducesum_row/compile.all
diff --git a/test/kernel/reduction/reducesum_row/src/reducesum_row.cpp b/benchmarks/kernels/reduction/reducesum_row/src/reducesum_row.cpp
similarity index 100%
rename from test/kernel/reduction/reducesum_row/src/reducesum_row.cpp
rename to benchmarks/kernels/reduction/reducesum_row/src/reducesum_row.cpp
diff --git a/test/kernel/sort/Makefile b/benchmarks/kernels/sort/Makefile
similarity index 70%
rename from test/kernel/sort/Makefile
rename to benchmarks/kernels/sort/Makefile
index a07b811..2fee7f6 100644
--- a/test/kernel/sort/Makefile
+++ b/benchmarks/kernels/sort/Makefile
@@ -1,3 +1,5 @@
+.DEFAULT_GOAL := all
+
 TARGET = $(ELF_HEAD)_$(TESTCASE).elf
 
 # Override target name for topk
@@ -7,10 +9,10 @@ endif
 SRC_FILE +=  $(TEST_ROOT)/$(CATEGORY)/$(TESTCASE)/$(TESTCASE).cpp
 
 # Special handling for topk - embed data as object files
-EXTRA_OBJ_FILES :=
-EXTRA_OBJ_DEPS :=
+EXTRA_OBJ_FILES =
+EXTRA_OBJ_DEPS =
 DATA_OBJ_DIR := topk/data_obj
-OUTPUT_DATA_OBJ_DIR := ../../../output/kernel/sort/topk/data_obj
+OUTPUT_DATA_OBJ_DIR = $(OBJ_ROOT)/$(CATEGORY)/topk/data_obj
 
 ifeq ($(TESTCASE), topk)
 EXTRA_OBJ_FILES += $(OUTPUT_DATA_OBJ_DIR)/input_131072.o
@@ -21,9 +23,6 @@ pre_work: build_data_objs
 build_data_objs:
 	@COMPILER_DIR="$(COMPILER_DIR)" $(DATA_OBJ_DIR)/build_data_obj.sh $(DATA_OBJ_DIR) $(OUTPUT_DATA_OBJ_DIR)
 
-$(OUTPUT_DATA_OBJ_DIR)/%.o: $(DATA_OBJ_DIR)/%.s pre_work
-	@mkdir -p $(shell dirname $@)
-	$(AS) $(CC_O_ALL) $(INCLUDE) $(DEFINES) $< -o $@
 endif
 
 ifeq ($(opt), on)
@@ -31,4 +30,9 @@ DEFINES += -DOPT
 TARGET = $(ELF_HEAD)_$(TESTCASE)_OPT.elf
 endif
 
-include ../../common/Makefile.common
\ No newline at end of file
+include ../../common/Makefile.common
+
+ifneq ($(EXTRA_OBJ_FILES),)
+$(EXTRA_OBJ_FILES): pre_work
+	@true
+endif
diff --git a/test/kernel/sort/compile.all b/benchmarks/kernels/sort/compile.all
similarity index 100%
rename from test/kernel/sort/compile.all
rename to benchmarks/kernels/sort/compile.all
diff --git a/benchmarks/kernels/sort/topk/.gitignore b/benchmarks/kernels/sort/topk/.gitignore
new file mode 100644
index 0000000..e60f4e8
--- /dev/null
+++ b/benchmarks/kernels/sort/topk/.gitignore
@@ -0,0 +1,3 @@
+*.o
+*.s
+*.data
diff --git a/test/kernel/sort/topk/data_obj/build_data_obj.sh b/benchmarks/kernels/sort/topk/data_obj/build_data_obj.sh
similarity index 50%
rename from test/kernel/sort/topk/data_obj/build_data_obj.sh
rename to benchmarks/kernels/sort/topk/data_obj/build_data_obj.sh
index 8128f1e..7f3172b 100755
--- a/test/kernel/sort/topk/data_obj/build_data_obj.sh
+++ b/benchmarks/kernels/sort/topk/data_obj/build_data_obj.sh
@@ -1,10 +1,20 @@
 #!/bin/bash
-COMPILER_DIR="${COMPILER_DIR:-/remote/lms01/j00827727/jcore/compilers/linx_blockisa_llvm_musl0.56.16/bin}"
-DATA_OBJ_DIR="$1"
-OUTPUT_DIR="$2"
+set -euo pipefail
+
+COMPILER_DIR="${COMPILER_DIR:-/usr/bin}"
+LINX_TARGET="${LINX_TARGET:-linx64-linx-none-elf}"
+DATA_OBJ_DIR="${1:?data object directory required}"
+OUTPUT_DIR="${2:?output directory required}"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+CASE_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
 
 mkdir -p "$OUTPUT_DIR"
 
+if [[ ! -f "${DATA_OBJ_DIR}/input_131072.data" ||
+      ! -f "${DATA_OBJ_DIR}/top_2048_out.data" ]]; then
+    (cd "$CASE_DIR" && python3 gen_topk_data.py)
+fi
+
 build_one() {
     local name="$1"
     local data_file="${DATA_OBJ_DIR}/${name}.data"
@@ -25,10 +35,10 @@ _binary_${name}_data_end:
 .equ _binary_${name}_data_size, .-_binary_${name}_data_start
 EOF
 
-    $COMPILER_DIR/clang++ -target linx64v5 -c "$asm_file" -o "$obj_file"
+    "${COMPILER_DIR}/clang++" -target "$LINX_TARGET" -c "$asm_file" -o "$obj_file"
 }
 
 build_one "input_131072"
 build_one "top_2048_out"
 
-echo "Done building data object files"
\ No newline at end of file
+echo "Done building data object files"
diff --git a/test/kernel/sort/topk/gen_topk_data.py b/benchmarks/kernels/sort/topk/gen_topk_data.py
similarity index 100%
rename from test/kernel/sort/topk/gen_topk_data.py
rename to benchmarks/kernels/sort/topk/gen_topk_data.py
diff --git a/test/kernel/sort/topk/topk.cpp b/benchmarks/kernels/sort/topk/topk.cpp
similarity index 94%
rename from test/kernel/sort/topk/topk.cpp
rename to benchmarks/kernels/sort/topk/topk.cpp
index edee4bc..1f717ea 100644
--- a/test/kernel/sort/topk/topk.cpp
+++ b/benchmarks/kernels/sort/topk/topk.cpp
@@ -1,9 +1,13 @@
 #include <common/pto_tileop.hpp>
 #include "benchmark.h"
+#ifndef __linx
 #include "fileop.h"
+#endif
 #include "template_asm.h"
+#ifndef __linx
 #include <stdio.h>
 #include <string.h>
+#endif
 
 // #define FOR_GFSIM
 // ============================================================================
@@ -143,7 +147,7 @@ static int find_kth_bin(const uint32_t hist[256], int k, int& need_from_kth) {
 // ============================================================================
 
 int main() {
-#ifndef FOR_GFSIM
+#if !defined(FOR_GFSIM) && !defined(__linx)
     printf("=== TopK Test (SIMT per-bucket) ===\n");
     printf("Input: %d  TopK: %d  Tiles: %d  TileSize: %d\n",
            kInputCount, kTopK, kNumTiles, kTileSize);
@@ -161,14 +165,14 @@ int main() {
     using HistGT = GlobalTensor<uint32_t, Shape<1,1,1,16,16>, Stride<1,1,1,16,1>>;
     uint32_t histResult[256];
     HistGT histGlobal(histResult);
-    TCOPYOUT(histGlobal, high8HistTile);
+    TSTORE(histGlobal, high8HistTile);
 
     uint32_t global_high8_hist[256] = {0};
     for (int b = 0; b < 256; b++) {
         global_high8_hist[b] = histResult[b];
     }
 
-#ifndef FOR_GFSIM
+#if !defined(FOR_GFSIM) && !defined(__linx)
     printf("\nPhase 1: high8 histograms built (1 SIMT launch, 256 lanes).\n");
     fflush(stdout);
 #endif
@@ -179,7 +183,7 @@ int main() {
     int need_from_kth_bin = 0;
     int kth_bin = find_kth_bin(global_high8_hist, kTopK, need_from_kth_bin);
 
-#ifndef FOR_GFSIM
+#if !defined(FOR_GFSIM) && !defined(__linx)
     printf("\nPhase 2: kth_bin=%d  need_from_kth_bin=%d\n",
            kth_bin, need_from_kth_bin);
     uint64_t total_above = 0;
@@ -200,7 +204,7 @@ int main() {
 
     uint32_t low8HistResult[256];
     HistGT low8HistGlobal(low8HistResult);
-    TCOPYOUT(low8HistGlobal, low8HistTile);
+    TSTORE(low8HistGlobal, low8HistTile);
 
     uint32_t global_low8_hist_kth[256] = {0};
     for (int b = 0; b < 256; b++) {
@@ -220,7 +224,7 @@ int main() {
         }
     }
 
-#ifndef FOR_GFSIM
+#if !defined(FOR_GFSIM) && !defined(__linx)
     printf("\nPhase 4: low8_boundary=%d\n", low8_boundary);
     printf("  Global low8 hist (kth bin) total: %lu\n", cumsum_low);
     fflush(stdout);
@@ -229,7 +233,9 @@ int main() {
     // -------------------------------------------------------------------------
     // Phase 5: Scalar masked scatter (directly on g_input / g_output)
     // -------------------------------------------------------------------------
-    memset(g_output, 0, sizeof(g_output));
+    for (int i = 0; i < kInputCount; i++) {
+        g_output[i] = 0;
+    }
     for (int i = 0; i < kInputCount; i++) {
         uint16_t val  = g_input[i];
         uint8_t  high8 = static_cast<uint8_t>(val >> 8);
@@ -252,7 +258,7 @@ int main() {
         }
     }
 
-#ifndef FOR_GFSIM
+#if !defined(FOR_GFSIM) && !defined(__linx)
     printf("\nPhase 5: Collected %d output elements (expected %d)\n",
            out_count, kTopK);
     fflush(stdout);
@@ -264,7 +270,9 @@ int main() {
     int cmp_count = (out_count < kTopK) ? out_count : kTopK;
 
     uint16_t result_sorted[2048];
-    memcpy(result_sorted, result, sizeof(result_sorted));
+    for (int i = 0; i < cmp_count; i++) {
+        result_sorted[i] = result[i];
+    }
     for (int i = 0; i < cmp_count; i++) {
         for (int j = i + 1; j < cmp_count; j++) {
             if (result_sorted[i] < result_sorted[j]) {
@@ -280,7 +288,7 @@ int main() {
         if (result_sorted[i] == g_expected[i]) match++;
     }
 
-#ifndef FOR_GFSIM
+#if !defined(FOR_GFSIM) && !defined(__linx)
     printf("\n=== Verification (vs. embedded standard answer) ===\n");
     printf("Match: %d/%d (%.1f%%)\n", match, cmp_count, 100.0 * match / cmp_count);
     printf("Output[0..9]:    ");
@@ -291,7 +299,7 @@ int main() {
 #endif
 
     int ret = (match == cmp_count) ? 0 : 1;
-#ifndef FOR_GFSIM
+#if !defined(FOR_GFSIM) && !defined(__linx)
     printf("%s\n", ret ? "FAIL" : "PASS");
     fflush(stdout);
 #endif
diff --git a/test/other/cube/Makefile b/benchmarks/microbench/cube/Makefile
similarity index 100%
rename from test/other/cube/Makefile
rename to benchmarks/microbench/cube/Makefile
diff --git a/test/other/cube/compile.all b/benchmarks/microbench/cube/compile.all
similarity index 100%
rename from test/other/cube/compile.all
rename to benchmarks/microbench/cube/compile.all
diff --git a/test/other/cube/src/matop.cpp b/benchmarks/microbench/cube/src/matop.cpp
similarity index 87%
rename from test/other/cube/src/matop.cpp
rename to benchmarks/microbench/cube/src/matop.cpp
index b7aef50..49ae9f6 100644
--- a/test/other/cube/src/matop.cpp
+++ b/benchmarks/microbench/cube/src/matop.cpp
@@ -108,8 +108,8 @@ void matmul(const int loop, src_dtype *a, src_dtype *b) {
     tile_shapeB tB;
     tile_shapeACC tACC;
 
-    TCOPYIN(tA, gA);
-    TCOPYIN(tB, gB);
+    TLOAD(tA, gA);
+    TLOAD(tB, gB);
 
     #pragma clang loop unroll(full)
     for(int i=0;i<LOOP;i++){
@@ -154,22 +154,22 @@ void matmul_ch8(const int loop, src_dtype *a, src_dtype *b) {
     tile_shapeA tA7;
     tile_shapeB tB7;
 
-    TCOPYIN(tA0, gA);
-    TCOPYIN(tB0, gB);
-    TCOPYIN(tA1, gA);
-    TCOPYIN(tB1, gB);
-    TCOPYIN(tA2, gA);
-    TCOPYIN(tB2, gB);
-    TCOPYIN(tA3, gA);
-    TCOPYIN(tB3, gB);
-    TCOPYIN(tA4, gA);
-    TCOPYIN(tB4, gB);
-    TCOPYIN(tA5, gA);
-    TCOPYIN(tB5, gB);
-    TCOPYIN(tA6, gA);
-    TCOPYIN(tB6, gB);
-    TCOPYIN(tA7, gA);
-    TCOPYIN(tB7, gB);
+    TLOAD(tA0, gA);
+    TLOAD(tB0, gB);
+    TLOAD(tA1, gA);
+    TLOAD(tB1, gB);
+    TLOAD(tA2, gA);
+    TLOAD(tB2, gB);
+    TLOAD(tA3, gA);
+    TLOAD(tB3, gB);
+    TLOAD(tA4, gA);
+    TLOAD(tB4, gB);
+    TLOAD(tA5, gA);
+    TLOAD(tB5, gB);
+    TLOAD(tA6, gA);
+    TLOAD(tB6, gB);
+    TLOAD(tA7, gA);
+    TLOAD(tB7, gB);
 
     #pragma clang loop unroll(full)
     for(int i=0;i<LOOP/2;i++){
@@ -194,13 +194,13 @@ void matmacc(const int loop, src_dtype *a, src_dtype *b){
 
     gm_shapeA gA(a);
     gm_shapeB gB(b);
-    
+
     tile_shapeA tA;
     tile_shapeB tB;
     tile_shapeACC tACC;
-    
-    TCOPYIN(tA, gA);
-    TCOPYIN(tB, gB);
+
+    TLOAD(tA, gA);
+    TLOAD(tB, gB);
     MATMUL(tACC, tA, tB);
 
     #pragma clang loop unroll(full)
@@ -246,22 +246,22 @@ void matmacc_ch8(const int loop, src_dtype *a, src_dtype *b) {
     tile_shapeA tA7;
     tile_shapeB tB7;
 
-    TCOPYIN(tA0, gA);
-    TCOPYIN(tB0, gB);
-    TCOPYIN(tA1, gA);
-    TCOPYIN(tB1, gB);
-    TCOPYIN(tA2, gA);
-    TCOPYIN(tB2, gB);
-    TCOPYIN(tA3, gA);
-    TCOPYIN(tB3, gB);
-    TCOPYIN(tA4, gA);
-    TCOPYIN(tB4, gB);
-    TCOPYIN(tA5, gA);
-    TCOPYIN(tB5, gB);
-    TCOPYIN(tA6, gA);
-    TCOPYIN(tB6, gB);
-    TCOPYIN(tA7, gA);
-    TCOPYIN(tB7, gB);
+    TLOAD(tA0, gA);
+    TLOAD(tB0, gB);
+    TLOAD(tA1, gA);
+    TLOAD(tB1, gB);
+    TLOAD(tA2, gA);
+    TLOAD(tB2, gB);
+    TLOAD(tA3, gA);
+    TLOAD(tB3, gB);
+    TLOAD(tA4, gA);
+    TLOAD(tB4, gB);
+    TLOAD(tA5, gA);
+    TLOAD(tB5, gB);
+    TLOAD(tA6, gA);
+    TLOAD(tB6, gB);
+    TLOAD(tA7, gA);
+    TLOAD(tB7, gB);
     MATMUL(tACC, tA0, tB0);
     MATMUL(tACC, tA1, tB1);
     MATMUL(tACC, tA2, tB2);
diff --git a/test/other/lmbench/Makefile b/benchmarks/microbench/lmbench/Makefile
similarity index 100%
rename from test/other/lmbench/Makefile
rename to benchmarks/microbench/lmbench/Makefile
diff --git a/test/other/lmbench/compile_mem.all b/benchmarks/microbench/lmbench/compile_mem.all
similarity index 99%
rename from test/other/lmbench/compile_mem.all
rename to benchmarks/microbench/lmbench/compile_mem.all
index 94f88b6..06f0c46 100755
--- a/test/other/lmbench/compile_mem.all
+++ b/benchmarks/microbench/lmbench/compile_mem.all
@@ -73,7 +73,7 @@ make TESTCASE=mem MODE=tile_fcopy TROW=32  TCOL=128 LOOP=192
 make TESTCASE=mem MODE=tile_fcopy TROW=64  TCOL=128 LOOP=96
 
 
-#TCOPYIN/OUT Bandwidth Test
+#TLOAD/OUT Bandwidth Test
 make TESTCASE=mem MODE=tload_nd GROW=256  GCOL=256  TROW=16 TCOL=16 LOOP=12
 make TESTCASE=mem MODE=tload_nd GROW=512  GCOL=512  TROW=32 TCOL=32 LOOP=12
 make TESTCASE=mem MODE=tload_nd GROW=1024 GCOL=1024 TROW=64 TCOL=64 LOOP=6
diff --git a/test/other/lmbench/src/mem.cpp b/benchmarks/microbench/lmbench/src/mem.cpp
similarity index 97%
rename from test/other/lmbench/src/mem.cpp
rename to benchmarks/microbench/lmbench/src/mem.cpp
index 3355582..a1f03a4 100644
--- a/test/other/lmbench/src/mem.cpp
+++ b/benchmarks/microbench/lmbench/src/mem.cpp
@@ -192,8 +192,8 @@ void tload_nd(dtype *src) {
     for (int j = 0; j < bcol; ++j) {
       uint16_t offset = i * (tile_shape::Rows * gm_shape::Cols) + j * tile_shape::Cols;
       gm_shape gsrc(src + offset);
-      tile_shape tsrc; 
-      TCOPYIN(tsrc, gsrc);
+      tile_shape tsrc;
+      TLOAD(tsrc, gsrc);
     }
   }
 }
@@ -210,8 +210,8 @@ void tload_dn(dtype *src) {
     for (int j = 0; j < bcol; ++j) {
       uint16_t offset = i * (tile_shape::Rows * gm_shape::Cols) + j * tile_shape::Cols;
       gm_shape gsrc(src + offset);
-      tile_shape tsrc; 
-      TCOPYIN(tsrc, gsrc);
+      tile_shape tsrc;
+      TLOAD(tsrc, gsrc);
     }
   }
 }
@@ -228,8 +228,8 @@ void tload_nd2nz(dtype *src) {
     for (int j = 0; j < bcol; ++j) {
       uint16_t offset = i * (tile_shape::Rows * gm_shape::Cols) + j * tile_shape::Cols;
       gm_shape gsrc(src + offset);
-      tile_shape tsrc; 
-      TCOPYIN(tsrc, gsrc);
+      tile_shape tsrc;
+      TLOAD(tsrc, gsrc);
     }
   }
 }
@@ -246,8 +246,8 @@ void tload_nd2zn(dtype *src) {
     for (int j = 0; j < bcol; ++j) {
       uint16_t offset = i * (tile_shape::Rows * gm_shape::Cols) + j * tile_shape::Cols;
       gm_shape gsrc(src + offset);
-      tile_shape tsrc; 
-      TCOPYIN(tsrc, gsrc);
+      tile_shape tsrc;
+      TLOAD(tsrc, gsrc);
     }
   }
 }
@@ -269,8 +269,8 @@ void tload_dn2zn(dtype *src) {
     for (int j = 0; j < bcol; ++j) {
       uint16_t offset = i * (tile_shape::Rows * gm_shape::Cols) + j * tile_shape::Cols;
       gm_shape gsrc(src + offset);
-      tile_shape tsrc; 
-      TCOPYIN(tsrc, gsrc);
+      tile_shape tsrc;
+      TLOAD(tsrc, gsrc);
     }
   }
 }
@@ -289,7 +289,7 @@ void tstore_nd(dtype *dst) {
       uint16_t offset = i * (tile_shape::Rows * gm_shape::Cols) + j * tile_shape::Cols;
       gm_shape gdst(dst + offset);
       tile_shape tdst(0);
-      TCOPYOUT(gdst, tdst);
+      TSTORE(gdst, tdst);
     }
   }
 }
@@ -308,7 +308,7 @@ void tstore_dn(dtype *dst) {
       uint16_t offset = i * (tile_shape::Rows * gm_shape::Cols) + j * tile_shape::Cols;
       gm_shape gdst(dst + offset);
       tile_shape tdst(0);
-      TCOPYOUT(gdst, tdst);
+      TSTORE(gdst, tdst);
     }
   }
 }
@@ -327,7 +327,7 @@ void tstore_nz2nd(dtype *dst) {
       uint16_t offset = i * (tile_shape::Rows * gm_shape::Cols) + j * tile_shape::Cols;
       gm_shape gdst(dst + offset);
       tile_shape tdst(0);
-      TCOPYOUT(gdst, tdst);
+      TSTORE(gdst, tdst);
     }
   }
 }
@@ -353,7 +353,7 @@ int main() {
   using gm_shape = global_tensor<dtype, RowMajor<GROW, GCOL>>;
   using tile_shape = Tile<Location::Vec, dtype, TROW, TCOL, BLayout::RowMajor>;
   Tile<Location::Vec, dtype, TROW, TCOL, BLayout::RowMajor> tsrc(0);
-  
+
   BENCHSTART;
   #pragma clang loop unroll(full)
   for(int i=0;i<LOOP;i++){
@@ -361,7 +361,7 @@ int main() {
       dtype src[gm_size];
       gm_shape gsrc(src);
       Tile<Location::Vec, dtype, TROW, TCOL, BLayout::RowMajor> tmp;
-      uint16_t real_col = gm_shape::Cols / (STRD/sizeof(dtype)); 
+      uint16_t real_col = gm_shape::Cols / (STRD/sizeof(dtype));
       gm_rd<gm_shape><<<real_col, gm_shape::Rows, 1>>>(gsrc.data(), static_cast<uint16_t>(STRD/sizeof(dtype)), tmp.data());
     }
     else if(!strcmp(MODE, "gm_frd")){
@@ -374,7 +374,7 @@ int main() {
       dtype dst[gm_size];
       gm_shape gdst(dst);
       Tile<Location::Vec, dtype, TROW, TCOL, BLayout::RowMajor> tmp;
-      uint16_t real_col = gm_shape::Cols / (STRD/sizeof(dtype));  
+      uint16_t real_col = gm_shape::Cols / (STRD/sizeof(dtype));
       gm_wr<gm_shape><<<real_col, gm_shape::Rows, 1>>>(gdst.data(), static_cast<uint16_t>(STRD/sizeof(dtype)), tmp.data());
     }
     else if(!strcmp(MODE, "gm_fwr")){
@@ -389,7 +389,7 @@ int main() {
       gm_shape gsrc(src);
       gm_shape gdst(dst);
       Tile<Location::Vec, dtype, TROW, TCOL, BLayout::RowMajor> tmp;
-      uint16_t real_col = gm_shape::Cols / (STRD/sizeof(dtype));  
+      uint16_t real_col = gm_shape::Cols / (STRD/sizeof(dtype));
       gm_copy<gm_shape><<<real_col, gm_shape::Rows, 1>>>(gdst.data(), gsrc.data(), static_cast<uint16_t>(STRD/sizeof(dtype)), tmp.data());
     }
     else if(!strcmp(MODE, "gm_fcopy")){
@@ -402,7 +402,7 @@ int main() {
     }
     else if(!strcmp(MODE, "tile_rd")){
       //Tile<Location::Vec, dtype, TROW, TCOL, BLayout::RowMajor> tsrc(0);
-      uint16_t real_col = tile_shape::Cols / (STRD/sizeof(dtype)); 
+      uint16_t real_col = tile_shape::Cols / (STRD/sizeof(dtype));
       tile_rd<tile_shape><<<real_col, tile_shape::Rows, 1>>>(tsrc.data(), static_cast<uint16_t>(STRD/sizeof(dtype)));
     }
     else if(!strcmp(MODE, "tile_frd")){
@@ -411,7 +411,7 @@ int main() {
     }
     else if(!strcmp(MODE, "tile_wr")){
       Tile<Location::Vec, dtype, TROW, TCOL, BLayout::RowMajor> tdst;
-      uint16_t real_col = tile_shape::Cols / (STRD/sizeof(dtype)); 
+      uint16_t real_col = tile_shape::Cols / (STRD/sizeof(dtype));
       tile_wr<tile_shape><<<real_col, tile_shape::Rows, 1>>>(tdst.data(), static_cast<uint16_t>(STRD/sizeof(dtype)));
     }
     else if(!strcmp(MODE, "tile_fwr")){
@@ -421,7 +421,7 @@ int main() {
     else if(!strcmp(MODE, "tile_copy")){
       //Tile<Location::Vec, dtype, TROW, TCOL, BLayout::RowMajor> tsrc(0);
       Tile<Location::Vec, dtype, TROW, TCOL, BLayout::RowMajor> tdst;
-      uint16_t real_col =tile_shape::Cols / (STRD/sizeof(dtype)); 
+      uint16_t real_col =tile_shape::Cols / (STRD/sizeof(dtype));
       tile_copy<tile_shape><<<real_col, tile_shape::Rows, 1>>>(tdst.data(), tsrc.data(), static_cast<uint16_t>(STRD/sizeof(dtype)));
     }
     else if(!strcmp(MODE, "tile_fcopy")){
diff --git a/test/other/vec/Makefile b/benchmarks/microbench/vec/Makefile
similarity index 100%
rename from test/other/vec/Makefile
rename to benchmarks/microbench/vec/Makefile
diff --git a/test/other/vec/compile_lat_bw.all b/benchmarks/microbench/vec/compile_lat_bw.all
similarity index 100%
rename from test/other/vec/compile_lat_bw.all
rename to benchmarks/microbench/vec/compile_lat_bw.all
diff --git a/test/other/vec/src/lat_bw.cpp b/benchmarks/microbench/vec/src/lat_bw.cpp
similarity index 100%
rename from test/other/vec/src/lat_bw.cpp
rename to benchmarks/microbench/vec/src/lat_bw.cpp
diff --git a/test/other/vec/src/lat_bw_func.h b/benchmarks/microbench/vec/src/lat_bw_func.h
similarity index 100%
rename from test/other/vec/src/lat_bw_func.h
rename to benchmarks/microbench/vec/src/lat_bw_func.h
diff --git a/test/other/vec/src/lat_bw_vec.h b/benchmarks/microbench/vec/src/lat_bw_vec.h
similarity index 100%
rename from test/other/vec/src/lat_bw_vec.h
rename to benchmarks/microbench/vec/src/lat_bw_vec.h
diff --git a/test/other/deepseek/Makefile b/benchmarks/models/deepseekv3/Makefile
similarity index 100%
rename from test/other/deepseek/Makefile
rename to benchmarks/models/deepseekv3/Makefile
diff --git a/test/other/deepseek/compile.all b/benchmarks/models/deepseekv3/compile.all
similarity index 100%
rename from test/other/deepseek/compile.all
rename to benchmarks/models/deepseekv3/compile.all
diff --git a/test/other/deepseek/compile_cpu.all b/benchmarks/models/deepseekv3/compile_cpu.all
similarity index 100%
rename from test/other/deepseek/compile_cpu.all
rename to benchmarks/models/deepseekv3/compile_cpu.all
diff --git a/test/other/deepseek/src/concat.cpp b/benchmarks/models/deepseekv3/src/concat.cpp
similarity index 100%
rename from test/other/deepseek/src/concat.cpp
rename to benchmarks/models/deepseekv3/src/concat.cpp
diff --git a/test/other/deepseek/src/expand.cpp b/benchmarks/models/deepseekv3/src/expand.cpp
similarity index 100%
rename from test/other/deepseek/src/expand.cpp
rename to benchmarks/models/deepseekv3/src/expand.cpp
diff --git a/test/other/deepseek/src/gate.cpp b/benchmarks/models/deepseekv3/src/gate.cpp
similarity index 100%
rename from test/other/deepseek/src/gate.cpp
rename to benchmarks/models/deepseekv3/src/gate.cpp
diff --git a/test/other/deepseek/src/mask.cpp b/benchmarks/models/deepseekv3/src/mask.cpp
similarity index 100%
rename from test/other/deepseek/src/mask.cpp
rename to benchmarks/models/deepseekv3/src/mask.cpp
diff --git a/test/other/deepseek/src/mla.cpp b/benchmarks/models/deepseekv3/src/mla.cpp
similarity index 100%
rename from test/other/deepseek/src/mla.cpp
rename to benchmarks/models/deepseekv3/src/mla.cpp
diff --git a/test/other/deepseek/src/mlp.cpp b/benchmarks/models/deepseekv3/src/mlp.cpp
similarity index 100%
rename from test/other/deepseek/src/mlp.cpp
rename to benchmarks/models/deepseekv3/src/mlp.cpp
diff --git a/test/other/deepseek/src/moe.cpp b/benchmarks/models/deepseekv3/src/moe.cpp
similarity index 100%
rename from test/other/deepseek/src/moe.cpp
rename to benchmarks/models/deepseekv3/src/moe.cpp
diff --git a/test/other/deepseek/src/permute.cpp b/benchmarks/models/deepseekv3/src/permute.cpp
similarity index 100%
rename from test/other/deepseek/src/permute.cpp
rename to benchmarks/models/deepseekv3/src/permute.cpp
diff --git a/test/other/deepseek/src/projection.cpp b/benchmarks/models/deepseekv3/src/projection.cpp
similarity index 100%
rename from test/other/deepseek/src/projection.cpp
rename to benchmarks/models/deepseekv3/src/projection.cpp
diff --git a/test/other/deepseek/src/rmsnorm.cpp b/benchmarks/models/deepseekv3/src/rmsnorm.cpp
similarity index 100%
rename from test/other/deepseek/src/rmsnorm.cpp
rename to benchmarks/models/deepseekv3/src/rmsnorm.cpp
diff --git a/test/other/deepseek/src/rope.cpp b/benchmarks/models/deepseekv3/src/rope.cpp
similarity index 100%
rename from test/other/deepseek/src/rope.cpp
rename to benchmarks/models/deepseekv3/src/rope.cpp
diff --git a/test/other/deepseek/src/split.cpp b/benchmarks/models/deepseekv3/src/split.cpp
similarity index 100%
rename from test/other/deepseek/src/split.cpp
rename to benchmarks/models/deepseekv3/src/split.cpp
diff --git a/test/other/deepseek/src/topk.cpp b/benchmarks/models/deepseekv3/src/topk.cpp
similarity index 100%
rename from test/other/deepseek/src/topk.cpp
rename to benchmarks/models/deepseekv3/src/topk.cpp
diff --git a/test/other/deepseek/src/transformer.cpp b/benchmarks/models/deepseekv3/src/transformer.cpp
similarity index 100%
rename from test/other/deepseek/src/transformer.cpp
rename to benchmarks/models/deepseekv3/src/transformer.cpp
diff --git a/test/accelerator/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp b/benchmarks/npu/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp
similarity index 84%
rename from test/accelerator/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp
rename to benchmarks/npu/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp
index fe5bb69..517344e 100644
--- a/test/accelerator/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp
+++ b/benchmarks/npu/cube/LLAMA3_70B_attn_matmul_decode_bs_192/LLAMA3_70B_attn_matmul_decode_bs_192.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A8W8.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     __fp8_e4m3 a[MPC*KPC];
diff --git a/test/accelerator/cube/LLAMA3_70B_attn_matmul_decode_bs_192/params_mx_A8W8.h b/benchmarks/npu/cube/LLAMA3_70B_attn_matmul_decode_bs_192/params_mx_A8W8.h
similarity index 100%
rename from test/accelerator/cube/LLAMA3_70B_attn_matmul_decode_bs_192/params_mx_A8W8.h
rename to benchmarks/npu/cube/LLAMA3_70B_attn_matmul_decode_bs_192/params_mx_A8W8.h
diff --git a/test/accelerator/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp b/benchmarks/npu/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp
similarity index 84%
rename from test/accelerator/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp
rename to benchmarks/npu/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp
index fe5bb69..517344e 100644
--- a/test/accelerator/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp
+++ b/benchmarks/npu/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/LLAMA3_70B_ffn_matmul_3_decode_bs_192.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A8W8.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     __fp8_e4m3 a[MPC*KPC];
diff --git a/test/accelerator/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/params_mx_A8W8.h b/benchmarks/npu/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/params_mx_A8W8.h
similarity index 100%
rename from test/accelerator/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/params_mx_A8W8.h
rename to benchmarks/npu/cube/LLAMA3_70B_ffn_matmul_3_decode_bs_192/params_mx_A8W8.h
diff --git a/test/accelerator/cube/Layer_6588_modified_fp8_GB_nbuf/params_mx_A8W8.h b/benchmarks/npu/cube/Layer_6588_modified_fp8_GB_nbuf/params_mx_A8W8.h
similarity index 100%
rename from test/accelerator/cube/Layer_6588_modified_fp8_GB_nbuf/params_mx_A8W8.h
rename to benchmarks/npu/cube/Layer_6588_modified_fp8_GB_nbuf/params_mx_A8W8.h
diff --git a/test/accelerator/cube/Makefile b/benchmarks/npu/cube/Makefile
similarity index 100%
rename from test/accelerator/cube/Makefile
rename to benchmarks/npu/cube/Makefile
diff --git a/test/accelerator/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp b/benchmarks/npu/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp
similarity index 81%
rename from test/accelerator/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp
rename to benchmarks/npu/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp
index 68e3aeb..fa67648 100644
--- a/test/accelerator/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp
+++ b/benchmarks/npu/cube/QuantBatchMatmulV3_292_hif4/QuantBatchMatmulV3_292_hif4.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A4W4.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     float a[M*K];
diff --git a/test/accelerator/cube/QuantBatchMatmulV3_292_hif4/params_mx_A4W4.h b/benchmarks/npu/cube/QuantBatchMatmulV3_292_hif4/params_mx_A4W4.h
similarity index 100%
rename from test/accelerator/cube/QuantBatchMatmulV3_292_hif4/params_mx_A4W4.h
rename to benchmarks/npu/cube/QuantBatchMatmulV3_292_hif4/params_mx_A4W4.h
diff --git a/test/accelerator/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp b/benchmarks/npu/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp
similarity index 81%
rename from test/accelerator/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp
rename to benchmarks/npu/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp
index 68e3aeb..fa67648 100644
--- a/test/accelerator/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp
+++ b/benchmarks/npu/cube/QuantBatchMatmulV3_293_hif4/QuantBatchMatmulV3_293_hif4.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A4W4.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     float a[M*K];
diff --git a/test/accelerator/cube/QuantBatchMatmulV3_293_hif4/params_mx_A4W4.h b/benchmarks/npu/cube/QuantBatchMatmulV3_293_hif4/params_mx_A4W4.h
similarity index 100%
rename from test/accelerator/cube/QuantBatchMatmulV3_293_hif4/params_mx_A4W4.h
rename to benchmarks/npu/cube/QuantBatchMatmulV3_293_hif4/params_mx_A4W4.h
diff --git a/test/accelerator/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp b/benchmarks/npu/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp
similarity index 81%
rename from test/accelerator/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp
rename to benchmarks/npu/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp
index 68e3aeb..fa67648 100644
--- a/test/accelerator/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp
+++ b/benchmarks/npu/cube/QuantBatchMatmulV3_294_hif4/QuantBatchMatmulV3_294_hif4.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A4W4.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     float a[M*K];
diff --git a/test/accelerator/cube/QuantBatchMatmulV3_294_hif4/params_mx_A4W4.h b/benchmarks/npu/cube/QuantBatchMatmulV3_294_hif4/params_mx_A4W4.h
similarity index 100%
rename from test/accelerator/cube/QuantBatchMatmulV3_294_hif4/params_mx_A4W4.h
rename to benchmarks/npu/cube/QuantBatchMatmulV3_294_hif4/params_mx_A4W4.h
diff --git a/test/accelerator/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp b/benchmarks/npu/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp
similarity index 81%
rename from test/accelerator/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp
rename to benchmarks/npu/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp
index 68e3aeb..fa67648 100644
--- a/test/accelerator/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp
+++ b/benchmarks/npu/cube/QuantBatchMatmulV3_295_hif4/QuantBatchMatmulV3_295_hif4.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A4W4.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     float a[M*K];
diff --git a/test/accelerator/cube/QuantBatchMatmulV3_295_hif4/params_mx_A4W4.h b/benchmarks/npu/cube/QuantBatchMatmulV3_295_hif4/params_mx_A4W4.h
similarity index 100%
rename from test/accelerator/cube/QuantBatchMatmulV3_295_hif4/params_mx_A4W4.h
rename to benchmarks/npu/cube/QuantBatchMatmulV3_295_hif4/params_mx_A4W4.h
diff --git a/test/accelerator/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp b/benchmarks/npu/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp
similarity index 81%
rename from test/accelerator/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp
rename to benchmarks/npu/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp
index 68e3aeb..fa67648 100644
--- a/test/accelerator/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp
+++ b/benchmarks/npu/cube/QuantBatchMatmulV3_296_hif4/QuantBatchMatmulV3_296_hif4.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A4W4.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     float a[M*K];
diff --git a/test/accelerator/cube/QuantBatchMatmulV3_296_hif4/params_mx_A4W4.h b/benchmarks/npu/cube/QuantBatchMatmulV3_296_hif4/params_mx_A4W4.h
similarity index 100%
rename from test/accelerator/cube/QuantBatchMatmulV3_296_hif4/params_mx_A4W4.h
rename to benchmarks/npu/cube/QuantBatchMatmulV3_296_hif4/params_mx_A4W4.h
diff --git a/test/accelerator/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp b/benchmarks/npu/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp
similarity index 81%
rename from test/accelerator/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp
rename to benchmarks/npu/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp
index 68e3aeb..fa67648 100644
--- a/test/accelerator/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp
+++ b/benchmarks/npu/cube/QuantBatchMatmulV3_297_hif4/QuantBatchMatmulV3_297_hif4.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A4W4.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     float a[M*K];
diff --git a/test/accelerator/cube/QuantBatchMatmulV3_297_hif4/params_mx_A4W4.h b/benchmarks/npu/cube/QuantBatchMatmulV3_297_hif4/params_mx_A4W4.h
similarity index 100%
rename from test/accelerator/cube/QuantBatchMatmulV3_297_hif4/params_mx_A4W4.h
rename to benchmarks/npu/cube/QuantBatchMatmulV3_297_hif4/params_mx_A4W4.h
diff --git a/test/accelerator/cube/compile.all b/benchmarks/npu/cube/compile.all
similarity index 100%
rename from test/accelerator/cube/compile.all
rename to benchmarks/npu/cube/compile.all
diff --git a/test/accelerator/cube/dsv3_q_up_proj_fp8_GB_DN_3buf/params_mx_A8W8.h b/benchmarks/npu/cube/dsv3_q_up_proj_fp8_GB_DN_3buf/params_mx_A8W8.h
similarity index 100%
rename from test/accelerator/cube/dsv3_q_up_proj_fp8_GB_DN_3buf/params_mx_A8W8.h
rename to benchmarks/npu/cube/dsv3_q_up_proj_fp8_GB_DN_3buf/params_mx_A8W8.h
diff --git a/test/accelerator/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp b/benchmarks/npu/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp
similarity index 84%
rename from test/accelerator/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp
rename to benchmarks/npu/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp
index 4f15ff2..a6a4d4d 100644
--- a/test/accelerator/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp
+++ b/benchmarks/npu/cube/dsv3_q_up_proj_mxfp8/dsv3_q_up_proj_mxfp8.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A8W8.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     float a[M*K];
diff --git a/test/accelerator/cube/dsv3_q_up_proj_mxfp8/params_mx_A8W8.h b/benchmarks/npu/cube/dsv3_q_up_proj_mxfp8/params_mx_A8W8.h
similarity index 100%
rename from test/accelerator/cube/dsv3_q_up_proj_mxfp8/params_mx_A8W8.h
rename to benchmarks/npu/cube/dsv3_q_up_proj_mxfp8/params_mx_A8W8.h
diff --git a/test/accelerator/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp b/benchmarks/npu/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp
similarity index 83%
rename from test/accelerator/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp
rename to benchmarks/npu/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp
index 07dc46c..d3f66ec 100644
--- a/test/accelerator/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp
+++ b/benchmarks/npu/cube/llama3_70b_w8_bs_1_case_4/llama3_70b_w8_bs_1_case_4.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_A16W8.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     __half a[MPC*KPC];
diff --git a/test/accelerator/cube/llama3_70b_w8_bs_1_case_4/params_A16W8.h b/benchmarks/npu/cube/llama3_70b_w8_bs_1_case_4/params_A16W8.h
similarity index 100%
rename from test/accelerator/cube/llama3_70b_w8_bs_1_case_4/params_A16W8.h
rename to benchmarks/npu/cube/llama3_70b_w8_bs_1_case_4/params_A16W8.h
diff --git a/test/accelerator/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp b/benchmarks/npu/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp
similarity index 81%
rename from test/accelerator/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp
rename to benchmarks/npu/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp
index af1359a..839ca1a 100644
--- a/test/accelerator/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp
+++ b/benchmarks/npu/cube/llama_train_mm_2_A16W4/llama_train_mm_2_A16W4.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A8W8.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     float a[M*K];
diff --git a/test/accelerator/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp b/benchmarks/npu/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp
similarity index 83%
rename from test/accelerator/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp
rename to benchmarks/npu/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp
index 07dc46c..d3f66ec 100644
--- a/test/accelerator/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp
+++ b/benchmarks/npu/cube/llama_train_mm_2_A16W8/llama_train_mm_2_A16W8.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_A16W8.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     __half a[MPC*KPC];
diff --git a/test/accelerator/cube/llama_train_mm_2_A16W8/params_A16W8.h b/benchmarks/npu/cube/llama_train_mm_2_A16W8/params_A16W8.h
similarity index 100%
rename from test/accelerator/cube/llama_train_mm_2_A16W8/params_A16W8.h
rename to benchmarks/npu/cube/llama_train_mm_2_A16W8/params_A16W8.h
diff --git a/test/accelerator/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp b/benchmarks/npu/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp
similarity index 81%
rename from test/accelerator/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp
rename to benchmarks/npu/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp
index 48dd040..21467ee 100644
--- a/test/accelerator/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp
+++ b/benchmarks/npu/cube/llama_train_mm_2_mxfp8_mxfp4/llama_train_mm_2_mxfp8_mxfp4.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A8W4.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     float a[M*K];
diff --git a/test/accelerator/cube/llama_train_mm_2_mxfp8_mxfp4/params_mx_A8W4.h b/benchmarks/npu/cube/llama_train_mm_2_mxfp8_mxfp4/params_mx_A8W4.h
similarity index 100%
rename from test/accelerator/cube/llama_train_mm_2_mxfp8_mxfp4/params_mx_A8W4.h
rename to benchmarks/npu/cube/llama_train_mm_2_mxfp8_mxfp4/params_mx_A8W4.h
diff --git a/test/accelerator/cube/llava1_6_6/llava1_6_6.cpp b/benchmarks/npu/cube/llava1_6_6/llava1_6_6.cpp
similarity index 83%
rename from test/accelerator/cube/llava1_6_6/llava1_6_6.cpp
rename to benchmarks/npu/cube/llava1_6_6/llava1_6_6.cpp
index 07dc46c..d3f66ec 100644
--- a/test/accelerator/cube/llava1_6_6/llava1_6_6.cpp
+++ b/benchmarks/npu/cube/llava1_6_6/llava1_6_6.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_A16W8.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     __half a[MPC*KPC];
diff --git a/test/accelerator/cube/llava1_6_6/params_A16W8.h b/benchmarks/npu/cube/llava1_6_6/params_A16W8.h
similarity index 100%
rename from test/accelerator/cube/llava1_6_6/params_A16W8.h
rename to benchmarks/npu/cube/llava1_6_6/params_A16W8.h
diff --git a/test/accelerator/cube/mat_mul_o1_align_0001/mat_mul_o1_align_0001.cpp b/benchmarks/npu/cube/mat_mul_o1_align_0001/mat_mul_o1_align_0001.cpp
similarity index 100%
rename from test/accelerator/cube/mat_mul_o1_align_0001/mat_mul_o1_align_0001.cpp
rename to benchmarks/npu/cube/mat_mul_o1_align_0001/mat_mul_o1_align_0001.cpp
diff --git a/test/accelerator/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp b/benchmarks/npu/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp
similarity index 84%
rename from test/accelerator/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp
rename to benchmarks/npu/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp
index fe5bb69..517344e 100644
--- a/test/accelerator/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp
+++ b/benchmarks/npu/cube/matmul_1_bs16_fp8_GB_test/matmul_1_bs16_fp8_GB_test.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A8W8.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     __fp8_e4m3 a[MPC*KPC];
diff --git a/test/accelerator/cube/matmul_1_bs16_fp8_GB_test/params_mx_A8W8.h b/benchmarks/npu/cube/matmul_1_bs16_fp8_GB_test/params_mx_A8W8.h
similarity index 100%
rename from test/accelerator/cube/matmul_1_bs16_fp8_GB_test/params_mx_A8W8.h
rename to benchmarks/npu/cube/matmul_1_bs16_fp8_GB_test/params_mx_A8W8.h
diff --git a/test/accelerator/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp b/benchmarks/npu/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp
similarity index 84%
rename from test/accelerator/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp
rename to benchmarks/npu/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp
index fe5bb69..517344e 100644
--- a/test/accelerator/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp
+++ b/benchmarks/npu/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A8W8.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     __fp8_e4m3 a[MPC*KPC];
diff --git a/test/accelerator/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/params_mx_A8W8.h b/benchmarks/npu/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/params_mx_A8W8.h
similarity index 100%
rename from test/accelerator/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/params_mx_A8W8.h
rename to benchmarks/npu/cube/model_graph_graph7_mat_mul_0279_fp8_GB_DN_nbuf/params_mx_A8W8.h
diff --git a/test/accelerator/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp b/benchmarks/npu/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp
similarity index 84%
rename from test/accelerator/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp
rename to benchmarks/npu/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp
index fe5bb69..517344e 100644
--- a/test/accelerator/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp
+++ b/benchmarks/npu/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/moe_w1w3_bs16_fp8_GB_DN_nbuf.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A8W8.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     __fp8_e4m3 a[MPC*KPC];
diff --git a/test/accelerator/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/params_mx_A8W8.h b/benchmarks/npu/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/params_mx_A8W8.h
similarity index 100%
rename from test/accelerator/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/params_mx_A8W8.h
rename to benchmarks/npu/cube/moe_w1w3_bs16_fp8_GB_DN_nbuf/params_mx_A8W8.h
diff --git a/test/accelerator/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp b/benchmarks/npu/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp
similarity index 81%
rename from test/accelerator/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp
rename to benchmarks/npu/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp
index 48dd040..21467ee 100644
--- a/test/accelerator/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp
+++ b/benchmarks/npu/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A8W4.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     float a[M*K];
diff --git a/test/accelerator/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/params_mx_A8W4.h b/benchmarks/npu/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/params_mx_A8W4.h
similarity index 100%
rename from test/accelerator/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/params_mx_A8W4.h
rename to benchmarks/npu/cube/mx_a8w4_float8_e4m3fn_float4_e2m1_bfloat16_0022/params_mx_A8W4.h
diff --git a/test/accelerator/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp b/benchmarks/npu/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp
similarity index 81%
rename from test/accelerator/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp
rename to benchmarks/npu/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp
index 48dd040..21467ee 100644
--- a/test/accelerator/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp
+++ b/benchmarks/npu/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A8W4.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     float a[M*K];
diff --git a/test/accelerator/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/params_mx_A8W4.h b/benchmarks/npu/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/params_mx_A8W4.h
similarity index 100%
rename from test/accelerator/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/params_mx_A8W4.h
rename to benchmarks/npu/cube/mx_a8w4_nz_0001_float8_e4m3fn_float4_e2m1_bfloat16/params_mx_A8W4.h
diff --git a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_A16W8/params_A16W8.h b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_A16W8/params_A16W8.h
similarity index 100%
rename from test/accelerator/cube/xinghuo_13b_tp8_matmul_01_A16W8/params_A16W8.h
rename to benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_A16W8/params_A16W8.h
diff --git a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp
similarity index 83%
rename from test/accelerator/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp
rename to benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp
index 07dc46c..d3f66ec 100644
--- a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp
+++ b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_A16W8/xinghuo_13b_tp8_matmul_01_A16W8.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_A16W8.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     __half a[MPC*KPC];
diff --git a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/params_mx_A8W8.h b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/params_mx_A8W8.h
similarity index 100%
rename from test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/params_mx_A8W8.h
rename to benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/params_mx_A8W8.h
diff --git a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp
similarity index 84%
rename from test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp
rename to benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp
index fe5bb69..517344e 100644
--- a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp
+++ b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_modified/xinghuo_13b_tp8_matmul_01_mxfp8_modified.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A8W8.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     __fp8_e4m3 a[MPC*KPC];
diff --git a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/params_mx_A8W4.h b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/params_mx_A8W4.h
similarity index 100%
rename from test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/params_mx_A8W4.h
rename to benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/params_mx_A8W4.h
diff --git a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp
similarity index 84%
rename from test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp
rename to benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp
index fe5bb69..517344e 100644
--- a/test/accelerator/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp
+++ b/benchmarks/npu/cube/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4/xinghuo_13b_tp8_matmul_01_mxfp8_mxfp4.cpp
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 #include "params_mx_A8W8.h"
-#include "../../include/accelerator_cube.h"
+#include <benchmark_support/npu/npu_cube.h>
 
 int main(){
     __fp8_e4m3 a[MPC*KPC];
diff --git a/test/accelerator/fusion/Makefile b/benchmarks/npu/fusion/Makefile
similarity index 100%
rename from test/accelerator/fusion/Makefile
rename to benchmarks/npu/fusion/Makefile
diff --git a/test/accelerator/fusion/compile.all b/benchmarks/npu/fusion/compile.all
similarity index 100%
rename from test/accelerator/fusion/compile.all
rename to benchmarks/npu/fusion/compile.all
diff --git a/test/accelerator/fusion/compile_fusion_2d_unroll.all b/benchmarks/npu/fusion/compile_fusion_2d_unroll.all
similarity index 100%
rename from test/accelerator/fusion/compile_fusion_2d_unroll.all
rename to benchmarks/npu/fusion/compile_fusion_2d_unroll.all
diff --git a/test/accelerator/fusion/compile_fusion_dcore.all b/benchmarks/npu/fusion/compile_fusion_dcore.all
similarity index 100%
rename from test/accelerator/fusion/compile_fusion_dcore.all
rename to benchmarks/npu/fusion/compile_fusion_dcore.all
diff --git a/test/accelerator/fusion/compile_fusion_dynamic.all b/benchmarks/npu/fusion/compile_fusion_dynamic.all
similarity index 100%
rename from test/accelerator/fusion/compile_fusion_dynamic.all
rename to benchmarks/npu/fusion/compile_fusion_dynamic.all
diff --git a/test/accelerator/fusion/compile_fusion_fp4.all b/benchmarks/npu/fusion/compile_fusion_fp4.all
similarity index 100%
rename from test/accelerator/fusion/compile_fusion_fp4.all
rename to benchmarks/npu/fusion/compile_fusion_fp4.all
diff --git a/test/accelerator/fusion/dynamic.list b/benchmarks/npu/fusion/dynamic.list
similarity index 100%
rename from test/accelerator/fusion/dynamic.list
rename to benchmarks/npu/fusion/dynamic.list
diff --git a/test/accelerator/fusion/fa1/fa1.cpp b/benchmarks/npu/fusion/fa1/fa1.cpp
similarity index 98%
rename from test/accelerator/fusion/fa1/fa1.cpp
rename to benchmarks/npu/fusion/fa1/fa1.cpp
index c155569..911b983 100644
--- a/test/accelerator/fusion/fa1/fa1.cpp
+++ b/benchmarks/npu/fusion/fa1/fa1.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_fusion.h"
+#include <benchmark_support/npu/npu_fusion.h>
 #include "benchmark.h"
 #include "fileop.h"
 
diff --git a/test/accelerator/fusion/fa10/fa10.cpp b/benchmarks/npu/fusion/fa10/fa10.cpp
similarity index 98%
rename from test/accelerator/fusion/fa10/fa10.cpp
rename to benchmarks/npu/fusion/fa10/fa10.cpp
index 84f75a5..b219a36 100644
--- a/test/accelerator/fusion/fa10/fa10.cpp
+++ b/benchmarks/npu/fusion/fa10/fa10.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_fusion.h"
+#include <benchmark_support/npu/npu_fusion.h>
 #include "benchmark.h"
 
 #define B 1
diff --git a/test/accelerator/fusion/fa11/fa11.cpp b/benchmarks/npu/fusion/fa11/fa11.cpp
similarity index 95%
rename from test/accelerator/fusion/fa11/fa11.cpp
rename to benchmarks/npu/fusion/fa11/fa11.cpp
index 42b3d88..d4967c4 100644
--- a/test/accelerator/fusion/fa11/fa11.cpp
+++ b/benchmarks/npu/fusion/fa11/fa11.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_fusion.h"
+#include <benchmark_support/npu/npu_fusion.h>
 #include "benchmark.h"
 
 //need with bytemask
diff --git a/test/accelerator/fusion/fa2/fa2.cpp b/benchmarks/npu/fusion/fa2/fa2.cpp
similarity index 98%
rename from test/accelerator/fusion/fa2/fa2.cpp
rename to benchmarks/npu/fusion/fa2/fa2.cpp
index 05604f6..803696a 100644
--- a/test/accelerator/fusion/fa2/fa2.cpp
+++ b/benchmarks/npu/fusion/fa2/fa2.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_fusion.h"
+#include <benchmark_support/npu/npu_fusion.h>
 #include "benchmark.h"
 
 #define B 1
diff --git a/test/accelerator/fusion/fa3/fa3.cpp b/benchmarks/npu/fusion/fa3/fa3.cpp
similarity index 98%
rename from test/accelerator/fusion/fa3/fa3.cpp
rename to benchmarks/npu/fusion/fa3/fa3.cpp
index 2d45c86..d41cc9c 100644
--- a/test/accelerator/fusion/fa3/fa3.cpp
+++ b/benchmarks/npu/fusion/fa3/fa3.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_fusion.h"
+#include <benchmark_support/npu/npu_fusion.h>
 #include "benchmark.h"
 
 #define B 1
diff --git a/test/accelerator/fusion/fa4/fa4.cpp b/benchmarks/npu/fusion/fa4/fa4.cpp
similarity index 98%
rename from test/accelerator/fusion/fa4/fa4.cpp
rename to benchmarks/npu/fusion/fa4/fa4.cpp
index d2c10e0..80431d0 100644
--- a/test/accelerator/fusion/fa4/fa4.cpp
+++ b/benchmarks/npu/fusion/fa4/fa4.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_fusion.h"
+#include <benchmark_support/npu/npu_fusion.h>
 #include "benchmark.h"
 
 #define B 1
diff --git a/test/accelerator/fusion/fa5/fa5.cpp b/benchmarks/npu/fusion/fa5/fa5.cpp
similarity index 98%
rename from test/accelerator/fusion/fa5/fa5.cpp
rename to benchmarks/npu/fusion/fa5/fa5.cpp
index 8511996..e4be6ca 100644
--- a/test/accelerator/fusion/fa5/fa5.cpp
+++ b/benchmarks/npu/fusion/fa5/fa5.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_fusion.h"
+#include <benchmark_support/npu/npu_fusion.h>
 #include "benchmark.h"
 
 #define B 1
diff --git a/test/accelerator/fusion/fa6/fa6.cpp b/benchmarks/npu/fusion/fa6/fa6.cpp
similarity index 98%
rename from test/accelerator/fusion/fa6/fa6.cpp
rename to benchmarks/npu/fusion/fa6/fa6.cpp
index 1be9a70..04670c0 100644
--- a/test/accelerator/fusion/fa6/fa6.cpp
+++ b/benchmarks/npu/fusion/fa6/fa6.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_fusion.h"
+#include <benchmark_support/npu/npu_fusion.h>
 #include "benchmark.h"
 
 #define B 1
diff --git a/test/accelerator/fusion/fa7/fa7.cpp b/benchmarks/npu/fusion/fa7/fa7.cpp
similarity index 98%
rename from test/accelerator/fusion/fa7/fa7.cpp
rename to benchmarks/npu/fusion/fa7/fa7.cpp
index d35d506..0d9e0ce 100644
--- a/test/accelerator/fusion/fa7/fa7.cpp
+++ b/benchmarks/npu/fusion/fa7/fa7.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_fusion.h"
+#include <benchmark_support/npu/npu_fusion.h>
 #include "benchmark.h"
 
 #define B 1
diff --git a/test/accelerator/fusion/fa8/fa8.cpp b/benchmarks/npu/fusion/fa8/fa8.cpp
similarity index 98%
rename from test/accelerator/fusion/fa8/fa8.cpp
rename to benchmarks/npu/fusion/fa8/fa8.cpp
index 799638c..b1d264e 100644
--- a/test/accelerator/fusion/fa8/fa8.cpp
+++ b/benchmarks/npu/fusion/fa8/fa8.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_fusion.h"
+#include <benchmark_support/npu/npu_fusion.h>
 #include "benchmark.h"
 
 #define B 1
diff --git a/test/accelerator/fusion/fa9/fa9.cpp b/benchmarks/npu/fusion/fa9/fa9.cpp
similarity index 98%
rename from test/accelerator/fusion/fa9/fa9.cpp
rename to benchmarks/npu/fusion/fa9/fa9.cpp
index 1cced45..6ea94aa 100644
--- a/test/accelerator/fusion/fa9/fa9.cpp
+++ b/benchmarks/npu/fusion/fa9/fa9.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_fusion.h"
+#include <benchmark_support/npu/npu_fusion.h>
 #include "benchmark.h"
 
 //need with philox
diff --git a/test/accelerator/fusion/fa_fp4/fa_fp4.cpp b/benchmarks/npu/fusion/fa_fp4/fa_fp4.cpp
similarity index 98%
rename from test/accelerator/fusion/fa_fp4/fa_fp4.cpp
rename to benchmarks/npu/fusion/fa_fp4/fa_fp4.cpp
index dce68c7..8f0c80d 100644
--- a/test/accelerator/fusion/fa_fp4/fa_fp4.cpp
+++ b/benchmarks/npu/fusion/fa_fp4/fa_fp4.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_fusion.h"
+#include <benchmark_support/npu/npu_fusion.h>
 #include "benchmark.h"
 #include "fileop.h"
 
diff --git a/test/accelerator/fusion/flashmla13/flashmla13.cpp b/benchmarks/npu/fusion/flashmla13/flashmla13.cpp
similarity index 100%
rename from test/accelerator/fusion/flashmla13/flashmla13.cpp
rename to benchmarks/npu/fusion/flashmla13/flashmla13.cpp
diff --git a/test/accelerator/fusion/opt.list b/benchmarks/npu/fusion/opt.list
similarity index 100%
rename from test/accelerator/fusion/opt.list
rename to benchmarks/npu/fusion/opt.list
diff --git a/test/accelerator/fusion/program.list b/benchmarks/npu/fusion/program.list
similarity index 100%
rename from test/accelerator/fusion/program.list
rename to benchmarks/npu/fusion/program.list
diff --git a/test/accelerator/fusion/simall.py b/benchmarks/npu/fusion/simall.py
similarity index 100%
rename from test/accelerator/fusion/simall.py
rename to benchmarks/npu/fusion/simall.py
diff --git a/test/accelerator/nddma/Makefile b/benchmarks/npu/nddma/Makefile
similarity index 100%
rename from test/accelerator/nddma/Makefile
rename to benchmarks/npu/nddma/Makefile
diff --git a/test/accelerator/nddma/compile_transpose.all b/benchmarks/npu/nddma/compile_transpose.all
similarity index 100%
rename from test/accelerator/nddma/compile_transpose.all
rename to benchmarks/npu/nddma/compile_transpose.all
diff --git a/test/accelerator/nddma/transpose_053_mgather/transpose_053_mgather.cpp b/benchmarks/npu/nddma/transpose_053_mgather/transpose_053_mgather.cpp
similarity index 96%
rename from test/accelerator/nddma/transpose_053_mgather/transpose_053_mgather.cpp
rename to benchmarks/npu/nddma/transpose_053_mgather/transpose_053_mgather.cpp
index e9152fe..b481f34 100644
--- a/test/accelerator/nddma/transpose_053_mgather/transpose_053_mgather.cpp
+++ b/benchmarks/npu/nddma/transpose_053_mgather/transpose_053_mgather.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_transpose.h"
+#include <benchmark_support/npu/npu_transpose.h>
 #include "benchmark.h"
 #include "fileop.h"
 
@@ -71,7 +71,7 @@ int main()
             TableGT dstGlobal(gDst);
 
             MGATHER(elemTile, srcGlobal, loadIdxTile16);
-            // TCOPYIN(elemTile, srcGlobal);
+            // TLOAD(elemTile, srcGlobal);
 
             MSCATTER(dstGlobal, elemTile, storeIdxTile16);
         }
diff --git a/test/accelerator/nddma/transpose_053_tload/transpose_053_tload.cpp b/benchmarks/npu/nddma/transpose_053_tload/transpose_053_tload.cpp
similarity index 95%
rename from test/accelerator/nddma/transpose_053_tload/transpose_053_tload.cpp
rename to benchmarks/npu/nddma/transpose_053_tload/transpose_053_tload.cpp
index 1d5767b..abb9929 100644
--- a/test/accelerator/nddma/transpose_053_tload/transpose_053_tload.cpp
+++ b/benchmarks/npu/nddma/transpose_053_tload/transpose_053_tload.cpp
@@ -54,8 +54,8 @@ int main()
                     gm_shape_out gDst(out + dst_slice_offset);
 
                     tile_shape tmp;
-                    TCOPYIN(tmp, gSrc);
-                    TCOPYOUT(gDst, tmp);
+                    TLOAD(tmp, gSrc);
+                    TSTORE(gDst, tmp);
                 }
             }
         }
diff --git a/test/accelerator/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp b/benchmarks/npu/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp
similarity index 82%
rename from test/accelerator/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp
rename to benchmarks/npu/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp
index cc178b9..e3d929f 100644
--- a/test/accelerator/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp
+++ b/benchmarks/npu/vec_simd/Add_ND_bfloat16_float32_DeepSeek_V3_000028/Add_ND_bfloat16_float32_DeepSeek_V3_000028.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 
 #define kM 1024
 #define kN 1024
diff --git a/test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp b/benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp
similarity index 95%
rename from test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp
rename to benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp
index f37b51d..4b4e445 100644
--- a/test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp
+++ b/benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic/LayerNormV4_ND_bfloat16_IDZJ06_25B_8K_LORA_R6144_000001_grad_chip_generic.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp b/benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp
similarity index 95%
rename from test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp
rename to benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp
index 2878ed5..4d8e7dc 100644
--- a/test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp
+++ b/benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R12288_000020_grad_chip_generic.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp b/benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp
similarity index 95%
rename from test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp
rename to benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp
index 43043b9..0e26fc9 100644
--- a/test/accelerator/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp
+++ b/benchmarks/npu/vec_simd/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV/LayerNormV4_ND_bfloat16_float32_X1_ViT175B_R24576_000020_grad_GENERIC_AIV.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simd/Makefile b/benchmarks/npu/vec_simd/Makefile
similarity index 100%
rename from test/accelerator/vec_simd/Makefile
rename to benchmarks/npu/vec_simd/Makefile
diff --git a/test/accelerator/vec_simd/compile.all b/benchmarks/npu/vec_simd/compile.all
similarity index 100%
rename from test/accelerator/vec_simd/compile.all
rename to benchmarks/npu/vec_simd/compile.all
diff --git a/test/accelerator/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp b/benchmarks/npu/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp
similarity index 95%
rename from test/accelerator/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp
rename to benchmarks/npu/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp
index e8a2da8..89dee66 100644
--- a/test/accelerator/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp
+++ b/benchmarks/npu/vec_simd/gemm_18x128x256/gemm_18x128x256.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp b/benchmarks/npu/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp
similarity index 95%
rename from test/accelerator/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp
rename to benchmarks/npu/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp
index ab6a0ff..d72a518 100644
--- a/test/accelerator/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp
+++ b/benchmarks/npu/vec_simd/layernorm_vcadd_vaddx3_12288_fp16/layernorm_vcadd_vaddx3_12288_fp16.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp b/benchmarks/npu/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp
similarity index 95%
rename from test/accelerator/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp
rename to benchmarks/npu/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp
index 271d9fc..46e8675 100644
--- a/test/accelerator/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp
+++ b/benchmarks/npu/vec_simd/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV/moe_gating_top_k_deepseekv3_16_fp32_GENERIC_AIV.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp b/benchmarks/npu/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp
similarity index 93%
rename from test/accelerator/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp
rename to benchmarks/npu/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp
index 3dfb6c2..74b4735 100644
--- a/test/accelerator/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp
+++ b/benchmarks/npu/vec_simd/rmsnorm_reduce_1_16384_fp16/rmsnorm_reduce_1_16384_fp16.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp b/benchmarks/npu/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp
similarity index 93%
rename from test/accelerator/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp
rename to benchmarks/npu/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp
index 93ef753..1c50c92 100644
--- a/test/accelerator/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp
+++ b/benchmarks/npu/vec_simd/rmsnorm_reduce_2_8192_fp16/rmsnorm_reduce_2_8192_fp16.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp b/benchmarks/npu/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp
similarity index 93%
rename from test/accelerator/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp
rename to benchmarks/npu/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp
index 3194b39..f170d87 100644
--- a/test/accelerator/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp
+++ b/benchmarks/npu/vec_simd/rmsnorm_reduce_4_4096_fp16/rmsnorm_reduce_4_4096_fp16.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp b/benchmarks/npu/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp
similarity index 93%
rename from test/accelerator/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp
rename to benchmarks/npu/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp
index 5d782f3..ed31dac 100644
--- a/test/accelerator/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp
+++ b/benchmarks/npu/vec_simd/rmsnorm_reduce_4_5120_fp16/rmsnorm_reduce_4_5120_fp16.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp b/benchmarks/npu/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp
similarity index 95%
rename from test/accelerator/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp
rename to benchmarks/npu/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp
index 1240850..6ea2179 100644
--- a/test/accelerator/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp
+++ b/benchmarks/npu/vec_simd/rope_32_40_1_64_bf16/rope_32_40_1_64_bf16.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp b/benchmarks/npu/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp
similarity index 94%
rename from test/accelerator/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp
rename to benchmarks/npu/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp
index 5a5024f..97100bf 100644
--- a/test/accelerator/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp
+++ b/benchmarks/npu/vec_simd/softmax_8_34_fp16/softmax_8_34_fp16.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
@@ -18,7 +18,7 @@ void softmax_local(__half* dst, __half* src){
 
     gm_shape gsrc(src);
     tile_shape tsrc;
-    TCOPYIN(tsrc, gsrc);
+    TLOAD(tsrc, gsrc);
 
     tMax tLocalMax;
     TROWMAX(tLocalMax, tsrc);
@@ -42,7 +42,7 @@ void softmax_local(__half* dst, __half* src){
     TDIV(tres, tExp, tSumExpand);
 
     gm_shape gdst(dst);
-    TCOPYOUT(gdst, tres);
+    TSTORE(gdst, tres);
 }
 
 int main(){
diff --git a/test/accelerator/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp b/benchmarks/npu/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp
similarity index 94%
rename from test/accelerator/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp
rename to benchmarks/npu/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp
index 8a5be54..0a8ff03 100644
--- a/test/accelerator/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp
+++ b/benchmarks/npu/vec_simd/softmax_LLM_2/softmax_LLM_2.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp b/benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp
similarity index 94%
rename from test/accelerator/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp
rename to benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp
index a260789..9075da5 100644
--- a/test/accelerator/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp
+++ b/benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_bf16/softmax_vaddx3_vcadd_1_4096_bf16.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp b/benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp
similarity index 94%
rename from test/accelerator/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp
rename to benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp
index 9bac6bb..bf861a2 100644
--- a/test/accelerator/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp
+++ b/benchmarks/npu/vec_simd/softmax_vaddx3_vcadd_1_4096_fp16/softmax_vaddx3_vcadd_1_4096_fp16.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp b/benchmarks/npu/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp
similarity index 96%
rename from test/accelerator/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp
rename to benchmarks/npu/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp
index f952193..fb080ac 100644
--- a/test/accelerator/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp
+++ b/benchmarks/npu/vec_simd/swiglu_64_1024_fp16/swiglu_64_1024_fp16.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simd.h"
+#include <benchmark_support/npu/npu_vec_simd.h>
 #include "benchmark.h"
 #include "common.h"
 
diff --git a/test/accelerator/vec_simt/Makefile b/benchmarks/npu/vec_simt/Makefile
similarity index 64%
rename from test/accelerator/vec_simt/Makefile
rename to benchmarks/npu/vec_simt/Makefile
index 80e8b22..fca51d5 100644
--- a/test/accelerator/vec_simt/Makefile
+++ b/benchmarks/npu/vec_simt/Makefile
@@ -1,13 +1,15 @@
+.DEFAULT_GOAL := all
+
 TARGET = $(ELF_HEAD)_$(TESTCASE).elf
 SRC_FILE +=  $(TEST_ROOT)/$(CATEGORY)/$(TESTCASE)/$(TESTCASE).cpp
 
 # Special handling for hashfind - embed data as object files
-EXTRA_OBJ_FILES :=
-EXTRA_OBJ_DEPS :=
+EXTRA_OBJ_FILES =
+EXTRA_OBJ_DEPS =
 
 # Data object files location (relative paths)
 DATA_OBJ_DIR := hashfind/data_obj
-OUTPUT_DATA_OBJ_DIR := ../../output/accelerator/vec_simt/hashfind/data_obj
+OUTPUT_DATA_OBJ_DIR = $(OBJ_ROOT)/$(CATEGORY)/hashfind/data_obj
 
 # hashfind uses embedded data (simple_ dataset)
 ifeq ($(TESTCASE), hashfind)
@@ -20,11 +22,6 @@ pre_work: build_data_objs
 build_data_objs:
 	@COMPILER_DIR="$(COMPILER_DIR)" $(DATA_OBJ_DIR)/build_data_obj.sh $(DATA_OBJ_DIR) $(OUTPUT_DATA_OBJ_DIR)
 
-# Pattern rule so make doesn't use a generic implicit rule for .s → .o in the data_obj dir
-$(OUTPUT_DATA_OBJ_DIR)/%.o: $(DATA_OBJ_DIR)/%.s pre_work
-	@mkdir -p $(shell dirname $@)
-	$(AS) $(CC_O_ALL) $(INCLUDE) $(DEFINES) $< -o $@
-
 endif
 
 # hashfind_simple also uses embedded data (simple_ prefixed)
@@ -38,11 +35,11 @@ pre_work: build_data_objs
 build_data_objs:
 	@COMPILER_DIR="$(COMPILER_DIR)" $(DATA_OBJ_DIR)/build_data_obj.sh $(DATA_OBJ_DIR) $(OUTPUT_DATA_OBJ_DIR)
 
-# Pattern rule so make doesn't use a generic implicit rule for .s → .o in the data_obj dir
-$(OUTPUT_DATA_OBJ_DIR)/%.o: $(DATA_OBJ_DIR)/%.s pre_work
-	@mkdir -p $(shell dirname $@)
-	$(AS) $(CC_O_ALL) $(INCLUDE) $(DEFINES) $< -o $@
-
 endif
 
-include ../../common/Makefile.common
\ No newline at end of file
+include ../../common/Makefile.common
+
+ifneq ($(EXTRA_OBJ_FILES),)
+$(EXTRA_OBJ_FILES): pre_work
+	@true
+endif
diff --git a/benchmarks/npu/vec_simt/compile.all b/benchmarks/npu/vec_simt/compile.all
new file mode 100755
index 0000000..1f7b2fc
--- /dev/null
+++ b/benchmarks/npu/vec_simt/compile.all
@@ -0,0 +1,5 @@
+#! /bin/bash
+
+make TESTCASE=npu_hashtable_insert_cmp_host
+make TESTCASE=npu_hashtable_lookup_cmp_host
+make TESTCASE=hashfind
diff --git a/test/kernel/control/hashtable_lookup_simd/compute_offsets.py b/benchmarks/npu/vec_simt/hashfind/compute_offsets.py
similarity index 100%
rename from test/kernel/control/hashtable_lookup_simd/compute_offsets.py
rename to benchmarks/npu/vec_simt/hashfind/compute_offsets.py
diff --git a/benchmarks/npu/vec_simt/hashfind/data_obj/.gitignore b/benchmarks/npu/vec_simt/hashfind/data_obj/.gitignore
new file mode 100644
index 0000000..dbf14ab
--- /dev/null
+++ b/benchmarks/npu/vec_simt/hashfind/data_obj/.gitignore
@@ -0,0 +1,3 @@
+*.s
+*.o
+*.data
diff --git a/test/accelerator/vec_simt/hashfind/data_obj/build_data_obj.sh b/benchmarks/npu/vec_simt/hashfind/data_obj/build_data_obj.sh
similarity index 53%
rename from test/accelerator/vec_simt/hashfind/data_obj/build_data_obj.sh
rename to benchmarks/npu/vec_simt/hashfind/data_obj/build_data_obj.sh
index 095eed2..e4cbbff 100755
--- a/test/accelerator/vec_simt/hashfind/data_obj/build_data_obj.sh
+++ b/benchmarks/npu/vec_simt/hashfind/data_obj/build_data_obj.sh
@@ -1,10 +1,21 @@
 #!/bin/bash
-COMPILER_DIR="${COMPILER_DIR:-/remote/lms60/c00622284/janus/linxisa_compiler_v0.55/linx_blockisa_llvm_musl/bin}"
-DATA_OBJ_DIR="$1"
-OUTPUT_DIR="$2"
+set -euo pipefail
+
+COMPILER_DIR="${COMPILER_DIR:-/usr/bin}"
+LINX_TARGET="${LINX_TARGET:-linx64-linx-none-elf}"
+DATA_OBJ_DIR="${1:?data object directory required}"
+OUTPUT_DIR="${2:?output directory required}"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+CASE_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
 
 mkdir -p "$OUTPUT_DIR"
 
+if [[ ! -f "${DATA_OBJ_DIR}/simple_inserted_slot.data" ||
+      ! -f "${DATA_OBJ_DIR}/simple_lookup_keys.data" ||
+      ! -f "${DATA_OBJ_DIR}/simple_lookup_values.data" ]]; then
+    (cd "$CASE_DIR" && python3 gen_data_simple.py)
+fi
+
 build_one() {
     local name="$1"
     local data_file="${DATA_OBJ_DIR}/${name}.data"
@@ -26,16 +37,12 @@ _binary_${name}_data_end:
 .equ _binary_${name}_data_size, .-_binary_${name}_data_start
 EOF
 
-    $COMPILER_DIR/clang++ -target linx64v5 -c "$asm_file" -o "$obj_file"
+    "${COMPILER_DIR}/clang++" -target "$LINX_TARGET" -c "$asm_file" -o "$obj_file"
 }
 
-build_one "inserted_slot"
-build_one "lookup_keys"
-build_one "lookup_values"
-
 # Simple dataset (8192 entries, 80% load, 1024 queries)
 build_one "simple_inserted_slot"
 build_one "simple_lookup_keys"
 build_one "simple_lookup_values"
 
-echo "Done building data object files"
\ No newline at end of file
+echo "Done building data object files"
diff --git a/test/accelerator/vec_simt/hashfind/gen_data_simple.py b/benchmarks/npu/vec_simt/hashfind/gen_data_simple.py
similarity index 100%
rename from test/accelerator/vec_simt/hashfind/gen_data_simple.py
rename to benchmarks/npu/vec_simt/hashfind/gen_data_simple.py
diff --git a/test/accelerator/vec_simt/hashfind/hashfind.cpp b/benchmarks/npu/vec_simt/hashfind/hashfind.cpp
similarity index 99%
rename from test/accelerator/vec_simt/hashfind/hashfind.cpp
rename to benchmarks/npu/vec_simt/hashfind/hashfind.cpp
index 88a46cd..46b7522 100644
--- a/test/accelerator/vec_simt/hashfind/hashfind.cpp
+++ b/benchmarks/npu/vec_simt/hashfind/hashfind.cpp
@@ -282,14 +282,14 @@ void loadKeys(typename HashFindTypes<kTileRows, kTileCols>::TileU32& lowTile,
     TileU16 offsetLowTile, offsetHighTile;
     OffsetGT offsetLowGlobal(g_offset_low);
     OffsetGT offsetHighGlobal(g_offset_high);
-    TCOPYIN(offsetLowTile,  offsetLowGlobal);
-    TCOPYIN(offsetHighTile, offsetHighGlobal);
+    TLOAD(offsetLowTile,  offsetLowGlobal);
+    TLOAD(offsetHighTile, offsetHighGlobal);
 
     KeyGT keysGlobal(queries);
     MGATHER(lowTile,  keysGlobal, offsetLowTile);
     MGATHER(highTile, keysGlobal, offsetHighTile);
 
-    TCOPYIN(queryKeyTile, keysGlobal);
+    TLOAD(queryKeyTile, keysGlobal);
 }
 
 // ============================================================================
@@ -487,12 +487,12 @@ void runHashFind(int32_t __out__ *out,
         // Load the 256 distinct update values, then MSCATTER to the table
         using UpdGT = GlobalTensor<int32_t, Shape<1,1,1,kTileRows,kTileCols>, Stride<1,1,1,kTileCols,1>>;
         UpdGT updGlobal(update_values);
-        TCOPYIN(updateTile, updGlobal);
+        TLOAD(updateTile, updGlobal);
         updateValues<kTileRows, kTileCols, kCap>(updateTile, foundIdxTile, table);
     }
 
     TileGT outGlobal(out);
-    TCOPYOUT(outGlobal, outTile);
+    TSTORE(outGlobal, outTile);
 }
 
 template <int kTileRows, int kTileCols, int kCap, int kMaxProbe>
diff --git a/test/accelerator/vec_simt/accel_hashtable_insert_cmp_host/accel_hashtable_insert_cmp_host.cpp b/benchmarks/npu/vec_simt/npu_hashtable_insert_cmp_host/npu_hashtable_insert_cmp_host.cpp
similarity index 94%
rename from test/accelerator/vec_simt/accel_hashtable_insert_cmp_host/accel_hashtable_insert_cmp_host.cpp
rename to benchmarks/npu/vec_simt/npu_hashtable_insert_cmp_host/npu_hashtable_insert_cmp_host.cpp
index c487c92..c1e6ac7 100644
--- a/test/accelerator/vec_simt/accel_hashtable_insert_cmp_host/accel_hashtable_insert_cmp_host.cpp
+++ b/benchmarks/npu/vec_simt/npu_hashtable_insert_cmp_host/npu_hashtable_insert_cmp_host.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simt.h"
+#include <benchmark_support/npu/npu_vec_simt.h>
 #include "benchmark.h"
 
 #define INSERT_NUM 4096
diff --git a/test/accelerator/vec_simt/accel_hashtable_lookup_cmp_host/accel_hashtable_lookup_cmp_host.cpp b/benchmarks/npu/vec_simt/npu_hashtable_lookup_cmp_host/npu_hashtable_lookup_cmp_host.cpp
similarity index 93%
rename from test/accelerator/vec_simt/accel_hashtable_lookup_cmp_host/accel_hashtable_lookup_cmp_host.cpp
rename to benchmarks/npu/vec_simt/npu_hashtable_lookup_cmp_host/npu_hashtable_lookup_cmp_host.cpp
index 41296ce..db93a2b 100644
--- a/test/accelerator/vec_simt/accel_hashtable_lookup_cmp_host/accel_hashtable_lookup_cmp_host.cpp
+++ b/benchmarks/npu/vec_simt/npu_hashtable_lookup_cmp_host/npu_hashtable_lookup_cmp_host.cpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../../include/accelerator_vec_simt.h"
+#include <benchmark_support/npu/npu_vec_simt.h>
 #include "benchmark.h"
 
 #define LOOKUP_NUM 4096
diff --git a/test/run_ci.py b/benchmarks/run_ci.py
similarity index 94%
rename from test/run_ci.py
rename to benchmarks/run_ci.py
index 090c65d..1057a8d 100755
--- a/test/run_ci.py
+++ b/benchmarks/run_ci.py
@@ -8,7 +8,7 @@
 import subprocess
 
 def compile():
-    os.chdir(os.path.dirname(__file__)+"/tileop_api")
+    os.chdir(os.path.join(os.path.dirname(__file__), "api", "tileop"))
     print(os.getcwd())
     subprocess.run("./compile.all", shell=True)
     return
@@ -18,7 +18,6 @@ def run():
     return
 
 def verify():
-    
     return 
 
 if __name__ == '__main__':
@@ -39,5 +38,3 @@ def verify():
     compile()
     run()
     verify()
-
-    
\ No newline at end of file
diff --git a/benchmarks/scripts/legacy_batch/bench_all.sh b/benchmarks/scripts/legacy_batch/bench_all.sh
new file mode 100755
index 0000000..8256298
--- /dev/null
+++ b/benchmarks/scripts/legacy_batch/bench_all.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+set -e
+set -x
+set -o pipefail
+
+cd $(dirname $0)/../..
+
+export CC_OPT=default
+
+python3 benchmarks/scripts/legacy_batch/run_compile.py
+ERRS=$(grep fail: benchmarks/cm_log/compile_summary.log | awk '{print $2}')
+PASS=$(($ERRS <= 4))
+if [[ x"$PASS" != x"1" ]]; then
+  cat benchmarks/cm_log/compile_summary.log
+  exit 1
+fi
+
+# ELF_LIST="output/api/tileop/elf/*.elf output/microbench/lmbench/elf/*.elf output/kernels/composite/elf/*.elf output/models/deepseekv3/elf/*.elf"
+# 
+# realpath $ELF_LIST > tmp.list
+# 
+# if [[ -f $QEMU ]]; then
+#   ARGS="$ARGS -m $QEMU"
+# fi
+# python3 benchmarks/scripts/legacy_batch/run_qemu.py -i tmp.list -o benchmarks/cm_log/qemu_run.log $ARGS
+# rm -f tmp.list
diff --git a/test/other/scripts/run_ci.py b/benchmarks/scripts/legacy_batch/run_ci.py
similarity index 89%
rename from test/other/scripts/run_ci.py
rename to benchmarks/scripts/legacy_batch/run_ci.py
index 5fd0388..35bf327 100755
--- a/test/other/scripts/run_ci.py
+++ b/benchmarks/scripts/legacy_batch/run_ci.py
@@ -8,10 +8,11 @@
 import subprocess
 from pathlib import Path
 
-OTHER_ROOT = Path(__file__).resolve().parent.parent
+REPO_ROOT = Path(__file__).resolve().parents[3]
+BENCHMARK_ROOT = REPO_ROOT / "benchmarks"
 
 def compile():
-    compile_dir = OTHER_ROOT / "tileop_api"
+    compile_dir = BENCHMARK_ROOT / "api" / "tileop"
     print(compile_dir)
     subprocess.run("./compile.all", cwd=compile_dir, shell=True)
     return
diff --git a/test/other/scripts/run_compile.py b/benchmarks/scripts/legacy_batch/run_compile.py
similarity index 85%
rename from test/other/scripts/run_compile.py
rename to benchmarks/scripts/legacy_batch/run_compile.py
index a9bbd6a..1140fc8 100755
--- a/test/other/scripts/run_compile.py
+++ b/benchmarks/scripts/legacy_batch/run_compile.py
@@ -14,33 +14,33 @@
 MAX_WORKERS = 20  #parallel thread num depend on your machine
 
 REPO_ROOT = Path(__file__).resolve().parents[3]
-TEST_ROOT = REPO_ROOT / "test"
-CM_LOG_DIR = TEST_ROOT / "cm_log"
+BENCHMARK_ROOT = REPO_ROOT / "benchmarks"
+CM_LOG_DIR = BENCHMARK_ROOT / "cm_log"
 
 compile_result = {"pass":[], "fail":[], "timeout":[]}
 
 compile_list = [
-    "other/tileop_test/compile.all",
-    "kernel/orther/compile_softmax.all",
-    "kernel/orther/compile_gemm.all",
-    "kernel/orther/compile_linear.all",
-    "kernel/orther/compile_matmul.all",
-    "kernel/orther/compile_flash_attention.all",
-    "other/lmbench/compile_mem.all",
-    "other/vec/compile_lat_bw.all",
-    "other/cube/compile.all",
-    "other/deepseek/compile.all",
-    "accelerator/vec_simd/compile.all",
-    "accelerator/fusion/compile.all",
+    "api/tileop/compile.all",
+    "kernels/composite/compile_softmax.all",
+    "kernels/composite/compile_gemm.all",
+    "kernels/composite/compile_linear.all",
+    "kernels/composite/compile_matmul.all",
+    "kernels/composite/compile_flash_attention.all",
+    "microbench/lmbench/compile_mem.all",
+    "microbench/vec/compile_lat_bw.all",
+    "microbench/cube/compile.all",
+    "models/deepseekv3/compile.all",
+    "npu/vec_simd/compile.all",
+    "npu/fusion/compile.all",
 ]
 
 def cmd_config_parse(list):
     pass
 
 def compile_elf(compile_file):
-    cmd_path = TEST_ROOT / compile_file
+    cmd_path = BENCHMARK_ROOT / compile_file
     cmd_dir = cmd_path.parent
-    cmd_rel_dir = cmd_dir.relative_to(TEST_ROOT)
+    cmd_rel_dir = cmd_dir.relative_to(BENCHMARK_ROOT)
     cmd = f"{cmd_path.stem}_{str(cmd_rel_dir).replace(os.sep, '_')}"
     print(f"processing {cmd}...")
 
@@ -77,7 +77,7 @@ def compile_elf(compile_file):
     args = parser.parse_args()
 
     os.environ["PLAT"] = args.plat
-    print(f"test_root is {TEST_ROOT}")
+    print(f"benchmark_root is {BENCHMARK_ROOT}")
     if CM_LOG_DIR.exists():
         shutil.rmtree(CM_LOG_DIR)
     CM_LOG_DIR.mkdir(parents=True)
diff --git a/test/other/scripts/run_qemu.py b/benchmarks/scripts/legacy_batch/run_qemu.py
similarity index 100%
rename from test/other/scripts/run_qemu.py
rename to benchmarks/scripts/legacy_batch/run_qemu.py
diff --git a/test/other/scripts/run_result_check.py b/benchmarks/scripts/legacy_batch/run_result_check.py
similarity index 100%
rename from test/other/scripts/run_result_check.py
rename to benchmarks/scripts/legacy_batch/run_result_check.py
diff --git a/test/script/README.md b/benchmarks/scripts/recursive/README.md
similarity index 57%
rename from test/script/README.md
rename to benchmarks/scripts/recursive/README.md
index 5b1394a..399e232 100644
--- a/test/script/README.md
+++ b/benchmarks/scripts/recursive/README.md
@@ -6,7 +6,7 @@
 - options:
   -h            show this help message and exit
   -lib          TileOP库的根目录， case: /xx/PTOTileLib/
-  -src          需要编译的目录（递归的）, case: /xx/PTOTileLib/test/tileop_api/src/
+  -src          需要编译的目录（递归的）, case: /xx/PTOTileLib/benchmarks/api/tileop/src/
                 默认等于lib
   -m            test model: cmp or run, default cmp
   -lc           linx clang++ path, case: /xx/linx_blockisa_llvm/bin/clang++
@@ -29,16 +29,16 @@ run: 编译 + 运行
 ## 使用实例
 
 - 编译 cpu_sim版本
-python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/test/tileop_api/src -hc /xx/llvm-15.0.4/bin/clang++
+python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/benchmarks/api/tileop/src -hc /xx/llvm-15.0.4/bin/clang++
 
 - 编译 jcore版本
-python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/test/tileop_api/src -lc /xx/linx_blockisa_llvm/bin/clang++
+python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/benchmarks/api/tileop/src -lc /xx/linx_blockisa_llvm/bin/clang++
 
 - 编译+运行 cpu_sim版本
-python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/test/tileop_api/src -hc /xx/llvm-15.0.4/bin/clang++ -m run
+python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/benchmarks/api/tileop/src -hc /xx/llvm-15.0.4/bin/clang++ -m run
 
 - 编译+运行+功能验证 jcore版本
-python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/test/tileop_api/src -lc /xx/linx_blockisa_llvm/bin/clang++ -hc /xx/llvm-15.0.4/bin/clang++ -qemu /xx/qemu-linx -m run
+python3 /xx/test.py -lib /xx/PTOTileLib/ -src /xx/PTOTileLib/benchmarks/api/tileop/src -lc /xx/linx_blockisa_llvm/bin/clang++ -hc /xx/llvm-15.0.4/bin/clang++ -qemu /xx/qemu-linx -m run
 
 - 编译+运行+功能验证 单用例
-python3 {$REPO}/test/tileop_api/test_tileop.py -lib {$REPO} -src {$REPO}/test/tileop_api/src -lc {$L_CHAIN}/bin/clang++ -hc {$H_CHAIN}/bin/clang++ -qemu {$QEMU}/qemu-linx -m run -case=Txxx
\ No newline at end of file
+python3 {$REPO}/benchmarks/scripts/recursive/test.py -lib {$REPO} -src {$REPO}/benchmarks/api/tileop/src -lc {$L_CHAIN}/bin/clang++ -hc {$H_CHAIN}/bin/clang++ -qemu {$QEMU}/qemu-linx -m run -case=Txxx
\ No newline at end of file
diff --git a/test/script/test.py b/benchmarks/scripts/recursive/test.py
similarity index 99%
rename from test/script/test.py
rename to benchmarks/scripts/recursive/test.py
index 47f682d..3158791 100644
--- a/test/script/test.py
+++ b/benchmarks/scripts/recursive/test.py
@@ -870,7 +870,7 @@ def main(
         "-src",
         type=str,
         default="None",
-        help="input test dir, case: /xx/Linx-TileOP-API/test/tileop_api/src",
+        help="input test dir, case: /xx/Linx-TileOP-API/benchmarks/api/tileop/src",
     )
     argParser.add_argument(
         "-m", type=str, default="cmp", help="test model: cmp or run, default cmp"
diff --git a/compiler/linx_blockisa_llvm_musl.tar.gz b/compiler/linx_blockisa_llvm_musl.tar.gz
new file mode 100644
index 0000000..ad015cb
--- /dev/null
+++ b/compiler/linx_blockisa_llvm_musl.tar.gz
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:0d7b21b7b0888299562a2b06ff78f2d2e973c0cfdb9bd61aec140686624d2029
+size 821947387
diff --git a/include/aarch64/TCopyIn.hpp b/include/aarch64/TLoad.hpp
similarity index 100%
rename from include/aarch64/TCopyIn.hpp
rename to include/aarch64/TLoad.hpp
diff --git a/include/aarch64/TCopyOut.hpp b/include/aarch64/TStore.hpp
similarity index 100%
rename from include/aarch64/TCopyOut.hpp
rename to include/aarch64/TStore.hpp
diff --git a/test/accelerator/include/accelerator_cube.h b/include/benchmark_support/npu/npu_cube.h
similarity index 89%
rename from test/accelerator/include/accelerator_cube.h
rename to include/benchmark_support/npu/npu_cube.h
index 9b08f02..424a758 100644
--- a/test/accelerator/include/accelerator_cube.h
+++ b/include/benchmark_support/npu/npu_cube.h
@@ -1,6 +1,6 @@
 #include <common/pto_tileop.hpp>
 
-template <const int kM, const int kN, const int kK, 
+template <const int kM, const int kN, const int kK,
           const int kTM, const int kTN, const int kTK>
 void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float *dequant) {
     using gm_shapeA = global_tensor<__half, RowMajor<kM, kK>>;
@@ -70,8 +70,8 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float
             tile_shapeA tA;
             tile_shapeB tB_ori;
             tile_shapeB_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -83,8 +83,8 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float
             tile_shapeA_trows tA;
             tile_shapeB_tcols tB_ori;
             tile_shapeB_tcols_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -92,7 +92,7 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float
         //quant pre VQF322BF16_PRE
         tile_shapeACC_cvt tACC_cvt;
         TCVT(tACC_cvt, tACC);
-        TCOPYOUT(gC, tACC_cvt);
+        TSTORE(gC, tACC_cvt);
         }
         if constexpr (rmd_N) {
         auto gC = gCIter(i, Nb);
@@ -109,8 +109,8 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float
             tile_shapeA tA;
             tile_shapeB_trows tB_ori;
             tile_shapeB_trows_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -121,8 +121,8 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float
             tile_shapeA_trows tA;
             tile_shapeB_tcorner tB_ori;
             tile_shapeB_tcorner_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -130,7 +130,7 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float
         //quant pre
         tile_shapeC_trows_cvt tACC_cvt;
         TCVT(tACC_cvt, tACC);
-        TCOPYOUT(gC, tACC_cvt);
+        TSTORE(gC, tACC_cvt);
         }
     }
     if constexpr (rmd_M) {
@@ -149,8 +149,8 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float
             tile_shapeA_tcols tA;
             tile_shapeB tB_ori;
             tile_shapeB_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -161,15 +161,15 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float
             tile_shapeA_tcorner tA;
             tile_shapeB_tcols tB_ori;
             tile_shapeB_tcols_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
 
         tile_shapeC_tcols_cvt tACC_cvt;
         TCVT(tACC_cvt, tACC);
-        TCOPYOUT(gC, tACC_cvt);
+        TSTORE(gC, tACC_cvt);
         }
         if constexpr (rmd_N) {
         auto gC = gCIter(Mb, Nb);
@@ -177,7 +177,7 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float
         tile_shapeC_tcorner tACC;
         tile_shapeA_tcols tA(0);
         tile_shapeB_trows_cvt tB(0);
-        MATMUL(tACC, tA, tB);  
+        MATMUL(tACC, tA, tB);
         #pragma clang loop unroll(full)
         for (int k = 0; k < Kb; ++k) {
             auto gA = gAIter(Mb, k);
@@ -186,8 +186,8 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float
             tile_shapeA_tcols tA;
             tile_shapeB_trows tB_ori;
             tile_shapeB_trows_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -198,20 +198,20 @@ void matmul_kernel_a16w8(__half *c_ptr, __half *a_ptr, __fp8_e4m3 *b_ptr, float
             tile_shapeA_tcorner tA;
             tile_shapeB_tcorner tB_ori;
             tile_shapeB_tcorner_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
 
         tile_shapeC_tcorner_cvt tACC_cvt;
         TCVT(tACC_cvt, tACC);
-        TCOPYOUT(gC, tACC_cvt);
+        TSTORE(gC, tACC_cvt);
         }
     }
 }
 
-template <const int kM, const int kN, const int kK, 
+template <const int kM, const int kN, const int kK,
           const int kTM, const int kTN, const int kTK>
 void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, float *dequant) {
     using gm_shapeA = global_tensor<__fp8_e4m3, RowMajor<kM, kK>>;
@@ -277,8 +277,8 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA tA;
             tile_shapeB tB_ori;
             tile_shapeB_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -290,8 +290,8 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA_trows tA;
             tile_shapeB_tcols tB_ori;
             tile_shapeB_tcols_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -299,7 +299,7 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
         //quant pre VQF322BF16_PRE
         tile_shapeACC_cvt tACC_cvt;
         TCVT(tACC_cvt, tACC);
-        TCOPYOUT(gC, tACC_cvt);
+        TSTORE(gC, tACC_cvt);
         }
         if constexpr (rmd_N) {
         auto gC = gCIter(i, Nb);
@@ -316,8 +316,8 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA tA;
             tile_shapeB_trows tB_ori;
             tile_shapeB_trows_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -328,8 +328,8 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA_trows tA;
             tile_shapeB_tcorner tB_ori;
             tile_shapeB_tcorner_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -337,7 +337,7 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
         //quant pre
         tile_shapeC_trows_cvt tACC_cvt;
         TCVT(tACC_cvt, tACC);
-        TCOPYOUT(gC, tACC_cvt);
+        TSTORE(gC, tACC_cvt);
         }
     }
     if constexpr (rmd_M) {
@@ -356,8 +356,8 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA_tcols tA;
             tile_shapeB tB_ori;
             tile_shapeB_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -368,15 +368,15 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA_tcorner tA;
             tile_shapeB_tcols tB_ori;
             tile_shapeB_tcols_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
 
         tile_shapeC_tcols_cvt tACC_cvt;
         TCVT(tACC_cvt, tACC);
-        TCOPYOUT(gC, tACC_cvt);
+        TSTORE(gC, tACC_cvt);
         }
         if constexpr (rmd_N) {
         auto gC = gCIter(Mb, Nb);
@@ -384,7 +384,7 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
         tile_shapeC_tcorner tACC;
         TileLeft<__half, kTM, kTK, rmd_M, kTK> tA(0);
         TileRight<__half, kTK, kTN, kTK, rmd_N> tB(0);
-        MATMUL(tACC, tA, tB);  
+        MATMUL(tACC, tA, tB);
         #pragma clang loop unroll(full)
         for (int k = 0; k < Kb; ++k) {
             auto gA = gAIter(Mb, k);
@@ -393,8 +393,8 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA_tcols tA;
             tile_shapeB_trows tB_ori;
             tile_shapeB_trows_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -405,20 +405,20 @@ void matmul_kernel_a8w8(__fp8_e4m3 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA_tcorner tA;
             tile_shapeB_tcorner tB_ori;
             tile_shapeB_tcorner_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCVT(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
 
         tile_shapeC_tcorner_cvt tACC_cvt;
         TCVT(tACC_cvt, tACC);
-        TCOPYOUT(gC, tACC_cvt);
+        TSTORE(gC, tACC_cvt);
         }
     }
 }
 
-template <const int kM, const int kN, const int kK, 
+template <const int kM, const int kN, const int kK,
           const int kTM, const int kTN, const int kTK>
 void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr, __fp8_e4m3 *amx, __fp8_e4m3 *bmx, float *dequant) {
     using gm_shapeA = global_tensor<__fp8_e4m3, RowMajor<kM, kK>>;
@@ -499,8 +499,8 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA tA;
             tile_shapeB tB_ori;
             tile_shapeB_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCAST(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -512,16 +512,16 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA_trows tA;
             tile_shapeB_tcols tB_ori;
             tile_shapeB_tcols_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB_ori, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB_ori, gB);
             TCAST(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
 
-        //quant pre(acc * dequant_scale) 
+        //quant pre(acc * dequant_scale)
         auto gDQ = gDQIter(0,j);
         tile_shapeDQ tDQ;
-        TCOPYIN(tDQ, gDQ);
+        TLOAD(tDQ, gDQ);
         tile_shapeACC tDQ_expand;
         TEXPANDROW(tDQ_expand,tDQ);
         TMULS(tACC, tACC, tDQ_expand);
@@ -531,7 +531,7 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
         // TMULS(tACC_scale, tACC_scale, static_cast<float>(2));
         tile_shapeACC_cvt tACC_cvt;
         TCAST(tACC_cvt, tACC);
-        TCOPYOUT(gC, tACC_cvt);
+        TSTORE(gC, tACC_cvt);
         }
         if constexpr (rmd_N) {
         auto gC = gCIter(i, Nb);
@@ -548,8 +548,8 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA tA;
             tile_shapeB_trows tB_ori;
             tile_shapeB_trows_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             TCAST(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -560,22 +560,22 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA_trows tA;
             tile_shapeB_tcorner tB_ori;
             tile_shapeB_tcorner_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             TCAST(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
 
-        //quant pre(acc * dequant_scale) 
+        //quant pre(acc * dequant_scale)
         auto gDQ = gDQIter(0,Nb);
         tile_shapeDQ tDQ;
-        TCOPYIN(tDQ, gDQ);
+        TLOAD(tDQ, gDQ);
         tile_shapeC_trows tDQ_expand;
         TEXPANDROW(tDQ_expand,tDQ);
         TMULS(tACC, tACC, tDQ_expand);
         tile_shapeC_trows_cvt tACC_cvt;
         TCAST(tACC_cvt, tACC);
-        TCOPYOUT(gC, tACC_cvt);
+        TSTORE(gC, tACC_cvt);
         }
     }
     if constexpr (rmd_M) {
@@ -594,8 +594,8 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA_tcols tA;
             tile_shapeB tB_ori;
             tile_shapeB_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             TCAST(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -606,8 +606,8 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA_tcorner tA;
             tile_shapeB_tcols tB_ori;
             tile_shapeB_tcols_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             TCAST(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -615,13 +615,13 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
         //need quantization for C
         auto gDQ = gDQIter(0,j);
         tile_shapeDQ tDQ;
-        TCOPYIN(tDQ, gDQ);
+        TLOAD(tDQ, gDQ);
         tile_shapeC_tcols tDQ_expand;
         TEXPANDROW(tDQ_expand,tDQ);
         TMULS(tACC, tACC, tDQ_expand);
         tile_shapeC_tcols_cvt tACC_cvt;
         TCAST(tACC_cvt, tACC);
-        TCOPYOUT(gC, tACC_cvt);
+        TSTORE(gC, tACC_cvt);
         }
         if constexpr (rmd_N) {
         auto gC = gCIter(Mb, Nb);
@@ -629,7 +629,7 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
         tile_shapeC_tcorner tACC;
         tile_shapeA_tcols tA(0);
         tile_shapeB_trows_cvt tB(0);
-        MATMUL(tACC, tA, tB);  
+        MATMUL(tACC, tA, tB);
         #pragma clang loop unroll(full)
         for (int k = 0; k < Kb; ++k) {
             auto gA = gAIter(Mb, k);
@@ -638,8 +638,8 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA_tcols tA;
             tile_shapeB_trows tB_ori;
             tile_shapeB_trows_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             TCAST(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -650,8 +650,8 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
             tile_shapeA_tcorner tA;
             tile_shapeB_tcorner tB_ori;
             tile_shapeB_tcorner_cvt tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             TCAST(tB, tB_ori);
             MATMACC(tACC, tA, tB);
         }
@@ -660,13 +660,13 @@ void matmul_kernel_mx_a8w8(__bf16 *c_ptr, __fp8_e4m3 *a_ptr, __fp8_e5m2 *b_ptr,
         //need quantization for C
         auto gDQ = gDQIter(0,Nb);
         tile_shapeDQ tDQ;
-        TCOPYIN(tDQ, gDQ);
+        TLOAD(tDQ, gDQ);
         tile_shapeC_tcorner tDQ_expand;
         TEXPANDROW(tDQ_expand,tDQ);
         TMULS(tACC, tACC, tDQ_expand);
         tile_shapeC_tcorner_cvt tACC_cvt;
         TCAST(tACC_cvt, tACC);
-        TCOPYOUT(gC, tACC_cvt);
+        TSTORE(gC, tACC_cvt);
         }
     }
 }
@@ -724,8 +724,8 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) {
 
         tile_shapeA tA;
         tile_shapeB tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
         MATMACC(tACC, tA, tB);
       }
 
@@ -735,12 +735,12 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) {
 
         tile_shapeA_trows tA;
         tile_shapeB_tcols tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
         MATMACC(tACC, tA, tB);
       }
 
-      TCOPYOUT(gC, tACC);
+      TSTORE(gC, tACC);
     }
     if constexpr (rmd_N) {
       auto gC = gCIter(i, Nb);
@@ -756,8 +756,8 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) {
 
         tile_shapeA tA;
         tile_shapeB_trows tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
         MATMACC(tACC, tA, tB);
       }
       if constexpr (rmd_K) {
@@ -766,11 +766,11 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) {
 
         tile_shapeA_trows tA;
         tile_shapeB_tcorner tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
         MATMACC(tACC, tA, tB);
       }
-      TCOPYOUT(gC, tACC);
+      TSTORE(gC, tACC);
     }
   }
   if constexpr (rmd_M) {
@@ -788,8 +788,8 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) {
 
         tile_shapeA_tcols tA;
         tile_shapeB tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
         MATMACC(tACC, tA, tB);
       }
       if constexpr (rmd_K) {
@@ -798,11 +798,11 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) {
 
         tile_shapeA_tcorner tA;
         tile_shapeB_tcols tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
         MATMACC(tACC, tA, tB);
       }
-      TCOPYOUT(gC, tACC);
+      TSTORE(gC, tACC);
     }
     if constexpr (rmd_N) {
       auto gC = gCIter(Mb, Nb);
@@ -810,7 +810,7 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) {
       tile_shapeC_tcorner tACC;
       tile_shapeA_tcols tA(0);
       tile_shapeB_trows tB(0);
-      MATMUL(tACC, tA, tB);  
+      MATMUL(tACC, tA, tB);
       #pragma clang loop unroll(full)
       for (int k = 0; k < Kb; ++k) {
         auto gA = gAIter(Mb, k);
@@ -818,8 +818,8 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) {
 
         tile_shapeA_tcols tA;
         tile_shapeB_trows tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
         MATMACC(tACC, tA, tB);
       }
       if constexpr (rmd_K) {
@@ -828,11 +828,11 @@ void matmul_a32w32(float *c_ptr, float *a_ptr, float *b_ptr) {
 
         tile_shapeA_tcorner tA;
         tile_shapeB_tcorner tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
         MATMACC(tACC, tA, tB);
       }
-      TCOPYOUT(gC, tACC);
+      TSTORE(gC, tACC);
     }
   }
 }
diff --git a/test/accelerator/include/accelerator_fa_2d_unroll.h b/include/benchmark_support/npu/npu_fa_2d_unroll.h
similarity index 98%
rename from test/accelerator/include/accelerator_fa_2d_unroll.h
rename to include/benchmark_support/npu/npu_fa_2d_unroll.h
index d076b17..253448a 100644
--- a/test/accelerator/include/accelerator_fa_2d_unroll.h
+++ b/include/benchmark_support/npu/npu_fa_2d_unroll.h
@@ -49,9 +49,9 @@ void __vec__ new_max_1src(
     #ifndef RES_CHECK
     upd_max = upd_max * src_scale;
     #endif
-    new_max_ptr[max_idx] = upd_max; 
+    new_max_ptr[max_idx] = upd_max;
 
-    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max); 
+    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max);
 }
 
 template<typename tileSrc, typename tileSrc_cast, typename tileMax>
@@ -100,7 +100,7 @@ void __vec__ new_sum_1src(
         typename tileSrc::DType exp_src_3 = src_ptr[src_idx_3];
         typename tileSrc::DType exp_src_01 = exp_src_0 + exp_src_1;
         typename tileSrc::DType exp_src_23 = exp_src_2 + exp_src_3;
-        typename tileSrc::DType exp_src_0123 = exp_src_01 + exp_src_23;        
+        typename tileSrc::DType exp_src_0123 = exp_src_01 + exp_src_23;
         upd_sum += exp_src_0123;
     }
     blkv_get_tile_ptr(new_sum)[sum_idx] = upd_sum;
@@ -332,9 +332,9 @@ void __vec__ new_max_4src(
         upd_max = blkv_max(upd_max, s3_max_0123);
     }
     upd_max = upd_max * src_scale;
-    new_max_ptr[max_idx] = upd_max; 
+    new_max_ptr[max_idx] = upd_max;
 
-    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max); 
+    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max);
 }
 
 template<typename tileSrc, typename tileSrc_cast, typename tileMax>
@@ -407,7 +407,7 @@ void __vec__ src_exp_2src_with_local_sum(
         BLKC_ASSIGN_CAST(src_exp1, idx_2, src1_exp2);
         BLKC_ASSIGN_CAST(src_exp1, idx_3, src1_exp3);
         typename tileSum::DType src1_exp_sum = src1_exp0 + src1_exp1 + src1_exp2 + src1_exp3;
-        
+
         upd_sum += src0_exp_sum + src1_exp_sum;
     }
     size_t idx_sum = i * tileSum::RowStride;
@@ -450,7 +450,7 @@ void __vec__ new_sum_4src(
         typename tileSrc::DType s0_exp_src_3 = src0_ptr[src_idx_3];
         typename tileSrc::DType s0_exp_src_01 = s0_exp_src_0 + s0_exp_src_1;
         typename tileSrc::DType s0_exp_src_23 = s0_exp_src_2 + s0_exp_src_3;
-        typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23;  
+        typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23;
 
         typename tileSrc::DType s1_exp_src_0 = src1_ptr[src_idx_0];
         typename tileSrc::DType s1_exp_src_1 = src1_ptr[src_idx_1];
@@ -458,7 +458,7 @@ void __vec__ new_sum_4src(
         typename tileSrc::DType s1_exp_src_3 = src1_ptr[src_idx_3];
         typename tileSrc::DType s1_exp_src_01 = s1_exp_src_0 + s1_exp_src_1;
         typename tileSrc::DType s1_exp_src_23 = s1_exp_src_2 + s1_exp_src_3;
-        typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23;  
+        typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23;
 
         typename tileSrc::DType s2_exp_src_0 = src2_ptr[src_idx_0];
         typename tileSrc::DType s2_exp_src_1 = src2_ptr[src_idx_1];
@@ -466,7 +466,7 @@ void __vec__ new_sum_4src(
         typename tileSrc::DType s2_exp_src_3 = src2_ptr[src_idx_3];
         typename tileSrc::DType s2_exp_src_01 = s2_exp_src_0 + s2_exp_src_1;
         typename tileSrc::DType s2_exp_src_23 = s2_exp_src_2 + s2_exp_src_3;
-        typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23;  
+        typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23;
 
         typename tileSrc::DType s3_exp_src_0 = src3_ptr[src_idx_0];
         typename tileSrc::DType s3_exp_src_1 = src3_ptr[src_idx_1];
@@ -532,7 +532,7 @@ void __vec__ local_max_4src(
         upd_max = blkv_max(upd_max, s0123_max);
     }
     upd_max = upd_max * src_scale;
-    local_max_ptr[max_idx] = upd_max;  
+    local_max_ptr[max_idx] = upd_max;
 }
 
 template<typename tileSrc, typename tileSum>
@@ -568,7 +568,7 @@ void __vec__ local_sum_4src(
         typename tileSrc::DType s0_exp_src_3 = src0_ptr[src_idx_3];
         typename tileSrc::DType s0_exp_src_01 = s0_exp_src_0 + s0_exp_src_1;
         typename tileSrc::DType s0_exp_src_23 = s0_exp_src_2 + s0_exp_src_3;
-        typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23;  
+        typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23;
 
         typename tileSrc::DType s1_exp_src_0 = src1_ptr[src_idx_0];
         typename tileSrc::DType s1_exp_src_1 = src1_ptr[src_idx_1];
@@ -576,7 +576,7 @@ void __vec__ local_sum_4src(
         typename tileSrc::DType s1_exp_src_3 = src1_ptr[src_idx_3];
         typename tileSrc::DType s1_exp_src_01 = s1_exp_src_0 + s1_exp_src_1;
         typename tileSrc::DType s1_exp_src_23 = s1_exp_src_2 + s1_exp_src_3;
-        typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23;  
+        typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23;
 
         typename tileSrc::DType s2_exp_src_0 = src2_ptr[src_idx_0];
         typename tileSrc::DType s2_exp_src_1 = src2_ptr[src_idx_1];
@@ -584,7 +584,7 @@ void __vec__ local_sum_4src(
         typename tileSrc::DType s2_exp_src_3 = src2_ptr[src_idx_3];
         typename tileSrc::DType s2_exp_src_01 = s2_exp_src_0 + s2_exp_src_1;
         typename tileSrc::DType s2_exp_src_23 = s2_exp_src_2 + s2_exp_src_3;
-        typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23;  
+        typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23;
 
         typename tileSrc::DType s3_exp_src_0 = src3_ptr[src_idx_0];
         typename tileSrc::DType s3_exp_src_1 = src3_ptr[src_idx_1];
@@ -622,7 +622,7 @@ void __vec__ new_max_of_2_loc_max(
     typename tileMax::DType local_max_01 = blkv_max(local_max_0_ptr[max_idx], local_max_1_ptr[max_idx]);
     upd_max = blkv_max(upd_max, local_max_01);
     new_max_ptr[max_idx] = upd_max;
-    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max); 
+    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max);
 }
 template<typename tileScale, typename tileSum>
 void __vec__ new_sum_of_2_loc_sum(
@@ -677,7 +677,7 @@ void __vec__ new_max_of_4_loc_max(
     typename tileMax::DType local_max_0123 = blkv_max(local_max_01, local_max_23);
     upd_max = blkv_max(upd_max, local_max_0123);
     new_max_ptr[max_idx] = upd_max;
-    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max); 
+    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max);
 }
 template<typename tileScale, typename tileSum>
 void __vec__ new_sum_of_4_loc_sum(
@@ -702,7 +702,7 @@ void __vec__ new_sum_of_4_loc_sum(
 
     size_t sum_idx = i*tileSum::RowStride;
 
-    new_sum_ptr[sum_idx] = old_sum_ptr[sum_idx] * scale_ptr[sum_idx] + 
+    new_sum_ptr[sum_idx] = old_sum_ptr[sum_idx] * scale_ptr[sum_idx] +
                            local_sum_0_ptr[sum_idx] + local_sum_1_ptr[sum_idx] +
                            local_sum_2_ptr[sum_idx] + local_sum_3_ptr[sum_idx];
 }
@@ -731,7 +731,7 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype
     using tileW_out  = TileAcc<float, kTm, kTk>;      // [kTm×kTk]
     using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>;
     using tileW_cast = Tile<Location::Vec, typename tileW_type<dtype>::DType, kTm, kTk, BLayout::ColMajor>;
-    using tileW_left = TileLeft<dtype, kTm, kTk>; 
+    using tileW_left = TileLeft<dtype, kTm, kTk>;
 
     using tileO_out  = TileAcc<float, kTm, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::ColMajor>; // [kTm×vD]
@@ -768,7 +768,7 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype
 
         tileQ tQ[Xdim];
 
-        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore 
+        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x+=2){
                 auto gQ = gIterQ(i+x,0);
@@ -778,7 +778,7 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x++){
                 auto gQ = gIterQ(i+x,0);
-                TCOPYIN(tQ[x], gQ);
+                TLOAD(tQ[x], gQ);
             }
         #endif
 
@@ -813,7 +813,7 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype
                 #pragma clang loop unroll(full)
                 for(int y=0;y<Ydim;y++){
                     auto gK = gIterK(0, j+y);
-                    TCOPYIN(tK[y], gK);
+                    TLOAD(tK[y], gK);
                 }
             #endif
 
@@ -878,8 +878,8 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype
                 #pragma clang loop unroll(full)
                 for(int x=0;x<Xdim;x++){
                     new_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(
-                                                                tScale[x].data(), 
-                                                                tNewMax[x].data(), 
+                                                                tScale[x].data(),
+                                                                tNewMax[x].data(),
                                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                                                                 tMax[x].data(),
                                                                 scale);
@@ -888,7 +888,7 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype
                     //                                             tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                     //                                             tNewMax[x].data(),
                     //                                             scale);
-                    
+
                     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(),
                                                                                                    tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale);
                     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(),
@@ -906,7 +906,7 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){    
+                for(int x=0;x<Xdim;x++){
                     #pragma clang loop unroll(full)
                     for(int k=0;k<2;k++){
                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
@@ -926,7 +926,7 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){       
+                for(int x=0;x<Xdim;x++){
                     for(int k=0;k<4;k++){
                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
                     }
@@ -966,7 +966,7 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype
                 #pragma clang loop unroll(full)
                 for(int y=0;y<Ydim;y++){
                     auto gV = gIterV(j+y, 0);
-                    TCOPYIN(tV[y], gV);
+                    TLOAD(tV[y], gV);
                 }
             #endif
 
@@ -1036,11 +1036,11 @@ void flash_attention_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype
             #pragma clang loop unroll(full)
             for (int x = 0; x < Xdim; ++x) {
                 auto dstO = gIterO(i+x, 0);
-                TCOPYOUT(dstO, tO_cast[x]);
+                TSTORE(dstO, tO_cast[x]);
             }
         #endif
 
     }
 }
 
-#include "accelerator_fa_unalign_2d_unroll.h"
+#include "npu_fa_unalign_2d_unroll.h"
diff --git a/test/accelerator/include/accelerator_fa_2d_unroll_pto.h b/include/benchmark_support/npu/npu_fa_2d_unroll_pto.h
similarity index 97%
rename from test/accelerator/include/accelerator_fa_2d_unroll_pto.h
rename to include/benchmark_support/npu/npu_fa_2d_unroll_pto.h
index 3dfa040..5adff39 100644
--- a/test/accelerator/include/accelerator_fa_2d_unroll_pto.h
+++ b/include/benchmark_support/npu/npu_fa_2d_unroll_pto.h
@@ -11,7 +11,7 @@ void flash_attention_2d_unroll_pto(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, d
     using tileW_out  = TileAcc<float, kTm, kTk>;      // [kTm×kTk]
     using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>;
     using tileW_cast = Tile<Location::Vec, typename tileW_type<dtype>::DType, kTm, kTk, BLayout::ColMajor>;
-    using tileW_left = TileLeft<dtype, kTm, kTk>; 
+    using tileW_left = TileLeft<dtype, kTm, kTk>;
 
     using tileO_out  = TileAcc<float, kTm, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::ColMajor>; // [kTm×vD]
@@ -50,7 +50,7 @@ void flash_attention_2d_unroll_pto(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, d
         #pragma clang loop unroll(full)
         for(int x=0;x<Xdim;x++){
             auto gQ = gIterQ(i+x,0);
-            TCOPYIN(tQ[x], gQ);
+            TLOAD(tQ[x], gQ);
         }
 
         tileMax tMax[Xdim];
@@ -77,7 +77,7 @@ void flash_attention_2d_unroll_pto(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, d
             #pragma clang loop unroll(full)
             for(int y=0;y<Ydim;y++){
                 auto gK = gIterK(0, j+y);
-                TCOPYIN(tK[y], gK);
+                TLOAD(tK[y], gK);
             }
 
             tileW tW[Xdim][Ydim];
@@ -97,7 +97,7 @@ void flash_attention_2d_unroll_pto(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, d
             tileSum tNewSum[Xdim];
 
             tileW_cast tExpW[Xdim][Ydim];
-            
+
             tileMax tLocalMax[Xdim][Ydim];
             tileSum tLocalSum[Xdim][Ydim];
             tileSum tScaledOldSum[Xdim];
@@ -111,7 +111,7 @@ void flash_attention_2d_unroll_pto(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, d
                 for(int y=0;y<Ydim;y++){
                     TCOLMAX_TEPL(tLocalMax[x][y], tW[x][y]);
                 }
-                
+
                 #if Ydim == 1
                     TMAX_TEPL(tNewMax[x], tMax[x], tLocalMax[x][0]);
                 #elif Ydim == 2
@@ -176,7 +176,7 @@ void flash_attention_2d_unroll_pto(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, d
             #pragma clang loop unroll(full)
             for(int y=0;y<Ydim;y++){
                 auto gV = gIterV(j+y, 0);
-                TCOPYIN(tV[y], gV);
+                TLOAD(tV[y], gV);
             }
 
             // ColMajor -> Nz
@@ -233,7 +233,7 @@ void flash_attention_2d_unroll_pto(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, d
             TCOLEXPANDMUL_TEPL(tO[x], tO[x], tInvSum[x]);
             TCAST_TEPL(tO_cast[x], tO[x]);
             auto dstO = gIterO(i+x, 0);
-            TCOPYOUT(dstO, tO_cast[x]);
+            TSTORE(dstO, tO_cast[x]);
         }
     }
 }
diff --git a/test/accelerator/include/accelerator_fa_dcore.h b/include/benchmark_support/npu/npu_fa_dcore.h
similarity index 98%
rename from test/accelerator/include/accelerator_fa_dcore.h
rename to include/benchmark_support/npu/npu_fa_dcore.h
index 80322da..1ed8436 100644
--- a/test/accelerator/include/accelerator_fa_dcore.h
+++ b/include/benchmark_support/npu/npu_fa_dcore.h
@@ -124,7 +124,7 @@ void flash_attention_dcore(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_
     using tileQ      = MultiTile<MULTI, TileLeft<dtype, kTm, (qD==192? 256:qD), kTm, qD>>;       // [kTm×qD]
     using tileK      = MultiTile<MULTI, TileRight<dtype, (qD==192? 256:qD), kTk, qD, kTk>>;      // [vD×kTk]
     // using tileW_out  = TileAcc<float, kTm, kTk>;      // [kTm×kTk]
-    using tileW      = MultiTile<MULTI, Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>>; 
+    using tileW      = MultiTile<MULTI, Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>>;
     using tileW_cast = MultiTile<MULTI, Tile<Location::Vec, dtype, kTm, kTk, BLayout::ColMajor>>;
     using tileW_left = MultiTile<MULTI, TileLeft<dtype, kTm, kTk>>;
 
@@ -151,7 +151,7 @@ void flash_attention_dcore(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_
     const float scale = 1.0f / sqrt((float)qD);
     const int Qb = (Sq + kTm - 1) / kTm;
     const int Kb = (Skv + kTk - 1) / kTk;
-    
+
     // 对每个 Q-block (i)
     for (int i = 0; i < Qb; i += MULTI) {
         // 加载当前Q块 (仅一次)
@@ -161,7 +161,7 @@ void flash_attention_dcore(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_
         auto gQ = gIterQ(i,0);
         TLOAD2_ND2NZ(tQ.Tiles[1], tQ.Tiles[0], gQ);
         #else
-        TCOPYIN(tQ, [&](int t) { return gIterQ(i + t, 0); });
+        TLOAD(tQ, [&](int t) { return gIterQ(i + t, 0); });
         #endif
 
         // 初始化状态: 最大值/指数和/输出累加
@@ -178,7 +178,7 @@ void flash_attention_dcore(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_
             // 加载K_j和V_j
             auto gK = gIterK(0, j);
             tileK tK;
-            TCOPYIN(tK, gK);
+            TLOAD(tK, gK);
 
             // 计算注意力分数块
             tileW tW;
@@ -207,7 +207,7 @@ void flash_attention_dcore(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_
             // 计算当前块的加权输出: O_j = W * V
             auto gV = gIterV(j, 0);
             tileV tV;
-            TCOPYIN(tV, gV);
+            TLOAD(tV, gV);
             MATMUL(tPV, tW_left, tV);
 
             if(j==0){
@@ -233,7 +233,7 @@ void flash_attention_dcore(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_
         auto gO = gIterO(i, 0);
         TSTORE2_DN2DN(gO, tO_cast.Tiles[1], tO_cast.Tiles[0]);
         #else
-        TCOPYOUT([&](int t) { return gIterO(i + t, 0); }, tO_cast);
+        TSTORE([&](int t) { return gIterO(i + t, 0); }, tO_cast);
         #endif
     }
 }
diff --git a/test/accelerator/include/accelerator_fa_dynamic.h b/include/benchmark_support/npu/npu_fa_dynamic.h
similarity index 98%
rename from test/accelerator/include/accelerator_fa_dynamic.h
rename to include/benchmark_support/npu/npu_fa_dynamic.h
index 0882089..c0be2a3 100644
--- a/test/accelerator/include/accelerator_fa_dynamic.h
+++ b/include/benchmark_support/npu/npu_fa_dynamic.h
@@ -231,9 +231,9 @@ void __vec__ new_max_4src_dynamic(
         upd_max = blkv_max(upd_max, s3_max_0123);
     }
     upd_max = upd_max * src_scale;
-    new_max_ptr[max_idx] = upd_max; 
+    new_max_ptr[max_idx] = upd_max;
 
-    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max); 
+    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max);
 }
 
 template<typename tileSrc, typename tileSum, typename tileScale>
@@ -274,7 +274,7 @@ void __vec__ new_sum_4src_dynamic(
         typename tileSrc::DType s0_exp_src_3 = src0_ptr[src_idx_3];
         typename tileSrc::DType s0_exp_src_01 = s0_exp_src_0 + s0_exp_src_1;
         typename tileSrc::DType s0_exp_src_23 = s0_exp_src_2 + s0_exp_src_3;
-        typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23;  
+        typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23;
 
         typename tileSrc::DType s1_exp_src_0 = src1_ptr[src_idx_0];
         typename tileSrc::DType s1_exp_src_1 = src1_ptr[src_idx_1];
@@ -282,7 +282,7 @@ void __vec__ new_sum_4src_dynamic(
         typename tileSrc::DType s1_exp_src_3 = src1_ptr[src_idx_3];
         typename tileSrc::DType s1_exp_src_01 = s1_exp_src_0 + s1_exp_src_1;
         typename tileSrc::DType s1_exp_src_23 = s1_exp_src_2 + s1_exp_src_3;
-        typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23;  
+        typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23;
 
         typename tileSrc::DType s2_exp_src_0 = src2_ptr[src_idx_0];
         typename tileSrc::DType s2_exp_src_1 = src2_ptr[src_idx_1];
@@ -290,7 +290,7 @@ void __vec__ new_sum_4src_dynamic(
         typename tileSrc::DType s2_exp_src_3 = src2_ptr[src_idx_3];
         typename tileSrc::DType s2_exp_src_01 = s2_exp_src_0 + s2_exp_src_1;
         typename tileSrc::DType s2_exp_src_23 = s2_exp_src_2 + s2_exp_src_3;
-        typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23;  
+        typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23;
 
         typename tileSrc::DType s3_exp_src_0 = src3_ptr[src_idx_0];
         typename tileSrc::DType s3_exp_src_1 = src3_ptr[src_idx_1];
@@ -358,7 +358,7 @@ void __vec__ local_max_4src_dynamic(
         upd_max = blkv_max(upd_max, s0123_max);
     }
     upd_max = upd_max * src_scale;
-    local_max_ptr[max_idx] = upd_max;  
+    local_max_ptr[max_idx] = upd_max;
 }
 
 template<typename tileSrc, typename tileSum>
@@ -396,7 +396,7 @@ void __vec__ local_sum_4src_dynamic(
         typename tileSrc::DType s0_exp_src_3 = src0_ptr[src_idx_3];
         typename tileSrc::DType s0_exp_src_01 = s0_exp_src_0 + s0_exp_src_1;
         typename tileSrc::DType s0_exp_src_23 = s0_exp_src_2 + s0_exp_src_3;
-        typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23;  
+        typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23;
 
         typename tileSrc::DType s1_exp_src_0 = src1_ptr[src_idx_0];
         typename tileSrc::DType s1_exp_src_1 = src1_ptr[src_idx_1];
@@ -404,7 +404,7 @@ void __vec__ local_sum_4src_dynamic(
         typename tileSrc::DType s1_exp_src_3 = src1_ptr[src_idx_3];
         typename tileSrc::DType s1_exp_src_01 = s1_exp_src_0 + s1_exp_src_1;
         typename tileSrc::DType s1_exp_src_23 = s1_exp_src_2 + s1_exp_src_3;
-        typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23;  
+        typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23;
 
         typename tileSrc::DType s2_exp_src_0 = src2_ptr[src_idx_0];
         typename tileSrc::DType s2_exp_src_1 = src2_ptr[src_idx_1];
@@ -412,7 +412,7 @@ void __vec__ local_sum_4src_dynamic(
         typename tileSrc::DType s2_exp_src_3 = src2_ptr[src_idx_3];
         typename tileSrc::DType s2_exp_src_01 = s2_exp_src_0 + s2_exp_src_1;
         typename tileSrc::DType s2_exp_src_23 = s2_exp_src_2 + s2_exp_src_3;
-        typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23;  
+        typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23;
 
         typename tileSrc::DType s3_exp_src_0 = src3_ptr[src_idx_0];
         typename tileSrc::DType s3_exp_src_1 = src3_ptr[src_idx_1];
@@ -463,7 +463,7 @@ __attribute__((noinline)) void flash_attention_dynamic(dtype* out_ptr, dtype* q_
     using tileW_out  = TileAcc<float, kTm, kTk, -1, -1>;      // [kTm×kTk]
     using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor, -1, -1>;
     using tileW_cast = Tile<Location::Vec, dtype, kTm, kTk, BLayout::ColMajor, -1, -1>;
-    using tileW_left = TileLeft<dtype, kTm, kTk, -1, -1>; 
+    using tileW_left = TileLeft<dtype, kTm, kTk, -1, -1>;
 
     using tileO_out  = TileAcc<float, kTm, vD, -1, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::ColMajor, -1, vD>; // [kTm×vD]
@@ -491,7 +491,7 @@ __attribute__((noinline)) void flash_attention_dynamic(dtype* out_ptr, dtype* q_
         int dyn_m = (i+1) * kTm > Sq ? rQ:kTm;
         tileQ tQ[Xdim]; for (auto& x : tQ) { x = tileQ(dyn_m);}
 
-        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore 
+        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x+=2){
                 size_t offset_Q = (i+x) * tileQ::Rows * qD;
@@ -503,7 +503,7 @@ __attribute__((noinline)) void flash_attention_dynamic(dtype* out_ptr, dtype* q_
             for(int x=0;x<Xdim;x++){
                 size_t offset_Q = (i+x) * tileQ::Rows * qD;
                 gmQ gQ(q_ptr+offset_Q, Sq);
-                TCOPYIN(tQ[x], gQ);
+                TLOAD(tQ[x], gQ);
             }
         #endif
 
@@ -533,7 +533,7 @@ __attribute__((noinline)) void flash_attention_dynamic(dtype* out_ptr, dtype* q_
             for(int y=0;y<Ydim;y++){
                 size_t offset_K = (j+y) * tileK::Cols * qD;
                 gmK gK(k_ptr+offset_K, Skv);
-                TCOPYIN(tK[y], gK);
+                TLOAD(tK[y], gK);
             }
 
             tileW tW[Xdim][Ydim];for (auto& row : tW) for (auto& x : row) { x = tileW(dyn_m, dyn_k);}
@@ -607,8 +607,8 @@ __attribute__((noinline)) void flash_attention_dynamic(dtype* out_ptr, dtype* q_
                 #pragma clang loop unroll(full)
                 for(int x=0;x<Xdim;x++){
                     new_max_4src_dynamic<tileW, tileMax><<<tMax[x].GetValidRow(), 1, 1>>>(
-                                                                tScale[x].data(), 
-                                                                tNewMax[x].data(), 
+                                                                tScale[x].data(),
+                                                                tNewMax[x].data(),
                                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                                                                 tMax[x].data(),
                                                                 scale, tW[0][0].GetValidCol());
@@ -636,7 +636,7 @@ __attribute__((noinline)) void flash_attention_dynamic(dtype* out_ptr, dtype* q_
             for(int y=0;y<Ydim;y++){
                 size_t offset_V = (j+y) * tileV::Rows * vD;
                 gmV gV(v_ptr+offset_V, Skv);
-                TCOPYIN(tV[y], gV);
+                TLOAD(tV[y], gV);
             }
 
             // ColMajor -> Nz
@@ -690,7 +690,7 @@ __attribute__((noinline)) void flash_attention_dynamic(dtype* out_ptr, dtype* q_
         for (int x = 0; x < Xdim; ++x) {
             size_t offset_O = (i+x) * tileO_cast::Rows * vD;
             gmO dstO(out_ptr+offset_O, Sq);
-            TCOPYOUT(dstO, tO_cast[x]);
+            TSTORE(dstO, tO_cast[x]);
         }
 
         i+=Xdim;
@@ -895,7 +895,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt
     using tileW_out  = TileAcc<float, kTm, kTk, -1, -1>;      // [kTm×kTk]
     using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor, -1, -1>;
     using tileW_cast = Tile<Location::Vec, dtype, kTm, kTk, BLayout::ColMajor, -1, -1>;
-    using tileW_left = TileLeft<dtype, kTm, kTk, -1, -1>; 
+    using tileW_left = TileLeft<dtype, kTm, kTk, -1, -1>;
 
     using tileO_out  = TileAcc<float, kTm, vD, -1, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::ColMajor, -1, vD>; // [kTm×vD]
@@ -922,7 +922,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt
         for(int x=0;x<dyn_Xdim;x++){
             size_t offset_Q = (i+x) * tileQ::Rows * qD;
             gmQ gQ(q_ptr+offset_Q, Sq);
-            TCOPYIN(tQ[x], gQ);
+            TLOAD(tQ[x], gQ);
         }
 
         tileMax tMax[Xdim]; for (auto& x : tMax) { x = tileMax(dyn_m);}
@@ -941,8 +941,8 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt
         tileScale tScale[Xdim]; for (auto& x : tScale) { x = tileScale(dyn_m);}
 
         for (int j = 0; j < Kb; ) {
-            int dyn_Ydim = (j+Ydim)   < (Kb-1)? Ydim: 
-                           (j+Ydim/2) < (Kb-1)? Ydim/2:  
+            int dyn_Ydim = (j+Ydim)   < (Kb-1)? Ydim:
+                           (j+Ydim/2) < (Kb-1)? Ydim/2:
                            (j+Ydim/4) < (Kb-1)? Ydim/4:1;
             int dyn_k = (j+1) * kTk > Skv ? rK:kTk;
 
@@ -951,7 +951,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt
             for(int y=0;y<dyn_Ydim;y++){
                 size_t offset_K = (j+y) * tileK::Cols * qD;
                 gmK gK(k_ptr+offset_K, Skv);
-                TCOPYIN(tK[y], gK);
+                TLOAD(tK[y], gK);
             }
 
             tileW tW[Xdim][Ydim];for (auto& row : tW) for (auto& x : row) { x = tileW(dyn_m, dyn_k);}
@@ -1000,12 +1000,12 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt
                 tileSum tLocalSum[Xdim][2]; for (auto& row : tLocalSum) for (auto& x : row) { x = tileSum(dyn_m);}
                 for(int x=0;x<dyn_Xdim;x++){
                     new_max_4src_dynamic<tileW, tileMax><<<tMax[x].GetValidRow(), 1, 1>>>(
-                                                                tScale[x].data(), 
-                                                                tNewMax[x].data(), 
+                                                                tScale[x].data(),
+                                                                tNewMax[x].data(),
                                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                                                                 tMax[x].data(),
                                                                 scale, tW[0][0].GetValidCol());
-                    
+
                     src_exp_2src_with_local_sum_dynamic<tileW, tileW_cast, tileMax, tileSum><<<tW[0][0].GetValidRow(), 1, 1>>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(),
                                                                                                    tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale, tW[0][0].GetValidCol());
                     src_exp_2src_with_local_sum_dynamic<tileW, tileW_cast, tileMax, tileSum><<<tW[0][0].GetValidRow(), 1, 1>>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(),
@@ -1019,7 +1019,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt
             for(int y=0;y<dyn_Ydim;y++){
                 size_t offset_V = (j+y) * tileV::Rows * vD;
                 gmV gV(v_ptr+offset_V, Skv);
-                TCOPYIN(tV[y], gV);
+                TLOAD(tV[y], gV);
             }
 
             // ColMajor -> Nz
@@ -1066,7 +1066,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt
         for (int x = 0; x < dyn_Xdim; ++x) {
             size_t offset_O = (i+x) * tileO_cast::Rows * vD;
             gmO dstO(out_ptr+offset_O, Sq);
-            TCOPYOUT(dstO, tO_cast[x]);
+            TSTORE(dstO, tO_cast[x]);
         }
 
         i+=dyn_Xdim;
@@ -1083,7 +1083,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt
         for(int x=0;x<Xdim;x++){                                    \
             size_t offset_Q = (i+x) * tileQ::Rows * qD;             \
             gmQ gQ(q_ptr+offset_Q, Sq);                             \
-            TCOPYIN(tQ[x], gQ);                                     \
+            TLOAD(tQ[x], gQ);                                     \
         }                                                           \
                                                                     \
         tileMax tMax[Xdim];                                         \
@@ -1133,7 +1133,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt
             for(int y=0;y<Ydim;y++){                                \
                 size_t offset_K = (j+y) * tileK::Cols * qD;         \
                 gmK gK(k_ptr+offset_K, Skv);                        \
-                TCOPYIN(tK[y], gK);                                 \
+                TLOAD(tK[y], gK);                                 \
             }                                                       \
             tileW tW[Xdim][Ydim];                                   \
             _Pragma("clang loop unroll(full)")                      \
@@ -1215,7 +1215,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt
             for(int y=0;y<Ydim;y++){ \
                 size_t offset_V = (j+y) * tileV::Rows * vD; \
                 gmV gV(v_ptr+offset_V, Skv); \
-                TCOPYIN(tV[y], gV); \
+                TLOAD(tV[y], gV); \
             } \
             /* ColMajor -> Nz */ \
             /* 计算当前块的加权输出: O_j = W * V */ \
@@ -1270,7 +1270,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt
         for (int x = 0; x < Xdim; ++x) { \
             size_t offset_O = (i+x) * tileO_cast::Rows * vD; \
             gmO dstO(out_ptr+offset_O, Sq); \
-            TCOPYOUT(dstO, tO_cast[x]); \
+            TSTORE(dstO, tO_cast[x]); \
         }
 
 template <typename dtype, int qD, int vD, int kTm, int kTk>
@@ -1286,7 +1286,7 @@ __attribute__((noinline)) void flash_attention_dynamic_unroll(dtype* out_ptr, dt
     using tileW_out  = TileAcc<float, kTm, kTk, -1, -1>;      // [kTm×kTk]
     using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor, -1, -1>;
     using tileW_cast = Tile<Location::Vec, dtype, kTm, kTk, BLayout::ColMajor, -1, -1>;
-    using tileW_left = TileLeft<dtype, kTm, kTk, -1, -1>; 
+    using tileW_left = TileLeft<dtype, kTm, kTk, -1, -1>;
 
     using tileO_out  = TileAcc<float, kTm, vD, -1, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::ColMajor, -1, vD>; // [kTm×vD]
diff --git a/test/accelerator/include/accelerator_fa_fp4.h b/include/benchmark_support/npu/npu_fa_fp4.h
similarity index 100%
rename from test/accelerator/include/accelerator_fa_fp4.h
rename to include/benchmark_support/npu/npu_fa_fp4.h
diff --git a/test/accelerator/include/accelerator_fa_manual.h b/include/benchmark_support/npu/npu_fa_manual.h
similarity index 97%
rename from test/accelerator/include/accelerator_fa_manual.h
rename to include/benchmark_support/npu/npu_fa_manual.h
index 691ebdd..8401529 100644
--- a/test/accelerator/include/accelerator_fa_manual.h
+++ b/include/benchmark_support/npu/npu_fa_manual.h
@@ -37,9 +37,9 @@ void __vec__ new_max_manual(
     #ifndef RES_CHECK
     upd_max = upd_max * src_scale;
     #endif
-    new_max_ptr[max_idx] = upd_max; 
+    new_max_ptr[max_idx] = upd_max;
 
-    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max); 
+    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max);
 }
 
 template<typename tileSrc, typename tileSrc_cast, typename tileMax>
@@ -88,7 +88,7 @@ void __vec__ new_sum_manual(
         typename tileSrc::DType exp_src_3 = src_ptr[src_idx_3];
         typename tileSrc::DType exp_src_01 = exp_src_0 + exp_src_1;
         typename tileSrc::DType exp_src_23 = exp_src_2 + exp_src_3;
-        typename tileSrc::DType exp_src_0123 = exp_src_01 + exp_src_23;        
+        typename tileSrc::DType exp_src_0123 = exp_src_01 + exp_src_23;
         upd_sum += exp_src_0123;
     }
     blkv_get_tile_ptr(new_sum)[sum_idx] = upd_sum;
@@ -108,7 +108,7 @@ void flash_attention_manual(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v
     using tileW_out  = TileAcc<float, kTm, kTk>;      // [kTm×kTk]
     using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>;
     using tileW_cast = Tile<Location::Vec, dtype, kTm, kTk, BLayout::ColMajor>;
-    using tileW_left = TileLeft<dtype, kTm, kTk>; 
+    using tileW_left = TileLeft<dtype, kTm, kTk>;
 
     using tileO_out  = TileAcc<float, kTm, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::ColMajor>; // [kTm×vD]
@@ -148,7 +148,7 @@ void flash_attention_manual(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v
         #pragma clang loop unroll(full)
         for(int x=0;x<Xdim;x++){
             auto gQ = gIterQ(i+x,0);
-            TCOPYIN(tQ[x], gQ);
+            TLOAD(tQ[x], gQ);
         }
 
         tileMax tMax[Xdim];
@@ -177,9 +177,9 @@ void flash_attention_manual(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v
             tileW_cast tExpW[Xdim][Ydim];
 
             auto gK = gIterK(0, j+0);
-            TCOPYIN(tK[0], gK);
+            TLOAD(tK[0], gK);
             gK = gIterK(0, j+1);
-            TCOPYIN(tK[1], gK);
+            TLOAD(tK[1], gK);
 
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x++){
@@ -219,7 +219,7 @@ void flash_attention_manual(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v
 
             //Q0 * K2 = S2
             gK = gIterK(0, j+2);
-            TCOPYIN(tK[2], gK);
+            TLOAD(tK[2], gK);
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x++){
                 tileW_out tW_out;
@@ -230,7 +230,7 @@ void flash_attention_manual(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v
             // P0 * V0 = PV0
             tileV tV[Ydim];
             auto gV = gIterV(j+0, 0);
-            TCOPYIN(tV[0], gV);
+            TLOAD(tV[0], gV);
 
             tileW_left tW_left[Xdim][Ydim];
             #pragma clang loop unroll(full)
@@ -285,7 +285,7 @@ void flash_attention_manual(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v
 
             //Q0 * K3 = S3
             gK = gIterK(0, j+3);
-            TCOPYIN(tK[3], gK);
+            TLOAD(tK[3], gK);
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x++){
                 tileW_out tW_out;
@@ -295,7 +295,7 @@ void flash_attention_manual(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v
 
             //P1 * V1 = PV1
             gV = gIterV(j+1, 0);
-            TCOPYIN(tV[1], gV);
+            TLOAD(tV[1], gV);
 
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x++){
@@ -341,7 +341,7 @@ void flash_attention_manual(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v
 
             //P2 * V2 = PV2
             gV = gIterV(j+2, 0);
-            TCOPYIN(tV[2], gV);
+            TLOAD(tV[2], gV);
 
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x++){
@@ -387,7 +387,7 @@ void flash_attention_manual(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v
 
             //P3 * V3 = PV3
             gV = gIterV(j+3, 0);
-            TCOPYIN(tV[3], gV);
+            TLOAD(tV[3], gV);
 
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x++){
@@ -400,7 +400,7 @@ void flash_attention_manual(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v
             for(int x=0;x<Xdim;x++){
                 tMax[x] = tNewMax[x];
                 tSum[x] = tNewSum[x];
-            }        
+            }
 
         tileO_cast tO_cast[Xdim];
         #pragma clang loop unroll(full)
@@ -412,7 +412,7 @@ void flash_attention_manual(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v
         #pragma clang loop unroll(full)
         for (int x = 0; x < Xdim; ++x) {
             auto dstO = gIterO(i+x, 0);
-            TCOPYOUT(dstO, tO_cast[x]);
+            TSTORE(dstO, tO_cast[x]);
         }
 
     }
diff --git a/test/accelerator/include/accelerator_fa_opt1.h b/include/benchmark_support/npu/npu_fa_opt1.h
similarity index 94%
rename from test/accelerator/include/accelerator_fa_opt1.h
rename to include/benchmark_support/npu/npu_fa_opt1.h
index 81630b2..4d4d598 100644
--- a/test/accelerator/include/accelerator_fa_opt1.h
+++ b/include/benchmark_support/npu/npu_fa_opt1.h
@@ -9,8 +9,8 @@ void flash_attention_opt1(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
     using tileQ      = TileLeft<dtype, kTm, (qD==192? 256:qD), kTm, qD>;       // [kTm×qD]
     using tileK      = TileRight<dtype, (qD==192? 256:qD), kTk, qD, kTk>;      // [vD×kTk]
     using tileW_out  = TileAcc<float, kTm, kTk>;      // [kTm×kTk]
-    using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::RowMajor>; 
-    using tileW_left = TileLeft<dtype, kTm, kTk>; 
+    using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::RowMajor>;
+    using tileW_left = TileLeft<dtype, kTm, kTk>;
 
     using tileO_out  = TileAcc<float, kTm, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::RowMajor>; // [kTm×vD]
@@ -40,7 +40,7 @@ void flash_attention_opt1(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
         // 加载当前Q块 (仅一次)
         tileQ tQ;
         auto gQ = gIterQ(i, 0);
-        TCOPYIN(tQ, gQ);
+        TLOAD(tQ, gQ);
 
         // 初始化状态: 最大值/指数和/输出累加
         tileMax tMax;
@@ -55,8 +55,8 @@ void flash_attention_opt1(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
         // 加载K_j和V_j
         auto gK = gIterK(0, j);
         auto gV = gIterV(j, 0);
-        tileK tK; TCOPYIN(tK, gK);
-        tileV tV; TCOPYIN(tV, gV);
+        tileK tK; TLOAD(tK, gK);
+        tileV tV; TLOAD(tV, gV);
 
         // 计算注意力分数块
         tileW_out tW_out;
@@ -94,6 +94,6 @@ void flash_attention_opt1(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
         TCAST(tO_cast, tO);
         // 写回全局内存
         auto dstO = gIterO(i, 0);
-        TCOPYOUT(dstO, tO_cast);
+        TSTORE(dstO, tO_cast);
     }
 }
\ No newline at end of file
diff --git a/test/accelerator/include/accelerator_fa_opt2.h b/include/benchmark_support/npu/npu_fa_opt2.h
similarity index 94%
rename from test/accelerator/include/accelerator_fa_opt2.h
rename to include/benchmark_support/npu/npu_fa_opt2.h
index a2f856d..2f0d157 100644
--- a/test/accelerator/include/accelerator_fa_opt2.h
+++ b/include/benchmark_support/npu/npu_fa_opt2.h
@@ -9,8 +9,8 @@ void flash_attention_opt2(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
     using tileQ      = TileLeft<dtype, kTm, (qD==192? 256:qD), kTm, qD>;       // [kTm×qD]
     using tileK      = TileRight<dtype, (qD==192? 256:qD), kTk, qD, kTk>;      // [vD×kTk]
     using tileW_out  = TileAcc<float, kTm, kTk>;      // [kTm×kTk]
-    using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>; 
-    using tileW_left = TileLeft<dtype, kTm, kTk>; 
+    using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>;
+    using tileW_left = TileLeft<dtype, kTm, kTk>;
 
     using tileO_out  = TileAcc<float, kTm, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::RowMajor>; // [kTm×vD]
@@ -34,13 +34,13 @@ void flash_attention_opt2(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
     const float scale = 1.0f / sqrt((float)qD);
     const int Qb = (Sq + kTm - 1) / kTm;
     const int Kb = (Skv + kTk - 1) / kTk;
-    
+
     // 对每个 Q-block (i)
     for (int i = 0; i < Qb; ++i) {
         // 加载当前Q块 (仅一次)
         tileQ tQ;
         auto gQ = gIterQ(i,0);
-        TCOPYIN(tQ, gQ);
+        TLOAD(tQ, gQ);
 
         // 初始化状态: 最大值/指数和/输出累加
         tileMax tMax;
@@ -55,8 +55,8 @@ void flash_attention_opt2(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
         // 加载K_j和V_j
         auto gK = gIterK(0, j);
         auto gV = gIterV(j, 0);
-        tileK tK; TCOPYIN(tK, gK);
-        tileV tV; TCOPYIN(tV, gV);
+        tileK tK; TLOAD(tK, gK);
+        tileV tV; TLOAD(tV, gV);
 
         // 计算注意力分数块
         tileW_out tW_out;
@@ -94,6 +94,6 @@ void flash_attention_opt2(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
         TCAST(tO_cast, tO);
         // 写回全局内存
         auto dstO = gIterO(i, 0);
-        TCOPYOUT(dstO, tO_cast);
+        TSTORE(dstO, tO_cast);
     }
 }
\ No newline at end of file
diff --git a/test/accelerator/include/accelerator_fa_opt3.h b/include/benchmark_support/npu/npu_fa_opt3.h
similarity index 97%
rename from test/accelerator/include/accelerator_fa_opt3.h
rename to include/benchmark_support/npu/npu_fa_opt3.h
index 70741d6..9ed0634 100644
--- a/test/accelerator/include/accelerator_fa_opt3.h
+++ b/include/benchmark_support/npu/npu_fa_opt3.h
@@ -124,8 +124,8 @@ void flash_attention_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
     using tileQ      = TileLeft<dtype, kTm, (qD==192? 256:qD), kTm, qD>;       // [kTm×qD]
     using tileK      = TileRight<dtype, (qD==192? 256:qD), kTk, qD, kTk>;      // [vD×kTk]
     using tileW_out  = TileAcc<float, kTm, kTk>;      // [kTm×kTk]
-    using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>; 
-    using tileW_left = TileLeft<dtype, kTm, kTk>; 
+    using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>;
+    using tileW_left = TileLeft<dtype, kTm, kTk>;
 
     using tileO_out  = TileAcc<float, kTm, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::ColMajor>; // [kTm×vD]
@@ -150,13 +150,13 @@ void flash_attention_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
     const float scale = 1.0f / sqrt((float)qD);
     const int Qb = (Sq + kTm - 1) / kTm;
     const int Kb = (Skv + kTk - 1) / kTk;
-    
+
     // 对每个 Q-block (i)
     for (int i = 0; i < Qb; ++i) {
         // 加载当前Q块 (仅一次)
         tileQ tQ;
         auto gQ = gIterQ(i,0);
-        TCOPYIN(tQ, gQ);
+        TLOAD(tQ, gQ);
 
         // 初始化状态: 最大值/指数和/输出累加
         tileMax tMax;
@@ -171,8 +171,8 @@ void flash_attention_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
         // 加载K_j和V_j
         auto gK = gIterK(0, j);
         auto gV = gIterV(j, 0);
-        tileK tK; TCOPYIN(tK, gK);
-        tileV tV; TCOPYIN(tV, gV);
+        tileK tK; TLOAD(tK, gK);
+        tileV tV; TLOAD(tV, gV);
 
         // 计算注意力分数块
         tileW_out tW_out;
@@ -180,7 +180,7 @@ void flash_attention_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
 
         // Nz -> ColMajor
         tileW tW;
-        #ifdef TEMPLATE 
+        #ifdef TEMPLATE
         ACCSCALE_NZ2DN(tW, tW_out, scale);
 
         tileMax tNewMax;
@@ -225,7 +225,7 @@ void flash_attention_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
         normalize<tileO_cast, tileO, tileSum><<<tileO::ValidRow, tileO::ValidCol, 1>>>(tO_cast.data(), tO.data(), tRescaleO.data(), tSum.data());
         // 写回全局内存
         auto dstO = gIterO(i, 0);
-        TCOPYOUT(dstO, tO_cast);
+        TSTORE(dstO, tO_cast);
     }
 }
 
@@ -240,7 +240,7 @@ void flash_attention_multitile_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
     using tileQ      = MultiTile<MULTI, TileLeft<dtype, kTm, (qD==192? 256:qD), kTm, qD>>;       // [kTm×qD]
     using tileK      = MultiTile<MULTI, TileRight<dtype, (qD==192? 256:qD), kTk, qD, kTk>>;      // [vD×kTk]
     // using tileW_out  = TileAcc<float, kTm, kTk>;      // [kTm×kTk]
-    using tileW      = MultiTile<MULTI, Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>>; 
+    using tileW      = MultiTile<MULTI, Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>>;
     using tileW_left = MultiTile<MULTI, TileLeft<dtype, kTm, kTk>>;
 
     // using tileO_out  = TileAcc<float, kTm, vD>;
@@ -266,7 +266,7 @@ void flash_attention_multitile_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
     const float scale = 1.0f / sqrt((float)qD);
     const int Qb = (Sq + kTm - 1) / kTm;
     const int Kb = (Skv + kTk - 1) / kTk;
-    
+
     // 对每个 Q-block (i)
     for (int i = 0; i < Qb; i += MULTI) {
         // 加载当前Q块 (仅一次)
@@ -276,7 +276,7 @@ void flash_attention_multitile_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
         auto gQ = gIterQ(i,0);
         TLOAD2_ND2NZ(tQ.Tiles[1], tQ.Tiles[0], gQ);
         #else
-        TCOPYIN(tQ, [&](int t) { return gIterQ(i + t, 0); });
+        TLOAD(tQ, [&](int t) { return gIterQ(i + t, 0); });
         #endif
 
         // 初始化状态: 最大值/指数和/输出累加
@@ -294,7 +294,7 @@ void flash_attention_multitile_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
         // 加载K_j和V_j
         auto gK = gIterK(0, j);
         tileK tK;
-        TCOPYIN(tK, gK);
+        TLOAD(tK, gK);
 
         // 计算注意力分数块
         tileW tW;
@@ -330,7 +330,7 @@ void flash_attention_multitile_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
         // 计算当前块的加权输出: O_j = W * V
         auto gV = gIterV(j, 0);
         tileV tV;
-        TCOPYIN(tV, gV);
+        TLOAD(tV, gV);
         MATMUL(tO, tW_left, tV);
         // 更新最大值状态
         tMax = tNewMax;
@@ -356,7 +356,7 @@ void flash_attention_multitile_opt3(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
         auto gO = gIterO(i, 0);
         TSTORE2_DN2DN(gO, tO_cast.Tiles[1], tO_cast.Tiles[0]);
         #else
-        TCOPYOUT([&](int t) { return gIterO(i + t, 0); }, tO_cast);
+        TSTORE([&](int t) { return gIterO(i + t, 0); }, tO_cast);
         #endif
     }
 }
\ No newline at end of file
diff --git a/test/accelerator/include/accelerator_fa_opt4.h b/include/benchmark_support/npu/npu_fa_opt4.h
similarity index 97%
rename from test/accelerator/include/accelerator_fa_opt4.h
rename to include/benchmark_support/npu/npu_fa_opt4.h
index 5081a94..b108619 100644
--- a/test/accelerator/include/accelerator_fa_opt4.h
+++ b/include/benchmark_support/npu/npu_fa_opt4.h
@@ -25,7 +25,7 @@ void __vec__ flashsoftmax_opt4_with_scale(
     typename tileSum::DType old_sum_val = old_sum_ptr[sum_idx];
     typename tileSum::DType upd_sum = old_sum_val;
     typename tileMax::DType upd_max = old_max_val;
-    
+
     #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidCol;j+=4){
         size_t src_idx_0 =  i * tileSrc::RowStride + j * tileSrc::ColStride;
@@ -89,7 +89,7 @@ void __vec__ flashsoftmax_opt4(
     typename tileSum::DType old_sum_val = old_sum_ptr[sum_idx];
     typename tileSum::DType upd_sum = old_sum_val;
     typename tileMax::DType upd_max = old_max_val;
-    
+
     #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidCol;j+=4){
         size_t src_idx_0 =  i * tileSrc::RowStride + j * tileSrc::ColStride;
@@ -170,7 +170,7 @@ void flash_attention_opt4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
     using tileK      = TileRight<dtype, (qD==192? 256:qD), kTk, qD, kTk>;      // [vD×kTk]
     using tileW_out  = TileAcc<float, kTm, kTk>;      // [kTm×kTk]
     using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>;
-    using tileW_left = TileLeft<dtype, kTm, kTk>; 
+    using tileW_left = TileLeft<dtype, kTm, kTk>;
 
     using tileO_out  = TileAcc<float, kTm, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::ColMajor>; // [kTm×vD]
@@ -195,19 +195,19 @@ void flash_attention_opt4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
     const float scale = 1.0f / sqrt((float)qD);
     const int Qb = (Sq + kTm - 1) / kTm;
     const int Kb = (Skv + kTk - 1) / kTk;
-    
+
     // 对每个 Q-block (i)
     for (int i = 0; i < Qb; ++i) {
         // 加载当前Q块 (仅一次)
         tileQ tQ;
         auto gQ = gIterQ(i,0);
-        TCOPYIN(tQ, gQ);
+        TLOAD(tQ, gQ);
 
         // 初始化状态: 最大值/指数和/输出累加
         tileMax tMax;
         TEXPANDSCALAR(tMax, -1e30f);  // 初始化为极小值
         tileSum tSum(0);              // 指数和归零
-        tileO_out tPV_out;            
+        tileO_out tPV_out;
         tileO tPV;
         tileO tO(0);                  // 输出累加归零
 
@@ -217,8 +217,8 @@ void flash_attention_opt4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
         // 加载K_j和V_j
         auto gK = gIterK(0, j);
         auto gV = gIterV(j, 0);
-        tileK tK; TCOPYIN(tK, gK);
-        tileV tV; TCOPYIN(tV, gV);
+        tileK tK; TLOAD(tK, gK);
+        tileV tV; TLOAD(tV, gV);
 
         // 计算注意力分数块
         tileW_out tW_out;
@@ -226,7 +226,7 @@ void flash_attention_opt4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
 
         // Nz -> ColMajor
         tileW tW;
-        #ifdef TEMPLATE 
+        #ifdef TEMPLATE
         ACCSCALE_NZ2DN(tW, tW_out, scale);
 
         tileMax tNewMax;
@@ -270,6 +270,6 @@ void flash_attention_opt4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_p
         // TCAST(tO_cast, tO);
         normalize_opt4<tileO_cast, tileO, tileSum><<<tileO::ValidRow, tileO::ValidCol, 1>>>(tO_cast.data(), tO.data(), tSum.data());
         auto dstO = gIterO(i, 0);
-        TCOPYOUT(dstO, tO_cast);
+        TSTORE(dstO, tO_cast);
     }
 }
\ No newline at end of file
diff --git a/test/accelerator/include/accelerator_fa_template_2d_unroll.h b/include/benchmark_support/npu/npu_fa_template_2d_unroll.h
similarity index 98%
rename from test/accelerator/include/accelerator_fa_template_2d_unroll.h
rename to include/benchmark_support/npu/npu_fa_template_2d_unroll.h
index fe4a781..2b93d6d 100644
--- a/test/accelerator/include/accelerator_fa_template_2d_unroll.h
+++ b/include/benchmark_support/npu/npu_fa_template_2d_unroll.h
@@ -240,9 +240,9 @@ void __vec__ new_max_4src_template(
 
     typename tileMax::DType local_max_0123 = blkv_max(local_max_01, local_max_23);
     upd_max = blkv_max(upd_max, local_max_0123);
-    new_max_ptr[max_idx] = upd_max; 
+    new_max_ptr[max_idx] = upd_max;
 
-    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max); 
+    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max);
 }
 
 template<typename tileSrc, typename tileSrc_cast, typename tileMax>
@@ -313,7 +313,7 @@ void __vec__ src_exp_2src_with_local_sum_template(
         BLKC_ASSIGN_CAST(src_exp1, idx_2, src1_exp2);
         BLKC_ASSIGN_CAST(src_exp1, idx_3, src1_exp3);
         typename tileSum::DType src1_exp_sum = src1_exp0 + src1_exp1 + src1_exp2 + src1_exp3;
-        
+
         upd_sum += src0_exp_sum + src1_exp_sum;
     }
     size_t idx_sum = i * tileSum::RowStride;
@@ -356,7 +356,7 @@ void __vec__ new_sum_4src_template(
         typename tileSrc::DType s0_exp_src_3 = src0_ptr[src_idx_3];
         typename tileSrc::DType s0_exp_src_01 = s0_exp_src_0 + s0_exp_src_1;
         typename tileSrc::DType s0_exp_src_23 = s0_exp_src_2 + s0_exp_src_3;
-        typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23;  
+        typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23;
 
         typename tileSrc::DType s1_exp_src_0 = src1_ptr[src_idx_0];
         typename tileSrc::DType s1_exp_src_1 = src1_ptr[src_idx_1];
@@ -364,7 +364,7 @@ void __vec__ new_sum_4src_template(
         typename tileSrc::DType s1_exp_src_3 = src1_ptr[src_idx_3];
         typename tileSrc::DType s1_exp_src_01 = s1_exp_src_0 + s1_exp_src_1;
         typename tileSrc::DType s1_exp_src_23 = s1_exp_src_2 + s1_exp_src_3;
-        typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23;  
+        typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23;
 
         typename tileSrc::DType s2_exp_src_0 = src2_ptr[src_idx_0];
         typename tileSrc::DType s2_exp_src_1 = src2_ptr[src_idx_1];
@@ -372,7 +372,7 @@ void __vec__ new_sum_4src_template(
         typename tileSrc::DType s2_exp_src_3 = src2_ptr[src_idx_3];
         typename tileSrc::DType s2_exp_src_01 = s2_exp_src_0 + s2_exp_src_1;
         typename tileSrc::DType s2_exp_src_23 = s2_exp_src_2 + s2_exp_src_3;
-        typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23;  
+        typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23;
 
         typename tileSrc::DType s3_exp_src_0 = src3_ptr[src_idx_0];
         typename tileSrc::DType s3_exp_src_1 = src3_ptr[src_idx_1];
@@ -415,7 +415,7 @@ void __vec__ local_max_4src_template(
     typename tileMax::DType local_max_0123 = blkv_max(local_max_01, local_max_23);
 
     upd_max = blkv_max(upd_max, local_max_0123);
-    local_max_ptr[max_idx] = upd_max;  
+    local_max_ptr[max_idx] = upd_max;
 }
 
 template<typename tileSrc, typename tileSum>
@@ -451,7 +451,7 @@ void __vec__ local_sum_4src_template(
         typename tileSrc::DType s0_exp_src_3 = src0_ptr[src_idx_3];
         typename tileSrc::DType s0_exp_src_01 = s0_exp_src_0 + s0_exp_src_1;
         typename tileSrc::DType s0_exp_src_23 = s0_exp_src_2 + s0_exp_src_3;
-        typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23;  
+        typename tileSrc::DType s0_exp_src_0123 = s0_exp_src_01 + s0_exp_src_23;
 
         typename tileSrc::DType s1_exp_src_0 = src1_ptr[src_idx_0];
         typename tileSrc::DType s1_exp_src_1 = src1_ptr[src_idx_1];
@@ -459,7 +459,7 @@ void __vec__ local_sum_4src_template(
         typename tileSrc::DType s1_exp_src_3 = src1_ptr[src_idx_3];
         typename tileSrc::DType s1_exp_src_01 = s1_exp_src_0 + s1_exp_src_1;
         typename tileSrc::DType s1_exp_src_23 = s1_exp_src_2 + s1_exp_src_3;
-        typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23;  
+        typename tileSrc::DType s1_exp_src_0123 = s1_exp_src_01 + s1_exp_src_23;
 
         typename tileSrc::DType s2_exp_src_0 = src2_ptr[src_idx_0];
         typename tileSrc::DType s2_exp_src_1 = src2_ptr[src_idx_1];
@@ -467,7 +467,7 @@ void __vec__ local_sum_4src_template(
         typename tileSrc::DType s2_exp_src_3 = src2_ptr[src_idx_3];
         typename tileSrc::DType s2_exp_src_01 = s2_exp_src_0 + s2_exp_src_1;
         typename tileSrc::DType s2_exp_src_23 = s2_exp_src_2 + s2_exp_src_3;
-        typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23;  
+        typename tileSrc::DType s2_exp_src_0123 = s2_exp_src_01 + s2_exp_src_23;
 
         typename tileSrc::DType s3_exp_src_0 = src3_ptr[src_idx_0];
         typename tileSrc::DType s3_exp_src_1 = src3_ptr[src_idx_1];
@@ -505,7 +505,7 @@ void __vec__ new_max_of_2_loc_max_template(
     typename tileMax::DType local_max_01 = blkv_max(local_max_0_ptr[max_idx], local_max_1_ptr[max_idx]);
     upd_max = blkv_max(upd_max, local_max_01);
     new_max_ptr[max_idx] = upd_max;
-    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max); 
+    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max);
 }
 template<typename tileScale, typename tileSum>
 void __vec__ new_sum_of_2_loc_sum_template(
@@ -560,7 +560,7 @@ void __vec__ new_max_of_4_loc_max_template(
     typename tileMax::DType local_max_0123 = blkv_max(local_max_01, local_max_23);
     upd_max = blkv_max(upd_max, local_max_0123);
     new_max_ptr[max_idx] = upd_max;
-    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max); 
+    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max);
 }
 template<typename tileScale, typename tileSum>
 void __vec__ new_sum_of_4_loc_sum_template(
@@ -585,7 +585,7 @@ void __vec__ new_sum_of_4_loc_sum_template(
 
     size_t sum_idx = i*tileSum::RowStride;
 
-    new_sum_ptr[sum_idx] = old_sum_ptr[sum_idx] * scale_ptr[sum_idx] + 
+    new_sum_ptr[sum_idx] = old_sum_ptr[sum_idx] * scale_ptr[sum_idx] +
                            local_sum_0_ptr[sum_idx] + local_sum_1_ptr[sum_idx] +
                            local_sum_2_ptr[sum_idx] + local_sum_3_ptr[sum_idx];
 }
@@ -639,7 +639,7 @@ void flash_attention_template_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_p
 
         tileQ tQ[Xdim];
 
-        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore 
+        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x+=2){
                 auto gQ = gIterQ(i+x,0);
@@ -649,7 +649,7 @@ void flash_attention_template_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_p
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x++){
                 auto gQ = gIterQ(i+x,0);
-                TCOPYIN(tQ[x], gQ);
+                TLOAD(tQ[x], gQ);
             }
         #endif
 
@@ -684,7 +684,7 @@ void flash_attention_template_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_p
                 #pragma clang loop unroll(full)
                 for(int y=0;y<Ydim;y++){
                     auto gK = gIterK(0, j+y);
-                    TCOPYIN(tK[y], gK);
+                    TLOAD(tK[y], gK);
                 }
             #endif
 
@@ -749,8 +749,8 @@ void flash_attention_template_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_p
                 #pragma clang loop unroll(full)
                 for(int x=0;x<Xdim;x++){
                     new_max_4src_template<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(
-                                                                tScale[x].data(), 
-                                                                tNewMax[x].data(), 
+                                                                tScale[x].data(),
+                                                                tNewMax[x].data(),
                                                                 tLocalMax[x][0].data(), tLocalMax[x][1].data(), tLocalMax[x][2].data(), tLocalMax[x][3].data(),
                                                                 tMax[x].data());
                     // src_exp_4src_template<tileW, tileMax><<<tileW::ValidRow, tileW::ValidCol, 1>>>(
@@ -760,7 +760,7 @@ void flash_attention_template_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_p
                     src_exp_2src_with_local_sum_template<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(),
                                                                                                    tW[x][0].data(), tW[x][1].data(), tNewMax[x].data());
                     src_exp_2src_with_local_sum_template<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(),
-                                                                                                   tW[x][2].data(), tW[x][3].data(), tNewMax[x].data());                    
+                                                                                                   tW[x][2].data(), tW[x][3].data(), tNewMax[x].data());
                     // new_sum_4src_template<tileW, tileSum, tileScale><<<tileSum::ValidRow, 1, 1>>>(
                     //                                             tNewSum[x].data(),
                     //                                             tExpW[x][0].data(), tExpW[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(),
@@ -774,7 +774,7 @@ void flash_attention_template_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_p
                 tileSum tLocalSum[Xdim][2];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){    
+                for(int x=0;x<Xdim;x++){
                     #pragma clang loop unroll(full)
                     for(int k=0;k<2;k++){
                         local_max_4src_template<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax4[x][k].data(), tLocalMax[x][4*k].data(), tLocalMax[x][4*k+1].data(), tLocalMax[x][4*k+2].data(), tLocalMax[x][4*k+3].data());
@@ -800,7 +800,7 @@ void flash_attention_template_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_p
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){       
+                for(int x=0;x<Xdim;x++){
                     for(int k=0;k<4;k++){
                         local_max_4src_template<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax4[x][k].data(), tLocalMax[x][4*k].data(), tLocalMax[x][4*k+1].data(), tLocalMax[x][4*k+2].data(), tLocalMax[x][4*k+3].data());
                     }
@@ -837,7 +837,7 @@ void flash_attention_template_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_p
                 #pragma clang loop unroll(full)
                 for(int y=0;y<Ydim;y++){
                     auto gV = gIterV(j+y, 0);
-                    TCOPYIN(tV[y], gV);
+                    TLOAD(tV[y], gV);
                 }
             #endif
 
@@ -907,7 +907,7 @@ void flash_attention_template_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_p
             #pragma clang loop unroll(full)
             for (int x = 0; x < Xdim; ++x) {
                 auto dstO = gIterO(i+x, 0);
-                TCOPYOUT(dstO, tO_cast[x]);
+                TSTORE(dstO, tO_cast[x]);
             }
         #endif
 
diff --git a/test/accelerator/include/accelerator_fa_unalign_2d_unroll.h b/include/benchmark_support/npu/npu_fa_unalign_2d_unroll.h
similarity index 97%
rename from test/accelerator/include/accelerator_fa_unalign_2d_unroll.h
rename to include/benchmark_support/npu/npu_fa_unalign_2d_unroll.h
index c0250ca..f451c58 100644
--- a/test/accelerator/include/accelerator_fa_unalign_2d_unroll.h
+++ b/include/benchmark_support/npu/npu_fa_unalign_2d_unroll.h
@@ -41,7 +41,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
     using tileW_out  = TileAcc<float, kTm, kTk>;      // [kTm×kTk]
     using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>;
     using tileW_cast = Tile<Location::Vec, dtype, kTm, kTk, BLayout::ColMajor>;
-    using tileW_left = TileLeft<dtype, kTm, kTk>; 
+    using tileW_left = TileLeft<dtype, kTm, kTk>;
 
     using tileO_out  = TileAcc<float, kTm, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::ColMajor>; // [kTm×vD]
@@ -100,7 +100,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
         #pragma clang loop unroll(full)
         for(int x=0;x<Xdim;x++){
             auto gQ = gIterQ(i+x,0);
-            TCOPYIN(tQ[x], gQ);
+            TLOAD(tQ[x], gQ);
         }
 
         tileMax tMax[Xdim];
@@ -127,7 +127,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
             #pragma clang loop unroll(full)
             for(int y=0;y<Ydim;y++){
                 auto gK = gIterK(0, j+y);
-                TCOPYIN(tK[y], gK);
+                TLOAD(tK[y], gK);
             }
 
             tileW tW[Xdim][Ydim];
@@ -189,8 +189,8 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
                 #pragma clang loop unroll(full)
                 for(int x=0;x<Xdim;x++){
                     new_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(
-                                                                tScale[x].data(), 
-                                                                tNewMax[x].data(), 
+                                                                tScale[x].data(),
+                                                                tNewMax[x].data(),
                                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                                                                 tMax[x].data(),
                                                                 scale);
@@ -216,7 +216,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
             #pragma clang loop unroll(full)
             for(int y=0;y<Ydim;y++){
                 auto gV = gIterV(j+y, 0);
-                TCOPYIN(tV[y], gV);
+                TLOAD(tV[y], gV);
             }
 
             // ColMajor -> Nz
@@ -261,7 +261,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
 
             tileK_tcols tK;
             auto gK = gIterK(0, Kb);
-            TCOPYIN(tK, gK);
+            TLOAD(tK, gK);
 
             tileW_tcols tW[Xdim];
             #pragma clang loop unroll(full)
@@ -291,7 +291,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
 
             tileV_trows tV;
             auto gV = gIterV(Kb, 0);
-            TCOPYIN(tV, gV);
+            TLOAD(tV, gV);
 
             // 计算当前块的加权输出: O_j = W * V
             tileW_left_tcols tW_left[Xdim];
@@ -325,7 +325,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
         #pragma clang loop unroll(full)
         for (int x = 0; x < Xdim; ++x) {
             auto dstO = gIterO(i+x, 0);
-            TCOPYOUT(dstO, tO_cast[x]);
+            TSTORE(dstO, tO_cast[x]);
         }
     }
 
@@ -334,7 +334,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
         tileQ_trows tQ;
 
         auto gQ = gIterQ(Qb,0);
-        TCOPYIN(tQ, gQ);
+        TLOAD(tQ, gQ);
 
         tileMax_trows tMax;
         TEXPANDSCALAR(tMax, -1e30f);
@@ -354,7 +354,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
             #pragma clang loop unroll(full)
             for(int y=0;y<Ydim;y++){
                 auto gK = gIterK(0, j+y);
-                TCOPYIN(tK[y], gK);
+                TLOAD(tK[y], gK);
             }
 
             tileW_trows tW[Ydim];
@@ -405,8 +405,8 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
                                                             );
             #elif Ydim == 4
                 new_max_4src<tileW_trows, tileMax_trows><<<tileMax_trows::ValidRow, 1, 1>>>(
-                                                            tScale.data(), 
-                                                            tNewMax.data(), 
+                                                            tScale.data(),
+                                                            tNewMax.data(),
                                                             tW[0].data(), tW[1].data(), tW[2].data(), tW[3].data(),
                                                             tMax.data(),
                                                             scale);
@@ -432,7 +432,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
             #pragma clang loop unroll(full)
             for(int y=0;y<Ydim;y++){
                 auto gV = gIterV(j+y, 0);
-                TCOPYIN(tV[y], gV);
+                TLOAD(tV[y], gV);
             }
 
             tileW_left_trows tW_left[Ydim];
@@ -462,7 +462,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
 
             tileK_tcols tK;
             auto gK = gIterK(0, Kb);
-            TCOPYIN(tK, gK);
+            TLOAD(tK, gK);
 
             tileW_tcorner tW;
             tileW_out_tcorner tW_out;
@@ -486,7 +486,7 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
 
             tileV_trows tV;
             auto gV = gIterV(Kb, 0);
-            TCOPYIN(tV, gV);
+            TLOAD(tV, gV);
 
             tileW_left_tcorner tW_left;
             TCVT_DN2NZ(tW_left, tExpW);
@@ -504,6 +504,6 @@ void flash_attention_unalign_2d_unroll(dtype* out_ptr, dtype* q_ptr, dtype* k_pt
         normalize_no_last_update<tileO_cast_trows, tileO_trows, tileSum_trows><<<tileO_trows::ValidRow, tileO_trows::ValidCol, 1>>>(tO_cast.data(), tO.data(), tSum.data());
 
         auto dstO = gIterO(Qb, 0);
-        TCOPYOUT(dstO, tO_cast);
+        TSTORE(dstO, tO_cast);
     }
 }
\ No newline at end of file
diff --git a/test/accelerator/include/accelerator_fusion.h b/include/benchmark_support/npu/npu_fusion.h
similarity index 97%
rename from test/accelerator/include/accelerator_fusion.h
rename to include/benchmark_support/npu/npu_fusion.h
index 00e816c..12d4a33 100644
--- a/test/accelerator/include/accelerator_fusion.h
+++ b/include/benchmark_support/npu/npu_fusion.h
@@ -70,9 +70,9 @@ void __vec__ flashsoftmax_new_max(
     #ifndef RES_CHECK
     upd_max = upd_max * src_scale;
     #endif
-    new_max_ptr[max_idx] = upd_max; 
+    new_max_ptr[max_idx] = upd_max;
 
-    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max); 
+    scale_ptr[max_idx] =  blkv_fexp(old_max_val - upd_max);
 }
 
 template<typename tileMax, typename tileScale>
@@ -147,7 +147,7 @@ void __vec__ flashsoftmax_new_sum(
         typename tileSrc::DType exp_src_3 = src_ptr[src_idx_3];
         typename tileSrc::DType exp_src_01 = exp_src_0 + exp_src_1;
         typename tileSrc::DType exp_src_23 = exp_src_2 + exp_src_3;
-        typename tileSrc::DType exp_src_0123 = exp_src_01 + exp_src_23;        
+        typename tileSrc::DType exp_src_0123 = exp_src_01 + exp_src_23;
         upd_sum += exp_src_0123;
     }
     blkv_get_tile_ptr(new_sum)[sum_idx] = upd_sum;
@@ -441,19 +441,19 @@ void __vec__ normalize(
     blkv_get_tile_ptr(out_cast)[idx] = static_cast<typename tileO_cast::DType>( (blkv_get_tile_ptr(rescale_out)[idx] + blkv_get_tile_ptr(out)[idx]) /  blkv_get_tile_ptr(sum)[idx_sum] );
 }
 
-#include "accelerator_fa_fp4.h"
-#include "accelerator_fa_opt1.h"
-#include "accelerator_fa_opt2.h"
-#include "accelerator_fa_opt3.h"
-#include "accelerator_fa_opt4.h"
-#include "accelerator_fa_dcore.h"
-#include "accelerator_fa_2d_unroll.h"
+#include "npu_fa_fp4.h"
+#include "npu_fa_opt1.h"
+#include "npu_fa_opt2.h"
+#include "npu_fa_opt3.h"
+#include "npu_fa_opt4.h"
+#include "npu_fa_dcore.h"
+#include "npu_fa_2d_unroll.h"
 #ifdef _2D_UNROLL_PTO
-#include "accelerator_fa_2d_unroll_pto.h"
+#include "npu_fa_2d_unroll_pto.h"
 #endif
-#include "accelerator_fa_template_2d_unroll.h"
-#include "accelerator_fa_dynamic.h"
-#include "accelerator_fa_manual.h"
+#include "npu_fa_template_2d_unroll.h"
+#include "npu_fa_dynamic.h"
+#include "npu_fa_manual.h"
 
 template <int S, int D, int tM, int tK>
 void flashsoftmax(float *input, float *max, float *sum, float *input_scale, uint16_t *bitmask_gm, __half *output) {
@@ -467,11 +467,11 @@ void flashsoftmax(float *input, float *max, float *sum, float *input_scale, uint
     using tileMax = Tile<Location::Vec, float, tM, 16, BLayout::RowMajor, tM, 1>;
     using tileSum = Tile<Location::Vec, float, tM, 16, BLayout::RowMajor, tM, 1>;
     using tileScale = Tile<Location::Vec, float, tM, 16, BLayout::RowMajor, tM, 1>;
-    
+
     using tileO = Tile<Location::Vec, float, tM, D, BLayout::RowMajor>;
     using tileO_cast = Tile<Location::Vec, __half, tM, D, BLayout::RowMajor>;
 
-    const int Bm = S/tM; 
+    const int Bm = S/tM;
     const int Bk = S/tK;
 
     for(int i=0;i<Bm;i++){
@@ -481,10 +481,10 @@ void flashsoftmax(float *input, float *max, float *sum, float *input_scale, uint
 
         #pragma clang loop unroll(full)
         for(int j=0;j<Bk;j++){
-            uint32_t offset = i * tM * S + j * tK;  
+            uint32_t offset = i * tM * S + j * tK;
             gmIn gIn(input+offset);
             tileIn tIn;
-            TCOPYIN(tIn, gIn);
+            TLOAD(tIn, gIn);
 
             tileMax tNewMax;
             tileSum tNewSum;
@@ -501,15 +501,15 @@ void flashsoftmax(float *input, float *max, float *sum, float *input_scale, uint
 
         uint32_t offset = i * tM;
         gmMax gMax(max+offset);
-        TCOPYOUT(gMax, tMax);
+        TSTORE(gMax, tMax);
 
         gmSum gSum(sum+offset);
-        TCOPYOUT(gSum, tSum);
+        TSTORE(gSum, tSum);
 
         offset = i * tM * D;
         gmOut gOut(output + offset);
         tileO_cast tO_cast;
         TCAST(tO_cast, tO);
-        TCOPYOUT(gOut, tO_cast);
+        TSTORE(gOut, tO_cast);
     }
 }
diff --git a/test/accelerator/include/accelerator_transpose.h b/include/benchmark_support/npu/npu_transpose.h
similarity index 100%
rename from test/accelerator/include/accelerator_transpose.h
rename to include/benchmark_support/npu/npu_transpose.h
diff --git a/test/accelerator/include/accelerator_vec_simd.h b/include/benchmark_support/npu/npu_vec_simd.h
similarity index 92%
rename from test/accelerator/include/accelerator_vec_simd.h
rename to include/benchmark_support/npu/npu_vec_simd.h
index fd650db..63438b6 100644
--- a/test/accelerator/include/accelerator_vec_simd.h
+++ b/include/benchmark_support/npu/npu_vec_simd.h
@@ -19,10 +19,10 @@ void matadd(dtype *c_ptr, dtype *a_ptr, dtype *b_ptr) {
       auto gC = gCIter(i, j);
 
       tile_shape tA, tB, tC;
-      TCOPYIN(tA, gA);
-      TCOPYIN(tB, gB);
+      TLOAD(tA, gA);
+      TLOAD(tB, gB);
       TADD(tC, tA, tB);
-      TCOPYOUT(gC, tC);
+      TSTORE(gC, tC);
     }
   }
 }
@@ -41,19 +41,19 @@ void __vec__ rmsnorm_kernel(
     __vbuf__ typename tile_shape::DType *out_ptr = blkv_get_tile_ptr(out);
 
     __half sum;
-    
+
     __half data = static_cast<__half>(in_ptr[j]);
-    asm volatile("l.rdfadd %1.fh, ->%0.h" 
+    asm volatile("l.rdfadd %1.fh, ->%0.h"
             :"=r"(sum)
             :"vr"(data)
             );
-    
+
     #pragma clang loop unroll(full)
     for(int i=1;i<iter;i++){
         size_t idx = i * col + j;
         data = static_cast<__half>(in_ptr[idx]) * static_cast<__half>(in_ptr[idx]);
         typename tile_shape::DType local_sum;
-        asm volatile("l.rdfadd %1.fh, ->%0.h" 
+        asm volatile("l.rdfadd %1.fh, ->%0.h"
                     :"=r"(local_sum)
                     :"vr"(data)
                     );
@@ -61,7 +61,7 @@ void __vec__ rmsnorm_kernel(
     }
 
     sum = blkv_fsqrt( (sum / static_cast<__half>(tile_shape::ValidCol)) );
-      
+
     #pragma clang loop unroll(full)
     for(int i=0;i<iter;i++){
         size_t idx = i * col + j;
@@ -77,14 +77,14 @@ void rmsnorm_oneline(dtype *dst, dtype *src){
 
     gm_shape gsrc(src);
     tile_shape tsrc;
-    TCOPYIN(tsrc, gsrc);
+    TLOAD(tsrc, gsrc);
 
     const int iter = tile_shape::ValidCol / LaneNum;
     tile_shape tdst;
     rmsnorm_kernel<tile_shape><<<LaneNum, 1, 1>>>(tdst.data(), tsrc.data(), LaneNum, iter);
 
     gm_shape gdst(dst);
-    TCOPYOUT(gdst, tdst);
+    TSTORE(gdst, tdst);
 }
 
 // x / (Ex^2) ^ .5
@@ -92,9 +92,9 @@ template<typename dtype, const int kM, const int kN, const int kTM, const int kT
 void rmsnorm(dtype *dst, dtype *src){
     using gm_shape = global_tensor<dtype, RowMajor<kM, kN>>;
     using tile_shape = Tile<Location::Vec, dtype, kTM, kTN, BLayout::RowMajor>;
- 
+
     using tSum = Tile<Location::Vec, dtype, kTM, 256, BLayout::RowMajor, kTM, 1>;
- 
+
     using gIter = global_iterator<gm_shape, tile_shape>;
 
     gIter giter_src(src);
@@ -111,19 +111,19 @@ void rmsnorm(dtype *dst, dtype *src){
         {
             auto gsrc = giter_src(i, j);
             tile_shape tsrc;
- 
-            TCOPYIN(tsrc, gsrc);
+
+            TLOAD(tsrc, gsrc);
 
             tSum tLocalSum;
             TMUL(tsrc, tsrc, tsrc);
             TROWSUM(tLocalSum, tsrc);
             TADD(tAccSquareSum, tAccSquareSum, tLocalSum);
         }
- 
+
         tSum gSqureMean;
         TDIVS(gSqureMean, tAccSquareSum, kN);
         TSQRT(gSqureMean, gSqureMean);
- 
+
         tile_shape gSqureMean_i;
         TEXPANDCOL(gSqureMean_i, gSqureMean);
 
@@ -131,12 +131,12 @@ void rmsnorm(dtype *dst, dtype *src){
         {
             auto  gsrc = giter_src(i,j);
             tile_shape tsrc;
-            TCOPYIN(tsrc, gsrc);
- 
+            TLOAD(tsrc, gsrc);
+
             TDIV(tsrc, tsrc, gSqureMean_i);
- 
+
             auto gdst = giter_dst(i,j);
-            TCOPYOUT(gdst, tsrc);
+            TSTORE(gdst, tsrc);
         }
     }
 }
@@ -161,15 +161,15 @@ void __vec__ layernorm_kernel(
 
     __half data = static_cast<__half>(in_ptr[j]);
     __half data_square = data * data;
-    asm volatile("l.rdfadd %1.fh, ->%0.h" 
+    asm volatile("l.rdfadd %1.fh, ->%0.h"
             :"=r"(sum)
             :"vr"(data)
             );
-    asm volatile("l.rdfadd %1.fh, ->%0.h" 
+    asm volatile("l.rdfadd %1.fh, ->%0.h"
         :"=r"(square_sum)
         :"vr"(data_square)
         );
-    
+
     #pragma clang loop unroll(full)
     for(int i=1;i<iter;i++){
         size_t idx = i * col + j;
@@ -177,11 +177,11 @@ void __vec__ layernorm_kernel(
         data_square = data * data;
         typename tile_shape::DType local_sum;
         typename tile_shape::DType local_square_sum;
-        asm volatile("l.rdfadd %1.fh, ->%0.h" 
+        asm volatile("l.rdfadd %1.fh, ->%0.h"
                     :"=r"(local_sum)
                     :"vr"(data)
                     );
-        asm volatile("l.rdfadd %1.fh, ->%0.h" 
+        asm volatile("l.rdfadd %1.fh, ->%0.h"
                     :"=r"(local_square_sum)
                     :"vr"(data)
                     );
@@ -191,7 +191,7 @@ void __vec__ layernorm_kernel(
 
     sum = (sum / static_cast<__half>(tile_shape::ValidCol));
     square_sum = (square_sum / static_cast<__half>(tile_shape::ValidCol));
-      
+
     #pragma clang loop unroll(full)
     for(int i=0;i<iter;i++){
         size_t idx = i * col + j;
@@ -207,14 +207,14 @@ void layernorm_oneline(dtype *dst, dtype *src){
 
     gm_shape gsrc(src);
     tile_shape tsrc;
-    TCOPYIN(tsrc, gsrc);
+    TLOAD(tsrc, gsrc);
 
     const int iter = tile_shape::ValidCol / LaneNum;
     tile_shape tdst;
     layernorm_kernel<tile_shape><<<LaneNum, 1, 1>>>(tdst.data(), tsrc.data(), LaneNum, iter);
 
     gm_shape gdst(dst);
-    TCOPYOUT(gdst, tdst);
+    TSTORE(gdst, tdst);
 }
 
 
@@ -225,9 +225,9 @@ void layernorm(dtype *dst, dtype *src, float *gamma, float *beta)
     using gm_shape = global_tensor<dtype, RowMajor<kM, kN>>;
 
     using tile_shape = Tile<Location::Vec, dtype, kTM, kTN, BLayout::RowMajor>;
- 
+
     using tSum = Tile<Location::Vec, dtype, kTM, kTN, BLayout::RowMajor, kTM, 1>;
- 
+
     using gIter = global_iterator<gm_shape, tile_shape>;
 
     gIter giter_src(src);
@@ -240,23 +240,23 @@ void layernorm(dtype *dst, dtype *src, float *gamma, float *beta)
     {
         tSum tAccSum(0);        // tiling sum
         tSum tAccSquareSum(0);  // tiling square sum
-  
+
         for(int j=0;j<Nb;j++)
         {
             auto gsrc = giter_src(i, j);
             tile_shape tsrc;
- 
-            TCOPYIN(tsrc, gsrc);
- 
+
+            TLOAD(tsrc, gsrc);
+
             tSum tLocalSum;
             TROWSUM(tLocalSum, tsrc);
             TADD(tAccSum, tAccSum, tLocalSum);
- 
+
             TMUL(tsrc, tsrc, tsrc);
             TROWSUM(tLocalSum, tsrc);
             TADD(tAccSquareSum, tAccSquareSum, tLocalSum);
         }
- 
+
         tSum gMean;        // Ex
         tSum gMeanSquare;  // (Ex)^2
         tSum gStdDev;      // Ex^2
@@ -265,7 +265,7 @@ void layernorm(dtype *dst, dtype *src, float *gamma, float *beta)
         TDIVS(gStdDev, tAccSquareSum, kN);
         TSUB(gStdDev, gStdDev, gMeanSquare);
         TSQRT(gStdDev, gStdDev);
- 
+
         tile_shape gMean_i;
         tile_shape gStdDev_i;
         TEXPANDCOL(gMean_i, gMean);
@@ -275,16 +275,16 @@ void layernorm(dtype *dst, dtype *src, float *gamma, float *beta)
         {
             auto  gsrc = giter_src(i,j);
             tile_shape tsrc;
-            TCOPYIN(tsrc, gsrc);
- 
+            TLOAD(tsrc, gsrc);
+
             TSUB(tsrc, tsrc, gMean_i);    // (x - Ex)
             TDIV(tsrc, tsrc, gStdDev_i);  // (x - Ex) / (Ex^2 - (Ex)^2)^.5
 
             TMULS(tsrc, tsrc, static_cast<dtype>(*gamma));
             TADDS(tsrc, tsrc, static_cast<dtype>(*beta));
- 
+
             auto gdst = giter_dst(i,j);
-            TCOPYOUT(gdst, tsrc);
+            TSTORE(gdst, tsrc);
         }
     }
 }
@@ -297,9 +297,9 @@ void layernorm_bf16(__bf16 *dst, __bf16 *src, float *gamma, float *beta)
     using tile_shape = Tile<Location::Vec, __bf16, kTM, kTN, BLayout::RowMajor>;
 
     using tile_shape_cast = Tile<Location::Vec, __half, kTM, kTN, BLayout::RowMajor>;
- 
+
     using tSum = Tile<Location::Vec, __half, kTM, kTN, BLayout::RowMajor, kTM, 1>;
- 
+
     using gIter = global_iterator<gm_shape, tile_shape>;
 
     gIter giter_src(src);
@@ -312,24 +312,24 @@ void layernorm_bf16(__bf16 *dst, __bf16 *src, float *gamma, float *beta)
     {
         tSum tAccSum(0);        // tiling sum
         tSum tAccSquareSum(0);  // tiling square sum
-  
+
         for(int j=0;j<Nb;j++)
         {
             auto gsrc = giter_src(i, j);
             tile_shape tsrc_ori;
             tile_shape_cast tsrc;
-            TCOPYIN(tsrc_ori, gsrc);
+            TLOAD(tsrc_ori, gsrc);
             TCAST(tsrc, tsrc_ori);
- 
+
             tSum tLocalSum;
             TROWSUM(tLocalSum, tsrc);
             TADD(tAccSum, tAccSum, tLocalSum);
- 
+
             TMUL(tsrc, tsrc, tsrc);
             TROWSUM(tLocalSum, tsrc);
             TADD(tAccSquareSum, tAccSquareSum, tLocalSum);
         }
- 
+
         tSum gMean;        // Ex
         tSum gMeanSquare;  // (Ex)^2
         tSum gStdDev;      // Ex^2
@@ -338,7 +338,7 @@ void layernorm_bf16(__bf16 *dst, __bf16 *src, float *gamma, float *beta)
         TDIVS(gStdDev, tAccSquareSum, kN);
         TSUB(gStdDev, gStdDev, gMeanSquare);
         TSQRT(gStdDev, gStdDev);
- 
+
         tile_shape_cast gMean_i;
         tile_shape_cast gStdDev_i;
         TEXPANDCOL(gMean_i, gMean);
@@ -349,19 +349,19 @@ void layernorm_bf16(__bf16 *dst, __bf16 *src, float *gamma, float *beta)
             auto  gsrc = giter_src(i,j);
             tile_shape tsrc_ori;
             tile_shape_cast tsrc;
-            TCOPYIN(tsrc_ori, gsrc);
+            TLOAD(tsrc_ori, gsrc);
             TCAST(tsrc, tsrc_ori);
- 
+
             TSUB(tsrc, tsrc, gMean_i);    // (x - Ex)
             TDIV(tsrc, tsrc, gStdDev_i);  // (x - Ex) / (Ex^2 - (Ex)^2)^.5
 
             TMULS(tsrc, tsrc, static_cast<__half>(*gamma));
             TADDS(tsrc, tsrc, static_cast<__half>(*beta));
- 
+
             auto gdst = giter_dst(i,j);
             tile_shape_cast tdst;
             TCAST(tdst, tsrc);
-            TCOPYOUT(gdst, tdst);
+            TSTORE(gdst, tdst);
         }
     }
 }
@@ -401,14 +401,14 @@ void layernorm_bf16(__bf16 *dst, __bf16 *src, float *gamma, float *beta)
 //                     gm_pic gpic(pic+ n*C*H*W + c*H*W + h*pool.stride*W + w*pool.stride); //pic[n, c, h*pool.stride, w*pool.stride]
 
 //                     tile_filt tpic;
-//                     TCOPYIN(tpic, gpic);
+//                     TLOAD(tpic, gpic);
 //                     TROWMAXEXPAND(tpic, tpic);
 //                     TCOLMAXEXPAND(tpic, tpic);
 //                     TCOPY(tmp, tpic);
 
 //                     int offset = n*C*H_out*W_out + c*H_out*W_out + h*W_out + w;
 //                     gm_out gO(out+offset);
-//                     TCOPYOUT(gO, tpic);
+//                     TSTORE(gO, tpic);
 //                 }
 //             }
 //         }
@@ -430,35 +430,35 @@ void __vec__ softmax_kernel(
 
     __half max;
     __half sum;
-    
+
     __half data = static_cast<__half>(in_ptr[j]);
-    asm volatile("l.rdfmax %1.fh, ->%0.h" 
+    asm volatile("l.rdfmax %1.fh, ->%0.h"
             :"=r"(max)
             :"vr"(data)
             );
 
-    asm volatile("l.rdfadd %1.fh, ->%0.h" 
+    asm volatile("l.rdfadd %1.fh, ->%0.h"
             :"=r"(sum)
             :"vr"(data)
             );
-    
+
     #pragma clang loop unroll(full)
     for(int i=1;i<iter;i++){
         size_t idx = i * col + j;
         data = static_cast<__half>(in_ptr[idx]);
         typename tile_shape::DType local_max;
-        asm volatile("l.rdfmax %1.fh, ->%0.h" 
+        asm volatile("l.rdfmax %1.fh, ->%0.h"
                     :"=r"(local_max)
                     :"vr"(data)
                     );
         max = blkv_max(max, local_max);
 
         typename tile_shape::DType local_sum;
-        asm volatile("l.rdfadd %1.fh, ->%0.h" 
+        asm volatile("l.rdfadd %1.fh, ->%0.h"
                     :"=r"(local_sum)
                     :"vr"(data)
                     );
-        sum += local_sum;       
+        sum += local_sum;
     }
 
     #pragma clang loop unroll(full)
@@ -476,14 +476,14 @@ void softmax_oneline(dtype *dst, dtype *src){
 
     gm_shape gsrc(src);
     tile_shape tsrc;
-    TCOPYIN(tsrc, gsrc);
+    TLOAD(tsrc, gsrc);
 
     const int iter = tile_shape::ValidCol/ LaneNum;
     tile_shape tdst;
     softmax_kernel<tile_shape><<<LaneNum, 1, 1>>>(tdst.data(), tsrc.data(), LaneNum, iter);
 
     gm_shape gdst(dst);
-    TCOPYOUT(gdst, tdst);
+    TSTORE(gdst, tdst);
 }
 
 template<const int kM, const int kN, const int kTM, const int kTN>
@@ -506,7 +506,7 @@ void softmax_bf16(__bf16* dst, __bf16* src){
             gm_shape gsrc(src+offset);
             tile_shape_ori tsrc_ori;
             tile_shape tsrc;
-            TCOPYIN(tsrc_ori, gsrc);
+            TLOAD(tsrc_ori, gsrc);
             TCAST(tsrc, tsrc_ori);
 
             tMax tLocalMax;
@@ -541,7 +541,7 @@ void softmax_bf16(__bf16* dst, __bf16* src){
             gm_shape gsrc(src+offset);
             tile_shape_ori tsrc_ori;
             tile_shape tsrc;
-            TCOPYIN(tsrc_ori, gsrc);
+            TLOAD(tsrc_ori, gsrc);
             TCAST(tsrc, tsrc_ori);
 
             tile_shape gMax;
@@ -557,7 +557,7 @@ void softmax_bf16(__bf16* dst, __bf16* src){
 
             tile_shape_ori tsrc_out;
             TCAST(tsrc_out, tsrc);
-            TCOPYOUT(gdst, tsrc_out);
+            TSTORE(gdst, tsrc_out);
         }
     }
 }
@@ -581,7 +581,7 @@ void softmax(dtype* dst, dtype* src){
             uint32_t offset = i*kTM*kN+j*kTN;
             gm_shape gsrc(src+offset);
             tile_shape tsrc;
-            TCOPYIN(tsrc, gsrc);
+            TLOAD(tsrc, gsrc);
 
             tMax tLocalMax;
             TROWMAX(tLocalMax, tsrc);
@@ -615,7 +615,7 @@ void softmax(dtype* dst, dtype* src){
             uint32_t offset = i*kTM*kN+j*kTN;
             gm_shape gsrc(src+offset);
             tile_shape tsrc;
-            TCOPYIN(tsrc, gsrc);
+            TLOAD(tsrc, gsrc);
 
             tile_shape gMax;
             tile_shape gSum;
@@ -627,14 +627,14 @@ void softmax(dtype* dst, dtype* src){
             TDIV(tsrc, tsrc, gSum);
 
             gm_shape gdst(dst+offset);
-            TCOPYOUT(gdst, tsrc);
+            TSTORE(gdst, tsrc);
         }
     }
 }
 
 template <const int kM, const int kN, const int kK, const int kTM, const int kTN, const int kTK, const bool Relu>
 void gemm(float *c, float *a, float *b, float alpha, float beta)
-{ 
+{
     using gm_shapeA = global_tensor<float, RowMajor<kM, kK>>;
     using gm_shapeB = global_tensor<float, RowMajor<kK, kN>>;
     using gm_shapeC = global_tensor<float, RowMajor<kM, kN>>;
@@ -681,8 +681,8 @@ void gemm(float *c, float *a, float *b, float alpha, float beta)
 
             tile_shapeA tA;
             tile_shapeB tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             MATMACC(tACC, tA, tB);
         }
 
@@ -692,20 +692,20 @@ void gemm(float *c, float *a, float *b, float alpha, float beta)
 
             tile_shapeA_trows tA;
             tile_shapeB_tcols tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             MATMACC(tACC, tA, tB);
         }
 
         tile_shapeACC oldC;
-        TCOPYIN(oldC, gC);
+        TLOAD(oldC, gC);
         TMULS(tACC, tACC, alpha);
         TMULS(oldC, oldC, beta);
         TADD(tACC, tACC, oldC);
         if constexpr(Relu){
             TMAXS(tACC, tACC, 0);
         }
-        TCOPYOUT(gC, tACC);
+        TSTORE(gC, tACC);
         }
         if constexpr (rmd_N) {
         auto gC = gCIter(i, Nb);
@@ -718,8 +718,8 @@ void gemm(float *c, float *a, float *b, float alpha, float beta)
 
             tile_shapeA tA;
             tile_shapeB_trows tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             MATMACC(tACC, tA, tB);
         }
         if constexpr (rmd_K) {
@@ -728,20 +728,20 @@ void gemm(float *c, float *a, float *b, float alpha, float beta)
 
             tile_shapeA_trows tA;
             tile_shapeB_tcorner tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             MATMACC(tACC, tA, tB);
         }
 
         tile_shapeC_trows oldC;
-        TCOPYIN(oldC, gC);
+        TLOAD(oldC, gC);
         TMULS(tACC, tACC, alpha);
         TMULS(oldC, oldC, beta);
         TADD(tACC, tACC, oldC);
         if constexpr(Relu){
             TMAXS(tACC, tACC, 0);
         }
-        TCOPYOUT(gC, tACC);
+        TSTORE(gC, tACC);
         }
     }
     if constexpr (rmd_M) {
@@ -756,8 +756,8 @@ void gemm(float *c, float *a, float *b, float alpha, float beta)
 
             tile_shapeA_tcols tA;
             tile_shapeB tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             MATMACC(tACC, tA, tB);
         }
         if constexpr (rmd_K) {
@@ -766,20 +766,20 @@ void gemm(float *c, float *a, float *b, float alpha, float beta)
 
             tile_shapeA_tcorner tA;
             tile_shapeB_tcols tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             MATMACC(tACC, tA, tB);
         }
 
         tile_shapeC_tcols oldC;
-        TCOPYIN(oldC, gC);
+        TLOAD(oldC, gC);
         TMULS(tACC, tACC, alpha);
         TMULS(oldC, oldC, beta);
         TADD(tACC, tACC, oldC);
         if constexpr(Relu){
             TMAXS(tACC, tACC, 0);
         }
-        TCOPYOUT(gC, tACC);
+        TSTORE(gC, tACC);
         }
         if constexpr (rmd_N) {
         auto gC = gCIter(Mb, Nb);
@@ -792,8 +792,8 @@ void gemm(float *c, float *a, float *b, float alpha, float beta)
 
             tile_shapeA_tcols tA;
             tile_shapeB_trows tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             MATMACC(tACC, tA, tB);
         }
         if constexpr (rmd_K) {
@@ -802,22 +802,22 @@ void gemm(float *c, float *a, float *b, float alpha, float beta)
 
             tile_shapeA_tcorner tA;
             tile_shapeB_tcorner tB;
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             MATMACC(tACC, tA, tB);
         }
 
         tile_shapeC_tcorner oldC;
-        TCOPYIN(oldC, gC);
+        TLOAD(oldC, gC);
         TMULS(tACC, tACC, alpha);
         TMULS(oldC, oldC, beta);
         TADD(tACC, tACC, oldC);
         if constexpr(Relu){
             TMAXS(tACC, tACC, 0);
         }
-        TCOPYOUT(gC, tACC);
+        TSTORE(gC, tACC);
         }
-    }    
+    }
 }
 
 template<typename dtype, int M, int N, int tM, int tN>
@@ -826,7 +826,7 @@ void gelu(dtype *out, dtype* in){
 }
 
 // w1(in) * silu(w2(in))
-//silu : x / (1 + e^-x) 
+//silu : x / (1 + e^-x)
 template <typename dtype, const int S, const int InDim, const int OutDim, const int tS, const int tInDim, const int tOutDim>
 void swiglu(dtype *out, dtype *in, dtype *w1, dtype *w2){
     using gmIn = global_tensor<dtype, RowMajor<S, InDim>>;
@@ -849,7 +849,7 @@ void swiglu(dtype *out, dtype *in, dtype *w1, dtype *w2){
     const int Sb = S / tS;
     const int Inb = InDim / tInDim;
     const int Outb = OutDim / tOutDim;
-    
+
     for(int i=0;i<Sb;i++){
         for(int j=0;j<Outb;j++){
 
@@ -863,9 +863,9 @@ void swiglu(dtype *out, dtype *in, dtype *w1, dtype *w2){
                 tileIn tIn;
                 tileW tW1;
                 tileW tW2;
-                TCOPYIN(tIn, gIn);
-                TCOPYIN(tW1, gW1);
-                TCOPYIN(tW2, gW2);
+                TLOAD(tIn, gIn);
+                TLOAD(tW1, gW1);
+                TLOAD(tW2, gW2);
 
                 MATMACC(tACC_W1, tIn, tW1);
                 MATMACC(tACC_W2, tIn, tW2);
@@ -880,7 +880,7 @@ void swiglu(dtype *out, dtype *in, dtype *w1, dtype *w2){
             tileACC tOut;
             TMUL(tOut, tACC_W1, tACC_W2);
             auto gOut = gIterOut(i,j);
-            TCOPYOUT(gOut, tOut);
+            TSTORE(gOut, tOut);
         }
     }
 }
@@ -904,13 +904,13 @@ void rope(__bf16 *out, __bf16 *x, __bf16 *freqs_cis){
             gm_shape input(x+offset);
             tile_shape tin_ori;
             tile_shape_rope resh_tin;
-            TCOPYIN(tin_ori, input);   // 64*32
+            TLOAD(tin_ori, input);   // 64*32
             tile_shape_cast tin;
             TCAST(tin, tin_ori);
             TRESHAPE(resh_tin, tin); // 64*32 -> 1024*2
 
             tile_shape_half tin_real;
-            tile_shape_half tin_imag; 
+            tile_shape_half tin_imag;
             TEXTRACT(tin_real, resh_tin, 0, 0);        // real 1024*1
             TEXTRACT(tin_imag, resh_tin, 0, 1);        // image 1024*1
 
@@ -918,7 +918,7 @@ void rope(__bf16 *out, __bf16 *x, __bf16 *freqs_cis){
             gm_shape freqs(freqs_cis+offset);
             tile_shape tfreqs_ori;
             tile_shape_rope tfreqs_resh;
-            TCOPYIN(tfreqs_ori, freqs);
+            TLOAD(tfreqs_ori, freqs);
             tile_shape_cast tfreqs;
             TCAST(tfreqs, tfreqs_ori);
             TRESHAPE(tfreqs_resh, tfreqs);
@@ -958,7 +958,7 @@ void rope(__bf16 *out, __bf16 *x, __bf16 *freqs_cis){
 
             tile_shape_cast tout_resh_cast;
             TCAST(tout_resh_cast, tout_resh);
-            TCOPYOUT(output, tout_resh_cast);
+            TSTORE(output, tout_resh_cast);
         }
     }
 }
@@ -985,16 +985,16 @@ void __vec__ BitonicSortStepDescend_RowMajor_Imp(
         "v.lw   [ta, vn#1.reuse.uh<<2],     ->vt.w\n"       // src[index_part+col/2] = partner_idx
         "v.lw   [ta, vm#2.reuse.uh<<2],     ->vt.w\n"       // src[index] = cur_value
         "v.lw   [ta, vm#1.reuse.uh<<2],     ->vt.w\n"       // src[index_part] = partner_value
-        "v.sw  vt#2.reuse.sw, [to, vm#2.reuse.uh<<2]\n"           // dst[tid] = src[tid]   // copy first 
+        "v.sw  vt#2.reuse.sw, [to, vm#2.reuse.uh<<2]\n"           // dst[tid] = src[tid]   // copy first
         "v.sw  vt#1.reuse.sw, [to, vm#1.reuse.uh<<2]\n"           // dst[partner] = src[partner] // copy first
-        "v.sw  vt#4.reuse.sw, [to, vn#2.reuse.uh<<2]\n"           // dst[tid+col/2] = src[tid+col/2]   // copy first 
+        "v.sw  vt#4.reuse.sw, [to, vn#2.reuse.uh<<2]\n"           // dst[tid+col/2] = src[tid+col/2]   // copy first
         "v.sw  vt#3.reuse.sw, [to, vn#1.reuse.uh<<2]\n"           // dst[partner+col/2] = src[partner+col/2] // copy first
         "v.cmp.lt lc0.uh, vu#1.reuse.uh, ->vn.b\n"          // tid < partner
         "v.and  vu#1.reuse.uh, ri0.uh, ->vn.h\n"            // partner & stage
         "v.cmp.eqi vn#1.reuse.uh, 0, ->vn.b\n"              // partner & stage == 0
         "v.cmp.lt vt#2.reuse.sw, vt#1.reuse.sw, ->vn.b\n"         // cur_value < partner_value
         "v.and vn#4.reuse.ub, vn#2.reuse.ub, ->vu.b\n"            // (tid < partner) & (partner & stage) == 0
-        "v.and vu#1.reuse.ub, vn#1.reuse.ub ->vu.b\n"             // (tid < partner) & ((partner & stage) == 0) & (cur_value < partner_value) 
+        "v.and vu#1.reuse.ub, vn#1.reuse.ub ->vu.b\n"             // (tid < partner) & ((partner & stage) == 0) & (cur_value < partner_value)
         "v.cmp.eqi vu#1.ub, 1, ->vm.b\n"              // sort_descend
         ""
         "v.cmp.eqi vn#3.uh, 1, ->vn.b\n"                // partner & stage == 1
@@ -1015,7 +1015,7 @@ void __vec__ BitonicSortStepDescend_RowMajor_Imp(
         "v.sw  vt#3.sw, [to, vn#2.uh<<2]\n"           // dst[tid+col/2] = src[partner]
         "v.sw  vt#4.sw, [to, vn#1.uh<<2]\n"           // dst[partner+col/2] = src[tid]
         "l.addi t#1.ud, 0, ->p\n"                     // resave p from 1st branch
-        ""                                            // merge 2nd branch two result 
+        ""                                            // merge 2nd branch two result
         "c.bstop\n"
         :
         :"i"(tile_shape::ValidCol)
@@ -1043,7 +1043,7 @@ void TSORTROW(tile_shape &weight, tile_shape &indices, tile_shape &src) {
 
     using tile_shape_sort = Tile<Location::Vec, dtype, tile_shape::Rows, 2*tile_shape::Cols, BLayout::RowMajor>;
     tile_shape_sort dst_sort;
-    tile_shape_sort src_sort; 
+    tile_shape_sort src_sort;
 
     TRANGE_RowMajor<tile_shape><<<col, row>>>(indices.data());
     tile_shape_sort padding(-1);
@@ -1072,14 +1072,14 @@ void topk(dtype *weight, dtype* indices, dtype *x){
     using gmOut = global_tensor<dtype, RowMajor<tokens, tK>>;
     using tileIn = Tile<Location::Vec, dtype, tS, scores, BLayout::RowMajor>;
     using tileOut = Tile<Location::Vec, dtype, tS, 32, BLayout::RowMajor, tS, tK>; // topk < 32
-   
+
     const int block = tokens/tS;
     for(int i=0;i<block;i++){
         gmIn gIn(x+i*tS*scores);
         tileIn tIn;
         tileIn tWeight;
         tileIn tIndice;
-        TCOPYIN(tIn, gIn);
+        TLOAD(tIn, gIn);
         TSORTROW<dtype>(tWeight, tIndice, tIn);
         tileOut tWeightOut;
         TEXTRACT(tWeightOut, tWeight, 0, 0);
@@ -1088,9 +1088,9 @@ void topk(dtype *weight, dtype* indices, dtype *x){
         TEXTRACT(tIndiceOut, tIndice, 0, 0);
 
         gmOut gWeight(weight+i*tS*tK);
-        TCOPYOUT(gWeight, tWeightOut);
+        TSTORE(gWeight, tWeightOut);
 
         gmOut gIndice(indices+i*tS*tK);
-        TCOPYOUT(gIndice, tIndiceOut);
+        TSTORE(gIndice, tIndiceOut);
     }
 }
\ No newline at end of file
diff --git a/test/accelerator/include/accelerator_vec_simt.h b/include/benchmark_support/npu/npu_vec_simt.h
similarity index 100%
rename from test/accelerator/include/accelerator_vec_simt.h
rename to include/benchmark_support/npu/npu_vec_simt.h
diff --git a/include/common/debug_utils.hpp b/include/common/debug_utils.hpp
index 5dd882a..7569d49 100644
--- a/include/common/debug_utils.hpp
+++ b/include/common/debug_utils.hpp
@@ -1,14 +1,18 @@
 #ifndef DEBUG_UIILS_HPP
 #define DEBUG_UIILS_HPP
 
-#ifdef __linx
-#include "jcore/utils.hpp"
+#if defined(__linx)
+namespace pto {
+template <typename tile_shape>
+void print_tile(tile_shape &) {}
+} // namespace pto
 #elif defined(__ARM_FEATURE_SME)
 #include "aarch64/utils.hpp"
 #elif defined(__cpu_sim__)
 #include "cpu_sim/utils.hpp"
 #endif
 
+#ifndef __linx
 namespace pto {
 template <typename tile_shape>
 void print_tile(tile_shape &tile) {
@@ -16,5 +20,6 @@ void print_tile(tile_shape &tile) {
 }
 
 } // namespace pto
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/common/layout.hpp b/include/common/layout.hpp
index d746528..3317ae2 100644
--- a/include/common/layout.hpp
+++ b/include/common/layout.hpp
@@ -2,8 +2,11 @@
 #define LAYOUT_HPP
 
 #include <stdint.h>
+#include <stddef.h>
 
+#ifndef __linx
 #include <iostream>
+#endif
 #include <type_traits>
 
 #include "common/math_utils.hpp"
@@ -65,6 +68,7 @@ const char *layout_type_to_str(LayoutEnum type) {
   return "UnsupportedLayout";
 }
 
+#ifndef __linx
 class MatrixLayoutPrettyPrinter {
   template <typename Layout>
   static void print(std::ostream &out, const Layout &layout) {
@@ -73,6 +77,7 @@ class MatrixLayoutPrettyPrinter {
         << Layout::ColStride << ">, Numel = " << Layout::Numel;
   }
 };
+#endif
 
 template <const int Rows_, const int Cols_, const int RowStride_,
           const int ColStride_,
@@ -171,6 +176,7 @@ struct BlockMatrixLayout {
     return outer_(outer_i, outer_j) + inner_(inner_i, inner_j);
   }
 
+#ifndef __linx
   void dump() const {
     for (int i = 0; i < Rows; ++i) {
       for (int j = 0; j < Cols; ++j) {
@@ -179,6 +185,7 @@ struct BlockMatrixLayout {
       printf("\n");
     }
   }
+#endif
 
   auto get_outer_layout() const { return decltype(outer_){}; }
 
@@ -196,6 +203,7 @@ struct BlockMatrixLayout {
 };
 
 /// @brief Pretty printer for BlockMatrixLayout
+#ifndef __linx
 template <typename OuterLayout_, typename InnerLayout_>
 static std::ostream &
 operator<<(std::ostream &out,
@@ -206,6 +214,7 @@ operator<<(std::ostream &out,
       << "  }";
   return out;
 }
+#endif
 
 template <typename OuterLayout_, typename InnerLayout_>
 concept BlockRowMajorLayout =
@@ -234,4 +243,4 @@ template <typename OuterLayout_, typename InnerLayout_>
 using BlockMixed = BlockMatrixLayout<OuterLayout_, InnerLayout_>;
 } // namespace pto
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/common/pto_tile.hpp b/include/common/pto_tile.hpp
index 1177823..06f3409 100644
--- a/include/common/pto_tile.hpp
+++ b/include/common/pto_tile.hpp
@@ -416,7 +416,7 @@ struct Tile {
   static_assert(SFractalSize_ == 512 || SFractalSize_ == 1024,
                 "SFractalSize_ illegal");
 
-#ifdef __linx
+#if defined(__linx) && defined(SUPERNPUBENCH_LINX_TILE_SIZE)
   using TileDType = DType tile_size(Rows *Cols / (sizeof(DType) * 8 / type_traits<DType>::bits));
 #else
   using TileDType = DType[Rows * Cols];
@@ -656,6 +656,7 @@ const char* get_layout_str() {
 
 template <typename tile_shape>
 void print_tile_info() {
+#ifndef __linx
   std::cout << "Tile Rows Number: " << tile_shape::Rows << std::endl;
   std::cout << "Tile Columns Number: " << tile_shape::Cols << std::endl;
   std::cout << "Tile Active Rows Number: " << tile_shape::ValidRow << std::endl;
@@ -667,6 +668,7 @@ void print_tile_info() {
   std::cout << "Tile Size: " << tile_shape::Numel << std::endl;
   std::cout << "Tile Layout: " << get_layout_str<tile_shape>() << std::endl;
   std::cout << "Tile Data Dump: " << std::endl;
+#endif
 }
 
 } // namespace pto
diff --git a/include/common/tileop_api.hpp b/include/common/tileop_api.hpp
index 9774dd9..aacf95f 100644
--- a/include/common/tileop_api.hpp
+++ b/include/common/tileop_api.hpp
@@ -36,7 +36,7 @@ void MATMACCMX(tile_shape_C &dst, tile_shape_A &src0,  tile_shape_AX &src0x,
   MATMACCMX_Impl(dst, src0, src0x, src1, src1x);
 }
 
-template <is_tile_data_v tile_shape_A, is_tile_data_v tile_shape_B, 
+template <is_tile_data_v tile_shape_A, is_tile_data_v tile_shape_B,
           is_tile_data_v tile_shape_BX, is_tile_data_v tile_shape_C>
 void MATMACCMXB(tile_shape_C &dst, tile_shape_A &src0,
                 tile_shape_B &src1, tile_shape_BX &src1x) {
@@ -81,12 +81,12 @@ void TCOPY(tile_shape &dst, tile_shape &src) {
   TCOPY_Impl(dst, src);
 }
 template <is_tile_data_v tile_shape, is_global_data_v gm_shape>
-void TCOPYIN(tile_shape &dst, gm_shape &src) {
-  TCOPYIN_Impl(dst, src);
+void TLOAD(tile_shape &dst, gm_shape &src) {
+  TLOAD_Impl(dst, src);
 }
 template <is_global_data_v gm_shape, is_tile_data_v tile_shape>
-void TCOPYOUT(gm_shape &dst, tile_shape &src) {
-  TCOPYOUT_Impl(dst, src);
+void TSTORE(gm_shape &dst, tile_shape &src) {
+  TSTORE_Impl(dst, src);
 }
 template <is_tile_data_v tile_shape_out, is_tile_data_v tile_shape_in>
 void TCVT(tile_shape_out &dst, tile_shape_in &src) {
diff --git a/include/common/tileop_api_impl.hpp b/include/common/tileop_api_impl.hpp
index ff76138..fe8ad19 100644
--- a/include/common/tileop_api_impl.hpp
+++ b/include/common/tileop_api_impl.hpp
@@ -8,44 +8,34 @@
 #include "jcore/TAdd.hpp"
 #include "jcore/TAdds.hpp"
 #include "jcore/TAnd.hpp"
-#include "jcore/TAssemble.hpp"
-#include "jcore/TCast.hpp"
-#include "jcore/TCI.hpp"
 #include "jcore/TCmp.hpp"
+#include "jcore/TCI.hpp"
 #include "jcore/TCopy.hpp"
-#include "jcore/TCopyIn.hpp"
-#include "jcore/TCopyOut.hpp"
+#include "jcore/TLoad.hpp"
+#include "jcore/TStore.hpp"
 #include "jcore/TCvt.hpp"
 #include "jcore/TDiv.hpp"
 #include "jcore/TDivs.hpp"
 #include "jcore/TExp.hpp"
+#include "jcore/TOr.hpp"
+#include "jcore/TRem.hpp"
+#include "jcore/TRecip.hpp"
+#include "jcore/TSqrt.hpp"
+#include "jcore/TSub.hpp"
+#include "jcore/TSubs.hpp"
+#include "jcore/TMul.hpp"
+#include "jcore/TMuls.hpp"
+#include "jcore/TMax.hpp"
+#include "jcore/TMaxs.hpp"
 #include "jcore/TExpandCol.hpp"
 #include "jcore/TExpandRow.hpp"
 #include "jcore/TExpandScalar.hpp"
-#include "jcore/TExtract.hpp"
-#include "jcore/TFillPad.hpp"
-#include "jcore/TGather.hpp"
-#include "jcore/TMax.hpp"
-#include "jcore/TMaxs.hpp"
-#include "jcore/TMin.hpp"
-#include "jcore/TMins.hpp"
-#include "jcore/TMul.hpp"
-#include "jcore/TMuls.hpp"
-#include "jcore/TOr.hpp"
-#include "jcore/TPad.hpp"
-#include "jcore/TRSqrt.hpp"
-#include "jcore/TRecip.hpp"
-#include "jcore/TRem.hpp"
 #include "jcore/TReshape.hpp"
 #include "jcore/TRowMax.hpp"
 #include "jcore/TRowMaxExpand.hpp"
 #include "jcore/TRowSum.hpp"
 #include "jcore/TRowSumExpand.hpp"
-#include "jcore/TScatter.hpp"
-#include "jcore/TSelect.hpp"
-#include "jcore/TSqrt.hpp"
-#include "jcore/TSub.hpp"
-#include "jcore/TSubs.hpp"
+#include "jcore/TPad.hpp"
 #include "jcore/TTrans.hpp"
 
 #elif defined(__ARM_FEATURE_SME)
@@ -60,8 +50,8 @@
 #include "aarch64/TCI.hpp"
 #include "aarch64/TCmp.hpp"
 #include "aarch64/TCopy.hpp"
-#include "aarch64/TCopyIn.hpp"
-#include "aarch64/TCopyOut.hpp"
+#include "aarch64/TLoad.hpp"
+#include "aarch64/TStore.hpp"
 #include "aarch64/TDiv.hpp"
 #include "aarch64/TDivs.hpp"
 #include "aarch64/TExp.hpp"
@@ -103,8 +93,8 @@
 #include "cpu_sim/TCI.hpp"
 #include "cpu_sim/TCmp.hpp"
 #include "cpu_sim/TCopy.hpp"
-#include "cpu_sim/TCopyIn.hpp"
-#include "cpu_sim/TCopyOut.hpp"
+#include "cpu_sim/TLoad.hpp"
+#include "cpu_sim/TStore.hpp"
 #include "cpu_sim/TCvt.hpp"
 #include "cpu_sim/TDiv.hpp"
 #include "cpu_sim/TDivs.hpp"
@@ -142,4 +132,4 @@
 #error "__linx, __ARM_FEATURE_SME, or __cpu_sim__ must be defined"
 
 #endif
-#endif
\ No newline at end of file
+#endif
diff --git a/include/cpu_sim/TCopyIn.hpp b/include/cpu_sim/TLoad.hpp
similarity index 85%
rename from include/cpu_sim/TCopyIn.hpp
rename to include/cpu_sim/TLoad.hpp
index 56476ea..17e32de 100644
--- a/include/cpu_sim/TCopyIn.hpp
+++ b/include/cpu_sim/TLoad.hpp
@@ -1,12 +1,12 @@
-#ifndef TCOPYIN_HPP
-#define TCOPYIN_HPP
+#ifndef CPU_SIM_TLOAD_HPP
+#define CPU_SIM_TLOAD_HPP
 
 #include "common/pto_tile.hpp"
 
 using namespace pto;
 
 template <typename tile_shape, typename gm_shape>
-void CopyInRow2NzImpl1D(typename tile_shape::TileDType dst,
+void LoadRow2NzImpl1D(typename tile_shape::TileDType dst,
                      const typename gm_shape::DType *src) {
   static constexpr int inner_rows = tile_shape::InnerRows;
   static constexpr int inner_cols = tile_shape::InnerCols;
@@ -29,7 +29,7 @@ void CopyInRow2NzImpl1D(typename tile_shape::TileDType dst,
 }
 
 template <typename tile_shape, typename gm_shape>
-void CopyInRow2ZnImpl1D(typename tile_shape::TileDType dst,
+void LoadRow2ZnImpl1D(typename tile_shape::TileDType dst,
                      const typename gm_shape::DType *src) {
   static constexpr int inner_rows = tile_shape::InnerRows;
   static constexpr int inner_cols = tile_shape::InnerCols;
@@ -52,7 +52,7 @@ void CopyInRow2ZnImpl1D(typename tile_shape::TileDType dst,
 }
 
 template <typename tile_shape, typename gm_shape>
-void CopyInCol2ZnImpl1D(typename tile_shape::TileDType dst,
+void LoadCol2ZnImpl1D(typename tile_shape::TileDType dst,
                      const typename gm_shape::DType *src) {
   static constexpr int inner_rows = tile_shape::InnerRows;
   static constexpr int inner_cols = tile_shape::InnerCols;
@@ -75,7 +75,7 @@ void CopyInCol2ZnImpl1D(typename tile_shape::TileDType dst,
 }
 
 template <typename tile_shape, typename gm_shape>
-void TCopyIn_RowMajor_Impl(typename tile_shape::TileDType dst,
+void TLoad_RowMajor_Impl(typename tile_shape::TileDType dst,
                            const typename gm_shape::DType *src) {
   for (size_t i = 0; i < tile_shape::ValidRow; ++i)
     for (size_t j = 0; j < tile_shape::ValidCol; ++j) {
@@ -86,7 +86,7 @@ void TCopyIn_RowMajor_Impl(typename tile_shape::TileDType dst,
 }
 
 template <typename tile_shape, typename gm_shape>
-void TCopyIn_ColMajor_Impl(typename tile_shape::TileDType dst,
+void TLoad_ColMajor_Impl(typename tile_shape::TileDType dst,
                            const typename gm_shape::DType *src) {
   for (size_t i = 0; i < tile_shape::ValidCol; ++i)
     for (size_t j = 0; j < tile_shape::ValidRow; ++j) {
@@ -97,7 +97,7 @@ void TCopyIn_ColMajor_Impl(typename tile_shape::TileDType dst,
 }
 
 template <typename tile_shape, typename gm_shape>
-void CopyInRow2NzImpl1D_Dynamic(tile_shape &dst,
+void LoadRow2NzImpl1D_Dynamic(tile_shape &dst,
                                 gm_shape &src) {
   static constexpr int inner_rows = tile_shape::InnerRows;
   static constexpr int inner_cols = tile_shape::InnerCols;
@@ -120,7 +120,7 @@ void CopyInRow2NzImpl1D_Dynamic(tile_shape &dst,
 }
 
 template <typename tile_shape, typename gm_shape>
-void CopyInRow2ZnImpl1D_Dynamic(tile_shape &dst,
+void LoadRow2ZnImpl1D_Dynamic(tile_shape &dst,
                                 gm_shape &src) {
   static constexpr int inner_rows = tile_shape::InnerRows;
   static constexpr int inner_cols = tile_shape::InnerCols;
@@ -143,7 +143,7 @@ void CopyInRow2ZnImpl1D_Dynamic(tile_shape &dst,
 }
 
 template <typename tile_shape, typename gm_shape>
-void CopyInCol2ZnImpl1D_Dynamic(tile_shape &dst,
+void LoadCol2ZnImpl1D_Dynamic(tile_shape &dst,
                                 gm_shape &src) {
   static constexpr int inner_rows = tile_shape::InnerRows;
   static constexpr int inner_cols = tile_shape::InnerCols;
@@ -166,7 +166,7 @@ void CopyInCol2ZnImpl1D_Dynamic(tile_shape &dst,
 }
 
 template <typename tile_shape, typename gm_shape>
-void TCopyIn_RowMajor_Impl_Dynamic(tile_shape &dst,
+void TLoad_RowMajor_Impl_Dynamic(tile_shape &dst,
                                    gm_shape &src) {
   for (size_t i = 0; i < dst.GetValidRow(); ++i) {
     for (size_t j = 0; j < dst.GetValidCol(); ++j) {
@@ -178,7 +178,7 @@ void TCopyIn_RowMajor_Impl_Dynamic(tile_shape &dst,
 }
 
 template <typename tile_shape, typename gm_shape>
-void TCopyIn_ColMajor_Impl_Dynamic(tile_shape &dst,
+void TLoad_ColMajor_Impl_Dynamic(tile_shape &dst,
                                    gm_shape &src) {
   for (size_t i = 0; i < dst.GetValidCol(); ++i) {
     for (size_t j = 0; j < dst.GetValidRow(); ++j) {
@@ -190,26 +190,26 @@ void TCopyIn_ColMajor_Impl_Dynamic(tile_shape &dst,
 }
 
 template <is_tile_data_v tile_shape, is_global_data_v gm_shape>
-void TCOPYIN_Impl(tile_shape &dst, gm_shape &src) {
+void TLOAD_Impl(tile_shape &dst, gm_shape &src) {
   static_assert(tile_shape::Loc != Location::Acc, "Unsupport ACC to be input or output here");
   if (tile_shape::ValidRow == DYNAMIC || tile_shape::ValidCol == DYNAMIC) { // dynamic
     if constexpr (is_Nz_layout<tile_shape>::value) {
       if constexpr (gm_shape::isRowMajor) {
-        CopyInRow2NzImpl1D_Dynamic<tile_shape, gm_shape>(dst, src);
+        LoadRow2NzImpl1D_Dynamic<tile_shape, gm_shape>(dst, src);
       } else {
         static_assert(gm_shape::isRowMajor, "Storage layout type not supported, gm should rowmajor");
       }
     } else if constexpr (is_Zn_layout<tile_shape>::value) {
       if constexpr (!gm_shape::isRowMajor) {
-        CopyInCol2ZnImpl1D_Dynamic<tile_shape, gm_shape>(dst, src);
+        LoadCol2ZnImpl1D_Dynamic<tile_shape, gm_shape>(dst, src);
       } else {
-        CopyInRow2ZnImpl1D_Dynamic<tile_shape, gm_shape>(dst, src);
+        LoadRow2ZnImpl1D_Dynamic<tile_shape, gm_shape>(dst, src);
       }
     } else if constexpr (tile_shape::isBoxedLayout == false) {
       if constexpr (gm_shape::isRowMajor) {
-        TCopyIn_RowMajor_Impl_Dynamic<tile_shape, gm_shape>(dst, src);
+        TLoad_RowMajor_Impl_Dynamic<tile_shape, gm_shape>(dst, src);
       } else {
-        TCopyIn_ColMajor_Impl_Dynamic<tile_shape, gm_shape>(dst, src);
+        TLoad_ColMajor_Impl_Dynamic<tile_shape, gm_shape>(dst, src);
       }
     } else {
       static_assert(tile_shape::isBoxedLayout == false, "Data type not supported");
@@ -217,21 +217,21 @@ void TCOPYIN_Impl(tile_shape &dst, gm_shape &src) {
   } else { // static
     if constexpr (is_Nz_layout<tile_shape>::value) {
       if constexpr (gm_shape::isRowMajor) {
-        CopyInRow2NzImpl1D<tile_shape, gm_shape>(dst.data(), src.data());
+        LoadRow2NzImpl1D<tile_shape, gm_shape>(dst.data(), src.data());
       } else {
         static_assert(gm_shape::isRowMajor, "Storage layout type not supported, gm should rowmajor");
       }
     } else if constexpr (is_Zn_layout<tile_shape>::value) {
       if constexpr (!gm_shape::isRowMajor) {
-        CopyInCol2ZnImpl1D<tile_shape, gm_shape>(dst.data(), src.data());
+        LoadCol2ZnImpl1D<tile_shape, gm_shape>(dst.data(), src.data());
       } else {
-        CopyInRow2ZnImpl1D<tile_shape, gm_shape>(dst.data(), src.data());
+        LoadRow2ZnImpl1D<tile_shape, gm_shape>(dst.data(), src.data());
       }
     } else if constexpr (tile_shape::isBoxedLayout == false) {
       if constexpr (gm_shape::isRowMajor) {
-        TCopyIn_RowMajor_Impl<tile_shape, gm_shape>(dst.data(), src.data());
+        TLoad_RowMajor_Impl<tile_shape, gm_shape>(dst.data(), src.data());
       } else {
-        TCopyIn_ColMajor_Impl<tile_shape, gm_shape>(dst.data(), src.data());
+        TLoad_ColMajor_Impl<tile_shape, gm_shape>(dst.data(), src.data());
       }
     } else {
       static_assert(tile_shape::isBoxedLayout == false, "Data type not supported");
diff --git a/include/cpu_sim/TCopyOut.hpp b/include/cpu_sim/TStore.hpp
similarity index 84%
rename from include/cpu_sim/TCopyOut.hpp
rename to include/cpu_sim/TStore.hpp
index 6a59712..d3412b4 100644
--- a/include/cpu_sim/TCopyOut.hpp
+++ b/include/cpu_sim/TStore.hpp
@@ -1,12 +1,12 @@
-#ifndef TCOPYOUT_HPP
-#define TCOPYOUT_HPP
+#ifndef CPU_SIM_TSTORE_HPP
+#define CPU_SIM_TSTORE_HPP
 
 #include "common/pto_tile.hpp"
 
 using namespace pto;
 
 template <typename gm_shape, typename tile_shape>
-void CopyOut2NzImpl1D(typename gm_shape::DType *dst,
+void Store2NzImpl1D(typename gm_shape::DType *dst,
                       const typename tile_shape::TileDType src) {
   static constexpr int inner_rows = tile_shape::InnerRows;
   static constexpr int inner_cols = tile_shape::InnerCols;
@@ -29,7 +29,7 @@ void CopyOut2NzImpl1D(typename gm_shape::DType *dst,
 }
 
 template <typename gm_shape, typename tile_shape>
-void TCopyOut_ColMajor_Impl(typename gm_shape::DType *dst,
+void TStore_ColMajor_Impl(typename gm_shape::DType *dst,
                             typename tile_shape::TileDType src) {
   for (size_t i = 0; i < tile_shape::ValidCol; ++i)
     for (size_t j = 0; j < tile_shape::ValidRow; ++j) {
@@ -40,7 +40,7 @@ void TCopyOut_ColMajor_Impl(typename gm_shape::DType *dst,
 }
 
 template <typename gm_shape, typename tile_shape>
-void TCopyOut_RowMajor_Impl(typename gm_shape::DType *dst,
+void TStore_RowMajor_Impl(typename gm_shape::DType *dst,
                             typename tile_shape::TileDType src) {
   for (size_t i = 0; i < tile_shape::ValidRow; ++i)
     for (size_t j = 0; j < tile_shape::ValidCol; ++j) {
@@ -51,7 +51,7 @@ void TCopyOut_RowMajor_Impl(typename gm_shape::DType *dst,
 }
 
 template <typename gm_shape, typename tile_shape>
-void CopyOut2NzImpl1D_Dynamic(gm_shape &dst,
+void Store2NzImpl1D_Dynamic(gm_shape &dst,
                               const tile_shape &src) {
   static constexpr int inner_rows = tile_shape::InnerRows;
   static constexpr int inner_cols = tile_shape::InnerCols;
@@ -74,7 +74,7 @@ void CopyOut2NzImpl1D_Dynamic(gm_shape &dst,
 }
 
 template <typename gm_shape, typename tile_shape>
-void TCopyOut_ColMajor_Impl_Dynamic(gm_shape& dst,
+void TStore_ColMajor_Impl_Dynamic(gm_shape& dst,
                                     tile_shape &src) {
   for (size_t i = 0; i < src.GetValidCol(); ++i) {
     for (size_t j = 0; j < src.GetValidRow(); ++j) {
@@ -86,7 +86,7 @@ void TCopyOut_ColMajor_Impl_Dynamic(gm_shape& dst,
 }
 
 template <typename gm_shape, typename tile_shape>
-void TCopyOut_RowMajor_Impl_Dynamic(gm_shape &dst,
+void TStore_RowMajor_Impl_Dynamic(gm_shape &dst,
                                     tile_shape &src) {
   for (size_t i = 0; i < src.GetValidRow(); ++i) {
     for (size_t j = 0; j < src.GetValidCol(); ++j) {
@@ -98,20 +98,20 @@ void TCopyOut_RowMajor_Impl_Dynamic(gm_shape &dst,
 }
 
 template <is_global_data_v gm_shape, is_tile_data_v tile_shape>
-void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) {
+void TSTORE_Impl(gm_shape &dst, tile_shape &src) {
   static_assert(tile_shape::Loc != Location::Acc, "Unsupport ACC to be input or output here");
   if (tile_shape::ValidRow == DYNAMIC || tile_shape::ValidCol == DYNAMIC) { // dynamic
     if constexpr (is_Nz_layout<tile_shape>::value) {
       if constexpr (gm_shape::isRowMajor) {
-        CopyOut2NzImpl1D_Dynamic<gm_shape, tile_shape>(dst, src);
+        Store2NzImpl1D_Dynamic<gm_shape, tile_shape>(dst, src);
       } else {
         static_assert(gm_shape::isRowMajor, "Storage layout type not supported, gm should rowmajor");
       }
     } else if constexpr (tile_shape::isBoxedLayout == false) {
       if constexpr (gm_shape::isRowMajor) {
-        TCopyOut_RowMajor_Impl_Dynamic<gm_shape, tile_shape>(dst, src);
+        TStore_RowMajor_Impl_Dynamic<gm_shape, tile_shape>(dst, src);
       } else {
-        TCopyOut_ColMajor_Impl_Dynamic<gm_shape, tile_shape>(dst, src);
+        TStore_ColMajor_Impl_Dynamic<gm_shape, tile_shape>(dst, src);
       }
     } else {
       static_assert(tile_shape::isBoxedLayout == false, "Data type not supported");
@@ -119,15 +119,15 @@ void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) {
   } else { // static
     if constexpr (is_Nz_layout<tile_shape>::value) {
       if constexpr (gm_shape::isRowMajor) {
-        CopyOut2NzImpl1D<gm_shape, tile_shape>(dst.data(), src.data());
+        Store2NzImpl1D<gm_shape, tile_shape>(dst.data(), src.data());
       } else {
         static_assert(gm_shape::isRowMajor, "Storage layout type not supported, gm should rowmajor");
       }
     } else if constexpr (tile_shape::isBoxedLayout == false) {
       if constexpr (gm_shape::isRowMajor) {
-        TCopyOut_RowMajor_Impl<gm_shape, tile_shape>(dst.data(), src.data());
+        TStore_RowMajor_Impl<gm_shape, tile_shape>(dst.data(), src.data());
       } else {
-        TCopyOut_ColMajor_Impl<gm_shape, tile_shape>(dst.data(), src.data());
+        TStore_ColMajor_Impl<gm_shape, tile_shape>(dst.data(), src.data());
       }
     } else {
       static_assert(tile_shape::isBoxedLayout == false, "Data type not supported");
diff --git a/include/jcore/MatMacc.hpp b/include/jcore/MatMacc.hpp
index 39f4acc..a274bbc 100644
--- a/include/jcore/MatMacc.hpp
+++ b/include/jcore/MatMacc.hpp
@@ -5,6 +5,68 @@
 
 using namespace pto;
 
+#ifdef __linx
+template <typename...>
+struct linx_matmacc_unsupported {
+  static constexpr bool value = false;
+};
+
+// Matrix Multiply and Accumulate: C[MxN] += A[MxK] x B[KxN]
+template <is_tile_data_v tile_shape_A, is_tile_data_v tile_shape_B,
+          is_tile_data_v tile_shape_C>
+void MATMACC_Impl(tile_shape_C &dst, tile_shape_A &src0, tile_shape_B &src1) {
+  static_assert(tile_shape_A::ValidCol == tile_shape_B::ValidRow,
+                "Linx scalar MATMACC requires A columns to match B rows");
+  static_assert(!tile_shape_A::isBoxedLayout && !tile_shape_B::isBoxedLayout &&
+                    !tile_shape_C::isBoxedLayout,
+                "Linx scalar MATMACC supports only unboxed layouts");
+  static_assert(tile_shape_A::Loc != Location::Acc &&
+                    tile_shape_B::Loc != Location::Acc &&
+                    tile_shape_C::Loc != Location::Acc,
+                "Linx scalar MATMACC does not support ACC tile operands");
+  static_assert(std::is_integral<typename tile_shape_A::DType>::value &&
+                    std::is_integral<typename tile_shape_B::DType>::value &&
+                    std::is_integral<typename tile_shape_C::DType>::value,
+                "Linx scalar MATMACC direct smoke supports integral tiles only");
+
+  constexpr size_t rows = tile_shape_C::ValidRow;
+  constexpr size_t cols = tile_shape_C::ValidCol;
+  constexpr size_t inner = tile_shape_A::ValidCol;
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      typename tile_shape_C::DType acc = dst.data()[index<tile_shape_C>(row, col)];
+      for (size_t k = 0; k < inner; ++k) {
+        acc += src0.data()[index<tile_shape_A>(row, k)] *
+               src1.data()[index<tile_shape_B>(k, col)];
+      }
+      dst.data()[index<tile_shape_C>(row, col)] = acc;
+    }
+  }
+}
+
+template <is_tile_data_v tile_shape_A, is_tile_data_v tile_shape_AX,
+          is_tile_data_v tile_shape_B, is_tile_data_v tile_shape_BX,
+          is_tile_data_v tile_shape_C>
+void MATMACCMX_Impl(tile_shape_C &, tile_shape_A &, tile_shape_AX &,
+                    tile_shape_B &, tile_shape_BX &) {
+  static_assert(linx_matmacc_unsupported<tile_shape_A, tile_shape_AX,
+                                         tile_shape_B, tile_shape_BX,
+                                         tile_shape_C>::value,
+                "Linx direct MATMACCMX smoke is not implemented");
+}
+
+template <is_tile_data_v tile_shape_A, is_tile_data_v tile_shape_B,
+          is_tile_data_v tile_shape_BX, is_tile_data_v tile_shape_C>
+void MATMACCMXB_Impl(tile_shape_C &, tile_shape_A &, tile_shape_B &,
+                     tile_shape_BX &) {
+  static_assert(linx_matmacc_unsupported<tile_shape_A, tile_shape_B,
+                                         tile_shape_BX, tile_shape_C>::value,
+                "Linx direct MATMACCMXB smoke is not implemented");
+}
+
+#else
+
 template <typename tile_shape_A, typename tile_shape_B, typename tile_shape_C>
 void __vec__ MatMacc_Vec_Impl(
     typename tile_shape_C::TileDType __out__ dst,
@@ -235,4 +297,5 @@ void MATMACCMXB_Impl(tile_shape_C &dst,
   }
 }
 
-#endif
\ No newline at end of file
+#endif
+#endif
diff --git a/include/jcore/MatMul.hpp b/include/jcore/MatMul.hpp
index 166abf8..80bf41a 100644
--- a/include/jcore/MatMul.hpp
+++ b/include/jcore/MatMul.hpp
@@ -5,6 +5,45 @@
 
 using namespace pto;
 
+#ifdef __linx
+
+// Direct-boot Linx smoke uses the scalar fallback until the vector launch
+// syntax is supported in this toolchain lane.
+template <is_tile_data_v tile_shape_A, is_tile_data_v tile_shape_B,
+          is_tile_data_v tile_shape_C>
+void MATMUL_Impl(tile_shape_C &dst, tile_shape_A &src0, tile_shape_B &src1) {
+  static_assert(!tile_shape_A::isBoxedLayout && !tile_shape_B::isBoxedLayout &&
+                    !tile_shape_C::isBoxedLayout,
+                "Linx scalar MATMUL supports only unboxed layouts");
+  static_assert(tile_shape_A::Loc != Location::Acc &&
+                    tile_shape_B::Loc != Location::Acc &&
+                    tile_shape_C::Loc != Location::Acc,
+                "Linx scalar MATMUL does not support ACC tile operands");
+
+  const size_t rows = dst.GetValidRow();
+  const size_t cols = dst.GetValidCol();
+  const size_t inner =
+      src0.GetValidCol() > src1.GetValidRow() ? src0.GetValidCol()
+                                               : src1.GetValidRow();
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      typename tile_shape_C::DType acc = 0;
+      for (size_t k = 0; k < inner; ++k) {
+        if constexpr (!std::is_same<typename tile_shape_A::DType, __half>::value &&
+                      !std::is_same<typename tile_shape_B::DType, __half>::value &&
+                      !std::is_same<typename tile_shape_C::DType, __half>::value) {
+          acc += src0.data()[index<tile_shape_A>(row, k)] *
+                 src1.data()[index<tile_shape_B>(k, col)];
+        }
+      }
+      dst.data()[index<tile_shape_C>(row, col)] = acc;
+    }
+  }
+}
+
+#else
+
 template <typename tile_shape_A, typename tile_shape_B, typename tile_shape_C>
 void __vec__ MatMul_Vec_Impl(
     typename tile_shape_C::TileDType __out__ dst,
@@ -245,3 +284,5 @@ void MATMULMXB_Impl(tile_shape_C &dst,
 }
 
 #endif
+
+#endif
diff --git a/include/jcore/TAbs.hpp b/include/jcore/TAbs.hpp
index 73e55be..273cce6 100644
--- a/include/jcore/TAbs.hpp
+++ b/include/jcore/TAbs.hpp
@@ -5,6 +5,27 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TABS_Impl(tile_shape &dst, tile_shape &src) {
+  size_t rows = src.GetValidRow();
+  size_t cols = src.GetValidCol();
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(!tile_shape::isBoxedLayout, "TABS not support Boxed Layout!");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t index = tile_shape::isRowMajor
+                         ? row * tile_shape::RowStride + col
+                         : col * tile_shape::ColStride + row;
+      auto src_value = src.data()[index];
+      auto zero = typename tile_shape::DType{};
+      dst.data()[index] = src_value < zero ? -src_value : src_value;
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__ TAbs_Vec_RowMajor(
   typename tile_shape::TileDType __out__ dst,
@@ -85,5 +106,6 @@ template <is_tile_data_v tile_shape> void TABS_Impl(tile_shape &dst, tile_shape
     }
   }
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TAdd.hpp b/include/jcore/TAdd.hpp
index 87faa41..7a59dfa 100644
--- a/include/jcore/TAdd.hpp
+++ b/include/jcore/TAdd.hpp
@@ -5,6 +5,26 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TADD_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
+  size_t rows = src0.GetValidRow();
+  size_t cols = src0.GetValidCol();
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(tile_shape::isBoxedLayout == false,
+                "TADD not support Boxed Layout!");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t index = tile_shape::isRowMajor
+                         ? row * tile_shape::RowStride + col
+                         : col * tile_shape::ColStride + row;
+      dst.data()[index] = src0.data()[index] + src1.data()[index];
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__ TAdd_Vec_RowMajor(
   typename tile_shape::TileDType __out__ dst,
@@ -72,5 +92,6 @@ void TADD_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
   }
 
 }
+#endif
 
 #endif
diff --git a/include/jcore/TAdds.hpp b/include/jcore/TAdds.hpp
index 9143023..adc353f 100644
--- a/include/jcore/TAdds.hpp
+++ b/include/jcore/TAdds.hpp
@@ -5,6 +5,25 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TADDS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) {
+  size_t rows = src.GetValidRow();
+  size_t cols = src.GetValidCol();
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(!tile_shape::isBoxedLayout, "TADDS not support Boxed Layout!");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t index = tile_shape::isRowMajor
+                         ? row * tile_shape::RowStride + col
+                         : col * tile_shape::ColStride + row;
+      dst.data()[index] = src.data()[index] + s;
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__ TAdds_Vec_RowMajor(
   typename tile_shape::TileDType __out__ dst,
@@ -80,5 +99,6 @@ void TADDS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s)
     }
   }
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TAnd.hpp b/include/jcore/TAnd.hpp
index b7dda75..bc43bf4 100644
--- a/include/jcore/TAnd.hpp
+++ b/include/jcore/TAnd.hpp
@@ -5,6 +5,25 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TAND_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
+  size_t rows = src0.GetValidRow();
+  size_t cols = src0.GetValidCol();
+  static_assert(tile_shape::Loc == Location::Vec,
+                "Only VEC tile type are supported");
+  static_assert(!tile_shape::isBoxedLayout, "TAND not support Boxed Layout!");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t index = tile_shape::isRowMajor
+                         ? row * tile_shape::RowStride + col
+                         : col * tile_shape::ColStride + row;
+      dst.data()[index] = src0.data()[index] & src1.data()[index];
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__ TAnd_Vec_RowMajor(
   typename tile_shape::TileDType __out__ dst,
@@ -70,5 +89,6 @@ void TAND_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
                   "Only int data type are supported");
   }
 }
+#endif
 
 #endif
diff --git a/include/jcore/TCI.hpp b/include/jcore/TCI.hpp
index 3849eab..148aa00 100644
--- a/include/jcore/TCI.hpp
+++ b/include/jcore/TCI.hpp
@@ -5,7 +5,41 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape, typename T, int descending>
+void TCI_Impl(tile_shape &dst, T s) {
+  static constexpr size_t row = tile_shape::ValidRow;
+  static constexpr size_t col = tile_shape::ValidCol;
+
+  static_assert(std::is_same<typename tile_shape::DType, T>::value,
+                "Dst and scalar must be same data type!");
+  static_assert((descending == 0) || (descending == 1),
+                "descending must be 0 or 1!");
+  static_assert(row != DYNAMIC && col != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+  static_assert(tile_shape::Loc == Location::Vec,
+                "Only VEC tile type are supported");
+  static_assert(!tile_shape::isBoxedLayout, "TCI not support Boxed Layout!");
+  static_assert(std::is_same<typename tile_shape::DType, int32_t>::value ||
+                    std::is_same<typename tile_shape::DType, uint32_t>::value ||
+                    std::is_same<typename tile_shape::DType, int16_t>::value ||
+                    std::is_same<typename tile_shape::DType, uint16_t>::value,
+                "Data type not supported");
 
+  for (size_t row_idx = 0; row_idx < row; ++row_idx) {
+    for (size_t col_idx = 0; col_idx < col; ++col_idx) {
+      size_t tile_index = index<tile_shape>(row_idx, col_idx);
+      if constexpr (descending) {
+        dst.data()[tile_index] =
+            s - static_cast<typename tile_shape::DType>(tile_index);
+      } else {
+        dst.data()[tile_index] =
+            s + static_cast<typename tile_shape::DType>(tile_index);
+      }
+    }
+  }
+}
+#else
 template <typename tile_shape, int desc>
 void __vec__ TCIImpl_RowMajor(typename tile_shape::TileDType __out__ dst,
                                 const typename tile_shape::DType __in__ s) {
@@ -70,4 +104,5 @@ if constexpr (std::is_same<typename tile_shape::DType, int32_t>::value ||
   }
 }
 
-#endif
\ No newline at end of file
+#endif
+#endif
diff --git a/include/jcore/TCmp.hpp b/include/jcore/TCmp.hpp
index 608e76f..6a37c0c 100644
--- a/include/jcore/TCmp.hpp
+++ b/include/jcore/TCmp.hpp
@@ -3,9 +3,64 @@
 
 #include "common/pto_tile.hpp"
 #include "jcore/constants.hpp"
+#ifdef __linx
+#include <stddef.h>
+#else
 #include <assert.h>
+#endif
 using namespace pto;
 
+#ifdef __linx
+template <typename T> static inline int32_t linx_tcmp_value(T a, T b, CmpMode mode) {
+  switch (mode) {
+  case CmpMode::EQ:
+    return a == b;
+  case CmpMode::NE:
+    return a != b;
+  case CmpMode::GT:
+    return a > b;
+  case CmpMode::LT:
+    return a < b;
+  case CmpMode::GE:
+    return a >= b;
+  case CmpMode::LE:
+    return a <= b;
+  }
+  return 0;
+}
+
+template <typename tile_shape_out, typename tile_shape_in>
+void TCMP_Impl(tile_shape_out &dst, tile_shape_in &src0, tile_shape_in &src1,
+               CmpMode cmpMode) {
+  static_assert(tile_shape_in::Rows == tile_shape_out::Rows &&
+                    tile_shape_in::Cols == tile_shape_out::Cols,
+                "Error! Input shape != Output shape");
+  static_assert(tile_shape_in::InnerRows == tile_shape_out::InnerRows &&
+                    tile_shape_in::InnerCols == tile_shape_out::InnerCols,
+                "Error! Inner shape is not equal!");
+  static_assert(tile_shape_out::Loc == Location::Vec &&
+                    tile_shape_in::Loc == Location::Vec,
+                "Only VEC tile type are supported");
+  static_assert(tile_shape_out::isBoxedLayout == false &&
+                    tile_shape_in::isBoxedLayout == false,
+                "TCMP not support Boxed Layout!");
+
+  static constexpr size_t row = tile_shape_in::ValidRow;
+  static constexpr size_t col = tile_shape_in::ValidCol;
+  static_assert(row != DYNAMIC && col != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+
+  for (size_t i = 0; i < row; ++i) {
+    for (size_t j = 0; j < col; ++j) {
+      size_t in_index = index<tile_shape_in>(i, j);
+      size_t out_index = index<tile_shape_out>(i, j);
+      dst.data()[out_index] =
+          static_cast<typename tile_shape_out::DType>(linx_tcmp_value(
+              src0.data()[in_index], src1.data()[in_index], cmpMode));
+    }
+  }
+}
+#else
 template <typename tile_shape_out, typename tile_shape_in, CmpMode mode>
 void __vec__ TCmp_Vec_RowMajor(typename tile_shape_out::TileDType __out__ dst,
                                 const typename tile_shape_in::TileDType __in__ src0,
@@ -160,3 +215,4 @@ void TCMP_Impl(tile_shape_out &dst, tile_shape_in &src0, tile_shape_in &src1, Cm
   }
 }
 #endif
+#endif
diff --git a/include/jcore/TCopy.hpp b/include/jcore/TCopy.hpp
index 956c432..c0a2778 100644
--- a/include/jcore/TCopy.hpp
+++ b/include/jcore/TCopy.hpp
@@ -5,6 +5,25 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TCOPY_Impl(tile_shape &dst, tile_shape &src) {
+  size_t rows = src.GetValidRow();
+  size_t cols = src.GetValidCol();
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(!tile_shape::isBoxedLayout, "TCOPY not support Boxed Layout!");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t index = tile_shape::isRowMajor
+                         ? row * tile_shape::RowStride + col
+                         : col * tile_shape::ColStride + row;
+      dst.data()[index] = src.data()[index];
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__ TCopy_Vec_RowMajor(
   typename tile_shape::TileDType __out__ dst,
@@ -68,5 +87,6 @@ void TCOPY_Impl(tile_shape &dst, tile_shape &src) {
     }
   }
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TCvt.hpp b/include/jcore/TCvt.hpp
index e1ab2f5..19601cc 100644
--- a/include/jcore/TCvt.hpp
+++ b/include/jcore/TCvt.hpp
@@ -2,10 +2,38 @@
 #define TCVT_HPP
 
 #include "common/pto_tile.hpp"
+#ifndef __linx
 #include "template_asm.hpp"
+#endif
 
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape_out, is_tile_data_v tile_shape_in>
+void TCVT_Impl(tile_shape_out &dst, tile_shape_in &src) {
+  static_assert(tile_shape_in::ValidRow != DYNAMIC &&
+                    tile_shape_in::ValidCol != DYNAMIC &&
+                    tile_shape_out::ValidRow != DYNAMIC &&
+                    tile_shape_out::ValidCol != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+  static_assert(tile_shape_in::Loc != Location::Acc,
+                "Linx direct TCVT smoke does not support ACC input");
+  static_assert(tile_shape_out::Loc != Location::Acc,
+                "ACC can not be output tile!");
+  static_assert(tile_shape_in::ValidRow == tile_shape_out::ValidRow &&
+                    tile_shape_in::ValidCol == tile_shape_out::ValidCol,
+                "TCVT direct path requires matching logical shapes");
+
+  for (size_t row = 0; row < tile_shape_in::ValidRow; ++row) {
+    for (size_t col = 0; col < tile_shape_in::ValidCol; ++col) {
+      size_t src_index = index<tile_shape_in>(row, col);
+      size_t dst_index = index<tile_shape_out>(row, col);
+      dst.data()[dst_index] =
+          static_cast<typename tile_shape_out::DType>(src.data()[src_index]);
+    }
+  }
+}
+#else
 template <typename, typename = void>
 struct blkc_has_data_member : std::false_type {};
 
@@ -707,4 +735,5 @@ void TCVT_Impl(tile_shape_out &dst, tile_shape_in &src) {
   }
 }
 
-#endif
\ No newline at end of file
+#endif
+#endif
diff --git a/include/jcore/TDiv.hpp b/include/jcore/TDiv.hpp
index b13ff80..50666dd 100644
--- a/include/jcore/TDiv.hpp
+++ b/include/jcore/TDiv.hpp
@@ -5,6 +5,26 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TDIV_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
+  static constexpr size_t row = tile_shape::ValidRow;
+  static constexpr size_t col = tile_shape::ValidCol;
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(tile_shape::isBoxedLayout == false,
+                "TDIV not support Boxed Layout!");
+  static_assert(row != DYNAMIC && col != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+
+  for (size_t i = 0; i < row; ++i) {
+    for (size_t j = 0; j < col; ++j) {
+      size_t tile_index = index<tile_shape>(i, j);
+      dst.data()[tile_index] = src0.data()[tile_index] / src1.data()[tile_index];
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__
 TDivImpl_RowMajor(typename tile_shape::TileDType __out__ dst,
@@ -68,5 +88,6 @@ void TDIV_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
                   "Storage type not supported");
   }
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TDivs.hpp b/include/jcore/TDivs.hpp
index 7f5ec9f..8a6f435 100644
--- a/include/jcore/TDivs.hpp
+++ b/include/jcore/TDivs.hpp
@@ -5,6 +5,26 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TDIVS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) {
+  static constexpr size_t row = tile_shape::ValidRow;
+  static constexpr size_t col = tile_shape::ValidCol;
+  static_assert(row != DYNAMIC && col != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(tile_shape::isBoxedLayout == false,
+                "TDIVS not support Boxed Layout!");
+
+  for (size_t i = 0; i < row; ++i) {
+    for (size_t j = 0; j < col; ++j) {
+      size_t tile_index = index<tile_shape>(i, j);
+      dst.data()[tile_index] = src.data()[tile_index] / s;
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__ TDivsImpl_RowMajor(typename tile_shape::TileDType __out__ dst,
                                 const typename tile_shape::TileDType __in__ src,
@@ -62,5 +82,6 @@ void TDIVS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s)
                   "Storage layout type not supported");
   }
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TExp.hpp b/include/jcore/TExp.hpp
index 0e43020..b180984 100644
--- a/include/jcore/TExp.hpp
+++ b/include/jcore/TExp.hpp
@@ -5,6 +5,43 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <typename T>
+T linx_tile_iexp(T value) {
+  T result = static_cast<T>(1);
+  result +=
+      (value >= static_cast<T>(1)) ? static_cast<T>(2) : static_cast<T>(0);
+  result +=
+      (value >= static_cast<T>(2)) ? static_cast<T>(4) : static_cast<T>(0);
+  result +=
+      (value >= static_cast<T>(3)) ? static_cast<T>(13) : static_cast<T>(0);
+  result +=
+      (value >= static_cast<T>(4)) ? static_cast<T>(35) : static_cast<T>(0);
+  result +=
+      (value >= static_cast<T>(5)) ? static_cast<T>(93) : static_cast<T>(0);
+  return result;
+}
+
+template <is_tile_data_v tile_shape>
+void TEXP_Impl(tile_shape &dst, tile_shape &src) {
+  size_t rows = src.GetValidRow();
+  size_t cols = src.GetValidCol();
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(!tile_shape::isBoxedLayout, "TEXP not support Boxed Layout!");
+  static_assert(std::is_integral<typename tile_shape::DType>::value,
+                "Linx direct TEXP supports integral smoke types only");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t tile_index = tile_shape::isRowMajor
+                              ? row * tile_shape::RowStride + col
+                              : col * tile_shape::ColStride + row;
+      dst.data()[tile_index] = linx_tile_iexp(src.data()[tile_index]);
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__
 TExpImpl_RowMajor(typename tile_shape::TileDType __out__ dst,
@@ -65,4 +102,6 @@ template <is_tile_data_v tile_shape> void TEXP_Impl(tile_shape &dst, tile_shape
   }
 }
 
-#endif
\ No newline at end of file
+#endif
+
+#endif
diff --git a/include/jcore/TExpandCol.hpp b/include/jcore/TExpandCol.hpp
index 25c6988..51c1d5d 100644
--- a/include/jcore/TExpandCol.hpp
+++ b/include/jcore/TExpandCol.hpp
@@ -5,6 +5,29 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape_out, is_tile_data_v tile_shape_in>
+void TEXPANDCOL_Impl(tile_shape_out &dst, tile_shape_in &src) {
+  static_assert((tile_shape_out::Rows == tile_shape_in::Rows) &&
+                    (tile_shape_out::ValidRow == tile_shape_in::ValidRow),
+                "Error! Cude A:Rows != Cude B:Rows");
+  static_assert(!tile_shape_out::isBoxedLayout && !tile_shape_in::isBoxedLayout,
+                "Not support Fractal layout");
+  static_assert(tile_shape_out::Loc != Location::Acc &&
+                    tile_shape_in::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+
+  size_t row = dst.GetValidRow();
+  size_t col = dst.GetValidCol();
+  for (size_t row_idx = 0; row_idx < row; ++row_idx) {
+    for (size_t col_idx = 0; col_idx < col; ++col_idx) {
+      size_t dst_index = index<tile_shape_out>(row_idx, col_idx);
+      size_t src_index = index<tile_shape_in>(row_idx, 0);
+      dst.data()[dst_index] = src.data()[src_index];
+    }
+  }
+}
+#else
 template <typename tile_shape_out, typename tile_shape_in>
 void __vec__
 TExpandCol_RowImpl(typename tile_shape_out::TileDType __out__ dst,
@@ -74,4 +97,5 @@ void TEXPANDCOL_Impl(tile_shape_out &dst, tile_shape_in &src) {
   }
 }
 
-#endif
\ No newline at end of file
+#endif
+#endif
diff --git a/include/jcore/TExpandRow.hpp b/include/jcore/TExpandRow.hpp
index d37d634..a945426 100644
--- a/include/jcore/TExpandRow.hpp
+++ b/include/jcore/TExpandRow.hpp
@@ -5,6 +5,34 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape_out, is_tile_data_v tile_shape_in>
+void TEXPANDROW_Impl(tile_shape_out &dst, tile_shape_in &src) {
+  static_assert((tile_shape_out::Cols == tile_shape_in::Cols) &&
+                    (tile_shape_out::ValidCol == tile_shape_in::ValidCol),
+                "Error! Cude A:Columns != Cude B:Columns");
+  static_assert(!tile_shape_out::isBoxedLayout && !tile_shape_in::isBoxedLayout,
+                "Not support Fractal layout");
+  static_assert(tile_shape_in::ValidRow != DYNAMIC &&
+                    tile_shape_in::ValidCol != DYNAMIC &&
+                    tile_shape_out::ValidRow != DYNAMIC &&
+                    tile_shape_out::ValidCol != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+  static_assert(tile_shape_out::Loc != Location::Acc &&
+                    tile_shape_in::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+
+  size_t row = dst.GetValidRow();
+  size_t col = dst.GetValidCol();
+  for (size_t row_idx = 0; row_idx < row; ++row_idx) {
+    for (size_t col_idx = 0; col_idx < col; ++col_idx) {
+      size_t dst_index = index<tile_shape_out>(row_idx, col_idx);
+      size_t src_index = index<tile_shape_in>(0, col_idx);
+      dst.data()[dst_index] = src.data()[src_index];
+    }
+  }
+}
+#else
 template <typename tile_shape_out, typename tile_shape_in>
 void __vec__
 TExpandRow_RowImpl(typename tile_shape_out::TileDType __out__ dst,
@@ -78,4 +106,5 @@ void TEXPANDROW_Impl(tile_shape_out &dst, tile_shape_in &src) {
   }
 }
 
-#endif
\ No newline at end of file
+#endif
+#endif
diff --git a/include/jcore/TExpandScalar.hpp b/include/jcore/TExpandScalar.hpp
index dd316c8..97cfb4a 100644
--- a/include/jcore/TExpandScalar.hpp
+++ b/include/jcore/TExpandScalar.hpp
@@ -5,6 +5,25 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TEXPANDSCALAR_Impl(tile_shape &dst, typename tile_shape::DType s) {
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(!tile_shape::isBoxedLayout,
+                "TEXPANDSCALAR Linx smoke supports only unboxed tiles");
+
+  size_t row = dst.GetValidRow();
+  size_t col = dst.GetValidCol();
+
+  for (size_t row_idx = 0; row_idx < row; ++row_idx) {
+    for (size_t col_idx = 0; col_idx < col; ++col_idx) {
+      size_t tile_index = index<tile_shape>(row_idx, col_idx);
+      dst.data()[tile_index] = s;
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__
 ExpandScalarImpl_RowMajor(typename tile_shape::TileDType __out__ dst,
@@ -140,4 +159,4 @@ void TEXPANDSCALAR_Impl(tile_shape &dst, typename tile_shape::DType s) {
 }
 
 #endif
-
+#endif
diff --git a/include/jcore/TCopyIn.hpp b/include/jcore/TLoad.hpp
similarity index 84%
rename from include/jcore/TCopyIn.hpp
rename to include/jcore/TLoad.hpp
index 5c8a3fe..b96ec3f 100644
--- a/include/jcore/TCopyIn.hpp
+++ b/include/jcore/TLoad.hpp
@@ -1,14 +1,42 @@
-#ifndef TCOPYIN_HPP
-#define TCOPYIN_HPP
+#ifndef JCORE_TLOAD_HPP
+#define JCORE_TLOAD_HPP
 
 #include "common/pto_tile.hpp"
+#ifdef ENABLE_TENSOR_INSTR
 #include "template_asm.hpp"
+#endif
 
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape, is_global_data_v gm_shape>
+void TLOAD_Impl(tile_shape &dst, gm_shape &src) {
+  size_t rows = dst.GetValidRow();
+  size_t cols = dst.GetValidCol();
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(gm_shape::staticStride[0] == 1 &&
+                    gm_shape::staticStride[1] == 1,
+                "TODO: Support global tensor more than 3 dimensions");
+  static_assert(tile_shape::isBoxedLayout == false,
+                "Linx smoke TLOAD supports only unboxed tiles");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t gm_index = gm_shape::isRowMajor
+                            ? row * gm_shape::RowStride + col
+                            : col * gm_shape::ColStride + row;
+      size_t tile_index = tile_shape::isRowMajor
+                              ? row * tile_shape::RowStride + col
+                              : col * tile_shape::ColStride + row;
+      dst.data()[tile_index] = src.data()[gm_index];
+    }
+  }
+}
+#else
 // gm row major -> tile Nz
 template <typename tile_shape, typename gm_shape>
-void __mtc__ CopyInRow2NzImpl1D(typename tile_shape::TileDType __out__ dst,
+void __mtc__ LoadRow2NzImpl1D(typename tile_shape::TileDType __out__ dst,
                                 const typename gm_shape::DType __in__ *src) {
   static constexpr int inner_rows = tile_shape::InnerRows;
   static constexpr int inner_cols = tile_shape::InnerCols;
@@ -34,7 +62,7 @@ void __mtc__ CopyInRow2NzImpl1D(typename tile_shape::TileDType __out__ dst,
 
 // gm col major -> tile Zn
 template <typename tile_shape, typename gm_shape>
-void __mtc__ CopyInCol2ZnImpl1D(typename tile_shape::TileDType __out__ dst,
+void __mtc__ LoadCol2ZnImpl1D(typename tile_shape::TileDType __out__ dst,
                                 const typename gm_shape::DType __in__ *src) {
   static constexpr int inner_rows = tile_shape::InnerRows;
   static constexpr int inner_cols = tile_shape::InnerCols;
@@ -60,7 +88,7 @@ void __mtc__ CopyInCol2ZnImpl1D(typename tile_shape::TileDType __out__ dst,
 
 // gm row major -> tile Zn
 template <typename tile_shape, typename gm_shape>
-void __mtc__ CopyInRow2ZnImpl1D(typename tile_shape::TileDType __out__ dst,
+void __mtc__ LoadRow2ZnImpl1D(typename tile_shape::TileDType __out__ dst,
                                 const typename gm_shape::DType __in__ *src) {
   static constexpr int inner_rows = tile_shape::InnerRows;
   static constexpr int inner_cols = tile_shape::InnerCols;
@@ -86,9 +114,9 @@ void __mtc__ CopyInRow2ZnImpl1D(typename tile_shape::TileDType __out__ dst,
 
 //no fractal
 template <typename tile_shape, typename gm_shape>
-void __mtc__ TCopyIn_Vec_ColMajor(typename tile_shape::TileDType __out__ dst,
+void __mtc__ TLoad_Vec_ColMajor(typename tile_shape::TileDType __out__ dst,
                                   const typename gm_shape::DType __in__ *src) {
- 
+
   size_t i = blkv_get_index_x();
   size_t j = blkv_get_index_y();
 
@@ -96,11 +124,11 @@ void __mtc__ TCopyIn_Vec_ColMajor(typename tile_shape::TileDType __out__ dst,
   size_t index_tile = j * tile_shape::ColStride + i;
   blkv_get_tile_ptr(dst)[index_tile] = src[index_gm];
 }
- 
+
 template <typename tile_shape, typename gm_shape>
-void __mtc__ TCopyIn_Vec_RowMajor(typename tile_shape::TileDType __out__ dst,
+void __mtc__ TLoad_Vec_RowMajor(typename tile_shape::TileDType __out__ dst,
                                   typename gm_shape::DType __in__ *src) {
- 
+
   size_t i = blkv_get_index_x();
   size_t j = blkv_get_index_y();
 
@@ -111,7 +139,7 @@ void __mtc__ TCopyIn_Vec_RowMajor(typename tile_shape::TileDType __out__ dst,
 
 // gm row major -> tile Nz
 template <typename tile_shape, typename gm_shape>
-void __mtc__ CopyInRow2NzImpl2D_Dynamic(typename tile_shape::TileDType __out__ dst,
+void __mtc__ LoadRow2NzImpl2D_Dynamic(typename tile_shape::TileDType __out__ dst,
                                         const typename gm_shape::DType __in__ *src,
                                         const size_t __in__ gm_row_stride) {
   static constexpr int inner_rows = tile_shape::InnerRows;
@@ -134,7 +162,7 @@ void __mtc__ CopyInRow2NzImpl2D_Dynamic(typename tile_shape::TileDType __out__ d
 
 // gm col major -> tile Zn
 template <typename tile_shape, typename gm_shape>
-void __mtc__ CopyInCol2ZnImpl2D_Dynamic(typename tile_shape::TileDType __out__ dst,
+void __mtc__ LoadCol2ZnImpl2D_Dynamic(typename tile_shape::TileDType __out__ dst,
                                         const typename gm_shape::DType __in__ *src,
                                         const size_t __in__ gm_col_stride) {
   static constexpr int inner_rows = tile_shape::InnerRows;
@@ -157,7 +185,7 @@ void __mtc__ CopyInCol2ZnImpl2D_Dynamic(typename tile_shape::TileDType __out__ d
 
 // gm row major -> tile Zn
 template <typename tile_shape, typename gm_shape>
-void __mtc__ CopyInRow2ZnImpl2D_Dynamic(typename tile_shape::TileDType __out__ dst,
+void __mtc__ LoadRow2ZnImpl2D_Dynamic(typename tile_shape::TileDType __out__ dst,
                                         const typename gm_shape::DType __in__ *src,
                                         const size_t __in__ gm_row_stride) {
   static constexpr int inner_rows = tile_shape::InnerRows;
@@ -180,10 +208,10 @@ void __mtc__ CopyInRow2ZnImpl2D_Dynamic(typename tile_shape::TileDType __out__ d
 
 //no fractal
 template <typename tile_shape, typename gm_shape>
-void __mtc__ TCopyIn_Vec_ColMajor_Dynamic(typename tile_shape::TileDType __out__ dst,
+void __mtc__ TLoad_Vec_ColMajor_Dynamic(typename tile_shape::TileDType __out__ dst,
                                           const typename gm_shape::DType __in__ *src,
                                           const size_t __in__ gm_col_stride) {
- 
+
   size_t i = blkv_get_index_x();
   size_t j = blkv_get_index_y();
 
@@ -193,10 +221,10 @@ void __mtc__ TCopyIn_Vec_ColMajor_Dynamic(typename tile_shape::TileDType __out__
 }
 
 template <typename tile_shape, typename gm_shape>
-void __mtc__ TCopyIn_Vec_RowMajor_Dynamic(typename tile_shape::TileDType __out__ dst,
+void __mtc__ TLoad_Vec_RowMajor_Dynamic(typename tile_shape::TileDType __out__ dst,
                                           typename gm_shape::DType __in__ *src,
                                           const size_t __in__ gm_row_stride) {
- 
+
   size_t i = blkv_get_index_x();
   size_t j = blkv_get_index_y();
 
@@ -206,7 +234,7 @@ void __mtc__ TCopyIn_Vec_RowMajor_Dynamic(typename tile_shape::TileDType __out__
 }
 
 template <is_tile_data_v tile_shape, is_global_data_v gm_shape>
-void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) {
+void _TLOAD_Impl(tile_shape &dst, gm_shape &src) {
   size_t tile_rows = dst.GetValidRow();
   size_t tile_cols = dst.GetValidCol();
   static_assert(tile_shape::Loc != Location::Acc, "Unsupport ACC to be input or output here");
@@ -290,7 +318,7 @@ void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) {
                 tile_shape::ValidRow == DYNAMIC || tile_shape::ValidCol == DYNAMIC) { // dynamic
     if constexpr (is_Nz_layout<tile_shape>::value) { // Nz
       if constexpr (gm_shape::isRowMajor) {
-        CopyInRow2NzImpl2D_Dynamic<tile_shape, gm_shape>
+        LoadRow2NzImpl2D_Dynamic<tile_shape, gm_shape>
             <<<tile_cols, tile_rows, 1>>>(dst.data(), src.data(), src.GetStride(3));
       } else {
         static_assert(gm_shape::isRowMajor,
@@ -298,16 +326,16 @@ void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) {
       }
     } else if constexpr (is_Zn_layout<tile_shape>::value) { //Zn
       if constexpr (!gm_shape::isRowMajor) {
-        CopyInCol2ZnImpl2D_Dynamic<tile_shape, gm_shape>
+        LoadCol2ZnImpl2D_Dynamic<tile_shape, gm_shape>
             <<<tile_rows, tile_cols, 1>>>(dst.data(), src.data(), src.GetStride(4));
       } else {
-        CopyInRow2ZnImpl2D_Dynamic<tile_shape, gm_shape>
+        LoadRow2ZnImpl2D_Dynamic<tile_shape, gm_shape>
             <<<tile_cols, tile_rows, 1>>>(dst.data(), src.data(), src.GetStride(3));
       }
     } else if constexpr (tile_shape::isBoxedLayout == false) {
       if constexpr (tile_shape::isRowMajor) {
         if constexpr (gm_shape::isRowMajor) {
-          TCopyIn_Vec_RowMajor_Dynamic<tile_shape, gm_shape>
+          TLoad_Vec_RowMajor_Dynamic<tile_shape, gm_shape>
               <<<tile_cols, tile_rows, 1>>>(dst.data(), src.data(), src.GetStride(3));
         } else {
           static_assert(gm_shape::isRowMajor,
@@ -315,7 +343,7 @@ void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) {
         }
       } else if constexpr (!tile_shape::isRowMajor) {
         if constexpr (!gm_shape::isRowMajor) {
-          TCopyIn_Vec_ColMajor_Dynamic<tile_shape, gm_shape>
+          TLoad_Vec_ColMajor_Dynamic<tile_shape, gm_shape>
               <<<tile_rows, tile_cols, 1>>>(dst.data(), src.data(), src.GetStride(4));
         } else {
           static_assert(!gm_shape::isRowMajor,
@@ -329,7 +357,7 @@ void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) {
   } else { // static
     if constexpr (is_Nz_layout<tile_shape>::value) { // Nz
       if constexpr (gm_shape::isRowMajor) {
-        CopyInRow2NzImpl1D<tile_shape, gm_shape>
+        LoadRow2NzImpl1D<tile_shape, gm_shape>
             <<<tile_cols, 1, 1>>>(dst.data(), src.data());
       } else {
         static_assert(gm_shape::isRowMajor,
@@ -337,16 +365,16 @@ void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) {
       }
     } else if constexpr (is_Zn_layout<tile_shape>::value) { //Zn
       if constexpr (!gm_shape::isRowMajor) {
-        CopyInCol2ZnImpl1D<tile_shape, gm_shape>
+        LoadCol2ZnImpl1D<tile_shape, gm_shape>
             <<<tile_rows, 1, 1>>>(dst.data(), src.data());
       } else {
-        CopyInRow2ZnImpl1D<tile_shape, gm_shape>
+        LoadRow2ZnImpl1D<tile_shape, gm_shape>
             <<<tile_cols, 1, 1>>>(dst.data(), src.data());
       }
     } else if constexpr (tile_shape::isBoxedLayout == false) {
       if constexpr (tile_shape::isRowMajor) {
         if constexpr (gm_shape::isRowMajor) {
-          TCopyIn_Vec_RowMajor<tile_shape, gm_shape>
+          TLoad_Vec_RowMajor<tile_shape, gm_shape>
               <<<tile_cols, tile_rows, 1>>>(dst.data(), src.data());
         } else {
           static_assert(gm_shape::isRowMajor,
@@ -354,7 +382,7 @@ void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) {
         }
       } else if constexpr (!tile_shape::isRowMajor) {
         if constexpr (!gm_shape::isRowMajor) {
-          TCopyIn_Vec_ColMajor<tile_shape, gm_shape>
+          TLoad_Vec_ColMajor<tile_shape, gm_shape>
               <<<tile_rows, tile_cols, 1>>>(dst.data(), src.data());
         } else {
           static_assert(!gm_shape::isRowMajor,
@@ -367,19 +395,19 @@ void _TCOPYIN_Impl(tile_shape &dst, gm_shape &src) {
     }
   }
 
-  
+
 #endif
 }
 
 template <is_tile_data_v tile_shape, is_global_data_v gm_shape>
-void TCOPYIN_2LVL(tile_shape &dst, gm_shape &src){
+void TLOAD_2LVL(tile_shape &dst, gm_shape &src){
   using tile_tmp = Tile<Location::Vec, typename gm_shape::DType, tile_shape::Rows, tile_shape::Cols, gm_shape::isRowMajor? BLayout::RowMajor:BLayout::ColMajor, tile_shape::ValidRow, tile_shape::ValidCol>;
   tile_tmp tmp;
-  _TCOPYIN_Impl(tmp, src);
+  _TLOAD_Impl(tmp, src);
   if constexpr(gm_shape::isRowMajor && is_Nz_layout<tile_shape>::value){
-    TCVT_ND2NZ(dst, tmp);    
+    TCVT_ND2NZ(dst, tmp);
   }else if constexpr(gm_shape::isRowMajor && is_Zn_layout<tile_shape>::value){
-    TCVT_ND2ZN(dst, tmp); 
+    TCVT_ND2ZN(dst, tmp);
   }else if constexpr(!gm_shape::isRowMajor && is_Nz_layout<tile_shape>::value){
     TCVT_DN2NZ(dst, tmp);
   }else if constexpr(!gm_shape::isRowMajor && is_Zn_layout<tile_shape>::value){
@@ -388,12 +416,13 @@ void TCOPYIN_2LVL(tile_shape &dst, gm_shape &src){
 }
 
 template <is_tile_data_v tile_shape, is_global_data_v gm_shape>
-void TCOPYIN_Impl(tile_shape &dst, gm_shape &src) {
+void TLOAD_Impl(tile_shape &dst, gm_shape &src) {
   #ifdef RUMINATE
-  TCOPYIN_2LVL(dst, src);
+  TLOAD_2LVL(dst, src);
   #else
-  _TCOPYIN_Impl(dst, src);
+  _TLOAD_Impl(dst, src);
   #endif
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TMax.hpp b/include/jcore/TMax.hpp
index 2739f0f..3d9415d 100644
--- a/include/jcore/TMax.hpp
+++ b/include/jcore/TMax.hpp
@@ -5,6 +5,27 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TMAX_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
+  size_t rows = src0.GetValidRow();
+  size_t cols = src0.GetValidCol();
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(!tile_shape::isBoxedLayout, "TMAX not support Boxed Layout!");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t index = tile_shape::isRowMajor
+                         ? row * tile_shape::RowStride + col
+                         : col * tile_shape::ColStride + row;
+      auto src0_value = src0.data()[index];
+      auto src1_value = src1.data()[index];
+      dst.data()[index] = src0_value > src1_value ? src0_value : src1_value;
+    }
+  }
+}
+#else
 
 template <typename tile_shape>
 void __vec__
@@ -68,5 +89,6 @@ void TMAX_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
                   "Storage layout type not supported");
   }
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TMaxs.hpp b/include/jcore/TMaxs.hpp
index 9cec5aa..6d7e3dc 100644
--- a/include/jcore/TMaxs.hpp
+++ b/include/jcore/TMaxs.hpp
@@ -5,6 +5,26 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TMAXS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) {
+  size_t rows = src.GetValidRow();
+  size_t cols = src.GetValidCol();
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(!tile_shape::isBoxedLayout, "TMAXS not support Boxed Layout!");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t index = tile_shape::isRowMajor
+                         ? row * tile_shape::RowStride + col
+                         : col * tile_shape::ColStride + row;
+      auto src_value = src.data()[index];
+      dst.data()[index] = src_value > s ? src_value : s;
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__ TMaxsImpl_RowMajor(typename tile_shape::TileDType __out__ dst,
                                 const typename tile_shape::TileDType __in__ src,
@@ -64,5 +84,6 @@ void TMAXS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s)
                   "Storage layout type not supported");
   }
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TMul.hpp b/include/jcore/TMul.hpp
index c059c62..dd989e1 100644
--- a/include/jcore/TMul.hpp
+++ b/include/jcore/TMul.hpp
@@ -5,6 +5,25 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TMUL_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
+  size_t rows = src0.GetValidRow();
+  size_t cols = src0.GetValidCol();
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(!tile_shape::isBoxedLayout, "TMUL not support Boxed Layout!");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t index = tile_shape::isRowMajor
+                         ? row * tile_shape::RowStride + col
+                         : col * tile_shape::ColStride + row;
+      dst.data()[index] = src0.data()[index] * src1.data()[index];
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__
 TmulImpl_RowMajor(typename tile_shape::TileDType __out__ dst,
@@ -67,5 +86,6 @@ void TMUL_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
                   "Storage layout type not supported");
   }
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TMuls.hpp b/include/jcore/TMuls.hpp
index 3d01439..af125e6 100644
--- a/include/jcore/TMuls.hpp
+++ b/include/jcore/TMuls.hpp
@@ -5,6 +5,25 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TMULS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) {
+  size_t rows = src.GetValidRow();
+  size_t cols = src.GetValidCol();
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(!tile_shape::isBoxedLayout, "TMULS not support Boxed Layout!");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t index = tile_shape::isRowMajor
+                         ? row * tile_shape::RowStride + col
+                         : col * tile_shape::ColStride + row;
+      dst.data()[index] = src.data()[index] * s;
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__ TMulsImpl_RowMajor(typename tile_shape::TileDType __out__ dst,
                                 const typename tile_shape::TileDType __in__ src,
@@ -62,5 +81,6 @@ void TMULS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s)
                   "Storage layout type not supported");
   }
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TOr.hpp b/include/jcore/TOr.hpp
index a7eee26..f7e5854 100644
--- a/include/jcore/TOr.hpp
+++ b/include/jcore/TOr.hpp
@@ -5,6 +5,25 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TOR_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
+  size_t rows = src0.GetValidRow();
+  size_t cols = src0.GetValidCol();
+  static_assert(tile_shape::Loc == Location::Vec,
+                "Only VEC tile type are supported");
+  static_assert(!tile_shape::isBoxedLayout, "TOR not support Boxed Layout!");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t index = tile_shape::isRowMajor
+                         ? row * tile_shape::RowStride + col
+                         : col * tile_shape::ColStride + row;
+      dst.data()[index] = src0.data()[index] | src1.data()[index];
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__ TOr_Vec_RowMajor(
   typename tile_shape::TileDType __out__ dst,
@@ -70,5 +89,6 @@ void TOR_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
                   "Only int data type are supported");
   }
 }
+#endif
 
 #endif
diff --git a/include/jcore/TPad.hpp b/include/jcore/TPad.hpp
index 9ffd31c..3a6af6a 100644
--- a/include/jcore/TPad.hpp
+++ b/include/jcore/TPad.hpp
@@ -3,9 +3,61 @@
 
 #include "common/pto_tile.hpp"
 #include "jcore/constants.hpp"
+#ifndef __linx
 #include <assert.h>
+#endif
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape_out, is_tile_data_v tile_shape_in,
+          typename T>
+void TPAD_Impl(tile_shape_out &dst, const tile_shape_in &src, T pad_value,
+               size_t up_pad, size_t left_pad, size_t down_pad,
+               size_t right_pad) {
+  static_assert(!tile_shape_out::isBoxedLayout &&
+                    !tile_shape_in::isBoxedLayout,
+                "Not support Boxed Layout!");
+  static_assert(tile_shape_out::Loc == Location::Vec &&
+                    tile_shape_in::Loc == Location::Vec,
+                "Only VEC tile type are supported");
+  static_assert(tile_shape_out::ValidRow != DYNAMIC &&
+                    tile_shape_out::ValidCol != DYNAMIC &&
+                    tile_shape_in::ValidRow != DYNAMIC &&
+                    tile_shape_in::ValidCol != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+  static_assert(tile_shape_out::ValidRow >= tile_shape_in::ValidRow &&
+                    tile_shape_out::ValidCol >= tile_shape_in::ValidCol,
+                "Dst must cover src shape!");
+
+  size_t src_valid_row = src.GetValidRow();
+  size_t src_valid_col = src.GetValidCol();
+  size_t dst_valid_row = dst.GetValidRow();
+  size_t dst_valid_col = dst.GetValidCol();
+  size_t after_pad_row = up_pad + src_valid_row + down_pad;
+  size_t after_pad_col = left_pad + src_valid_col + right_pad;
+
+  if (after_pad_row > dst_valid_row || after_pad_col > dst_valid_col) {
+    return;
+  }
+
+  for (size_t row = 0; row < dst_valid_row; ++row) {
+    for (size_t col = 0; col < dst_valid_col; ++col) {
+      size_t dst_index = index<tile_shape_out>(row, col);
+      bool in_src_range = row >= up_pad && row < up_pad + src_valid_row &&
+                          col >= left_pad && col < left_pad + src_valid_col;
+      if (in_src_range) {
+        size_t src_row = row - up_pad;
+        size_t src_col = col - left_pad;
+        size_t src_index = index<tile_shape_in>(src_row, src_col);
+        dst.data()[dst_index] = src.data()[src_index];
+      } else {
+        dst.data()[dst_index] =
+            static_cast<typename tile_shape_out::DType>(pad_value);
+      }
+    }
+  }
+}
+#else
 template <typename tile_shape_out, typename tile_shape_in, typename T>
 void __vec__ TPad_Vec_RowMajor(typename tile_shape_out::TileDType __out__ dst,
                             const typename tile_shape_in::TileDType __in__ src, const T __in__ pad_value,
@@ -101,4 +153,5 @@ void TPAD_Impl(tile_shape_out &dst, const tile_shape_in &src,
                     "Storage layout type not supported");
   }
 }
-#endif
\ No newline at end of file
+#endif
+#endif
diff --git a/include/jcore/TRecip.hpp b/include/jcore/TRecip.hpp
index 9680437..43fb3fd 100644
--- a/include/jcore/TRecip.hpp
+++ b/include/jcore/TRecip.hpp
@@ -5,6 +5,27 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TRECIP_Impl(tile_shape &dst, tile_shape &src) {
+  static constexpr size_t row = tile_shape::ValidRow;
+  static constexpr size_t col = tile_shape::ValidCol;
+  static_assert(row != DYNAMIC && col != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(tile_shape::isBoxedLayout == false,
+                "TRECIP not support Boxed Layout!");
+
+  for (size_t i = 0; i < row; ++i) {
+    for (size_t j = 0; j < col; ++j) {
+      size_t tile_index = index<tile_shape>(i, j);
+      dst.data()[tile_index] =
+          static_cast<typename tile_shape::DType>(1) / src.data()[tile_index];
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__ TRecip_RowMajor(typename tile_shape::TileDType __out__ dst,
                              const typename tile_shape::TileDType __in__ src) {
@@ -62,4 +83,5 @@ void TRECIP_Impl(tile_shape &dst, tile_shape &src) {
   }
 }
 
-#endif
\ No newline at end of file
+#endif
+#endif
diff --git a/include/jcore/TRem.hpp b/include/jcore/TRem.hpp
index be2ef5e..38789bc 100644
--- a/include/jcore/TRem.hpp
+++ b/include/jcore/TRem.hpp
@@ -5,6 +5,29 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TREM_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
+  static constexpr size_t row = tile_shape::ValidRow;
+  static constexpr size_t col = tile_shape::ValidCol;
+  static_assert(tile_shape::Loc == Location::Vec,
+                "Only VEC tile type are supported");
+  static_assert(row != DYNAMIC && col != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+  static_assert(std::is_same<typename tile_shape::DType, int32_t>::value ||
+                    std::is_same<typename tile_shape::DType, int16_t>::value,
+                "Data type not supported");
+  static_assert(tile_shape::isBoxedLayout == false,
+                "TREM not support Boxed Layout!");
+
+  for (size_t i = 0; i < row; ++i) {
+    for (size_t j = 0; j < col; ++j) {
+      size_t tile_index = index<tile_shape>(i, j);
+      dst.data()[tile_index] = src0.data()[tile_index] % src1.data()[tile_index];
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__
 TRemImpl_RowMajor(typename tile_shape::TileDType __out__ dst,
@@ -71,5 +94,6 @@ void TREM_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
                   "Storage type not supported");
   }
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TReshape.hpp b/include/jcore/TReshape.hpp
index 64aff12..782a1ae 100644
--- a/include/jcore/TReshape.hpp
+++ b/include/jcore/TReshape.hpp
@@ -5,6 +5,28 @@
 
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape_out, is_tile_data_v tile_shape_in>
+void TRESHAPE_Impl(tile_shape_out &tile_out, tile_shape_in &tile_in) {
+  static_assert(tile_shape_in::ValidRow != DYNAMIC &&
+                    tile_shape_in::ValidCol != DYNAMIC &&
+                    tile_shape_out::ValidRow != DYNAMIC &&
+                    tile_shape_out::ValidCol != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+  static_assert(tile_shape_out::Loc != Location::Acc &&
+                    tile_shape_in::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(!tile_shape_out::isBoxedLayout &&
+                    !tile_shape_in::isBoxedLayout,
+                "TRESHAPE not support Boxed Layout!");
+  static_assert(tile_shape_out::Numel == tile_shape_in::Numel,
+                "TRESHAPE requires equal tile element counts");
+
+  for (size_t index = 0; index < tile_shape_in::Numel; ++index) {
+    tile_out.data()[index] = tile_in.data()[index];
+  }
+}
+#else
 template <is_tile_data_v tile_shape_out, is_tile_data_v tile_shape_in>
 void TRESHAPE_Impl(tile_shape_out &tile_out, tile_shape_in &tile_in) {
   static_assert(tile_shape_in::ValidRow != DYNAMIC && tile_shape_in::ValidCol != DYNAMIC &&
@@ -14,5 +36,6 @@ void TRESHAPE_Impl(tile_shape_out &tile_out, tile_shape_in &tile_in) {
               "Unsupport ACC to be input or output here");
   tile_out.data() = tile_in.data();
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TRowMax.hpp b/include/jcore/TRowMax.hpp
index 917af27..dc46afd 100644
--- a/include/jcore/TRowMax.hpp
+++ b/include/jcore/TRowMax.hpp
@@ -5,6 +5,36 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape_out, is_tile_data_v tile_shape_in>
+void TROWMAX_Impl(tile_shape_out &dst, tile_shape_in &src) {
+  static_assert(tile_shape_in::Rows == tile_shape_out::Rows,
+                "Error! Input row != Output row.");
+  static_assert(tile_shape_out::ValidCol == 1, "valid column must be 1.");
+  static_assert(!tile_shape_out::isBoxedLayout && !tile_shape_in::isBoxedLayout,
+                "Not support Fractal layout");
+  static_assert(tile_shape_in::ValidRow != DYNAMIC &&
+                    tile_shape_in::ValidCol != DYNAMIC &&
+                    tile_shape_out::ValidRow != DYNAMIC &&
+                    tile_shape_out::ValidCol != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+  static_assert(tile_shape_out::Loc != Location::Acc &&
+                    tile_shape_in::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+
+  size_t rows = src.GetValidRow();
+  size_t cols = src.GetValidCol();
+  for (size_t row = 0; row < rows; ++row) {
+    typename tile_shape_in::DType max_val =
+        src.data()[index<tile_shape_in>(row, 0)];
+    for (size_t col = 1; col < cols; ++col) {
+      auto now_val = src.data()[index<tile_shape_in>(row, col)];
+      max_val = max_val > now_val ? max_val : now_val;
+    }
+    dst.data()[index<tile_shape_out>(row, 0)] = max_val;
+  }
+}
+#else
 template <typename tile_shape_out, typename tile_shape_in>
 void __vec__
 TRowMax_NoFractal_Impl(typename tile_shape_out::TileDType __out__ dst,
@@ -105,4 +135,5 @@ void TROWMAX_Impl(tile_shape_out &dst, tile_shape_in &src) {
   }
 }
 
-#endif
\ No newline at end of file
+#endif
+#endif
diff --git a/include/jcore/TRowMaxExpand.hpp b/include/jcore/TRowMaxExpand.hpp
index 7c1dbd0..b552b5a 100644
--- a/include/jcore/TRowMaxExpand.hpp
+++ b/include/jcore/TRowMaxExpand.hpp
@@ -5,6 +5,31 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+// ROWMAX + EXPAND
+template <is_tile_data_v tile_shape>
+void TROWMAXEXPAND_Impl(tile_shape &dst, tile_shape &src) {
+  static_assert(tile_shape::ValidRow != DYNAMIC &&
+                    tile_shape::ValidCol != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(!tile_shape::isBoxedLayout, "Not support Fractal layout");
+
+  size_t rows = src.GetValidRow();
+  size_t cols = src.GetValidCol();
+  for (size_t row = 0; row < rows; ++row) {
+    typename tile_shape::DType max_val = src.data()[index<tile_shape>(row, 0)];
+    for (size_t col = 1; col < cols; ++col) {
+      auto now_val = src.data()[index<tile_shape>(row, col)];
+      max_val = max_val > now_val ? max_val : now_val;
+    }
+    for (size_t col = 0; col < cols; ++col) {
+      dst.data()[index<tile_shape>(row, col)] = max_val;
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__
 TRowMaxExpand_NoFractal_Impl(typename tile_shape::TileDType __out__ dst,
@@ -77,3 +102,4 @@ void TROWMAXEXPAND_Impl(tile_shape &dst, tile_shape &src) {
   }
 }
 #endif
+#endif
diff --git a/include/jcore/TRowSum.hpp b/include/jcore/TRowSum.hpp
index c6ebb8c..c49ee8c 100644
--- a/include/jcore/TRowSum.hpp
+++ b/include/jcore/TRowSum.hpp
@@ -5,6 +5,34 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape_out, is_tile_data_v tile_shape_in>
+void TROWSUM_Impl(tile_shape_out &dst, tile_shape_in &src) {
+  static_assert(tile_shape_in::Rows == tile_shape_out::Rows,
+                "Error! Input row != Output row.");
+  static_assert(tile_shape_out::ValidCol == 1, "valid column must be 1.");
+  static_assert(!tile_shape_out::isBoxedLayout && !tile_shape_in::isBoxedLayout,
+                "Not support Fractal layout");
+  static_assert(tile_shape_in::ValidRow != DYNAMIC &&
+                    tile_shape_in::ValidCol != DYNAMIC &&
+                    tile_shape_out::ValidRow != DYNAMIC &&
+                    tile_shape_out::ValidCol != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+  static_assert(tile_shape_out::Loc != Location::Acc &&
+                    tile_shape_in::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+
+  size_t rows = src.GetValidRow();
+  size_t cols = src.GetValidCol();
+  for (size_t row = 0; row < rows; ++row) {
+    typename tile_shape_in::DType sum = src.data()[index<tile_shape_in>(row, 0)];
+    for (size_t col = 1; col < cols; ++col) {
+      sum += src.data()[index<tile_shape_in>(row, col)];
+    }
+    dst.data()[index<tile_shape_out>(row, 0)] = sum;
+  }
+}
+#else
 template <typename tile_shape_out, typename tile_shape_in>
 void __vec__
 TRowSum_NoFractal_Impl(typename tile_shape_out::TileDType __out__ dst,
@@ -107,4 +135,5 @@ void TROWSUM_Impl(tile_shape_out &dst, tile_shape_in &src) {
   }
 }
 
-#endif
\ No newline at end of file
+#endif
+#endif
diff --git a/include/jcore/TRowSumExpand.hpp b/include/jcore/TRowSumExpand.hpp
index 1be1bef..710db56 100644
--- a/include/jcore/TRowSumExpand.hpp
+++ b/include/jcore/TRowSumExpand.hpp
@@ -5,6 +5,30 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+// ROWSUM + EXPAND
+template <is_tile_data_v tile_shape>
+void TROWSUMEXPAND_Impl(tile_shape &dst, tile_shape &src) {
+  static_assert(tile_shape::ValidRow != DYNAMIC &&
+                    tile_shape::ValidCol != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(!tile_shape::isBoxedLayout, "Not support Fractal layout");
+
+  size_t rows = src.GetValidRow();
+  size_t cols = src.GetValidCol();
+  for (size_t row = 0; row < rows; ++row) {
+    typename tile_shape::DType sum = src.data()[index<tile_shape>(row, 0)];
+    for (size_t col = 1; col < cols; ++col) {
+      sum += src.data()[index<tile_shape>(row, col)];
+    }
+    for (size_t col = 0; col < cols; ++col) {
+      dst.data()[index<tile_shape>(row, col)] = sum;
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__
 TRowSumExpand_NoFractal_Impl(typename tile_shape::TileDType __out__ dst,
@@ -77,3 +101,4 @@ void TROWSUMEXPAND_Impl(tile_shape &dst, tile_shape &src) {
 }
 
 #endif
+#endif
diff --git a/include/jcore/TSqrt.hpp b/include/jcore/TSqrt.hpp
index dd8db9d..c819d7c 100644
--- a/include/jcore/TSqrt.hpp
+++ b/include/jcore/TSqrt.hpp
@@ -5,6 +5,49 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <typename T>
+T linx_tile_isqrt(T value) {
+  T root = 0;
+  root += (value >= static_cast<T>(1)) ? static_cast<T>(1) : static_cast<T>(0);
+  root += (value >= static_cast<T>(4)) ? static_cast<T>(1) : static_cast<T>(0);
+  root += (value >= static_cast<T>(9)) ? static_cast<T>(1) : static_cast<T>(0);
+  root += (value >= static_cast<T>(16)) ? static_cast<T>(1) : static_cast<T>(0);
+  root += (value >= static_cast<T>(25)) ? static_cast<T>(1) : static_cast<T>(0);
+  root += (value >= static_cast<T>(36)) ? static_cast<T>(1) : static_cast<T>(0);
+  root += (value >= static_cast<T>(49)) ? static_cast<T>(1) : static_cast<T>(0);
+  root += (value >= static_cast<T>(64)) ? static_cast<T>(1) : static_cast<T>(0);
+  root += (value >= static_cast<T>(81)) ? static_cast<T>(1) : static_cast<T>(0);
+  root += (value >= static_cast<T>(100)) ? static_cast<T>(1) : static_cast<T>(0);
+  root += (value >= static_cast<T>(121)) ? static_cast<T>(1) : static_cast<T>(0);
+  root += (value >= static_cast<T>(144)) ? static_cast<T>(1) : static_cast<T>(0);
+  root += (value >= static_cast<T>(169)) ? static_cast<T>(1) : static_cast<T>(0);
+  root += (value >= static_cast<T>(196)) ? static_cast<T>(1) : static_cast<T>(0);
+  root += (value >= static_cast<T>(225)) ? static_cast<T>(1) : static_cast<T>(0);
+  return root;
+}
+
+template <is_tile_data_v tile_shape>
+void TSQRT_Impl(tile_shape &dst, tile_shape &src) {
+  static constexpr size_t row = tile_shape::ValidRow;
+  static constexpr size_t col = tile_shape::ValidCol;
+  static_assert(row != DYNAMIC && col != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(tile_shape::isBoxedLayout == false,
+                "TSQRT not support Boxed Layout!");
+  static_assert(std::is_integral<typename tile_shape::DType>::value,
+                "Linx direct TSQRT supports integral smoke types only");
+
+  for (size_t i = 0; i < row; ++i) {
+    for (size_t j = 0; j < col; ++j) {
+      size_t tile_index = index<tile_shape>(i, j);
+      dst.data()[tile_index] = linx_tile_isqrt(src.data()[tile_index]);
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__ TSqrt_RowMajor(typename tile_shape::TileDType __out__ dst,
                             const typename tile_shape::TileDType __in__ src) {
@@ -74,4 +117,5 @@ void TSQRT_Impl(tile_shape &dst, tile_shape &src) {
   }
 }
 
-#endif
\ No newline at end of file
+#endif
+#endif
diff --git a/include/jcore/TCopyOut.hpp b/include/jcore/TStore.hpp
similarity index 82%
rename from include/jcore/TCopyOut.hpp
rename to include/jcore/TStore.hpp
index cd8725c..f6bfbd5 100644
--- a/include/jcore/TCopyOut.hpp
+++ b/include/jcore/TStore.hpp
@@ -1,13 +1,36 @@
-#ifndef TCOPYOUT_HPP
-#define TCOPYOUT_HPP
+#ifndef JCORE_TSTORE_HPP
+#define JCORE_TSTORE_HPP
 
 #include "common/pto_tile.hpp"
 
 using namespace pto;
 
+#ifdef __linx
+template <is_global_data_v gm_shape, is_tile_data_v tile_shape>
+void TSTORE_Impl(gm_shape &dst, tile_shape &src) {
+  size_t rows = src.GetValidRow();
+  size_t cols = src.GetValidCol();
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(tile_shape::isBoxedLayout == false,
+                "Linx smoke TSTORE supports only unboxed tiles");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t gm_index = gm_shape::isRowMajor
+                            ? row * gm_shape::RowStride + col
+                            : col * gm_shape::ColStride + row;
+      size_t tile_index = tile_shape::isRowMajor
+                              ? row * tile_shape::RowStride + col
+                              : col * tile_shape::ColStride + row;
+      dst.data()[gm_index] = src.data()[tile_index];
+    }
+  }
+}
+#else
 // cube left -> gm row major
 template <typename gm_shape, typename tile_shape>
-void __mtc__ CopyOut2NzImpl1D(typename gm_shape::DType __out__ *dst,
+void __mtc__ Store2NzImpl1D(typename gm_shape::DType __out__ *dst,
                               const typename tile_shape::TileDType __in__ src) {
   static constexpr int inner_rows = tile_shape::InnerRows;
   static constexpr int inner_cols = tile_shape::InnerCols;
@@ -32,7 +55,7 @@ void __mtc__ CopyOut2NzImpl1D(typename gm_shape::DType __out__ *dst,
 }
 
 template <typename gm_shape, typename tile_shape>
-void __mtc__ CopyOut2ZnImpl1D(typename gm_shape::DType __out__ *dst,
+void __mtc__ Store2ZnImpl1D(typename gm_shape::DType __out__ *dst,
                               const typename tile_shape::TileDType __in__ src) {
   static constexpr int inner_rows = tile_shape::InnerRows;
   static constexpr int inner_cols = tile_shape::InnerCols;
@@ -59,7 +82,7 @@ void __mtc__ CopyOut2ZnImpl1D(typename gm_shape::DType __out__ *dst,
 //no fractal
 template <typename gm_shape, typename tile_shape>
 void __mtc__
-TCopyOut_Vec_ColMajor(typename gm_shape::DType __out__ *dst,
+TStore_Vec_ColMajor(typename gm_shape::DType __out__ *dst,
                       const typename tile_shape::TileDType __in__ src) {
   size_t i = blkv_get_index_x();
   size_t j = blkv_get_index_y();
@@ -68,10 +91,10 @@ TCopyOut_Vec_ColMajor(typename gm_shape::DType __out__ *dst,
   size_t index_tile = j * tile_shape::ColStride + i;
   dst[index_gm] = blkv_get_tile_ptr(src)[index_tile];
 }
- 
+
 template <typename gm_shape, typename tile_shape>
 void __mtc__
-TCopyOut_Vec_RowMajor(typename gm_shape::DType __out__ *dst,
+TStore_Vec_RowMajor(typename gm_shape::DType __out__ *dst,
                       typename tile_shape::TileDType __in__ src) {
   size_t i = blkv_get_index_x();
   size_t j = blkv_get_index_y();
@@ -83,7 +106,7 @@ TCopyOut_Vec_RowMajor(typename gm_shape::DType __out__ *dst,
 
 // cube left -> gm row major
 template <typename gm_shape, typename tile_shape>
-void __mtc__ CopyOut2NzImpl2D_Dynamic(typename gm_shape::DType __out__ *dst,
+void __mtc__ Store2NzImpl2D_Dynamic(typename gm_shape::DType __out__ *dst,
                                       const typename tile_shape::TileDType __in__ src,
                                       const size_t __in__ gm_row_stride) {
   static constexpr int inner_rows = tile_shape::InnerRows;
@@ -105,7 +128,7 @@ void __mtc__ CopyOut2NzImpl2D_Dynamic(typename gm_shape::DType __out__ *dst,
 }
 
 template <typename gm_shape, typename tile_shape>
-void __mtc__ CopyOut2ZnImpl2D_Dynamic(typename gm_shape::DType __out__ *dst,
+void __mtc__ Store2ZnImpl2D_Dynamic(typename gm_shape::DType __out__ *dst,
                                       const typename tile_shape::TileDType __in__ src,
                                       const size_t __in__ gm_row_stride) {
   static constexpr int inner_rows = tile_shape::InnerRows;
@@ -128,7 +151,7 @@ void __mtc__ CopyOut2ZnImpl2D_Dynamic(typename gm_shape::DType __out__ *dst,
 
 //no fractal
 template <typename gm_shape, typename tile_shape>
-void __mtc__ TCopyOut_Vec_ColMajor_Dynamic(typename gm_shape::DType __out__ *dst,
+void __mtc__ TStore_Vec_ColMajor_Dynamic(typename gm_shape::DType __out__ *dst,
                                            const typename tile_shape::TileDType __in__ src,
                                            const size_t __in__ gm_col_stride) {
   size_t i = blkv_get_index_x();
@@ -138,9 +161,9 @@ void __mtc__ TCopyOut_Vec_ColMajor_Dynamic(typename gm_shape::DType __out__ *dst
   size_t index_tile = j * tile_shape::ColStride + i;
   dst[index_gm] = blkv_get_tile_ptr(src)[index_tile];
 }
- 
+
 template <typename gm_shape, typename tile_shape>
-void __mtc__ TCopyOut_Vec_RowMajor_Dynamic(typename gm_shape::DType __out__ *dst,
+void __mtc__ TStore_Vec_RowMajor_Dynamic(typename gm_shape::DType __out__ *dst,
                                            typename tile_shape::TileDType __in__ src,
                                            const size_t __in__ gm_row_stride) {
   size_t i = blkv_get_index_x();
@@ -152,7 +175,7 @@ void __mtc__ TCopyOut_Vec_RowMajor_Dynamic(typename gm_shape::DType __out__ *dst
 }
 
 template <is_global_data_v gm_shape, is_tile_data_v tile_shape>
-void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) {
+void TSTORE_Impl(gm_shape &dst, tile_shape &src) {
   size_t tile_rows = src.GetValidRow();
   size_t tile_cols = src.GetValidCol();
   static_assert(tile_shape::Loc != Location::Acc, "Unsupport ACC to be input or output here");
@@ -203,7 +226,7 @@ void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) {
       }
     }
   } else {
-    static_assert(tile_shape::isBoxedLayout == false, 
+    static_assert(tile_shape::isBoxedLayout == false,
                   "Storage layout type not supported");
   }
 
@@ -211,15 +234,15 @@ void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) {
   if constexpr (gm_shape::RowStride == DYNAMIC || gm_shape::ColStride == DYNAMIC ||
                 tile_shape::ValidRow == DYNAMIC || tile_shape::ValidCol == DYNAMIC) { // dynamic
     if constexpr (is_Nz_layout<tile_shape>::value) { // Nz
-      CopyOut2NzImpl2D_Dynamic<gm_shape, tile_shape>
+      Store2NzImpl2D_Dynamic<gm_shape, tile_shape>
           <<<tile_cols, tile_rows, 1>>>(dst.data(), src.data(), dst.GetStride(3));
     } else if constexpr (is_Zn_layout<tile_shape>::value) { // Zn
-      CopyOut2ZnImpl2D_Dynamic<gm_shape, tile_shape>
+      Store2ZnImpl2D_Dynamic<gm_shape, tile_shape>
           <<<tile_cols, tile_rows, 1>>>(dst.data(), src.data(), dst.GetStride(3));
     } else if constexpr (tile_shape::isBoxedLayout == false) {
       if constexpr (tile_shape::isRowMajor) {
         if constexpr (gm_shape::isRowMajor) {
-          TCopyOut_Vec_RowMajor_Dynamic<gm_shape, tile_shape>
+          TStore_Vec_RowMajor_Dynamic<gm_shape, tile_shape>
               <<<tile_cols, tile_rows, 1>>>(dst.data(), src.data(), dst.GetStride(3));
         } else {
           static_assert(gm_shape::isRowMajor,
@@ -227,7 +250,7 @@ void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) {
         }
       } else if constexpr (!tile_shape::isRowMajor) {
         if constexpr (!gm_shape::isRowMajor) {
-          TCopyOut_Vec_ColMajor_Dynamic<gm_shape, tile_shape>
+          TStore_Vec_ColMajor_Dynamic<gm_shape, tile_shape>
               <<<tile_rows, tile_cols, 1>>>(dst.data(), src.data(), dst.GetStride(4));
         } else {
           static_assert(!gm_shape::isRowMajor,
@@ -235,20 +258,20 @@ void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) {
         }
       }
     } else {
-      static_assert(tile_shape::isBoxedLayout == false, 
+      static_assert(tile_shape::isBoxedLayout == false,
                     "Storage layout type not supported");
     }
   } else { // static
     if constexpr (is_Nz_layout<tile_shape>::value) { // Nz
-      CopyOut2NzImpl1D<gm_shape, tile_shape>
+      Store2NzImpl1D<gm_shape, tile_shape>
           <<<tile_cols, 1, 1>>>(dst.data(), src.data());
     } else if constexpr (is_Zn_layout<tile_shape>::value) { // Zn
-      CopyOut2ZnImpl1D<gm_shape, tile_shape>
+      Store2ZnImpl1D<gm_shape, tile_shape>
           <<<tile_cols, 1, 1>>>(dst.data(), src.data());
     } else if constexpr (tile_shape::isBoxedLayout == false) {
       if constexpr (tile_shape::isRowMajor) {
         if constexpr (gm_shape::isRowMajor) {
-          TCopyOut_Vec_RowMajor<gm_shape, tile_shape>
+          TStore_Vec_RowMajor<gm_shape, tile_shape>
               <<<tile_cols, tile_rows, 1>>>(dst.data(), src.data());
         } else {
           static_assert(gm_shape::isRowMajor,
@@ -256,7 +279,7 @@ void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) {
         }
       } else if constexpr (!tile_shape::isRowMajor) {
         if constexpr (!gm_shape::isRowMajor) {
-          TCopyOut_Vec_ColMajor<gm_shape, tile_shape>
+          TStore_Vec_ColMajor<gm_shape, tile_shape>
               <<<tile_rows, tile_cols, 1>>>(dst.data(), src.data());
         } else {
           static_assert(!gm_shape::isRowMajor,
@@ -264,13 +287,14 @@ void TCOPYOUT_Impl(gm_shape &dst, tile_shape &src) {
         }
       }
     } else {
-      static_assert(tile_shape::isBoxedLayout == false, 
+      static_assert(tile_shape::isBoxedLayout == false,
                     "Storage layout type not supported");
     }
   }
 
-  
+
 #endif
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TSub.hpp b/include/jcore/TSub.hpp
index 5c9ee29..8689e63 100644
--- a/include/jcore/TSub.hpp
+++ b/include/jcore/TSub.hpp
@@ -5,6 +5,26 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TSUB_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
+  size_t rows = src0.GetValidRow();
+  size_t cols = src0.GetValidCol();
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(tile_shape::isBoxedLayout == false,
+                "TSUB not support Boxed Layout!");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t index = tile_shape::isRowMajor
+                         ? row * tile_shape::RowStride + col
+                         : col * tile_shape::ColStride + row;
+      dst.data()[index] = src0.data()[index] - src1.data()[index];
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__ Tsub_RowMajor(typename tile_shape::TileDType __out__ dst,
                            const typename tile_shape::TileDType __in__ src0,
@@ -71,5 +91,6 @@ void TSUB_Impl(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
                   "Storage layout type not supported");
   }
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TSubs.hpp b/include/jcore/TSubs.hpp
index 90f462b..daa151f 100644
--- a/include/jcore/TSubs.hpp
+++ b/include/jcore/TSubs.hpp
@@ -5,6 +5,25 @@
 #include "jcore/constants.hpp"
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape>
+void TSUBS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s) {
+  size_t rows = src.GetValidRow();
+  size_t cols = src.GetValidCol();
+  static_assert(tile_shape::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(!tile_shape::isBoxedLayout, "TSUBS not support Boxed Layout!");
+
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t index = tile_shape::isRowMajor
+                         ? row * tile_shape::RowStride + col
+                         : col * tile_shape::ColStride + row;
+      dst.data()[index] = src.data()[index] - s;
+    }
+  }
+}
+#else
 template <typename tile_shape>
 void __vec__ TSubs_RowMajor(typename tile_shape::TileDType __out__ dst,
                             const typename tile_shape::TileDType __in__ src,
@@ -67,5 +86,6 @@ void TSUBS_Impl(tile_shape &dst, tile_shape &src, typename tile_shape::DType s)
                   "Storage layout type not supported");
   }
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/TTrans.hpp b/include/jcore/TTrans.hpp
index 3933340..9df15ef 100644
--- a/include/jcore/TTrans.hpp
+++ b/include/jcore/TTrans.hpp
@@ -6,6 +6,40 @@
 
 using namespace pto;
 
+#ifdef __linx
+template <is_tile_data_v tile_shape_out, is_tile_data_v tile_shape_in>
+void TTRANS_Impl(tile_shape_out &dst, tile_shape_in &src) {
+  static_assert(
+      tile_shape_in::Rows == tile_shape_out::Cols &&
+          tile_shape_in::Cols == tile_shape_out::Rows,
+      "Error! Input rows != Output Columns or Input Columns != Output rows");
+  static_assert(tile_shape_in::ValidRow != DYNAMIC &&
+                    tile_shape_in::ValidCol != DYNAMIC &&
+                    tile_shape_out::ValidRow != DYNAMIC &&
+                    tile_shape_out::ValidCol != DYNAMIC,
+                "TODO: Support tile dynamic shape!");
+  static_assert(tile_shape_out::Loc != Location::Acc &&
+                    tile_shape_in::Loc != Location::Acc,
+                "Unsupport ACC to be input or output here");
+  static_assert(tile_shape_out::isBoxedLayout == false &&
+                    tile_shape_in::isBoxedLayout == false,
+                "Storage layout type not supported");
+
+  size_t rows = src.GetValidRow();
+  size_t cols = src.GetValidCol();
+  for (size_t row = 0; row < rows; ++row) {
+    for (size_t col = 0; col < cols; ++col) {
+      size_t src_index = tile_shape_in::isRowMajor
+                             ? row * tile_shape_in::RowStride + col
+                             : col * tile_shape_in::ColStride + row;
+      size_t dst_index = tile_shape_out::isRowMajor
+                             ? col * tile_shape_out::RowStride + row
+                             : row * tile_shape_out::ColStride + col;
+      dst.data()[dst_index] = src.data()[src_index];
+    }
+  }
+}
+#else
 template <typename tile_shape_out, typename tile_shape_in>
 void __vec__
 TTrans_RowMajor(typename tile_shape_out::TileDType __out__ dst,
@@ -96,5 +130,6 @@ void TTRANS_Impl(tile_shape_out &dst, tile_shape_in &src) {
                   "Storage layout type not supported");
   }
 }
+#endif
 
-#endif
\ No newline at end of file
+#endif
diff --git a/include/jcore/type.hpp b/include/jcore/type.hpp
index 2165a2c..bde7cdc 100644
--- a/include/jcore/type.hpp
+++ b/include/jcore/type.hpp
@@ -1,8 +1,76 @@
 #ifndef _INCLUDE_JCORE_TYPE_H_
 #define _INCLUDE_JCORE_TYPE_H_
 
-#include <type_traits>
 #include <cstddef>
+#include <stdint.h>
+#include <type_traits>
+
+#ifdef __linx
+#ifndef tile_size
+#define tile_size(N) __attribute__((tile_size(N)))
+#endif
+#ifndef __in__
+#define __in__
+#endif
+#ifndef __out__
+#define __out__
+#endif
+#ifndef __vec__
+#define __vec__
+#endif
+#ifndef __mtc__
+#define __mtc__
+#endif
+
+struct __fp32 {
+  float value;
+  constexpr __fp32(float v = 0.0f) : value(v) {}
+  constexpr operator float() const { return value; }
+};
+struct __tf32 {
+  uint32_t bits;
+};
+struct __hf32 {
+  uint32_t bits;
+};
+struct __half {
+  uint16_t bits;
+  constexpr __half(float = 0.0f) : bits(0) {}
+};
+struct __hif8 {
+  uint8_t bits;
+};
+struct __fp8_e4m3 {
+  uint8_t bits;
+};
+struct __fp8_e5m2 {
+  uint8_t bits;
+};
+struct __fp6_e3m2 {
+  uint8_t bits;
+};
+struct __fp6_e2m3 {
+  uint8_t bits;
+};
+struct __fp4_e2m1x2 {
+  uint8_t bits;
+};
+struct __fp4_e1m2x2 {
+  uint8_t bits;
+};
+struct __fp8_e8m0 {
+  uint8_t bits;
+};
+struct __fp4_hif4x2 {
+  uint8_t bits;
+};
+struct __int4x2 {
+  uint8_t bits;
+};
+struct __uint4x2 {
+  uint8_t bits;
+};
+#endif
 
 enum __type_code {
   __type_fp64 = 0,
@@ -50,7 +118,9 @@ template<> struct type_traits<__tf32>         : public type_traits_base<__type_t
 template<> struct type_traits<__hf32>         : public type_traits_base<__type_hf32, 32> {};
 
 template<> struct type_traits<__half>         : public type_traits_base<__type_fp16, 16> {};
+#ifndef __linx
 template<> struct type_traits<__bf16>         : public type_traits_base<__type_bf16, 16> {};
+#endif
 template<> struct type_traits<__hif8>         : public type_traits_base<__type_hif8, 8> {};
 
 template<> struct type_traits<__fp8_e4m3>     : public type_traits_base<__type_fp8_e4m3, 8> {};
diff --git a/include/jcore/utils.hpp b/include/jcore/utils.hpp
index 3268155..ee7dee0 100644
--- a/include/jcore/utils.hpp
+++ b/include/jcore/utils.hpp
@@ -12,7 +12,7 @@ void print_tile_Impl(tile_shape &tile) {
   typename tile_shape::DType d[tile_size] = {0};
   using dtype = typename tile_shape::DType;
   using shape = Shape<1, 1, 1, 1, 1>;
-  using stride = 
+  using stride =
        std::conditional_t<tile_shape::isRowMajor || tile_shape::isBoxedLayout,
           Stride<1, 1, tile_shape::Rows * tile_shape::Cols, tile_shape::Cols, 1>,
           Stride<1, 1, tile_shape::Rows * tile_shape::Cols, 1, tile_shape::Rows>>;
@@ -21,7 +21,7 @@ void print_tile_Impl(tile_shape &tile) {
           GlobalTensor<dtype, shape, stride, Layout::ND>,
           GlobalTensor<dtype, shape, stride, Layout::DN>>;
   gm_shape dst(d);
-  TCOPYOUT(dst, tile);
+  TSTORE(dst, tile);
 
   print_tile_info<tile_shape>();
   std::cout << std::fixed << std::scientific << std::setprecision(4);
diff --git a/kernels/element_wise/gelu.hpp b/kernels/element_wise/gelu.hpp
index 69cc4ee..191eb59 100644
--- a/kernels/element_wise/gelu.hpp
+++ b/kernels/element_wise/gelu.hpp
@@ -1,9 +1,14 @@
 #include <common/pto_tileop.hpp>
-#include "../test/accelerator/include/accelerator_fusion.h"
+#include <benchmark_support/npu/npu_fusion.h>
 
 #include "template_asm.h"
+#ifdef __linx
+#include <stddef.h>
+#include <stdint.h>
+#else
 #include <cstdint>
 #include <cstdio>
+#endif
 // 海思解决方案 新版多项式拟合
 
 // ==============================================
@@ -21,7 +26,7 @@ void __vec__ gelu_simd(
 
     // 数据格式转换 V.FCVT
     float x = static_cast<float>(indata);
-    
+
     constexpr uint32_t TOTAL_COUNT = 24*8*1024;
     constexpr float SCALAR_A5 = -3.5123395303315874e-09f;
     constexpr float SCALAR_A4 =  2.6452661927578447e-07f;
@@ -31,7 +36,7 @@ void __vec__ gelu_simd(
     constexpr float SCALAR_A0 = -7.2666168212890625e-02f;
     constexpr float SCALAR_AM1 = -1.5957698822021484e+00f;
     constexpr float FP32_MAX = 5.75f;
-    
+
     float t = blkv_max(x, -FP32_MAX);
     t = blkv_min(t, FP32_MAX);
     float t2 = t * t;
@@ -45,7 +50,7 @@ void __vec__ gelu_simd(
 
     float exp_val = blkv_fexp(t * p);
     float y = x / (1.0f + exp_val);
-    
+
     BLKC_ASSIGN_CAST(out, index, y);
     // blkv_get_tile_ptr(out)[index] = static_cast<typename tile_shape::DType>(result);
 }
@@ -65,7 +70,7 @@ void gelu_impl(
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -83,7 +88,7 @@ void gelu(
     bool approximate = false // false:none, true:tanh
     ) {
     const int Mb = gM / tM;
-    
+
     const int rmd_M = gM % tM;
 
     using gm_shape = global_tensor<dtype, RowMajor<1, gM>>;
@@ -112,18 +117,18 @@ void gelu(
         // printf("iter i %d\n",i);
         auto gI = gIIter(0, i);
         auto gO = gOIter(0, i);
-        TCOPYIN(inTile, gI);
+        TLOAD(inTile, gI);
         // DUMP_TILE("inTile", inTile, g_dump_intTile, 1, tM);
         gelu_impl<tile_shapeData>(inTile, outTile);
         // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM);
-        TCOPYOUT(gO, outTile);
+        TSTORE(gO, outTile);
     }
     if constexpr (rmd_M) {
         auto gI = gIIter(0, Mb);
         auto gO = gOIter(0, Mb);
-        TCOPYIN(inTile_rmd, gI);
+        TLOAD(inTile_rmd, gI);
         gelu_impl<tile_shapeData_rmd>(inTile_rmd, outTile_rmd);
         // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM);
-        TCOPYOUT(gO, outTile_rmd);
+        TSTORE(gO, outTile_rmd);
     }
-}
\ No newline at end of file
+}
diff --git a/kernels/element_wise/gelu_origin.hpp b/kernels/element_wise/gelu_origin.hpp
index 0db59e6..62f4fb4 100644
--- a/kernels/element_wise/gelu_origin.hpp
+++ b/kernels/element_wise/gelu_origin.hpp
@@ -1,5 +1,5 @@
 #include <common/pto_tileop.hpp>
-#include "../test/accelerator/include/accelerator_fusion.h"
+#include <benchmark_support/npu/npu_fusion.h>
 
 #include "template_asm.h"
 #include <cstdint>
@@ -91,7 +91,7 @@ void __vec__ gelu_simd(
     float result;
     // 数据格式转换 V.FCVT
     float x = static_cast<float>(indata);
-    
+
     // GELU(x)=x∗Φ(x), Φ(x)=负无穷~x积分 φ(exp(-0.5f*x*x) / sqrt(2π))
     // 等价于GELU(x)=0.5⋅x⋅(1+erf(x/sqrt(2))
     if (!approximate) {
@@ -126,7 +126,7 @@ void gelu_impl(
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -144,7 +144,7 @@ void gelu(
     bool approximate = false // false:none, true:tanh
     ) {
     const int Mb = gM / tM;
-    
+
     const int rmd_M = gM % tM;
 
     using gm_shape = global_tensor<dtype, RowMajor<1, gM>>;
@@ -173,19 +173,19 @@ void gelu(
         // printf("iter i %d\n",i);
         auto gI = gIIter(0, i);
         auto gO = gOIter(0, i);
-        TCOPYIN(inTile, gI);
+        TLOAD(inTile, gI);
         // DUMP_TILE("inTile", inTile, g_dump_intTile, 1, tM);
         gelu_impl<tile_shapeData>(inTile, outTile, approximate);
         // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM);
-        TCOPYOUT(gO, outTile);
+        TSTORE(gO, outTile);
     }
     if constexpr (rmd_M) {
         auto gI = gIIter(0, Mb);
         auto gO = gOIter(0, Mb);
-        TCOPYIN(inTile_rmd, gI);
+        TLOAD(inTile_rmd, gI);
         gelu_impl<tile_shapeData_rmd>(inTile_rmd, outTile_rmd, approximate);
         // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM);
-        TCOPYOUT(gO, outTile_rmd);
+        TSTORE(gO, outTile_rmd);
     }
 
 }
\ No newline at end of file
diff --git a/kernels/fa_mx/fa_hif4.hpp b/kernels/fa_mx/fa_hif4.hpp
index 280ef24..a655750 100644
--- a/kernels/fa_mx/fa_hif4.hpp
+++ b/kernels/fa_mx/fa_hif4.hpp
@@ -272,7 +272,7 @@ void __vec__ pkg_rowmax(
     __bf16x2 upd_max;
     __bf16 old_max_bf160, old_max_bf161;
     linx_cvt(old_max_bf160, old_max_ptr[i*2*tileMax::RowStride]); //float->bf16
-    linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]); 
+    linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]);
     linx_cvt_package(upd_max, old_max_bf160, old_max_bf161);
 
     // calc tile rowmax
@@ -323,7 +323,7 @@ void __vec__ pkg_rowmax(
     // recalculate scale of softmax
     __bf16x2 scale, old_max_bf16x2;
     linx_cvt_package(old_max_bf16x2, old_max_bf160, old_max_bf161);
-    blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max); 
+    blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max);
     blkv_bf16x2_fexp(scale, old_max_bf16x2);
     // opt1
     union { __bf16x2 vec; uint32_t u32; } scale_u;
@@ -365,7 +365,7 @@ void __vec__ pkg_rowmax_4src(
     __bf16x2 upd_max;
     __bf16 old_max_bf160, old_max_bf161;
     linx_cvt(old_max_bf160, old_max_ptr[i*2*tileMax::RowStride]); //float->bf16
-    linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]); 
+    linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]);
     linx_cvt_package(upd_max, old_max_bf160, old_max_bf161);
 
     // calc tile rowmax
@@ -379,7 +379,7 @@ void __vec__ pkg_rowmax_4src(
         uint32_t src_idx_21 =  (2*i + 1) * tileSrc::RowStride + (j + 2) * tileSrc::ColStride;
         uint32_t src_idx_30 =  (2*i) * tileSrc::RowStride + (j + 3) * tileSrc::ColStride;
         uint32_t src_idx_31 =  (2*i + 1) * tileSrc::RowStride + (j + 3) * tileSrc::ColStride;
-        
+
         __bf16x2 s0_0, s0_1, s0_2, s0_3;
         linx_cvt_package(s0_0, src0_ptr[src_idx_00], src0_ptr[src_idx_01]);
         linx_cvt_package(s0_1, src0_ptr[src_idx_10], src0_ptr[src_idx_11]);
@@ -435,7 +435,7 @@ void __vec__ pkg_rowmax_4src(
     // recalculate scale of softmax
     __bf16x2 scale, old_max_bf16x2;
     linx_cvt_package(old_max_bf16x2, old_max_bf160, old_max_bf161);
-    blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max); 
+    blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max);
     blkv_bf16x2_fexp(scale, old_max_bf16x2);
 
     union { __bf16x2 vec; uint32_t u32; } scale_u;
@@ -480,7 +480,7 @@ void __vec__ pkg_rowmax_4srcx2(
     __bf16x2 upd_max;
     __bf16 old_max_bf160, old_max_bf161;
     linx_cvt(old_max_bf160, old_max_ptr[i*2*tileMax::RowStride]); //float->bf16
-    linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]); 
+    linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]);
     linx_cvt_package(upd_max, old_max_bf160, old_max_bf161);
 
     // calc tile rowmax
@@ -491,7 +491,7 @@ void __vec__ pkg_rowmax_4srcx2(
         uint32_t src_idx_10 =  (2*i) * tileSrc::RowStride + (j + 1) * tileSrc::ColStride;
         uint32_t src_idx_20 =  (2*i) * tileSrc::RowStride + (j + 2) * tileSrc::ColStride;
         uint32_t src_idx_30 =  (2*i) * tileSrc::RowStride + (j + 3) * tileSrc::ColStride;
-        
+
         __bf16x2 s0_0, s0_1;
         blkv_bf16x2_fmax(s0_0, src0_x2_ptr[src_idx_00], src0_x2_ptr[src_idx_10]);
         blkv_bf16x2_fmax(s0_1, src0_x2_ptr[src_idx_20], src0_x2_ptr[src_idx_30]);
@@ -531,7 +531,7 @@ void __vec__ pkg_rowmax_4srcx2(
     // recalculate scale of softmax
     __bf16x2 scale, old_max_bf16x2;
     linx_cvt_package(old_max_bf16x2, old_max_bf160, old_max_bf161);
-    blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max); 
+    blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max);
     blkv_bf16x2_fexp(scale, old_max_bf16x2);
 
     union { __bf16x2 vec; uint32_t u32; } scale_u;
@@ -564,12 +564,12 @@ void __vec__ rowsum_2src_with_local_sum(
     linx_cvt_package(src_scale, 1.0f / sqrt((float)qD), 1.0f / sqrt((float)qD));
     __bf16x2 upd_sum, new_max_val;
     __bf16 new_max_bf16_0, new_max_bf16_1;
-    
+
     // Initialize local sum to 0
     linx_cvt_package(upd_sum, 0.0f, 0.0f);
 
     linx_cvt(new_max_bf16_0, blkv_get_tile_ptr(new_max)[i*2*tileMax::RowStride]); //float->bf16
-    linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]); 
+    linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]);
     linx_cvt_package(new_max_val, new_max_bf16_0, new_max_bf16_1);
 
     #pragma clang loop unroll(full)
@@ -582,7 +582,7 @@ void __vec__ rowsum_2src_with_local_sum(
         uint32_t src_idx_21 =  (2*i + 1) * tileSrc::RowStride + (j + 2) * tileSrc::ColStride;
         uint32_t src_idx_30 =  (2*i) * tileSrc::RowStride + (j + 3) * tileSrc::ColStride;
         uint32_t src_idx_31 =  (2*i + 1) * tileSrc::RowStride + (j + 3) * tileSrc::ColStride;
-        
+
         // Process src0
         __bf16x2 s0_0, s0_1, s0_2, s0_3;
         __bf16x2 sum01_0, sum23_0, sum0123_0;
@@ -590,7 +590,7 @@ void __vec__ rowsum_2src_with_local_sum(
         linx_cvt_package(s0_1, src0_ptr[src_idx_10], src0_ptr[src_idx_11]);
         linx_cvt_package(s0_2, src0_ptr[src_idx_20], src0_ptr[src_idx_21]);
         linx_cvt_package(s0_3, src0_ptr[src_idx_30], src0_ptr[src_idx_31]);
-        
+
         blkv_bf16x2_fmsub(s0_0, s0_0, src_scale, new_max_val);
         blkv_bf16x2_fmsub(s0_1, s0_1, src_scale, new_max_val);
         blkv_bf16x2_fmsub(s0_2, s0_2, src_scale, new_max_val);
@@ -603,7 +603,7 @@ void __vec__ rowsum_2src_with_local_sum(
         blkv_bf16x2_fadd(sum23_0, s0_2, s0_3);
         blkv_bf16x2_fadd(sum0123_0, sum01_0, sum23_0);
         blkv_bf16x2_fadd(upd_sum, upd_sum, sum0123_0);
-        
+
         BLKC_ASSIGN_CAST(src_exp0, src_idx_00, s0_0);
         BLKC_ASSIGN_CAST(src_exp0, src_idx_10, s0_1);
         BLKC_ASSIGN_CAST(src_exp0, src_idx_20, s0_2);
@@ -616,7 +616,7 @@ void __vec__ rowsum_2src_with_local_sum(
         linx_cvt_package(s1_1, src1_ptr[src_idx_10], src1_ptr[src_idx_11]);
         linx_cvt_package(s1_2, src1_ptr[src_idx_20], src1_ptr[src_idx_21]);
         linx_cvt_package(s1_3, src1_ptr[src_idx_30], src1_ptr[src_idx_31]);
-        
+
         blkv_bf16x2_fmsub(s1_0, s1_0, src_scale, new_max_val);
         blkv_bf16x2_fmsub(s1_1, s1_1, src_scale, new_max_val);
         blkv_bf16x2_fmsub(s1_2, s1_2, src_scale, new_max_val);
@@ -629,7 +629,7 @@ void __vec__ rowsum_2src_with_local_sum(
         blkv_bf16x2_fadd(sum23_1, s1_2, s1_3);
         blkv_bf16x2_fadd(sum0123_1, sum01_1, sum23_1);
         blkv_bf16x2_fadd(upd_sum, upd_sum, sum0123_1);
-        
+
         BLKC_ASSIGN_CAST(src_exp1, src_idx_00, s1_0);
         BLKC_ASSIGN_CAST(src_exp1, src_idx_10, s1_1);
         BLKC_ASSIGN_CAST(src_exp1, src_idx_20, s1_2);
@@ -640,7 +640,7 @@ void __vec__ rowsum_2src_with_local_sum(
     sum_u.vec = upd_sum;
     __bf16 sum0 = (sum_u.u32 >> 16) & 0xffff;
     __bf16 sum1 = (sum_u.u32 & 0xffff);
-    
+
     local_sum_ptr[(i*2)*tileSum::RowStride] = sum0;
     local_sum_ptr[(i*2+1)*tileSum::RowStride] = sum1;
 }
@@ -670,12 +670,12 @@ void __vec__ rowsum_2src_with_local_sumx2(
     linx_cvt_package(src_scale, 1.0f / sqrt((float)qD), 1.0f / sqrt((float)qD));
     __bf16x2 upd_sum, new_max_val;
     __bf16 new_max_bf16_0, new_max_bf16_1;
-    
+
     // Initialize local sum to 0
     linx_cvt_package(upd_sum, 0.0f, 0.0f);
 
     linx_cvt(new_max_bf16_0, blkv_get_tile_ptr(new_max)[i*2*tileMax::RowStride]); //float->bf16
-    linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]); 
+    linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]);
     linx_cvt_package(new_max_val, new_max_bf16_0, new_max_bf16_1);
 
     // row sum
@@ -686,17 +686,17 @@ void __vec__ rowsum_2src_with_local_sumx2(
         uint32_t src_idx_10 =  (2*i) * tileSrc::RowStride + (j + 1) * tileSrc::ColStride;
         uint32_t src_idx_20 =  (2*i) * tileSrc::RowStride + (j + 2) * tileSrc::ColStride;
         uint32_t src_idx_30 =  (2*i) * tileSrc::RowStride + (j + 3) * tileSrc::ColStride;
-        
+
         // Process src0
         __bf16x2 s0_0, s0_1, s0_2, s0_3;
         __bf16x2 sum01_0, sum23_0, sum0123_0;
-        
+
         // 直接将内存读取作为 fmsub 的输入操作数
         blkv_bf16x2_fmsub(s0_0, src0_x2_ptr[src_idx_00], src_scale, new_max_val);
         blkv_bf16x2_fmsub(s0_1, src0_x2_ptr[src_idx_10], src_scale, new_max_val);
         blkv_bf16x2_fmsub(s0_2, src0_x2_ptr[src_idx_20], src_scale, new_max_val);
         blkv_bf16x2_fmsub(s0_3, src0_x2_ptr[src_idx_30], src_scale, new_max_val);
-        
+
         blkv_bf16x2_fexp(s0_0, s0_0);
         blkv_bf16x2_fexp(s0_1, s0_1);
         blkv_bf16x2_fexp(s0_2, s0_2);
@@ -713,12 +713,12 @@ void __vec__ rowsum_2src_with_local_sumx2(
         // Process src1
         __bf16x2 s1_0, s1_1, s1_2, s1_3;
         __bf16x2 sum01_1, sum23_1, sum0123_1;
-        
+
         blkv_bf16x2_fmsub(s1_0, src1_x2_ptr[src_idx_00], src_scale, new_max_val);
         blkv_bf16x2_fmsub(s1_1, src1_x2_ptr[src_idx_10], src_scale, new_max_val);
         blkv_bf16x2_fmsub(s1_2, src1_x2_ptr[src_idx_20], src_scale, new_max_val);
         blkv_bf16x2_fmsub(s1_3, src1_x2_ptr[src_idx_30], src_scale, new_max_val);
-        
+
         blkv_bf16x2_fexp(s1_0, s1_0);
         blkv_bf16x2_fexp(s1_1, s1_1);
         blkv_bf16x2_fexp(s1_2, s1_2);
@@ -727,7 +727,7 @@ void __vec__ rowsum_2src_with_local_sumx2(
         blkv_bf16x2_fadd(sum23_1, s1_2, s1_3);
         blkv_bf16x2_fadd(sum0123_1, sum01_1, sum23_1);
         blkv_bf16x2_fadd(upd_sum, upd_sum, sum0123_1);
-        
+
         blkc_assign_elem(src_exp1_x2_ptr, src_idx_00, s1_0);
         blkc_assign_elem(src_exp1_x2_ptr, src_idx_10, s1_1);
         blkc_assign_elem(src_exp1_x2_ptr, src_idx_20, s1_2);
@@ -738,7 +738,7 @@ void __vec__ rowsum_2src_with_local_sumx2(
     sum_u.vec = upd_sum;
     __bf16 sum0 = (sum_u.u32 >> 16) & 0xffff;
     __bf16 sum1 = (sum_u.u32 & 0xffff);
-    
+
     local_sum_ptr[(i*2)*tileSum::RowStride] = sum0;
     local_sum_ptr[(i*2+1)*tileSum::RowStride] = sum1;
 }
@@ -768,12 +768,12 @@ void __vec__ rowsum_2src_with_local_expx2(
     linx_cvt_package(src_scale, 1.0f / sqrt((float)qD), 1.0f / sqrt((float)qD));
     __bf16x2 upd_sum, new_max_val;
     __bf16 new_max_bf16_0, new_max_bf16_1;
-    
+
     // Initialize local sum to 0
     linx_cvt_package(upd_sum, 0.0f, 0.0f);
 
     linx_cvt(new_max_bf16_0, blkv_get_tile_ptr(new_max)[i*2*tileMax::RowStride]); //float->bf16
-    linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]); 
+    linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]);
     linx_cvt_package(new_max_val, new_max_bf16_0, new_max_bf16_1);
 
     // row sum
@@ -784,17 +784,17 @@ void __vec__ rowsum_2src_with_local_expx2(
         uint32_t src_idx_10 =  (2*i) * tileSrc::RowStride + (j + 1) * tileSrc::ColStride;
         uint32_t src_idx_20 =  (2*i) * tileSrc::RowStride + (j + 2) * tileSrc::ColStride;
         uint32_t src_idx_30 =  (2*i) * tileSrc::RowStride + (j + 3) * tileSrc::ColStride;
-        
+
         // Process src0
         __bf16x2 s0_0, s0_1, s0_2, s0_3;
         __bf16x2 sum01_0, sum23_0, sum0123_0;
-        
+
         // 直接将内存读取作为 fmsub 的输入操作数
         blkv_bf16x2_fmsub(s0_0, src0_x2_ptr[src_idx_00], src_scale, new_max_val);
         blkv_bf16x2_fmsub(s0_1, src0_x2_ptr[src_idx_10], src_scale, new_max_val);
         blkv_bf16x2_fmsub(s0_2, src0_x2_ptr[src_idx_20], src_scale, new_max_val);
         blkv_bf16x2_fmsub(s0_3, src0_x2_ptr[src_idx_30], src_scale, new_max_val);
-        
+
         blkv_bf16x2_fexp(s0_0, s0_0);
         blkv_bf16x2_fexp(s0_1, s0_1);
         blkv_bf16x2_fexp(s0_2, s0_2);
@@ -811,12 +811,12 @@ void __vec__ rowsum_2src_with_local_expx2(
         // Process src1
         __bf16x2 s1_0, s1_1, s1_2, s1_3;
         __bf16x2 sum01_1, sum23_1, sum0123_1;
-        
+
         blkv_bf16x2_fmsub(s1_0, src1_x2_ptr[src_idx_00], src_scale, new_max_val);
         blkv_bf16x2_fmsub(s1_1, src1_x2_ptr[src_idx_10], src_scale, new_max_val);
         blkv_bf16x2_fmsub(s1_2, src1_x2_ptr[src_idx_20], src_scale, new_max_val);
         blkv_bf16x2_fmsub(s1_3, src1_x2_ptr[src_idx_30], src_scale, new_max_val);
-        
+
         blkv_bf16x2_fexp(s1_0, s1_0);
         blkv_bf16x2_fexp(s1_1, s1_1);
         blkv_bf16x2_fexp(s1_2, s1_2);
@@ -825,7 +825,7 @@ void __vec__ rowsum_2src_with_local_expx2(
         // blkv_bf16x2_fadd(sum23_1, s1_2, s1_3);
         // blkv_bf16x2_fadd(sum0123_1, sum01_1, sum23_1);
         // blkv_bf16x2_fadd(upd_sum, upd_sum, sum0123_1);
-        
+
         blkc_assign_elem(src_exp1_x2_ptr, src_idx_00, s1_0);
         blkc_assign_elem(src_exp1_x2_ptr, src_idx_10, s1_1);
         blkc_assign_elem(src_exp1_x2_ptr, src_idx_20, s1_2);
@@ -836,7 +836,7 @@ void __vec__ rowsum_2src_with_local_expx2(
     // sum_u.vec = upd_sum;
     // __bf16 sum0 = (sum_u.u32 >> 16) & 0xffff;
     // __bf16 sum1 = (sum_u.u32 & 0xffff);
-    
+
     // local_sum_ptr[(i*2)*tileSum::RowStride] = sum0;
     // local_sum_ptr[(i*2+1)*tileSum::RowStride] = sum1;
 }
@@ -890,14 +890,14 @@ void __vec__ rowsum(
     __bf16 new_max_bf16_0, new_max_bf16_1;
     // old_sum * rescale + new_sum
     linx_cvt(old_sum_bf16_0, old_sum_ptr[i*2*tileSum::RowStride]); //float->bf16
-    linx_cvt(old_sum_bf16_1, old_sum_ptr[(i*2+1)*tileSum::RowStride]); 
+    linx_cvt(old_sum_bf16_1, old_sum_ptr[(i*2+1)*tileSum::RowStride]);
     linx_cvt_package(upd_sum, old_sum_bf16_0, old_sum_bf16_1);
     linx_cvt(scale_bf16_0, scale_ptr[i*2*tileSum::RowStride]); //float->bf16
-    linx_cvt(scale_bf16_1, scale_ptr[(i*2+1)*tileSum::RowStride]); 
+    linx_cvt(scale_bf16_1, scale_ptr[(i*2+1)*tileSum::RowStride]);
     linx_cvt_package(scale, scale_bf16_0, scale_bf16_1);
-    blkv_bf16x2_fmul(upd_sum, upd_sum, scale);  
+    blkv_bf16x2_fmul(upd_sum, upd_sum, scale);
     linx_cvt(new_max_bf16_0, blkv_get_tile_ptr(new_max)[i*2*tileMax::RowStride]); //float->bf16
-    linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]); 
+    linx_cvt(new_max_bf16_1, blkv_get_tile_ptr(new_max)[(i*2+1)*tileMax::RowStride]);
     linx_cvt_package(new_max_val, new_max_bf16_0, new_max_bf16_1);
 
     // calculate row sum of softmax, l_i
@@ -952,7 +952,7 @@ void __vec__ rowsum(
     sum_u.vec = upd_sum;
     __bf16 sum0 = (sum_u.u32 >> 16) & 0xffff;
     __bf16 sum1 = (sum_u.u32 & 0xffff);
-    
+
     new_sum_ptr[(i*2)*tileSum::RowStride] = sum0;
     new_sum_ptr[(i*2+1)*tileSum::RowStride] = sum1;
 }
@@ -987,7 +987,7 @@ void __vec__ flashsoftmax_dn_mout_cast_kernel_bf16x2(
     __bf16x2 upd_max;
     __bf16 old_max_bf160, old_max_bf161;
     linx_cvt(old_max_bf160, old_max_ptr[i*2*tileMax::RowStride]); //float->bf16
-    linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]); 
+    linx_cvt(old_max_bf161, old_max_ptr[(i*2+1)*tileMax::RowStride]);
     linx_cvt_package(upd_max, old_max_bf160, old_max_bf161);
 
     // calc tile rowmax
@@ -1044,7 +1044,7 @@ void __vec__ flashsoftmax_dn_mout_cast_kernel_bf16x2(
     // recalculate scale of softmax
     __bf16x2 scale, old_max_bf16x2;
     linx_cvt_package(old_max_bf16x2, old_max_bf160, old_max_bf161);
-    blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max); 
+    blkv_bf16x2_fsub(old_max_bf16x2, old_max_bf16x2, upd_max);
     blkv_bf16x2_fexp(scale, old_max_bf16x2);
     uint32_t scale_idx00 = i*2*tileScale::RowStride;
     uint32_t scale_idx01 = (i*2+1)*tileScale::RowStride;
@@ -1064,7 +1064,7 @@ void __vec__ flashsoftmax_dn_mout_cast_kernel_bf16x2(
     __bf16x2 upd_sum;
     __bf16 old_sum_bf16_0, old_sum_bf16_1;
     linx_cvt(old_sum_bf16_0, old_sum_ptr[i*2*tileSum::RowStride]); //float->bf16
-    linx_cvt(old_sum_bf16_1, old_sum_ptr[(i*2+1)*tileSum::RowStride]); 
+    linx_cvt(old_sum_bf16_1, old_sum_ptr[(i*2+1)*tileSum::RowStride]);
     linx_cvt_package(upd_sum, old_sum_bf16_0, old_sum_bf16_1);
     blkv_bf16x2_fmul(upd_sum, upd_sum, scale);  // *** TODO
 
@@ -1208,9 +1208,9 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
     using gmK = global_tensor<dtype, ColMajor<qD/2, Skv>>;  // K: [qD×S]
     using gmV = global_tensor<dtype, RowMajor<Skv, vD/2>>;  // V: [S×vD]
     using gmO = global_tensor<dtype, ColMajor<Sq, vD/2>>;  // O: [SxvD]
-    using gm_QMX = global_tensor<uint8_t, RowMajor<Sq, qD/w_factor>>; 
-    using gm_KMX = global_tensor<uint8_t, ColMajor<qD/w_factor, Skv>>; 
-    using gm_VMX = global_tensor<uint8_t, RowMajor<Skv, vD/w_factor>>; 
+    using gm_QMX = global_tensor<uint8_t, RowMajor<Sq, qD/w_factor>>;
+    using gm_KMX = global_tensor<uint8_t, ColMajor<qD/w_factor, Skv>>;
+    using gm_VMX = global_tensor<uint8_t, RowMajor<Skv, vD/w_factor>>;
     // tile 寄存器形状
     using tileQ      = TileLeft<dtype, kTm, (qD==192? 256:qD)/2, kTm, qD/2>;       // [kTm×qD]
     using tileK      = TileRight<dtype, (qD==192? 256:qD)/2, kTk, qD/2, kTk>;      // [vD×kTk]
@@ -1278,7 +1278,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
         tileQ tQ[Xdim];
         tile_QMX tQMX[Xdim];
         // load tile Q,  TODO: add ND2ZZ transform for QMX
-        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore 
+        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x+=2){
                 auto gQ = gIterQ(i+x,0);
@@ -1290,10 +1290,10 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
             for(int x=0;x<Xdim;x++){
                 auto gQ = gIterQ(i+x,0);
                 // auto gQMX = gIterQMX(i+x,0);
-                TCOPYIN(tQ[x], gQ);
-                gen_ND2ZZ_offset_Impl<gm_QMX, tile_QMX, tile_ND2ZZOffset>(gQMX, tQMX[x], nd2zz_offset, i+x, 0); 
+                TLOAD(tQ[x], gQ);
+                gen_ND2ZZ_offset_Impl<gm_QMX, tile_QMX, tile_ND2ZZOffset>(gQMX, tQMX[x], nd2zz_offset, i+x, 0);
                 MGATHER(tQMX[x], gQMX, nd2zz_offset);
-                // TCOPYIN(tQMX[x], gQMX);
+                // TLOAD(tQMX[x], gQMX);
             }
         #endif
 
@@ -1330,10 +1330,10 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
                 for(int y=0;y<Ydim;y++){
                     auto gK = gIterK(0, j+y);
                     // auto gKM = gIterKMX(0, j+y);
-                    TCOPYIN(tK[y], gK);
+                    TLOAD(tK[y], gK);
                     gen_ND2NN_offset_Impl<gm_KMX, tile_KMX, tile_ND2NNOffset>(gKMX, tKMX[y], nd2nn_offset, 0, j+y);
-                    MGATHER(tKMX[y], gKMX, nd2nn_offset); 
-                    // TCOPYIN(tKMX[y], gKM);
+                    MGATHER(tKMX[y], gKMX, nd2nn_offset);
+                    // TLOAD(tKMX[y], gKM);
                 }
             #endif
 
@@ -1372,7 +1372,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
                                                         tNewMax[x].data(),
                                                         tNewSum[x].data(),
                                                         tExpW[x][0].data(),
-                                                        tW[x][0].data(), // 
+                                                        tW[x][0].data(), //
                                                         tMax[x].data(),
                                                         tSum[x].data());
                         tohif4<tileP_hif4, tile_PMX, tileW_cast><<<tileW_cast::ValidRow, tileW_cast::ValidCol/64, 1>>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); // 64 = 1 group, tExpW(Zn, bf16), tP_scale(ZZ, E6M2 with zero E1_8 && E1_16)
@@ -1383,7 +1383,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
                                                         tNewMax[x].data(),
                                                         tNewSum[x].data(),
                                                         tExpW[x][0].data(),
-                                                        tW[x][0].data(), // 
+                                                        tW[x][0].data(), //
                                                         tMax[x].data(),
                                                         tSum[x].data());
                         tohif4_bf16x2<tileP_hif4, tile_PMX, tileW_cast><<<tileW_cast::ValidRow, tileW_cast::ValidCol/64, 1>>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data());
@@ -1422,8 +1422,8 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
                 // #pragma clang loop unroll(full)
                 // for(int x=0;x<Xdim;x++){
                 //     new_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(
-                //                                                 tScale[x].data(), 
-                //                                                 tNewMax[x].data(), 
+                //                                                 tScale[x].data(),
+                //                                                 tNewMax[x].data(),
                 //                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                 //                                                 tMax[x].data(),
                 //                                                 scale);
@@ -1432,7 +1432,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
                 //     //                                             tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                 //     //                                             tNewMax[x].data(),
                 //     //                                             scale);
-                    
+
                 //     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(),
                 //                                                                                    tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale);
                 //     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(),
@@ -1450,7 +1450,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){    
+                for(int x=0;x<Xdim;x++){
                     #pragma clang loop unroll(full)
                     for(int k=0;k<2;k++){
                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
@@ -1470,7 +1470,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){       
+                for(int x=0;x<Xdim;x++){
                     for(int k=0;k<4;k++){
                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
                     }
@@ -1512,10 +1512,10 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
                 for(int y=0;y<Ydim;y++){
                     auto gV = gIterV(j+y, 0);
                     // auto gVMX = gIterVMX(j+y, 0);
-                    TCOPYIN(tV[y], gV);
+                    TLOAD(tV[y], gV);
                     gen_ND2NN_offset_Impl<gm_VMX, tile_VMX, tile_ND2NNOffset>(gVMX, tVMX[y], nd2nn_offset, j+y, 0);
-                    MGATHER(tVMX[y], gVMX, nd2nn_offset); 
-                    // TCOPYIN(tVMX[y], gVMX);
+                    MGATHER(tVMX[y], gVMX, nd2nn_offset);
+                    // TLOAD(tVMX[y], gVMX);
                 }
             #endif
 
@@ -1586,7 +1586,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
             #pragma clang loop unroll(full)
             for (int x = 0; x < Xdim; ++x) {
                 auto dstO = gIterO(i+x, 0);
-                TCOPYOUT(dstO, tO_cast[x]);//TMOV
+                TSTORE(dstO, tO_cast[x]);//TMOV
             }
         #endif
 
@@ -1601,9 +1601,9 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
 //     using gmK = global_tensor<dtype, ColMajor<qD/2, Skv>>;  // K: [qD×S]
 //     using gmV = global_tensor<dtype, RowMajor<Skv, vD/2>>;  // V: [S×vD]
 //     using gmO = global_tensor<dtype, ColMajor<Sq, vD/2>>;  // O: [SxvD]
-//     using gm_QMX = global_tensor<uint8_t, RowMajor<Sq, qD/w_factor>>; 
-//     using gm_KMX = global_tensor<uint8_t, ColMajor<qD/w_factor, Skv>>; 
-//     using gm_VMX = global_tensor<uint8_t, RowMajor<Skv, vD/w_factor>>; 
+//     using gm_QMX = global_tensor<uint8_t, RowMajor<Sq, qD/w_factor>>;
+//     using gm_KMX = global_tensor<uint8_t, ColMajor<qD/w_factor, Skv>>;
+//     using gm_VMX = global_tensor<uint8_t, RowMajor<Skv, vD/w_factor>>;
 //     // tile 寄存器形状
 //     using tileQ      = TileLeft<dtype, kTm, (qD==192? 256:qD)/2, kTm, qD/2>;       // [kTm×qD]
 //     using tileK      = TileRight<dtype, (qD==192? 256:qD)/2, kTk, qD/2, kTk>;      // [vD×kTk]
@@ -1671,7 +1671,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
 //         tileQ tQ[Xdim];
 //         tile_QMX tQMX[Xdim];
 //         // load tile Q,  TODO: add ND2ZZ transform for QMX
-//         #ifdef MULTI_LDST // don't use, no need for multi tload/tstore 
+//         #ifdef MULTI_LDST // don't use, no need for multi tload/tstore
 //             #pragma clang loop unroll(full)
 //             for(int x=0;x<Xdim;x+=2){
 //                 auto gQ = gIterQ(i+x,0);
@@ -1683,10 +1683,10 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
 //             for(int x=0;x<Xdim;x++){
 //                 auto gQ = gIterQ(i+x,0);
 //                 // auto gQMX = gIterQMX(i+x,0);
-//                 TCOPYIN(tQ[x], gQ);
-//                 gen_ND2ZZ_offset_Impl<gm_QMX, tile_QMX, tile_ND2ZZOffset>(gQMX, tQMX[x], nd2zz_offset, i+x, 0); 
+//                 TLOAD(tQ[x], gQ);
+//                 gen_ND2ZZ_offset_Impl<gm_QMX, tile_QMX, tile_ND2ZZOffset>(gQMX, tQMX[x], nd2zz_offset, i+x, 0);
 //                 MGATHER(tQMX[x], gQMX, nd2zz_offset);
-//                 // TCOPYIN(tQMX[x], gQMX);
+//                 // TLOAD(tQMX[x], gQMX);
 //             }
 //         #endif
 
@@ -1723,10 +1723,10 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
 //                 for(int y=0;y<Ydim;y++){
 //                     auto gK = gIterK(0, j+y);
 //                     // auto gKM = gIterKMX(0, j+y);
-//                     TCOPYIN(tK[y], gK);
+//                     TLOAD(tK[y], gK);
 //                     gen_ND2NN_offset_Impl<gm_KMX, tile_KMX, tile_ND2NNOffset>(gKMX, tKMX[y], nd2nn_offset, 0, j+y);
-//                     MGATHER(tKMX[y], gKMX, nd2nn_offset); 
-//                     // TCOPYIN(tKMX[y], gKM);
+//                     MGATHER(tKMX[y], gKMX, nd2nn_offset);
+//                     // TLOAD(tKMX[y], gKM);
 //                 }
 //             #endif
 
@@ -1765,7 +1765,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
 //                                                         tNewMax[x].data(),
 //                                                         tNewSum[x].data(),
 //                                                         tExpW[x][0].data(),
-//                                                         tW[x][0].data(), // 
+//                                                         tW[x][0].data(), //
 //                                                         tMax[x].data(),
 //                                                         tSum[x].data());
 //                         tohif4<tileP_hif4, tile_PMX, tileW_cast><<<tileW_cast::ValidRow, tileW_cast::ValidCol/64, 1>>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); // 64 = 1 group, tExpW(Zn, bf16), tP_scale(ZZ, E6M2 with zero E1_8 && E1_16)
@@ -1776,7 +1776,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
 //                         //                                 tNewMax[x].data(),
 //                         //                                 tNewSum[x].data(),
 //                         //                                 tExpW[x][0].data(),
-//                         //                                 tW[x][0].data(), // 
+//                         //                                 tW[x][0].data(), //
 //                         //                                 tMax[x].data(),
 //                         //                                 tSum[x].data());
 
@@ -1821,8 +1821,8 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
 //                 #pragma clang loop unroll(full)
 //                 for(int x=0;x<Xdim;x++){
 //                     new_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(
-//                                                                 tScale[x].data(), 
-//                                                                 tNewMax[x].data(), 
+//                                                                 tScale[x].data(),
+//                                                                 tNewMax[x].data(),
 //                                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
 //                                                                 tMax[x].data(),
 //                                                                 scale);
@@ -1831,7 +1831,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
 //                     //                                             tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
 //                     //                                             tNewMax[x].data(),
 //                     //                                             scale);
-                    
+
 //                     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(),
 //                                                                                                    tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale);
 //                     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(),
@@ -1849,7 +1849,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
 //                 tileSum tLocalSum[Xdim][4];
 
 //                 #pragma clang loop unroll(full)
-//                 for(int x=0;x<Xdim;x++){    
+//                 for(int x=0;x<Xdim;x++){
 //                     #pragma clang loop unroll(full)
 //                     for(int k=0;k<2;k++){
 //                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
@@ -1869,7 +1869,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
 //                 tileSum tLocalSum[Xdim][4];
 
 //                 #pragma clang loop unroll(full)
-//                 for(int x=0;x<Xdim;x++){       
+//                 for(int x=0;x<Xdim;x++){
 //                     for(int k=0;k<4;k++){
 //                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
 //                     }
@@ -1911,10 +1911,10 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
 //                 for(int y=0;y<Ydim;y++){
 //                     auto gV = gIterV(j+y, 0);
 //                     // auto gVMX = gIterVMX(j+y, 0);
-//                     TCOPYIN(tV[y], gV);
+//                     TLOAD(tV[y], gV);
 //                     gen_ND2NN_offset_Impl<gm_VMX, tile_VMX, tile_ND2NNOffset>(gVMX, tVMX[y], nd2nn_offset, j+y, 0);
-//                     MGATHER(tVMX[y], gVMX, nd2nn_offset); 
-//                     // TCOPYIN(tVMX[y], gVMX);
+//                     MGATHER(tVMX[y], gVMX, nd2nn_offset);
+//                     // TLOAD(tVMX[y], gVMX);
 //                 }
 //             #endif
 
@@ -1985,7 +1985,7 @@ void flash_attention_2d_unroll_hif4(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr,
 //             #pragma clang loop unroll(full)
 //             for (int x = 0; x < Xdim; ++x) {
 //                 auto dstO = gIterO(i+x, 0);
-//                 TCOPYOUT(dstO, tO_cast[x]);//TMOV
+//                 TSTORE(dstO, tO_cast[x]);//TMOV
 //             }
 //         #endif
 
@@ -2001,9 +2001,9 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype
     using gmK = global_tensor<dtype, ColMajor<qD/2, Skv>>;  // K: [qD×S]
     using gmV = global_tensor<dtype, RowMajor<Skv, vD/2>>;  // V: [S×vD]
     using gmO = global_tensor<dtype, ColMajor<Sq, vD/2>>;  // O: [SxvD]
-    using gm_QMX = global_tensor<uint8_t, RowMajor<Sq, qD/w_factor>>; 
-    using gm_KMX = global_tensor<uint8_t, ColMajor<qD/w_factor, Skv>>; 
-    using gm_VMX = global_tensor<uint8_t, RowMajor<Skv, vD/w_factor>>; 
+    using gm_QMX = global_tensor<uint8_t, RowMajor<Sq, qD/w_factor>>;
+    using gm_KMX = global_tensor<uint8_t, ColMajor<qD/w_factor, Skv>>;
+    using gm_VMX = global_tensor<uint8_t, RowMajor<Skv, vD/w_factor>>;
     // tile 寄存器形状
     using tileQ      = TileLeft<dtype, kTm, (qD==192? 256:qD)/2, kTm, qD/2>;       // [kTm×qD]
     using tileK      = TileRight<dtype, (qD==192? 256:qD)/2, kTk, qD/2, kTk>;      // [vD×kTk]
@@ -2071,7 +2071,7 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype
         tileQ tQ[Xdim];
         tile_QMX tQMX[Xdim];
         // load tile Q,  TODO: add ND2ZZ transform for QMX
-        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore 
+        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x+=2){
                 auto gQ = gIterQ(i+x,0);
@@ -2083,10 +2083,10 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype
             for(int x=0;x<Xdim;x++){
                 auto gQ = gIterQ(i+x,0);
                 auto gQMX = gIterQMX(i+x,0);
-                TCOPYIN(tQ[x], gQ);
-                // gen_ND2ZZ_offset_Impl<gm_QMX, tile_QMX, tile_ND2ZZOffset>(gQMX, tQMX[x], nd2zz_offset, i+x, 0); 
+                TLOAD(tQ[x], gQ);
+                // gen_ND2ZZ_offset_Impl<gm_QMX, tile_QMX, tile_ND2ZZOffset>(gQMX, tQMX[x], nd2zz_offset, i+x, 0);
                 // MGATHER(tQMX[x], gQMX, nd2zz_offset);
-                TCOPYIN(tQMX[x], gQMX);
+                TLOAD(tQMX[x], gQMX);
             }
         #endif
 
@@ -2123,10 +2123,10 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype
                 for(int y=0;y<Ydim;y++){
                     auto gK = gIterK(0, j+y);
                     auto gKM = gIterKMX(0, j+y);
-                    TCOPYIN(tK[y], gK);
+                    TLOAD(tK[y], gK);
                     // gen_ND2NN_offset_Impl<gm_KMX, tile_KMX, tile_ND2NNOffset>(gKMX, tKMX[y], nd2nn_offset, 0, j+y);
-                    // MGATHER(tKMX[y], gKMX, nd2nn_offset); 
-                    TCOPYIN(tKMX[y], gKM);
+                    // MGATHER(tKMX[y], gKMX, nd2nn_offset);
+                    TLOAD(tKMX[y], gKM);
                 }
             #endif
 
@@ -2164,7 +2164,7 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype
                                                         tNewMax[x].data(),
                                                         tNewSum[x].data(),
                                                         tExpW[x][0].data(),
-                                                        tW[x][0].data(), // 
+                                                        tW[x][0].data(), //
                                                         tMax[x].data(),
                                                         tSum[x].data());
                         tohif4<tileP_hif4, tile_PMX, tileW_cast><<<tileW_cast::ValidRow, tileW_cast::ValidCol/64, 1>>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); // 64 = 1 group, tExpW(Zn, bf16), tP_scale(ZZ, E6M2 with zero E1_8 && E1_16)
@@ -2176,7 +2176,7 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype
                                                         tNewMax[x].data(),
                                                         tNewSum[x].data(),
                                                         tExpW[x][0].data(),
-                                                        tW[x][0].data(), // 
+                                                        tW[x][0].data(), //
                                                         tMax[x].data(),
                                                         tSum[x].data());
                         tohif4_bf16x2<tileP_hif4, tile_PMX, tileW_cast><<<tileW_cast::ValidRow, tileW_cast::ValidCol/64, 1>>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data());
@@ -2215,8 +2215,8 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype
                 // #pragma clang loop unroll(full)
                 // for(int x=0;x<Xdim;x++){
                 //     new_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(
-                //                                                 tScale[x].data(), 
-                //                                                 tNewMax[x].data(), 
+                //                                                 tScale[x].data(),
+                //                                                 tNewMax[x].data(),
                 //                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                 //                                                 tMax[x].data(),
                 //                                                 scale);
@@ -2225,7 +2225,7 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype
                 //     //                                             tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                 //     //                                             tNewMax[x].data(),
                 //     //                                             scale);
-                    
+
                 //     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(),
                 //                                                                                    tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale);
                 //     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(),
@@ -2243,7 +2243,7 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){    
+                for(int x=0;x<Xdim;x++){
                     #pragma clang loop unroll(full)
                     for(int k=0;k<2;k++){
                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
@@ -2263,7 +2263,7 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){       
+                for(int x=0;x<Xdim;x++){
                     for(int k=0;k<4;k++){
                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
                     }
@@ -2305,10 +2305,10 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype
                 for(int y=0;y<Ydim;y++){
                     auto gV = gIterV(j+y, 0);
                     auto gVMX = gIterVMX(j+y, 0);
-                    TCOPYIN(tVMX[y], gVMX);
-                    TCOPYIN(tV[y], gV);
+                    TLOAD(tVMX[y], gVMX);
+                    TLOAD(tV[y], gV);
                     // gen_ND2NN_offset_Impl<gm_VMX, tile_VMX, tile_ND2NNOffset>(gVMX, tVMX[y], nd2nn_offset, j+y, 0);
-                    // MGATHER(tVMX[y], gVMX, nd2nn_offset); 
+                    // MGATHER(tVMX[y], gVMX, nd2nn_offset);
 
                 }
             #endif
@@ -2380,7 +2380,7 @@ void flash_attention_2d_unroll_hif4_nogather(dtype* out_ptr, dtype* q_ptr, dtype
             #pragma clang loop unroll(full)
             for (int x = 0; x < Xdim; ++x) {
                 auto dstO = gIterO(i+x, 0);
-                TCOPYOUT(dstO, tO_cast[x]);//TMOV
+                TSTORE(dstO, tO_cast[x]);//TMOV
             }
         #endif
 
@@ -2395,9 +2395,9 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty
     using gmK = global_tensor<dtype, ColMajor<qD/2, Skv>>;  // K: [qD×S]
     using gmV = global_tensor<dtype, RowMajor<Skv, vD/2>>;  // V: [S×vD]
     using gmO = global_tensor<dtype, ColMajor<Sq, vD/2>>;  // O: [SxvD]
-    using gm_QMX = global_tensor<uint8_t, RowMajor<Sq, qD/w_factor>>; 
-    using gm_KMX = global_tensor<uint8_t, ColMajor<qD/w_factor, Skv>>; 
-    using gm_VMX = global_tensor<uint8_t, RowMajor<Skv, vD/w_factor>>; 
+    using gm_QMX = global_tensor<uint8_t, RowMajor<Sq, qD/w_factor>>;
+    using gm_KMX = global_tensor<uint8_t, ColMajor<qD/w_factor, Skv>>;
+    using gm_VMX = global_tensor<uint8_t, RowMajor<Skv, vD/w_factor>>;
     // tile 寄存器形状
     using tileQ      = TileLeft<dtype, kTm, (qD==192? 256:qD)/2, kTm, qD/2>;       // [kTm×qD]
     using tileK      = TileRight<dtype, (qD==192? 256:qD)/2, kTk, qD/2, kTk>;      // [vD×kTk]
@@ -2466,7 +2466,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty
         tileQ tQ[Xdim];
         tile_QMX tQMX[Xdim];
         // load tile Q,  TODO: add ND2ZZ transform for QMX
-        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore 
+        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x+=2){
                 auto gQ = gIterQ(i+x,0);
@@ -2478,10 +2478,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty
             for(int x=0;x<Xdim;x++){
                 auto gQ = gIterQ(i+x,0);
                 auto gQMX = gIterQMX(i+x,0);
-                TCOPYIN(tQ[x], gQ);
-                // gen_ND2ZZ_offset_Impl<gm_QMX, tile_QMX, tile_ND2ZZOffset>(gQMX, tQMX[x], nd2zz_offset, i+x, 0); 
+                TLOAD(tQ[x], gQ);
+                // gen_ND2ZZ_offset_Impl<gm_QMX, tile_QMX, tile_ND2ZZOffset>(gQMX, tQMX[x], nd2zz_offset, i+x, 0);
                 // MGATHER(tQMX[x], gQMX, nd2zz_offset);
-                TCOPYIN(tQMX[x], gQMX);
+                TLOAD(tQMX[x], gQMX);
             }
         #endif
 
@@ -2518,10 +2518,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty
                 for(int y=0;y<Ydim;y++){
                     auto gK = gIterK(0, j+y);
                     auto gKM = gIterKMX(0, j+y);
-                    TCOPYIN(tK[y], gK);
+                    TLOAD(tK[y], gK);
                     // gen_ND2NN_offset_Impl<gm_KMX, tile_KMX, tile_ND2NNOffset>(gKMX, tKMX[y], nd2nn_offset, 0, j+y);
-                    // MGATHER(tKMX[y], gKMX, nd2nn_offset); 
-                    TCOPYIN(tKMX[y], gKM);
+                    // MGATHER(tKMX[y], gKMX, nd2nn_offset);
+                    TLOAD(tKMX[y], gKM);
                 }
             #endif
 
@@ -2560,7 +2560,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty
                                                         tNewMax[x].data(),
                                                         tNewSum[x].data(),
                                                         tExpW[x][0].data(),
-                                                        tW[x][0].data(), // 
+                                                        tW[x][0].data(), //
                                                         tMax[x].data(),
                                                         tSum[x].data());
                         tohif4<tileP_hif4, tile_PMX, tileW_cast><<<tileW_cast::ValidRow, tileW_cast::ValidCol/64, 1>>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); // 64 = 1 group, tExpW(Zn, bf16), tP_scale(ZZ, E6M2 with zero E1_8 && E1_16)
@@ -2572,7 +2572,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty
                         //                                 tNewMax[x].data(),
                         //                                 tNewSum[x].data(),
                         //                                 tExpW[x][0].data(),
-                        //                                 tW[x][0].data(), // 
+                        //                                 tW[x][0].data(), //
                         //                                 tMax[x].data(),
                         //                                 tSum[x].data());
                         // bf16tobf16x2();
@@ -2606,13 +2606,13 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty
                 // #pragma clang loop unroll(full)
                 // for(int x=0;x<Xdim;x++){
                 //     new_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(
-                //                                                 tScale[x].data(), 
-                //                                                 tNewMax[x].data(), 
+                //                                                 tScale[x].data(),
+                //                                                 tNewMax[x].data(),
                 //                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                 //                                                 tMax[x].data(),
                 //                                                 scale);
 
-                    
+
                 //     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(),
                 //                                                                                    tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale);
                 //     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(),
@@ -2624,8 +2624,8 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty
                 #pragma clang loop unroll(full)
                 for(int x=0;x<Xdim;x++){
                     pkg_rowmax_4src<tileW, tileW_cast, tileMax, tileScale, qD><<<tileW::ValidRow/2, 1, 1>>>(
-                                                                tScale[x].data(), 
-                                                                tNewMax[x].data(), 
+                                                                tScale[x].data(),
+                                                                tNewMax[x].data(),
                                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                                                                 tMax[x].data());
 
@@ -2635,7 +2635,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty
                     rowsum_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum, qD><<<tileW::ValidRow/2, 1, 1>>>(
                                                                     tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(),
                                                                     tW[x][2].data(), tW[x][3].data(), tNewMax[x].data());
-                    
+
                     new_sum_of_2_loc_sum_bf16x2<tileScale, tileSum><<<tileSum::ValidRow/2, 1, 1>>>(
                                                                     tNewSum[x].data(), tLocalSum[x][0].data(), tLocalSum[x][1].data(), tSum[x].data(), tScale[x].data());
 
@@ -2649,7 +2649,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){    
+                for(int x=0;x<Xdim;x++){
                     #pragma clang loop unroll(full)
                     for(int k=0;k<2;k++){
                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
@@ -2669,7 +2669,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){       
+                for(int x=0;x<Xdim;x++){
                     for(int k=0;k<4;k++){
                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
                     }
@@ -2711,10 +2711,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty
                 for(int y=0;y<Ydim;y++){
                     auto gV = gIterV(j+y, 0);
                     auto gVMX = gIterVMX(j+y, 0);
-                    TCOPYIN(tVMX[y], gVMX);
-                    TCOPYIN(tV[y], gV);
+                    TLOAD(tVMX[y], gVMX);
+                    TLOAD(tV[y], gV);
                     // gen_ND2NN_offset_Impl<gm_VMX, tile_VMX, tile_ND2NNOffset>(gVMX, tVMX[y], nd2nn_offset, j+y, 0);
-                    // MGATHER(tVMX[y], gVMX, nd2nn_offset); 
+                    // MGATHER(tVMX[y], gVMX, nd2nn_offset);
 
                 }
             #endif
@@ -2786,7 +2786,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax(dtype* out_ptr, dtype* q_ptr, dty
             #pragma clang loop unroll(full)
             for (int x = 0; x < Xdim; ++x) {
                 auto dstO = gIterO(i+x, 0);
-                TCOPYOUT(dstO, tO_cast[x]);//TMOV
+                TSTORE(dstO, tO_cast[x]);//TMOV
             }
         #endif
 
@@ -2801,9 +2801,9 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p
     using gmK = global_tensor<dtype, ColMajor<qD/2, Skv>>;  // K: [qD×S]
     using gmV = global_tensor<dtype, RowMajor<Skv, vD/2>>;  // V: [S×vD]
     using gmO = global_tensor<dtype, ColMajor<Sq, vD/2>>;  // O: [SxvD]
-    using gm_QMX = global_tensor<uint8_t, RowMajor<Sq, qD/w_factor>>; 
-    using gm_KMX = global_tensor<uint8_t, ColMajor<qD/w_factor, Skv>>; 
-    using gm_VMX = global_tensor<uint8_t, RowMajor<Skv, vD/w_factor>>; 
+    using gm_QMX = global_tensor<uint8_t, RowMajor<Sq, qD/w_factor>>;
+    using gm_KMX = global_tensor<uint8_t, ColMajor<qD/w_factor, Skv>>;
+    using gm_VMX = global_tensor<uint8_t, RowMajor<Skv, vD/w_factor>>;
     // tile 寄存器形状
     using tileQ      = TileLeft<dtype, kTm, (qD==192? 256:qD)/2, kTm, qD/2>;       // [kTm×qD]
     using tileK      = TileRight<dtype, (qD==192? 256:qD)/2, kTk, qD/2, kTk>;      // [vD×kTk]
@@ -2872,7 +2872,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p
         tileQ tQ[Xdim];
         tile_QMX tQMX[Xdim];
         // load tile Q,  TODO: add ND2ZZ transform for QMX
-        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore 
+        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x+=2){
                 auto gQ = gIterQ(i+x,0);
@@ -2884,10 +2884,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p
             for(int x=0;x<Xdim;x++){
                 auto gQ = gIterQ(i+x,0);
                 auto gQMX = gIterQMX(i+x,0);
-                TCOPYIN(tQ[x], gQ);
-                // gen_ND2ZZ_offset_Impl<gm_QMX, tile_QMX, tile_ND2ZZOffset>(gQMX, tQMX[x], nd2zz_offset, i+x, 0); 
+                TLOAD(tQ[x], gQ);
+                // gen_ND2ZZ_offset_Impl<gm_QMX, tile_QMX, tile_ND2ZZOffset>(gQMX, tQMX[x], nd2zz_offset, i+x, 0);
                 // MGATHER(tQMX[x], gQMX, nd2zz_offset);
-                TCOPYIN(tQMX[x], gQMX);
+                TLOAD(tQMX[x], gQMX);
             }
         #endif
 
@@ -2924,10 +2924,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p
                 for(int y=0;y<Ydim;y++){
                     auto gK = gIterK(0, j+y);
                     auto gKM = gIterKMX(0, j+y);
-                    TCOPYIN(tK[y], gK);
+                    TLOAD(tK[y], gK);
                     // gen_ND2NN_offset_Impl<gm_KMX, tile_KMX, tile_ND2NNOffset>(gKMX, tKMX[y], nd2nn_offset, 0, j+y);
-                    // MGATHER(tKMX[y], gKMX, nd2nn_offset); 
-                    TCOPYIN(tKMX[y], gKM);
+                    // MGATHER(tKMX[y], gKMX, nd2nn_offset);
+                    TLOAD(tKMX[y], gKM);
                 }
             #endif
 
@@ -2966,7 +2966,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p
                                                         tNewMax[x].data(),
                                                         tNewSum[x].data(),
                                                         tExpW[x][0].data(),
-                                                        tW[x][0].data(), // 
+                                                        tW[x][0].data(), //
                                                         tMax[x].data(),
                                                         tSum[x].data());
                         tohif4<tileP_hif4, tile_PMX, tileW_cast><<<tileW_cast::ValidRow, tileW_cast::ValidCol/64, 1>>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); // 64 = 1 group, tExpW(Zn, bf16), tP_scale(ZZ, E6M2 with zero E1_8 && E1_16)
@@ -2978,7 +2978,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p
                         //                                 tNewMax[x].data(),
                         //                                 tNewSum[x].data(),
                         //                                 tExpW[x][0].data(),
-                        //                                 tW[x][0].data(), // 
+                        //                                 tW[x][0].data(), //
                         //                                 tMax[x].data(),
                         //                                 tSum[x].data());
                         // bf16tobf16x2();
@@ -3012,13 +3012,13 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p
                 // #pragma clang loop unroll(full)
                 // for(int x=0;x<Xdim;x++){
                 //     new_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(
-                //                                                 tScale[x].data(), 
-                //                                                 tNewMax[x].data(), 
+                //                                                 tScale[x].data(),
+                //                                                 tNewMax[x].data(),
                 //                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                 //                                                 tMax[x].data(),
                 //                                                 scale);
 
-                    
+
                 //     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(),
                 //                                                                                    tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale);
                 //     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(),
@@ -3030,8 +3030,8 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p
                 #pragma clang loop unroll(full)
                 for(int x=0;x<Xdim;x++){
                     pkg_rowmax_4srcx2<tileW, tileW_cast, tileMax, tileScale, qD><<<tileW::ValidRow/2, 1, 1>>>(
-                                                                tScale[x].data(), 
-                                                                tNewMax[x].data(), 
+                                                                tScale[x].data(),
+                                                                tNewMax[x].data(),
                                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                                                                 tMax[x].data());
 
@@ -3041,7 +3041,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p
                     rowsum_2src_with_local_sumx2<tileW, tileW_cast, tileMax, tileSum, qD><<<tileW::ValidRow/2, 1, 1>>>(
                                                                     tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(),
                                                                     tW[x][2].data(), tW[x][3].data(), tNewMax[x].data());
-                    
+
                     new_sum_of_2_loc_sum_bf16x2<tileScale, tileSum><<<tileSum::ValidRow/2, 1, 1>>>(
                                                                     tNewSum[x].data(), tLocalSum[x][0].data(), tLocalSum[x][1].data(), tSum[x].data(), tScale[x].data());
 
@@ -3055,7 +3055,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){    
+                for(int x=0;x<Xdim;x++){
                     #pragma clang loop unroll(full)
                     for(int k=0;k<2;k++){
                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
@@ -3075,7 +3075,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){       
+                for(int x=0;x<Xdim;x++){
                     for(int k=0;k<4;k++){
                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
                     }
@@ -3117,10 +3117,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p
                 for(int y=0;y<Ydim;y++){
                     auto gV = gIterV(j+y, 0);
                     auto gVMX = gIterVMX(j+y, 0);
-                    TCOPYIN(tVMX[y], gVMX);
-                    TCOPYIN(tV[y], gV);
+                    TLOAD(tVMX[y], gVMX);
+                    TLOAD(tV[y], gV);
                     // gen_ND2NN_offset_Impl<gm_VMX, tile_VMX, tile_ND2NNOffset>(gVMX, tVMX[y], nd2nn_offset, j+y, 0);
-                    // MGATHER(tVMX[y], gVMX, nd2nn_offset); 
+                    // MGATHER(tVMX[y], gVMX, nd2nn_offset);
 
                 }
             #endif
@@ -3192,7 +3192,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_loadx2(dtype* out_ptr, dtype* q_p
             #pragma clang loop unroll(full)
             for (int x = 0; x < Xdim; ++x) {
                 auto dstO = gIterO(i+x, 0);
-                TCOPYOUT(dstO, tO_cast[x]);//TMOV
+                TSTORE(dstO, tO_cast[x]);//TMOV
             }
         #endif
 
@@ -3225,9 +3225,9 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype
     using gmK = global_tensor<dtype, ColMajor<qD/2, Skv>>;  // K: [qD×S]
     using gmV = global_tensor<dtype, RowMajor<Skv, vD/2>>;  // V: [S×vD]
     using gmO = global_tensor<dtype, ColMajor<Sq, vD/2>>;  // O: [SxvD]
-    using gm_QMX = global_tensor<uint8_t, RowMajor<Sq, qD/w_factor>>; 
-    using gm_KMX = global_tensor<uint8_t, ColMajor<qD/w_factor, Skv>>; 
-    using gm_VMX = global_tensor<uint8_t, RowMajor<Skv, vD/w_factor>>; 
+    using gm_QMX = global_tensor<uint8_t, RowMajor<Sq, qD/w_factor>>;
+    using gm_KMX = global_tensor<uint8_t, ColMajor<qD/w_factor, Skv>>;
+    using gm_VMX = global_tensor<uint8_t, RowMajor<Skv, vD/w_factor>>;
     // tile 寄存器形状
     using tileQ      = TileLeft<dtype, kTm, (qD==192? 256:qD)/2, kTm, qD/2>;       // [kTm×qD]
     using tileK      = TileRight<dtype, (qD==192? 256:qD)/2, kTk, qD/2, kTk>;      // [vD×kTk]
@@ -3303,7 +3303,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype
         tileQ tQ[Xdim];
         tile_QMX tQMX[Xdim];
         // load tile Q,  TODO: add ND2ZZ transform for QMX
-        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore 
+        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x+=2){
                 auto gQ = gIterQ(i+x,0);
@@ -3315,10 +3315,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype
             for(int x=0;x<Xdim;x++){
                 auto gQ = gIterQ(i+x,0);
                 auto gQMX = gIterQMX(i+x,0);
-                TCOPYIN(tQ[x], gQ);
-                // gen_ND2ZZ_offset_Impl<gm_QMX, tile_QMX, tile_ND2ZZOffset>(gQMX, tQMX[x], nd2zz_offset, i+x, 0); 
+                TLOAD(tQ[x], gQ);
+                // gen_ND2ZZ_offset_Impl<gm_QMX, tile_QMX, tile_ND2ZZOffset>(gQMX, tQMX[x], nd2zz_offset, i+x, 0);
                 // MGATHER(tQMX[x], gQMX, nd2zz_offset);
-                TCOPYIN(tQMX[x], gQMX);
+                TLOAD(tQMX[x], gQMX);
             }
         #endif
 
@@ -3355,10 +3355,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype
                 for(int y=0;y<Ydim;y++){
                     auto gK = gIterK(0, j+y);
                     auto gKM = gIterKMX(0, j+y);
-                    TCOPYIN(tK[y], gK);
+                    TLOAD(tK[y], gK);
                     // gen_ND2NN_offset_Impl<gm_KMX, tile_KMX, tile_ND2NNOffset>(gKMX, tKMX[y], nd2nn_offset, 0, j+y);
-                    // MGATHER(tKMX[y], gKMX, nd2nn_offset); 
-                    TCOPYIN(tKMX[y], gKM);
+                    // MGATHER(tKMX[y], gKMX, nd2nn_offset);
+                    TLOAD(tKMX[y], gKM);
                 }
             #endif
 
@@ -3400,7 +3400,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype
                                                         tNewMax[x].data(),
                                                         tNewSum[x].data(),
                                                         tExpW[x][0].data(),
-                                                        tW[x][0].data(), // 
+                                                        tW[x][0].data(), //
                                                         tMax[x].data(),
                                                         tSum[x].data());
                         tohif4<tileP_hif4, tile_PMX, tileW_cast><<<tileW_cast::ValidRow, tileW_cast::ValidCol/64, 1>>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); // 64 = 1 group, tExpW(Zn, bf16), tP_scale(ZZ, E6M2 with zero E1_8 && E1_16)
@@ -3412,7 +3412,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype
                         //                                 tNewMax[x].data(),
                         //                                 tNewSum[x].data(),
                         //                                 tExpW[x][0].data(),
-                        //                                 tW[x][0].data(), // 
+                        //                                 tW[x][0].data(), //
                         //                                 tMax[x].data(),
                         //                                 tSum[x].data());
                         // bf16tobf16x2();
@@ -3446,13 +3446,13 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype
                 // #pragma clang loop unroll(full)
                 // for(int x=0;x<Xdim;x++){
                 //     new_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(
-                //                                                 tScale[x].data(), 
-                //                                                 tNewMax[x].data(), 
+                //                                                 tScale[x].data(),
+                //                                                 tNewMax[x].data(),
                 //                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                 //                                                 tMax[x].data(),
                 //                                                 scale);
 
-                    
+
                 //     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(),
                 //                                                                                    tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale);
                 //     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(),
@@ -3465,12 +3465,12 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype
                 #pragma clang loop unroll(full)
                 for(int x=0;x<Xdim;x++){
                     pkg_rowmax_4srcx2<tileW, tileW_cast, tileMax, tileScale, qD><<<tileW::ValidRow/2, 1, 1>>>(
-                                                                tScale[x].data(), 
-                                                                tNewMax[x].data(), 
+                                                                tScale[x].data(),
+                                                                tNewMax[x].data(),
                                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                                                                 tMax[x].data());
 
-                    
+
                     rowsum_2src_with_local_expx2<tileW, tileW_cast, tileMax, qD><<<tileW::ValidRow/2, 1, 1>>>(
                                                                     tExpW[x][0].data(), tExpW[x][1].data(),
                                                                     tW[x][0].data(), tW[x][1].data(), tNewMax[x].data());
@@ -3503,7 +3503,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){    
+                for(int x=0;x<Xdim;x++){
                     #pragma clang loop unroll(full)
                     for(int k=0;k<2;k++){
                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
@@ -3523,7 +3523,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){       
+                for(int x=0;x<Xdim;x++){
                     for(int k=0;k<4;k++){
                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
                     }
@@ -3565,10 +3565,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype
                 for(int y=0;y<Ydim;y++){
                     auto gV = gIterV(j+y, 0);
                     auto gVMX = gIterVMX(j+y, 0);
-                    TCOPYIN(tVMX[y], gVMX);
-                    TCOPYIN(tV[y], gV);
+                    TLOAD(tVMX[y], gVMX);
+                    TLOAD(tV[y], gV);
                     // gen_ND2NN_offset_Impl<gm_VMX, tile_VMX, tile_ND2NNOffset>(gVMX, tVMX[y], nd2nn_offset, j+y, 0);
-                    // MGATHER(tVMX[y], gVMX, nd2nn_offset); 
+                    // MGATHER(tVMX[y], gVMX, nd2nn_offset);
 
                 }
             #endif
@@ -3640,7 +3640,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload(dtype* out_ptr, dtype
             #pragma clang loop unroll(full)
             for (int x = 0; x < Xdim; ++x) {
                 auto dstO = gIterO(i+x, 0);
-                TCOPYOUT(dstO, tO_cast[x]);//TMOV
+                TSTORE(dstO, tO_cast[x]);//TMOV
             }
         #endif
     }
@@ -3655,9 +3655,9 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp
     using gmK = global_tensor<dtype, ColMajor<qD/2, Skv>>;  // K: [qD×S]
     using gmV = global_tensor<dtype, RowMajor<Skv, vD/2>>;  // V: [S×vD]
     using gmO = global_tensor<dtype, ColMajor<Sq, vD/2>>;  // O: [SxvD]
-    using gm_QMX = global_tensor<uint8_t, RowMajor<Sq, qD/w_factor>>; 
-    using gm_KMX = global_tensor<uint8_t, ColMajor<qD/w_factor, Skv>>; 
-    using gm_VMX = global_tensor<uint8_t, RowMajor<Skv, vD/w_factor>>; 
+    using gm_QMX = global_tensor<uint8_t, RowMajor<Sq, qD/w_factor>>;
+    using gm_KMX = global_tensor<uint8_t, ColMajor<qD/w_factor, Skv>>;
+    using gm_VMX = global_tensor<uint8_t, RowMajor<Skv, vD/w_factor>>;
     // tile 寄存器形状
     using tileQ      = TileLeft<dtype, kTm, (qD==192? 256:qD)/2, kTm, qD/2>;       // [kTm×qD]
     using tileK      = TileRight<dtype, (qD==192? 256:qD)/2, kTk, qD/2, kTk>;      // [vD×kTk]
@@ -3740,7 +3740,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp
         tileQ tQ[Xdim];
         tile_QMX tQMX[Xdim];
         // load tile Q,  TODO: add ND2ZZ transform for QMX
-        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore 
+        #ifdef MULTI_LDST // don't use, no need for multi tload/tstore
             #pragma clang loop unroll(full)
             for(int x=0;x<Xdim;x+=2){
                 auto gQ = gIterQ(i+x,0);
@@ -3752,10 +3752,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp
             for(int x=0;x<Xdim;x++){
                 auto gQ = gIterQ(i+x,0);
                 auto gQMX = gIterQMX(i+x,0);
-                TCOPYIN(tQ[x], gQ);
-                // gen_ND2ZZ_offset_Impl<gm_QMX, tile_QMX, tile_ND2ZZOffset>(gQMX, tQMX[x], nd2zz_offset, i+x, 0); 
+                TLOAD(tQ[x], gQ);
+                // gen_ND2ZZ_offset_Impl<gm_QMX, tile_QMX, tile_ND2ZZOffset>(gQMX, tQMX[x], nd2zz_offset, i+x, 0);
                 // MGATHER(tQMX[x], gQMX, nd2zz_offset);
-                TCOPYIN(tQMX[x], gQMX);
+                TLOAD(tQMX[x], gQMX);
             }
         #endif
 
@@ -3792,10 +3792,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp
                 for(int y=0;y<Ydim;y++){
                     auto gK = gIterK(0, j+y);
                     auto gKM = gIterKMX(0, j+y);
-                    TCOPYIN(tK[y], gK);
+                    TLOAD(tK[y], gK);
                     // gen_ND2NN_offset_Impl<gm_KMX, tile_KMX, tile_ND2NNOffset>(gKMX, tKMX[y], nd2nn_offset, 0, j+y);
-                    // MGATHER(tKMX[y], gKMX, nd2nn_offset); 
-                    TCOPYIN(tKMX[y], gKM);
+                    // MGATHER(tKMX[y], gKMX, nd2nn_offset);
+                    TLOAD(tKMX[y], gKM);
                 }
             #endif
 
@@ -3837,7 +3837,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp
                                                         tNewMax[x].data(),
                                                         tNewSum[x].data(),
                                                         tExpW[x][0].data(),
-                                                        tW[x][0].data(), // 
+                                                        tW[x][0].data(), //
                                                         tMax[x].data(),
                                                         tSum[x].data());
                         tohif4<tileP_hif4, tile_PMX, tileW_cast><<<tileW_cast::ValidRow, tileW_cast::ValidCol/64, 1>>>(tP_hif4[x][0].data(), tP_scale[x][0].data(), tExpW[x][0].data()); // 64 = 1 group, tExpW(Zn, bf16), tP_scale(ZZ, E6M2 with zero E1_8 && E1_16)
@@ -3849,7 +3849,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp
                         //                                 tNewMax[x].data(),
                         //                                 tNewSum[x].data(),
                         //                                 tExpW[x][0].data(),
-                        //                                 tW[x][0].data(), // 
+                        //                                 tW[x][0].data(), //
                         //                                 tMax[x].data(),
                         //                                 tSum[x].data());
                         // bf16tobf16x2();
@@ -3883,13 +3883,13 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp
                 // #pragma clang loop unroll(full)
                 // for(int x=0;x<Xdim;x++){
                 //     new_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(
-                //                                                 tScale[x].data(), 
-                //                                                 tNewMax[x].data(), 
+                //                                                 tScale[x].data(),
+                //                                                 tNewMax[x].data(),
                 //                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                 //                                                 tMax[x].data(),
                 //                                                 scale);
 
-                    
+
                 //     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][0].data(), tExpW[x][0].data(), tExpW[x][1].data(),
                 //                                                                                    tW[x][0].data(), tW[x][1].data(), tNewMax[x].data(), scale);
                 //     src_exp_2src_with_local_sum<tileW, tileW_cast, tileMax, tileSum><<<tileW::ValidRow, 1, 1>>>(tLocalSum[x][1].data(), tExpW[x][2].data(), tExpW[x][3].data(),
@@ -3902,12 +3902,12 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp
                 #pragma clang loop unroll(full)
                 for(int x=0;x<Xdim;x++){
                     pkg_rowmax_4srcx2<tileW, tileW_cast, tileMax, tileScale, qD><<<tileW::ValidRow/2, 1, 1>>>(
-                                                                tScale[x].data(), 
-                                                                tNewMax[x].data(), 
+                                                                tScale[x].data(),
+                                                                tNewMax[x].data(),
                                                                 tW[x][0].data(), tW[x][1].data(), tW[x][2].data(), tW[x][3].data(),
                                                                 tMax[x].data());
 
-                    
+
                     rowsum_2src_with_local_expx2<tileW, tileW_cast, tileMax, qD><<<tileW::ValidRow/2, 1, 1>>>(
                                                                     tExpW[x][0].data(), tExpW[x][1].data(),
                                                                     tW[x][0].data(), tW[x][1].data(), tNewMax[x].data());
@@ -3940,7 +3940,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){    
+                for(int x=0;x<Xdim;x++){
                     #pragma clang loop unroll(full)
                     for(int k=0;k<2;k++){
                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
@@ -3960,7 +3960,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp
                 tileSum tLocalSum[Xdim][4];
 
                 #pragma clang loop unroll(full)
-                for(int x=0;x<Xdim;x++){       
+                for(int x=0;x<Xdim;x++){
                     for(int k=0;k<4;k++){
                         local_max_4src<tileW, tileMax><<<tileMax::ValidRow, 1, 1>>>(tLocalMax[x][k].data(), tW[x][4*k].data(), tW[x][4*k+1].data(), tW[x][4*k+2].data(), tW[x][4*k+3].data(), scale);
                     }
@@ -4002,10 +4002,10 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp
                 for(int y=0;y<Ydim;y++){
                     auto gV = gIterV(j+y, 0);
                     auto gVMX = gIterVMX(j+y, 0);
-                    TCOPYIN(tVMX[y], gVMX);
-                    TCOPYIN(tV[y], gV);
+                    TLOAD(tVMX[y], gVMX);
+                    TLOAD(tV[y], gV);
                     // gen_ND2NN_offset_Impl<gm_VMX, tile_VMX, tile_ND2NNOffset>(gVMX, tVMX[y], nd2nn_offset, j+y, 0);
-                    // MGATHER(tVMX[y], gVMX, nd2nn_offset); 
+                    // MGATHER(tVMX[y], gVMX, nd2nn_offset);
 
                 }
             #endif
@@ -4085,7 +4085,7 @@ void flash_attention_2d_unroll_hif4_optsoftmax_cubeoffload2(dtype* out_ptr, dtyp
             #pragma clang loop unroll(full)
             for (int x = 0; x < Xdim; ++x) {
                 auto dstO = gIterO(i+x, 0);
-                TCOPYOUT(dstO, tO_cast[x]);//TMOV
+                TSTORE(dstO, tO_cast[x]);//TMOV
             }
         #endif
     }
diff --git a/kernels/matmul_mx/matmul_mx.hpp b/kernels/matmul_mx/matmul_mx.hpp
index 52e6ee6..ddf435d 100644
--- a/kernels/matmul_mx/matmul_mx.hpp
+++ b/kernels/matmul_mx/matmul_mx.hpp
@@ -4,7 +4,9 @@
 #include <common/pto_tileop.hpp>
 #include "template_asm.h"
 #include <cstdint>
+#ifndef __linx
 #include <cstdio>
+#endif
 #include "utils/layout_transform.hpp"
 
 
@@ -13,7 +15,7 @@
 //         GlobalTensor<typename decltype(TileVar)::DType, \
 //                      Shape<1,1,1,Rows,Cols>, \
 //                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-//         TCOPYOUT(_g, TileVar); \
+//         TSTORE(_g, TileVar); \
 //         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
 //         for (int ri = 0; ri < Rows; ri++) { \
 //             printf("  row%2d: ", ri); \
@@ -32,20 +34,20 @@ using namespace pto;
 
 // TODO, move to utils.cpp
 template <is_global_data_v GmOut, is_tile_data_v TileAcc>
-void TCOPYOUT_ACC(GmOut &Gout, TileAcc &tAcc){
+void TSTORE_ACC(GmOut &Gout, TileAcc &tAcc){
     using TileAccOut = Tile<Location::Vec, typename TileAcc::DType, TileAcc::Rows, TileAcc::Cols, BLayout::RowMajor, TileAcc::ValidRow, TileAcc::ValidCol>;
     TileAccOut tAccOut;
     TCVT(tAccOut, tAcc);
-    TCOPYOUT(Gout, tAccOut);
+    TSTORE(Gout, tAccOut);
 }
 
 // typeb_wfactor 表明typeA和typeB的位宽比例，比如fp8是fp4x2的两倍，
 // smatrix_wfactor : scaling matrix 与计算matrix位宽比
-template <typename dtypeA, const int gM, const int gN, const int gK, const int tM, const int tN, const int tK, 
+template <typename dtypeA, const int gM, const int gN, const int gK, const int tM, const int tN, const int tK,
           typename dtypeB = dtypeA, const int typeb_wfactor = 1, const int smatrix_wfactor=1>
 void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8_t *src1_mx) {
   // only support regular shape now for this operator!
-  // static_assert(gM % tM == 0); 
+  // static_assert(gM % tM == 0);
   // static_assert(gN % tN == 0);
   // static_assert(gK % tK == 0);
   static const uint32_t valid_row = (tM > gM) ? gM : tM;
@@ -54,7 +56,7 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8
   using gm_shapeB = global_tensor<dtypeB, RowMajor<gK/typeb_wfactor, gN>>;
   using gm_shapeC = global_tensor<float, RowMajor<gM, gN>>;
 
-  using tile_shapeA = TileLeft<dtypeA, tM, tK, valid_row, tK>; 
+  using tile_shapeA = TileLeft<dtypeA, tM, tK, valid_row, tK>;
   using tile_shapeB = TileRight<dtypeB, tK/typeb_wfactor, tN, tK/typeb_wfactor, valid_col>;
   using tile_shapeACC = TileAcc<float, tM, tN, valid_row, valid_col>;
   using itA = global_iterator<gm_shapeA, tile_shapeA>;
@@ -65,7 +67,7 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8
   itB gBIter(src1);
   itC gCIter(dst);
 
-  using gm_shapeAMX = global_tensor<uint8_t, RowMajor<gM, gK/smatrix_wfactor>>; 
+  using gm_shapeAMX = global_tensor<uint8_t, RowMajor<gM, gK/smatrix_wfactor>>;
   gm_shapeAMX gAMX(src0_mx);
   using tile_shapeAMX = Tile<Location::Scaling, uint8_t, tM, tK, BLayout::RowMajor, valid_row, tK/smatrix_wfactor, SLayout::RowMajor>; // 实际tile尺寸<tM, tK/32>, 需初始化为0
   using itAMX = global_iterator<gm_shapeAMX, tile_shapeAMX>;
@@ -119,8 +121,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8
         auto gB = gBIter(k,j);
         tile_shapeA tA;
         tile_shapeB tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
 
         // if (src0_mx != nullptr && src1_mx != nullptr) {
         tile_shapeAMX tAMX;
@@ -141,8 +143,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8
         auto gB = gBIter(Kb,j);
         tile_shapeA_trows tA;
         tile_shapeB_tcols tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
 
         tile_shapeAMX_trows tAMX;
         gen_ND2ZZ_offset_Impl<gm_shapeAMX, tile_shapeAMX_trows, tile_ND2ZZOffset>(gAMX, tAMX, nd2zz_offset, i, Kb);
@@ -158,7 +160,7 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8
           MATMULMX(tACC, tA, tAMX, tB, tBMX);
         }
       }
-      TCOPYOUT_ACC(gC, tACC);
+      TSTORE_ACC(gC, tACC);
     }
     if constexpr (rmd_N) {
       auto gC = gCIter(i, Nb);
@@ -168,8 +170,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8
         auto gB = gBIter(k, Nb);
         tile_shapeA tA;
         tile_shapeB_trows tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
 
         tile_shapeAMX tAMX;
         gen_ND2ZZ_offset_Impl<gm_shapeAMX, tile_shapeAMX, tile_ND2ZZOffset>(gAMX, tAMX, nd2zz_offset, i, k);
@@ -191,8 +193,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8
 
         tile_shapeA_trows tA;
         tile_shapeB_tcorner tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
 
         tile_shapeAMX_trows tAMX;
         gen_ND2ZZ_offset_Impl<gm_shapeAMX, tile_shapeAMX_trows, tile_ND2ZZOffset>(gAMX, tAMX, nd2zz_offset, i, Kb);
@@ -206,7 +208,7 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8
           MATMULMX(tACC, tA, tAMX, tB, tBMX);
         }
       }
-      TCOPYOUT_ACC(gC, tACC);
+      TSTORE_ACC(gC, tACC);
     }
   }
   if constexpr (rmd_M) {
@@ -221,8 +223,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8
 
         tile_shapeA_tcols tA;
         tile_shapeB tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
 
         tile_shapeAMX_tcols tAMX;
         gen_ND2ZZ_offset_Impl<gm_shapeAMX, tile_shapeAMX_tcols, tile_ND2ZZOffset>(gAMX, tAMX, nd2zz_offset, Mb, k);
@@ -244,8 +246,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8
 
         tile_shapeA_tcorner tA;
         tile_shapeB_tcols tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
 
         tile_shapeAMX_tcorner tAMX;
         gen_ND2ZZ_offset_Impl<gm_shapeAMX, tile_shapeAMX_tcorner, tile_ND2ZZOffset>(gAMX, tAMX, nd2zz_offset, Mb, Kb);
@@ -260,7 +262,7 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8
           MATMULMX(tACC, tA, tAMX, tB, tBMX);
         }
       }
-      TCOPYOUT_ACC(gC, tACC);
+      TSTORE_ACC(gC, tACC);
     }
     if constexpr (rmd_N) {
       auto gC = gCIter(Mb, Nb);
@@ -273,8 +275,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8
 
         tile_shapeA_tcols tA;
         tile_shapeB_trows tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
 
         tile_shapeAMX_tcols tAMX;
         gen_ND2ZZ_offset_Impl<gm_shapeAMX, tile_shapeAMX_tcols, tile_ND2ZZOffset>(gAMX, tAMX, nd2zz_offset, Mb, k);
@@ -295,8 +297,8 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8
 
         tile_shapeA_tcorner tA;
         tile_shapeB_tcorner tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
 
         tile_shapeAMX_tcorner tAMX;
         gen_ND2ZZ_offset_Impl<gm_shapeAMX, tile_shapeAMX_tcorner, tile_ND2ZZOffset>(gAMX, tAMX, nd2zz_offset, Mb, Kb);
@@ -310,12 +312,12 @@ void matmul_mxfp(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8
           MATMULMX(tACC, tA, tAMX, tB, tBMX);
         }
       }
-      TCOPYOUT_ACC(gC, tACC);
+      TSTORE_ACC(gC, tACC);
     }
   }
 }
 
-template <typename dtypeA, const int gM, const int gN, const int gK, const int tM, const int tN, const int tK, 
+template <typename dtypeA, const int gM, const int gN, const int gK, const int tM, const int tN, const int tK,
           typename dtypeB = dtypeA, const int typeb_wfactor = 1, const int smatrix_wfactor=1>
 void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8_t *src1_mx) {
   static_assert(typeb_wfactor == 1 );
@@ -325,7 +327,7 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx
   using gm_shapeB = global_tensor<dtypeB, RowMajor<gK/typeb_wfactor, gN>>;
   using gm_shapeC = global_tensor<float, RowMajor<gM, gN>>;
 
-  using tile_shapeA = TileLeft<dtypeA, tM, tK, valid_row, tK/typeb_wfactor>; 
+  using tile_shapeA = TileLeft<dtypeA, tM, tK, valid_row, tK/typeb_wfactor>;
   using tile_shapeB = TileRight<dtypeB, tK/typeb_wfactor, tN, tK/typeb_wfactor, valid_col>;
   using tile_shapeACC = TileAcc<float, tM, tN, valid_row, valid_col>;
   using itA = global_iterator<gm_shapeA, tile_shapeA>;
@@ -336,7 +338,7 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx
   itB gBIter(src1);
   itC gCIter(dst);
 
-  using gm_shapeAMX = global_tensor<uint8_t, RowMajor<gM, gK/smatrix_wfactor>>; 
+  using gm_shapeAMX = global_tensor<uint8_t, RowMajor<gM, gK/smatrix_wfactor>>;
   using tile_shapeAMX = Tile<Location::Scaling, uint8_t, tM, tK, BLayout::RowMajor, valid_row, tK/smatrix_wfactor>; // 实际tile尺寸<tM, tK/32>, 需初始化为0
   using itAMX = global_iterator<gm_shapeAMX, tile_shapeAMX>;
   itAMX gAMXIter(src0_mx);
@@ -390,10 +392,10 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx
         tile_shapeB tB;
         tile_shapeAMX tAMX;
         tile_shapeBMX tBMX;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
-        TCOPYIN(tAMX, gAMX);
-        TCOPYIN(tBMX, gBMX);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
+        TLOAD(tAMX, gAMX);
+        TLOAD(tBMX, gBMX);
 
         if(k==0){
           MATMULMX(tACC, tA, tAMX, tB, tBMX);
@@ -406,14 +408,14 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx
         auto gB = gBIter(Kb,j);
         tile_shapeA_trows tA;
         tile_shapeB_tcols tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
         auto gAMX = gAMXIter(i, Kb);
         auto gBMX = gBMXIter(Kb, j);
         tile_shapeAMX_trows tAMX;
         tile_shapeBMX_tcols tBMX;
-        TCOPYIN(tAMX, gAMX);
-        TCOPYIN(tBMX, gBMX);
+        TLOAD(tAMX, gAMX);
+        TLOAD(tBMX, gBMX);
 
         if constexpr(Kb>0){
           MATMACCMX(tACC, tA, tAMX, tB, tBMX);
@@ -421,7 +423,7 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx
           MATMULMX(tACC, tA, tAMX, tB, tBMX);
         }
       }
-      TCOPYOUT_ACC(gC, tACC);
+      TSTORE_ACC(gC, tACC);
     }
     if constexpr (rmd_N) {
       auto gC = gCIter(i, Nb);
@@ -431,15 +433,15 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx
         auto gB = gBIter(k, Nb);
         tile_shapeA tA;
         tile_shapeB_trows tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
 
         auto gAMX = gAMXIter(i, k);
         auto gBMX = gBMXIter(k, Nb);
         tile_shapeAMX tAMX;
         tile_shapeBMX_trows tBMX;
-        TCOPYIN(tAMX, gAMX);
-        TCOPYIN(tBMX, gBMX);
+        TLOAD(tAMX, gAMX);
+        TLOAD(tBMX, gBMX);
 
         if(k==0){
           MATMULMX(tACC, tA, tAMX, tB, tBMX);
@@ -454,15 +456,15 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx
 
         tile_shapeA_trows tA;
         tile_shapeB_tcorner tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
 
         auto gAMX = gAMXIter(i, Kb);
         auto gBMX = gBMXIter(Kb, Nb);
         tile_shapeAMX_trows tAMX;
         tile_shapeBMX_tcorner tBMX;
-        TCOPYIN(tAMX, gAMX);
-        TCOPYIN(tBMX, gBMX);
+        TLOAD(tAMX, gAMX);
+        TLOAD(tBMX, gBMX);
 
         if constexpr(Kb>0){
           MATMACCMX(tACC, tA, tAMX, tB, tBMX);
@@ -470,7 +472,7 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx
           MATMULMX(tACC, tA, tAMX, tB, tBMX);
         }
       }
-      TCOPYOUT_ACC(gC, tACC);
+      TSTORE_ACC(gC, tACC);
     }
   }
   if constexpr (rmd_M) {
@@ -485,15 +487,15 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx
 
         tile_shapeA_tcols tA;
         tile_shapeB tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
 
         auto gAMX = gAMXIter(Mb, k);
         auto gBMX = gBMXIter(k, j);
         tile_shapeAMX_tcols tAMX;
         tile_shapeBMX tBMX;
-        TCOPYIN(tAMX, gAMX);
-        TCOPYIN(tBMX, gBMX);
+        TLOAD(tAMX, gAMX);
+        TLOAD(tBMX, gBMX);
 
         if(k==0){
           MATMULMX(tACC, tA, tAMX, tB, tBMX);
@@ -508,15 +510,15 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx
 
         tile_shapeA_tcorner tA;
         tile_shapeB_tcols tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
 
         auto gAMX = gAMXIter(Mb, Kb);
         auto gBMX = gBMXIter(Kb, j);
         tile_shapeAMX_tcorner tAMX;
         tile_shapeBMX_tcols tBMX;
-        TCOPYIN(tAMX, gAMX);
-        TCOPYIN(tBMX, gBMX);
+        TLOAD(tAMX, gAMX);
+        TLOAD(tBMX, gBMX);
 
         if constexpr(Kb>0){
           MATMACCMX(tACC, tA, tAMX, tB, tBMX);
@@ -524,7 +526,7 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx
           MATMULMX(tACC, tA, tAMX, tB, tBMX);
         }
       }
-      TCOPYOUT_ACC(gC, tACC);
+      TSTORE_ACC(gC, tACC);
     }
     if constexpr (rmd_N) {
       auto gC = gCIter(Mb, Nb);
@@ -537,15 +539,15 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx
 
         tile_shapeA_tcols tA;
         tile_shapeB_trows tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
 
         auto gAMX = gAMXIter(Mb, k);
         auto gBMX = gBMXIter(k, Nb);
         tile_shapeAMX_tcols tAMX;
         tile_shapeBMX_trows tBMX;
-        TCOPYIN(tAMX, gAMX);
-        TCOPYIN(tBMX, gBMX);
+        TLOAD(tAMX, gAMX);
+        TLOAD(tBMX, gBMX);
 
         if(k==0){
           MATMULMX(tACC, tA, tAMX, tB, tBMX);
@@ -559,22 +561,22 @@ void matmul_mxfp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx
 
         tile_shapeA_tcorner tA;
         tile_shapeB_tcorner tB;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
 
         auto gAMX = gAMXIter(Mb, Kb);
         auto gBMX = gBMXIter(Kb, Nb);
         tile_shapeAMX_tcorner tAMX;
         tile_shapeBMX_tcorner tBMX;
-        TCOPYIN(tAMX, gAMX);
-        TCOPYIN(tBMX, gBMX);
+        TLOAD(tAMX, gAMX);
+        TLOAD(tBMX, gBMX);
         if constexpr(Kb>0){
           MATMACCMX(tACC, tA, tAMX, tB, tBMX);
         } else {
           MATMULMX(tACC, tA, tAMX, tB, tBMX);
         }
       }
-      TCOPYOUT_ACC(gC, tACC);
+      TSTORE_ACC(gC, tACC);
     }
   }
 }
@@ -609,22 +611,22 @@ constexpr ResA find_reuseA(int Mb, int Kb, int MAX_TILE_NUM) {
     return {best_m, best_k, best_val};
 }
 
-template <typename dtypeA, const int gM, const int gN, const int gK, const int tM, const int tN, const int tK, 
+template <typename dtypeA, const int gM, const int gN, const int gK, const int tM, const int tN, const int tK,
           typename dtypeB = dtypeA, const int typeb_wfactor = 1, const int smatrix_wfactor=1>
 void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8_t *src1_mx) {
   static_assert(typeb_wfactor == 1 );
   static const uint32_t valid_row = (tM > gM) ? gM : tM;
   static const uint32_t valid_col = (tN > gN) ? gN : tN;
   static const uint32_t MAX_TILE_NUM = 24; // TODO, check this value
-  
+
   using gm_shapeA = global_tensor<dtypeA, RowMajor<gM, gK/typeb_wfactor>>;
   using gm_shapeB = global_tensor<dtypeB, RowMajor<gK/typeb_wfactor, gN>>;
   using gm_shapeC = global_tensor<float, RowMajor<gM, gN>>;
 
-  using tile_shapeA = TileLeft<dtypeA, tM, tK, valid_row, tK/typeb_wfactor>; 
+  using tile_shapeA = TileLeft<dtypeA, tM, tK, valid_row, tK/typeb_wfactor>;
   using tile_shapeB = TileRight<dtypeB, tK/typeb_wfactor, tN, tK/typeb_wfactor, valid_col>;
   using tile_shapeACC = TileAcc<float, tM, tN, valid_row, valid_col>;
-  
+
   using itA = global_iterator<gm_shapeA, tile_shapeA>;
   using itB = global_iterator<gm_shapeB, tile_shapeB>;
   using itC = global_iterator<gm_shapeC, tile_shapeACC>;
@@ -633,8 +635,8 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
   itB gBIter(src1);
   itC gCIter(dst);
 
-  using gm_shapeAMX = global_tensor<uint8_t, RowMajor<gM, gK/smatrix_wfactor>>; 
-  using tile_shapeAMX = Tile<Location::Scaling, uint8_t, tM, tK, BLayout::RowMajor, valid_row, tK/smatrix_wfactor>; 
+  using gm_shapeAMX = global_tensor<uint8_t, RowMajor<gM, gK/smatrix_wfactor>>;
+  using tile_shapeAMX = Tile<Location::Scaling, uint8_t, tM, tK, BLayout::RowMajor, valid_row, tK/smatrix_wfactor>;
   using itAMX = global_iterator<gm_shapeAMX, tile_shapeAMX>;
   itAMX gAMXIter(src0_mx);
 
@@ -691,8 +693,8 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
       // for(int k=0; k<R.k; k++){
       //   auto gA = gAIter(ii+i*R.m, k);
       //   auto gAMX = gAMXIter(ii+i*R.m, k);
-      //   TCOPYIN(tA[ii][k], gA);
-      //   TCOPYIN(tAMX[ii][k], gAMX);
+      //   TLOAD(tA[ii][k], gA);
+      //   TLOAD(tAMX[ii][k], gAMX);
       // }
 
       #pragma clang loop unroll(full)
@@ -705,14 +707,14 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
           auto gBMX = gBMXIter(k, j);
           tile_shapeB tB;
           tile_shapeBMX tBMX;
-          TCOPYIN(tB, gB);
-          TCOPYIN(tBMX, gBMX);
+          TLOAD(tB, gB);
+          TLOAD(tBMX, gBMX);
           if(j==0){
             // eliminate head cost
             auto gA = gAIter(ii+i*R.m, k);
             auto gAMX = gAMXIter(ii+i*R.m, k);
-            TCOPYIN(tA[ii][k], gA);
-            TCOPYIN(tAMX[ii][k], gAMX);
+            TLOAD(tA[ii][k], gA);
+            TLOAD(tAMX[ii][k], gAMX);
           }
           if(k==0){
             MATMULMX(tACC, tA[ii][k], tAMX[ii][k], tB, tBMX);
@@ -731,10 +733,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
             auto gAMX = gAMXIter(i*R.m+ii, k);
             auto gB = gBIter(k, j);
             auto gBMX = gBMXIter(k, j);
-            TCOPYIN(tA_tmp, gA);
-            TCOPYIN(tAMX_tmp, gAMX);
-            TCOPYIN(tB, gB);
-            TCOPYIN(tBMX, gBMX);
+            TLOAD(tA_tmp, gA);
+            TLOAD(tAMX_tmp, gAMX);
+            TLOAD(tB, gB);
+            TLOAD(tBMX, gBMX);
             MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX);
           }
         }
@@ -751,10 +753,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
           tile_shapeB_tcols tB;
           tile_shapeBMX_tcols tBMX;
 
-          TCOPYIN(tA_tmp, gA);
-          TCOPYIN(tAMX_tmp, gAMX);
-          TCOPYIN(tB, gB);
-          TCOPYIN(tBMX, gBMX);
+          TLOAD(tA_tmp, gA);
+          TLOAD(tAMX_tmp, gAMX);
+          TLOAD(tB, gB);
+          TLOAD(tBMX, gBMX);
           if constexpr(Kb>0){
             MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX);
           } else {
@@ -763,7 +765,7 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
         }
 
         auto gC = gCIter(i*R.m+ii, j);
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
       }
 
       // [m, rmd_N, k]
@@ -776,8 +778,8 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
           auto gBMX = gBMXIter(k, Nb);
           tile_shapeB_trows tB;
           tile_shapeBMX_trows tBMX;
-          TCOPYIN(tB, gB);
-          TCOPYIN(tBMX, gBMX);
+          TLOAD(tB, gB);
+          TLOAD(tBMX, gBMX);
           if(k==0){
             MATMULMX(tACC, tA[ii][k], tAMX[ii][k], tB, tBMX);
           }else{
@@ -795,10 +797,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
             auto gAMX = gAMXIter(i*R.m+ii, k);
             auto gB = gBIter(k, Nb);
             auto gBMX = gBMXIter(k, Nb);
-            TCOPYIN(tA_tmp, gA);
-            TCOPYIN(tAMX_tmp, gAMX);
-            TCOPYIN(tB, gB);
-            TCOPYIN(tBMX, gBMX);
+            TLOAD(tA_tmp, gA);
+            TLOAD(tAMX_tmp, gAMX);
+            TLOAD(tB, gB);
+            TLOAD(tBMX, gBMX);
             MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX);
           }
         }
@@ -814,11 +816,11 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
           tile_shapeAMX_trows tAMX_tmp;
           tile_shapeB_tcorner tB;
           tile_shapeBMX_tcorner tBMX;
-          
-          TCOPYIN(tA_tmp, gA);
-          TCOPYIN(tAMX_tmp, gAMX);
-          TCOPYIN(tB, gB);
-          TCOPYIN(tBMX, gBMX);
+
+          TLOAD(tA_tmp, gA);
+          TLOAD(tAMX_tmp, gAMX);
+          TLOAD(tB, gB);
+          TLOAD(tBMX, gBMX);
           if constexpr(Kb>0){
             MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX);
           } else {
@@ -827,7 +829,7 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
         }
 
         auto gC = gCIter(i*R.m+ii, Nb);
-        TCOPYOUT_ACC(gC, tACC);       
+        TSTORE_ACC(gC, tACC);
       }
     }
   }
@@ -835,7 +837,7 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
   if constexpr(rM>0){
     tile_shapeA tA[rM][R.k];
     tile_shapeAMX tAMX[rM][R.k];
-    
+
     #pragma clang loop unroll(full)
     for(int i=0; i<rM; i++){
 
@@ -844,8 +846,8 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
       for(int k=0; k<R.k; k++){
         auto gA = gAIter(i+dM*R.m, k);
         auto gAMX = gAMXIter(i+dM*R.m, k);
-        TCOPYIN(tA[i][k], gA);
-        TCOPYIN(tAMX[i][k], gAMX);
+        TLOAD(tA[i][k], gA);
+        TLOAD(tAMX[i][k], gAMX);
       }
 
       #pragma clang loop unroll(full)
@@ -858,13 +860,13 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
           auto gBMX = gBMXIter(k, j);
           tile_shapeB tB;
           tile_shapeBMX tBMX;
-          TCOPYIN(tB, gB);
-          TCOPYIN(tBMX, gBMX);
+          TLOAD(tB, gB);
+          TLOAD(tBMX, gBMX);
           if(j==0){
             auto gA = gAIter(i+dM*R.m, k);
             auto gAMX = gAMXIter(i+dM*R.m, k);
-            TCOPYIN(tA[i][k], gA);
-            TCOPYIN(tAMX[i][k], gAMX);
+            TLOAD(tA[i][k], gA);
+            TLOAD(tAMX[i][k], gAMX);
           }
           if(k==0){
             MATMULMX(tACC, tA[i][k], tAMX[i][k], tB, tBMX);
@@ -883,10 +885,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
             auto gAMX = gAMXIter(i+dM*R.m, k);
             auto gB = gBIter(k, j);
             auto gBMX = gBMXIter(k, j);
-            TCOPYIN(tA_tmp, gA);
-            TCOPYIN(tAMX_tmp, gAMX);
-            TCOPYIN(tB, gB);
-            TCOPYIN(tBMX, gBMX);
+            TLOAD(tA_tmp, gA);
+            TLOAD(tAMX_tmp, gAMX);
+            TLOAD(tB, gB);
+            TLOAD(tBMX, gBMX);
             MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX);
           }
         }
@@ -903,10 +905,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
           tile_shapeB_tcols tB;
           tile_shapeBMX_tcols tBMX;
 
-          TCOPYIN(tA_tmp, gA);
-          TCOPYIN(tAMX_tmp, gAMX);
-          TCOPYIN(tB, gB);
-          TCOPYIN(tBMX, gBMX);
+          TLOAD(tA_tmp, gA);
+          TLOAD(tAMX_tmp, gAMX);
+          TLOAD(tB, gB);
+          TLOAD(tBMX, gBMX);
           if constexpr(Kb>0){
             MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX);
           } else {
@@ -914,7 +916,7 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
           }
         }
         auto gC = gCIter(i+dM*R.m, j);
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
       }
 
       // [rM, rmd_N, k]
@@ -927,8 +929,8 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
           auto gBMX = gBMXIter(k, Nb);
           tile_shapeB_trows tB;
           tile_shapeBMX_trows tBMX;
-          TCOPYIN(tB, gB);
-          TCOPYIN(tBMX, gBMX);
+          TLOAD(tB, gB);
+          TLOAD(tBMX, gBMX);
           if(k==0){
             MATMULMX(tACC, tA[i][k], tAMX[i][k], tB, tBMX);
           }else{
@@ -946,10 +948,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
             auto gAMX = gAMXIter(i+dM*R.m, k);
             auto gB = gBIter(k, Nb);
             auto gBMX = gBMXIter(k, Nb);
-            TCOPYIN(tA_tmp, gA);
-            TCOPYIN(tAMX_tmp, gAMX);
-            TCOPYIN(tB, gB);
-            TCOPYIN(tBMX, gBMX);
+            TLOAD(tA_tmp, gA);
+            TLOAD(tAMX_tmp, gAMX);
+            TLOAD(tB, gB);
+            TLOAD(tBMX, gBMX);
             MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX);
           }
         }
@@ -966,10 +968,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
           tile_shapeB_tcorner tB;
           tile_shapeBMX_tcorner tBMX;
 
-          TCOPYIN(tA_tmp, gA);
-          TCOPYIN(tAMX_tmp, gAMX);
-          TCOPYIN(tB, gB);
-          TCOPYIN(tBMX, gBMX);
+          TLOAD(tA_tmp, gA);
+          TLOAD(tAMX_tmp, gAMX);
+          TLOAD(tB, gB);
+          TLOAD(tBMX, gBMX);
           if constexpr(Kb>0){
             MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX);
           } else {
@@ -977,7 +979,7 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
           }
         }
         auto gC = gCIter(i+dM*R.m, Nb);
-        TCOPYOUT_ACC(gC, tACC);        
+        TSTORE_ACC(gC, tACC);
       }
     }
   }
@@ -986,13 +988,13 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
   if constexpr (rmd_M) {
     tile_shapeA_tcols tA[R.k];
     tile_shapeAMX_tcols tAMX[R.k];
-    
+
     #pragma clang loop unroll(full)
     for(int k=0; k<R.k; k++){
       auto gA = gAIter(Mb, k);
       auto gAMX = gAMXIter(Mb, k);
-      TCOPYIN(tA[k], gA);
-      TCOPYIN(tAMX[k], gAMX);
+      TLOAD(tA[k], gA);
+      TLOAD(tAMX[k], gAMX);
     }
 
     #pragma clang loop unroll(full)
@@ -1005,8 +1007,8 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
         auto gBMX = gBMXIter(k, j);
         tile_shapeB tB;
         tile_shapeBMX tBMX;
-        TCOPYIN(tB, gB);
-        TCOPYIN(tBMX, gBMX);
+        TLOAD(tB, gB);
+        TLOAD(tBMX, gBMX);
         if(k==0){
           MATMULMX(tACC, tA[k], tAMX[k], tB, tBMX);
         }else{
@@ -1024,10 +1026,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
           auto gAMX = gAMXIter(Mb, k);
           auto gB = gBIter(k, j);
           auto gBMX = gBMXIter(k, j);
-          TCOPYIN(tA_tmp, gA);
-          TCOPYIN(tAMX_tmp, gAMX);
-          TCOPYIN(tB, gB);
-          TCOPYIN(tBMX, gBMX);
+          TLOAD(tA_tmp, gA);
+          TLOAD(tAMX_tmp, gAMX);
+          TLOAD(tB, gB);
+          TLOAD(tBMX, gBMX);
           MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX);
         }
       }
@@ -1044,10 +1046,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
         tile_shapeB_tcols tB;
         tile_shapeBMX_tcols tBMX;
 
-        TCOPYIN(tA_tmp, gA);
-        TCOPYIN(tAMX_tmp, gAMX);
-        TCOPYIN(tB, gB);
-        TCOPYIN(tBMX, gBMX);
+        TLOAD(tA_tmp, gA);
+        TLOAD(tAMX_tmp, gAMX);
+        TLOAD(tB, gB);
+        TLOAD(tBMX, gBMX);
         if constexpr(Kb>0){
           MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX);
         } else {
@@ -1055,7 +1057,7 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
         }
       }
       auto gC = gCIter(Mb, j);
-      TCOPYOUT_ACC(gC, tACC);
+      TSTORE_ACC(gC, tACC);
     }
 
     // [rmd_M, rmd_N, k]
@@ -1068,8 +1070,8 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
         auto gBMX = gBMXIter(k, Nb);
         tile_shapeB_trows tB;
         tile_shapeBMX_trows tBMX;
-        TCOPYIN(tB, gB);
-        TCOPYIN(tBMX, gBMX);
+        TLOAD(tB, gB);
+        TLOAD(tBMX, gBMX);
         if(k==0){
           MATMULMX(tACC, tA[k], tAMX[k], tB, tBMX);
         }else{
@@ -1087,10 +1089,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
           auto gAMX = gAMXIter(Mb, k);
           auto gB = gBIter(k, Nb);
           auto gBMX = gBMXIter(k, Nb);
-          TCOPYIN(tA_tmp, gA);
-          TCOPYIN(tAMX_tmp, gAMX);
-          TCOPYIN(tB, gB);
-          TCOPYIN(tBMX, gBMX);
+          TLOAD(tA_tmp, gA);
+          TLOAD(tAMX_tmp, gAMX);
+          TLOAD(tB, gB);
+          TLOAD(tBMX, gBMX);
           MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX);
         }
       }
@@ -1107,10 +1109,10 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
         tile_shapeB_tcorner tB;
         tile_shapeBMX_tcorner tBMX;
 
-        TCOPYIN(tA_tmp, gA);
-        TCOPYIN(tAMX_tmp, gAMX);
-        TCOPYIN(tB, gB);
-        TCOPYIN(tBMX, gBMX);
+        TLOAD(tA_tmp, gA);
+        TLOAD(tAMX_tmp, gAMX);
+        TLOAD(tB, gB);
+        TLOAD(tBMX, gBMX);
         if constexpr(Kb>0){
           MATMACCMX(tACC, tA_tmp, tAMX_tmp, tB, tBMX);
         } else {
@@ -1118,25 +1120,25 @@ void matmul_mxfp_notcvt_reuseA(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *
         }
       }
       auto gC = gCIter(Mb, Nb);
-      TCOPYOUT_ACC(gC, tACC);        
+      TSTORE_ACC(gC, tACC);
     }
   }
 }
 
 // typeb_wfactor 表明typeA和typeB的位宽比例，比如fp8是fp4x2的两倍，
 // smatrix_wfactor : scaling matrix 与计算matrix位宽比
-template <typename dtypeA, const int gM, const int gN, const int gK, const int tM, const int tN, const int tK, 
+template <typename dtypeA, const int gM, const int gN, const int gK, const int tM, const int tN, const int tK,
           typename dtypeB = dtypeA, const int typeb_wfactor = 1, const int smatrix_wfactor=1>
 void matmul_mxfp_notcvt_old(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8_t *src1_mx) {
   // only support regular shape now for this operator!
-  static_assert(gM % tM == 0); 
+  static_assert(gM % tM == 0);
   static_assert(gN % tN == 0);
   static_assert(gK % tK == 0);
   using gm_shapeA = global_tensor<dtypeA, RowMajor<gM, gK/typeb_wfactor>>;
   using gm_shapeB = global_tensor<dtypeB, RowMajor<gK/typeb_wfactor, gN>>;
   using gm_shapeC = global_tensor<float, RowMajor<gM, gN>>;
 
-  using tile_shapeA = TileLeft<dtypeA, tM, tK/typeb_wfactor>; 
+  using tile_shapeA = TileLeft<dtypeA, tM, tK/typeb_wfactor>;
   using tile_shapeB = TileRight<dtypeB, tK/typeb_wfactor, tN>;
   using tile_shapeACC = TileAcc<float, tM, tN>;
   using itA = global_iterator<gm_shapeA, tile_shapeA>;
@@ -1147,7 +1149,7 @@ void matmul_mxfp_notcvt_old(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src
   itB gBIter(src1);
   itC gCIter(dst);
 
-  using gm_shapeAMX = global_tensor<uint8_t, RowMajor<gM, gK/smatrix_wfactor>>; 
+  using gm_shapeAMX = global_tensor<uint8_t, RowMajor<gM, gK/smatrix_wfactor>>;
   // gm_shapeAMX gAMX(src0_mx);
   // using tile_shapeAMX = Tile<Location::Scaling, uint8_t, tM, tK/smatrix_wfactor, BLayout::RowMajor, tM, tK/smatrix_wfactor>; // 实际tile尺寸<tM, tK/32>, 需初始化为0
   using tile_shapeAMX = Tile<Location::Scaling, uint8_t, tM, tK, BLayout::RowMajor, tM, tK/smatrix_wfactor>; // 实际tile尺寸<tM, tK/32>, 需初始化为0
@@ -1184,10 +1186,10 @@ void matmul_mxfp_notcvt_old(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src
           tile_shapeB tB;
           tile_shapeAMX tAMX;
           tile_shapeBMX tBMX;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
-          TCOPYIN(tAMX, gAMX);
-          TCOPYIN(tBMX, gBMX);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
+          TLOAD(tAMX, gAMX);
+          TLOAD(tBMX, gBMX);
 
             if(k==0){
               MATMULMX(tACC, tA, tAMX, tB, tBMX);
@@ -1197,25 +1199,25 @@ void matmul_mxfp_notcvt_old(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src
               // MATMACC(tACC, tA, tB);
             }
         }
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
     }
   }
 }
 
 // typeb_wfactor 表明typeA和typeB的位宽比例，比如fp8是fp4x2的两倍，
 // smatrix_wfactor : scaling matrix 与计算matrix位宽比
-template <typename dtypeA, const int gM, const int gN, const int gK, const int tM, const int tN, const int tK, 
+template <typename dtypeA, const int gM, const int gN, const int gK, const int tM, const int tN, const int tK,
           typename dtypeB = dtypeA, const int typeb_wfactor = 1, const int smatrix_wfactor=1>
 void matmul_fp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx, uint8_t *src1_mx) {
   // only support regular shape now for this operator!
-  static_assert(gM % tM == 0); 
+  static_assert(gM % tM == 0);
   static_assert(gN % tN == 0);
   static_assert(gK % tK == 0);
   using gm_shapeA = global_tensor<dtypeA, RowMajor<gM, gK/typeb_wfactor>>;
   using gm_shapeB = global_tensor<dtypeB, RowMajor<gK/typeb_wfactor, gN>>;
   using gm_shapeC = global_tensor<float, RowMajor<gM, gN>>;
 
-  using tile_shapeA = TileLeft<dtypeA, tM, tK/typeb_wfactor>; 
+  using tile_shapeA = TileLeft<dtypeA, tM, tK/typeb_wfactor>;
   using tile_shapeB = TileRight<dtypeB, tK/typeb_wfactor, tN>;
   using tile_shapeACC = TileAcc<float, tM, tN>;
   using itA = global_iterator<gm_shapeA, tile_shapeA>;
@@ -1226,7 +1228,7 @@ void matmul_fp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx,
   itB gBIter(src1);
   itC gCIter(dst);
 
-  using gm_shapeAMX = global_tensor<uint8_t, RowMajor<gM, gK/smatrix_wfactor>>; 
+  using gm_shapeAMX = global_tensor<uint8_t, RowMajor<gM, gK/smatrix_wfactor>>;
   // gm_shapeAMX gAMX(src0_mx);
   // using tile_shapeAMX = Tile<Location::Scaling, uint8_t, tM, tK/smatrix_wfactor, BLayout::RowMajor, tM, tK/smatrix_wfactor>; // 实际tile尺寸<tM, tK/32>, 需初始化为0
   using tile_shapeAMX = Tile<Location::Scaling, uint8_t, tM, tK, BLayout::RowMajor, tM, tK/smatrix_wfactor>; // 实际tile尺寸<tM, tK/32>, 需初始化为0
@@ -1263,10 +1265,10 @@ void matmul_fp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx,
           tile_shapeB tB;
           tile_shapeAMX tAMX;
           tile_shapeBMX tBMX;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
-          TCOPYIN(tAMX, gAMX);
-          TCOPYIN(tBMX, gBMX);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
+          TLOAD(tAMX, gAMX);
+          TLOAD(tBMX, gBMX);
 
             if(k==0){
               MATMUL(tACC, tA, tB);
@@ -1275,7 +1277,7 @@ void matmul_fp_notcvt(float *dst, dtypeA *src0, dtypeB *src1, uint8_t *src0_mx,
               MATMACC(tACC, tA, tB);
             }
         }
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
     }
   }
 }
@@ -1323,7 +1325,7 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) {
   using gm_shapeA = global_tensor<dtypeA, RowMajor<gM, gK>>;
   using gm_shapeB = global_tensor<dtypeB, RowMajor<gK/width_factor, gN>>;
   // 伪量化固定float, group 大小128， 128个fp4共享一个scaling factor, 128的partial sum* scale
-  using gm_shape_scale = global_tensor<float, RowMajor<gK/128, gN>>; 
+  using gm_shape_scale = global_tensor<float, RowMajor<gK/128, gN>>;
   using gm_shapeACC = global_tensor<float, RowMajor<gM, gN>>;
   using tile_shapeA = TileLeft<dtypeA, trow, tK, tM, tK>;
   using tile_shapeB = TileRight<dtypeB, tK/width_factor, tcol, tK/width_factor, tcol>;
@@ -1331,7 +1333,7 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) {
   using tile_shape_dequant = Tile<Location::Vec, float, trow, tcol, BLayout::RowMajor, tM, tN>;
   using tile_shapeACC = TileAcc<float, trow, tcol, tM, tN>;
   // copy of acc, input as vector
-  using tile_ACCin = Tile<Location::Vec, float, trow, tcol, BLayout::ColMajor, tM, tN>; 
+  using tile_ACCin = Tile<Location::Vec, float, trow, tcol, BLayout::ColMajor, tM, tN>;
 
   using itA = global_iterator<gm_shapeA, tile_shapeA>;
   using itB = global_iterator<gm_shapeB, tile_shapeB>;
@@ -1367,9 +1369,9 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) {
   using tile_shape_scale_tcols = Tile<Location::Scaling, float, tK/128, tcol, BLayout::RowMajor, rmd_K/128, tN, SLayout::NoneBox>;
   using tile_shape_scale_tcorner = Tile<Location::Scaling, float, tK/128, tcol, BLayout::RowMajor, rmd_K/128, rmd_N, SLayout::NoneBox>;
 
-  using tile_ACCin_trows = Tile<Location::Vec, float, trow, tcol, BLayout::RowMajor, tM, rmd_N>; 
-  using tile_ACCin_tcols = Tile<Location::Vec, float, trow, tcol, BLayout::RowMajor, rmd_M, tN>; 
-  using tile_ACCin_tconer = Tile<Location::Vec, float, trow, tcol, BLayout::RowMajor, rmd_M, rmd_N>; 
+  using tile_ACCin_trows = Tile<Location::Vec, float, trow, tcol, BLayout::RowMajor, tM, rmd_N>;
+  using tile_ACCin_tcols = Tile<Location::Vec, float, trow, tcol, BLayout::RowMajor, rmd_M, tN>;
+  using tile_ACCin_tconer = Tile<Location::Vec, float, trow, tcol, BLayout::RowMajor, rmd_M, rmd_N>;
 
   using tile_shape_dequant_trows = Tile<Location::Vec, float, trow, tcol, BLayout::RowMajor, tM, rmd_N>;
   using tile_shape_dequant_tcols = Tile<Location::Vec, float, trow, tcol, BLayout::RowMajor, rmd_M, tN>;
@@ -1392,9 +1394,9 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) {
         tile_shapeB tB;
         tile_shape_scale ts;
         tile_shape_dequant tC_dequant;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
-        TCOPYIN(ts, gS); // [1, tN]
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
+        TLOAD(ts, gS); // [1, tN]
 
         MATMUL(tACC, tA, tB);
         TCVT(tACCin, tACC);//[tM, tN] 256->1 , 256 -> 2 scaling factor
@@ -1413,22 +1415,22 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) {
         tile_shape_scale_tcols ts;
         tile_shape_dequant tC_dequant;
 
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
-        TCOPYIN(ts, gS);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
+        TLOAD(ts, gS);
 
         MATMUL(tACC, tA, tB);
         TCVT(tACCin, tACC);
         dequant_acc<tile_ACCin, tile_shape_scale, tile_shape_dequant><<<tile_shapeACC::ValidRow, tile_shapeACC::ValidCol, 1>>>(tACCin.data(), ts.data(), tAdder[k%2].data(), tC_dequant.data());
         tAdder[(k+1)%2] = tC_dequant;
       }
-      TCOPYOUT(gACC, tAdder[(k+1)%2]);
+      TSTORE(gACC, tAdder[(k+1)%2]);
     }
     // if constexpr (rmd_N) // TODO
   }
   if constexpr (rmd_M) {
     for (int j = 0; j < Nb; ++j) {
-      auto gACC = gACCIter(Mb, j); 
+      auto gACC = gACCIter(Mb, j);
       tile_shapeC_tcols tACC;
       tile_ACCin_tcols tACCin;
       tile_shape_dequant_tcols tAdder[2];
@@ -1444,9 +1446,9 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) {
         tile_shapeB tB;
         tile_shape_scale ts;
         tile_shape_dequant_tcols tC_dequant;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
-        TCOPYIN(ts, gS);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
+        TLOAD(ts, gS);
         MATMUL(tACC, tA, tB);
         TCVT(tACCin, tACC);
         dequant_acc<tile_ACCin_tcols, tile_shape_scale, tile_shape_dequant_tcols><<<tile_ACCin_tcols::ValidRow, tile_ACCin_tcols::ValidCol, 1>>>(tACCin.data(), ts.data(), tAdder[k%2].data(), tC_dequant.data());
@@ -1462,15 +1464,15 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) {
         tile_shapeB_tcols tB;
         tile_shape_scale_tcols ts;
         tile_shape_dequant tC_dequant;
-        TCOPYIN(tA, gA);
-        TCOPYIN(tB, gB);
-        TCOPYIN(ts, gS);
+        TLOAD(tA, gA);
+        TLOAD(tB, gB);
+        TLOAD(ts, gS);
         MATMUL(tACC, tA, tB);
         TCVT(tACCin, tACC);
         dequant_acc<tile_ACCin_tcols, tile_shape_scale_tcols, tile_shape_dequant_tcols><<<tile_shapeACC::ValidRow, tile_shapeACC::ValidCol, 1>>>(tACCin.data(), ts.data(), tAdder[k%2].data(), tC_dequant.data());
         tAdder[(k+1)%2] = tC_dequant;
       }
-      TCOPYOUT(gACC, tAdder[(k+1)%2]);
+      TSTORE(gACC, tAdder[(k+1)%2]);
     }
     // todo
     // if constexpr (rmd_N) {
@@ -1483,9 +1485,9 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) {
 
     //     tile_shapeA_tcols tA;
     //     tile_shapeB_trows tB;
-    //     TCOPYIN(tA, gA);
-    //     TCOPYIN(tB, gB);
-    //     MATMUL(tACC, tA, tB);        
+    //     TLOAD(tA, gA);
+    //     TLOAD(tB, gB);
+    //     MATMUL(tACC, tA, tB);
     //   }
     //   #pragma clang loop unroll(full)
     //   for (int k = 1; k < Kb; ++k) {
@@ -1494,8 +1496,8 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) {
 
     //     tile_shapeA_tcols tA;
     //     tile_shapeB_trows tB;
-    //     TCOPYIN(tA, gA);
-    //     TCOPYIN(tB, gB);
+    //     TLOAD(tA, gA);
+    //     TLOAD(tB, gB);
     //     MATMACC(tACC, tA, tB);
     //   }
     //   if constexpr (rmd_K) {
@@ -1504,18 +1506,18 @@ void matmul_mp(float *acc_ptr, dtypeA *a_ptr, dtypeB *b_ptr, float *c_ptr) {
 
     //     tile_shapeA_tcorner tA;
     //     tile_shapeB_tcorner tB;
-    //     TCOPYIN(tA, gA);
-    //     TCOPYIN(tB, gB);
+    //     TLOAD(tA, gA);
+    //     TLOAD(tB, gB);
     //     if constexpr(Kb>0){
     //       MATMACC(tACC, tA, tB);
     //     } else {
     //       MATMUL(tACC, tA, tB);
     //     }
     //   }
-    //   TCOPYOUT_ACC(gC, tACC);
+    //   TSTORE_ACC(gC, tACC);
     // }
   }
 }
 
 
-#endif
\ No newline at end of file
+#endif
diff --git a/kernels/memory/broadcast.hpp b/kernels/memory/broadcast.hpp
index 5276b2b..0b6756a 100644
--- a/kernels/memory/broadcast.hpp
+++ b/kernels/memory/broadcast.hpp
@@ -10,7 +10,7 @@
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -162,7 +162,7 @@ void gen_offset_impl(
     const size_t total_elements) {
     static_assert(tile_shapeOffset::ValidRow != -1 && tile_shapeOffset::ValidCol != -1,
                   "Only static shape supported");
-                  
+
     #if MAX_DIMs >= 1
     size_t in_shape0 = in_shape[0];
     size_t out_shape0 = out_shape[0];
@@ -236,11 +236,11 @@ void broadcast(
     const size_t *out_shape
     ) {
     const size_t Mb = gOM / tM;
-    const size_t rmd_M = gOM % tM; 
+    const size_t rmd_M = gOM % tM;
 
     using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gOM>>;
-    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>; 
+    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>;
     using tile_shapeOffset = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor>;
     using tile_shapeData_rmd = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor, 1, rmd_M>;
     using tile_shapeOffset_rmd = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor, 1, rmd_M>;
@@ -276,18 +276,18 @@ void broadcast(
         // printf("total_elements = %d\n", total_elements);
         // printf("in_shape[0] = %d\n", in_shape[0]);
         // printf("inGm = %ld\n", inGm);
-        
-        // TCOPYIN(inTile, gI);
+
+        // TLOAD(inTile, gI);
         // DUMP_TILE("inTile", inTile, g_dump_inTile, 1, tM);
         gen_offset_impl<dtype, tile_shapeOffset, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile, in_shape, out_shape, base, total_elements);
         base += total_elements;
-        
+
         // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM);
 
         MGATHER(outTile, inGm, offsetTile);
 
         // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM);
-        TCOPYOUT(gO, outTile);
+        TSTORE(gO, outTile);
     }
     if constexpr (rmd_M) {
         // printf("rmd_M = %d\n", rmd_M);
@@ -296,7 +296,7 @@ void broadcast(
         gen_offset_impl<dtype, tile_shapeOffset_rmd, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile_rmd, in_shape, out_shape, base, total_elements);
         base += total_elements;
         MGATHER(outTile_rmd, inGm, offsetTile_rmd);
-        TCOPYOUT(gO, outTile_rmd);
+        TSTORE(gO, outTile_rmd);
     }
 
 }
diff --git a/kernels/memory/broadcast_019.hpp b/kernels/memory/broadcast_019.hpp
index b6916f2..c1e9b28 100644
--- a/kernels/memory/broadcast_019.hpp
+++ b/kernels/memory/broadcast_019.hpp
@@ -10,7 +10,7 @@
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -268,11 +268,11 @@ void broadcast(
     const size_t *out_shape
     ) {
     const size_t Mb = gOM / tM;
-    const size_t rmd_M = gOM % tM; 
+    const size_t rmd_M = gOM % tM;
 
     using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gOM>>;
-    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>; 
+    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>;
     using tile_shapeOffset = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor>;
     using tile_shapeData_rmd = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor, 1, rmd_M>;
     using tile_shapeOffset_rmd = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor, 1, rmd_M>;
@@ -308,18 +308,18 @@ void broadcast(
         // printf("total_elements = %d\n", total_elements);
         // printf("in_shape[0] = %d\n", in_shape[0]);
         // printf("inGm = %ld\n", inGm);
-        
-        // TCOPYIN(inTile, gI);
+
+        // TLOAD(inTile, gI);
         // DUMP_TILE("inTile", inTile, g_dump_inTile, 1, tM);
         gen_offset_impl<dtype, tile_shapeOffset, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile, in_shape, out_shape, base, total_elements);
         base += total_elements;
-        
+
         // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM);
 
         MGATHER(outTile, inGm, offsetTile);
 
         // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM);
-        TCOPYOUT(gO, outTile);
+        TSTORE(gO, outTile);
     }
     if constexpr (rmd_M) {
         // printf("rmd_M = %d\n", rmd_M);
@@ -328,7 +328,7 @@ void broadcast(
         gen_offset_impl<dtype, tile_shapeOffset_rmd, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile_rmd, in_shape, out_shape, base, total_elements);
         base += total_elements;
         // MGATHER(outTile_rmd, inGm, offsetTile_rmd);
-        TCOPYOUT(gO, outTile_rmd);
+        TSTORE(gO, outTile_rmd);
     }
 
 }
diff --git a/kernels/memory/broadcast_039.hpp b/kernels/memory/broadcast_039.hpp
index b3fc5e5..1faf050 100644
--- a/kernels/memory/broadcast_039.hpp
+++ b/kernels/memory/broadcast_039.hpp
@@ -10,7 +10,7 @@
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -317,11 +317,11 @@ void broadcast(
     const size_t *out_shape
     ) {
     const size_t Mb = gOM / tM;
-    const size_t rmd_M = gOM % tM; 
+    const size_t rmd_M = gOM % tM;
 
     using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gOM>>;
-    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>; 
+    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>;
     using tile_shapeOffset = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor>;
     using tile_shapeData_rmd = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor, 1, rmd_M>;
     using tile_shapeOffset_rmd = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor, 1, rmd_M>;
@@ -357,18 +357,18 @@ void broadcast(
         // printf("total_elements = %d\n", total_elements);
         // printf("in_shape[0] = %d\n", in_shape[0]);
         // printf("inGm = %ld\n", inGm);
-        
-        // TCOPYIN(inTile, gI);
+
+        // TLOAD(inTile, gI);
         // DUMP_TILE("inTile", inTile, g_dump_inTile, 1, tM);
         gen_offset_impl<dtype, tile_shapeOffset, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile, in_shape, out_shape, base, total_elements);
         base += total_elements;
-        
+
         // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM);
 
         MGATHER(outTile, inGm, offsetTile);
 
         // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM);
-        TCOPYOUT(gO, outTile);
+        TSTORE(gO, outTile);
     }
     if constexpr (rmd_M) {
         // printf("rmd_M = %d\n", rmd_M);
@@ -377,7 +377,7 @@ void broadcast(
         gen_offset_impl<dtype, tile_shapeOffset_rmd, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile_rmd, in_shape, out_shape, base, total_elements);
         base += total_elements;
         // MGATHER(outTile_rmd, inGm, offsetTile_rmd);
-        TCOPYOUT(gO, outTile_rmd);
+        TSTORE(gO, outTile_rmd);
     }
 
 }
diff --git a/kernels/memory/broadcast_07.hpp b/kernels/memory/broadcast_07.hpp
index 4b3b7b1..d38101c 100644
--- a/kernels/memory/broadcast_07.hpp
+++ b/kernels/memory/broadcast_07.hpp
@@ -11,7 +11,7 @@
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -268,11 +268,11 @@ void broadcast(
     const size_t *out_shape
     ) {
     const size_t Mb = gOM / tM;
-    const size_t rmd_M = gOM % tM; 
+    const size_t rmd_M = gOM % tM;
 
     using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gOM>>;
-    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>; 
+    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>;
     using tile_shapeOffset = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor>;
     using tile_shapeData_rmd = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor, 1, rmd_M>;
     using tile_shapeOffset_rmd = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor, 1, rmd_M>;
@@ -308,18 +308,18 @@ void broadcast(
         // printf("total_elements = %d\n", total_elements);
         // printf("in_shape[0] = %d\n", in_shape[0]);
         // printf("inGm = %ld\n", inGm);
-        
-        // TCOPYIN(inTile, gI);
+
+        // TLOAD(inTile, gI);
         // DUMP_TILE("inTile", inTile, g_dump_inTile, 1, tM);
         gen_offset_impl<dtype, tile_shapeOffset, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile, in_shape, out_shape, base, total_elements);
         base += total_elements;
-        
+
         // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM);
 
         MGATHER(outTile, inGm, offsetTile);
 
         // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM);
-        TCOPYOUT(gO, outTile);
+        TSTORE(gO, outTile);
     }
     if constexpr (rmd_M) {
         // printf("rmd_M = %d\n", rmd_M);
@@ -328,7 +328,7 @@ void broadcast(
         gen_offset_impl<dtype, tile_shapeOffset_rmd, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile_rmd, in_shape, out_shape, base, total_elements);
         base += total_elements;
         MGATHER(outTile_rmd, inGm, offsetTile_rmd);
-        TCOPYOUT(gO, outTile_rmd);
+        TSTORE(gO, outTile_rmd);
     }
 
 }
diff --git a/kernels/memory/broadcast_07_simple.hpp b/kernels/memory/broadcast_07_simple.hpp
index 75e03d0..1e39c97 100644
--- a/kernels/memory/broadcast_07_simple.hpp
+++ b/kernels/memory/broadcast_07_simple.hpp
@@ -40,11 +40,11 @@ void broadcast(
     const size_t *out_shape
     ) {
     const size_t Mb = gOM / tM;
-    const size_t rmd_M = gOM % tM; 
+    const size_t rmd_M = gOM % tM;
 
     using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gOM>>;
-    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>; 
+    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>;
     using tile_shapeOffset = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor>;
     using tile_shapeData_rmd = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor, 1, rmd_M>;
     using tile_shapeOffset_rmd = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor, 1, rmd_M>;
@@ -66,7 +66,7 @@ void broadcast(
         gen_offset_impl<dtype, tile_shapeOffset, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile, in_shape, out_shape, base, total_elements);
         base += total_elements;
         MGATHER(outTile, inGm, offsetTile);
-        TCOPYOUT(gO, outTile);
+        TSTORE(gO, outTile);
     }
     if constexpr (rmd_M) {
         auto gO = gOIter(0, Mb);
@@ -74,6 +74,6 @@ void broadcast(
         gen_offset_impl<dtype, tile_shapeOffset_rmd, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile_rmd, in_shape, out_shape, base, total_elements);
         base += total_elements;
         MGATHER(outTile_rmd, inGm, offsetTile_rmd);
-        TCOPYOUT(gO, outTile_rmd);
+        TSTORE(gO, outTile_rmd);
     }
 }
\ No newline at end of file
diff --git a/kernels/memory/broadcast_Hunyuan.hpp b/kernels/memory/broadcast_Hunyuan.hpp
index 80b5b1c..191ab31 100644
--- a/kernels/memory/broadcast_Hunyuan.hpp
+++ b/kernels/memory/broadcast_Hunyuan.hpp
@@ -10,7 +10,7 @@
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -273,11 +273,11 @@ void broadcast(
     const size_t *out_shape
     ) {
     const size_t Mb = gOM / tM;
-    const size_t rmd_M = gOM % tM; 
+    const size_t rmd_M = gOM % tM;
 
     using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gOM>>;
-    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>; 
+    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>;
     using tile_shapeOffset = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor>;
     using tile_shapeData_rmd = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor, 1, rmd_M>;
     using tile_shapeOffset_rmd = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor, 1, rmd_M>;
@@ -313,18 +313,18 @@ void broadcast(
         // printf("total_elements = %d\n", total_elements);
         // printf("in_shape[0] = %d\n", in_shape[0]);
         // printf("inGm = %ld\n", inGm);
-        
-        // TCOPYIN(inTile, gI);
+
+        // TLOAD(inTile, gI);
         // DUMP_TILE("inTile", inTile, g_dump_inTile, 1, tM);
         gen_offset_impl<dtype, tile_shapeOffset, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile, in_shape, out_shape, base, total_elements);
         base += total_elements;
-        
+
         // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM);
 
         MGATHER(outTile, inGm, offsetTile);
 
         // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM);
-        TCOPYOUT(gO, outTile);
+        TSTORE(gO, outTile);
     }
     if constexpr (rmd_M) {
         // printf("rmd_M = %d\n", rmd_M);
@@ -333,7 +333,7 @@ void broadcast(
         gen_offset_impl<dtype, tile_shapeOffset_rmd, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile_rmd, in_shape, out_shape, base, total_elements);
         base += total_elements;
         MGATHER(outTile_rmd, inGm, offsetTile_rmd);
-        TCOPYOUT(gO, outTile_rmd);
+        TSTORE(gO, outTile_rmd);
     }
 
 }
diff --git a/kernels/memory/broadcast_mscatter.hpp b/kernels/memory/broadcast_mscatter.hpp
index f38161b..958da18 100644
--- a/kernels/memory/broadcast_mscatter.hpp
+++ b/kernels/memory/broadcast_mscatter.hpp
@@ -167,7 +167,7 @@ void broadcast_mscatter(
     // ======================
     for (int i = 0; i < input_tiles; ++i) {
         auto gIn = gInIter(0, i);
-        TCOPYIN(inDataTile, gIn);
+        TLOAD(inDataTile, gIn);
 
         // 重置广播步进
         memset(bcast_step, 0, sizeof(bcast_step));
@@ -185,7 +185,7 @@ void broadcast_mscatter(
             // 散射写入
             MSCATTER(outGm, inDataTile, offsetTile);
             auto gOffset = gOffsetIter(0, offset_idx);
-            TCOPYOUT(gOffset, offsetTile);
+            TSTORE(gOffset, offsetTile);
             offset_idx ++;
 
             // 下一组广播坐标
@@ -200,7 +200,7 @@ void broadcast_mscatter(
     if constexpr (rmd_input > 0) {
         auto gIn = gInIter(0, input_tiles);
         total_elements = rmd_input;
-        TCOPYIN(inDataTile_rmd, gIn);
+        TLOAD(inDataTile_rmd, gIn);
 
         memset(bcast_step, 0, sizeof(bcast_step));
         done = false;
@@ -212,7 +212,7 @@ void broadcast_mscatter(
             );
             MSCATTER(outGm, inDataTile_rmd, offsetTile_rmd);
             auto gOffset = gOffsetIter(0, offset_idx);
-            TCOPYOUT(gOffset, offsetTile_rmd);
+            TSTORE(gOffset, offsetTile_rmd);
             offset_idx ++;
 
             done = !next_broadcast_step();
diff --git a/kernels/memory/broadcast_nomg.hpp b/kernels/memory/broadcast_nomg.hpp
index 16c5332..c619c0f 100644
--- a/kernels/memory/broadcast_nomg.hpp
+++ b/kernels/memory/broadcast_nomg.hpp
@@ -10,7 +10,7 @@
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -162,7 +162,7 @@ void gen_offset_impl(
     const size_t total_elements) {
     static_assert(tile_shapeOffset::ValidRow != -1 && tile_shapeOffset::ValidCol != -1,
                   "Only static shape supported");
-                  
+
     #if MAX_DIMs >= 1
     size_t in_shape0 = in_shape[0];
     size_t out_shape0 = out_shape[0];
@@ -236,11 +236,11 @@ void broadcast(
     const size_t *out_shape
     ) {
     const size_t Mb = gOM / tM;
-    const size_t rmd_M = gOM % tM; 
+    const size_t rmd_M = gOM % tM;
 
     using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gOM>>;
-    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>; 
+    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>;
     using tile_shapeOffset = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor>;
     using tile_shapeData_rmd = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor, 1, rmd_M>;
     using tile_shapeOffset_rmd = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor, 1, rmd_M>;
@@ -276,18 +276,18 @@ void broadcast(
         // printf("total_elements = %d\n", total_elements);
         // printf("in_shape[0] = %d\n", in_shape[0]);
         // printf("inGm = %ld\n", inGm);
-        
-        // TCOPYIN(inTile, gI);
+
+        // TLOAD(inTile, gI);
         // DUMP_TILE("inTile", inTile, g_dump_inTile, 1, tM);
         gen_offset_impl<dtype, tile_shapeOffset, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile, in_shape, out_shape, base, total_elements);
         base += total_elements;
-        
+
         // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM);
 
         // MGATHER(outTile, inGm, offsetTile);
 
         // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM);
-        TCOPYOUT(gO, offsetTile);
+        TSTORE(gO, offsetTile);
     }
     if constexpr (rmd_M) {
         // printf("rmd_M = %d\n", rmd_M);
@@ -296,7 +296,7 @@ void broadcast(
         gen_offset_impl<dtype, tile_shapeOffset_rmd, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile_rmd, in_shape, out_shape, base, total_elements);
         base += total_elements;
         // MGATHER(outTile_rmd, inGm, offsetTile_rmd);
-        TCOPYOUT(gO, offsetTile_rmd);
+        TSTORE(gO, offsetTile_rmd);
     }
 
 }
diff --git a/kernels/memory/broadcast_nocopyout.hpp b/kernels/memory/broadcast_nostore.hpp
similarity index 97%
rename from kernels/memory/broadcast_nocopyout.hpp
rename to kernels/memory/broadcast_nostore.hpp
index 5340d09..4d429b0 100644
--- a/kernels/memory/broadcast_nocopyout.hpp
+++ b/kernels/memory/broadcast_nostore.hpp
@@ -10,7 +10,7 @@
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -162,7 +162,7 @@ void gen_offset_impl(
     const size_t total_elements) {
     static_assert(tile_shapeOffset::ValidRow != -1 && tile_shapeOffset::ValidCol != -1,
                   "Only static shape supported");
-                  
+
     #if MAX_DIMs >= 1
     size_t in_shape0 = in_shape[0];
     size_t out_shape0 = out_shape[0];
@@ -229,18 +229,18 @@ void gen_offset_impl(
 
 
 template<typename dtype, size_t MAX_DIM = 8, size_t IN_DIM, size_t OUT_DIM, size_t gIM, size_t gOM, size_t tM>
-void broadcast_nocopyout(
+void broadcast_nostore(
     dtype *in_ptr,
     dtype *out_ptr,
     const size_t *in_shape,
     const size_t *out_shape
     ) {
     const size_t Mb = gOM / tM;
-    const size_t rmd_M = gOM % tM; 
+    const size_t rmd_M = gOM % tM;
 
     using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gOM>>;
-    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>; 
+    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>;
     using tile_shapeOffset = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor>;
     using tile_shapeData_rmd = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor, 1, rmd_M>;
     using tile_shapeOffset_rmd = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor, 1, rmd_M>;
@@ -273,10 +273,10 @@ void broadcast_nocopyout(
         auto gO = gOIter(0, i);
         gen_offset_impl<dtype, tile_shapeOffset, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile, in_shape, out_shape, base, total_elements);
         base += total_elements;
-        
+
         MGATHER(outTile, inGm, offsetTile);
 
-        // TCOPYOUT(gO, outTile);
+        // TSTORE(gO, outTile);
     }
     if constexpr (rmd_M) {
         auto gO = gOIter(0, Mb);
@@ -284,7 +284,7 @@ void broadcast_nocopyout(
         gen_offset_impl<dtype, tile_shapeOffset_rmd, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile_rmd, in_shape, out_shape, base, total_elements);
         base += total_elements;
         MGATHER(outTile_rmd, inGm, offsetTile_rmd);
-        // TCOPYOUT(gO, outTile_rmd);
+        // TSTORE(gO, outTile_rmd);
     }
 }
 
diff --git a/kernels/memory/broadcast_simple.hpp b/kernels/memory/broadcast_simple.hpp
index c8ce29f..43e135e 100644
--- a/kernels/memory/broadcast_simple.hpp
+++ b/kernels/memory/broadcast_simple.hpp
@@ -85,7 +85,7 @@ void gen_offset_impl(
     const size_t total_elements) {
     static_assert(tile_shapeOffset::ValidRow != -1 && tile_shapeOffset::ValidCol != -1,
                   "Only static shape supported");
-                  
+
     #if MAX_DIMs >= 1
     size_t in_shape0 = in_shape[0];
     size_t out_shape0 = out_shape[0];
@@ -159,11 +159,11 @@ void broadcast(
     const size_t *out_shape
     ) {
     const size_t Mb = gOM / tM;
-    const size_t rmd_M = gOM % tM; 
+    const size_t rmd_M = gOM % tM;
 
     using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gOM>>;
-    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>; 
+    using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>;
     using tile_shapeOffset = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor>;
     using tile_shapeData_rmd = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor, 1, rmd_M>;
     using tile_shapeOffset_rmd = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor, 1, rmd_M>;
@@ -199,18 +199,18 @@ void broadcast(
         // printf("total_elements = %d\n", total_elements);
         // printf("in_shape[0] = %d\n", in_shape[0]);
         // printf("inGm = %ld\n", inGm);
-        
-        // TCOPYIN(inTile, gI);
+
+        // TLOAD(inTile, gI);
         // DUMP_TILE("inTile", inTile, g_dump_inTile, 1, tM);
         gen_offset_impl<dtype, tile_shapeOffset, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile, in_shape, out_shape, base, total_elements);
         base += total_elements;
-        
+
         // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM);
 
         MGATHER(outTile, inGm, offsetTile);
 
         // DUMP_TILE("outTile", outTile, g_dump_outTile, 1, tM);
-        TCOPYOUT(gO, outTile);
+        TSTORE(gO, outTile);
     }
     if constexpr (rmd_M) {
         // printf("rmd_M = %d\n", rmd_M);
@@ -219,7 +219,7 @@ void broadcast(
         gen_offset_impl<dtype, tile_shapeOffset_rmd, MAX_DIM, IN_DIM, OUT_DIM>(offsetTile_rmd, in_shape, out_shape, base, total_elements);
         base += total_elements;
         MGATHER(outTile_rmd, inGm, offsetTile_rmd);
-        TCOPYOUT(gO, outTile_rmd);
+        TSTORE(gO, outTile_rmd);
     }
 
 }
diff --git a/kernels/memory/broadcast_vec_019.hpp b/kernels/memory/broadcast_vec_019.hpp
index e58991b..d03fdbb 100644
--- a/kernels/memory/broadcast_vec_019.hpp
+++ b/kernels/memory/broadcast_vec_019.hpp
@@ -5,7 +5,7 @@
 using namespace pto;
 
 // =====================================================================
-// Broadcast (B,1,K) -> (B,N,K) via TCOPYIN + __vec__ broadcast + TCOPYOUT
+// Broadcast (B,1,K) -> (B,N,K) via TLOAD + __vec__ broadcast + TSTORE
 //
 // Optimized for: (1280,1,49) -> (1280,8,49), dtype=half
 //
@@ -17,7 +17,7 @@ using namespace pto;
 // Processing strategy:
 //   Divide B batches into tiles of kTileBatch batches each.
 //   Per tile:
-//     1. TCOPYIN  (kTileBatch, K)   from GlobalMem -> TileReg
+//     1. TLOAD  (kTileBatch, K)   from GlobalMem -> TileReg
 //        Reads kTileBatch * K contiguous elements.
 //     2. __vec__ broadcast within TileReg:
 //        Launch <<<K, N*kTileBatch, 1>>> threads:
@@ -27,7 +27,7 @@ using namespace pto;
 //          batch_idx = y & (kTileBatch - 1)  (0..kTileBatch-1, bitwise)
 //          Read  src[batch_idx * RowStride + x]
 //          Write dst[batch_idx * RowStride + copy * K + x]
-//     3. TCOPYOUT (kTileBatch, N*K)  from TileReg -> GlobalMem
+//     3. TSTORE (kTileBatch, N*K)  from TileReg -> GlobalMem
 //
 // TileReg layout:
 //   Physical tile cols = 512 (padded for 512B alignment).
@@ -111,23 +111,23 @@ void broadcast(dtype *in_ptr, dtype *out_ptr,
 
     for (size_t i = 0; i < Nb; i++) {
         gm_in gsrc(in_ptr + i * kTileBatch * kInner);
-        TCOPYIN(inTile, gsrc);
+        TLOAD(inTile, gsrc);
 
         vec_broadcast_3d<tile_out, tile_in, kInner, kBCast, kTileBatch>
             <<<kInner, kBCast * kTileBatch, 1>>>(outTile.data(), inTile.data());
 
         gm_out gdst(out_ptr + i * kTileBatch * kBCast * kInner);
-        TCOPYOUT(gdst, outTile);
+        TSTORE(gdst, outTile);
     }
 
     if constexpr (rmd > 0) {
         gm_in gsrc(in_ptr + Nb * kTileBatch * kInner);
-        TCOPYIN(inTile_rmd, gsrc);
+        TLOAD(inTile_rmd, gsrc);
 
         vec_broadcast_3d<tile_out_r, tile_in_r, kInner, kBCast, rmd>
             <<<kInner, kBCast * rmd, 1>>>(outTile_rmd.data(), inTile_rmd.data());
 
         gm_out gdst(out_ptr + Nb * kTileBatch * kBCast * kInner);
-        TCOPYOUT(gdst, outTile_rmd);
+        TSTORE(gdst, outTile_rmd);
     }
 }
\ No newline at end of file
diff --git a/kernels/memory/broadcast_vec_039.hpp b/kernels/memory/broadcast_vec_039.hpp
index 72cccbd..58cd521 100644
--- a/kernels/memory/broadcast_vec_039.hpp
+++ b/kernels/memory/broadcast_vec_039.hpp
@@ -5,7 +5,7 @@
 using namespace pto;
 
 // =====================================================================
-// Broadcast (B,1,K) -> (B,N,K) via TCOPYIN + __vec__ broadcast + TCOPYOUT
+// Broadcast (B,1,K) -> (B,N,K) via TLOAD + __vec__ broadcast + TSTORE
 //
 // Optimized for: (8192,1,16) -> (8192,8,16), dtype=half
 //
@@ -17,7 +17,7 @@ using namespace pto;
 // Processing strategy:
 //   Divide B batches into tiles of kTileBatch batches each.
 //   Per tile:
-//     1. TCOPYIN  (kTileBatch, K)   from GlobalMem -> TileReg
+//     1. TLOAD  (kTileBatch, K)   from GlobalMem -> TileReg
 //        Reads kTileBatch * K contiguous elements.
 //     2. __vec__ broadcast within TileReg:
 //        For each batch, replicate its K elements N times (row-wise).
@@ -28,7 +28,7 @@ using namespace pto;
 //          col  = x % K (inner column   0..K-1)
 //          Read  src[y * RowStride + col]
 //          Write dst[y * RowStride + x]
-//     3. TCOPYOUT (kTileBatch, N*K)  from TileReg -> GlobalMem
+//     3. TSTORE (kTileBatch, N*K)  from TileReg -> GlobalMem
 //
 // TileReg layout:
 //   Physical tile cols = 256 (padded for 512B alignment).
@@ -113,23 +113,23 @@ void broadcast(dtype *in_ptr, dtype *out_ptr,
 
     for (size_t i = 0; i < Nb; i++) {
         gm_in gsrc(in_ptr + i * kTileBatch * kInner);
-        TCOPYIN(inTile, gsrc);
+        TLOAD(inTile, gsrc);
 
         vec_broadcast_3d<tile_out, tile_in, kInner>
             <<<kBCast * kInner, kTileBatch, 1>>>(outTile.data(), inTile.data());
 
         gm_out gdst(out_ptr + i * kTileBatch * kBCast * kInner);
-        TCOPYOUT(gdst, outTile);
+        TSTORE(gdst, outTile);
     }
 
     if constexpr (rmd > 0) {
         gm_in gsrc(in_ptr + Nb * kTileBatch * kInner);
-        TCOPYIN(inTile_rmd, gsrc);
+        TLOAD(inTile_rmd, gsrc);
 
         vec_broadcast_3d<tile_out_r, tile_in_r, kInner>
             <<<kBCast * kInner, rmd, 1>>>(outTile_rmd.data(), inTile_rmd.data());
 
         gm_out gdst(out_ptr + Nb * kTileBatch * kBCast * kInner);
-        TCOPYOUT(gdst, outTile_rmd);
+        TSTORE(gdst, outTile_rmd);
     }
 }
\ No newline at end of file
diff --git a/kernels/memory/broadcast_vec_07.hpp b/kernels/memory/broadcast_vec_07.hpp
index 5934187..2129afb 100644
--- a/kernels/memory/broadcast_vec_07.hpp
+++ b/kernels/memory/broadcast_vec_07.hpp
@@ -5,20 +5,20 @@
 using namespace pto;
 
 // =====================================================================
-// Broadcast (N,1) -> (N,C) via TCOPYIN + __vec__ broadcast + TCOPYOUT
+// Broadcast (N,1) -> (N,C) via TLOAD + __vec__ broadcast + TSTORE
 //
 // Optimized for: (1443,1) -> (1443,129), dtype=half
 //
 // Processing strategy:
 //   Divide N rows into tiles of kTileRows rows each.
 //   Per tile:
-//     1. TCOPYIN   (kTileRows, 1)   from GlobalMem -> TileReg
+//     1. TLOAD   (kTileRows, 1)   from GlobalMem -> TileReg
 //     2. __vec__ broadcast (kTileRows, 1) -> (kTileRows, C) within TileReg
 //        Launch <<<kC, kTileRows, 1>>> threads:
 //          x = column index (0..kC-1), y = row index (0..kTileRows-1)
 //          Each thread reads src[j*src_RowStride] (col 0 of row j)
 //          and writes to dst[i + j*dst_RowStride] (col i of row j)
-//     3. TCOPYOUT  (kTileRows, C)   from TileReg  -> GlobalMem
+//     3. TSTORE  (kTileRows, C)   from TileReg  -> GlobalMem
 //
 // TileReg layout:
 //   Physical tile cols padded to 256 for 512B alignment.
@@ -77,13 +77,13 @@ void broadcast(dtype *in_ptr, dtype *out_ptr,
 
     for (size_t i = 0; i < Nb; i++) {
         gm_in gsrc(in_ptr + i * kTileRows);
-        TCOPYIN(inTile, gsrc);
+        TLOAD(inTile, gsrc);
 
         vec_broadcast_rowmajor<tile_out, tile_in>
             <<<kC, kTileRows, 1>>>(outTile.data(), inTile.data());
 
         gm_out gdst(out_ptr + i * kTileRows * kC);
-        TCOPYOUT(gdst, outTile);
+        TSTORE(gdst, outTile);
     }
 
     using tile_in_r  = Tile<Location::Vec, dtype, kTileRows, tileCols,
@@ -94,12 +94,12 @@ void broadcast(dtype *in_ptr, dtype *out_ptr,
     tile_out_r outTile_rmd;
     if constexpr (rmd > 0) {
         gm_in gsrc(in_ptr + Nb * kTileRows);
-        TCOPYIN(inTile_rmd, gsrc);
+        TLOAD(inTile_rmd, gsrc);
 
         vec_broadcast_rowmajor<tile_out_r, tile_in_r>
             <<<kC, rmd, 1>>>(outTile_rmd.data(), inTile_rmd.data());
 
         gm_out gdst(out_ptr + Nb * kTileRows * kC);
-        TCOPYOUT(gdst, outTile_rmd);
+        TSTORE(gdst, outTile_rmd);
     }
 }
\ No newline at end of file
diff --git a/kernels/memory/broadcast_vec_07_handwrite.hpp b/kernels/memory/broadcast_vec_07_handwrite.hpp
index 88cccf2..b6d3a6c 100644
--- a/kernels/memory/broadcast_vec_07_handwrite.hpp
+++ b/kernels/memory/broadcast_vec_07_handwrite.hpp
@@ -12,7 +12,7 @@
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -58,7 +58,7 @@ void broadcast(
     const size_t rmd_N = rmd_M;
     const size_t vld_N = N * 129;    // 实际一次写回数量
     const size_t rmd_vld_N = rmd_M * 129;    // 尾块，实际一次写回数量
-    
+
 
     Assert(tO0 > 129);
     Assert(tO0 % 128 == 0);
@@ -86,26 +86,26 @@ void broadcast(
         auto gI = gIIter(0, i);
         t_in_offset = 0;
         t_out_offset = 0;
-        TCOPYIN(inTile, gI);
+        TLOAD(inTile, gI);
         for (int j = 0; j < N; ++i) {
             vec_broadcast<dtype, tile_shapeIn, tile_shapeOut><<<129, 1, 1>>>(inTile, outTile, t_in_offset, t_out_offset);
             t_in_offset += 1;
             t_out_offset += 129;
         }
-        TCOPYOUT(outTile, out_ptr);
+        TSTORE(outTile, out_ptr);
         out_ptr += sizeof(dtype) * vld_N;
     }
     if constexpr (rmd_M) {
         auto gI = gIIter(0, Mb);
         t_in_offset = 0;
         t_out_offset = 0;
-        TCOPYIN(inTile, gI);
+        TLOAD(inTile, gI);
         for (int j = 0; j < rmd_N; ++i) {
             vec_broadcast<dtype, tile_shapeIn_rmd, tile_shapeOut_rmd><<<129, 1, 1>>>(inTile_rmd, outTile_rmd, t_in_offset, t_out_offset);
             t_in_offset += 1;
             t_out_offset += 129;
         }
-        TCOPYOUT(outTile, out_ptr);
+        TSTORE(outTile, out_ptr);
         out_ptr += sizeof(dtype) * rmd_vld_N;
     }
 }
diff --git a/kernels/memory/concat_gather.hpp b/kernels/memory/concat_gather.hpp
index 85a69fd..404f7ae 100644
--- a/kernels/memory/concat_gather.hpp
+++ b/kernels/memory/concat_gather.hpp
@@ -16,7 +16,7 @@ using namespace pto;
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -36,7 +36,7 @@ void __vec__ gen_offset_concat(
     typename tile_shape::TileDType __out__ out,
     typename tile_Inshape::TileDType __in__ in_shape,
     typename tile_Outshape::TileDType  __in__ out_shape,
-//    const size_t in_dim, 
+//    const size_t in_dim,
     const size_t base,
     const size_t total_elements
 ) {
@@ -64,14 +64,14 @@ void __vec__ gen_offset_concat(
     // 输出一维索引 → 输出坐标
     size_t out_coord[MAX_DIM] = {0};    //
     size_t tmp = idx;   //
-    
+
     #pragma clang loop unroll(full)
     for (int d = DATA_DIM - 1; d >= 0; d--) {
         out_coord[d] = tmp % out_shape_ptr[d];
         tmp /= out_shape_ptr[d];
     }
 
-    size_t n = out_coord[CONCAT_DIM] / in_shape_ptr[CONCAT_DIM]; 
+    size_t n = out_coord[CONCAT_DIM] / in_shape_ptr[CONCAT_DIM];
     size_t offset = out_coord[CONCAT_DIM] % in_shape_ptr[CONCAT_DIM];
 
     out_coord[CONCAT_DIM] = offset;
@@ -88,10 +88,10 @@ void __vec__ gen_offset_concat(
         }
     }
 */
-//    uint16_t in_offset = 0; 
-    uint32_t in_offset = 0;   
+//    uint16_t in_offset = 0;
+    uint32_t in_offset = 0;
 
-    #pragma clang loop unroll(full)    
+    #pragma clang loop unroll(full)
     for (int i = 0; i < DATA_DIM; i++) {
         in_offset += out_coord[i] * stride[i] * sizeof(dtype);
     }
@@ -109,7 +109,7 @@ void gen_offset_Impl(
 //    const size_t in_dim,
 //   const size_t out_dim,
 //    const size_t transpose_dim1,
-//    const size_t transpose_dim0,   
+//    const size_t transpose_dim0,
     const size_t base,
     const size_t total_elements
     )
@@ -130,31 +130,31 @@ void concat_gather(
 //    const size_t in_dim,
 //    const size_t out_dim,
 //    const size_t transpose_dim1,
-//    const size_t transpose_dim0   
-) 
+//    const size_t transpose_dim0
+)
 {
 
     const int Mb = gOM / tM;
-    
+
     const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;     //将gm中的Tensor先声明为一维数据 
+    using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;     //将gm中的Tensor先声明为一维数据
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gOM>>;
 
-    using gm_InDataShape = global_tensor<size_t, RowMajor<1, DATA_DIM>>;     //将gm中的Tensor先声明为一维数据 
+    using gm_InDataShape = global_tensor<size_t, RowMajor<1, DATA_DIM>>;     //将gm中的Tensor先声明为一维数据
     using gm_OutDataShape = global_tensor<size_t, RowMajor<1, DATA_DIM>>;
 
     using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeOffset = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
 
     using tile_Inshape = Tile<Location::Vec, size_t, 1, 32, BLayout::RowMajor, 1, DATA_DIM>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
-    using tile_Outshape = Tile<Location::Vec, size_t, 1, 32, BLayout::RowMajor, 1, DATA_DIM>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec        
+    using tile_Outshape = Tile<Location::Vec, size_t, 1, 32, BLayout::RowMajor, 1, DATA_DIM>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
 //    using tile_shapeOffset = Tile<Location::Vec, uint16_t, 1, tM, BLayout::RowMajor>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
 
     gm_shapeIn inGm(in_ptr);
 
-    gm_InDataShape InShapeGm(in_shape);   
-    gm_OutDataShape OutShapeGm(out_shape);      
+    gm_InDataShape InShapeGm(in_shape);
+    gm_OutDataShape OutShapeGm(out_shape);
 
     tile_shapeData dataTile;
     tile_shapeOffset offsetTile;
@@ -175,25 +175,25 @@ void concat_gather(
 
     for (int i = 0; i < Mb; ++i) {
         auto gO = gOIter(0, i);
-        TCOPYIN(InshapeTile, InShapeGm);
-        TCOPYIN(OutshapeTile, OutShapeGm);
+        TLOAD(InshapeTile, InShapeGm);
+        TLOAD(OutshapeTile, OutShapeGm);
         gen_offset_Impl<dtype, tile_shapeOffset, tile_Inshape, tile_Outshape, MAX_DIM, DATA_DIM, CONCAT_DIM>(offsetTile, InshapeTile, OutshapeTile, base, total_elements);
 //        printf("end genoffset\n");
         base += total_elements;
 //        DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM);
         MGATHER(dataTile, inGm, offsetTile);
-//        printf("end mgather\n");        
-        TCOPYOUT(gO, dataTile);
+//        printf("end mgather\n");
+        TSTORE(gO, dataTile);
     }
-    if constexpr (rmd_M) {        
+    if constexpr (rmd_M) {
         auto gO = gOIter(0, Mb);
-        TCOPYIN(InshapeTile, InShapeGm);
-        TCOPYIN(OutshapeTile, OutShapeGm);        
+        TLOAD(InshapeTile, InShapeGm);
+        TLOAD(OutshapeTile, OutShapeGm);
         total_elements = rmd_M;//尾片的大小。
         gen_offset_Impl<dtype, tile_shapeOffset, tile_Inshape, tile_Outshape, MAX_DIM, DATA_DIM, CONCAT_DIM>(offsetTile, InshapeTile, OutshapeTile, base, total_elements);
         base += total_elements;
         MGATHER(dataTile, inGm, offsetTile);
-        TCOPYOUT(gO, dataTile);
+        TSTORE(gO, dataTile);
     }
 }
 
diff --git a/kernels/memory/concat_scatter.hpp b/kernels/memory/concat_scatter.hpp
index e0f14b8..13b6587 100644
--- a/kernels/memory/concat_scatter.hpp
+++ b/kernels/memory/concat_scatter.hpp
@@ -16,7 +16,7 @@ using namespace pto;
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -36,15 +36,15 @@ void __vec__ gen_offset_concat(
     typename tile_shape::TileDType __out__ out,
     typename tile_Inshape::TileDType __in__ in_shape,
     typename tile_Outshape::TileDType  __in__ out_shape,
-//    const size_t in_dim, 
+//    const size_t in_dim,
     const size_t base,
     const size_t total_elements
 ) {
     size_t index = blkv_get_index_x();
     size_t idx = blkv_get_index_x();
 
-    __vbuf__ typename tile_Inshape::DType *in_shape_ptr = blkv_get_tile_ptr(in_shape);  
-    __vbuf__ typename tile_Outshape::DType *out_shape_ptr = blkv_get_tile_ptr(out_shape);      
+    __vbuf__ typename tile_Inshape::DType *in_shape_ptr = blkv_get_tile_ptr(in_shape);
+    __vbuf__ typename tile_Outshape::DType *out_shape_ptr = blkv_get_tile_ptr(out_shape);
 
     if (index >= total_elements) return;
     idx = idx + base;   // todo idx是个向量，base是个标量，获得所有的基地址或者说基offset
@@ -64,7 +64,7 @@ void __vec__ gen_offset_concat(
     // 输出一维索引 → 输出坐标
     size_t in_coord[MAX_DIM] = {0};    //
     size_t tmp = idx;   //
-    
+
     #pragma clang loop unroll(full)
     for (int d = DATA_DIM - 1; d >= 0; d--) {
         in_coord[d] = tmp % in_shape_ptr[d];
@@ -73,7 +73,7 @@ void __vec__ gen_offset_concat(
     size_t n = tmp;
     in_coord[CONCAT_DIM] = n * in_shape_ptr[CONCAT_DIM] + in_coord[CONCAT_DIM];
 
-//    size_t n = out_coord[CONCAT_DIM] / in_shape_ptr[CONCAT_DIM]; 
+//    size_t n = out_coord[CONCAT_DIM] / in_shape_ptr[CONCAT_DIM];
 //    size_t offset = out_coord[CONCAT_DIM] % in_shape_ptr[CONCAT_DIM];
 
 //    out_coord[CONCAT_DIM] = offset;
@@ -90,11 +90,11 @@ void __vec__ gen_offset_concat(
         }
     }
 */
-//    uint16_t in_offset = 0; 
-//    uint32_t out_offset = 0;  
-    uint16_t out_offset = 0;        
+//    uint16_t in_offset = 0;
+//    uint32_t out_offset = 0;
+    uint16_t out_offset = 0;
 
-    #pragma clang loop unroll(full)    
+    #pragma clang loop unroll(full)
     for (int i = 0; i < DATA_DIM; i++) {
         out_offset += in_coord[i] * stride[i] * sizeof(dtype);
     }
@@ -111,7 +111,7 @@ void gen_offset_Impl(
 //    const size_t in_dim,
 //   const size_t out_dim,
 //    const size_t transpose_dim1,
-//    const size_t transpose_dim0,   
+//    const size_t transpose_dim0,
     const size_t base,
     const size_t total_elements
     )
@@ -132,33 +132,33 @@ void concat_scatter(
 //    const size_t in_dim,
 //    const size_t out_dim,
 //    const size_t transpose_dim1,
-//    const size_t transpose_dim0   
-) 
+//    const size_t transpose_dim0
+)
 {
 
     const int Mb = gOM / tM;
-    
+
     const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;     //将gm中的Tensor先声明为一维数据 
+    using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;     //将gm中的Tensor先声明为一维数据
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gOM>>;
 
-    using gm_InDataShape = global_tensor<size_t, RowMajor<1, DATA_DIM>>;     //将gm中的Tensor先声明为一维数据 
+    using gm_InDataShape = global_tensor<size_t, RowMajor<1, DATA_DIM>>;     //将gm中的Tensor先声明为一维数据
     using gm_OutDataShape = global_tensor<size_t, RowMajor<1, DATA_DIM>>;
 
     using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>; // todo 尾块怎么处理？是否要作为参数写在这
-    using tile_shapeOffset = Tile<Location::Vec, uint16_t, 1, tM, BLayout::RowMajor>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec    
+    using tile_shapeOffset = Tile<Location::Vec, uint16_t, 1, tM, BLayout::RowMajor>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
 //    using tile_shapeOffset = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
 
     using tile_Inshape = Tile<Location::Vec, size_t, 1, 32, BLayout::RowMajor, 1, DATA_DIM>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
-    using tile_Outshape = Tile<Location::Vec, size_t, 1, 32, BLayout::RowMajor, 1, DATA_DIM>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec        
+    using tile_Outshape = Tile<Location::Vec, size_t, 1, 32, BLayout::RowMajor, 1, DATA_DIM>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
 //    using tile_shapeOffset = Tile<Location::Vec, uint16_t, 1, tM, BLayout::RowMajor>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
 
 //    gm_shapeIn inGm(in_ptr);
     gm_shapeOut outGm(out_ptr);
 
-    gm_InDataShape InShapeGm(in_shape);   
-    gm_OutDataShape OutShapeGm(out_shape);      
+    gm_InDataShape InShapeGm(in_shape);
+    gm_OutDataShape OutShapeGm(out_shape);
 
     tile_shapeData dataTile;
     tile_shapeOffset offsetTile;
@@ -181,24 +181,24 @@ void concat_scatter(
 
 
     for (int i = 0; i < Mb; ++i) {
-        auto gI = gIIter(0, i);        
-        TCOPYIN(InshapeTile, InShapeGm);
-        TCOPYIN(OutshapeTile, OutShapeGm);
+        auto gI = gIIter(0, i);
+        TLOAD(InshapeTile, InShapeGm);
+        TLOAD(OutshapeTile, OutShapeGm);
         gen_offset_Impl<dtype, tile_shapeOffset, tile_Inshape, tile_Outshape, MAX_DIM, DATA_DIM, CONCAT_DIM>(offsetTile, InshapeTile, OutshapeTile, base, total_elements);
 //        printf("end genoffset\n");
         base += total_elements;
 //        DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM);
-        TCOPYIN(dataTile, gI);
+        TLOAD(dataTile, gI);
         MSCATTER(outGm, dataTile, offsetTile);
     }
-    if constexpr (rmd_M) {  
-        auto gI = gIIter(0, Mb);                  
-        TCOPYIN(InshapeTile, InShapeGm);
-        TCOPYIN(OutshapeTile, OutShapeGm);        
+    if constexpr (rmd_M) {
+        auto gI = gIIter(0, Mb);
+        TLOAD(InshapeTile, InShapeGm);
+        TLOAD(OutshapeTile, OutShapeGm);
         total_elements = rmd_M;//尾片的大小。
         gen_offset_Impl<dtype, tile_shapeOffset, tile_Inshape, tile_Outshape, MAX_DIM, DATA_DIM, CONCAT_DIM>(offsetTile, InshapeTile, OutshapeTile, base, total_elements);
         base += total_elements;
-        TCOPYIN(dataTile, gI);
+        TLOAD(dataTile, gI);
         MSCATTER(outGm, dataTile, offsetTile);
     }
 }
diff --git a/kernels/memory/gather.hpp b/kernels/memory/gather.hpp
index 0222852..03a1069 100644
--- a/kernels/memory/gather.hpp
+++ b/kernels/memory/gather.hpp
@@ -10,7 +10,7 @@
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -27,7 +27,7 @@
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -40,7 +40,7 @@
 template<typename tile_shapeInOffset, typename tile_shapeOffset, typename dtype, int gN>
 void __vec__ gen_offset(
     typename tile_shapeInOffset::TileDType __in__ in,   // inOffset
-    typename tile_shapeOffset::TileDType __out__ out,   // 
+    typename tile_shapeOffset::TileDType __out__ out,   //
     const size_t n_base
 ) {
     size_t data_width = sizeof(dtype);
@@ -64,7 +64,7 @@ void gen_offset_impl(
     ) {
     static_assert(tile_shapeOffset::ValidRow != -1 && tile_shapeOffset::ValidCol != -1,
                   "Only static shape supported");
-                  
+
 
     gen_offset<tile_shapeInOffset, tile_shapeOffset, dtype, gN><<<tile_shapeOffset::ValidCol, tile_shapeOffset::ValidRow, 1>>>(
         in_offset.data(),
@@ -86,25 +86,25 @@ void gather(
     using gm_shapeInOffset = global_tensor<otype, RowMajor<1, gM>>;
     using gm_shapeIn = global_tensor<dtype, RowMajor<gK, gN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<gM, gN>>;
-    
-    using tile_shapeInData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>; 
+
+    using tile_shapeInData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>;
     using itIn = global_iterator<gm_shapeIn, tile_shapeInData>;
     tile_shapeInData inTile;
     itIn gInIter(in_data_ptr);
 
-    using tile_shapeInOffset = Tile<Location::Vec, otype, 1, tM, BLayout::RowMajor>; 
-    using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; 
+    using tile_shapeInOffset = Tile<Location::Vec, otype, 1, tM, BLayout::RowMajor>;
+    using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>;
     using tile_shapeOffset = Tile<Location::Vec, uint32_t, tM, tN, BLayout::RowMajor>;
 
-    using tile_shapeInOffset_rmd_n  = Tile<Location::Vec, otype, 1, tM, BLayout::RowMajor>; 
+    using tile_shapeInOffset_rmd_n  = Tile<Location::Vec, otype, 1, tM, BLayout::RowMajor>;
     using tile_shapeData_rmd_n      = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>;
     using tile_shapeOffset_rmd_n    = Tile<Location::Vec, uint32_t, tM, tN, BLayout::RowMajor, tM, rmd_N>;
 
-    using tile_shapeInOffset_rmd_mn = Tile<Location::Vec, otype, 1, tM, BLayout::RowMajor, 1, rmd_M>; 
+    using tile_shapeInOffset_rmd_mn = Tile<Location::Vec, otype, 1, tM, BLayout::RowMajor, 1, rmd_M>;
     using tile_shapeData_rmd_mn     = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>;
     using tile_shapeOffset_rmd_mn   = Tile<Location::Vec, uint32_t, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>;
 
-    using tile_shapeInOffset_rmd_m  = Tile<Location::Vec, otype, 1, tM, BLayout::RowMajor, 1, rmd_M>; 
+    using tile_shapeInOffset_rmd_m  = Tile<Location::Vec, otype, 1, tM, BLayout::RowMajor, 1, rmd_M>;
     using tile_shapeData_rmd_m      = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, tN>;
     using tile_shapeOffset_rmd_m    = Tile<Location::Vec, uint32_t, tM, tN, BLayout::RowMajor, rmd_M, tN>;
 
@@ -117,11 +117,11 @@ void gather(
     tile_shapeInOffset_rmd_n inOffsetTile_rmd_n;
     tile_shapeData_rmd_n outTile_rmd_n;
     tile_shapeOffset_rmd_n offsetTile_rmd_n;
-    
+
     tile_shapeInOffset_rmd_mn inOffsetTile_rmd_mn;
     tile_shapeData_rmd_mn outTile_rmd_mn;
     tile_shapeOffset_rmd_mn offsetTile_rmd_mn;
-    
+
     tile_shapeInOffset_rmd_m inOffsetTile_rmd_m;
     tile_shapeData_rmd_m outTile_rmd_m;
     tile_shapeOffset_rmd_m offsetTile_rmd_m;
@@ -140,24 +140,24 @@ void gather(
     // ///////////////////////////////////////
 
     size_t n_base = 0;
-    
+
     // #pragma clang loop unroll(full)
     for (int j = 0; j < Mb; ++j) {
     printf("j = %d\n", j);
         for (int i = 0; i < Nb; ++i) {
             auto gInOffset = gInOffsetIter(0, j);
             auto gO = gOIter(j, i);
-            TCOPYIN(inOffsetTile, gInOffset);
+            TLOAD(inOffsetTile, gInOffset);
             // test
             // auto gIn = gInIter(j, i);
-            // TCOPYIN(inTile, gIn);
+            // TLOAD(inTile, gIn);
             n_base = i * tN;
             // printf("j = %d\n", j);
             // printf("i = %d\n", i);
             // printf("base = %d\n", base);
             // printf("in_shape[0] = %d\n", in_shape[0]);
             gen_offset_impl<tile_shapeInOffset, tile_shapeOffset, dtype, gN>(inOffsetTile, offsetTile, n_base);
-        
+
             MGATHER(outTile, inGm, offsetTile);
 
             // printf("inGm = %d\n", inGm);
@@ -165,19 +165,19 @@ void gather(
             // DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tN);
             // DUMP_TILE_FLOAT("inTile", inTile, g_dump_outdata, 1, tN);
             // DUMP_TILE_FLOAT("outTile", outTile, g_dump_outdata, 1, tN);
-            TCOPYOUT(gO, outTile);
+            TSTORE(gO, outTile);
         }
         if constexpr (rmd_N) {
             auto gInOffset = gInOffsetIter(0, j);
             auto gO = gOIter(j, Nb);
             n_base = Nb * tN;
-            TCOPYIN(inOffsetTile_rmd_n, gInOffset);
+            TLOAD(inOffsetTile_rmd_n, gInOffset);
             gen_offset_impl<tile_shapeInOffset_rmd_n, tile_shapeOffset_rmd_n, dtype, gN>(inOffsetTile_rmd_n, offsetTile_rmd_n, n_base);
             MGATHER(outTile_rmd_n, inGm, offsetTile_rmd_n);
             // DUMP_TILE("inOffsetTile_rmd_n", inOffsetTile_rmd_n, g_dump_inoffset, 1, rmd_N);
             // DUMP_TILE_FLOAT("outTile_rmd_n", outTile_rmd_n, g_dump_outdata, 20, rmd_N);
             // DUMP_TILE("offsetTile_rmd_n", offsetTile_rmd_n, g_dump, 20, rmd_N);
-            TCOPYOUT(gO, outTile_rmd_n);
+            TSTORE(gO, outTile_rmd_n);
         }
     }
     if constexpr (rmd_M) {
@@ -185,19 +185,19 @@ void gather(
             auto gInOffset = gInOffsetIter(0, Mb);
             auto gO = gOIter(Mb, i);
             n_base = i * tN;
-            TCOPYIN(inOffsetTile_rmd_m, gInOffset);
+            TLOAD(inOffsetTile_rmd_m, gInOffset);
             gen_offset_impl<tile_shapeInOffset_rmd_m, tile_shapeOffset_rmd_m, dtype, gN>(inOffsetTile_rmd_m, offsetTile_rmd_m, n_base);
             MGATHER(outTile_rmd_m, inGm, offsetTile_rmd_m);
-            TCOPYOUT(gO, outTile_rmd_m);
+            TSTORE(gO, outTile_rmd_m);
         }
         if constexpr (rmd_N) {
             auto gInOffset = gInOffsetIter(0, Mb);
             auto gO = gOIter(Mb, Nb);
             n_base = Nb * tN;
-            TCOPYIN(inOffsetTile_rmd_mn, gInOffset);
+            TLOAD(inOffsetTile_rmd_mn, gInOffset);
             gen_offset_impl<tile_shapeInOffset_rmd_mn, tile_shapeOffset_rmd_mn, dtype, gN>(inOffsetTile_rmd_mn, offsetTile_rmd_mn, n_base);
             MGATHER(outTile_rmd_mn, inGm, offsetTile_rmd_mn);
-            TCOPYOUT(gO, outTile_rmd_mn);
+            TSTORE(gO, outTile_rmd_mn);
         }
     }
 
diff --git a/kernels/memory/transpose.hpp b/kernels/memory/transpose.hpp
index 37ca721..51d0a0c 100644
--- a/kernels/memory/transpose.hpp
+++ b/kernels/memory/transpose.hpp
@@ -16,7 +16,7 @@ using namespace pto;
         GlobalTensor<typename decltype(TileVar)::DType, \
                      Shape<1,1,1,Rows,Cols>, \
                      Stride<1,1,1,Cols,1>> _g(DumpBuf); \
-        TCOPYOUT(_g, TileVar); \
+        TSTORE(_g, TileVar); \
         printf("[DUMP] %s (shape=%dx%d):\n", label, Rows, Cols); \
         for (int ri = 0; ri < Rows; ri++) { \
             printf("  row%2d: ", ri); \
@@ -38,7 +38,7 @@ void __vec__ gen_offset_trans(
 //    const size_t in_dim,
 //    const size_t out_dim,
 //    const size_t transpose_dim1,
-//    const size_t transpose_dim0,    
+//    const size_t transpose_dim0,
     const size_t base,
     const size_t total_elements
 ) {
@@ -62,7 +62,7 @@ void __vec__ gen_offset_trans(
     // 输出一维索引 → 输出坐标
     size_t out_coord[MAX_DIM] = {0};    //
     size_t tmp = idx;   //
-    
+
     #pragma clang loop unroll(full)
     for (int d = OUT_DIM - 1; d >= 0; d--) {
         out_coord[d] = tmp % out_shape[d];
@@ -81,10 +81,10 @@ void __vec__ gen_offset_trans(
         }
     }
 */
-//    uint16_t in_offset = 0; 
-    uint32_t in_offset = 0;   
+//    uint16_t in_offset = 0;
+    uint32_t in_offset = 0;
 
-    #pragma clang loop unroll(full)    
+    #pragma clang loop unroll(full)
     for (int i = 0; i < IN_DIM; i++) {
         in_offset += out_coord[i] * stride[i] * sizeof(dtype);
     }
@@ -101,7 +101,7 @@ void gen_offset_Impl(
 //    const size_t in_dim,
 //   const size_t out_dim,
 //    const size_t transpose_dim1,
-//    const size_t transpose_dim0,   
+//    const size_t transpose_dim0,
     const size_t base,
     const size_t total_elements
     )
@@ -122,16 +122,16 @@ void transpose(
 //    const size_t in_dim,
 //    const size_t out_dim,
 //    const size_t transpose_dim1,
-//    const size_t transpose_dim0   
-) 
+//    const size_t transpose_dim0
+)
     {
 
     const int Mb = gOM / tM;
 
-    
+
     const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;     //将gm中的Tensor先声明为一维数据 
+    using gm_shapeIn = global_tensor<dtype, RowMajor<1, gIM>>;     //将gm中的Tensor先声明为一维数据
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gOM>>;
     using tile_shapeData = Tile<Location::Vec, dtype, 1, tM, BLayout::RowMajor>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeOffset = Tile<Location::Vec, uint32_t, 1, tM, BLayout::RowMajor>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
@@ -158,9 +158,9 @@ void transpose(
         base += total_elements;
 //        DUMP_TILE("offsetTile", offsetTile, g_dump, 1, tM);
         MGATHER(dataTile, inGm, offsetTile);
-//        printf("end mgather\n");        
-        TCOPYOUT(gO, dataTile);
-//        TCOPYOUT(gO, dataTile);
+//        printf("end mgather\n");
+        TSTORE(gO, dataTile);
+//        TSTORE(gO, dataTile);
     }
     if constexpr (rmd_M) {
         auto gO = gOIter(0, Mb);
@@ -168,7 +168,7 @@ void transpose(
         gen_offset_Impl<dtype, tile_shapeOffset, MAX_DIM, IN_DIM, OUT_DIM, TRANSPOSE_DIM1, TRANSPOSE_DIM0>(offsetTile, in_shape, out_shape, base, total_elements);
         base += total_elements;
         MGATHER(dataTile, inGm, offsetTile);
-        TCOPYOUT(gO, dataTile);
+        TSTORE(gO, dataTile);
     }
 }
 
diff --git a/kernels/memory/transpose_vector_007.hpp b/kernels/memory/transpose_vector_007.hpp
index c9fbe07..4093211 100644
--- a/kernels/memory/transpose_vector_007.hpp
+++ b/kernels/memory/transpose_vector_007.hpp
@@ -4,11 +4,11 @@
 using namespace pto;
 
 //AI/IA = A, placeholder
-//like Ttrans 
+//like Ttrans
 template<typename dtype, typename tileInData, typename tileOutData>
 void __vec__ transpose_007_impl(
     typename tileOutData::TileDType __out__ out,
-    const typename tileInData::TileDType __in__ in    
+    const typename tileInData::TileDType __in__ in
 )
 {
     size_t i = blkv_get_index_x(); // 4096
@@ -27,13 +27,13 @@ void __vec__ transpose_007_impl(
 // in1[DimIn1, 1] in2 [DimIn2, 1] bias[DimOut, 1] weight [DimOut, DimIn1, DimIn2]
 template<typename dtype>
 void transpose_007(
-        dtype *out_ptr, 
+        dtype *out_ptr,
         dtype *in_ptr
 )
 {
-    const int Mb = 4096 / 4096;    
+    const int Mb = 4096 / 4096;
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<1, 4096*3>>;     //将gm中的Tensor先声明为一维数据 
+    using gm_shapeIn = global_tensor<dtype, RowMajor<1, 4096*3>>;     //将gm中的Tensor先声明为一维数据
     using gm_shapeOut = global_tensor<dtype, RowMajor<3, 4096>>;
     using tile_shapeInData = Tile<Location::Vec, dtype, 1, 4096*4, BLayout::RowMajor, 1, 4096*3>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeOutData = Tile<Location::Vec, dtype, 4, 4096, BLayout::RowMajor, 3, 4096>; // todo 尾块怎么处理？是否要作为参数写在这
@@ -42,18 +42,18 @@ void transpose_007(
     using itOut = global_iterator<gm_shapeOut, tile_shapeOutData>;
 
     tile_shapeInData InDataTile;
-    tile_shapeOutData OutDataTile;  
+    tile_shapeOutData OutDataTile;
 
     itIn  gIIter(in_ptr);
-    itOut gOIter(out_ptr);  
+    itOut gOIter(out_ptr);
 
     for (int i = 0; i < Mb; ++i) {
         auto gI = gIIter(0, i);
         auto gO = gOIter(0, i);
-        TCOPYIN(InDataTile, gI);
+        TLOAD(InDataTile, gI);
         transpose_007_impl<dtype, tile_shapeInData, tile_shapeOutData><<<tile_shapeOutData::ValidCol, tile_shapeOutData::ValidRow, 1>>>(OutDataTile.data(), InDataTile.data());
-        TCOPYOUT(gO, OutDataTile);
-    }    
+        TSTORE(gO, OutDataTile);
+    }
 
 }
 
diff --git a/kernels/memory/transpose_vector_050.hpp b/kernels/memory/transpose_vector_050.hpp
index d280dd3..d9c2053 100644
--- a/kernels/memory/transpose_vector_050.hpp
+++ b/kernels/memory/transpose_vector_050.hpp
@@ -7,7 +7,7 @@ using namespace pto;
 template<typename dtype, typename tileData>
 void __vec__ transpose_050_impl(
     typename tileData::TileDType __out__ out,
-    const typename tileData::TileDType __in__ in    
+    const typename tileData::TileDType __in__ in
 )
 {
     size_t i = blkv_get_index_x(); // y
@@ -26,33 +26,33 @@ void __vec__ transpose_050_impl(
 
 template<typename dtype>
 void transpose_050(
-        dtype *out_ptr, 
+        dtype *out_ptr,
         dtype *in_ptr
 )
-{   
+{
 
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<512, 64>>;     //将gm中的Tensor先声明为一维数据 
+    using gm_shapeIn = global_tensor<dtype, RowMajor<512, 64>>;     //将gm中的Tensor先声明为一维数据
     using gm_shapeOut = global_tensor<dtype, RowMajor<512, 64>>;
     using tile_shapeData = Tile<Location::Vec, dtype, 512, 64, BLayout::RowMajor, 512, 64>; // todo 尾块怎么处理？是否要作为参数写在这
 
     using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
-    using itOut = global_iterator<gm_shapeOut, tile_shapeData>;    
+    using itOut = global_iterator<gm_shapeOut, tile_shapeData>;
 
     tile_shapeData InDataTile;
     tile_shapeData OutDataTile;
 
     itIn  gIIter(in_ptr);
-    itOut gOIter(out_ptr);  
+    itOut gOIter(out_ptr);
 
 
     auto gI = gIIter(0, 0);
     auto gO = gOIter(0, 0);
-    TCOPYIN(InDataTile, gI);
+    TLOAD(InDataTile, gI);
     transpose_050_impl<dtype, tile_shapeData><<<tile_shapeData::ValidCol, tile_shapeData::ValidRow, 1>>>(OutDataTile.data(), InDataTile.data());
-    TCOPYOUT(gO, OutDataTile);
-   
- 
+    TSTORE(gO, OutDataTile);
+
+
 }
 
 
diff --git a/kernels/other/attention.hpp b/kernels/other/attention.hpp
index b40b402..ae4af5c 100644
--- a/kernels/other/attention.hpp
+++ b/kernels/other/attention.hpp
@@ -54,8 +54,8 @@ void flash_attention1(float *out_ptr,
 
     for (int j = 0; j < Kb; ++j) {
       // load Q_i, K_j
-      tileQ tQ; TCOPYIN(tQ, gQ(i, 0));
-      tileK tK; TCOPYIN(tK, gK(0, j));
+      tileQ tQ; TLOAD(tQ, gQ(i, 0));
+      tileK tK; TLOAD(tK, gK(0, j));
       // 计算分数块
       tileW tW;
       MATMUL(tW, tQ, tK);
@@ -72,8 +72,8 @@ void flash_attention1(float *out_ptr,
     // 2. 扫描所有 K‑blocks，计算行 exp 和的累加
     tileSum tSum(0);
     for (int j = 0; j < Kb; ++j) {
-      tileQ tQ; TCOPYIN(tQ, gQ(i, 0));
-      tileK tK; TCOPYIN(tK, gK(0, j));
+      tileQ tQ; TLOAD(tQ, gQ(i, 0));
+      tileK tK; TLOAD(tK, gK(0, j));
       tileW tW;
       MATMUL(tW, tQ, tK);
       TMULS(tW, tW, scale);
@@ -95,9 +95,9 @@ void flash_attention1(float *out_ptr,
     // 3. 重算加权，乘 V 并累加到输出
     tO = tileO(0);
     for (int j = 0; j < Kb; ++j) {
-      tileQ tQ; TCOPYIN(tQ, gQ(i, 0));
-      tileK tK; TCOPYIN(tK, gK(0, j));
-      tileV tV; TCOPYIN(tV, gV(j, 0));
+      tileQ tQ; TLOAD(tQ, gQ(i, 0));
+      tileK tK; TLOAD(tK, gK(0, j));
+      tileV tV; TLOAD(tV, gV(j, 0));
 
       tileW tW;
       MATMUL(tW, tQ, tK);
@@ -115,6 +115,6 @@ void flash_attention1(float *out_ptr,
     }
 
     // 写回 global
-    TCOPYOUT(dstO, tO);
+    TSTORE(dstO, tO);
   }
 }
diff --git a/kernels/other/conv.hpp b/kernels/other/conv.hpp
index 8cb22e3..f2f9301 100644
--- a/kernels/other/conv.hpp
+++ b/kernels/other/conv.hpp
@@ -4,13 +4,13 @@ using namespace pto;
 
 //in: Input data of shape (N, C, H, W) ->        -> N, 1, C, H, W   ->  N, F, C, H, W
 //filter: Filter weights of shape (F, C, HH, WW) -> 1 ,F, C, HH, WW - > N ,F, C, HH, WW
-//out: Output data, of shape (N, F, H', W')      
+//out: Output data, of shape (N, F, H', W')
 // for j in range(0, H_prime):
 //     for i in range(0, W_prime):
 //        tmp_w = w
 //        tmp_w = tmp_w[np.newaxis,:]
 //        tmp_w = np.repeat(tmp_w, N, axis=0)
-//        tmp_x = x_pad[:, :, j * stride:j * stride + HH, i * stride:i * stride + WW] 
+//        tmp_x = x_pad[:, :, j * stride:j * stride + HH, i * stride:i * stride + WW]
 //        tmp_x = tmp_x[:,np.newaxis]
 //        tmp_x = np.repeat(tmp_x, F, axis=1)
 //        out[:,:,j,i] = np.sum(np.sum(np.sum(tmp_x*tmp_w, axis=-1), axis=-1), axis=-1) \
@@ -26,11 +26,11 @@ using namespace pto;
 //         for i in range(0, W_prime):
 //             tmp_w = w[f, :, :, :]
 //             tmp_w = tmp_w[np.newaxis,:]
-//             tmp_w = np.repeat(tmp_w, N, axis=0) 
+//             tmp_w = np.repeat(tmp_w, N, axis=0)
 //             out[:, f, j, i] = np.sum(np.sum(np.sum(x_pad[:, :, j * stride:j * stride + HH, i * stride:i * stride + WW] * tmp_w, axis=-3), axis=-2), axis=-1)
 
 //pic [N, C, H, W], filter [F, C, HH, WW] -> out [N, F, H', W']
-template<typename dtype, const int N, const int C, const int H, const int W, 
+template<typename dtype, const int N, const int C, const int H, const int W,
         const int F, const int HH, const int WW>
 void conv_forward(dtype *out, dtype *pic, dtype *filter){
     const int stride = 1;
@@ -59,8 +59,8 @@ void conv_forward(dtype *out, dtype *pic, dtype *filter){
                         tile_filt tfilt;
                         tile_filt tpic;
 
-                        TCOPYIN(tfilt, gfilt);
-                        TCOPYIN(tpic, gpic);
+                        TLOAD(tfilt, gfilt);
+                        TLOAD(tpic, gpic);
                         TMUL(tpic, tpic, tfilt);
                         TROWSUMEXPAND(tpic, tpic, tpic);
                         TCOLSUMEXPAND(tpic, tpic, tpic); // sum all element
@@ -68,7 +68,7 @@ void conv_forward(dtype *out, dtype *pic, dtype *filter){
                     }
                     int offset = n*F*H_prime*W_prime + f*H_prime*W_prime + h*W_prime + w;
                     gm_out gO(out+offset);
-                    TCOPYOUT(gO, tmp);
+                    TSTORE(gO, tmp);
                 }
             }
         }
diff --git a/kernels/other/flash_attention.hpp b/kernels/other/flash_attention.hpp
index b4a5429..37e1bc0 100644
--- a/kernels/other/flash_attention.hpp
+++ b/kernels/other/flash_attention.hpp
@@ -16,7 +16,7 @@ void flash_attention(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr) {
     using tileK      = TileRight<float, qD, kTk>;      // [vD×kTk]
     using tileW_out  = TileAcc<float, kTm, kTk>;      // [kTm×kTk]
     using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::RowMajor>;
-    using tileW_left = TileLeft<float, kTm, kTk>; 
+    using tileW_left = TileLeft<float, kTm, kTk>;
 
     using tileO_out  = TileAcc<float, kTm, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::RowMajor>; // [kTm×vD]
@@ -46,7 +46,7 @@ void flash_attention(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr) {
         // 加载当前Q块 (仅一次)
         tileQ tQ;
         auto gQ = gIterQ(i,0);
-        TCOPYIN(tQ, gQ);
+        TLOAD(tQ, gQ);
 
         // 初始化状态: 最大值/指数和/输出累加
         tileMax tMax;
@@ -61,8 +61,8 @@ void flash_attention(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr) {
         // 加载K_j和V_j
         auto gK = gIterK(0,j);
         auto gV = gIterV(j,0);
-        tileK tK; TCOPYIN(tK, gK);
-        tileV tV; TCOPYIN(tV, gV);
+        tileK tK; TLOAD(tK, gK);
+        tileV tV; TLOAD(tV, gV);
 
         // 计算注意力分数块
         tileW_out tW_out;
@@ -132,7 +132,7 @@ void flash_attention(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr) {
         TCAST(tO_cast, tO);
         // 写回全局内存
         auto dstO = gIterO(i, 0);
-        TCOPYOUT(dstO, tO_cast);
+        TSTORE(dstO, tO_cast);
     }
 }
 
@@ -160,7 +160,7 @@ void __vec__ flashsoftmax_new_max(
         upd_max = blkv_max(upd_max, src_ptr[src_idx] * src_scale);
     }
 
-    new_max_ptr[max_idx] = upd_max; 
+    new_max_ptr[max_idx] = upd_max;
 }
 
 template<typename tileMax, typename tileScale>
@@ -282,8 +282,8 @@ void flash_attention_opt(float* out_ptr, float* q_ptr, float* k_ptr, float* v_pt
     using tileQ      = TileLeft<float, kTm, qD>;       // [kTm×qD]
     using tileK      = TileRight<float, qD, kTk>;      // [vD×kTk]
     using tileW_out  = TileAcc<float, kTm, kTk>;      // [kTm×kTk]
-    using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::RowMajor>; 
-    using tileW_left = TileLeft<float, kTm, kTk>; 
+    using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::RowMajor>;
+    using tileW_left = TileLeft<float, kTm, kTk>;
 
     using tileO_out  = TileAcc<float, kTm, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::RowMajor>; // [kTm×vD]
@@ -313,7 +313,7 @@ void flash_attention_opt(float* out_ptr, float* q_ptr, float* k_ptr, float* v_pt
         // 加载当前Q块 (仅一次)
         tileQ tQ;
         auto gQ = gIterQ(i, 0);
-        TCOPYIN(tQ, gQ);
+        TLOAD(tQ, gQ);
 
         // 初始化状态: 最大值/指数和/输出累加
         tileMax tMax;
@@ -328,8 +328,8 @@ void flash_attention_opt(float* out_ptr, float* q_ptr, float* k_ptr, float* v_pt
         // 加载K_j和V_j
         auto gK = gIterK(0, j);
         auto gV = gIterV(j, 0);
-        tileK tK; TCOPYIN(tK, gK);
-        tileV tV; TCOPYIN(tV, gV);
+        tileK tK; TLOAD(tK, gK);
+        tileV tV; TLOAD(tV, gV);
 
         // 计算注意力分数块
         tileW_out tW_out;
@@ -367,7 +367,7 @@ void flash_attention_opt(float* out_ptr, float* q_ptr, float* k_ptr, float* v_pt
         TCAST(tO_cast, tO);
         // 写回全局内存
         auto dstO = gIterO(i, 0);
-        TCOPYOUT(dstO, tO_cast);
+        TSTORE(dstO, tO_cast);
     }
 }
 
@@ -382,8 +382,8 @@ void flash_attention_opt2(float* out_ptr, float* q_ptr, float* k_ptr, float* v_p
     using tileQ      = TileLeft<float, kTm, qD>;       // [kTm×qD]
     using tileK      = TileRight<float, qD, kTk>;      // [vD×kTk]
     using tileW_out  = TileAcc<float, kTm, kTk>;      // [kTm×kTk]
-    using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>; 
-    using tileW_left = TileLeft<float, kTm, kTk>; 
+    using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>;
+    using tileW_left = TileLeft<float, kTm, kTk>;
 
     using tileO_out  = TileAcc<float, kTm, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::RowMajor>; // [kTm×vD]
@@ -413,7 +413,7 @@ void flash_attention_opt2(float* out_ptr, float* q_ptr, float* k_ptr, float* v_p
         // 加载当前Q块 (仅一次)
         tileQ tQ;
         auto gQ = gIterQ(i, 0);
-        TCOPYIN(tQ, gQ);
+        TLOAD(tQ, gQ);
 
         // 初始化状态: 最大值/指数和/输出累加
         tileMax tMax;
@@ -428,8 +428,8 @@ void flash_attention_opt2(float* out_ptr, float* q_ptr, float* k_ptr, float* v_p
         // 加载K_j和V_j
         auto gK = gIterK(0, j);
         auto gV = gIterV(j, 0);
-        tileK tK; TCOPYIN(tK, gK);
-        tileV tV; TCOPYIN(tV, gV);
+        tileK tK; TLOAD(tK, gK);
+        tileV tV; TLOAD(tV, gV);
 
         // 计算注意力分数块
         tileW_out tW_out;
@@ -467,7 +467,7 @@ void flash_attention_opt2(float* out_ptr, float* q_ptr, float* k_ptr, float* v_p
         TCAST(tO_cast, tO);
         // 写回全局内存
         auto dstO = gIterO(i, 0);
-        TCOPYOUT(dstO, tO_cast);
+        TSTORE(dstO, tO_cast);
     }
 }
 
@@ -513,7 +513,7 @@ void flash_attention_frac(float *out_ptr,
   for (int i = 0; i < Qb; ++i) {
     // 加载当前Q块 (仅一次)
     tileQ tQ;
-    TCOPYIN(tQ, gQ(i, 0));
+    TLOAD(tQ, gQ(i, 0));
 
     // 初始化状态: 最大值/指数和/输出累加
     tileMax tMax;
@@ -525,8 +525,8 @@ void flash_attention_frac(float *out_ptr,
     #pragma clang loop unroll(full)
     for (int j = 0; j < Kb; ++j) {
       // 加载K_j和V_j
-      tileK tK; TCOPYIN(tK, gK(0, j));
-      tileV tV; TCOPYIN(tV, gV(j, 0));
+      tileK tK; TLOAD(tK, gK(0, j));
+      tileV tV; TLOAD(tV, gV(j, 0));
 
       // 计算注意力分数块
       tileS tS;
@@ -588,7 +588,7 @@ void flash_attention_frac(float *out_ptr,
 
     // 写回全局内存
     auto dstO = gO(i, 0);
-    TCOPYOUT(dstO, tO);
+    TSTORE(dstO, tO);
   }
 }
 
@@ -622,7 +622,7 @@ void flash_attention_rm(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr
     itK gIterK(k_ptr);
     itV gIterV(v_ptr);
     itO gIterO(out_ptr);
-    
+
     const float scale = 1.0f / sqrt((float)qD);
     const int Qb = (S + kTm - 1) / kTm;
     const int Kb = (S + kTk - 1) / kTk;
@@ -633,7 +633,7 @@ void flash_attention_rm(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr
         // 加载当前Q块 (仅一次)
         tileQ tQ;
         auto gQ = gIterQ(i, 0);
-        TCOPYIN(tQ, gQ);
+        TLOAD(tQ, gQ);
 
         // 初始化状态: 最大值/指数和/输出累加
         tileMax tMax;
@@ -646,8 +646,8 @@ void flash_attention_rm(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr
         // 加载K_j和V_j
         auto gK = gIterK(0, j);
         auto gV = gIterV(j, 0);
-        tileK tK; TCOPYIN(tK, gK);
-        tileV tV; TCOPYIN(tV, gV);
+        tileK tK; TLOAD(tK, gK);
+        tileV tV; TLOAD(tV, gV);
 
         // 计算注意力分数块
         tileW tW;
@@ -706,7 +706,7 @@ void flash_attention_rm(float* out_ptr, float* q_ptr, float* k_ptr, float* v_ptr
 
         // 写回全局内存
         auto dstO = gIterO(i, 0);
-        TCOPYOUT(dstO, tO);
+        TSTORE(dstO, tO);
     }
 }
 
@@ -730,10 +730,10 @@ void flash_attention_opt2_unroll2_aligned(float* out_ptr, float* q_ptr, float* k
     using tileQ      = TileLeft<float, kTm, qD>;
     using tileK      = TileRight<float, qD, kTk>;
     using tileV      = TileRight<float, kTk, vD>;
-    
+
     using tileW_out  = TileAcc<float, kTm, kTk>;
-    using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>; 
-    using tileW_left = TileLeft<float, kTm, kTk>; 
+    using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::ColMajor>;
+    using tileW_left = TileLeft<float, kTm, kTk>;
 
     using tileO_out  = TileAcc<float, kTm, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::RowMajor>;
@@ -758,7 +758,7 @@ void flash_attention_opt2_unroll2_aligned(float* out_ptr, float* q_ptr, float* k
     for (int i = 0; i < Qb; ++i) {
         tileQ tQ;
         auto gQ = gIterQ(i, 0);
-        TCOPYIN(tQ, gQ);
+        TLOAD(tQ, gQ);
 
         // 初始化状态
         tileMax tMax; TEXPANDSCALAR(tMax, -1e30f);
@@ -768,17 +768,17 @@ void flash_attention_opt2_unroll2_aligned(float* out_ptr, float* q_ptr, float* k
         // --- 定义双缓冲寄存器 ---
         tileK tK_0, tK_1;
         tileV tV_0, tV_1;
-        
+
         tileW_out tW_out_0, tW_out_1;
         tileW tW_0, tW_1;
-        
+
         tileW tExpW_0, tExpW_1;
         tileW_left tW_left_0, tW_left_1;
-        
+
         tileO_out tO_out_0, tO_out_1;
-        tileO tO_tmp_0, tO_tmp_1; 
+        tileO tO_tmp_0, tO_tmp_1;
         tileO tRescaleO_0, tRescaleO_1;
-        
+
         // 状态更新的临时变量
         tileMax tNewMax_0, tNewMax_1;
         tileSum tNewSum_0, tNewSum_1;
@@ -793,15 +793,15 @@ void flash_attention_opt2_unroll2_aligned(float* out_ptr, float* q_ptr, float* k
             auto gK_1 = gIterK(0, j + 1);
             auto gV_1 = gIterV(j + 1, 0);
 
-            TCOPYIN(tK_0, gK_0);
-            TCOPYIN(tV_0, gV_0);
-            TCOPYIN(tK_1, gK_1);
-            TCOPYIN(tV_1, gV_1);
+            TLOAD(tK_0, gK_0);
+            TLOAD(tV_0, gV_0);
+            TLOAD(tK_1, gK_1);
+            TLOAD(tV_1, gV_1);
 
             // 2. MatMul QK Grouping
             // Doubled buffer
             MATMUL(tW_out_0, tQ, tK_0);
-            MATMUL(tW_out_1, tQ, tK_1); 
+            MATMUL(tW_out_1, tQ, tK_1);
 
             // 3. Convert
             TCVT(tW_0, tW_out_0);
@@ -813,11 +813,11 @@ void flash_attention_opt2_unroll2_aligned(float* out_ptr, float* q_ptr, float* k
             );
 
             TMUL(tO, tO, tRescaleO_0); // Rescale O_old
-            
+
             TCVT(tW_left_0, tExpW_0);
             MATMUL(tO_out_0, tW_left_0, tV_0); // Compute P0 * V0
             TCVT(tO_tmp_0, tO_out_0);
-            
+
             TADD(tO, tO, tO_tmp_0); // Accumulate O_new
 
             // Update State 0 -> 1
@@ -855,14 +855,14 @@ void flash_attention_opt2_unroll2_aligned(float* out_ptr, float* q_ptr, float* k
         Tile<Location::Vec, float, kTm, vD, BLayout::RowMajor> tO_cast;
         TCAST(tO_cast, tO);
         auto dstO = gIterO(i, 0);
-        TCOPYOUT(dstO, tO_cast);
+        TSTORE(dstO, tO_cast);
     }
 }
 
 template <typename dtype, int qD, int vD, int kTm, int kTk>
 __attribute__((noinline))
 void flash_attention_dynamic(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype* v_ptr, int Sq, int Skv) {
-    
+
     using gmQ = global_tensor<float, RowMajor<-1, qD>>;  // Q: [S×qD]
     using gmK = global_tensor<float, ColMajor<qD, -1>>;  // K: [qD×S]
     using gmV = global_tensor<float, RowMajor<-1, vD>>;  // V: [S×vD]
@@ -872,7 +872,7 @@ void flash_attention_dynamic(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype*
     using tileK      = TileRight<dtype, qD, kTk, qD, -1>;        // [vD×kTk]
     using tileW_out  = TileAcc<float, kTm, kTk, -1, -1>;         // [kTm×kTk]
     using tileW      = Tile<Location::Vec, float, kTm, kTk, BLayout::RowMajor, -1, -1>;
-    using tileW_left = TileLeft<dtype, kTm, kTk, -1, -1>; 
+    using tileW_left = TileLeft<dtype, kTm, kTk, -1, -1>;
 
     using tileO_out  = TileAcc<float, kTm, vD, -1, vD>;
     using tileO      = Tile<Location::Vec, float, kTm, vD, BLayout::RowMajor, -1, vD>; // [kTm×vD]
@@ -897,7 +897,7 @@ void flash_attention_dynamic(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype*
         gmQ gQ(q_ptr+offset_Q, Sq);
 
         tileQ tQ(dyn_m);
-        TCOPYIN(tQ, gQ);
+        TLOAD(tQ, gQ);
 
         tileMax tMax(-1e30f, dyn_m);
         tileSum tSum(0, dyn_m);
@@ -906,7 +906,7 @@ void flash_attention_dynamic(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype*
 
         #pragma clang loop unroll(full)
         for (int j = 0; j < Kb; ++j) {
-        
+
             int dyn_k = (j+1) * kTk > Skv ? rK:kTk;
 
             size_t offset_K = j * tileK::Cols * qD;
@@ -914,8 +914,8 @@ void flash_attention_dynamic(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype*
             gmK gK(k_ptr+offset_K, Skv);
             gmV gV(v_ptr+offset_V, Skv);
 
-            tileK tK(dyn_k); TCOPYIN(tK, gK);
-            tileV tV(dyn_k); TCOPYIN(tV, gV);
+            tileK tK(dyn_k); TLOAD(tK, gK);
+            tileV tV(dyn_k); TLOAD(tV, gV);
 
             tileW_out tW_out(dyn_m, dyn_k);
             MATMUL(tW_out, tQ, tK);
@@ -985,7 +985,7 @@ void flash_attention_dynamic(dtype* out_ptr, dtype* q_ptr, dtype* k_ptr, dtype*
 
         size_t offset_O = i * tileO_cast::Rows * vD;
         gmO dstO(out_ptr+offset_O, Sq);
-        TCOPYOUT(dstO, tO_cast);
+        TSTORE(dstO, tO_cast);
     }
 }
 
diff --git a/kernels/other/flash_attention_mask.hpp b/kernels/other/flash_attention_mask.hpp
index d626385..a6b9083 100644
--- a/kernels/other/flash_attention_mask.hpp
+++ b/kernels/other/flash_attention_mask.hpp
@@ -89,7 +89,7 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr,
   for (int i = 0; i < Qb; ++i) {
     // 加载当前Q块 (仅一次)
     tileQ tQ;
-    TCOPYIN(tQ, gQ(i, 0));
+    TLOAD(tQ, gQ(i, 0));
 
     // 初始化状态: 最大值/指数和/输出累加
     tileMax tMax;
@@ -103,9 +103,9 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr,
     for (int j = 0; j < Kb; ++j) {
       // 加载K_j和V_j
       tileK tK;
-      TCOPYIN(tK, gK(0, j));
+      TLOAD(tK, gK(0, j));
       tileV tV;
-      TCOPYIN(tV, gV(j, 0));
+      TLOAD(tV, gV(j, 0));
 
       // 计算注意力分数块
       tileW_out tW_out;
@@ -169,9 +169,9 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr,
     if constexpr (rK) {
       // 加载K_Kb 和V_Kb
       tileK_tcols tK_tcols;
-      TCOPYIN(tK_tcols, gK(0, Kb));
+      TLOAD(tK_tcols, gK(0, Kb));
       tileV_trows tV_trows;
-      TCOPYIN(tV_trows, gV(Kb, 0));
+      TLOAD(tV_trows, gV(Kb, 0));
 
       // 计算注意力分数块
       tileW_out_tcols tW_out_tcols;
@@ -237,10 +237,10 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr,
     TEXPANDCOL(tInvSumExpanded, tInvSum);
     TMUL(tO, tO, tInvSumExpanded);
 
-    // 写回全局内存-------将第一步和第二部合并，即完成输出tile块的计算，并copyout到
+    // 写回全局内存-------将第一步和第二部合并，即完成输出tile块的计算，并store到
     // gO(i,0)
     auto dstO = gO(i, 0);
-    TCOPYOUT(dstO, tO);
+    TSTORE(dstO, tO);
   }
 
   // 最后的Q-block块(Qb)
@@ -248,7 +248,7 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr,
 
     // 加载当前Q块 (仅一次)
     tileQ_trows tQ_trows;
-    TCOPYIN(tQ_trows, gQ(Qb, 0));
+    TLOAD(tQ_trows, gQ(Qb, 0));
 
     // 初始化状态: 最大值/指数和/输出累加
     tileMax_trows tMax_trows;
@@ -263,9 +263,9 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr,
     for (int j = 0; j < Kb; ++j) {
       // 加载K_j和V_j
       tileK tK;
-      TCOPYIN(tK, gK(0, j));
+      TLOAD(tK, gK(0, j));
       tileV tV;
-      TCOPYIN(tV, gV(j, 0));
+      TLOAD(tV, gV(j, 0));
 
       // 计算注意力分数块
       tileW_out_trows tW_out_trows;
@@ -329,9 +329,9 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr,
     if constexpr (rK) {
       // 加载K_Kb 和V_Kb
       tileK_tcols tK_tcols;
-      TCOPYIN(tK_tcols, gK(0, Kb));
+      TLOAD(tK_tcols, gK(0, Kb));
       tileV_trows tV_trows;
-      TCOPYIN(tV_trows, gV(Kb, 0));
+      TLOAD(tV_trows, gV(Kb, 0));
 
       // 计算注意力分数块
       tileW_out_tcorner tW_out_tcorner;
@@ -397,10 +397,10 @@ void flash_attention_frac(float *out_ptr, float *q_ptr, float *k_ptr,
     TEXPANDCOL(tInvSumExpanded_trows, tInvSum_trows);
     TMUL(tO_trows, tO_trows, tInvSumExpanded_trows);
 
-    // 写回全局内存-----将第三步和第四步合并，即完成输出tile块的计算，并copyout到
+    // 写回全局内存-----将第三步和第四步合并，即完成输出tile块的计算，并store到
     // gO(Qb,0)
     auto dstO = gO(Qb, 0);
-    TCOPYOUT(dstO, tO_trows);
+    TSTORE(dstO, tO_trows);
   }
 }
 
diff --git a/kernels/other/gemm.hpp b/kernels/other/gemm.hpp
index 3520552..f5ddd63 100644
--- a/kernels/other/gemm.hpp
+++ b/kernels/other/gemm.hpp
@@ -5,28 +5,28 @@ using namespace pto;
 
 template <typename DataType, const int gM, const int gN, const int gK, const int tM, const int tN, const int tK, const bool Relu, const bool Bias>
 void gemm(DataType *dst, DataType *src0, DataType *src1, DataType *src2)
-{ 
+{
     using gm_shapeA = global_tensor<DataType, RowMajor<gM, gK>>;
     using gm_shapeB = global_tensor<DataType, RowMajor<gK, gN>>;
     using gm_shapeC = global_tensor<DataType, RowMajor<gM, gN>>;
     using gm_shapeBias = global_tensor<DataType, RowMajor<1, gN>>;
- 
+
     using tile_shapeA = TileLeft<DataType, tM, tK>;
     using tile_shapeB = TileRight<DataType, tK, tN>;
     using tile_shapeACC = TileAcc<DataType, tM, tN>;
     using tile_shapeACC_RM = Tile<Location::Vec, DataType, tM, tN, BLayout::RowMajor>;
     using tile_shapeBias = Tile<Location::Vec, DataType, tM, tN, BLayout::RowMajor, 1, tN>;
- 
+
     using gm_iteratorA = global_iterator<gm_shapeA, tile_shapeA>;
     using gm_iteratorB = global_iterator<gm_shapeB, tile_shapeB>;
     using gm_iteratorC = global_iterator<gm_shapeC, tile_shapeACC>;
     using gm_iteratorBias = global_iterator<gm_shapeBias, tile_shapeBias>;
- 
+
     gm_iteratorA gAIter(src0);
     gm_iteratorB gBIter(src1);
     gm_iteratorC gCIter(dst);
     gm_iteratorBias gBiasIter(src2);
- 
+
     const int Mb = gM / tM;
     const int Nb = gN / tN;
     const int Kb = gK / tK;
@@ -40,7 +40,7 @@ void gemm(DataType *dst, DataType *src0, DataType *src1, DataType *src2)
             tile_shapeB tB(0);
             tile_shapeACC tACC;
             MATMUL(tACC, tA, tB);
-            
+
             #pragma clang loop unroll(full)
             for(int k = 0; k < Kb; k++)
             {
@@ -48,28 +48,28 @@ void gemm(DataType *dst, DataType *src0, DataType *src1, DataType *src2)
                 auto gB = gBIter(k, j);
                 tile_shapeA tA;
                 tile_shapeB tB;
-                TCOPYIN(tA, gA);
-                TCOPYIN(tB, gB);
+                TLOAD(tA, gA);
+                TLOAD(tB, gB);
                 MATMACC(tACC, tA, tB);
             }
 
             tile_shapeACC_RM tACC_RM;
             TCVT(tACC_RM, tACC);
- 
+
             if constexpr (Bias) {
                 tile_shapeBias tBias;
                 tile_shapeACC_RM tExpandBias;
                 auto gBias = gBiasIter(0,j);
-                TCOPYIN(tBias, gBias);
+                TLOAD(tBias, gBias);
                 TEXPANDROW(tExpandBias, tBias);
                 TADD(tACC_RM, tACC_RM, tExpandBias);
             }
- 
+
             if constexpr (Relu) {
                 TMAXS(tACC_RM, tACC_RM, 0);
             }
- 
-            TCOPYOUT(gC, tACC_RM);
+
+            TSTORE(gC, tACC_RM);
         }
     }
 }
\ No newline at end of file
diff --git a/kernels/other/linear.hpp b/kernels/other/linear.hpp
index 36fa7bc..d53b9f0 100644
--- a/kernels/other/linear.hpp
+++ b/kernels/other/linear.hpp
@@ -19,9 +19,9 @@ void Identity(dtype *dst, dtype *src){
     for(int i=0;i<Mb;i++){
         for(int j=0;j<Nb;j++){
             tile_shape tsrc;
-            TCOPYIN(tsrc, git_src(i,j));
+            TLOAD(tsrc, git_src(i,j));
             auto gdst = git_dst(i,j);
-            TCOPYOUT(gdst, tsrc);
+            TSTORE(gdst, tsrc);
         }
     }
 }
@@ -29,14 +29,14 @@ void Identity(dtype *dst, dtype *src){
 // output = input * A^T + bias ;;; input = [batch_size, in_features] , A = [out_features, in_features]
 // A should be row major
 template<typename dtype, const int BatSize, const int InFeat, const int OutFeat>
-void Linear(dtype *out, dtype *in, dtype *weight, dtype *bias){    
+void Linear(dtype *out, dtype *in, dtype *weight, dtype *bias){
     gemm<BatSize, OutFeat, InFeat, 64, 64, 64, false, true>(out, in, weight, bias);
 }
 
 // y = x1^T.A.x2 + b, where x1=1xd1, x2=1xd2, A=doxd1xd2
 // in1[DimIn1, 1] in2 [DimIn2, 1] bias[DimOut, 1] weight [DimOut, DimIn1, DimIn2]
 template<typename dtype, const int DimIn1, const int DimIn2, const int DimOut, const bool Bias>
-void BiLinear(dtype *out, dtype *weight, 
+void BiLinear(dtype *out, dtype *weight,
              dtype *in1, dtype *in2, dtype *bias){
     using gm_shapeA = global_tensor<dtype, RowMajor<1, DimIn1>>;
     using gm_shapeB = global_tensor<dtype, RowMajor<DimIn2, 1>>;
@@ -57,25 +57,25 @@ void BiLinear(dtype *out, dtype *weight,
     tile_shapeW  tW;
     tile_shapeBT tmp;
     tile_shapeO  tO;
-    TCOPYIN(tin1, gin1);
-    TCOPYIN(tin2, gin2);
+    TLOAD(tin1, gin1);
+    TLOAD(tin2, gin2);
     for (int i=0;i<DimOut;i++){
         gm_shapeW gW((weight + i * DimIn1 * DimIn2)); //weight[i, 0, 0]
-        TCOPYIN(tW, gW);
+        TLOAD(tW, gW);
         MATMUL(tmp, tin1, tW);
         MATMUL(tO, tmp, tin2);
         gm_shapeO gO(out+i);
-        TCOPYOUT(gO, tO);
+        TSTORE(gO, tO);
     }
     if constexpr (Bias){
         Tile<Location::Vec, dtype, DimOut, 1, BLayout::RowMajor> tout;
         Tile<Location::Vec, dtype, DimOut, 1, BLayout::RowMajor> tbias;
         gm_shapeO gO(out);
         gm_shapeO gbias(bias);
-        TCOPYIN(tout, gO);
-        TCOPYIN(tbias, gbias);
+        TLOAD(tout, gO);
+        TLOAD(tbias, gbias);
         TADD(tout, tout, tbias);
-        TCOPYOUT(gO, tout);
+        TSTORE(gO, tout);
     }
 }
 
diff --git a/kernels/other/matadd.hpp b/kernels/other/matadd.hpp
index 7bb031d..6860bee 100644
--- a/kernels/other/matadd.hpp
+++ b/kernels/other/matadd.hpp
@@ -27,10 +27,10 @@ void matadd(float *c_ptr, float *a_ptr, float *b_ptr) {
       auto gC = gCIter(i, j);
 
       tile_shape tA, tB, tC;
-      TCOPYIN(tA, gA);
-      TCOPYIN(tB, gB);
+      TLOAD(tA, gA);
+      TLOAD(tB, gB);
       TADD(tC, tA, tB);
-      TCOPYOUT(gC, tC);
+      TSTORE(gC, tC);
     }
   }
 }
diff --git a/kernels/other/matmul.hpp b/kernels/other/matmul.hpp
index 3d3cbe1..6af6183 100644
--- a/kernels/other/matmul.hpp
+++ b/kernels/other/matmul.hpp
@@ -10,19 +10,19 @@
 using namespace pto;
 
 template <is_global_data_v GmOut, is_tile_data_v TileAcc>
-void TCOPYOUT_ACC(GmOut &Gout, TileAcc &tAcc){
+void TSTORE_ACC(GmOut &Gout, TileAcc &tAcc){
     using TileAccOut = Tile<Location::Vec, __bf16, TileAcc::Rows, TileAcc::Cols, BLayout::RowMajor, TileAcc::ValidRow, TileAcc::ValidCol>;
     TileAccOut tAccOut;
     TCVT(tAccOut, tAcc);
-    TCOPYOUT(Gout, tAccOut);
+    TSTORE(Gout, tAccOut);
 }
 
 template <is_global_data_v GmOut, is_tile_data_v TileAcc>
-void TCOPYOUT_ACC_DYNAMIC(GmOut &Gout, TileAcc &tAcc, size_t valid_row, size_t valid_col){
+void TSTORE_ACC_DYNAMIC(GmOut &Gout, TileAcc &tAcc, size_t valid_row, size_t valid_col){
     using TileAccOut = Tile<Location::Vec, typename TileAcc::DType, TileAcc::Rows, TileAcc::Cols, BLayout::RowMajor, -1, -1>;
     TileAccOut tAccOut(valid_row, valid_col);
     TCVT(tAccOut, tAcc);
-    TCOPYOUT(Gout, tAccOut);
+    TSTORE(Gout, tAccOut);
 }
 
 // A * B -> C with any shape
@@ -74,16 +74,16 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
         tile_shapeACC tACC;
         // tile_shapecast tcast;
-        
+
         if constexpr(Kb>0){
           auto gA = gAIter(i, 0);
           auto gB = gBIter(0, j);
 
           tile_shapeA tA;
           tile_shapeB tB;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
-          MATMUL(tACC, tA, tB);        
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
+          MATMUL(tACC, tA, tB);
         }
         #pragma clang loop unroll(full)
         for (int k = 1; k < Kb; ++k) {
@@ -92,8 +92,8 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA tA;
           tile_shapeB tB;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
           MATMACC(tACC, tA, tB);
         }
 
@@ -103,8 +103,8 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_trows tA;
           tile_shapeB_tcols tB;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
           if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
           } else {
@@ -112,8 +112,8 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) {
           }
         }
         // TCVT(tCast, tACC);
-        // TCOPYOUT(gC, tCast);
-        TCOPYOUT_ACC(gC, tACC);
+        // TSTORE(gC, tCast);
+        TSTORE_ACC(gC, tACC);
       }
       if constexpr (rmd_N) {
         auto gC = gCIter(i, Nb);
@@ -125,9 +125,9 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA tA;
           tile_shapeB_trows tB;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
-          MATMUL(tACC, tA, tB);        
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
+          MATMUL(tACC, tA, tB);
         }
         #pragma clang loop unroll(full)
         for (int k = 1; k < Kb; ++k) {
@@ -136,8 +136,8 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA tA;
           tile_shapeB_trows tB;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
           MATMACC(tACC, tA, tB);
         }
         if constexpr (rmd_K) {
@@ -146,15 +146,15 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_trows tA;
           tile_shapeB_tcorner tB;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
           if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
           } else {
             MATMUL(tACC, tA, tB);
           }
         }
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
       }
     }
     if constexpr (rmd_M) {
@@ -168,9 +168,9 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_tcols tA;
           tile_shapeB tB;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
-          MATMUL(tACC, tA, tB);        
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
+          MATMUL(tACC, tA, tB);
         }
         #pragma clang loop unroll(full)
         for (int k = 1; k < Kb; ++k) {
@@ -179,8 +179,8 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_tcols tA;
           tile_shapeB tB;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
           MATMACC(tACC, tA, tB);
         }
         if constexpr (rmd_K) {
@@ -189,15 +189,15 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_tcorner tA;
           tile_shapeB_tcols tB;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
           if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
           } else {
             MATMUL(tACC, tA, tB);
           }
         }
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
       }
       if constexpr (rmd_N) {
         auto gC = gCIter(Mb, Nb);
@@ -209,9 +209,9 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_tcols tA;
           tile_shapeB_trows tB;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
-          MATMUL(tACC, tA, tB);        
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
+          MATMUL(tACC, tA, tB);
         }
         #pragma clang loop unroll(full)
         for (int k = 1; k < Kb; ++k) {
@@ -220,8 +220,8 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_tcols tA;
           tile_shapeB_trows tB;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
           MATMACC(tACC, tA, tB);
         }
         if constexpr (rmd_K) {
@@ -230,15 +230,15 @@ void matmul_mask(__bf16 *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_tcorner tA;
           tile_shapeB_tcorner tB;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
           if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
           } else {
             MATMUL(tACC, tA, tB);
           }
         }
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
       }
     }
   }
@@ -275,8 +275,8 @@ void matmul_frac(float* dst, dtype* src0, dtype* src1){
 
               tile_shapeA tA;
               tile_shapeB tB;
-              TCOPYIN(tA, gA);
-              TCOPYIN(tB, gB);
+              TLOAD(tA, gA);
+              TLOAD(tB, gB);
               MATMUL(tACC, tA, tB);
             }
             #pragma clang loop unroll(full)
@@ -285,11 +285,11 @@ void matmul_frac(float* dst, dtype* src0, dtype* src1){
                 auto gB = gBIter(k,j);
                 tile_shapeA tA;
                 tile_shapeB tB;
-                TCOPYIN(tA, gA);
-                TCOPYIN(tB, gB);
+                TLOAD(tA, gA);
+                TLOAD(tB, gB);
                 MATMACC(tACC, tA, tB);
             }
-            TCOPYOUT_ACC(gC, tACC);
+            TSTORE_ACC(gC, tACC);
         }
     }
 }
@@ -406,7 +406,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
         // #pragma clang loop unroll(full)
         // for(int k=0;k<R.k;k++){
         //   auto gA = gIterA(ii+i*R.m,k);
-        //   TCOPYIN(tA[ii][k], gA);
+        //   TLOAD(tA[ii][k], gA);
         // }
 
         #pragma clang loop unroll(full)
@@ -417,11 +417,11 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
           for(int k=0;k<R.k;k++){
             auto gB = gIterB(k,j);
             tile_shapeB tB;
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if(j==0){
               // eliminate head cost
               auto gA = gIterA(ii+i*R.m,k);
-              TCOPYIN(tA[ii][k], gA);
+              TLOAD(tA[ii][k], gA);
             }
             if(k==0){
               MATMUL(tACC, tA[ii][k], tB);
@@ -436,8 +436,8 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
               tile_shapeB tB;
               auto gA = gIterA(i*R.m+ii,k);
               auto gB = gIterB(k,j);
-              TCOPYIN(tA,gA);
-              TCOPYIN(tB,gB);
+              TLOAD(tA,gA);
+              TLOAD(tB,gB);
               MATMACC(tACC, tA, tB);
             }
           }
@@ -450,8 +450,8 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
             tile_shapeA_trows tA;
             tile_shapeB_tcols tB;
 
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
             } else {
@@ -460,7 +460,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
           }
 
           auto gC = gIterC(i*R.m+ii,j);
-          TCOPYOUT_ACC(gC, tACC);
+          TSTORE_ACC(gC, tACC);
         }
 
         // [m, rmd_N, k]
@@ -471,7 +471,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
           for(int k=0;k<R.k;k++){
             auto gB = gIterB(k,Nb);
             tile_shapeB_trows tB;
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if(k==0){
               MATMUL(tACC, tA[ii][k], tB);
             }else{
@@ -480,14 +480,14 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
           }
           static_assert(R.k > Kb);
           if constexpr(R.k < Kb){
-            
+
             for(int k=R.k;k<Kb;k++){
               tile_shapeA tA;
               tile_shapeB_trows tB;
               auto gA = gIterA(i*R.m+ii,k);
               auto gB = gIterB(k,Nb);
-              TCOPYIN(tA,gA);
-              TCOPYIN(tB,gB);
+              TLOAD(tA,gA);
+              TLOAD(tB,gB);
               MATMACC(tACC, tA, tB);
             }
           }
@@ -499,9 +499,9 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
 
             tile_shapeA_trows tA;
             tile_shapeB_tcorner tB;
-            
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             if constexpr(Kb>0){
               MATMACC(tACC, tA, tB);
             } else {
@@ -510,7 +510,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
           }
 
           auto gC = gIterC(i*R.m+ii,Nb);
-          TCOPYOUT_ACC(gC, tACC);       
+          TSTORE_ACC(gC, tACC);
         }
 
       }
@@ -518,7 +518,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
 
     if constexpr(rM>0){
       tile_shapeA tA[rM][R.k];
-      
+
       #pragma clang loop unroll(full)
       for(int i=0;i<rM;i++){
 
@@ -526,7 +526,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
         #pragma clang loop unroll(full)
         for(int k=0;k<R.k;k++){
           auto gA = gIterA(i+dM*R.m,k);
-          TCOPYIN(tA[i][k], gA);
+          TLOAD(tA[i][k], gA);
         }
 
         #pragma clang loop unroll(full)
@@ -537,7 +537,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
           for(int k=0;k<R.k;k++){
             auto gB = gIterB(k,j);
             tile_shapeB tB;
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if(k==0){
               MATMUL(tACC, tA[i][k], tB);
             }else{
@@ -551,8 +551,8 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
               tile_shapeB tB;
               auto gA = gIterA(i+dM*R.m,k);
               auto gB = gIterB(k,j);
-              TCOPYIN(tA,gA);
-              TCOPYIN(tB,gB);
+              TLOAD(tA,gA);
+              TLOAD(tB,gB);
               MATMACC(tACC, tA, tB);
             }
           }
@@ -565,8 +565,8 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
             tile_shapeA_trows tA;
             tile_shapeB_tcols tB;
 
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
             } else {
@@ -574,7 +574,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
             }
           }
           auto gC = gIterC(i+dM*R.m,j);
-          TCOPYOUT_ACC(gC, tACC);
+          TSTORE_ACC(gC, tACC);
         }
 
         // [rM, rmd_N, k]
@@ -585,7 +585,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
           for(int k=0;k<R.k;k++){
             auto gB = gIterB(k,Nb);
             tile_shapeB_trows tB;
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if(k==0){
               MATMUL(tACC, tA[i][k], tB);
             }else{
@@ -599,8 +599,8 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
               tile_shapeB_trows tB;
               auto gA = gIterA(i+dM*R.m,k);
               auto gB = gIterB(k,Nb);
-              TCOPYIN(tA,gA);
-              TCOPYIN(tB,gB);
+              TLOAD(tA,gA);
+              TLOAD(tB,gB);
               MATMACC(tACC, tA, tB);
             }
           }
@@ -613,8 +613,8 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
             tile_shapeA_trows tA;
             tile_shapeB_tcorner tB;
 
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
             } else {
@@ -622,7 +622,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
             }
           }
           auto gC = gIterC(i+dM*R.m,Nb);
-          TCOPYOUT_ACC(gC, tACC);        
+          TSTORE_ACC(gC, tACC);
         }
       }
     }
@@ -630,11 +630,11 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
     // [rmd_M, n, k]
     if constexpr (rmd_M) {
       tile_shapeA_tcols tA[R.k];
-      
+
       #pragma clang loop unroll(full)
       for(int k=0;k<R.k;k++){
         auto gA = gIterA(Mb,k);
-        TCOPYIN(tA[k], gA);
+        TLOAD(tA[k], gA);
       }
 
       #pragma clang loop unroll(full)
@@ -645,7 +645,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
         for(int k=0;k<R.k;k++){
           auto gB = gIterB(k,j);
           tile_shapeB tB;
-          TCOPYIN(tB, gB);
+          TLOAD(tB, gB);
           if(k==0){
             MATMUL(tACC, tA[k], tB);
           }else{
@@ -659,8 +659,8 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
             tile_shapeB tB;
             auto gA = gIterA(Mb,k);
             auto gB = gIterB(k,j);
-            TCOPYIN(tA,gA);
-            TCOPYIN(tB,gB);
+            TLOAD(tA,gA);
+            TLOAD(tB,gB);
             MATMACC(tACC, tA, tB);
           }
         }
@@ -673,8 +673,8 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
           tile_shapeA_tcorner tA;
           tile_shapeB_tcols tB;
 
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
           if constexpr(Kb>0){
           MATMACC(tACC, tA, tB);
           } else {
@@ -682,7 +682,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
           }
         }
         auto gC = gIterC(Mb,j);
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
       }
 
       // [rmd_M, rmd_N, k]
@@ -693,7 +693,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
         for(int k=0;k<R.k;k++){
           auto gB = gIterB(k,Nb);
           tile_shapeB_trows tB;
-          TCOPYIN(tB, gB);
+          TLOAD(tB, gB);
           if(k==0){
             MATMUL(tACC, tA[k], tB);
           }else{
@@ -707,8 +707,8 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
             tile_shapeB_trows tB;
             auto gA = gIterA(Mb,k);
             auto gB = gIterB(k,Nb);
-            TCOPYIN(tA,gA);
-            TCOPYIN(tB,gB);
+            TLOAD(tA,gA);
+            TLOAD(tB,gB);
             MATMACC(tACC, tA, tB);
           }
         }
@@ -721,8 +721,8 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
           tile_shapeA_tcorner tA;
           tile_shapeB_tcorner tB;
 
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
           if constexpr(Kb>0){
           MATMACC(tACC, tA, tB);
           } else {
@@ -730,7 +730,7 @@ void matmul_mask_reuseA(float *dst, dtype *src0, dtype *src1){
           }
         }
         auto gC = gIterC(Mb,Nb);
-        TCOPYOUT_ACC(gC, tACC);        
+        TSTORE_ACC(gC, tACC);
       }
     }
   }// Batch
@@ -802,7 +802,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
         #pragma clang loop unroll(full)
         for (int k = 0; k < R.k; k++) {
           auto gA = gIterA(row, k);
-          TCOPYIN(tA_phase0[k], gA);
+          TLOAD(tA_phase0[k], gA);
         }
 
         // --- N 主列 ---
@@ -813,12 +813,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
           for (int k = 0; k < R.k; k++) {
             tile_shapeB tB;
             auto gB = gIterB(k, j);
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if (k == 0) MATMUL (tACC, tA_phase0[k], tB);
             else        MATMACC(tACC, tA_phase0[k], tB);
           }
           auto gC = gIterC(row, j);
-          TCOPYOUT_ACC(gC, tACC);
+          TSTORE_ACC(gC, tACC);
         }
 
         // --- N 余列 (rmd_N) ---
@@ -828,12 +828,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
           for (int k = 0; k < R.k; k++) {
             tile_shapeB_trows tB;
             auto gB = gIterB(k, Nb);
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if (k == 0) MATMUL (tACC, tA_phase0[k], tB);
             else        MATMACC(tACC, tA_phase0[k], tB);
           }
           auto gC = gIterC(row, Nb);
-          TCOPYOUT_ACC(gC, tACC);
+          TSTORE_ACC(gC, tACC);
         }
 
         // Phase B-1: 剩余 K 轴 Full chunks (每块 MAX_TILE_NUM 个 k tile)
@@ -847,7 +847,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
             #pragma clang loop unroll(full)
             for (int k = 0; k < MAX_TILE_NUM; k++) {
               auto gA = gIterA(row, k_base + k);
-              TCOPYIN(tA_chunk[k], gA);
+              TLOAD(tA_chunk[k], gA);
             }
 
             // --- N 主列 ---
@@ -858,12 +858,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
               for (int k = 0; k < MAX_TILE_NUM; k++) {
                 tile_shapeB tB;
                 auto gB = gIterB(k_base + k, j);
-                TCOPYIN(tB, gB);
+                TLOAD(tB, gB);
                 if (k == 0) MATMUL (tACC, tA_chunk[k], tB);
                 else        MATMACC(tACC, tA_chunk[k], tB);
               }
               auto gC = gIterC(row, j);
-              TCOPYOUT_ACC(gC, tACC);
+              TSTORE_ACC(gC, tACC);
             }
 
             // --- N 余列 ---
@@ -873,12 +873,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
               for (int k = 0; k < MAX_TILE_NUM; k++) {
                 tile_shapeB_trows tB;
                 auto gB = gIterB(k_base + k, Nb);
-                TCOPYIN(tB, gB);
+                TLOAD(tB, gB);
                 if (k == 0) MATMUL (tACC, tA_chunk[k], tB);
                 else        MATMACC(tACC, tA_chunk[k], tB);
               }
               auto gC = gIterC(row, Nb);
-              TCOPYOUT_ACC(gC, tACC);
+              TSTORE_ACC(gC, tACC);
             }
           }
         }
@@ -891,7 +891,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
           #pragma clang loop unroll(full)
           for (int k = 0; k < K2_rem; k++) {
             auto gA = gIterA(row, k_base + k);
-            TCOPYIN(tA_tail[k], gA);
+            TLOAD(tA_tail[k], gA);
           }
 
           // --- N 主列 ---
@@ -902,12 +902,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
             for (int k = 0; k < K2_rem; k++) {
               tile_shapeB tB;
               auto gB = gIterB(k_base + k, j);
-              TCOPYIN(tB, gB);
+              TLOAD(tB, gB);
               if (k == 0) MATMUL (tACC, tA_tail[k], tB);
               else        MATMACC(tACC, tA_tail[k], tB);
             }
             auto gC = gIterC(row, j);
-            TCOPYOUT_ACC(gC, tACC);
+            TSTORE_ACC(gC, tACC);
           }
 
           // --- N 余列 ---
@@ -917,12 +917,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
             for (int k = 0; k < K2_rem; k++) {
               tile_shapeB_trows tB;
               auto gB = gIterB(k_base + k, Nb);
-              TCOPYIN(tB, gB);
+              TLOAD(tB, gB);
               if (k == 0) MATMUL (tACC, tA_tail[k], tB);
               else        MATMACC(tACC, tA_tail[k], tB);
             }
             auto gC = gIterC(row, Nb);
-            TCOPYOUT_ACC(gC, tACC);
+            TSTORE_ACC(gC, tACC);
           }
         }
 
@@ -930,7 +930,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
         if constexpr (rmd_K) {
           tile_shapeA_trows tA_rmdK;
           auto gA = gIterA(row, Kb);
-          TCOPYIN(tA_rmdK, gA);
+          TLOAD(tA_rmdK, gA);
 
           // --- N 主列 ---
           #pragma clang loop unroll(full)
@@ -938,11 +938,11 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
             tile_shapeACC tACC;
             tile_shapeB_tcols tB;
             auto gB = gIterB(Kb, j);
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if constexpr (Kb > 0) MATMACC(tACC, tA_rmdK, tB);
             else                  MATMUL (tACC, tA_rmdK, tB);
             auto gC = gIterC(row, j);
-            TCOPYOUT_ACC(gC, tACC);
+            TSTORE_ACC(gC, tACC);
           }
 
           // --- N 余列 ---
@@ -950,11 +950,11 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
             tile_shapeC_trows tACC;
             tile_shapeB_tcorner tB;
             auto gB = gIterB(Kb, Nb);
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if constexpr (Kb > 0) MATMACC(tACC, tA_rmdK, tB);
             else                  MATMUL (tACC, tA_rmdK, tB);
             auto gC = gIterC(row, Nb);
-            TCOPYOUT_ACC(gC, tACC);
+            TSTORE_ACC(gC, tACC);
           }
         }
 
@@ -974,7 +974,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
         #pragma clang loop unroll(full)
         for (int k = 0; k < R.k; k++) {
           auto gA = gIterA(row, k);
-          TCOPYIN(tA_phase0[k], gA);
+          TLOAD(tA_phase0[k], gA);
         }
 
         #pragma clang loop unroll(full)
@@ -984,12 +984,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
           for (int k = 0; k < R.k; k++) {
             tile_shapeB tB;
             auto gB = gIterB(k, j);
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if (k == 0) MATMUL (tACC, tA_phase0[k], tB);
             else        MATMACC(tACC, tA_phase0[k], tB);
           }
           auto gC = gIterC(row, j);
-          TCOPYOUT_ACC(gC, tACC);
+          TSTORE_ACC(gC, tACC);
         }
 
         if constexpr (rmd_N) {
@@ -998,12 +998,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
           for (int k = 0; k < R.k; k++) {
             tile_shapeB_trows tB;
             auto gB = gIterB(k, Nb);
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if (k == 0) MATMUL (tACC, tA_phase0[k], tB);
             else        MATMACC(tACC, tA_phase0[k], tB);
           }
           auto gC = gIterC(row, Nb);
-          TCOPYOUT_ACC(gC, tACC);
+          TSTORE_ACC(gC, tACC);
         }
 
         // Phase B-1: Full chunks
@@ -1016,7 +1016,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
             #pragma clang loop unroll(full)
             for (int k = 0; k < MAX_TILE_NUM; k++) {
               auto gA = gIterA(row, k_base + k);
-              TCOPYIN(tA_chunk[k], gA);
+              TLOAD(tA_chunk[k], gA);
             }
 
             #pragma clang loop unroll(full)
@@ -1026,12 +1026,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
               for (int k = 0; k < MAX_TILE_NUM; k++) {
                 tile_shapeB tB;
                 auto gB = gIterB(k_base + k, j);
-                TCOPYIN(tB, gB);
+                TLOAD(tB, gB);
                 if (k == 0) MATMUL (tACC, tA_chunk[k], tB);
                 else        MATMACC(tACC, tA_chunk[k], tB);
               }
               auto gC = gIterC(row, j);
-              TCOPYOUT_ACC(gC, tACC);
+              TSTORE_ACC(gC, tACC);
             }
 
             if constexpr (rmd_N) {
@@ -1040,12 +1040,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
               for (int k = 0; k < MAX_TILE_NUM; k++) {
                 tile_shapeB_trows tB;
                 auto gB = gIterB(k_base + k, Nb);
-                TCOPYIN(tB, gB);
+                TLOAD(tB, gB);
                 if (k == 0) MATMUL (tACC, tA_chunk[k], tB);
                 else        MATMACC(tACC, tA_chunk[k], tB);
               }
               auto gC = gIterC(row, Nb);
-              TCOPYOUT_ACC(gC, tACC);
+              TSTORE_ACC(gC, tACC);
             }
           }
         }
@@ -1058,7 +1058,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
           #pragma clang loop unroll(full)
           for (int k = 0; k < K2_rem; k++) {
             auto gA = gIterA(row, k_base + k);
-            TCOPYIN(tA_tail[k], gA);
+            TLOAD(tA_tail[k], gA);
           }
 
           #pragma clang loop unroll(full)
@@ -1068,12 +1068,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
             for (int k = 0; k < K2_rem; k++) {
               tile_shapeB tB;
               auto gB = gIterB(k_base + k, j);
-              TCOPYIN(tB, gB);
+              TLOAD(tB, gB);
               if (k == 0) MATMUL (tACC, tA_tail[k], tB);
               else        MATMACC(tACC, tA_tail[k], tB);
             }
             auto gC = gIterC(row, j);
-            TCOPYOUT_ACC(gC, tACC);
+            TSTORE_ACC(gC, tACC);
           }
 
           if constexpr (rmd_N) {
@@ -1082,12 +1082,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
             for (int k = 0; k < K2_rem; k++) {
               tile_shapeB_trows tB;
               auto gB = gIterB(k_base + k, Nb);
-              TCOPYIN(tB, gB);
+              TLOAD(tB, gB);
               if (k == 0) MATMUL (tACC, tA_tail[k], tB);
               else        MATMACC(tACC, tA_tail[k], tB);
             }
             auto gC = gIterC(row, Nb);
-            TCOPYOUT_ACC(gC, tACC);
+            TSTORE_ACC(gC, tACC);
           }
         }
 
@@ -1095,29 +1095,29 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
         if constexpr (rmd_K) {
           tile_shapeA_trows tA_rmdK;
           auto gA = gIterA(row, Kb);
-          TCOPYIN(tA_rmdK, gA);
+          TLOAD(tA_rmdK, gA);
 
           #pragma clang loop unroll(full)
           for (int j = 0; j < Nb; j++) {
             tile_shapeACC tACC;
             tile_shapeB_tcols tB;
             auto gB = gIterB(Kb, j);
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if constexpr (Kb > 0) MATMACC(tACC, tA_rmdK, tB);
             else                  MATMUL (tACC, tA_rmdK, tB);
             auto gC = gIterC(row, j);
-            TCOPYOUT_ACC(gC, tACC);
+            TSTORE_ACC(gC, tACC);
           }
 
           if constexpr (rmd_N) {
             tile_shapeC_trows tACC;
             tile_shapeB_tcorner tB;
             auto gB = gIterB(Kb, Nb);
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if constexpr (Kb > 0) MATMACC(tACC, tA_rmdK, tB);
             else                  MATMUL (tACC, tA_rmdK, tB);
             auto gC = gIterC(row, Nb);
-            TCOPYOUT_ACC(gC, tACC);
+            TSTORE_ACC(gC, tACC);
           }
         }
 
@@ -1134,7 +1134,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
       #pragma clang loop unroll(full)
       for (int k = 0; k < R.k; k++) {
         auto gA = gIterA(Mb, k);
-        TCOPYIN(tA_phase0[k], gA);
+        TLOAD(tA_phase0[k], gA);
       }
 
       #pragma clang loop unroll(full)
@@ -1144,12 +1144,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
         for (int k = 0; k < R.k; k++) {
           tile_shapeB tB;
           auto gB = gIterB(k, j);
-          TCOPYIN(tB, gB);
+          TLOAD(tB, gB);
           if (k == 0) MATMUL (tACC, tA_phase0[k], tB);
           else        MATMACC(tACC, tA_phase0[k], tB);
         }
         auto gC = gIterC(Mb, j);
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
       }
 
       if constexpr (rmd_N) {
@@ -1158,12 +1158,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
         for (int k = 0; k < R.k; k++) {
           tile_shapeB_trows tB;
           auto gB = gIterB(k, Nb);
-          TCOPYIN(tB, gB);
+          TLOAD(tB, gB);
           if (k == 0) MATMUL (tACC, tA_phase0[k], tB);
           else        MATMACC(tACC, tA_phase0[k], tB);
         }
         auto gC = gIterC(Mb, Nb);
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
       }
 
       // Phase B-1: Full chunks (rmd_M 行，A 类型为 tcols)
@@ -1176,7 +1176,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
           #pragma clang loop unroll(full)
           for (int k = 0; k < MAX_TILE_NUM; k++) {
             auto gA = gIterA(Mb, k_base + k);
-            TCOPYIN(tA_chunk[k], gA);
+            TLOAD(tA_chunk[k], gA);
           }
 
           #pragma clang loop unroll(full)
@@ -1186,12 +1186,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
             for (int k = 0; k < MAX_TILE_NUM; k++) {
               tile_shapeB tB;
               auto gB = gIterB(k_base + k, j);
-              TCOPYIN(tB, gB);
+              TLOAD(tB, gB);
               if (k == 0) MATMUL (tACC, tA_chunk[k], tB);
               else        MATMACC(tACC, tA_chunk[k], tB);
             }
             auto gC = gIterC(Mb, j);
-            TCOPYOUT_ACC(gC, tACC);
+            TSTORE_ACC(gC, tACC);
           }
 
           if constexpr (rmd_N) {
@@ -1200,12 +1200,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
             for (int k = 0; k < MAX_TILE_NUM; k++) {
               tile_shapeB_trows tB;
               auto gB = gIterB(k_base + k, Nb);
-              TCOPYIN(tB, gB);
+              TLOAD(tB, gB);
               if (k == 0) MATMUL (tACC, tA_chunk[k], tB);
               else        MATMACC(tACC, tA_chunk[k], tB);
             }
             auto gC = gIterC(Mb, Nb);
-            TCOPYOUT_ACC(gC, tACC);
+            TSTORE_ACC(gC, tACC);
           }
         }
       }
@@ -1218,7 +1218,7 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
         #pragma clang loop unroll(full)
         for (int k = 0; k < K2_rem; k++) {
           auto gA = gIterA(Mb, k_base + k);
-          TCOPYIN(tA_tail[k], gA);
+          TLOAD(tA_tail[k], gA);
         }
 
         #pragma clang loop unroll(full)
@@ -1228,12 +1228,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
           for (int k = 0; k < K2_rem; k++) {
             tile_shapeB tB;
             auto gB = gIterB(k_base + k, j);
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if (k == 0) MATMUL (tACC, tA_tail[k], tB);
             else        MATMACC(tACC, tA_tail[k], tB);
           }
           auto gC = gIterC(Mb, j);
-          TCOPYOUT_ACC(gC, tACC);
+          TSTORE_ACC(gC, tACC);
         }
 
         if constexpr (rmd_N) {
@@ -1242,12 +1242,12 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
           for (int k = 0; k < K2_rem; k++) {
             tile_shapeB_trows tB;
             auto gB = gIterB(k_base + k, Nb);
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if (k == 0) MATMUL (tACC, tA_tail[k], tB);
             else        MATMACC(tACC, tA_tail[k], tB);
           }
           auto gC = gIterC(Mb, Nb);
-          TCOPYOUT_ACC(gC, tACC);
+          TSTORE_ACC(gC, tACC);
         }
       }
 
@@ -1255,29 +1255,29 @@ void matmul_mask_reuseA_OPT(float *dst, dtype *src0, dtype *src1){
       if constexpr (rmd_K) {
         tile_shapeA_tcorner tA_rmdK;
         auto gA = gIterA(Mb, Kb);
-        TCOPYIN(tA_rmdK, gA);
+        TLOAD(tA_rmdK, gA);
 
         #pragma clang loop unroll(full)
         for (int j = 0; j < Nb; j++) {
           tile_shapeC_tcols tACC;
           tile_shapeB_tcols tB;
           auto gB = gIterB(Kb, j);
-          TCOPYIN(tB, gB);
+          TLOAD(tB, gB);
           if constexpr (Kb > 0) MATMACC(tACC, tA_rmdK, tB);
           else                  MATMUL (tACC, tA_rmdK, tB);
           auto gC = gIterC(Mb, j);
-          TCOPYOUT_ACC(gC, tACC);
+          TSTORE_ACC(gC, tACC);
         }
 
         if constexpr (rmd_N) {
           tile_shapeC_tcorner tACC;
           tile_shapeB_tcorner tB;
           auto gB = gIterB(Kb, Nb);
-          TCOPYIN(tB, gB);
+          TLOAD(tB, gB);
           if constexpr (Kb > 0) MATMACC(tACC, tA_rmdK, tB);
           else                  MATMUL (tACC, tA_rmdK, tB);
           auto gC = gIterC(Mb, Nb);
-          TCOPYOUT_ACC(gC, tACC);
+          TSTORE_ACC(gC, tACC);
         }
       }
     } // rmd_M
@@ -1380,7 +1380,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
           #pragma clang loop unroll(full)
           for (int k = 0; k < LEN; k++) {
             auto gA = gIterA(row, k_base + k);
-            TCOPYIN(tA[k], gA);
+            TLOAD(tA[k], gA);
           }
 
           #pragma clang loop unroll(full)
@@ -1390,7 +1390,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
             for (int k = 0; k < LEN; k++) {
               tile_shapeB tB;
               auto gB = gIterB(k_base + k, j);
-              TCOPYIN(tB, gB);
+              TLOAD(tB, gB);
               if (k == 0) MATMUL (tACC, tA[k], tB);
               else        MATMACC(tACC, tA[k], tB);
             }
@@ -1403,7 +1403,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
             for (int k = 0; k < LEN; k++) {
               tile_shapeB_trows tB;
               auto gB = gIterB(k_base + k, Nb);
-              TCOPYIN(tB, gB);
+              TLOAD(tB, gB);
               if (k == 0) MATMUL (tACC, tA[k], tB);
               else        MATMACC(tACC, tA[k], tB);
             }
@@ -1417,7 +1417,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
           #pragma clang loop unroll(full)
           for (int k = 0; k < LEN; k++) {
             auto gA = gIterA(Mb, k_base + k);
-            TCOPYIN(tA[k], gA);
+            TLOAD(tA[k], gA);
           }
 
           #pragma clang loop unroll(full)
@@ -1427,7 +1427,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
             for (int k = 0; k < LEN; k++) {
               tile_shapeB tB;
               auto gB = gIterB(k_base + k, j);
-              TCOPYIN(tB, gB);
+              TLOAD(tB, gB);
               if (k == 0) MATMUL (tACC, tA[k], tB);
               else        MATMACC(tACC, tA[k], tB);
             }
@@ -1440,7 +1440,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
             for (int k = 0; k < LEN; k++) {
               tile_shapeB_trows tB;
               auto gB = gIterB(k_base + k, Nb);
-              TCOPYIN(tB, gB);
+              TLOAD(tB, gB);
               if (k == 0) MATMUL (tACC, tA[k], tB);
               else        MATMACC(tACC, tA[k], tB);
             }
@@ -1463,7 +1463,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
             #pragma clang loop unroll(full)
             for (int k = 0; k < LEN; k++) {
               auto gA = gIterA(row, k_base + k);
-              TCOPYIN(tA[k], gA);
+              TLOAD(tA[k], gA);
             }
 
             #pragma clang loop unroll(full)
@@ -1473,7 +1473,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
               for (int k = 0; k < LEN; k++) {
                 tile_shapeB tB;
                 auto gB = gIterB(k_base + k, j);
-                TCOPYIN(tB, gB);
+                TLOAD(tB, gB);
                 if (k == 0) MATMUL (tACC, tA[k], tB);
                 else        MATMACC(tACC, tA[k], tB);
               }
@@ -1488,7 +1488,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
               for (int k = 0; k < LEN; k++) {
                 tile_shapeB_trows tB;
                 auto gB = gIterB(k_base + k, Nb);
-                TCOPYIN(tB, gB);
+                TLOAD(tB, gB);
                 if (k == 0) MATMUL (tACC, tA[k], tB);
                 else        MATMACC(tACC, tA[k], tB);
               }
@@ -1503,7 +1503,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
             #pragma clang loop unroll(full)
             for (int k = 0; k < LEN; k++) {
               auto gA = gIterA(Mb, k_base + k);
-              TCOPYIN(tA[k], gA);
+              TLOAD(tA[k], gA);
             }
 
             #pragma clang loop unroll(full)
@@ -1513,7 +1513,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
               for (int k = 0; k < LEN; k++) {
                 tile_shapeB tB;
                 auto gB = gIterB(k_base + k, j);
-                TCOPYIN(tB, gB);
+                TLOAD(tB, gB);
                 if (k == 0) MATMUL (tACC, tA[k], tB);
                 else        MATMACC(tACC, tA[k], tB);
               }
@@ -1528,7 +1528,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
               for (int k = 0; k < LEN; k++) {
                 tile_shapeB_trows tB;
                 auto gB = gIterB(k_base + k, Nb);
-                TCOPYIN(tB, gB);
+                TLOAD(tB, gB);
                 if (k == 0) MATMUL (tACC, tA[k], tB);
                 else        MATMACC(tACC, tA[k], tB);
               }
@@ -1556,7 +1556,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
         #pragma clang loop unroll(full)
         for (int k = 0; k < LEN; k++) {
           auto gA = gIterA(row, k_base + k);
-          TCOPYIN(tA[k], gA);
+          TLOAD(tA[k], gA);
         }
 
         #pragma clang loop unroll(full)
@@ -1566,7 +1566,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
           for (int k = 0; k < LEN; k++) {
             tile_shapeB tB;
             auto gB = gIterB(k_base + k, j);
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if (k == 0) MATMUL (tACC, tA[k], tB);
             else        MATMACC(tACC, tA[k], tB);
           }
@@ -1585,7 +1585,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
           for (int k = 0; k < LEN; k++) {
             tile_shapeB_trows tB;
             auto gB = gIterB(k_base + k, Nb);
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if (k == 0) MATMUL (tACC, tA[k], tB);
             else        MATMACC(tACC, tA[k], tB);
           }
@@ -1604,7 +1604,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
         #pragma clang loop unroll(full)
         for (int k = 0; k < LEN; k++) {
           auto gA = gIterA(Mb, k_base + k);
-          TCOPYIN(tA[k], gA);
+          TLOAD(tA[k], gA);
         }
 
         #pragma clang loop unroll(full)
@@ -1614,7 +1614,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
           for (int k = 0; k < LEN; k++) {
             tile_shapeB tB;
             auto gB = gIterB(k_base + k, j);
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if (k == 0) MATMUL (tACC, tA[k], tB);
             else        MATMACC(tACC, tA[k], tB);
           }
@@ -1633,7 +1633,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
           for (int k = 0; k < LEN; k++) {
             tile_shapeB_trows tB;
             auto gB = gIterB(k_base + k, Nb);
-            TCOPYIN(tB, gB);
+            TLOAD(tB, gB);
             if (k == 0) MATMUL (tACC, tA[k], tB);
             else        MATMACC(tACC, tA[k], tB);
           }
@@ -1659,14 +1659,14 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
       for (int row = 0; row < Mb; row++) {
         tile_shapeA_trows tA_rmdK;
         auto gA = gIterA(row, Kb);
-        TCOPYIN(tA_rmdK, gA);
+        TLOAD(tA_rmdK, gA);
 
         #pragma clang loop unroll(full)
         for (int j = 0; j < Nb; j++) {
           tile_shapeACC tACC;
           tile_shapeB_tcols tB;
           auto gB = gIterB(Kb, j);
-          TCOPYIN(tB, gB);
+          TLOAD(tB, gB);
           MATMUL(tACC, tA_rmdK, tB);
 
           if constexpr (is_first) {
@@ -1682,7 +1682,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
           tile_shapeACC_trows tACC;
           tile_shapeB_tcorner tB;
           auto gB = gIterB(Kb, Nb);
-          TCOPYIN(tB, gB);
+          TLOAD(tB, gB);
           MATMUL(tACC, tA_rmdK, tB);
 
           if constexpr (is_first) {
@@ -1698,14 +1698,14 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
       if constexpr (rmd_M) {
         tile_shapeA_tcorner tA_rmdK;
         auto gA = gIterA(Mb, Kb);
-        TCOPYIN(tA_rmdK, gA);
+        TLOAD(tA_rmdK, gA);
 
         #pragma clang loop unroll(full)
         for (int j = 0; j < Nb; j++) {
           tile_shapeACC_tcols tACC;
           tile_shapeB_tcols tB;
           auto gB = gIterB(Kb, j);
-          TCOPYIN(tB, gB);
+          TLOAD(tB, gB);
           MATMUL(tACC, tA_rmdK, tB);
 
           if constexpr (is_first) {
@@ -1721,7 +1721,7 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
           tile_shapeACC_tcorner tACC;
           tile_shapeB_tcorner tB;
           auto gB = gIterB(Kb, Nb);
-          TCOPYIN(tB, gB);
+          TLOAD(tB, gB);
           MATMUL(tACC, tA_rmdK, tB);
 
           if constexpr (is_first) {
@@ -1745,15 +1745,15 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
         tile_C_bf16 tC_b;
         // TMOV_NZ2DN(tC_b, tC_main[m][n]);
         auto gC = gIterC(m, n);
-        // TCOPYOUT(gC, tC_b);
-        TCOPYOUT(gC, tC_main[m][n]);
+        // TSTORE(gC, tC_b);
+        TSTORE(gC, tC_main[m][n]);
       }
       if constexpr (rmd_N) {
         tile_C_bf16_trows tC_b;
         // TMOV_NZ2DN(tC_b, tC_rcol[m]);
         auto gC = gIterC(m, Nb);
-        TCOPYOUT(gC, tC_rcol[m]);
-        // TCOPYOUT(gC, tC_b);
+        TSTORE(gC, tC_rcol[m]);
+        // TSTORE(gC, tC_b);
       }
     }
     if constexpr (rmd_M) {
@@ -1762,15 +1762,15 @@ void matmul_mask_reuseA_OPT2(float *dst, dtype *src0, dtype *src1){
         tile_C_bf16_tcols tC_b;
         // TMOV_NZ2DN(tC_b, tC_rrow[n]);
         auto gC = gIterC(Mb, n);
-        TCOPYOUT(gC,  tC_rrow[n]);
-        // TCOPYOUT(gC, tC_b);
+        TSTORE(gC,  tC_rrow[n]);
+        // TSTORE(gC, tC_b);
       }
       if constexpr (rmd_N) {
         tile_C_bf16_tcorner tC_b;
         // TMOV_NZ2DN(tC_b, tC_corner);
         auto gC = gIterC(Mb, Nb);
-        TCOPYOUT(gC, tC_corner);
-        // TCOPYOUT(gC, tC_b);
+        TSTORE(gC, tC_corner);
+        // TSTORE(gC, tC_b);
       }
     }
 
@@ -1873,7 +1873,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
           #pragma clang loop unroll(full)
           for (int k = 0; k < LEN; k++) {
             auto gB = gIterB(k_base + k, col);
-            TCOPYIN(tB[k], gB);
+            TLOAD(tB[k], gB);
           }
 
           #pragma clang loop unroll(full)
@@ -1883,7 +1883,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
             for (int k = 0; k < LEN; k++) {
               tile_shapeA tA;
               auto gA = gIterA(row, k_base + k);
-              TCOPYIN(tA, gA);
+              TLOAD(tA, gA);
               if (k == 0) MATMUL (tACC, tA, tB[k]);
               else        MATMACC(tACC, tA, tB[k]);
             }
@@ -1896,7 +1896,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
             for (int k = 0; k < LEN; k++) {
               tile_shapeA_tcols tA;
               auto gA = gIterA(Mb, k_base + k);
-              TCOPYIN(tA, gA);
+              TLOAD(tA, gA);
               if (k == 0) MATMUL (tACC, tA, tB[k]);
               else        MATMACC(tACC, tA, tB[k]);
             }
@@ -1910,7 +1910,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
           #pragma clang loop unroll(full)
           for (int k = 0; k < LEN; k++) {
             auto gB = gIterB(k_base + k, Nb);
-            TCOPYIN(tB[k], gB);
+            TLOAD(tB[k], gB);
           }
 
           #pragma clang loop unroll(full)
@@ -1920,7 +1920,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
             for (int k = 0; k < LEN; k++) {
               tile_shapeA tA;
               auto gA = gIterA(row, k_base + k);
-              TCOPYIN(tA, gA);
+              TLOAD(tA, gA);
               if (k == 0) MATMUL (tACC, tA, tB[k]);
               else        MATMACC(tACC, tA, tB[k]);
             }
@@ -1933,7 +1933,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
             for (int k = 0; k < LEN; k++) {
               tile_shapeA_tcols tA;
               auto gA = gIterA(Mb, k_base + k);
-              TCOPYIN(tA, gA);
+              TLOAD(tA, gA);
               if (k == 0) MATMUL (tACC, tA, tB[k]);
               else        MATMACC(tACC, tA, tB[k]);
             }
@@ -1956,7 +1956,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
             #pragma clang loop unroll(full)
             for (int k = 0; k < LEN; k++) {
               auto gB = gIterB(k_base + k, col);
-              TCOPYIN(tB[k], gB);
+              TLOAD(tB[k], gB);
             }
 
             #pragma clang loop unroll(full)
@@ -1966,7 +1966,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
               for (int k = 0; k < LEN; k++) {
                 tile_shapeA tA;
                 auto gA = gIterA(row, k_base + k);
-                TCOPYIN(tA, gA);
+                TLOAD(tA, gA);
                 if (k == 0) MATMUL (tACC, tA, tB[k]);
                 else        MATMACC(tACC, tA, tB[k]);
               }
@@ -1981,7 +1981,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
               for (int k = 0; k < LEN; k++) {
                 tile_shapeA_tcols tA;
                 auto gA = gIterA(Mb, k_base + k);
-                TCOPYIN(tA, gA);
+                TLOAD(tA, gA);
                 if (k == 0) MATMUL (tACC, tA, tB[k]);
                 else        MATMACC(tACC, tA, tB[k]);
               }
@@ -1996,7 +1996,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
             #pragma clang loop unroll(full)
             for (int k = 0; k < LEN; k++) {
               auto gB = gIterB(k_base + k, Nb);
-              TCOPYIN(tB[k], gB);
+              TLOAD(tB[k], gB);
             }
 
             #pragma clang loop unroll(full)
@@ -2006,7 +2006,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
               for (int k = 0; k < LEN; k++) {
                 tile_shapeA tA;
                 auto gA = gIterA(row, k_base + k);
-                TCOPYIN(tA, gA);
+                TLOAD(tA, gA);
                 if (k == 0) MATMUL (tACC, tA, tB[k]);
                 else        MATMACC(tACC, tA, tB[k]);
               }
@@ -2021,7 +2021,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
               for (int k = 0; k < LEN; k++) {
                 tile_shapeA_tcols tA;
                 auto gA = gIterA(Mb, k_base + k);
-                TCOPYIN(tA, gA);
+                TLOAD(tA, gA);
                 if (k == 0) MATMUL (tACC, tA, tB[k]);
                 else        MATMACC(tACC, tA, tB[k]);
               }
@@ -2049,7 +2049,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
         #pragma clang loop unroll(full)
         for (int k = 0; k < LEN; k++) {
           auto gB = gIterB(k_base + k, col);
-          TCOPYIN(tB[k], gB);
+          TLOAD(tB[k], gB);
         }
 
         #pragma clang loop unroll(full)
@@ -2059,7 +2059,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
           for (int k = 0; k < LEN; k++) {
             tile_shapeA tA;
             auto gA = gIterA(row, k_base + k);
-            TCOPYIN(tA, gA);
+            TLOAD(tA, gA);
             if (k == 0) MATMUL (tACC, tA, tB[k]);
             else        MATMACC(tACC, tA, tB[k]);
           }
@@ -2078,7 +2078,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
           for (int k = 0; k < LEN; k++) {
             tile_shapeA_tcols tA;
             auto gA = gIterA(Mb, k_base + k);
-            TCOPYIN(tA, gA);
+            TLOAD(tA, gA);
             if (k == 0) MATMUL (tACC, tA, tB[k]);
             else        MATMACC(tACC, tA, tB[k]);
           }
@@ -2097,7 +2097,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
         #pragma clang loop unroll(full)
         for (int k = 0; k < LEN; k++) {
           auto gB = gIterB(k_base + k, Nb);
-          TCOPYIN(tB[k], gB);
+          TLOAD(tB[k], gB);
         }
 
         #pragma clang loop unroll(full)
@@ -2107,7 +2107,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
           for (int k = 0; k < LEN; k++) {
             tile_shapeA tA;
             auto gA = gIterA(row, k_base + k);
-            TCOPYIN(tA, gA);
+            TLOAD(tA, gA);
             if (k == 0) MATMUL (tACC, tA, tB[k]);
             else        MATMACC(tACC, tA, tB[k]);
           }
@@ -2126,7 +2126,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
           for (int k = 0; k < LEN; k++) {
             tile_shapeA_tcols tA;
             auto gA = gIterA(Mb, k_base + k);
-            TCOPYIN(tA, gA);
+            TLOAD(tA, gA);
             if (k == 0) MATMUL (tACC, tA, tB[k]);
             else        MATMACC(tACC, tA, tB[k]);
           }
@@ -2152,14 +2152,14 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
       for (int col = 0; col < Nb; col++) {
         tile_shapeB_tcols tB_rmdK;
         auto gB = gIterB(Kb, col);
-        TCOPYIN(tB_rmdK, gB);
+        TLOAD(tB_rmdK, gB);
 
         #pragma clang loop unroll(full)
         for (int row = 0; row < Mb; row++) {
           tile_shapeACC tACC;
           tile_shapeA_trows tA;
           auto gA = gIterA(row, Kb);
-          TCOPYIN(tA, gA);
+          TLOAD(tA, gA);
           MATMUL(tACC, tA, tB_rmdK);
 
           if constexpr (is_first) {
@@ -2175,7 +2175,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
           tile_shapeACC_tcols tACC;
           tile_shapeA_tcorner tA;
           auto gA = gIterA(Mb, Kb);
-          TCOPYIN(tA, gA);
+          TLOAD(tA, gA);
           MATMUL(tACC, tA, tB_rmdK);
 
           if constexpr (is_first) {
@@ -2191,14 +2191,14 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
       if constexpr (rmd_N) {
         tile_shapeB_tcorner tB_rmdK;
         auto gB = gIterB(Kb, Nb);
-        TCOPYIN(tB_rmdK, gB);
+        TLOAD(tB_rmdK, gB);
 
         #pragma clang loop unroll(full)
         for (int row = 0; row < Mb; row++) {
           tile_shapeACC_trows tACC;
           tile_shapeA_trows tA;
           auto gA = gIterA(row, Kb);
-          TCOPYIN(tA, gA);
+          TLOAD(tA, gA);
           MATMUL(tACC, tA, tB_rmdK);
 
           if constexpr (is_first) {
@@ -2214,7 +2214,7 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
           tile_shapeACC_tcorner tACC;
           tile_shapeA_tcorner tA;
           auto gA = gIterA(Mb, Kb);
-          TCOPYIN(tA, gA);
+          TLOAD(tA, gA);
           MATMUL(tACC, tA, tB_rmdK);
 
           if constexpr (is_first) {
@@ -2238,15 +2238,15 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
         tile_C_bf16 tC_b;
         // TMOV_NZ2DN(tC_b, tC_main[m][n]);
         auto gC = gIterC(m, n);
-        // TCOPYOUT(gC, tC_b);
-        TCOPYOUT(gC, tC_main[m][n]);
+        // TSTORE(gC, tC_b);
+        TSTORE(gC, tC_main[m][n]);
       }
       if constexpr (rmd_N) {
         tile_C_bf16_trows tC_b;
         // TMOV_NZ2DN(tC_b, tC_rcol[m]);
         auto gC = gIterC(m, Nb);
-        TCOPYOUT(gC, tC_rcol[m]);
-        // TCOPYOUT(gC, tC_b);
+        TSTORE(gC, tC_rcol[m]);
+        // TSTORE(gC, tC_b);
       }
     }
     if constexpr (rmd_M) {
@@ -2255,15 +2255,15 @@ void matmul_mask_reuseB_OPT2(float *dst, dtype *src0, dtype *src1){
         tile_C_bf16_tcols tC_b;
         // TMOV_NZ2DN(tC_b, tC_rrow[n]);
         auto gC = gIterC(Mb, n);
-        TCOPYOUT(gC,  tC_rrow[n]);
-        // TCOPYOUT(gC, tC_b);
+        TSTORE(gC,  tC_rrow[n]);
+        // TSTORE(gC, tC_b);
       }
       if constexpr (rmd_N) {
         tile_C_bf16_tcorner tC_b;
         // TMOV_NZ2DN(tC_b, tC_corner);
         auto gC = gIterC(Mb, Nb);
-        TCOPYOUT(gC, tC_corner);
-        // TCOPYOUT(gC, tC_b);
+        TSTORE(gC, tC_corner);
+        // TSTORE(gC, tC_b);
       }
     }
 
@@ -2331,7 +2331,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
         // #pragma clang loop unroll(full)
         // for(int k=0;k<R.k;k++){
         //   auto gB = gIterB(k, ii+i*R.n);
-        //   TCOPYIN(tB[k][ii], gB);
+        //   TLOAD(tB[k][ii], gB);
         // }
 
         #pragma clang loop unroll(full)
@@ -2342,11 +2342,11 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
           for(int k=0;k<R.k;k++){
             auto gA = gIterA(j,k);
             tile_shapeA tA;
-            TCOPYIN(tA, gA);
+            TLOAD(tA, gA);
             if(j==0){
               // eliminate head cost
               auto gB = gIterB(k, ii+i*R.n);
-              TCOPYIN(tB[k][ii], gB);
+              TLOAD(tB[k][ii], gB);
             }
             if(k==0){
               MATMUL(tACC, tA, tB[k][ii]);
@@ -2363,8 +2363,8 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
               auto gB = gIterB(k,i*R.n+ii);
               auto gA = gIterA(j,k);
 
-              TCOPYIN(tA,gA);
-              TCOPYIN(tB,gB);
+              TLOAD(tA,gA);
+              TLOAD(tB,gB);
               MATMACC(tACC, tA, tB);
             }
           }
@@ -2377,8 +2377,8 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
             tile_shapeA_trows tA;
 
 
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
             } else {
@@ -2387,7 +2387,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
           }
 
           auto gC = gIterC(j, i*R.n+ii);
-          TCOPYOUT_ACC(gC, tACC);
+          TSTORE_ACC(gC, tACC);
         }
 
         // [n, rmd_M, k]
@@ -2398,7 +2398,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
           for(int k=0;k<R.k;k++){
             auto gA = gIterA(Mb,k);
             tile_shapeA_tcols tA;
-            TCOPYIN(tA, gA);
+            TLOAD(tA, gA);
             if(k==0){
               MATMUL(tACC, tA, tB[k][ii]);
             }else{
@@ -2414,8 +2414,8 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
               auto gB = gIterB(k, i*R.n+ii);
               auto gA = gIterA(Mb,k);
 
-              TCOPYIN(tA,gA);
-              TCOPYIN(tB,gB);
+              TLOAD(tA,gA);
+              TLOAD(tB,gB);
               MATMACC(tACC, tA, tB);
             }
           }
@@ -2428,8 +2428,8 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
             tile_shapeB_tcols tB;
             tile_shapeA_tcorner tA;
 
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
             } else {
@@ -2438,7 +2438,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
           }
 
           auto gC = gIterC(Mb, i*R.n+ii);
-          TCOPYOUT_ACC(gC, tACC);       
+          TSTORE_ACC(gC, tACC);
         }
 
       }
@@ -2447,7 +2447,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
     // [rN, m, k]
     if constexpr(rN>0){
       tile_shapeB tB[R.k][rN];
-      
+
       #pragma clang loop unroll(full)
       for(int i=0;i<rN;i++){
 
@@ -2455,7 +2455,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
         #pragma clang loop unroll(full)
         for(int k=0;k<R.k;k++){
           auto gB = gIterB(k, i+dN*R.n);
-          TCOPYIN(tB[k][i], gB);
+          TLOAD(tB[k][i], gB);
         }
 
         #pragma clang loop unroll(full)
@@ -2466,7 +2466,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
           for(int k=0;k<R.k;k++){
             auto gA = gIterA(j,k);
             tile_shapeA tA;
-            TCOPYIN(tA, gA);
+            TLOAD(tA, gA);
             if(k==0){
               MATMUL(tACC, tA, tB[k][i]);
             }else{
@@ -2482,8 +2482,8 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
               auto gB = gIterB(k, i+dN*R.n);
               auto gA = gIterA(j, k);
 
-              TCOPYIN(tA,gA);
-              TCOPYIN(tB,gB);
+              TLOAD(tA,gA);
+              TLOAD(tB,gB);
               if constexpr (R.k == 0) {
                 MATMUL(tACC, tA, tB);
               } else
@@ -2499,8 +2499,8 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
             tile_shapeB_tcols tB;
             tile_shapeA_trows tA;
 
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
             } else {
@@ -2508,7 +2508,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
             }
           }
           auto gC = gIterC(j, i+dN*R.n);
-          TCOPYOUT_ACC(gC, tACC);
+          TSTORE_ACC(gC, tACC);
         }
 
         // [rN, rmd_M, k]
@@ -2519,7 +2519,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
           for(int k=0;k<R.k;k++){
             auto gA = gIterA(Mb,k);
             tile_shapeA_tcols tA;
-            TCOPYIN(tA, gA);
+            TLOAD(tA, gA);
             if(k==0){
               MATMUL(tACC, tA, tB[k][i]);
             }else{
@@ -2534,8 +2534,8 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
               tile_shapeA_tcols tA;
               auto gB = gIterB(k, i+dN*R.n);
               auto gA = gIterA(Mb,k);
-              TCOPYIN(tA,gA);
-              TCOPYIN(tB,gB);
+              TLOAD(tA,gA);
+              TLOAD(tB,gB);
               if constexpr (R.k == 0)
                 MATMUL(tACC, tA, tB);
               else
@@ -2551,8 +2551,8 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
             tile_shapeB_tcols tB;
             tile_shapeA_tcorner tA;
 
-            TCOPYIN(tA, gA);
-            TCOPYIN(tB, gB);
+            TLOAD(tA, gA);
+            TLOAD(tB, gB);
             if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
             } else {
@@ -2560,7 +2560,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
             }
           }
           auto gC = gIterC(Mb, i+dN*R.n);
-          TCOPYOUT_ACC(gC, tACC);        
+          TSTORE_ACC(gC, tACC);
         }
       }
     }
@@ -2568,11 +2568,11 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
     // [rmd_N, m, k]
     if constexpr (rmd_N) {
       tile_shapeB_trows tB[R.k];
-      
+
       #pragma clang loop unroll(full)
       for(int k=0;k<R.k;k++){
         auto gB = gIterB(k, Nb);
-        TCOPYIN(tB[k], gB);
+        TLOAD(tB[k], gB);
       }
 
       #pragma clang loop unroll(full)
@@ -2583,7 +2583,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
         for(int k=0;k<R.k;k++){
           auto gA = gIterA(j,k);
           tile_shapeA tA;
-          TCOPYIN(tA, gA);
+          TLOAD(tA, gA);
           if(k==0){
             MATMUL(tACC, tA, tB[k]);
           }else{
@@ -2598,8 +2598,8 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
             tile_shapeA tA;
             auto gB = gIterB(k,Nb);
             auto gA = gIterA(j,k);
-            TCOPYIN(tA,gA);
-            TCOPYIN(tB,gB);
+            TLOAD(tA,gA);
+            TLOAD(tB,gB);
             if constexpr (R.k == 0)
               MATMUL(tACC, tA, tB);
             else
@@ -2615,8 +2615,8 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
           tile_shapeB_tcorner tB;
           tile_shapeA_trows tA;
 
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
           if constexpr(Kb>0){
           MATMACC(tACC, tA, tB);
           } else {
@@ -2624,7 +2624,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
           }
         }
         auto gC = gIterC(j, Nb);
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
       }
 
       // [rmd_N, rmd_M, k]
@@ -2635,7 +2635,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
         for(int k=0;k<R.k;k++){
           auto gA = gIterA(Mb,k);
           tile_shapeA_tcols tA;
-          TCOPYIN(tA, gA);
+          TLOAD(tA, gA);
           if(k==0){
             MATMUL(tACC, tA, tB[k]);
           }else{
@@ -2650,8 +2650,8 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
             tile_shapeA_tcols tA;
             auto gB = gIterB(k,Nb);
             auto gA = gIterA(Mb,k);
-            TCOPYIN(tA,gA);
-            TCOPYIN(tB,gB);
+            TLOAD(tA,gA);
+            TLOAD(tB,gB);
             if constexpr (R.k == 0)
               MATMUL(tACC, tA, tB);
             else
@@ -2667,8 +2667,8 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
           tile_shapeB_tcorner tB;
           tile_shapeA_tcorner tA;
 
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
           if constexpr(Kb>0){
           MATMACC(tACC, tA, tB);
           } else {
@@ -2676,7 +2676,7 @@ void matmul_mask_reuseB(float *dst, dtype *src0, dtype *src1){
           }
         }
         auto gC = gIterC(Mb,Nb);
-        TCOPYOUT_ACC(gC, tACC);        
+        TSTORE_ACC(gC, tACC);
       }
     }
 
@@ -2760,10 +2760,10 @@ void matmul_mask_reuseAB(float *dst, dtype *src0, dtype *src1){
         #pragma clang loop unroll(full)
         for(int k=0;k<R.k;k++){
           auto gA = gIterA(m+i*R.m,k);
-          TCOPYIN(tA[m][k], gA);
+          TLOAD(tA[m][k], gA);
         }
       }
-      
+
       #pragma clang loop unroll(full)
       for(int j=0;j<dN;j++){
 
@@ -2773,7 +2773,7 @@ void matmul_mask_reuseAB(float *dst, dtype *src0, dtype *src1){
           #pragma clang loop unroll(full)
           for(int k=0;k<R.k;k++){
             auto gB = gIterB(k, n+j*R.n);
-            TCOPYIN(tB[k][n], gB);
+            TLOAD(tB[k][n], gB);
           }
         }
 
@@ -2797,13 +2797,13 @@ void matmul_mask_reuseAB(float *dst, dtype *src0, dtype *src1){
                 tile_shapeB tB;
                 auto gA = gIterA(i*R.m+ii,k);
                 auto gB = gIterB(k,j*R.n+jj);
-                TCOPYIN(tA,gA);
-                TCOPYIN(tB,gB);
+                TLOAD(tA,gA);
+                TLOAD(tB,gB);
                 MATMACC(tACC, tA, tB);
               }
             }
             auto gC = gIterC(i*R.m+ii,j*R.n+jj);
-            TCOPYOUT_ACC(gC, tACC);    
+            TSTORE_ACC(gC, tACC);
           }
         }
       }
@@ -2816,7 +2816,7 @@ void matmul_mask_reuseAB(float *dst, dtype *src0, dtype *src1){
           #pragma clang loop unroll(full)
           for(int k=0;k<R.k;k++){
             auto gB = gIterB(k, n+dN*R.n);
-            TCOPYIN(tB[k][n], gB);
+            TLOAD(tB[k][n], gB);
           }
         }
 
@@ -2841,18 +2841,18 @@ void matmul_mask_reuseAB(float *dst, dtype *src0, dtype *src1){
                 tile_shapeB tB;
                 auto gA = gIterA(i*R.m+ii,k);
                 auto gB = gIterB(k,dN*R.n+jj);
-                TCOPYIN(tA,gA);
-                TCOPYIN(tB,gB);
+                TLOAD(tA,gA);
+                TLOAD(tB,gB);
                 MATMACC(tACC, tA, tB);
               }
             }
             auto gC = gIterC(i*R.m+ii,dN*R.n+jj);
-            TCOPYOUT_ACC(gC, tACC);    
+            TSTORE_ACC(gC, tACC);
           }
         }
       }
     }
-    
+
     if constexpr(rM){
       tile_shapeA tA[rM][R.k];
       //copy in remaining M dimension A tile
@@ -2861,10 +2861,10 @@ void matmul_mask_reuseAB(float *dst, dtype *src0, dtype *src1){
         #pragma clang loop unroll(full)
         for(int k=0;k<R.k;k++){
           auto gA = gIterA(m+dM*R.m,k);
-          TCOPYIN(tA[m][k], gA);
+          TLOAD(tA[m][k], gA);
         }
       }
-      
+
       #pragma clang loop unroll(full)
       for(int j=0;j<dN;j++){
 
@@ -2874,7 +2874,7 @@ void matmul_mask_reuseAB(float *dst, dtype *src0, dtype *src1){
           #pragma clang loop unroll(full)
           for(int k=0;k<R.k;k++){
             auto gB = gIterB(k, n+j*R.n);
-            TCOPYIN(tB[k][n], gB);
+            TLOAD(tB[k][n], gB);
           }
         }
 
@@ -2898,13 +2898,13 @@ void matmul_mask_reuseAB(float *dst, dtype *src0, dtype *src1){
                 tile_shapeB tB;
                 auto gA = gIterA(dM*R.m+ii,k);
                 auto gB = gIterB(k,j*R.n+jj);
-                TCOPYIN(tA,gA);
-                TCOPYIN(tB,gB);
+                TLOAD(tA,gA);
+                TLOAD(tB,gB);
                 MATMACC(tACC, tA, tB);
               }
             }
             auto gC = gIterC(dM*R.m+ii,j*R.n+jj);
-            TCOPYOUT_ACC(gC, tACC);    
+            TSTORE_ACC(gC, tACC);
           }
         }
       }
@@ -2917,7 +2917,7 @@ void matmul_mask_reuseAB(float *dst, dtype *src0, dtype *src1){
           #pragma clang loop unroll(full)
           for(int k=0;k<R.k;k++){
             auto gB = gIterB(k, n+dN*R.n);
-            TCOPYIN(tB[k][n], gB);
+            TLOAD(tB[k][n], gB);
           }
         }
 
@@ -2942,13 +2942,13 @@ void matmul_mask_reuseAB(float *dst, dtype *src0, dtype *src1){
                 tile_shapeB tB;
                 auto gA = gIterA(dM*R.m+ii,k);
                 auto gB = gIterB(k,dN*R.n+jj);
-                TCOPYIN(tA,gA);
-                TCOPYIN(tB,gB);
+                TLOAD(tA,gA);
+                TLOAD(tB,gB);
                 MATMACC(tACC, tA, tB);
               }
             }
             auto gC = gIterC(dM*R.m+ii,dN*R.n+jj);
-            TCOPYOUT_ACC(gC, tACC);    
+            TSTORE_ACC(gC, tACC);
           }
         }
       }
@@ -2997,7 +2997,7 @@ void matmul_mask_multi4_B(float *dst, dtype *src0, dtype *src1){
             for(int k=0;k<Kb;k++){
               tile_shapeA tA;
               auto gA = gIterA(i,k);
-              TCOPYIN(tA, gA);
+              TLOAD(tA, gA);
               if(k==0){
                 MATMUL(tACC, tA, tB[k][jj]);
               }else{
@@ -3005,7 +3005,7 @@ void matmul_mask_multi4_B(float *dst, dtype *src0, dtype *src1){
               }
             }
             auto gC = gIterC(i,j+jj);
-            TCOPYOUT_ACC(gC, tACC);
+            TSTORE_ACC(gC, tACC);
           }
         }
       }
@@ -3067,7 +3067,7 @@ void matmul_mask_multi4_AB(float *dst, dtype *src0, dtype *src1){
             MATMACC(tACC, tA[k+3], tB[k+3][jj]);
           }
           auto gC = gIterC(i,j);
-          TCOPYOUT_ACC(gC, tACC);
+          TSTORE_ACC(gC, tACC);
         }
       }
     }
@@ -3102,15 +3102,15 @@ __attribute__((noinline)) void matmul_dynamic_new(float* dst, dtype* src0, dtype
                   int dyn_k = gK - k > tK ? tK : gK - k;
                   tile_shapeA tA(dyn_m, dyn_k);
                   tile_shapeB tB(dyn_k, dyn_n);
-                  TCOPYIN(tA, gA);
-                  TCOPYIN(tB, gB);
+                  TLOAD(tA, gA);
+                  TLOAD(tB, gB);
                   if(k==0){
                     MATMUL(tACC, tA, tB);
                   }else{
                     MATMACC(tACC, tA, tB);
                   }
               }
-              TCOPYOUT_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol());
+              TSTORE_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol());
           }
       }
     }
@@ -3150,15 +3150,15 @@ __attribute__((noinline)) void matmul_dynamic(float* dst, dtype* src0, dtype* sr
                   int dyn_k = (k+1) * tK > gK ? rem_k:tK;
                   tile_shapeA tA(dyn_m, dyn_k);
                   tile_shapeB tB(dyn_k, dyn_n);
-                  TCOPYIN(tA, gA);
-                  TCOPYIN(tB, gB);
+                  TLOAD(tA, gA);
+                  TLOAD(tB, gB);
                   if(k==0){
                     MATMUL(tACC, tA, tB);
                   }else{
                     MATMACC(tACC, tA, tB);
                   }
               }
-              TCOPYOUT_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol());
+              TSTORE_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol());
           }
       }
     }
@@ -3201,7 +3201,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt
     int rem_k = gK % tK;
 
     ResA R = find_reuseA_dynamic(Mb, Kb, MAX_TILE_NUM);
-    
+
     int dM = R.m == 0? 0 : Mb / R.m;
     int rM = R.m == 0? 0 : Mb % R.m;
 
@@ -3223,7 +3223,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt
         tile_shapeA tA[m_step][R.k];
 
         for (int mm=0;mm<m_step;mm++) {
-          for (int kk=0;kk<R.k;kk++) { 
+          for (int kk=0;kk<R.k;kk++) {
             if( (i+mm+1) * tM > gM ){
               tA[mm][kk]= tile_shapeA(rem_m, tK);
             }else{
@@ -3236,7 +3236,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt
           for(int k=0;k<R.k;k++){
             size_t offset_A = (i+ii) * gK * tile_shapeA::Rows + k * tile_shapeA::Cols;
             gm_shapeA gA(src0+offset_A, gM, gK);
-            TCOPYIN(tA[ii][k], gA);
+            TLOAD(tA[ii][k], gA);
           }
 
           int dyn_m = (i+ii+1) * tM > gM? rem_m:tM;
@@ -3249,7 +3249,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt
               size_t offset_B = k * gN * tile_shapeB::Rows + j * tile_shapeB::Cols;
               gm_shapeB gB(src1 + offset_B, gK, gN);
               tile_shapeB tB(tK, dyn_n);
-              TCOPYIN(tB, gB);
+              TLOAD(tB, gB);
               if(k==0){
                 MATMUL(tACC, tA[ii][k], tB);
               }else{
@@ -3268,8 +3268,8 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt
                 tile_shapeA tA(dyn_m, dyn_k);
                 tile_shapeB tB(dyn_k, dyn_n);
 
-                TCOPYIN(tA, gA);
-                TCOPYIN(tB, gB);
+                TLOAD(tA, gA);
+                TLOAD(tB, gB);
                 if(k==0){
                   MATMUL(tACC, tA, tB);
                 }else{
@@ -3280,7 +3280,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt
 
             size_t offset_C = (i+ii) * gN * tile_shapeACC::Rows + j * tile_shapeACC::Cols;
             gm_shapeC gC(dst + offset_C, gM, gN);
-            TCOPYOUT_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol());
+            TSTORE_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol());
           }
         }
 
@@ -3313,7 +3313,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseB(float* dst, dtype* src0, dt
     R.n = Ra.m;
     R.k = Ra.k;
     R.val = Ra.val;
-    
+
     int dN = R.n == 0? 0 : Nb / R.n;
     int rN = R.n == 0? 0 : Nb % R.n;
 
@@ -3333,7 +3333,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseB(float* dst, dtype* src0, dt
         tile_shapeB tB[R.k][n_step];
 
         for (int nn=0;nn<n_step;nn++) {
-          for (int kk=0;kk<R.k;kk++) { 
+          for (int kk=0;kk<R.k;kk++) {
             if( (i+nn+1) * tN > gN ){
               tB[kk][nn]= tile_shapeB(tK, rem_n);
             }else{
@@ -3346,7 +3346,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseB(float* dst, dtype* src0, dt
           for(int k=0;k<R.k;k++){
             size_t offset_B = k * gN * tile_shapeB::Rows + (i+ii) * tile_shapeB::Cols;
             gm_shapeB gB(src1+offset_B, gK, gN);
-            TCOPYIN(tB[k][ii], gB);
+            TLOAD(tB[k][ii], gB);
           }
 
           int dyn_n = (i+ii+1) * tN > gN? rem_n:tN;
@@ -3359,7 +3359,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseB(float* dst, dtype* src0, dt
               size_t offset_A = j * gK * tile_shapeA::Rows + k * tile_shapeA::Cols;
               gm_shapeA gA(src0 + offset_A, gM, gK);
               tile_shapeA tA(dyn_m, tK);
-              TCOPYIN(tA, gA);
+              TLOAD(tA, gA);
               if(k==0){
                 MATMUL(tACC, tA, tB[k][ii]);
               }else{
@@ -3378,8 +3378,8 @@ __attribute__((noinline)) void matmul_dynamic_reuseB(float* dst, dtype* src0, dt
                 tile_shapeA tA(dyn_m, dyn_k);
                 tile_shapeB tB(dyn_k, dyn_n);
 
-                TCOPYIN(tA, gA);
-                TCOPYIN(tB, gB);
+                TLOAD(tA, gA);
+                TLOAD(tB, gB);
                 if(k==0){
                   MATMUL(tACC, tA, tB);
                 }else{
@@ -3390,7 +3390,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseB(float* dst, dtype* src0, dt
 
             size_t offset_C =  j * gN * tile_shapeACC::Rows + (i+ii) * tile_shapeACC::Cols;
             gm_shapeC gC(dst + offset_C, gM, gN);
-            TCOPYOUT_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol());
+            TSTORE_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol());
           }
         }
 
@@ -3446,8 +3446,8 @@ void matmul_mx(float *dst, dtype *src0, dtype *src1, uint8_t *src0_mx, uint8_t *
           tile_shapeB tB;
           tile_shapeAMX tAMX;
           tile_shapeBMX tBMX;
-          TCOPYIN(tA, gA);
-          TCOPYIN(tB, gB);
+          TLOAD(tA, gA);
+          TLOAD(tB, gB);
 
           blk_tload(tAMX.GetValidCol(), tAMX.GetValidRow(), tile_shapeAMX::Cols,
           type_traits<typename tile_shapeAMX::DType>::TypeCode,
@@ -3471,7 +3471,7 @@ void matmul_mx(float *dst, dtype *src0, dtype *src1, uint8_t *src0_mx, uint8_t *
             MATMACCMX(tACC, tA, tAMX, tB, tBMX);
           }
         }
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
     }
   }
 }
@@ -3520,16 +3520,16 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) {
         auto gC = gCIter(i, j);
 
         tile_shapeACC tACC;
-        
+
         if constexpr(Kb>0){
           auto gA = gAIter(i, 0);
           auto gB = gBIter(0, j);
 
           tile_shapeA tA;
           tile_shapeB tB;
-          TCOPYIN_2LVL(tA, gA);
-          TCOPYIN_2LVL(tB, gB);
-          MATMUL(tACC, tA, tB);        
+          TLOAD_2LVL(tA, gA);
+          TLOAD_2LVL(tB, gB);
+          MATMUL(tACC, tA, tB);
         }
         #pragma clang loop unroll(full)
         for (int k = 1; k < Kb; ++k) {
@@ -3538,8 +3538,8 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA tA;
           tile_shapeB tB;
-          TCOPYIN_2LVL(tA, gA);
-          TCOPYIN_2LVL(tB, gB);
+          TLOAD_2LVL(tA, gA);
+          TLOAD_2LVL(tB, gB);
           MATMACC(tACC, tA, tB);
         }
 
@@ -3549,15 +3549,15 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_trows tA;
           tile_shapeB_tcols tB;
-          TCOPYIN_2LVL(tA, gA);
-          TCOPYIN_2LVL(tB, gB);
+          TLOAD_2LVL(tA, gA);
+          TLOAD_2LVL(tB, gB);
           if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
           } else {
             MATMUL(tACC, tA, tB);
           }
         }
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
       }
       if constexpr (rmd_N) {
         auto gC = gCIter(i, Nb);
@@ -3569,9 +3569,9 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA tA;
           tile_shapeB_trows tB;
-          TCOPYIN_2LVL(tA, gA);
-          TCOPYIN_2LVL(tB, gB);
-          MATMUL(tACC, tA, tB);        
+          TLOAD_2LVL(tA, gA);
+          TLOAD_2LVL(tB, gB);
+          MATMUL(tACC, tA, tB);
         }
         #pragma clang loop unroll(full)
         for (int k = 1; k < Kb; ++k) {
@@ -3580,8 +3580,8 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA tA;
           tile_shapeB_trows tB;
-          TCOPYIN_2LVL(tA, gA);
-          TCOPYIN_2LVL(tB, gB);
+          TLOAD_2LVL(tA, gA);
+          TLOAD_2LVL(tB, gB);
           MATMACC(tACC, tA, tB);
         }
         if constexpr (rmd_K) {
@@ -3590,15 +3590,15 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_trows tA;
           tile_shapeB_tcorner tB;
-          TCOPYIN_2LVL(tA, gA);
-          TCOPYIN_2LVL(tB, gB);
+          TLOAD_2LVL(tA, gA);
+          TLOAD_2LVL(tB, gB);
           if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
           } else {
             MATMUL(tACC, tA, tB);
           }
         }
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
       }
     }
     if constexpr (rmd_M) {
@@ -3612,9 +3612,9 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_tcols tA;
           tile_shapeB tB;
-          TCOPYIN_2LVL(tA, gA);
-          TCOPYIN_2LVL(tB, gB);
-          MATMUL(tACC, tA, tB);        
+          TLOAD_2LVL(tA, gA);
+          TLOAD_2LVL(tB, gB);
+          MATMUL(tACC, tA, tB);
         }
         #pragma clang loop unroll(full)
         for (int k = 1; k < Kb; ++k) {
@@ -3623,8 +3623,8 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_tcols tA;
           tile_shapeB tB;
-          TCOPYIN_2LVL(tA, gA);
-          TCOPYIN_2LVL(tB, gB);
+          TLOAD_2LVL(tA, gA);
+          TLOAD_2LVL(tB, gB);
           MATMACC(tACC, tA, tB);
         }
         if constexpr (rmd_K) {
@@ -3633,15 +3633,15 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_tcorner tA;
           tile_shapeB_tcols tB;
-          TCOPYIN_2LVL(tA, gA);
-          TCOPYIN_2LVL(tB, gB);
+          TLOAD_2LVL(tA, gA);
+          TLOAD_2LVL(tB, gB);
           if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
           } else {
             MATMUL(tACC, tA, tB);
           }
         }
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
       }
       if constexpr (rmd_N) {
         auto gC = gCIter(Mb, Nb);
@@ -3653,9 +3653,9 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_tcols tA;
           tile_shapeB_trows tB;
-          TCOPYIN_2LVL(tA, gA);
-          TCOPYIN_2LVL(tB, gB);
-          MATMUL(tACC, tA, tB);        
+          TLOAD_2LVL(tA, gA);
+          TLOAD_2LVL(tB, gB);
+          MATMUL(tACC, tA, tB);
         }
         #pragma clang loop unroll(full)
         for (int k = 1; k < Kb; ++k) {
@@ -3664,8 +3664,8 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_tcols tA;
           tile_shapeB_trows tB;
-          TCOPYIN_2LVL(tA, gA);
-          TCOPYIN_2LVL(tB, gB);
+          TLOAD_2LVL(tA, gA);
+          TLOAD_2LVL(tB, gB);
           MATMACC(tACC, tA, tB);
         }
         if constexpr (rmd_K) {
@@ -3674,15 +3674,15 @@ void matmul_mask_2lvl(float *c_ptr, dtype *a_ptr, dtype *b_ptr) {
 
           tile_shapeA_tcorner tA;
           tile_shapeB_tcorner tB;
-          TCOPYIN_2LVL(tA, gA);
-          TCOPYIN_2LVL(tB, gB);
+          TLOAD_2LVL(tA, gA);
+          TLOAD_2LVL(tB, gB);
           if constexpr(Kb>0){
             MATMACC(tACC, tA, tB);
           } else {
             MATMUL(tACC, tA, tB);
           }
         }
-        TCOPYOUT_ACC(gC, tACC);
+        TSTORE_ACC(gC, tACC);
       }
     }
   }
@@ -3718,11 +3718,11 @@ void matmul_vec(float* dst, float* src0, float* src1){
                 auto gB = gBIter(k,j);
                 tile_shapeA tA;
                 tile_shapeB tB;
-                TCOPYIN(tA, gA);
-                TCOPYIN(tB, gB);
+                TLOAD(tA, gA);
+                TLOAD(tB, gB);
                 MATMACC(tACC, tA, tB);
             }
-            TCOPYOUT(gC, tACC);
+            TSTORE(gC, tACC);
         }
     }
 }
@@ -3745,10 +3745,10 @@ void matmul_tile_vec(float* dst, float* src0, float* src1) {
     tile_shape_B d1;
     tile_shape_C d2;
 
-    TCOPYIN(d0, s0);
-    TCOPYIN(d1, s1);
+    TLOAD(d0, s0);
+    TLOAD(d1, s1);
     MATMUL(d2, d0, d1);
-    TCOPYOUT(res, d2);
+    TSTORE(res, d2);
 }
 
 template <uint16_t M, uint16_t N, uint16_t K>
@@ -3769,10 +3769,10 @@ void matmul_tile_frac(float* dst, float* src0, float* src1) {
     tile_shape_B d1;
     tile_shape_C d2;
 
-    TCOPYIN(d0, s0);
-    TCOPYIN(d1, s1);
+    TLOAD(d0, s0);
+    TLOAD(d1, s1);
     MATMUL(d2, d0, d1);
-    TCOPYOUT_ACC(res, d2);
+    TSTORE_ACC(res, d2);
 }
 
 
diff --git a/kernels/other/matmul_dynamic_reuse.hpp b/kernels/other/matmul_dynamic_reuse.hpp
index 1600964..7a1693f 100644
--- a/kernels/other/matmul_dynamic_reuse.hpp
+++ b/kernels/other/matmul_dynamic_reuse.hpp
@@ -19,7 +19,7 @@
           for(int k=0;k<RK;k++){ \
             size_t offset_A = (i+ii) * gK * tile_shapeA::Rows + k * tile_shapeA::Cols; \
             gm_shapeA gA(src0+offset_A, gM, gK); \
-            TCOPYIN(tA[ii][k], gA); \
+            TLOAD(tA[ii][k], gA); \
           } \
  \
           int dyn_m = (i+ii+1) * tM > gM? rem_m:tM; \
@@ -32,7 +32,7 @@
               size_t offset_B = k * gN * tile_shapeB::Rows + j * tile_shapeB::Cols; \
               gm_shapeB gB(src1 + offset_B, gK, gN); \
               tile_shapeB tB(tK, dyn_n); \
-              TCOPYIN(tB, gB); \
+              TLOAD(tB, gB); \
               if(k==0){ \
                 MATMUL(tACC, tA[ii][k], tB); \
               }else{ \
@@ -51,8 +51,8 @@
                 tile_shapeA tA(dyn_m, dyn_k); \
                 tile_shapeB tB(dyn_k, dyn_n); \
  \
-                TCOPYIN(tA, gA); \
-                TCOPYIN(tB, gB); \
+                TLOAD(tA, gA); \
+                TLOAD(tB, gB); \
                 if(k==0){ \
                   MATMUL(tACC, tA, tB); \
                 }else{ \
@@ -63,7 +63,7 @@
  \
             size_t offset_C = (i+ii) * gN * tile_shapeACC::Rows + j * tile_shapeACC::Cols; \
             gm_shapeC gC(dst + offset_C, gM, gN); \
-            TCOPYOUT_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); \
+            TSTORE_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); \
           } \
         }
 
@@ -114,7 +114,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt
           for(int k=0;k<RK;k++){ \
             size_t offset_B = k * gN * tile_shapeB::Rows + (i+ii) * tile_shapeB::Cols; \
             gm_shapeB gB(src1+offset_B, gK, gN); \
-            TCOPYIN(tB[k][ii], gB); \
+            TLOAD(tB[k][ii], gB); \
           } \
  \
           int dyn_n = (i+ii+1) * tN > gN? rem_n:tN; \
@@ -127,7 +127,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt
               size_t offset_A = j * gK * tile_shapeA::Rows + k * tile_shapeA::Cols; \
               gm_shapeA gA(src0 + offset_A, gM, gK); \
               tile_shapeA tA(dyn_m, tK); \
-              TCOPYIN(tA, gA); \
+              TLOAD(tA, gA); \
               if(k==0){ \
                 MATMUL(tACC, tA, tB[k][ii]); \
               }else{ \
@@ -146,8 +146,8 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt
                 tile_shapeA tA(dyn_m, dyn_k); \
                 tile_shapeB tB(dyn_k, dyn_n); \
  \
-                TCOPYIN(tA, gA); \
-                TCOPYIN(tB, gB); \
+                TLOAD(tA, gA); \
+                TLOAD(tB, gB); \
                 if(k==0){ \
                   MATMUL(tACC, tA, tB); \
                 }else{ \
@@ -158,7 +158,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseA(float* dst, dtype* src0, dt
  \
             size_t offset_C =  j * gN * tile_shapeACC::Rows + (i+ii) * tile_shapeACC::Cols; \
             gm_shapeC gC(dst + offset_C, gM, gN); \
-            TCOPYOUT_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); \
+            TSTORE_ACC_DYNAMIC(gC, tACC, tACC.GetValidRow(), tACC.GetValidCol()); \
           } \
         }
 
@@ -178,7 +178,7 @@ __attribute__((noinline)) void matmul_dynamic_reuseB(float* dst, dtype* src0, dt
     int rem_m = gM % tM;
     int rem_n = gN % tN;
     int rem_k = gK % tK;
-    
+
     for (int b=0;b<Batch;b++){
       for(int i=0;i<Nb;){
         if((i+RN) <= Nb){
diff --git a/kernels/other/normalization.hpp b/kernels/other/normalization.hpp
index b3bee22..5f125d8 100644
--- a/kernels/other/normalization.hpp
+++ b/kernels/other/normalization.hpp
@@ -7,9 +7,9 @@ template<typename dtype, const int kM, const int kN, const int kTM, const int kT
 void rmsnorm(dtype *dst, dtype *src){
     using gm_shape = global_tensor<dtype, RowMajor<kM, kN>>;
     using tile_shape = Tile<Location::Vec, dtype, kTM, kTN, BLayout::RowMajor>;
- 
+
     using tSum = Tile<Location::Vec, dtype, kTM, 1, BLayout::RowMajor>;
- 
+
     using gIter = global_iterator<gm_shape, tile_shape>;
 
     gIter giter_src(src);
@@ -26,19 +26,19 @@ void rmsnorm(dtype *dst, dtype *src){
         {
             auto gsrc = giter_src(i, j);
             tile_shape tsrc;
- 
-            TCOPYIN(tsrc, gsrc);
+
+            TLOAD(tsrc, gsrc);
 
             tSum tLocalSum;
             TMUL(tsrc, tsrc, tsrc);
             TROWSUM(tLocalSum, tsrc);
             TADD(tAccSquareSum, tAccSquareSum, tLocalSum);
         }
- 
+
         tSum gSqureMean;
         TDIVS(gSqureMean, tAccSquareSum, kN);
         TSQRT(gSqureMean, gSqureMean);
- 
+
         tile_shape gSqureMean_i;
         TEXPANDCOL(gSqureMean_i, gSqureMean);
 
@@ -46,12 +46,12 @@ void rmsnorm(dtype *dst, dtype *src){
         {
             auto  gsrc = giter_src(i,j);
             tile_shape tsrc;
-            TCOPYIN(tsrc, gsrc);
- 
+            TLOAD(tsrc, gsrc);
+
             TDIV(tsrc, tsrc, gSqureMean_i);
- 
+
             auto gdst = giter_dst(i,j);
-            TCOPYOUT(gdst, tsrc);
+            TSTORE(gdst, tsrc);
         }
     }
 }
@@ -62,9 +62,9 @@ void layernorm(dtype *dst, dtype *src)
 {
     using gm_shape = global_tensor<dtype, RowMajor<kM, kN>>;
     using tile_shape = Tile<Location::Vec, dtype, kTM, kTN, BLayout::RowMajor>;
- 
+
     using tSum = Tile<Location::Vec, dtype, kTM, 1, BLayout::RowMajor>;
- 
+
     using gIter = global_iterator<gm_shape, tile_shape>;
 
     gIter giter_src(src);
@@ -77,23 +77,23 @@ void layernorm(dtype *dst, dtype *src)
     {
         tSum tAccSum(0);        // tiling sum
         tSum tAccSquareSum(0);  // tiling square sum
-  
+
         for(int j=0;j<Nb;j++)
         {
             auto gsrc = giter_src(i, j);
             tile_shape tsrc;
- 
-            TCOPYIN(tsrc, gsrc);
- 
+
+            TLOAD(tsrc, gsrc);
+
             tSum tLocalSum;
             TROWSUM(tLocalSum, tsrc);
             TADD(tAccSum, tAccSum, tLocalSum);
- 
+
             TMUL(tsrc, tsrc, tsrc);
             TROWSUM(tLocalSum, tsrc);
             TADD(tAccSquareSum, tAccSquareSum, tLocalSum);
         }
- 
+
         tSum gMean;        // Ex
         tSum gMeanSquare;  // (Ex)^2
         tSum gStdDev;      // Ex^2
@@ -102,7 +102,7 @@ void layernorm(dtype *dst, dtype *src)
         TDIVS(gStdDev, tAccSquareSum, kN);
         TSUB(gStdDev, gStdDev, gMeanSquare);
         TSQRT(gStdDev, gStdDev);
- 
+
         tile_shape gMean_i;
         tile_shape gStdDev_i;
         TEXPANDCOL(gMean_i, gMean);
@@ -112,13 +112,13 @@ void layernorm(dtype *dst, dtype *src)
         {
             auto  gsrc = giter_src(i,j);
             tile_shape tsrc;
-            TCOPYIN(tsrc, gsrc);
- 
+            TLOAD(tsrc, gsrc);
+
             TSUB(tsrc, tsrc, gMean_i);    // (x - Ex)
             TDIV(tsrc, tsrc, gStdDev_i);  // (x - Ex) / (Ex^2 - (Ex)^2)^.5
- 
+
             auto gdst = giter_dst(i,j);
-            TCOPYOUT(gdst, tsrc);
+            TSTORE(gdst, tsrc);
         }
     }
 }
\ No newline at end of file
diff --git a/kernels/other/pooling.hpp b/kernels/other/pooling.hpp
index 1fb64a2..f5c2699 100644
--- a/kernels/other/pooling.hpp
+++ b/kernels/other/pooling.hpp
@@ -37,14 +37,14 @@ void max_pool_forward(dtype *out, dtype *pic, const pool_pm pool){
                     gm_pic gpic(pic+ n*C*H*W + c*H*W + h*pool.stride*W + w*pool.stride); //pic[n, c, h*pool.stride, w*pool.stride]
 
                     tile_filt tpic;
-                    TCOPYIN(tpic, gpic);
+                    TLOAD(tpic, gpic);
                     TROWMAXEXPAND(tpic, tpic);
                     TCOLMAXEXPAND(tpic, tpic);
                     TCOPY(tmp, tpic);
 
                     int offset = n*C*H_out*W_out + c*H_out*W_out + h*W_out + w;
                     gm_out gO(out+offset);
-                    TCOPYOUT(gO, tpic);
+                    TSTORE(gO, tpic);
                 }
             }
         }
@@ -75,7 +75,7 @@ void avg_pool_forward(dtype *out, dtype *pic, const pool_pm pool){
                     gm_pic gpic(pic+ n*C*H*W + c*H*W + h*pool.stride*W + w*pool.stride); //pic[n, c, h*pool.stride, w*pool.stride]
 
                     tile_filt tpic;
-                    TCOPYIN(tpic, gpic);
+                    TLOAD(tpic, gpic);
                     TROWSUMEXPAND(tpic, tpic);
                     TCOLSUMEXPAND(tpic, tpic);
                     TDIVS(tpic, tpic, HH*WW);
@@ -83,7 +83,7 @@ void avg_pool_forward(dtype *out, dtype *pic, const pool_pm pool){
 
                     int offset = n*C*H_out*W_out + c*H_out*W_out + h*W_out + w;
                     gm_out gO(out+offset);
-                    TCOPYOUT(gO, tpic);
+                    TSTORE(gO, tpic);
                 }
             }
         }
diff --git a/kernels/other/softmax.hpp b/kernels/other/softmax.hpp
index fa4c600..b651132 100644
--- a/kernels/other/softmax.hpp
+++ b/kernels/other/softmax.hpp
@@ -20,7 +20,7 @@ void softmax(dtype* dst, dtype* src){
             uint32_t offset = i*kTM*kN+j*kTN;
             gm_shape gsrc(src+offset);
             tile_shape tsrc;
-            TCOPYIN(tsrc, gsrc);
+            TLOAD(tsrc, gsrc);
 
             tMax tLocalMax;
             TROWMAX(tLocalMax, tsrc);
@@ -54,7 +54,7 @@ void softmax(dtype* dst, dtype* src){
             uint32_t offset = i*kTM*kN+j*kTN;
             gm_shape gsrc(src+offset);
             tile_shape tsrc;
-            TCOPYIN(tsrc, gsrc);
+            TLOAD(tsrc, gsrc);
 
             tile_shape gMax;
             tile_shape gSum;
@@ -66,7 +66,7 @@ void softmax(dtype* dst, dtype* src){
             TDIV(tsrc, tsrc, gSum);
 
             gm_shape gdst(dst+offset);
-            TCOPYOUT(gdst, tsrc);
+            TSTORE(gdst, tsrc);
         }
     }
 }
diff --git a/kernels/reduction/cumsum_colvec.hpp b/kernels/reduction/cumsum_colvec.hpp
index e7ca443..02a8345 100644
--- a/kernels/reduction/cumsum_colvec.hpp
+++ b/kernels/reduction/cumsum_colvec.hpp
@@ -23,41 +23,41 @@ void __vec__ cumsum_col_kernel(
     typename tileSum::TileDType __out__ new_sum,
     typename tileData::TileDType __out__ out,
     const typename tileData::TileDType __in__ src,
-    const typename tileSum::TileDType __in__ old_sum    
+    const typename tileSum::TileDType __in__ old_sum
 )
 {
-    size_t i = blkv_get_index_x();   
-    size_t sum_idx = i * tileSum::RowStride;        
+    size_t i = blkv_get_index_x();
+    size_t sum_idx = i * tileSum::RowStride;
 
     __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
-    __vbuf__ typename tileData::DType *out_ptr = blkv_get_tile_ptr(out);    
+    __vbuf__ typename tileData::DType *out_ptr = blkv_get_tile_ptr(out);
     __vbuf__ typename tileData::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);   
+    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);
 
 
     typename tileSum::DType upd_sum = old_sum_ptr[i];
 //    printf("upd_sum = %d",upd_sum);
-//    typename tileData::DType upd_out = old_sum_ptr[i];    
-/*    
+//    typename tileData::DType upd_out = old_sum_ptr[i];
+/*
     for(size_t j=0;j<tileSrc::ValidRow;j+=4){
         size_t src_idx_0 =  i * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1) * tileSrc::RowStride;
         size_t src_idx_2 =  i * tileSrc::ColStride + (j + 2) * tileSrc::RowStride;
         size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3) * tileSrc::RowStride;
-        typename tileSum::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];    
+        typename tileSum::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
         typename tileSum::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];
-        typename tileSum::DType sum_0123 = sum_01 + sum_23; 
-        upd_sum = upd_sum + sum_0123;              
+        typename tileSum::DType sum_0123 = sum_01 + sum_23;
+        upd_sum = upd_sum + sum_0123;
     }
 */
     #pragma clang loop unroll(full)
     for(size_t j=0;j<tileData::ValidRow;j++){
         size_t idx =  i * tileData::ColStride + j * tileData::RowStride;
-        typename tileData::DType sum_out = upd_sum + src_ptr[idx];        
-        upd_sum = sum_out;              
+        typename tileData::DType sum_out = upd_sum + src_ptr[idx];
+        upd_sum = sum_out;
         out_ptr[idx] = static_cast<typename tileData::DType>(sum_out);
-    }    
-    new_sum_ptr[i] = upd_sum;    
+    }
+    new_sum_ptr[i] = upd_sum;
 }
 
 
@@ -66,35 +66,35 @@ void __vec__ cumsum_col_kernel(
 template<typename dtype, const int gIM, const int gIN, const int tM, const int tN>
 void cumsum_col_rand(
     dtype *in_ptr,
-//    dtype *inzero_ptr,    
+//    dtype *inzero_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
-    const int rmd_M = gIM % tM; 
-    const int rmd_N = gIN % tN; 
+    const int rmd_M = gIM % tM;
+    const int rmd_N = gIN % tN;
 //    const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据 
-//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据
+//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; //
-    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //     
-    using tile_shapeSum = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; // 
+    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //
+    using tile_shapeSum = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; //
 
-    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // 
-    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //     
+    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; //
+    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //
     using tile_shapeSum_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; //
     //need tM = 1;
 
 
-    gm_shapeIn inGm(in_ptr);   
-//    gm_shapeOut ZeroGm(inzero_ptr); 
+    gm_shapeIn inGm(in_ptr);
+//    gm_shapeOut ZeroGm(inzero_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeSum olcSumGm(old_sum_ptr);    
+//    gm_shapeSum olcSumGm(old_sum_ptr);
 
     tile_shapeData dataTile;
     tile_shapeData OutTile;
@@ -105,19 +105,19 @@ void cumsum_col_rand(
 
     tile_shapeData_row dataTile_row;
     tile_shapeData_row OutTile_row;
-    tile_shapeData_cor dataTile_cor;    
-    tile_shapeData_cor OutTile_cor;   
-    
+    tile_shapeData_cor dataTile_cor;
+    tile_shapeData_cor OutTile_cor;
+
     tile_shapeSum_row SumTile_row;
-    tile_shapeSum_row oldSumTile_row;    
+    tile_shapeSum_row oldSumTile_row;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;  
-//    using itZero = global_iterator<gm_shapeOut, tile_shapeData>;    
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
+//    using itZero = global_iterator<gm_shapeOut, tile_shapeData>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeData>;
-//    using itSum = global_iterator<gm_shapeOut, tile_shapeSum>;    
+//    using itSum = global_iterator<gm_shapeOut, tile_shapeSum>;
 
     itIn  gIIter(in_ptr);
     itOut gOIter(out_ptr);
@@ -128,44 +128,44 @@ void cumsum_col_rand(
         TEXPANDSCALAR(oldSumTile, 0);//初始化为0
         for (int i = 0; i < Mb; ++i) {
             auto gI = gIIter(i, j);
-            auto gO = gOIter(i, j); 
-            TCOPYIN(dataTile, gI);
-//            printf("in0 : %d, %d\n",in_ptr[i*tM], i*tM);            
+            auto gO = gOIter(i, j);
+            TLOAD(dataTile, gI);
+//            printf("in0 : %d, %d\n",in_ptr[i*tM], i*tM);
             cumsum_col_kernel<tile_shapeData, tile_shapeSum><<<tile_shapeData::ValidCol, 1, 1>>>(SumTile.data(), OutTile.data(), dataTile.data(), oldSumTile.data());
             oldSumTile = SumTile;
-            TCOPYOUT(gO, OutTile);
-//            printf("out0 : %d,%d\n", out_ptr[i*tM],i*tM);   
+            TSTORE(gO, OutTile);
+//            printf("out0 : %d,%d\n", out_ptr[i*tM],i*tM);
         }
-        if constexpr (rmd_M > 0){   
+        if constexpr (rmd_M > 0){
             auto gI = gIIter(Mb, j);
             auto gO = gOIter(Mb, j);
-            TCOPYIN(dataTile_col, gI);
+            TLOAD(dataTile_col, gI);
             cumsum_col_kernel<tile_shapeData_col,tile_shapeSum><<<tile_shapeData_col::ValidCol, 1, 1>>>(SumTile.data(), OutTile_col.data(), dataTile_col.data(), oldSumTile.data());
             oldSumTile = SumTile;
-            TCOPYOUT(gO, OutTile_col);
+            TSTORE(gO, OutTile_col);
         }
-//        TCOPYOUT(gO, SumTile);
+//        TSTORE(gO, SumTile);
     }
     if constexpr (rmd_N > 0){
-//        auto gZero = gZeroIter(0, Nb);         
+//        auto gZero = gZeroIter(0, Nb);
 //        auto gO = gOIter(0, Nb);
-        TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0        
-//        TCOPYIN(oldSumTile_row, gZero);//初始化为0
-        for (int i = 0; i < Mb; ++i) {   
+        TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0
+//        TLOAD(oldSumTile_row, gZero);//初始化为0
+        for (int i = 0; i < Mb; ++i) {
             auto gI = gIIter(i, Nb);
             auto gO = gOIter(i, Nb);
-            TCOPYIN(dataTile_row, gI);
+            TLOAD(dataTile_row, gI);
             cumsum_col_kernel<tile_shapeData_row,tile_shapeSum_row><<<tile_shapeData_row::ValidCol, 1, 1>>>(SumTile_row.data(), OutTile_row.data(), dataTile_row.data(), oldSumTile_row.data());
             oldSumTile_row = SumTile_row;
-            TCOPYOUT(gO, OutTile_row);            
+            TSTORE(gO, OutTile_row);
         }
-        if constexpr (rmd_M > 0){   
+        if constexpr (rmd_M > 0){
             auto gI = gIIter(Mb, Nb);
             auto gO = gOIter(Mb, Nb);
-            TCOPYIN(dataTile_cor, gI);
+            TLOAD(dataTile_cor, gI);
             cumsum_col_kernel<tile_shapeData_cor,tile_shapeSum_row><<<tile_shapeData_cor::ValidCol, 1, 1>>>(SumTile_row.data(), OutTile_cor.data(), dataTile_cor.data(), oldSumTile_row.data());
             oldSumTile_row = SumTile_row;
-            TCOPYOUT(gO, OutTile_cor);
+            TSTORE(gO, OutTile_cor);
         }
     }
 /*
diff --git a/kernels/reduction/cumsum_rowvec.hpp b/kernels/reduction/cumsum_rowvec.hpp
index ec998c1..eeedc93 100644
--- a/kernels/reduction/cumsum_rowvec.hpp
+++ b/kernels/reduction/cumsum_rowvec.hpp
@@ -16,42 +16,42 @@ using namespace pto;
 template<typename tileData, typename tileSum>
 void __vec__ cumsum_row_kernel(
     typename tileSum::TileDType __out__ new_sum,
-    const typename tileData::TileDType __out__ out,    
+    const typename tileData::TileDType __out__ out,
     const typename tileData::TileDType __in__ src,
-    const typename tileSum::TileDType __in__ old_sum    
+    const typename tileSum::TileDType __in__ old_sum
 )
 {
-//    size_t i = blkv_get_index_x();  
-    size_t j = blkv_get_index_y();  
-    size_t sum_idx = j * tileSum::RowStride;    
+//    size_t i = blkv_get_index_x();
+    size_t j = blkv_get_index_y();
+    size_t sum_idx = j * tileSum::RowStride;
 
     __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
-    __vbuf__ typename tileData::DType *out_ptr = blkv_get_tile_ptr(out);    
+    __vbuf__ typename tileData::DType *out_ptr = blkv_get_tile_ptr(out);
     __vbuf__ typename tileData::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);   
+    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);
 
 
     typename tileSum::DType upd_sum = old_sum_ptr[sum_idx];
-/*    
+/*
     for(size_t j=0;j<tileSrc::ValidRow;j+=4){
         size_t src_idx_0 =  i * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1) * tileSrc::RowStride;
         size_t src_idx_2 =  i * tileSrc::ColStride + (j + 2) * tileSrc::RowStride;
         size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3) * tileSrc::RowStride;
-        typename tileSum::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];    
+        typename tileSum::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
         typename tileSum::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];
-        typename tileSum::DType sum_0123 = sum_01 + sum_23; 
-        upd_sum = upd_sum + sum_0123;              
+        typename tileSum::DType sum_0123 = sum_01 + sum_23;
+        upd_sum = upd_sum + sum_0123;
     }
 */
     #pragma clang loop unroll(full)
     for(size_t i=0;i<tileData::ValidCol;i++){
         size_t idx =  i * tileData::ColStride + j * tileData::RowStride;
         typename tileData::DType sum_out = upd_sum + src_ptr[idx];
-        upd_sum = sum_out;   
-        out_ptr[idx] = static_cast<typename tileData::DType>(sum_out);           
-    }    
-    new_sum_ptr[sum_idx] = upd_sum;    
+        upd_sum = sum_out;
+        out_ptr[idx] = static_cast<typename tileData::DType>(sum_out);
+    }
+    new_sum_ptr[sum_idx] = upd_sum;
 }
 
 
@@ -60,105 +60,105 @@ template<typename dtype, const int gIM, const int gIN, const int tM, const int t
 void cumsum_row_rand(
     dtype *in_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM; // todo 尾块怎么处理？
-    const int rmd_N = gIN % tN; // todo 尾块怎么处理？    
+    const int rmd_N = gIN % tN; // todo 尾块怎么处理？
 
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据 
-//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据
+//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeSum = Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, tM, 1>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
 
 
-    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, tN>; // todo 尾块怎么处理？是否要作为参数写在这   
+    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, tN>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; // todo 尾块怎么处理？是否要作为参数写在这
-    using tile_shapeSum_col =  Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, rmd_M, 1>;      
+    using tile_shapeSum_col =  Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, rmd_M, 1>;
 
 
-    gm_shapeIn inGm(in_ptr);    
+    gm_shapeIn inGm(in_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeSum olcSumGm(old_sum_ptr);    
+//    gm_shapeSum olcSumGm(old_sum_ptr);
 
-    tile_shapeData dataTile;                
+    tile_shapeData dataTile;
     tile_shapeData_row dataTile_row;
     tile_shapeData_col dataTile_col;
-    tile_shapeData_cor dataTile_cor;    
+    tile_shapeData_cor dataTile_cor;
 
-    tile_shapeData OutTile;                
+    tile_shapeData OutTile;
     tile_shapeData_row OutTile_row;
     tile_shapeData_col OutTile_col;
-    tile_shapeData_cor OutTile_cor;      
-    
+    tile_shapeData_cor OutTile_cor;
+
     tile_shapeSum SumTile;
     tile_shapeSum oldSumTile;
     tile_shapeSum_col SumTile_col;
-    tile_shapeSum_col oldSumTile_col;    
+    tile_shapeSum_col oldSumTile_col;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;  
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeData>;
 
     itIn  gIIter(in_ptr);
     itOut gOIter(out_ptr);
 
 //    printf("tile_shapeSum::ValidCol = %d\n",  tile_shapeSum::ValidCol);
-//    printf("tile_shapeSum::ValidRow = %d\n",  tile_shapeSum::ValidRow);    
+//    printf("tile_shapeSum::ValidRow = %d\n",  tile_shapeSum::ValidRow);
 
     for (int j = 0; j < Mb; ++j) {
 //        auto gO = gOIter(j, 0);
         TEXPANDSCALAR(oldSumTile, 0);//初始化为0
-        //初始化old_sum的tile      
+        //初始化old_sum的tile
         for (int i = 0; i < Nb; ++i) {
-            auto gI = gIIter(j, i); 
-            auto gO = gOIter(j, i);                               
-            TCOPYIN(dataTile, gI);    
+            auto gI = gIIter(j, i);
+            auto gO = gOIter(j, i);
+            TLOAD(dataTile, gI);
             cumsum_row_kernel<tile_shapeData, tile_shapeSum><<<1, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), OutTile.data(), dataTile.data(), oldSumTile.data());
 //            reducesum_row_kernel<tile_shapeData, tile_shapeSum><<<tile_shapeSum::ValidRow, 1, 1>>>(SumTile.data(), dataTile.data(), oldSumTile.data());
             oldSumTile = SumTile;
-            TCOPYOUT(gO, OutTile);            
+            TSTORE(gO, OutTile);
         }
 //        printf("end for%d\n",j);
         //for row corner
         if constexpr (rmd_N > 0){
             auto gI = gIIter(j, Nb);
             auto gO = gOIter(j, Nb);
-            TCOPYIN(dataTile_row, gI);
-            cumsum_row_kernel<tile_shapeData_row, tile_shapeSum><<<1, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), OutTile_row.data(), dataTile_row.data(), oldSumTile.data());            
+            TLOAD(dataTile_row, gI);
+            cumsum_row_kernel<tile_shapeData_row, tile_shapeSum><<<1, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), OutTile_row.data(), dataTile_row.data(), oldSumTile.data());
 //            reducesum_row_kernel<tile_shapeData_row, tile_shapeSum><<<tile_shapeSum::ValidRow, 1, 1>>>(SumTile.data(), dataTile_row.data(), oldSumTile.data());
             oldSumTile = SumTile;
-            TCOPYOUT(gO, OutTile_row);            
+            TSTORE(gO, OutTile_row);
         }
     }
     //for col cor
     if constexpr (rmd_M > 0){
         TEXPANDSCALAR(oldSumTile_col, 0);//初始化为0
-        //初始化old_sum的tile      
+        //初始化old_sum的tile
         for (int i = 0; i < Nb; ++i) {
-            auto gI = gIIter(Mb, i);   
-            auto gO = gOIter(Mb, i);   
-            TCOPYIN(dataTile_col, gI);                  
+            auto gI = gIIter(Mb, i);
+            auto gO = gOIter(Mb, i);
+            TLOAD(dataTile_col, gI);
             cumsum_row_kernel<tile_shapeData_col, tile_shapeSum_col><<<1, tile_shapeSum_col::ValidRow, 1>>>(SumTile_col.data(), OutTile_col.data(), dataTile_col.data(), oldSumTile_col.data());
             oldSumTile_col = SumTile_col;
-            TCOPYOUT(gO, OutTile_col);
+            TSTORE(gO, OutTile_col);
         }
         if constexpr (rmd_N > 0){
             auto gI = gIIter(Mb, Nb);
             auto gO = gOIter(Mb, Nb);
-            TCOPYIN(dataTile_cor, gI);             
+            TLOAD(dataTile_cor, gI);
             cumsum_row_kernel<tile_shapeData_cor, tile_shapeSum_col><<<1, tile_shapeSum_col::ValidRow, 1>>>(SumTile_col.data(), OutTile_cor.data(), dataTile_cor.data(), oldSumTile_col.data());
             oldSumTile_col = SumTile_col;
-            TCOPYOUT(gO, OutTile_cor);
-        }        
+            TSTORE(gO, OutTile_cor);
+        }
     }
 /*
     for(int i = 0; i < gIM; i++){
diff --git a/kernels/reduction/reducemax_colvec.hpp b/kernels/reduction/reducemax_colvec.hpp
index 502db96..4a87fd8 100644
--- a/kernels/reduction/reducemax_colvec.hpp
+++ b/kernels/reduction/reducemax_colvec.hpp
@@ -19,30 +19,30 @@ template<typename tileSrc, typename tileMax>
 void __vec__ reducemax_col_kernel(
     typename tileMax::TileDType __out__ new_max,
     const typename tileSrc::TileDType __in__ src,
-    const typename tileMax::TileDType __in__ old_max    
+    const typename tileMax::TileDType __in__ old_max
 )
 {
-    size_t i = blkv_get_index_x();  
+    size_t i = blkv_get_index_x();
 
     __vbuf__ typename tileMax::DType *new_max_ptr = blkv_get_tile_ptr(new_max);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);   
+    __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);
 
 
     typename tileMax::DType upd_max = old_max_ptr[i];
-/*    
+/*
     for(size_t j=0;j<tileSrc::ValidRow;j+=4){
         size_t src_idx_0 =  i * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1) * tileSrc::RowStride;
         size_t src_idx_2 =  i * tileSrc::ColStride + (j + 2) * tileSrc::RowStride;
         size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3) * tileSrc::RowStride;
-        typename tileMax::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];    
+        typename tileMax::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
         typename tileMax::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];
-        typename tileMax::DType sum_0123 = sum_01 + sum_23; 
-        upd_sum = upd_sum + sum_0123;              
+        typename tileMax::DType sum_0123 = sum_01 + sum_23;
+        upd_sum = upd_sum + sum_0123;
     }
 */
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidRow;j+=8){
         size_t src_idx_0 =  i * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1) * tileSrc::RowStride;
@@ -51,18 +51,18 @@ void __vec__ reducemax_col_kernel(
         size_t src_idx_4 =  i * tileSrc::ColStride + (j + 4) * tileSrc::RowStride;
         size_t src_idx_5 =  i * tileSrc::ColStride + (j + 5) * tileSrc::RowStride;
         size_t src_idx_6 =  i * tileSrc::ColStride + (j + 6) * tileSrc::RowStride;
-        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;        
-        typename tileMax::DType max_01 = blkv_max(src_ptr[src_idx_0], src_ptr[src_idx_1]);    
+        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;
+        typename tileMax::DType max_01 = blkv_max(src_ptr[src_idx_0], src_ptr[src_idx_1]);
         typename tileMax::DType max_23 = blkv_max(src_ptr[src_idx_2], src_ptr[src_idx_3]);
-        typename tileMax::DType max_45 = blkv_max(src_ptr[src_idx_4], src_ptr[src_idx_5]);    
-        typename tileMax::DType max_67 = blkv_max(src_ptr[src_idx_6], src_ptr[src_idx_7]);        
-        typename tileMax::DType max_0123 = blkv_max(max_01, max_23); 
+        typename tileMax::DType max_45 = blkv_max(src_ptr[src_idx_4], src_ptr[src_idx_5]);
+        typename tileMax::DType max_67 = blkv_max(src_ptr[src_idx_6], src_ptr[src_idx_7]);
+        typename tileMax::DType max_0123 = blkv_max(max_01, max_23);
         typename tileMax::DType max_4567 = blkv_max(max_45, max_67);
-        typename tileMax::DType max_tmp = blkv_max(max_0123, max_4567);         
-        upd_max = blkv_max(upd_max, max_tmp);              
+        typename tileMax::DType max_tmp = blkv_max(max_0123, max_4567);
+        upd_max = blkv_max(upd_max, max_tmp);
     }
 
-    new_max_ptr[i] = upd_max;    
+    new_max_ptr[i] = upd_max;
 }
 
 
@@ -70,58 +70,58 @@ void __vec__ reducemax_col_kernel(
 template<typename dtype, int gIM, int gIN, int tM, int tN>
 void reducemax_col_rand(
     dtype *in_ptr,
-//    dtype *inzero_ptr,    
+//    dtype *inzero_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM;
     const int rmd_N = gIN % tN;
 //    const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     // 
-//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //
+//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gIN>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; //
-    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //     
-    using tile_shapeMax = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; // 
+    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //
+    using tile_shapeMax = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; //
 
 
 
-    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // 
-    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //     
-    using tile_shapeMax_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; // 
+    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; //
+    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //
+    using tile_shapeMax_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; //
     //need tM = 1;
 
 
-    gm_shapeIn inGm(in_ptr);   
-//    gm_shapeOut ZeroGm(inzero_ptr); 
+    gm_shapeIn inGm(in_ptr);
+//    gm_shapeOut ZeroGm(inzero_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeSum olcSumGm(old_sum_ptr);    
+//    gm_shapeSum olcSumGm(old_sum_ptr);
 
     tile_shapeData dataTile;
-    tile_shapeData_col dataTile_col;    
+    tile_shapeData_col dataTile_col;
     tile_shapeMax MaxTile;
     tile_shapeMax oldMaxTile;
 
     tile_shapeData_row dataTile_row;
-    tile_shapeData_cor dataTile_cor;    
+    tile_shapeData_cor dataTile_cor;
     tile_shapeMax_row MaxTile_row;
-    tile_shapeMax_row oldMaxTile_row;    
+    tile_shapeMax_row oldMaxTile_row;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;      
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itIn_row = global_iterator<gm_shapeIn, tile_shapeMax>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeMax>;
 
     itIn  gIIter(in_ptr);
     itIn_row  gIIter_rmd_row(in_ptr);
-//    itZero  gZeroIter(inzero_ptr);    
+//    itZero  gZeroIter(inzero_ptr);
     itOut gOIter(out_ptr);
 
 //    dtype zero = 0;
@@ -130,41 +130,41 @@ void reducemax_col_rand(
 //        auto gZero = gZeroIter(0, j);
         auto gO = gOIter(0, j);
         TEXPANDSCALAR(oldMaxTile, 0);//初始化为0
-//        TCOPYIN(oldSumTile, gZero);//初始化为0
-        //初始化old_sum的tile      
-        //need 
+//        TLOAD(oldSumTile, gZero);//初始化为0
+        //初始化old_sum的tile
+        //need
         for (int i = 0; i < Mb; ++i) {
             auto gI = gIIter(i, j);
-            TCOPYIN(dataTile, gI);
+            TLOAD(dataTile, gI);
             reducemax_col_kernel<tile_shapeData, tile_shapeMax><<<tile_shapeMax::ValidCol, tile_shapeMax::ValidRow, 1>>>(MaxTile.data(), dataTile.data(), oldMaxTile.data());
             oldMaxTile = MaxTile;
         }
-        if constexpr (rmd_M > 0){   
+        if constexpr (rmd_M > 0){
             auto gI = gIIter(Mb, j);
-            TCOPYIN(dataTile_col, gI);
+            TLOAD(dataTile_col, gI);
             reducemax_col_kernel<tile_shapeData_col,tile_shapeMax><<<tile_shapeMax::ValidCol, tile_shapeMax::ValidRow, 1>>>(MaxTile.data(), dataTile_col.data(), oldMaxTile.data());
             oldMaxTile = MaxTile;
         }
-        TCOPYOUT(gO, MaxTile);
+        TSTORE(gO, MaxTile);
     }
     if constexpr (rmd_N > 0){
-//        auto gZero = gZeroIter(0, Nb);         
+//        auto gZero = gZeroIter(0, Nb);
         auto gO = gOIter(0, Nb);
-        TEXPANDSCALAR(oldMaxTile_row, 0);//初始化为0        
-//        TCOPYIN(oldSumTile_row, gZero);//初始化为0
-        for (int i = 0; i < Mb; ++i) {   
+        TEXPANDSCALAR(oldMaxTile_row, 0);//初始化为0
+//        TLOAD(oldSumTile_row, gZero);//初始化为0
+        for (int i = 0; i < Mb; ++i) {
             auto gI = gIIter(i, Nb);
-            TCOPYIN(dataTile_row, gI);
+            TLOAD(dataTile_row, gI);
             reducemax_col_kernel<tile_shapeData_row,tile_shapeMax_row><<<tile_shapeMax_row::ValidCol, tile_shapeMax_row::ValidRow, 1>>>(MaxTile_row.data(), dataTile_row.data(), oldMaxTile_row.data());
             oldMaxTile_row = MaxTile_row;
         }
-        if constexpr (rmd_M > 0){   
+        if constexpr (rmd_M > 0){
             auto gI = gIIter(Mb, Nb);
-            TCOPYIN(dataTile_cor, gI);
+            TLOAD(dataTile_cor, gI);
             reducemax_col_kernel<tile_shapeData_cor,tile_shapeMax_row><<<tile_shapeMax_row::ValidCol, tile_shapeMax_row::ValidRow, 1>>>(MaxTile_row.data(), dataTile_cor.data(), oldMaxTile_row.data());
             oldMaxTile_row = MaxTile_row;
         }
-        TCOPYOUT(gO, MaxTile_row);
+        TSTORE(gO, MaxTile_row);
     }
 }
 
diff --git a/kernels/reduction/reducemax_colvec_single.hpp b/kernels/reduction/reducemax_colvec_single.hpp
index 02806e6..7eea81e 100644
--- a/kernels/reduction/reducemax_colvec_single.hpp
+++ b/kernels/reduction/reducemax_colvec_single.hpp
@@ -19,30 +19,30 @@ template<typename tileSrc, typename tileMax>
 void __vec__ reducemax_col_kernel(
     typename tileMax::TileDType __out__ new_max,
     const typename tileSrc::TileDType __in__ src,
-    const typename tileMax::TileDType __in__ old_max    
+    const typename tileMax::TileDType __in__ old_max
 )
 {
-    size_t i = blkv_get_index_x();  
+    size_t i = blkv_get_index_x();
 
     __vbuf__ typename tileMax::DType *new_max_ptr = blkv_get_tile_ptr(new_max);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);            
+    __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);
 
 
     typename tileMax::DType upd_max = old_max_ptr[i];
-/*    
+/*
     for(size_t j=0;j<tileSrc::ValidRow;j+=4){
         size_t src_idx_0 =  i * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1) * tileSrc::RowStride;
         size_t src_idx_2 =  i * tileSrc::ColStride + (j + 2) * tileSrc::RowStride;
         size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3) * tileSrc::RowStride;
-        typename tileMax::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];    
+        typename tileMax::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
         typename tileMax::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];
-        typename tileMax::DType sum_0123 = sum_01 + sum_23; 
-        upd_sum = upd_sum + sum_0123;              
+        typename tileMax::DType sum_0123 = sum_01 + sum_23;
+        upd_sum = upd_sum + sum_0123;
     }
 */
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidRow;j+=8){
         size_t src_idx_0 =  i * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1) * tileSrc::RowStride;
@@ -51,15 +51,15 @@ void __vec__ reducemax_col_kernel(
         size_t src_idx_4 =  i * tileSrc::ColStride + (j + 4) * tileSrc::RowStride;
         size_t src_idx_5 =  i * tileSrc::ColStride + (j + 5) * tileSrc::RowStride;
         size_t src_idx_6 =  i * tileSrc::ColStride + (j + 6) * tileSrc::RowStride;
-        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;        
-        typename tileMax::DType max_01 = blkv_max(src_ptr[src_idx_0], src_ptr[src_idx_1]);    
+        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;
+        typename tileMax::DType max_01 = blkv_max(src_ptr[src_idx_0], src_ptr[src_idx_1]);
         typename tileMax::DType max_23 = blkv_max(src_ptr[src_idx_2], src_ptr[src_idx_3]);
-        typename tileMax::DType max_45 = blkv_max(src_ptr[src_idx_4], src_ptr[src_idx_5]);    
-        typename tileMax::DType max_67 = blkv_max(src_ptr[src_idx_6], src_ptr[src_idx_7]);        
-        typename tileMax::DType max_0123 = blkv_max(max_01, max_23); 
+        typename tileMax::DType max_45 = blkv_max(src_ptr[src_idx_4], src_ptr[src_idx_5]);
+        typename tileMax::DType max_67 = blkv_max(src_ptr[src_idx_6], src_ptr[src_idx_7]);
+        typename tileMax::DType max_0123 = blkv_max(max_01, max_23);
         typename tileMax::DType max_4567 = blkv_max(max_45, max_67);
-        typename tileMax::DType max_tmp = blkv_max(max_0123, max_4567);         
-        upd_max = blkv_max(upd_max, max_tmp);              
+        typename tileMax::DType max_tmp = blkv_max(max_0123, max_4567);
+        upd_max = blkv_max(upd_max, max_tmp);
     }
 
 /*
@@ -67,10 +67,10 @@ void __vec__ reducemax_col_kernel(
     for(size_t j=0;j<tileSrc::ValidRow;j++){
         size_t src_idx =  i * tileSrc::ColStride + j * tileSrc::RowStride;
         upd_max = blkv_max(upd_max, src_ptr[src_idx]);
-//        upd_max = upd_sum + src_ptr[src_idx];              
+//        upd_max = upd_sum + src_ptr[src_idx];
     }
-*/    
-    new_max_ptr[i] = upd_max;    
+*/
+    new_max_ptr[i] = upd_max;
 }
 
 
@@ -78,58 +78,58 @@ void __vec__ reducemax_col_kernel(
 template<typename dtype, int gIM, int gIN, int tM, int tN>
 void reducemax_col_rand(
     dtype *in_ptr,
-//    dtype *inzero_ptr,    
+//    dtype *inzero_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM;
     const int rmd_N = gIN % tN;
 //    const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     // 
-//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //
+//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gIN>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; //
-    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //     
-    using tile_shapeMax = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; // 
+    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //
+    using tile_shapeMax = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; //
 
 
 
-    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // 
-    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //     
-    using tile_shapeMax_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; // 
+    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; //
+    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //
+    using tile_shapeMax_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; //
     //need tM = 1;
 
 
-    gm_shapeIn inGm(in_ptr);   
-//    gm_shapeOut ZeroGm(inzero_ptr); 
+    gm_shapeIn inGm(in_ptr);
+//    gm_shapeOut ZeroGm(inzero_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeSum olcSumGm(old_sum_ptr);    
+//    gm_shapeSum olcSumGm(old_sum_ptr);
 
     tile_shapeData dataTile;
-    tile_shapeData_col dataTile_col;    
+    tile_shapeData_col dataTile_col;
     tile_shapeMax MaxTile;
     tile_shapeMax oldMaxTile;
 
     tile_shapeData_row dataTile_row;
-    tile_shapeData_cor dataTile_cor;    
+    tile_shapeData_cor dataTile_cor;
     tile_shapeMax_row MaxTile_row;
-    tile_shapeMax_row oldMaxTile_row;    
+    tile_shapeMax_row oldMaxTile_row;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;      
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itIn_row = global_iterator<gm_shapeIn, tile_shapeMax>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeMax>;
 
     itIn  gIIter(in_ptr);
     itIn_row  gIIter_rmd_row(in_ptr);
-//    itZero  gZeroIter(inzero_ptr);    
+//    itZero  gZeroIter(inzero_ptr);
     itOut gOIter(out_ptr);
 
 //    dtype zero = 0;
@@ -138,42 +138,42 @@ void reducemax_col_rand(
 //        auto gZero = gZeroIter(0, j);
     auto gO = gOIter(0, 0);
     TEXPANDSCALAR(oldMaxTile, 0);//初始化为0
-//        TCOPYIN(oldSumTile, gZero);//初始化为0
-    //初始化old_sum的tile      
-    //need 
+//        TLOAD(oldSumTile, gZero);//初始化为0
+    //初始化old_sum的tile
+    //need
     for (int i = 0; i < Mb; ++i) {
         auto gI = gIIter(i, 0);
-        TCOPYIN(dataTile, gI);
+        TLOAD(dataTile, gI);
         reducemax_col_kernel<tile_shapeData, tile_shapeMax><<<tile_shapeMax::ValidCol, tile_shapeMax::ValidRow, 1>>>(MaxTile.data(), dataTile.data(), oldMaxTile.data());
         oldMaxTile = MaxTile;
     }
-    if constexpr (rmd_M > 0){   
+    if constexpr (rmd_M > 0){
         auto gI = gIIter(Mb, 0);
-        TCOPYIN(dataTile_col, gI);
+        TLOAD(dataTile_col, gI);
         reducemax_col_kernel<tile_shapeData_col,tile_shapeMax><<<tile_shapeMax::ValidCol, tile_shapeMax::ValidRow, 1>>>(MaxTile.data(), dataTile_col.data(), oldMaxTile.data());
         oldMaxTile = MaxTile;
     }
-    TCOPYOUT(gO, MaxTile);
+    TSTORE(gO, MaxTile);
 //    }
 /*
     if constexpr (rmd_N > 0){
-//        auto gZero = gZeroIter(0, Nb);         
+//        auto gZero = gZeroIter(0, Nb);
         auto gO = gOIter(0, Nb);
-        TEXPANDSCALAR(oldMaxTile_row, 0);//初始化为0        
-//        TCOPYIN(oldSumTile_row, gZero);//初始化为0
-        for (int i = 0; i < Mb; ++i) {   
+        TEXPANDSCALAR(oldMaxTile_row, 0);//初始化为0
+//        TLOAD(oldSumTile_row, gZero);//初始化为0
+        for (int i = 0; i < Mb; ++i) {
             auto gI = gIIter(i, Nb);
-            TCOPYIN(dataTile_row, gI);
+            TLOAD(dataTile_row, gI);
             reducemax_col_kernel<tile_shapeData_row,tile_shapeMax_row><<<tile_shapeMax_row::ValidCol, tile_shapeMax_row::ValidRow, 1>>>(MaxTile_row.data(), dataTile_row.data(), oldMaxTile_row.data());
             oldMaxTile_row = MaxTile_row;
         }
-        if constexpr (rmd_M > 0){   
+        if constexpr (rmd_M > 0){
             auto gI = gIIter(Mb, Nb);
-            TCOPYIN(dataTile_cor, gI);
+            TLOAD(dataTile_cor, gI);
             reducemax_col_kernel<tile_shapeData_cor,tile_shapeMax_row><<<tile_shapeMax_row::ValidCol, tile_shapeMax_row::ValidRow, 1>>>(MaxTile_row.data(), dataTile_cor.data(), oldMaxTile_row.data());
             oldMaxTile_row = MaxTile_row;
         }
-        TCOPYOUT(gO, MaxTile_row);
+        TSTORE(gO, MaxTile_row);
     }
 */
 }
diff --git a/kernels/reduction/reducemax_colvec_single_8192.hpp b/kernels/reduction/reducemax_colvec_single_8192.hpp
index 0688726..eb25d6c 100644
--- a/kernels/reduction/reducemax_colvec_single_8192.hpp
+++ b/kernels/reduction/reducemax_colvec_single_8192.hpp
@@ -20,39 +20,39 @@ void __vec__ reducemax_col_kernel(
     typename tileTmpMax::TileDType __out__ new_max,
     const typename tileSrc::TileDType __in__ src,
     const typename tileTmpMax::TileDType __in__ old_max,
-    const size_t tile_idx  
+    const size_t tile_idx
 )
 {
-    size_t i = blkv_get_index_x();  
+    size_t i = blkv_get_index_x();
 
     __vbuf__ typename tileTmpMax::DType *new_max_ptr = blkv_get_tile_ptr(new_max);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileTmpMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);    
+    __vbuf__ typename tileTmpMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileTmpMax::ValidRow;j++){
-        size_t old_max_idx =  i * tileTmpMax::ColStride + j * tileTmpMax::RowStride;       
-        new_max_ptr[old_max_idx] = old_max_ptr[old_max_idx];          
+        size_t old_max_idx =  i * tileTmpMax::ColStride + j * tileTmpMax::RowStride;
+        new_max_ptr[old_max_idx] = old_max_ptr[old_max_idx];
     }
-    
-    #pragma clang loop unroll(full) 
+
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidRow;j+=8){
         size_t src_idx_0 =  i * tileSrc::ColStride + (j + 0) * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1) * tileSrc::RowStride;
         size_t src_idx_2 =  i * tileSrc::ColStride + (j + 2) * tileSrc::RowStride;
-        size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3) * tileSrc::RowStride;        
+        size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3) * tileSrc::RowStride;
         size_t src_idx_4 =  i * tileSrc::ColStride + (j + 4) * tileSrc::RowStride;
         size_t src_idx_5 =  i * tileSrc::ColStride + (j + 5) * tileSrc::RowStride;
         size_t src_idx_6 =  i * tileSrc::ColStride + (j + 6) * tileSrc::RowStride;
-        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;        
-        typename  tileSrc::DType max_01 = blkv_max(src_ptr[src_idx_0], src_ptr[src_idx_1]);    
+        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;
+        typename  tileSrc::DType max_01 = blkv_max(src_ptr[src_idx_0], src_ptr[src_idx_1]);
         typename  tileSrc::DType max_23 = blkv_max(src_ptr[src_idx_2], src_ptr[src_idx_3]);
-        typename  tileSrc::DType max_45 = blkv_max(src_ptr[src_idx_4], src_ptr[src_idx_5]);    
-        typename  tileSrc::DType max_67 = blkv_max(src_ptr[src_idx_6], src_ptr[src_idx_7]);        
-        typename  tileSrc::DType max_0123 = blkv_max(max_01, max_23); 
+        typename  tileSrc::DType max_45 = blkv_max(src_ptr[src_idx_4], src_ptr[src_idx_5]);
+        typename  tileSrc::DType max_67 = blkv_max(src_ptr[src_idx_6], src_ptr[src_idx_7]);
+        typename  tileSrc::DType max_0123 = blkv_max(max_01, max_23);
         typename  tileSrc::DType max_4567 = blkv_max(max_45, max_67);
-        typename  tileSrc::DType max_all = blkv_max(max_0123, max_4567);   
-        src_ptr[src_idx_0] = max_all;          
+        typename  tileSrc::DType max_all = blkv_max(max_0123, max_4567);
+        src_ptr[src_idx_0] = max_all;
     }
 
     #pragma clang loop unroll(full)
@@ -60,17 +60,17 @@ void __vec__ reducemax_col_kernel(
         size_t tmp_idx_0 =  i * tileSrc::ColStride + (j + 0*8) * tileSrc::RowStride;
         size_t tmp_idx_1 =  i * tileSrc::ColStride + (j + 1*8) * tileSrc::RowStride;
         size_t tmp_idx_2 =  i * tileSrc::ColStride + (j + 2*8) * tileSrc::RowStride;
-        size_t tmp_idx_3 =  i * tileSrc::ColStride + (j + 3*8) * tileSrc::RowStride;        
+        size_t tmp_idx_3 =  i * tileSrc::ColStride + (j + 3*8) * tileSrc::RowStride;
         size_t tmp_idx_4 =  i * tileSrc::ColStride + (j + 4*8) * tileSrc::RowStride;
         size_t tmp_idx_5 =  i * tileSrc::ColStride + (j + 5*8) * tileSrc::RowStride;
         size_t tmp_idx_6 =  i * tileSrc::ColStride + (j + 6*8) * tileSrc::RowStride;
-        size_t tmp_idx_7 =  i * tileSrc::ColStride + (j + 7*8) * tileSrc::RowStride;  
+        size_t tmp_idx_7 =  i * tileSrc::ColStride + (j + 7*8) * tileSrc::RowStride;
         typename tileSrc::DType tmp_max_01 = blkv_max(src_ptr[tmp_idx_0], src_ptr[tmp_idx_1]);
-        typename tileSrc::DType tmp_max_23 = blkv_max(src_ptr[tmp_idx_2], src_ptr[tmp_idx_3]); 
-        typename tileSrc::DType tmp_max_45 = blkv_max(src_ptr[tmp_idx_4], src_ptr[tmp_idx_5]); 
-        typename tileSrc::DType tmp_max_67 = blkv_max(src_ptr[tmp_idx_6], src_ptr[tmp_idx_7]);  
-        typename tileSrc::DType tmp_max_0123 = blkv_max(tmp_max_01, tmp_max_23); 
-        typename tileSrc::DType tmp_max_4567 = blkv_max(tmp_max_45, tmp_max_67); 
+        typename tileSrc::DType tmp_max_23 = blkv_max(src_ptr[tmp_idx_2], src_ptr[tmp_idx_3]);
+        typename tileSrc::DType tmp_max_45 = blkv_max(src_ptr[tmp_idx_4], src_ptr[tmp_idx_5]);
+        typename tileSrc::DType tmp_max_67 = blkv_max(src_ptr[tmp_idx_6], src_ptr[tmp_idx_7]);
+        typename tileSrc::DType tmp_max_0123 = blkv_max(tmp_max_01, tmp_max_23);
+        typename tileSrc::DType tmp_max_4567 = blkv_max(tmp_max_45, tmp_max_67);
         typename tileSrc::DType tmp_max_all = blkv_max(tmp_max_0123, tmp_max_4567);
         src_ptr[tmp_idx_0] = tmp_max_all;
     };
@@ -80,29 +80,29 @@ void __vec__ reducemax_col_kernel(
     size_t tmp_idx_l2_0 =  i * tileSrc::ColStride + 0*64 * tileSrc::RowStride;
     size_t tmp_idx_l2_1 =  i * tileSrc::ColStride + 1*64 * tileSrc::RowStride;
     size_t tmp_idx_l2_2 =  i * tileSrc::ColStride + 2*64 * tileSrc::RowStride;
-    size_t tmp_idx_l2_3 =  i * tileSrc::ColStride + 3*64 * tileSrc::RowStride;        
+    size_t tmp_idx_l2_3 =  i * tileSrc::ColStride + 3*64 * tileSrc::RowStride;
     size_t tmp_idx_l2_4 =  i * tileSrc::ColStride + 4*64 * tileSrc::RowStride;
     size_t tmp_idx_l2_5 =  i * tileSrc::ColStride + 5*64 * tileSrc::RowStride;
     size_t tmp_idx_l2_6 =  i * tileSrc::ColStride + 6*64 * tileSrc::RowStride;
-    size_t tmp_idx_l2_7 =  i * tileSrc::ColStride + 7*64 * tileSrc::RowStride;      
+    size_t tmp_idx_l2_7 =  i * tileSrc::ColStride + 7*64 * tileSrc::RowStride;
     typename tileTmpMax::DType tmp_max_l2_01 = blkv_max(src_ptr[tmp_idx_l2_0], src_ptr[tmp_idx_l2_1]);
-    typename tileTmpMax::DType tmp_max_l2_23 = blkv_max(src_ptr[tmp_idx_l2_2], src_ptr[tmp_idx_l2_3]);   
+    typename tileTmpMax::DType tmp_max_l2_23 = blkv_max(src_ptr[tmp_idx_l2_2], src_ptr[tmp_idx_l2_3]);
     typename tileTmpMax::DType tmp_max_l2_45 = blkv_max(src_ptr[tmp_idx_l2_4], src_ptr[tmp_idx_l2_5]);
-    typename tileTmpMax::DType tmp_max_l2_67 = blkv_max(src_ptr[tmp_idx_l2_6], src_ptr[tmp_idx_l2_7]);  
-    typename tileTmpMax::DType tmp_max_l2_0123 = blkv_max(tmp_max_l2_01, tmp_max_l2_23); 
-    typename tileTmpMax::DType tmp_max_l2_4567 = blkv_max(tmp_max_l2_45, tmp_max_l2_67); 
-    typename tileTmpMax::DType tmp_max_l2_all = blkv_max(tmp_max_l2_0123, tmp_max_l2_4567);          
+    typename tileTmpMax::DType tmp_max_l2_67 = blkv_max(src_ptr[tmp_idx_l2_6], src_ptr[tmp_idx_l2_7]);
+    typename tileTmpMax::DType tmp_max_l2_0123 = blkv_max(tmp_max_l2_01, tmp_max_l2_23);
+    typename tileTmpMax::DType tmp_max_l2_4567 = blkv_max(tmp_max_l2_45, tmp_max_l2_67);
+    typename tileTmpMax::DType tmp_max_l2_all = blkv_max(tmp_max_l2_0123, tmp_max_l2_4567);
 
 /*
     #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidRow;j++){
         size_t src_idx =  i * tileSrc::ColStride + j * tileSrc::RowStride;
-        upd_max = upd_max + src_ptr[src_idx];              
+        upd_max = upd_max + src_ptr[src_idx];
     }
 */
-//    __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);        
-//    new_max_ptr[i] = tmp_max_l2_all + old_max_ptr[i];  
-//    new_max_ptr[i] = tmp_max_l2_all;  
+//    __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);
+//    new_max_ptr[i] = tmp_max_l2_all + old_max_ptr[i];
+//    new_max_ptr[i] = tmp_max_l2_all;
 
     size_t  max_tile_idx = i * tileTmpMax::ColStride + tile_idx * tileTmpMax::RowStride;
     new_max_ptr[max_tile_idx] = tmp_max_l2_all;
@@ -117,25 +117,25 @@ void __vec__ reducemax_col_final_kernel(
     __vbuf__ typename tileMax::DType *new_max_ptr = blkv_get_tile_ptr(new_max);
     __vbuf__ typename tileTmpMax::DType *tmp_max_ptr = blkv_get_tile_ptr(tmp_max);
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileTmpMax::ValidRow;j+=8){
         size_t src_idx_0 =  i * tileTmpMax::ColStride + (j + 0) * tileTmpMax::RowStride;
         size_t src_idx_1 =  i * tileTmpMax::ColStride + (j + 1) * tileTmpMax::RowStride;
         size_t src_idx_2 =  i * tileTmpMax::ColStride + (j + 2) * tileTmpMax::RowStride;
-        size_t src_idx_3 =  i * tileTmpMax::ColStride + (j + 3) * tileTmpMax::RowStride;        
+        size_t src_idx_3 =  i * tileTmpMax::ColStride + (j + 3) * tileTmpMax::RowStride;
         size_t src_idx_4 =  i * tileTmpMax::ColStride + (j + 4) * tileTmpMax::RowStride;
         size_t src_idx_5 =  i * tileTmpMax::ColStride + (j + 5) * tileTmpMax::RowStride;
         size_t src_idx_6 =  i * tileTmpMax::ColStride + (j + 6) * tileTmpMax::RowStride;
-        size_t src_idx_7 =  i * tileTmpMax::ColStride + (j + 7) * tileTmpMax::RowStride;        
-        typename  tileTmpMax::DType max_01 = blkv_max(tmp_max_ptr[src_idx_0], tmp_max_ptr[src_idx_1]);    
-        typename  tileTmpMax::DType max_23 = blkv_max(tmp_max_ptr[src_idx_2], tmp_max_ptr[src_idx_3]); 
-        typename  tileTmpMax::DType max_45 = blkv_max(tmp_max_ptr[src_idx_4], tmp_max_ptr[src_idx_5]);    
-        typename  tileTmpMax::DType max_67 = blkv_max(tmp_max_ptr[src_idx_6], tmp_max_ptr[src_idx_7]);        
-        typename  tileTmpMax::DType max_0123 = blkv_max(max_01, max_23); 
+        size_t src_idx_7 =  i * tileTmpMax::ColStride + (j + 7) * tileTmpMax::RowStride;
+        typename  tileTmpMax::DType max_01 = blkv_max(tmp_max_ptr[src_idx_0], tmp_max_ptr[src_idx_1]);
+        typename  tileTmpMax::DType max_23 = blkv_max(tmp_max_ptr[src_idx_2], tmp_max_ptr[src_idx_3]);
+        typename  tileTmpMax::DType max_45 = blkv_max(tmp_max_ptr[src_idx_4], tmp_max_ptr[src_idx_5]);
+        typename  tileTmpMax::DType max_67 = blkv_max(tmp_max_ptr[src_idx_6], tmp_max_ptr[src_idx_7]);
+        typename  tileTmpMax::DType max_0123 = blkv_max(max_01, max_23);
         typename  tileTmpMax::DType max_4567 = blkv_max(max_45, max_67);
-        typename  tileTmpMax::DType max_all = blkv_max(max_0123, max_4567);   
-        tmp_max_ptr[src_idx_0] = max_all;          
-    }   
+        typename  tileTmpMax::DType max_all = blkv_max(max_0123, max_4567);
+        tmp_max_ptr[src_idx_0] = max_all;
+    }
 
     size_t max_idx_0 = i * tileTmpMax::ColStride + 0*8 * tileTmpMax::RowStride;
     size_t max_idx_1 = i * tileTmpMax::ColStride + 1*8 * tileTmpMax::RowStride;
@@ -145,46 +145,46 @@ void __vec__ reducemax_col_final_kernel(
 
 template<typename dtype, int gIM, int gIN, int tM, int tN>
 void reducemax_col_rand(
-    dtype *in_ptr,  
+    dtype *in_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM;
     const int rmd_N = gIN % tN;
 //    const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //   
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gIN>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; //
-    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //     
-    using tile_shapeMax = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; // 
-    using tile_shapeTmpMax = Tile<Location::Vec, dtype, 16, tN, BLayout::RowMajor>; //      
+    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //
+    using tile_shapeMax = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; //
+    using tile_shapeTmpMax = Tile<Location::Vec, dtype, 16, tN, BLayout::RowMajor>; //
 
 
-//    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // 
-//    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //     
-//    using tile_shapeMax_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; // 
+//    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; //
+//    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //
+//    using tile_shapeMax_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; //
     //need tM = 1;
 
 
-    gm_shapeIn inGm(in_ptr);   
-    gm_shapeOut outGm(out_ptr); 
+    gm_shapeIn inGm(in_ptr);
+    gm_shapeOut outGm(out_ptr);
 
     tile_shapeData dataTile;
-    tile_shapeData_col dataTile_col;    
+    tile_shapeData_col dataTile_col;
     tile_shapeMax MaxTile;
     tile_shapeTmpMax oldtmpMaxTile;
     tile_shapeTmpMax tmpMaxTile;
 //    tile_shapeTmpMax_l2 tmpMaxTile_l2;
 
 //    tile_shapeData_row dataTile_row;
-//    tile_shapeData_cor dataTile_cor;    
+//    tile_shapeData_cor dataTile_cor;
 //    tile_shapeMax_row MaxTile_row;
-//    tile_shapeMax_row oldMaxTile_row;    
+//    tile_shapeMax_row oldMaxTile_row;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
@@ -192,7 +192,7 @@ void reducemax_col_rand(
     using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeMax>;
 
-    itIn  gIIter(in_ptr);  
+    itIn  gIIter(in_ptr);
     itOut gOIter(out_ptr);
 
 //    dtype zero = 0;
@@ -202,19 +202,19 @@ void reducemax_col_rand(
     auto gO = gOIter(0, 0);
     TEXPANDSCALAR(oldtmpMaxTile, 0);//初始化为0
 //    TEXPANDSCALAR(tmpMaxTile, 0);//初始化为0
-//    TEXPANDSCALAR(tmpMaxTile_l2, 0);//初始化为0        
+//    TEXPANDSCALAR(tmpMaxTile_l2, 0);//初始化为0
     for (size_t i = 0; i < Mb; ++i){
         auto gI = gIIter(i, 0);
-        TCOPYIN(dataTile, gI);
-        reducemax_col_kernel<tile_shapeData, tile_shapeTmpMax><<<tile_shapeTmpMax::ValidCol, 1, 1>>>(tmpMaxTile.data(), 
+        TLOAD(dataTile, gI);
+        reducemax_col_kernel<tile_shapeData, tile_shapeTmpMax><<<tile_shapeTmpMax::ValidCol, 1, 1>>>(tmpMaxTile.data(),
                                                                                                      dataTile.data(),
-                                                                                                     oldtmpMaxTile.data(), 
+                                                                                                     oldtmpMaxTile.data(),
                                                                                                      i);
         oldtmpMaxTile = tmpMaxTile;
     }
-    reducemax_col_final_kernel<tile_shapeTmpMax, tile_shapeMax><<<tile_shapeMax::ValidCol, 1, 1>>>(MaxTile.data(), 
+    reducemax_col_final_kernel<tile_shapeTmpMax, tile_shapeMax><<<tile_shapeMax::ValidCol, 1, 1>>>(MaxTile.data(),
                                                                                                    tmpMaxTile.data());
-    TCOPYOUT(gO, MaxTile);
+    TSTORE(gO, MaxTile);
 }
 
 
diff --git a/kernels/reduction/reducemax_colvec_unalign_120_8.hpp b/kernels/reduction/reducemax_colvec_unalign_120_8.hpp
index adab096..4aa0ab1 100644
--- a/kernels/reduction/reducemax_colvec_unalign_120_8.hpp
+++ b/kernels/reduction/reducemax_colvec_unalign_120_8.hpp
@@ -18,18 +18,18 @@ using namespace pto;
 template<typename tileSrc, typename tileTmp>
 void __vec__ reducemax_col_tmp(
     typename tileTmp::TileDType __out__ tmp_max,
-    const typename tileSrc::TileDType __in__ src    
+    const typename tileSrc::TileDType __in__ src
 )
 {
-    size_t i = blkv_get_index_x();  
+    size_t i = blkv_get_index_x();
 
     __vbuf__ typename tileTmp::DType *tmp_max_ptr = blkv_get_tile_ptr(tmp_max);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-//    __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);   
+//    __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);
 
     typename tileTmp::DType upd_tmp_max = 0;
-   
-    #pragma clang loop unroll(full) 
+
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::Rows;j+=8){//非valid处也参与计算补0，能凑出8元树形累加出来
         size_t src_idx_0 =  i * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1) * tileSrc::RowStride;
@@ -38,57 +38,57 @@ void __vec__ reducemax_col_tmp(
         size_t src_idx_4 =  i * tileSrc::ColStride + (j + 4) * tileSrc::RowStride;
         size_t src_idx_5 =  i * tileSrc::ColStride + (j + 5) * tileSrc::RowStride;
         size_t src_idx_6 =  i * tileSrc::ColStride + (j + 6) * tileSrc::RowStride;
-        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;        
-        typename tileTmp::DType max_01 = blkv_max(src_ptr[src_idx_0], src_ptr[src_idx_1]);    
+        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;
+        typename tileTmp::DType max_01 = blkv_max(src_ptr[src_idx_0], src_ptr[src_idx_1]);
         typename tileTmp::DType max_23 = blkv_max(src_ptr[src_idx_2], src_ptr[src_idx_3]);
-        typename tileTmp::DType max_45 = blkv_max(src_ptr[src_idx_4], src_ptr[src_idx_5]);    
-        typename tileTmp::DType max_67 = blkv_max(src_ptr[src_idx_6], src_ptr[src_idx_7]);        
-        typename tileTmp::DType max_0123 = blkv_max(max_01, max_23); 
+        typename tileTmp::DType max_45 = blkv_max(src_ptr[src_idx_4], src_ptr[src_idx_5]);
+        typename tileTmp::DType max_67 = blkv_max(src_ptr[src_idx_6], src_ptr[src_idx_7]);
+        typename tileTmp::DType max_0123 = blkv_max(max_01, max_23);
         typename tileTmp::DType max_4567 = blkv_max(max_45, max_67);
-        typename tileTmp::DType max_tmp = blkv_max(max_0123, max_4567);         
-        upd_tmp_max = blkv_max(upd_tmp_max, max_tmp);              
+        typename tileTmp::DType max_tmp = blkv_max(max_0123, max_4567);
+        upd_tmp_max = blkv_max(upd_tmp_max, max_tmp);
     }
 
-    tmp_max_ptr[i] = upd_tmp_max;   
+    tmp_max_ptr[i] = upd_tmp_max;
 }
 
 template<typename tileTmp, typename tileMax>
 void __vec__ reducemax_col_final(
     typename tileMax::TileDType __out__ new_max,
-    const typename tileTmp::TileDType __in__ src, 
-    const typename tileMax::TileDType __in__ old_max   
+    const typename tileTmp::TileDType __in__ src,
+    const typename tileMax::TileDType __in__ old_max
 )
 {
-    size_t i = blkv_get_index_x();  
+    size_t i = blkv_get_index_x();
 
     __vbuf__ typename tileMax::DType *new_max_ptr = blkv_get_tile_ptr(new_max);
     __vbuf__ typename tileTmp::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);   
+    __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);
 
 
     typename tileMax::DType upd_max = old_max_ptr[i];
-   
+
 
     size_t src_idx_0 =  i * tileMax::ColStride + 0 * tileMax::ValidCol;
     size_t src_idx_1 =  i * tileMax::ColStride + 1 * tileMax::ValidCol;
     size_t src_idx_2 =  i * tileMax::ColStride + 2 * tileMax::ValidCol;
-    size_t src_idx_3 =  i * tileMax::ColStride + 3 * tileMax::ValidCol;  
+    size_t src_idx_3 =  i * tileMax::ColStride + 3 * tileMax::ValidCol;
     size_t src_idx_4 =  i * tileMax::ColStride + 4 * tileMax::ValidCol;
     size_t src_idx_5 =  i * tileMax::ColStride + 5 * tileMax::ValidCol;
     size_t src_idx_6 =  i * tileMax::ColStride + 6 * tileMax::ValidCol;
-    size_t src_idx_7 =  i * tileMax::ColStride + 7 * tileMax::ValidCol;       
-    typename tileMax::DType max_01 = blkv_max(src_ptr[src_idx_0], src_ptr[src_idx_1]);    
-    typename tileMax::DType max_23 = blkv_max(src_ptr[src_idx_2], src_ptr[src_idx_3]);    
-    typename tileMax::DType max_45 = blkv_max(src_ptr[src_idx_4], src_ptr[src_idx_5]);    
-    typename tileMax::DType max_67 = blkv_max(src_ptr[src_idx_6], src_ptr[src_idx_7]);        
-    typename tileMax::DType max_0123 = blkv_max(max_01, max_23); 
-    typename tileMax::DType max_4567 = blkv_max(max_45, max_67);      
-    typename tileMax::DType max_all = blkv_max(max_0123, max_4567); 
-              
-//        upd_max = upd_max + max_tmp;              
-
-
-    new_max_ptr[i] = blkv_max(max_all, upd_max);   
+    size_t src_idx_7 =  i * tileMax::ColStride + 7 * tileMax::ValidCol;
+    typename tileMax::DType max_01 = blkv_max(src_ptr[src_idx_0], src_ptr[src_idx_1]);
+    typename tileMax::DType max_23 = blkv_max(src_ptr[src_idx_2], src_ptr[src_idx_3]);
+    typename tileMax::DType max_45 = blkv_max(src_ptr[src_idx_4], src_ptr[src_idx_5]);
+    typename tileMax::DType max_67 = blkv_max(src_ptr[src_idx_6], src_ptr[src_idx_7]);
+    typename tileMax::DType max_0123 = blkv_max(max_01, max_23);
+    typename tileMax::DType max_4567 = blkv_max(max_45, max_67);
+    typename tileMax::DType max_all = blkv_max(max_0123, max_4567);
+
+//        upd_max = upd_max + max_tmp;
+
+
+    new_max_ptr[i] = blkv_max(max_all, upd_max);
 }
 
 
@@ -98,70 +98,70 @@ void __vec__ reducemax_col_final(
 template<typename dtype, int gIM, int gIN, int tM, int tN, int tM_VLD>
 void reducemax_col_rand(
     dtype *in_ptr,
-//    dtype *inzero_ptr,    
+//    dtype *inzero_ptr,
     dtype *out_ptr
-) 
+)
 {
 
-//    const int Mb = (gIM/8) / tM;  
+//    const int Mb = (gIM/8) / tM;
 
     const int rmd_M = gIM % tM;
     const int rmd_N = gIN % tN;
 //    const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM/8, gIN*8>>;     // 
-//    using gm_shapeMax = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM/8, gIN*8>>;     //
+//    using gm_shapeMax = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gIN>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM/8, tN*8, BLayout::RowMajor, tM_VLD/8, tN*8>; //
 //    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //
-    using tile_shapeTmp = Tile<Location::Vec, dtype, 1, tN*8, BLayout::RowMajor>; //      
-    using tile_shapeMax = Tile<Location::Vec, dtype, 1, tN*8, BLayout::RowMajor, 1, tN>; // 
+    using tile_shapeTmp = Tile<Location::Vec, dtype, 1, tN*8, BLayout::RowMajor>; //
+    using tile_shapeMax = Tile<Location::Vec, dtype, 1, tN*8, BLayout::RowMajor, 1, tN>; //
 
 
 
-//    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // 
-//    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //     
-//    using tile_shapeMax_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; // 
+//    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; //
+//    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //
+//    using tile_shapeMax_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; //
     //need tM = 1;
 
 
-    gm_shapeIn inGm(in_ptr);   
-//    gm_shapeOut ZeroGm(inzero_ptr); 
+    gm_shapeIn inGm(in_ptr);
+//    gm_shapeOut ZeroGm(inzero_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeMax olcMaxGm(old_max_ptr);    
+//    gm_shapeMax olcMaxGm(old_max_ptr);
 
     tile_shapeData dataTile;
 //    tile_shapeData_col dataTile_col;
-    tile_shapeTmp TmpTile;        
+    tile_shapeTmp TmpTile;
     tile_shapeMax MaxTile;
     tile_shapeMax oldMaxTile;
 
 //    tile_shapeData_row dataTile_row;
-//    tile_shapeData_cor dataTile_cor;    
+//    tile_shapeData_cor dataTile_cor;
 //    tile_shapeMax_row MaxTile_row;
-//    tile_shapeMax_row oldMaxTile_row;    
+//    tile_shapeMax_row oldMaxTile_row;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;      
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itIn_row = global_iterator<gm_shapeIn, tile_shapeMax>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeMax>;
 
     itIn  gIIter(in_ptr);
     itIn_row  gIIter_rmd_row(in_ptr);
-//    itZero  gZeroIter(inzero_ptr);    
+//    itZero  gZeroIter(inzero_ptr);
     itOut gOIter(out_ptr);
 
 
     auto gO = gOIter(0, 0);
     TEXPANDSCALAR(oldMaxTile, 0);//初始化为0
     auto gI = gIIter(0, 0);
-    TCOPYIN(dataTile, gI);//补0的TLOAD
+    TLOAD(dataTile, gI);//补0的TLOAD
     reducemax_col_tmp<tile_shapeData, tile_shapeTmp><<<tile_shapeTmp::ValidCol, tile_shapeTmp::ValidRow, 1>>>(TmpTile.data(), dataTile.data());
     reducemax_col_final<tile_shapeTmp, tile_shapeMax><<<tile_shapeMax::ValidCol, tile_shapeMax::ValidRow, 1>>>(MaxTile.data(), TmpTile.data(), oldMaxTile.data());
     oldMaxTile = MaxTile;
-    TCOPYOUT(gO, MaxTile);
+    TSTORE(gO, MaxTile);
 }
 
 #endif
diff --git a/kernels/reduction/reducemax_rowvec.hpp b/kernels/reduction/reducemax_rowvec.hpp
index 6b95d70..57830c3 100644
--- a/kernels/reduction/reducemax_rowvec.hpp
+++ b/kernels/reduction/reducemax_rowvec.hpp
@@ -18,17 +18,17 @@ template<typename tileSrc, typename tileMax>
 void __vec__ reducemax_row_kernel(
     typename tileMax::TileDType __out__ new_max,
     const typename tileSrc::TileDType __in__ src,
-    const typename tileMax::TileDType __in__ old_max    
+    const typename tileMax::TileDType __in__ old_max
 )
 {
-//    size_t i = blkv_get_index_x();  
-    size_t j = blkv_get_index_x();  
+//    size_t i = blkv_get_index_x();
+    size_t j = blkv_get_index_x();
 //    size_t j = blkv_get_index_y();
-    size_t idx = j * tileMax::RowStride;    
+    size_t idx = j * tileMax::RowStride;
 
     __vbuf__ typename tileMax::DType *new_max_ptr = blkv_get_tile_ptr(new_max);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);   
+    __vbuf__ typename tileMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);
 
 
     typename tileMax::DType upd_max = old_max_ptr[idx];
@@ -39,34 +39,34 @@ void __vec__ reducemax_row_kernel(
         size_t src_idx0 =  i * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx1 =  (i+1) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx2 =  (i+2) * tileSrc::ColStride + j * tileSrc::RowStride;
-        size_t src_idx3 =  (i+3) * tileSrc::ColStride + j * tileSrc::RowStride;        
+        size_t src_idx3 =  (i+3) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx4 =  (i+4) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx5 =  (i+5) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx6 =  (i+6) * tileSrc::ColStride + j * tileSrc::RowStride;
-        size_t src_idx7 =  (i+7) * tileSrc::ColStride + j * tileSrc::RowStride; 
+        size_t src_idx7 =  (i+7) * tileSrc::ColStride + j * tileSrc::RowStride;
 
         typename tileMax::DType max_01 = blkv_max(src_ptr[src_idx0], src_ptr[src_idx1]);
-        typename tileMax::DType max_23 = blkv_max(src_ptr[src_idx2], src_ptr[src_idx3]);   
-        typename tileMax::DType max_45 = blkv_max(src_ptr[src_idx4], src_ptr[src_idx5]);  
-        typename tileMax::DType max_67 = blkv_max(src_ptr[src_idx6], src_ptr[src_idx7]);    
+        typename tileMax::DType max_23 = blkv_max(src_ptr[src_idx2], src_ptr[src_idx3]);
+        typename tileMax::DType max_45 = blkv_max(src_ptr[src_idx4], src_ptr[src_idx5]);
+        typename tileMax::DType max_67 = blkv_max(src_ptr[src_idx6], src_ptr[src_idx7]);
 
         typename tileMax::DType max_0123 = blkv_max(max_01, max_23);
-        typename tileMax::DType max_4567 = blkv_max(max_45, max_67);        
+        typename tileMax::DType max_4567 = blkv_max(max_45, max_67);
 
         typename tileMax::DType max_tmp = blkv_max(max_0123, max_4567);
 
-        upd_max = blkv_max(upd_max, max_tmp);              
-    }        
+        upd_max = blkv_max(upd_max, max_tmp);
+    }
 
 
 /*
     #pragma clang loop unroll(full)
     for(size_t i=0;i<tileSrc::ValidCol;i++){
         size_t src_idx =  i * tileSrc::ColStride + j * tileSrc::RowStride;
-        upd_max = blkv_max(upd_max, src_ptr[src_idx]);              
+        upd_max = blkv_max(upd_max, src_ptr[src_idx]);
     }
-*/    
-    new_max_ptr[idx] = upd_max;    
+*/
+    new_max_ptr[idx] = upd_max;
 }
 
 
@@ -75,63 +75,63 @@ template<typename dtype, const int gIM, const int gIN, const int tM, const int t
 void reducemax_row_rand(
     dtype *in_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM; // todo 尾块怎么处理？
-    const int rmd_N = gIN % tN; // todo 尾块怎么处理？    
+    const int rmd_N = gIN % tN; // todo 尾块怎么处理？
 
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据 
-//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据
+//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<gIM, 1>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeMax = Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, tM, 1>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
 
 
-    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, tN>; // todo 尾块怎么处理？是否要作为参数写在这   
+    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, tN>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; // todo 尾块怎么处理？是否要作为参数写在这
-    using tile_shapeMax_col =  Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, rmd_M, 1>;      
+    using tile_shapeMax_col =  Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, rmd_M, 1>;
 
 
-    gm_shapeIn inGm(in_ptr);    
+    gm_shapeIn inGm(in_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeSum olcSumGm(old_sum_ptr);    
+//    gm_shapeSum olcSumGm(old_sum_ptr);
 
-    tile_shapeData dataTile;                
+    tile_shapeData dataTile;
     tile_shapeData_row dataTile_row;
     tile_shapeData_col dataTile_col;
-    tile_shapeData_cor dataTile_cor;    
-    
+    tile_shapeData_cor dataTile_cor;
+
     tile_shapeMax MaxTile;
     tile_shapeMax oldMaxTile;
     tile_shapeMax_col MaxTile_col;
-    tile_shapeMax_col oldMaxTile_col;    
+    tile_shapeMax_col oldMaxTile_col;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;  
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeMax>;
 
     itIn  gIIter(in_ptr);
     itOut gOIter(out_ptr);
 
 //    printf("tile_shapeSum::ValidCol = %d\n",  tile_shapeSum::ValidCol);
-//    printf("tile_shapeSum::ValidRow = %d\n",  tile_shapeSum::ValidRow);    
+//    printf("tile_shapeSum::ValidRow = %d\n",  tile_shapeSum::ValidRow);
 //    printf("before for\n");
     for (int j = 0; j < Mb; ++j) {
         auto gO = gOIter(j, 0);
         TEXPANDSCALAR(oldMaxTile, 0);//初始化为0
-        //初始化old_sum的tile      
+        //初始化old_sum的tile
         for (int i = 0; i < Nb; ++i) {
-            auto gI = gIIter(j, i);   
-//            printf("before copy in , %d\n", i);                
-            TCOPYIN(dataTile, gI);    
+            auto gI = gIIter(j, i);
+//            printf("before copy in , %d\n", i);
+            TLOAD(dataTile, gI);
             reducemax_row_kernel<tile_shapeData, tile_shapeMax><<<tile_shapeMax::ValidRow, 1, 1>>>(MaxTile.data(), dataTile.data(), oldMaxTile.data());
 //            reducesum_row_kernel<tile_shapeData, tile_shapeSum><<<1, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), dataTile.data(), oldSumTile.data());
 //            printf("kernel , %d\n", i);
@@ -141,40 +141,40 @@ void reducemax_row_rand(
         //for row corner
         if constexpr (rmd_N > 0){
             auto gI = gIIter(j, Nb);
-            TCOPYIN(dataTile_row, gI);
-            reducemax_row_kernel<tile_shapeData_row, tile_shapeMax><<<tile_shapeMax::ValidRow, 1, 1>>>(MaxTile.data(), dataTile_row.data(), oldMaxTile.data());            
+            TLOAD(dataTile_row, gI);
+            reducemax_row_kernel<tile_shapeData_row, tile_shapeMax><<<tile_shapeMax::ValidRow, 1, 1>>>(MaxTile.data(), dataTile_row.data(), oldMaxTile.data());
 //            reducesum_row_kernel<tile_shapeData_row, tile_shapeSum><<<tile_shapeSum::ValidRow, 1, 1>>>(SumTile.data(), dataTile_row.data(), oldSumTile.data());
             oldMaxTile = MaxTile;
         }
-//        printf("before tcopyout\n");        
-        TCOPYOUT(gO, MaxTile);
-//        printf("end tcopyout\n"); 
+//        printf("before tstore\n");
+        TSTORE(gO, MaxTile);
+//        printf("end tstore\n");
     }
     //for col cor
     if constexpr (rmd_M > 0){
         auto gO = gOIter(Mb, 0);
         TEXPANDSCALAR(oldMaxTile_col, 0);//初始化为0
-        //初始化old_sum的tile      
+        //初始化old_sum的tile
         for (int i = 0; i < Nb; ++i) {
             auto gI = gIIter(Mb, i);
-            TCOPYIN(dataTile_col, gI);
+            TLOAD(dataTile_col, gI);
             reducemax_row_kernel<tile_shapeData_col, tile_shapeMax_col><<<tile_shapeMax_col::ValidRow, 1, 1>>>(MaxTile_col.data(), dataTile_col.data(), oldMaxTile_col.data());
             oldMaxTile_col = MaxTile_col;
         }
         if constexpr (rmd_N > 0){
             auto gI = gIIter(Mb, Nb);
-            TCOPYIN(dataTile_cor, gI);             
+            TLOAD(dataTile_cor, gI);
             reducemax_row_kernel<tile_shapeData_cor, tile_shapeMax_col><<<tile_shapeMax_col::ValidRow, 1, 1>>>(MaxTile_col.data(), dataTile_cor.data(), oldMaxTile_col.data());
             oldMaxTile_col = MaxTile_col;
         }
-        TCOPYOUT(gO, MaxTile_col);
+        TSTORE(gO, MaxTile_col);
     }
 /*
     for(int i = 0; i < gIM; i++){
         printf("out%d = %d\n", i, out_ptr[i]);
     }
 */
-//    printf("end program\n"); 
+//    printf("end program\n");
 }
 
 #endif
diff --git a/kernels/reduction/reducemax_rowvec_single_tree_opt_2.hpp b/kernels/reduction/reducemax_rowvec_single_tree_opt_2.hpp
index 8f066d2..18cde97 100644
--- a/kernels/reduction/reducemax_rowvec_single_tree_opt_2.hpp
+++ b/kernels/reduction/reducemax_rowvec_single_tree_opt_2.hpp
@@ -19,26 +19,26 @@ void __vec__ reducemax_row_kernel(
     const typename tileSrc::TileDType __in__ src,
     const typename tileSrcCol::TileDType __in__ src_col,
     const typename tileTmpMax::TileDType __in__ old_max,
-    const size_t tile_idx    
+    const size_t tile_idx
 )
 {
 
-    size_t j = blkv_get_index_x();  
-    size_t z = blkv_get_index_y();     
+    size_t j = blkv_get_index_x();
+    size_t z = blkv_get_index_y();
     size_t stride_src = z * (tileSrc::ValidCol/4) * tileSrc::ColStride;
-    size_t stride_src_col = z * (tileSrcCol::ValidCol/4) * tileSrcCol::ColStride;    
-  
+    size_t stride_src_col = z * (tileSrcCol::ValidCol/4) * tileSrcCol::ColStride;
+
     __vbuf__ typename tileTmpMax::DType *new_max_ptr = blkv_get_tile_ptr(new_max);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileSrc::DType *src_col_ptr = blkv_get_tile_ptr(src_col);    
-    __vbuf__ typename tileTmpMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);   
+    __vbuf__ typename tileSrc::DType *src_col_ptr = blkv_get_tile_ptr(src_col);
+    __vbuf__ typename tileTmpMax::DType *old_max_ptr = blkv_get_tile_ptr(old_max);
 
 /*
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t i=0;i<tileTmpMax::ValidCol/4;i++){
-        size_t old_max_idx =  z * tileTmpMax::ValidCol/4 * tileTmpMax::ColStride + i * tileTmpMax::ColStride + j * tileTmpMax::RowStride;       
-        new_max_ptr[old_max_idx] = old_max_ptr[old_max_idx];          
-    }    
+        size_t old_max_idx =  z * tileTmpMax::ValidCol/4 * tileTmpMax::ColStride + i * tileTmpMax::ColStride + j * tileTmpMax::RowStride;
+        new_max_ptr[old_max_idx] = old_max_ptr[old_max_idx];
+    }
 */
 
     #pragma clang loop unroll(full)
@@ -46,25 +46,25 @@ void __vec__ reducemax_row_kernel(
         size_t src_idx_0 =  stride_src + (i+0) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_1 =  stride_src + (i+1) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_2 =  stride_src + (i+2) * tileSrc::ColStride + j * tileSrc::RowStride;
-        size_t src_idx_3 =  stride_src + (i+3) * tileSrc::ColStride + j * tileSrc::RowStride;        
+        size_t src_idx_3 =  stride_src + (i+3) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_4 =  stride_src + (i+4) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_5 =  stride_src + (i+5) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_6 =  stride_src + (i+6) * tileSrc::ColStride + j * tileSrc::RowStride;
-        size_t src_idx_7 =  stride_src + (i+7) * tileSrc::ColStride + j * tileSrc::RowStride; 
+        size_t src_idx_7 =  stride_src + (i+7) * tileSrc::ColStride + j * tileSrc::RowStride;
 
         typename tileSrc::DType max_01 = blkv_max(src_ptr[src_idx_0], src_ptr[src_idx_1]);
-        typename tileSrc::DType max_23 = blkv_max(src_ptr[src_idx_2], src_ptr[src_idx_3]);   
-        typename tileSrc::DType max_45 = blkv_max(src_ptr[src_idx_4], src_ptr[src_idx_5]);  
-        typename tileSrc::DType max_67 = blkv_max(src_ptr[src_idx_6], src_ptr[src_idx_7]);    
+        typename tileSrc::DType max_23 = blkv_max(src_ptr[src_idx_2], src_ptr[src_idx_3]);
+        typename tileSrc::DType max_45 = blkv_max(src_ptr[src_idx_4], src_ptr[src_idx_5]);
+        typename tileSrc::DType max_67 = blkv_max(src_ptr[src_idx_6], src_ptr[src_idx_7]);
 
         typename tileSrc::DType max_0123 = blkv_max(max_01, max_23);
-        typename tileSrc::DType max_4567 = blkv_max(max_45, max_67);        
+        typename tileSrc::DType max_4567 = blkv_max(max_45, max_67);
 
         typename tileSrc::DType max_all = blkv_max(max_0123, max_4567);
 
         size_t src_col_idx_0 = stride_src_col + (i/8) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
-        src_col_ptr[src_col_idx_0] = max_all;         
-    }        
+        src_col_ptr[src_col_idx_0] = max_all;
+    }
 
 
     #pragma clang loop unroll(full)
@@ -72,17 +72,17 @@ void __vec__ reducemax_row_kernel(
         size_t tmp_idx_0 =  stride_src_col + (i+0) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
         size_t tmp_idx_1 =  stride_src_col + (i+1) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
         size_t tmp_idx_2 =  stride_src_col + (i+2) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
-        size_t tmp_idx_3 =  stride_src_col + (i+3) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;        
+        size_t tmp_idx_3 =  stride_src_col + (i+3) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
         size_t tmp_idx_4 =  stride_src_col + (i+4) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
         size_t tmp_idx_5 =  stride_src_col + (i+5) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
         size_t tmp_idx_6 =  stride_src_col + (i+6) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
-        size_t tmp_idx_7 =  stride_src_col + (i+7) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;  
+        size_t tmp_idx_7 =  stride_src_col + (i+7) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
         typename tileSrcCol::DType tmp_max_01 = blkv_max(src_col_ptr[tmp_idx_0], src_col_ptr[tmp_idx_1]);
-        typename tileSrcCol::DType tmp_max_23 = blkv_max(src_col_ptr[tmp_idx_2], src_col_ptr[tmp_idx_3]); 
-        typename tileSrcCol::DType tmp_max_45 = blkv_max(src_col_ptr[tmp_idx_4], src_col_ptr[tmp_idx_5]); 
-        typename tileSrcCol::DType tmp_max_67 = blkv_max(src_col_ptr[tmp_idx_6], src_col_ptr[tmp_idx_7]);  
-        typename tileSrcCol::DType tmp_max_0123 = blkv_max(tmp_max_01, tmp_max_23); 
-        typename tileSrcCol::DType tmp_max_4567 = blkv_max(tmp_max_45, tmp_max_67); 
+        typename tileSrcCol::DType tmp_max_23 = blkv_max(src_col_ptr[tmp_idx_2], src_col_ptr[tmp_idx_3]);
+        typename tileSrcCol::DType tmp_max_45 = blkv_max(src_col_ptr[tmp_idx_4], src_col_ptr[tmp_idx_5]);
+        typename tileSrcCol::DType tmp_max_67 = blkv_max(src_col_ptr[tmp_idx_6], src_col_ptr[tmp_idx_7]);
+        typename tileSrcCol::DType tmp_max_0123 = blkv_max(tmp_max_01, tmp_max_23);
+        typename tileSrcCol::DType tmp_max_4567 = blkv_max(tmp_max_45, tmp_max_67);
         typename tileSrcCol::DType tmp_max_all = blkv_max(tmp_max_0123, tmp_max_4567);
         src_col_ptr[tmp_idx_0] = tmp_max_all;
     }
@@ -92,29 +92,29 @@ void __vec__ reducemax_row_kernel(
     size_t stride = 8;
     size_t iternum = __builtin_ctz(tileSrcCol::ValidCol/4) - 3;
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t k=0; k<iternum; k++){
-        //#pragma clang loop unroll(full) 
+        //#pragma clang loop unroll(full)
         for(size_t i=0; i<tileSrcCol::ValidCol/4; i+=(stride*2)){
             size_t src_idx_0 =  stride_src_col + (i + 0*stride) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
             size_t src_idx_1 =  stride_src_col + (i + 1*stride) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
-            typename  tileSrcCol::DType max_01 = blkv_max(src_col_ptr[src_idx_0], src_col_ptr[src_idx_1]);           
-            src_col_ptr[src_idx_0] = max_01;          
+            typename  tileSrcCol::DType max_01 = blkv_max(src_col_ptr[src_idx_0], src_col_ptr[src_idx_1]);
+            src_col_ptr[src_idx_0] = max_01;
         }
         stride = stride*2;
     }
 
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t i=0;i<tileTmpMax::ValidCol/4;i++){
-        size_t old_max_idx =  z * tileTmpMax::ValidCol/4 * tileTmpMax::ColStride + i * tileTmpMax::ColStride + j * tileTmpMax::RowStride;       
-        new_max_ptr[old_max_idx] = old_max_ptr[old_max_idx];          
-    }    
+        size_t old_max_idx =  z * tileTmpMax::ValidCol/4 * tileTmpMax::ColStride + i * tileTmpMax::ColStride + j * tileTmpMax::RowStride;
+        new_max_ptr[old_max_idx] = old_max_ptr[old_max_idx];
+    }
 
 
     size_t src_max_idx = stride_src_col + j * tileSrcCol::RowStride;
     size_t max_tile_idx = z * tileTmpMax::ValidCol/4 * tileTmpMax::ColStride + tile_idx * tileTmpMax::ColStride + j * tileTmpMax::RowStride;
-    new_max_ptr[max_tile_idx] = src_col_ptr[src_max_idx];  
+    new_max_ptr[max_tile_idx] = src_col_ptr[src_max_idx];
 }
 
 
@@ -129,39 +129,39 @@ void __vec__ reducemax_row_final_kernel(
     __vbuf__ typename tileMax::DType *new_max_ptr = blkv_get_tile_ptr(new_max);
     __vbuf__ typename tileTmpMax::DType *tmp_max_ptr = blkv_get_tile_ptr(tmp_max);
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t i=0;i<tileTmpMax::Cols;i+=8){
         size_t src_idx_0 =  (i+0) * tileTmpMax::ColStride + j * tileTmpMax::RowStride;
         size_t src_idx_1 =  (i+1) * tileTmpMax::ColStride + j * tileTmpMax::RowStride;
         size_t src_idx_2 =  (i+2) * tileTmpMax::ColStride + j * tileTmpMax::RowStride;
-        size_t src_idx_3 =  (i+3) * tileTmpMax::ColStride + j * tileTmpMax::RowStride;        
+        size_t src_idx_3 =  (i+3) * tileTmpMax::ColStride + j * tileTmpMax::RowStride;
         size_t src_idx_4 =  (i+4) * tileTmpMax::ColStride + j * tileTmpMax::RowStride;
         size_t src_idx_5 =  (i+5) * tileTmpMax::ColStride + j * tileTmpMax::RowStride;
         size_t src_idx_6 =  (i+6) * tileTmpMax::ColStride + j * tileTmpMax::RowStride;
-        size_t src_idx_7 =  (i+7) * tileTmpMax::ColStride + j * tileTmpMax::RowStride;        
-        typename  tileTmpMax::DType max_01 = blkv_max(tmp_max_ptr[src_idx_0], tmp_max_ptr[src_idx_1]);    
+        size_t src_idx_7 =  (i+7) * tileTmpMax::ColStride + j * tileTmpMax::RowStride;
+        typename  tileTmpMax::DType max_01 = blkv_max(tmp_max_ptr[src_idx_0], tmp_max_ptr[src_idx_1]);
         typename  tileTmpMax::DType max_23 = blkv_max(tmp_max_ptr[src_idx_2], tmp_max_ptr[src_idx_3]);
-        typename  tileTmpMax::DType max_45 = blkv_max(tmp_max_ptr[src_idx_4], tmp_max_ptr[src_idx_5]);    
-        typename  tileTmpMax::DType max_67 = blkv_max(tmp_max_ptr[src_idx_6], tmp_max_ptr[src_idx_7]);        
-        typename  tileTmpMax::DType max_0123 = blkv_max(max_01, max_23); 
+        typename  tileTmpMax::DType max_45 = blkv_max(tmp_max_ptr[src_idx_4], tmp_max_ptr[src_idx_5]);
+        typename  tileTmpMax::DType max_67 = blkv_max(tmp_max_ptr[src_idx_6], tmp_max_ptr[src_idx_7]);
+        typename  tileTmpMax::DType max_0123 = blkv_max(max_01, max_23);
         typename  tileTmpMax::DType max_4567 = blkv_max(max_45, max_67);
-        typename  tileTmpMax::DType max_all = blkv_max(max_0123, max_4567);   
-        tmp_max_ptr[src_idx_0] = max_all;          
-    }   
+        typename  tileTmpMax::DType max_all = blkv_max(max_0123, max_4567);
+        tmp_max_ptr[src_idx_0] = max_all;
+    }
 
     size_t stride = 8;
-    size_t iternum = __builtin_ctz(tileTmpMax::Cols) - 3;    
-    #pragma clang loop unroll(full) 
+    size_t iternum = __builtin_ctz(tileTmpMax::Cols) - 3;
+    #pragma clang loop unroll(full)
     for(size_t k=0;k<iternum;k++){
-        #pragma clang loop unroll(full) 
+        #pragma clang loop unroll(full)
         for(size_t i=0;i<tileTmpMax::Cols;i+=(stride*2)){
             size_t src_idx_0 =  (i + 0*stride) * tileTmpMax::ColStride + j * tileTmpMax::RowStride;
             size_t src_idx_1 =  (i + 1*stride) * tileTmpMax::ColStride + j * tileTmpMax::RowStride;
-            typename  tileTmpMax::DType max_01 = blkv_max(tmp_max_ptr[src_idx_0], tmp_max_ptr[src_idx_1]);           
-            tmp_max_ptr[src_idx_0] = max_01;          
+            typename  tileTmpMax::DType max_01 = blkv_max(tmp_max_ptr[src_idx_0], tmp_max_ptr[src_idx_1]);
+            tmp_max_ptr[src_idx_0] = max_01;
         }
         stride = stride*2;
-    }    
+    }
 
     size_t max_idx = j * tileTmpMax::RowStride;
     new_max_ptr[idx] = tmp_max_ptr[max_idx];
@@ -172,31 +172,31 @@ template<typename dtype, const int gIM, const int gIN, const int tM, const int t
 void reducemax_row_rand(
     dtype *in_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM; // todo 尾块怎么处理？
-    const int rmd_N = gIN % tN; // todo 尾块怎么处理？    
+    const int rmd_N = gIN % tN; // todo 尾块怎么处理？
 
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据 
-//    using gm_shapeMax = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据
+//    using gm_shapeMax = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<gIM, 1>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; // todo 尾块怎么处理？是否要作为参数写在这
-    using tile_shapeDataCol = Tile<Location::Vec, dtype, tM, tN/8, BLayout::ColMajor>; // todo 尾块怎么处理？是否要作为参数写在这    
+    using tile_shapeDataCol = Tile<Location::Vec, dtype, tM, tN/8, BLayout::ColMajor>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeMax = Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, tM, 1>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
     using tile_shapeTmpMax = Tile<Location::Vec, dtype, tM, 64, BLayout::ColMajor, tM, Nb*4>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
 
 
-    gm_shapeIn inGm(in_ptr);    
+    gm_shapeIn inGm(in_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeMax olcMaxGm(old_max_ptr);    
+//    gm_shapeMax olcMaxGm(old_max_ptr);
 
-    tile_shapeData dataTile;   
-    tile_shapeDataCol dataTile_col;                        
+    tile_shapeData dataTile;
+    tile_shapeDataCol dataTile_col;
     tile_shapeMax MaxTile;
     tile_shapeTmpMax oldtmpMaxTile;
     tile_shapeTmpMax tmpMaxTile;
@@ -204,7 +204,7 @@ void reducemax_row_rand(
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;  
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeMax>;
 
     itIn  gIIter(in_ptr);
@@ -212,21 +212,21 @@ void reducemax_row_rand(
 
 
     auto gO = gOIter(0, 0);
-    TEXPANDSCALAR(oldtmpMaxTile, 0);//初始化为0  
-    TEXPANDSCALAR(dataTile_col, 0);//初始化为0     
+    TEXPANDSCALAR(oldtmpMaxTile, 0);//初始化为0
+    TEXPANDSCALAR(dataTile_col, 0);//初始化为0
     for (int i = 0; i < Nb; ++i) {
-        auto gI = gIIter(0, i);                
-        TCOPYIN(dataTile, gI);    
-        reducemax_row_kernel<tile_shapeData, tile_shapeDataCol, tile_shapeTmpMax><<<tile_shapeTmpMax::ValidRow, 4, 1>>>(tmpMaxTile.data(), 
-                                                                                                                        dataTile.data(), 
-                                                                                                                        dataTile_col.data(), 
+        auto gI = gIIter(0, i);
+        TLOAD(dataTile, gI);
+        reducemax_row_kernel<tile_shapeData, tile_shapeDataCol, tile_shapeTmpMax><<<tile_shapeTmpMax::ValidRow, 4, 1>>>(tmpMaxTile.data(),
+                                                                                                                        dataTile.data(),
+                                                                                                                        dataTile_col.data(),
                                                                                                                         oldtmpMaxTile.data(),
                                                                                                                         i);
         oldtmpMaxTile = tmpMaxTile;
     }
-    reducemax_row_final_kernel<tile_shapeTmpMax, tile_shapeMax><<<tile_shapeTmpMax::ValidRow, 1, 1>>>(MaxTile.data(), 
-                                                                                                      tmpMaxTile.data());     
-    TCOPYOUT(gO, MaxTile);
+    reducemax_row_final_kernel<tile_shapeTmpMax, tile_shapeMax><<<tile_shapeTmpMax::ValidRow, 1, 1>>>(MaxTile.data(),
+                                                                                                      tmpMaxTile.data());
+    TSTORE(gO, MaxTile);
 }
 
 #endif
diff --git a/kernels/reduction/reduceprod_colvec.hpp b/kernels/reduction/reduceprod_colvec.hpp
index 15d12c1..15bab7d 100644
--- a/kernels/reduction/reduceprod_colvec.hpp
+++ b/kernels/reduction/reduceprod_colvec.hpp
@@ -19,14 +19,14 @@ template<typename tileSrc, typename timeProd>
 void __vec__ reduceprod_col_kernel(
     typename timeProd::TileDType __out__ new_prod,
     const typename tileSrc::TileDType __in__ src,
-    const typename timeProd::TileDType __in__ old_prod    
+    const typename timeProd::TileDType __in__ old_prod
 )
 {
-    size_t i = blkv_get_index_x();  
+    size_t i = blkv_get_index_x();
 
     __vbuf__ typename timeProd::DType *new_prod_ptr = blkv_get_tile_ptr(new_prod);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename timeProd::DType *old_prod_ptr = blkv_get_tile_ptr(old_prod);   
+    __vbuf__ typename timeProd::DType *old_prod_ptr = blkv_get_tile_ptr(old_prod);
 
 
     typename timeProd::DType upd_prod = old_prod_ptr[i];
@@ -34,9 +34,9 @@ void __vec__ reduceprod_col_kernel(
     #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidRow;j++){
         size_t src_idx =  i * tileSrc::ColStride + j * tileSrc::RowStride;
-        upd_prod = upd_prod * src_ptr[src_idx];          
-    }    
-    new_prod_ptr[i] = upd_prod;    
+        upd_prod = upd_prod * src_ptr[src_idx];
+    }
+    new_prod_ptr[i] = upd_prod;
 }
 
 
@@ -44,58 +44,58 @@ void __vec__ reduceprod_col_kernel(
 template<typename dtype, int gIM, int gIN, int tM, int tN>
 void reduceprod_col_rand(
     dtype *in_ptr,
-//    dtype *inzero_ptr,    
+//    dtype *inzero_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM;
     const int rmd_N = gIN % tN;
 //    const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     // 
-//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //
+//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gIN>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; //
-    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, tN>; //     
-    using tile_shapeProd = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; // 
+    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, tN>; //
+    using tile_shapeProd = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; //
 
 
 
-    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // 
-    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //     
-    using tile_shapeProd_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; // 
+    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; //
+    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //
+    using tile_shapeProd_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; //
     //need tM = 1;
 
 
-    gm_shapeIn inGm(in_ptr);   
-//    gm_shapeOut ZeroGm(inzero_ptr); 
+    gm_shapeIn inGm(in_ptr);
+//    gm_shapeOut ZeroGm(inzero_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeSum olcSumGm(old_sum_ptr);    
+//    gm_shapeSum olcSumGm(old_sum_ptr);
 
     tile_shapeData dataTile;
-    tile_shapeData_col dataTile_col;    
+    tile_shapeData_col dataTile_col;
     tile_shapeProd ProdTile;
     tile_shapeProd oldProdTile;
 
     tile_shapeData_row dataTile_row;
-    tile_shapeData_cor dataTile_cor;    
+    tile_shapeData_cor dataTile_cor;
     tile_shapeProd_row ProdTile_row;
-    tile_shapeProd_row oldProdTile_row;    
+    tile_shapeProd_row oldProdTile_row;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;      
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itIn_row = global_iterator<gm_shapeIn, tile_shapeProd>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeProd>;
 
     itIn  gIIter(in_ptr);
     itIn_row  gIIter_rmd_row(in_ptr);
-//    itZero  gZeroIter(inzero_ptr);    
+//    itZero  gZeroIter(inzero_ptr);
     itOut gOIter(out_ptr);
 
 //    dtype zero = 0;
@@ -104,41 +104,41 @@ void reduceprod_col_rand(
 //        auto gZero = gZeroIter(0, j);
         auto gO = gOIter(0, j);
         TEXPANDSCALAR(oldProdTile, 0);//初始化为0
-//        TCOPYIN(oldSumTile, gZero);//初始化为0
-        //初始化old_sum的tile      
-        //need 
+//        TLOAD(oldSumTile, gZero);//初始化为0
+        //初始化old_sum的tile
+        //need
         for (int i = 0; i < Mb; ++i) {
             auto gI = gIIter(i, j);
-            TCOPYIN(dataTile, gI);
+            TLOAD(dataTile, gI);
             reduceprod_col_kernel<tile_shapeData, tile_shapeProd><<<tile_shapeProd::ValidCol, tile_shapeProd::ValidRow, 1>>>(ProdTile.data(), dataTile.data(), oldProdTile.data());
             oldProdTile = ProdTile;
         }
-        if constexpr (rmd_M > 0){   
+        if constexpr (rmd_M > 0){
             auto gI = gIIter(Mb, j);
-            TCOPYIN(dataTile_col, gI);
+            TLOAD(dataTile_col, gI);
             reduceprod_col_kernel<tile_shapeData_col,tile_shapeProd><<<tile_shapeProd::ValidCol, tile_shapeProd::ValidRow, 1>>>(ProdTile.data(), dataTile_col.data(), oldProdTile.data());
             oldProdTile = ProdTile;
         }
-        TCOPYOUT(gO, ProdTile);
+        TSTORE(gO, ProdTile);
     }
     if constexpr (rmd_N > 0){
-//        auto gZero = gZeroIter(0, Nb);         
+//        auto gZero = gZeroIter(0, Nb);
         auto gO = gOIter(0, Nb);
-        TEXPANDSCALAR(oldProdTile_row, 0);//初始化为0        
-//        TCOPYIN(oldSumTile_row, gZero);//初始化为0
-        for (int i = 0; i < Mb; ++i) {   
+        TEXPANDSCALAR(oldProdTile_row, 0);//初始化为0
+//        TLOAD(oldSumTile_row, gZero);//初始化为0
+        for (int i = 0; i < Mb; ++i) {
             auto gI = gIIter(i, Nb);
-            TCOPYIN(dataTile_row, gI);
+            TLOAD(dataTile_row, gI);
             reduceprod_col_kernel<tile_shapeData_row,tile_shapeProd_row><<<tile_shapeProd_row::ValidCol, tile_shapeProd_row::ValidRow, 1>>>(ProdTile_row.data(), dataTile_row.data(), oldProdTile_row.data());
             oldProdTile_row = ProdTile_row;
         }
-        if constexpr (rmd_M > 0){   
+        if constexpr (rmd_M > 0){
             auto gI = gIIter(Mb, Nb);
-            TCOPYIN(dataTile_cor, gI);
+            TLOAD(dataTile_cor, gI);
             reduceprod_col_kernel<tile_shapeData_cor,tile_shapeProd_row><<<tile_shapeProd_row::ValidCol, tile_shapeProd_row::ValidRow, 1>>>(ProdTile_row.data(), dataTile_cor.data(), oldProdTile_row.data());
             oldProdTile_row = ProdTile_row;
         }
-        TCOPYOUT(gO, ProdTile_row);
+        TSTORE(gO, ProdTile_row);
     }
 }
 
diff --git a/kernels/reduction/reduceprod_rowvec.hpp b/kernels/reduction/reduceprod_rowvec.hpp
index 7494130..931b238 100644
--- a/kernels/reduction/reduceprod_rowvec.hpp
+++ b/kernels/reduction/reduceprod_rowvec.hpp
@@ -17,17 +17,17 @@ template<typename tileSrc, typename tileProd>
 void __vec__ reduceprod_row_kernel(
     typename tileProd::TileDType __out__ new_prod,
     const typename tileSrc::TileDType __in__ src,
-    const typename tileProd::TileDType __in__ old_prod    
+    const typename tileProd::TileDType __in__ old_prod
 )
 {
-//    size_t i = blkv_get_index_x();  
-    size_t j = blkv_get_index_x();  
+//    size_t i = blkv_get_index_x();
+    size_t j = blkv_get_index_x();
 //    size_t j = blkv_get_index_y();
-    size_t idx = j * tileProd::RowStride;    
+    size_t idx = j * tileProd::RowStride;
 
     __vbuf__ typename tileProd::DType *new_prod_ptr = blkv_get_tile_ptr(new_prod);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileProd::DType *old_prod_ptr = blkv_get_tile_ptr(old_prod);   
+    __vbuf__ typename tileProd::DType *old_prod_ptr = blkv_get_tile_ptr(old_prod);
 
 
     typename tileProd::DType upd_prod = old_prod_ptr[idx];
@@ -35,9 +35,9 @@ void __vec__ reduceprod_row_kernel(
     #pragma clang loop unroll(full)
     for(size_t i=0;i<tileSrc::ValidCol;i++){
         size_t src_idx =  i * tileSrc::ColStride + j * tileSrc::RowStride;
-        upd_prod = upd_prod * src_ptr[src_idx];              
-    }    
-    new_prod_ptr[idx] = upd_prod;    
+        upd_prod = upd_prod * src_ptr[src_idx];
+    }
+    new_prod_ptr[idx] = upd_prod;
 }
 
 
@@ -46,63 +46,63 @@ template<typename dtype, const int gIM, const int gIN, const int tM, const int t
 void reduceprod_row_rand(
     dtype *in_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM; // todo 尾块怎么处理？
-    const int rmd_N = gIN % tN; // todo 尾块怎么处理？    
+    const int rmd_N = gIN % tN; // todo 尾块怎么处理？
 
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据 
-//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据
+//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<gIM, 1>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeProd = Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, tM, 1>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
 
 
-    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, tN>; // todo 尾块怎么处理？是否要作为参数写在这   
+    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, tN>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; // todo 尾块怎么处理？是否要作为参数写在这
-    using tile_shapeProd_col =  Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, rmd_M, 1>;      
+    using tile_shapeProd_col =  Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, rmd_M, 1>;
 
 
-    gm_shapeIn inGm(in_ptr);    
+    gm_shapeIn inGm(in_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeSum olcSumGm(old_sum_ptr);    
+//    gm_shapeSum olcSumGm(old_sum_ptr);
 
-    tile_shapeData dataTile;                
+    tile_shapeData dataTile;
     tile_shapeData_row dataTile_row;
     tile_shapeData_col dataTile_col;
-    tile_shapeData_cor dataTile_cor;    
-    
+    tile_shapeData_cor dataTile_cor;
+
     tile_shapeProd ProdTile;
     tile_shapeProd oldProdTile;
     tile_shapeProd_col ProdTile_col;
-    tile_shapeProd_col oldProdTile_col;    
+    tile_shapeProd_col oldProdTile_col;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;  
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeProd>;
 
     itIn  gIIter(in_ptr);
     itOut gOIter(out_ptr);
 
 //    printf("tile_shapeSum::ValidCol = %d\n",  tile_shapeSum::ValidCol);
-//    printf("tile_shapeSum::ValidRow = %d\n",  tile_shapeSum::ValidRow);    
+//    printf("tile_shapeSum::ValidRow = %d\n",  tile_shapeSum::ValidRow);
 //    printf("before for\n");
     for (int j = 0; j < Mb; ++j) {
         auto gO = gOIter(j, 0);
         TEXPANDSCALAR(oldProdTile, 0);//初始化为0
-        //初始化old_sum的tile      
+        //初始化old_sum的tile
         for (int i = 0; i < Nb; ++i) {
-            auto gI = gIIter(j, i);   
-//            printf("before copy in , %d\n", i);                
-            TCOPYIN(dataTile, gI);    
+            auto gI = gIIter(j, i);
+//            printf("before copy in , %d\n", i);
+            TLOAD(dataTile, gI);
             reduceprod_row_kernel<tile_shapeData, tile_shapeProd><<<tile_shapeProd::ValidRow, 1, 1>>>(ProdTile.data(), dataTile.data(), oldProdTile.data());
 //            reducesum_row_kernel<tile_shapeData, tile_shapeSum><<<1, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), dataTile.data(), oldSumTile.data());
 //            printf("kernel , %d\n", i);
@@ -112,40 +112,40 @@ void reduceprod_row_rand(
         //for row corner
         if constexpr (rmd_N > 0){
             auto gI = gIIter(j, Nb);
-            TCOPYIN(dataTile_row, gI);
-            reduceprod_row_kernel<tile_shapeData_row, tile_shapeProd><<<tile_shapeProd::ValidRow, 1, 1>>>(ProdTile.data(), dataTile_row.data(), oldProdTile.data());            
+            TLOAD(dataTile_row, gI);
+            reduceprod_row_kernel<tile_shapeData_row, tile_shapeProd><<<tile_shapeProd::ValidRow, 1, 1>>>(ProdTile.data(), dataTile_row.data(), oldProdTile.data());
 //            reducesum_row_kernel<tile_shapeData_row, tile_shapeSum><<<tile_shapeSum::ValidRow, 1, 1>>>(SumTile.data(), dataTile_row.data(), oldSumTile.data());
             oldProdTile = ProdTile;
         }
-//        printf("before tcopyout\n");        
-        TCOPYOUT(gO, ProdTile);
-//        printf("end tcopyout\n"); 
+//        printf("before tstore\n");
+        TSTORE(gO, ProdTile);
+//        printf("end tstore\n");
     }
     //for col cor
     if constexpr (rmd_M > 0){
         auto gO = gOIter(Mb, 0);
         TEXPANDSCALAR(oldProdTile_col, 0);//初始化为0
-        //初始化old_sum的tile      
+        //初始化old_sum的tile
         for (int i = 0; i < Nb; ++i) {
-            auto gI = gIIter(Mb, i);   
-            TCOPYIN(dataTile_col, gI);                  
+            auto gI = gIIter(Mb, i);
+            TLOAD(dataTile_col, gI);
             reduceprod_row_kernel<tile_shapeData_col, tile_shapeProd_col><<<tile_shapeProd_col::ValidRow, 1, 1>>>(ProdTile_col.data(), dataTile_col.data(), oldProdTile_col.data());
             oldProdTile_col = ProdTile_col;
         }
         if constexpr (rmd_N > 0){
             auto gI = gIIter(Mb, Nb);
-            TCOPYIN(dataTile_cor, gI);             
+            TLOAD(dataTile_cor, gI);
             reduceprod_row_kernel<tile_shapeData_cor, tile_shapeProd_col><<<tile_shapeProd_col::ValidRow, 1, 1>>>(ProdTile_col.data(), dataTile_cor.data(), oldProdTile_col.data());
             oldProdTile_col = ProdTile_col;
         }
-        TCOPYOUT(gO, ProdTile_col);
+        TSTORE(gO, ProdTile_col);
     }
 /*
     for(int i = 0; i < gIM; i++){
         printf("out%d = %d\n", i, out_ptr[i]);
     }
 */
-//    printf("end program\n"); 
+//    printf("end program\n");
 }
 
 #endif
diff --git a/kernels/reduction/reducesum_colvec.hpp b/kernels/reduction/reducesum_colvec.hpp
index 431d8d4..65742ed 100644
--- a/kernels/reduction/reducesum_colvec.hpp
+++ b/kernels/reduction/reducesum_colvec.hpp
@@ -19,20 +19,20 @@ template<typename tileSrc, typename tileSum>
 void __vec__ reducesum_col_kernel(
     typename tileSum::TileDType __out__ new_sum,
     const typename tileSrc::TileDType __in__ src,
-    const typename tileSum::TileDType __in__ old_sum    
+    const typename tileSum::TileDType __in__ old_sum
 )
 {
-    size_t i = blkv_get_index_x();  
+    size_t i = blkv_get_index_x();
 
     __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);   
+    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);
 
 
 //    typename tileSum::DType upd_sum = old_sum_ptr[i];
     typename tileSum::DType upd_sum = 0;
-   
-    #pragma clang loop unroll(full) 
+
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidRow;j+=8){
         size_t src_idx_0 =  i * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1) * tileSrc::RowStride;
@@ -41,25 +41,25 @@ void __vec__ reducesum_col_kernel(
         size_t src_idx_4 =  i * tileSrc::ColStride + (j + 4) * tileSrc::RowStride;
         size_t src_idx_5 =  i * tileSrc::ColStride + (j + 5) * tileSrc::RowStride;
         size_t src_idx_6 =  i * tileSrc::ColStride + (j + 6) * tileSrc::RowStride;
-        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;        
-        typename tileSum::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];    
+        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;
+        typename tileSum::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
         typename tileSum::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];
-        typename tileSum::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];    
-        typename tileSum::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];        
-        typename tileSum::DType sum_0123 = sum_01 + sum_23; 
+        typename tileSum::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];
+        typename tileSum::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];
+        typename tileSum::DType sum_0123 = sum_01 + sum_23;
         typename tileSum::DType sum_4567 = sum_45 + sum_67;
-        typename tileSum::DType sum_tmp = sum_0123 + sum_4567;         
-        upd_sum = upd_sum + sum_tmp;              
+        typename tileSum::DType sum_tmp = sum_0123 + sum_4567;
+        upd_sum = upd_sum + sum_tmp;
     }
 
 /*
     #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidRow;j++){
         size_t src_idx =  i * tileSrc::ColStride + j * tileSrc::RowStride;
-        upd_sum = upd_sum + src_ptr[src_idx];              
+        upd_sum = upd_sum + src_ptr[src_idx];
     }
-*/        
-    new_sum_ptr[i] = upd_sum + old_sum_ptr[i];    
+*/
+    new_sum_ptr[i] = upd_sum + old_sum_ptr[i];
 }
 
 
@@ -68,58 +68,58 @@ void __vec__ reducesum_col_kernel(
 template<typename dtype, int gIM, int gIN, int tM, int tN>
 void reducesum_colsum_rand(
     dtype *in_ptr,
-//    dtype *inzero_ptr,    
+//    dtype *inzero_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM;
     const int rmd_N = gIN % tN;
 //    const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     // 
-//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //
+//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gIN>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; //
-    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //     
-    using tile_shapeSum = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; // 
+    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //
+    using tile_shapeSum = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; //
 
 
 
-    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // 
-    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //     
-    using tile_shapeSum_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; // 
+    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; //
+    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //
+    using tile_shapeSum_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; //
     //need tM = 1;
 
 
-    gm_shapeIn inGm(in_ptr);   
-//    gm_shapeOut ZeroGm(inzero_ptr); 
+    gm_shapeIn inGm(in_ptr);
+//    gm_shapeOut ZeroGm(inzero_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeSum olcSumGm(old_sum_ptr);    
+//    gm_shapeSum olcSumGm(old_sum_ptr);
 
     tile_shapeData dataTile;
-    tile_shapeData_col dataTile_col;    
+    tile_shapeData_col dataTile_col;
     tile_shapeSum SumTile;
     tile_shapeSum oldSumTile;
 
     tile_shapeData_row dataTile_row;
-    tile_shapeData_cor dataTile_cor;    
+    tile_shapeData_cor dataTile_cor;
     tile_shapeSum_row SumTile_row;
-    tile_shapeSum_row oldSumTile_row;    
+    tile_shapeSum_row oldSumTile_row;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;      
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itIn_row = global_iterator<gm_shapeIn, tile_shapeSum>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeSum>;
 
     itIn  gIIter(in_ptr);
     itIn_row  gIIter_rmd_row(in_ptr);
-//    itZero  gZeroIter(inzero_ptr);    
+//    itZero  gZeroIter(inzero_ptr);
     itOut gOIter(out_ptr);
 
 //    dtype zero = 0;
@@ -128,41 +128,41 @@ void reducesum_colsum_rand(
 //        auto gZero = gZeroIter(0, j);
         auto gO = gOIter(0, j);
         TEXPANDSCALAR(oldSumTile, 0);//初始化为0
-//        TCOPYIN(oldSumTile, gZero);//初始化为0
-        //初始化old_sum的tile      
-        //need 
+//        TLOAD(oldSumTile, gZero);//初始化为0
+        //初始化old_sum的tile
+        //need
         for (int i = 0; i < Mb; ++i) {
             auto gI = gIIter(i, j);
-            TCOPYIN(dataTile, gI);
+            TLOAD(dataTile, gI);
             reducesum_col_kernel<tile_shapeData, tile_shapeSum><<<tile_shapeSum::ValidCol, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), dataTile.data(), oldSumTile.data());
             oldSumTile = SumTile;
         }
-        if constexpr (rmd_M > 0){   
+        if constexpr (rmd_M > 0){
             auto gI = gIIter(Mb, j);
-            TCOPYIN(dataTile_col, gI);
+            TLOAD(dataTile_col, gI);
             reducesum_col_kernel<tile_shapeData_col,tile_shapeSum><<<tile_shapeSum::ValidCol, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), dataTile_col.data(), oldSumTile.data());
             oldSumTile = SumTile;
         }
-        TCOPYOUT(gO, SumTile);
+        TSTORE(gO, SumTile);
     }
     if constexpr (rmd_N > 0){
-//        auto gZero = gZeroIter(0, Nb);         
+//        auto gZero = gZeroIter(0, Nb);
         auto gO = gOIter(0, Nb);
-        TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0        
-//        TCOPYIN(oldSumTile_row, gZero);//初始化为0
-        for (int i = 0; i < Mb; ++i) {   
+        TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0
+//        TLOAD(oldSumTile_row, gZero);//初始化为0
+        for (int i = 0; i < Mb; ++i) {
             auto gI = gIIter(i, Nb);
-            TCOPYIN(dataTile_row, gI);
+            TLOAD(dataTile_row, gI);
             reducesum_col_kernel<tile_shapeData_row,tile_shapeSum_row><<<tile_shapeSum_row::ValidCol, tile_shapeSum_row::ValidRow, 1>>>(SumTile_row.data(), dataTile_row.data(), oldSumTile_row.data());
             oldSumTile_row = SumTile_row;
         }
-        if constexpr (rmd_M > 0){   
+        if constexpr (rmd_M > 0){
             auto gI = gIIter(Mb, Nb);
-            TCOPYIN(dataTile_cor, gI);
+            TLOAD(dataTile_cor, gI);
             reducesum_col_kernel<tile_shapeData_cor,tile_shapeSum_row><<<tile_shapeSum_row::ValidCol, tile_shapeSum_row::ValidRow, 1>>>(SumTile_row.data(), dataTile_cor.data(), oldSumTile_row.data());
             oldSumTile_row = SumTile_row;
         }
-        TCOPYOUT(gO, SumTile_row);
+        TSTORE(gO, SumTile_row);
     }
 }
 
diff --git a/kernels/reduction/reducesum_colvec_single.hpp b/kernels/reduction/reducesum_colvec_single.hpp
index f20e3ad..1070fe2 100644
--- a/kernels/reduction/reducesum_colvec_single.hpp
+++ b/kernels/reduction/reducesum_colvec_single.hpp
@@ -19,19 +19,19 @@ template<typename tileSrc, typename tileSum>
 void __vec__ reducesum_col_kernel(
     typename tileSum::TileDType __out__ new_sum,
     const typename tileSrc::TileDType __in__ src,
-    const typename tileSum::TileDType __in__ old_sum    
+    const typename tileSum::TileDType __in__ old_sum
 )
 {
-    size_t i = blkv_get_index_x();  
+    size_t i = blkv_get_index_x();
 
     __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);   
+    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);
 
 
     typename tileSum::DType upd_sum = old_sum_ptr[i];
-   
-    #pragma clang loop unroll(full) 
+
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidRow;j+=8){
         size_t src_idx_0 =  i * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1) * tileSrc::RowStride;
@@ -40,25 +40,25 @@ void __vec__ reducesum_col_kernel(
         size_t src_idx_4 =  i * tileSrc::ColStride + (j + 4) * tileSrc::RowStride;
         size_t src_idx_5 =  i * tileSrc::ColStride + (j + 5) * tileSrc::RowStride;
         size_t src_idx_6 =  i * tileSrc::ColStride + (j + 6) * tileSrc::RowStride;
-        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;        
-        typename tileSum::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];    
+        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;
+        typename tileSum::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
         typename tileSum::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];
-        typename tileSum::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];    
-        typename tileSum::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];        
-        typename tileSum::DType sum_0123 = sum_01 + sum_23; 
+        typename tileSum::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];
+        typename tileSum::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];
+        typename tileSum::DType sum_0123 = sum_01 + sum_23;
         typename tileSum::DType sum_4567 = sum_45 + sum_67;
-        typename tileSum::DType sum_tmp = sum_0123 + sum_4567;         
-        upd_sum = upd_sum + sum_tmp;                
+        typename tileSum::DType sum_tmp = sum_0123 + sum_4567;
+        upd_sum = upd_sum + sum_tmp;
     }
 
 /*
     #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidRow;j++){
         size_t src_idx =  i * tileSrc::ColStride + j * tileSrc::RowStride;
-        upd_sum = upd_sum + src_ptr[src_idx];              
+        upd_sum = upd_sum + src_ptr[src_idx];
     }
-*/        
-    new_sum_ptr[i] = upd_sum;    
+*/
+    new_sum_ptr[i] = upd_sum;
 }
 
 
@@ -67,58 +67,58 @@ void __vec__ reducesum_col_kernel(
 template<typename dtype, int gIM, int gIN, int tM, int tN>
 void reducesum_colsum_rand(
     dtype *in_ptr,
-//    dtype *inzero_ptr,    
+//    dtype *inzero_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM;
     const int rmd_N = gIN % tN;
 //    const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     // 
-//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //
+//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gIN>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; //
-    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //     
-    using tile_shapeSum = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; // 
+    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //
+    using tile_shapeSum = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; //
 
 
 
-    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // 
-    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //     
-    using tile_shapeSum_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; // 
+    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; //
+    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //
+    using tile_shapeSum_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; //
     //need tM = 1;
 
 
-    gm_shapeIn inGm(in_ptr);   
-//    gm_shapeOut ZeroGm(inzero_ptr); 
+    gm_shapeIn inGm(in_ptr);
+//    gm_shapeOut ZeroGm(inzero_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeSum olcSumGm(old_sum_ptr);    
+//    gm_shapeSum olcSumGm(old_sum_ptr);
 
     tile_shapeData dataTile;
-    tile_shapeData_col dataTile_col;    
+    tile_shapeData_col dataTile_col;
     tile_shapeSum SumTile;
     tile_shapeSum oldSumTile;
 
     tile_shapeData_row dataTile_row;
-    tile_shapeData_cor dataTile_cor;    
+    tile_shapeData_cor dataTile_cor;
     tile_shapeSum_row SumTile_row;
-    tile_shapeSum_row oldSumTile_row;    
+    tile_shapeSum_row oldSumTile_row;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;      
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itIn_row = global_iterator<gm_shapeIn, tile_shapeSum>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeSum>;
 
     itIn  gIIter(in_ptr);
     itIn_row  gIIter_rmd_row(in_ptr);
-//    itZero  gZeroIter(inzero_ptr);    
+//    itZero  gZeroIter(inzero_ptr);
     itOut gOIter(out_ptr);
 
 //    dtype zero = 0;
@@ -127,42 +127,42 @@ void reducesum_colsum_rand(
 //        auto gZero = gZeroIter(0, j);
     auto gO = gOIter(0, 0);
     TEXPANDSCALAR(oldSumTile, 0);//初始化为0
-//        TCOPYIN(oldSumTile, gZero);//初始化为0
-        //初始化old_sum的tile      
-        //need 
+//        TLOAD(oldSumTile, gZero);//初始化为0
+        //初始化old_sum的tile
+        //need
     for (int i = 0; i < Mb; ++i) {
         auto gI = gIIter(i, 0);
-        TCOPYIN(dataTile, gI);
+        TLOAD(dataTile, gI);
         reducesum_col_kernel<tile_shapeData, tile_shapeSum><<<tile_shapeSum::ValidCol, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), dataTile.data(), oldSumTile.data());
         oldSumTile = SumTile;
     }
-    if constexpr (rmd_M > 0){   
+    if constexpr (rmd_M > 0){
         auto gI = gIIter(Mb, 0);
-        TCOPYIN(dataTile_col, gI);
+        TLOAD(dataTile_col, gI);
         reducesum_col_kernel<tile_shapeData_col,tile_shapeSum><<<tile_shapeSum::ValidCol, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), dataTile_col.data(), oldSumTile.data());
         oldSumTile = SumTile;
     }
-    TCOPYOUT(gO, SumTile);
+    TSTORE(gO, SumTile);
 //    }
     /*
     if constexpr (rmd_N > 0){
-//        auto gZero = gZeroIter(0, Nb);         
+//        auto gZero = gZeroIter(0, Nb);
         auto gO = gOIter(0, Nb);
-        TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0        
-//        TCOPYIN(oldSumTile_row, gZero);//初始化为0
-        for (int i = 0; i < Mb; ++i) {   
+        TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0
+//        TLOAD(oldSumTile_row, gZero);//初始化为0
+        for (int i = 0; i < Mb; ++i) {
             auto gI = gIIter(i, Nb);
-            TCOPYIN(dataTile_row, gI);
+            TLOAD(dataTile_row, gI);
             reducesum_col_kernel<tile_shapeData_row,tile_shapeSum_row><<<tile_shapeSum_row::ValidCol, tile_shapeSum_row::ValidRow, 1>>>(SumTile_row.data(), dataTile_row.data(), oldSumTile_row.data());
             oldSumTile_row = SumTile_row;
         }
-        if constexpr (rmd_M > 0){   
+        if constexpr (rmd_M > 0){
             auto gI = gIIter(Mb, Nb);
-            TCOPYIN(dataTile_cor, gI);
+            TLOAD(dataTile_cor, gI);
             reducesum_col_kernel<tile_shapeData_cor,tile_shapeSum_row><<<tile_shapeSum_row::ValidCol, tile_shapeSum_row::ValidRow, 1>>>(SumTile_row.data(), dataTile_cor.data(), oldSumTile_row.data());
             oldSumTile_row = SumTile_row;
         }
-        TCOPYOUT(gO, SumTile_row);
+        TSTORE(gO, SumTile_row);
     }
     */
 }
diff --git a/kernels/reduction/reducesum_colvec_single_8192.hpp b/kernels/reduction/reducesum_colvec_single_8192.hpp
index 266d1e5..055df83 100644
--- a/kernels/reduction/reducesum_colvec_single_8192.hpp
+++ b/kernels/reduction/reducesum_colvec_single_8192.hpp
@@ -20,39 +20,39 @@ void __vec__ reducesum_col_kernel(
     typename tileTmpSum::TileDType __out__ new_sum,
     const typename tileSrc::TileDType __in__ src,
     const typename tileTmpSum::TileDType __in__ old_sum,
-    const size_t tile_idx  
+    const size_t tile_idx
 )
 {
-    size_t i = blkv_get_index_x();  
+    size_t i = blkv_get_index_x();
 
     __vbuf__ typename tileTmpSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);    
+    __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileTmpSum::ValidRow;j++){
-        size_t old_sum_idx =  i * tileTmpSum::ColStride + j * tileTmpSum::RowStride;       
-        new_sum_ptr[old_sum_idx] = old_sum_ptr[old_sum_idx];          
+        size_t old_sum_idx =  i * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
+        new_sum_ptr[old_sum_idx] = old_sum_ptr[old_sum_idx];
     }
-    
-    #pragma clang loop unroll(full) 
+
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidRow;j+=8){
         size_t src_idx_0 =  i * tileSrc::ColStride + (j + 0) * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1) * tileSrc::RowStride;
         size_t src_idx_2 =  i * tileSrc::ColStride + (j + 2) * tileSrc::RowStride;
-        size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3) * tileSrc::RowStride;        
+        size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3) * tileSrc::RowStride;
         size_t src_idx_4 =  i * tileSrc::ColStride + (j + 4) * tileSrc::RowStride;
         size_t src_idx_5 =  i * tileSrc::ColStride + (j + 5) * tileSrc::RowStride;
         size_t src_idx_6 =  i * tileSrc::ColStride + (j + 6) * tileSrc::RowStride;
         size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;
-        typename  tileSrc::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];    
+        typename  tileSrc::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
         typename  tileSrc::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];
-        typename  tileSrc::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];    
-        typename  tileSrc::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];        
+        typename  tileSrc::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];
+        typename  tileSrc::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];
         typename  tileSrc::DType sum_0123 = sum_01 + sum_23;
         typename  tileSrc::DType sum_4567 = sum_45 + sum_67;
-        typename  tileSrc::DType sum_all = sum_0123 + sum_4567;   
-        src_ptr[src_idx_0] = sum_all;          
+        typename  tileSrc::DType sum_all = sum_0123 + sum_4567;
+        src_ptr[src_idx_0] = sum_all;
     }
 
     #pragma clang loop unroll(full)
@@ -60,17 +60,17 @@ void __vec__ reducesum_col_kernel(
         size_t tmp_idx_0 =  i * tileSrc::ColStride + (j + 0*8) * tileSrc::RowStride;
         size_t tmp_idx_1 =  i * tileSrc::ColStride + (j + 1*8) * tileSrc::RowStride;
         size_t tmp_idx_2 =  i * tileSrc::ColStride + (j + 2*8) * tileSrc::RowStride;
-        size_t tmp_idx_3 =  i * tileSrc::ColStride + (j + 3*8) * tileSrc::RowStride;        
+        size_t tmp_idx_3 =  i * tileSrc::ColStride + (j + 3*8) * tileSrc::RowStride;
         size_t tmp_idx_4 =  i * tileSrc::ColStride + (j + 4*8) * tileSrc::RowStride;
         size_t tmp_idx_5 =  i * tileSrc::ColStride + (j + 5*8) * tileSrc::RowStride;
         size_t tmp_idx_6 =  i * tileSrc::ColStride + (j + 6*8) * tileSrc::RowStride;
-        size_t tmp_idx_7 =  i * tileSrc::ColStride + (j + 7*8) * tileSrc::RowStride;  
+        size_t tmp_idx_7 =  i * tileSrc::ColStride + (j + 7*8) * tileSrc::RowStride;
         typename tileSrc::DType tmp_sum_01 = src_ptr[tmp_idx_0]+ src_ptr[tmp_idx_1];
-        typename tileSrc::DType tmp_sum_23 = src_ptr[tmp_idx_2]+ src_ptr[tmp_idx_3]; 
-        typename tileSrc::DType tmp_sum_45 = src_ptr[tmp_idx_4]+ src_ptr[tmp_idx_5]; 
-        typename tileSrc::DType tmp_sum_67 = src_ptr[tmp_idx_6]+ src_ptr[tmp_idx_7];  
-        typename tileSrc::DType tmp_sum_0123 = tmp_sum_01 + tmp_sum_23; 
-        typename tileSrc::DType tmp_sum_4567 = tmp_sum_45 + tmp_sum_67; 
+        typename tileSrc::DType tmp_sum_23 = src_ptr[tmp_idx_2]+ src_ptr[tmp_idx_3];
+        typename tileSrc::DType tmp_sum_45 = src_ptr[tmp_idx_4]+ src_ptr[tmp_idx_5];
+        typename tileSrc::DType tmp_sum_67 = src_ptr[tmp_idx_6]+ src_ptr[tmp_idx_7];
+        typename tileSrc::DType tmp_sum_0123 = tmp_sum_01 + tmp_sum_23;
+        typename tileSrc::DType tmp_sum_4567 = tmp_sum_45 + tmp_sum_67;
         typename tileSrc::DType tmp_sum_all = tmp_sum_0123 + tmp_sum_4567;
         src_ptr[tmp_idx_0] = tmp_sum_all;
     };
@@ -80,29 +80,29 @@ void __vec__ reducesum_col_kernel(
     size_t tmp_idx_l2_0 =  i * tileSrc::ColStride + 0*64 * tileSrc::RowStride;
     size_t tmp_idx_l2_1 =  i * tileSrc::ColStride + 1*64 * tileSrc::RowStride;
     size_t tmp_idx_l2_2 =  i * tileSrc::ColStride + 2*64 * tileSrc::RowStride;
-    size_t tmp_idx_l2_3 =  i * tileSrc::ColStride + 3*64 * tileSrc::RowStride;        
+    size_t tmp_idx_l2_3 =  i * tileSrc::ColStride + 3*64 * tileSrc::RowStride;
     size_t tmp_idx_l2_4 =  i * tileSrc::ColStride + 4*64 * tileSrc::RowStride;
     size_t tmp_idx_l2_5 =  i * tileSrc::ColStride + 5*64 * tileSrc::RowStride;
     size_t tmp_idx_l2_6 =  i * tileSrc::ColStride + 6*64 * tileSrc::RowStride;
-    size_t tmp_idx_l2_7 =  i * tileSrc::ColStride + 7*64 * tileSrc::RowStride;      
+    size_t tmp_idx_l2_7 =  i * tileSrc::ColStride + 7*64 * tileSrc::RowStride;
     typename tileTmpSum::DType tmp_sum_l2_01 = src_ptr[tmp_idx_l2_0] + src_ptr[tmp_idx_l2_1];
-    typename tileTmpSum::DType tmp_sum_l2_23 = src_ptr[tmp_idx_l2_2] + src_ptr[tmp_idx_l2_3];   
+    typename tileTmpSum::DType tmp_sum_l2_23 = src_ptr[tmp_idx_l2_2] + src_ptr[tmp_idx_l2_3];
     typename tileTmpSum::DType tmp_sum_l2_45 = src_ptr[tmp_idx_l2_4] + src_ptr[tmp_idx_l2_5];
-    typename tileTmpSum::DType tmp_sum_l2_67 = src_ptr[tmp_idx_l2_6] + src_ptr[tmp_idx_l2_7];  
-    typename tileTmpSum::DType tmp_sum_l2_0123 = tmp_sum_l2_01 + tmp_sum_l2_23; 
-    typename tileTmpSum::DType tmp_sum_l2_4567 = tmp_sum_l2_45 + tmp_sum_l2_67; 
-    typename tileTmpSum::DType tmp_sum_l2_all = tmp_sum_l2_0123 + tmp_sum_l2_4567;          
+    typename tileTmpSum::DType tmp_sum_l2_67 = src_ptr[tmp_idx_l2_6] + src_ptr[tmp_idx_l2_7];
+    typename tileTmpSum::DType tmp_sum_l2_0123 = tmp_sum_l2_01 + tmp_sum_l2_23;
+    typename tileTmpSum::DType tmp_sum_l2_4567 = tmp_sum_l2_45 + tmp_sum_l2_67;
+    typename tileTmpSum::DType tmp_sum_l2_all = tmp_sum_l2_0123 + tmp_sum_l2_4567;
 
 /*
     #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidRow;j++){
         size_t src_idx =  i * tileSrc::ColStride + j * tileSrc::RowStride;
-        upd_sum = upd_sum + src_ptr[src_idx];              
+        upd_sum = upd_sum + src_ptr[src_idx];
     }
 */
-//    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);        
-//    new_sum_ptr[i] = tmp_sum_l2_all + old_sum_ptr[i];  
-//    new_sum_ptr[i] = tmp_sum_l2_all;  
+//    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);
+//    new_sum_ptr[i] = tmp_sum_l2_all + old_sum_ptr[i];
+//    new_sum_ptr[i] = tmp_sum_l2_all;
 
     size_t  sum_tile_idx = i * tileTmpSum::ColStride + tile_idx * tileTmpSum::RowStride;
     new_sum_ptr[sum_tile_idx] = tmp_sum_l2_all;
@@ -117,25 +117,25 @@ void __vec__ reducesum_col_final_kernel(
     __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
     __vbuf__ typename tileTmpSum::DType *tmp_sum_ptr = blkv_get_tile_ptr(tmp_sum);
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileTmpSum::ValidRow;j+=8){
         size_t src_idx_0 =  i * tileTmpSum::ColStride + (j + 0) * tileTmpSum::RowStride;
         size_t src_idx_1 =  i * tileTmpSum::ColStride + (j + 1) * tileTmpSum::RowStride;
         size_t src_idx_2 =  i * tileTmpSum::ColStride + (j + 2) * tileTmpSum::RowStride;
-        size_t src_idx_3 =  i * tileTmpSum::ColStride + (j + 3) * tileTmpSum::RowStride;        
+        size_t src_idx_3 =  i * tileTmpSum::ColStride + (j + 3) * tileTmpSum::RowStride;
         size_t src_idx_4 =  i * tileTmpSum::ColStride + (j + 4) * tileTmpSum::RowStride;
         size_t src_idx_5 =  i * tileTmpSum::ColStride + (j + 5) * tileTmpSum::RowStride;
         size_t src_idx_6 =  i * tileTmpSum::ColStride + (j + 6) * tileTmpSum::RowStride;
-        size_t src_idx_7 =  i * tileTmpSum::ColStride + (j + 7) * tileTmpSum::RowStride;        
-        typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];    
+        size_t src_idx_7 =  i * tileTmpSum::ColStride + (j + 7) * tileTmpSum::RowStride;
+        typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];
         typename  tileTmpSum::DType sum_23 = tmp_sum_ptr[src_idx_2] + tmp_sum_ptr[src_idx_3];
-        typename  tileTmpSum::DType sum_45 = tmp_sum_ptr[src_idx_4] + tmp_sum_ptr[src_idx_5];    
-        typename  tileTmpSum::DType sum_67 = tmp_sum_ptr[src_idx_6] + tmp_sum_ptr[src_idx_7];        
-        typename  tileTmpSum::DType sum_0123 = sum_01 + sum_23; 
+        typename  tileTmpSum::DType sum_45 = tmp_sum_ptr[src_idx_4] + tmp_sum_ptr[src_idx_5];
+        typename  tileTmpSum::DType sum_67 = tmp_sum_ptr[src_idx_6] + tmp_sum_ptr[src_idx_7];
+        typename  tileTmpSum::DType sum_0123 = sum_01 + sum_23;
         typename  tileTmpSum::DType sum_4567 = sum_45 + sum_67;
-        typename  tileTmpSum::DType sum_all = sum_0123 + sum_4567;   
-        tmp_sum_ptr[src_idx_0] = sum_all;          
-    }   
+        typename  tileTmpSum::DType sum_all = sum_0123 + sum_4567;
+        tmp_sum_ptr[src_idx_0] = sum_all;
+    }
 
     size_t sum_idx_0 = i * tileTmpSum::ColStride + 0*8 * tileTmpSum::RowStride;
     size_t sum_idx_1 = i * tileTmpSum::ColStride + 1*8 * tileTmpSum::RowStride;
@@ -145,47 +145,47 @@ void __vec__ reducesum_col_final_kernel(
 
 template<typename dtype, int gIM, int gIN, int tM, int tN>
 void reducesum_colsum_rand(
-    dtype *in_ptr,  
+    dtype *in_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM;
     const int rmd_N = gIN % tN;
 //    const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //   
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gIN>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; //
-    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //     
-    using tile_shapeSum = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; // 
-    using tile_shapeTmpSum = Tile<Location::Vec, dtype, 16, tN, BLayout::RowMajor>; // 
-//    using tile_shapeTmpSum_l2 = Tile<Location::Vec, dtype, tM/64, tN, BLayout::RowMajor>; //     
+    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //
+    using tile_shapeSum = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; //
+    using tile_shapeTmpSum = Tile<Location::Vec, dtype, 16, tN, BLayout::RowMajor>; //
+//    using tile_shapeTmpSum_l2 = Tile<Location::Vec, dtype, tM/64, tN, BLayout::RowMajor>; //
 
 
-//    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // 
-//    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //     
-//    using tile_shapeSum_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; // 
+//    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; //
+//    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //
+//    using tile_shapeSum_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; //
     //need tM = 1;
 
 
-    gm_shapeIn inGm(in_ptr);   
-    gm_shapeOut outGm(out_ptr); 
+    gm_shapeIn inGm(in_ptr);
+    gm_shapeOut outGm(out_ptr);
 
     tile_shapeData dataTile;
-    tile_shapeData_col dataTile_col;    
+    tile_shapeData_col dataTile_col;
     tile_shapeSum SumTile;
     tile_shapeTmpSum oldtmpSumTile;
     tile_shapeTmpSum tmpSumTile;
 //    tile_shapeTmpSum_l2 tmpSumTile_l2;
 
 //    tile_shapeData_row dataTile_row;
-//    tile_shapeData_cor dataTile_cor;    
+//    tile_shapeData_cor dataTile_cor;
 //    tile_shapeSum_row SumTile_row;
-//    tile_shapeSum_row oldSumTile_row;    
+//    tile_shapeSum_row oldSumTile_row;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
@@ -193,7 +193,7 @@ void reducesum_colsum_rand(
     using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeSum>;
 
-    itIn  gIIter(in_ptr);  
+    itIn  gIIter(in_ptr);
     itOut gOIter(out_ptr);
 
 //    dtype zero = 0;
@@ -203,24 +203,24 @@ void reducesum_colsum_rand(
     auto gO = gOIter(0, 0);
     TEXPANDSCALAR(oldtmpSumTile, 0);//初始化为0
 //    TEXPANDSCALAR(tmpSumTile, 0);//初始化为0
-//    TEXPANDSCALAR(tmpSumTile_l2, 0);//初始化为0        
+//    TEXPANDSCALAR(tmpSumTile_l2, 0);//初始化为0
     for (size_t i = 0; i < Mb; ++i){
         auto gI = gIIter(i, 0);
-        TCOPYIN(dataTile, gI);
-        reducesum_col_kernel<tile_shapeData, tile_shapeTmpSum><<<tile_shapeTmpSum::ValidCol, 1, 1>>>(tmpSumTile.data(), 
+        TLOAD(dataTile, gI);
+        reducesum_col_kernel<tile_shapeData, tile_shapeTmpSum><<<tile_shapeTmpSum::ValidCol, 1, 1>>>(tmpSumTile.data(),
                                                                                                      dataTile.data(),
-                                                                                                     oldtmpSumTile.data(), 
+                                                                                                     oldtmpSumTile.data(),
                                                                                                      i);
         oldtmpSumTile = tmpSumTile;
     }
-    reducesum_col_final_kernel<tile_shapeTmpSum, tile_shapeSum><<<tile_shapeSum::ValidCol, 1, 1>>>(SumTile.data(), 
+    reducesum_col_final_kernel<tile_shapeTmpSum, tile_shapeSum><<<tile_shapeSum::ValidCol, 1, 1>>>(SumTile.data(),
                                                                                                    tmpSumTile.data());
-    TCOPYOUT(gO, SumTile);
+    TSTORE(gO, SumTile);
 }
 /*
-    if constexpr (rmd_M > 0){   
+    if constexpr (rmd_M > 0){
         auto gI = gIIter(Mb, 0);
-        TCOPYIN(dataTile_col, gI);
+        TLOAD(dataTile_col, gI);
         reducesum_col_kernel<tile_shapeData_col,tile_shapeSum><<<tile_shapeSum::ValidCol, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), dataTile_col.data(), oldSumTile.data());
         oldSumTile = SumTile;
     }
@@ -229,23 +229,23 @@ void reducesum_colsum_rand(
 //    }
     /*
     if constexpr (rmd_N > 0){
-//        auto gZero = gZeroIter(0, Nb);         
+//        auto gZero = gZeroIter(0, Nb);
         auto gO = gOIter(0, Nb);
-        TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0        
-//        TCOPYIN(oldSumTile_row, gZero);//初始化为0
-        for (int i = 0; i < Mb; ++i) {   
+        TEXPANDSCALAR(oldSumTile_row, 0);//初始化为0
+//        TLOAD(oldSumTile_row, gZero);//初始化为0
+        for (int i = 0; i < Mb; ++i) {
             auto gI = gIIter(i, Nb);
-            TCOPYIN(dataTile_row, gI);
+            TLOAD(dataTile_row, gI);
             reducesum_col_kernel<tile_shapeData_row,tile_shapeSum_row><<<tile_shapeSum_row::ValidCol, tile_shapeSum_row::ValidRow, 1>>>(SumTile_row.data(), dataTile_row.data(), oldSumTile_row.data());
             oldSumTile_row = SumTile_row;
         }
-        if constexpr (rmd_M > 0){   
+        if constexpr (rmd_M > 0){
             auto gI = gIIter(Mb, Nb);
-            TCOPYIN(dataTile_cor, gI);
+            TLOAD(dataTile_cor, gI);
             reducesum_col_kernel<tile_shapeData_cor,tile_shapeSum_row><<<tile_shapeSum_row::ValidCol, tile_shapeSum_row::ValidRow, 1>>>(SumTile_row.data(), dataTile_cor.data(), oldSumTile_row.data());
             oldSumTile_row = SumTile_row;
         }
-        TCOPYOUT(gO, SumTile_row);
+        TSTORE(gO, SumTile_row);
     }
     */
 
diff --git a/kernels/reduction/reducesum_colvec_single_tree.hpp b/kernels/reduction/reducesum_colvec_single_tree.hpp
index f24fb56..462116d 100644
--- a/kernels/reduction/reducesum_colvec_single_tree.hpp
+++ b/kernels/reduction/reducesum_colvec_single_tree.hpp
@@ -21,40 +21,40 @@ void __vec__ reducesum_col_kernel(
     typename tileTmpSum::TileDType __out__ new_sum,
     const typename tileSrc::TileDType __in__ src,
     const typename tileTmpSum::TileDType __in__ old_sum,
-    const size_t tile_idx  
+    const size_t tile_idx
 )
 {
-    size_t i = blkv_get_index_x();  
+    size_t i = blkv_get_index_x();
 
     __vbuf__ typename tileTmpSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);    
+    __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileTmpSum::ValidRow;j++){
-        size_t old_sum_idx =  i * tileTmpSum::ColStride + j * tileTmpSum::RowStride;       
-        new_sum_ptr[old_sum_idx] = old_sum_ptr[old_sum_idx];          
+        size_t old_sum_idx =  i * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
+        new_sum_ptr[old_sum_idx] = old_sum_ptr[old_sum_idx];
     }
 
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidRow;j+=8){
         size_t src_idx_0 =  i * tileSrc::ColStride + (j + 0) * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1) * tileSrc::RowStride;
         size_t src_idx_2 =  i * tileSrc::ColStride + (j + 2) * tileSrc::RowStride;
-        size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3) * tileSrc::RowStride;        
+        size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3) * tileSrc::RowStride;
         size_t src_idx_4 =  i * tileSrc::ColStride + (j + 4) * tileSrc::RowStride;
         size_t src_idx_5 =  i * tileSrc::ColStride + (j + 5) * tileSrc::RowStride;
         size_t src_idx_6 =  i * tileSrc::ColStride + (j + 6) * tileSrc::RowStride;
         size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;
-        typename  tileSrc::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];    
+        typename  tileSrc::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
         typename  tileSrc::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];
-        typename  tileSrc::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];    
-        typename  tileSrc::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];        
+        typename  tileSrc::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];
+        typename  tileSrc::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];
         typename  tileSrc::DType sum_0123 = sum_01 + sum_23;
         typename  tileSrc::DType sum_4567 = sum_45 + sum_67;
-        typename  tileSrc::DType sum_all = sum_0123 + sum_4567;   
-        src_ptr[src_idx_0] = sum_all;          
+        typename  tileSrc::DType sum_all = sum_0123 + sum_4567;
+        src_ptr[src_idx_0] = sum_all;
     }
 
     #pragma clang loop unroll(full)
@@ -62,17 +62,17 @@ void __vec__ reducesum_col_kernel(
         size_t src_idx_0 =  i * tileSrc::ColStride + (j + 0*8) * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1*8) * tileSrc::RowStride;
         size_t src_idx_2 =  i * tileSrc::ColStride + (j + 2*8) * tileSrc::RowStride;
-        size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3*8) * tileSrc::RowStride;        
+        size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3*8) * tileSrc::RowStride;
         size_t src_idx_4 =  i * tileSrc::ColStride + (j + 4*8) * tileSrc::RowStride;
         size_t src_idx_5 =  i * tileSrc::ColStride + (j + 5*8) * tileSrc::RowStride;
         size_t src_idx_6 =  i * tileSrc::ColStride + (j + 6*8) * tileSrc::RowStride;
-        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7*8) * tileSrc::RowStride;  
+        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7*8) * tileSrc::RowStride;
         typename tileSrc::DType tmp_sum_01 = src_ptr[src_idx_0]+ src_ptr[src_idx_1];
-        typename tileSrc::DType tmp_sum_23 = src_ptr[src_idx_2]+ src_ptr[src_idx_3]; 
-        typename tileSrc::DType tmp_sum_45 = src_ptr[src_idx_4]+ src_ptr[src_idx_5]; 
-        typename tileSrc::DType tmp_sum_67 = src_ptr[src_idx_6]+ src_ptr[src_idx_7];  
-        typename tileSrc::DType tmp_sum_0123 = tmp_sum_01 + tmp_sum_23; 
-        typename tileSrc::DType tmp_sum_4567 = tmp_sum_45 + tmp_sum_67; 
+        typename tileSrc::DType tmp_sum_23 = src_ptr[src_idx_2]+ src_ptr[src_idx_3];
+        typename tileSrc::DType tmp_sum_45 = src_ptr[src_idx_4]+ src_ptr[src_idx_5];
+        typename tileSrc::DType tmp_sum_67 = src_ptr[src_idx_6]+ src_ptr[src_idx_7];
+        typename tileSrc::DType tmp_sum_0123 = tmp_sum_01 + tmp_sum_23;
+        typename tileSrc::DType tmp_sum_4567 = tmp_sum_45 + tmp_sum_67;
         typename tileSrc::DType tmp_sum_all = tmp_sum_0123 + tmp_sum_4567;
         src_ptr[src_idx_0] = tmp_sum_all;
     };
@@ -80,19 +80,19 @@ void __vec__ reducesum_col_kernel(
 
     size_t stride = 64;
     size_t iternum = __builtin_ctz(tileSrc::ValidRow) - 6;
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t k=0;k<iternum;k++){
-        #pragma clang loop unroll(full) 
+        #pragma clang loop unroll(full)
         for(size_t j=0;j<tileSrc::ValidRow;j+=(stride*2)){
             size_t src_idx_0 =  i * tileSrc::ColStride + (j + 0*stride) * tileSrc::RowStride;
             size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1*stride) * tileSrc::RowStride;
-            typename  tileSrc::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];           
-            src_ptr[src_idx_0] = sum_01;          
+            typename  tileSrc::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
+            src_ptr[src_idx_0] = sum_01;
         }
         stride = stride*2;
     }
 
-        
+
     size_t src_sum_idx = i * tileSrc::ColStride;
     size_t  sum_tile_idx = i * tileTmpSum::ColStride + tile_idx * tileTmpSum::RowStride;
     new_sum_ptr[sum_tile_idx] = src_ptr[src_sum_idx];
@@ -109,39 +109,39 @@ void __vec__ reducesum_col_final_kernel(
     __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
     __vbuf__ typename tileTmpSum::DType *tmp_sum_ptr = blkv_get_tile_ptr(tmp_sum);
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileTmpSum::ValidRow;j+=8){
         size_t src_idx_0 =  i * tileTmpSum::ColStride + (j + 0) * tileTmpSum::RowStride;
         size_t src_idx_1 =  i * tileTmpSum::ColStride + (j + 1) * tileTmpSum::RowStride;
         size_t src_idx_2 =  i * tileTmpSum::ColStride + (j + 2) * tileTmpSum::RowStride;
-        size_t src_idx_3 =  i * tileTmpSum::ColStride + (j + 3) * tileTmpSum::RowStride;        
+        size_t src_idx_3 =  i * tileTmpSum::ColStride + (j + 3) * tileTmpSum::RowStride;
         size_t src_idx_4 =  i * tileTmpSum::ColStride + (j + 4) * tileTmpSum::RowStride;
         size_t src_idx_5 =  i * tileTmpSum::ColStride + (j + 5) * tileTmpSum::RowStride;
         size_t src_idx_6 =  i * tileTmpSum::ColStride + (j + 6) * tileTmpSum::RowStride;
-        size_t src_idx_7 =  i * tileTmpSum::ColStride + (j + 7) * tileTmpSum::RowStride;        
-        typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];    
+        size_t src_idx_7 =  i * tileTmpSum::ColStride + (j + 7) * tileTmpSum::RowStride;
+        typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];
         typename  tileTmpSum::DType sum_23 = tmp_sum_ptr[src_idx_2] + tmp_sum_ptr[src_idx_3];
-        typename  tileTmpSum::DType sum_45 = tmp_sum_ptr[src_idx_4] + tmp_sum_ptr[src_idx_5];    
-        typename  tileTmpSum::DType sum_67 = tmp_sum_ptr[src_idx_6] + tmp_sum_ptr[src_idx_7];        
-        typename  tileTmpSum::DType sum_0123 = sum_01 + sum_23; 
+        typename  tileTmpSum::DType sum_45 = tmp_sum_ptr[src_idx_4] + tmp_sum_ptr[src_idx_5];
+        typename  tileTmpSum::DType sum_67 = tmp_sum_ptr[src_idx_6] + tmp_sum_ptr[src_idx_7];
+        typename  tileTmpSum::DType sum_0123 = sum_01 + sum_23;
         typename  tileTmpSum::DType sum_4567 = sum_45 + sum_67;
-        typename  tileTmpSum::DType sum_all = sum_0123 + sum_4567;   
-        tmp_sum_ptr[src_idx_0] = sum_all;          
-    }   
+        typename  tileTmpSum::DType sum_all = sum_0123 + sum_4567;
+        tmp_sum_ptr[src_idx_0] = sum_all;
+    }
 
     size_t stride = 8;
-    size_t iternum = __builtin_ctz(tileTmpSum::ValidRow) - 3;    
-    #pragma clang loop unroll(full) 
+    size_t iternum = __builtin_ctz(tileTmpSum::ValidRow) - 3;
+    #pragma clang loop unroll(full)
     for(size_t k=0;k<iternum;k++){
-        #pragma clang loop unroll(full) 
+        #pragma clang loop unroll(full)
         for(size_t j=0;j<tileTmpSum::ValidRow;j+=(stride*2)){
             size_t src_idx_0 =  i * tileTmpSum::ColStride + (j + 0*stride) * tileTmpSum::RowStride;
             size_t src_idx_1 =  i * tileTmpSum::ColStride + (j + 1*stride) * tileTmpSum::RowStride;
-            typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];           
-            tmp_sum_ptr[src_idx_0] = sum_01;          
+            typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];
+            tmp_sum_ptr[src_idx_0] = sum_01;
         }
         stride = stride*2;
-    }    
+    }
 
     size_t sum_idx = i * tileTmpSum::ColStride;
     new_sum_ptr[i] = tmp_sum_ptr[sum_idx];
@@ -150,47 +150,47 @@ void __vec__ reducesum_col_final_kernel(
 
 template<typename dtype, int gIM, int gIN, int tM, int tN>
 void reducesum_colsum_rand(
-    dtype *in_ptr,  
+    dtype *in_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM;
     const int rmd_N = gIN % tN;
 //    const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //   
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gIN>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; //
-    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //     
-    using tile_shapeSum = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; // 
-    using tile_shapeTmpSum = Tile<Location::Vec, dtype, Mb, tN, BLayout::RowMajor>; // 
-//    using tile_shapeTmpSum_l2 = Tile<Location::Vec, dtype, tM/64, tN, BLayout::RowMajor>; //     
+    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //
+    using tile_shapeSum = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor>; //
+    using tile_shapeTmpSum = Tile<Location::Vec, dtype, Mb, tN, BLayout::RowMajor>; //
+//    using tile_shapeTmpSum_l2 = Tile<Location::Vec, dtype, tM/64, tN, BLayout::RowMajor>; //
 
 
-//    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // 
-//    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //     
-//    using tile_shapeSum_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; // 
+//    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; //
+//    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //
+//    using tile_shapeSum_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; //
     //need tM = 1;
 
 
-    gm_shapeIn inGm(in_ptr);   
-    gm_shapeOut outGm(out_ptr); 
+    gm_shapeIn inGm(in_ptr);
+    gm_shapeOut outGm(out_ptr);
 
     tile_shapeData dataTile;
-    tile_shapeData_col dataTile_col;    
+    tile_shapeData_col dataTile_col;
     tile_shapeSum SumTile;
     tile_shapeTmpSum oldtmpSumTile;
     tile_shapeTmpSum tmpSumTile;
 //    tile_shapeTmpSum_l2 tmpSumTile_l2;
 
 //    tile_shapeData_row dataTile_row;
-//    tile_shapeData_cor dataTile_cor;    
+//    tile_shapeData_cor dataTile_cor;
 //    tile_shapeSum_row SumTile_row;
-//    tile_shapeSum_row oldSumTile_row;    
+//    tile_shapeSum_row oldSumTile_row;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
@@ -198,7 +198,7 @@ void reducesum_colsum_rand(
     using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeSum>;
 
-    itIn  gIIter(in_ptr);  
+    itIn  gIIter(in_ptr);
     itOut gOIter(out_ptr);
 
 //    dtype zero = 0;
@@ -208,19 +208,19 @@ void reducesum_colsum_rand(
     auto gO = gOIter(0, 0);
     TEXPANDSCALAR(oldtmpSumTile, 0);//初始化为0
 //    TEXPANDSCALAR(tmpSumTile, 0);//初始化为0
-//    TEXPANDSCALAR(tmpSumTile_l2, 0);//初始化为0        
+//    TEXPANDSCALAR(tmpSumTile_l2, 0);//初始化为0
     for (size_t i = 0; i < Mb; ++i){
         auto gI = gIIter(i, 0);
-        TCOPYIN(dataTile, gI);
-        reducesum_col_kernel<tile_shapeData, tile_shapeTmpSum><<<tile_shapeTmpSum::ValidCol, 1, 1>>>(tmpSumTile.data(), 
+        TLOAD(dataTile, gI);
+        reducesum_col_kernel<tile_shapeData, tile_shapeTmpSum><<<tile_shapeTmpSum::ValidCol, 1, 1>>>(tmpSumTile.data(),
                                                                                                      dataTile.data(),
-                                                                                                     oldtmpSumTile.data(), 
+                                                                                                     oldtmpSumTile.data(),
                                                                                                      i);
         oldtmpSumTile = tmpSumTile;
     }
-    reducesum_col_final_kernel<tile_shapeTmpSum, tile_shapeSum><<<tile_shapeSum::ValidCol, 1, 1>>>(SumTile.data(), 
+    reducesum_col_final_kernel<tile_shapeTmpSum, tile_shapeSum><<<tile_shapeSum::ValidCol, 1, 1>>>(SumTile.data(),
                                                                                                    tmpSumTile.data());
-    TCOPYOUT(gO, SumTile);
+    TSTORE(gO, SumTile);
 }
 
 
diff --git a/kernels/reduction/reducesum_colvec_unalign_120_8.hpp b/kernels/reduction/reducesum_colvec_unalign_120_8.hpp
index 45cb8f8..39ba724 100644
--- a/kernels/reduction/reducesum_colvec_unalign_120_8.hpp
+++ b/kernels/reduction/reducesum_colvec_unalign_120_8.hpp
@@ -18,18 +18,18 @@ using namespace pto;
 template<typename tileSrc, typename tileTmp>
 void __vec__ reducesum_col_tmp(
     typename tileTmp::TileDType __out__ tmp_sum,
-    const typename tileSrc::TileDType __in__ src    
+    const typename tileSrc::TileDType __in__ src
 )
 {
-    size_t i = blkv_get_index_x();  
+    size_t i = blkv_get_index_x();
 
     __vbuf__ typename tileTmp::DType *tmp_sum_ptr = blkv_get_tile_ptr(tmp_sum);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-//    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);   
+//    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);
 
     typename tileTmp::DType upd_tmp_sum = 0;
-   
-    #pragma clang loop unroll(full) 
+
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::Rows;j+=8){//非valid处也参与计算补0，能凑出8元树形累加出来
         size_t src_idx_0 =  i * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1) * tileSrc::RowStride;
@@ -38,64 +38,64 @@ void __vec__ reducesum_col_tmp(
         size_t src_idx_4 =  i * tileSrc::ColStride + (j + 4) * tileSrc::RowStride;
         size_t src_idx_5 =  i * tileSrc::ColStride + (j + 5) * tileSrc::RowStride;
         size_t src_idx_6 =  i * tileSrc::ColStride + (j + 6) * tileSrc::RowStride;
-        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;        
-        typename tileTmp::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];    
+        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;
+        typename tileTmp::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
         typename tileTmp::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];
-        typename tileTmp::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];    
-        typename tileTmp::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];        
-        typename tileTmp::DType sum_0123 = sum_01 + sum_23; 
+        typename tileTmp::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];
+        typename tileTmp::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];
+        typename tileTmp::DType sum_0123 = sum_01 + sum_23;
         typename tileTmp::DType sum_4567 = sum_45 + sum_67;
-        typename tileTmp::DType sum_tmp = sum_0123 + sum_4567;         
-        upd_tmp_sum = upd_tmp_sum + sum_tmp;              
+        typename tileTmp::DType sum_tmp = sum_0123 + sum_4567;
+        upd_tmp_sum = upd_tmp_sum + sum_tmp;
     }
 
-    tmp_sum_ptr[i] = upd_tmp_sum;   
+    tmp_sum_ptr[i] = upd_tmp_sum;
 }
 
 template<typename tileTmp, typename tileSum>
 void __vec__ reducesum_col_final(
     typename tileSum::TileDType __out__ new_sum,
-    const typename tileTmp::TileDType __in__ src, 
-    const typename tileSum::TileDType __in__ old_sum   
+    const typename tileTmp::TileDType __in__ src,
+    const typename tileSum::TileDType __in__ old_sum
 )
 {
-    size_t i = blkv_get_index_x();  
+    size_t i = blkv_get_index_x();
 
     __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
     __vbuf__ typename tileTmp::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);   
+    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);
 
 
     typename tileSum::DType upd_sum = old_sum_ptr[i];
-   
+
 
     size_t src_idx_0 =  i * tileSum::ColStride + 0 * tileSum::ValidCol;
     size_t src_idx_1 =  i * tileSum::ColStride + 1 * tileSum::ValidCol;
     size_t src_idx_2 =  i * tileSum::ColStride + 2 * tileSum::ValidCol;
-    size_t src_idx_3 =  i * tileSum::ColStride + 3 * tileSum::ValidCol;  
+    size_t src_idx_3 =  i * tileSum::ColStride + 3 * tileSum::ValidCol;
     size_t src_idx_4 =  i * tileSum::ColStride + 4 * tileSum::ValidCol;
     size_t src_idx_5 =  i * tileSum::ColStride + 5 * tileSum::ValidCol;
     size_t src_idx_6 =  i * tileSum::ColStride + 6 * tileSum::ValidCol;
-    size_t src_idx_7 =  i * tileSum::ColStride + 7 * tileSum::ValidCol;       
-    typename tileSum::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];    
-    typename tileSum::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];    
-    typename tileSum::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];    
-    typename tileSum::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];        
-    typename tileSum::DType sum_0123 = sum_01 + sum_23; 
-    typename tileSum::DType sum_4567 = sum_45 + sum_67;      
-    typename tileSum::DType sum_all = sum_0123 + sum_4567; 
-              
-//        upd_sum = upd_sum + sum_tmp;              
+    size_t src_idx_7 =  i * tileSum::ColStride + 7 * tileSum::ValidCol;
+    typename tileSum::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
+    typename tileSum::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];
+    typename tileSum::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];
+    typename tileSum::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];
+    typename tileSum::DType sum_0123 = sum_01 + sum_23;
+    typename tileSum::DType sum_4567 = sum_45 + sum_67;
+    typename tileSum::DType sum_all = sum_0123 + sum_4567;
+
+//        upd_sum = upd_sum + sum_tmp;
 
 /*
     #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::ValidRow;j++){
         size_t src_idx =  i * tileSrc::ColStride + j * tileSrc::RowStride;
-        upd_sum = upd_sum + src_ptr[src_idx];              
+        upd_sum = upd_sum + src_ptr[src_idx];
     }
-*/        
-//    new_sum_ptr[i] = upd_sum; 
-    new_sum_ptr[i] = sum_all + upd_sum;   
+*/
+//    new_sum_ptr[i] = upd_sum;
+    new_sum_ptr[i] = sum_all + upd_sum;
 }
 
 
@@ -105,70 +105,70 @@ void __vec__ reducesum_col_final(
 template<typename dtype, int gIM, int gIN, int tM, int tN, int tM_VLD>
 void reducesum_colsum_rand(
     dtype *in_ptr,
-//    dtype *inzero_ptr,    
+//    dtype *inzero_ptr,
     dtype *out_ptr
-) 
+)
 {
 
-//    const int Mb = (gIM/8) / tM;  
+//    const int Mb = (gIM/8) / tM;
 
     const int rmd_M = gIM % tM;
     const int rmd_N = gIN % tN;
 //    const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM/8, gIN*8>>;     // 
-//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM/8, gIN*8>>;     //
+//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gIN>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM/8, tN*8, BLayout::RowMajor, tM_VLD/8, tN*8>; //
 //    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor,rmd_M, tN>; //
-    using tile_shapeTmp = Tile<Location::Vec, dtype, 1, tN*8, BLayout::RowMajor>; //      
-    using tile_shapeSum = Tile<Location::Vec, dtype, 1, tN*8, BLayout::RowMajor, 1, tN>; // 
+    using tile_shapeTmp = Tile<Location::Vec, dtype, 1, tN*8, BLayout::RowMajor>; //
+    using tile_shapeSum = Tile<Location::Vec, dtype, 1, tN*8, BLayout::RowMajor, 1, tN>; //
 
 
 
-//    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // 
-//    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //     
-//    using tile_shapeSum_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; // 
+//    using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; //
+//    using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; //
+//    using tile_shapeSum_row = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, rmd_N>; //
     //need tM = 1;
 
 
-    gm_shapeIn inGm(in_ptr);   
-//    gm_shapeOut ZeroGm(inzero_ptr); 
+    gm_shapeIn inGm(in_ptr);
+//    gm_shapeOut ZeroGm(inzero_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeSum olcSumGm(old_sum_ptr);    
+//    gm_shapeSum olcSumGm(old_sum_ptr);
 
     tile_shapeData dataTile;
 //    tile_shapeData_col dataTile_col;
-    tile_shapeTmp TmpTile;        
+    tile_shapeTmp TmpTile;
     tile_shapeSum SumTile;
     tile_shapeSum oldSumTile;
 
 //    tile_shapeData_row dataTile_row;
-//    tile_shapeData_cor dataTile_cor;    
+//    tile_shapeData_cor dataTile_cor;
 //    tile_shapeSum_row SumTile_row;
-//    tile_shapeSum_row oldSumTile_row;    
+//    tile_shapeSum_row oldSumTile_row;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;      
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itIn_row = global_iterator<gm_shapeIn, tile_shapeSum>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeSum>;
 
     itIn  gIIter(in_ptr);
     itIn_row  gIIter_rmd_row(in_ptr);
-//    itZero  gZeroIter(inzero_ptr);    
+//    itZero  gZeroIter(inzero_ptr);
     itOut gOIter(out_ptr);
 
 
     auto gO = gOIter(0, 0);
     TEXPANDSCALAR(oldSumTile, 0);//初始化为0
     auto gI = gIIter(0, 0);
-    TCOPYIN(dataTile, gI);//TLOAD应补0，目前gfrun默认补0，需要接口去弄
+    TLOAD(dataTile, gI);//TLOAD应补0，目前gfrun默认补0，需要接口去弄
     reducesum_col_tmp<tile_shapeData, tile_shapeTmp><<<tile_shapeTmp::ValidCol, tile_shapeTmp::ValidRow, 1>>>(TmpTile.data(), dataTile.data());
     reducesum_col_final<tile_shapeTmp, tile_shapeSum><<<tile_shapeSum::ValidCol, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), TmpTile.data(), oldSumTile.data());
     oldSumTile = SumTile;
-    TCOPYOUT(gO, SumTile);
+    TSTORE(gO, SumTile);
 }
 
 #endif
diff --git a/kernels/reduction/reducesum_colvec_unalign_tree.hpp b/kernels/reduction/reducesum_colvec_unalign_tree.hpp
index 5e603d6..d2c13d9 100644
--- a/kernels/reduction/reducesum_colvec_unalign_tree.hpp
+++ b/kernels/reduction/reducesum_colvec_unalign_tree.hpp
@@ -21,38 +21,38 @@ void __vec__ reducesum_col_kernel(
     const typename tileSrc::TileDType __in__ src
 )
 {
-    size_t i = blkv_get_index_x();  
+    size_t i = blkv_get_index_x();
 
     __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-//    __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);    
+//    __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);
 
 /*
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileTmpSum::ValidRow;j++){
-        size_t old_sum_idx =  i * tileTmpSum::ColStride + j * tileTmpSum::RowStride;       
-        new_sum_ptr[old_sum_idx] = old_sum_ptr[old_sum_idx];          
+        size_t old_sum_idx =  i * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
+        new_sum_ptr[old_sum_idx] = old_sum_ptr[old_sum_idx];
     }
 */
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileSrc::Rows;j+=8){
         size_t src_idx_0 =  i * tileSrc::ColStride + (j + 0) * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1) * tileSrc::RowStride;
         size_t src_idx_2 =  i * tileSrc::ColStride + (j + 2) * tileSrc::RowStride;
-        size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3) * tileSrc::RowStride;        
+        size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3) * tileSrc::RowStride;
         size_t src_idx_4 =  i * tileSrc::ColStride + (j + 4) * tileSrc::RowStride;
         size_t src_idx_5 =  i * tileSrc::ColStride + (j + 5) * tileSrc::RowStride;
         size_t src_idx_6 =  i * tileSrc::ColStride + (j + 6) * tileSrc::RowStride;
         size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7) * tileSrc::RowStride;
-        typename  tileSrc::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];    
+        typename  tileSrc::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
         typename  tileSrc::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];
-        typename  tileSrc::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];    
-        typename  tileSrc::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];        
+        typename  tileSrc::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];
+        typename  tileSrc::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];
         typename  tileSrc::DType sum_0123 = sum_01 + sum_23;
         typename  tileSrc::DType sum_4567 = sum_45 + sum_67;
-        typename  tileSrc::DType sum_all = sum_0123 + sum_4567;   
-        src_ptr[src_idx_0] = sum_all;          
+        typename  tileSrc::DType sum_all = sum_0123 + sum_4567;
+        src_ptr[src_idx_0] = sum_all;
     }
 
     #pragma clang loop unroll(full)
@@ -60,17 +60,17 @@ void __vec__ reducesum_col_kernel(
         size_t src_idx_0 =  i * tileSrc::ColStride + (j + 0*8) * tileSrc::RowStride;
         size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1*8) * tileSrc::RowStride;
         size_t src_idx_2 =  i * tileSrc::ColStride + (j + 2*8) * tileSrc::RowStride;
-        size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3*8) * tileSrc::RowStride;        
+        size_t src_idx_3 =  i * tileSrc::ColStride + (j + 3*8) * tileSrc::RowStride;
         size_t src_idx_4 =  i * tileSrc::ColStride + (j + 4*8) * tileSrc::RowStride;
         size_t src_idx_5 =  i * tileSrc::ColStride + (j + 5*8) * tileSrc::RowStride;
         size_t src_idx_6 =  i * tileSrc::ColStride + (j + 6*8) * tileSrc::RowStride;
-        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7*8) * tileSrc::RowStride;  
+        size_t src_idx_7 =  i * tileSrc::ColStride + (j + 7*8) * tileSrc::RowStride;
         typename tileSrc::DType tmp_sum_01 = src_ptr[src_idx_0]+ src_ptr[src_idx_1];
-        typename tileSrc::DType tmp_sum_23 = src_ptr[src_idx_2]+ src_ptr[src_idx_3]; 
-        typename tileSrc::DType tmp_sum_45 = src_ptr[src_idx_4]+ src_ptr[src_idx_5]; 
-        typename tileSrc::DType tmp_sum_67 = src_ptr[src_idx_6]+ src_ptr[src_idx_7];  
-        typename tileSrc::DType tmp_sum_0123 = tmp_sum_01 + tmp_sum_23; 
-        typename tileSrc::DType tmp_sum_4567 = tmp_sum_45 + tmp_sum_67; 
+        typename tileSrc::DType tmp_sum_23 = src_ptr[src_idx_2]+ src_ptr[src_idx_3];
+        typename tileSrc::DType tmp_sum_45 = src_ptr[src_idx_4]+ src_ptr[src_idx_5];
+        typename tileSrc::DType tmp_sum_67 = src_ptr[src_idx_6]+ src_ptr[src_idx_7];
+        typename tileSrc::DType tmp_sum_0123 = tmp_sum_01 + tmp_sum_23;
+        typename tileSrc::DType tmp_sum_4567 = tmp_sum_45 + tmp_sum_67;
         typename tileSrc::DType tmp_sum_all = tmp_sum_0123 + tmp_sum_4567;
         src_ptr[src_idx_0] = tmp_sum_all;
     };
@@ -78,18 +78,18 @@ void __vec__ reducesum_col_kernel(
 
     size_t stride = 64;
     size_t iternum = __builtin_ctz(tileSrc::Rows) - 6;
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t k=0;k<iternum;k++){
-        #pragma clang loop unroll(full) 
+        #pragma clang loop unroll(full)
         for(size_t j=0;j<tileSrc::Rows;j+=(stride*2)){ //not valid rows
             size_t src_idx_0 =  i * tileSrc::ColStride + (j + 0*stride) * tileSrc::RowStride;
             size_t src_idx_1 =  i * tileSrc::ColStride + (j + 1*stride) * tileSrc::RowStride;
-            typename  tileSrc::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];           
-            src_ptr[src_idx_0] = sum_01;          
+            typename  tileSrc::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
+            src_ptr[src_idx_0] = sum_01;
         }
         stride = stride * 2;
     }
-        
+
     size_t src_sum_idx = i * tileSrc::ColStride;
 //    size_t  sum_tile_idx = i * tileTmpSum::ColStride + tile_idx * tileTmpSum::RowStride;
     new_sum_ptr[i] = src_ptr[src_sum_idx];
@@ -105,39 +105,39 @@ void __vec__ reducesum_col_final_kernel(
     __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
     __vbuf__ typename tileTmpSum::DType *tmp_sum_ptr = blkv_get_tile_ptr(tmp_sum);
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t j=0;j<tileTmpSum::ValidRow;j+=8){
         size_t src_idx_0 =  i * tileTmpSum::ColStride + (j + 0) * tileTmpSum::RowStride;
         size_t src_idx_1 =  i * tileTmpSum::ColStride + (j + 1) * tileTmpSum::RowStride;
         size_t src_idx_2 =  i * tileTmpSum::ColStride + (j + 2) * tileTmpSum::RowStride;
-        size_t src_idx_3 =  i * tileTmpSum::ColStride + (j + 3) * tileTmpSum::RowStride;        
+        size_t src_idx_3 =  i * tileTmpSum::ColStride + (j + 3) * tileTmpSum::RowStride;
         size_t src_idx_4 =  i * tileTmpSum::ColStride + (j + 4) * tileTmpSum::RowStride;
         size_t src_idx_5 =  i * tileTmpSum::ColStride + (j + 5) * tileTmpSum::RowStride;
         size_t src_idx_6 =  i * tileTmpSum::ColStride + (j + 6) * tileTmpSum::RowStride;
-        size_t src_idx_7 =  i * tileTmpSum::ColStride + (j + 7) * tileTmpSum::RowStride;        
-        typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];    
+        size_t src_idx_7 =  i * tileTmpSum::ColStride + (j + 7) * tileTmpSum::RowStride;
+        typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];
         typename  tileTmpSum::DType sum_23 = tmp_sum_ptr[src_idx_2] + tmp_sum_ptr[src_idx_3];
-        typename  tileTmpSum::DType sum_45 = tmp_sum_ptr[src_idx_4] + tmp_sum_ptr[src_idx_5];    
-        typename  tileTmpSum::DType sum_67 = tmp_sum_ptr[src_idx_6] + tmp_sum_ptr[src_idx_7];        
-        typename  tileTmpSum::DType sum_0123 = sum_01 + sum_23; 
+        typename  tileTmpSum::DType sum_45 = tmp_sum_ptr[src_idx_4] + tmp_sum_ptr[src_idx_5];
+        typename  tileTmpSum::DType sum_67 = tmp_sum_ptr[src_idx_6] + tmp_sum_ptr[src_idx_7];
+        typename  tileTmpSum::DType sum_0123 = sum_01 + sum_23;
         typename  tileTmpSum::DType sum_4567 = sum_45 + sum_67;
-        typename  tileTmpSum::DType sum_all = sum_0123 + sum_4567;   
-        tmp_sum_ptr[src_idx_0] = sum_all;          
-    }   
+        typename  tileTmpSum::DType sum_all = sum_0123 + sum_4567;
+        tmp_sum_ptr[src_idx_0] = sum_all;
+    }
 
     size_t stride = 8;
-    size_t iternum = __builtin_ctz(tileTmpSum::ValidRow) - 3;    
-    #pragma clang loop unroll(full) 
+    size_t iternum = __builtin_ctz(tileTmpSum::ValidRow) - 3;
+    #pragma clang loop unroll(full)
     for(size_t k=0;k<iternum;k++){
-        #pragma clang loop unroll(full) 
+        #pragma clang loop unroll(full)
         for(size_t j=0;j<tileTmpSum::ValidRow;j+=(stride*2)){
             size_t src_idx_0 =  i * tileTmpSum::ColStride + (j + 0*stride) * tileTmpSum::RowStride;
             size_t src_idx_1 =  i * tileTmpSum::ColStride + (j + 1*stride) * tileTmpSum::RowStride;
-            typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];           
-            tmp_sum_ptr[src_idx_0] = sum_01;          
+            typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];
+            tmp_sum_ptr[src_idx_0] = sum_01;
         }
         stride = stride*2;
-    }    
+    }
 
     size_t sum_idx = i * tileTmpSum::ColStride;
     new_sum_ptr[i] = tmp_sum_ptr[sum_idx];
@@ -146,38 +146,38 @@ void __vec__ reducesum_col_final_kernel(
 
 template<typename dtype, int gIM, int gIN, int tM, int tN, int tM_VLD>
 void reducesum_colsum_rand(
-    dtype *in_ptr,  
+    dtype *in_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
 //    const int rmd_M = gIM % tM;
 //    const int rmd_N = gIN % tN;
 //    const int rmd_M = gOM % tM; // todo 尾块怎么处理？
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //   
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //
     using gm_shapeOut = global_tensor<dtype, RowMajor<1, gIN>>;
-    using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM_VLD, gIN>; //   
-    using tile_shapeSum = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, gIN>; // 
+    using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM_VLD, gIN>; //
+    using tile_shapeSum = Tile<Location::Vec, dtype, 1, tN, BLayout::RowMajor, 1, gIN>; //
 
 
-    gm_shapeIn inGm(in_ptr);   
-    gm_shapeOut outGm(out_ptr); 
+    gm_shapeIn inGm(in_ptr);
+    gm_shapeOut outGm(out_ptr);
 
     tile_shapeData dataTile;
-//    tile_shapeData_col dataTile_col;    
+//    tile_shapeData_col dataTile_col;
     tile_shapeSum SumTile;
 //    tile_shapeTmpSum oldtmpSumTile;
 //    tile_shapeTmpSum tmpSumTile;
 //    tile_shapeTmpSum_l2 tmpSumTile_l2;
 
 //    tile_shapeData_row dataTile_row;
-//    tile_shapeData_cor dataTile_cor;    
+//    tile_shapeData_cor dataTile_cor;
 //    tile_shapeSum_row SumTile_row;
-//    tile_shapeSum_row oldSumTile_row;    
+//    tile_shapeSum_row oldSumTile_row;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
@@ -185,21 +185,21 @@ void reducesum_colsum_rand(
     using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeSum>;
 
-    itIn  gIIter(in_ptr);  
+    itIn  gIIter(in_ptr);
     itOut gOIter(out_ptr);
 
 //    dtype zero = 0;
 
 //    for (int j = 0; j < Nb; ++j) {
 //        auto gZero = gZeroIter(0, j);
-    auto gO = gOIter(0, 0);     
+    auto gO = gOIter(0, 0);
 //    for (size_t i = 0; i < Mb; ++i){
     auto gI = gIIter(0, 0);
     TLOAD_PAD_ZERO(dataTile, gI);
-    reducesum_col_kernel<tile_shapeData, tile_shapeSum><<<tile_shapeSum::ValidCol, 1, 1>>>(SumTile.data(), 
+    reducesum_col_kernel<tile_shapeData, tile_shapeSum><<<tile_shapeSum::ValidCol, 1, 1>>>(SumTile.data(),
                                                                                            dataTile.data()
                                                                                            );
-    TCOPYOUT(gO, SumTile);
+    TSTORE(gO, SumTile);
 }
 
 
diff --git a/kernels/reduction/reducesum_rowvec.hpp b/kernels/reduction/reducesum_rowvec.hpp
index be8b749..ea6abb4 100644
--- a/kernels/reduction/reducesum_rowvec.hpp
+++ b/kernels/reduction/reducesum_rowvec.hpp
@@ -17,17 +17,17 @@ template<typename tileSrc, typename tileSum>
 void __vec__ reducesum_row_kernel(
     typename tileSum::TileDType __out__ new_sum,
     const typename tileSrc::TileDType __in__ src,
-    const typename tileSum::TileDType __in__ old_sum    
+    const typename tileSum::TileDType __in__ old_sum
 )
 {
-//    size_t i = blkv_get_index_x();  
-    size_t j = blkv_get_index_x();  
+//    size_t i = blkv_get_index_x();
+    size_t j = blkv_get_index_x();
 //    size_t j = blkv_get_index_y();
-    size_t idx = j * tileSum::RowStride;    
+    size_t idx = j * tileSum::RowStride;
 
     __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);   
+    __vbuf__ typename tileSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);
 
 
     typename tileSum::DType upd_sum = old_sum_ptr[idx];
@@ -37,32 +37,32 @@ void __vec__ reducesum_row_kernel(
         size_t src_idx0 =  i * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx1 =  (i+1) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx2 =  (i+2) * tileSrc::ColStride + j * tileSrc::RowStride;
-        size_t src_idx3 =  (i+3) * tileSrc::ColStride + j * tileSrc::RowStride;        
+        size_t src_idx3 =  (i+3) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx4 =  (i+4) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx5 =  (i+5) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx6 =  (i+6) * tileSrc::ColStride + j * tileSrc::RowStride;
-        size_t src_idx7 =  (i+7) * tileSrc::ColStride + j * tileSrc::RowStride; 
+        size_t src_idx7 =  (i+7) * tileSrc::ColStride + j * tileSrc::RowStride;
 
         typename tileSum::DType sum_01 = src_ptr[src_idx0] + src_ptr[src_idx1];
-        typename tileSum::DType sum_23 = src_ptr[src_idx2] + src_ptr[src_idx3];   
-        typename tileSum::DType sum_45 = src_ptr[src_idx4] + src_ptr[src_idx5];  
-        typename tileSum::DType sum_67 = src_ptr[src_idx6] + src_ptr[src_idx7];    
+        typename tileSum::DType sum_23 = src_ptr[src_idx2] + src_ptr[src_idx3];
+        typename tileSum::DType sum_45 = src_ptr[src_idx4] + src_ptr[src_idx5];
+        typename tileSum::DType sum_67 = src_ptr[src_idx6] + src_ptr[src_idx7];
 
         typename tileSum::DType sum_0123 = sum_01 + sum_23;
-        typename tileSum::DType sum_4567 = sum_45 + sum_67;        
+        typename tileSum::DType sum_4567 = sum_45 + sum_67;
 
         typename tileSum::DType sum_tmp = sum_0123 + sum_4567;
 
-        upd_sum = upd_sum + sum_tmp;              
-    }        
+        upd_sum = upd_sum + sum_tmp;
+    }
 
-/*    
+/*
     for(size_t i=0;i<tileSrc::ValidCol;i++){
         size_t src_idx =  i * tileSrc::ColStride + j * tileSrc::RowStride;
-        upd_sum = upd_sum + src_ptr[src_idx];              
+        upd_sum = upd_sum + src_ptr[src_idx];
     }
-*/        
-    new_sum_ptr[idx] = upd_sum;  
+*/
+    new_sum_ptr[idx] = upd_sum;
 
 }
 
@@ -72,63 +72,63 @@ template<typename dtype, const int gIM, const int gIN, const int tM, const int t
 void reducesum_trowsum_rand(
     dtype *in_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM; // todo 尾块怎么处理？
-    const int rmd_N = gIN % tN; // todo 尾块怎么处理？    
+    const int rmd_N = gIN % tN; // todo 尾块怎么处理？
 
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据 
-//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据
+//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<gIM, 1>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeData_row = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, tM, rmd_N>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeSum = Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, tM, 1>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
 
 
-    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, tN>; // todo 尾块怎么处理？是否要作为参数写在这   
+    using tile_shapeData_col = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, tN>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeData_cor = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor, rmd_M, rmd_N>; // todo 尾块怎么处理？是否要作为参数写在这
-    using tile_shapeSum_col =  Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, rmd_M, 1>;      
+    using tile_shapeSum_col =  Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, rmd_M, 1>;
 
 
-    gm_shapeIn inGm(in_ptr);    
+    gm_shapeIn inGm(in_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeSum olcSumGm(old_sum_ptr);    
+//    gm_shapeSum olcSumGm(old_sum_ptr);
 
-    tile_shapeData dataTile;                
+    tile_shapeData dataTile;
     tile_shapeData_row dataTile_row;
     tile_shapeData_col dataTile_col;
-    tile_shapeData_cor dataTile_cor;    
-    
+    tile_shapeData_cor dataTile_cor;
+
     tile_shapeSum SumTile;
     tile_shapeSum oldSumTile;
     tile_shapeSum_col SumTile_col;
-    tile_shapeSum_col oldSumTile_col;    
+    tile_shapeSum_col oldSumTile_col;
 
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;  
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeSum>;
 
     itIn  gIIter(in_ptr);
     itOut gOIter(out_ptr);
 
 //    printf("tile_shapeSum::ValidCol = %d\n",  tile_shapeSum::ValidCol);
-//    printf("tile_shapeSum::ValidRow = %d\n",  tile_shapeSum::ValidRow);    
+//    printf("tile_shapeSum::ValidRow = %d\n",  tile_shapeSum::ValidRow);
 //    printf("before for\n");
     for (int j = 0; j < Mb; ++j) {
         auto gO = gOIter(j, 0);
         TEXPANDSCALAR(oldSumTile, 0);//初始化为0
-        //初始化old_sum的tile      
+        //初始化old_sum的tile
         for (int i = 0; i < Nb; ++i) {
-            auto gI = gIIter(j, i);   
-//            printf("before copy in , %d\n", i);                
-            TCOPYIN(dataTile, gI);    
+            auto gI = gIIter(j, i);
+//            printf("before copy in , %d\n", i);
+            TLOAD(dataTile, gI);
             reducesum_row_kernel<tile_shapeData, tile_shapeSum><<<tile_shapeSum::ValidRow, 1, 1>>>(SumTile.data(), dataTile.data(), oldSumTile.data());
 //            reducesum_row_kernel<tile_shapeData, tile_shapeSum><<<1, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), dataTile.data(), oldSumTile.data());
 //            printf("kernel , %d\n", i);
@@ -138,40 +138,40 @@ void reducesum_trowsum_rand(
         //for row corner
         if constexpr (rmd_N > 0){
             auto gI = gIIter(j, Nb);
-            TCOPYIN(dataTile_row, gI);
-            reducesum_row_kernel<tile_shapeData_row, tile_shapeSum><<<tile_shapeSum::ValidRow, 1, 1>>>(SumTile.data(), dataTile_row.data(), oldSumTile.data());            
+            TLOAD(dataTile_row, gI);
+            reducesum_row_kernel<tile_shapeData_row, tile_shapeSum><<<tile_shapeSum::ValidRow, 1, 1>>>(SumTile.data(), dataTile_row.data(), oldSumTile.data());
 //            reducesum_row_kernel<tile_shapeData_row, tile_shapeSum><<<tile_shapeSum::ValidRow, 1, 1>>>(SumTile.data(), dataTile_row.data(), oldSumTile.data());
             oldSumTile = SumTile;
         }
-//        printf("before tcopyout\n");        
-        TCOPYOUT(gO, SumTile);
-//        printf("end tcopyout\n"); 
+//        printf("before tstore\n");
+        TSTORE(gO, SumTile);
+//        printf("end tstore\n");
     }
     //for col cor
     if constexpr (rmd_M > 0){
         auto gO = gOIter(Mb, 0);
         TEXPANDSCALAR(oldSumTile_col, 0);//初始化为0
-        //初始化old_sum的tile      
+        //初始化old_sum的tile
         for (int i = 0; i < Nb; ++i) {
-            auto gI = gIIter(Mb, i);   
-            TCOPYIN(dataTile_col, gI);                  
+            auto gI = gIIter(Mb, i);
+            TLOAD(dataTile_col, gI);
             reducesum_row_kernel<tile_shapeData_col, tile_shapeSum_col><<<tile_shapeSum_col::ValidRow, 1, 1>>>(SumTile_col.data(), dataTile_col.data(), oldSumTile_col.data());
             oldSumTile_col = SumTile_col;
         }
         if constexpr (rmd_N > 0){
             auto gI = gIIter(Mb, Nb);
-            TCOPYIN(dataTile_cor, gI);             
+            TLOAD(dataTile_cor, gI);
             reducesum_row_kernel<tile_shapeData_cor, tile_shapeSum_col><<<tile_shapeSum_col::ValidRow, 1, 1>>>(SumTile_col.data(), dataTile_cor.data(), oldSumTile_col.data());
             oldSumTile_col = SumTile_col;
         }
-        TCOPYOUT(gO, SumTile_col);
+        TSTORE(gO, SumTile_col);
     }
 /*
     for(int i = 0; i < gIM; i++){
         printf("out%d = %d\n", i, out_ptr[i]);
     }
 */
-//    printf("end program\n"); 
+//    printf("end program\n");
 }
 
 #endif
diff --git a/kernels/reduction/reducesum_rowvec_single_tree.hpp b/kernels/reduction/reducesum_rowvec_single_tree.hpp
index 8b8cabf..ac2738b 100644
--- a/kernels/reduction/reducesum_rowvec_single_tree.hpp
+++ b/kernels/reduction/reducesum_rowvec_single_tree.hpp
@@ -18,38 +18,38 @@ void __vec__ reducesum_row_kernel(
     typename tileTmpSum::TileDType __out__ new_sum,
     const typename tileSrc::TileDType __in__ src,
     const typename tileTmpSum::TileDType __in__ old_sum,
-    const size_t tile_idx    
+    const size_t tile_idx
 )
 {
 
-    size_t j = blkv_get_index_x();  
-  
+    size_t j = blkv_get_index_x();
+
     __vbuf__ typename tileTmpSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);   
+    __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);
 
     #pragma clang loop unroll(full)
     for(int i=0;i<tileSrc::ValidCol;i+=8){
         size_t src_idx_0 =  i * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_1 =  (i+1) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_2 =  (i+2) * tileSrc::ColStride + j * tileSrc::RowStride;
-        size_t src_idx_3 =  (i+3) * tileSrc::ColStride + j * tileSrc::RowStride;        
+        size_t src_idx_3 =  (i+3) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_4 =  (i+4) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_5 =  (i+5) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_6 =  (i+6) * tileSrc::ColStride + j * tileSrc::RowStride;
-        size_t src_idx_7 =  (i+7) * tileSrc::ColStride + j * tileSrc::RowStride; 
+        size_t src_idx_7 =  (i+7) * tileSrc::ColStride + j * tileSrc::RowStride;
 
         typename tileSrc::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
-        typename tileSrc::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];   
-        typename tileSrc::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];  
-        typename tileSrc::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];    
+        typename tileSrc::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];
+        typename tileSrc::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];
+        typename tileSrc::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];
 
         typename tileSrc::DType sum_0123 = sum_01 + sum_23;
-        typename tileSrc::DType sum_4567 = sum_45 + sum_67;        
+        typename tileSrc::DType sum_4567 = sum_45 + sum_67;
 
         typename tileSrc::DType sum_all = sum_0123 + sum_4567;
-        src_ptr[src_idx_0] = sum_all;         
-    }        
+        src_ptr[src_idx_0] = sum_all;
+    }
 
 
     #pragma clang loop unroll(full)
@@ -57,17 +57,17 @@ void __vec__ reducesum_row_kernel(
         size_t tmp_idx_0 =  (i+0*8) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t tmp_idx_1 =  (i+1*8) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t tmp_idx_2 =  (i+2*8) * tileSrc::ColStride + j * tileSrc::RowStride;
-        size_t tmp_idx_3 =  (i+3*8) * tileSrc::ColStride + j * tileSrc::RowStride;        
+        size_t tmp_idx_3 =  (i+3*8) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t tmp_idx_4 =  (i+4*8) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t tmp_idx_5 =  (i+5*8) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t tmp_idx_6 =  (i+6*8) * tileSrc::ColStride + j * tileSrc::RowStride;
-        size_t tmp_idx_7 =  (i+7*8) * tileSrc::ColStride + j * tileSrc::RowStride;  
+        size_t tmp_idx_7 =  (i+7*8) * tileSrc::ColStride + j * tileSrc::RowStride;
         typename tileSrc::DType tmp_sum_01 = src_ptr[tmp_idx_0]+ src_ptr[tmp_idx_1];
-        typename tileSrc::DType tmp_sum_23 = src_ptr[tmp_idx_2]+ src_ptr[tmp_idx_3]; 
-        typename tileSrc::DType tmp_sum_45 = src_ptr[tmp_idx_4]+ src_ptr[tmp_idx_5]; 
-        typename tileSrc::DType tmp_sum_67 = src_ptr[tmp_idx_6]+ src_ptr[tmp_idx_7];  
-        typename tileSrc::DType tmp_sum_0123 = tmp_sum_01 + tmp_sum_23; 
-        typename tileSrc::DType tmp_sum_4567 = tmp_sum_45 + tmp_sum_67; 
+        typename tileSrc::DType tmp_sum_23 = src_ptr[tmp_idx_2]+ src_ptr[tmp_idx_3];
+        typename tileSrc::DType tmp_sum_45 = src_ptr[tmp_idx_4]+ src_ptr[tmp_idx_5];
+        typename tileSrc::DType tmp_sum_67 = src_ptr[tmp_idx_6]+ src_ptr[tmp_idx_7];
+        typename tileSrc::DType tmp_sum_0123 = tmp_sum_01 + tmp_sum_23;
+        typename tileSrc::DType tmp_sum_4567 = tmp_sum_45 + tmp_sum_67;
         typename tileSrc::DType tmp_sum_all = tmp_sum_0123 + tmp_sum_4567;
         src_ptr[tmp_idx_0] = tmp_sum_all;
     }
@@ -75,29 +75,29 @@ void __vec__ reducesum_row_kernel(
     size_t stride = 64;
     size_t iternum = __builtin_ctz(tileSrc::ValidCol) - 6;
 
-//    #pragma clang loop unroll(full) 
+//    #pragma clang loop unroll(full)
     for(size_t k=0; k<iternum; k++){
-        //#pragma clang loop unroll(full) 
+        //#pragma clang loop unroll(full)
         for(size_t i=0; i<tileSrc::ValidCol; i+=(stride*2)){
             size_t src_idx_0 =  (i + 0*stride) * tileSrc::ColStride + j * tileSrc::RowStride;
             size_t src_idx_1 =  (i + 1*stride) * tileSrc::ColStride + j * tileSrc::RowStride;
-            typename  tileSrc::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];           
-            src_ptr[src_idx_0] = sum_01;          
+            typename  tileSrc::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
+            src_ptr[src_idx_0] = sum_01;
         }
         stride = stride*2;
     }
 
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t i=0;i<tileTmpSum::ValidCol;i++){
-        size_t old_sum_idx =  i * tileTmpSum::ColStride + j * tileTmpSum::RowStride;       
-        new_sum_ptr[old_sum_idx] = old_sum_ptr[old_sum_idx];          
-    }    
+        size_t old_sum_idx =  i * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
+        new_sum_ptr[old_sum_idx] = old_sum_ptr[old_sum_idx];
+    }
 
 
     size_t src_sum_idx = j * tileSrc::RowStride;
     size_t  sum_tile_idx = tile_idx * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
-    new_sum_ptr[sum_tile_idx] = src_ptr[src_sum_idx];  
+    new_sum_ptr[sum_tile_idx] = src_ptr[src_sum_idx];
 }
 
 
@@ -112,39 +112,39 @@ void __vec__ reducesum_row_final_kernel(
     __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
     __vbuf__ typename tileTmpSum::DType *tmp_sum_ptr = blkv_get_tile_ptr(tmp_sum);
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t i=0;i<tileTmpSum::ValidCol;i+=8){
         size_t src_idx_0 =  (i+0) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
         size_t src_idx_1 =  (i+1) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
         size_t src_idx_2 =  (i+2) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
-        size_t src_idx_3 =  (i+3) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;        
+        size_t src_idx_3 =  (i+3) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
         size_t src_idx_4 =  (i+4) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
         size_t src_idx_5 =  (i+5) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
         size_t src_idx_6 =  (i+6) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
-        size_t src_idx_7 =  (i+7) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;        
-        typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];    
+        size_t src_idx_7 =  (i+7) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
+        typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];
         typename  tileTmpSum::DType sum_23 = tmp_sum_ptr[src_idx_2] + tmp_sum_ptr[src_idx_3];
-        typename  tileTmpSum::DType sum_45 = tmp_sum_ptr[src_idx_4] + tmp_sum_ptr[src_idx_5];    
-        typename  tileTmpSum::DType sum_67 = tmp_sum_ptr[src_idx_6] + tmp_sum_ptr[src_idx_7];        
-        typename  tileTmpSum::DType sum_0123 = sum_01 + sum_23; 
+        typename  tileTmpSum::DType sum_45 = tmp_sum_ptr[src_idx_4] + tmp_sum_ptr[src_idx_5];
+        typename  tileTmpSum::DType sum_67 = tmp_sum_ptr[src_idx_6] + tmp_sum_ptr[src_idx_7];
+        typename  tileTmpSum::DType sum_0123 = sum_01 + sum_23;
         typename  tileTmpSum::DType sum_4567 = sum_45 + sum_67;
-        typename  tileTmpSum::DType sum_all = sum_0123 + sum_4567;   
-        tmp_sum_ptr[src_idx_0] = sum_all;          
-    }   
+        typename  tileTmpSum::DType sum_all = sum_0123 + sum_4567;
+        tmp_sum_ptr[src_idx_0] = sum_all;
+    }
 
     size_t stride = 8;
-    size_t iternum = __builtin_ctz(tileTmpSum::ValidCol) - 3;    
-    #pragma clang loop unroll(full) 
+    size_t iternum = __builtin_ctz(tileTmpSum::ValidCol) - 3;
+    #pragma clang loop unroll(full)
     for(size_t k=0;k<iternum;k++){
-        #pragma clang loop unroll(full) 
+        #pragma clang loop unroll(full)
         for(size_t i=0;i<tileTmpSum::ValidCol;i+=(stride*2)){
             size_t src_idx_0 =  (i + 0*stride) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
             size_t src_idx_1 =  (i + 1*stride) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
-            typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];           
-            tmp_sum_ptr[src_idx_0] = sum_01;          
+            typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];
+            tmp_sum_ptr[src_idx_0] = sum_01;
         }
         stride = stride*2;
-    }    
+    }
 
     size_t sum_idx = j * tileTmpSum::RowStride;
     new_sum_ptr[idx] = tmp_sum_ptr[sum_idx];
@@ -155,29 +155,29 @@ template<typename dtype, const int gIM, const int gIN, const int tM, const int t
 void reducesum_trowsum_rand(
     dtype *in_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM; // todo 尾块怎么处理？
-    const int rmd_N = gIN % tN; // todo 尾块怎么处理？    
+    const int rmd_N = gIN % tN; // todo 尾块怎么处理？
 
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据 
-//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据
+//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<gIM, 1>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeSum = Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, tM, 1>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
     using tile_shapeTmpSum = Tile<Location::Vec, dtype, tM, Nb>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
 
 
-    gm_shapeIn inGm(in_ptr);    
+    gm_shapeIn inGm(in_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeSum olcSumGm(old_sum_ptr);    
+//    gm_shapeSum olcSumGm(old_sum_ptr);
 
-    tile_shapeData dataTile;                      
+    tile_shapeData dataTile;
     tile_shapeSum SumTile;
     tile_shapeTmpSum oldtmpSumTile;
     tile_shapeTmpSum tmpSumTile;
@@ -185,7 +185,7 @@ void reducesum_trowsum_rand(
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;  
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeSum>;
 
     itIn  gIIter(in_ptr);
@@ -194,22 +194,22 @@ void reducesum_trowsum_rand(
 
 //    for (int j = 0; j < Mb; ++j) {
     auto gO = gOIter(0, 0);
-    TEXPANDSCALAR(oldtmpSumTile, 0);//初始化为0      
+    TEXPANDSCALAR(oldtmpSumTile, 0);//初始化为0
     for (int i = 0; i < Nb; ++i) {
-        auto gI = gIIter(0, i);   
-//       printf("before copy in , %d\n", i);                
-        TCOPYIN(dataTile, gI);    
-        reducesum_row_kernel<tile_shapeData, tile_shapeTmpSum><<<tile_shapeTmpSum::ValidRow, 1, 1>>>(tmpSumTile.data(), 
-                                                                                                     dataTile.data(), 
+        auto gI = gIIter(0, i);
+//       printf("before copy in , %d\n", i);
+        TLOAD(dataTile, gI);
+        reducesum_row_kernel<tile_shapeData, tile_shapeTmpSum><<<tile_shapeTmpSum::ValidRow, 1, 1>>>(tmpSumTile.data(),
+                                                                                                     dataTile.data(),
                                                                                                      oldtmpSumTile.data(),
                                                                                                      i);
 //      reducesum_row_kernel<tile_shapeData, tile_shapeSum><<<1, tile_shapeSum::ValidRow, 1>>>(SumTile.data(), dataTile.data(), oldSumTile.data());
 //      printf("kernel , %d\n", i);
         oldtmpSumTile = tmpSumTile;
     }
-    reducesum_row_final_kernel<tile_shapeTmpSum, tile_shapeSum><<<tile_shapeTmpSum::ValidRow, 1, 1>>>(SumTile.data(), 
-                                                                                                      tmpSumTile.data());     
-    TCOPYOUT(gO, SumTile);
+    reducesum_row_final_kernel<tile_shapeTmpSum, tile_shapeSum><<<tile_shapeTmpSum::ValidRow, 1, 1>>>(SumTile.data(),
+                                                                                                      tmpSumTile.data());
+    TSTORE(gO, SumTile);
 }
 //}
 
diff --git a/kernels/reduction/reducesum_rowvec_single_tree_opt_2.hpp b/kernels/reduction/reducesum_rowvec_single_tree_opt_2.hpp
index ddfcca7..dd76974 100644
--- a/kernels/reduction/reducesum_rowvec_single_tree_opt_2.hpp
+++ b/kernels/reduction/reducesum_rowvec_single_tree_opt_2.hpp
@@ -19,26 +19,26 @@ void __vec__ reducesum_row_kernel(
     const typename tileSrc::TileDType __in__ src,
     const typename tileSrcCol::TileDType __in__ src_col,
     const typename tileTmpSum::TileDType __in__ old_sum,
-    const size_t tile_idx    
+    const size_t tile_idx
 )
 {
 
-    size_t j = blkv_get_index_x();  
-    size_t z = blkv_get_index_y();     
+    size_t j = blkv_get_index_x();
+    size_t z = blkv_get_index_y();
     size_t stride_src = z * (tileSrc::ValidCol/4) * tileSrc::ColStride;
-    size_t stride_src_col = z * (tileSrcCol::ValidCol/4) * tileSrcCol::ColStride;    
-  
+    size_t stride_src_col = z * (tileSrcCol::ValidCol/4) * tileSrcCol::ColStride;
+
     __vbuf__ typename tileTmpSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
     __vbuf__ typename tileSrc::DType *src_ptr = blkv_get_tile_ptr(src);
-    __vbuf__ typename tileSrc::DType *src_col_ptr = blkv_get_tile_ptr(src_col);    
-    __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);   
+    __vbuf__ typename tileSrc::DType *src_col_ptr = blkv_get_tile_ptr(src_col);
+    __vbuf__ typename tileTmpSum::DType *old_sum_ptr = blkv_get_tile_ptr(old_sum);
 
 /*
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t i=0;i<tileTmpSum::ValidCol/4;i++){
-        size_t old_sum_idx =  z * tileTmpSum::ValidCol/4 * tileTmpSum::ColStride + i * tileTmpSum::ColStride + j * tileTmpSum::RowStride;       
-        new_sum_ptr[old_sum_idx] = old_sum_ptr[old_sum_idx];          
-    }    
+        size_t old_sum_idx =  z * tileTmpSum::ValidCol/4 * tileTmpSum::ColStride + i * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
+        new_sum_ptr[old_sum_idx] = old_sum_ptr[old_sum_idx];
+    }
 */
 
     #pragma clang loop unroll(full)
@@ -46,25 +46,25 @@ void __vec__ reducesum_row_kernel(
         size_t src_idx_0 =  stride_src + (i+0) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_1 =  stride_src + (i+1) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_2 =  stride_src + (i+2) * tileSrc::ColStride + j * tileSrc::RowStride;
-        size_t src_idx_3 =  stride_src + (i+3) * tileSrc::ColStride + j * tileSrc::RowStride;        
+        size_t src_idx_3 =  stride_src + (i+3) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_4 =  stride_src + (i+4) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_5 =  stride_src + (i+5) * tileSrc::ColStride + j * tileSrc::RowStride;
         size_t src_idx_6 =  stride_src + (i+6) * tileSrc::ColStride + j * tileSrc::RowStride;
-        size_t src_idx_7 =  stride_src + (i+7) * tileSrc::ColStride + j * tileSrc::RowStride; 
+        size_t src_idx_7 =  stride_src + (i+7) * tileSrc::ColStride + j * tileSrc::RowStride;
 
         typename tileSrc::DType sum_01 = src_ptr[src_idx_0] + src_ptr[src_idx_1];
-        typename tileSrc::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];   
-        typename tileSrc::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];  
-        typename tileSrc::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];    
+        typename tileSrc::DType sum_23 = src_ptr[src_idx_2] + src_ptr[src_idx_3];
+        typename tileSrc::DType sum_45 = src_ptr[src_idx_4] + src_ptr[src_idx_5];
+        typename tileSrc::DType sum_67 = src_ptr[src_idx_6] + src_ptr[src_idx_7];
 
         typename tileSrc::DType sum_0123 = sum_01 + sum_23;
-        typename tileSrc::DType sum_4567 = sum_45 + sum_67;        
+        typename tileSrc::DType sum_4567 = sum_45 + sum_67;
 
         typename tileSrc::DType sum_all = sum_0123 + sum_4567;
 
         size_t src_col_idx_0 = stride_src_col + (i/8) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
-        src_col_ptr[src_col_idx_0] = sum_all;         
-    }        
+        src_col_ptr[src_col_idx_0] = sum_all;
+    }
 
 
     #pragma clang loop unroll(full)
@@ -72,17 +72,17 @@ void __vec__ reducesum_row_kernel(
         size_t tmp_idx_0 =  stride_src_col + (i+0) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
         size_t tmp_idx_1 =  stride_src_col + (i+1) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
         size_t tmp_idx_2 =  stride_src_col + (i+2) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
-        size_t tmp_idx_3 =  stride_src_col + (i+3) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;        
+        size_t tmp_idx_3 =  stride_src_col + (i+3) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
         size_t tmp_idx_4 =  stride_src_col + (i+4) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
         size_t tmp_idx_5 =  stride_src_col + (i+5) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
         size_t tmp_idx_6 =  stride_src_col + (i+6) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
-        size_t tmp_idx_7 =  stride_src_col + (i+7) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;  
+        size_t tmp_idx_7 =  stride_src_col + (i+7) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
         typename tileSrcCol::DType tmp_sum_01 = src_col_ptr[tmp_idx_0]+ src_col_ptr[tmp_idx_1];
-        typename tileSrcCol::DType tmp_sum_23 = src_col_ptr[tmp_idx_2]+ src_col_ptr[tmp_idx_3]; 
-        typename tileSrcCol::DType tmp_sum_45 = src_col_ptr[tmp_idx_4]+ src_col_ptr[tmp_idx_5]; 
-        typename tileSrcCol::DType tmp_sum_67 = src_col_ptr[tmp_idx_6]+ src_col_ptr[tmp_idx_7];  
-        typename tileSrcCol::DType tmp_sum_0123 = tmp_sum_01 + tmp_sum_23; 
-        typename tileSrcCol::DType tmp_sum_4567 = tmp_sum_45 + tmp_sum_67; 
+        typename tileSrcCol::DType tmp_sum_23 = src_col_ptr[tmp_idx_2]+ src_col_ptr[tmp_idx_3];
+        typename tileSrcCol::DType tmp_sum_45 = src_col_ptr[tmp_idx_4]+ src_col_ptr[tmp_idx_5];
+        typename tileSrcCol::DType tmp_sum_67 = src_col_ptr[tmp_idx_6]+ src_col_ptr[tmp_idx_7];
+        typename tileSrcCol::DType tmp_sum_0123 = tmp_sum_01 + tmp_sum_23;
+        typename tileSrcCol::DType tmp_sum_4567 = tmp_sum_45 + tmp_sum_67;
         typename tileSrcCol::DType tmp_sum_all = tmp_sum_0123 + tmp_sum_4567;
         src_col_ptr[tmp_idx_0] = tmp_sum_all;
     }
@@ -92,29 +92,29 @@ void __vec__ reducesum_row_kernel(
     size_t stride = 8;
     size_t iternum = __builtin_ctz(tileSrcCol::ValidCol/4) - 3;
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t k=0; k<iternum; k++){
-        //#pragma clang loop unroll(full) 
+        //#pragma clang loop unroll(full)
         for(size_t i=0; i<tileSrcCol::ValidCol/4; i+=(stride*2)){
             size_t src_idx_0 =  stride_src_col + (i + 0*stride) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
             size_t src_idx_1 =  stride_src_col + (i + 1*stride) * tileSrcCol::ColStride + j * tileSrcCol::RowStride;
-            typename  tileSrcCol::DType sum_01 = src_col_ptr[src_idx_0] + src_col_ptr[src_idx_1];           
-            src_col_ptr[src_idx_0] = sum_01;          
+            typename  tileSrcCol::DType sum_01 = src_col_ptr[src_idx_0] + src_col_ptr[src_idx_1];
+            src_col_ptr[src_idx_0] = sum_01;
         }
         stride = stride*2;
     }
 
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t i=0;i<tileTmpSum::ValidCol/4;i++){
-        size_t old_sum_idx =  z * tileTmpSum::ValidCol/4 * tileTmpSum::ColStride + i * tileTmpSum::ColStride + j * tileTmpSum::RowStride;       
-        new_sum_ptr[old_sum_idx] = old_sum_ptr[old_sum_idx];          
-    }    
+        size_t old_sum_idx =  z * tileTmpSum::ValidCol/4 * tileTmpSum::ColStride + i * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
+        new_sum_ptr[old_sum_idx] = old_sum_ptr[old_sum_idx];
+    }
 
 
     size_t src_sum_idx = stride_src_col + j * tileSrcCol::RowStride;
     size_t sum_tile_idx = z * tileTmpSum::ValidCol/4 * tileTmpSum::ColStride + tile_idx * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
-    new_sum_ptr[sum_tile_idx] = src_col_ptr[src_sum_idx];  
+    new_sum_ptr[sum_tile_idx] = src_col_ptr[src_sum_idx];
 }
 
 
@@ -129,39 +129,39 @@ void __vec__ reducesum_row_final_kernel(
     __vbuf__ typename tileSum::DType *new_sum_ptr = blkv_get_tile_ptr(new_sum);
     __vbuf__ typename tileTmpSum::DType *tmp_sum_ptr = blkv_get_tile_ptr(tmp_sum);
 
-    #pragma clang loop unroll(full) 
+    #pragma clang loop unroll(full)
     for(size_t i=0;i<tileTmpSum::Cols;i+=8){
         size_t src_idx_0 =  (i+0) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
         size_t src_idx_1 =  (i+1) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
         size_t src_idx_2 =  (i+2) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
-        size_t src_idx_3 =  (i+3) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;        
+        size_t src_idx_3 =  (i+3) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
         size_t src_idx_4 =  (i+4) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
         size_t src_idx_5 =  (i+5) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
         size_t src_idx_6 =  (i+6) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
-        size_t src_idx_7 =  (i+7) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;        
-        typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];    
+        size_t src_idx_7 =  (i+7) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
+        typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];
         typename  tileTmpSum::DType sum_23 = tmp_sum_ptr[src_idx_2] + tmp_sum_ptr[src_idx_3];
-        typename  tileTmpSum::DType sum_45 = tmp_sum_ptr[src_idx_4] + tmp_sum_ptr[src_idx_5];    
-        typename  tileTmpSum::DType sum_67 = tmp_sum_ptr[src_idx_6] + tmp_sum_ptr[src_idx_7];        
-        typename  tileTmpSum::DType sum_0123 = sum_01 + sum_23; 
+        typename  tileTmpSum::DType sum_45 = tmp_sum_ptr[src_idx_4] + tmp_sum_ptr[src_idx_5];
+        typename  tileTmpSum::DType sum_67 = tmp_sum_ptr[src_idx_6] + tmp_sum_ptr[src_idx_7];
+        typename  tileTmpSum::DType sum_0123 = sum_01 + sum_23;
         typename  tileTmpSum::DType sum_4567 = sum_45 + sum_67;
-        typename  tileTmpSum::DType sum_all = sum_0123 + sum_4567;   
-        tmp_sum_ptr[src_idx_0] = sum_all;          
-    }   
+        typename  tileTmpSum::DType sum_all = sum_0123 + sum_4567;
+        tmp_sum_ptr[src_idx_0] = sum_all;
+    }
 
     size_t stride = 8;
-    size_t iternum = __builtin_ctz(tileTmpSum::Cols) - 3;    
-    #pragma clang loop unroll(full) 
+    size_t iternum = __builtin_ctz(tileTmpSum::Cols) - 3;
+    #pragma clang loop unroll(full)
     for(size_t k=0;k<iternum;k++){
-        #pragma clang loop unroll(full) 
+        #pragma clang loop unroll(full)
         for(size_t i=0;i<tileTmpSum::Cols;i+=(stride*2)){
             size_t src_idx_0 =  (i + 0*stride) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
             size_t src_idx_1 =  (i + 1*stride) * tileTmpSum::ColStride + j * tileTmpSum::RowStride;
-            typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];           
-            tmp_sum_ptr[src_idx_0] = sum_01;          
+            typename  tileTmpSum::DType sum_01 = tmp_sum_ptr[src_idx_0] + tmp_sum_ptr[src_idx_1];
+            tmp_sum_ptr[src_idx_0] = sum_01;
         }
         stride = stride*2;
-    }    
+    }
 
     size_t sum_idx = j * tileTmpSum::RowStride;
     new_sum_ptr[idx] = tmp_sum_ptr[sum_idx];
@@ -172,31 +172,31 @@ template<typename dtype, const int gIM, const int gIN, const int tM, const int t
 void reducesum_trowsum_rand(
     dtype *in_ptr,
     dtype *out_ptr
-) 
+)
 {
 
     const int Mb = gIM / tM;
-    const int Nb = gIN / tN;    
+    const int Nb = gIN / tN;
 
     const int rmd_M = gIM % tM; // todo 尾块怎么处理？
-    const int rmd_N = gIN % tN; // todo 尾块怎么处理？    
+    const int rmd_N = gIN % tN; // todo 尾块怎么处理？
 
 
-    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据 
-//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;    
+    using gm_shapeIn = global_tensor<dtype, RowMajor<gIM, gIN>>;     //将gm中的Tensor先声明为一维数据
+//    using gm_shapeSum = global_tensor<dtype, RowMajor<gIM, gIN>>;
     using gm_shapeOut = global_tensor<dtype, RowMajor<gIM, 1>>;
     using tile_shapeData = Tile<Location::Vec, dtype, tM, tN, BLayout::RowMajor>; // todo 尾块怎么处理？是否要作为参数写在这
-    using tile_shapeDataCol = Tile<Location::Vec, dtype, tM, tN/8, BLayout::ColMajor>; // todo 尾块怎么处理？是否要作为参数写在这    
+    using tile_shapeDataCol = Tile<Location::Vec, dtype, tM, tN/8, BLayout::ColMajor>; // todo 尾块怎么处理？是否要作为参数写在这
     using tile_shapeSum = Tile<Location::Vec, dtype, tM, 8, BLayout::RowMajor, tM, 1>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
     using tile_shapeTmpSum = Tile<Location::Vec, dtype, tM, 64, BLayout::ColMajor, tM, Nb*4>; // todo 这里的location，一定要是Vec吗？哪怕没有传入Vec
 
 
-    gm_shapeIn inGm(in_ptr);    
+    gm_shapeIn inGm(in_ptr);
     gm_shapeOut outGm(out_ptr);
-//    gm_shapeSum olcSumGm(old_sum_ptr);    
+//    gm_shapeSum olcSumGm(old_sum_ptr);
 
-    tile_shapeData dataTile;   
-    tile_shapeDataCol dataTile_col;                        
+    tile_shapeData dataTile;
+    tile_shapeDataCol dataTile_col;
     tile_shapeSum SumTile;
     tile_shapeTmpSum oldtmpSumTile;
     tile_shapeTmpSum tmpSumTile;
@@ -204,7 +204,7 @@ void reducesum_trowsum_rand(
 //    int base = 0;// todo 生成一个标量
 //    int all_num = gOM; // 总元素数量
 
-    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;  
+    using itIn = global_iterator<gm_shapeIn, tile_shapeData>;
     using itOut = global_iterator<gm_shapeOut, tile_shapeSum>;
 
     itIn  gIIter(in_ptr);
@@ -212,21 +212,21 @@ void reducesum_trowsum_rand(
 
 
     auto gO = gOIter(0, 0);
-    TEXPANDSCALAR(oldtmpSumTile, 0);//初始化为0  
-    TEXPANDSCALAR(dataTile_col, 0);//初始化为0     
+    TEXPANDSCALAR(oldtmpSumTile, 0);//初始化为0
+    TEXPANDSCALAR(dataTile_col, 0);//初始化为0
     for (int i = 0; i < Nb; ++i) {
-        auto gI = gIIter(0, i);                
-        TCOPYIN(dataTile, gI);    
-        reducesum_row_kernel<tile_shapeData, tile_shapeDataCol, tile_shapeTmpSum><<<tile_shapeTmpSum::ValidRow, 4, 1>>>(tmpSumTile.data(), 
-                                                                                                                        dataTile.data(), 
-                                                                                                                        dataTile_col.data(), 
+        auto gI = gIIter(0, i);
+        TLOAD(dataTile, gI);
+        reducesum_row_kernel<tile_shapeData, tile_shapeDataCol, tile_shapeTmpSum><<<tile_shapeTmpSum::ValidRow, 4, 1>>>(tmpSumTile.data(),
+                                                                                                                        dataTile.data(),
+                                                                                                                        dataTile_col.data(),
                                                                                                                         oldtmpSumTile.data(),
                                                                                                                         i);
         oldtmpSumTile = tmpSumTile;
     }
-    reducesum_row_final_kernel<tile_shapeTmpSum, tile_shapeSum><<<tile_shapeTmpSum::ValidRow, 1, 1>>>(SumTile.data(), 
-                                                                                                      tmpSumTile.data());     
-    TCOPYOUT(gO, SumTile);
+    reducesum_row_final_kernel<tile_shapeTmpSum, tile_shapeSum><<<tile_shapeTmpSum::ValidRow, 1, 1>>>(SumTile.data(),
+                                                                                                      tmpSumTile.data());
+    TSTORE(gO, SumTile);
 }
 
 #endif
diff --git a/models/deepseekv3/mla.hpp b/models/deepseekv3/mla.hpp
index f309309..350d790 100644
--- a/models/deepseekv3/mla.hpp
+++ b/models/deepseekv3/mla.hpp
@@ -26,26 +26,26 @@ using namespace pto;
 
 //     const int Sb = seq_len / tS;
 //     for(int i=0;i<Sb;i++){
-//         Tile<Location::Vec, dtype, dim/2, 1, BLayout::RowMajor> freqs; 
+//         Tile<Location::Vec, dtype, dim/2, 1, BLayout::RowMajor> freqs;
 //         Tile<Location::Vec, dtype, tS, 1, BLayout::RowMajor> t;
 //         TARANGE(freqs, 0, dim, 2);
 //         TARANGE(t, 0, end, 1);
 //         freq_tshape tfreq_cis;
 //         freq_tshape tfreq_cis_real;
 //         freq_tshape tfreq_cis_imag;
-//         TOUTDOT(tfreq_cis, freqs, t); //outer product 
-//         TSIN(tfreq_cis_real, tfreq_cis); // 
+//         TOUTDOT(tfreq_cis, freqs, t); //outer product
+//         TSIN(tfreq_cis_real, tfreq_cis); //
 //         TCOS(tfreq_cis_imag, tfreq_cis);
 //         auto gO_real = gFreq_real(i,0);
 //         auto gO_imag = gFreq_imag(i,0);
-//         TCOPYOUT(gO_real, tfreq_cis_real);
-//         TCOPYOUT(gO_imag, tfreq_cis_imag);
+//         TSTORE(gO_real, tfreq_cis_real);
+//         TSTORE(gO_imag, tfreq_cis_imag);
 //     }
 // }
 
 template<typename dype, const int bsz, const int seq_len, const int dim_in,const int dim_out>
 void projection(Tensor<dtype, bsz, seq_len , dim_out> &out,
-               Tensor<dtype, bsz, seq_len, dim_in> &x, 
+               Tensor<dtype, bsz, seq_len, dim_in> &x,
                Tensor<dtype, dim_in, dim_out> &proj){
 
     for(int i=0;i<bsz;i++){
@@ -59,9 +59,9 @@ template<typename dtype, const int kM, const int kN, const int kTM, const int kT
 void rmsnorm(dtype *dst, dtype *src){
     using gm_shape = global_tensor<dtype, RowMajor<kM, kN>>;
     using tile_shape = Tile<Location::Vec, dtype, kTM, kTN, BLayout::RowMajor>;
- 
+
     using tSum = Tile<Location::Vec, dtype, kTM, 16, BLayout::RowMajor, kTM, 1>;
- 
+
     using gIter = global_iterator<gm_shape, tile_shape>;
 
     gIter giter_src(src);
@@ -78,19 +78,19 @@ void rmsnorm(dtype *dst, dtype *src){
         {
             auto gsrc = giter_src(i, j);
             tile_shape tsrc;
- 
-            TCOPYIN(tsrc, gsrc);
+
+            TLOAD(tsrc, gsrc);
 
             tSum tLocalSum;
             TMUL(tsrc, tsrc, tsrc);
             TROWSUM(tLocalSum, tsrc);
             TADD(tAccSquareSum, tAccSquareSum, tLocalSum);
         }
- 
+
         tSum gSqureMean;
         TDIVS(gSqureMean, tAccSquareSum, kN);
         TSQRT(gSqureMean, gSqureMean);
- 
+
         tile_shape gSqureMean_i;
         TEXPANDCOL(gSqureMean_i, gSqureMean);
 
@@ -98,12 +98,12 @@ void rmsnorm(dtype *dst, dtype *src){
         {
             auto  gsrc = giter_src(i,j);
             tile_shape tsrc;
-            TCOPYIN(tsrc, gsrc);
- 
+            TLOAD(tsrc, gsrc);
+
             TDIV(tsrc, tsrc, gSqureMean_i);
- 
+
             auto gdst = giter_dst(i,j);
-            TCOPYOUT(gdst, tsrc);
+            TSTORE(gdst, tsrc);
         }
     }
 }
@@ -157,12 +157,12 @@ void apply_rotary_emb(dtype *x, dtype *freqs_cis){
             gm_shape input(x+offset);
             tile_shape tin;
             tile_shape_rope resh_tin;
-            TCOPYIN(tin, input);   // 64*32
+            TLOAD(tin, input);   // 64*32
             TRESHAPE(resh_tin, tin); // 64*32 -> 1024*2
             //TTRANS();           // 128*2 -> 2*128
 
             tile_shape_half tin_real;
-            tile_shape_half tin_imag; 
+            tile_shape_half tin_imag;
             TEXTRACT(tin_real, resh_tin, 0, 0);        // real 1024*1
             TEXTRACT(tin_imag, resh_tin, 0, 1);        // image 1024*1
 
@@ -170,7 +170,7 @@ void apply_rotary_emb(dtype *x, dtype *freqs_cis){
             gm_shape freqs(freqs_cis+offset);
             tile_shape tfreqs;
             tile_shape_rope tfreqs_resh;
-            TCOPYIN(tfreqs, freqs);
+            TLOAD(tfreqs, freqs);
             TRESHAPE(tfreqs_resh, tfreqs);
 
             tile_shape_half tfreqs_real;
@@ -203,7 +203,7 @@ void apply_rotary_emb(dtype *x, dtype *freqs_cis){
             tile_shape tout_resh;
             TRESHAPE(tout_resh, tout);
 
-            TCOPYOUT(input, tout_resh);
+            TSTORE(input, tout_resh);
         }
     }
 }
@@ -221,18 +221,18 @@ void split(dtype *out1, dtype *out2, dtype *in){
 
     uint32_t n_row =  row/trow;
     uint32_t n_col1 = dim1/tcol;
-    uint32_t n_col2 = dim2/tcol; 
+    uint32_t n_col2 = dim2/tcol;
 
     for(int i=0;i<n_row;i++){
         for(int j=0;j<n_col1;j++){
             uint32_t offset = i * (trow * (dim1+dim2)) + j * tcol;
             gm_in input(in+offset);
             tile_shape tmp;
-            TCOPYIN(tmp, input);
+            TLOAD(tmp, input);
 
             offset = i * (trow * dim1) + j * tcol;
             gm_out1 output1(out1+offset);
-            TCOPYOUT(output1,tmp);
+            TSTORE(output1,tmp);
         }
     }
 
@@ -241,11 +241,11 @@ void split(dtype *out1, dtype *out2, dtype *in){
             uint32_t offset = i * (trow * (dim1+dim2)) + j * tcol + dim1;
             gm_in input(in+offset);
             tile_shape tmp;
-            TCOPYIN(tmp, input);
+            TLOAD(tmp, input);
 
             offset = i * (trow * dim2) + j * tcol;
             gm_out2 output2(out2+offset);
-            TCOPYOUT(output2,tmp);
+            TSTORE(output2,tmp);
         }
     }
 }
@@ -271,11 +271,11 @@ void concat(dtype *out, dtype *in1, dtype *in2){
             uint32_t offset = i * (trow * dim1) + j * tcol;
             gm_in1 input1(in1 + offset);
             tile_shape tmp;
-            TCOPYIN(tmp, input1);
+            TLOAD(tmp, input1);
 
             offset = i * (trow * (dim1+dim2)) + j * tcol;
             gm_out output(out+offset);
-            TCOPYOUT(output, tmp);
+            TSTORE(output, tmp);
         }
     }
 
@@ -285,11 +285,11 @@ void concat(dtype *out, dtype *in1, dtype *in2){
 
             gm_in2 input2(in2 + offset);
             tile_shape tmp;
-            TCOPYIN(tmp, input2);
+            TLOAD(tmp, input2);
 
             offset = dim1 + i * (trow * (dim1+dim2)) + j * tcol;
             gm_out output(out+offset);
-            TCOPYOUT(output, tmp);
+            TSTORE(output, tmp);
         }
     }
 }
@@ -313,11 +313,11 @@ void permute(dtype *out, dtype *in){
                     tile_shape tmp;
                     gm_shape_in input(in+src_offset);
                     gm_shape_out ouput(out+dst_offset);
-                    TCOPYIN(tmp, input);
-                    TCOPYOUT(ouput, tmp);
+                    TLOAD(tmp, input);
+                    TSTORE(ouput, tmp);
             }
         }
-    }    
+    }
 }
 
 //[row, dim] -> [row, ext_dim, dim]
@@ -333,18 +333,18 @@ void expand(dtype *out, dtype *in){
         uint32_t offset = i * dim;
         gm_in input(in + offset);
         tile_shape tmp;
-        TCOPYIN(tmp, input);
+        TLOAD(tmp, input);
         for(int j=0;j<ext_dim;j++){
             offset = i * ext_dim * dim + j * dim;
             gm_out output(out+offset);
-            TCOPYOUT(output, tmp);
+            TSTORE(output, tmp);
         }
-    }   
+    }
 }
 
 template<typename dtype, const int bsz, const int seq_len, typename args>
-void MLA(Tensor<dtype, bsz, seq_len, args::dim> & out, 
-        Tensor<dtype, bsz, seq_len, args::dim> &x, 
+void MLA(Tensor<dtype, bsz, seq_len, args::dim> & out,
+        Tensor<dtype, bsz, seq_len, args::dim> &x,
         Tensor<dtype, seq_len, args::qk_rope_head_dim> & freqs_cis,
         Tensor<dtype, bsz, args::n_heads, seq_len>* atten_mask=nullptr){
     // do down projection to q_down then do up projection to q_up
@@ -374,7 +374,7 @@ void MLA(Tensor<dtype, bsz, seq_len, args::dim> & out,
     Tensor<dtype, bsz, seq_len, args::n_heads*args::qk_nope_head_dim> q_nope;
     Tensor<dtype, bsz, seq_len, args::n_heads*args::qk_rope_head_dim> q_pe;
     split<dtype, bsz*seq_len*args::n_heads, args::qk_nope_head_dim, args::qk_rope_head_dim>(q_nope.data(), q_pe.data(), q_up.data());
-        
+
     //for q_pe doing rotary embedding
     Tensor<dtype, bsz, args::n_heads, seq_len, args::qk_rope_head_dim> q_perm;
     permute<dtype, bsz, seq_len, args::n_heads, args::qk_rope_head_dim>(q_perm.data(), q_pe.data());
@@ -386,7 +386,7 @@ void MLA(Tensor<dtype, bsz, seq_len, args::dim> & out,
     permute<dtype, bsz, args::n_heads, seq_len, args::qk_rope_head_dim>(q_pe.data(), q_perm.data());
 
     //writeTensorToFile<dtype, bsz*seq_len, args::n_heads*args::qk_rope_head_dim>(q_pe.data(), "q_pe_cpp.txt");
-    
+
     Tensor<dtype, bsz, seq_len, args::n_heads*args::qk_head_dim> q_attn;
     //concat q_nope and q_pe to Q
     concat<dtype, bsz*seq_len*args::n_heads, args::qk_nope_head_dim, args::qk_rope_head_dim>(q_attn.data(), q_nope.data(), q_pe.data());
@@ -394,7 +394,7 @@ void MLA(Tensor<dtype, bsz, seq_len, args::dim> & out,
     // do down projection to k_rope+k_lora_rank, and split to k_rope, k_lora_rank
     Tensor<dtype, bsz, seq_len, args::kv_lora_rank> kv;
     Tensor<dtype, bsz, seq_len, args::qk_rope_head_dim> k_pe;
-    { 
+    {
         Tensor<dtype, args::dim, args::kv_lora_rank+args::qk_rope_head_dim> Wkv_down(1);
         Tensor<dtype, bsz, seq_len, args::kv_lora_rank+args::qk_rope_head_dim> kv_down;
         projection<dtype, bsz, seq_len, args::dim, args::kv_lora_rank+args::qk_rope_head_dim>(kv_down, x, Wkv_down);
@@ -432,15 +432,15 @@ void MLA(Tensor<dtype, bsz, seq_len, args::dim> & out,
     //writeTensorToFile<dtype, bsz*seq_len*args::n_heads, args::qk_head_dim>(k_attn.data(), "k_attn_cpp.txt");
     //writeTensorToFile<dtype, bsz*seq_len*args::n_heads, args::v_head_dim>(v_attn.data(), "v_attn_cpp.txt");
 
-    permute<dtype, bsz, seq_len, args::n_heads, args::qk_head_dim>(q_attn_pm.data(), q_attn.data()); //[b,s,n_heads,qk_head_dim] permute to [b,n_heads,s,qk_head_dim] 
-    permute<dtype, bsz, seq_len, args::n_heads, args::qk_head_dim>(k_attn_pm.data(), k_attn.data()); //[b,s,n_heads,qk_head_dim] permute to [b,n_heads,s,qk_head_dim] 
+    permute<dtype, bsz, seq_len, args::n_heads, args::qk_head_dim>(q_attn_pm.data(), q_attn.data()); //[b,s,n_heads,qk_head_dim] permute to [b,n_heads,s,qk_head_dim]
+    permute<dtype, bsz, seq_len, args::n_heads, args::qk_head_dim>(k_attn_pm.data(), k_attn.data()); //[b,s,n_heads,qk_head_dim] permute to [b,n_heads,s,qk_head_dim]
     permute<dtype, bsz, seq_len, args::n_heads, args::v_head_dim>(v_attn_pm.data(), v_attn.data()); //[b,s,n_heads,v_head_dim]  permute to [b,n_heads,s,v_head_dim]
 
     Tensor<dtype, bsz, seq_len, args::n_heads*args::v_head_dim>  attn_out;
     Tensor<dtype, bsz, args::n_heads, seq_len, args::v_head_dim> attn_tmp;
     for(int i=0;i<bsz;i++){
         for(int j=0;j<args::n_heads;j++){
-            // NOTE: v_head_dim == qk_head_dim since flash_attention impl 
+            // NOTE: v_head_dim == qk_head_dim since flash_attention impl
             // FIXME: consider attn_mask in flash_attention before softmax
             // attn_mask: lower triangle matrix with upper fill with -inf
             flash_attention<seq_len,args::qk_head_dim,args::v_head_dim,32,32>(attn_tmp.data(i,j), q_attn_pm.data(i,j), k_attn_pm.data(i,j), v_attn_pm.data(i,j));
@@ -451,7 +451,7 @@ void MLA(Tensor<dtype, bsz, seq_len, args::dim> & out,
     permute<dtype, bsz, args::n_heads, seq_len, args::v_head_dim>(attn_out.data(), attn_tmp.data());
 
     //writeTensorToFile<dtype, bsz*seq_len*args::n_heads, args::v_head_dim>(attn_out.data(), "attn_out_cpp.txt");
-    
+
     //final output projection
     {
         Tensor<dtype, args::n_heads*args::v_head_dim, args::dim> Wout(1);
diff --git a/models/deepseekv3/moe.hpp b/models/deepseekv3/moe.hpp
index 4726706..da6387d 100644
--- a/models/deepseekv3/moe.hpp
+++ b/models/deepseekv3/moe.hpp
@@ -36,7 +36,7 @@ void __vec__ BitonicSortStepDescend_RowMajor_Imp(
     //             dst_ptr[i * tile_shape::Cols + tid] =
     //                 src_ptr[i * tile_shape::Cols + tid];
     //             dst_ptr[i * tile_shape::Cols + partner] =
-    //                 src_ptr[i * tile_shape::Cols + partner];                
+    //                 src_ptr[i * tile_shape::Cols + partner];
     //         }
     //     } else {
     //         if (src_ptr[i * tile_shape::Cols + tid] >
@@ -93,15 +93,15 @@ void __vec__ BitonicSortStepDescend_RowMajor_Imp(
     //     "l.sw  vt#1.sw, [to, vm#2.uh<<2]\n"           // dst[tid] = src[partner]
     //     "l.sw  vt#2.sw, [to, vm#1.uh<<2]\n"           // dst[partner] = src[tid]
     //     "l.addi t#1.ud, 0, ->p\n"                     //resave p from 3rd branch
-    //     "l.xori p, -1, ->p\n"                      
+    //     "l.xori p, -1, ->p\n"
     //     "l.and p, t#2.ud, ->p\n"                      //go else for 2nd branch
-    //     "l.cmp.lt vt#1.sw, vt#2.sw, -> vn.b\n"        // src[partner] < src[tid] 
+    //     "l.cmp.lt vt#1.sw, vt#2.sw, -> vn.b\n"        // src[partner] < src[tid]
     //     "l.addi p, 0 ->t.d\n"                         // save p for 4th branch
     //     "l.cmp.eqi vn#1.ub, 1,->p\n"                  // set p if(src[tid] < src[partner])
     //     "l.sw  vt#1.sw, [to, vm#2.uh<<2]\n"           //dst[tid] = src[partner]
     //     "l.sw  vt#2.sw, [to, vm#1.uh<<2]\n"           //dst[partner] = src[tid]
     //     "l.addi t#1.ud, 0, ->p\n"                     //resave p from 4rd branch
-    //     ""                                            //merge 2nd branch two result 
+    //     ""                                            //merge 2nd branch two result
     //     "l.addi t#3.ud, 0, ->p\n"                     //resave p from 2nd branch
     //     "l.addi t#4.ud, 0, ->p\n"                     //resave p from 1st branch
     //     "c.bstop\n"
@@ -122,16 +122,16 @@ void __vec__ BitonicSortStepDescend_RowMajor_Imp(
         "v.lw   [ta, vn#1.reuse.uh<<2],     ->vt.w\n"       // src[index_part+col/2] = partner_idx
         "v.lw   [ta, vm#2.reuse.uh<<2],     ->vt.w\n"       // src[index] = cur_value
         "v.lw   [ta, vm#1.reuse.uh<<2],     ->vt.w\n"       // src[index_part] = partner_value
-        "v.sw  vt#2.reuse.sw, [to, vm#2.reuse.uh<<2]\n"           // dst[tid] = src[tid]   // copy first 
+        "v.sw  vt#2.reuse.sw, [to, vm#2.reuse.uh<<2]\n"           // dst[tid] = src[tid]   // copy first
         "v.sw  vt#1.reuse.sw, [to, vm#1.reuse.uh<<2]\n"           // dst[partner] = src[partner] // copy first
-        "v.sw  vt#4.reuse.sw, [to, vn#2.reuse.uh<<2]\n"           // dst[tid+col/2] = src[tid+col/2]   // copy first 
+        "v.sw  vt#4.reuse.sw, [to, vn#2.reuse.uh<<2]\n"           // dst[tid+col/2] = src[tid+col/2]   // copy first
         "v.sw  vt#3.reuse.sw, [to, vn#1.reuse.uh<<2]\n"           // dst[partner+col/2] = src[partner+col/2] // copy first
         "v.cmp.lt lc0.uh, vu#1.reuse.uh, ->vn.b\n"          // tid < partner
         "v.and  vu#1.reuse.uh, ri0.uh, ->vn.h\n"            // partner & stage
         "v.cmp.eqi vn#1.reuse.uh, 0, ->vn.b\n"              // partner & stage == 0
         "v.cmp.lt vt#2.reuse.sw, vt#1.reuse.sw, ->vn.b\n"         // cur_value < partner_value
         "v.and vn#4.reuse.ub, vn#2.reuse.ub, ->vu.b\n"            // (tid < partner) & (partner & stage) == 0
-        "v.and vu#1.reuse.ub, vn#1.reuse.ub ->vu.b\n"             // (tid < partner) & ((partner & stage) == 0) & (cur_value < partner_value) 
+        "v.and vu#1.reuse.ub, vn#1.reuse.ub ->vu.b\n"             // (tid < partner) & ((partner & stage) == 0) & (cur_value < partner_value)
         "v.cmp.eqi vu#1.ub, 1, ->vm.b\n"              // sort_descend
         ""
         "v.cmp.eqi vn#3.uh, 1, ->vn.b\n"                // partner & stage == 1
@@ -152,7 +152,7 @@ void __vec__ BitonicSortStepDescend_RowMajor_Imp(
         "v.sw  vt#3.sw, [to, vn#2.uh<<2]\n"           // dst[tid+col/2] = src[partner]
         "v.sw  vt#4.sw, [to, vn#1.uh<<2]\n"           // dst[partner+col/2] = src[tid]
         "l.addi t#1.ud, 0, ->p\n"                     // resave p from 1st branch
-        ""                                            // merge 2nd branch two result 
+        ""                                            // merge 2nd branch two result
         "c.bstop\n"
         :
         :"i"(tile_shape::ValidCol)
@@ -171,7 +171,7 @@ TRANGE_RowMajor(typename tile_shape::TileDType __out__ dst) {
 }
 
 template <is_tile_data_v tile_shape, bool ascending = true>
-__attribute__((always_inline)) 
+__attribute__((always_inline))
 void TSORTROW(tile_shape &weight, tile_shape &indices, tile_shape &src) {
     static constexpr uint16_t row = tile_shape::ValidRow;
     static constexpr uint16_t col = tile_shape::ValidCol;
@@ -181,7 +181,7 @@ void TSORTROW(tile_shape &weight, tile_shape &indices, tile_shape &src) {
 
     using tile_shape_sort = Tile<Location::Vec, dtype, tile_shape::Rows, 2*tile_shape::Cols, BLayout::RowMajor>;
     tile_shape_sort dst_sort;
-    tile_shape_sort src_sort; 
+    tile_shape_sort src_sort;
 
     TRANGE_RowMajor<tile_shape><<<col, row>>>(indices.data());
     tile_shape_sort padding(-1);
@@ -196,7 +196,7 @@ void TSORTROW(tile_shape &weight, tile_shape &indices, tile_shape &src) {
                 BitonicSortStepDescend_RowMajor_Imp<tile_shape_sort><<<col, row>>>(dst_sort.data(), src_sort.data(), stage, step);
                 TCOPY(src_sort, dst_sort);
 
-                // TCOPYOUT(gIn, dst);
+                // TSTORE(gIn, dst);
                 // printf("stage:%d step:%d\n", stage, step);
                 // for (int j=0;j<col;j++) {
                 //     printf("%.0f ", tmp[j]);
@@ -246,7 +246,7 @@ void BitonicSortStepDescend_RowMajor_Imp(
                             src[i * tile_shape::Cols + col + partner];
                         dst[i * tile_shape::Cols + col + partner] =
                             src[i * tile_shape::Cols + col + tid];
-                        
+
                     }
                 } else {
                     if (src[i * tile_shape::Cols + tid] >
@@ -279,7 +279,7 @@ template <is_tile_data_v tile_shape, bool ascending = true>
 void TSORTROW(tile_shape &weight, tile_shape &indices, tile_shape &src) {
     using tile_shape_sort = Tile<Location::Vec, dtype, tile_shape::Rows, 2*tile_shape::Cols, BLayout::RowMajor>;
     tile_shape_sort dst_sort;
-    tile_shape_sort src_sort; 
+    tile_shape_sort src_sort;
 
     TRANGE_RowMajor<tile_shape>(indices.data());
     tile_shape_sort padding(0);
@@ -325,7 +325,7 @@ void __vec__ TScatterRow_Vec_RowMajor(
     __vbuf__ typename tile_shape_dst::DType *dst_ptr = blkv_get_tile_ptr(dst);
     __vbuf__ typename tile_shape_dst::DType *src_ptr = blkv_get_tile_ptr(src);
     __vbuf__ typename tile_shape_srci::DType *si_ptr = blkv_get_tile_ptr(srci);
-    dst_ptr[j*tile_shape_dst::RowStride + i] = src_ptr[j*tile_shape_dst::RowStride + i]; 
+    dst_ptr[j*tile_shape_dst::RowStride + i] = src_ptr[j*tile_shape_dst::RowStride + i];
     for(uint16_t k=0;k<tile_shape_srci::ValidCol;k++){
         uint16_t index = j * tile_shape_srci::RowStride + k;
         uint16_t idx = si_ptr[index];
@@ -339,7 +339,7 @@ void TScatterRow_Vec_RowMajor(
     const typename tile_shape_dst::TileDType  src,
     const typename tile_shape_srci::TileDType srci,
     const typename tile_shape_dst::DType s) {
-    
+
     for (uint16_t i = 0; i < tile_shape_dst::ValidRow; ++i){
         for (uint16_t j = 0; j < tile_shape_dst::ValidCol; ++j) {
             dst[i*tile_shape_dst::RowStride+j] = src[i*tile_shape_dst::RowStride+j];
@@ -363,7 +363,7 @@ void topk(dtype *weight, dtype* indices, dtype *x){
     using gmOut = global_tensor<dtype, RowMajor<tokens, num>>;
     using tileIn = Tile<Location::Vec, dtype, tS, scores, BLayout::RowMajor>;
     using tileOut = Tile<Location::Vec, dtype, tS, 32, BLayout::RowMajor, tS, num>; // num < 32
-   
+
     #ifdef __cpu_sim__
     //writeTensorToFile<dtype, tokens, scores>(x, "moe_topk_in_cpp.txt");
     #endif
@@ -374,7 +374,7 @@ void topk(dtype *weight, dtype* indices, dtype *x){
         tileIn tIn;
         tileIn tWeight;
         tileIn tIndice;
-        TCOPYIN(tIn, gIn);
+        TLOAD(tIn, gIn);
         TSORTROW(tWeight, tIndice, tIn);
         tileOut tWeightOut;
         TEXTRACT(tWeightOut, tWeight, 0, 0);
@@ -383,10 +383,10 @@ void topk(dtype *weight, dtype* indices, dtype *x){
         TEXTRACT(tIndiceOut, tIndice, 0, 0);
 
         gmOut gWeight(weight+i*tS*num);
-        TCOPYOUT(gWeight, tWeightOut);
+        TSTORE(gWeight, tWeightOut);
 
         gmOut gIndice(indices+i*tS*num);
-        TCOPYOUT(gIndice, tIndiceOut);
+        TSTORE(gIndice, tIndiceOut);
     }
 
     #ifdef __cpu_sim__
@@ -410,18 +410,18 @@ void sigmoid(dtype *out, dtype* in){
         for(int j=0;j<Nb;j++){
             tile_shape tmp;
             auto gIn = gIterIn(i,j);
-            TCOPYIN(tmp, gIn);
+            TLOAD(tmp, gIn);
             TEXP(tmp,tmp); //e^x
             TRECIP(tmp,tmp); // e^-x
             TADDS(tmp,tmp,static_cast<dtype>(1)); // 1+ e^(-x)
             TRECIP(tmp,tmp); // 1/ (1 + e^-x)
             auto gOut = gIterOut(i,j);
-            TCOPYOUT(gOut, tmp);
+            TSTORE(gOut, tmp);
         }
     }
 }
 
-//select idx array to index every row in "in", then mask with value and create last dim to ext_dim 
+//select idx array to index every row in "in", then mask with value and create last dim to ext_dim
 //out[idx] -> [tokens, in_dim]
 template<typename dtype, int tokens, int in_dim, int idx_dim, int ext_dim>
 void scatter_expand(dtype *out, dtype*idx, dtype *in, dtype value){
@@ -441,11 +441,11 @@ void scatter_expand(dtype *out, dtype*idx, dtype *in, dtype value){
     for(int i=0;i<blocks;i++){
         gmIn gIn(in + i*tS*in_dim);
         tileIn tIn;
-        TCOPYIN(tIn, gIn);
+        TLOAD(tIn, gIn);
 
         gmIdx gIdx(idx + i*tS*idx_dim);
         tileIdx tIdx;
-        TCOPYIN(tIdx, gIdx);
+        TLOAD(tIdx, gIdx);
 
         #ifdef __linx
         static constexpr uint16_t row = tS;
@@ -455,7 +455,7 @@ void scatter_expand(dtype *out, dtype*idx, dtype *in, dtype value){
         #else
         TScatterRow_Vec_RowMajor<tileIn, tileIdx>(tIn.data(), tIn.data(), tIdx.data(), value);
         #endif
-        
+
         Tile<Location::Vec, dtype, tS*in_dim, 1, BLayout::RowMajor> tRe;
         TRESHAPE(tRe, tIn);
 
@@ -463,7 +463,7 @@ void scatter_expand(dtype *out, dtype*idx, dtype *in, dtype value){
         TEXPANDCOL(tOut, tRe);
 
         gmOut gOut(out + i*tS*in_dim*ext_dim);
-        TCOPYOUT(gOut, tOut);
+        TSTORE(gOut, tOut);
     }
 }
 
@@ -481,19 +481,19 @@ void mask_fill(dtype *data, dtype *mask, dtype mask_value){
 
         gm_shape gmask(mask+i*tS*dim);
         tile_shape tmask;
-        TCOPYIN(tdata, gdata);
-        TCOPYIN(tmask, gmask);
+        TLOAD(tdata, gdata);
+        TLOAD(tmask, gmask);
 
         tile_shape tmaskval(mask_value);
 
         TSELECT(tdata, tmask, tmaskval, tdata);
 
-        TCOPYOUT(gdata, tdata);
+        TSTORE(gdata, tdata);
     }
 }
 
 template<typename dtype, const int bsz, const int seq_len, typename args>
-void Gate(dtype *weights, 
+void Gate(dtype *weights,
           dtype *indices,
           dtype *x,
           dtype *bias=nullptr){
@@ -623,14 +623,14 @@ void Gate(dtype *weights,
                 uint64_t offset = i * tS * 2;
                 gm_shape gIn(group_weight.data()+offset);
                 tile_shape tmp;
-                TCOPYIN(tmp, gIn);
+                TLOAD(tmp, gIn);
 
                 tile_shape_out rowsum;
                 TROWSUM(rowsum,tmp);
 
                 offset = i * tS * 1;
                 gm_shape_out gOut(group_weight_sum.data()+offset);
-                TCOPYOUT(gOut, rowsum);
+                TSTORE(gOut, rowsum);
             }
         }
         #ifdef __cpu_sim__
@@ -671,7 +671,7 @@ void Gate(dtype *weights,
 
         //[b*s, n_expert_groups] all-1 matrix to index limit_groups_indices, and set to zero
         Tensor<dtype, tokens, args::n_expert_groups> mask(1);
-        Tensor<dtype, tokens*args::n_expert_groups, args::n_routed_experts/args::n_expert_groups> mask_expand;  
+        Tensor<dtype, tokens*args::n_expert_groups, args::n_routed_experts/args::n_expert_groups> mask_expand;
         scatter_expand<dtype, tokens, args::n_expert_groups, args::n_limited_groups, args::n_routed_experts/args::n_expert_groups>(mask_expand.data(), limit_group_indices.data(), mask.data(), 0); //TSCATTER(重新定义每行indices选对应行的某些列)
         //scores [b*s, n_expert_groups, n_routed_experts/n_expert_groups] mask [b*s, n_expert_groups]
         // to mask irelevant groups with "-inf" except selcted limited groups
@@ -710,7 +710,7 @@ void Gate(dtype *weights,
     writeTensorToFile<dtype, tokens, args::n_activated_experts>(indices, "moe_gate_indices_masked_cpp.txt");
     #endif
 
-    //weights = original_scores.gather(1, indices) should be same as above? 
+    //weights = original_scores.gather(1, indices) should be same as above?
     //gather is extract corresponding score on dim=1, weight([b*s, n_activated_experts])
     if constexpr(args::score_func == ScoreFunc::SIGMOID){
         // weights /= weights.sum weight sum normalization since we have extract some weight so the extract sum is not 1
@@ -722,13 +722,13 @@ void Gate(dtype *weights,
             uint64_t offset = i * tS * args::n_activated_experts;
             gm_shape gIn(weights+offset);
             tile_shape tmp;
-            TCOPYIN(tmp, gIn);
+            TLOAD(tmp, gIn);
 
             tile_shape rowsum;
             TROWSUMEXPAND(rowsum,tmp);
             TDIV(tmp, tmp, rowsum);
             gm_shape gOut(weights+offset);
-            TCOPYOUT(gOut, tmp);
+            TSTORE(gOut, tmp);
         }
     }
 }
@@ -750,7 +750,7 @@ void Gate(dtype *weights,
 //     uint16_t index = j * tile_shape::RowStride + i;
 //     uint16_t idx = src_ptr[index];
 //     sum[idx] = sum[idx] + 1;
-     
+
 //     dst_ptr[idx] = dst_ptr[idx] + 1;
 
 // }
@@ -765,7 +765,7 @@ void Gate(dtype *weights,
 //     for (uint16_t i = 0; i < tile_shape_src::ValidRow; ++i){
 //         for (uint16_t j = 0; j < tile_shape_src::ValidCol; ++j) {
 //             uint16_t idx = src[i * tile_shape_src::RowStride + j];
-//             sum[idx] = sum[idx] + 1;            
+//             sum[idx] = sum[idx] + 1;
 //         }
 //     }
 
@@ -792,13 +792,13 @@ void bincount(size_t *counts, dtype *indices, size_t size){
     // for (int i=0;i<block_tokens;i++) {
     //     gm_idx gidx(indices + i*tS*topk);
     //     tile_idx tidx;
-    //     TCOPYIN(tidx, gidx);
+    //     TLOAD(tidx, gidx);
 
     //     tile_cnt blk_cnt;
     //     TSUM(blk_cnt, gidx);
     //     TADD(tcnt, tcnt, blk_cnt);
     // }
-    // TCOPYOUT(gcnt, tcnt);
+    // TSTORE(gcnt, tcnt);
 
     //naive scalar implementaton
     for(int i=0;i<size;i++){
@@ -808,7 +808,7 @@ void bincount(size_t *counts, dtype *indices, size_t size){
     // for(int i=0;i<expert_cnt;i++){
     //     for (int j=0;j<block_tokens;j++) {
     //         register uint64_t s;
-    //         TCOPYIN(tile_idx, gidx)
+    //         TLOAD(tile_idx, gidx)
     //         rowcondset(tile_idx, tile_idx, j);
     //         TSUM(s, tile_idx);
     //         counts[i] += s;
@@ -822,7 +822,7 @@ void where(dtype *tokens_idx, dtype *topk_idx, dtype *indices, int idx){
 }
 
 // w2(silu(w1(in)) * w3(in))
-//silu : x / (1 + e^-x) 
+//silu : x / (1 + e^-x)
 template <typename dtype, const int S, const int dim, const int inter_dim>
 void MLP(dtype *out, dtype *in, dtype *w1, dtype *w2, dtype *w3){
     const int tS = 64;
@@ -872,8 +872,8 @@ void MLP(dtype *out, dtype *in, dtype *w1, dtype *w2, dtype *w3){
                     auto gW1 = gIterW1(0,k);
                     tileIO  tIn;
                     tileW13 tW1;
-                    TCOPYIN(tIn, gIn);
-                    TCOPYIN(tW1, gW1);
+                    TLOAD(tIn, gIn);
+                    TLOAD(tW1, gW1);
                     MATMUL(tACC_W1, tIn, tW1);
                 }
 
@@ -883,8 +883,8 @@ void MLP(dtype *out, dtype *in, dtype *w1, dtype *w2, dtype *w3){
                     auto gW1 = gIterW1(d,k);
                     tileIO  tIn;
                     tileW13 tW1;
-                    TCOPYIN(tIn, gIn);
-                    TCOPYIN(tW1, gW1);
+                    TLOAD(tIn, gIn);
+                    TLOAD(tW1, gW1);
                     MATMACC(tACC_W1, tIn, tW1);
                 }
 
@@ -897,8 +897,8 @@ void MLP(dtype *out, dtype *in, dtype *w1, dtype *w2, dtype *w3){
                     auto gW3 = gIterW3(0,k);
                     tileIO  tIn;
                     tileW13 tW3;
-                    TCOPYIN(tIn, gIn);
-                    TCOPYIN(tW3, gW3);
+                    TLOAD(tIn, gIn);
+                    TLOAD(tW3, gW3);
                     MATMUL(tACC_W3, tIn, tW3);
                 }
 
@@ -908,8 +908,8 @@ void MLP(dtype *out, dtype *in, dtype *w1, dtype *w2, dtype *w3){
                     auto gW3 = gIterW3(d,k);
                     tileIO  tIn;
                     tileW13 tW3;
-                    TCOPYIN(tIn, gIn);
-                    TCOPYIN(tW3, gW3);
+                    TLOAD(tIn, gIn);
+                    TLOAD(tW3, gW3);
                     MATMACC(tACC_W3, tIn, tW3);
                 }
 
@@ -930,7 +930,7 @@ void MLP(dtype *out, dtype *in, dtype *w1, dtype *w2, dtype *w3){
 
                 auto gW2 = gIterW2(k,j);
                 tileW2 tW2;
-                TCOPYIN(tW2, gW2);
+                TLOAD(tW2, gW2);
                 MATMUL(tACC2_W2, tACC_W13, tW2);
 
                 tileACC2_CVT tACC2_W2_CVT;
@@ -939,7 +939,7 @@ void MLP(dtype *out, dtype *in, dtype *w1, dtype *w2, dtype *w3){
             }
 
             auto gOut = gIterOut(i,j);
-            TCOPYOUT(gOut, tACC2_W2_OUT);
+            TSTORE(gOut, tACC2_W2_OUT);
         }
     }
 }
@@ -979,7 +979,7 @@ void TRowCondSet_Vec_RowMajor(
             dst[i*tile_shape::RowStride+j] = one;
         }else{
             typename tile_shape::DType zero = 0;
-            dst[i*tile_shape::RowStride+j] = zero;            
+            dst[i*tile_shape::RowStride+j] = zero;
         }
     }
   }
@@ -987,7 +987,7 @@ void TRowCondSet_Vec_RowMajor(
 #endif
 
 template<typename dtype, const int bsz, const int seq_len, typename args>
-void MoE(Tensor<dtype, bsz, seq_len, args::dim> & out, 
+void MoE(Tensor<dtype, bsz, seq_len, args::dim> & out,
         Tensor<dtype, bsz, seq_len, args::dim> &x){
     const int tokens = bsz*seq_len;
     //view(x, x); //[bsz, seq_len, dim] -> [b*s, dim]
@@ -1026,7 +1026,7 @@ void MoE(Tensor<dtype, bsz, seq_len, args::dim> & out,
     Tensor<dtype, args::dim, args::moe_inter_dim> experts_w1[args::n_routed_experts];
     Tensor<dtype, args::moe_inter_dim, args::dim> experts_w2[args::n_routed_experts];
     Tensor<dtype, args::dim, args::moe_inter_dim> experts_w3[args::n_routed_experts];
-    
+
     #ifdef __cpu_sim__
     writeTensorToFile<dtype, args::dim, args::moe_inter_dim>(experts_w1[3].data(), "moe_tmp.txt");
     #endif
@@ -1039,14 +1039,14 @@ void MoE(Tensor<dtype, bsz, seq_len, args::dim> & out,
         // printf("current idx is %d\n", idx);
         Tensor<dtype, args::dim, args::moe_inter_dim> expert_w1 =  experts_w1[idx];
         Tensor<dtype, args::moe_inter_dim, args::dim> expert_w2 =  experts_w2[idx];
-        Tensor<dtype, args::dim, args::moe_inter_dim> expert_w3 =  experts_w3[idx]; 
-        
+        Tensor<dtype, args::dim, args::moe_inter_dim> expert_w3 =  experts_w3[idx];
+
         Tensor<dtype, tokens, args::dim> x_mask_w_wt;
         //Tensor<dtype, tokens, args::dim> weight_expand;
         //generate condition matrix for indices that indices == i -> 1, indices !=i -> 0
         //[tokens, n_activated_experts] with corresponding tokens all zeros or all ones
         //finally get x_mask with unselect tokens row set to 0 and multiply with weight in advance;
-        {            
+        {
             const int tS = 64;
             const int tdim = 64;
             using gmIn = global_tensor<dtype, RowMajor<tokens, args::n_activated_experts>>;
@@ -1059,7 +1059,7 @@ void MoE(Tensor<dtype, bsz, seq_len, args::dim> & out,
             for(int i=0;i<block_tokens;i++){
                 gmIn gidx(indices.data()+i*tS*args::n_activated_experts);
                 tileIn  tidx;
-                TCOPYIN(tidx, gidx);
+                TLOAD(tidx, gidx);
 
                 //judege ifeq "i"  to set 1 or 0
                 #ifdef __linx
@@ -1079,7 +1079,7 @@ void MoE(Tensor<dtype, bsz, seq_len, args::dim> & out,
 
                 gmIn gweight(weights.data()+i*tS*args::n_activated_experts);
                 tileIn tweight;
-                TCOPYIN(tweight, gweight);
+                TLOAD(tweight, gweight);
                 TMUL(tweight, tweight, tidx);
                 Tile<Location::Vec, dtype, tS, 16, BLayout::RowMajor, tS, 1> tweight_sum;
                 TROWSUM(tweight_sum, tweight);
@@ -1090,12 +1090,12 @@ void MoE(Tensor<dtype, bsz, seq_len, args::dim> & out,
                     uint64_t offset = i*(tS*args::dim)+j*tdim;
                     gmOut gIn(x.data()+offset);
                     tileOut tOut;
-                    TCOPYIN(tOut, gIn);
+                    TLOAD(tOut, gIn);
                     TMUL(tOut, tOut, tcond);
                     TMUL(tOut, tOut, tweight_expand);
-        
+
                     gmOut gOut(x_mask_w_wt.data()+offset);
-                    TCOPYOUT(gOut, tOut);
+                    TSTORE(gOut, tOut);
                 }
             }
         }
@@ -1124,7 +1124,7 @@ void MoE(Tensor<dtype, bsz, seq_len, args::dim> & out,
         dtype tokens_idx[];
         dtype topk_idx[];
         where(tokens_idx, topk_idx, indices.data(), i);  //idx, top = torch.where(indices == i) 返回index=i的expert的行索引(idx)即哪些token属于这个专家，列索引(top)这个专家属于topk的哪个,需要看下ascend c++做法
-        
+
         //Gather 选出来的tokens，在Scatter到最终out token dim
         for(int i=0;i<tokens_idx.size();i++){
             //shared expert logics
@@ -1142,13 +1142,13 @@ void MoE(Tensor<dtype, bsz, seq_len, args::dim> & out,
         }
         */
     }
-    
+
     Tensor<dtype, args::dim, args::n_shared_experts*args::moe_inter_dim> shared_expert_w1(1);
     Tensor<dtype, args::n_shared_experts*args::moe_inter_dim, args::dim> shared_expert_w2(1);
     Tensor<dtype, args::dim, args::n_shared_experts*args::moe_inter_dim> shared_expert_w3(1);
     Tensor<dtype, bsz*seq_len, args::dim> shared_expert_out;
     MLP<dtype, bsz*seq_len, args::dim, args::n_shared_experts*args::moe_inter_dim>(shared_expert_out.data(), x.data(), shared_expert_w1.data(), shared_expert_w2.data(), shared_expert_w3.data());
-    
+
     matadd<bsz*seq_len, args::dim, 64, 64>(out.data(), y.data(), shared_expert_out.data());
     //reshape (bsz*seq_len, dim) -> (bsz, seq_len, dim)
 }
diff --git a/samples/README.md b/samples/README.md
new file mode 100644
index 0000000..194ec1d
--- /dev/null
+++ b/samples/README.md
@@ -0,0 +1,15 @@
+# Samples
+
+This tree keeps small checked-in compiler-output examples for quick inspection.
+Samples are reference artifacts, not build gates, and are intentionally separate
+from generated local output under `output/`.
+
+| Workload | Sample | Related SuperNPUBench source | Provenance |
+| --- | --- | --- | --- |
+| Flash attention | [`flash_attention/flash_attention_block_template.diss`](flash_attention/flash_attention_block_template.diss) | [`../benchmarks/npu/fusion`](../benchmarks/npu/fusion), [`../benchmarks/kernels/composite/src/flash_attention.cpp`](../benchmarks/kernels/composite/src/flash_attention.cpp) | `pto_objdump`/`llvm-objdump` disassembly of a larger compiler-produced flash attention object, including block-template TileOP sequences such as `BSTART.TLOAD`, `BSTART.TMATMUL`, `BSTART.TCVT`, and `BSTART.TSTORE`. |
+| GEMM | [`gemm/gemm_avs_tile_smoke.diss`](gemm/gemm_avs_tile_smoke.diss) | [`../benchmarks/npu/vec_simd/gemm_18x128x256`](../benchmarks/npu/vec_simd/gemm_18x128x256), [`../benchmarks/kernels/composite/src/gemm.cpp`](../benchmarks/kernels/composite/src/gemm.cpp) | `llvm-objdump -dl` of a compiler-produced `gemm.o` from the Linx superproject AVS tile smoke outputs. |
+
+A compatible Linx compiler can disassemble these objects, but direct
+SuperNPUBench NPU source compilation may still require frontend support for the
+block-vector builtins and tile-register inline-assembly constraints used by the
+benchmark headers.
diff --git a/samples/flash_attention/README.md b/samples/flash_attention/README.md
new file mode 100644
index 0000000..011af42
--- /dev/null
+++ b/samples/flash_attention/README.md
@@ -0,0 +1,19 @@
+# Flash Attention Sample
+
+`flash_attention_block_template.diss` is a checked-in block-template TileOP
+disassembly of a compiler-produced Linx object for a larger flash attention
+case. It is intended to show representative `BSTART.*`, `B.ARG`, `B.IOR`, and
+`B.IOTI` sequences instead of the older scalar smoke output.
+
+Related SuperNPUBench sources:
+
+| Path | Role |
+| --- | --- |
+| [`../../benchmarks/npu/fusion`](../../benchmarks/npu/fusion) | Active NPU flash-attention-style fusion benchmark suite. |
+| [`../../benchmarks/kernels/composite/src/flash_attention.cpp`](../../benchmarks/kernels/composite/src/flash_attention.cpp) | Composite flash attention benchmark entrypoint. |
+
+Regenerate from a compatible Linx compiler object with:
+
+```sh
+llvm-objdump -dl flash_attention.o > flash_attention_block_template.diss
+```
diff --git a/samples/flash_attention/flash_attention_block_template.diss b/samples/flash_attention/flash_attention_block_template.diss
new file mode 100644
index 0000000..b4c7487
--- /dev/null
+++ b/samples/flash_attention/flash_attention_block_template.diss
@@ -0,0 +1,926 @@
+
+generated/pto_objdump/obj/flash_attention.o:	file format elf64-linx
+
+Disassembly of section .text:
+
+0000000000000000 <pto_flash_attention>:
+       0:      FENTRY	[ra ~ s8], sp!, 208
+       4:      C.BSTART	COND, 0x1c
+       6:      c.movr	a3,	->s1
+       8:      c.movr	a1,	->s7
+       a:      sdi	a0, [sp, 40]
+       e:      c.movr	zero,	->a0
+      10:      c.setc.eq	a4, a0
+      12:      C.BSTART	DIRECT, 0x24
+      14:      c.lwi	[a4, 4],	->t
+      16:      c.sdi	t#1, [sp, 24]
+      18:      lwi	[a4, 0],	->s0
+
+000000000000001c <.LBB0_1>:
+      1c:      C.BSTART.STD
+      1e:      c.movi	5,	->s0
+      20:      sdi	s0, [sp, 24]
+
+0000000000000024 <.LBB0_3>:
+      24:      C.BSTART	COND, 0x1e0
+      26:      c.sext.w	s0,	->t
+      28:      c.movi	5,	->a1
+      2a:      c.setc.ne	t#1, a1
+      2c:      C.BSTART	COND, 0x1e0
+      2e:      c.ldi	[sp, 24],	->t
+      30:      c.sext.w	t#1,	->t
+      32:      c.setc.ne	t#1, a1
+      34:      C.BSTART.STD
+      36:      addi	a2, 64,	->a1
+      3a:      addi	zero, 16,	->x1
+      3e:      addi	zero, 1024,	->a3
+      42:      addi	zero, 64,	->a4
+
+0000000000000046 <.LBB0_6>:
+      46:      C.BSTART.STD
+      48:      slli	a0, 8,	->t
+      4c:      add	s7, t#1,	->a5
+      50:      BSTART.TLOAD	INT32
+      54:      C.B.DIMI	4, 	->lb0
+      56:      C.B.DIMI	16, 	->lb1
+      58:      B.ARG	ND2ZN.normal
+      5c:      B.IOR	[a5,x1],[]
+      60:      B.IOTI	[], last	->t<4KB>
+      64:      BSTART.TLOAD	INT32
+      68:      C.B.DIMI	4, 	->lb0
+      6a:      C.B.DIMI	4, 	->lb1
+      6c:      B.ARG	DN2NZ.normal
+      70:      B.IOR	[a2,x1],[]
+      74:      B.IOTI	[], last	->t<4KB>
+      78:      BSTART.TLOAD	INT32
+      7c:      C.B.DIMI	4, 	->lb0
+      7e:      C.B.DIMI	16, 	->lb1
+      80:      B.ARG	DN2NZ.normal
+      84:      B.IOR	[s1,a3],[]
+      88:      B.IOTI	[], last	->t<4KB>
+      8c:      BSTART.TMATMUL	INT32
+      90:      C.B.DIMI	16, 	->lb0
+      92:      C.B.DIMI	4, 	->lb1
+      94:      C.B.DIMI	4, 	->lb2
+      96:      B.IOTI	[t#1, t#2], last	->acc<4KB>
+      9a:      BSTART.ACCCVT	INT32
+      9e:      B.IOTI	[], last	->m<4KB>
+      a2:
+      a6:      B.ARG	VV
+      aa:      B.IOTI	[m#1]	->t<4KB>
+      ae:      B.IOTI	[], last	->t<4KB>
+      b2:
+      b6:      B.ARG	VV
+      ba:      B.IOTI	[t#2]	->t<4KB>
+      be:      B.IOTI	[], last	->t<4KB>
+      c2:      BSTART.TMATMUL	INT32
+      c6:      C.B.DIMI	16, 	->lb0
+      c8:      C.B.DIMI	16, 	->lb1
+      ca:      C.B.DIMI	4, 	->lb2
+      cc:      B.IOTI	[t#2, t#3], last	->acc<4KB>
+      d0:      BSTART.ACCCVT	INT32
+      d4:      B.IOTI	[], last	->m<4KB>
+      d8:      C.BSTART.STD
+      da:      slli	a0, 10,	->a5
+      de:      c.movr	a1,	->a6
+      e0:      c.movr	x1,	->a7
+
+00000000000000e2 <.LBB0_15>:
+      e2:      BSTART.TLOAD	INT32
+      e6:      C.B.DIMI	4, 	->lb0
+      e8:      C.B.DIMI	4, 	->lb1
+      ea:      B.ARG	DN2NZ.normal
+      ee:      B.IOR	[a6,x1],[]
+      f2:      B.IOTI	[], last	->t<4KB>
+      f6:      C.BSTART.STD
+      f8:      add	s1, a7,	->x0
+      fc:      BSTART.TLOAD	INT32
+     100:      C.B.DIMI	4, 	->lb0
+     102:      C.B.DIMI	16, 	->lb1
+     104:      B.ARG	DN2NZ.normal
+     108:      B.IOR	[x0,a3],[]
+     10c:      B.IOTI	[], last	->t<4KB>
+     110:      BSTART.TMATMUL	INT32
+     114:      C.B.DIMI	16, 	->lb0
+     116:      C.B.DIMI	4, 	->lb1
+     118:      C.B.DIMI	4, 	->lb2
+     11a:      B.IOTI	[t#1, t#2], last	->acc<4KB>
+     11e:      BSTART.ACCCVT	INT32
+     122:      B.IOTI	[], last	->m<4KB>
+     126:
+     12a:      B.ARG	VV
+     12e:      B.IOTI	[m#2]	->t<4KB>
+     132:      B.IOTI	[], last	->t<4KB>
+     136:
+     13a:      B.ARG	VV
+     13e:      B.IOTI	[t#2]	->t<4KB>
+     142:      B.IOTI	[], last	->t<4KB>
+     146:      BSTART.TMATMUL	INT32
+     14a:      C.B.DIMI	16, 	->lb0
+     14c:      C.B.DIMI	16, 	->lb1
+     14e:      C.B.DIMI	4, 	->lb2
+     150:      B.IOTI	[t#2, t#3], last	->acc<4KB>
+     154:      BSTART.ACCCVT	INT32
+     158:      B.IOTI	[], last	->m<4KB>
+     15c:
+     160:      B.ARG	VV
+     164:      B.IOTI	[m#1]	->t<4KB>
+     168:      B.IOTI	[], last	->t<4KB>
+     16c:
+     170:      B.ARG	VV
+     174:      B.IOTI	[m#2]	->t<4KB>
+     178:      B.IOTI	[], last	->t<4KB>
+     17c:
+     180:      B.ARG	VV
+     184:      B.IOTI	[t#2, t#3]	->t<4KB>
+     188:      B.IOTI	[], last	->t<4KB>
+     18c:      C.BSTART.STD
+     18e:      addi	a6, 64,	->a6
+     192:
+     196:      B.ARG	VV
+     19a:      B.IOTI	[t#2]	->t<4KB>
+     19e:      B.IOTI	[], last	->m<4KB>
+     1a2:      C.BSTART	COND, 0xe2
+     1a4:      addi	a7, 16,	->a7
+     1a8:      c.setc.ne	a7, a3
+     1aa:      C.BSTART.STD
+     1ac:      c.ldi	[sp, 40],	->t
+     1ae:      add	t#1, a5,	->a5
+     1b2:
+     1b6:      B.ARG	VV
+     1ba:      B.IOTI	[m#1]	->t<4KB>
+     1be:      B.IOTI	[], last	->t<4KB>
+     1c2:      BSTART.TSTORE	INT32
+     1c6:      C.B.DIMI	16, 	->lb0
+     1c8:      C.B.DIMI	16, 	->lb1
+     1ca:      B.ARG	NORM.normal
+     1ce:      B.IOR	[a5,a4],[]
+     1d2:      B.IOTI	[t#1], last	->t<4KB>
+     1d6:      C.BSTART	COND, 0x306
+     1d8:      addi	a0, 1,	->a0
+     1dc:      c.setc.eq	a0, x1
+     1de:      C.BSTART	DIRECT, 0x46
+
+00000000000001e0 <.LBB0_7>:
+     1e0:      C.BSTART.STD
+     1e2:      c.movr	zero,	->u
+     1e4:      c.movi	1,	->s3
+     1e6:      c.movi	4,	->s5
+     1e8:      c.movi	4,	->t
+     1ea:      c.sdi	t#1, [sp, 72]
+     1ec:      addi	zero, 256,	->t
+     1f0:      c.sdi	t#1, [sp, 64]
+     1f2:      addi	zero, 16,	->t
+     1f6:      c.sdi	t#1, [sp, 16]
+     1f8:      c.movr	u#1,	->s4
+     1fa:      sdi	u#1, [sp, 96]
+     1fe:      c.movr	u#1,	->a3
+     200:      sdi	s1, [sp, 56]
+     204:      sdi	a2, [sp, 80]
+
+0000000000000208 <.LBB0_8>:
+     208:      C.BSTART.STD
+     20a:      sdi	a3, [sp, 8]
+     20e:      slliw	a3, 4,	->t
+     212:      c.sdi	t#1, [sp, 32]
+     214:      ldi	[sp, 96],	->s2
+
+0000000000000218 <.LBB0_9>:
+     218:      C.BSTART.STD
+     21a:      sdi	s2, [sp, 48]
+     21e:      slliw	s2, 8,	->t
+     222:      c.sdi	t#1, [sp, 88]
+     224:      c.ldi	[sp, 96],	->t
+     226:      c.movr	t#1,	->a2
+     228:      c.movr	t#1,	->s2
+     22a:      c.movr	t#1,	->a3
+
+000000000000022c <.LBB0_10>:
+     22c:      C.BSTART.STD
+     22e:      sdi	a2, [sp, 120]
+     232:      hl.sdip	s2, a3, [sp, 104]
+     238:      ldi	[sp, 96],	->s6
+     23c:      c.movr	s6,	->s1
+     23e:      ldi	[sp, 80],	->s2
+
+0000000000000242 <.LBB0_13>:
+     242:      HL.BSTART.STD	CALL, _ZN3pto7kernels15load_scalar_i32EPKvNS0_9pto_dtypeEi, ra=.LBB0_20
+     24a:      addw	s4, s1,	->a2
+     24e:      c.movr	s7,	->a0
+     250:      c.movr	s0,	->a1
+
+0000000000000252 <.LBB0_20>:
+     252:      HL.BSTART.STD	CALL, _ZN3pto7kernels15load_scalar_i32EPKvNS0_9pto_dtypeEi, ra=.LBB0_21
+     25a:      c.movr	s3,	->s8
+     25c:      c.movr	s5,	->s3
+     25e:      c.movr	s7,	->s5
+     260:      c.movr	a0,	->s7
+     262:      ldi	[sp, 120],	->a0
+     266:      addw	a0, s1,	->a2
+     26a:      c.movr	s2,	->a0
+     26c:      c.movr	s0,	->a1
+
+000000000000026e <.LBB0_21>:
+     26e:      C.BSTART	COND, 0x242
+     270:      mulw	a0, s7,	->t
+     274:      c.movr	s5,	->s7
+     276:      c.movr	s3,	->s5
+     278:      c.movr	s8,	->s3
+     27a:      addw	s1, s3,	->s1
+     27e:      addw	t#1, s6,	->s6
+     282:      c.sext.w	s1,	->t
+     284:      c.setc.ne	t#1, s5
+     286:      HL.BSTART.STD	CALL, _ZN3pto7kernels15load_scalar_i32EPKvNS0_9pto_dtypeEi, ra=.LBB0_19
+     28e:      ldi	[sp, 88],	->a0
+     292:      ldi	[sp, 104],	->s2
+     296:      addw	s2, a0,	->a2
+     29a:      c.ldi	[sp, 56],	->t
+     29c:      c.movr	t#1,	->a0
+     29e:      c.movr	s0,	->a1
+
+00000000000002a0 <.LBB0_19>:
+     2a0:      C.BSTART	COND, 0x22c
+     2a2:      ldi	[sp, 72],	->u
+     2a6:      c.ldi	[sp, 120],	->t
+     2a8:      addw	t#1, u#1,	->a2
+     2ac:      mulw	a0, s6,	->u
+     2b0:      c.ldi	[sp, 112],	->t
+     2b2:      addw	u#1, t#1,	->a3
+     2b6:      addw	s2, s3,	->s2
+     2ba:      addw	s2, zero,	->u
+     2be:      c.ldi	[sp, 64],	->t
+     2c0:      c.setc.ne	u#1, t#1
+     2c2:      HL.BSTART.STD	CALL, _ZN3pto7kernels16store_scalar_i32EPvNS0_9pto_dtypeEii, ra=.LBB0_18
+     2ca:      ldi	[sp, 32],	->a0
+     2ce:      ldi	[sp, 48],	->s2
+     2d2:      addw	s2, a0,	->a2
+     2d6:      ldi	[sp, 40],	->a0
+     2da:      ldi	[sp, 24],	->a1
+
+00000000000002de <.LBB0_18>:
+     2de:      C.BSTART	COND, 0x218
+     2e0:      addw	s2, s3,	->s2
+     2e4:      addw	s2, zero,	->u
+     2e8:      c.ldi	[sp, 16],	->t
+     2ea:      c.setc.ne	u#1, t#1
+     2ec:      C.BSTART	COND, 0x208
+     2ee:      c.ldi	[sp, 72],	->t
+     2f0:      addw	s4, t#1,	->s4
+     2f4:      c.ldi	[sp, 8],	->t
+     2f6:      addw	t#1, s3,	->a3
+     2fa:      addw	a3, zero,	->u
+     2fe:      ldi	[sp, 80],	->a2
+     302:      c.ldi	[sp, 64],	->t
+     304:      c.setc.ne	u#1, t#1
+
+0000000000000306 <.LBB0_17>:
+     306:      FRET.STK	[ra ~ s8], sp!, 208
+
+Disassembly of section .text._ZN3pto7kernels15load_scalar_i32EPKvNS0_9pto_dtypeEi:
+
+0000000000000000 <_ZN3pto7kernels15load_scalar_i32EPKvNS0_9pto_dtypeEi>:
+       0:      FENTRY	[ra ~ ra], sp!, 32
+       4:      C.BSTART	COND, 0x40
+       6:      c.movr	a0,	->a3
+       8:      c.movr	zero,	->a0
+       a:      c.sext.w	a1,	->t
+       c:      addw	a1, zero,	->a1
+      10:      c.movi	3,	->a4
+      12:      setc.lt	a4, t#1
+      16:      C.BSTART	COND, 0x5e
+      18:      setc.eqi	a1, 1
+      1c:      C.BSTART	COND, 0x82
+      1e:      setc.eqi	a1, 2
+      22:      C.BSTART	COND, 0x134
+      24:      c.setc.ne	a1, a4
+      26:      HL.BSTART.STD	CALL, _ZN3pto17fp8_e4m3_to_floatENS_10fp8_e4m3_tE, ra=.LBB1_21
+      2e:      addw	a2, zero,	->a0
+      32:      lb	[a3, a0],	->a0
+      36:      sbi	a0, [sp, 23]
+      3a:      addi	sp, 23,	->a0
+
+000000000000003e <.LBB1_21>:
+      3e:      C.BSTART	DIRECT, 0x12e
+
+0000000000000040 <.LBB1_5>:
+      40:      C.BSTART	COND, 0x66
+      42:      setc.eqi	a1, 4
+      46:      C.BSTART	COND, 0xf8
+      48:      setc.eqi	a1, 6
+      4c:      C.BSTART	COND, 0x134
+      4e:      setc.nei	a1, 5
+      52:      C.BSTART.STD
+      54:      c.sext.w	a2,	->t
+      56:      lw	[a3, t#1<<2],	->a0
+      5a:      FRET.STK	[ra ~ ra], sp!, 32
+
+000000000000005e <.LBB1_9>:
+      5e:      C.BSTART	DIRECT, 0x12e
+      60:      c.sext.w	a2,	->t
+      62:      lwu	[a3, t#1<<2],	->a0
+
+0000000000000066 <.LBB1_10>:
+      66:      C.BSTART	DIRECT, 0x12e
+      68:      c.sext.w	a2,	->t
+      6a:      lbu	[a3, t#1],	->t
+      6e:      andi	t#1, 15,	->t
+      72:      slli	t#1, 2,	->u
+      76:      addtpc	0,	->t
+      7a:      addi	t#1, 0,	->t
+      7e:      lwu	[u#1, t#1],	->a0
+
+0000000000000082 <.LBB1_11>:
+      82:      C.BSTART	COND, 0x104
+      84:      c.sext.w	a2,	->t
+      86:      lh	[a3, t#1<<1],	->a3
+      8a:      lui	524288,	->t
+      8e:      andw	a3, t#1,	->a0
+      92:      hl.lui	65535,	->t
+      98:      andw	a3, t#1,	->a1
+      9c:      srliw	a3, 10,	->t
+      a0:      andiw	t#1, 31,	->a2
+      a4:      addw	a2, zero,	->a4
+      a8:      setc.eqi	a4, 31
+      ac:      C.BSTART	COND, 0x116
+      ae:      andiw	a3, 1023,	->a3
+      b2:      c.movr	zero,	->a5
+      b4:      c.setc.ne	a4, a5
+      b6:      C.BSTART	COND, 0x12e
+      b8:      c.sext.w	a3,	->t
+      ba:      c.setc.eq	t#1, a5
+      bc:      C.BSTART	DIRECT, 0x128
+      be:      clz	a3, 31,	->a2
+      c2:      xoriw	a2, 31,	->u
+      c6:      c.movi	9,	->t
+      c8:      subw	t#1, u#1,	->t
+      cc:      addi	zero, 32,	->u
+      d0:      sll	t#1, u#1,	->t
+      d4:      srl	t#1, u#1,	->t
+      d8:      sllw	a1, t#1,	->t
+      dc:      slliw	t#1, 14,	->u
+      e0:      lui	2044,	->t
+      e4:      andw	u#1, t#1,	->u
+      e8:      slliw	a2, 23,	->t
+      ec:      subw	u#1, t#1,	->t
+      f0:      addw	t#1, a0,	->a0
+      f4:      lui	274432,	->a1
+
+00000000000000f8 <.LBB1_15>:
+      f8:      C.BSTART.STD
+      fa:      c.sext.w	a2,	->t
+      fc:      lb	[a3, t#1],	->a0
+     100:      FRET.STK	[ra ~ ra], sp!, 32
+
+0000000000000104 <.LBB1_16>:
+     104:      C.BSTART	DIRECT, 0x12e
+     106:      slliw	a1, 13,	->t
+     10a:      orw	a0, t#1,	->u
+     10e:      lui	522240,	->t
+     112:      orw	u#1, t#1,	->a0
+
+0000000000000116 <.LBB1_17>:
+     116:      C.BSTART.STD
+     118:      slliw	a3, 13,	->t
+     11c:      orw	t#1, a2<<23,	->t
+     120:      addw	t#1, a0,	->a0
+     124:      lui	229376,	->a1
+
+0000000000000128 <.LBB1_18>:
+     128:      C.BSTART.STD
+     12a:      addw	a0, a1,	->a0
+
+000000000000012e <.LBB1_19>:
+     12e:      C.BSTART.STD
+     130:      fcvtz.fs2sw	a0,	->a0
+
+0000000000000134 <.LBB1_20>:
+     134:      FRET.STK	[ra ~ ra], sp!, 32
+
+Disassembly of section .text._ZN3pto7kernels16store_scalar_i32EPvNS0_9pto_dtypeEii:
+
+0000000000000000 <_ZN3pto7kernels16store_scalar_i32EPvNS0_9pto_dtypeEii>:
+       0:      FENTRY	[ra ~ s8], sp!, 96
+       4:      C.BSTART	COND, 0x5e
+       6:      c.sext.w	a1,	->t
+       8:      addw	a1, zero,	->a1
+       c:      c.movi	3,	->a4
+       e:      setc.lt	a4, t#1
+      12:      C.BSTART	COND, 0x7c
+      14:      setc.eqi	a1, 1
+      18:      C.BSTART	COND, 0x114
+      1a:      setc.eqi	a1, 2
+      1e:      C.BSTART	COND, 0x110
+      20:      c.setc.ne	a1, a4
+      22:      C.BSTART	COND, 0x1c4
+      24:      addw	a3, zero,	->a1
+      28:      setc.eqi	a1, 0
+      2c:      C.BSTART	COND, 0x1c8
+      2e:      c.movr	a2,	->s0
+      30:      c.movr	a0,	->s1
+      32:      c.movr	a3,	->s3
+      34:      scvtf.sw2fs	a3,	->a0
+      38:      lui	524288,	->t
+      3c:      xorw	a0, t#1,	->u
+      40:      cmp.lti	a1, 0,	->t
+      44:      csel	t#1, a0, u#1,	->s4
+      48:      hl.lwu.pcr	[.rodata.cst4+0x4],	->t
+      4e:      fge.fs	s4, t#1,	->t
+      52:      xori	t#1, 1,	->t
+      56:      c.setc.eq	t#1, zero
+      58:      C.BSTART	DIRECT, 0x212
+      5a:      c.movr	zero,	->s2
+      5c:      c.movr	s4,	->a0
+
+000000000000005e <.LBB2_4>:
+      5e:      C.BSTART	COND, 0x8c
+      60:      setc.eqi	a1, 4
+      64:      C.BSTART	COND, 0x162
+      66:      setc.eqi	a1, 6
+      6a:      C.BSTART	COND, 0x110
+      6c:      setc.nei	a1, 5
+      70:      C.BSTART.STD
+      72:      c.sext.w	a2,	->t
+      74:      sw	a3, [a0, t#1<<2]
+      78:      FRET.STK	[ra ~ s8], sp!, 96
+
+000000000000007c <.LBB2_9>:
+      7c:      C.BSTART.STD
+      7e:      scvtf.sw2fs	a3,	->u
+      82:      c.sext.w	a2,	->t
+      84:      sw	u#1, [a0, t#1<<2]
+      88:      FRET.STK	[ra ~ s8], sp!, 96
+
+000000000000008c <.LBB2_46>:
+      8c:      C.BSTART.STD
+      8e:      c.movr	a2,	->s2
+      90:      c.movr	a0,	->s1
+      92:      scvtf.sw2fs	a3,	->a0
+      96:      lui	524288,	->t
+      9a:      xorw	a0, t#1,	->u
+      9e:      c.sext.w	a3,	->t
+      a0:      cmp.lti	t#1, 0,	->t
+      a4:      csel	t#1, a0, u#1,	->s4
+      a8:      c.movr	zero,	->s3
+      aa:      c.movi	1,	->s5
+      ac:      addtpc	0,	->t
+      b0:      addi	t#1, 0,	->s6
+      b4:      addi	zero, 32,	->s7
+      b8:      sll	a0, s7,	->t
+      bc:      srl	t#1, s7,	->s0
+      c0:      addi	zero, 16,	->s8
+
+00000000000000c4 <.LBB2_47>:
+      c4:      HL.BSTART.STD	CALL, __subsf3, ra=.LBB2_58
+      cc:      lwi	[s6, 0],	->a0
+      d0:      sll	a0, s7,	->a0
+      d4:      srl	a0, s7,	->a1
+      d8:      c.movr	s0,	->a0
+
+00000000000000da <.LBB2_58>:
+      da:      C.BSTART	COND, 0xc4
+      dc:      lui	524288,	->t
+      e0:      xorw	a0, t#1,	->u
+      e4:      hl.lwu.pcr	[.rodata.cst4],	->t
+      ea:      flt.fs	a0, t#1,	->t
+      ee:      csel	t#1, a0, u#1,	->u
+      f2:      flt.fs	u#1, s4,	->t
+      f6:      csel	t#1, s4, u#1,	->s4
+      fa:      csel	t#1, s3, s5,	->s3
+      fe:      addi	s6, 4,	->s6
+     102:      addi	s5, 1,	->s5
+     106:      c.setc.ne	s5, s8
+     108:      C.BSTART.STD
+     10a:      c.sext.w	s2,	->t
+     10c:      sb	s3, [s1, t#1]
+
+0000000000000110 <.LBB2_49>:
+     110:      FRET.STK	[ra ~ s8], sp!, 96
+
+0000000000000114 <.LBB2_10>:
+     114:      C.BSTART	COND, 0x16e
+     116:      scvtf.sw2fs	a3,	->u
+     11a:      srliw	u#1, 16,	->u
+     11e:      lui	8,	->t
+     122:      andw	u#1, t#1,	->a1
+     126:      hl.lui	8388607,	->t
+     12c:      andw	u#2, t#1,	->a3
+     130:      srliw	u#2, 23,	->a5
+     134:      andiw	a5, 255,	->a4
+     138:      addw	a4, zero,	->a7
+     13c:      setc.nei	a7, 255
+     140:      C.BSTART.STD
+     142:      c.sext.w	a3,	->t
+     144:      cmp.eqi	t#1, 0,	->a3
+     148:      addiw	zero, 512,	->u
+     14c:      c.movr	zero,	->t
+     14e:      csel	a3, u#1, t#1,	->t
+     152:      orw	t#1, a1,	->a1
+
+0000000000000156 <.LBB2_16>:
+     156:      C.BSTART	DIRECT, 0x3aa
+     158:      hl.lui	31744,	->t
+     15e:      orw	a1, t#1,	->a1
+
+0000000000000162 <.LBB2_8>:
+     162:      C.BSTART.STD
+     164:      c.sext.w	a2,	->t
+     166:      sb	a3, [a0, t#1]
+     16a:      FRET.STK	[ra ~ s8], sp!, 96
+
+000000000000016e <.LBB2_12>:
+     16e:      C.BSTART	COND, 0x2d6
+     170:      addi	zero, 32,	->a6
+     174:      sll	a4, a6,	->t
+     178:      srl	t#1, a6,	->x0
+     17c:      setc.geui	x0, 113
+     180:      C.BSTART	COND, 0x3aa
+     182:      setc.ltui	x0, 102
+     186:      C.BSTART	DIRECT, 0x3a4
+     188:      lui	2048,	->t
+     18c:      orw	a3, t#1,	->a3
+     190:      addiw	zero, 126,	->t
+     194:      subw	t#1, a4,	->u
+     198:      addiw	zero, 125,	->t
+     19c:      subw	t#1, a4,	->t
+     1a0:      sll	t#1, a6,	->t
+     1a4:      srl	t#1, a6,	->u
+     1a8:      sll	u#2, a6,	->t
+     1ac:      srl	t#1, a6,	->t
+     1b0:      srlw	a3, t#1,	->u
+     1b4:      srlw	a3, u#2,	->t
+     1b8:      andiw	t#1, 1,	->t
+     1bc:      addw	t#1, u#1,	->t
+     1c0:      andiw	t#1, 1023,	->a3
+
+00000000000001c4 <.LBB2_24>:
+     1c4:      C.BSTART	DIRECT, 0x3de
+     1c6:      c.movr	zero,	->a1
+
+00000000000001c8 <.LBB2_31>:
+     1c8:      C.BSTART.STD
+     1ca:      c.movr	zero,	->s8
+     1cc:      addi	zero, 32,	->s5
+     1d0:      c.movi	1,	->s6
+     1d2:      addi	zero, 29,	->s7
+     1d6:      c.movr	s4,	->a0
+
+00000000000001d8 <.LBB2_32>:
+     1d8:      HL.BSTART.STD	CALL, __mulsf3, ra=.LBB2_57
+     1e0:      sll	a0, s5,	->a0
+     1e4:      srl	a0, s5,	->a0
+     1e8:      lui	258048,	->a1
+
+00000000000001ec <.LBB2_57>:
+     1ec:      C.BSTART	COND, 0x212
+     1ee:      addw	s8, s6,	->s2
+     1f2:      hl.lwu.pcr	[.rodata.cst4+0x4],	->t
+     1f8:      fge.fs	a0, t#1,	->t
+     1fc:      xori	t#1, 1,	->t
+     200:      c.setc.ne	t#1, zero
+     202:      C.BSTART	COND, 0x1d8
+     204:      sll	s8, s5,	->t
+     208:      srl	t#1, s5,	->t
+     20c:      setc.ltu	t#1, s7
+     210:      c.movr	s2,	->s8
+
+0000000000000212 <.LBB2_27>:
+     212:      C.BSTART	COND, 0x2c6
+     214:      hl.lwu.pcr	[.rodata.cst4+0x8],	->t
+     21a:      flt.fs	a0, t#1,	->t
+     21e:      xori	t#1, 1,	->t
+     222:      c.setc.ne	t#1, zero
+     224:      C.BSTART.STD
+     226:      addi	zero, 32,	->s5
+     22a:      c.movi	-1,	->s6
+     22c:      subi	zero, 29,	->s7
+
+0000000000000230 <.LBB2_29>:
+     230:      HL.BSTART.STD	CALL, __addsf3, ra=.LBB2_56
+     238:      sll	a0, s5,	->a0
+     23c:      srl	a0, s5,	->a0
+     240:      c.movr	a0,	->a1
+
+0000000000000242 <.LBB2_56>:
+     242:      C.BSTART	COND, 0x262
+     244:      addw	s2, s6,	->a1
+     248:      hl.lwu.pcr	[.rodata.cst4+0x8],	->t
+     24e:      flt.fs	a0, t#1,	->t
+     252:      xori	t#1, 1,	->t
+     256:      c.setc.ne	t#1, zero
+     258:      C.BSTART	COND, 0x230
+     25a:      c.sext.w	s2,	->t
+     25c:      setc.lt	s7, t#1
+     260:      c.movr	a1,	->s2
+
+0000000000000262 <.LBB2_34>:
+     262:      C.BSTART	COND, 0x30c
+     264:      c.movi	7,	->s2
+     266:      c.movi	1,	->t
+     268:      addw	a1, t#1,	->u
+     26c:      sraiw	s3, 31,	->t
+     270:      andiw	t#1, -128,	->s3
+     274:      addw	u#1, zero,	->u
+     278:      c.movi	-6,	->t
+     27a:      setc.lt	t#1, u#1
+     27e:      HL.BSTART.STD	CALL, __mulsf3, ra=.LBB2_54
+     286:      sll	s4, s5,	->a0
+     28a:      srl	a0, s5,	->a0
+     28e:      lui	278528,	->a1
+
+0000000000000292 <.LBB2_54>:
+     292:      HL.BSTART.STD	CALL, __addsf3, ra=.LBB2_55
+     29a:      sll	a0, s5,	->a0
+     29e:      srl	a0, s5,	->a0
+     2a2:      lui	258048,	->a1
+
+00000000000002a6 <.LBB2_55>:
+     2a6:      C.BSTART	DIRECT, 0x326
+     2a8:      fcvtz.fs2sw	a0,	->a0
+     2ac:      c.sext.w	a0,	->t
+     2ae:      cmp.gei	t#1, 1,	->u
+     2b2:      c.movr	zero,	->t
+     2b4:      csel	u#1, t#1, a0,	->u
+     2b8:      c.sext.w	u#1,	->t
+     2ba:      cmp.lti	t#1, 7,	->t
+     2be:      csel	t#1, s2, u#1,	->t
+     2c2:      orw	s3, t#1,	->a1
+
+00000000000002c6 <.LBB2_37>:
+     2c6:      C.BSTART	DIRECT, 0x314
+     2c8:      c.movi	7,	->t
+     2ca:      addw	s2, t#1,	->s4
+     2ce:      sraiw	s3, 31,	->t
+     2d2:      andiw	t#1, -128,	->s3
+
+00000000000002d6 <.LBB2_15>:
+     2d6:      C.BSTART	COND, 0x156
+     2d8:      setc.geui	x0, 143
+     2dc:      C.BSTART	COND, 0x38e
+     2de:      sll	a3, a6,	->t
+     2e2:      srl	t#1, a6,	->u
+     2e6:      lui	2047,	->t
+     2ea:      setc.geu	u#1, t#1
+     2ee:      C.BSTART	DIRECT, 0x3a4
+     2f0:      slliw	a4, 10,	->u
+     2f4:      lui	4,	->t
+     2f8:      addw	u#1, t#1,	->u
+     2fc:      lui	1,	->t
+     300:      addw	a3, t#1,	->t
+     304:      srliw	t#1, 13,	->t
+     308:      orw	t#1, u#1,	->a3
+
+000000000000030c <.LBB2_35>:
+     30c:      C.BSTART.STD
+     30e:      addw	a1, s2,	->s4
+     312:      c.movr	a1,	->s2
+
+0000000000000314 <.LBB2_38>:
+     314:      C.BSTART	COND, 0x32c
+     316:      addw	s2, zero,	->s5
+     31a:      c.movi	8,	->s6
+     31c:      setc.lt	s5, s6
+     320:      C.BSTART.STD
+     322:      oriw	s3, 126,	->a1
+
+0000000000000326 <.LBB2_40>:
+     326:      C.BSTART	DIRECT, 0x3de
+     328:      c.movr	s1,	->a0
+     32a:      c.movr	s0,	->a2
+
+000000000000032c <.LBB2_41>:
+     32c:      HL.BSTART.STD	CALL, __addsf3, ra=.LBB2_51
+     334:      addi	zero, 32,	->s7
+     338:      hl.lui	3212836864,	->a1
+     33e:      bxu	a1, 31,	->a1
+     342:      sll	a0, s7,	->a0
+     346:      srl	a0, s7,	->a0
+
+000000000000034a <.LBB2_51>:
+     34a:      HL.BSTART.STD	CALL, __mulsf3, ra=.LBB2_52
+     352:      sll	a0, s7,	->a0
+     356:      srl	a0, s7,	->a0
+     35a:      lui	266240,	->a1
+
+000000000000035e <.LBB2_52>:
+     35e:      HL.BSTART.STD	CALL, __addsf3, ra=.LBB2_53
+     366:      sll	a0, s7,	->a0
+     36a:      srl	a0, s7,	->a0
+     36e:      lui	258048,	->a1
+
+0000000000000372 <.LBB2_53>:
+     372:      C.BSTART	COND, 0x3c0
+     374:      fcvtz.fs2sw	a0,	->a1
+     378:      c.sext.w	a1,	->t
+     37a:      setc.lt	t#1, s6
+     37e:      c.movr	s1,	->a0
+     380:      c.movr	s0,	->a2
+     382:      C.BSTART	COND, 0x3b6
+     384:      setc.nei	s5, 7
+     388:      C.BSTART	DIRECT, 0x3de
+     38a:      oriw	s3, 126,	->a1
+
+000000000000038e <.LBB2_18>:
+     38e:      C.BSTART	COND, 0x156
+     390:      setc.eqi	a7, 142
+     394:      C.BSTART.STD
+     396:      slliw	a5, 10,	->u
+     39a:      hl.lui	17408,	->t
+     3a0:      addw	u#1, t#1,	->a3
+
+00000000000003a4 <.LBB2_21>:
+     3a4:      C.BSTART.STD
+     3a6:      orw	a3, a1,	->a1
+
+00000000000003aa <.LBB2_22>:
+     3aa:      C.BSTART.STD
+     3ac:      c.sext.w	a2,	->t
+     3ae:      sh	a1, [a0, t#1<<1]
+     3b2:      FRET.STK	[ra ~ s8], sp!, 96
+
+00000000000003b6 <.LBB2_43>:
+     3b6:      C.BSTART.STD
+     3b8:      c.movr	zero,	->a1
+     3ba:      c.movi	8,	->t
+     3bc:      addw	s2, t#1,	->s4
+
+00000000000003c0 <.LBB2_44>:
+     3c0:      C.BSTART.STD
+     3c2:      c.sext.w	a1,	->t
+     3c4:      cmp.gei	t#1, 1,	->u
+     3c8:      c.movr	zero,	->t
+     3ca:      csel	u#1, t#1, a1,	->u
+     3ce:      andiw	s3, 255,	->u
+     3d2:      slliw	s4, 3,	->t
+     3d6:      orw	t#1, u#1,	->t
+     3da:      orw	t#1, u#2,	->a1
+
+00000000000003de <.LBB2_45>:
+     3de:      C.BSTART.STD
+     3e0:      c.sext.w	a2,	->t
+     3e2:      sb	a1, [a0, t#1]
+     3e6:      FRET.STK	[ra ~ s8], sp!, 96
+
+Disassembly of section .text._ZN3pto17fp8_e4m3_to_floatENS_10fp8_e4m3_tE:
+
+0000000000000000 <_ZN3pto17fp8_e4m3_to_floatENS_10fp8_e4m3_tE>:
+       0:      FENTRY	[ra ~ s5], sp!, 64
+       4:      C.BSTART	COND, 0x5a
+       6:      lbui	[a0, 0],	->u
+       a:      slliw	u#1, 24,	->t
+       e:      sraiw	t#1, 24,	->t
+      12:      c.sext.w	t#1,	->t
+      14:      cmp.gei	t#1, 0,	->u
+      18:      c.movr	zero,	->s1
+      1a:      c.movi	4,	->t
+      1c:      csel	u#1, s1, t#1,	->u
+      20:      addtpc	0,	->t
+      24:      addi	t#1, 0,	->t
+      28:      lwu	[t#1, u#1],	->s0
+      2c:      andiw	u#3, 7,	->s3
+      30:      srliw	u#3, 3,	->t
+      34:      andiw	t#1, 15,	->s4
+      38:      c.sext.w	s4,	->t
+      3a:      c.setc.eq	t#1, s1
+      3c:      C.BSTART	COND, 0xe0
+      3e:      setc.nei	s4, 15
+      42:      HL.BSTART.STD	CALL, __mulsf3, ra=.LBB3_27
+      4a:      addi	zero, 32,	->a0
+      4e:      bxu	s0, 31,	->a0
+      52:      lui	276224,	->a1
+
+0000000000000056 <.LBB3_27>:
+      56:      FRET.STK	[ra ~ s5], sp!, 64
+
+000000000000005a <.LBB3_3>:
+      5a:      C.BSTART	COND, 0x1be
+      5c:      c.sext.w	s3,	->t
+      5e:      c.setc.eq	t#1, s1
+      60:      C.BSTART.STD
+      62:      hl.lwu.pcr	[.rodata.cst4+0xc],	->a0
+      68:      c.movi	6,	->s4
+      6a:      addi	zero, 32,	->s2
+      6e:      c.movi	-1,	->s5
+
+0000000000000070 <.LBB3_5>:
+      70:      HL.BSTART.STD	CALL, __mulsf3, ra=.LBB3_26
+      78:      sll	a0, s2,	->a0
+      7c:      srl	a0, s2,	->a0
+      80:      lui	258048,	->a1
+
+0000000000000084 <.LBB3_26>:
+      84:      C.BSTART	COND, 0x70
+      86:      addw	s4, s5,	->s4
+      8a:      c.sext.w	s4,	->t
+      8c:      c.setc.ne	t#1, s1
+      8e:      HL.BSTART.STD	CALL, __mulsf3, ra=.LBB3_23
+      96:      ucvtf.uw2fs	s3,	->a1
+      9a:      sll	a1, s2,	->a1
+      9e:      srl	a1, s2,	->a2
+      a2:      lui	253952,	->a1
+      a6:      c.movr	a0,	->s1
+      a8:      c.movr	a2,	->a0
+
+00000000000000aa <.LBB3_23>:
+      aa:      HL.BSTART.STD	CALL, __mulsf3, ra=.LBB3_24
+      b2:      sll	s0, s2,	->a1
+      b6:      srl	a1, s2,	->a2
+      ba:      sll	a0, s2,	->a0
+      be:      srl	a0, s2,	->a1
+      c2:      c.movr	a2,	->a0
+
+00000000000000c4 <.LBB3_24>:
+      c4:      HL.BSTART.STD	CALL, __mulsf3, ra=.LBB3_25
+      cc:      sll	s1, s2,	->a1
+      d0:      sll	a0, s2,	->a0
+      d4:      srl	a1, s2,	->a1
+      d8:      srl	a0, s2,	->a0
+
+00000000000000dc <.LBB3_25>:
+      dc:      FRET.STK	[ra ~ s5], sp!, 64
+
+00000000000000e0 <.LBB3_7>:
+      e0:      HL.BSTART.STD	CALL, __mulsf3, ra=.LBB3_20
+      e8:      ucvtf.uw2fs	s3,	->a0
+      ec:      addi	zero, 32,	->s2
+      f0:      sll	a0, s2,	->a0
+      f4:      srl	a0, s2,	->a0
+      f8:      lui	253952,	->a1
+
+00000000000000fc <.LBB3_20>:
+      fc:      HL.BSTART.STD	CALL, __addsf3, ra=.LBB3_21
+     104:      sll	a0, s2,	->a0
+     108:      srl	a0, s2,	->a0
+     10c:      lui	260096,	->a1
+
+0000000000000110 <.LBB3_21>:
+     110:      HL.BSTART.STD	CALL, __mulsf3, ra=.LBB3_22
+     118:      sll	s0, s2,	->a1
+     11c:      srl	a1, s2,	->a2
+     120:      sll	a0, s2,	->a0
+     124:      srl	a0, s2,	->a1
+     128:      c.movr	a2,	->a0
+
+000000000000012a <.LBB3_22>:
+     12a:      C.BSTART	COND, 0x174
+     12c:      c.movr	a0,	->s0
+     12e:      c.movi	-7,	->t
+     130:      addw	s4, t#1,	->s3
+     134:      sll	s4, s2,	->t
+     138:      srl	t#1, s2,	->t
+     13c:      setc.ltui	t#1, 7
+     140:      C.BSTART	COND, 0x1a0
+     142:      hl.lwu.pcr	[.rodata.cst4+0xc],	->a0
+     148:      c.sext.w	s3,	->t
+     14a:      c.setc.eq	t#1, s1
+     14c:      C.BSTART.STD
+     14e:      hl.lwu.pcr	[.rodata.cst4+0xc],	->a0
+     154:      c.movi	-1,	->s4
+
+0000000000000156 <.LBB3_12>:
+     156:      HL.BSTART.STD	CALL, __addsf3, ra=.LBB3_19
+     15e:      sll	a0, s2,	->a0
+     162:      srl	a0, s2,	->a0
+     166:      c.movr	a0,	->a1
+
+0000000000000168 <.LBB3_19>:
+     168:      C.BSTART	COND, 0x1a0
+     16a:      addw	s3, s4,	->s3
+     16e:      c.sext.w	s3,	->t
+     170:      c.setc.eq	t#1, s1
+     172:      C.BSTART	DIRECT, 0x156
+
+0000000000000174 <.LBB3_8>:
+     174:      C.BSTART.STD
+     176:      hl.lwu.pcr	[.rodata.cst4+0xc],	->a0
+     17c:      c.movi	1,	->s1
+     17e:      c.movi	1,	->s4
+
+0000000000000180 <.LBB3_9>:
+     180:      HL.BSTART.STD	CALL, __mulsf3, ra=.LBB3_18
+     188:      sll	a0, s2,	->a0
+     18c:      srl	a0, s2,	->a0
+     190:      lui	258048,	->a1
+
+0000000000000194 <.LBB3_18>:
+     194:      C.BSTART	COND, 0x180
+     196:      addw	s3, s1,	->s3
+     19a:      c.sext.w	s3,	->t
+     19c:      c.cmp.eqi	0,	->t
+     19e:      c.setc.ne	t#1, s4
+
+00000000000001a0 <.LBB3_13>:
+     1a0:      HL.BSTART.STD	CALL, __mulsf3, ra=.LBB3_17
+     1a8:      sll	s0, s2,	->a1
+     1ac:      srl	a1, s2,	->a2
+     1b0:      sll	a0, s2,	->a0
+     1b4:      srl	a0, s2,	->a1
+     1b8:      c.movr	a2,	->a0
+
+00000000000001ba <.LBB3_17>:
+     1ba:      FRET.STK	[ra ~ s5], sp!, 64
+
+00000000000001be <.LBB3_16>:
+     1be:      C.BSTART.STD
+     1c0:      lui	524288,	->t
+     1c4:      andw	s0, t#1,	->a0
+     1c8:      FRET.STK	[ra ~ s5], sp!, 64
diff --git a/samples/gemm/README.md b/samples/gemm/README.md
new file mode 100644
index 0000000..f80c5b2
--- /dev/null
+++ b/samples/gemm/README.md
@@ -0,0 +1,18 @@
+# GEMM Sample
+
+`gemm_avs_tile_smoke.diss` is a checked-in `llvm-objdump -dl` disassembly of a
+compiler-produced Linx object for an AVS tile smoke GEMM case.
+
+Related SuperNPUBench sources:
+
+| Path | Role |
+| --- | --- |
+| [`../../benchmarks/npu/vec_simd/gemm_18x128x256`](../../benchmarks/npu/vec_simd/gemm_18x128x256) | Active NPU GEMM benchmark case. |
+| [`../../benchmarks/kernels/composite/src/gemm.cpp`](../../benchmarks/kernels/composite/src/gemm.cpp) | Composite GEMM benchmark entrypoint. |
+| [`../../benchmarks/kernels/gemm/matmul`](../../benchmarks/kernels/gemm/matmul) | Matmul/GEMM kernel benchmark suite. |
+
+Regenerate from a compatible Linx compiler object with:
+
+```sh
+llvm-objdump -dl gemm.o > gemm_avs_tile_smoke.diss
+```
diff --git a/samples/gemm/gemm_avs_tile_smoke.diss b/samples/gemm/gemm_avs_tile_smoke.diss
new file mode 100644
index 0000000..a873ec9
--- /dev/null
+++ b/samples/gemm/gemm_avs_tile_smoke.diss
@@ -0,0 +1,48 @@
+
+generated/avs-tile-smoke/compiler/avs/obj/gemm.o:	file format elf64-linx
+
+Disassembly of section .text:
+
+0000000000000000 <gemm_i32>:
+; gemm_i32():
+       0: 41 00 d5 0a  FENTRY	[ra ~ s2], sp!, 40
+       4: 00 08        C.BSTART.STD
+       6: 06 28        c.movr	zero,	->a3
+       8: 15 03 00 04  addi	zero, 64,	->a4
+       c: 95 03 00 01  addi	zero, 16,	->a5
+      10: 46 41        c.movr	a3,	->a6
+
+0000000000000012 <.LBB0_1>:
+; .LBB0_1():
+      12: 00 08        C.BSTART.STD
+      14: 95 7f 64 00  slli	a6, 6,	->t
+      18: 85 04 82 07  add	a2, t#1,	->a7
+      1c: c6 a0        c.movr	a1,	->x0
+      1e: 46 a9        c.movr	a3,	->x1
+
+0000000000000020 <.LBB0_2>:
+; .LBB0_2():
+      20: 00 08        C.BSTART.STD
+      22: 09 ab 54 17  lw	[a7, x1<<2],	->x2
+      26: 95 ff 2a 00  slli	x1, 2,	->t
+      2a: 85 8b 84 07  add	a7, t#1,	->x3
+      2e: 46 59        c.movr	a3,	->s0
+
+0000000000000030 <.LBB0_5>:
+; .LBB0_5():
+      30: 04 00        C.BSTART	COND, 0x30
+      32: 09 2f b1 06  lw	[a0, s0],	->u
+      36: 89 2f ba 06  lw	[x0, s0],	->t
+      3a: 47 7b cc b1  maddw	t#1, u#1, x2,	->x2
+      3e: 95 85 45 00  addi	s0, 4,	->s0
+      42: f6 32        c.setc.ne	s0, a4
+      44: e4 fe        C.BSTART	COND, 0x20
+      46: 59 20 7b 01  swi	x2, [x3, 0]
+      4a: 15 0a 0a 04  addi	x0, 64,	->x0
+      4e: 95 8a 1a 00  addi	x1, 1,	->x1
+      52: 76 3d        c.setc.ne	x1, a5
+      54: f4 fd        C.BSTART	COND, 0x12
+      56: 15 01 01 04  addi	a0, 64,	->a0
+      5a: 15 04 14 00  addi	a6, 1,	->a6
+      5e: 36 3a        c.setc.ne	a6, a5
+      60: 41 30 d5 0a  FRET.STK	[ra ~ s2], sp!, 40
diff --git a/test/README.md b/test/README.md
deleted file mode 100644
index ce041c4..0000000
--- a/test/README.md
+++ /dev/null
@@ -1,101 +0,0 @@
-# Test Navigation
-
-The `test` tree contains focused API tests, kernel and accelerator suites,
-Python golden-comparison tests, and batch scripts. Most make-driven suites
-reuse [`common/Makefile.common`](common/Makefile.common), so the same
-`TESTCASE`, `PLAT`, `COMPILER_DIR`, and `QEMU` variables work across many
-directories.
-
-## Directory Map
-
-| Path | Use it for |
-| --- | --- |
-| [`common`](common) | Shared make rules, platform flags, output layout, and simulator targets. |
-| [`tileop_api`](tileop_api) | Small TileOP API tests. This is the best first stop for validating an individual API operation. |
-| [`py_api`](py_api) | Python extension build and golden-comparison tests. |
-| [`accelerator`](accelerator) | Accelerator-oriented suites such as cube, vector, DMA, fusion, and versioned target tests. |
-| [`kernel`](kernel) | Kernel suites for control, element-wise, fusion, GEMM, memory, reduction, sort, and related cases. |
-| [`other`](other) | Additional model, microbenchmark, TileOP, vector, and script-driven suites. |
-| [`script`](script) | Recursive compile/run helper for larger batch workflows. |
-
-## Common Build Pattern
-
-```sh
-cd test/tileop_api
-make clean
-make TESTCASE=TAdd PLAT=cpu COMPILER_DIR=/usr/bin
-make TESTCASE=TAdd PLAT=linx COMPILER_DIR=/path/to/linx/compiler/bin
-make TESTCASE=TAdd PLAT=linx QEMU=/path/to/qemu-linx sim
-```
-
-Platform values:
-
-| Platform | Backend |
-| --- | --- |
-| `PLAT=cpu` | CPU simulation backend with `__cpu_sim__`. |
-| `PLAT=linx` | Linx target backend with `__linx`. |
-| `PLAT=arm_sme` | Arm SME-oriented backend with `__ARM_FEATURE_SME`. |
-
-Common targets:
-
-```sh
-make TESTCASE=<case> all
-make TESTCASE=<case> diss
-make TESTCASE=<case> sim
-make TESTCASE=<case> debug
-make clean
-make clean_all
-```
-
-Build products are written below the repository-level `output/` directory.
-
-## Batch Runs
-
-Several suites include a local `compile.all` file. Run it from the suite
-directory:
-
-```sh
-cd test/tileop_api && bash compile.all
-cd test/py_api && bash compile.all
-cd test/kernel/gemm/matmul && bash compile.all
-cd test/accelerator/vec_simt && bash compile.all
-```
-
-For recursive compile/run automation, see [`script/README.md`](script/README.md).
-
-## Python Golden Comparison
-
-```sh
-cd test/py_api
-make clean
-make TESTCASE=tileop_py
-python3 golden_cmp/golden_cmp.py -i tadd
-```
-
-For adding golden-comparison cases, see
-[`py_api/golden_cmp/README.md`](py_api/golden_cmp/README.md).
-
-## Adding A Test Case
-
-For an existing make-driven suite:
-
-1. Add the source file under that suite's `src/` directory.
-2. Set `SRC_FILE`, `TARGET`, and any suite-specific variables in the local
-   `Makefile`.
-3. Include [`common/Makefile.common`](common/Makefile.common).
-4. Add the case to the local `compile.all` file if it belongs in batch runs.
-
-Minimal local makefile shape:
-
-```make
-SRC_FILE += $(TEST_ROOT)/$(CASE_SRC_DIR)/$(TESTCASE).cpp
-TARGET = $(ELF_HEAD)_$(TESTCASE).elf
-include ../common/Makefile.common
-```
-
-Adjust the relative include path when the suite is nested more deeply.
-
-For a new suite, create a directory with `src/`, a small local `Makefile`, and
-an optional `compile.all` batch entrypoint.
-
-Back to the repository overview: [`../README.md`](../README.md).
diff --git a/test/accelerator/vec_simt/compile.all b/test/accelerator/vec_simt/compile.all
deleted file mode 100755
index 0991a56..0000000
--- a/test/accelerator/vec_simt/compile.all
+++ /dev/null
@@ -1,5 +0,0 @@
-#! /bin/bash
-
-make TESTCASE=accel_hashtable_insert_cmp_host
-make TESTCASE=accel_hashtable_lookup_cmp_host
-make TESTCASE=hashfind
diff --git a/test/accelerator/vec_simt/hashfind/data_obj/.gitignore b/test/accelerator/vec_simt/hashfind/data_obj/.gitignore
deleted file mode 100644
index b72b9e3..0000000
--- a/test/accelerator/vec_simt/hashfind/data_obj/.gitignore
+++ /dev/null
@@ -1,2 +0,0 @@
-*.s
-*.o
\ No newline at end of file
diff --git a/test/kernel/control/compile.all b/test/kernel/control/compile.all
deleted file mode 100755
index 9772a2f..0000000
--- a/test/kernel/control/compile.all
+++ /dev/null
@@ -1,12 +0,0 @@
-#! /bin/bash
-for debug in on off; do
-    if [[ "$debug" == "on" ]]; then
-        debug_define=""
-    else
-        debug_define="-DFOR_GFSIM"
-    fi
-    make TESTCASE=hashtable_lookup_simt SUFFIX=_kNum6144_kNumThreads6144_kMaxProbe512_break_debug_${debug} EXTRA_DEFINES="-DkNum=6144 -DkNumThreads=6144 -DMAX_PROBE=512 ${debug_define}" diss
-    for num_col in 256 512 1024; do
-        make TESTCASE=hashtable_lookup_simd SUFFIX=_kNum6144_kMaxProbe512_knum_col${num_col}_debug_${debug} EXTRA_DEFINES="-DkNum=6144 -DMAX_PROBE=512 -DNUM_COL=${num_col} ${debug_define}" diss
-    done
-done
diff --git a/test/kernel/control/hashtable_lookup_simd/data_obj/.gitignore b/test/kernel/control/hashtable_lookup_simd/data_obj/.gitignore
deleted file mode 100644
index b72b9e3..0000000
--- a/test/kernel/control/hashtable_lookup_simd/data_obj/.gitignore
+++ /dev/null
@@ -1,2 +0,0 @@
-*.s
-*.o
\ No newline at end of file
diff --git a/test/kernel/orther/accelerator_compile.sh b/test/kernel/orther/accelerator_compile.sh
deleted file mode 100755
index 414d95a..0000000
--- a/test/kernel/orther/accelerator_compile.sh
+++ /dev/null
@@ -1,10 +0,0 @@
-#! /bin/bash
-
-./accelerator_compile_new/compile_matmul.all
-./accelerator_compile_new/compile_matmul_reuseA.all
-./accelerator_compile_new/compile_matmul_reuseB.all
-./accelerator_compile_new/compile_matmul_reuseAB.all
-
-./accelerator_compile_new/compile_matmul_dynamic.all
-./accelerator_compile_new/compile_matmul_dynamic_reuseA.all
-./accelerator_compile_new/compile_matmul_dynamic_reuseB.all
\ No newline at end of file
diff --git a/test/kernel/orther/accelerator_compile/compile_matmul.all b/test/kernel/orther/accelerator_compile/compile_matmul.all
deleted file mode 100755
index 03a6f46..0000000
--- a/test/kernel/orther/accelerator_compile/compile_matmul.all
+++ /dev/null
@@ -1,110 +0,0 @@
-#! /bin/bash
-
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=2048   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
-
-
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=777  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=777  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=777  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=777  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=777  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=256  N=777   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8 M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic.all b/test/kernel/orther/accelerator_compile/compile_matmul_dynamic.all
deleted file mode 100755
index b7dc3b1..0000000
--- a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic.all
+++ /dev/null
@@ -1,110 +0,0 @@
-#! /bin/bash
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=2048   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
-
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=777  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=777  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=777  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=777  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=777  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=256  N=777   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuse.all b/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuse.all
deleted file mode 100644
index 48de0d9..0000000
--- a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuse.all
+++ /dev/null
@@ -1,110 +0,0 @@
-#! /bin/bash
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=2048   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
-
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=777  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=777  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=777  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=777  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=777  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=256  N=777   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSE M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuseA.all b/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuseA.all
deleted file mode 100755
index a0ada86..0000000
--- a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuseA.all
+++ /dev/null
@@ -1,110 +0,0 @@
-#! /bin/bash
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=2048   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
-
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=777  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=777  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=777  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=777  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=777  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=256  N=777   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEA M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuseB.all b/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuseB.all
deleted file mode 100755
index 9fdd451..0000000
--- a/test/kernel/orther/accelerator_compile/compile_matmul_dynamic_reuseB.all
+++ /dev/null
@@ -1,110 +0,0 @@
-#! /bin/bash
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=2048   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
-
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=777  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=777  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=777  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=777  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=777  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=256  N=777   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_DYNAMIC_REUSEB M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/test/kernel/orther/accelerator_compile/compile_matmul_reuseA.all b/test/kernel/orther/accelerator_compile/compile_matmul_reuseA.all
deleted file mode 100755
index f5a39cb..0000000
--- a/test/kernel/orther/accelerator_compile/compile_matmul_reuseA.all
+++ /dev/null
@@ -1,110 +0,0 @@
-#! /bin/bash
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=2048   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
-
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=777  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=777  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=777  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=777  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=777  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=256  N=777   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEA M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/test/kernel/orther/accelerator_compile/compile_matmul_reuseAB.all b/test/kernel/orther/accelerator_compile/compile_matmul_reuseAB.all
deleted file mode 100755
index c313c6f..0000000
--- a/test/kernel/orther/accelerator_compile/compile_matmul_reuseAB.all
+++ /dev/null
@@ -1,110 +0,0 @@
-#! /bin/bash
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=2048   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
-
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=777  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=777  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=777  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=777  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=777  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=256  N=777   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEAB M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/test/kernel/orther/accelerator_compile/compile_matmul_reuseB.all b/test/kernel/orther/accelerator_compile/compile_matmul_reuseB.all
deleted file mode 100755
index 893c7fe..0000000
--- a/test/kernel/orther/accelerator_compile/compile_matmul_reuseB.all
+++ /dev/null
@@ -1,110 +0,0 @@
-#! /bin/bash
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=2048  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=2048  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=2048  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=2048  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=2048  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=2048  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=2048  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=2048  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=2048  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=2048  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=256   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=2048   K=2048  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=2048  N=2048   K=2048  tM=64 tK=64 tN=64
-
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=256  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=777  tM=64 tK=256 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=777  tM=64 tK=256 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=256  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=777  tM=256 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=777  tM=256 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=256  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=777  tM=64 tK=64 tN=256
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=777  tM=64 tK=64 tN=256
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=256  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=777  tM=64 tK=256 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=777  tM=64 tK=256 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=256  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=777  tM=256 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=777  tM=256 tK=64 tN=64
-
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=256  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=256   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=256  N=777   K=777  tM=64 tK=64 tN=64
-make TESTCASE=matmul MODE=MASK_FP8_REUSEB M=777  N=777   K=777  tM=64 tK=64 tN=64
\ No newline at end of file
diff --git a/test/kernel/sort/topk/.gitignore b/test/kernel/sort/topk/.gitignore
deleted file mode 100644
index f406623..0000000
--- a/test/kernel/sort/topk/.gitignore
+++ /dev/null
@@ -1,2 +0,0 @@
-*.o
-*.s
\ No newline at end of file
diff --git a/test/other/scripts/bench_all.sh b/test/other/scripts/bench_all.sh
deleted file mode 100755
index d0cf362..0000000
--- a/test/other/scripts/bench_all.sh
+++ /dev/null
@@ -1,42 +0,0 @@
-#!/bin/bash
-set -e
-set -x
-set -o pipefail
-
-cd $(dirname $0)/../..
-
-export CC_OPT=default
-
-rm -rf output/matmul_compile_output
-python3 test/ascpp/run_compile.py test/ascpp/matmul -o output/matmul_compile_output    -f test/ascpp/filter.conf | tee bench.log
-Total=$(grep "Total test cases:" bench.log | awk '{print $4}')
-PASS=$(grep "Suaccelssful:" bench.log | awk '{print $2}')
-if [[ x"$Total" != x"$PASS" ]]; then
-  exit 1
-fi
-
-rm -rf output/fa_normal_compile_output
-python3 test/ascpp/run_compile.py test/ascpp/fa     -o output/fa_normal_compile_output -f test/ascpp/filter.conf | tee bench.log
-Total=$(grep "Total test cases:" bench.log | awk '{print $4}')
-PASS=$(grep "Suaccelssful:" bench.log | awk '{print $2}')
-if [[ x"$Total" != x"$PASS" ]]; then
-  exit 1
-fi
-
-python3 test/scripts/run_compile.py
-ERRS=$(grep fail: cm_log/compile_summary.log | awk '{print $2}')
-PASS=$(($ERRS <= 4))
-if [[ x"$PASS" != x"1" ]]; then
-  cat cm_log/compile_summary.log
-  exit 1
-fi
-
-# ELF_LIST="output/tileop_test/elf/*.elf output/lmbench/elf/*.elf output/kernel/elf/*.elf output/deepseek/elf/*.elf"
-# 
-# realpath $ELF_LIST > tmp.list
-# 
-# if [[ -f $QEMU ]]; then
-#   ARGS="$ARGS -m $QEMU"
-# fi
-# python3 test/scripts/run_qemu.py -i tmp.list -o cm_log/qemu_run.log $ARGS
-# rm -f tmp.list
diff --git a/test/other/tileop_api/src/MatMul_e4m3.cpp b/test/other/tileop_api/src/MatMul_e4m3.cpp
deleted file mode 100644
index 7237b96..0000000
--- a/test/other/tileop_api/src/MatMul_e4m3.cpp
+++ /dev/null
@@ -1,89 +0,0 @@
-#include <common/pto_tileop.hpp>
-#include "../data.hpp"
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <typename TA, typename TB>
-void __vec__ test_cvt(typename TA::TileDType __out__ a,
-                      typename TB::TileDType __in__ b) {
-  using AType = typename TA::DType;
-  using BType = typename TB::DType;
-  BType *pb = blkv_get_tile_ptr(b);
-  AType *pa = blkv_get_tile_ptr(a);
-  int x = blkv_get_index_x();
-  int y = blkv_get_index_y();
-  int idx = index<TA>(y, x);
-  AType o = (AType)(pb[idx]);
-  pa[idx] = o;
-}
-
-template <uint16_t M, uint16_t N, uint16_t K>
-void test(float *dst, float *src0, float *src1) {
-  using gm_shape_A = global_tensor<float, RowMajor<M, K>>;
-  using gm_shape_B = global_tensor<float, ColMajor<K, N>>;
-  using gm_shape_C = global_tensor<float, RowMajor<M, N>>;
-
-  using tile_shape_A = TileLeft<float, M, K>;
-  using tile_shape_B = TileRight<float, K, N>;
-  using tile_shape_C = TileAcc<float, M, N>;
-  using tile_shape_LA = TileLeft<__fp8_e4m3, M, K>;
-  using tile_shape_LB = TileRight<__fp8_e4m3, K, N>;
-
-  gm_shape_A s0(src0);
-  gm_shape_B s1(src1);
-  gm_shape_C res(dst);
-
-  tile_shape_A d0;
-  tile_shape_B d1;
-  tile_shape_C d2;
-  tile_shape_LA lda;
-  tile_shape_LB ldb;
-
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  test_cvt<tile_shape_LA, tile_shape_A><<<M, K, 1>>>(lda.data(), d0.data());
-  test_cvt<tile_shape_LB, tile_shape_B><<<K, N, 1>>>(ldb.data(), d1.data());
-  MATMUL(d2, lda, ldb);
-  TCOPYOUT(res, d2);
-}
-
-int main() {
-  const uint16_t M = 64;
-  const uint16_t K = 32;
-  const uint16_t N = 64;
-
-  size_t size_A = M * K;
-  size_t size_B = K * N;
-  size_t size_C = M * N;
-
-  float *dst = (float *)malloc(size_C * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, size_C);
-
-  float *src0 = (float *)malloc(size_A * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, size_A);
-  float *src1 = (float *)malloc(size_B * sizeof(float));
-  check_mem_alloc(src1);
-  init_src_fp(src1, size_B);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<M, N, K>(dst, src0, src1);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, size_C);
-
-  free(dst);
-  free(src0);
-  free(src1);
-
-  return 0;
-}
diff --git a/test/other/tileop_api/src/TAbs.cpp b/test/other/tileop_api/src/TAbs.cpp
deleted file mode 100644
index e3d06c7..0000000
--- a/test/other/tileop_api/src/TAbs.cpp
+++ /dev/null
@@ -1,151 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
-          uint16_t tile_col, typename T>
-void test_RowMajor(T *dst, T *src0) {
-  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor>;
- 
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  #pragma clang loop unroll(full)
-  for (int i = 0; i < block_row; ++i) {
-    #pragma clang loop unroll(full)
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src0 + offset);
-      gm_shape res(dst + offset);
-  
-      tile_shape d0, d1;
-      TCOPYIN(d0, s0);
-      TABS(d1, d0);
-      TCOPYOUT(res, d1);
-    }
-  }
-}
- 
-template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
-          uint16_t tile_col, typename T>
-void test_ColMajor(T *dst, T *src0) {
-  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
- 
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  #pragma clang loop unroll(full)
-  for (int i = 0; i < block_col; ++i) {
-    #pragma clang loop unroll(full)
-    for (int j = 0; j < block_row; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src0 + offset);
-      gm_shape res(dst + offset);
-  
-      tile_shape d0, d1;
-      TCOPYIN(d0, s0);
-      TABS(d1, d0);
-      TCOPYOUT(res, d1);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 128;
-  const uint16_t gm_col = 128;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src0 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, gm_size);
-
-  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
-  check_mem_alloc(dst_f16);
-  init_dst(dst_f16, gm_size);
- 
-  __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
-  check_mem_alloc(src0_f16);
-  init_src_fp(src0_f16, gm_size);
-
-  int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
-  check_mem_alloc(dst_i8);
-  init_dst(dst_i8, gm_size);
- 
-  int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
-  check_mem_alloc(src0_i8);
-  init_src_int(src0_i8, gm_size);
- 
-  int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
-  check_mem_alloc(dst_i16);
-  init_dst(dst_i16, gm_size);
- 
-  int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
-  check_mem_alloc(src0_i16);
-  init_src_int(src0_i16, gm_size);
-  
-  int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
-  check_mem_alloc(dst_i32);
-  init_dst(dst_i32, gm_size);
- 
-  int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
-  check_mem_alloc(src0_i32);
-  init_src_int(src0_i32, gm_size);
- 
-  int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
-  check_mem_alloc(dst_i64);
-  init_dst(dst_i64, gm_size);
- 
-  int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
-  check_mem_alloc(src0_i64);
-  init_src_int(src0_i64, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, float>(dst, src0);
- 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-  OutArray(dst_i8, gm_size);
-  OutArray(dst_i16, gm_size);
-  OutArray(dst_i32, gm_size);
-  OutArray(dst_i64, gm_size);
- 
-  free(dst);
-  free(src0);
- 
-  free(dst_f16);
-  free(src0_f16);
- 
-  free(dst_i8);
-  free(src0_i8);
- 
-  free(dst_i16);
-  free(src0_i16);
- 
-  free(dst_i32);
-  free(src0_i32);
- 
-  free(dst_i64);
-  free(src0_i64);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TAdd_mask.cpp b/test/other/tileop_api/src/TAdd_mask.cpp
deleted file mode 100644
index 011a5dd..0000000
--- a/test/other/tileop_api/src/TAdd_mask.cpp
+++ /dev/null
@@ -1,119 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-using namespace pto;
-
-template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
-          uint16_t tile_col>
-void test(float *c_ptr, float *a_ptr, float *b_ptr) {
-  using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-  using glb_iterator = global_iterator<gm_shape, tile_shape>;
-
-  static constexpr int block_row = gm_row / tile_row;
-  static constexpr int block_col = gm_col / tile_col;
-  static constexpr int remainder_row = gm_row % tile_row;
-  static constexpr int remainder_col = gm_col % tile_col;
-
-  using trailing_rows_shape =
-      Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor, tile_row, remainder_col>;
-  using trailing_cols_shape =
-      Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor, remainder_row, tile_col>;
-  using trailing_corner_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor, remainder_row, remainder_col>;
-
-  glb_iterator gAIter(a_ptr);
-  glb_iterator gBIter(b_ptr);
-  glb_iterator gCIter(c_ptr);
-  for (int i = 0; i < block_row; ++i) {
-    for (int j = 0; j < block_col; ++j) {
-      auto gA = gAIter(i, j);
-      auto gB = gBIter(i, j);
-      auto gC = gCIter(i, j);
-
-      tile_shape tA, tB, tC;
-      TCOPYIN(tA, gA);
-      TCOPYIN(tB, gB);
-      TADD(tC, tA, tB);
-      TCOPYOUT(gC, tC);
-    }
-    if constexpr (remainder_col) {
-      auto gA = gAIter(i, block_col);
-      auto gB = gBIter(i, block_col);
-      auto gC = gCIter(i, block_col);
-
-      trailing_rows_shape tA, tB, tC;
-      TCOPYIN(tA, gA);
-      TCOPYIN(tB, gB);
-      TADD(tC, tA, tB);
-      TCOPYOUT(gC, tC);
-    }
-  }
-  if constexpr (remainder_row) {
-    for (int j = 0; j < block_col; ++j) {
-      auto gA = gAIter(block_row, j);
-      auto gB = gBIter(block_row, j);
-      auto gC = gCIter(block_row, j);
-
-      trailing_cols_shape tA, tB, tC;
-      TCOPYIN(tA, gA);
-      TCOPYIN(tB, gB);
-      TADD(tC, tA, tB);
-      TCOPYOUT(gC, tC);
-    }
-    if constexpr (remainder_col) {
-      auto gA = gAIter(block_row, block_col);
-      auto gB = gBIter(block_row, block_col);
-      auto gC = gCIter(block_row, block_col);
-
-      trailing_corner_shape tA, tB, tC;
-      TCOPYIN(tA, gA);
-      TCOPYIN(tB, gB);
-      TADD(tC, tA, tB);
-      TCOPYOUT(gC, tC);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 123;
-  const uint16_t gm_col = 123;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src0 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, gm_size);
-  float *src1 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src1);
-  init_src_fp(src1, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, src0, src1);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-  free(src0);
-  free(src1);
-
-  return 0;
-}
diff --git a/test/other/tileop_api/src/TCopy.cpp b/test/other/tileop_api/src/TCopy.cpp
deleted file mode 100644
index f182e04..0000000
--- a/test/other/tileop_api/src/TCopy.cpp
+++ /dev/null
@@ -1,161 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
-          uint16_t tile_col, typename T>
-void test_RowMajor(T *dst, T *src0) {
-  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor>;
- 
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  #pragma clang loop unroll(full)
-  for (int i = 0; i < block_row; ++i) {
-    #pragma clang loop unroll(full)
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src0 + offset);
-      gm_shape res(dst + offset);
-  
-      tile_shape d0, d1;
-      TCOPYIN(d0, s0);
-      TCOPY(d1, d0);
-      TCOPYOUT(res, d1);
-    }
-  }
-}
- 
-template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
-          uint16_t tile_col, typename T>
-void test_ColMajor(T *dst, T *src0) {
-  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
- 
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  #pragma clang loop unroll(full)
-  for (int i = 0; i < block_col; ++i) {
-    #pragma clang loop unroll(full)
-    for (int j = 0; j < block_row; ++j) {
-      int offset = i * (tile_col * gm_row) + j * tile_row;
-      gm_shape s0(src0 + offset);
-      gm_shape res(dst + offset);
-  
-      tile_shape d0, d1;
-      TCOPYIN(d0, s0);
-      TCOPY(d1, d0);
-      TCOPYOUT(res, d1);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 128;
-  const uint16_t gm_col = 128;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src0 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, gm_size);
-
-  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
-  check_mem_alloc(dst_f16);
-  init_dst(dst_f16, gm_size);
- 
-  __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
-  check_mem_alloc(src0_f16);
-  init_src_fp(src0_f16, gm_size);
-
-  int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
-  check_mem_alloc(dst_i8);
-  init_dst(dst_i8, gm_size);
- 
-  int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
-  check_mem_alloc(src0_i8);
-  init_src_int(src0_i8, gm_size);
- 
-  int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
-  check_mem_alloc(dst_i16);
-  init_dst(dst_i16, gm_size);
- 
-  int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
-  check_mem_alloc(src0_i16);
-  init_src_int(src0_i16, gm_size);
-  
-  int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
-  check_mem_alloc(dst_i32);
-  init_dst(dst_i32, gm_size);
- 
-  int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
-  check_mem_alloc(src0_i32);
-  init_src_int(src0_i32, gm_size);
- 
-  int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
-  check_mem_alloc(dst_i64);
-  init_dst(dst_i64, gm_size);
- 
-  int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
-  check_mem_alloc(src0_i64);
-  init_src_int(src0_i64, gm_size);
-
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, float>(dst, src0);
- 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16);
-
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8);
-
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16);
-
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32);
- 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-  //OutArray(dst_f16, gm_size);
-  OutArray(dst_i8, gm_size);
-  OutArray(dst_i16, gm_size);
-  OutArray(dst_i32, gm_size);
-  OutArray(dst_i64, gm_size);
- 
-  free(dst);
-  free(src0);
- 
-  free(dst_f16);
-  free(src0_f16);
- 
-  free(dst_i8);
-  free(src0_i8);
- 
-  free(dst_i16);
-  free(src0_i16);
- 
-  free(dst_i32);
-  free(src0_i32);
- 
-  free(dst_i64);
-  free(src0_i64);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TCopyIn.cpp b/test/other/tileop_api/src/TCopyIn.cpp
deleted file mode 100644
index b99ea98..0000000
--- a/test/other/tileop_api/src/TCopyIn.cpp
+++ /dev/null
@@ -1,170 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
-          uint16_t tile_col, typename T>
-void test_RowMajor(T *dst, T *src0) {
-  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor>;
- 
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  #pragma clang loop unroll(full)
-  for (int i = 0; i < block_row; ++i) {
-    #pragma clang loop unroll(full)
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src0 + offset);
-      gm_shape res(dst + offset);
-  
-      tile_shape d0;
-      TCOPYIN(d0, s0);
-      TCOPYOUT(res, d0);
-    }
-  }
-}
- 
-template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
-          uint16_t tile_col, typename T>
-void test_ColMajor(T *dst, T *src0) {
-  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
- 
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  #pragma clang loop unroll(full)
-  for (int i = 0; i < block_col; ++i) {
-    #pragma clang loop unroll(full)
-    for (int j = 0; j < block_row; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src0 + offset);
-      gm_shape res(dst + offset);
-  
-      tile_shape d0;
-      TCOPYIN(d0, s0);
-      TCOPYOUT(res, d0);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src0 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, gm_size);
-
-  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
-  check_mem_alloc(dst_f16);
-  init_dst(dst_f16, gm_size);
- 
-  __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
-  check_mem_alloc(src0_f16);
-  init_src_fp(src0_f16, gm_size);
-
-  int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
-  check_mem_alloc(dst_i8);
-  init_dst(dst_i8, gm_size);
- 
-  int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
-  check_mem_alloc(src0_i8);
-  init_src_int(src0_i8, gm_size);
- 
-  int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
-  check_mem_alloc(dst_i16);
-  init_dst(dst_i16, gm_size);
- 
-  int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
-  check_mem_alloc(src0_i16);
-  init_src_int(src0_i16, gm_size);
-  
-  int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
-  check_mem_alloc(dst_i32);
-  init_dst(dst_i32, gm_size);
- 
-  int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
-  check_mem_alloc(src0_i32);
-  init_src_int(src0_i32, gm_size);
- 
-  int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
-  check_mem_alloc(dst_i64);
-  init_dst(dst_i64, gm_size);
- 
-  int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
-  check_mem_alloc(src0_i64);
-  init_src_int(src0_i64, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, float>(dst, src0);
- 
-  test_ColMajor<gm_row, gm_col, tile_row, tile_col, float>(dst, src0);
- 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16);
-
-  test_ColMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16);
- 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8);
-
-  test_ColMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8);
-
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16);
-
-  test_ColMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16);
- 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32);
-
-  test_ColMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32);
- 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64);
-
-  test_ColMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-  //OutArray(dst_f16, gm_size);
-  OutArray(dst_i8, gm_size);
-  OutArray(dst_i16, gm_size);
-  OutArray(dst_i32, gm_size);
-  OutArray(dst_i64, gm_size);
- 
-  free(dst);
-  free(src0);
- 
-  free(dst_f16);
-  free(src0_f16);
- 
-  free(dst_i8);
-  free(src0_i8);
- 
-  free(dst_i16);
-  free(src0_i16);
- 
-  free(dst_i32);
-  free(src0_i32);
- 
-  free(dst_i64);
-  free(src0_i64);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TCopyOut.cpp b/test/other/tileop_api/src/TCopyOut.cpp
deleted file mode 100644
index b99ea98..0000000
--- a/test/other/tileop_api/src/TCopyOut.cpp
+++ /dev/null
@@ -1,170 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
-          uint16_t tile_col, typename T>
-void test_RowMajor(T *dst, T *src0) {
-  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::RowMajor>;
- 
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  #pragma clang loop unroll(full)
-  for (int i = 0; i < block_row; ++i) {
-    #pragma clang loop unroll(full)
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src0 + offset);
-      gm_shape res(dst + offset);
-  
-      tile_shape d0;
-      TCOPYIN(d0, s0);
-      TCOPYOUT(res, d0);
-    }
-  }
-}
- 
-template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
-          uint16_t tile_col, typename T>
-void test_ColMajor(T *dst, T *src0) {
-  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
- 
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  #pragma clang loop unroll(full)
-  for (int i = 0; i < block_col; ++i) {
-    #pragma clang loop unroll(full)
-    for (int j = 0; j < block_row; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src0 + offset);
-      gm_shape res(dst + offset);
-  
-      tile_shape d0;
-      TCOPYIN(d0, s0);
-      TCOPYOUT(res, d0);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src0 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, gm_size);
-
-  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
-  check_mem_alloc(dst_f16);
-  init_dst(dst_f16, gm_size);
- 
-  __half *src0_f16 = (__half *)malloc(gm_size * sizeof(__half));
-  check_mem_alloc(src0_f16);
-  init_src_fp(src0_f16, gm_size);
-
-  int8_t *dst_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
-  check_mem_alloc(dst_i8);
-  init_dst(dst_i8, gm_size);
- 
-  int8_t *src0_i8 = (int8_t *)malloc(gm_size * sizeof(int8_t));
-  check_mem_alloc(src0_i8);
-  init_src_int(src0_i8, gm_size);
- 
-  int16_t *dst_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
-  check_mem_alloc(dst_i16);
-  init_dst(dst_i16, gm_size);
- 
-  int16_t *src0_i16 = (int16_t *)malloc(gm_size * sizeof(int16_t));
-  check_mem_alloc(src0_i16);
-  init_src_int(src0_i16, gm_size);
-  
-  int32_t *dst_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
-  check_mem_alloc(dst_i32);
-  init_dst(dst_i32, gm_size);
- 
-  int32_t *src0_i32 = (int32_t *)malloc(gm_size * sizeof(int32_t));
-  check_mem_alloc(src0_i32);
-  init_src_int(src0_i32, gm_size);
- 
-  int64_t *dst_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
-  check_mem_alloc(dst_i64);
-  init_dst(dst_i64, gm_size);
- 
-  int64_t *src0_i64 = (int64_t *)malloc(gm_size * sizeof(int64_t));
-  check_mem_alloc(src0_i64);
-  init_src_int(src0_i64, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, float>(dst, src0);
- 
-  test_ColMajor<gm_row, gm_col, tile_row, tile_col, float>(dst, src0);
- 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16);
-
-  test_ColMajor<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src0_f16);
- 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8);
-
-  test_ColMajor<gm_row, gm_col, tile_row, tile_col, int8_t>(dst_i8, src0_i8);
-
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16);
-
-  test_ColMajor<gm_row, gm_col, tile_row, tile_col, int16_t>(dst_i16, src0_i16);
- 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32);
-
-  test_ColMajor<gm_row, gm_col, tile_row, tile_col, int32_t>(dst_i32, src0_i32);
- 
-  test_RowMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64);
-
-  test_ColMajor<gm_row, gm_col, tile_row, tile_col, int64_t>(dst_i64, src0_i64);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-  //OutArray(dst_f16, gm_size);
-  OutArray(dst_i8, gm_size);
-  OutArray(dst_i16, gm_size);
-  OutArray(dst_i32, gm_size);
-  OutArray(dst_i64, gm_size);
- 
-  free(dst);
-  free(src0);
- 
-  free(dst_f16);
-  free(src0_f16);
- 
-  free(dst_i8);
-  free(src0_i8);
- 
-  free(dst_i16);
-  free(src0_i16);
- 
-  free(dst_i32);
-  free(src0_i32);
- 
-  free(dst_i64);
-  free(src0_i64);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TCvt.cpp b/test/other/tileop_api/src/TCvt.cpp
deleted file mode 100644
index a4f3362..0000000
--- a/test/other/tileop_api/src/TCvt.cpp
+++ /dev/null
@@ -1,57 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t row, uint16_t col> void Test(float *dst, float *src) {
-  using gm_shape = global_tensor<float, RowMajor<row, col>>;
-
-  using tile_shape_in = Tile<Location::Vec, float, row, col, BLayout::RowMajor>;
-  using tile_shape_out = TileLeft<float, row, col>;
-
-  gm_shape s0(src);
-  gm_shape res(dst);
-
-  tile_shape_in d0;
-  tile_shape_out d1;
-
-  TCOPYIN(d0, s0);
-  TCVT(d1, d0);
-  TCVT(d0, d1);
-  TCOPYOUT(res, d0);
-}
-
-int main() {
-  const uint16_t row = 64;
-  const uint16_t col = 128;
-
-  size_t size = row * col;
-
-  float *dst = (float *)malloc(size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, size);
-
-  float *src = (float *)malloc(size * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  Test<row, col>(dst, src);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, size);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TDiv.cpp b/test/other/tileop_api/src/TDiv.cpp
deleted file mode 100644
index aa3668b..0000000
--- a/test/other/tileop_api/src/TDiv.cpp
+++ /dev/null
@@ -1,70 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
-          uint16_t tile_col>
-void test(float *dst, float *src0, float *src1) {
-  using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  for (int i = 0; i < block_row; ++i) {
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src0 + offset);
-      gm_shape s1(src1 + offset);
-      gm_shape res(dst + offset);
-
-      tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
-      TDIV(d2, d1, d0);
-      TCOPYOUT(res, d2);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src0 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, gm_size);
-  float *src1 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src1);
-  init_src_fp(src1, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, src0, src1);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-  free(src0);
-  free(src1);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TDivs.cpp b/test/other/tileop_api/src/TDivs.cpp
deleted file mode 100644
index 0c798c4..0000000
--- a/test/other/tileop_api/src/TDivs.cpp
+++ /dev/null
@@ -1,64 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
-          uint64_t tile_col>
-void test(float *dst, float *src, float s) {
-  using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  for (int i = 0; i < block_row; ++i) {
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src + offset);
-      gm_shape res(dst + offset);
-
-      tile_shape d0, d1;
-      TCOPYIN(d0, s0);
-      TDIVS(d1, d0, s);
-      TCOPYOUT(res, d1);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, src, s_fp32);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TExp.cpp b/test/other/tileop_api/src/TExp.cpp
deleted file mode 100644
index 49f8645..0000000
--- a/test/other/tileop_api/src/TExp.cpp
+++ /dev/null
@@ -1,64 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
-          uint64_t tile_col>
-void test(float *dst, float *src) {
-  using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  for (int i = 0; i < block_row; ++i) {
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src + offset);
-      gm_shape res(dst + offset);
-
-      tile_shape d0, d1;
-      TCOPYIN(d0, s0);
-      TEXP(d1, d0);
-      TCOPYOUT(res, d1);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, src);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TExpandCol.cpp b/test/other/tileop_api/src/TExpandCol.cpp
deleted file mode 100644
index f2633e4..0000000
--- a/test/other/tileop_api/src/TExpandCol.cpp
+++ /dev/null
@@ -1,57 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t row, uint16_t col> void test(float *dst, float *src) {
-  using gm_shape_in = global_tensor<float, RowMajor<row, 1>>;
-  using gm_shape_out = global_tensor<float, RowMajor<row, col>>;
-
-  using tile_shape_in = Tile<Location::Vec, float, row, 1, BLayout::RowMajor>;
-  using tile_shape_out = Tile<Location::Vec, float, row, col, BLayout::RowMajor>;
-
-  gm_shape_in s0(src);
-  gm_shape_out res(dst);
-  tile_shape_in d0;
-  tile_shape_out d1;
-
-  TCOPYIN(d0, s0);
-  TEXPANDCOL(d1, d0);
-  TCOPYOUT(res, d1);
-}
-
-int main() {
-  const uint16_t row = 128;
-  const uint16_t col = 64;
-
-  size_t size_in = row;
-  size_t size_out = row * col;
-
-  float *dst = (float *)malloc(size_out * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, size_out);
-
-  float *src = (float *)malloc(size_in * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, size_in);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<row, col>(dst, src);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, size_out);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TExpandRow.cpp b/test/other/tileop_api/src/TExpandRow.cpp
deleted file mode 100644
index eb2902f..0000000
--- a/test/other/tileop_api/src/TExpandRow.cpp
+++ /dev/null
@@ -1,57 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t row, uint16_t col> void test(float *dst, float *src) {
-  using gm_shape_in = global_tensor<float, RowMajor<1, col>>;
-  using gm_shape_out = global_tensor<float, RowMajor<row, col>>;
-
-  using tile_shape_in = Tile<Location::Vec, float, 1, col, BLayout::RowMajor>;
-  using tile_shape_out = Tile<Location::Vec, float, row, col, BLayout::RowMajor>;
-
-  gm_shape_in s0(src);
-  gm_shape_out res(dst);
-
-  tile_shape_in d0;
-  tile_shape_out d1;
-  TCOPYIN(d0, s0);
-  TEXPANDROW(d1, d0);
-  TCOPYOUT(res, d1);
-}
-
-int main() {
-  const uint16_t row = 64;
-  const uint16_t col = 128;
-
-  size_t size_in = col;
-  size_t size_out = row * col;
-
-  float *dst = (float *)malloc(size_out * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, size_out);
-
-  float *src = (float *)malloc(size_in * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, size_in);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<row, col>(dst, src);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, size_out);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TExpandScalar.cpp b/test/other/tileop_api/src/TExpandScalar.cpp
deleted file mode 100644
index 83ac6ed..0000000
--- a/test/other/tileop_api/src/TExpandScalar.cpp
+++ /dev/null
@@ -1,49 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
-          uint64_t tile_col>
-void test(float *dst, float s) {
-  using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-  gm_shape res(dst);
-
-  tile_shape d0;
-  TEXPANDSCALAR(d0, s);
-  TCOPYOUT(res, d0);
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 128;
-  const uint16_t tile_row = 64;
-  const uint16_t tile_col = 128;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, s_fp32);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TMax.cpp b/test/other/tileop_api/src/TMax.cpp
deleted file mode 100644
index 7d156c2..0000000
--- a/test/other/tileop_api/src/TMax.cpp
+++ /dev/null
@@ -1,70 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
-          uint16_t tile_col>
-void test(float *dst, float *src0, float *src1) {
-  using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  for (int i = 0; i < block_row; ++i) {
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src0 + offset);
-      gm_shape s1(src1 + offset);
-      gm_shape res(dst + offset);
-
-      tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
-      TMAX(d2, d1, d0);
-      TCOPYOUT(res, d2);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src0 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, gm_size);
-  float *src1 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src1);
-  init_src_fp(src1, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, src0, src1);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-  free(src0);
-  free(src1);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TMaxs.cpp b/test/other/tileop_api/src/TMaxs.cpp
deleted file mode 100644
index 5224e6e..0000000
--- a/test/other/tileop_api/src/TMaxs.cpp
+++ /dev/null
@@ -1,64 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
-          uint64_t tile_col>
-void test(float *dst, float *src, float s) {
-  using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  for (int i = 0; i < block_row; ++i) {
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src + offset);
-      gm_shape res(dst + offset);
-
-      tile_shape d0, d1;
-      TCOPYIN(d0, s0);
-      TMAXS(d1, d0, s);
-      TCOPYOUT(res, d1);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, src, s_fp32);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TMul.cpp b/test/other/tileop_api/src/TMul.cpp
deleted file mode 100644
index 6800ab7..0000000
--- a/test/other/tileop_api/src/TMul.cpp
+++ /dev/null
@@ -1,70 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
-          uint16_t tile_col>
-void test(float *dst, float *src0, float *src1) {
-  using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  for (int i = 0; i < block_row; ++i) {
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src0 + offset);
-      gm_shape s1(src1 + offset);
-      gm_shape res(dst + offset);
-
-      tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
-      TMUL(d2, d1, d0);
-      TCOPYOUT(res, d2);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src0 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, gm_size);
-  float *src1 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src1);
-  init_src_fp(src1, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, src0, src1);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-  free(src0);
-  free(src1);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TMuls.cpp b/test/other/tileop_api/src/TMuls.cpp
deleted file mode 100644
index bbf8f6f..0000000
--- a/test/other/tileop_api/src/TMuls.cpp
+++ /dev/null
@@ -1,64 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
-          uint64_t tile_col>
-void test(float *dst, float *src, float s) {
-  using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  for (int i = 0; i < block_row; ++i) {
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src + offset);
-      gm_shape res(dst + offset);
-
-      tile_shape d0, d1;
-      TCOPYIN(d0, s0);
-      TMULS(d1, d0, s);
-      TCOPYOUT(res, d1);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, src, s_fp32);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TRecip.cpp b/test/other/tileop_api/src/TRecip.cpp
deleted file mode 100644
index d30c423..0000000
--- a/test/other/tileop_api/src/TRecip.cpp
+++ /dev/null
@@ -1,64 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
-          uint64_t tile_col>
-void test(float *dst, float *src) {
-  using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  for (int i = 0; i < block_row; ++i) {
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src + offset);
-      gm_shape res(dst + offset);
-
-      tile_shape d0, d1;
-      TCOPYIN(d0, s0);
-      TRECIP(d1, d0);
-      TCOPYOUT(res, d1);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, src);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TReshape.cpp b/test/other/tileop_api/src/TReshape.cpp
deleted file mode 100644
index f96a5bc..0000000
--- a/test/other/tileop_api/src/TReshape.cpp
+++ /dev/null
@@ -1,60 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
-          uint64_t tile_col>
-void test(float *dst, float *src) {
-  using gm_shape_in = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using gm_shape_out = global_tensor<float, RowMajor<gm_row * gm_col, 1>>;
-
-  using tile_shape_in = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-  using tile_shape_out = Tile<Location::Vec, float, tile_row * tile_col, 1, BLayout::RowMajor>;
-  gm_shape_in s0(src);
-  gm_shape_out res(dst);
-
-  tile_shape_in d0;
-  tile_shape_out d1;
-  TCOPYIN(d0, s0);
-  TRESHAPE(d1, d0);
-  TCOPYOUT(res, d1);
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, src);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TRowMax.cpp b/test/other/tileop_api/src/TRowMax.cpp
deleted file mode 100644
index 074f9c2..0000000
--- a/test/other/tileop_api/src/TRowMax.cpp
+++ /dev/null
@@ -1,58 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t row, uint16_t col> void test(float *dst, float *src) {
-  using gm_shape_in = global_tensor<float, RowMajor<row, col>>;
-  using gm_shape_out = global_tensor<float, RowMajor<row, 1>>;
-
-  using tile_shape_in = Tile<Location::Vec, float, row, col, BLayout::RowMajor>;
-  using tile_shape_out = Tile<Location::Vec, float, row, 1, BLayout::RowMajor>;
-
-  gm_shape_in s0(src);
-  gm_shape_out res(dst);
-
-  tile_shape_in d0;
-  tile_shape_out d1;
-
-  TCOPYIN(d0, s0);
-  TROWMAX(d1, d0);
-  TCOPYOUT(res, d1);
-}
-
-int main() {
-  const uint16_t row = 128;
-  const uint16_t col = 64;
-
-  size_t size_in = row * col;
-  size_t size_out = row;
-
-  float *dst = (float *)malloc(size_out * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, size_out);
-
-  float *src = (float *)malloc(size_in * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, size_in);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<row, col>(dst, src);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, size_out);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TRowMaxExpand.cpp b/test/other/tileop_api/src/TRowMaxExpand.cpp
deleted file mode 100644
index b9cc1e1..0000000
--- a/test/other/tileop_api/src/TRowMaxExpand.cpp
+++ /dev/null
@@ -1,58 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t row, uint16_t col> void test(float *dst, float *src) {
-  using gm_shape_in = global_tensor<float, RowMajor<row, col>>;
-  using gm_shape_out = global_tensor<float, RowMajor<row, col>>;
-
-  using tile_shape_in = Tile<Location::Vec, float, row, col, BLayout::RowMajor>;
-  using tile_shape_out = Tile<Location::Vec, float, row, col, BLayout::RowMajor>;
-
-  gm_shape_in s0(src);
-  gm_shape_out res(dst);
-
-  tile_shape_in d0;
-  tile_shape_out d1;
-
-  TCOPYIN(d0, s0);
-  TROWMAXEXPAND(d1, d0);
-  TCOPYOUT(res, d1);
-}
-
-int main() {
-  const uint16_t row = 64;
-  const uint16_t col = 128;
-
-  size_t size_in = row * col;
-  size_t size_out = row * col;
-
-  float *dst = (float *)malloc(size_out * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, size_out);
-
-  float *src = (float *)malloc(size_in * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, size_in);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<row, col>(dst, src);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, size_out);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TRowSum.cpp b/test/other/tileop_api/src/TRowSum.cpp
deleted file mode 100644
index c242a41..0000000
--- a/test/other/tileop_api/src/TRowSum.cpp
+++ /dev/null
@@ -1,58 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t row, uint16_t col> void test(float *dst, float *src) {
-  using gm_shape_in = global_tensor<float, RowMajor<row, col>>;
-  using gm_shape_out = global_tensor<float, RowMajor<row, 1>>;
-
-  using tile_shape_in = Tile<Location::Vec, float, row, col, BLayout::RowMajor>;
-  using tile_shape_out = Tile<Location::Vec, float, row, 1, BLayout::RowMajor>;
-
-  gm_shape_in s0(src);
-  gm_shape_out res(dst);
-
-  tile_shape_in d0;
-  tile_shape_out d1;
-
-  TCOPYIN(d0, s0);
-  TROWSUM(d1, d0);
-  TCOPYOUT(res, d1);
-}
-
-int main() {
-  const uint16_t row = 128;
-  const uint16_t col = 64;
-
-  size_t size_in = row * col;
-  size_t size_out = row;
-
-  float *dst = (float *)malloc(size_out * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, size_out);
-
-  float *src = (float *)malloc(size_in * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, size_in);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<row, col>(dst, src);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, size_out);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TRowSumExpand.cpp b/test/other/tileop_api/src/TRowSumExpand.cpp
deleted file mode 100644
index 32b0a78..0000000
--- a/test/other/tileop_api/src/TRowSumExpand.cpp
+++ /dev/null
@@ -1,58 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t row, uint16_t col> void test(float *dst, float *src) {
-  using gm_shape_in = global_tensor<float, RowMajor<row, col>>;
-  using gm_shape_out = global_tensor<float, RowMajor<row, col>>;
-
-  using tile_shape_in = Tile<Location::Vec, float, row, col, BLayout::RowMajor>;
-  using tile_shape_out = Tile<Location::Vec, float, row, col, BLayout::RowMajor>;
-
-  gm_shape_in s0(src);
-  gm_shape_out res(dst);
-
-  tile_shape_in d0;
-  tile_shape_out d1;
-
-  TCOPYIN(d0, s0);
-  TROWSUMEXPAND(d1, d0);
-  TCOPYOUT(res, d1);
-}
-
-int main() {
-  const uint16_t row = 64;
-  const uint16_t col = 128;
-
-  size_t size_in = row * col;
-  size_t size_out = row * col;
-
-  float *dst = (float *)malloc(size_out * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, size_out);
-
-  float *src = (float *)malloc(size_in * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, size_in);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<row, col>(dst, src);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, size_out);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TSqrt.cpp b/test/other/tileop_api/src/TSqrt.cpp
deleted file mode 100644
index 68813b1..0000000
--- a/test/other/tileop_api/src/TSqrt.cpp
+++ /dev/null
@@ -1,64 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
-          uint64_t tile_col>
-void test(float *dst, float *src) {
-  using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  for (int i = 0; i < block_row; ++i) {
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src + offset);
-      gm_shape res(dst + offset);
-
-      tile_shape d0, d1;
-      TCOPYIN(d0, s0);
-      TSQRT(d1, d0);
-      TCOPYOUT(res, d1);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, src);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TSub.cpp b/test/other/tileop_api/src/TSub.cpp
deleted file mode 100644
index ac1c694..0000000
--- a/test/other/tileop_api/src/TSub.cpp
+++ /dev/null
@@ -1,70 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
-          uint16_t tile_col>
-void test(float *dst, float *src0, float *src1) {
-  using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  for (int i = 0; i < block_row; ++i) {
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src0 + offset);
-      gm_shape s1(src1 + offset);
-      gm_shape res(dst + offset);
-
-      tile_shape d0, d1, d2;
-      TCOPYIN(d0, s0);
-      TCOPYIN(d1, s1);
-      TSUB(d2, d1, d0);
-      TCOPYOUT(res, d2);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src0 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, gm_size);
-  float *src1 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src1);
-  init_src_fp(src1, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, src0, src1);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-  free(src0);
-  free(src1);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TSubs.cpp b/test/other/tileop_api/src/TSubs.cpp
deleted file mode 100644
index 615f129..0000000
--- a/test/other/tileop_api/src/TSubs.cpp
+++ /dev/null
@@ -1,64 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
-          uint64_t tile_col>
-void test(float *dst, float *src, float s) {
-  using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-
-  uint16_t block_row = gm_row / tile_row;
-  uint16_t block_col = gm_col / tile_col;
-  for (int i = 0; i < block_row; ++i) {
-    for (int j = 0; j < block_col; ++j) {
-      int offset = i * (tile_row * gm_col) + j * tile_col;
-      gm_shape s0(src + offset);
-      gm_shape res(dst + offset);
-
-      tile_shape d0, d1;
-      TCOPYIN(d0, s0);
-      TSUBS(d1, d0, s);
-      TCOPYOUT(res, d1);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 64;
-  const uint16_t gm_col = 64;
-  const uint16_t tile_row = 32;
-  const uint16_t tile_col = 32;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, src, s_fp32);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/TTrans.cpp b/test/other/tileop_api/src/TTrans.cpp
deleted file mode 100644
index a430236..0000000
--- a/test/other/tileop_api/src/TTrans.cpp
+++ /dev/null
@@ -1,58 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t row, uint16_t col> void test(float *dst, float *src) {
-  using gm_shape_in = global_tensor<float, RowMajor<row, col>>;
-  using gm_shape_out = global_tensor<float, RowMajor<col, row>>;
-
-  using tile_shape_in = Tile<Location::Vec, float, row, col, BLayout::RowMajor>;
-  using tile_shape_out = Tile<Location::Vec, float, col, row, BLayout::RowMajor>;
-
-  gm_shape_in s0(src);
-  gm_shape_out res(dst);
-  
-  tile_shape_in d0;
-  tile_shape_out d1;
-
-  TCOPYIN(d0, s0);
-  TTRANS(d1, d0);
-  TCOPYOUT(res, d1);
-}
-
-int main() {
-  const uint16_t row = 64;
-  const uint16_t col = 128;
-
-  size_t size_in = row * col;
-  size_t size_out = col * row;
-
-  float *dst = (float *)malloc(size_out * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, size_out);
-
-  float *src = (float *)malloc(size_in * sizeof(float));
-  check_mem_alloc(src);
-  init_src_fp(src, size_in);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<row, col>(dst, src);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, size_out);
-
-  free(dst);
-  free(src);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/test_MatMacc.cpp b/test/other/tileop_api/src/test_MatMacc.cpp
deleted file mode 100644
index 866c92c..0000000
--- a/test/other/tileop_api/src/test_MatMacc.cpp
+++ /dev/null
@@ -1,71 +0,0 @@
-#include <common/pto_tileop.hpp>
-
-#include "../data.hpp"
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t M, uint16_t N, uint16_t K>
-void test(float *dst, float *src0, float *src1) {
-  using gm_shape_A = global_tensor<float, RowMajor<M, K>>;
-  using gm_shape_B = global_tensor<float, ColMajor<K, N>>;
-  using gm_shape_C = global_tensor<float, RowMajor<M, N>>;
-
-  using tile_shape_A = TileLeft<float, M, K>;
-  using tile_shape_B = TileRight<float, K, N>;
-  using tile_shape_C = TileAcc<float, M, N>;
-
-  gm_shape_A s0(src0);
-  gm_shape_B s1(src1);
-  gm_shape_C res(dst);
-
-  tile_shape_A d0;
-  tile_shape_B d1;
-  tile_shape_C d2(0);
-
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  MATMACC(d2, d0, d1);
-  TCOPYOUT(res, d2);
-}
-
-int main() {
-  const uint16_t M = 64;
-  const uint16_t K = 32;
-  const uint16_t N = 128;
-
-  size_t size_A = M * K;
-  size_t size_B = K * N;
-  size_t size_C = M * N;
-
-  float *dst = (float *)malloc(size_C * sizeof(float));
-  check_mem_alloc(dst);
-  init_src_fp(dst, size_C);
-
-  float *src0 = (float *)malloc(size_A * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, size_A);
-  float *src1 = (float *)malloc(size_B * sizeof(float));
-  check_mem_alloc(src1);
-  init_src_fp(src1, size_B);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<M, N, K>(dst, src0, src1);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, size_C);
-
-  free(dst);
-  free(src0);
-  free(src1);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/other/tileop_api/src/test_MatMul.cpp b/test/other/tileop_api/src/test_MatMul.cpp
deleted file mode 100644
index 92aa454..0000000
--- a/test/other/tileop_api/src/test_MatMul.cpp
+++ /dev/null
@@ -1,51 +0,0 @@
-#include <common/pto_tileop.hpp>
-
-#include "../../../kernels/matmul.hpp"
-#include "../data.hpp"
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-int main() {
-  const uint16_t M = 160;
-  const uint16_t K = 80;
-  const uint16_t N = 320;
-  const uint16_t TM = 32;
-  const uint16_t TK = 32;
-  const uint16_t TN = 32;
-
-  size_t size_A = M * K;
-  size_t size_B = K * N;
-  size_t size_C = M * N;
-
-  float *dst = (float *)malloc(size_C * sizeof(float));
-  check_mem_alloc(dst);
-  init_src_fp(dst, size_C);
-
-  float *src0 = (float *)malloc(size_A * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, size_A);
-  float *src1 = (float *)malloc(size_B * sizeof(float));
-  check_mem_alloc(src1);
-  init_src_fp(src1, size_B);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  matmul<M, N, K, TM, TN, TK>(dst, src0, src1);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, size_C);
-
-  free(dst);
-  free(src0);
-  free(src1);
-
-  return 0;
-}
diff --git a/test/tileop_api/src/MatMul_e4m3.cpp b/test/tileop_api/src/MatMul_e4m3.cpp
deleted file mode 100644
index 2392f07..0000000
--- a/test/tileop_api/src/MatMul_e4m3.cpp
+++ /dev/null
@@ -1,89 +0,0 @@
-#include <common/pto_tileop.hpp>
-#include "../data.hpp"
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <typename TA, typename TB>
-void __vec__ test_cvt(typename TA::TileDType __out__ a,
-                      typename TB::TileDType __in__ b) {
-  using AType = typename TA::DType;
-  using BType = typename TB::DType;
-  __vbuf__ BType *pb = blkv_get_tile_ptr(b);
-  __vbuf__ AType *pa = blkv_get_tile_ptr(a);
-  int x = blkv_get_index_x();
-  int y = blkv_get_index_y();
-  int idx = index<TA>(y, x);
-  AType o = (AType)(pb[idx]);
-  pa[idx] = o;
-}
-
-template <uint16_t M, uint16_t N, uint16_t K>
-void test(float *dst, float *src0, float *src1) {
-  using gm_shape_A = global_tensor<float, RowMajor<M, K>>;
-  using gm_shape_B = global_tensor<float, ColMajor<K, N>>;
-  using gm_shape_C = global_tensor<float, RowMajor<M, N>>;
-
-  using tile_shape_A = TileLeft<float, M, K>;
-  using tile_shape_B = TileRight<float, K, N>;
-  using tile_shape_C = TileAcc<float, M, N>;
-  using tile_shape_LA = TileLeft<__fp8_e4m3, M, K>;
-  using tile_shape_LB = TileRight<__fp8_e4m3, K, N>;
-
-  gm_shape_A s0(src0);
-  gm_shape_B s1(src1);
-  gm_shape_C res(dst);
-
-  tile_shape_A d0;
-  tile_shape_B d1;
-  tile_shape_C d2;
-  tile_shape_LA lda;
-  tile_shape_LB ldb;
-
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  test_cvt<tile_shape_LA, tile_shape_A><<<M, K, 1>>>(lda.data(), d0.data());
-  test_cvt<tile_shape_LB, tile_shape_B><<<K, N, 1>>>(ldb.data(), d1.data());
-  MATMUL(d2, lda, ldb);
-  TCOPYOUT(res, d2);
-}
-
-int main() {
-  const uint16_t M = 64;
-  const uint16_t K = 32;
-  const uint16_t N = 128;
-
-  size_t size_A = M * K;
-  size_t size_B = K * N;
-  size_t size_C = M * N;
-
-  float *dst = (float *)malloc(size_C * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, size_C);
-
-  float *src0 = (float *)malloc(size_A * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, size_A);
-  float *src1 = (float *)malloc(size_B * sizeof(float));
-  check_mem_alloc(src1);
-  init_src_fp(src1, size_B);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<M, N, K>(dst, src0, src1);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, size_C);
-
-  free(dst);
-  free(src0);
-  free(src1);
-
-  return 0;
-}
diff --git a/test/tileop_api/src/TAdd_mask.cpp b/test/tileop_api/src/TAdd_mask.cpp
deleted file mode 100644
index a0f6f40..0000000
--- a/test/tileop_api/src/TAdd_mask.cpp
+++ /dev/null
@@ -1,120 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-using namespace pto;
-
-template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row,
-          uint16_t tile_col>
-void test(float *c_ptr, float *a_ptr, float *b_ptr) {
-  using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
-  using glb_iterator = global_iterator<gm_shape, tile_shape>;
-
-  static constexpr int block_row = gm_row / tile_row;
-  static constexpr int block_col = gm_col / tile_col;
-  static constexpr int remainder_row = gm_row % tile_row;
-  static constexpr int remainder_col = gm_col % tile_col;
-
-  using trailing_rows_shape =
-      Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor, tile_row, remainder_col>;
-  using trailing_cols_shape =
-      Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor, remainder_row, tile_col>;
-  using trailing_corner_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor,
-                                            remainder_row, remainder_col>;
-
-  glb_iterator gAIter(a_ptr);
-  glb_iterator gBIter(b_ptr);
-  glb_iterator gCIter(c_ptr);
-  for (int i = 0; i < block_row; ++i) {
-    for (int j = 0; j < block_col; ++j) {
-      auto gA = gAIter(i, j);
-      auto gB = gBIter(i, j);
-      auto gC = gCIter(i, j);
-
-      tile_shape tA, tB, tC;
-      TCOPYIN(tA, gA);
-      TCOPYIN(tB, gB);
-      TADD(tC, tA, tB);
-      TCOPYOUT(gC, tC);
-    }
-    if constexpr (remainder_col) {
-      auto gA = gAIter(i, block_col);
-      auto gB = gBIter(i, block_col);
-      auto gC = gCIter(i, block_col);
-
-      trailing_rows_shape tA, tB, tC;
-      TCOPYIN(tA, gA);
-      TCOPYIN(tB, gB);
-      TADD(tC, tA, tB);
-      TCOPYOUT(gC, tC);
-    }
-  }
-  if constexpr (remainder_row) {
-    for (int j = 0; j < block_col; ++j) {
-      auto gA = gAIter(block_row, j);
-      auto gB = gBIter(block_row, j);
-      auto gC = gCIter(block_row, j);
-
-      trailing_cols_shape tA, tB, tC;
-      TCOPYIN(tA, gA);
-      TCOPYIN(tB, gB);
-      TADD(tC, tA, tB);
-      TCOPYOUT(gC, tC);
-    }
-    if constexpr (remainder_col) {
-      auto gA = gAIter(block_row, block_col);
-      auto gB = gBIter(block_row, block_col);
-      auto gC = gCIter(block_row, block_col);
-
-      trailing_corner_shape tA, tB, tC;
-      TCOPYIN(tA, gA);
-      TCOPYIN(tB, gB);
-      TADD(tC, tA, tB);
-      TCOPYOUT(gC, tC);
-    }
-  }
-}
-
-int main() {
-  const uint16_t gm_row = 66;
-  const uint16_t gm_col = 66;
-  const uint16_t tile_row = 16;
-  const uint16_t tile_col = 16;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  float *dst = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, gm_size);
-
-  float *src0 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, gm_size);
-  float *src1 = (float *)malloc(gm_size * sizeof(float));
-  check_mem_alloc(src1);
-  init_src_fp(src1, gm_size);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<gm_row, gm_col, tile_row, tile_col>(dst, src0, src1);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, gm_size);
-
-  free(dst);
-  free(src0);
-  free(src1);
-
-  return 0;
-}
diff --git a/test/tileop_api/src/TSqrt.cpp b/test/tileop_api/src/TSqrt.cpp
deleted file mode 100644
index 7fbc90f..0000000
--- a/test/tileop_api/src/TSqrt.cpp
+++ /dev/null
@@ -1,107 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
-          uint64_t tile_col, typename T>
-void test_rm(T *dst, T *src) {
-  using gm_shape = global_tensor<T, RowMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col>;
-  using glb_iterator = global_iterator<gm_shape, tile_shape>;
-
-  glb_iterator gSIter(src);
-  glb_iterator gDIter(dst);
-
-  size_t block_row = gm_row / tile_row;
-  size_t block_col = gm_col / tile_col;
-  for (int i = 0; i < block_row; ++i) {
-    for (int j = 0; j < block_col; ++j) {
-      auto s0 = gSIter(i, j);
-      auto res = gDIter(i, j);
-
-      tile_shape t0, t1;
-      TCOPYIN(t0, s0);
-      TSQRT(t1, t0);
-      TCOPYOUT(res, t1);
-    }
-  }
-}
-
-template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row,
-          uint64_t tile_col, typename T>
-void test_cm(T *dst, T *src) {
-  using gm_shape = global_tensor<T, ColMajor<gm_row, gm_col>>;
-  using tile_shape = Tile<Location::Vec, T, tile_row, tile_col, BLayout::ColMajor>;
-  using glb_iterator = global_iterator<gm_shape, tile_shape>;
-
-  glb_iterator gSIter(src);
-  glb_iterator gDIter(dst);
-
-  size_t block_row = gm_row / tile_row;
-  size_t block_col = gm_col / tile_col;
-  for (int i = 0; i < block_col; ++i) {
-    for (int j = 0; j < block_row; ++j) {
-      auto s0 = gSIter(j, i);
-      auto res = gDIter(j, i);
-
-      tile_shape t0, t1;
-      TCOPYIN(t0, s0);
-      TSQRT(t1, t0);
-      TCOPYOUT(res, t1);
-    }
-  }
-}
-
-int main() {
-
-  const size_t gm_row = 32;
-  const size_t gm_col = 32;
-  const size_t tile_row = 16;
-  const size_t tile_col = 16;
-
-  size_t gm_size = gm_row * gm_col;
-  size_t tile_size = tile_row * tile_col;
-
-  // __half
-  __half *dst_f16 = (__half *)malloc(gm_size * sizeof(__half));
-  check_mem_alloc(dst_f16);
-  init_dst(dst_f16, gm_size);
-
-  __half *src_f16 = (__half *)malloc(gm_size * sizeof(__half));
-  check_mem_alloc(src_f16);
-  init_rows_fp(src_f16, gm_row, gm_col);
-
-  // __fp32
-  __fp32 *dst_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
-  check_mem_alloc(dst_f32);
-  init_dst(dst_f32, gm_size);
-
-  __fp32 *src_f32 = (__fp32 *)malloc(gm_size * sizeof(__fp32));
-  check_mem_alloc(src_f32);
-  init_rows_fp(src_f32, gm_row, gm_col);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test_rm<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src_f16);
-  test_cm<gm_row, gm_col, tile_row, tile_col, __half>(dst_f16, src_f16);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst_f16, gm_size);
-  OutArray(dst_f32, gm_size);
-
-  free(dst_f16);
-  free(src_f16);
-  free(dst_f32);
-  free(src_f32);
-
-  return 0;
-}
\ No newline at end of file
diff --git a/test/tileop_api/src/test_MatMacc.cpp b/test/tileop_api/src/test_MatMacc.cpp
deleted file mode 100644
index ead53c2..0000000
--- a/test/tileop_api/src/test_MatMacc.cpp
+++ /dev/null
@@ -1,75 +0,0 @@
-#include <common/pto_tileop.hpp>
-
-#include "../data.hpp"
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t M, uint16_t N, uint16_t K>
-void test(float *dst, float *src0, float *src1) {
-  using gm_shape_A = global_tensor<float, RowMajor<M, K>>;
-  using gm_shape_B = global_tensor<float, ColMajor<K, N>>;
-  using gm_shape_C = global_tensor<float, RowMajor<M, N>>;
-
-  using tile_shape_A = TileLeft<float, M, K>;
-  using tile_shape_B = TileRight<float, K, N>;
-  using tile_shape_C = TileAcc<float, M, N>;
-  using tile_shape_O = Tile<Location::Vec, float, M, N>;
-
-  gm_shape_A s0(src0);
-  gm_shape_B s1(src1);
-  gm_shape_C res(dst);
-
-  tile_shape_A d0;
-  tile_shape_B d1;
-  tile_shape_C d2;
-  tile_shape_O d3;
-
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  MATMUL(d2, d0, d1);
-  MATMACC(d2, d0, d1);
-  TCVT(d3, d2);
-  TCOPYOUT(res, d3);
-}
-
-int main() {
-  const uint16_t M = 16;
-  const uint16_t K = 8;
-  const uint16_t N = 32;
-
-  size_t size_A = M * K;
-  size_t size_B = K * N;
-  size_t size_C = M * N;
-
-  float *dst = (float *)malloc(size_C * sizeof(float));
-  check_mem_alloc(dst);
-  init_src_fp(dst, size_C);
-
-  float *src0 = (float *)malloc(size_A * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, size_A);
-  float *src1 = (float *)malloc(size_B * sizeof(float));
-  check_mem_alloc(src1);
-  init_src_fp(src1, size_B);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<M, N, K>(dst, src0, src1);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, size_C);
-
-  free(dst);
-  free(src0);
-  free(src1);
-
-  return 0;
-}
diff --git a/test/tileop_api/src/test_MatMul.cpp b/test/tileop_api/src/test_MatMul.cpp
deleted file mode 100644
index 56ad9c3..0000000
--- a/test/tileop_api/src/test_MatMul.cpp
+++ /dev/null
@@ -1,73 +0,0 @@
-#include "../data.hpp"
-#include <common/pto_tileop.hpp>
-
-#ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
-#endif
-
-template <uint16_t M, uint16_t N, uint16_t K>
-void test(float *dst, float *src0, float *src1) {
-  using gm_shape_A = global_tensor<float, RowMajor<M, K>>;
-  using gm_shape_B = global_tensor<float, ColMajor<K, N>>;
-  using gm_shape_C = global_tensor<float, RowMajor<M, N>>;
-
-  using tile_shape_A = TileLeft<float, M, K>;
-  using tile_shape_B = TileRight<float, K, N>;
-  using tile_shape_C = TileAcc<float, M, N>;
-  using tile_shape_O = TileLeft<float, M, K>;
-
-  gm_shape_A s0(src0);
-  gm_shape_B s1(src1);
-  gm_shape_C res(dst);
-
-  tile_shape_A d0;
-  tile_shape_B d1;
-  tile_shape_C d2;
-  tile_shape_O d3;
-
-  TCOPYIN(d0, s0);
-  TCOPYIN(d1, s1);
-  MATMUL(d2, d0, d1);
-  TCVT(d3, d2);
-  TCOPYOUT(res, d3);
-}
-
-int main() {
-  const uint16_t M = 16;
-  const uint16_t K = 8;
-  const uint16_t N = 32;
-
-  size_t size_A = M * K;
-  size_t size_B = K * N;
-  size_t size_C = M * N;
-
-  float *dst = (float *)malloc(size_C * sizeof(float));
-  check_mem_alloc(dst);
-  init_dst(dst, size_C);
-
-  float *src0 = (float *)malloc(size_A * sizeof(float));
-  check_mem_alloc(src0);
-  init_src_fp(src0, size_A);
-  float *src1 = (float *)malloc(size_B * sizeof(float));
-  check_mem_alloc(src1);
-  init_src_fp(src1, size_B);
-
-#ifdef LINX_PMC
-  PMC_START();
-#endif
-
-  test<M, N, K>(dst, src0, src1);
-
-#ifdef LINX_PMC
-  PMC_END();
-#endif
-
-  printf("Result:\n");
-  OutArray(dst, size_C);
-
-  free(dst);
-  free(src0);
-  free(src1);
-
-  return 0;
-}
diff --git a/tests/README.md b/tests/README.md
new file mode 100644
index 0000000..e0558f9
--- /dev/null
+++ b/tests/README.md
@@ -0,0 +1,71 @@
+# Tests
+
+This tree keeps correctness material that is not the primary Linx benchmark
+navigation surface. Active Linx benchmark entrypoints live under
+[`../benchmarks`](../benchmarks); add new benchmark suites there instead of
+recreating the old `test/` tree.
+
+## Directory Map
+
+| Path | Purpose |
+| --- | --- |
+| [`py_api`](py_api) | Active Python-facing TileOP correctness and golden-comparison flow. |
+| [`tileop_layout`](tileop_layout) | TileOP layout and behavior checks that are not cataloged as primary benchmark suites. |
+
+These directories still use the shared benchmark harness through
+[`../benchmarks/common/Makefile.common`](../benchmarks/common/Makefile.common),
+so the same `TESTCASE`, `PLAT`, `COMPILER_DIR`, and `QEMU` variables work here.
+
+## Common Build Pattern
+
+```sh
+cd tests/tileop_layout
+make clean
+make TESTCASE=TLOAD PLAT=linx COMPILER_DIR=/path/to/linx/compiler/bin
+make TESTCASE=TSTORE PLAT=linx COMPILER_DIR=/path/to/linx/compiler/bin
+```
+
+Platform values:
+
+| Platform | Backend |
+| --- | --- |
+| `PLAT=cpu` | CPU simulation backend with `__cpu_sim__`. |
+| `PLAT=linx` | Linx target backend with `__linx`. |
+| `PLAT=arm_sme` | Arm SME-oriented backend with `__ARM_FEATURE_SME`. |
+
+Common targets:
+
+```sh
+make TESTCASE=<case> all
+make TESTCASE=<case> diss
+make TESTCASE=<case> sim
+make TESTCASE=<case> debug
+make clean
+make clean_all
+```
+
+Build products are written below the repository-level `output/` directory.
+
+## Batch Runs
+
+Run batch files from their suite directory so relative paths and make variables
+resolve as intended:
+
+```sh
+cd tests/tileop_layout && bash compile.all
+cd tests/py_api && bash compile.all
+```
+
+## Python Golden Comparison
+
+```sh
+cd tests/py_api
+make clean
+make TESTCASE=tileop_py
+python3 golden_cmp/golden_cmp.py -i tadd
+```
+
+For adding golden-comparison cases, see
+[`py_api/golden_cmp/README.md`](py_api/golden_cmp/README.md).
+
+Back to the repository overview: [`../README.md`](../README.md).
diff --git a/test/py_api/Makefile b/tests/py_api/Makefile
similarity index 65%
rename from test/py_api/Makefile
rename to tests/py_api/Makefile
index 01a5e4e..21ffb78 100644
--- a/test/py_api/Makefile
+++ b/tests/py_api/Makefile
@@ -6,4 +6,5 @@ endif
 
 SRC_FILE +=  $(TEST_ROOT)/$(CASE_SRC_DIR)/$(TESTCASE).cpp
 
-include ../common/Makefile.common
\ No newline at end of file
+CATEGORY_ROOT := $(abspath ../..)
+include ../../benchmarks/common/Makefile.common
diff --git a/test/py_api/compile.all b/tests/py_api/compile.all
similarity index 100%
rename from test/py_api/compile.all
rename to tests/py_api/compile.all
diff --git a/test/py_api/golden_cmp/README.md b/tests/py_api/golden_cmp/README.md
similarity index 88%
rename from test/py_api/golden_cmp/README.md
rename to tests/py_api/golden_cmp/README.md
index 068ddb8..4d89087 100644
--- a/test/py_api/golden_cmp/README.md
+++ b/tests/py_api/golden_cmp/README.md
@@ -7,12 +7,12 @@
  · 文件路径：PTOTileLib/include/cpu_sim/
 
  · 操作说明：
-   
+
    1. 如果是添加一个新的运算方式（如 texp），则需要新建一个 HPP 文件。
    2. 如果是同一运算方式的不同属性（如不同的矩阵尺寸或 tile 大小），则直接在对应的 HPP 文件中添加。
 
  · 标准函数格式：
-   
+
    · 文件头需要包含必要的头文件。
    · 声明变量和函数名称时，需注意命名规范。
    · 如果有一个综合函数记得写清条件
@@ -57,16 +57,16 @@ void TADD(tile_shape &dst, tile_shape &src0, tile_shape &src1) {
 
 2. 新建调用函数与python接口的hpp文件
 
-文件路径：PTOTileLib/test/py_api/src
+文件路径：PTOTileLib/tests/py_api/src
 
 步骤说明：
 
  1. 新建文件：
-    
+
     · 添加固定文件头。
     . 声明要传入的参数
     . 声明矩阵的形状与layout
-    . 进行矩阵操作，并且使用TCOPYIN,TCOPYOUT函数以及上一个步骤声明的函数来进行操作。注意满足TCOPYIN,TCOPYOUT对于矩阵layout的要求。
+    . 进行矩阵操作，并且使用TLOAD,TSTORE函数以及上一个步骤声明的函数来进行操作。注意满足TLOAD,TSTORE对于矩阵layout的要求。
     . 对函数进行绑定，注意在绑定时需要进行接口的转换以及参数的传入。
     . 之后在tileop_py.cpp中加入要编译的文件名
     ```
@@ -90,18 +90,18 @@ void tadd_py(float* dst, float* src0, float* src1){
             int offset = i * (tile_row * gm_col) + j * tile_col;
             gm_shape s0(src0 + offset);
             gm_shape s1(src1 + offset);
-            gm_shape res(dst + offset);  
+            gm_shape res(dst + offset);
 
             tile_shape d0, d1, d2;
-            TCOPYIN(d0, s0);
-            TCOPYIN(d1, s1);
+            TLOAD(d0, s0);
+            TLOAD(d1, s1);
             TADD(d2, d0, d1);
-            TCOPYOUT(res, d2);
+            TSTORE(res, d2);
         }
     }
 }
 
-#ifdef __cpu_sim__ 
+#ifdef __cpu_sim__
     void bind_tadd(py::module_& m) {
         m.def("tadd", [](py::array_t<float> dst_py, py::array_t<float> src0_py, py::array_t<float> src1_py){
             float* dst = static_cast<float*>(dst_py.request().ptr);
@@ -119,12 +119,12 @@ void tadd_py(float* dst, float* src0, float* src1){
 
 3. 修改 CONFIG.JSON 文件和 ref_func_lib.py 文件
 
-文件路径：/PTOTileLib/test/py_api/golden_cmp/
+文件路径：/PTOTileLib/tests/py_api/golden_cmp/
 
 步骤说明：
 
  1. 在config.json文件中的 cases 中添加新测试用例：
-    
+
     · 按照以下格式添加新函数的属性。
         ```
         {
@@ -132,7 +132,7 @@ void tadd_py(float* dst, float* src0, float* src1){
             "group": "tadd",
             "input_shapes": [[16, 16], [16, 16]],
             "output_shape": [16, 16],
-            "ref_func":"lambda input: tadd(input[0], input[1])", 
+            "ref_func":"lambda input: tadd(input[0], input[1])",
             "test_func":"lambda res, input: tileop_py.tadd_api.tadd(res, input[0], input[1])"
         }
         ```
@@ -144,7 +144,7 @@ void tadd_py(float* dst, float* src0, float* src1){
     . **test_func**: 该函数调用的是绑定的函数。
 
  2. 在 ref_func_lib.py 中添加需要进行的python操作：
-    
+
     · 按照以下格式添加python操作。
     ```
     def tadd(a, b):
@@ -158,12 +158,12 @@ void tadd_py(float* dst, float* src0, float* src1){
 添加新的测试用例
 按照以下步骤完成修改后，即可成功添加新的测试用例。确保所有文件的修改内容与 config.json 中的定义一致，以避免运行时错误。
 
-在 /PTOTileLib/test/py_api/ 路径下，执行以下命令：
+在 /PTOTileLib/tests/py_api/ 路径下，执行以下命令：
 ```
-make clean  
-make TESTCASE=tileop_py  
-python3 golden_cmp/golden_cmp.py -i tadd  
+make clean
+make TESTCASE=tileop_py
+python3 golden_cmp/golden_cmp.py -i tadd
 ```
-其中 -i 后面跟着的是函数的名称，具体的函数名可以参考 config.json 文件中的内容。  
+其中 -i 后面跟着的是函数的名称，具体的函数名可以参考 config.json 文件中的内容。
 之后print出的对比结果中，在最后两行会显示loss（误差）以及是否pass or fail
 
diff --git a/test/py_api/golden_cmp/config.json b/tests/py_api/golden_cmp/config.json
similarity index 100%
rename from test/py_api/golden_cmp/config.json
rename to tests/py_api/golden_cmp/config.json
diff --git a/test/py_api/golden_cmp/golden_cmp.py b/tests/py_api/golden_cmp/golden_cmp.py
similarity index 100%
rename from test/py_api/golden_cmp/golden_cmp.py
rename to tests/py_api/golden_cmp/golden_cmp.py
diff --git a/test/py_api/golden_cmp/ref_func_lib.py b/tests/py_api/golden_cmp/ref_func_lib.py
similarity index 100%
rename from test/py_api/golden_cmp/ref_func_lib.py
rename to tests/py_api/golden_cmp/ref_func_lib.py
diff --git a/test/py_api/golden_cmp/test.sh b/tests/py_api/golden_cmp/test.sh
similarity index 100%
rename from test/py_api/golden_cmp/test.sh
rename to tests/py_api/golden_cmp/test.sh
diff --git a/test/py_api/src/flash_attention_py.hpp b/tests/py_api/src/flash_attention_py.hpp
similarity index 100%
rename from test/py_api/src/flash_attention_py.hpp
rename to tests/py_api/src/flash_attention_py.hpp
diff --git a/test/py_api/src/tadd.hpp b/tests/py_api/src/tadd.hpp
similarity index 89%
rename from test/py_api/src/tadd.hpp
rename to tests/py_api/src/tadd.hpp
index 39bf976..3245e5c 100644
--- a/test/py_api/src/tadd.hpp
+++ b/tests/py_api/src/tadd.hpp
@@ -18,18 +18,18 @@ void tadd_py(float* dst, float* src0, float* src1){
             int offset = i * (tile_row * gm_col) + j * tile_col;
             gm_shape s0(src0 + offset);
             gm_shape s1(src1 + offset);
-            gm_shape res(dst + offset);  
+            gm_shape res(dst + offset);
 
             tile_shape d0, d1, d2;
-            TCOPYIN(d0, s0);
-            TCOPYIN(d1, s1);
+            TLOAD(d0, s0);
+            TLOAD(d1, s1);
             TADD(d2, d0, d1);
-            TCOPYOUT(res, d2);
+            TSTORE(res, d2);
         }
     }
 }
 
-#ifdef __cpu_sim__ 
+#ifdef __cpu_sim__
     void bind_tadd(py::module_& m) {
         m.def("tadd", [](py::array_t<float> dst_py, py::array_t<float> src0_py, py::array_t<float> src1_py){
             float* dst = static_cast<float*>(dst_py.request().ptr);
diff --git a/test/py_api/src/tcvt.hpp b/tests/py_api/src/tcvt.hpp
similarity index 94%
rename from test/py_api/src/tcvt.hpp
rename to tests/py_api/src/tcvt.hpp
index 2931dd3..04bab23 100644
--- a/test/py_api/src/tcvt.hpp
+++ b/tests/py_api/src/tcvt.hpp
@@ -22,13 +22,13 @@ void tcvtnz2zn(float* dst, float* src) {
     tile_shape_nz d1;
     tile_shape_zn d2;
     tile_shape_out d3;
-    
 
-    TCOPYIN(d0, s);
+
+    TLOAD(d0, s);
     TRESHAPE(d1, d0);
     TCVT(d2, d1);
     TRESHAPE(d3, d2);
-    TCOPYOUT(res, d3);
+    TSTORE(res, d3);
 }
 
 
@@ -48,14 +48,14 @@ void tcvtzn2nz(float* dst, float* src) {
     tile_shape_zn d1;
     tile_shape_nz d2;
     tile_shape_out d3;
-    
 
-    TCOPYIN(d0, s);
+
+    TLOAD(d0, s);
     TRESHAPE(d1, d0);
     TCVT(d2, d1);
     TRESHAPE(d3, d2);
-    TCOPYOUT(res, d3);
-    
+    TSTORE(res, d3);
+
 }
 template <uint16_t K>
 void tcvtnz2rowmajor(float* dst, float* src) {
@@ -73,13 +73,13 @@ void tcvtnz2rowmajor(float* dst, float* src) {
     tile_shape_nz d1;
     tile_shape_rowmajor d2;
     tile_shape_out d3;
-    
 
-    TCOPYIN(d0, s);
+
+    TLOAD(d0, s);
     TRESHAPE(d1, d0);
     TCVT(d2, d1);
     TRESHAPE(d3, d2);
-    TCOPYOUT(res, d3);    
+    TSTORE(res, d3);
 }
 
 
@@ -100,13 +100,13 @@ void tcvtrowmajor2nz(float* dst, float* src) {
     tile_shape_rowmajor d1;
     tile_shape_nz d2;
     tile_shape_out d3;
-    
 
-    TCOPYIN(d0, s);
+
+    TLOAD(d0, s);
     TRESHAPE(d1, d0);
     TCVT(d2, d1);
     TRESHAPE(d3, d2);
-    TCOPYOUT(res, d3);      
+    TSTORE(res, d3);
 }
 
 // Python 接口绑定
diff --git a/test/py_api/src/texp.hpp b/tests/py_api/src/texp.hpp
similarity index 94%
rename from test/py_api/src/texp.hpp
rename to tests/py_api/src/texp.hpp
index b8d1566..7c5a577 100644
--- a/test/py_api/src/texp.hpp
+++ b/tests/py_api/src/texp.hpp
@@ -23,9 +23,9 @@ void texp_py(float* dst, float* src) {
 
             tile_shape d0, d1;
 
-            TCOPYIN(d0, s0);
+            TLOAD(d0, s0);
             TEXP(d1, d0);
-            TCOPYOUT(res, d1);
+            TSTORE(res, d1);
         }
     }
 }
diff --git a/test/py_api/src/tileop_py.cpp b/tests/py_api/src/tileop_py.cpp
similarity index 97%
rename from test/py_api/src/tileop_py.cpp
rename to tests/py_api/src/tileop_py.cpp
index 557801f..c40a9d6 100644
--- a/test/py_api/src/tileop_py.cpp
+++ b/tests/py_api/src/tileop_py.cpp
@@ -11,7 +11,7 @@
 #include "flash_attention_py.hpp"
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 namespace py = pybind11;
diff --git a/test/py_api/src/tmax.hpp b/tests/py_api/src/tmax.hpp
similarity index 93%
rename from test/py_api/src/tmax.hpp
rename to tests/py_api/src/tmax.hpp
index 5b07694..618fb15 100644
--- a/test/py_api/src/tmax.hpp
+++ b/tests/py_api/src/tmax.hpp
@@ -23,10 +23,10 @@ void tmax_py(float* dst, float* src0, float* src1){
 
             tile_shape d0, d1, d2;
 
-            TCOPYIN(d0, s0);
-            TCOPYIN(d1, s1);
+            TLOAD(d0, s0);
+            TLOAD(d1, s1);
             TMAX(d2, d1, d0);
-            TCOPYOUT(res, d2);
+            TSTORE(res, d2);
         }
     }
 }
diff --git a/test/py_api/src/tsub.hpp b/tests/py_api/src/tsub.hpp
similarity index 89%
rename from test/py_api/src/tsub.hpp
rename to tests/py_api/src/tsub.hpp
index a13c69d..a47b0be 100644
--- a/test/py_api/src/tsub.hpp
+++ b/tests/py_api/src/tsub.hpp
@@ -18,18 +18,18 @@ void tsub_py(float* dst, float* src0, float* src1) {
             int offset = i * (tile_row * gm_col) + j * tile_col;
             gm_shape s0(src0 + offset);
             gm_shape s1(src1 + offset);
-            gm_shape res(dst + offset);  
+            gm_shape res(dst + offset);
 
             tile_shape d0, d1, d2;
-            TCOPYIN(d0, s0);
-            TCOPYIN(d1, s1);
+            TLOAD(d0, s0);
+            TLOAD(d1, s1);
             TSUB(d2, d0, d1);
-            TCOPYOUT(res, d2);
+            TSTORE(res, d2);
         }
     }
 }
 
-#ifdef __cpu_sim__ 
+#ifdef __cpu_sim__
     void bind_tsub(py::module_& m) {
         m.def("tsub", [](py::array_t<float> dst_py, py::array_t<float> src0_py, py::array_t<float> src1_py){
             float* dst = static_cast<float*>(dst_py.request().ptr);
diff --git a/test/other/tileop_test/Makefile b/tests/tileop_layout/Makefile
similarity index 97%
rename from test/other/tileop_test/Makefile
rename to tests/tileop_layout/Makefile
index 11112a1..f274aea 100644
--- a/test/other/tileop_test/Makefile
+++ b/tests/tileop_layout/Makefile
@@ -1,9 +1,9 @@
-ifeq ($(TESTCASE), TCOPYIN)
+ifeq ($(TESTCASE), TLOAD)
 DEFINES += -DGM_ROW=$(row) -DGM_COL=$(col) -DTR_ROW=$(trow) -DTR_COL=$(tcol)
 TARGET = $(ELF_HEAD)_$(TESTCASE)_$(MODE)_r$(row)_c$(col)_tr$(trow)_tc$(tcol).elf
 endif
 
-ifeq ($(TESTCASE), TCOPYOUT)
+ifeq ($(TESTCASE), TSTORE)
 DEFINES += -DROW=$(row) -DCOL=$(col) -DTROW=$(trow) -DTCOL=$(tcol)
 TARGET = $(ELF_HEAD)_$(TESTCASE)_$(MODE)_r$(row)_c$(col)_tr$(trow)_tc$(tcol).elf
 endif
@@ -177,4 +177,5 @@ ifneq ($(MODE), )
 endif
 
 SRC_FILE +=  $(TEST_ROOT)/$(CASE_SRC_DIR)/$(TESTCASE).cpp
-include ../../common/Makefile.common
+CATEGORY_ROOT := $(abspath ../..)
+include ../../benchmarks/common/Makefile.common
diff --git a/test/other/tileop_test/compile.all b/tests/tileop_layout/compile.all
similarity index 85%
rename from test/other/tileop_test/compile.all
rename to tests/tileop_layout/compile.all
index 94ccd7d..af37eeb 100755
--- a/test/other/tileop_test/compile.all
+++ b/tests/tileop_layout/compile.all
@@ -1,39 +1,39 @@
-make TESTCASE=TCOPYIN MODE=ND2ND row=16  col=16  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYIN MODE=ND2ND row=64  col=64  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYIN MODE=ND2ND row=128 col=64  trow=32 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYIN MODE=ND2ND row=64  col=128 trow=16 tcol=32 PLAT=linx 
-
-make TESTCASE=TCOPYIN MODE=ND2NZ row=16  col=16  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYIN MODE=ND2NZ row=64  col=64  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYIN MODE=ND2NZ row=128 col=64  trow=32 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYIN MODE=ND2NZ row=64  col=128 trow=16 tcol=32 PLAT=linx 
-
-make TESTCASE=TCOPYIN MODE=DN2ZN row=16  col=16  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYIN MODE=DN2ZN row=64  col=64  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYIN MODE=DN2ZN row=128 col=64  trow=32 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYIN MODE=DN2ZN row=64  col=128 trow=16 tcol=32 PLAT=linx 
-
-#Tcopyout
-make TESTCASE=TCOPYOUT MODE=ND2ND row=16  col=16  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYOUT MODE=ND2ND row=64  col=64  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYOUT MODE=ND2ND row=128 col=64  trow=32 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYOUT MODE=ND2ND row=64  col=128 trow=16 tcol=32 PLAT=linx 
-
-make TESTCASE=TCOPYOUT MODE=NZ2ND row=16  col=16  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYOUT MODE=NZ2ND row=64  col=64  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYOUT MODE=NZ2ND row=128 col=64  trow=32 tcol=16 PLAT=linx 
-make TESTCASE=TCOPYOUT MODE=NZ2ND row=64  col=128 trow=16 tcol=32 PLAT=linx 
-
-# make TESTCASE=TCOPYOUT MODE=ZN2DN row=16  col=16  trow=16 tcol=16 PLAT=linx
-# make TESTCASE=TCOPYOUT MODE=ZN2DN row=64  col=64  trow=16 tcol=16 PLAT=linx
-# make TESTCASE=TCOPYOUT MODE=ZN2DN row=128 col=64  trow=32 tcol=16 PLAT=linx
-# make TESTCASE=TCOPYOUT MODE=ZN2DN row=64  col=128 trow=16 tcol=32 PLAT=linx
+make TESTCASE=TLOAD MODE=ND2ND row=16  col=16  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TLOAD MODE=ND2ND row=64  col=64  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TLOAD MODE=ND2ND row=128 col=64  trow=32 tcol=16 PLAT=linx
+make TESTCASE=TLOAD MODE=ND2ND row=64  col=128 trow=16 tcol=32 PLAT=linx
+
+make TESTCASE=TLOAD MODE=ND2NZ row=16  col=16  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TLOAD MODE=ND2NZ row=64  col=64  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TLOAD MODE=ND2NZ row=128 col=64  trow=32 tcol=16 PLAT=linx
+make TESTCASE=TLOAD MODE=ND2NZ row=64  col=128 trow=16 tcol=32 PLAT=linx
+
+make TESTCASE=TLOAD MODE=DN2ZN row=16  col=16  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TLOAD MODE=DN2ZN row=64  col=64  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TLOAD MODE=DN2ZN row=128 col=64  trow=32 tcol=16 PLAT=linx
+make TESTCASE=TLOAD MODE=DN2ZN row=64  col=128 trow=16 tcol=32 PLAT=linx
+
+#Tstore
+make TESTCASE=TSTORE MODE=ND2ND row=16  col=16  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TSTORE MODE=ND2ND row=64  col=64  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TSTORE MODE=ND2ND row=128 col=64  trow=32 tcol=16 PLAT=linx
+make TESTCASE=TSTORE MODE=ND2ND row=64  col=128 trow=16 tcol=32 PLAT=linx
+
+make TESTCASE=TSTORE MODE=NZ2ND row=16  col=16  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TSTORE MODE=NZ2ND row=64  col=64  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TSTORE MODE=NZ2ND row=128 col=64  trow=32 tcol=16 PLAT=linx
+make TESTCASE=TSTORE MODE=NZ2ND row=64  col=128 trow=16 tcol=32 PLAT=linx
+
+# make TESTCASE=TSTORE MODE=ZN2DN row=16  col=16  trow=16 tcol=16 PLAT=linx
+# make TESTCASE=TSTORE MODE=ZN2DN row=64  col=64  trow=16 tcol=16 PLAT=linx
+# make TESTCASE=TSTORE MODE=ZN2DN row=128 col=64  trow=32 tcol=16 PLAT=linx
+# make TESTCASE=TSTORE MODE=ZN2DN row=64  col=128 trow=16 tcol=32 PLAT=linx
 
 #Tadd
-make TESTCASE=TADD MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TADD MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TADD MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx 
-make TESTCASE=TADD MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx 
+make TESTCASE=TADD MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TADD MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TADD MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx
+make TESTCASE=TADD MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx
 
 # make TESTCASE=TADD MODE=NZ row=16  col=16  trow=16 tcol=16 PLAT=linx
 # make TESTCASE=TADD MODE=NZ row=64  col=64  trow=16 tcol=16 PLAT=linx
@@ -46,10 +46,10 @@ make TESTCASE=TADD MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx
 # make TESTCASE=TADD MODE=ZN row=64  col=128 trow=16 tcol=32 PLAT=linx
 
 #Texp
-make TESTCASE=TEXP MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TEXP MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TEXP MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx 
-make TESTCASE=TEXP MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx 
+make TESTCASE=TEXP MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TEXP MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TEXP MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx
+make TESTCASE=TEXP MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx
 
 # make TESTCASE=TEXP MODE=NZ row=16  col=16  trow=16 tcol=16 PLAT=linx
 # make TESTCASE=TEXP MODE=NZ row=64  col=64  trow=16 tcol=16 PLAT=linx
@@ -62,10 +62,10 @@ make TESTCASE=TEXP MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx
 # make TESTCASE=TEXP MODE=ZN row=64  col=128 trow=16 tcol=32 PLAT=linx
 
 #Tcopy
-make TESTCASE=TCOPY  MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TCOPY  MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TCOPY  MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx 
-make TESTCASE=TCOPY  MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx 
+make TESTCASE=TCOPY  MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TCOPY  MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TCOPY  MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx
+make TESTCASE=TCOPY  MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx
 
 # make TESTCASE=TCOPY TEST_TYPE=tileop MODE=DN row=16  col=16  trow=16 tcol=16 PLAT=linx
 # make TESTCASE=TCOPY TEST_TYPE=tileop MODE=DN row=64  col=64  trow=16 tcol=16 PLAT=linx
@@ -139,10 +139,10 @@ make TESTCASE=TASSEMBLE ROW=128 COL1=16 COL2=8 COL3=8 PLAT=linx
 make TESTCASE=TASSEMBLE ROW=128 COL1=8 COL2=16 COL3=8 PLAT=linx
 
 #TADDCAST
-make TESTCASE=TADDCAST MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TADDCAST MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TADDCAST MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx 
-make TESTCASE=TADDCAST MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx 
+make TESTCASE=TADDCAST MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TADDCAST MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TADDCAST MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx
+make TESTCASE=TADDCAST MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx
 
 #TABS
 make TESTCASE=TABS TEST_TYPE=tileop MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx
@@ -151,10 +151,10 @@ make TESTCASE=TABS TEST_TYPE=tileop MODE=ND row=128 col=64  trow=32 tcol=16 PLAT
 make TESTCASE=TABS TEST_TYPE=tileop MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx
 
 #TADDMASK
-make TESTCASE=TADDMASK MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TADDMASK MODE=ND row=66  col=66  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TADDMASK MODE=ND row=130 col=66  trow=32 tcol=16 PLAT=linx 
-make TESTCASE=TADDMASK MODE=ND row=66  col=130 trow=16 tcol=32 PLAT=linx 
+make TESTCASE=TADDMASK MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TADDMASK MODE=ND row=66  col=66  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TADDMASK MODE=ND row=130 col=66  trow=32 tcol=16 PLAT=linx
+make TESTCASE=TADDMASK MODE=ND row=66  col=130 trow=16 tcol=32 PLAT=linx
 
 #TAND
 make TESTCASE=TAND TEST_TYPE=tileop MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx
@@ -181,22 +181,22 @@ make TESTCASE=TEXPANDSCALAR TEST_TYPE=tileop MODE=ND row=128 col=64  trow=32 tco
 make TESTCASE=TEXPANDSCALAR TEST_TYPE=tileop MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx
 
 #TDIV
-make TESTCASE=TDIV MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TDIV MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TDIV MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx 
-make TESTCASE=TDIV MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx 
+make TESTCASE=TDIV MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TDIV MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TDIV MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx
+make TESTCASE=TDIV MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx
 
 #TMUL
-make TESTCASE=TMUL MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TMUL MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TMUL MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx 
-make TESTCASE=TMUL MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx 
+make TESTCASE=TMUL MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TMUL MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TMUL MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx
+make TESTCASE=TMUL MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx
 
 #TSUB
-make TESTCASE=TSUB MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TSUB MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TSUB MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx 
-make TESTCASE=TSUB MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx 
+make TESTCASE=TSUB MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TSUB MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TSUB MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx
+make TESTCASE=TSUB MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx
 
 #TCAST
 make TESTCASE=TCAST TEST_TYPE=tileop MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx
@@ -229,10 +229,10 @@ make TESTCASE=TTRANS TEST_TYPE=tileop MODE=ND row=128 col=64  trow=32 tcol=16 PL
 make TESTCASE=TTRANS TEST_TYPE=tileop MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx
 
 #TMAKERANGE
-make TESTCASE=TMAKERANGE MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TMAKERANGE MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx 
-make TESTCASE=TMAKERANGE MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx 
-make TESTCASE=TMAKERANGE MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx 
+make TESTCASE=TMAKERANGE MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TMAKERANGE MODE=ND row=64  col=64  trow=16 tcol=16 PLAT=linx
+make TESTCASE=TMAKERANGE MODE=ND row=128 col=64  trow=32 tcol=16 PLAT=linx
+make TESTCASE=TMAKERANGE MODE=ND row=64  col=128 trow=16 tcol=32 PLAT=linx
 
 #TOR
 make TESTCASE=TOR TEST_TYPE=tileop MODE=ND row=16  col=16  trow=16 tcol=16 PLAT=linx
diff --git a/test/other/tileop_test/compile_fa_tileop.all b/tests/tileop_layout/compile_fa_tileop.all
similarity index 100%
rename from test/other/tileop_test/compile_fa_tileop.all
rename to tests/tileop_layout/compile_fa_tileop.all
diff --git a/test/other/tileop_test/src/CubeVecTrans.cpp b/tests/tileop_layout/src/CubeVecTrans.cpp
similarity index 93%
rename from test/other/tileop_test/src/CubeVecTrans.cpp
rename to tests/tileop_layout/src/CubeVecTrans.cpp
index b5aefb8..7c6837a 100644
--- a/test/other/tileop_test/src/CubeVecTrans.cpp
+++ b/tests/tileop_layout/src/CubeVecTrans.cpp
@@ -1,8 +1,8 @@
 #include <common/pto_tileop.hpp>
-#include <cstring> 
+#include <cstring>
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef GM
@@ -64,8 +64,8 @@ void CubeVecTrans(float* dst, float* src0, float* src1){
             for(int k=0;k<Kb;k++){
                 tile_shapeA tA(i+j+k);
                 tile_shapeB tB(i+j+k);
-                // TCOPYIN(tA, gA);
-                // TCOPYIN(tB, gB);
+                // TLOAD(tA, gA);
+                // TLOAD(tB, gB);
                 MATMUL(tACC, tA, tB);
                 tile_shapeD tVecIn;
                 TCVT(tVecIn, tACC);
@@ -76,7 +76,7 @@ void CubeVecTrans(float* dst, float* src0, float* src1){
                 tile_shapeD_out tCubeOut;
                 MATMUL(tCubeOut, tCubeIn_left, tCubeIn_right);
             }
-            //TCOPYOUT(gC, tACC);
+            //TSTORE(gC, tACC);
         }
     }
 }
@@ -85,7 +85,7 @@ int main() {
     float src0[GM*GK];
     float src1[GK*GN];
     float dst[GM*GN];
-    
+
     #ifdef LINX_PMC
     PMC_START();
     #endif
diff --git a/test/other/tileop_test/src/MATMUL.cpp b/tests/tileop_layout/src/MATMUL.cpp
similarity index 100%
rename from test/other/tileop_test/src/MATMUL.cpp
rename to tests/tileop_layout/src/MATMUL.cpp
diff --git a/test/other/tileop_test/src/MGATHER.cpp b/tests/tileop_layout/src/MGATHER.cpp
similarity index 100%
rename from test/other/tileop_test/src/MGATHER.cpp
rename to tests/tileop_layout/src/MGATHER.cpp
diff --git a/test/other/tileop_test/src/MSCATTER.cpp b/tests/tileop_layout/src/MSCATTER.cpp
similarity index 100%
rename from test/other/tileop_test/src/MSCATTER.cpp
rename to tests/tileop_layout/src/MSCATTER.cpp
diff --git a/test/other/tileop_test/src/TABS.cpp b/tests/tileop_layout/src/TABS.cpp
similarity index 98%
rename from test/other/tileop_test/src/TABS.cpp
rename to tests/tileop_layout/src/TABS.cpp
index 4062283..5695a6b 100644
--- a/test/other/tileop_test/src/TABS.cpp
+++ b/tests/tileop_layout/src/TABS.cpp
@@ -2,7 +2,7 @@
 #include <string> 
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/TADD.cpp b/tests/tileop_layout/src/TADD.cpp
similarity index 98%
rename from test/other/tileop_test/src/TADD.cpp
rename to tests/tileop_layout/src/TADD.cpp
index bae1b85..7c096f6 100644
--- a/test/other/tileop_test/src/TADD.cpp
+++ b/tests/tileop_layout/src/TADD.cpp
@@ -2,7 +2,7 @@
 #include <string> 
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/TADDCAST.cpp b/tests/tileop_layout/src/TADDCAST.cpp
similarity index 99%
rename from test/other/tileop_test/src/TADDCAST.cpp
rename to tests/tileop_layout/src/TADDCAST.cpp
index f76949b..4772b76 100644
--- a/test/other/tileop_test/src/TADDCAST.cpp
+++ b/tests/tileop_layout/src/TADDCAST.cpp
@@ -3,7 +3,7 @@
 #include "jcore/TAddCast.hpp"
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/TADDMASK.cpp b/tests/tileop_layout/src/TADDMASK.cpp
similarity index 90%
rename from test/other/tileop_test/src/TADDMASK.cpp
rename to tests/tileop_layout/src/TADDMASK.cpp
index c52dcf9..b11fad5 100644
--- a/test/other/tileop_test/src/TADDMASK.cpp
+++ b/tests/tileop_layout/src/TADDMASK.cpp
@@ -1,8 +1,8 @@
 #include <common/pto_tileop.hpp>
-#include <string> 
+#include <string>
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
@@ -55,10 +55,10 @@ void tadd_mask_nd(float *dst, float *src0, float *src1) {
             auto g2 = gdst(i, j);
 
             tile_shape td0(2*i+j), td1(i+2*j), td2;
-            // TCOPYIN(td0, g0);
-            // TCOPYIN(td1, g1);
+            // TLOAD(td0, g0);
+            // TLOAD(td1, g1);
             TADD(td2, td0, td1);
-            // TCOPYOUT(g2, td2);
+            // TSTORE(g2, td2);
         }
         if constexpr (remainder_col) {
             auto g0 = gsrc0(i, block_col);
@@ -66,10 +66,10 @@ void tadd_mask_nd(float *dst, float *src0, float *src1) {
             auto g2 = gdst(i, block_col);
 
             trailing_rows_shape td0(2*i), td1(i), td2;
-            // TCOPYIN(td0, g0);
-            // TCOPYIN(td1, g1);
+            // TLOAD(td0, g0);
+            // TLOAD(td1, g1);
             TADD(td2, td0, td1);
-            // TCOPYOUT(g2, td2);
+            // TSTORE(g2, td2);
         }
     }
     if constexpr (remainder_row) {
@@ -79,10 +79,10 @@ void tadd_mask_nd(float *dst, float *src0, float *src1) {
             auto g2 = gdst(block_row, j);
 
             trailing_cols_shape td0(j), td1(2*j), td2;
-            // TCOPYIN(td0, g0);
-            // TCOPYIN(td1, g1);
+            // TLOAD(td0, g0);
+            // TLOAD(td1, g1);
             TADD(td2, td0, td1);
-            // TCOPYOUT(g2, td2);
+            // TSTORE(g2, td2);
         }
         if constexpr (remainder_col) {
             auto g0 = gsrc0(block_row, block_col);
@@ -90,10 +90,10 @@ void tadd_mask_nd(float *dst, float *src0, float *src1) {
             auto g2 = gdst(block_row, block_col);
 
             trailing_corner_shape td0(0), td1(1), td2;
-            // TCOPYIN(td0, g0);
-            // TCOPYIN(td1, g1);
+            // TLOAD(td0, g0);
+            // TLOAD(td1, g1);
             TADD(td2, td0, td1);
-            // TCOPYOUT(g2, td2);
+            // TSTORE(g2, td2);
         }
     }
 }
diff --git a/test/other/tileop_test/src/TAND.cpp b/tests/tileop_layout/src/TAND.cpp
similarity index 95%
rename from test/other/tileop_test/src/TAND.cpp
rename to tests/tileop_layout/src/TAND.cpp
index 6a4cbcf..3680161 100644
--- a/test/other/tileop_test/src/TAND.cpp
+++ b/tests/tileop_layout/src/TAND.cpp
@@ -1,9 +1,9 @@
 #include <common/pto_tileop.hpp>
-#include <string> 
+#include <string>
 #include "jcore/TAnd.hpp"
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
@@ -47,10 +47,10 @@ void tand_nd(T *dst, T *src0, T *src1) {
             auto g2 = gdst(i, j);
 
             tile_shape td0(2*i+j), td1(i+2*j), td2;
-            // TCOPYIN(td0, g0);
-            // TCOPYIN(td1, g1);
+            // TLOAD(td0, g0);
+            // TLOAD(td1, g1);
             TAND_Impl(td2, td1, td0);
-            // TCOPYOUT(g2, td2);
+            // TSTORE(g2, td2);
         }
     }
 }
diff --git a/test/other/tileop_test/src/TASSEMBLE.cpp b/tests/tileop_layout/src/TASSEMBLE.cpp
similarity index 100%
rename from test/other/tileop_test/src/TASSEMBLE.cpp
rename to tests/tileop_layout/src/TASSEMBLE.cpp
diff --git a/test/other/tileop_test/src/TCAST.cpp b/tests/tileop_layout/src/TCAST.cpp
similarity index 98%
rename from test/other/tileop_test/src/TCAST.cpp
rename to tests/tileop_layout/src/TCAST.cpp
index 33504fa..cdf679b 100644
--- a/test/other/tileop_test/src/TCAST.cpp
+++ b/tests/tileop_layout/src/TCAST.cpp
@@ -2,7 +2,7 @@
 #include <string> 
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/TCOPY.cpp b/tests/tileop_layout/src/TCOPY.cpp
similarity index 99%
rename from test/other/tileop_test/src/TCOPY.cpp
rename to tests/tileop_layout/src/TCOPY.cpp
index 2ec924f..594f6d0 100644
--- a/test/other/tileop_test/src/TCOPY.cpp
+++ b/tests/tileop_layout/src/TCOPY.cpp
@@ -2,7 +2,7 @@
 #include <string> 
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/TDIV.cpp b/tests/tileop_layout/src/TDIV.cpp
similarity index 98%
rename from test/other/tileop_test/src/TDIV.cpp
rename to tests/tileop_layout/src/TDIV.cpp
index b2f8a2b..a326d25 100644
--- a/test/other/tileop_test/src/TDIV.cpp
+++ b/tests/tileop_layout/src/TDIV.cpp
@@ -2,7 +2,7 @@
 #include <string> 
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/TEXP.cpp b/tests/tileop_layout/src/TEXP.cpp
similarity index 98%
rename from test/other/tileop_test/src/TEXP.cpp
rename to tests/tileop_layout/src/TEXP.cpp
index 320ed9a..3897710 100644
--- a/test/other/tileop_test/src/TEXP.cpp
+++ b/tests/tileop_layout/src/TEXP.cpp
@@ -2,7 +2,7 @@
 #include <string> 
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/TEXPANDCOL.cpp b/tests/tileop_layout/src/TEXPANDCOL.cpp
similarity index 93%
rename from test/other/tileop_test/src/TEXPANDCOL.cpp
rename to tests/tileop_layout/src/TEXPANDCOL.cpp
index d8f4e8d..6119706 100644
--- a/test/other/tileop_test/src/TEXPANDCOL.cpp
+++ b/tests/tileop_layout/src/TEXPANDCOL.cpp
@@ -1,8 +1,8 @@
 #include <common/pto_tileop.hpp>
-#include <string> 
+#include <string>
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
@@ -47,9 +47,9 @@ void texpandcol_nd(float *dst, float *src) {
 
             tile_shape_in td0(2*i+j);
             tile_shape_out td1;
-            // TCOPYIN(td0, g0);
+            // TLOAD(td0, g0);
             TEXPANDCOL(td1, td0);
-            // TCOPYOUT(g1, td1);
+            // TSTORE(g1, td1);
         }
     }
 }
@@ -76,9 +76,9 @@ void texpandcol_dn(float *dst, float *src) {
 
             tile_shape_in td0(2*i+j);
             tile_shape_out td1;
-            // TCOPYIN(td0, g0);
+            // TLOAD(td0, g0);
             TEXPANDCOL(td1, td0);
-            // TCOPYOUT(g1, td1);
+            // TSTORE(g1, td1);
         }
     }
 }
diff --git a/test/other/tileop_test/src/TEXPANDROW.cpp b/tests/tileop_layout/src/TEXPANDROW.cpp
similarity index 93%
rename from test/other/tileop_test/src/TEXPANDROW.cpp
rename to tests/tileop_layout/src/TEXPANDROW.cpp
index f5db59b..2eaea10 100644
--- a/test/other/tileop_test/src/TEXPANDROW.cpp
+++ b/tests/tileop_layout/src/TEXPANDROW.cpp
@@ -1,8 +1,8 @@
 #include <common/pto_tileop.hpp>
-#include <string> 
+#include <string>
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
@@ -47,9 +47,9 @@ void texpandrow_nd(float *dst, float *src) {
 
             tile_shape_in td0(2*i+j);
             tile_shape_out td1;
-            // TCOPYIN(td0, g0);
+            // TLOAD(td0, g0);
             TEXPANDROW(td1, td0);
-            // TCOPYOUT(g1, td1);
+            // TSTORE(g1, td1);
         }
     }
 }
@@ -76,9 +76,9 @@ void texpandrow_dn(float *dst, float *src) {
 
             tile_shape_in td0(2*i+j);
             tile_shape_out td1;
-            // TCOPYIN(td0, g0);
+            // TLOAD(td0, g0);
             TEXPANDROW(td1, td0);
-            // TCOPYOUT(g1, td1);
+            // TSTORE(g1, td1);
         }
     }
 }
diff --git a/test/other/tileop_test/src/TEXPANDSCALAR.cpp b/tests/tileop_layout/src/TEXPANDSCALAR.cpp
similarity index 94%
rename from test/other/tileop_test/src/TEXPANDSCALAR.cpp
rename to tests/tileop_layout/src/TEXPANDSCALAR.cpp
index 6834142..5621dfa 100644
--- a/test/other/tileop_test/src/TEXPANDSCALAR.cpp
+++ b/tests/tileop_layout/src/TEXPANDSCALAR.cpp
@@ -1,8 +1,8 @@
 #include <common/pto_tileop.hpp>
-#include <string> 
+#include <string>
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
@@ -43,7 +43,7 @@ void texpandscalar_nd(float *dst, float s) {
 
             tile_shape td0;
             TEXPANDSCALAR(td0, s);
-            // TCOPYOUT(g0, td0);
+            // TSTORE(g0, td0);
         }
     }
 }
@@ -66,7 +66,7 @@ void texpandscalar_dn(float *dst, float s) {
 
             tile_shape td0;
             TEXPANDSCALAR(td0, s);
-            // TCOPYOUT(g0, td0);
+            // TSTORE(g0, td0);
         }
     }
 }
diff --git a/test/other/tileop_test/src/TEXTRACT.cpp b/tests/tileop_layout/src/TEXTRACT.cpp
similarity index 100%
rename from test/other/tileop_test/src/TEXTRACT.cpp
rename to tests/tileop_layout/src/TEXTRACT.cpp
diff --git a/test/other/tileop_test/src/TFILLPAD.cpp b/tests/tileop_layout/src/TFILLPAD.cpp
similarity index 98%
rename from test/other/tileop_test/src/TFILLPAD.cpp
rename to tests/tileop_layout/src/TFILLPAD.cpp
index b80a8bd..f3f37de 100644
--- a/test/other/tileop_test/src/TFILLPAD.cpp
+++ b/tests/tileop_layout/src/TFILLPAD.cpp
@@ -2,7 +2,7 @@
 #include <string> 
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/TGATHER.cpp b/tests/tileop_layout/src/TGATHER.cpp
similarity index 93%
rename from test/other/tileop_test/src/TGATHER.cpp
rename to tests/tileop_layout/src/TGATHER.cpp
index e3bda07..86e4917 100644
--- a/test/other/tileop_test/src/TGATHER.cpp
+++ b/tests/tileop_layout/src/TGATHER.cpp
@@ -1,8 +1,8 @@
 #include <common/pto_tileop.hpp>
-#include <string> 
+#include <string>
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
@@ -55,10 +55,10 @@ void tgather_nd(float *dst, float *src, uint16_t *indices) {
             tile_shape_src td0(2*i+j);
             tile_shape_indices td1(1);
             tile_shape_dst td2;
-            // TCOPYIN(td0, g0);
-            // TCOPYIN(td1, g1);
+            // TLOAD(td0, g0);
+            // TLOAD(td1, g1);
             TGATHER(td2, td0, td1);
-            // TCOPYOUT(g2, td2);
+            // TSTORE(g2, td2);
         }
     }
 }
@@ -93,10 +93,10 @@ void tgather_dn(float *dst, float *src, uint16_t *indices) {
             tile_shape_src td0(2*i+j);
             tile_shape_indices td1(1);
             tile_shape_dst td2;
-            // TCOPYIN(td0, g0);
-            // TCOPYIN(td1, g1);
+            // TLOAD(td0, g0);
+            // TLOAD(td1, g1);
             TGATHER(td2, td0, td1);
-            // TCOPYOUT(g2, td2);
+            // TSTORE(g2, td2);
         }
     }
 }
diff --git a/test/other/tileop_test/src/TCOPYIN.cpp b/tests/tileop_layout/src/TLOAD.cpp
similarity index 83%
rename from test/other/tileop_test/src/TCOPYIN.cpp
rename to tests/tileop_layout/src/TLOAD.cpp
index b3d38b6..c947f3c 100644
--- a/test/other/tileop_test/src/TCOPYIN.cpp
+++ b/tests/tileop_layout/src/TLOAD.cpp
@@ -1,8 +1,8 @@
 #include <common/pto_tileop.hpp>
-#include <string> 
+#include <string>
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef GM_ROW
@@ -26,7 +26,7 @@
 #endif
 
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row, uint64_t tile_col>
-void copyin_nd2nd(float *src) {
+void load_nd2nd(float *src) {
   using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
   using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
   //带mask tile_shape = Tile<Location::Vec, float, tile_row, tile_col, 11, 11>;
@@ -36,14 +36,14 @@ void copyin_nd2nd(float *src) {
     for (int j = 0; j < block_col; ++j) {
       int offset = i * (tile_row * gm_col) + j * tile_col;
       gm_shape s0(src + offset);
-      tile_shape d0; 
-      TCOPYIN(d0, s0);
+      tile_shape d0;
+      TLOAD(d0, s0);
     }
   }
 }
 
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row, uint64_t tile_col>
-void copyin_nd2nz(float *src) {
+void load_nd2nz(float *src) {
   using gm_shape = global_tensor<float, RowMajor<gm_row, gm_col>>;
   using tile_shape = TileLeft<float, tile_row, tile_col>;
 
@@ -56,15 +56,15 @@ void copyin_nd2nz(float *src) {
 
   for (int i = 0; i < block_row; ++i) {
     for (int j = 0; j < block_col; ++j) {
-      tile_shape d0; 
+      tile_shape d0;
       auto g0 =  gsrc(i,j);
-      TCOPYIN(d0, g0);
+      TLOAD(d0, g0);
     }
   }
 }
 
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row, uint64_t tile_col>
-void copyin_dn2zn(float *src) {
+void load_dn2zn(float *src) {
   using gm_shape = global_tensor<float, ColMajor<gm_row, gm_col>>;
   using tile_shape = TileRight<float, tile_row, tile_col>;
 
@@ -77,9 +77,9 @@ void copyin_dn2zn(float *src) {
 
   for (int i = 0; i < block_col; ++i) {
     for (int j = 0; j < block_row; ++j) {
-      tile_shape d0; 
+      tile_shape d0;
       auto g0 =  gsrc(i,j);
-      TCOPYIN(d0, g0);
+      TLOAD(d0, g0);
     }
   }
 }
@@ -100,11 +100,11 @@ int main() {
   float src[gm_size];
 
   if(!strcmp(MODE, "ND2ND")){
-    copyin_nd2nd<gm_row, gm_col, tile_row, tile_col>(src);
+    load_nd2nd<gm_row, gm_col, tile_row, tile_col>(src);
   }else if(!strcmp(MODE, "ND2NZ")){
-    copyin_nd2nz<gm_row, gm_col, tile_row, tile_col>(src);
+    load_nd2nz<gm_row, gm_col, tile_row, tile_col>(src);
   }else if(!strcmp(MODE, "DN2ZN")){
-    copyin_dn2zn<gm_row, gm_col, tile_row, tile_col>(src);
+    load_dn2zn<gm_row, gm_col, tile_row, tile_col>(src);
   }
 
   #ifdef LINX_PMC
diff --git a/test/other/tileop_test/src/TMAKERANGE.cpp b/tests/tileop_layout/src/TMAKERANGE.cpp
similarity index 94%
rename from test/other/tileop_test/src/TMAKERANGE.cpp
rename to tests/tileop_layout/src/TMAKERANGE.cpp
index b63317b..66ec41d 100644
--- a/test/other/tileop_test/src/TMAKERANGE.cpp
+++ b/tests/tileop_layout/src/TMAKERANGE.cpp
@@ -1,9 +1,9 @@
 #include <common/pto_tileop.hpp>
-#include <string> 
+#include <string>
 #include "jcore/TMakeRange.hpp"
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
@@ -44,7 +44,7 @@ void tmakerange_nd(float *dst, float s) {
 
             tile_shape td0;
             TMAKERANGE_Impl(td0, s);
-            // TCOPYOUT(g0, td0);
+            // TSTORE(g0, td0);
         }
     }
 }
@@ -67,7 +67,7 @@ void tmakerange_dn(float *dst, float s) {
 
             tile_shape td0;
             // TMAKERANGE_Impl(td0, s);
-            // TCOPYOUT(g0, td0);
+            // TSTORE(g0, td0);
         }
     }
 }
diff --git a/test/other/tileop_test/src/TMUL.cpp b/tests/tileop_layout/src/TMUL.cpp
similarity index 98%
rename from test/other/tileop_test/src/TMUL.cpp
rename to tests/tileop_layout/src/TMUL.cpp
index 47cd391..0389cb3 100644
--- a/test/other/tileop_test/src/TMUL.cpp
+++ b/tests/tileop_layout/src/TMUL.cpp
@@ -2,7 +2,7 @@
 #include <string> 
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/TOR.cpp b/tests/tileop_layout/src/TOR.cpp
similarity index 95%
rename from test/other/tileop_test/src/TOR.cpp
rename to tests/tileop_layout/src/TOR.cpp
index a31eb31..ae9e5e9 100644
--- a/test/other/tileop_test/src/TOR.cpp
+++ b/tests/tileop_layout/src/TOR.cpp
@@ -1,9 +1,9 @@
 #include <common/pto_tileop.hpp>
-#include <string> 
+#include <string>
 #include "jcore/TOr.hpp"
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
@@ -47,10 +47,10 @@ void tor_nd(T *dst, T *src0, T *src1) {
             auto g2 = gdst(i, j);
 
             tile_shape td0(i%2), td1(j%2), td2;
-            // TCOPYIN(td0, g0);
-            // TCOPYIN(td1, g1);
+            // TLOAD(td0, g0);
+            // TLOAD(td1, g1);
             TOR_Impl(td2, td1, td0);
-            // TCOPYOUT(g2, td2);
+            // TSTORE(g2, td2);
         }
     }
 }
diff --git a/test/other/tileop_test/src/TRESHAPE.cpp b/tests/tileop_layout/src/TRESHAPE.cpp
similarity index 100%
rename from test/other/tileop_test/src/TRESHAPE.cpp
rename to tests/tileop_layout/src/TRESHAPE.cpp
diff --git a/test/other/tileop_test/src/TROWMAX.cpp b/tests/tileop_layout/src/TROWMAX.cpp
similarity index 98%
rename from test/other/tileop_test/src/TROWMAX.cpp
rename to tests/tileop_layout/src/TROWMAX.cpp
index b668481..a70bb93 100644
--- a/test/other/tileop_test/src/TROWMAX.cpp
+++ b/tests/tileop_layout/src/TROWMAX.cpp
@@ -2,7 +2,7 @@
 #include <string> 
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/TROWMAXEXPAND.cpp b/tests/tileop_layout/src/TROWMAXEXPAND.cpp
similarity index 98%
rename from test/other/tileop_test/src/TROWMAXEXPAND.cpp
rename to tests/tileop_layout/src/TROWMAXEXPAND.cpp
index 8adb2a1..8c4af0f 100644
--- a/test/other/tileop_test/src/TROWMAXEXPAND.cpp
+++ b/tests/tileop_layout/src/TROWMAXEXPAND.cpp
@@ -2,7 +2,7 @@
 #include <string> 
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/TROWSUM.cpp b/tests/tileop_layout/src/TROWSUM.cpp
similarity index 98%
rename from test/other/tileop_test/src/TROWSUM.cpp
rename to tests/tileop_layout/src/TROWSUM.cpp
index 7fd88b9..3ed3d08 100644
--- a/test/other/tileop_test/src/TROWSUM.cpp
+++ b/tests/tileop_layout/src/TROWSUM.cpp
@@ -2,7 +2,7 @@
 #include <string> 
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/TROWSUMEXPAND.cpp b/tests/tileop_layout/src/TROWSUMEXPAND.cpp
similarity index 98%
rename from test/other/tileop_test/src/TROWSUMEXPAND.cpp
rename to tests/tileop_layout/src/TROWSUMEXPAND.cpp
index cb5bff7..437c50c 100644
--- a/test/other/tileop_test/src/TROWSUMEXPAND.cpp
+++ b/tests/tileop_layout/src/TROWSUMEXPAND.cpp
@@ -2,7 +2,7 @@
 #include <string> 
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/TSELECT.cpp b/tests/tileop_layout/src/TSELECT.cpp
similarity index 91%
rename from test/other/tileop_test/src/TSELECT.cpp
rename to tests/tileop_layout/src/TSELECT.cpp
index 685aa80..519b55f 100644
--- a/test/other/tileop_test/src/TSELECT.cpp
+++ b/tests/tileop_layout/src/TSELECT.cpp
@@ -1,8 +1,8 @@
 #include <common/pto_tileop.hpp>
-#include <string> 
+#include <string>
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
@@ -54,11 +54,11 @@ void tselect_nd(float *dst, float *src0, float *src1, uint16_t *cond) {
             tile_shape_fp32 td0(2*i+j), td1(i+2*j);
             tile_shape_uint16 td2(i%2);
             tile_shape_fp32 td3;
-            // TCOPYIN(td0, g0);
-            // TCOPYIN(td1, g1);
-            // TCOPYIN(td2, g2);
+            // TLOAD(td0, g0);
+            // TLOAD(td1, g1);
+            // TLOAD(td2, g2);
             TSELECT(td3, td2, td0, td1);
-            // TCOPYOUT(g3, td3);
+            // TSTORE(g3, td3);
         }
     }
 }
@@ -92,11 +92,11 @@ void tselect_dn(float *dst, float *src0, float *src1, uint16_t *cond) {
             tile_shape_fp32 td0(2*i+j), td1(i+2*j);
             tile_shape_uint16 td2(i%2);
             tile_shape_fp32 td3;
-            // TCOPYIN(td0, g0);
-            // TCOPYIN(td1, g1);
-            // TCOPYIN(td2, g2);
+            // TLOAD(td0, g0);
+            // TLOAD(td1, g1);
+            // TLOAD(td2, g2);
             TSELECT(td3, td2, td0, td1);
-            // TCOPYOUT(g3, td3);
+            // TSTORE(g3, td3);
         }
     }
 }
diff --git a/test/other/tileop_test/src/TCOPYOUT.cpp b/tests/tileop_layout/src/TSTORE.cpp
similarity index 85%
rename from test/other/tileop_test/src/TCOPYOUT.cpp
rename to tests/tileop_layout/src/TSTORE.cpp
index 7c81372..b8e7bc6 100644
--- a/test/other/tileop_test/src/TCOPYOUT.cpp
+++ b/tests/tileop_layout/src/TSTORE.cpp
@@ -1,7 +1,7 @@
 #include <common/pto_tileop.hpp>
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
@@ -21,7 +21,7 @@
 #endif
 
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row, uint64_t tile_col>
-void copyout_nd2nd(float *dst) {
+void store_nd2nd(float *dst) {
     using tile_shape = Tile<Location::Vec, float, tile_row, tile_col, BLayout::RowMajor>;
     using gm_shape   = global_tensor<float, RowMajor<gm_row, gm_col>>;
 
@@ -36,13 +36,13 @@ void copyout_nd2nd(float *dst) {
         for (int j = 0; j < block_col; ++j) {
         tile_shape d0(i+j);
         auto dstO = gdst(i,j);
-        TCOPYOUT(dstO, d0);
+        TSTORE(dstO, d0);
         }
     }
 }
 
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row, uint64_t tile_col>
-void copyout_nz2nd(float *dst) {
+void store_nz2nd(float *dst) {
     using tile_shape = TileLeft<float, tile_row, tile_col>;
     using gm_shape   = global_tensor<float, RowMajor<gm_row, gm_col>>;
 
@@ -56,13 +56,13 @@ void copyout_nz2nd(float *dst) {
         for (int j = 0; j < block_col; ++j) {
         tile_shape d0(i+j);
         auto dstO = gdst(i,j);
-        TCOPYOUT(dstO, d0);
+        TSTORE(dstO, d0);
         }
     }
 }
 
 template <uint64_t gm_row, uint64_t gm_col, uint64_t tile_row, uint64_t tile_col>
-void copyout_zn2dn(float *dst) {
+void store_zn2dn(float *dst) {
     // using tile_shape = TileRight<float, tile_row, tile_col>;
     // using gm_shape   = global_tensor<float, ColMajor<gm_row, gm_col>>;
 
@@ -76,7 +76,7 @@ void copyout_zn2dn(float *dst) {
     //     for (int j = 0; j < block_col; ++j) {
     //     tile_shape d0(i+j);
     //     auto dstO = gdst(i,j);
-    //     TCOPYOUT(dstO, d0);
+    //     TSTORE(dstO, d0);
     //     }
     // }
 }
@@ -92,15 +92,15 @@ int main() {
     #endif
 
     size_t gm_size = gm_row * gm_col;
-    
+
     float dst[gm_size];
 
     if(!strcmp(MODE, "ND2ND")){
-        copyout_nd2nd<gm_row, gm_col, tile_row, tile_col>(dst);
+        store_nd2nd<gm_row, gm_col, tile_row, tile_col>(dst);
     }else if(!strcmp(MODE, "NZ2ND")){
-        copyout_nz2nd<gm_row, gm_col, tile_row, tile_col>(dst);
+        store_nz2nd<gm_row, gm_col, tile_row, tile_col>(dst);
     }else if(!strcmp(MODE, "ZN2DN")){
-        copyout_zn2dn<gm_row, gm_col, tile_row, tile_col>(dst);
+        store_zn2dn<gm_row, gm_col, tile_row, tile_col>(dst);
     }
 
     #ifdef LINX_PMC
diff --git a/test/other/tileop_test/src/TSUB.cpp b/tests/tileop_layout/src/TSUB.cpp
similarity index 98%
rename from test/other/tileop_test/src/TSUB.cpp
rename to tests/tileop_layout/src/TSUB.cpp
index 9448a6e..afbed25 100644
--- a/test/other/tileop_test/src/TSUB.cpp
+++ b/tests/tileop_layout/src/TSUB.cpp
@@ -2,7 +2,7 @@
 #include <string> 
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/TTRANS.cpp b/tests/tileop_layout/src/TTRANS.cpp
similarity index 98%
rename from test/other/tileop_test/src/TTRANS.cpp
rename to tests/tileop_layout/src/TTRANS.cpp
index 8975633..e2852a0 100644
--- a/test/other/tileop_test/src/TTRANS.cpp
+++ b/tests/tileop_layout/src/TTRANS.cpp
@@ -2,7 +2,7 @@
 #include <string> 
 
 #ifdef LINX_PMC
-#include "../linxStartEnd.hpp"
+#include <linxStartEnd.hpp>
 #endif
 
 #ifndef ROW
diff --git a/test/other/tileop_test/src/fa_tileop.cpp b/tests/tileop_layout/src/fa_tileop.cpp
similarity index 92%
rename from test/other/tileop_test/src/fa_tileop.cpp
rename to tests/tileop_layout/src/fa_tileop.cpp
index 10812e2..d409687 100644
--- a/test/other/tileop_test/src/fa_tileop.cpp
+++ b/tests/tileop_layout/src/fa_tileop.cpp
@@ -38,7 +38,7 @@
     #endif
 #else
     typedef float dtype;
-#endif  
+#endif
 
 template <uint16_t gm_row, uint16_t gm_col, uint16_t tile_row, uint16_t tile_col>
 void tsub_nz_left(dtype *dst, dtype *src0, dtype *src1) {
@@ -57,11 +57,11 @@ void tsub_nz_left(dtype *dst, dtype *src0, dtype *src1) {
     for (int i = 0; i < block_row; ++i) {
         for (int j = 0; j < block_col; ++j) {
             tile_shape tsrc0,tsrc1;
-            TCOPYIN(tsrc0, gsrc0(i, j));
-            TCOPYIN(tsrc1, gsrc1(i, j));
+            TLOAD(tsrc0, gsrc0(i, j));
+            TLOAD(tsrc1, gsrc1(i, j));
             TSUB(tsrc0, tsrc0, tsrc1);
             auto gO = gdst(i, j);
-            TCOPYOUT(gO, tsrc0);
+            TSTORE(gO, tsrc0);
         }
     }
 }
@@ -86,11 +86,11 @@ void trowsum_nz_left(dtype *dst, dtype *src) {
         tile_shape_r trsum;
         for (int j = 0; j < block_col; ++j) {
             tile_shape_in tsrc;
-            TCOPYIN(tsrc, gsrc(i, j));
+            TLOAD(tsrc, gsrc(i, j));
             TROWSUM(trsum, tsrc);
         }
         auto gO = gdst(i, 0);
-        TCOPYOUT(gO, trsum);
+        TSTORE(gO, trsum);
     }
 }
 
@@ -115,10 +115,10 @@ void texpandcol_nz_left(dtype *dst, dtype *src) {
         for (int j = 0; j < block_col; ++j) {
             tile_shape_in tsrc;
             tile_shape_expand texpand;
-            TCOPYIN(tsrc, gsrc(i, 0));
+            TLOAD(tsrc, gsrc(i, 0));
             TEXPANDCOL(texpand, tsrc);
             auto gO = gdst(i, j);
-            TCOPYOUT(gO, texpand);
+            TSTORE(gO, texpand);
         }
     }
 }
@@ -140,11 +140,11 @@ void tmul_nz_out(dtype *dst, dtype *src0, dtype *src1) {
     for (int i = 0; i < block_row; ++i) {
         for (int j = 0; j < block_col; ++j) {
             tile_shape tsrc0,tsrc1;
-            TCOPYIN(tsrc0, gsrc0(i, j));
-            TCOPYIN(tsrc1, gsrc1(i, j));
+            TLOAD(tsrc0, gsrc0(i, j));
+            TLOAD(tsrc1, gsrc1(i, j));
             TMUL(tsrc0, tsrc0, tsrc1);
             auto gO = gdst(i, j);
-            TCOPYOUT(gO, tsrc0);
+            TSTORE(gO, tsrc0);
         }
     }
 }
@@ -170,10 +170,10 @@ void texpandcol_nz_out(dtype *dst, dtype *src) {
         for (int j = 0; j < block_col; ++j) {
             tile_shape_in tsrc;
             tile_shape_expand texpand;
-            TCOPYIN(tsrc, gsrc(i, 0));
+            TLOAD(tsrc, gsrc(i, 0));
             TEXPANDCOL(texpand, tsrc);
             auto gO = gdst(i, j);
-            TCOPYOUT(gO, texpand);
+            TSTORE(gO, texpand);
         }
     }
 }
@@ -199,11 +199,11 @@ void trowmax_nz_left(dtype *dst, dtype *src) {
         tile_shape_r trmax;
         for (int j = 0; j < block_col; ++j) {
             tile_shape_in tsrc;
-            TCOPYIN(tsrc, gsrc(i, j));
+            TLOAD(tsrc, gsrc(i, j));
             TROWMAX(trmax, tsrc);
         }
         auto gO = gdst(i, 0);
-        TCOPYOUT(gO, trmax);
+        TSTORE(gO, trmax);
     }
 }
 
@@ -223,10 +223,10 @@ void tmuls_nz_left(dtype *dst, dtype *src, dtype s) {
     for (int i = 0; i < block_row; ++i) {
         for (int j = 0; j < block_col; ++j) {
             tile_shape tsrc;
-            TCOPYIN(tsrc, gsrc(i, j));
+            TLOAD(tsrc, gsrc(i, j));
             TMULS(tsrc, tsrc, s);
             auto gO = gdst(i, j);
-            TCOPYOUT(gO, tsrc);
+            TSTORE(gO, tsrc);
         }
     }
 }
@@ -249,10 +249,10 @@ void tcvt_out_left(dtype *dst, dtype *src) {
         for (int j = 0; j < block_col; ++j) {
             tile_shape_in tsrc;
             tile_shape_out tout;
-            TCOPYIN(tsrc, gsrc(i, j));
+            TLOAD(tsrc, gsrc(i, j));
             TCVT(tout, tsrc);
             auto gO = gdst(i, j);
-            TCOPYOUT(gO, tout);
+            TSTORE(gO, tout);
         }
     }
 }