Translation and benchmarking of CCE paged attention highperf by MirkoDeVita98 · Pull Request #175 · hw-native-sys/pto-isa

MirkoDeVita98 · 2026-06-19T12:00:07Z

| shape                           | batch | seq_len | block_dim | jit_time_us | jit_tflops | jit_tflops_normalized | jit_bandwidth_tb_s |
|---------------------------------|------:|--------:|----------:|------------:|-----------:|----------------------:|-------------------:|
| b1_h32_kv8_s128_bs128_fp16      |     1 |     128 |        24 |    2806.880 |   0.000756 |              0.018141 |           0.000193 |
| b1_h32_kv8_s512_bs128_fp16      |     1 |     512 |        24 |     472.160 |   0.017975 |              0.431388 |           0.004476 |
| b1_h32_kv8_s4096_bs128_fp16     |     1 |    4096 |        24 |     442.020 |   0.153602 |              3.686452 |           0.037993 |
| b1_h32_kv8_s8192_bs128_fp16     |     1 |    8192 |        24 |     425.660 |   0.319012 |              7.656281 |           0.078868 |
| b1_h32_kv8_s16384_bs128_fp16    |     1 |   16384 |        24 |     448.560 |   0.605451 |             14.530825 |           0.149647 |
| b1_h32_kv8_s32768_bs128_fp16    |     1 |   32768 |        24 |     466.000 |   1.165584 |             27.974025 |           0.288058 |
| b1_h32_kv8_s65536_bs128_fp16    |     1 |   65536 |        24 |     690.600 |   1.573016 |             37.752379 |           0.388726 |
| b1_h32_kv8_s131072_bs128_fp16   |     1 |  131072 |        24 |    1018.260 |   2.133688 |             51.208518 |           0.527264 |
| b2_h32_kv8_s128_bs128_fp16      |     2 |     128 |        24 |     446.460 |   0.009504 |              0.228105 |           0.002422 |
| b2_h32_kv8_s512_bs128_fp16      |     2 |     512 |        24 |     428.880 |   0.039577 |              0.949843 |           0.009856 |
| b2_h32_kv8_s4096_bs128_fp16     |     2 |    4096 |        24 |     434.700 |   0.312377 |              7.497058 |           0.077266 |
| b2_h32_kv8_s8192_bs128_fp16     |     2 |    8192 |        24 |     440.580 |   0.616417 |             14.794011 |           0.152395 |
| b2_h32_kv8_s16384_bs128_fp16    |     2 |   16384 |        24 |     473.400 |   1.147364 |             27.536742 |           0.283590 |
| b2_h32_kv8_s32768_bs128_fp16    |     2 |   32768 |        24 |     672.740 |   1.614776 |             38.754632 |           0.399070 |
| b2_h32_kv8_s65536_bs128_fp16    |     2 |   65536 |        24 |    1032.760 |   2.103731 |             50.489546 |           0.519877 |
| b2_h32_kv8_s131072_bs128_fp16   |     2 |  131072 |        24 |    1481.100 |   2.933832 |             70.411974 |           0.724990 |
| b4_h32_kv8_s128_bs128_fp16      |     4 |     128 |        24 |     456.700 |   0.018583 |              0.445981 |           0.004736 |
| b4_h32_kv8_s512_bs128_fp16      |     4 |     512 |        24 |     445.740 |   0.076160 |              1.827831 |           0.018967 |
| b4_h32_kv8_s4096_bs128_fp16     |     4 |    4096 |        24 |     490.420 |   0.553772 |             13.290531 |           0.136974 |
| b4_h32_kv8_s8192_bs128_fp16     |     4 |    8192 |        24 |     518.540 |   1.047483 |             25.139604 |           0.258966 |
| b4_h32_kv8_s16384_bs128_fp16    |     4 |   16384 |        24 |     683.420 |   1.589542 |             38.148997 |           0.392881 |
| b4_h32_kv8_s32768_bs128_fp16    |     4 |   32768 |        24 |    1462.180 |   1.485897 |             35.661533 |           0.367219 |
| b4_h32_kv8_s65536_bs128_fp16    |     4 |   65536 |        24 |    1510.700 |   2.876348 |             69.032349 |           0.710807 |
| b4_h32_kv8_s131072_bs128_fp16   |     4 |  131072 |        24 |    2490.960 |   3.488855 |             83.732518 |           0.862144 |
| b8_h32_kv8_s128_bs128_fp16      |     8 |     128 |        24 |     500.720 |   0.033898 |              0.813547 |           0.008638 |
| b8_h32_kv8_s512_bs128_fp16      |     8 |     512 |        24 |     464.360 |   0.146212 |              3.509077 |           0.036412 |
| b8_h32_kv8_s4096_bs128_fp16     |     8 |    4096 |        24 |     532.000 |   1.020981 |             24.503542 |           0.252537 |
| b8_h32_kv8_s8192_bs128_fp16     |     8 |    8192 |        24 |     701.820 |   1.547867 |             37.148814 |           0.382674 |
| b8_h32_kv8_s16384_bs128_fp16    |     8 |   16384 |        24 |    1088.560 |   1.995893 |             47.901426 |           0.493318 |
| b8_h32_kv8_s32768_bs128_fp16    |     8 |   32768 |        24 |    1543.420 |   2.815370 |             67.568883 |           0.695780 |
| b8_h32_kv8_s65536_bs128_fp16    |     8 |   65536 |        24 |    2422.800 |   3.587006 |             86.088134 |           0.886425 |
| b8_h32_kv8_s131072_bs128_fp16   |     8 |  131072 |        24 |    4266.960 |   4.073437 |             97.762499 |           1.006602 |

…king scripts

gemini-code-assist

Code Review

This pull request introduces a high-performance Paged Attention Torch JIT demo, including Python JIT utilities, benchmarking scripts, tiling logic, and the underlying C++ PTO-ISA kernel implementation. The review feedback highlights two important improvements in jit_util_pa.py: first, avoiding host-device synchronization overhead by caching the tiling tensor using its Python object ID instead of converting it to a CPU tuple on every launch; second, resolving the kernel source path to an absolute path to prevent JIT compilation failures when running the scripts from different working directories.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-19T12:02:33Z

+        tiling_cpu = tuple(int(x) for x in tiling.detach().cpu().tolist())
+        sizes_key = tuple(sorted((name, int(size)) for name, size in workspace_sizes.items()))
+        key = (str(device), sizes_key, tiling_cpu)
+        if workspace.get("key") == key:
+            return
+        workspace.clear()
+        workspace["key"] = key


Converting the tiling tensor to a CPU list/tuple on every single kernel launch introduces significant overhead and forces host-device synchronization (via .cpu()), which severely degrades performance and invalidates benchmark timing. Since the tiling tensor object is typically reused across iterations for the same shape, we can use the Python object ID id(tiling) as part of the cache key and store a reference to tiling in the workspace to keep it alive. This completely avoids the NPU-to-CPU transfer and list conversion overhead during the benchmark loop.

Suggested change

tiling_cpu = tuple(int(x) for x in tiling.detach().cpu().tolist())

sizes_key = tuple(sorted((name, int(size)) for name, size in workspace_sizes.items()))

key = (str(device), sizes_key, tiling_cpu)

if workspace.get("key") == key:

return

workspace.clear()

workspace["key"] = key

sizes_key = tuple(sorted((name, int(size)) for name, size in workspace_sizes.items()))

key = (str(device), sizes_key, id(tiling))

if workspace.get("key") == key:

return

workspace.clear()

workspace["key"] = key

workspace["tiling_ref"] = tiling

gemini-code-assist · 2026-06-19T12:02:33Z

+
+
+def jit_compile_paged_attention(verbose: bool = False, clean_up: bool = True, kernel_cpp: str = "pa_kernel.cpp"):
+    lib_path = compile_paged_attention(kernel_cpp, verbose=verbose)


If the script is executed from a directory other than demos/torch_jit/paged_attention_highperf/ (for example, from the repository root), the relative path "pa_kernel.cpp" will not be found, causing the JIT compilation to fail. Resolving kernel_cpp to an absolute path relative to the script's directory when a relative path is provided makes the JIT compilation robust and runnable from any working directory.

kernel_path = Path(kernel_cpp) if not kernel_path.is_absolute(): kernel_path = Path(__file__).parent / kernel_path lib_path = compile_paged_attention(str(kernel_path), verbose=verbose)

first version of CCE paged attention highperf with torch jit benchmar…

aca4ea1

…king scripts

gemini-code-assist Bot reviewed Jun 19, 2026

View reviewed changes

MirkoDeVita98 marked this pull request as draft June 19, 2026 12:29

mirkodevita added 7 commits June 24, 2026 13:16

working port of paged attention highperf

fe12224

faster working version

2e4d918

rchanged sync to sue set flag and wiat flags

61ccad0

400 GB/s version

0c5edc1

730 GB/s version

de960f2

vectorize combine scale part

c0c3263

900GB/s version

baa25c7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Translation and benchmarking of CCE paged attention highperf#175

Translation and benchmarking of CCE paged attention highperf#175
MirkoDeVita98 wants to merge 8 commits into
hw-native-sys:mainfrom
MirkoDeVita98:paged_attention

MirkoDeVita98 commented Jun 19, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		def jit_compile_paged_attention(verbose: bool = False, clean_up: bool = True, kernel_cpp: str = "pa_kernel.cpp"):
		lib_path = compile_paged_attention(kernel_cpp, verbose=verbose)

Uh oh!

Conversation

MirkoDeVita98 commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MirkoDeVita98 commented Jun 19, 2026 •

edited

Loading