Skip to content

Translation and benchmarking of CCE paged attention highperf#175

Draft
MirkoDeVita98 wants to merge 8 commits into
hw-native-sys:mainfrom
MirkoDeVita98:paged_attention
Draft

Translation and benchmarking of CCE paged attention highperf#175
MirkoDeVita98 wants to merge 8 commits into
hw-native-sys:mainfrom
MirkoDeVita98:paged_attention

Conversation

@MirkoDeVita98

@MirkoDeVita98 MirkoDeVita98 commented Jun 19, 2026

Copy link
Copy Markdown
| shape                           | batch | seq_len | block_dim | jit_time_us | jit_tflops | jit_tflops_normalized | jit_bandwidth_tb_s |
|---------------------------------|------:|--------:|----------:|------------:|-----------:|----------------------:|-------------------:|
| b1_h32_kv8_s128_bs128_fp16      |     1 |     128 |        24 |    2806.880 |   0.000756 |              0.018141 |           0.000193 |
| b1_h32_kv8_s512_bs128_fp16      |     1 |     512 |        24 |     472.160 |   0.017975 |              0.431388 |           0.004476 |
| b1_h32_kv8_s4096_bs128_fp16     |     1 |    4096 |        24 |     442.020 |   0.153602 |              3.686452 |           0.037993 |
| b1_h32_kv8_s8192_bs128_fp16     |     1 |    8192 |        24 |     425.660 |   0.319012 |              7.656281 |           0.078868 |
| b1_h32_kv8_s16384_bs128_fp16    |     1 |   16384 |        24 |     448.560 |   0.605451 |             14.530825 |           0.149647 |
| b1_h32_kv8_s32768_bs128_fp16    |     1 |   32768 |        24 |     466.000 |   1.165584 |             27.974025 |           0.288058 |
| b1_h32_kv8_s65536_bs128_fp16    |     1 |   65536 |        24 |     690.600 |   1.573016 |             37.752379 |           0.388726 |
| b1_h32_kv8_s131072_bs128_fp16   |     1 |  131072 |        24 |    1018.260 |   2.133688 |             51.208518 |           0.527264 |
| b2_h32_kv8_s128_bs128_fp16      |     2 |     128 |        24 |     446.460 |   0.009504 |              0.228105 |           0.002422 |
| b2_h32_kv8_s512_bs128_fp16      |     2 |     512 |        24 |     428.880 |   0.039577 |              0.949843 |           0.009856 |
| b2_h32_kv8_s4096_bs128_fp16     |     2 |    4096 |        24 |     434.700 |   0.312377 |              7.497058 |           0.077266 |
| b2_h32_kv8_s8192_bs128_fp16     |     2 |    8192 |        24 |     440.580 |   0.616417 |             14.794011 |           0.152395 |
| b2_h32_kv8_s16384_bs128_fp16    |     2 |   16384 |        24 |     473.400 |   1.147364 |             27.536742 |           0.283590 |
| b2_h32_kv8_s32768_bs128_fp16    |     2 |   32768 |        24 |     672.740 |   1.614776 |             38.754632 |           0.399070 |
| b2_h32_kv8_s65536_bs128_fp16    |     2 |   65536 |        24 |    1032.760 |   2.103731 |             50.489546 |           0.519877 |
| b2_h32_kv8_s131072_bs128_fp16   |     2 |  131072 |        24 |    1481.100 |   2.933832 |             70.411974 |           0.724990 |
| b4_h32_kv8_s128_bs128_fp16      |     4 |     128 |        24 |     456.700 |   0.018583 |              0.445981 |           0.004736 |
| b4_h32_kv8_s512_bs128_fp16      |     4 |     512 |        24 |     445.740 |   0.076160 |              1.827831 |           0.018967 |
| b4_h32_kv8_s4096_bs128_fp16     |     4 |    4096 |        24 |     490.420 |   0.553772 |             13.290531 |           0.136974 |
| b4_h32_kv8_s8192_bs128_fp16     |     4 |    8192 |        24 |     518.540 |   1.047483 |             25.139604 |           0.258966 |
| b4_h32_kv8_s16384_bs128_fp16    |     4 |   16384 |        24 |     683.420 |   1.589542 |             38.148997 |           0.392881 |
| b4_h32_kv8_s32768_bs128_fp16    |     4 |   32768 |        24 |    1462.180 |   1.485897 |             35.661533 |           0.367219 |
| b4_h32_kv8_s65536_bs128_fp16    |     4 |   65536 |        24 |    1510.700 |   2.876348 |             69.032349 |           0.710807 |
| b4_h32_kv8_s131072_bs128_fp16   |     4 |  131072 |        24 |    2490.960 |   3.488855 |             83.732518 |           0.862144 |
| b8_h32_kv8_s128_bs128_fp16      |     8 |     128 |        24 |     500.720 |   0.033898 |              0.813547 |           0.008638 |
| b8_h32_kv8_s512_bs128_fp16      |     8 |     512 |        24 |     464.360 |   0.146212 |              3.509077 |           0.036412 |
| b8_h32_kv8_s4096_bs128_fp16     |     8 |    4096 |        24 |     532.000 |   1.020981 |             24.503542 |           0.252537 |
| b8_h32_kv8_s8192_bs128_fp16     |     8 |    8192 |        24 |     701.820 |   1.547867 |             37.148814 |           0.382674 |
| b8_h32_kv8_s16384_bs128_fp16    |     8 |   16384 |        24 |    1088.560 |   1.995893 |             47.901426 |           0.493318 |
| b8_h32_kv8_s32768_bs128_fp16    |     8 |   32768 |        24 |    1543.420 |   2.815370 |             67.568883 |           0.695780 |
| b8_h32_kv8_s65536_bs128_fp16    |     8 |   65536 |        24 |    2422.800 |   3.587006 |             86.088134 |           0.886425 |
| b8_h32_kv8_s131072_bs128_fp16   |     8 |  131072 |        24 |    4266.960 |   4.073437 |             97.762499 |           1.006602 |

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a high-performance Paged Attention Torch JIT demo, including Python JIT utilities, benchmarking scripts, tiling logic, and the underlying C++ PTO-ISA kernel implementation. The review feedback highlights two important improvements in jit_util_pa.py: first, avoiding host-device synchronization overhead by caching the tiling tensor using its Python object ID instead of converting it to a CPU tuple on every launch; second, resolving the kernel source path to an absolute path to prevent JIT compilation failures when running the scripts from different working directories.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +88 to +94
tiling_cpu = tuple(int(x) for x in tiling.detach().cpu().tolist())
sizes_key = tuple(sorted((name, int(size)) for name, size in workspace_sizes.items()))
key = (str(device), sizes_key, tiling_cpu)
if workspace.get("key") == key:
return
workspace.clear()
workspace["key"] = key

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Converting the tiling tensor to a CPU list/tuple on every single kernel launch introduces significant overhead and forces host-device synchronization (via .cpu()), which severely degrades performance and invalidates benchmark timing. Since the tiling tensor object is typically reused across iterations for the same shape, we can use the Python object ID id(tiling) as part of the cache key and store a reference to tiling in the workspace to keep it alive. This completely avoids the NPU-to-CPU transfer and list conversion overhead during the benchmark loop.

Suggested change
tiling_cpu = tuple(int(x) for x in tiling.detach().cpu().tolist())
sizes_key = tuple(sorted((name, int(size)) for name, size in workspace_sizes.items()))
key = (str(device), sizes_key, tiling_cpu)
if workspace.get("key") == key:
return
workspace.clear()
workspace["key"] = key
sizes_key = tuple(sorted((name, int(size)) for name, size in workspace_sizes.items()))
key = (str(device), sizes_key, id(tiling))
if workspace.get("key") == key:
return
workspace.clear()
workspace["key"] = key
workspace["tiling_ref"] = tiling



def jit_compile_paged_attention(verbose: bool = False, clean_up: bool = True, kernel_cpp: str = "pa_kernel.cpp"):
lib_path = compile_paged_attention(kernel_cpp, verbose=verbose)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the script is executed from a directory other than demos/torch_jit/paged_attention_highperf/ (for example, from the repository root), the relative path "pa_kernel.cpp" will not be found, causing the JIT compilation to fail. Resolving kernel_cpp to an absolute path relative to the script's directory when a relative path is provided makes the JIT compilation robust and runnable from any working directory.

    kernel_path = Path(kernel_cpp)
    if not kernel_path.is_absolute():
        kernel_path = Path(__file__).parent / kernel_path
    lib_path = compile_paged_attention(str(kernel_path), verbose=verbose)

@MirkoDeVita98 MirkoDeVita98 marked this pull request as draft June 19, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant