Translation and benchmarking of CCE paged attention highperf#175
Translation and benchmarking of CCE paged attention highperf#175MirkoDeVita98 wants to merge 8 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a high-performance Paged Attention Torch JIT demo, including Python JIT utilities, benchmarking scripts, tiling logic, and the underlying C++ PTO-ISA kernel implementation. The review feedback highlights two important improvements in jit_util_pa.py: first, avoiding host-device synchronization overhead by caching the tiling tensor using its Python object ID instead of converting it to a CPU tuple on every launch; second, resolving the kernel source path to an absolute path to prevent JIT compilation failures when running the scripts from different working directories.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| tiling_cpu = tuple(int(x) for x in tiling.detach().cpu().tolist()) | ||
| sizes_key = tuple(sorted((name, int(size)) for name, size in workspace_sizes.items())) | ||
| key = (str(device), sizes_key, tiling_cpu) | ||
| if workspace.get("key") == key: | ||
| return | ||
| workspace.clear() | ||
| workspace["key"] = key |
There was a problem hiding this comment.
Converting the tiling tensor to a CPU list/tuple on every single kernel launch introduces significant overhead and forces host-device synchronization (via .cpu()), which severely degrades performance and invalidates benchmark timing. Since the tiling tensor object is typically reused across iterations for the same shape, we can use the Python object ID id(tiling) as part of the cache key and store a reference to tiling in the workspace to keep it alive. This completely avoids the NPU-to-CPU transfer and list conversion overhead during the benchmark loop.
| tiling_cpu = tuple(int(x) for x in tiling.detach().cpu().tolist()) | |
| sizes_key = tuple(sorted((name, int(size)) for name, size in workspace_sizes.items())) | |
| key = (str(device), sizes_key, tiling_cpu) | |
| if workspace.get("key") == key: | |
| return | |
| workspace.clear() | |
| workspace["key"] = key | |
| sizes_key = tuple(sorted((name, int(size)) for name, size in workspace_sizes.items())) | |
| key = (str(device), sizes_key, id(tiling)) | |
| if workspace.get("key") == key: | |
| return | |
| workspace.clear() | |
| workspace["key"] = key | |
| workspace["tiling_ref"] = tiling |
|
|
||
|
|
||
| def jit_compile_paged_attention(verbose: bool = False, clean_up: bool = True, kernel_cpp: str = "pa_kernel.cpp"): | ||
| lib_path = compile_paged_attention(kernel_cpp, verbose=verbose) |
There was a problem hiding this comment.
If the script is executed from a directory other than demos/torch_jit/paged_attention_highperf/ (for example, from the repository root), the relative path "pa_kernel.cpp" will not be found, causing the JIT compilation to fail. Resolving kernel_cpp to an absolute path relative to the script's directory when a relative path is provided makes the JIT compilation robust and runnable from any working directory.
kernel_path = Path(kernel_cpp)
if not kernel_path.is_absolute():
kernel_path = Path(__file__).parent / kernel_path
lib_path = compile_paged_attention(str(kernel_path), verbose=verbose)
Uh oh!
There was an error while loading. Please reload this page.