Parallel Tile Operation (PTO) is a virtual ISA for tile-oriented programming defined by Ascend CANN. This repository provides PTO Tile instruction implementations, examples, tests, and documentation to help developers migrate and optimize operators more smoothly across different Ascend generations.
- π 2025-12-27: PTO Tile Library is officially open-sourced.
- β¨ 2026-01-30: Added reduction instructions and MX instructions.
- π 2026-02-28: Added convolution instructions, quantization instructions, and inter-kernel communication instructions.
- π₯ 2026-03-30: Added support for Ascend A5, asynchronous communication instructions, and CostModel performance simulation.
- π οΈ 2026-04-02: Local engineering workflow improved with pre-commit checks, documentation build verification, and CPU-SIM validation updates.
The PTO ISA is built on Ascend's underlying hardware and software abstractions and defines more than 90 standard tile instructions. It uses a higher-level tile programming model to bridge implementation differences across generations. Its goal is not to hide low-level capabilities, but to raise the abstraction level while preserving room for performance tuning.
- Unified cross-generation tile abstraction: reduces migration cost across different Ascend generations.
- Balances portability and performance: guarantees correct behavior under fixed tile shapes while preserving tuning dimensions such as tile size, tile shape, and instruction ordering.
- Designed for frameworks, operators, and toolchains: serves as a common interface for upper-layer frameworks, operator implementations, and compiler toolchains.
- Continuously extensible: defines 90+ standard operations today, with ongoing implementation and ecosystem integration.
In addition to compute and data-movement instructions, PTO ISA also provides a communication extension instruction set for inter-NPU data transfer and synchronization, covering point-to-point communication, signal synchronization, and collective communication.
These communication primitives follow the same tile-level abstraction and cross-platform design as the compute instructions, and can drive multiple data-movement hardware engines on Ascend to help users build deeply fused compute-communication kernels. For the communication ISA entry, see docs/isa/comm/README.md.
At present, PTO instructions have been integrated into the following frameworks:
- PyPTO
- TileLang Ascend
- More language and frontend support is continuously being improved
- Unified Tile ISA abstraction: uses standard PTO instructions to describe tile-level computation and dataflow.
- Balances cross-generation migration and performance tuning: improves portability while retaining sufficient low-level control.
- Auto / Manual dual-mode workflow: validate logic quickly first, then refine the implementation step by step. Auto Mode is currently available in CPU simulation.
- CPU Simulator support: enables functional verification and development debugging on CPU.
- Covers key programming elements: supports tile shape, tile mask, event synchronization, fixed-function units, and pipeline modeling.
- Complete docs, tests, and examples: includes ISA docs, developer docs, test scripts, and performance case studies.
PTO Tile Lib is mainly intended for the following developers:
- Framework or compiler backend developers who interface directly with Ascend hardware
- High-performance operator developers who need to migrate and reuse implementations across platforms
- Performance engineers who need explicit control over tiles, buffers, and pipelines
- CPU path: requires Python, CMake, and a C++20-capable compiler; suitable for quick cross-platform validation.
- NPU path: requires Linux and the Ascend CANN toolkit; suitable for running on Ascend hardware or simulator.
- For detailed environment setup instructions, see the Getting Started Guide
# CPU Simulator (recommended first step)
python3 tests/run_cpu.py --clean --verbose
# Run GEMM demo
python3 tests/run_cpu.py --demo gemm --verbose
# Run Flash Attention demo
python3 tests/run_cpu.py --demo flash_attn --verbose
# Run a single ST testcase
python3 tests/script/run_st.py -r sim -v a3 -t tadd -g TADDTest.case_float_64x64_64x64
# One-click build and run recommended tests
./build.sh --run_all --a3 --sim
# Build a wheel package (artifacts will be placed under dist/)
python3 -m build --wheelFor more complete build, test, and scripting details, see the Getting Started Guide and Test Guide.
- Auto Mode Add example: a good first example for understanding how PTO instructions are organized
- GEMM performance example: useful for understanding tile-level operator optimization
- Flash Attention example: useful for understanding complex operators and performance tuning
- Start from simple examples to understand how PTO instructions organize tile-level computation and data movement.
- Verify functionality and correctness in CPU simulation to build intuition about instruction semantics and results.
- Port the code to Ascend hardware to validate correctness and collect performance data. See the msprof tool
- Identify performance bottlenecks (CUBE Bound / MTE Bound / Vector Bound) and start optimization and tuning. See Performance Optimization
This repository also demonstrates how standard tile operations can be mapped to different pipeline implementations through template parameters:
- Tile Programming Model: understand static tile shapes, dynamic tile masks, and data organization
- Events and Synchronization: understand set/wait flag and pipeline synchronization
- General Conventions: understand general PTO programming rules and constraints
- PTO Instruction List: browse the standard operations defined by the PTO ISA
- ISA Overview: entry point and navigation for PTO ISA documentation
- PTO Instruction List: browse PTO standard operations by category
- Tile Programming Model: understand tile shapes, masks, and the programming model
- Events and Synchronization: understand event recording, waiting, and synchronization
- General Conventions: review naming, constraints, and common rules
- Developer Documentation Index: browse documentation for extending PTO Tile Lib
- Performance Optimization: review performance analysis and tuning guidance
- Documentation Build Guide: learn how to build the MkDocs site locally
- Reference implementation:
kernels/manual/a2a3/gemm_performance/ - Detailed analysis and tuning notes: High-Performance GEMM Operator Example
- Operator implementation and tuning notes: A2/A3 version, A5 version
- A5 build guide, with A5 performance numbers still pending: Flash Attention Performance Kernel (A5)
- S0: query sequence length (number of rows in Q/O)
- S1: key/value sequence length (number of rows in K/V)
Ascend 910B2 multi-core comparison, using torch_npu as the baseline:
| Sequence length | PTO time (us) | torch_npu time (us) | PTO TFLOPS | torch_npu TFLOPS | PTO speedup |
|---|---|---|---|---|---|
| 1024 | 20.960 | 58.461 | 25.61 | 9.18 | 2.79x |
| 2048 | 32.461 | 70.801 | 66.16 | 30.33 | 2.18x |
| 4096 | 88.902 | 118.302 | 96.62 | 72.61 | 1.33x |
| 8192 | 292.626 | 353.147 | 117.42 | 97.30 | 1.21x |
| 16384 | 909.058 | 1118.462 | 151.19 | 122.88 | 1.23x |
| 32768 | 3262.645 | 3646.173 | 168.50 | 150.78 | 1.12x |
- Reference implementation:
kernels/manual/a2a3/tget_bandwidth/ - Detailed analysis and build/run guide: TGET / TGET_ASYNC Bandwidth Comparison Example
This example measures point-to-point remote-read bandwidth on Ascend A2/A3 and compares TGET (synchronous, via UB staging) with TGET_ASYNC (asynchronous, direct transfer through the DMA engine).
- Reference implementation:
kernels/manual/a2a3/gemm_ar/ - Detailed analysis and tuning notes: High-Performance GEMM AllReduce Fused Operator Example
This example shows how PTO communication primitives can be fused with compute kernels to overlap GEMM and AllReduce within one operator pipeline.
- Ascend A2 (Ascend 910B)
- Ascend A3 (Ascend 910C)
- Ascend A5 (Ascend 950)
- CPU (x86_64 / AArch64)
For more details, see include/README.md.
Planned future features:
| Feature | Description | Scope | Progress / target completion |
|---|---|---|---|
| PTO Auto Mode | BiSheng compiler support for automatic tile buffer allocation and synchronization insertion. | Compiler / toolchain | Ongoing |
| PTO Tile Fusion | BiSheng compiler support for automatic tile operation fusion. | Compiler / toolchain | Ongoing |
| PTO-AS | Bytecode support for PTO ISA. | Compiler / toolchain | Ongoing |
| Convolution extension | PTO ISA support for convolution kernels. | ISA extension | Ongoing |
| Collective communication extension | Add asynchronous communication instructions for Ccu and Roce, and add the TPREFECTH (AIV direct-drive) communication instruction. | Communication ISA extension | 2026 Q2 |
| System scheduling extension | PTO ISA support for SPMD/MPMD programming schedules. | ISA extension | Planned |
| Micro-instructions | Support expressing high-performance operators through micro-instructions, together with a foundational high-performance micro-instruction library. | ISA extension / operator development | 2026 Q2 |
| Base instructions | Further optimize A5 instruction performance, add Pooling-related base instructions, and enhance convolution, quantization, and Fixpipe instruction capabilities. | ISA extension | 2026 Q2 |
| CostModel | Support CostModel performance simulation for A5 instructions. | Toolchain / performance modeling | 2026 Q2 |
| CPU-SIM | Keep CPU-SIM built in sync with instruction enhancements. | CPU simulation | 2026 Q2 |
Key directories are listed below:
βββ include/ # Public PTO headers and interfaces
β βββ pto/ # Common types, ISA interfaces, and CPU/NPU implementations
βββ kernels/ # Kernels and operator implementations
β βββ manual/ # Hand-optimized implementations and performance examples
β βββ custom/ # Custom operator examples
βββ docs/ # ISA, programming model, getting started, and doc site sources
β βββ isa/ # Instruction references and category indexes
β βββ coding/ # Developer and performance optimization docs
β βββ assembly/ # PTO-AS assembly syntax and specification
β βββ mkdocs/ # MkDocs config and source files
βββ demos/ # Auto Mode, baseline, and torch_jit examples
βββ tests/ # CPU / NPU tests, scripts, and test entry points
β βββ cpu/ # CPU simulation tests
β βββ npu/ # SoC-specific NPU tests
β βββ script/ # Test build and execution scripts
βββ scripts/ # Build, install, and release scripts
βββ cmake/ # Shared CMake configuration and packaging logic
βββ build.sh # One-click build and run entry script
βββ CMakeLists.txt # Top-level CMake configuration
- Contributing Guide: contribution workflow and development guidelines
- Security and Vulnerability Disclosure: process for reporting security issues
- Release Notes: version updates and release history
- License: CANN Open Software License Agreement Version 2.0
- PyPTO: an upper-layer programming framework in the PTO ecosystem
- PTOAS: PTO assembler and compiler backend for PTO workflows
- pto-dsl: Pythonic frontend and JIT workflow exploration for PTO
- FAQ: common issues and solutions
- Issue reporting: submit problems through repository Issues
- Feature requests: share suggestions through Issues or discussion channels
- Code contributions: contribute through Pull Requests
