Skip to content

hw-native-sys/pto-isa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1,350 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PTO Tile Lib

PTO Tile Library

Parallel Tile Operation (PTO) is a virtual ISA for tile-oriented programming defined by Ascend CANN. This repository provides PTO Tile instruction implementations, examples, tests, and documentation to help developers migrate and optimize operators more smoothly across different Ascend generations.

License Platform Docs

πŸ“° News

  • πŸŽ‰ 2025-12-27: PTO Tile Library is officially open-sourced.
  • ✨ 2026-01-30: Added reduction instructions and MX instructions.
  • πŸš€ 2026-02-28: Added convolution instructions, quantization instructions, and inter-kernel communication instructions.
  • πŸ”₯ 2026-03-30: Added support for Ascend A5, asynchronous communication instructions, and CostModel performance simulation.
  • πŸ› οΈ 2026-04-02: Local engineering workflow improved with pre-commit checks, documentation build verification, and CPU-SIM validation updates.

🎯 Project Positioning

The PTO ISA is built on Ascend's underlying hardware and software abstractions and defines more than 90 standard tile instructions. It uses a higher-level tile programming model to bridge implementation differences across generations. Its goal is not to hide low-level capabilities, but to raise the abstraction level while preserving room for performance tuning.

  • Unified cross-generation tile abstraction: reduces migration cost across different Ascend generations.
  • Balances portability and performance: guarantees correct behavior under fixed tile shapes while preserving tuning dimensions such as tile size, tile shape, and instruction ordering.
  • Designed for frameworks, operators, and toolchains: serves as a common interface for upper-layer frameworks, operator implementations, and compiler toolchains.
  • Continuously extensible: defines 90+ standard operations today, with ongoing implementation and ecosystem integration.

In addition to compute and data-movement instructions, PTO ISA also provides a communication extension instruction set for inter-NPU data transfer and synchronization, covering point-to-point communication, signal synchronization, and collective communication.

These communication primitives follow the same tile-level abstraction and cross-platform design as the compute instructions, and can drive multiple data-movement hardware engines on Ascend to help users build deeply fused compute-communication kernels. For the communication ISA entry, see docs/isa/comm/README.md.

At present, PTO instructions have been integrated into the following frameworks:

✨ Core Features

  • Unified Tile ISA abstraction: uses standard PTO instructions to describe tile-level computation and dataflow.
  • Balances cross-generation migration and performance tuning: improves portability while retaining sufficient low-level control.
  • Auto / Manual dual-mode workflow: validate logic quickly first, then refine the implementation step by step. Auto Mode is currently available in CPU simulation.
  • CPU Simulator support: enables functional verification and development debugging on CPU.
  • Covers key programming elements: supports tile shape, tile mask, event synchronization, fixed-function units, and pipeline modeling.
  • Complete docs, tests, and examples: includes ISA docs, developer docs, test scripts, and performance case studies.

πŸ‘₯ Intended Audience

PTO Tile Lib is mainly intended for the following developers:

  • Framework or compiler backend developers who interface directly with Ascend hardware
  • High-performance operator developers who need to migrate and reuse implementations across platforms
  • Performance engineers who need explicit control over tiles, buffers, and pipelines

πŸš€ Quick Start

Environment Setup

  • CPU path: requires Python, CMake, and a C++20-capable compiler; suitable for quick cross-platform validation.
  • NPU path: requires Linux and the Ascend CANN toolkit; suitable for running on Ascend hardware or simulator.
  • For detailed environment setup instructions, see the Getting Started Guide

Build and Run

# CPU Simulator (recommended first step)
python3 tests/run_cpu.py --clean --verbose

# Run GEMM demo
python3 tests/run_cpu.py --demo gemm --verbose

# Run Flash Attention demo
python3 tests/run_cpu.py --demo flash_attn --verbose

# Run a single ST testcase
python3 tests/script/run_st.py -r sim -v a3 -t tadd -g TADDTest.case_float_64x64_64x64

# One-click build and run recommended tests
./build.sh --run_all --a3 --sim

# Build a wheel package (artifacts will be placed under dist/)
python3 -m build --wheel

For more complete build, test, and scripting details, see the Getting Started Guide and Test Guide.

Recommended Examples

Recommended Learning Path

  1. Start from simple examples to understand how PTO instructions organize tile-level computation and data movement.
  2. Verify functionality and correctness in CPU simulation to build intuition about instruction semantics and results.
  3. Port the code to Ascend hardware to validate correctness and collect performance data. See the msprof tool
  4. Identify performance bottlenecks (CUBE Bound / MTE Bound / Vector Bound) and start optimization and tuning. See Performance Optimization

This repository also demonstrates how standard tile operations can be mapped to different pipeline implementations through template parameters:

πŸ—‚οΈ Documentation Navigation

ISA and Programming Model

Development and Optimization

πŸ“Š Examples and Performance References

GEMM

GEMM performance reference (Ascend A3, 24 cores)

Flash Attention

Ascend 910B2 multi-core comparison, using torch_npu as the baseline:

Sequence length PTO time (us) torch_npu time (us) PTO TFLOPS torch_npu TFLOPS PTO speedup
1024 20.960 58.461 25.61 9.18 2.79x
2048 32.461 70.801 66.16 30.33 2.18x
4096 88.902 118.302 96.62 72.61 1.33x
8192 292.626 353.147 117.42 97.30 1.21x
16384 909.058 1118.462 151.19 122.88 1.23x
32768 3262.645 3646.173 168.50 150.78 1.12x

Flash Attention 910B2 PTO vs torch_npu

Communication Instruction Bandwidth

This example measures point-to-point remote-read bandwidth on Ascend A2/A3 and compares TGET (synchronous, via UB staging) with TGET_ASYNC (asynchronous, direct transfer through the DMA engine).

GEMM AllReduce Fused Compute-Communication

This example shows how PTO communication primitives can be fused with compute kernels to overlap GEMM and AllReduce within one operator pipeline.

πŸ–₯️ Platform Support

  • Ascend A2 (Ascend 910B)
  • Ascend A3 (Ascend 910C)
  • Ascend A5 (Ascend 950)
  • CPU (x86_64 / AArch64)

For more details, see include/README.md.

πŸ›£οΈ Roadmap

Planned future features:

Feature Description Scope Progress / target completion
PTO Auto Mode BiSheng compiler support for automatic tile buffer allocation and synchronization insertion. Compiler / toolchain Ongoing
PTO Tile Fusion BiSheng compiler support for automatic tile operation fusion. Compiler / toolchain Ongoing
PTO-AS Bytecode support for PTO ISA. Compiler / toolchain Ongoing
Convolution extension PTO ISA support for convolution kernels. ISA extension Ongoing
Collective communication extension Add asynchronous communication instructions for Ccu and Roce, and add the TPREFECTH (AIV direct-drive) communication instruction. Communication ISA extension 2026 Q2
System scheduling extension PTO ISA support for SPMD/MPMD programming schedules. ISA extension Planned
Micro-instructions Support expressing high-performance operators through micro-instructions, together with a foundational high-performance micro-instruction library. ISA extension / operator development 2026 Q2
Base instructions Further optimize A5 instruction performance, add Pooling-related base instructions, and enhance convolution, quantization, and Fixpipe instruction capabilities. ISA extension 2026 Q2
CostModel Support CostModel performance simulation for A5 instructions. Toolchain / performance modeling 2026 Q2
CPU-SIM Keep CPU-SIM built in sync with instruction enhancements. CPU simulation 2026 Q2

πŸ—ƒοΈ Directory Structure

Key directories are listed below:

β”œβ”€β”€ include/                     # Public PTO headers and interfaces
β”‚   └── pto/                     # Common types, ISA interfaces, and CPU/NPU implementations
β”œβ”€β”€ kernels/                     # Kernels and operator implementations
β”‚   β”œβ”€β”€ manual/                  # Hand-optimized implementations and performance examples
β”‚   └── custom/                  # Custom operator examples
β”œβ”€β”€ docs/                        # ISA, programming model, getting started, and doc site sources
β”‚   β”œβ”€β”€ isa/                     # Instruction references and category indexes
β”‚   β”œβ”€β”€ coding/                  # Developer and performance optimization docs
β”‚   β”œβ”€β”€ assembly/                # PTO-AS assembly syntax and specification
β”‚   └── mkdocs/                  # MkDocs config and source files
β”œβ”€β”€ demos/                       # Auto Mode, baseline, and torch_jit examples
β”œβ”€β”€ tests/                       # CPU / NPU tests, scripts, and test entry points
β”‚   β”œβ”€β”€ cpu/                     # CPU simulation tests
β”‚   β”œβ”€β”€ npu/                     # SoC-specific NPU tests
β”‚   └── script/                  # Test build and execution scripts
β”œβ”€β”€ scripts/                     # Build, install, and release scripts
β”œβ”€β”€ cmake/                       # Shared CMake configuration and packaging logic
β”œβ”€β”€ build.sh                     # One-click build and run entry script
└── CMakeLists.txt               # Top-level CMake configuration

ℹ️ Related Information

  • Contributing Guide: contribution workflow and development guidelines
  • Security and Vulnerability Disclosure: process for reporting security issues
  • Release Notes: version updates and release history
  • License: CANN Open Software License Agreement Version 2.0
  • PyPTO: an upper-layer programming framework in the PTO ecosystem
  • PTOAS: PTO assembler and compiler backend for PTO workflows
  • pto-dsl: Pythonic frontend and JIT workflow exploration for PTO
  • FAQ: common issues and solutions

πŸ“¬ Contact Us

  • Issue reporting: submit problems through repository Issues
  • Feature requests: share suggestions through Issues or discussion channels
  • Code contributions: contribute through Pull Requests

About

PTO instruction set architecture

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors