pypto-tooling

Dockerfiles and runbooks for PyPTO and simpler development on Ascend 910B.

Repository Contents

README.md: Repository index and quick usage.
Dockerfile.hw-native-sys.cann9.0: Standalone PyPTO image that clones all sources from GitHub at build time.
Dockerfile.server.cann:9.0: Server/dev image that uses a local pypto build context.
Dockerfile.simpler.cann9.0: Standalone simpler image with pinned commit support and HCCL-safe runtime defaults.
Dockerfile.pytorch-hccl-tests.cann9.0: Standalone HCCL micro-benchmark image (torchrun + pytorch-hccl-tests fork).
Dockerfile.hw-native-sys.sim.ubuntu22.04: Standalone local simulation image for PyPTO (a2a3sim/a5sim) on x86_64 without NPU devices.
Dockerfile.simpler.sim.ubuntu22.04: Standalone simpler-only simulation image (a2a3sim/a5sim) for L3 worker STs without pypto or CANN.
docker-entrypoint-cann.sh: Runtime helper for workspace/runtime symlink handling.
bz910b-reproduce.md: Reproduction and test workflow guide.
profiling/: Personal collective benchmark harness for PyPTO vs simpler L3 collectives.
dockerfile_skills/SKILL.md: Internal Dockerfile construction/debugging notes.
dockerfile_skills/issue_0.md: Detailed issue log for comm_alloc_windows/HCCL environment mismatch.
dockerfile_skills/issue_pytorch_hccl_tests.md: pytorch-hccl-tests benchmark image — editable-install shadow, WORLD_SIZE Makefile shadow, fp64 HCCL reduce crash.

Purpose of Each File

README.md: Top-level index of the repository, with build commands and runtime guidance.
Dockerfile.hw-native-sys.cann9.0: Build a standalone PyPTO image by cloning repositories during build; best for reproducible CI-like setup without local source dependencies.
Dockerfile.server.cann:9.0: Build a server/dev image from a local hw-native-sys workspace where pypto is copied from build context; best for local iteration.
Dockerfile.simpler.cann9.0: Build a standalone simpler-only image for runtime and worker validation when pypto is not required.
Dockerfile.pytorch-hccl-tests.cann9.0: Build a standalone HCCL benchmark image for baseline latency/bandwidth/allreduce comparisons (not for pypto/simpler composite tests).
Dockerfile.hw-native-sys.sim.ubuntu22.04: Build a standalone simulation-only PyPTO image for local development/testing on CPU-hosted simulators (a2a3sim, a5sim).
Dockerfile.simpler.sim.ubuntu22.04: Build a standalone simpler-only simulation image for L3 worker/collective STs on a2a3sim without pypto or CANN.
docker-entrypoint-cann.sh: Runtime helper script that normalizes runtime layout (workspace/runtime symlink behavior) before launching the container command.
bz910b-reproduce.md: Operator runbook for reproducing tests and validating NPU execution on Ascend 910B hosts.
profiling/: Personal harness: equivalence-driven benchmark drivers, artifact bundles, and figure generation for collectives performance analysis.
dockerfile_skills/SKILL.md: Internal engineering notes and patterns for constructing/debugging these Dockerfiles.
dockerfile_skills/issue_0.md: Historical incident report and root-cause details for the HCCL IPC/driver compatibility issue.

Recommended Image: `Dockerfile.hw-native-sys.cann9.0`

Dockerfile.hw-native-sys.cann9.0 is now standalone:

No local source checkout is required.
Build works from any directory using stdin input.
Repositories (pypto, pto-isa) are cloned during build.
Key behavior is controlled via build args (CANN_VERSION, INSTALL_PREFIX, PYPTO_COMMIT, PTO_ISA_COMMIT).

Build Commands

Build standalone hw-native-sys image (defaults):

docker build -t pypto3-hw-native-sys:cann9 - < Dockerfile.hw-native-sys.cann9.0

Build standalone hw-native-sys image (custom commit/pinning):

docker build \
  --build-arg PYPTO_COMMIT=a1b066df02fc938f76b3d38b85fc9fbd0e036d07 \
  --build-arg PTO_ISA_COMMIT=016396b57e2c17093f1194e6acd89bb112b0ab24 \
  -t pypto3-hw-native-sys:cann9 \
  - < Dockerfile.hw-native-sys.cann9.0

Build server/dev image (requires local pypto in build context):

docker build --no-cache -f "Dockerfile.server.cann:9.0" -t pypto3-dev-env:cann9 .

Build standalone simpler image (defaults):

docker build -t simpler-cann9 - < Dockerfile.simpler.cann9.0

Build standalone simpler image (custom commit/pinning):

docker build \
  --build-arg SIMPLER_COMMIT=afb5c5a95cf05d5bb346eaef83a318c6f3164971 \
  --build-arg PTO_ISA_COMMIT=ddafa8da9c760ecd13fe9fe2833d6ee55fb20bd8 \
  -t simpler-cann9 \
  - < Dockerfile.simpler.cann9.0

Build standalone pytorch-hccl-tests image (defaults):

docker build -t pytorch-hccl-tests:cann9 - < Dockerfile.pytorch-hccl-tests.cann9.0

Build standalone pytorch-hccl-tests image (x86 host / custom ref):

docker build \
  --build-arg PT_HCCL_INSTALL_TARGET=install-npu-x86 \
  -t pytorch-hccl-tests:cann9 \
  - < Dockerfile.pytorch-hccl-tests.cann9.0

Build standalone local simulation image (no NPU required):

docker build -t pypto3-hw-native-sys:sim -f Dockerfile.hw-native-sys.sim.ubuntu22.04 .

Build standalone local simulation image (pinned commits):

docker build --build-arg PYPTO_COMMIT=b77b41fee7df8b700c6a6a8f0de12767d31be642 --build-arg PTO_ISA_COMMIT=016396b57e2c17093f1194e6acd89bb112b0ab24 -t pypto3-hw-native-sys:sim -f Dockerfile.hw-native-sys.sim.ubuntu22.04 .

Build standalone simpler simulation image (no NPU, no pypto):

docker build -t simpler-hw-native-sys:sim -f Dockerfile.simpler.sim.ubuntu22.04 .

Build standalone simpler simulation image (pinned commits):

docker build \
  --build-arg SIMPLER_COMMIT=286aa7e4100dc9fc71b77a0f1b279fb5a2207afa \
  --build-arg PTO_ISA_COMMIT=ddafa8da9c760ecd13fe9fe2833d6ee55fb20bd8 \
  -t simpler-hw-native-sys:sim \
  -f Dockerfile.simpler.sim.ubuntu22.04 .

Runtime Notes (Local Simulation)

This mode is for local CPU-hosted simulator workflows only (a2a3sim, a5sim):

docker run --rm -it pypto3-hw-native-sys:sim

Inside the container:

cd /opt/pypto
python -c "import pypto; print('pypto ok')"
which ptoas && ptoas --version

# Unit tests
pytest tests/ut -n auto --maxprocesses 8 -v

# Full ST matrix on simulators (long run)
pytest tests/st -v --forked --platform=a2a3sim,a5sim --pto-isa-commit=016396b57e2c17093f1194e6acd89bb112b0ab24

Optional CI-aligned simulator subset:

pytest tests/st/runtime/ops/test_assemble.py tests/st/runtime/ops/test_mscatter.py tests/st/runtime/framework_and_models/test_qwen3_decode_scope3_mixed.py tests/st/runtime/control_flow/test_dyn_orch_shape.py::TestDynOrchShapeOperations::test_dyn_orch_paged_attention -v --platform=a5sim --forked --pto-isa-commit=016396b57e2c17093f1194e6acd89bb112b0ab24 -k "not TestMscatter"
pytest tests/st/runtime/cross_core/test_cross_core.py -v --forked --platform=a5sim --pto-isa-commit=016396b57e2c17093f1194e6acd89bb112b0ab24
pytest tests/st/runtime/cross_core/test_cross_core.py -v --forked --platform=a2a3sim --pto-isa-commit=016396b57e2c17093f1194e6acd89bb112b0ab24

Simulation image scope notes:

No --privileged, no /dev mount, and no /usr/local/Ascend/driver mount are required.
Do not use this image for onboard NPU execution (a2a3, a5); use the CANN-based Dockerfiles instead.

Sim Dev Iteration Workflow (local code changes)

When iterating on C++/Python code changes, do not rebuild the image. Build it once, then bind-mount your workspace and reinstall:

# One-time: build the sim image (~15-30 min)
docker build -t pypto3-hw-native-sys:sim -f Dockerfile.hw-native-sys.sim.ubuntu22.04 .

# Every code change: mount workspace + pip install -e (~2-5 min)
docker run --rm \
  -v /home/gb4018/workspace/hw-native-sys/pypto:/opt/pypto \
  pypto3-hw-native-sys:sim \
  bash -c "pip install --no-build-isolation -e '/opt/pypto[dev]' 2>&1 | tail -1 && \
           pytest tests/ut/codegen/test_orchestration_codegen.py -v"

# Pre-commit checks (MUST run inside Docker — host may have root-owned cache):
docker run --rm \
  -v /home/gb4018/workspace/hw-native-sys/pypto:/opt/pypto \
  pypto3-hw-native-sys:sim \
  bash -c "ruff check ."

# Full unit test suite (before pushing):
docker run --rm \
  -v /home/gb4018/workspace/hw-native-sys/pypto:/opt/pypto \
  pypto3-hw-native-sys:sim \
  bash -c "pip install --no-build-isolation -e '/opt/pypto[dev]' 2>&1 | tail -1 && \
           pytest tests/ut -n auto --maxprocesses 8 -v"

Key rules:

Never rebuild the image for code changes — -v mount + pip install -e is the iteration loop.
Always run ruff inside Docker — the host .ruff_cache gets root-owned from previous Docker runs.
Test targeted suites first, then full suite before pushing.

Simpler-only simulation (`Dockerfile.simpler.sim.ubuntu22.04`)

docker run --rm -it --shm-size=4g simpler-hw-native-sys:sim

Inside the container:

cd /opt/simpler
python -c "import simpler; print('simpler ok')"

# L3 distributed/collective STs on a2a3sim (CI-aligned)
/opt/pypto-tooling/scripts/run-simpler-l3-sim.sh distributed

Use --shm-size=4g on docker run when running L3 tests (forked workers + torch shared memory exhaust default /dev/shm).

Local branch (mount workspace simpler/):

docker run --rm -it --shm-size=4g -v /path/to/simpler:/opt/simpler simpler-hw-native-sys:sim
# inside: pip install --no-build-isolation -e '.[test]'
#         /opt/pypto-tooling/scripts/run-simpler-l3-sim.sh distributed

Runtime Notes (Ascend 910B)

Single-device / non-HCCL run:

docker run --rm -it \
  --privileged \
  --ipc=host \
  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi:ro \
  -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
  -v /dev:/dev \
  pypto3-hw-native-sys:cann9

Multi-device distributed / HCCL run (must include --pid=host):

docker run --rm -it \
  --privileged \
  --ipc=host \
  --pid=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi:ro \
  -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
  -v /dev:/dev \
  pypto3-hw-native-sys:cann9

The same runtime rule applies to simpler-cann9 and pytorch-hccl-tests:cann9: include --pid=host for distributed/HCCL workflows.

Before HCCL tests in pypto or simpler images, export LD_PRELOAD in your shell (not baked into the image):

export LD_PRELOAD=${CANN_HOME}/aarch64-linux/lib64/libhccl.so

The pytorch-hccl-tests image uses torch HCCL directly and does not need LD_PRELOAD; it still requires --pid=host for multi-rank NPU collectives.

Mount only the host driver path at runtime:

/usr/local/Ascend/driver

Do not mount /usr/local/Ascend from host, because it can shadow the image's baked CANN version.

HCCL Bandwidth Benchmarks (`pytorch-hccl-tests:cann9`)

OSU-style micro-benchmarks measuring HCCL bandwidth via torch_npu directly (no pypto/simpler). Launch the container with the multi-device / HCCL run flags above (--pid=host is mandatory). This image uses torch HCCL and does not need LD_PRELOAD.

Sanity-check the runtime first:

cd /opt/pytorch-hccl-tests
npu-smi info                       # confirm visible chip count
python -c "import torch_npu, torch.distributed as dist; \
  print('torch_npu', torch_npu.__version__); print('hccl', dist.is_hccl_available())"

Bidirectional bandwidth (`bibw`)

Point-to-point bidirectional bandwidth between two ranks — each rank sends and receives simultaneously, and the benchmark reports the aggregate GB/s across both directions. It is a 2-rank test (enforced internally), so always use --nproc_per_node 2:

# direct invocation
torchrun --nnodes 1 --nproc_per_node 2 \
  pytorch_hccl_tests/cli.py --benchmark bibw --device npu

# Makefile shortcut
make bidirectional-bw DEVICE=npu

Pin the exact pair of chips to measure a specific link (count must be 2):

ASCEND_RT_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 \
  pytorch_hccl_tests/cli.py --benchmark bibw --device npu

Results print per message size as Size (B) Bandwidth (GB/s) and are also written to a CSV in the working directory, e.g. osu_bibw_gbps-npu-float16-2.csv.

Tunables (all optional): --dtype (default float16; e.g. float32, bfloat16), --min / --max to bound the message-size sweep in bytes, --skip warmup iterations, --iterations timed iterations:

torchrun --nnodes 1 --nproc_per_node 2 \
  pytorch_hccl_tests/cli.py --benchmark bibw --device npu \
  --dtype float32 --min 1024 --max 4194304 --skip 10 --iterations 100

For the unidirectional counterpart use --benchmark bandwidth (Makefile: make bandwidth DEVICE=npu); both are part of the make p2p DEVICE=npu suite.

Collective bandwidth (allreduce, allgather, ...)

Collective benchmarks scale with rank count — set --nproc_per_node to the number of chips (start at 2, scale to your visible count):

torchrun --nnodes 1 --nproc_per_node 4 \
  pytorch_hccl_tests/cli.py --benchmark allreduce --device npu

# Makefile shortcut (canonical OSU-style API)
WORLD_SIZE=4 make allreduce DEVICE=npu

WORLD_SIZE caveat. The env-prefix form above is the intended API, but it only takes effect if the Makefile declares WORLD_SIZE ?= 2 (conditional). Older checkouts use export WORLD_SIZE = 2, which shadows the env prefix (it silently stays at 2) — there, pass it as a make argument instead: make allreduce WORLD_SIZE=4 DEVICE=npu, or use make -e. Fix tracked in huawei-csl/pytorch-hccl-tests PR #9.

Enumerate every benchmark and flag in the pinned build (bibw, bandwidth, latency, allreduce, allgather, alltoall, broadcast, reducescatter, barrier, ...):

python pytorch_hccl_tests/cli.py --help

A CPU smoke test (--device cpu) validates the harness without NPUs but does not measure NPU bandwidth.

Multi-pair bandwidth (`mbw-mr`)

The container ships an editable checkout on master, so switching branches takes effect with no reinstall. The multi-pair bandwidth / message-rate benchmark currently lives on the feat/osu-mbw-mr branch (PR #9):

cd /opt/pytorch-hccl-tests
git checkout feat/osu-mbw-mr
WORLD_SIZE=8 make mbw-mr DEVICE=npu        # world_size//2 concurrent sender/receiver pairs

(See the WORLD_SIZE caveat above — if the checkout still uses export WORLD_SIZE = 2, pass it as make mbw-mr WORLD_SIZE=8 … instead.)

Update this to plain master once PR #9 merges.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pypto-tooling

Repository Contents

Purpose of Each File

Recommended Image: `Dockerfile.hw-native-sys.cann9.0`

Build Commands

Runtime Notes (Local Simulation)

Sim Dev Iteration Workflow (local code changes)

Simpler-only simulation (`Dockerfile.simpler.sim.ubuntu22.04`)

Runtime Notes (Ascend 910B)

HCCL Bandwidth Benchmarks (`pytorch-hccl-tests:cann9`)

Bidirectional bandwidth (`bibw`)

Collective bandwidth (allreduce, allgather, ...)

Multi-pair bandwidth (`mbw-mr`)

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
build_skills		build_skills
debugging_skills		debugging_skills
dockerfile_skills		dockerfile_skills
mcp-hw-native-sys		mcp-hw-native-sys
profiling		profiling
scripts		scripts
.gitignore		.gitignore
.mcp.json		.mcp.json
Dockerfile.hw-native-sys.cann9.0		Dockerfile.hw-native-sys.cann9.0
Dockerfile.hw-native-sys.sim.ubuntu22.04		Dockerfile.hw-native-sys.sim.ubuntu22.04
Dockerfile.pytorch-hccl-tests.cann9.0		Dockerfile.pytorch-hccl-tests.cann9.0
Dockerfile.server.cann:9.0		Dockerfile.server.cann:9.0
Dockerfile.simpler.cann9.0		Dockerfile.simpler.cann9.0
Dockerfile.simpler.sim.ubuntu22.04		Dockerfile.simpler.sim.ubuntu22.04
README.md		README.md
bz910b-reproduce.md		bz910b-reproduce.md
diagnose_npu.py		diagnose_npu.py
docker-entrypoint-cann.sh		docker-entrypoint-cann.sh
personal_setup.md		personal_setup.md
pypto-tooling.code-workspace		pypto-tooling.code-workspace

Folders and files

Latest commit

History

Repository files navigation

pypto-tooling

Repository Contents

Purpose of Each File

Recommended Image: Dockerfile.hw-native-sys.cann9.0

Build Commands

Runtime Notes (Local Simulation)

Sim Dev Iteration Workflow (local code changes)

Simpler-only simulation (Dockerfile.simpler.sim.ubuntu22.04)

Runtime Notes (Ascend 910B)

HCCL Bandwidth Benchmarks (pytorch-hccl-tests:cann9)

Bidirectional bandwidth (bibw)

Collective bandwidth (allreduce, allgather, ...)

Multi-pair bandwidth (mbw-mr)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Recommended Image: `Dockerfile.hw-native-sys.cann9.0`

Simpler-only simulation (`Dockerfile.simpler.sim.ubuntu22.04`)

HCCL Bandwidth Benchmarks (`pytorch-hccl-tests:cann9`)

Bidirectional bandwidth (`bibw`)

Multi-pair bandwidth (`mbw-mr`)

Packages