Dockerfiles and runbooks for PyPTO and simpler development on Ascend 910B.
- README.md: Repository index and quick usage.
- Dockerfile.hw-native-sys.cann9.0: Standalone PyPTO image that clones all sources from GitHub at build time.
- Dockerfile.server.cann:9.0: Server/dev image that uses a local pypto build context.
- Dockerfile.simpler.cann9.0: Standalone simpler image with pinned commit support and HCCL-safe runtime defaults.
- Dockerfile.pytorch-hccl-tests.cann9.0: Standalone HCCL micro-benchmark image (
torchrun+ pytorch-hccl-tests fork). - Dockerfile.hw-native-sys.sim.ubuntu22.04: Standalone local simulation image for PyPTO (
a2a3sim/a5sim) on x86_64 without NPU devices. - Dockerfile.simpler.sim.ubuntu22.04: Standalone simpler-only simulation image (
a2a3sim/a5sim) for L3 worker STs without pypto or CANN. - docker-entrypoint-cann.sh: Runtime helper for workspace/runtime symlink handling.
- bz910b-reproduce.md: Reproduction and test workflow guide.
- profiling/: Personal collective benchmark harness for PyPTO vs simpler L3 collectives.
- dockerfile_skills/SKILL.md: Internal Dockerfile construction/debugging notes.
- dockerfile_skills/issue_0.md: Detailed issue log for comm_alloc_windows/HCCL environment mismatch.
- dockerfile_skills/issue_pytorch_hccl_tests.md: pytorch-hccl-tests benchmark image — editable-install shadow, WORLD_SIZE Makefile shadow, fp64 HCCL reduce crash.
- README.md: Top-level index of the repository, with build commands and runtime guidance.
- Dockerfile.hw-native-sys.cann9.0: Build a standalone PyPTO image by cloning repositories during build; best for reproducible CI-like setup without local source dependencies.
- Dockerfile.server.cann:9.0: Build a server/dev image from a local hw-native-sys workspace where pypto is copied from build context; best for local iteration.
- Dockerfile.simpler.cann9.0: Build a standalone simpler-only image for runtime and worker validation when pypto is not required.
- Dockerfile.pytorch-hccl-tests.cann9.0: Build a standalone HCCL benchmark image for baseline latency/bandwidth/allreduce comparisons (not for pypto/simpler composite tests).
- Dockerfile.hw-native-sys.sim.ubuntu22.04: Build a standalone simulation-only PyPTO image for local development/testing on CPU-hosted simulators (
a2a3sim,a5sim). - Dockerfile.simpler.sim.ubuntu22.04: Build a standalone simpler-only simulation image for L3 worker/collective STs on
a2a3simwithout pypto or CANN. - docker-entrypoint-cann.sh: Runtime helper script that normalizes runtime layout (workspace/runtime symlink behavior) before launching the container command.
- bz910b-reproduce.md: Operator runbook for reproducing tests and validating NPU execution on Ascend 910B hosts.
- profiling/: Personal harness: equivalence-driven benchmark drivers, artifact bundles, and figure generation for collectives performance analysis.
- dockerfile_skills/SKILL.md: Internal engineering notes and patterns for constructing/debugging these Dockerfiles.
- dockerfile_skills/issue_0.md: Historical incident report and root-cause details for the HCCL IPC/driver compatibility issue.
Dockerfile.hw-native-sys.cann9.0 is now standalone:
- No local source checkout is required.
- Build works from any directory using stdin input.
- Repositories (
pypto,pto-isa) are cloned during build. - Key behavior is controlled via build args (
CANN_VERSION,INSTALL_PREFIX,PYPTO_COMMIT,PTO_ISA_COMMIT).
Build standalone hw-native-sys image (defaults):
docker build -t pypto3-hw-native-sys:cann9 - < Dockerfile.hw-native-sys.cann9.0Build standalone hw-native-sys image (custom commit/pinning):
docker build \
--build-arg PYPTO_COMMIT=a1b066df02fc938f76b3d38b85fc9fbd0e036d07 \
--build-arg PTO_ISA_COMMIT=016396b57e2c17093f1194e6acd89bb112b0ab24 \
-t pypto3-hw-native-sys:cann9 \
- < Dockerfile.hw-native-sys.cann9.0Build server/dev image (requires local pypto in build context):
docker build --no-cache -f "Dockerfile.server.cann:9.0" -t pypto3-dev-env:cann9 .Build standalone simpler image (defaults):
docker build -t simpler-cann9 - < Dockerfile.simpler.cann9.0Build standalone simpler image (custom commit/pinning):
docker build \
--build-arg SIMPLER_COMMIT=afb5c5a95cf05d5bb346eaef83a318c6f3164971 \
--build-arg PTO_ISA_COMMIT=ddafa8da9c760ecd13fe9fe2833d6ee55fb20bd8 \
-t simpler-cann9 \
- < Dockerfile.simpler.cann9.0Build standalone pytorch-hccl-tests image (defaults):
docker build -t pytorch-hccl-tests:cann9 - < Dockerfile.pytorch-hccl-tests.cann9.0Build standalone pytorch-hccl-tests image (x86 host / custom ref):
docker build \
--build-arg PT_HCCL_INSTALL_TARGET=install-npu-x86 \
-t pytorch-hccl-tests:cann9 \
- < Dockerfile.pytorch-hccl-tests.cann9.0Build standalone local simulation image (no NPU required):
docker build -t pypto3-hw-native-sys:sim -f Dockerfile.hw-native-sys.sim.ubuntu22.04 .Build standalone local simulation image (pinned commits):
docker build --build-arg PYPTO_COMMIT=b77b41fee7df8b700c6a6a8f0de12767d31be642 --build-arg PTO_ISA_COMMIT=016396b57e2c17093f1194e6acd89bb112b0ab24 -t pypto3-hw-native-sys:sim -f Dockerfile.hw-native-sys.sim.ubuntu22.04 .Build standalone simpler simulation image (no NPU, no pypto):
docker build -t simpler-hw-native-sys:sim -f Dockerfile.simpler.sim.ubuntu22.04 .Build standalone simpler simulation image (pinned commits):
docker build \
--build-arg SIMPLER_COMMIT=286aa7e4100dc9fc71b77a0f1b279fb5a2207afa \
--build-arg PTO_ISA_COMMIT=ddafa8da9c760ecd13fe9fe2833d6ee55fb20bd8 \
-t simpler-hw-native-sys:sim \
-f Dockerfile.simpler.sim.ubuntu22.04 .This mode is for local CPU-hosted simulator workflows only (a2a3sim, a5sim):
docker run --rm -it pypto3-hw-native-sys:simInside the container:
cd /opt/pypto
python -c "import pypto; print('pypto ok')"
which ptoas && ptoas --version
# Unit tests
pytest tests/ut -n auto --maxprocesses 8 -v
# Full ST matrix on simulators (long run)
pytest tests/st -v --forked --platform=a2a3sim,a5sim --pto-isa-commit=016396b57e2c17093f1194e6acd89bb112b0ab24Optional CI-aligned simulator subset:
pytest tests/st/runtime/ops/test_assemble.py tests/st/runtime/ops/test_mscatter.py tests/st/runtime/framework_and_models/test_qwen3_decode_scope3_mixed.py tests/st/runtime/control_flow/test_dyn_orch_shape.py::TestDynOrchShapeOperations::test_dyn_orch_paged_attention -v --platform=a5sim --forked --pto-isa-commit=016396b57e2c17093f1194e6acd89bb112b0ab24 -k "not TestMscatter"
pytest tests/st/runtime/cross_core/test_cross_core.py -v --forked --platform=a5sim --pto-isa-commit=016396b57e2c17093f1194e6acd89bb112b0ab24
pytest tests/st/runtime/cross_core/test_cross_core.py -v --forked --platform=a2a3sim --pto-isa-commit=016396b57e2c17093f1194e6acd89bb112b0ab24Simulation image scope notes:
- No
--privileged, no/devmount, and no/usr/local/Ascend/drivermount are required. - Do not use this image for onboard NPU execution (
a2a3,a5); use the CANN-based Dockerfiles instead.
When iterating on C++/Python code changes, do not rebuild the image. Build it once, then bind-mount your workspace and reinstall:
# One-time: build the sim image (~15-30 min)
docker build -t pypto3-hw-native-sys:sim -f Dockerfile.hw-native-sys.sim.ubuntu22.04 .
# Every code change: mount workspace + pip install -e (~2-5 min)
docker run --rm \
-v /home/gb4018/workspace/hw-native-sys/pypto:/opt/pypto \
pypto3-hw-native-sys:sim \
bash -c "pip install --no-build-isolation -e '/opt/pypto[dev]' 2>&1 | tail -1 && \
pytest tests/ut/codegen/test_orchestration_codegen.py -v"
# Pre-commit checks (MUST run inside Docker — host may have root-owned cache):
docker run --rm \
-v /home/gb4018/workspace/hw-native-sys/pypto:/opt/pypto \
pypto3-hw-native-sys:sim \
bash -c "ruff check ."
# Full unit test suite (before pushing):
docker run --rm \
-v /home/gb4018/workspace/hw-native-sys/pypto:/opt/pypto \
pypto3-hw-native-sys:sim \
bash -c "pip install --no-build-isolation -e '/opt/pypto[dev]' 2>&1 | tail -1 && \
pytest tests/ut -n auto --maxprocesses 8 -v"Key rules:
- Never rebuild the image for code changes —
-vmount +pip install -eis the iteration loop. - Always run ruff inside Docker — the host
.ruff_cachegets root-owned from previous Docker runs. - Test targeted suites first, then full suite before pushing.
docker run --rm -it --shm-size=4g simpler-hw-native-sys:simInside the container:
cd /opt/simpler
python -c "import simpler; print('simpler ok')"
# L3 distributed/collective STs on a2a3sim (CI-aligned)
/opt/pypto-tooling/scripts/run-simpler-l3-sim.sh distributedUse --shm-size=4g on docker run when running L3 tests (forked workers + torch shared memory exhaust default /dev/shm).
Local branch (mount workspace simpler/):
docker run --rm -it --shm-size=4g -v /path/to/simpler:/opt/simpler simpler-hw-native-sys:sim
# inside: pip install --no-build-isolation -e '.[test]'
# /opt/pypto-tooling/scripts/run-simpler-l3-sim.sh distributedSingle-device / non-HCCL run:
docker run --rm -it \
--privileged \
--ipc=host \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi:ro \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
-v /dev:/dev \
pypto3-hw-native-sys:cann9Multi-device distributed / HCCL run (must include --pid=host):
docker run --rm -it \
--privileged \
--ipc=host \
--pid=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi:ro \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
-v /dev:/dev \
pypto3-hw-native-sys:cann9The same runtime rule applies to simpler-cann9 and pytorch-hccl-tests:cann9: include --pid=host for distributed/HCCL workflows.
Before HCCL tests in pypto or simpler images, export LD_PRELOAD in your shell (not baked into the image):
export LD_PRELOAD=${CANN_HOME}/aarch64-linux/lib64/libhccl.soThe pytorch-hccl-tests image uses torch HCCL directly and does not need LD_PRELOAD; it still requires --pid=host for multi-rank NPU collectives.
Mount only the host driver path at runtime:
/usr/local/Ascend/driver
Do not mount /usr/local/Ascend from host, because it can shadow the image's baked CANN version.
OSU-style micro-benchmarks measuring HCCL bandwidth via torch_npu directly (no
pypto/simpler). Launch the container with the multi-device / HCCL run flags
above (--pid=host is mandatory). This image uses torch HCCL and does not
need LD_PRELOAD.
Sanity-check the runtime first:
cd /opt/pytorch-hccl-tests
npu-smi info # confirm visible chip count
python -c "import torch_npu, torch.distributed as dist; \
print('torch_npu', torch_npu.__version__); print('hccl', dist.is_hccl_available())"Point-to-point bidirectional bandwidth between two ranks — each rank sends and
receives simultaneously, and the benchmark reports the aggregate GB/s across
both directions. It is a 2-rank test (enforced internally), so always use
--nproc_per_node 2:
# direct invocation
torchrun --nnodes 1 --nproc_per_node 2 \
pytorch_hccl_tests/cli.py --benchmark bibw --device npu
# Makefile shortcut
make bidirectional-bw DEVICE=npuPin the exact pair of chips to measure a specific link (count must be 2):
ASCEND_RT_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 \
pytorch_hccl_tests/cli.py --benchmark bibw --device npuResults print per message size as Size (B) Bandwidth (GB/s) and are also
written to a CSV in the working directory, e.g. osu_bibw_gbps-npu-float16-2.csv.
Tunables (all optional): --dtype (default float16; e.g. float32,
bfloat16), --min / --max to bound the message-size sweep in bytes,
--skip warmup iterations, --iterations timed iterations:
torchrun --nnodes 1 --nproc_per_node 2 \
pytorch_hccl_tests/cli.py --benchmark bibw --device npu \
--dtype float32 --min 1024 --max 4194304 --skip 10 --iterations 100For the unidirectional counterpart use --benchmark bandwidth (Makefile:
make bandwidth DEVICE=npu); both are part of the make p2p DEVICE=npu suite.
Collective benchmarks scale with rank count — set --nproc_per_node to the
number of chips (start at 2, scale to your visible count):
torchrun --nnodes 1 --nproc_per_node 4 \
pytorch_hccl_tests/cli.py --benchmark allreduce --device npu
# Makefile shortcut (canonical OSU-style API)
WORLD_SIZE=4 make allreduce DEVICE=npu
WORLD_SIZEcaveat. The env-prefix form above is the intended API, but it only takes effect if the Makefile declaresWORLD_SIZE ?= 2(conditional). Older checkouts useexport WORLD_SIZE = 2, which shadows the env prefix (it silently stays at 2) — there, pass it as a make argument instead:make allreduce WORLD_SIZE=4 DEVICE=npu, or usemake -e. Fix tracked in huawei-csl/pytorch-hccl-tests PR #9.
Enumerate every benchmark and flag in the pinned build (bibw, bandwidth,
latency, allreduce, allgather, alltoall, broadcast, reducescatter,
barrier, ...):
python pytorch_hccl_tests/cli.py --helpA CPU smoke test (--device cpu) validates the harness without NPUs but does not
measure NPU bandwidth.
The container ships an editable checkout on master, so switching branches
takes effect with no reinstall. The multi-pair bandwidth / message-rate
benchmark currently lives on the feat/osu-mbw-mr branch (PR #9):
cd /opt/pytorch-hccl-tests
git checkout feat/osu-mbw-mr
WORLD_SIZE=8 make mbw-mr DEVICE=npu # world_size//2 concurrent sender/receiver pairs(See the WORLD_SIZE caveat above — if the checkout still uses export WORLD_SIZE = 2, pass it as make mbw-mr WORLD_SIZE=8 … instead.)
Update this to plain master once PR #9 merges.