Skip to content

georgebisbas/pypto-tooling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

115 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pypto-tooling

Dockerfiles and runbooks for PyPTO and simpler development on Ascend 910B.

Repository Contents

Purpose of Each File

  • README.md: Top-level index of the repository, with build commands and runtime guidance.
  • Dockerfile.hw-native-sys.cann9.0: Build a standalone PyPTO image by cloning repositories during build; best for reproducible CI-like setup without local source dependencies.
  • Dockerfile.server.cann:9.0: Build a server/dev image from a local hw-native-sys workspace where pypto is copied from build context; best for local iteration.
  • Dockerfile.simpler.cann9.0: Build a standalone simpler-only image for runtime and worker validation when pypto is not required.
  • Dockerfile.pytorch-hccl-tests.cann9.0: Build a standalone HCCL benchmark image for baseline latency/bandwidth/allreduce comparisons (not for pypto/simpler composite tests).
  • Dockerfile.hw-native-sys.sim.ubuntu22.04: Build a standalone simulation-only PyPTO image for local development/testing on CPU-hosted simulators (a2a3sim, a5sim).
  • Dockerfile.simpler.sim.ubuntu22.04: Build a standalone simpler-only simulation image for L3 worker/collective STs on a2a3sim without pypto or CANN.
  • docker-entrypoint-cann.sh: Runtime helper script that normalizes runtime layout (workspace/runtime symlink behavior) before launching the container command.
  • bz910b-reproduce.md: Operator runbook for reproducing tests and validating NPU execution on Ascend 910B hosts.
  • profiling/: Personal harness: equivalence-driven benchmark drivers, artifact bundles, and figure generation for collectives performance analysis.
  • dockerfile_skills/SKILL.md: Internal engineering notes and patterns for constructing/debugging these Dockerfiles.
  • dockerfile_skills/issue_0.md: Historical incident report and root-cause details for the HCCL IPC/driver compatibility issue.

Recommended Image: Dockerfile.hw-native-sys.cann9.0

Dockerfile.hw-native-sys.cann9.0 is now standalone:

  • No local source checkout is required.
  • Build works from any directory using stdin input.
  • Repositories (pypto, pto-isa) are cloned during build.
  • Key behavior is controlled via build args (CANN_VERSION, INSTALL_PREFIX, PYPTO_COMMIT, PTO_ISA_COMMIT).

Build Commands

Build standalone hw-native-sys image (defaults):

docker build -t pypto3-hw-native-sys:cann9 - < Dockerfile.hw-native-sys.cann9.0

Build standalone hw-native-sys image (custom commit/pinning):

docker build \
  --build-arg PYPTO_COMMIT=a1b066df02fc938f76b3d38b85fc9fbd0e036d07 \
  --build-arg PTO_ISA_COMMIT=016396b57e2c17093f1194e6acd89bb112b0ab24 \
  -t pypto3-hw-native-sys:cann9 \
  - < Dockerfile.hw-native-sys.cann9.0

Build server/dev image (requires local pypto in build context):

docker build --no-cache -f "Dockerfile.server.cann:9.0" -t pypto3-dev-env:cann9 .

Build standalone simpler image (defaults):

docker build -t simpler-cann9 - < Dockerfile.simpler.cann9.0

Build standalone simpler image (custom commit/pinning):

docker build \
  --build-arg SIMPLER_COMMIT=afb5c5a95cf05d5bb346eaef83a318c6f3164971 \
  --build-arg PTO_ISA_COMMIT=ddafa8da9c760ecd13fe9fe2833d6ee55fb20bd8 \
  -t simpler-cann9 \
  - < Dockerfile.simpler.cann9.0

Build standalone pytorch-hccl-tests image (defaults):

docker build -t pytorch-hccl-tests:cann9 - < Dockerfile.pytorch-hccl-tests.cann9.0

Build standalone pytorch-hccl-tests image (x86 host / custom ref):

docker build \
  --build-arg PT_HCCL_INSTALL_TARGET=install-npu-x86 \
  -t pytorch-hccl-tests:cann9 \
  - < Dockerfile.pytorch-hccl-tests.cann9.0

Build standalone local simulation image (no NPU required):

docker build -t pypto3-hw-native-sys:sim -f Dockerfile.hw-native-sys.sim.ubuntu22.04 .

Build standalone local simulation image (pinned commits):

docker build --build-arg PYPTO_COMMIT=b77b41fee7df8b700c6a6a8f0de12767d31be642 --build-arg PTO_ISA_COMMIT=016396b57e2c17093f1194e6acd89bb112b0ab24 -t pypto3-hw-native-sys:sim -f Dockerfile.hw-native-sys.sim.ubuntu22.04 .

Build standalone simpler simulation image (no NPU, no pypto):

docker build -t simpler-hw-native-sys:sim -f Dockerfile.simpler.sim.ubuntu22.04 .

Build standalone simpler simulation image (pinned commits):

docker build \
  --build-arg SIMPLER_COMMIT=286aa7e4100dc9fc71b77a0f1b279fb5a2207afa \
  --build-arg PTO_ISA_COMMIT=ddafa8da9c760ecd13fe9fe2833d6ee55fb20bd8 \
  -t simpler-hw-native-sys:sim \
  -f Dockerfile.simpler.sim.ubuntu22.04 .

Runtime Notes (Local Simulation)

This mode is for local CPU-hosted simulator workflows only (a2a3sim, a5sim):

docker run --rm -it pypto3-hw-native-sys:sim

Inside the container:

cd /opt/pypto
python -c "import pypto; print('pypto ok')"
which ptoas && ptoas --version

# Unit tests
pytest tests/ut -n auto --maxprocesses 8 -v

# Full ST matrix on simulators (long run)
pytest tests/st -v --forked --platform=a2a3sim,a5sim --pto-isa-commit=016396b57e2c17093f1194e6acd89bb112b0ab24

Optional CI-aligned simulator subset:

pytest tests/st/runtime/ops/test_assemble.py tests/st/runtime/ops/test_mscatter.py tests/st/runtime/framework_and_models/test_qwen3_decode_scope3_mixed.py tests/st/runtime/control_flow/test_dyn_orch_shape.py::TestDynOrchShapeOperations::test_dyn_orch_paged_attention -v --platform=a5sim --forked --pto-isa-commit=016396b57e2c17093f1194e6acd89bb112b0ab24 -k "not TestMscatter"
pytest tests/st/runtime/cross_core/test_cross_core.py -v --forked --platform=a5sim --pto-isa-commit=016396b57e2c17093f1194e6acd89bb112b0ab24
pytest tests/st/runtime/cross_core/test_cross_core.py -v --forked --platform=a2a3sim --pto-isa-commit=016396b57e2c17093f1194e6acd89bb112b0ab24

Simulation image scope notes:

  • No --privileged, no /dev mount, and no /usr/local/Ascend/driver mount are required.
  • Do not use this image for onboard NPU execution (a2a3, a5); use the CANN-based Dockerfiles instead.

Sim Dev Iteration Workflow (local code changes)

When iterating on C++/Python code changes, do not rebuild the image. Build it once, then bind-mount your workspace and reinstall:

# One-time: build the sim image (~15-30 min)
docker build -t pypto3-hw-native-sys:sim -f Dockerfile.hw-native-sys.sim.ubuntu22.04 .

# Every code change: mount workspace + pip install -e (~2-5 min)
docker run --rm \
  -v /home/gb4018/workspace/hw-native-sys/pypto:/opt/pypto \
  pypto3-hw-native-sys:sim \
  bash -c "pip install --no-build-isolation -e '/opt/pypto[dev]' 2>&1 | tail -1 && \
           pytest tests/ut/codegen/test_orchestration_codegen.py -v"

# Pre-commit checks (MUST run inside Docker — host may have root-owned cache):
docker run --rm \
  -v /home/gb4018/workspace/hw-native-sys/pypto:/opt/pypto \
  pypto3-hw-native-sys:sim \
  bash -c "ruff check ."

# Full unit test suite (before pushing):
docker run --rm \
  -v /home/gb4018/workspace/hw-native-sys/pypto:/opt/pypto \
  pypto3-hw-native-sys:sim \
  bash -c "pip install --no-build-isolation -e '/opt/pypto[dev]' 2>&1 | tail -1 && \
           pytest tests/ut -n auto --maxprocesses 8 -v"

Key rules:

  • Never rebuild the image for code changes-v mount + pip install -e is the iteration loop.
  • Always run ruff inside Docker — the host .ruff_cache gets root-owned from previous Docker runs.
  • Test targeted suites first, then full suite before pushing.

Simpler-only simulation (Dockerfile.simpler.sim.ubuntu22.04)

docker run --rm -it --shm-size=4g simpler-hw-native-sys:sim

Inside the container:

cd /opt/simpler
python -c "import simpler; print('simpler ok')"

# L3 distributed/collective STs on a2a3sim (CI-aligned)
/opt/pypto-tooling/scripts/run-simpler-l3-sim.sh distributed

Use --shm-size=4g on docker run when running L3 tests (forked workers + torch shared memory exhaust default /dev/shm).

Local branch (mount workspace simpler/):

docker run --rm -it --shm-size=4g -v /path/to/simpler:/opt/simpler simpler-hw-native-sys:sim
# inside: pip install --no-build-isolation -e '.[test]'
#         /opt/pypto-tooling/scripts/run-simpler-l3-sim.sh distributed

Runtime Notes (Ascend 910B)

Single-device / non-HCCL run:

docker run --rm -it \
  --privileged \
  --ipc=host \
  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi:ro \
  -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
  -v /dev:/dev \
  pypto3-hw-native-sys:cann9

Multi-device distributed / HCCL run (must include --pid=host):

docker run --rm -it \
  --privileged \
  --ipc=host \
  --pid=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi:ro \
  -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
  -v /dev:/dev \
  pypto3-hw-native-sys:cann9

The same runtime rule applies to simpler-cann9 and pytorch-hccl-tests:cann9: include --pid=host for distributed/HCCL workflows.

Before HCCL tests in pypto or simpler images, export LD_PRELOAD in your shell (not baked into the image):

export LD_PRELOAD=${CANN_HOME}/aarch64-linux/lib64/libhccl.so

The pytorch-hccl-tests image uses torch HCCL directly and does not need LD_PRELOAD; it still requires --pid=host for multi-rank NPU collectives.

Mount only the host driver path at runtime:

/usr/local/Ascend/driver

Do not mount /usr/local/Ascend from host, because it can shadow the image's baked CANN version.

HCCL Bandwidth Benchmarks (pytorch-hccl-tests:cann9)

OSU-style micro-benchmarks measuring HCCL bandwidth via torch_npu directly (no pypto/simpler). Launch the container with the multi-device / HCCL run flags above (--pid=host is mandatory). This image uses torch HCCL and does not need LD_PRELOAD.

Sanity-check the runtime first:

cd /opt/pytorch-hccl-tests
npu-smi info                       # confirm visible chip count
python -c "import torch_npu, torch.distributed as dist; \
  print('torch_npu', torch_npu.__version__); print('hccl', dist.is_hccl_available())"

Bidirectional bandwidth (bibw)

Point-to-point bidirectional bandwidth between two ranks — each rank sends and receives simultaneously, and the benchmark reports the aggregate GB/s across both directions. It is a 2-rank test (enforced internally), so always use --nproc_per_node 2:

# direct invocation
torchrun --nnodes 1 --nproc_per_node 2 \
  pytorch_hccl_tests/cli.py --benchmark bibw --device npu

# Makefile shortcut
make bidirectional-bw DEVICE=npu

Pin the exact pair of chips to measure a specific link (count must be 2):

ASCEND_RT_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 \
  pytorch_hccl_tests/cli.py --benchmark bibw --device npu

Results print per message size as Size (B) Bandwidth (GB/s) and are also written to a CSV in the working directory, e.g. osu_bibw_gbps-npu-float16-2.csv.

Tunables (all optional): --dtype (default float16; e.g. float32, bfloat16), --min / --max to bound the message-size sweep in bytes, --skip warmup iterations, --iterations timed iterations:

torchrun --nnodes 1 --nproc_per_node 2 \
  pytorch_hccl_tests/cli.py --benchmark bibw --device npu \
  --dtype float32 --min 1024 --max 4194304 --skip 10 --iterations 100

For the unidirectional counterpart use --benchmark bandwidth (Makefile: make bandwidth DEVICE=npu); both are part of the make p2p DEVICE=npu suite.

Collective bandwidth (allreduce, allgather, ...)

Collective benchmarks scale with rank count — set --nproc_per_node to the number of chips (start at 2, scale to your visible count):

torchrun --nnodes 1 --nproc_per_node 4 \
  pytorch_hccl_tests/cli.py --benchmark allreduce --device npu

# Makefile shortcut (canonical OSU-style API)
WORLD_SIZE=4 make allreduce DEVICE=npu

WORLD_SIZE caveat. The env-prefix form above is the intended API, but it only takes effect if the Makefile declares WORLD_SIZE ?= 2 (conditional). Older checkouts use export WORLD_SIZE = 2, which shadows the env prefix (it silently stays at 2) — there, pass it as a make argument instead: make allreduce WORLD_SIZE=4 DEVICE=npu, or use make -e. Fix tracked in huawei-csl/pytorch-hccl-tests PR #9.

Enumerate every benchmark and flag in the pinned build (bibw, bandwidth, latency, allreduce, allgather, alltoall, broadcast, reducescatter, barrier, ...):

python pytorch_hccl_tests/cli.py --help

A CPU smoke test (--device cpu) validates the harness without NPUs but does not measure NPU bandwidth.

Multi-pair bandwidth (mbw-mr)

The container ships an editable checkout on master, so switching branches takes effect with no reinstall. The multi-pair bandwidth / message-rate benchmark currently lives on the feat/osu-mbw-mr branch (PR #9):

cd /opt/pytorch-hccl-tests
git checkout feat/osu-mbw-mr
WORLD_SIZE=8 make mbw-mr DEVICE=npu        # world_size//2 concurrent sender/receiver pairs

(See the WORLD_SIZE caveat above — if the checkout still uses export WORLD_SIZE = 2, pass it as make mbw-mr WORLD_SIZE=8 … instead.)

Update this to plain master once PR #9 merges.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors