Enhance Multi-Node NCCL Testing with Torch C10D Gloo Framework#243
Enhance Multi-Node NCCL Testing with Torch C10D Gloo Framework#243hexinw wants to merge 2 commits intoNVIDIA:masterfrom
Conversation
This patch introduces support for running multi-process, multi-node NCCL tests using the Torch c10d Gloo distributed framework. Previously, running multi-node NCCL tests required MPI, which relies on SSH or Kubexec (in Kubernetes) to access worker nodes. This setup posed deployment and security challenges due to the need for maintaining SSH keys or Kubexec RBAC policies. With the introduction of C10D Gloo, worker nodes now communicate with the master node over TCP transport. This simplifies the process, making it similar to running multi-node PyTorch training jobs. Users only need to set the following environment variables to start the test: - MASTER_ADDR - RANK - WORLD_SIZE >> Dependencies PyTorch C++ APIs and libraries are required. Download LibTorch with the following commands: ``` cd /tmp/ wget https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip unzip libtorch-shared-with-deps-latest.zip sudo mv libtorch /usr/local/ ``` >> Build instructions To build the NCCL test binaries supporting both MPI and C10D Gloo, use: ``` MPI=1 GLOO=1 make ``` >> Usage >>>> Run a Single 8-GPU Node NCCL Test: 1. Set environment variables: ``` export NCCL_TOPO_FILE=<topo_file_location> export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH ``` 2. Execute the test: ``` #!/bin/bash for i in {0..7}; do MASTER_ADDR=localhost RANK=$i WORLD_SIZE=8 ./all_reduce_perf -b1G -e2G -f2 -t1 -g1 & done wait ``` >>>> Run a Two-Node NCCL Test: Node 1: 1. Set environment variables: ``` export NCCL_TOPO_FILE=<topo_file_location> export MASTER_ADDR=<master_node_ip_address> export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH ``` 2. Execute the test: ``` RANK=0 WORLD_SIZE=2 /tmp/all_reduce_perf -b1G -e2G -f2 -t1 -g8 ``` Node 2: 1. Set environment variables: ``` export NCCL_TOPO_FILE=<topo_file_location> export MASTER_ADDR=<master_node_ip_address> export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH ``` 2. Execute the test: ``` RANK=1 WORLD_SIZE=2 /tmp/all_reduce_perf -b1G -e2G -f2 -t1 -g8 ```
| NVCUFLAGS := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++11 | ||
| CXXFLAGS := -std=c++11 | ||
| NVCUFLAGS := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++17 | ||
| CXXFLAGS := -std=c++17 |
There was a problem hiding this comment.
I don't think we can force all users to move to c++17 just for this feature.
There was a problem hiding this comment.
I agreed. I can feature-ize the compiling to C++17 only for GLOO.
| #ifdef MPI_SUPPORT | ||
| MPI_Barrier(MPI_COMM_WORLD); | ||
| #endif | ||
| if (!use_c10d_gloo) { |
There was a problem hiding this comment.
I don't understand why we need a boolean and these new if statements.
We normally build separate binaries for single node and then MPI=1 builds for multiple node.
I expected we'd have to build standalone, MPI=1 and GLOO=1 binaries
There was a problem hiding this comment.
This boolean helps to enforce only one transport is picked at run time, if user ever builds with both MPI=1 and GLOO=1 in one single binary.
src/common.cu
Outdated
| auto options = c10d::ProcessGroupGloo::Options::create(); | ||
| // Create Gloo device that binds to any interface. | ||
| ::gloo::transport::tcp::attr tcp_attr; | ||
| tcp_attr.iface = "eth0"; |
There was a problem hiding this comment.
Why is the interface name hardcoded?
There was a problem hiding this comment.
Good catch. I will fix it to be configurable by an env variable. Thanks.
Use "GLOO_INTERFACE" env to specify the network interface.
This patch introduces support for running multi-process, multi-node NCCL tests using the Torch c10d Gloo distributed framework.
Previously, running multi-node NCCL tests required MPI, which relies on SSH or Kubexec (in Kubernetes) to access worker nodes. This setup posed deployment and security challenges due to the need for maintaining SSH keys or Kubexec RBAC policies.
With the introduction of C10D Gloo, worker nodes now communicate with the master node over TCP transport. This simplifies the process, making it similar to running multi-node PyTorch training jobs. Users only need to set the following environment variables to start the test:
PyTorch C++ APIs and libraries are required. Download LibTorch with the following commands:
To build the NCCL test binaries supporting both MPI and C10D Gloo, use:
Node 1:
Node 2: