I was running ZCCL on my CPU cluster using hpcx/v2.11.0/gcc-7.3.1 and ran into the following problems (which I managed to solve by now):
1. sizeof() works imporperly for MPI_Datatype
There are some places in the project ( e.g. line 32 of ZCCL_ring.c : recvtype_extent = sizeof(recvtype) ) where sizeof() is used to acquire the size of a MPI_Datatype. This caused Segmentation Faults (specifically: bad mapping errors) when running.
I assume this is probably due to the fact that sizeof(MPI_Datatype) returns the size of a pointer which is 8B in 64bit systems (whilst for instance, we want the size of MPI_FLOAT to be 4B) , thus I replaced all such behavior with MPI_Type_size(datatype, &size) which I packaged as mpi_sizeof() in ZCCL_utils.h
2. buffer allocation and the 'count' parameter might be insufficient
This is a rather minor issue, since I surprisingly found that when being fed totally random inputs, compressors might output sizes that exceed the original input. In such cases, a handful of buffers in the project might be subject to data overflow, and the 'count' values in various functions like MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) might also be insufficient. The project workes fine for me after these ammendments.
If you're interested, I'd be happy to submit a Pull Request with my fixes. Please let me know! 😊
I was running ZCCL on my CPU cluster using hpcx/v2.11.0/gcc-7.3.1 and ran into the following problems (which I managed to solve by now):
1. sizeof() works imporperly for MPI_Datatype
There are some places in the project ( e.g. line 32 of ZCCL_ring.c : recvtype_extent = sizeof(recvtype) ) where sizeof() is used to acquire the size of a MPI_Datatype. This caused Segmentation Faults (specifically: bad mapping errors) when running.
I assume this is probably due to the fact that sizeof(MPI_Datatype) returns the size of a pointer which is 8B in 64bit systems (whilst for instance, we want the size of MPI_FLOAT to be 4B) , thus I replaced all such behavior with MPI_Type_size(datatype, &size) which I packaged as mpi_sizeof() in ZCCL_utils.h
2. buffer allocation and the 'count' parameter might be insufficient
This is a rather minor issue, since I surprisingly found that when being fed totally random inputs, compressors might output sizes that exceed the original input. In such cases, a handful of buffers in the project might be subject to data overflow, and the 'count' values in various functions like MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) might also be insufficient. The project workes fine for me after these ammendments.
If you're interested, I'd be happy to submit a Pull Request with my fixes. Please let me know! 😊