Skip to content

Problems with acquiring sizes for MPI_Datatype and buffer allocation #5

@konnyakucstdio

Description

@konnyakucstdio

I was running ZCCL on my CPU cluster using hpcx/v2.11.0/gcc-7.3.1 and ran into the following problems (which I managed to solve by now):

1. sizeof() works imporperly for MPI_Datatype

There are some places in the project ( e.g. line 32 of ZCCL_ring.c : recvtype_extent = sizeof(recvtype) ) where sizeof() is used to acquire the size of a MPI_Datatype. This caused Segmentation Faults (specifically: bad mapping errors) when running.

I assume this is probably due to the fact that sizeof(MPI_Datatype) returns the size of a pointer which is 8B in 64bit systems (whilst for instance, we want the size of MPI_FLOAT to be 4B) , thus I replaced all such behavior with MPI_Type_size(datatype, &size) which I packaged as mpi_sizeof() in ZCCL_utils.h

2. buffer allocation and the 'count' parameter might be insufficient

This is a rather minor issue, since I surprisingly found that when being fed totally random inputs, compressors might output sizes that exceed the original input. In such cases, a handful of buffers in the project might be subject to data overflow, and the 'count' values in various functions like MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) might also be insufficient. The project workes fine for me after these ammendments.

If you're interested, I'd be happy to submit a Pull Request with my fixes. Please let me know! 😊

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions