Hi Blosc Team,
I have observed significant degradation (2-5 times slower than workstation) in terms of compress/decompress speed at DGXs (CPUs are Intel Xeon Platinum 8480C, 8570). When I reduced the number of threads, it helps, but according the doc in the below, it should be similar speed. But it might be limited since the test size are small, 32 (blocks), 64 (chunks).
|
In order to reduce the overhead of threads as much as possible, I've |
Or.. the current roadmap for 3.0, is exactly for optimization for DGX servers? I found the below points are very relevant.
|
* Optimization for multi-socket machines: right now, C-Blosc2 is optimized for single-socket machines. However, in multi-socket machines, memory access is not uniform (NUMA architecture), so optimizations are needed to make sure that every thread is accessing to local memory as much as possible. This would require to use e.g. `numactl <https://linux.die.net/man/8/numactl>`_ or `libnuma <https://man7.org/linux/man-pages/man3/numa.3.html>`_ so as to pin threads and memory allocations to the local socket. |
|
* Support for GPUs: nowadays, GPUs are becoming more and more powerful, and having support for them in C-Blosc2 would be a great addition. The idea is to offload the compression, but most importantly, decompression tasks to the GPU, so that the CPU is free to do other tasks. This would require to use e.g. `CUDA <https://developer.nvidia.com/cuda-toolkit>`_ or `ROCm <https://rocm.docs.amd.com/>`_ so as to access to the GPU capabilities. |
For now, if you can clarify that blosc2 speed (compress/decompress) at DGX is slower than the speed at the most modern workstations, that will be super helpful, and good to know information. Plus, your advice to optimize the latency with the current blosc2 at DGXs.
Btw, it seems obvious that per-core clock speed of Xeon Platinum 8480C, 8570 is weaker than workstation (AMD Threadripper Pro 5975WX).

Hi Blosc Team,
I have observed significant degradation (2-5 times slower than workstation) in terms of compress/decompress speed at DGXs (CPUs are Intel Xeon Platinum 8480C, 8570). When I reduced the number of threads, it helps, but according the doc in the below, it should be similar speed. But it might be limited since the test size are small, 32 (blocks), 64 (chunks).
c-blosc2/README_THREADED.rst
Line 9 in 6c16487
Or.. the current roadmap for 3.0, is exactly for optimization for DGX servers? I found the below points are very relevant.
c-blosc2/ROADMAP-TO-3.0.rst
Line 11 in 6c16487
c-blosc2/ROADMAP-TO-3.0.rst
Line 13 in 6c16487
For now, if you can clarify that blosc2 speed (compress/decompress) at DGX is slower than the speed at the most modern workstations, that will be super helpful, and good to know information. Plus, your advice to optimize the latency with the current blosc2 at DGXs.
Btw, it seems obvious that per-core clock speed of Xeon Platinum 8480C, 8570 is weaker than workstation (AMD Threadripper Pro 5975WX).