
Thank you for open-sourcing such an impressive work. However, when I tried running main.py with 4 RTX 3090 GPUs under the default settings, I encountered an NCCL error during the GPU parallel part: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL. Although I haven't fully identified the source of the error, I suspect it might be an issue with my workstation, so I switched to running your training program on a single GPU. As shown in the image, I observed that GPU usage increases with each epoch, eventually leading to an out-of-memory error. Do you have any ideas on where a memory leak might occur in your code?
Thank you for open-sourcing such an impressive work. However, when I tried running main.py with 4 RTX 3090 GPUs under the default settings, I encountered an NCCL error during the GPU parallel part: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL. Although I haven't fully identified the source of the error, I suspect it might be an issue with my workstation, so I switched to running your training program on a single GPU. As shown in the image, I observed that GPU usage increases with each epoch, eventually leading to an out-of-memory error. Do you have any ideas on where a memory leak might occur in your code?