The GPU memory usage continues to increase as the number of epochs increases

![Weixin Screenshot_20240905135527](https://github.com/user-attachments/assets/ab66f7a0-aa1b-42b6-b7a0-1d45f1d40a3a)
Thank you for open-sourcing such an impressive work. However, when I tried running main.py with 4 RTX 3090 GPUs under the default settings, I encountered an NCCL error during the GPU parallel part: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL. Although I haven't fully identified the source of the error, I suspect it might be an issue with my workstation, so I switched to running your training program on a single GPU. As shown in the image, I observed that GPU usage increases with each epoch, eventually leading to an out-of-memory error. Do you have any ideas on where a memory leak might occur in your code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The GPU memory usage continues to increase as the number of epochs increases #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

The GPU memory usage continues to increase as the number of epochs increases #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions