Skip to content

The GPU memory usage continues to increase as the number of epochs increases #18

@mxuai

Description

@mxuai

Weixin Screenshot_20240905135527
Thank you for open-sourcing such an impressive work. However, when I tried running main.py with 4 RTX 3090 GPUs under the default settings, I encountered an NCCL error during the GPU parallel part: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL. Although I haven't fully identified the source of the error, I suspect it might be an issue with my workstation, so I switched to running your training program on a single GPU. As shown in the image, I observed that GPU usage increases with each epoch, eventually leading to an out-of-memory error. Do you have any ideas on where a memory leak might occur in your code?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions