Skip to content
This repository was archived by the owner on Aug 6, 2025. It is now read-only.

what is the better grad_clip value with cosine lr_scheduler when traning a DiT based model? #104

@JohnHerry

Description

@JohnHerry

thanks for the job. I have a problem that may not directly related with this project. We are in a training job about large model based on the DiT, the traning lr_scheduler is cosine scheduler, but since the traning data is large and part of them are noisy, so I think we should append a grad_clip dring training to gain stability. the model contains 20 layers of DiT, and what is the better trying value of the grad_clip [on grad_norm] I should choice? thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions