what is the better grad_clip value with cosine lr_scheduler when traning a DiT based model?

thanks for the job. I have a problem that may not directly related with this project.  We are in a training job about large model based on the DiT, the traning lr_scheduler is cosine scheduler,  but since the traning data is large and part of them are noisy, so I think we should append a grad_clip dring training to gain stability. the model contains 20 layers of DiT, and what is the better trying value of the grad_clip [on grad_norm] I should choice? thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what is the better grad_clip value with cosine lr_scheduler when traning a DiT based model? #104

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

what is the better grad_clip value with cosine lr_scheduler when traning a DiT based model? #104

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions