thanks for the job. I have a problem that may not directly related with this project. We are in a training job about large model based on the DiT, the traning lr_scheduler is cosine scheduler, but since the traning data is large and part of them are noisy, so I think we should append a grad_clip dring training to gain stability. the model contains 20 layers of DiT, and what is the better trying value of the grad_clip [on grad_norm] I should choice? thanks.
thanks for the job. I have a problem that may not directly related with this project. We are in a training job about large model based on the DiT, the traning lr_scheduler is cosine scheduler, but since the traning data is large and part of them are noisy, so I think we should append a grad_clip dring training to gain stability. the model contains 20 layers of DiT, and what is the better trying value of the grad_clip [on grad_norm] I should choice? thanks.